"I want to say something interesting"
It is far from the first year that over 100 IT companies and thousands of mathematicians have been toiling to create speech recognition systems. In November 2001 at the Comdex IT expo in Las Vegas, the first fully functional pocket voice translator was demonstrated. It is sensational that the machine was created in Russia by the offshore office of Ectaco, Inc.
iOne correspondent Serge Kolyada made a special trip to St. Petersburg to talk about the market prospects of speech technologies with Anton Epifanov, the general manager of Ectaco's development office in Russia, and Vyacheslav Baryshnikov, the lead development specialist.
Grey, scrunched in the middle, soapbox-like, with little microphone and speaker holes and a green display, the device does not seem like an electronic translator at first glance.
"So try and say something to it," suggests Vyacheslav Baryshnikov.
"Hello," almost automatically say I. Zero effect.
"Bring it closer to your mouth and speak more distinctly," instructs Baryshnikov, hissing at other developers that sidestep us. Thumbing through the back files of my mind to English lessons past, I break the silence and retort "Good Morning" in the manner of a British lieutenant into the incomprehensible device. Something clicks in the box, and a scrolling indicator-cursor goes back and forth over the screen, and in half a second, the device deafens the room with a gurgling succession of sounds - "Buenos Dias." Judging by Slava's (Vyacheslav's) beaming face, it worked: the UT-103 gave us "Good Morning" in Spanish.
Petersburg native Vyacheslav Baryshnikov, who doesn't look a day over 30, is the head of the speech recognition group at the development center of American company Ectaco.
"It's our most promising department. Slava even has a few PhD's on his staff," comments Anton Epifanov, manager of Ectaco's Petersburg office.
The UT-103 is the latest Ectaco development and the first pocket voice translator in the world. This device understands only a pre-determined set of phrases from the lexicon of a traveler, but voices their translations in three languages. Vyacheslav Baryshnikov doesn't deny the imperfections of his continuing development work: "The more important thing for our group is to achieve a specific technological result. At one time, we had two different voice recognition technologies. One of these, the less reliable one, has been fleshed out in the Gold Partner device, a portable self-instruction foreign language dictionary organizer with a keyboard. The other, which we finished work on right before the Gold Partner came out, is built into the UT-103 - the speech-to-speech translator without a keyboard. The UT-103 doesn't speak Chinese yet, of course, but we're getting much nearer to it being able to understand Russian."
They have been getting nearer for a long time.
"The concept of speech technologies in pocket-sized devices has a bright future. It cropped up about three years ago," observes Anton Epifanov, "and at about the same time, we already had a group of people working on such technologies. They started practically from scratch. Of course, we read a lot and invited people in from the side. We did a year and a half of work on the UT-103 - it was our third attempt to make a working system. The first two didn't work out. Didn't work out at all.
"However, now we see the next three years in development of our technology, and the UT-103 is just one incarnation, and maybe not even the most interesting one, of what Slava and his group are working on. A device that works really well will appear on the market in Spring-Summer 2002."
The way a recorder, amplifier and synthesizer work is understandable, in principle. These devices produce or alter recorded sound.
"But I don't understand how a voice translator works," say I.
"We're conducting research - how can we say this simply - in the area of speech," answers Vyacheslav, "That is to say, we are studying acoustic signals that a human voice-box produces, and according to the research, humans are incapable of producing more than a thousand sounds - in any language. If we're talking about Russian, then that's a little over one hundred sounds, and Japanese, Chinese, Korean and other languages consist of sounds with different tonalities."
iOne: Give us an example of a sound.
Vyacheslav Baryshnikov: "Oh," "Ah," "Ooh." Of course, sounds are divided into stressed and unstressed, depending on their position in a word. These sounds are studied, and on the basis of these studies, a universal "engine," which allows - much in the same way that you use the alphabet write down words - to record and transcribe any phrase and make the machine understand it with enough probability. Nothing complicated.
iOne: But then there's male voices, female voices, different timbres… and intonations, finally.
V.B.: We have learned to eliminate these "divergences." We have speech databases that allow us to use statistics of pronunciation of this or that sound in various contexts, and this is the cornerstone of a speaker-independent system. In other words, it doesn't matter whether the voice is male or female - it'll only be of a different tonality. This so-called basic tone is the "carrier" and will be the most significantly distinct characteristic. Our goal is to "clean up" the voice, finding invariable, speaker-independent values. Certainly, this requires a complex mathematical apparatus.
iOne: Connected speech is probably harder to recognize than just separate phrases?
Anton Epifanov: Current speech developments may be rated pretty much the same way as the first text translators were. Certainly the Socrates, Prompt, or any other, translate - but in the majority of cases the translation is grammatically imperfect: at best, it only allows one to understand the meaning without knowing the language. This is exactly how speech technologies allow a device to understand a speaking person. Certainly, it's still early to talk about fantastic results in this area, but I think that in the future - in 2 or 3 years time - a device with a recorder will appear that will allow natives of different languages to communicate with one another. The translation, understandably, will be broken, and it will sound something like this: "I to want to say one interesting thing." However, you'll be able to speak in broken Chinese, which is practically impossible to achieve by using other means.
Even the imperfect UT-103 unexpectedly has sparked the interest of a rather specialized group of users.
"It's American police," says Epifanov, "who represent a considerably large vertical market, and the thing is, not only is the IT budget of a large American police department (with a staff of over 200) a few million dollars a month, but even a hammer for government use could cost $5000. Police deal with people of a non-American origin all the time, and they simply can't do without an electronic translator. They've been writing comments and suggestions to us through the Internet - they're looking for a device that could translate into Chinese, Russian and Spanish, one that would tell suspects in clear and proper wording that they're under arrest."
A.E.: Speech recognition can also free up hundreds of thousands of people - call center employees who are asked frequently asked questions and standard answers. Government, the military and large corporations will not be able to do without SR. For example, take our customer service department: as a rule, users ask around 5 to 7 typical questions.
iOne: Sounds nice, but I don't think it'll never work in Russia. By the way, is Russian speech difficult to have recognized?
A.E.: In principle, no, but the ability to recognize Russian language doesn't really say anything about the quality of the product. We're working on an international scale, and there's not as many Russians in the world as we'd like for there to be. There's significantly more Chinese and Hispanics. Slava, I hope, will complete the development of a new engine that will be multilingual, and to make SR for it, let's say in Russian or Chinese, will take 2 to 3 months time, not more. It'll be sufficient to replace English in the memory of the device with the sounds of Chinese, and changing the math won't be necessary.
iOne: All the SR systems I have come across worked on desktop computers. Was it difficult to integrate SR into such small toys such as the UT-103 of the Gold Partner?
V.B.: It was rather difficult, but each of those devices has a powerful enough processor, and so with the SR we manage processing in 2 to 3 seconds, but this is probably not the limit. Chances are that with a couple of contrivances we could increase the speed, but we had to do without adaptive filters (filtering out background noise such as the sound of a car going by) because this strict, resource-heavy math doesn't fly on small systems.
iOne: But what about, for example, using GPRS or 3G net services to transfer the speech onto more powerful computers for subsequent deciphering? In Moscow, such a system is being actively discussed.
V.B.: We even thought about that option, but to do that today would be sufficiently difficult: collaboration with mobile telephone manufacturers and mobile service operators is needed. One day, of course, this service will appear. However, this will only occur in a number of years, but the potential is there already - the market size for such systems will be astounding. It'll be in billions of dollars, I think.
iOne: Aren't you afraid that the proliferation of electronic speech translators will squelch language learning?
A.E.: No. I think that people will learn languages as before. A speech translator is a niche product, which you would take with you on a tourist trip. It's a means of communicating in a foreign language about relatively simple subjects. Take for example the fact that you know English. Well, in China, a lot of people don't, and people in France do, but don't like to speak it. The market for simple speech devices will widen for the simple reason that people have become more mobile. The world has accelerated, but high-quality professional translations in business deals will still call for live people.
iOne: Talk of speech recognition has been around for 20 years, and from time to time, some company announces that it has a working application or system which ends up falling short of expectations. Don't you think that things are dragging along at a snail's pace?
A.E.: Yes. Full-fledged devices and applications capable of recognizing speech began appearing only recently. What's more - they began appearing with the advent of mass sales of handheld devices. Note how fast speech recognition appeared on mobile phones - it's still imperfect, but it's there. Why? Because there is nothing more natural than talking to such a device. As far as the dictation programs previously developed for desktop PCs go - they are, by nature, not very useful. The keyboard and mouse have remained unbeatable for more 20 years, just like remotes for TVs. Integrating Microsoft Word software into a mobile phone is no problem, but you wouldn't be able to input text into this "mobile" MS Word. It is specifically the means of input that is going to define the development of this segment market in the coming years, and in all likelihood, the universal means of input will be speech. Already in a year's time, speech recognition in one form or another will be part of all our dictionaries, and users will pay for this supplementary function. When I'm talking about the inevitability of the speech technologies boom, it doesn't mean that next year we'll paint the Gold Partner a different color. Its presentation will change: in the future, our dictionaries will be used not only as learning tools, but as communication equipment. The old models won't go anywhere - they'll just be transferred to the low-end category, just as electronic organizers became mass market items with the arrival of Palm - an organizer that hooks up to computers. The next stage in the evolution of speech technologies will last 5 to 10 years. It is specifically in this stage that speech technologies will develop especially fast. Here's an example: this same UT-103 can be perfected ad infinitum - add durability, decrease the amount of errors in recognition, add programmability and finally, add all the functions of a Gold Partner. The Gold Partner cannot really be improved upon. Right now, speech recognition development is not a profitable activity. If IBM can allow itself to spend massive resources on these technologies, then we can't, but as soon as the first profit appears, the revision will be instantaneous. We'll see the speech technology boom happen already in 2002-2003.