November 09, 2012
Microsoft Research Machine Translator Speaks Chinese in Your Voice
By Amanda Ciccatelli
TMCnet Web Editor
One of the most natural interfaces for people is human speech. Over the past 60 years, computer scientists have been trying to find ways to understand and recognize human speech.
When they first started addressing this, they looked at it as pattern-matching problem. The earliest systems took the wave forms of speakers voice and matched them up to wave forms that represented specific words. However, that approach was fragile because everyone speaks differently and even one speaker says the same thing differently depending on the context.
In the late 1970s, there was a major shift in speech-recognition. Carnegie Melon University decided to use a statistical modeling technique to take data from many speakers and produce more robust models of speech. This was a huge improvement and over the last 30 years speech-recognition systems have become dramatically better than they used to be.
Today, near-real-time speech conversion from one language to another has become a reality.
Microsoft (News - Alert) Research recently demonstrated how to convert spoken English into Mandarin with a few seconds' delay, as well as how to output that speech in the vocal style of the original speaker. Microsoft's Research Chief Rick Rashid demonstrated this new technology in Tjianjin, China a few weeks ago where he spoke eight English sentences into the speech recognition, translation and generation system.
The system's advanced capability comes from improvements of the speech-to-speech process. Software like Nuance's (News - Alert) Dragon Naturally Speaking have led the trail for speech recognition in offices - and now products Apple's Siri iPhone assistant can recognize spoken questions and search for answers.
According to Rashid, these systems go wrong a lot, typically erring on one out of every four or five words, but they now have a better way to recognize what people are saying. Microsoft uses a novel neural networking system that reduces word-recognition errors to one in seven or eight.
Most importantly, Rashid said the generation of Mandarin speech is a voice like the speaker's: if you can preserve the speaker's vocal cadence in the translation, their meaning will be more apparent and the conversation will be more effective.
Rashid said, "In a few years we hope we'll be able to break down the language barriers between people."
Edited by Rachel Ramsey