Technology Review – Microsoft researchers have demonstrated software that translates spoken English into spoken Chinese almost instantly, while preserving the unique cadence of the speaker’s voice—a trick that could make conversation more effective and personal.
The first public demonstration was made by Rick Rashid, Microsoft’s chief research officer, on October 25 at an event in Tianjin, China. “I’m speaking in English and you’ll hear my words in Chinese in my own voice,” Rashid told the audience. The system works by recognizing a person’s words, quickly converting the text into properly ordered Chinese sentences, and then handing those over to speech synthesis software that has been trained to replicate the speaker’s voice.
During my October 25 presentation in China, I had the opportunity to showcase the latest results of this work. We have been able to reduce the word error rate for speech by over 30% compared to previous methods. This means that rather than having one word in 4 or 5 incorrect, now the error rate is one word in 7 or 8. While still far from perfect, this is the most dramatic change in accuracy since the introduction of hidden Markov modeling in 1979, and as we add more data to the training we believe that we will get even better results.
Of course, there are still likely to be errors in both the English text and the translation into Chinese, and the results can sometimes be humorous. Still, the technology has developed to be quite useful.
Most significantly, we have attained an important goal by enabling an English speaker like me to present in Chinese in his or her own voice, which is what I demonstrated in China. It required a text to speech system that Microsoft researchers built using a few hours speech of a native Chinese speaker and properties of my own voice taken from about one hour of pre-recorded (English) data, in this case recordings of previous speeches I’d made.
Though it was a limited test, the effect was dramatic, and the audience came alive in response. When I spoke in English, the system automatically combined all the underlying technologies to deliver a robust speech to speech experience—my voice speaking Chinese. You can see the demo in the video above.