Slate – If you’ve got an Android phone, try this: Hit the microphone icon on the home screen, then ask, “How many angstroms in a mile?” Use your normal speaking voice—don’t speak slowly or strain to over-pronounce “angstrom.” So long as you have a good Internet connection, the phone shouldn’t take more than a second to recognize your question and shoot back a reply: 1.609344 × 10^13.
Google mobile – launched Voice Search more than two years ago, we wanted it to “just work” right out of the box, without an initial setup process. And so, we built speech models broad enough to accommodate a wide variety of people, regardless of gender, age, and accents, or variations in pitch, pace, and other factors. But we always knew we could build a more accurate model by listening to your voice, and learning how you — as a unique individual — speak. So today we’re launching personalized recognition.
If you’ve tried speech-recognition software in the past, you may be skeptical of Android’s capabilities. Older speech software required you to talk in a stilted manner, and it was so prone to error that it was usually easier just to give up and type. Today’s top-of-the-line systems—like software made by Dragon—don’t ask you to talk funny, but they tend to be slow and use up a lot of your computer’s power when deciphering your words. Google’s system, on the other hand, offloads its processing to the Internet cloud. Everything you say to Android goes back to Google’s data centers, where powerful servers apply statistical modeling to determine what you’re saying. The process is fast, can be done from anywhere, and is uncannily accurate. You can speak normally (though if you want punctuation in your email, you’ve got to say “period” and “comma”), you can speak for as long as you’d like, and you can use the biggest words you can think of. It even works if you’ve got an accent.
How does Android’s speech system work so well? The magic of data. Speech recognition is one of a handful of Google’s artificial intelligence programs—the others are language translation and image search—that get their power by analyzing impossibly huge troves of information. For the speech system, the data are a large number of voice recordings. If you’ve used Android’s speech recognition system, Google Voice’s e-mail transcription service, Goog411 (a now-defunct information service), or some other Google speech-related service, there’s a good chance that the company has your voice somewhere on its servers. And it’s only because Google has your voice—and millions of others—that it can recognize mine.
There’s a lot of overlap between search and speech. To decipher your speech, Google’s system doesn’t just use recorded voices. It also relies on a host of other data, including billions of written search queries that it uses to predict the words you’re most probably saying. If you say 33rd and Sixth, NYC,” your NYC might sound like and I see, but Google knows that you’re probably saying NYC, because that’s what a lot of other people mean when they say that phrase. Altogether, Google’s speech recognition program comprises many billions of pieces of text and audio; Cohen says that building just one part of the speech-recognition system required “roughly 70 CPU-years” of computer time. Google’s cloud of processors can do that amount of crunching in a single day. “This is one of the things that brought me to Google,” Cohen says. “We can now iterate much more quickly, experiment much more quickly, to train these enormous models and see what works.”
Speech recognition is still a very young field. “We don’t do well enough at anything right now,” Cohen says. He notes that the system keeps getting better—and more and more people keep using Android’s voice search—but we’re still many years (and maybe even decades) away from what Cohen says is Google’s long-term vision for speech-recognition. “We want it to be totally ubiquitous,” he says. “No matter what the application is, no matter what you’re trying to do with your phone, we want you to be able to talk to your phone.”