Baidu’s Deep-Learning System is better at English and Mandarin Speech Recognition than most people

China’s leading Internet-search company, Baidu, has developed a voice system that can recognize English and Mandarin speech better than people, in some cases.

The new system, called Deep Speech 2, is especially significant in how it relies entirely on machine learning for translation. Whereas older voice-recognition systems include many handcrafted components to aid audio processing and transcription, the Baidu system learned to recognize words from scratch, simply by listening to thousands of hours of transcribed audio.

The technology relies on a powerful technique known as deep learning, which involves training a very large multilayered virtual network of neurons to recognize patterns in vast quantities of data. The Baidu app for smartphones lets users search by voice, and also includes a voice-controlled personal assistant called Duer. Voice queries are more popular in China because it is more time-consuming to input text, and because some people do not know how to use Pinyin, the phonetic system for transcribing Mandarin using Latin characters.

“Historically, people viewed Chinese and English as two vastly different languages, and so there was a need to design very different features,” says Andrew Ng, a former Stanford professor and Google researcher, and now chief scientist for the Chinese company. “The learning algorithms are now so general that you can just learn.”

Deep learning has its roots in ideas first developed more than 50 years ago, but in the past few years new mathematical techniques, combined with greater computer power and huge quantities of training data, have led to remarkable progress, especially in tasks that require some sort of visual or auditory perception. The technique has already improved the performance of voice recognition and image processing, and large companies including Google, Facebook, and Baidu are applying it to the massive data sets they own.

In developing Deep Speech 2, Baidu also created new hardware architecture for deep learning that runs seven times faster than the previous version. Deep learning usually relies on graphics processors, because these are good for the intensive parallel computations involved.

Arxiv – Deep Speech 2: End-to-End Speech Recognition in English and Mandarin


They show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech–two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, resulting in a 7x speedup over our previous system. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.


End-to-end deep learning presents the exciting opportunity to improve speech recognition systems continually with increases in data and computation. Indeed, our results show that, compared to the previous incarnation, Deep Speech has significantly closed the gap in transcription performance with human workers by leveraging more data and larger models. Further, since the approach is highly generic, we’ve shown that it can quickly be applied to new languages. Creating high-performing recognizers for two very different languages, English and Mandarin, required essentially no expert knowledge of the languages. Finally, we have also shown that this approach can be efficiently deployed by batching user requests together on a GPU server, paving the way to deliver end-to-end Deep Learning technologies to users.

To achieve these results, we have explored various network architectures, finding several effective techniques: enhancements to numerical optimization through SortaGrad and Batch Normalization, evaluation of RNNs with larger strides with bigram outputs for English, searching through both bidirectional and unidirectional models. This exploration was powered by a well optimized, High Performance Computing inspired training system that allows us to train new, full-scale models on our large datasets in just a few days.

Overall, we believe our results confirm and exemplify the value of end-to-end Deep Learning methods for speech recognition in several settings. In those cases where our system is not already comparable to humans, the difference has fallen rapidly, largely because of application-agnostic Deep Learning techniques. We believe these techniques will continue to scale, and thus conclude that the vision of a single speech system that outperforms humans in most scenarios is imminently achievable.