They trained a model using recurrent neural networks (RNN) transducer (RNN-T) technology that is compact enough to reside on a phone. This means no more network latency or spottiness — the new recognizer is always available, even when you are offline. The model works at the character level, so that as you speak, it outputs words character-by-character, just as if someone was typing out what you say in real-time, and exactly as you’d expect from a keyboard dictation system.
This video compares the production, server-side speech recognizer (left panel) to the new on-device recognizer (right panel) when recognizing the same spoken sentence. Video credit: Akshay Kannan and Elnaz Sarbar
History of Speech Recognition
Traditionally, speech recognition systems consisted of several components – an acoustic model that maps segments of audio (typically 10 millisecond frames) to phonemes, a pronunciation model that connects phonemes together to form words, and a language model that expresses the likelihood of given phrases. In early systems, these components remained independently-optimized.
Around 2014, researchers began to focus on training a single neural network to directly map an input audio waveform to an output sentence. This sequence-to-sequence approach to learning a model by generating a sequence of words or graphemes given a sequence of audio features led to the development of “attention-based” and “listen-attend-spell” models.
Recurrent Neural Network Transducers
RNN-Ts are a form of sequence-to-sequence models that do not employ attention mechanisms. Unlike most sequence-to-sequence models, which typically need to process the entire input sequence (the waveform in our case) to produce an output (the sentence), the RNN-T continuously processes input samples and streams output symbols, a property that is welcome for speech dictation. In our implementation, the output symbols are the characters of the alphabet. The RNN-T recognizer outputs characters one-by-one, as you speak, with white spaces where appropriate. It does this with a feedback loop that feeds symbols predicted by the model back into it to predict the next symbols, as described in the figure below.
Google developed a new training technique that reduced the word error rate by 5% but it became even more computationally intensive.
They developed a parallel implementation so the RNN-T loss function could run efficiently in large batches on Google’s high-performance Cloud TPU v2 hardware. This yielded an approximate 3x speedup in training.
The RNN-T they trained offers the same accuracy as the traditional server-based models but is only 450MB, essentially making a smarter use of parameters and packing information more densely. 450MB was still too much and propagating signals through such a large network can be slow.
Google further reduced the model size by using the parameter quantization and hybrid kernel techniques they developed in 2016 and made publicly available through the model optimization toolkit in the TensorFlow Lite library. Model quantization delivered a 4x compression with respect to the trained floating point models and a 4x speedup at run-time, enabling their RNN-T to run faster than real-time speech on a single core. After compression, the final model is 80MB.
Their new all-neural, on-device Gboard speech recognizer is initially being launched to all Pixel phones in American English only.
SOURCES – Google Blog, Arxiv
Written By Brian Wang, Nextbigfuture.com