Google has a new end-to-end, all-neural, on-device speech recognizer to power speech input in Gboard.

They trained a model using recurrent neural networks (RNN) transducer (RNN-T) technology that is compact enough to reside on a phone. This means no more network latency or spottiness — the new recognizer is always available, even when you are offline. The model works at the character level, so that as you speak, it outputs words character-by-character, just as if someone was typing out what you say in real-time, and exactly as you’d expect from a keyboard dictation system.

This video compares the production, server-side speech recognizer (left panel) to the new on-device recognizer (right panel) when recognizing the same spoken sentence. Video credit: Akshay Kannan and Elnaz Sarbar

History of Speech Recognition

Traditionally, speech recognition systems consisted of several components – an acoustic model that maps segments of audio (typically 10 millisecond frames) to phonemes, a pronunciation model that connects phonemes together to form words, and a language model that expresses the likelihood of given phrases. In early systems, these components remained independently-optimized.

Around 2014, researchers began to focus on training a single neural network to directly map an input audio waveform to an output sentence. This sequence-to-sequence approach to learning a model by generating a sequence of words or graphemes given a sequence of audio features led to the development of “attention-based” and “listen-attend-spell” models.

Recurrent Neural Network Transducers

RNN-Ts are a form of sequence-to-sequence models that do not employ attention mechanisms. Unlike most sequence-to-sequence models, which typically need to process the entire input sequence (the waveform in our case) to produce an output (the sentence), the RNN-T continuously processes input samples and streams output symbols, a property that is welcome for speech dictation. In our implementation, the output symbols are the characters of the alphabet. The RNN-T recognizer outputs characters one-by-one, as you speak, with white spaces where appropriate. It does this with a feedback loop that feeds symbols predicted by the model back into it to predict the next symbols, as described in the figure below.

Google developed a new training technique that reduced the word error rate by 5% but it became even more computationally intensive.

They developed a parallel implementation so the RNN-T loss function could run efficiently in large batches on Google’s high-performance Cloud TPU v2 hardware. This yielded an approximate 3x speedup in training.

The RNN-T they trained offers the same accuracy as the traditional server-based models but is only 450MB, essentially making a smarter use of parameters and packing information more densely. 450MB was still too much and propagating signals through such a large network can be slow.

Google further reduced the model size by using the parameter quantization and hybrid kernel techniques they developed in 2016 and made publicly available through the model optimization toolkit in the TensorFlow Lite library. Model quantization delivered a 4x compression with respect to the trained floating point models and a 4x speedup at run-time, enabling their RNN-T to run faster than real-time speech on a single core. After compression, the final model is 80MB.

Their new all-neural, on-device Gboard speech recognizer is initially being launched to all Pixel phones in American English only.

Arxiv – Streaming End-to-end Speech Recognition For Mobile Devices

SOURCES – Google Blog, Arxiv
Written By Brian Wang, Nextbigfuture.com

Brian Wang

Brian Wang is a Futurist Thought Leader and a popular Science blogger with 1 million readers per month. His blog Nextbigfuture.com is ranked #1 Science News Blog. It covers many disruptive technology and trends including Space, Robotics, Artificial Intelligence, Medicine, Anti-aging Biotechnology, and Nanotechnology.

Known for identifying cutting edge technologies, he is currently a Co-Founder of a startup and fundraiser for high potential early-stage companies. He is the Head of Research for Allocations for deep technology investments and an Angel Investor at Space Angels.

A frequent speaker at corporations, he has been a TEDx speaker, a Singularity University speaker and guest at numerous interviews for radio and podcasts. He is open to public speaking and advising engagements.

4 thoughts on “Google Has 80 MB Speech Recognizer that Can Work Offline on Your Smartphone”

Anonymous

April 20, 2019 at 12:11 am

My cheap Android phone could already do this ~3 years ago. I noticed that voice recognition still worked in the mountains, so I tried turning off wifi and the cellular radio, and it could still take dictation fine. Perhaps this uses a new technique.
Improbus Liber

April 17, 2019 at 10:02 pm

I guess this means your next computer project will have a voice interface … Hal? Open the pod bay doors.
Brett Bellmore

April 17, 2019 at 4:58 pm

But would any sane person expect that it doesn’t phone home anyway?
Asteroza

April 17, 2019 at 6:43 am

The Pixel Live Transcribe feature takes some cues from this, and it’s pretty damn good.

Comments are closed.