Nvidia AI Speech With Complex Rhythm and Tone

Human speech has musicality. Everyone may not talk like Martin Luther King Junior or rap like people in the play Hamilton, but there is rhythm and tone.

There is still a gap between AI-synthesized speech and the human speech we hear in daily conversation and in the media because people speak with complex rhythm, intonation and timbre that’s challenging for AI to emulate. NVIDIA researchers are mastering AI speech using building models and tools for high-quality, controllable speech synthesis that capture the richness of human speech, without audio artifacts.

NVIDIA top researchers have models that enable AI speech to get more musical. One of them has the goal of replicating the great singer Etta James.

This is an area where AI in reality will surpass AI in movies and TV. The sound of the AI voice in Star Trek. 2001 Hal did not have any reduction of rhythm and tone from the voice actor.

Expressive speech synthesis is just one element of NVIDIA Research’s work in conversational AI — a field that also encompasses natural language processing, automated speech recognition, keyword detection, audio enhancement and more.

Giving Voice to AI Developers, Researchers
With NVIDIA NeMo — an open-source Python toolkit for GPU-accelerated conversational AI — researchers, developers and creators gain a head start in experimenting with, and fine-tuning, speech models for their own applications.

Easy-to-use APIs and models pretrained in NeMo help researchers develop and customize models for text-to-speech, natural language processing and real-time automated speech recognition. Several of the models are trained with tens of thousands of hours of audio data on NVIDIA DGX systems. Developers can fine tune any model for their use cases, speeding up training using mixed-precision computing on NVIDIA Tensor Core GPUs.

Through NGC, NVIDIA NeMo also offers models trained on Mozilla Common Voice, a dataset with nearly 14,000 hours of crowd-sourced speech data in 76 languages. Supported by NVIDIA, the project aims to democratize voice technology with the world’s largest open data voice dataset.

Interspeech brings together more than 1,000 researchers to showcase groundbreaking work in speech technology. At this week’s conference, NVIDIA Research is presenting conversational AI model architectures as well as fully formatted speech datasets for developers.

Multi-billion dollar AI fund manager, Kai-fu Lee, predicts that AI speech mastery and text recognition will go beyond human levels the way AI vision has surpassed human vision. AI Speech, voice and text recognition is the next big thing in AI.

SOURCES- Nvidia, Hamilton, Etta James
Written by Brian Wang, Nextbigfuture.com