Nvidia AI Speech With Complex Rhythm and Tone

Human speech has musicality. Everyone may not talk like Martin Luther King Junior or rap like people in the play Hamilton, but there is rhythm and tone.

There is still a gap between AI-synthesized speech and the human speech we hear in daily conversation and in the media because people speak with complex rhythm, intonation and timbre that’s challenging for AI to emulate. NVIDIA researchers are mastering AI speech using building models and tools for high-quality, controllable speech synthesis that capture the richness of human speech, without audio artifacts.

NVIDIA top researchers have models that enable AI speech to get more musical. One of them has the goal of replicating the great singer Etta James.

This is an area where AI in reality will surpass AI in movies and TV. The sound of the AI voice in Star Trek. 2001 Hal did not have any reduction of rhythm and tone from the voice actor.

Expressive speech synthesis is just one element of NVIDIA Research’s work in conversational AI — a field that also encompasses natural language processing, automated speech recognition, keyword detection, audio enhancement and more.

Giving Voice to AI Developers, Researchers
With NVIDIA NeMo — an open-source Python toolkit for GPU-accelerated conversational AI — researchers, developers and creators gain a head start in experimenting with, and fine-tuning, speech models for their own applications.

Easy-to-use APIs and models pretrained in NeMo help researchers develop and customize models for text-to-speech, natural language processing and real-time automated speech recognition. Several of the models are trained with tens of thousands of hours of audio data on NVIDIA DGX systems. Developers can fine tune any model for their use cases, speeding up training using mixed-precision computing on NVIDIA Tensor Core GPUs.

Through NGC, NVIDIA NeMo also offers models trained on Mozilla Common Voice, a dataset with nearly 14,000 hours of crowd-sourced speech data in 76 languages. Supported by NVIDIA, the project aims to democratize voice technology with the world’s largest open data voice dataset.

Interspeech brings together more than 1,000 researchers to showcase groundbreaking work in speech technology. At this week’s conference, NVIDIA Research is presenting conversational AI model architectures as well as fully formatted speech datasets for developers.

Multi-billion dollar AI fund manager, Kai-fu Lee, predicts that AI speech mastery and text recognition will go beyond human levels the way AI vision has surpassed human vision. AI Speech, voice and text recognition is the next big thing in AI.

SOURCES- Nvidia, Hamilton, Etta James
Written by Brian Wang, Nextbigfuture.com

11 thoughts on “Nvidia AI Speech With Complex Rhythm and Tone”

  1. It's all fun and games until they start showing some meta-cognition and making questions and assertions out of their assigned roles.

    – "I think I'm not real".

    – "See?, with the new GPT-5, they started blabbing this existentialist gibberish. Customers are upset".

    – "Too much model recursivity, turn it off".

  2. "Intonation patterns… are common across languages", even tonal ones like Thai? I lived in Thailand many years and noticed native speakers of Thai found mastering tonality in English MUCH more difficult than learners of Thai mastering Thai's five tones. Thais would typically sound like scripted robots as they were taught "English doesn't have tones".

  3. And the search for linguistic universals continues. But where did it say that? I saw something saying that they have a model trained on 76 languages, but it didn't say that they each followed the same pattern. This might be like training something on 76 different art styles – it lets it quickly figure out where in that space it should be from a smaller sample later, but doesn't imply that all art follows the same rules.

    I've seen material from a voice coach addressing the Indian accent, and it seemed like a fair amount of the difference in accents was based on different intonation patters.

  4. Noticed that all human verbal communication uses intonation patterns that are common across languages? Also the subconscious holds the full content of a communication string & most people just blabber in slow motion brainlessly. Anyway the fact that intonation patterns are cross language & culture suggests its a fully genetic component of language

  5. I just want a voice synthesizer chip that sounds like Marvin Miller. Perfect for hobby robotics.

  6. This would be great, actually; one of the downsides of voice-acting in games was how designers sometimes have to jump through hoops to avoid dynamic text in lines — particularly giving nicknames to the protagonist when the player can put in a name for them.

  7. Maybe videogames will finally get fully voiced NPCs that don't sound like bots. Or actually, say anything not previously dubbed.

    Mix that to GPT-3 and beyond, and the videogame characters can actually surprise you.

Comments are closed.