Baidu has demonstrated that a single deep learning voice system could learn to reproduce thousands of speaker identities, with less than half an hour of training data for each speaker. This capability was enabled by learning shared and discriminative information from speakers.
There are examples of speech sample recordings and synthesized speech based on different numbers of samples. The synthesized speech had some noise distortion but the samples did sound like the original speakers.
Baidu attempted to learn speaker characteristics from only a few utterances (i.e., sentences of few seconds duration). This problem is commonly known as “voice cloning.” Voice cloning is expected to have significant applications in the direction of personalization in human-machine interfaces.
They tried two fundamental approaches for solving the problems with voice cloning: speaker adaptation and speaker encoding.
Speaker adaptation is based on fine-tuning a multi-speaker generative model with a few cloning samples, by using backpropagation-based optimization. Adaptation can be applied to the whole model, or only the low-dimensional speaker embeddings. The latter enables a much lower number of parameters to represent each speaker, albeit it yields a longer cloning time and lower audio quality.
Speaker encoding is based on training a separate model to directly infer a new speaker embedding from cloning audios that will ultimately be used with a multi-speaker generative model. The speaker encoding model has time-and-frequency-domain processing blocks to retrieve speaker identity information from each audio sample, and attention blocks to combine them in an optimal way. The advantages of speaker encoding include fast cloning time (only a few seconds) and low number of parameters to represent each speaker, making it favorable for low-resource deployment.
In terms of naturalness of the speech and its similarity to original speaker, both approaches can achieve good performance, even with very few cloning audios. While speaker adaptation can achieve better naturalness and similarity, the cloning time or required memory for the speaker encoding approach is significantly less, making it favorable for low-resource deployment.
Improvements in the quality of dataset will result in higher naturalness and similarity of generated samples. Also, increasing the amount and diversity of speakers should enable a more meaningful speaker embedding space, which can improve the similarity obtained by both approaches. Baidu expects their techniques to benefit significantly from a large-scale and high-quality multi-speaker speech dataset.
Baidu believes that there are many promising horizons for improvement in voice cloning. Advances in meta-learning, i.e. systematic approach of learning-to-learn while training, should be promising to improve voice cloning, e.g. by integrating speaker adaptation or encoding into training, or by inferring the model weights in a more flexible way than the speaker embeddings are being used.
Terminator voice mimic scene