Microsoft's AI team introduces NaturalSpeech 2: a cutting-edge TTS system with latent diffusion models for powerful Zero-Shot speech synthesis and enhanced expressive prosodies

The goal of text-to-speech (TTS) is to generate high-quality, diverse speech that sounds like it’s being spoken by real people. Prosodies, speaker identities (such as gender, accent, and timbre), speaking and singing styles, and more all contribute to the richness of human speech. TTS systems have greatly improved in intelligibility and naturalness as neural networks and deep learning have progressed; some systems (such as NaturalSpeech) have even achieved human-level speech quality on benchmark data sets from single-speaker recording studios.

Due to the lack of diversity in the data, earlier speaker-limited recording studio data sets were insufficient to capture the wide variety of speaker identities, prosodies, and styles in human speech. However, using low or no firing technologies, TTS models can be trained on a large corpus to learn these differences and then use these trained models to generalize to the infinite unseen scenarios. Quantizing the continuous speech waveform into discrete tokens and modeling these tokens with autoregressive language models is common in today’s large-scale TTS systems.

New Microsoft research introduces NaturalSpeech 2, a TTS system that uses latent diffusion models to produce expressive prosody, good resilience, and most importantly, strong speech synthesis capabilities. The researchers began by training a neural audio codec that uses a codec encoder to transform a speech waveform into a series of latent vectors and a codec decoder to restore the original waveform. After deriving prior vectors from a phoneme encoder, duration predictor, and pitch predictor, they use a diffusion model to construct these latent vectors.

JOIN the fastest ML subreddit community

The following are examples of design decisions that are discussed in his document:

In previous work, speech is typically quantized with numerous residual quantizers to ensure the quality of the neural codec speech reconstruction. This overloads the acoustic model (autoregressive language model) to a large extent because the resulting sequence of discrete tokens is quite long. Instead of using tokens, the team used continuous vectors. Therefore, they employ continuous vectors instead of discrete tokens, which shorten the sequence and provide more data for accurate speech reconstruction at the granular level.
Substitution of autoregressive models by diffusion models.
Learning in context through speech prompting mechanisms. The team developed speech cue mechanisms to promote in-context learning in the diffusion model and pitch/duration predictors, improving zero-trigger capability by encouraging diffusion models to adhere to speech cue features.
NaturalSpeech 2 is more reliable and stable than its autoregressive predecessors, as it only requires a single acoustic model (the diffusion model) instead of a two-stage token prediction. In other words, you can use its duration/pitch prediction and non-autoregressive generation to apply different styles to speech (like a singing voice).

To demonstrate the effectiveness of these architectures, the researchers trained NaturalSpeech 2 with 400 million model parameters and 44,000 hours of speech data. They then used it to create speech in zero-shot scenarios (with just a few seconds of speech indication) with various speaker identities, prosody, and styles (eg, singing). The findings show that NaturalSpeech 2 outperforms previous powerful TTS systems in experiments and generates natural speech under zero-trigger conditions. Achieve a more similar prosody with voicemail and ground truth speech. It also achieves a naturalness comparable to or better (with respect to CMOS) than true speech on the LibriTTS and VCTK test sets. Experimental results also show that you can generate singing voices in a novel timbre with a short singing message, or interestingly, with just a voice message, unlocking truly zero singing synthesis.

In the future, the team plans to investigate effective methods such as consistency modeling to speed up the diffusion model and to investigate generalized training of the speaking and singing voice to enable more powerful mixed speaking and singing capabilities.

review the Paper and project page. Don’t forget to join our 20k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at asif@marktechpost.com

Check out 100 AI tools at AI Tools Club

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.