With the increasing human-machine interaction and entertainment applications, text-to-speech (TTS) and sing-along synthesis (SVS) tasks have been widely included in speech synthesis, which strives to generate realistic audio of people. Methods based on deep neural networks (DNN) have largely taken over the field of speech synthesis. Typically, a two-stage pipeline is used, with the acoustic model converting text and other control information into acoustic features (such as mel spectrograms) before the vocoder further converts the acoustic features into waveforms. audible.
The two-stage pipeline has been successful because it acts as a “relay” to solve the dimension explosion problem of translating short text to long audio at a high sample rate. The frames describe the acoustic characteristics. The acoustic characteristic that the acoustic model produces, often a mel spectrogram, significantly affects the quality of the synthesized speech. Convolutional Neural Networks (CNNs) and transformers are frequently employed in industry standard methods such as Tacotron, DurIAN, and FastSpeech to predict the governing component mel spectrogram. The ability of diffusion modeling approaches to generate high-quality samples has gained much interest. The two processes that make up a diffusion model, also known as a score-based model, are a diffusion process that gradually perturbs data into noise, and a reverse process that slowly transforms noise back into data. The diffusion model’s need for multiple iterations for generation is a serious flaw. Several techniques based on the diffusion model have been suggested for acoustic modeling in speech synthesis. The problem of slow build speed still exists in most of these jobs.
Grad-TTS developed a stochastic differential equation (SDE) to solve for the inverse SDE, which is used to solve the noise-to-mel spectrogram transformation. Despite producing great audio quality, the inference speed is slow as the reverse method requires many iterations (10–1000). Progressive distillation was added to Prodiff when it was being further developed to minimize sample processes. DiffGAN-TTS used an adversarially trained model in Liu et al. to roughly represent the denoising function for effective speech synthesis. The ResGrad in Chen et al. estimates the pretrained FastSpeech2 prediction residual and ground reality using the diffusion model.
From the above description, it is clear that speech synthesis has three objectives:
• Excellent audio quality: The generative model must faithfully capture the subtleties of the speaking voice that add to the expressiveness and naturalness of the synthesized audio. Recent research has focused on voices with more intricate changes in pitch, tense, and emotion, as well as the distinctive speech voice. Diffsinger, for example, demonstrates how a well-designed diffusion model can provide a good-quality synthesized singing voice after 100 iterations. Also, it is important to avoid artifacts and distortions in the created audio.
• quick inference: Fast audio synthesis is required for real-time applications, including music, interactive voice, and communication systems. Simply being faster than real time for speech synthesis is insufficient when making time for other algorithms in an embedded system.
• beyond talking: More complex voice modeling is needed, such as the singing voice, rather than the distinctive speaking voice in terms of pitch, emotion, rhythm, breath control, and timbre.
Although numerous attempts have been made, the trade-off problem between synthesized audio quality, model capability, and inference speed persists in TTS. It is most evident in SVS due to the mechanism of the denoising diffusion process when sampling. Existing approaches often aim to mitigate rather than fully solve the problem of slow inference. Despite this, they should be faster than traditional approaches without using broadcast models like FastSpeech2.
The consistency model has been recently developed, producing high-quality images with only one sampling step by expressing the stochastic differential equation (SDE), describing the sampling process as an ordinary differential equation (ODE), and further reinforcing the consistency property. of the model. in the ODE trajectory. Despite this achievement in image synthesis, a known speech synthesis model based on the consistency model is currently needed. This suggests that it is possible to develop a consistent model-based speech synthesis technique that combines high-quality synthesis with fast inference speed.
In this study, researchers from Hong Kong Baptist University, Hong Kong University of Science and Technology, Microsoft Research Asia, and Hong Kong Institute of Science and Innovation offer CoMoSpeech, a fast, high-quality speech synthesis approach. based on consistency models. Your CoMoSpeech is derived from an instructor who has already been trained. More specifically, your master model uses the SDE to learn the matching score function and seamlessly translate the mel spectrogram to the Gaussian noise distribution. After training, they build the master’s denoising function using the associated numerical ODE solvers, which are then used for further consistency distillation. Its CoMoSpeech with consistent characteristics is produced by distillation. Ultimately, your CoMoSpeech can generate high-quality audio with a single sample pass.
Findings from their TTS and SVS trials demonstrate that CoMoSpeech can produce monologues with a single sample pass, which is more than 150 times faster than real time. The study of audio quality also reveals that CoMoSpeech provides superior or equivalent audio quality to other diffusion model techniques that require tens or hundreds of iterations. Diffusion model-based speech synthesis is now feasible for the first time. Various audio examples are given on their project website.
review the Paper and Project. Don’t forget to join our 21k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.