With the increasing number of advancements in artificial intelligence, the fields of Natural Language Processing, Natural Language Generation and Computer Vision have gained great popularity recently, all thanks to the introduction of Large Language Models (LLM). Diffusion models, which have proven to be successful in the production of text-to-speech synthesis (TTS), have shown great generation quality. However, its prior distribution is limited to a representation that introduces noise and offers little information about the desired generation objective.
In recent research, a team of researchers from Tsinghua University and Microsoft Research Asia introduced a new text-to-speech system called Bridge-TTS. It is the first attempt to replace a clean and predictable alternative to the noisy Gaussian prior used in well-established diffusion-based TTS approaches. This prereplacement provides strong structural information about the target and has been taken from the latent representation extracted from the text input.
The team has shared that the main contribution is the development of a fully tractable Schrodinger bridge connecting the ground truth mel spectrogram and the clean prior. The suggested bridge-TTS uses a data-to-data process, which improves the information content of the prior distribution, in contrast to diffusion models that work through a data-to-noise process.
The team evaluated the approach and upon evaluation, the effectiveness of the suggested method was highlighted through experimental validation performed on the LJ-Speech dataset. In 50/1000 step synthesis configurations, Bridge-TTS has demonstrated better performance than its diffusion counterpart, Grad-TTS. It has even performed better in low-step scenarios than the powerful and fast TTS models. It has been emphasized that the main strengths of the Bridge-TTS approach are the quality of synthesis and the efficiency of sampling.
The team has summarized the main contributions as follows.
- Mel spectrograms have been produced from a latent representation of uncontaminated text. Unlike the traditional data-to-noise conversion procedure, this representation, which functions as condition information in the context of diffusion models, was created to be noise-free. Schrodinger's bridge has been used to investigate a data-to-data process.
- For paired data, a fully tractable Schrodinger bridge has been proposed. This bridge uses a reference stochastic differential equation (SDE) in a flexible way. This method allows for empirical investigation of design spaces as well as offering a theoretical explanation.
- It has been studied how the sampling technique, model parameterization and noise scheduling contribute to improving the quality of the TTS. An asymmetric noise program, data prediction, and first-order bridge samplers have also been implemented.
- The complete theoretical explanation of the underlying processes has been made possible by the completely tractable Schrodinger bridge. Empirical research has been conducted to understand how different elements affect the quality of TTS, including examining the effects of asymmetric noise schedules, model parameterization decisions, and the efficiency of the sampling process.
- The method has produced great results in terms of inference speed and generation quality. The equivalent diffusion-based Grad-TTS method has been largely outperformed in 1000- and 50-step generation situations. It also outperformed FastGrad-TTS in 4-step generation, the transformer-based model FastSpeech 2, and the state-of-the-art CoMoSpeech distillation approach in 2-step generation.
- The method has achieved extraordinary results after just one training session. This efficiency is visible at various stages of the creation process, demonstrating the reliability and power of the suggested approach.
Review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don't forget to join. our 33k+ ML SubReddit, 41k+ Facebook community, Discord Channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you'll love our newsletter.
Tanya Malhotra is a final year student of University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with specialization in artificial intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with a burning interest in acquiring new skills, leading groups and managing work in an organized manner.
<!– ai CONTENT END 2 –>