In recent years, there have been great advances in the field of speech synthesis. With the rapid progress of natural language systems, text is primarily chosen as the initial form for generating speech. A Text-To-Speech (TTS) system quickly converts natural language into speech. Given a text input, natural-sounding speech is produced. Currently, there are a number of text-to-speech-language models that generate high-quality speech.
Traditional models are limited to producing the same robotic results, which are only according to a particular speaker in a particular language. With the introduction of deep neural networks in the approach, text-to-speech models have already become more efficient with the additional features of maintaining stress and intonation in generated speech. These audios seem more human and natural. But now the feature of Cross-linguality of speech has been added, which was not mentioned yet. A team of Microsoft researchers has presented a language model that shows speech synthesis performance in multiple languages.
Multilingual speech synthesis is basically an approach to transmit the voice of a speaker of one language to another. The multilingual neural codec language model that the researchers have introduced is called VALL-E X. It is an extended version of the VALL-E text-to-speech model, which has been developed by acquiring VALL’s robust in-context learning capabilities. . -E TTS model.
The team has summarized their work as follows:
- VALL-E X is a multilingual neural codec language mode consisting of multilingual, multitalk, multi-domain impure speech data.
- VALL-E X has been designed by training a multilingual conditional codec language model to predict the acoustic token sequences of target language speech. This is done using both the speech of the source language and the text of the target language as prompts fed.
- The in-context multilingual learning framework enables multilingual speech production by VALL-E X. It maintains the speaker’s voice, emotion and speech background invisible.
- VALL-E X overcomes the main challenge of multilingual speech synthesis tasks: the foreign accent problem. You can generate speech in a native language for any speaker.
- VALL-E X has been applied to zero-trigger multilingual text-to-speech synthesis and zero-trigger speech-to-speech translation tasks. Upon experimentation, VALL-E X can exceed the solid baseline regarding speaker similarity, speech quality, translation quality, speech naturalness, and human evaluation.
VALL-E X was tested with LibriSpeech and EMIME for both English and Chinese languages, including English text-to-speech for Chinese speakers and Chinese TTS for English speakers. Demonstrates high-quality zero-trigger multilingual speech synthesis performance. This new model certainly looks promising as it surpasses the foreign accent model and increases the potential for cross-language speech synthesis.
review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 16k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.