A team of researchers from the University of Science and Technology of China has developed a new machine learning model for lip-to-speech synthesis (Lip2Speech). The model is capable of generating custom synthesized speech under zero-trigger conditions, which means that it can make predictions related to classes of data that it did not encounter during training. The researchers introduced their approach by leveraging a variational autoencoder, a generative model based on neural networks that encode and decode data.
Lip2Speech synthesis involves the prediction of spoken words based on a person’s lip movements and has several real-world applications. For example, it can help patients who are unable to produce speech sounds when communicating with others, add sound to silent movies, restore speech in noisy or damaged video, and even determine conversations in voiceless CCTV footage. While some machine learning models have shown promise in Lip2Speech applications, they often struggle with real-time performance and are not trained with zero-shot learning approaches.
Typically, to achieve zero-shot Lip2Speech synthesis, machine learning models require reliable video recordings of the speakers to extract additional information about their speech patterns. However, in cases where only silent or unintelligible video of a speaker’s face is available, this information will not be accessible. The researchers’ model aims to address this limitation by generating speech that matches the appearance and identity of a given speaker without relying on recordings of their actual speech.
The team proposed a zero-shot custom Lip2Speech synthesis method that uses facial images to control speaker identities. They employed a variational autoencoder to unravel speaker identity and linguistic content representations, allowing speaker embeddings to control the speech characteristics of synthetic speech for invisible speakers. In addition, they introduced associated cross-modal representation learning to improve the capability of face-based speaker embeddings (FSE) in voice control.
To assess the performance of their model, the researchers ran a series of tests. The results were remarkable, as the model generated synthesized speech that precisely matched the speaker’s lip movements and her age, gender, and general appearance. The potential applications of this model are wide ranging from assistive tools for the speech impaired to video editing software to help with police investigations. The researchers highlighted the effectiveness of their proposed method through extensive experiments, showing that the synthetic expressions were more natural and aligned with the personality of the input video compared to other methods. Importantly, this work represents the first attempt at zero-shot custom Lip2Speech synthesis using a facial image instead of reference audio to control voice characteristics.
In conclusion, the researchers have developed a machine learning model for Lip2Speech synthesis that excels under zero firing conditions. The model can generate custom synthesized speech that aligns with the speaker’s appearance and identity by leveraging a variational autoencoder and face images. The successful performance of this model opens up possibilities for various practical applications, such as helping the speech impaired, improving video editing tools, and assisting in police investigations.
review the Paper and Reference article. Don’t forget to join our 24k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Niharika is a technical consulting intern at Marktechpost. She is a third year student, currently pursuing her B.Tech from the Indian Institute of Technology (IIT), Kharagpur. She is a very enthusiastic individual with a strong interest in machine learning, data science, and artificial intelligence and an avid reader of the latest developments in these fields.