When textless natural language processing (NLP) initially emerged, the main concept involved training a language model on sequences of discrete units that could be learned rather than relying on transcribed text. This approach was intended to allow NLP tasks to be directly applicable to spoken expressions. Additionally, in the context of speech editing, a model would need to modify individual words or phrases to align them with a target transcript while maintaining the original, unaltered content of the speech. Researchers are currently exploring the potential of developing a unified model for text-to-speech and voice editing from scratch, which is a significant step forward in this field.
Recent research from the University of Texas at Austin and Rembrand presents VOICECRAFT, a Transformers-based NCLM that generates neural speech codec tokens for filler using autoregressive conditioning in bidirectional context. Voicecraft achieves state-of-the-art (SotA) results in voice editing and zero-shot TTS. The researchers base their approach on a two-stage token reordering process, which includes a delayed stacking step and a causal masking step. The proposed method allows autoregressive generation with bidirectional context and is applied to speech codec sequences; It is based on the causal masking methodology, which the successful multimodal causal masking model inspired in the joint modeling of text and image.
To further ensure effective modeling of multiple codebooks, the team incorporates causal masking with delayed stacking as a suggested token reordering approach. The team created a unique, realistic and difficult data set called REALEDIT to test voice editing. With waveforms ranging from 5 to 12 seconds in length, REALEDIT includes 310 real-world voice editing samples collected from audiobooks, YouTube videos, and Spotify podcasts. Target transcripts are generated by editing the source speech transcripts to maintain their grammatical correctness and semantic consistency.
The data set is structured to accommodate many editing scenarios, such as adding, deleting, replacing, and modifying multiple spans at once, with modified text lengths varying from one word to sixteen words. Due to the variety of topics, accents, speaking styles, recording environments, and background noises of the recordings, REALEDIT presents a greater challenge than popular speech synthesis evaluation datasets such as VCTK, LJSpeech, and LibriTTS, which offer audiobooks. Due to its diversity and realism, REALEDIT is a good barometer of the applicability of voice editing models in the real world.
Compared to the previous SotA voice editing model in REALEDIT, VOICECRAFT performs much better in subjective human listening tests. Most importantly, the edited VOICECRAFT speech sounds almost identical to the original audio without modifications. The results show that VOICECRAFT performs better than solid baselines such as replicated VALL-E and the well-known commercial model XTTS v2 when it comes to zero-shot TTS and does not require fine tuning. The team used audiobooks and YouTube videos in their data set.
Despite the progress of VOICECRAFT, the team highlights some limitations, such as:
- The most notable occurrence during generation is the long periods of silence followed by a scraping sound. The team conducted this study by sampling many expressions and selecting the shortest ones, but there should be more refined and effective forms.
- Another critical issue related to ai security is the question of how to watermark and identify synthetic speech. Much attention has recently been paid to watermarking and deepfake detection and great progress has been made.
However, with the advent of more sophisticated models like VOICECRAFT, the team believes that security researchers face new opportunities and obstacles. They have made all of their code and model weights publicly available to help with research into ai safety and speech synthesis.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord Channeland LinkedIn Grabove.
If you like our work, you will love our newsletter with more than 24,000 members…
Don't forget to join our SubReddit over 40,000 ml
Dhanshree Shenwai is a Computer Science Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking with a keen interest in ai applications. He is excited to explore new technologies and advancements in today's evolving world that makes life easier for everyone.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>