Recent advances in language models show impressive out-of-the-box speech conversion (VC) capabilities. However, the predominant VC models based on language models generally use offline conversion of source semantics to acoustic features, which requires the entirety of the source speech and limits their application to real-time scenarios.
In this research, a team of researchers from Northwestern Polytechnic University, China, and ByteDance present StreamVoice. StreamVoice is a novel streaming language model (LM)-based method for immediate speech conversion (VC), enabling real-time conversion with any speaker prompt and source voice. StreamVoice achieves streaming capability by employing a fully causal and context-aware LM with a time-independent acoustic predictor.
This model alternatively processes semantic and acoustic features at each autoregression time step, eliminating the need for a complete source speech. To mitigate potential performance degradation in streaming processing due to incomplete context, two strategies are employed:
1) teacher-guided context forecasting, where a teacher model summarizes the present and future semantic context during training to guide the model's forecasting for the missing context.
2) semantic masking strategy, which promotes acoustic prediction from previous corrupted semantic and acoustic inputs to improve context learning ability. In particular, StreamVoice stands out as the first LM-based zero-streaming VC model without any future perspective. Experimental results show the streaming conversion capability of StreamVoice while maintaining zero throughput comparable to non-streaming VC systems.
The figure above demonstrates the zero-shot VC concept that employs the widely used recognition synthesis framework. StreamVoice is based on this popular paradigm. The experiments conducted illustrate that StreamVoice exhibits the ability to perform speech conversion in a streaming manner, achieving high speaker similarity for both familiar and unfamiliar speakers. Maintains performance levels comparable to non-broadcast voice conversion (VC) systems. As a zero-shot VC model based on the initial language model (LM) without any future anticipation, the entire StreamVoice process incurs a latency of only 124 ms for the conversion process. This is notably 2.4 times faster than real-time on a single A100 GPU, even without engineering optimizations. The team's future work involves using more training data to improve StreamVoice's modeling capabilities. They also plan to optimize the transmission channel, incorporating a high-fidelity codec with a low bitrate and a unified transmission model.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook community, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Telegram channel
Janhavi Lande, Graduated in Engineering Physics from IIT Guwahati, Class of 2023. She is an upcoming data scientist and has been working in the world of ml/ai research for the last two years. What fascinates him most is this ever-changing world and its constant demand for humans to keep up. In her hobbies she likes to travel, read and write poems.
<!– ai CONTENT END 2 –>