Large-scale language models have made significant progress in generative tasks involving multi-speaker speech synthesis, music generation, and audio generation. Integrating speech modality into large unified multimodal models has also become popular, as seen in models such as SpeechGPT and AnyGPT. These advances are largely due to the discrete acoustic codec representations used from neural codec models. However, it poses challenges in bridging the gap between continuous speech and token-based language models. While current acoustic codec models offer good reconstruction quality, there is room for improvement in areas such as high bitrate compression and semantic depth.
Existing methods focus on three main areas to address the challenges of acoustic codec models. The first method includes improved reconstruction quality using techniques such as AudioDec, which demonstrated the importance of discriminators, and DAC, which improved quality using techniques such as quantizer removal. The second method uses improved compression-based developments such as HiFi-Codec's parallel GRVQ structure and Language-Codec's MCRVQ mechanism, achieving good performance with fewer quantizers for both. The last method aims to deepen the understanding of the codec space with TiCodec modeling time-independent and time-dependent information, while FACodec separates content, style, and acoustic details.
A team from Zhejiang University, Alibaba Group, and Fundamental ai Research at Meta have proposed WavTokenizer, a new acoustic codec model that offers significant advantages over previous state-of-the-art models in the audio domain. WavTokenizer achieves extreme compression by reducing the quantizer layers and temporal dimension of the discrete codec, with only 40 or 75 tokens for one second of 24 kHz audio. Moreover, its design contains a wider VQ space, extended contextual windows, enhanced attention networks, a powerful multi-scale discriminator, and an inverse Fourier transform structure. It demonstrates great performance in multiple domains such as speech, audio, and music.
The architecture of WavTokenizer is designed for unified modeling across domains such as multilingual speech, music, and audio. Its large version is trained on around 80,000 hours of data from various datasets including LibriTTS, VCTK, CommonVoice, etc. Its medium version uses a subset of 5,000 hours, while the small version is trained on 585 hours of LibriTTS data. The performance of WavTokenizer is evaluated against state-of-the-art codec models using official weight files from various frameworks such as Encodec 2, HiFi-Codec 3, etc. It is trained on NVIDIA A800 80G GPU, with 24 kHz input samples. The optimization of the proposed model is performed using AdamW optimizer with specific learning rate and decay settings.
The results demonstrated the excellent performance of WavTokenizer on multiple datasets and metrics. WavTokenizer-small outperforms the state-of-the-art DAC model by 0.15 on the UTMOS metric and the clean test subset of LibriTTS, which closely matches human perception of audio quality. Furthermore, this model outperforms the 100-token DAC model on all metrics with only 40 and 75 tokens, demonstrating its effectiveness in reconstructing audio with a single quantizer. WavTokenizer has comparable performance to Vocos with 4 quantizers and SpeechTokenizer with 8 quantifiers on objective metrics such as STOI, PESQ, and F1 score.
In conclusion, WavTokenizer shows a significant advancement in acoustic codec models, capable of quantizing a second of speech, music, or audio into as few as 75 or 40 high-quality tokens. This model achieves comparable results to existing models on the clean LibriTTS test dataset, while delivering extreme compression. The team performed a thorough analysis of the design motivations behind the VQ space and the decoder and validated the importance of each new module through ablation studies. The findings show that WavTokenizer has the potential to revolutionize audio compression and reconstruction across multiple domains. Moving forward, the researchers plan to consolidate WavTokenizer’s position as a cutting-edge solution in the field of acoustic codec models.
Take a look at the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..
Don't forget to join our SubReddit of over 50,000 ml
Below is a highly recommended webinar from our sponsor: ai/webinar-nvidia-nims-and-haystack?utm_campaign=2409-campaign-nvidia-nims-and-haystack-&utm_source=marktechpost&utm_medium=banner-ad-desktop” target=”_blank” rel=”noreferrer noopener”>'Developing High-Performance ai Applications with NVIDIA NIM and Haystack'
Sajjad Ansari is a final year student from IIT Kharagpur. As a technology enthusiast, he delves into practical applications of ai, focusing on understanding the impact of ai technologies and their real-world implications. He aims to articulate complex ai concepts in a clear and accessible manner.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>