In NVIDIA GTC25, <a target="_blank" href="http://gnani.ai“>Gnani.ai The experts presented innovative advances in voice ai, focusing on the development and implementation of voice -to -voice base models. This innovative approach promises to overcome the limitations of traditional cascade voice architectures, marking the beginning of an era of perfect, multilingual and emotionally conscious voice interactions.
The limitations of cascade architectures
Voice agents that drive current avant -garde architecture imply a three -stage pipe: voice to text (STT), large language models (LLM) and text to voice (TTS). While it is effective, this cascade architecture suffers from significant inconveniences, mainly propagation of latency and error. A waterfall architecture has multiple blocks in the pipe, and each block will add its own latency. The latency accumulated in these stages can vary from 2.5 to 3 seconds, which leads to a bad user experience. In addition, errors introduced in the STT stage spread through the pipe, which aggravates the inaccuracies. This traditional architecture also loses critical paralinguistic features, such as feeling, emotion and tone, resulting in monotonous and emotionally flat responses.
INTRODUCTION OF VOICE TO VOICE MODELS
To address these limitations, Gnani.ai presents a new voice -to -voice base model. This model processes and generates the audio directly, eliminating the need for intermediate text representations. Key innovation lies in training a mass audio encoder with 1.5 million hours of data labeled in 14 languages, capturing nuances of emotion, empathy and hue. This model uses an nested XL encoder, re -formed with comprehensive data and a layer of input audio projector to assign audio characteristics in textual integrities. For real -time transmission, audio and text functions are intertwined, while non -current use cases use a combination layer. The LLM layer, initially based on Llama 8b, expanded to include 14 languages, which requires the reconstruction of tokenizers. An output projector model generates Mel spectrograms, which allows the creation of hyperpersonalized voices.
Key benefits and technical obstacles
The voice to voice model offers several significant benefits. First, it significantly reduces latency, from 2 seconds to approximately 850-900 milliseconds for Token's first departure. Second, it improves the accuracy when merging ASR with the LLM layer, improving performance, especially for short and long speeches. Third, the model achieves emotional consciousness by capturing and modeling the tone, stress and speech rate. Fourth, it allows improved interruption management through contextual consciousness, facilitating more natural interactions. Finally, the model is designed to handle a low bandwidth audio effectively, which is crucial for telephone networks. The construction of this model presented several challenges, especially the massive data requirements. The team created a system of public origin with 4 million users to generate emotionally rich conversation data. They also took advantage of base models for the generation of synthetic data and trained in 13.5 million hours of publicly available data. The final model comprises a parameter model of 9 billion, with 636 million for audio entrance, 8 billion for the LLM and 300 million for the TTS system.
Nvidia's role in development
The development of this model depended largely on the Nvidia battery. Nvidia Nemo was used for training models of encoders, and the curator Nemo facilitated the generation of synthetic text data. Nvidia Eva was used to generate audio pairs, combining patented information with synthetic data.
Use cases
Gnani.ai exhibited two main use cases: translation of the language in real time and customer service. The real -time language translation demonstration presented an ai engine that facilitated a conversation between an English -speaking agent and a French -speaking client. The demonstration of customer service highlighted the capacity of the model to handle interlingues, interruptions and emotional nuances.
Voice -to -voice base model
The voice -to -voice base model represents a significant jump in the voice. By eliminating the limitations of traditional architectures, this model allows more natural, efficient and emotionally aware voice interactions. As technology continues to evolve, it promises to transform several industries, from customer service to global communication.