Kyutai Open Sources Moshi: A real-time native multimodal AI model that can listen and speak

In a surprising announcement that resonated throughout the technology world, Kyutai introduced Moshia revolutionary real-time native multimodal baseline model. This innovative model mirrors and surpasses some of the features introduced by OpenAI’s GPT-4o in May.

Moshi is designed to understand and express emotions, offering capabilities such as speaking in different accents, including French. It can listen to and generate audio and speech while maintaining a continuous stream of textual thoughts, as it says. One of Moshi’s standout features is its ability to handle two audio streams simultaneously, allowing it to listen and speak simultaneously. This real-time interaction is supported by joint pre-training on a mix of text and audio, leveraging synthetic text data from Helium, a 7-billion-parameter language model developed by Kyutai.

The Moshi tuning process involved 100,000 synthetic “speech-style” conversations, converted using Text-to-Speech (TTS) technology. The model’s voice was trained on synthetic data generated by a separate TTS model, achieving an impressive end-to-end latency of 200 milliseconds. Remarkably, Kyutai has also developed a smaller variant of Moshi that can run on a MacBook or consumer-sized GPU, making it accessible to a wider range of users.

Kyutai has emphasized the importance of responsible use of ai by incorporating watermarks to detect ai-generated audio, a feature that is currently in development. The decision to launch Moshi as an open-source project highlights Kyutai’s commitment to transparency and collaborative development within the ai community.

At its core, Moshi is powered by a 7 billion parameter multimodal language model that processes voice input and output. The model is powered by a dual-channel I/O system, which generates text tokens and audio codecs simultaneously. The base text language model, Helium 7B, was trained from scratch and then co-trained with text and audio codecs. Based on Kyutai’s in-house Mimi model, the voice codec features a 300x compression factor, capturing both semantic and acoustic information.

Moshi’s training involved rigorous processes, which involved fine-tuning 100,000 highly detailed transcripts, annotated with emotion and style. The text-to-speech engine, which supports 70 different emotions and styles, was fine-tuned using 20 hours of audio recorded by a licensed voiceover artist named Alice. The model is designed to adapt and can be fine-tuned with less than 30 minutes of audio.

Moshi’s implementation demonstrates its efficiency. The demo model, hosted on the Scaleway and Hugging Face platforms, can handle two batch sizes with 24 GB of VRAM. It supports multiple backends including CUDA, Metal, and CPU, and benefits from optimizations in the inference code via Rust. Improved KV caching and request caching are expected to further improve performance.

Looking ahead, Kyutai has ambitious plans for Moshi. The team aims to publish a white paper and open source versions of the model, including the inference codebase, 7B model, audio codec, and full optimized stack. Future iterations, such as Moshi 1.1, 1.2, and 2.0, will refine the model based on user feedback. The goal of Moshi licensing is to be as permissive as possible, encouraging widespread adoption and innovation.

In conclusion, Moshi exemplifies the potential of small, focused teams to achieve extraordinary advancements in ai technology. This model opens up new avenues for research assistance, idea sharing, language learning, and more, and demonstrates the transformative power of ai when deployed on-device with unparalleled flexibility. As an open-source model, it invites collaboration and innovation, ensuring the benefits of this revolutionary technology are accessible to all.

Review the Advertisement, Fundamentaland Demo Chat. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter.

Here comes the paper, the code and the model…

Join our Telegram Channel and LinkedIn GrAbove!.

If you like our work, you will love our Newsletter..

Don't forget to join our Subreddit with over 46 billion users

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary engineer and entrepreneur, Asif is committed to harnessing the potential of ai for social good. His most recent initiative is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has over 2 million monthly views, illustrating its popularity among the public.

Join the fastest growing ai research newsletter read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others…

Kyutai Open Sources Moshi: A real-time native multimodal AI model that can listen and speak

Technical Terrence Team

Oil is back below the $83.00 level

Leave a Reply Cancel reply

Recommended.

Royal Caribbean makes a change to the Internet that some passengers will not like

DataComp: In search of the next generation of multimodal data sets

Why has Ethereum price dropped to $2,200 today?

Own a piece of music history with Gala Music’s David Bowie-inspired NFTs

Project Arielle is Razer's first temperature-controlled gaming chair

Categories

Important Links

Kyutai Open Sources Moshi: A real-time native multimodal AI model that can listen and speak

Related

Technical Terrence Team

Oil is back below the $83.00 level

Leave a Reply Cancel reply

Recommended.

Royal Caribbean makes a change to the Internet that some passengers will not like

DataComp: In search of the next generation of multimodal data sets

Why has Ethereum price dropped to $2,200 today?

Own a piece of music history with Gala Music’s David Bowie-inspired NFTs

Project Arielle is Razer's first temperature-controlled gaming chair

Categories

Important Links

Get daily news updates to your inbox!