In the field of text-to-music synthesis, the quality of the generated content has been advancing, but the controllability of the musical aspects remains unexplored. A team of researchers from the Singapore University of technology and Design and Queen Mary University of London presented a solution to this challenge, called Mustango, which extends Tango’s text-to-audio model, with the aim of controlling non-generated music. not only with general text subtitles but with more complete subtitles that contain specific instructions related to chords, beats, tempo and key.
The researchers present Mustango as a text-to-music conversion system inspired by music domain knowledge and based on diffusion models. They highlight the unique challenges of generating music directly from a diffusion model, emphasizing the need to balance alignment with conditional text and musicality. Mustango allows musicians, producers and sound designers to create music clips with specific conditions such as chord progression, tempo and key selection.
As part of Mustango, the researchers propose MuNet, a UNet submodule informed by the music domain. MuNet integrates music-specific features, predicted from the text message, including chords, beats, key and tempo, into the diffusion noise removal process. To overcome the limited availability of open data sets with music and text subtitles, researchers introduce a novel data augmentation method. This method involves altering the harmonic, rhythmic and dynamic aspects of musical audio and using musical information retrieval methods to extract musical features, which are then added to existing text descriptions, resulting in the MusicBench dataset.
The MusicBench dataset contains over 52,000 instances, enriching the original text descriptions with beats, downbeat placement, underlying chord progression, key, and tempo. Researchers conduct extensive experiments demonstrating that Mustango achieves cutting-edge musical quality. They emphasize Mustang’s controllability through music-specific text cues, showing superior performance in capturing desired chords, beats, keys, and tempo across multiple data sets. They evaluate the adaptability of these predictors in scenarios where the control sentences are absent from the message and observe that Mustang outperforms Tango in such cases, indicating that the control predictors do not compromise performance.
The experiments include comparisons with baselines, such as Tango, and Mustang variants, demonstrating the effectiveness of the proposed data augmentation approach in improving performance. Formed from scratch, Mustang stands out as the best performer, surpassing Tango and other variants in terms of audio quality, rhythmic presence and harmony. Mustang has 1.4 billion parameters, much more than Tango.
In conclusion, the researchers present Mustang as a significant advance in the synthesis of text to music. They address the controllability gap in existing systems and demonstrate the effectiveness of their proposed method through extensive experiments. Mustango not only achieves state-of-the-art musical quality, but also provides improved controllability, making it a valuable contribution to the field. Researchers release MusicBench dataset, providing a resource for future research on text-to-music synthesis.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 33k+ ML SubReddit, 41k+ Facebook community, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
Pragati Jhunjhunwala is a Consulting Intern at MarktechPost. She is currently pursuing B.tech from the Indian Institute of technology (IIT), Kharagpur. She is a technology enthusiast and has a keen interest in the scope of data science software and applications. She is always reading about the advancements in different fields of ai and ML.
<!– ai CONTENT END 2 –>