Music is an art composed of harmony, melody and rhythm that permeates all aspects of human life. With the flourishing of deep generative models, the musical generation has drawn a lot of attention in recent years. As a leading class of generative models, Language Models (LMs) showed extraordinary modeling ability in modeling complex relationships in long-term contexts. In light of this, AudioLM and many subsequent works successfully applied LMs to audio synthesis. Concurrent with LM-based approaches, diffusion probabilistic models (DPMs), like another competitive class of generative models, have also shown exceptional abilities to synthesize speech, sounds, and music.
However, generating music from freeform text remains a challenge, as allowable music descriptions can be diverse and related to genres, instruments, tempo, settings, or even some subjective feelings.
Traditional text-to-music generation models often focus on specific properties such as audio continuation either rapid samplingwhile some models prioritize robust evidencewhich is occasionally performed by experts in the field, such as music producers. Furthermore, most are trained on large-scale music data sets and have demonstrated state-of-the-art generative performances with high fidelity and adherence to various aspects of text cues.
However, the success of these methods, such as MusicLM or Noise2Music, comes with high computational costs, which would severely hamper their practicality. By comparison, other approaches built on top of DPM made efficient sampling of high-quality music possible. However, their demonstrated cases were comparatively small and showed limited within-sample dynamics. Aiming for a feasible music creation tool, a high efficiency of the generative model is essential as it facilitates interactive creation taking human feedback into account, as in a previous study.
While both LM and DPM showed promising results, the relevant question is not whether one should be preferred over the other, but whether it is possible to take advantage of both approaches at the same time.
Based on the aforementioned motivation, an approach called MeLoDy has been developed. The summary of the strategy is presented in the following figure.
After analyzing the success of MusicLM, the authors take advantage of the highest level LM in MusicLM, called semantic LM, to model the semantic structure of music, determining the overall arrangement of melody, rhythm, dynamics, timbre, and the tempo. Conditioned on this semantic LM, they exploit the non-autoregressive nature of DPMs to model acoustics efficiently and effectively with the help of a successful sampling acceleration technique.
Furthermore, the authors propose the so-called two-way diffusion (DPD) model instead of adopting the classical diffusion process. In fact, working with the raw data would exponentially increase computational costs. The proposed solution is to reduce the raw data to a low-dimensional latent representation. Reducing the dimensionality of data makes it more difficult to impact operations and therefore decreases model execution time. The raw data can then be reconstructed from the latent representation via a pre-trained autoencoder.
Some output samples produced by the model are available at the following link: https://ficient-melody.github.io/. The code is not yet available, which means that, at the moment, it is not possible to test it, neither online nor locally.
This was the summary of MeLoDy, an efficient model of diffusion guided by LM that generates quality musical audios of the latest generation. If you are interested, you can learn more about this technique at the links below.
review the Paper. Don’t forget to join our 25k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
featured tools Of AI Tools Club
🚀 Check out 100 AI tools at AI Tools Club
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He currently works at the Christian Doppler ATHENA Laboratory and his research interests include adaptive video streaming, immersive media, machine learning and QoS / QoE evaluation.