Until now, most generative music models produced mono sound. This means that MusicGen does not place any sounds or instruments on the left or right side, resulting in a less lively and exciting mix. The reason stereo sound has been overlooked until now is that generating stereo is a non-trivial task.
As musicians, when we produce stereo signals, we have access to the individual instrument tracks in our mix and can place them wherever we want. MusicGen does not generate all the instruments separately, but rather produces a combined audio signal. Without access to these instrument sources, creating stereo sound is difficult. Unfortunately, splitting an audio signal into its individual sources is a difficult problem (I’ve posted a ai-music-source-separation-how-it-works-and-why-it-is-so-hard-187852e54752″ rel=”noopener”>blog post about that) and the technology is not 100% ready yet.
Therefore, Meta decided to incorporate stereo generation directly into the MusicGen model. Using a new data set consisting of stereo music, they trained MusicGen to produce stereo outputs. The researchers say that generating stereo does not incur additional computing costs compared to mono.
Although I think the stereo procedure is not described very clearly in the article, my understanding is that it works like this (figure 3): MusicGen has learned to generate two compressed audio signals (left and right channel) instead of one mono signal. These compressed signals must then be decoded separately before being combined to create the final stereo output. The reason this process does not take twice as long is that MusicGen can now produce two compressed audio signals in approximately the same time as one signal previously took.
Being able to produce convincing stereo sound really sets MusicGen apart from other next-generation models like MusicLM or Stable Audio. From my perspective, this “small” addition makes a big difference in the liveliness of the music generated. Listen for yourselves (it may be difficult to hear on smartphone speakers):
infectious mononucleosis
Stereo
MusicGen was impressive from the day it was launched. However, since then, Meta’s FAIR team has continually improved their product, allowing for higher quality results that sound more authentic. When it comes to text-to-music models that generate audio signals (not MIDI, etc.), MusicGen is ahead of its competitors from my perspective (as of November 2023).
Additionally, since MusicGen and all its related products (EnCodec, AudioGen) are open source, they are an incredible source of inspiration and framework for aspiring ai audio engineers. Looking at the improvements MusicGen has made in just 6 months, I can only imagine that 2024 will be an exciting year.
Another important point is that with its transparent approach Meta is also doing essential work for developers who want to integrate this technology into software for musicians. Generating samples, generating musical ideas or changing the genre of your existing work – these are some of the interesting applications we are already starting to see. With a sufficient level of transparency, we can ensure that we are building a future where ai makes music creation more exciting rather than just being a threat to human musicality.