Google researchers have created an AI that can generate minute-long pieces of music from text prompts, and can even transform a whistled or hummed melody on other instruments, similar to how systems like DALL-E generate images. based on written instructionsvia TechCrunch). The model is called MusicLM, and while you can’t play with it yourself, the company has uploaded a bunch of samples you produced using the model.
The examples are impressive. There are 30-second snippets of what sound like actual songs created from paragraph-long descriptions that prescribe a specific genre, mood, and even instruments, as well as five-minute-long pieces spawned from one or two words like “techno.” melodic”. Perhaps my favorite is a “story mode” demo, where the model is basically given a script to transform between prompts. For example, this notice:
electronic song played in a video game (0:00-0:15)
meditation song played by a river (0:15-0:30)
fire (0:30-0:45)
fireworks (0:45-0:60)
It might not be for everyone, but I could see that this was composed by a human (I also listened to it on loop dozens of times while writing this article). Examples of what the model outputs when asked to generate 10-second clips of instruments such as cello or maracas are also included on the demo site (the last example is one where the system does a relatively poor job). , eight-second clips of a certain genre, music that would fit a prison break, and even what a beginning pianist would sound like compared to an advanced one. It also includes renditions of phrases like “futuristic club” and “accordion death metal.”
MusicLM can even simulate human voices, and while it seems to get the tone and overall sound of the voices just right, there’s one quality to them that definitely doesn’t work. The best way I can describe it is that they sound grainy or static. That quality isn’t as clear in the example above, but I think this one illustrates it pretty well:
That, by the way, is the result of asking him to make music for a gym. She may also have noticed that the lyrics don’t make sense, but in a way that she can’t necessarily pick up on if she’s not paying attention, like she’s listening to someone singing Simlish or that song that should sound like english but isn’t.
I won’t pretend to know What Google achieved these results, but it is published a research paper explaining it in detail if you are the kind of person who would understand this figure:
AI-generated music has a long history stretching back decades; There are systems that have been accredited with compose pop songscopying bach better than a human in the 90sY accompaniment of live performances. A recent version uses the AI StableDiffusion imaging engine to convert text prompts to spectrograms which then become music. The document says that MusicLM can outperform other systems in terms of its “quality and adherence to subtitles,” as well as the fact that it can receive audio and copy the tune.
That last part is perhaps one of the best demonstrations the researchers came up with. The site lets you play the input audio, where someone hums or whistles a tune, then lets you listen as the model plays it like a lead electronic synth, string quartet, guitar solo, etc. From the examples I’ve heard, it handles the task very well.
As with other forays into this type of AI, Google is being significantly more cautious with MusicLM than some of its peers may be with similar technology. “We have no plans to release models at this time,” the document concludes, citing risks of “potential misappropriation of creative content” (read: plagiarism) and potential cultural appropriation or misrepresentation.
It’s always possible that the technology might show up at some point in one of Google’s fun music experiments, but for now, the only people who will be able to make use of the research are other people building AI music systems. Google says it’s publicly releasing a dataset with around 5,500 music and text pairs, which could help when training and evaluating other music AIs.