This paper was accepted into the NeurIPS 2023 workshop on diffusion models.
We demonstrate how conditional generation from diffusion models can be used to address a variety of realistic tasks in music production in 44.1 kHz stereo audio with sampling time guidance. The scenarios we consider include continuing, painting and regenerating musical audio, creating smooth transitions between two different musical tracks, and transferring desired stylistic features to existing audio clips. We achieve this by applying guidance at sampling time in a simple framework that supports both reconstruction and classification losses, or any combination of both. This approach ensures that the generated audio can match the surrounding context or fit a specified class distribution or latent representation relative to any suitable pre-trained classifier or embedding model.
In Table 1 we show randomly chosen samples for a series of creative applications, each conditioned on a given audio message. For each task and indication we show samples of the different models described in the article.
Types of tasks:
- padding: replaces the middle two seconds of the message
- regeneration: regenerates the middle two seconds of the message
- continuation: generates a new continuation from the first 2.4 seconds of the message
- transitions: regenerates a merged section between two tracks
- Guide: generate a new clip conditional on the Fits message classifier embedding
immediate | task | CQTDiff (initial value) | latent | waveform |
---|---|---|---|---|
stuffed | ||||
stuffed | ||||
stuffed | ||||
regenerate | ||||
regenerate | ||||
regenerate | ||||
continuation | ||||
continuation | ||||
continuation | ||||
transitions | ||||
transitions | ||||
transitions | ||||
guide | ||||
guide | ||||
guide |
The indications are taken from a test division of the Free Music File Dataset, published by Michaël Defferrard et al. under Creative Commons Attribution 4.0 International License (CC BY 4.0).