Meet AudioLDM: a latent diffusion model for audio generation that trains on AudioCaps with a single GPU and achieves SOTA text-to-audio (TTA) performance

For many applications, such as virtual and augmented reality, game creation, and video editing, it is crucial to produce sound, music, or voice effects based on specific criteria. Traditionally, signal processing techniques have been used to generate audio. In recent years there has been a revolution in this trade thanks to generative models, either unconditionally or dependent on other modalities. A modest collection of tags, such as the ten sound classes in the UrbanSound8K dataset, were used in previous experiments that focused primarily on tag-to-sound settings. Natural language, by contrast, is much more versatile than tags, as it can contain detailed descriptions of auditory cues (eg, pitch, acoustic environment, and temporal order).

Text-to-Audio (TTA) generation is the process of producing audio suggested by natural language descriptions. TTA systems can provide a variety of high-dimensional audio streams. They build the generative model in a compact latent learning space to model the data efficiently. Similar concepts are used by DiffSound, a program that uses diffusion models to learn a compressed discrete representation from the mel spectrogram of an audio file. In a discrete waveform space, AudioGen’s autoregressive model has supplanted DiffSound. They investigate latent diffusion models (LDM) for TTA generation in a continuous latent representation instead of learning discrete representations because StableDiffusion employs LDM to provide high-quality images for inspiration.

In addition, they study and achieve different zero-shot text-guided audio alterations with LDM, which have never been tried before because audio manipulations like style transfer are also required for audio cues. The need for large-scale, high-quality audio-text data pairs, which are typically not readily available and of restricted quality and quantity, could be a significant barrier to generation quality for earlier TTA studies. Various text preparation techniques have been suggested to make better use of data with noisy text subtitles. However, by removing the relationships between sound events in their preprocessing processes, they inevitably constrain the performance of their creation (for example, a barking dog is transformed into a dog park). This study addresses this problem by developing a technique that overcomes paired audio and text data and requires audio data for generative model training.

This paper presents a TTA system called AudioLDM that benefits from computational efficiency and text-conditional audio manipulations while achieving state-of-the-art rendering quality with continuous LDM. In particular, AudioLDM learns to produce the audio before in a latent space encoded by a Mel spectrogram-based variational autoencoder (VAE). A LDM conditioned by contrastive audio-language pretraining (CLAP) latent embedding is created for a previous generation. They reduce the need for text data during BOM training by using this audio-text-aligned embedding space, since previous generation demand can come directly from audio.

They show that training BOM with only audio is sometimes more effective than training with pairs of audio and text data. On the AudioCaps dataset, the proposed AudioLDM outperforms the DiffSound baseline by a significant margin with a refresh distance (FD) of 23.31, achieving state-of-the-art TTA performance. His method allows for zero-trigger audio changes throughout the sampling process in the interim. His contributions, in summary, are the following:

• They show the first attempt to create a continuous LDM for TTA generation and outperform current techniques on both subjective and objective criteria.

Without employing language and audio pairings to train BOM, they generate TTA using CLAP latents.

• They demonstrate experimentally that a high-quality and computationally efficient TTA system can be created using audio data during LDM training.

• Demonstrate that, without tuning the model for a particular job, their proposed TTA system can perform text-guided audio style modifications such as audio style transfer, super resolution, and internal painting. The code can be accessed on GitHub.

review the Github, Project, and Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 13k+ ML SubReddit, discord channel, and electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.

Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.