Meet AUDIT: an instruction-driven audio editing model based on latent diffusion models

Diffusion models are advancing rapidly and making life easier. From natural language processing and natural language understanding to computer vision, diffusion models have shown promising results in almost every domain. These models are a recent development in generative AI and are a type of deep generative model that can be used to generate realistic samples from complex distributions.

Researchers have recently introduced a new broadcast model that can easily edit audio clips. Called AUDIT, this latent diffusion model is an instruction-driven audio editing model. Audio editing primarily involves changing an input audio signal to produce an edited audio output. This includes tasks like adding background sound effects, replacing background music, repairing incomplete audio, or improving low-quality audio. AUDIT takes both the input audio and human instructions as conditions and produces the edited audio output.

Researchers have used triplet data to train the audio editing diffusion model in a supervised manner. The triplet data used is instruction, input audio, and output audio. The input audio has been used directly as conditional input to ensure consistency across raw audio segments. The editing instructions have also been used directly as a text guide to make the model more flexible and suitable for real world scenarios.

[Sponsored] 🔥 Build your personal brand with Taplio 🚀 The first all-in-one AI-powered tool to grow on LinkedIn. Create better LinkedIn content 10 times faster, schedule, analyze your stats, and engage. Try it free!

The team of researchers behind AUDIT have summarized their contributions as follows:

AUDIT is the first development in which a broadcast model has been trained for audio editing, which takes human text instructions as a condition.
A data construction framework has been designed to train AUDIT in a supervised way.
AUDIT is capable of maximizing the preservation of audio segments that do not require editing.
AUDIT works well with simple instructions such as text guidance without the need for a detailed description of the editing goal.
AUDIT has achieved remarkable results in both objective and subjective metrics for a number of audio editing tasks.

The team has shared some examples where AUDIT has performed very well and has accurately edited audio. These include adding the sound of car horns in the audio, replacing the sound of laughter with the sound of a trumpet, removing the sound of a woman speaking from the audio of someone whistling, etc. AUDIT performed extremely well on audio editing tasks and showed great results on both objective and subjective metrics, including the following tasks.

Adding a sound to an audio clip.
Discard or remove a sound from an audio clip
Replace a sound event in the input audio with another sound.
Audio Repaint – Fill in a masked segment of audio based on the context or text message provided.
Super-resolution task with which down-sampled input audio can be converted to high-sampled output audio.

In conclusion, AUDIT looks like a promising approach for the future that can simplify audio editing flexibly and effectively by following human instructions.

review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 18k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.

Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.