Self-supervised learning is being used prominently in artificial intelligence to develop intelligent systems. Transformer models such as BERT and T5 have recently become popular due to their excellent properties and have used the idea of self-monitoring in natural language processing tasks. These models are first trained on massive amounts of unlabeled data and then fitted with samples of labeled data. Although self-supervised learning has been used successfully in various fields, including speech processing, computer vision, and natural language processing, its application has yet to be explored in musical audios. The reason for this is the limitations that accompany the field of music, which models musical knowledge as the tonal and tuning characteristics of music.
To address this problem, a team of researchers introduced MERT, which is an abbreviation for ‘Large-Scale Self-Monitored Training Model of Music Comprehension’. This acoustic model has been developed with the idea of using teacher models to generate pseudo-labels as masked language modeling (MLM) for the pre-training phase. MERT helps the transformer coder in the BERT approach, which is the student model, better understand and understand the model music audio by integrating the teacher models.
This affordable and generalizable pretrained acoustic music model follows a self-monitored speech learning paradigm and employs teacher models to generate pseudo-targets for sequential audio clips by incorporating a multitasking paradigm to balance musical and acoustic performance learning. To improve the robustness of the learned representations, MERT has introduced a batch noise mixing augmentation technique. By combining audio recordings with random clips, this technique distorts the audio recordings, challenging the model to grasp relevant meanings even in obscure circumstances. This addition improves the model’s ability to generalize situations where music may be mixed with irrelevant audio.
The team has created a super effective combination of teacher models that shows better performance than all conventional audio and voice methods. This group includes an acoustics teacher based on Residual Vector Quantization – Variational AutoEncoder (RVQ-VAE) and a music teacher based on Constant-Q Transform (CQT). The Acoustic Teacher uses RVQ-VAE to provide a discrete acoustic level summary of the musical signal, capturing the acoustic characteristics. Based on CQT, the music teacher focuses on capturing the pitch and tonal aspects of music. Together, these teachers guide the model student in learning meaningful musical audio performances.
The team has also explored configurations to address the pre-training instability of the acoustic language model. By optimizing these settings, they were able to scale MERT from 95M to 330M parameters, resulting in a more powerful model capable of capturing intricate details of music audio. After evaluation, the experimental results demonstrated the efficacy of MERT in the generalization of various music comprehension tasks. The model achieved SOTA scores in 14 different tasks, demonstrating its high performance and generalizability.
In conclusion, the MERT model addresses the gap in the application of self-supervised learning to musical audios.
review the Paper and github link. Don’t forget to join our 23k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.