In several natural language processing applications, text-based large language models have shown impressive and even human-level performance. Meanwhile, an LLM training paradigm known as instruction tuning has evolved, in which data are organized as user instruction and reference response pairs, allowing LLMs to comply with unrestricted user commands. Increasingly, researchers are interested in equipping LLMs with multimodal sensory skills. Current research focuses on linking LLMs to the encoder of one input type more (such as an image, silent video, audio event, or voice) or to encoders of many input types together.
To align the encoder output spaces with the LLM input space, which is often taught using multimodal pretraining and instruction tuning, a plug-in module and LLM adapters can be used. The open speech, audio, language, music neural network proposed in this study is a unique multimodal audio-text LLM that can recognize and understand speech, audio events, and music, the three main categories of sounds. SALMONN employs a dual encoder framework, comprising a BEATs audio encoder and a Whisper speech model speech encoder, to improve performance in speech and non-voice audio applications.
To further improve the performance of Vicuña, the low-rank adaptation strategy is used as a cross-modal adapter to match the augmented input space with the output space. The cross-modal pre-training and instruction tuning phases of Q-Former and window-level LoRA employ many speech, audio, and music challenges. The resulting multimodal LLMs show little or no emergent cross-modal ability and may be restricted to specific types of tasks used in instructional adjustment, specifically audio captioning and speech recognition, which they term the task overfitting problem. The ability to perform cross-modal tasks that are not noticed during training is called cross-modal emerging abilities in this study. These skills are basically the emergent capabilities of LLMs that are lost during the adaptation of instruction.
To mitigate the significant catastrophic forgetting of training tasks, they suggest adding an additional few-shot activation tuning stage to the SALMONN repertoire. SALMONN cognitive listening abilities are assessed using a variety of speech standards, auditory events, and music. There are three levels for tasks. The first two levels evaluate untrained activities, while the first level compares eight tasks that are taught in the instruction setting, including audio subtitles, translation, and speech recognition. The second level includes five speech-based natural language processing (NLP) tasks, including gap filling and translation into untrained languages. These tasks require high-quality, multilingual alignments between speech and text tokens.
Understanding auditory information other than speech is necessary for the latter set of activities, such as audio-based narrative and speech audio co-reasoning. Experiment results demonstrate that SALMONN can complete all of these tasks and perform competitively on industry benchmarks when used as a single model. This suggests that it is possible to create artificial intelligence that is capable of “listening” and understanding a wide variety of audio inputs, including speech, audio events, and music.
The main contribution of this article can be summarized as follows.
• To the best of their knowledge, researchers from Tsinghua University and ByteDance offer SALMONN, the first multi-modal LLM that can recognize and understand general audio inputs, including speech, audio events, and music.
• By varying the scale factor of LoRA, they investigate the existence of emerging cross-modal abilities. They then suggest a low-cost activation tuning technique as an additional training step that can activate these skills and reduce catastrophic forgetting of tasks encountered during training.
• They provide two new tasks, audio-based storytelling and spoken audio conjoint reasoning, and assess SALMONN on a variety of tasks representing a range of general listening skills.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 32k+ ML SubReddit, Facebook community of more than 40,000 people, Discord Channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Data Science and artificial intelligence at the Indian Institute of technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around it. She loves connecting with people and collaborating on interesting projects.
<!– ai CONTENT END 2 –>