One of the biggest obstacles facing automated speech recognition (ASR) systems is their inability to adapt to unlimited and novel domains. Audiovisual ASR (AV-ASR) is a technique to improve the accuracy of ASR systems in multimodal video, especially when the audio is loud. This feature is invaluable for movies shot “in the wild” when the speaker’s mouth is not in view. Models for this task are typically large, comprising audio and visual encoders, and data sets for this task tend to be small.
Like other AVASR jobs, it is only taught and tested via instructional videos. As tests by the Google research team show, it performs poorly when applied to new domains using only a single training data set. However, several recently released massive audio-only models have been greatly optimized through the use of self-supervised pre-training and tremendously supervised training on audio-only data from audiobooks such as LibriLight and LibriSpeech. Models with billions of parameters, widespread availability, and impressive cross-domain generalization are characteristics of this class of models. The idea is to recycle the huge investment in training these models by reusing their weights. They are inspired by recent efforts modifying frozen foundation models for use in a variety of domains.
While these models retain the advantages of audio-only pretraining for zero-throw generalization, they now integrate visual inputs in a lightweight fashion to enable AV-ASR. The AVFormer framework uses light casting layers and trainable adapters to infuse visual input into a static ASR model.
The researchers demonstrate that these can be taught with minimal training time and additional parameters on a modest amount of mislabeled video data. This reduces the potential for domain change and catastrophic forgetfulness associated with end-to-end fine-tuning. They also incorporate a core curriculum during training to ensure consistency in tuning these adapters, which they demonstrate is critical for the model to correctly interpret auditory and visual data in tandem. Finally, they show that the model outperforms state-of-the-art zero-trigger approaches in three multi-domain AV-ASR benchmarks while maintaining respectable performance in audio-only baselines.
Zero-throw generalization across all AV domains is the goal without sacrificing quality in audio-only benchmarks. A state-of-the-art ASR model is used as a starting point and then modified for use in AV-ASR without restrictions. The following two elements are used to include visual features derived from a robust visual model previously trained in the model:
- They use a linear projection of visual elements to incorporate audio tokens.
- To facilitate domain adaptation, they introduce minimally invasive adapters into the ASR model encoder before it freezes up.
Here are some of the most crucial parts of the architecture:
- Encoder and decoder for frozen conformers
- The optical encoder and projection layers are used to project and extract features from images.
- Adaptation layers were added to the core infrastructure, specifically for the audio spectrum.
To facilitate domain adaptation across multiple modalities, the architecture features a frozen Conformer encoder-decoder model and a frozen CLIP encoder (frozen layers are shown in gray with a lock symbol), as well as two lightweight training modules. , a visual projection layer (shown in orange), and bottleneck adapters (shown in blue). The researchers recommend a two-stage approach to learning the curriculum, with the first phase focusing on training the adapters (blue) without visual tokens and the second phase adjusting the visual projection layer (orange) while keeping the rest of the model static. .
Researchers evaluate AVFormer’s zero-trigger performance on How2, VisSpeech, and Ego4D AV-ASR benchmarks against BEST-RQ, the audio version of the model, and AVATAR, the next-generation AV-ASR. When both AVATAR and BEST-RQ are trained on LibriSpeech and the full HowTo100M dataset, AVFormer still outperforms them. In particular, this requires training 600 million parameters for BEST-RQ but only 4 million parameters for AVFormer; therefore you only need a small subset of the training dataset (5% of HowTo100M). Additionally, they compare AVFormer to an audio-only baseline called LibriSpeech and find that it outperforms both.
State of the art zero shot performance is compared across many AV-ASR data sets. LibriSpeech, an audio-only platform, also features performances. Lower WER percentages indicate higher performance. While all of AVATAR and BEST-RQ are tuned in HowTo100M, AVFormer’s small collection of tuned parameters allows it to work effectively with as little as 5% of the data set.
The researchers present AVFormer, a powerful tool for converting static examples of state-of-the-art ASR models into models suitable for AVASR. This method is realistic and effective, as seen by its zero-shot efficiency. Tuning the full set of parameters of pretrained models becomes problematic as ASR models grow in size and complexity across domains. The method is parameter efficient, allowing simultaneous transfer of domains and combining of visual inputs.
review the Paper and blog article. Don’t forget to join our 23k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at asif@marktechpost.com
Check out 100 AI tools at AI Tools Club
Dhanshree Shenwai is a Computer Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking domain with strong interest in AI applications. She is enthusiastic about exploring new technologies and advancements in today’s changing world, making everyone’s life easier.