automatic speech recognition (ASR) is a well-established technology that is widely adopted for various applications such as conference calls, transcription of broadcast video, and voice commands. While the challenges for this technology center on noisy Audio tickets, the visual Broadcasting in multimodal videos (eg, TV, online edited videos) can provide strong clues to improve the robustness of ASR systems; this is called audiovisual ASR (AV-ASR).
Although lip movement can provide strong cues for speech recognition and is the most common area of focus for AV-ASR, the mouth is often not directly visible in videos in nature (for example, due to egocentric viewsface coverings and low resolution) and thus a new emerging area of research is Without restrictions AV-ASR (eg, AVATAR), which investigates the contribution of entire visual frames, and not just the mouth region.
However, creating audiovisual data sets to train AV-ASR models is challenging. Data sets like how2 and FishSpeech They have been created from online instructional videos, but are small in size. In contrast, the models themselves are often large and consist of both visual and audio encoders, so they tend to overfit these small data sets. However, there have been a number of large-scale audio-only models released recently that are highly optimized through large-scale training on only audio data obtained from audiobooks, such as LibriLight and LibriSpeech. These models contain billions of parameters, are readily available, and show strong generalization across all domains.
With the above challenges in mind, in “AVFormer: Vision injection into frozen speech models for Zero-Shot AV-ASR”, we present a simple method to augment existing large-scale audio-only models with visual information, while performing lightweight domain adaptation. AVFormer injects visual embeds into a frozen ASR model (similar to how Flamingo injects visual information in large language models for vision text tasks) using lightweight training adapters that can be trained on a small amount of weakly labeled video data with minimal additional training parameters and time. We also present a simple syllabus outline during the training, which we show is crucial in enabling the model to jointly process audio and visual information effectively. The resulting AVFormer model achieves state-of-the-art zero-shot performance across three different AV-ASR benchmarks (How2, VisSpeech, and ego4d), while crucially preserving decent performance on traditional audio-only speech recognition benchmarks (i.e., LibriSpeech).
Audiovisual speech recognition without restrictions. We inject vision into a frozen speech model (BEST-RQ, in grey) for zero-throw audiovisual ASR via lightweight modules to create a parameter- and data-efficient model called AVFormer (blue). Visual context can provide useful clues for strong speech recognition, especially when the audio signal is noisy (the visual bread loaf helps correct the “nail” for “loaf” audio-only error in the generated transcript). |
Injecting vision using lightweight modules
Our goal is to add visual comprehension capabilities to an existing audio-only ASR model while maintaining its generalization performance across multiple domains (both AV and audio-only).
To achieve this, we augmented an existing state-of-the-art ASR model (Best-RQ) with the following two components: (i) linear visual projector and (ii) lightweight adapters. The first projects visual features onto the audio token embedding space. This process allows the model to correctly connect the separately previously trained visual feature and audio input token representations. The latter then minimally modifies the model to add understanding of the multimodal inputs of the videos. We then train these add-on modules on untagged web videos from the HowTo100M data set, along with the outputs of an ASR model as pseudo ground truth, while keeping the rest of the Best-RQ model frozen. These lightweight modules enable data efficiency and strong performance generalization.
We evaluate our extended model on AV-ASR benchmarks in a zero-firing setup, where the model is never trained on a manually annotated AV-ASR dataset.
Learning curriculum for vision injection.
After initial evaluation, we empirically found that with a single naive round of joint training, the model has difficulty learning both adapters and visual projectors in one go. To mitigate this problem, we introduced a two-phase system curriculum learning strategy which decouples these two factors, domain adaptation and visual feature integration, and trains the network sequentially. In the first phase, the adapter parameters are optimized without feeding visual tokens at all. Once the adapters are trained, we add the visual tokens and train only the visual projection layers in the second phase while the trained adapters stay frozen.
The first stage focuses on the adaptation of the audio domain. In the second phase, the adapters are completely frozen and the visual projector simply has to learn to generate visual cues that project the visual tokens into audio space. In this way, our curriculum learning strategy allows the model to incorporate visual inputs and adapt to new audio domains in AV-ASR benchmarks. We apply each phase only once, as an iterative application of alternating phases leads to performance degradation.
General architecture and training procedure for AVFormer. The architecture consists of a frozen shaper encoder-decoder model, and a frozen SHORTEN encoder (frozen layers shown in gray with a lock symbol), along with two lightweight training modules: (i) visual projection layer (orange) and bottleneck adapters (blue) to enable multimodal domain adaptation . We propose a two-phase curriculum learning strategy: the adapters (blue) are first trained without any visual tokens, after which the visual projection layer (orange) is adjusted while all other parts are kept frozen. |
The graphs below show that without curriculum learning, our AV-ASR model performs worse than the audio-only baseline across all data sets, with the gap increasing as more visual tokens are added. In contrast, when the proposed two-phase curriculum is applied, our AV-ASR model performs significantly better than the basic audio-only model.
Curricular learning effects. The red and blue lines are for audiovisual models and are shown on 3 data sets at the zero trigger setting (lower WE ARE % is better). Using syllabus helps on all 3 data sets (for How2(a) and Ego4D(c) it is crucial to beat audio-only performance). Performance improves up to 4 visual tokens, at which point it saturates. |
Zero Shot AV-ASR Results
We benchmarked AVFormer against BEST-RQ, the audio version of our model, and AVATAR, the state of the art in AV-ASR, for zero performance on all three AV-ASR benchmarks: How2, VisSpeech, and Ego4D. AVFormer outperforms AVATAR and BEST-RQ on all of them, and even outperforms both AVATAR and BEST-RQ when trained on LibriSpeech and the full HowTo100M suite. This is notable because for BEST-RQ, this involves training 600 million parameters, whereas AVFormer only trains 4 million parameters and therefore requires only a small fraction of the training dataset (5% of HowTo100M). Additionally, we also benchmarked performance on LibriSpeech, which is audio-only, and AVFormer beats both baselines.
Conclusion
Introducing AVFormer, a lightweight method for adapting state-of-the-art frozen ASR models for AV-ASR. Our approach is hands-on and efficient, achieving impressive zero-shot performance. As ASR models get larger and larger, tuning the entire set of parameters in pretrained models becomes impractical (even more so for different domains). Our method seamlessly enables both the transfer of domains and the combination of visual inputs in the same efficient parameter model.
Thanks
This research was conducted by Paul Hongsuck Seo, Arsha Nagrani, and Cordelia Schmid.