Videos have become an increasingly important part of our daily lives, spanning fields such as entertainment, education, and communication. However, understanding the content of the videos is a challenging task, since the videos often contain multiple events that occur on different time scales. For example, a video of a musher harnessing dogs to a dogsled before everyone runs away involves a long event (the dogs pulling the sled) and a short event (the dogs harnessing the sled). One way to stimulate research in video comprehension is through the task of dense video subtitles, which consists of temporarily locating and describing all the events in a minute-long video. This differs from single-image captions and standard video captions, which consist of describing short videos with a single prayer.
Dense video captioning systems have wide applications, such as making videos accessible to the visually or hearing impaired by automatically generating chapters for videos, or improve the search for video moments in large databases. However, current dense video subtitle approaches have several limitations; for example, they often contain highly specialized task-specific components, making them difficult to integrate into powerful base models. Additionally, they are often trained exclusively on manually annotated data sets, which are very difficult to obtain and therefore not a scalable solution.
In this post, we introduce “Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Subtitles”, to appear in CVPR 2023. Vid2Seq’s architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output stream. To pre-train this unified model, we leverage untagged narrated videos rephrasing transcribed speech sentence boundaries as pseudo-event boundaries, and using transcribed speech sentences as pseudo-event subtitles. The resulting Vid2Seq model, pretrained on millions of narrated videos, improves the state of the art on a variety of dense video subtitle benchmarks, including tucocinas2, White and ActivityNet Subtitles. Vid2Seq also generalizes well in the dense short video subtitle setup, the paragraph video subtitle task, and the standard video subtitle task. Finally, we have also launched the code for Vid2Seq here.
Vid2Seq is a visual language model that predicts dense event captions along with their time base in a video by generating a single sequence of tokens. |
A visual language model for dense video subtitles
Multimode transformer architectures have improved the state of the art in a wide range of video tasks, such as action recognition. However, it is not easy to adapt such an architecture to the complex task of co-locating and captioning events in minute-long videos.
To get an overview of how we achieve this, we augment a visual language model with special time tokens (such as text tokens) that represent discretized timestamps in the video, similar to Pix2Seq in the spatial domain. Given the visual inputs, the resulting Vid2Seq model can take as input and generate text sequences and time tokens. First, this allows the Vid2Seq model to understand the temporal information of the transcribed speech input, which is issued as a single sequence of tokens. Second, this allows Vid2Seq to co-predict dense event captions and temporarily plug them into the video while generating a single token sequence.
The Vid2Seq architecture includes a visual encoder and a text encoder, which encode video frames and transcribed speech input, respectively. The resulting encodings are then sent to a text decoder, which autoregressively predicts the subtitle output sequence of dense events along with their temporal location in the video. The architecture is initialized with a powerful visual backbone and a strong language model.
Large-scale pre-workout in uncropped narrated videos
Due to the dense nature of the task, manually collecting annotations for dense video subtitles is particularly costly. Therefore, we pretrain the Vid2Seq model using unlabeled narrated videos, which are readily available at scale. In particular, we use the YT-Temporary-1B dataset, which includes 18 million narrated videos covering a wide range of domains.
We use transcribed voice sentences and their corresponding timestamps as supervision, which are issued as a single sequence of tokens. We pretrain Vid2Seq with a generative objective that teaches the decoder to predict the transcribed speech sequence with only visual inputs, and a denoising objective that encourages multimodal learning by requiring the model to predict masked tokens given a speech sequence. noisy transcribed and visual inputs. In particular, noise is added to the speech stream by randomly masking stretches of tokens.
Vid2Seq is pretrained on tagless narrated videos with a generative goal (above) and a noise removal target (below). |
Results in Subsequent Dense Video Subtitles Benchmarks
The resulting pretrained Vid2Seq model can be fitted in subsequent tasks with a simple maximum likelihood objective using teacher forcing (i.e. predict the next token given the previous field truth tokens). After fine tuning, Vid2Seq dramatically improves the state of the art on three standard benchmarks of downward dense subtitle video (ActivityNet Subtitles, tucocinas2 and White) and two video clip subtitle reference points (MSR-MTB, MSVD). In our role we provide additional ablation studies, qualitative results, as well as results in the few takes setting and video paragraph captioning task.
Comparison with state-of-the-art methods for dense video subtitles (left) and for video clip subtitles (good), about him Cider metric (higher is better). |
Conclusion
Introducing Vid2Seq, a novel visual language model for dense video subtitles that simply predicts all subtitles and event boundaries as a single sequence of tokens. Vid2Seq can be effectively pre-trained on unlabeled narrated videos at scale and achieves next-generation results on multiple post-dense video subtitle benchmarks. learn more about paper and grab the code here.
Thanks
This research was conducted by Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid.