Large language models (LLMs) have demonstrated their impressive instruction following capabilities and can be a universal interface for various tasks such as text generation, language translation, etc. These models can be extended to multimodal LLMs for processing language and other modalities. such as image, video and audio. Several recent works introduce models that specialize in processing videos. These Video LLMs retain the instruction-following capabilities of LLMs and allow users to ask multiple questions about a given video. However, an important piece missing from these Video LLMs is temporal localization. When prompted “When?” questions, these models cannot accurately locate periods and often hallucinate irrelevant information.
Three key aspects limit the temporal localization capabilities of existing video LLMs: time representation, architecture, and data. First, existing models typically represent timestamps as plain text (e.g., 01:22 or 142 seconds). However, given a set of frames, the correct timestamp still depends on the frame rate, which the model cannot access. This makes learning temporal location difficult. Second, the architecture of existing video LLMs may require more temporal resolution to accurately interpolate temporal information. For example, Video-LLaMA only consistently displays eight frames of the entire video, which must be reviewed for accurate temporal localization. Finally, temporal localization is largely ignored in the data used by existing video LLMs. The data with timestamps is only a small subset of the video instruction fitting data and the accuracy of these timestamps is also not verified.
NVIDIA researchers propose the Language Instructed Localization Temporal Assistant (LITA). The three key components they have proposed are: (1) Time Representation: Time tokens to represent relative timestamps and allow Video LLMs to communicate better about time than using plain text. (2) Architecture: They introduced SlowFast tokens to capture temporal information with fine temporal resolution to enable precise temporal localization. (3) Data: They have emphasized temporal location data for LITA. They have proposed a new task, Reasoning Temporal Localization (RTL), along with the ActivityNet-RTL dataset, to learn this task.
LITA is based on Image LLaVA for its simplicity and effectiveness. LITA does not depend on the underlying Image LLM architecture and can be easily adapted to other base architectures. Given a video, they first uniformly select T frames and encode each frame into M tokens. T × M is a large number that normally cannot be processed directly by the LLM module. Therefore, they use SlowFast pooling to reduce T × M tokens to T + M tokens. Text (request) tokens are processed to convert referenced timestamps into specialized time tokens. Then, the LLM module jointly processes all input tokens sequentially. The model is fine-tuned with RTL data and other video tasks, such as dense video subtitles and event localization. LITA learns to use time tokens instead of absolute timestamps.
Comparing LITA with LLaMA-Adapter, Video-LLaMA, VideoChat and Video-ChatGPT. Video-ChatGPT slightly outperforms other baselines, including VideoLLaMA-v2. LITA significantly outperforms these two existing Video LLMs in all aspects. In particular, LITA achieves a 22% improvement in information accuracy (2.94 vs. 2.40) and a 36% relative improvement in temporal understanding (2.68 vs. 1.98). This shows that the emphasis on temporal understanding in training allows for accurate temporal localization and improves understanding of the LITA video.
In conclusion, NVIDIA researchers present LITA, a revolutionary in temporal localization that uses Video LLM. With its unique model design, LITA introduces time tokens and SlowFast tokens, significantly improving time representation and video input processing. LITA demonstrates promising capabilities for answering complex temporal location questions and substantially improves video-based text generation compared to existing Video LLMs, even for non-temporal questions.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our 39k+ ML SubReddit
Asjad is an internal consultant at Marktechpost. He is pursuing B.tech in Mechanical Engineering at Indian Institute of technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who is always researching applications of machine learning in healthcare.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>