Understanding long videos, such as 24-hour CCTV footage or full movies, is a major challenge in video processing. Large Language Models (LLM) have shown great potential in handling multimodal data, including videos, but struggle with big data and high processing demands of large content. Most existing methods for managing long videos miss critical details, as simplifying visual content often eliminates subtle but essential information. This limits the ability to effectively interpret and analyze complex or dynamic video data.
Techniques currently used to understand long videos include extracting keyframes or converting video frames to text. These techniques simplify processing but cause massive loss of information as subtle details and visual nuances are omitted. Advanced video LLMs, such as Video-LLaMA and Video-LLaVA, attempt to improve understanding using multimodal representations and specialized modules. However, these models require extensive computational resources, are task specific, and have problems with long or unknown videos. Multimodal RAG systems, such as iRAG and LlamaIndex, improve data retrieval and processing, but lose valuable information when transforming video data into text. These limitations prevent current methods from fully capturing and utilizing the depth and complexity of video content.
To address the challenges of video understanding, researchers from ai research and Binjiang Institute of Zhejiang University inserted OmAgenta two-step approach: Video2RAG for preprocessing and DnC Loop for task execution. In Video2RAG, raw video data is subjected to scene detection, visual cues, and audio transcription to create summarized scene captions. These captions are vectorized and stored in an enriched knowledge database with more details about the time, location, and event details. In this way, the process avoids large context inputs to language models and, therefore, problems such as token overload and inference complexity. For task execution, queries are encoded and these video segments are retrieved for further analysis. This ensures efficient video understanding by balancing detailed data representation and computational feasibility.
The DNC Loop employs a divide and conquer strategy, recursively decomposing tasks into manageable subtasks. The Conqueror module evaluates tasks, directing them for splitting, tool invocation, or direct resolution. The Divider module divides complex tasks and Rescuer takes care of execution errors. The recursive task tree structure helps in effective management and resolution of tasks. The integration of Video2RAG structured preprocessing and the robust DnC Loop framework makes OmAgent offer a comprehensive video understanding system that can handle complex queries and produce accurate results.
Researchers conducted experiments to validate OmAgent's ability to solve complex problems and understand long videos. They used two benchmarks, MBPP (976 Python tasks) and FreshQA (dynamic real-world questions and answers), to test general problem solving, focusing on planning, task execution, and tool use. They designed a benchmark with more than 2000 question-answer pairs for video comprehension based on various long videos, reasoning evaluation, event localization, information summary, and external knowledge. OmAgent consistently outperformed baselines across all metrics. On MBPP and FreshQA, OmAgent achieved 88.3% and 79.7%, respectively, outperforming GPT-4 and XAgent. OmAgent scored 45.45% overall on video tasks compared to Video2RAG (27.27%), Frames with STT (28.57%), and other baselines. He excelled in reasoning (81.82%) and summarizing information (72.74%), but had problems with locating events (19.05%). OmAgent's Divide-and-Conquer (DnC) rewind and loop capabilities significantly improved performance on tasks requiring detailed analysis, but accuracy in event localization remained a challenge.
In summary, the proposal OmAgent integrates multi-modal RAG with a generalist ai framework, enabling advanced video understanding with near-infinite understanding capacity, a secondary recovery mechanism, and autonomous tool invocation. It achieved strong performance across multiple benchmarks. While challenges such as event positioning, character alignment, and audiovisual asynchrony remain, this method can serve as a foundation for future research to improve character disambiguation, audiovisual synchronization, and understanding of non-verbal audio cues. advancing understanding of long-form videos.
Verify he Paper and <a target="_blank" href="https://github.com/om-ai-lab/OmAgent?tab=readme-ov-file” target=”_blank” rel=”noreferrer noopener”>GitHub page. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter and join our Telegram channel and LinkedIn Grabove. Don't forget to join our SubReddit over 65,000 ml.
Recommend open source platform: Parlant is a framework that transforms the way ai agents make decisions in customer-facing scenarios. (Promoted)
Divyesh is a Consulting Intern at Marktechpost. He is pursuing a BTech in Agricultural and Food Engineering from the Indian Institute of technology Kharagpur. He is a data science and machine learning enthusiast who wants to integrate these leading technologies in agriculture and solve challenges.