Advances in multimodal intelligence It depends on the processing and understanding of images and videos. Images can reveal static scenes by providing information about details such as objects, text, and spatial relationships. However, this comes at the cost of being extremely challenging. Video understanding involves tracking changes over time, among other operations, while ensuring consistency across frames, which requires managing dynamic content and temporal relationships. These tasks become more difficult because the collection and annotation of video text datasets are relatively difficult compared to image text dataset.
Traditional methods for Multimodal Large Language Models (MLLMS) face challenges in video understanding. Approaches such as sparsely sampled frameworks, basic connectors, and image-based encoders fail to effectively capture temporal dependencies and dynamic content. Techniques such as token compression and Windows extended context struggle with the complexity of long-form video, while the integration of audio and visual inputs often lacks seamless interaction. Efforts on real-time scaling and processing model sizes remain inefficient, and existing architectures are not optimized to handle long video tasks.
To address video understanding challenges, researchers from alibaba group proposed the Videocall3 structure. This framework incorporates Any Resolution Vision Tokenization (AVT) and Differential frame pruner (difffp). AVT improves on traditional fixed-resolution tokenization by allowing vision encoders to dynamically process variable resolutions, reducing information loss. This is achieved by adapting VIT-based encoders with 2D string for flexible position inclusion. To preserve vital information, DiffFP deals with long, redundant video tokens by pruning frames with minimal differences as they are taken across a 1-norm distance between patches. Dynamic resolution handling, combined with efficient token reduction, improves rendering while reducing costs.

The model consists of a Vision encoder, video compressor, projector, and Large Language Model (LLM)initializing the vision encoder using a pre-trained Siglip model. Extracts visual tokens, while the video compressor reduces the video token representation. The projector connects the vision encoder to the LLM, and QWEN2.5 models are used for the LLM. Training occurs in four stages: vision encoder adaptation, vision language alignment, multi-task fine-tuning, and video-centric fine-tuning. The first three stages focus on image understanding, and the final stage improves video understanding by incorporating temporal information. He Vision encoder adaptation stage It focuses on fine-tuning the vision encoder, initialized with Siglip, on a large-scale image dataset, allowing it to process images at different resolutions. He Vision Language Alignment Stage It introduces multimodal knowledge, making the LLM and vision encoder trainable to integrate vision and language understanding. In it Multiple adjustment stageLean instruction is performed using multimodal question response data, including image and video questions, improving the model's ability to follow natural language instructions and process temporal information. He Video-centric fine-tuning stage Describes all the parameters to improve the model's video understanding capabilities. The training data comes from various sources such as scene images, documents, graphs, fine-grained images, and video data, ensuring comprehensive multimodal understanding.

The researchers conducted experiments to evaluate the performance of Videocall3 through image and video tasks. For image-based tasks, the model was tested on document comprehension, mathematical reasoning, and multiple image understanding, where it outperformed previous models, showing improvements in frame understanding and real-world knowledge. Question Answering (QA). On video-based tasks, VideOllama3 performed strongly on benchmarks such as Video and Mvbenchdemonstrating proficiency in general video comprehension, long-form video comprehension, and temporal reasoning. He 2b and 7b The models performed very competitively, with the 7b Model that leads in most video tasks, which underlines the effectiveness of the model in multimodal tasks. Other areas where significant improvements were reported were OCR, mathematical reasoning, multiple image understanding, and long-term video understanding.

At last, the proposed framework advances vision-centric multimodal models, offering a strong framework for understanding images and videos. By using high-quality image text datasets, it addresses the challenges of video understanding and temporal dynamics, achieving strong results on benchmarks. However, challenges such as video dataset quality and real-time processing remain. Future research can improve video text datasets, optimize for real-time performance, and integrate additional modalities such as audio and speech. This work can serve as a baseline for future advances in multimodal understanding, efficiency improvement, generalization, and integration.
Verify he Paper and Github page. All credit for this research goes to the researchers of this project. Also, don't forget to follow us <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter and join our telegram channel and LinkedIn GRsplash. Don't forget to join our 70k+ ml subreddit.
<a target="_blank" href="https://nebius.com/blog/posts/studio-embeddings-vision-and-language-models?utm_medium=newsletter&utm_source=marktechpost&utm_campaign=embedding-post-ai-studio” target=”_blank” rel=”noreferrer noopener”> (Recommended Read) NEBIUS ai Studio Expands with Vision Models, New Language Models, Embeddings, and Lora (Promoted)

Divyesh is a Consulting Intern at MarktechPost. He is pursuing a BTech in Agricultural and Food Engineering from Indian Institute of technology, Kharagpur. He is a data science and machine learning enthusiast who wants to integrate these leading technologies in the agricultural domain and solve challenges.