While multimodal models (LMMs) have advanced significantly for text and image tasks, video-based models remain underdeveloped. Videos are intrinsically complex and combine spatial and temporal dimensions that require more computational resources. Existing methods often adapt image-based approaches directly or rely on uniform frame sampling, which poorly captures temporal and motion patterns. Additionally, training large-scale video models is computationally expensive, making it difficult to efficiently explore design options.
To address these issues, researchers at Meta ai and Stanford developed Apolloa family of video-centric LMMs designed to push the boundaries of video understanding. Apollo addresses these challenges through thoughtful design decisions, improving efficiency and setting a new benchmark for tasks like temporal reasoning and video-based question answering.
<h3 class="wp-block-heading" id="h-meta-ai-introduces-apollo-a-family-of-scalable-video-lmms”>Meta ai Introduces Apollo: A Family of Scalable Video LMMs
goal of aiApollo The models are designed to process videos up to one hour long while achieving strong performance on key video language tasks. Apollo comes in three sizes: 1.5 billion, 3Band Parameters 7B – offer flexibility to adapt to various computational limitations and real-world needs.
Key innovations include:
Scale consistency: Design choices made in smaller models have been shown to transfer effectively to larger models, reducing the need for large-scale experiments.
Frames per second (fps) sampling– A more efficient video sampling technique compared to uniform frame sampling, ensuring better temporal coherence.
Double vision encoders: Combining SigLIP for spatial understanding with InternVideo2 for temporal reasoning allows for a balanced representation of video data.
ApolloBank– A set of selected benchmarks that reduces redundancy in evaluation while providing detailed information on model performance.
Highlights and technical advantages
The Apollo models are based on a number of well-researched design options aimed at overcoming the challenges of video-based LMMs:
Frames per second sampling: Unlike uniform frame sampling, fps sampling maintains a constant temporal flow, allowing Apollo to better understand the motion, speed, and sequence of events in videos.
Scale consistency: Experiments show that model design choices made on moderately sized models (parameters 2B-4B) generalize well to larger models. This approach reduces computational costs while maintaining performance gains.
Double vision encoders: Apollo uses two complementary encoders: SigLIP, which excels at spatial understanding, and InternVideo2, which improves temporal reasoning. Their combined strengths produce more accurate video representations.
Token resampling: By using a Perceiver Resampler, Apollo efficiently reduces video tokens without losing information. This allows models to process long videos without excessive computational overhead.
Optimized training: Apollo employs a three-stage training process in which video encoders are initially fitted to video data before integrating it with text and image data sets. This step-by-step approach ensures stable and effective learning.
Multi-turn conversations: Apollo models can support multi-turn interactive conversations based on video content, making them ideal for applications such as video-based chat systems or content analytics.
Performance information
Apollo's capabilities are validated through strong results across multiple benchmarks, often outperforming larger models:
Apollo-1.5B:
Outperforms models such as Phi-3.5-Vision (4.2B) and LongVA-7B.
Heaps: 60.8 in Video-MME, 63.3 at MLVU, 57.0 in ApolloBench.
Apollo-3B:
Competes and outperforms many 7B models.
Heaps: 58.4 in Video-MME, 68.7 at MLVU, 62.7 in ApolloBench.
Achieve 55.1 in LongVideoBench.
Apollo-7B:
It matches and even surpasses models with more than 30B parameters, such as Oryx-34B and VILA1.5-40B.
Heaps: 61.2 in Video-MME, 70.9 at MLVU, 66.3 in ApolloBench.
Reference Summary:
Conclusion
Apollo marks an important step forward in video-LMM development. By addressing key challenges such as efficient video sampling and model scalability, Apollo provides a practical and powerful solution for understanding video content. Its ability to outperform larger models highlights the importance of well-researched training design and strategies.
The Apollo family offers practical solutions for real-world applications, from video-based question answering to content analytics and interactive systems. Importantly, the introduction of Meta ai from ApolloBank provides a more agile and effective benchmark for evaluating video-LMMs, paving the way for future research.
Trending: LG ai Research launches EXAONE 3.5 – three frontier-level bilingual open-source ai models that deliver unmatched instruction following and broad context understanding for global leadership in generative ai excellence….
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. Their most recent endeavor is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has more than 2 million monthly visits, which illustrates its popularity among the public.