Recent advances in large multimodal models (LMMs) have demonstrated remarkable capabilities in diverse multimodal settings, moving closer to the goal of artificial general intelligence. By utilizing large amounts of vision language data, they enhance LMMs with visual capabilities, aligning vision encoders. However, most open-source LMMs have focused primarily on single-image scenarios, leaving more complex multi-image scenarios largely unexplored. This is important because many real-world applications utilize multi-image capabilities, such as comprehensive multi-image analysis. Given the wide range of computer vision situations and data types, there is a strong need to develop a general framework for LMMs that can work effectively with multi-image, video, and 3D data.
To address these issues, this paper reviews some related works. The first work is Interleaved Image-text data, which gives LMMs two key capabilities: multimodal in-context learning (ICL) and instruction following in real-world multi-image scenarios. Next, interleaved LMMs, such as the closed-source GPT-4V and Gemini, support real-world multi-image applications with superior performance. The community has also built open-source LMMs with excellent multi-image capabilities using various public datasets. In the last related work, Interleaved Benchmarks, several high-quality benchmarks for various scenarios have been developed to evaluate these multi-image capabilities of LMMs.
Researchers from ByteDance, HKUST, CUHK, and NTU have proposed LLaVA-NeXT-Interleave, a versatile LMM that can handle various real-world settings such as Multi-image, Multi-frame (videos), Multi-view (3D) while maintaining the performance of Multi-patch (single image). These four settings are collectively called M4. A high-quality training dataset, M4-Instruct, with 1177.6 samples is created to improve LMMs with M4 capabilities. This dataset covers 14 tasks and 41 datasets in these four domains. Using a single model, LLaVA-NeXT-Interleave shows the best results on different multi-image tasks compared to previous state-of-the-art models while still performing well on single images.
The LLaVA-NeXT-Interleave model is tested on M4. The LLaVA-Interleave bench is selected to cover a variety of in-domain and out-of-domain tasks while evaluating multiple images. For video evaluation, tests include NExTQA, MVBench, Video Detailed Description (VDD), and ActivityNet-QA (Act). ActivityNet-QA results include both accuracy and GPT scores. Additionally, the model is evaluated on VideoChat-GPT (VCG) using five criteria: information accuracy, detail orientation, context understanding, temporal understanding, and consistency. For 3D evaluation, tests include ScanQA and two 3D-LLM tasks.
The results for multiple images show that the average performance of LLaVA-NeXT-Interleave is better than previous open-source models in both in-domain and out-of-domain tests. After adding DPO, the proposed 7B model achieves the top performance in both VDD and VideoChatGPT tests, outperforming the previous LLaVA-NeXTVideo (34B). LLaVA-NeXT-Interleave only uses multi-view images to understand the 3D world and achieves much higher scores in challenging 3D situations compared to 3D-LLM and Point-LLM. For single-image tasks, 307k (40%) of the original single-image data from LLaVA-NeXT is added to the Multi-patch (single image), making the model capable of handling these tasks.
In conclusion, the researchers have presented LLaVA-NeXT-Interleave, a flexible LLM model that can handle different real-world settings such as multiple images, multiple frames (videos), and multiple views (3D). The researchers highlighted the potential of this model to enhance and combine the capabilities of LMM models in various visual tasks. The comprehensive experiments in this paper show that LLaVA-NeXT-Interleave sets new high standards in multi-image tasks and performs very well in single-image tasks. This work sets a new standard in the field, opening the door to future advancements in multi-modal ai and complex visual understanding tasks.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter.
Join our Telegram Channel and LinkedIn GrAbove!.
If you like our work, you will love our Newsletter..
Don't forget to join our Subreddit with over 46 billion users
Sajjad Ansari is a final year student from IIT Kharagpur. As a technology enthusiast, he delves into practical applications of ai, focusing on understanding the impact of ai technologies and their real-world implications. He aims to articulate complex ai concepts in a clear and accessible manner.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>