Multimodal Large Language Models (MLLM) represent a cutting-edge area in artificial intelligence, combining various modalities of data such as text, images, and even video to build a unified understanding across domains. These models are being developed to address increasingly complex tasks, such as visual question answering, text-to-image generation, and multimodal data interpretation. The ultimate goal of MLLMs is to enable ai systems to reason and infer with capabilities similar to human cognition by simultaneously understanding multiple forms of data. This field has seen rapid advances, but a challenge remains in creating models that can integrate these various inputs while maintaining high performance, scalability, and generalization.
One of the critical problems facing MLLM development is achieving robust interaction between different types of data. Existing models often need help balancing the processing of visual and text information, leading to a drop in performance when handling text-rich images or detailed visual-based tasks. Additionally, these models need help maintaining a high degree of contextual understanding when operating with multiple images. As demand for more versatile models grows, researchers are looking for innovative ways to improve the ability of MLLMs to address these challenges, thereby allowing models to seamlessly handle complex scenarios without sacrificing efficiency or accuracy.
Traditional MLLM approaches rely primarily on single-modality training and do not leverage the full potential of combining visual and textual data. This results in a model that can excel in linguistic or visual tasks, but struggles in multimodal contexts. Although recent approaches have integrated larger data sets and more complex architectures, they still suffer from inefficiencies when combining the two types of data. There is a growing need for models that can perform well on tasks that require interaction between images and text, such as object referencing and visual reasoning, while still being computationally feasible and deployable at scale.
Apple researchers developed the MM1.5 family of models and introduced several innovations to overcome these limitations. The MM1.5 models improve upon the capabilities of its predecessor, MM1, by improving text-rich image understanding and multi-image reasoning. The researchers took a novel data-centric approach, integrating high-resolution OCR data and synthetic captions into a continuous pre-training phase. This significantly allows the MM1.5 models to outperform previous models on visual understanding and grounding tasks. In addition to the general-purpose MLLMs, the MM1.5 model family includes two specialized variants: MM1.5-Video for understanding the video and MM1.5-UI for understanding mobile user interface. These specific models provide customized solutions for specific use cases, such as interpreting video data or analyzing mobile screen layouts.
MM1.5 uses a unique training strategy involving three main stages: large-scale pre-training, continuous high-resolution pre-training, and supervised fine-tuning (SFT). The first stage uses a massive data set comprising 2 billion image-text pairs, 600 million intertwined image-text documents, and 2 trillion text-only data tokens, providing a solid foundation for understanding multimodal. The second stage involves continuous pre-training using 45 million high-quality OCR data points and 7 million synthetic captions, helping to improve the model's performance on text-rich image tasks. The final stage, SFT, optimizes the model using a well-selected combination of single-image, multiple-image, and text-only data, making it adept at handling detailed visual references and multi-image reasoning.
The MM1.5 models have been evaluated on various benchmarks, showing superior performance over proprietary and open source models on various tasks. For example, the dense MM1.5 and MoE variants vary between 1 billion and 30 billion parameters, achieving competitive results even at smaller scales. The performance increase is particularly notable in understanding text-rich images, where MM1.5 models demonstrate a 1.4-point improvement over previous models on specific benchmarks. Furthermore, MM1.5-Video, trained solely on image data without specific video data, achieved state-of-the-art results on video understanding tasks by leveraging its strong general-purpose multimodal capabilities.
The extensive empirical studies conducted on the MM1.5 models revealed several key insights. The researchers showed that data curation and optimal training strategies can lead to robust performance even at lower parameter scales. Furthermore, including OCR data and synthetic subtitles during the continuous pre-training stage significantly improves text understanding at different image resolutions and aspect ratios. These insights pave the way for developing more efficient MLLMs in the future, which can deliver high-quality results without requiring extremely large-scale models.
Key research findings:
- Model variants: This includes dense models and MoE with parameters ranging from 1B to 30B, ensuring scalability and deployment flexibility.
- Training data: It uses 2B image-text pairs, 600 million interlaced image-text documents, and 2T text-only tokens.
- Specialized variants: MM1.5-Video and MM1.5-UI offer customized solutions for video understanding and mobile UI analysis.
- Performance improvement: A 1.4 point gain was achieved in benchmarks focused on understanding text-rich images compared to previous models.
- Data integration: Effective use of 45 million high-resolution OCR data and 7 million synthetic subtitles significantly increases the model's capabilities.
In conclusion, the MM1.5 model family sets a new benchmark in large multimodal language models by offering enhanced text-rich image understanding, visual basing, and multi-image reasoning capabilities. With its carefully curated data strategies, specialized variants for specific tasks, and scalable architecture, MM1.5 is poised to address key challenges in multimodal ai. The proposed models demonstrate that the combination of robust pre-training methods and continuous learning strategies can result in high-performance MLLM that is versatile in various applications, from general image and text understanding to specialized video and UI understanding.
look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. If you like our work, you will love our information sheet..
Don't forget to join our SubReddit over 50,000ml
Are you interested in promoting your company, product, service or event to over 1 million ai developers and researchers? Let's collaborate!
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. Their most recent endeavor is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has more than 2 million monthly visits, which illustrates its popularity among the public.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>