Multimodal models represent a significant advancement in ai, as they enable systems to process and understand data from multiple sources, such as text and images. These models are essential for applications such as image captioning, visual question answering, and assistance in robotics, where understanding visual and linguistic input is crucial. With advances in vision and language models (VLMs), ai systems can generate descriptive narratives from images, answer questions based on visual information, and perform tasks such as object recognition. However, many of today’s highest-performing multimodal models are built using proprietary data, limiting their accessibility to the broader research community and stifling innovation in open-access ai research.
One of the critical issues facing the development of open multimodal models is their reliance on data generated by proprietary systems. Closed systems, such as GPT-4V and Claude 3.5, have created high-quality synthetic data that helps models achieve impressive results, but this data is not widely available. As a result, researchers face barriers when trying to replicate or improve these models, and the scientific community needs a foundation to build such models from scratch using completely open datasets. This problem has stalled the progress of open research in the field of ai, as researchers are unable to access the fundamental components needed to create state-of-the-art multimodal models independently.
The methods commonly used to train multimodal models rely heavily on distillation of proprietary systems. Many vision language models, for example, use data such as ShareGPT4V, which is generated using GPT-4V, to train their systems. While highly effective, this synthetic data makes these models dependent on closed systems. Open weighted models have been developed, but they often perform significantly worse than their proprietary counterparts. Furthermore, these models are limited by their limited access to high-quality datasets, making it difficult to close the performance gap with closed systems. Open models therefore often lag behind state-of-the-art models from companies with access to proprietary data.
Researchers at the Allen Institute for ai and the University of Washington presented the Mouth Family of vision and language models. This new family of models represents a major advancement in the field by providing a fully open data and weights solution. Molmo does not rely on synthetic data from proprietary systems, making it a fully accessible tool for the ai research community. The researchers developed a new dataset, PixMoconsisting of detailed image captions created entirely by human annotators. This dataset allows Molmo models to be trained with high-quality natural data, making them competitive with the best models in the industry.
The first version includes several key components:
- E-1B molecule: Built using the fully open source OLMoE-1B-7B Large Language Model (LLM) of expert mixtures.
- Mouth-7B-O: It uses the fully open OLMo-7B-1024 LLM, which is scheduled for pre-launch in October 2024, with full public launch planned for later.
- Molmo-7B-D: This demo model takes advantage of the open-weight LLM Qwen2 7B.
- Boca-72B: The highest performance model in the family, utilizing the open weight LLM Qwen2 72B.
Molmo models are trained using a simple yet powerful pipeline that combines a pre-trained vision encoder with a language model. The vision encoder is based on OpenAI’s ViT-L/14 CLIP model, which provides reliable image tokenization. Molmo’s PixMo dataset, containing over 712,000 images and approximately 1.3 million captions, is the basis for training the models to generate dense and detailed image descriptions. Unlike previous methods that asked annotators to write captions, the PixMo dataset relies on spoken descriptions. Annotators were asked to describe every detail of the image for 60–90 seconds. This innovative approach enabled the collection of more descriptive data in less time and provided high-quality image annotations, avoiding the reliance on synthetic data from closed-loop VLMs.
The most advanced model in the family, Molmo-72B, has outperformed many leading proprietary systems, including Gemini 1.5 and Claude 3.5 Sonnet, in 11 academic tests. It also ranked second in a human evaluation with 15,000 image-text pairs, only slightly behind GPT-4o. The model achieved top scores in tests such as AndroidControl, where it achieved 88.7% accuracy for low-level tasks and 69.0% for high-level tasks. The MolmoE-1B model, another in the family, was able to closely match the performance of GPT-4V, making it a highly efficient and competitive open weight model. The broad success of Molmo models in both academic and user evaluations demonstrates the potential of open LMs to compete with and even outperform proprietary systems.
In conclusion, the development of the Molmo family offers the research community a powerful, open-source alternative to closed systems, offering fully open source code, datasets, and weights. By introducing innovative data collection techniques and optimizing model architecture, researchers at the Allen Institute for ai have successfully created a family of models that perform on par with, and in some cases outperform, the proprietary giants in the field. The release of these models, along with the associated PixMo datasets, paves the way for future innovation and collaboration in vision language model development, ensuring that the broader scientific community has the tools to continue pushing the boundaries of ai.
Take a look at the Models on the HF page, Manifestationand DetailsAll credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..
Don't forget to join our Subreddit with over 52 billion users
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary engineer and entrepreneur, Asif is committed to harnessing the potential of ai for social good. His most recent initiative is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has over 2 million monthly views, illustrating its popularity among the public.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>