Multimodal core models are increasingly relevant in artificial intelligence, allowing systems to process and integrate multiple forms of data, such as images, text, and audio, to address various tasks. However, these systems face significant challenges. Existing models often struggle to generalize across a wide variety of modalities and tasks due to their reliance on limited data sets and modalities. Furthermore, the architecture of many current models suffers from negative transfer, where performance on certain tasks deteriorates as new modalities are added. These challenges hinder scalability and the ability to deliver consistent results, underscoring the need for frameworks that can unify diverse data representations while preserving task performance.
EPFL researchers have introduced 4M, an open source framework designed to train versatile and scalable multimodal core models that extend beyond language. 4M addresses the limitations of existing approaches by enabling predictions across multiple modalities, integrating data from sources such as images, text, semantic features, and geometric metadata. Unlike traditional frameworks that serve a limited set of tasks, 4M expands to support 21 modalities, three times more than many of its predecessors.
A core innovation of 4M is the use of discrete tokenization, which converts various modalities into a unified sequence of tokens. This unified representation allows the model to leverage a Transformer-based architecture for joint training on multiple data types. By streamlining the training process and eliminating the need for task-specific components, 4M strikes a balance between scalability and efficiency. As an open source project, it is accessible to the broader research community, encouraging collaboration and further development.
Technical details and advantages
The 4M framework uses an encoder-decoder Transformer architecture designed for multimodal masked modeling. During training, modalities are tokenized by specialized encoders tailored to their data types. For example, image data uses discrete spatial VAEs, while text and structured metadata are processed using a WordPieza tokenizer. This consistent approach to tokenization ensures seamless integration of various types of data.
A notable feature of 4M is its ability to generate controllable and detailed data. By conditioning the results on specific modalities, such as human poses or metadata, the model provides a high degree of control over the generated content. Additionally, 4M's cross-modal retrieval capabilities enable queries in one modality (e.g., text) to retrieve relevant information in another (e.g., images).
The scalability of the framework is another strong point. Trained on large data sets such as COYO700M and CC12M, 4M ingests over 500 million samples and scales up to three billion parameters. By compressing dense data into sparse token sequences, it optimizes memory and computational efficiency, making it a practical choice for complex multimodal tasks.
Results and insights
4M's capabilities are evident in its performance in various tasks. In evaluations, it demonstrated strong performance across 21 modalities without compromising results compared to specialized models. For example, 4M's XL model achieved a semantic segmentation mIoU score of 48.1, meeting or exceeding benchmarks and handling three times more tasks than previous models.
The framework also excels at transfer learning. Tests on subsequent tasks, such as 3D object detection and multimodal semantic segmentation, show that 4M's pretrained encoders maintain high accuracy on both familiar and novel tasks. These results highlight its potential for applications in areas such as autonomous systems and healthcare, where multimodal data integration is critical.
Conclusion
The 4M framework marks an important step forward in the development of multimodal foundation models. By addressing cross-modal integration and scalability challenges, EPFL's contribution lays the foundation for more flexible and efficient ai systems. Its open source release encourages the research community to build on this work, pushing the boundaries of what multimodal ai can achieve. As the field evolves, frameworks like 4M will play a crucial role in enabling new applications and improving ai capabilities.
UPCOMING FREE ai WEBINAR (JANUARY 15, 2025): <a target="_blank" href="https://info.gretel.ai/boost-llm-accuracy-with-sd-and-evaluation-intelligence?utm_source=marktechpost&utm_medium=newsletter&utm_campaign=202501_gretel_galileo_webinar”>Increase LLM Accuracy with Synthetic Data and Assessment Intelligence–<a target="_blank" href="https://info.gretel.ai/boost-llm-accuracy-with-sd-and-evaluation-intelligence?utm_source=marktechpost&utm_medium=newsletter&utm_campaign=202501_gretel_galileo_webinar”>Join this webinar to learn practical information to improve LLM model performance and accuracy while protecting data privacy..
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. Their most recent endeavor is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has more than 2 million monthly visits, which illustrates its popularity among the public.
<a target="_blank" href="https://nebius.com/blog/posts/studio-embeddings-vision-and-language-models?utm_medium=newsletter&utm_source=marktechpost&utm_campaign=embedding-post-ai-studio”> (Recommended Reading) Nebius ai Studio Expands with Vision Models, New Language Models, Embeddings, and LoRA (Promoted)