Rendering models have received a lot of attention in computer vision, speech, natural language processing, etc. Representation models exhibit high generalization to various downstream tasks after learning from large amounts of data. In addition, there is a growing demand for representation models due to the dramatic rise of large-scale language models (LLMs). Representational models have recently demonstrated their critical importance in enabling LLMs to understand, experience, and interact with other modalities (such as vision). Previous research has mainly focused on the development of unimodal representation models with unique topologies and pretraining tasks due to the diverse properties of various modalities.
Recent efforts in sight language and audio language learning have shown promising results thanks to the development of unified architectures and effective pretraining activities. However, research on creating universal models that can be used for language, audio, and visual modalities has yet to be available. Despite producing outstanding results, unimodal rendering models need help to use multimodal data, such as image-text and audio-text pairings, efficiently, making their application in multimodal tasks difficult. Use a single masked prediction task with Multiway Transformer to analyze text and image modalities for pretraining.
Scalability to other modalities, such as audio, is restricted as the masked prediction job needs the pre-trained CLIP model to discretize the image input. It offers a broad pre-training approach that can be used for language, audio, and visual modalities without external models (such as CLIP). Still, you need to expand the focus to multimodal data. In this study, they investigate a scalable method to develop a general representation model that can be adapted to any number of modalities. They promote the following requirements for a comprehensive representation model: 1. The model design must be adaptable enough to handle multimodal interaction and multiple modalities. 2. Pre-training exercises should promote alignment between modalities and the extraction of information within each modality. 3. Pre-workout exercises should be comprehensive and simple so that they can be used with various modalities.
Due to these incentives, researchers from DAMO Academy and Huazhong University of Science and Technology suggest ONE-PEACE, a model with 4B parameters that can seamlessly align and integrate representations in visual, audio, and language modalities. The ONE-PEACE architecture comprises a modality fusion encoder and many modality adapters. Each mode includes an adapter to transform raw inputs into feature sequences. The modality fusion encoder uses the feature sequences based on the Transformer architecture. A common self-service layer and multiple mode Feed Forward Networks (FFNs) are present in each transformer block. During the modality FFN assists in the extraction of information within the modalities. The self-service layer uses the service mechanism to enable interaction between multimodal features.
The obvious division of labor in this architecture simplifies the addition of new modalities and simply requires the addition of adapters and FFNs. They provide two modality-independent pre-training assignments for ONE-PEACE. The first is cross-modal contrastive learning, which combines vision-language contrastive education and audio-language contrastive learning to successfully align the semantic spaces of the three modalities of vision, audio, and language. The second method is intramodal denoising contrastive learning, which can be thought of as a combination of masked prediction and contrastive knowledge. Contrast loss occurs between fine-grained masked features and visible features, such as image patches, language tokens, or audio waveform features.
ONE-PEACE can be expanded to infinite modalities thanks to the easy-to-scale model design and pre-training activities. Together, these activities improve model performance during fine tuning while preserving cross-modal resilience. They also remove the requirement for modality-specific plans because they are ubiquitous to all modalities. They conduct in-depth studies on various tasks in various modalities such as vision, audio, vision-language, and audio-language activities. ONE PEACE achieves industry-leading results without using pre-trained vision or language models for initialization in unimodal and multimodal tasks. The code is publicly available on GitHub.
review the Paper and Github. Don’t forget to join our 21k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.