MLLMs, or multimodal large language models, have been making strides lately. By incorporating images into large language models (LLMs) and leveraging the capabilities of LLMs, MLLMs demonstrate exceptional abilities in tasks including visual question answering, following instructions, and understanding images. Studies have seen a major flaw in these models despite their improvements; They still have some surprisingly simple and obvious visual flaws.
According to recent research from UC Berkeley and New York University, these MLLM deficiencies could be due to visual representation problems.
Pretrained vision and language models form the backbone of most MLLMs. To incorporate the different modalities, these models are coupled using several adapters. According to a common theory, any defects in pre-trained vision models can potentially affect subsequent MLLMs that use them.
Regarding the visual encoder, most open source MLLMs usually use the pre-trained Contrastive Language and Image Pretraining (CLIP) model. The researchers begin by cataloging cases of failures that CLIP has difficulty coding accurately. In the embedding space they make use of incorrect agreements. One of the visually distinct images is likely to be encoded ambiguously if CLIP encodes them similarly. This set of images is known as a CLIP-blind pair. To determine how visually similar the two images are, the team employs a vision-only self-supervised encoder like DINOv2. Here, CLIP-blind pairs refer to images with identical CLIP embeddings but different DINOv2 embeddings. They find that these CLIP-blind combinations cause MLLMs to make mistakes later.
With these pairs a new reference point called Multimodal Visual Patterns (MMVP) is introduced. This benchmark, which assesses the visual capabilities of state-of-the-art MLLMs with basic questions, is specifically intended to query disparities in CLIP-blind pairs. The researchers tested GPT-4V and other SOTA MLLMs on the benchmark and found that they all fail miserably at answering basic queries about visual features. Most of these models perform worse than random guesses; GPT-4V is an outlier. However, even GPT-4V shows a significant performance gap of more than 50% compared to human performance.
After finding numerous cases of MLLM failure individually, they investigated the systematic visual patterns in MMVP that CLIP models had difficulty with. In MMVP, nine pairs of CLIPblinds frequently exhibit patterns such as “orientation,” “counting,” and “point of view,” which present considerable difficulties for the CLIP vision encoder. Increasing the amount of training data and the size of the CLIP model has been an ongoing and substantial effort. To systematically evaluate whether scaling alone can alleviate these difficulties, MMVP cases were grouped into visual patterns. According to the results, the scale of the model/data is insufficient as no large-scale CLIP-based model could resolve any of the nine visual patterns found. Furthermore, the visual patterns tested by CLIP models were found to be strongly correlated with the performance of MLLMs. If CLIP has problems with a specific visual pattern, such as “orientation”, MLLMs will probably also have problems. Clearly, CLIP vision encoders have the potential to become a bottleneck in systems like this.
As a last stage, the team improves the visual base of the MLLMs. They focus on improving the visual grounding capabilities of MLLMs by integrating a self-supervised vision-only model, such as DINOv2. These methods are called Mixture of Features (MoF). To begin, a mixture called Additive-MoF (A-MoF) is created by linearly mixing CLIP and DINOv2 features in different ratios. While this method shows that DINOV2 features improve visual connection, it does so at the expense of a reduced ability to follow instructions. This solution is InterleavedMoF (I-MoF), which combines visual tokens from the CLIP and DINOv2 models in a spatially mixed way. While keeping the ability to follow instructions intact, this technique is found to greatly improve visual anchoring.
Pretrained CLIP vision encoders using MLLMs fail to classify meaningful visual patterns and fail to notice critical visual details in images, causing them to fail simple queries. However, when it comes to scalable vision models, CLIP-type models remain the gold standard. The study's findings refute the widespread assumption that simply expanding the data and models will solve all the problems with CLIP models. Research shows that vision-language models and vision-only self-supervised learning models, two common types of visual representation learning models, have their strengths and weaknesses. Their unique strengths extend beyond the usual measures used to compare them, such as linear probing and zero-shot precision on ImageNet. New evaluation metrics are needed to help create new algorithms for visual representation learning, even if a well-designed feature combination approach could overcome visual constraints and combine the best features of the two learning paradigms. The team hopes their effort inspires further advances in vision models.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook community, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Telegram channel
Dhanshree Shenwai is a Computer Science Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking with a keen interest in ai applications. He is excited to explore new technologies and advancements in today's evolving world that makes life easier for everyone.
<!– ai CONTENT END 2 –>