Current challenges facing large vision and language models (VLMs) include limitations in the capabilities of individual visual components and problems arising from excessively long visual tokens. These challenges pose limitations to the model's ability to accurately interpret complex visual information and extensive contextual details. Recognizing the importance of overcoming these obstacles to improve performance and versatility, this article presents a novel approach.
The proposed solution involves leveraging expert techniques together to synergize the strengths of individual visual encoders, encompassing skills in image and text matching, OCR, and image segmentation, among others. This methodology incorporates a fusion network to harmonize the processing of results from various visual experts, effectively closing the gap between image encoders and pre-trained language models (LLM).
Numerous researchers have highlighted the shortcomings of the CLIP encoder, citing challenges such as its inability to reliably capture basic spatial factors in images and its susceptibility to object hallucinations. Given the varying capabilities and limitations of different vision models, a fundamental question arises: How can the strengths of multiple visual experts be leveraged to synergistically improve overall performance?
Inspired by biological systems, the approach adopted here adopts a multivisual-expert perspective, similar to the functioning of the vertebrate visual system. In the pursuit of developing vision-language models (VLM) with polyvisual experts, three main concerns come to the fore:
- The effectiveness of polyvisual experts,
- Optimal integration of multiple experts and
- Prevention of exceeding the maximum length of Language Models (LLM) with multiple visual experts.
A candidate pool consisting of six renowned experts, including CLIP, DINOv2, LayoutLMv3, Convnext, SAM, and MAE, was created to evaluate the effectiveness of multiple visual experts in VLM. Using LLaVA-1.5 as a base configuration, single, double, and triple expert combinations were explored across eleven benchmarks. The results, as shown in Figure 1, demonstrate that with an increasing number of visual experts, VLMs obtain richer visual information (attributed to more visual channels), leading to an overall improvement in the upper limit of multimodal capacity at various reference points.
Left: Comparing InstructBLIP, Qwen-VL-Chat and LLaVA-1.5-7B, polyvisual expert MouSi achieves SoTA on a wide range of nine benchmarks. Good: Performances of the best models with different numbers of experts on nine benchmark data sets. In general, triple experts are better than double experts, who in turn are better than a single expert.
Additionally, the article explores several positional encoding schemes aimed at mitigating the problems associated with long sequences of image features. This addresses concerns related to position overflow and length limitations. For example, in the implemented technique, there is a substantial reduction in positional occupancy in models like SAM, from 4096 to the more efficient and manageable 64, or even down to 1.
Experimental results demonstrated the consistently superior performance of VLMs employing multiple experts compared to isolated visual encoders. The integration of additional experts marked a significant boost in performance, highlighting the effectiveness of this approach in improving the capabilities of vision and language models. They have illustrated that the polyvisual approach significantly raises the performance of vision and language models (VLM), surpassing the accuracy and depth of understanding achieved by existing models.
The demonstrated results align with the hypothesis that a cohesive set of expert encoders can lead to a substantial improvement in the ability of VLMs to handle complex multimodal inputs. In summary, research shows that using different visual experts makes vision and language models (VLM) work better. Helps models understand complex information more effectively. This not only solves the current problems but also strengthens the VLM. In the future, this approach could change the way we bring vision and language together!
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter and Google news. Join our 36k+ ML SubReddit, 41k+ Facebook community, Discord Channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Telegram channel
Janhavi Lande, Graduated in Engineering Physics from IIT Guwahati, Class of 2023. She is an upcoming data scientist and has been working in the world of ml/ai research for the last two years. What fascinates him most is this ever-changing world and its constant demand for humans to keep up. In her hobbies she likes to travel, read and write poems.
<!– ai CONTENT END 2 –>