The ability to learn to evaluate is increasingly playing a key role in the development of modern large multimodal models (LMMs). As pre-training with existing web data reaches its limits, researchers are shifting toward post-training with ai-enhanced synthetic data. This transition highlights the increasing importance of learning to evaluate in modern LMMs. Reliable ai evaluation is important for human work on complex task evaluations, as it generates effective reward signals in reinforcement learning and guides search at inference time. Despite progress in single-image, multi-image, and video scenarios, the development of open LMMs capable of evaluating the performance of other multimodal models presents a gap in the field.
Existing attempts to address the ai assessment challenge have primarily focused on using proprietary LMMs such as GPT-4V as generalist assessors for vision and language tasks. These models have been used in evaluation benchmarks for complex scenarios such as visual chat and detailed captioning. Additionally, open source alternatives such as Prometheus-Vision have emerged as evaluators for specific user-designed scoring criteria. In preference learning for LMM, techniques such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) have been applied to align models with human intentions. Recent research has extended these concepts to the multimodal space, exploring various strategies to improve visual chat skills and reduce hallucinations in vision and language models.
Researchers from ByteDance and the University of Maryland, College Park have proposed LLaVA-Critic, the first LMM designed specifically for evaluation tasks. This approach focuses on selecting tailored instruction-following data for evaluation purposes. It addresses two main scenarios: serving as LMM as a judge and facilitating preference learning. It aims to provide reliable evaluation scores comparable to proprietary models such as GPT-4V, offering a free alternative for various evaluation benchmarks in the first scenario. It presents a scalable solution to generate effective reward signals, reducing the dependence on costly human feedback collection in the second scenario. The LLaVA-Critic shows high correlation with commercial GPT models in evaluation tasks and superior performance in preference learning.
LLaVA-Critic is developed by fine-tuning a previously trained LMM, capable of following various instructions. This approach ensures that the model can handle a variety of high-quality vision tasks. The training process involves the use of an assessment message that combines multimodal instructions, model responses, and an optional reference response. LLaVA-Critic is trained to predict quantitative point scores or pairwise rankings based on specific criteria and provides detailed justifications for its judgments. The model uses standard cross-entropy loss for judgments and justifications. The researchers start with the pretrained LLaVA-OneVision(OV) 7B/72B checkpoint and fine-tune it on the LLaVA-Critic-113k dataset for one epoch.
The results demonstrate significant improvements in both the point score and pairwise classification capabilities of LLaVA-Critic compared to the baseline models. The LLaVA-Critic-72B achieves the highest average Pearson-r (0.754) and Kendall's Tau (0.933) point score, outperforming the reference LLaVA-OV-72B. In pairwise classification, LLaVA-Critic-72B outperforms GPT-4o and GPT-4V in no-tie comparisons, achieving an accuracy of 73.6%. LLaVA-Critic-7B outperforms most baselines compared to commercial models and other open source LMMs in the MLLM-as-a-Judge scenario. These results highlight the effectiveness of LLaVA-Critic as an open source alternative for the evaluation of multimodal models.
In conclusion, researchers have proposed LLaVA-Critic, the first LMM designed specifically for evaluation tasks. The researchers have used a diverse, high-quality instruction-following data set to develop this model that excels in two critical areas. First, as a generalized evaluator, LLaVA-Critic shows remarkable alignment with human preferences and GPT-4o in various evaluation tasks, offering a viable open source alternative to commercial models. Second, in preference learning scenarios, LLaVA-Critic works as a reliable reward model, outperforming human feedback-based approaches by improving the visual chat capabilities of LMMs. This research is a key step toward building self-criticism capabilities in open source LMMs, which will enable future advances in scalable, superhuman ai alignment feedback.
look at the Paper and Project. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. If you like our work, you will love our information sheet.. Don't forget to join our SubReddit over 50,000ml
Are you interested in promoting your company, product, service or event to over 1 million ai developers and researchers? Let's collaborate!
Sajjad Ansari is a final year student of IIT Kharagpur. As a technology enthusiast, he delves into the practical applications of ai with a focus on understanding the impact of ai technologies and their real-world implications. Its goal is to articulate complex ai concepts in a clear and accessible way.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>