Vision-language models (VLM) are increasingly used to generate answers to queries about visual content. Despite their advances, they often suffer from a major problem: generating plausible but incorrect responses, also known as hallucinations. These hallucinations can lead to a lack of trust in these systems, especially in high-risk real-world applications. Evaluating the usefulness and veracity of VLM-generated responses is challenging because it requires not only understanding the visual content but also verifying each claim made in the response. Traditional benchmarks have not been adequate to address this challenge, either because they limit assessments to binary, simplistic questions or because they rely on an incomplete context to judge open-ended responses.
Researchers at Salesforce ai Research have proposed Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm that evaluates VLM responses to open-ended visual queries. In PROVE, researchers use a high-fidelity scene graph representation built from highly detailed image captions and employ a large language model (LLM) to generate various question-answer (QA) pairs along with executable programs to verify each pair of QAs. This approach enables the creation of a challenging and visually informed QA benchmark dataset of 10.5K pairs. The evaluation strategy involves measuring both the usefulness and truthfulness of VLM responses using a unified framework based on scene graph comparisons. This programmatic evaluation provides a more reliable and interpretable assessment of VLM performance compared to previous benchmarks.
The PROVE benchmark uses detailed graph representations of scenes and executable programs to verify the correctness of VLM responses. Scene graphs, constructed from detailed image legends, contain entities, attributes, and relationships that represent the visual scene. When applying for an LLM, researchers generate open QA pairs and corresponding verification programs that ensure questions are challenging but verifiable. Only QC pairs that can be checked programmatically are kept in the benchmark, resulting in a high-quality data set. Evaluation involves extracting scene graph representations of both model responses and actual responses, and then calculating scores based on the recall and precision of these representations, measuring how useful and truthful the responses are.
The evaluation results show that current VLMs struggle to achieve a good balance between usefulness and veracity. Models such as GPT-4o, Phi-3.5-Vision, and Pixtral demonstrated higher usefulness scores, but not necessarily higher veracity. The study also found that increasing model size tends to improve usefulness, but does not always improve veracity. Evaluation of several models revealed that recent improvements in training better VLMs have led to greater utility, but have not consistently translated into truthful results. In particular, the LLaVA-1.5 model series achieved the best trueness scores, indicating that smaller, more focused models could outperform larger ones in maintaining accuracy.
In conclusion, PROVE presents a significant advance in the evaluation of the usefulness and veracity of the responses generated by VLM. By leveraging detailed scene graph representations and programmatic verification, this benchmark provides a more reliable and interpretable evaluation framework. The findings underscore the need for VLMs that strike a balance between generating informative and accurate responses, especially as their use in real-world applications continues to grow. Future research is expected to focus on improving both the usefulness and veracity of these models through advanced training techniques and new evaluation strategies.
look at the Paper and Data set card. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. If you like our work, you will love our information sheet.. Don't forget to join our SubReddit over 55,000ml.
(Next live webinar: October 29, 2024) Best platform to deliver optimized models: Predibase inference engine (promoted)
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. Their most recent endeavor is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has more than 2 million monthly visits, which illustrates its popularity among the public.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>