Large language models (LMMs) are developing significantly and are proving capable of handling more complex tasks that require a combination of different integrated skills. These tasks include graphical user interface navigation, image-to-code conversion, and movie understanding. Several benchmarks, including MME, MMBench, SEEDBench, MMMU, and MM-Vet, have been established to comprehensively evaluate the performance of LMMs. This benchmark focuses on evaluating LMMs based on their ability to integrate fundamental functions.
In recent research, MM-Vet has established itself as one of the most popular benchmarks for assessing LLMs, particularly through its use of open-ended vision and language questions designed to assess integrated abilities. This benchmark notably assesses six fundamental vision and language skills: numeracy, recognition, knowledge, spatial awareness, language creation, and optical character recognition (OCR). Many real-world applications depend on the ability to understand and absorb written and visual information coherently, which is made possible by these skills.
However, the original format of MM-Vet has one limitation: it can only be used for questions with a single text-image pair. This is problematic because it fails to capture the complexity of real-world situations, where information is frequently presented in sequences of text and visuals. In these types of situations, a model is put to the test in a more sophisticated and practical way by having to understand and interpret a variety of textual and visual information in context.
MM-Vet has been enhanced in MM-Vet v2 to overcome this restriction. “Text and Image Sequence Understanding” is the seventh VL capability included in this edition. This feature is intended to assess a model’s processing speed for sequences containing both text and visual information, more representative of the types of tasks that large multimodal models (LMMs) are likely to encounter in real-world scenarios. With the addition of this new feature, MM-Vet v2 provides a more comprehensive assessment of an LMM’s overall effectiveness and its ability to handle complex, interconnected tasks.
MM-Vet v2 aims to increase the size of the evaluation set, whilst preserving the high calibre of the evaluation samples, as well as improving the capabilities assessed. This ensures that the standard will remain stringent and reliable even as it expands to encompass increasingly difficult and varied work. After comparing various LMMs against MM-Vet v2, Claude 3.5 Sonnet was shown to have the highest performance score (71.8). This marginally outperformed GPT-4o, which had a score of 71.0, suggesting that Claude 3.5 Sonnet is marginally more adept at completing the challenging tasks assessed by MM-Vet v2. With a competitive score of 68.4, InternVL2-Llama3-76B stood out as the best open-weight model, demonstrating its robustness despite its open-weight status.
In conclusion, MM-Vet v2 represents a major advance in the evaluation of LMMs, as it allows for a more complete and realistic assessment of their capabilities, adding the ability to understand and process image and text sequences, as well as increasing the quality and scope of the evaluation set.
Take a look at the Paper and GitHubAll credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..
Don't forget to join our Subreddit with over 48 billion users
Find upcoming ai webinars here
Tanya Malhotra is a final year student of the University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Engineering with specialization in artificial intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking skills, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>