The fields of natural language processing (NLP) and natural language generation (NLG) have undergone astonishing transformations since the introduction of large language models (LLM) and multimodal core models. These models, which include GPT4V, Claude, and Gemini, combine visual and LLM encoders.
Current basic models have shown remarkable performance when presented with text-only or combined image and text inputs. However, an important question arises: will their capabilities change depending on the type of input they receive?
To answer this question, a team of researchers presented IsoBench, a benchmark dataset containing challenges from four major domains: gaming, science, mathematics, and algorithms. There are several isomorphic representations for each problem in IsoBench, including textual, mathematical, and graphical formats. Because of this diversity, performance disparities resulting from different forms of representation can be thoroughly examined.
The team has shared that IsoBench can be used as a tool to diagnose discrepancies in model performance caused by input representation by providing detailed feedback. A recurring pattern is observed in a variety of foundation models, as the models show a predilection for textual representations on the same topic. For example, Claude-3 Opus scores 28.7 points less when given photographs instead of text when tested on all questions in IsoBench. When presented with image inputs instead of text, GPT-4 Turbo and Gemini Pro exhibit performance drops of 18.7 and 14.9 points, respectively.
Two activation strategies, IsoCombination and IsoScratchPad, have been proposed to mitigate this reporting bias and improve model performance. IsoScratchPad focuses on allowing translations between multiple input forms, while IsoCombination considers combinations of various input representations.
By utilizing the advantages of various input modalities, these strategies can reduce performance disparities between base models. The team has shown through experiments that IsoCombination and IsoScratchPad improve model performance, presenting intriguing directions for future studies and advances in multimodal ai systems.
The team has summarized its main contributions as follows.
- IsoBench, an extensive test data set with 1,630 samples covering a number of topics, including chess, physics, chemistry, and discrete and applied mathematics, has been introduced. Comprehensive assessments of multimodal performance are made possible by the numerous isomorphic input representations each sample has, including domain-specific textual formats and visual formats.
- Using IsoBench, the team evaluated eight well-known core models and found a recurring pattern: multimodal models outperform image-based prompts when it comes to text-only prompts.
- The team also suggested two methods to close performance gaps between various input modalities. While IsoScratchPad (IsoSP) translates visual input into textual representations during inference, IsoCombination (IsoCB) mixes input modalities.
- Based on their research, the team found that, in some cases, IsoCB and IsoSP can improve the performance of multimodal baseline models by almost ten percentage points. By using these strategies, the observed bias toward textual representations is reduced and the model performs better with a variety of input modalities.
Review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our 39k+ ML SubReddit
Tanya Malhotra is a final year student of University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Engineering with specialization in artificial intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with a burning interest in acquiring new skills, leading groups and managing work in an organized manner.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>