Mathematical reasoning, part of our advanced thinking, reveals the complexities of human intelligence. It involves logical thinking and specialized knowledge, not only in words but also in images, crucial for understanding the skills. This has practical uses in ai. However, current ai data sets are often narrowly focused and missing a full exploration of combining visual language understanding with mathematics.
While large language models (LLM) and large multimodal models (LMM) demonstrate remarkable problem-solving abilities in various tasks, their suitability for mathematical reasoning in visual contexts remains understudied. To address this gap, researchers from UCLA, the University of Washington, and Microsoft present MATHVISTA, a benchmark that combines challenges from various mathematical and visual tasks. This benchmark comprises 6,141 examples from 28 existing multimodal mathematics-related datasets and three newly developed datasets (IQTest, FunctionQA, and PaperQA). Successful completion of these tasks requires nuanced visual understanding and complex compositional reasoning, posing difficulties for even the most advanced basic models.
In this article, the authors present MATHVISTA, a comprehensive benchmark for mathematical reasoning in visual contexts. They propose a taxonomy of tasks to guide their development, identifying seven types of mathematical reasoning and focusing on five main tasks: answering questions with figures (FQA), solving geometry problems (GPS), mathematical word problem (MWP), answering textbook questions (TQA). and visual question answering (VQA). The reference point covers a wide range of visual contexts, such as natural images, geometric diagrams, abstract scenes, synthetic scenes, figures, graphs, and plots. MATHVISTA incorporates 28 existing multimodal data sets, comprising 9 mathematics-oriented question answering (MathQA) data sets and 19 VQA data sets.
Researchers extensively tested 12 leading foundation models, including three large language models (LLMs) such as ChatGPT, GPT-4, Claude-2, two proprietary large multimodal models (LMMs) – GPT4V, Bard, and seven open source LMMs. They evaluated these models in MATHVISTA, employing zero-shot and low-shot configurations with chain-of-thought (CoT) and program-of-thought (PoT) stimulation strategies. The figure above shows examples of the newly annotated datasets: IQTest, FunctionQA, and PaperQA.
The findings reveal that CoT GPT-4, the best-performing text-based model without visual enhancements, achieves an overall accuracy of 29.2%. In comparison, the best performing multimodal model, Bard, achieves 34.8%, which represents 58% of human performance (34.8% vs. 60.3%). When PoT GPT-4 is enhanced with Bard subtitles and OCR text, it reaches 33.9%, very similar to Multimodal Bard.
Further analysis suggests that the shortcomings of Bard's model are due to incorrect calculations and hallucinations influenced by visual perception and textual reasoning. In particular, GPT-4V, the latest multimodal version of GPT-4, achieves a state-of-the-art accuracy of 49.9%, a significant improvement of 15.1% over Multimodal Bard, as reported in the first comprehensive evaluation using MATHVISTA. As the field continues to advance, his work provides valuable insights to further refine mathematical reasoning in multimodal ai systems.
Review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook community, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Telegram channel
Janhavi Lande, Graduated in Engineering Physics from IIT Guwahati, Class of 2023. She is an upcoming data scientist and has been working in the world of ml/ai research for the last two years. What fascinates him most is this ever-changing world and its constant demand for humans to keep up. In her hobbies she likes to travel, read and write poems.
<!– ai CONTENT END 2 –>