Large language models (LLMs) and their multimodal counterparts (MLLMs) have made significant strides in advancing artificial general intelligence (AGI) across several domains. However, these models face a significant challenge in the realm of visual mathematical problem solving. While MLLMs have demonstrated impressive capabilities across a variety of tasks, they struggle to fully realize their potential when faced with mathematical problems presented in visual contexts. This limitation is particularly evident in scenarios where models must interpret geometric figures, understand spatial relationships, and integrate complex mathematical concepts with visual information.
The difficulty lies in the particular demands of visual math problem solving, which requires seamless integration of analytical reasoning from textual questions with contextual information provided by visual diagrams. Unlike purely text-based math problems, where LLMs have shown considerable progress due to the abundance of training data and their inherent mastery of language, visual math introduces an additional layer of complexity. Models must not only understand mathematical concepts but also accurately interpret visual elements such as geometric shapes, angles, measurements, and spatial relationships depicted in the diagrams.
Fine-tuning visual instruction for MLLM has seen significant advances through approaches such as LLaMA-Adapter, LLaVA, Flamingo, SPHINX, and InternVL, each of which introduces efficient techniques for integrating vision and language. At the same time, text-based mathematical problem solving has made progress with projects such as MAmmoTH, MetaMATH, and MathCoder. However, in the multimodal mathematical domain, efforts remain limited. Datasets such as Geometry3K and UniMath have emerged, but their scope and scale are insufficient. G-LLaVA shows promise in graph geometry, but struggles in other mathematical areas, highlighting the need for more robust and comprehensive approaches to visual mathematical problem solving.
Researchers from CUHK, Peking University, Shanghai ai Lab and Oracle present MAVIS (Mathematical Visual Instruction Matching) MAVIS presents a robust approach that addresses the limitations of MLLMs in visual mathematical problem solving. This framework addresses three critical problems: unsatisfactory mathematical diagram embeddings by vision encoders, diagram language misalignment between vision encoders and LLMs, and inaccurate mathematical reasoning with visual elements. MAVIS presents two extensive datasets, MAVIS-Caption and MAVIS-Instruct, covering multiple mathematical domains. It employs a three-stage progressive training process to improve visual diagram encoding and reasoning capabilities. The result is MAVIS-7B, a specialized MLLM optimized for visual mathematical tasks, which demonstrates superior performance on evaluation benchmarks compared to existing open-source MLLMs, highlighting the effectiveness of this specific approach in advancing visual mathematical problem solving capabilities.
MAVIS introduces an innovative data engine to efficiently generate high-quality mathematical diagrams, thus addressing the scarcity of visual mathematical datasets. The engine covers three main types of diagrams: plane geometry, analytic geometry, and function. For plane geometry, it employs multi-hop data curation principles, iteratively combining basic shapes to create diverse configurations. Analytic geometry diagrams are built in a Cartesian coordinate system, incorporating various geometric elements without overlap. Function diagrams focus on seven fundamental types, using parameterized equations to generate diverse plots. All diagrams are rendered using Matplotlib, with additional features such as vertex labeling and keypoint plotting to enhance mathematical understanding and reasoning capabilities.
MAVIS-Caption, a crucial component of the MAVIS framework, is a large-scale dataset comprising 588,000 diagram-legend pairs. It spans three mathematical domains: plane geometry (299,000 pairs), analytic geometry (77,000 pairs), and function (212,000 pairs). The legends in the dataset are detailed, with an average length of 61.48 words and a vocabulary size of 149. Legend generation strategies vary by diagram type, using templates built with GPT-4 and domain-specific rules. Plane geometry legends are built iteratively, analytic geometry legends use coordinate-based descriptions, and function legends detail various properties of the graphed functions. All legends are refined by ChatGPT for natural language expression, ensuring high-quality, diverse, and mathematically accurate descriptions of visual mathematical content.
MAVIS-Instruct is a comprehensive dataset of 834,000 visual math problems designed to enhance the visual math reasoning capabilities of MLLMs. It covers both plane and function geometry problems, each accompanied by a chain of thought (CoT) justification averaging 150 words. Questions in the dataset are simplified to minimize textual redundancy, encouraging MLLMs to extract critical information from visual input. MAVIS-Instruct is compiled from four sources: manually collected problems augmented by GPT-4 (84K), existing datasets augmented by GPT-4 (80K), data engine legends annotated by GPT-4 (51K), and problems generated directly by the data engine. This diverse approach ensures broad coverage of mathematical concepts and problem types while maintaining high-quality, detailed solutions and justifications for each problem.
MAVIS-7B demonstrates superior performance on multiple mathematical benchmarks, proving its effectiveness in visually solving mathematical problems. On the comprehensive MathVerse benchmark, MAVIS-7B achieves the highest overall accuracy among open-source models, outperforming both larger models and specialized mathematical MLLMs. It outperforms InternLM-XComposer2 (7B) by 11.0% and ShareGPT4V (13B) by 10.1%. In specific domains, MAVIS-7B excels in GeoQA for planar geometry, achieving 66.7% accuracy, and in FunctionQA, achieving 40.3% accuracy, outperforming both traditional methods and other MLLMs. Qualitative analysis reveals MAVIS-7B’s superior understanding of geometric elements, function curves, and coordinate axes, leading to higher quality chain-of-thought reasoning compared to GPT-4V.
This study presents MAVISan efficient approach to fine-tuning visual math instruction for MLLM. The framework consists of two key components: high-quality datasets (MAVIS-Caption and MAVIS-Instruct) generated by a sophisticated data engine and a three-stage training sequence. This process sequentially improves the math-specific vision encoder, enhances diagram language alignment, and develops mathematical reasoning capabilities. The resulting specialized model, MAVIS-7B, demonstrates exceptional performance on several visual math benchmarks. MAVIS’ innovative approach sets a new standard in visual math problem solving, paving the way for future advancements in this critical area of ai and educational technology.
Review the Paper and GitHubAll credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter.
Join our Telegram Channel and LinkedIn GrAbove!.
If you like our work, you will love our Newsletter..
Don't forget to join our Subreddit with over 46 billion users
Asjad is a consultant intern at Marktechpost. He is pursuing Bachelors in Mechanical Engineering from Indian Institute of technology, Kharagpur. Asjad is a Machine Learning and Deep Learning enthusiast who is always researching the applications of Machine Learning in the healthcare domain.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>