Large language models (LLMs) have demonstrated the ability to generate generic computer programs, allowing understanding of the structure of programs. However, it is difficult to determine the true capabilities of LLMs, especially in pursuing tasks they missed during training. It is critical to determine whether LLMs can truly “understand” symbolic graphics programs, which generate visual content when run. They define this understanding as the ability to understand the semantic content of the rendered image based solely on the raw text input of the program. This method involves answering questions about the content of the image without actually seeing it, which is easy with visual input but much more difficult when relying solely on the text of the program.
Existing research on symbolic graphics programs has focused primarily on procedural modeling of 2D shapes and 3D geometry. These programs, such as constructive solid geometry (CSG), computer-aided design (CAD), and scalable vector graphics (SVG), provide a clear and interpretable representation of visual content. In addition, LLMs have been applied to various programming tasks such as code retrieval, automated testing, and generation; however, the comprehension of symbolic graphics programs is largely different, as their semantic meaning is often defined visually. Existing benchmarks for LLMs focus on non-graphical program comprehension, while visual language models are evaluated using multimodal datasets for tasks such as image captioning and visual question answering.
Researchers from the Max Planck Institute for Intelligent Systems, Tübingen, the University of Cambridge, and MIT have proposed a new approach to assess and improve master’s students’ understanding of symbolic graphics programs in law. A benchmark called SGP-Bench is presented for master’s students’ semantic understanding and consistency in interpreting SVG (2D vector graphics) and CAD (2D/3D objects) programs. Furthermore, a new fine-tuning method based on a collected instruction trace dataset called symbolic instruction fine-tuning is developed to improve performance. Furthermore, the MNIST symbolic dataset created by the researchers shows important differences between master’s students’ understanding of symbolic graphics programs and human understanding.
The process of building a benchmark to assess master's students' understanding of symbolic graphics programs uses a scalable and efficient process. It uses a powerful vision-language model (GPT-4o) to generate semantic questions based on rendered images of the symbolic programs. In addition, human annotators verify the quality and accuracy of these automatically generated question-answer pairs. This approach reduces the manual effort required compared to traditional data creation methods. The process for SVG and 2D CAD programs is straightforward as they directly produce 2D images, but in 3D CAD programs, 3D models are first converted to 2D images from multiple fixed camera positions.
The assessment of LLMs’ understanding of symbolic graphics programs is performed on the SGP-MNIST dataset, which consists of 1000 SVG programs representing MNIST-like digit images, with 100 programs per digit (0–9). While humans can easily recognize the images, LLMs found it extremely difficult to interpret the symbolic programs. Even the state-of-the-art GPT-4o model performed barely better than random guesses. This stark contrast between human and LLM performance highlights a significant gap in how machines process and understand symbolic representations of visual information compared to humans.
In conclusion, the researchers present a new way to evaluate LLMs, assessing their ability to understand images directly from their symbolic graphics programs without visual input. The researchers created the SGP-Bench, a benchmark that effectively measures the performance of LLMs on this task. They also introduced symbolic instruction fine-tuning (SIT) to improve the ability of LLMs to interpret graphics programs. This research helps provide a clearer picture of LLMs’ capabilities and promotes the creation of diverse assessment tasks. Future research includes investigating how LLMs understand semantics in this area and working on developing advanced methods to improve their performance on these tasks.
Take a look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..
Don't forget to join our Subreddit with over 48 billion users
Find upcoming ai webinars here
Sajjad Ansari is a final year student from IIT Kharagpur. As a technology enthusiast, he delves into practical applications of ai, focusing on understanding the impact of ai technologies and their real-world implications. He aims to articulate complex ai concepts in a clear and accessible manner.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>