The Artificial Intelligence (AI) domain is evolving and advancing with the release of each new model and solution. Large Language Models (LLMs), which have recently become very popular due to their incredible abilities, are the main reason for the rise of AI. The sub-domains of AI, whether it’s Natural Language Processing, Natural Language Understanding, or Computer Vision, are all making progress, and for very good reason. One area of research that has recently seen a lot of interest among the AI and deep learning communities is visual question answering (VQA). VQA is the task of answering open-ended, text-based questions about an image.
Systems that adopt Visual Question Answering attempt to adequately answer natural language questions regarding an image input, and these systems are designed in a way that they understand the content of an image in a similar way to humans and therefore, they communicate the findings effectively. . Recently, a team of researchers from UC Berkeley and Google Research proposed an approach called CodeVQA that addresses visual answering to questions by generating modular code. CodeVQA formulates VQA as a program synthesis problem and uses coding language models that take questions as input and generate code as output.
The main goal of this framework is to create Python programs that can call pretrained visual models and combine their results to provide responses. The produced programs manipulate the results of the visual model and derive a solution using arithmetic and conditional logic. Unlike previous approaches, this framework uses pretrained language models, pretrained visual models based on image and caption pairings, a small number of VQA samples, and pretrained visual models to support learning in context.
To extract specific visual information from the image, such as captions, pixel locations of things, or image and text similarity scores, CodeVQA uses primitive visual APIs wrapped in visual language models. The created code coordinates various APIs to collect the necessary data, then uses all the expressiveness of Python code to analyze the data and reason about it using mathematics, logical structures, feedback loops, and other programming constructs to arrive at a solution.
For evaluation, the team has compared the performance of this new technique against a baseline of few shots that does not use code generation to measure its effectiveness. COVR and GQA were the two reference data sets used in the assessment, among which the GQA data set includes multi-hop questions created from Visual Genome individual photo scene graphs that have been manually annotated by humans, and the set The COVR dataset contains multihop questions on image sets in the Visual Genome Datasets and imSitu. The results showed that CodeVQA performed better on both data sets than the baseline. In particular, it showed an accuracy improvement of at least 3% on the COVR dataset and approximately 2% on the GQA dataset.
The team mentioned that CodeVQA is easy to implement and use because it requires no additional training. It makes use of pre-trained models and a limited number of VQA samples for in-context learning, which helps tailor built programs to particular question-and-answer patterns. In short, this framework is powerful and leverages the power of pre-trained LMs and visual models, providing a code-based and modular approach to VQA.
review the Paper and GitHub link. Don’t forget to join our 24k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.