Meet ViperGPT: A Python framework that combines vision and language models through code generation to achieve next-generation results

The innovative work of Neural Module Networks in previous years aimed to break the works into simpler modules. Through end-to-end training using modules that were reconfigured for various problems, each module would learn its true purpose and become reusable. However, it took a lot of work to use this strategy in the real world due to various issues. Program development, in particular, either required reinforcement learning from scratch or relied on hand-tuned natural language parsers, making it difficult to optimize. Program creation was severely restricted by domain in each scenario. The training was made much more difficult by learning the perceptual models together with the program generator, which often did not provide the desired modular structure.

As an example, let’s take some directions, how many cupcakes can each child eat to make it fair? (see Figure 1 (above)) Find the children and the cupcakes in the picture, count how many of each there are, and then decide to divide using the logic that “fair” implies an equal division. To understand the visual environment, it is common for people to compose a mixture of many phases. However, end-to-end models, which naturally do not use this compositional reasoning, continue to be the dominant strategy in machine vision. While the discipline has made significant progress on specific tasks such as object identification and depth estimation, end-to-end methods for complicated tasks have yet to learn to implicitly complete each job during the execution of a neural network.

Figure 1: ViperGPT creates a program from visual input and a query, then runs it through the Python interpreter to produce the result. This diagram shows the code created, as well as the results of the intermediate variables used during execution. ViperGPT generates responses to open world queries that are accurate and understandable by assembling pre-trained modules.

This does not take advantage of advances in critical vision tasks at many levels. Still, it ignores that computers can easily perform mathematical operations (such as division) without machine learning. They cannot rely on neural models to systematically generalize to different muffin counts or children. End-to-end models also result in fundamentally opaque judgments, since it is impossible to verify the outcome of each phase to identify a fault. This method becomes progressively unfeasible as the models get more data and become computationally hungry. Their goal would be to recombine their current models in novel ways to perform new tasks without additional training. Why can’t they design similar modular solutions for more difficult tasks?

In this study, Columbia University researchers present ViperGPT1, a framework that circumvents these limitations by using large code-generating language models (such as the GPT-3 codex) to agilely create vision models on any textual query that specify the job. For each question, it creates specialized programs that accept photos or videos as arguments and deliver the answer to that image or video query. They show that creating these applications only requires giving Codex an API that exposes different visual functions (such as localization and computational depth), just like you might provide to an engineer. The model can reason about the use of these functions and build the necessary logic thanks to its previous code training.

Their findings show that this simple strategy offers exceptional zero-shot performance (ie, without training on task-specific images). The specific method of it has many advantages:

It can be interpreted as all stages are clearly defined as function calls in the code with intermediate values visible.
It is logical because it explicitly uses the logical and mathematical operations built into Python.
It is customizable in that you can easily include any vision or language module by just adding the corresponding module definition to the API.
Composition, breaking down activities into smaller subtasks that are completed step by step.
Adaptable to advances in the area, since improvements in any of the modules used will directly increase the performance of your technique.
You don’t need to retrain (or tune) a new model for every recent activity.
It is generic since it combines all the tasks in a single system.

So, their contributions are the following:

Using the benefits listed above, they provide a simple framework for handling sophisticated visual queries by incorporating code generation models into the vision with an API and the Python interpreter.
They get leading zero-shot scores on tasks involving a visual base, picture question answering, and video question answering, demonstrating that this interpretability enhances rather than detracts from performance.
To encourage study in this area, they provide a Python library that allows rapid creation of programs for visual tasks and will be open sourced after publication.

review the Paper, Code and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 16k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.

Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.