The search for general-purpose AI systems has facilitated the development of models capable of end-to-end training, many of which aim to provide a simple natural language interface for a user to interact with the model. Large-scale unsupervised pretraining followed by supervised multitasking training has been the most common method of developing these systems. They eventually want these systems running to escalate to the indefinitely long queue of difficult jobs. However, this strategy requires a carefully selected data set for each task. By breaking down difficult activities set in natural language into simpler phases that can be handled by specialized end-to-end trained models or other programs, they study the use of large language models to handle the long tail of complex tasks in this work.
Tell a computer vision program to “Label the seven main characters from the Big Bang Theory TV show in this image.” The system must first understand the purpose of the instruction before carrying out the following steps: detect faces, retrieve the Big Bang Theory main character list from a knowledge base, classify faces using the character list, and tag the image with the names and faces of the characters that were recognized. While multiple vision and language systems can perform each task, the execution of natural language tasks is beyond the reach of end-to-end trained systems.
Researchers at the Allen Institute for AI propose VISPROG, a program that takes as input visual information (a single image or a collection of images) and a natural language command, creates a series of instructions, or a visual program, as they may be called. , and then executes these statements to produce the required result. Each line of a visual program calls one of the many modules that the system now supports. Modules can be pre-built language models, OpenCV image processing subroutines, or arithmetic and logical operators. They can also be pre-built computer vision models. Modules consume the inputs created by executing previous lines of code, producing intermediate output that can be used later.
In the example mentioned above, the visual program created by VISPROG uses a face detector, GPT-3 as a knowledge retrieval system, and CLIP as an open vocabulary image classifier to provide the necessary output (see Fig. 1). VISPROG improves both the generation and the execution of programs for vision applications. Neural Module Networks (NMNs) combine specialized and differentiable neural modules to create a question-specific, end-to-end trainable network for the Visual Question Answer (VQA) problem. These methods train a layout generator using REINFORCE’s weak response monitoring or pre-built, fragile semantic analyzers to deterministically generate the module layout.
By contrast, VISPROG allows users to create complicated programs without prior training using a powerful language model (GPT-3) and limited examples in context. By invoking state-of-the-art trained models, non-neural Python subroutines, and higher levels of abstraction than NMNs, VISPROG programs are equally more abstract than NMNs. Due to these benefits, VISPROG is a fast, efficient and versatile neurosymbolic system. Also, VISPROG is highly interpreted. First, VISPROG creates easy-to-understand programs whose logical accuracy can be checked by the user. Second, by breaking the prediction into manageable parts, VISPROG allows the user to examine the results of intermediate phases to detect failures and, if necessary, make corrections to the logic.
A complete program with intermediate step outputs (such as text, bounding boxes, segmentation masks, produced images, etc.) wired to show the flow of information serves as a visual justification for the prediction. They use VISPROG for four different activities to show their versatility. These tasks involve common skills (such as image analysis), but also require specialized thinking and visual manipulation skills. These tasks include:
- Answer visual composition questions.
- Zero trigger NLVR on image pairings.
- Labeling of factual knowledge objects from NL instructions.
- Language-guided manipulation of images.
They point out that none of the modules or the language model have been altered in any way. Some examples in context with natural language commands and the appropriate programs are needed to adapt VISPROG to any task. VISPROG is easy to use and has substantial gains over a VQA model based on the composition VQA test of 2.7 points, good NLVR zero shot accuracy of 62.4%, and satisfactory qualitative and quantitative results in editing tasks. images and knowledge labeling.
review the Paper, Githuband project page. Don’t forget to join our 25k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.