Microsoft AI Proposes MM-REACT: A System Paradigm Combining ChatGPT and Vision Experts for Advanced Multimodal Reasoning and Action

Large Language Models (LLMs) are advancing rapidly and contributing to remarkable economic and social transformations. With many artificial intelligence (AI) tools released on the internet, one such tool that has become extremely popular in recent months is ChatGPT. ChatGPT is a natural language processing model that allows users to generate meaningful text just like humans. OpenAI ChatGPT is based on the GPT transformer architecture, with GPT-4 being the latest language model that drives it.

With the latest developments in artificial intelligence and machine learning, computer vision has advanced exponentially, with improved network architecture and large-scale model training. Recently, some researchers have introduced MM-REACT, which is a system paradigm that composes numerous vision experts with ChatGPT for multimodal reasoning and action. MM-REACT combines individual vision models with the language model in a more flexible way to overcome complicated visual comprehension challenges.

MM-REACT has been developed with the goal of taking on a wide range of complex visual tasks that existing vision and vision-language models struggle with. For this, MM-REACT uses a fast layout to represent various types of information, such as text descriptions, textualized spatial coordinates, and dense visual cues, such as images and videos, represented as inline filenames. This design allows ChatGPT to accept and process different types of information in combination with visual inputs, leading to a more accurate and complete understanding.

MM-REACT is a system that combines the capabilities of ChatGPT with a group of vision experts to add multimodal functionalities. The file path is used as a placeholder and is entered into ChatGPT to allow the system to accept images as input. Whenever the system requires specific image information, such as identifying a celebrity’s name or frame coordinates, ChatGPT seeks the help of a specific vision expert. The maven’s output is then serialized as text and combined with the input to further trigger ChatGPT. The answer is returned directly to the user if no external experts are needed.

ChatGPT has been designed to understand the knowledge of vision experts uses by adding certain statements to ChatGPT prompts that are related to each expert’s ability, input argument type, and output type, along with some examples in context for each expert. In addition, a special password is indicated to use regular expression matching to invoke the maven accordingly.

Following experimentation, Zero-shot experiments have shown how MM-REACT effectively addresses its particular capabilities of interest. It has proven its effectiveness in solving a wide range of advanced visual tasks that require complex visual understanding. The authors have shared some examples where MM-REACT can provide solutions to linear equations displayed in an image. In addition, you can realize the understanding of concepts by naming products in the image and their ingredients, etc. In conclusion, this system paradigm largely combines language and vision expertise and is capable of achieving advanced visual intelligence.

review the Paper, Projectand Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 16k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.

Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.