Recent Large Language Models (LLM) for various NLP tasks have made remarkable progress, with notable examples being GPT-3, PaLM, LLaMA, ChatGPT, and the more recently proposed GPT-4. These models hold enormous promise for human-like planning and decision-making, as they can solve various tasks in zero-shot situations or with the help of some instances. LLMs display emergent skills, including learning in context, mathematical reasoning, and common-sense thinking. However, LLMs have built-in restrictions, such as the inability to use external tools, access current information, or reason mathematically accurately.
An ongoing area of research focuses on improving language models with access to external tools and resources and investigating the integration of open-air tools and modular plug-and-play strategies to address these limitations of LLMs. Recent research uses LLM to build complicated programs that more efficiently complete logical reasoning problems and take advantage of robust computing resources to improve mathematical reasoning skills. For example, with the help of external knowledge sources and online search engines, LLMs can acquire information in real time and use domain-specific knowledge. Other current lines of research, including ViperGPT, Visual ChatGPT, VisProg, and HuggingGPT, integrate various basic computer vision models to give LLMs the skills needed to handle visual reasoning problems.
Despite substantial advances, today’s tool-enhanced LLMs still encounter significant hurdles when responding to real-world queries. Most current techniques are restricted to a limited set of tools or are based on particular devices for a given domain, making it difficult to generalize to different queries. Figure 1 illustrates this: “What is the main persuasive appeal used in this ad?” 1) Assume an advertising image has text context and call a text decoder to understand the semantics to answer this query; 2) find basic information to explain what “persuasive appeal” is and how the different types differ; 3) arrive at a solution using the suggestions of the entry question and the intermediate results of the previous phases; and 4) finally, present the answer in a way that is specific to the task.
On the other hand, when answering the question “Which animal skin is adapted to survive in cold places?”, you may need to contact additional modules such as an image caption to analyze image information and a web search engine to collect domain knowledge to understand scientific terminology. UCLA researchers and Microsoft Research provide Chameleon, a plug-and-play compositional reasoning framework that uses huge language models to solve these problems. Chameleon can synthesize programs to create various tools to answer multiple questions.
Chameleon is a natural language planner that is based on an LLM. Unlike conventional methods, it uses various tools such as LLM, pre-built computer vision models, online search engines, Python functions, and rule-based modules designed for a particular goal. Chameleon builds these programs using the learning-in-context capabilities of LLMs and does not require any training. The scheduler can derive the correct order of tools to compose and run to provide the final answer to a user query, prompted by descriptions of each tool and tool usage examples.
Chameleon creates programs that resemble natural language, unlike previous efforts that created domain-specific programs. These programs are less prone to errors, simpler to debug, easier to use for people with little programming knowledge, and extensible to include new modules. Each module in the program executes, processes, and caches the query and context, returns a response chosen by the module, and modifies the query and stored context for future executions of the module. By composing modules as a sequential program, updated queries and previously cached context can be used during the execution of subsequent modules. In two tasks, ScienceQA and TabMWP, they demonstrate the flexibility and power of Chameleon.
TabMWP is a mathematical benchmark that includes many tabular contexts, while ScienceQA is a question-and-answer multimodal benchmark that spans many context formats and science topics. The effectiveness of Chameleon’s ability to coordinate multiple tools across multiple types and domains can be tested using these two benchmarks. Notably, Chameleon with GPT-4 scores 86.54% accurate in ScienceQA, outperforming the best-reported few-shot model by a factor of 11.37%. Chameleon offers a 7.97% improvement over CoT GPT-4 and a 17.8% increase over the last generation model in TabMWP using GPT-4 as the underlying LLM, resulting in full accuracy of 98.78%.
Compared to earlier LLMs such as ChatGPT, further research suggests that the use of GPT-4 as a scheduler demonstrates more consistent and logical tool selection and can infer likely constraints given instructions. His brief contributions are as follows: (1) They create Chameleon, a plug-and-play compositional reasoning framework, to resolve the inherent limits of large language models and take on various reasoning tasks. (2) They effectively combine various technologies, including LLMs, business vision models, online search engines, Python functions, and rule-based modules, to create a flexible and adaptable AI system to answer queries from the world real. (3) Significantly advances the state of the art by demonstrating the flexibility and effectiveness of the framework in two benchmarks, ScienceQA and TabMWP. The codebase is publicly available on GitHub.
review the Paper, Project, and Github. Don’t forget to join our 19k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.