Imagine having a digital assistant that can not only answer your questions but also navigate the web, solve complex math problems, write code, and even reason about images and text-based games. It sounds too good to be true? Well, get ready because the future of artificial intelligence just became much more accessible and transparent with the introduction of LUMOS.
In a groundbreaking development, researchers at the Allen Institute for ai, UCLA, and the University of Washington have introduced LUMOS, an open source framework that promises to revolutionize the way we interact with language agents. Unlike existing closed source solutions that often seem like black boxes, LUMOS offers an unprecedented level of affordability, transparency and reproducibility, making it a game-changer in the world of ai.
But what exactly is LUMOS and why is it causing such a stir in the ai community? Buckle up, because we're about to delve into the nitty-gritty details of this remarkable innovation, exploring how it works, what it can do, and why it's more important than you think.
Current language agents often rely on large closed-source language models such as GPT-4 or ChatGPT as their core component. While powerful, these models are expensive, require more transparency, and offer limited reproducibility and controllability.
The LUMOS framework takes a different approach by using open source large language models (LLMs) as base models. It employs a unified and modular architecture consisting of three key components: a planning module, a grounding module, and an execution module.
The planning module decomposes complex tasks into a sequence of high-level subgoals expressed in natural language. For example, for a multimodal question like “What country is the device in your hand from?”, the planning module could generate two subgoals: “Identify the brand of the device” and “Answer the country of the brand of the device.” “.
The grounding module then translates these high-level subgoals into low-level executable actions that can be executed using various tools in the execution module. For example, the first subgoal could be based on an action like “VQA(What is the brand…?)” to identify the brand of the device from the image using a visual question answering tool.
The execution module contains a collection of commercially available tools, including APIs, neural models, and virtual simulators, that can execute ground-based actions. The results of these executed actions are then fed back to the planning and grounding modules, allowing for iterative and adaptive behavior of the agent.
One of the key advantages of LUMOS is its modular design, which allows for easy upgrades and broader applicability to various interactive tasks. By separating the planning, grounding, and execution components, researchers can improve or replace individual modules without affecting the others.
To train LUMOS, the researchers selected a large-scale, high-quality data set of more than 56,000 annotations derived from various ground-truth reasoning foundations in several complex interactive tasks, including question answering, mathematics, coding, web navigation and multimodal reasoning. These annotations were obtained using GPT-4 and other advanced language models to convert existing benchmarks into a unified format compatible with the LUMOS architecture. The resulting dataset is one of the largest open source resources for agent tuning, allowing smaller language models to be trained as language agents effectively.
In evaluations of nine data sets, LUMOS exhibited several key advantages. It outperformed several larger open source agents in available data sets for each task type, even outperforming GPT agents in web and question answering tasks in some cases. LUMOS also outperformed agents produced by other training methods, such as thought chain and non-modularized integrated training. LUMOS notably demonstrated impressive generalization capabilities, significantly outperforming 30B scale (WizardLM-30B and Vicuna-v1.3-33B) and domain-specific agents on unseen tasks involving novel environments and actions.
With its open source nature, competitive performance, and strong generalization capabilities, LUMOS represents an important step forward in the development of affordable, transparent, and reproducible linguistic agents for complex interactive tasks.
Review the Paper, HF page, and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our 39k+ ML SubReddit
<figure class="wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter“>
Vibhanshu Patidar is a Consulting Intern at MarktechPost. He is currently pursuing a bachelor's degree at the Indian Institute of technology (IIT) Kanpur. He is a robotics and machine learning enthusiast with a knack for unraveling the complexities of algorithms that bridge theory and practical applications.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>