Your daily to-do list is probably pretty simple: washing dishes, grocery shopping, and other minutiae. It's unlikely that you wrote “pick up the first dirty plate” or “wash that plate with a sponge,” because each of these miniature steps within the task seems intuitive. While we can complete each step routinely without much thought, a robot requires a complex plan that includes more detailed schematics.
MIT's Improbable ai Lab, a group within the Computer Science and artificial intelligence Laboratory (CSAIL), has offered help to these machines with a new multimodal framework: Models of compositional foundations for hierarchical planning (HiP), which develops detailed and feasible plans with the experience of three different foundation models. Like OpenAI's GPT-4, the base model on which ChatGPT and Bing Chat were built, these base models are trained on massive amounts of data for applications like image generation, text translation, and robotics.
Unlike RT2 and other multimodal models that are trained with paired vision, language, and action data, HiP uses three different base models, each trained with different modalities of data. Each basic model captures a different part of the decision-making process and then works together when it comes time to make decisions. HiP eliminates the need to access combined vision, language and action data, which is difficult to obtain. HiP also makes the reasoning process more transparent.
What is considered a daily task for a human may be a robot's “long-term goal” (a general goal that involves completing many smaller steps first) that requires sufficient data to plan, understand, and execute objectives. While computer vision researchers have attempted to build basic monolithic models for this problem, combining language, visual, and action data is expensive. Instead, HiP represents a different multimodal recipe: a trio that cost-effectively incorporates linguistic, physical, and environmental intelligence into a robot.
“Basic models don't have to be monolithic,” says NVIDIA ai researcher Jim Fan, who was not involved in the paper. “This work decomposes the complex task of planning embodied agents into three constituent models: a linguistic reasoning, a visual world model, and an action planner. It makes a difficult decision-making problem more manageable and transparent.”
The team believes their system could help these machines perform household tasks, such as putting away a book or placing a container in the dishwasher. Additionally, HiP could help with multi-step construction and manufacturing tasks, such as stacking and placing different materials in specific sequences.
HiP evaluation
The CSAIL team tested HiP's acuity on three manipulation tasks, outperforming comparable frameworks. The system reasoned by developing intelligent plans that adapt to new information.
First, the researchers asked him to stack blocks of different colors on top of each other and then place others nearby. The problem: Some of the correct colors were not present, so the robot had to place white blocks in a colored container to paint them. HiP often adjusted to these changes precisely, especially compared to state-of-the-art task planning systems like Transformer BC and Action Difusor, adjusting its plans to stack and place each square as needed.
Another test: placing objects such as candy and a hammer in a brown box ignoring other items. Some of the items he needed to move were dirty, so HiP adjusted his plans to place them in a cleaning box and then in the brown bin. In a third demonstration, the robot was able to ignore unnecessary objects to complete secondary kitchen objectives, such as opening a microwave, removing a kettle, and turning on a light. Some of the indicated steps had already been completed, so the robot adapted by skipping those instructions.
A triple hierarchy
HiP's triple planning process operates as a hierarchy, with the ability to pre-train each of its components on different data sets, including information outside of robotics. At the end of that order is a large language model (LLM), which begins ideation by capturing all the necessary symbolic information and developing an abstract task plan. Applying common sense knowledge found on the Internet, the model divides its objective into subgoals. For example, “make a cup of tea” becomes “fill a pot with water,” “boil the pot,” and subsequent required actions.
“All we want to do is take existing pre-trained models and have them successfully interact with each other,” says Anurag Ajay, a PhD student in MIT's Department of Electrical Engineering and Computer Science (EECS) and a CSAIL affiliate. “Rather than pushing for one model to do it all, we combine several that leverage different modalities of Internet data. When used together, they help with robotic decision making and can potentially help with tasks in homes, factories and construction sites.”
These models also need some type of “eyes” to understand the environment in which they operate and correctly execute each sub-goal. The team used a large video diffusion model to augment the initial planning completed by the LLM, which collects geometric and physical information about the world from images on the Internet. In turn, the video model generates an observation trajectory plan, refining the LLM scheme to incorporate new physical knowledge.
This process, known as iterative refinement, allows HiP to reason through its ideas and receive feedback at each stage to generate a more practical outline. The feedback flow is similar to writing an article, where an author can send their draft to an editor and, with those revisions built in, the editor reviews the latest changes and finalizes it.
In this case, the top of the hierarchy is an egocentric action model, or a sequence of first-person images that infers what actions should take place based on its environment. During this stage, the video model's observation plan is mapped onto the space visible to the robot, helping the machine decide how to execute each task within the long-term goal. If a robot uses HiP to make tea, this means it will have mapped out exactly where the pot, sink, and other key visual elements are, and will begin completing each subgoal.
Still, multimodal work is limited by the lack of basic high-quality video models. Once available, they could interact with HiP's small-scale video models to further improve the prediction of visual sequences and the generation of robotic actions. A higher quality version would also reduce the current data requirements of video models.
That said, the CSAIL team's approach only used a small amount of data overall. Furthermore, HiP was cheap to train and demonstrated the potential of using readily available base models to complete long-term tasks. “What Anurag has demonstrated is a proof of concept of how we can take models trained on separate tasks and data modalities and combine them into models for robotic planning. In the future, HiP could be expanded with pre-trained models that can process touch and sound to make better plans,” says lead author Pulkit Agrawal, assistant professor in EECS at MIT and director of the Improbable ai Lab. The group is also considering apply HiP to solve long-term robotics tasks in the real world.
Ajay and Agrawal are the lead authors of a document describing the job. They are joined by MIT professors and CSAIL principal investigators Tommi Jaakkola, Joshua Tenenbaum, and Leslie Pack Kaelbling; Akash Srivastava, CSAIL research affiliate and research director of the MIT-IBM ai Lab; graduate students Seungwook Han and Yilun Du '19; former postdoc Abhishek Gupta, who is now an assistant professor at the University of Washington; and former graduate student Shuang Li PhD '23.
The team's work was supported, in part, by the National Science Foundation, the U.S. Defense Advanced Research Projects Agency, the U.S. Army Research Office, the Office of Research Initiatives US Naval Research Multidisciplinary University and the MIT-IBM Watson ai Laboratory. Their findings were presented at the 2023 Conference on Neural Information Processing Systems (NeurIPS).