Imagine having to clean up a messy kitchen, starting with a countertop full of sauce packets. If your goal is to clean the counter, you can sweep up the packages as a group. However, if I wanted to sort through the mustard packets first before throwing out the rest, I would sort them more selectively, by type of sauce. And if, among the mustards, you were craving Gray Poupon, finding this particular brand would require a more careful search.
MIT engineers have developed a method that allows robots to make equally intuitive and task-relevant decisions.
The team's new approach, called Clio, allows a robot to identify the parts of a scene that matter, given the tasks at hand. With Clio, a robot takes a list of tasks described in natural language and, based on those tasks, determines the level of granularity needed to interpret its environment and “remember” only the parts of a scene that are relevant.
In real-world experiments ranging from a crowded cubicle to a five-story building on the MIT campus, the team used Clio to automatically segment a scene at different levels of granularity, based on a set of tasks specified in natural language prompts, such as “move shelf.” of magazines” and “get a first aid kit.”
The team also ran Clio in real time on a quadruped robot. As the robot explored an office building, Clio identified and mapped only those parts of the scene that related to the robot's tasks (such as retrieving a dog toy while ignoring piles of office supplies), allowing the robot to grasp the objects of interest.
Clio is named after the Greek muse of history, for her ability to identify and remember only the elements that matter for a given task. The researchers imagine that Clio would be useful in many situations and environments where a robot would need to quickly inspect and understand its environment in the context of its assigned task.
“Search and rescue is the application motivating this work, but Clio can also power home robots and robots that work in a factory alongside humans,” says Luca Carlone, associate professor in MIT's Department of Aeronautics and Astronautics (AeroAstro ), principal investigator at the Laboratory for Information and Decision Systems (LIDS), and director of the MIT SPARK Laboratory. “It's really about helping the robot understand the environment and what it needs to remember to carry out its mission.”
The team details their results in a study that appears today in the journal. Robotics and Automation Letters. Carlone's co-authors include SPARK Lab members: Dominic Maggio, Yun Chang, Nathan Hughes, and Lukas Schmid; and members of MIT Lincoln Laboratory: Matthew Trang, Dan Griffith, Carlyn Dougherty and Eric Cristofalo.
open fields
Major advances in the fields of computer vision and natural language processing have allowed robots to identify objects in their environment. But until recently, robots could only do so in “closed” scenarios, where they are programmed to work in a carefully selected and controlled environment, with a finite number of objects that the robot has been previously trained to recognize.
In recent years, researchers have taken a more “open” approach to allowing robots to recognize objects in more realistic environments. In the field of open set recognition, researchers have leveraged deep learning tools to build neural networks that can process billions of images from the Internet, along with the text associated with each image (such as a photo of a friend's dog on facebook, with the title “Meet My New Puppy!”).
From millions of image-text pairs, a neural network learns and then identifies those segments of a scene that are characteristic of certain terms, such as a dog. A robot can then apply that neural network to detect a dog in an entirely new scene.
But a challenge still remains in how to analyze a scene in a useful way that is relevant to a particular task.
“Typical methods will choose a fixed, arbitrary level of granularity to determine how to merge segments of a scene into what can be considered an 'object,'” Maggio says. “However, the granularity of what we call 'object' is actually related to what the robot has to do. “If that granularity is fixed without considering the tasks, then the robot may end up with a map that is not useful for its tasks.”
Information bottleneck
With Clio, the MIT team aimed to allow robots to interpret their environment at a level of granularity that could automatically adjust to the tasks at hand.
For example, given the task of moving a stack of books to a shelf, the robot should be able to determine that the entire stack of books is the relevant object for the task. Likewise, if the task were to move only the green book from the rest of the stack, the robot would have to distinguish the green book as a single target object and ignore the rest of the scene, including the other books in the stack.
The team's approach combines cutting-edge computer vision and large language models comprising neural networks that make connections between millions of open source images and semantic text. They also incorporate mapping tools that automatically divide an image into many small segments, which can be fed into the neural network to determine whether certain segments are semantically similar. The researchers then take advantage of an idea from classical information theory called the “information bottleneck,” which they use to compress a series of image segments in a way that selects and stores the segments that are semantically most relevant to an image. given task.
“For example, let's say there are a bunch of books in the scene and my task is simply to get the green book. In that case, we push all this information about the scene through this bottleneck and end up with a group of segments that represent the green book,” explains Maggio. “All the other segments that are not relevant are simply lumped into one group that we can simply delete. And we are left with an object with the appropriate granularity that I need to perform my task.”
The researchers demonstrated Clio in different real-world environments.
“What we thought would be a really sensible experiment would be to use Clio in my apartment, where I didn't do any cleaning beforehand,” says Maggio.
The team came up with a list of tasks in natural language, such as “move a bunch of clothes,” and then applied Clio to images of Maggio's messy apartment. In these cases, Clio was able to quickly segment apartment scenes and feed the segments through the Information Bottleneck algorithm to identify those segments that made up the pile of clothes.
They also ran Clio on Boston Dynamic's quadruped robot, Spot. They gave the robot a list of tasks to complete, and while the robot explored and mapped the interior of an office building, Clio ran an on-board computer mounted on Spot in real time, to select segments in the mapped scenes to visually relate to. the given task. The method generated an overlay map showing only the target objects, which the robot then used to approach the identified objects and physically complete the task.
“Running Clio in real time was a great achievement for the team,” says Maggio. “Many pre-jobs can take several hours to execute.”
In the future, the team plans to adapt Clio so that it can handle higher-level tasks and take advantage of recent advances in photorealistic visual scene representations.
“We still give Clio tasks that are somewhat specific, like 'find a deck of cards,'” Maggio says. “For search and rescue, you need to be assigned higher-level tasks, such as 'find survivors' or 'turn the power back on.' “So we want to get to a more human-level understanding of how to perform more complex tasks.”
This research was supported, in part, by the U.S. National Science Foundation, the Swiss National Science Foundation, MIT Lincoln Laboratory, the U.S. Office of Naval Research, and the Collaborative Research Alliance of Intelligent Distributed and Collaborative Systems and technology at the US Army Research Laboratory.