To teach an ai agent a new task, such as how to open a kitchen cabinet, researchers often use reinforcement learning, a trial-and-error process in which the agent is rewarded for taking actions that bring it closer to the goal. .
In many cases, a human expert must carefully design a reward function, which is an incentive mechanism that motivates the agent to explore. The human expert must iteratively update that reward function as the agent explores and attempts different actions. This can be time-consuming, inefficient, and difficult to scale, especially when the task is complex and involves many steps.
Researchers at MIT, Harvard University, and the University of Washington have developed a new reinforcement learning approach that does not rely on an expert-designed reward function. Instead, it leverages feedback from many non-expert users to guide the agent as they learn to achieve their goal.
While some other methods also try to use feedback from non-experts, this new approach allows the ai agent to learn more quickly, even though the data collected from users is often full of errors. This noisy data can cause other methods to fail.
Additionally, this new approach allows feedback to be collected asynchronously, so that non-expert users around the world can contribute to teaching the agent.
“One of the most challenging and time-consuming parts of designing a robotic agent today is designing the reward function. Today, reward functions are designed by expert researchers, a paradigm that is not scalable if we want to teach our robots many different tasks. Our work proposes a way to scale robot learning by crowdsourcing the design of the reward function and making it possible for non-experts to provide useful feedback,” says Pulkit Agrawal, assistant professor in the Department of Electrical Engineering and Computer Science. (EECS) at MIT who runs the Improbable ai Lab at the MIT Computer Science and artificial intelligence Laboratory (CSAIL).
In the future, this method could help a robot learn to perform specific tasks in a user’s home quickly, without the owner having to show it physical examples of each task. The robot could explore on its own, with non-expert feedback from crowdsourced sources guiding its exploration.
“In our method, the reward function guides the agent toward what it needs to explore, rather than telling it exactly what it needs to do to complete the task. So even if human supervision is somewhat inaccurate and noisy, the agent can still explore, which helps it learn much better,” explains lead author Marcel Torne ’23, a research assistant at the Improbable ai Lab.
Torne is joined in the article by his MIT advisor, Agrawal; lead author Abhishek Gupta, assistant professor at the University of Washington; as well as others at the University of Washington and MIT. The research will be presented at the Neural Information Processing Systems Conference next month.
noisy comments
One way to collect user feedback for reinforcement learning is to show it two photographs of the states achieved by the agent and then ask it which state is closer to a goal. For example, perhaps the goal of a robot is to open a kitchen cabinet. One image might show the robot opening the cabinet, while the second might show it opening the microwave. A user would choose the “best” status photo.
Some previous approaches attempt to use this crowdsourced binary feedback to optimize a reward function that the agent would use to learn the task. However, because non-experts are likely to make mistakes, the reward function can become very noisy, so the agent could get stuck and never reach its goal.
“Basically, the agent would take the reward function too seriously. He would try to match the reward function perfectly. So instead of directly optimizing the reward function, we simply use it to tell the robot which areas it should explore,” says Torne.
He and his collaborators divided the process into two separate parts, each driven by its own algorithm. They call their new reinforcement learning method HuGE (Human Guided Exploration).
On the one hand, a targeting algorithm is continually updated with crowdsourced human feedback. Feedback is not used as a reward function, but rather to guide the agent’s exploration. In a sense, non-expert users drop navigation paths that progressively lead the agent toward its goal.
On the other hand, the agent explores on its own, in a self-supervised manner and guided by the target selector. It collects images or videos of the actions it attempts, which are then sent to humans and used to update the target selector.
This reduces the area the agent can explore, leading it to more promising areas closer to its target. But if there is no feedback, or if the feedback takes a while to arrive, the agent will continue to learn on its own, albeit more slowly. This allows feedback to be collected infrequently and asynchronously.
“The exploration cycle can continue autonomously, because it will simply explore and learn new things. And then when you get a better signal, it will be explored in more concrete ways. You can make them spin at their own pace,” adds Torne.
And because the feedback simply gently guides the agent’s behavior, it will eventually learn to complete the task even if users provide incorrect answers.
Faster learning
The researchers tested this method on a series of simulated and real-world tasks. In the simulation, they used HuGE to effectively learn tasks with long sequences of actions, such as stacking blocks in a particular order or navigating a large maze.
In real-world tests, they used HuGE to train robotic arms to draw the letter “U” and select and place objects. For these tests, they gathered data from 109 non-expert users in 13 different countries on three continents.
In simulated and real-world experiments, HuGE helped agents learn to achieve the goal faster than other methods.
The researchers also found that data collected by non-experts produced better performance than synthetic data, which was produced and labeled by the researchers. For non-expert users, tagging 30 images or videos took less than two minutes.
“This makes it very promising in terms of being able to scale up this method,” adds Torne.
In a related paper, which the researchers presented at the recent Robot Learning Conference, they improved HuGE so that an ai agent can learn to perform the task and then autonomously reset the environment to continue learning. For example, if the agent learns how to open a cabinet, the method also guides the agent to close the cabinet.
“Now we can make it learn completely autonomously without the need for human reboots,” he says.
The researchers also emphasize that, in this and other learning approaches, it is essential to ensure that ai agents are aligned with human values.
In the future, they want to continue refining HuGE so that the agent can learn from other forms of communication, such as natural language and physical interactions with the robot. They are also interested in applying this method to teach multiple agents at once.
This research is funded, in part, by the MIT-IBM Watson ai Lab.