For roboticists, there is one challenge that stands out above all others: generalization: the ability to create machines that can adapt to any environment or condition. Since the 1970s, the field has evolved from writing sophisticated programs to using deep learning, teaching robots to learn directly from human behavior. But a critical obstacle remains: data quality. To improve, robots need to find scenarios that push the limits of their capabilities, operating at the edge of their domain. This process traditionally requires human supervision, with operators carefully challenging robots to expand their capabilities. As robots become more sophisticated, this practical approach faces a problem of scale: the demand for high-quality training data far exceeds the human ability to provide it.
Now, a team of researchers at MIT's Computer Science and artificial intelligence Laboratory (CSAIL) has developed a novel approach to training robots that could significantly accelerate the deployment of intelligent, adaptive machines in real-world environments. The new system, called “LucidSim”, uses recent advances in generative ai and physics simulators to create diverse and realistic virtual training environments, helping robots achieve expert-level performance on difficult tasks without real-world data.
LucidSim combines physics simulation with generative ai models, addressing one of the most persistent challenges in robotics: transferring skills learned in simulation to the real world. “A fundamental challenge in robot learning has long been the 'simulation-reality gap': the disparity between simulated training environments and the complex, unpredictable real world,” says MIT CSAIL postdoc Ge Yang, principal investigator. by LucidSim. “Previous approaches often relied on depth sensors, which simplified the problem but overlooked crucial real-world complexities.”
The multifaceted system is a combination of different technologies. Basically, LucidSim uses large language models to generate various structured descriptions of environments. These descriptions are then transformed into images using generative models. To ensure that these images reflect real-world physics, an underlying physics simulator is used to guide the generation process.
The birth of an idea: from burritos to advances
The inspiration for LucidSim came from an unexpected place: a conversation outside Beantown Taqueria in Cambridge, Massachusetts. “We wanted to teach vision-equipped robots how to improve using human feedback. But then we realized that we didn't have a purely vision-based policy to begin with,” says Alan Yu, an undergraduate in electrical engineering and computer science (EECS) at MIT and co-senior author of LucidSim. “We kept talking about it as we walked down the street and then we stopped outside the taqueria for about half an hour. “That’s where we had our moment.”
To construct their data, the team generated realistic images by extracting depth maps, which provide geometric information, and semantic masks, which label different parts of an image, from the simulated scene. However, they quickly realized that with strict control over the composition of the image content, the model would produce similar images that were not different from each other using the same message. So, they came up with a way to get various text messages from ChatGPT.
However, this approach only resulted in a single image. To make short, coherent videos that serve as little “experiences” for the robot, the scientists combined some imaging magic into another novel technique the team created, called “Dreams In Motion.” The system calculates the movements of each pixel between frames to warp a single generated image into a short multi-frame video. Dreams In Motion does this by considering the 3D geometry of the scene and the relative changes in the robot's perspective.
“We outperformed domain randomization, a method developed in 2017 that applies random colors and patterns to objects in the environment, which is still considered the gold standard these days,” Yu says. “Although this technique generates diverse data, it lacks realism. LucidSim addresses issues of both diversity and realism. “It is exciting that even without seeing the real world during training, the robot can recognize and navigate obstacles in real environments.”
The team is particularly excited about the potential of applying LucidSim to domains outside of quadrupedal locomotion and parkour, its primary testbed. An example is mobile manipulation, in which a mobile robot is assigned the task of manipulating objects in an open area; Furthermore, color perception is essential. “Today, these robots still learn from real-world demonstrations,” Yang says. “While collecting demos is easy, scaling a real-world robot teleoperation setup to thousands of skills is challenging because a human has to physically set up each scene. “We hope to make this easier and therefore qualitatively more scalable by moving data collection to a virtual environment.”
Who is the real expert?
The team tested LucidSim against an alternative, where an expert teacher demonstrates the ability for the robot to learn. The results were surprising: The robots trained by the expert struggled, succeeding only 15 percent of the time, and even quadrupling the amount of expert training data barely managed to gain the advantage. But when the robots collected their own training data through LucidSim, the story changed dramatically. Simply doubling the size of the data set catapulted success rates to 88 percent. “And giving our robot more data monotonically improves its performance; eventually, the student becomes an expert,” Yang says.
“One of the main challenges in transferring simulation to real for robotics is achieving visual realism in simulated environments,” says Shuran Song, assistant professor of electrical engineering at Stanford University, who was not involved in the research. “The LucidSim framework provides an elegant solution by using generative models to create diverse and highly realistic visual data for any simulation. “This work could significantly accelerate the deployment of robots trained in virtual environments for real-world tasks.”
From the streets of Cambridge to the cutting edge of robotics research, LucidSim is paving the way for a new generation of intelligent, adaptive machines, learning to navigate our complex world without ever setting foot in it.
Yu and Yang wrote the paper with four CSAIL affiliates: Ran Choi, an MIT postdoc in mechanical engineering; Yajvan Ravan, MIT EECS student; John Leonard, Samuel C. Collins Professor of Mechanical and Ocean Engineering in the Department of Mechanical Engineering at MIT; and Phillip Isola, MIT associate professor in EECS. Their work was supported, in part, by a Packard Fellowship, a Sloan Research Fellowship, the Office of Naval Research, the Singapore Defense Science and technology Agency, amazon, MIT Lincoln Laboratory, and the Institute for artificial intelligence and National Science Foundation Fundamental Interactions. The researchers presented their work at the Conference on Robot Learning (CoRL) in early November.