In the current spirit of ai, sequence models have skyrocketed in popularity for their ability to analyze data and predict what to do next. For example, you've probably used next token prediction models like ChatGPT, which anticipate each word (token) in a sequence to form responses to user queries. There are also full-sequence diffusion models like Sora, which turn words into dazzling, realistic images by successively “removing noise” from an entire video sequence.
Researchers at MIT's Computer Science and artificial intelligence Laboratory (CSAIL) have proposed a simple change to the diffusion training scheme that makes this denoising sequence considerably more flexible.
When applied to fields such as computer vision and robotics, the full-sequence and next-token diffusion models have capacity trade-offs. Next token models can generate sequences that vary in length. However, they create these generations without being aware of desirable states in the distant future (such as directing their sequence generation toward a given target 10 tokens away) and thus require additional mechanisms for long-term planning ( long term). Diffusion models can perform this type of sampling conditional on the future, but they lack the ability of next token models to generate sequences of variable length.
CSAIL researchers want to combine the strengths of both models, so they created a sequence model training technique called “Diffusion Forcing.” The name comes from “Teacher Forcing,” the conventional training scheme that breaks down the entire sequence generation into smaller, easier steps of the next token generation (much like a good teacher simplifying a complex concept).
Diffusion forcing found commonalities between diffusion models and teacher forcing: both use training schemes that involve predicting masked (noisy) tokens from unmasked tokens. In the case of diffusion models, they gradually add noise to the data, which can be seen as fractional masking. The MIT researchers' Diffusion Forcing method trains neural networks to clean a collection of tokens, removing different amounts of noise within each one while also predicting the next tokens. The result: a flexible and reliable sequence model that resulted in higher quality artificial videos and more accurate decision making for robots and ai agents.
By classifying noisy data and reliably predicting the next steps in a task, Diffusion Forcing can help a robot ignore visual distractions to complete manipulation tasks. It can also generate stable and consistent video sequences and even guide an ai agent through digital mazes. This method could potentially allow home and factory robots to generalize to new tasks and improve ai-generated entertainment.
“Sequence models aim to condition the known past and predict the unknown future, a kind of binary masking. However, masking does not have to be binary,” says lead author, MIT electrical engineering and computer science (EECS) doctoral student and CSAIL member Boyuan Chen. “With Diffusion Forcing, we add different levels of noise to each token, which effectively serves as a type of fractional masking. At test time, our system can “unmask” a collection of tokens and broadcast a sequence in the near future with a lower noise level. You know what to trust within your data to overcome out-of-distribution inputs.”
In several experiments, Diffusion Forcing thrived by ignoring misleading data to execute tasks while anticipating future actions.
When implemented on a robotic arm, for example, it helped swap two toy fruits on three circular mats, a minimal example of a family of long-horizon tasks that require memories. The researchers trained the robot by controlling it remotely (or teleoperating it) in virtual reality. The robot is trained to imitate the user's movements from its camera. Despite starting from random positions and seeing distractions like a shopping bag blocking the markers, he placed the objects in their target locations.
To generate videos, they trained Diffusion Forcing on “Minecraft” games and colorful digital environments created within Google. DeepMind Laboratory Simulator. When given a single frame of footage, the method produced higher-resolution, more stable videos than comparable baselines, such as a Sora-like full-sequence diffusion model and ChatGPT-like token following models. These approaches created videos that appeared inconsistent, with the latter sometimes failing to produce functional videos beyond 72 frames.
Diffusion Forcing not only generates elegant videos, but can also serve as a motion planner that guides toward desired results or rewards. Thanks to its flexibility, Diffusion Forcing can uniquely generate plans with variable horizons, perform tree searches, and incorporate the intuition that the distant future is more uncertain than the near future. On the task of solving a 2D maze, Diffusion Forcing outperformed six baselines by generating faster plans leading to the target location, indicating that it could be an effective planner for robots in the future.
In each demonstration, Diffusion Forcing acted as a full sequence model, a next token prediction model, or both. According to Chen, this versatile approach could serve as a powerful backbone for a “world model,” an artificial intelligence system that can simulate the dynamics of the world by training with billions of Internet videos. This would allow robots to perform novel tasks by imagining what they should do based on their environment. For example, if you ask a robot to open a door without having been trained how to do it, the model could produce a video showing the machine how to do it.
The team is currently looking to extend their method to larger data sets and the latest transformer models to improve performance. They intend to expand their work to build a robotic brain similar to ChatGPT that helps robots perform tasks in new environments without human demonstration.
“With Diffusion Forcing, we are taking a step to bring video generation and robotics closer together,” says lead author Vincent Sitzmann, an assistant professor at MIT and a member of CSAIL, where he leads the Scene Rendering group. “In the end, we hope to use all the knowledge stored in videos on the Internet to allow robots to help in everyday life. “Many interesting research challenges remain, such as how robots can learn to imitate humans by observing them even when their own bodies are so different from ours.”
Chen and Sitzmann wrote the paper along with recent MIT visiting researcher Diego Martí Monsó and CSAIL affiliates: Yilun Du, an EECS graduate student; Max Simchowitz, former postdoc and incoming assistant professor at Carnegie Mellon University; and Russ Tedrake, Toyota Professor of EECS, Aeronautics and Astronautics, and Mechanical Engineering at MIT, vice president of robotics research at the Toyota Research Institute, and CSAIL member. Their work was supported, in part, by the US National Science Foundation, the Defense Science and technology Agency of Singapore, the Intelligence Advanced Research Projects Activity through the US Department of the Interior . and the amazon Science Hub. They will present their research at NeurIPS in December.