Reinforcement learning (RL) excels at solving single tasks but struggles to multitask, especially across different robotic shapes. World models, which simulate environments, offer scalable solutions but often rely on inefficient and high-variance optimization methods. While large models trained on large datasets have advanced generalization capability in robotics, they typically need near-expert data and do not scale to diverse morphologies. RL can learn from suboptimal data, making it promising for multitasking environments. However, methods such as zero-order planning on world models face scalability issues and become less effective as model size increases, particularly on massive models such as GAIA-1 and UniSim.
Researchers at Georgia tech and UC San Diego have introduced Policy Learning with Large World Models (PWM), an innovative model-based reinforcement learning (MBRL) algorithm. PWM pre-trains world models on offline data and uses them for first-order gradient policy learning, allowing it to solve tasks with up to 152 action dimensions. This approach outperforms existing methods by achieving up to 27% higher rewards without expensive online scheduling. PWM emphasizes the utility of smooth, stable gradients over long horizons rather than mere accuracy. It demonstrates that efficient first-order optimization leads to better policies and faster training than traditional zeroth-order methods.
Machine learning is divided into model-based and model-free approaches. Model-free methods such as PPO and SAC dominate real-world applications and employ actor-critic architectures. SAC uses first-order gradients (FoG) for policy learning, which offer low variance but face problems with objective discontinuities. In contrast, PPO relies on zero-order gradients, which are robust to discontinuities but prone to high variance and slower optimization. Recently, the focus in robotics has shifted towards large multitask models trained by behavioral cloning. Examples include RT-1 and RT-2 for object manipulation. However, the potential of large models in machine learning still needs to be explored. MBRL methods such as DreamerV3 and TD-MPC2 leverage large-world models, but their scalability could be improved, particularly with the increasing size of models such as GAIA-1 and UniSim.
The study focuses on discrete-time, infinite-horizon RL scenarios represented by a Markov decision process (MDP) involving states, actions, dynamics, and rewards. RL aims to maximize the discounted rewards accrued through a policy. This is commonly addressed using actor-critic architectures, which approximate state values and optimize policies. In MBRL, additional components such as learned dynamics and reward models, often called world models, are used. These models can encode true states into latent representations. Leveraging these world models, PWM efficiently optimizes policies using FoG, reducing variance and improving sample efficiency even in complex environments.
In evaluating the proposed method, complex control tasks were tackled using the flex simulator, focusing on environments such as Hopper, Ant, Anymal, Humanoid, and Muscle-actuated Humanoid. Comparisons were made with SHAC, which uses ground-truth models, and TD-MPC2, a model-free method that actively plans at inference time. Results showed that PWM achieved higher rewards and smoother optimization landscapes than SHAC and TD-MPC2. Further testing on 30 and 80 multitasking environments revealed PWM’s superior reward performance and faster inference time than TD-MPC2. Ablation studies highlighted PWM’s robustness to rigid contact models and higher sample efficiency, especially with better trained world models.
The study introduced PWM as an approach in MBRL. PWM uses large multi-task world models as differentiable physics simulators, leveraging first-order gradients for efficient policy training. Evaluations highlighted PWM’s ability to outperform existing methods, including those with access to ground-truth simulation models such as TD-MPC2. Despite its strengths, PWM relies heavily on extensive pre-existing data for world model training, limiting its applicability in data-poor scenarios. Furthermore, while PWM offers efficient policy training, it requires retraining for each new task, posing challenges for rapid adaptation. Future research could explore improvements in world model training and extend PWM to image-based environments and real-world applications.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter.
Join our Telegram Channel and LinkedIn GrAbove!.
If you like our work, you will love our Newsletter..
Don't forget to join our Subreddit with over 46 billion users
Sana Hassan, a Consulting Intern at Marktechpost and a dual degree student at IIT Madras, is passionate about applying technology and ai to address real-world challenges. With a keen interest in solving practical problems, she brings a fresh perspective to the intersection of ai and real-life solutions.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>