Reward functions play a crucial role in reinforcement learning (RL) systems, but their design presents significant challenges in balancing the simplicity of task definition with the effectiveness of optimization. The conventional approach of using binary rewards offers a simple task definition, but creates optimization difficulties due to poor learning signals. While intrinsic rewards have emerged as a solution to aid policy optimization, their development process requires extensive knowledge and experience in specific tasks, placing substantial demands on human experts who must carefully balance multiple factors to create reward functions. reward that accurately represent the desired task and allow for efficient efficiency. learning.
Recent approaches have used large language models (LLM) to automate reward design based on natural language task descriptions, following two main methodologies. The first approach focuses on generating reward function codes through LLM, which has demonstrated success in continuous control tasks. However, this method faces limitations as it requires access to the environment source code or detailed parameter descriptions and has difficulty processing high-dimensional state representations. The second approach involves generating reward values directly through LLM, exemplified by methods such as Motif, which ranks observation legends using LLM preferences. However, it requires pre-existing captioned observational data sets and involves a time-consuming three-stage process.
Researchers at Meta, the University of Texas Austin, and UCLA have proposed ONI, a novel distributed architecture that simultaneously learns RL policies and intrinsic reward functions via LLM feedback. The method uses an asynchronous LLM server to annotate the agent's collected experiences, which are then transformed into an intrinsic reward model. The approach explores various algorithmic methods for reward modeling, including hashing, sorting, and ranking models, to investigate their effectiveness in addressing sparse reward problems. This unified methodology achieves superior performance when challenging low-reward tasks within the NetHack learning environment, operating solely on the agent's collected experience without requiring external data sets.
ONI uses several key components built on top of the Sample Factory library and its asynchronous proximal policy optimization (APPO) variant. The system operates with 480 simultaneous environment instances on a Tesla A100-80 GB GPU with 48 CPUs, achieving approximately 32,000 environment interactions per second. The architecture incorporates four crucial components: an LLM server on a separate node, an asynchronous process to transmit observation captions to the LLM server via HTTP requests, a hash table to store LLM captions and annotations, and a model learning code. dynamic reward. This asynchronous design maintains 80-95% of the performance of the original system, processing 30,000 environmental interactions per second without reward model training and 26,000 interactions when a ranking-based reward model is trained.
Experimental results demonstrate significant improvements in multi-task performance in the NetHack learning environment. Although the extrinsic reward agent performs adequately on the Dense Score task, it fails on the Sparse Reward tasks. 'ONI classification' matches or approaches the performance of existing methods such as Motif on most tasks, achieving this without previously collected data or additional dense reward functions. Among ONI variants, 'ONI retrieval' shows strong performance, while 'ONI classification' is constantly improving due to its ability to generalize to unseen messages. Furthermore, 'ONI ranking' achieves the highest experience levels, while 'ONI ranking' leads other performance metrics in non-rewarding environments.
In this paper, researchers presented ONI, which represents a significant advance in RL by introducing a distributed system that simultaneously learns intrinsic rewards and behaviors of online agents. It demonstrates state-of-the-art performance on challenging sparse reward tasks in the NetHack learning environment while eliminating the need for previously-essential pre-collected data sets or dense reward auxiliary functions. This work establishes a foundation for developing more autonomous intrinsic reward methods that can learn exclusively from the agent's experience, operate independently of external data set constraints, and integrate effectively with high-performance reinforcement learning systems.
Verify he Paper and GitHub page. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. Don't forget to join our SubReddit over 60,000 ml.
Trending: LG ai Research launches EXAONE 3.5 – three frontier-level bilingual open-source ai models that deliver unmatched instruction following and broad context understanding for global leadership in generative ai excellence….
Sajjad Ansari is a final year student of IIT Kharagpur. As a technology enthusiast, he delves into the practical applications of ai with a focus on understanding the impact of ai technologies and their real-world implications. Its goal is to articulate complex ai concepts in a clear and accessible way.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>