With the development of computing and data, autonomous agents are gaining power. The need for humans to have a say in the policies learned by agents and verify that they align with their goals becomes even more apparent in light of this.
Currently, users 1) create reward features for desired actions or 2) provide a large amount of tagged data. Both strategies present difficulties and are unlikely to be implemented in practice. Agents are vulnerable to bounty hacking, making it difficult to design bounty features that strike a balance between competing goals. However, a reward function can be learned from annotated examples. However, huge amounts of tagged data are needed to capture the subtleties of individual users’ tastes and goals, which has proven costly. Also, the reward features need to be redesigned or the data set needs to be collected for a new population of users with different goals.
New research from Stanford University and DeepMind aims to design a system that makes it easy for users to share their preferences, with an interface that is more natural than writing a reward function, and a cost-effective approach to defining those preferences using just a few instances. . His work uses large language models (LLMs) that have been trained on massive amounts of text data from the Internet and have proven adept at learning in context with little to no training examples. According to the researchers, LLMs are excellent contextual learners because they have been trained on a large enough data set to incorporate important common-sense insights into human behavior.
Researchers investigate how to employ a requested LLM as a surrogate reward function to train RL agents using end-user-provided data. Using a conversational interface, the proposed method makes the user define a goal. When defining a goal, a few instances such as “versatility” or a sentence can be used if the topic is common knowledge. They define a reward function using the indicator and LLM to train a RL agent. The trajectory of an RL episode and the user’s prompt are entered into the LLM, and the score (eg, “No” or “0”) to determine whether the trajectory satisfies the user’s goal is issued as a reward integer for the RL agent. One of the advantages of using LLMs as an indirect reward function is that users can intuitively specify their preferences through language instead of having to provide dozens of examples of desirable behaviors.
Users report that the proposed agent is much more in line with their objective than a trained agent with a different objective. Using its prior knowledge of common targets, the LLM increases the proportion of target-aligned reward cues generated in response to zero-fire cues by an average of 48% for a regular request for matrix game results and in a 36% for a coded order. . In Ultimatum Game, the DEALORNODEAL bargaining task, and MatrixGames, the team just uses various prompts to guide players through the process. Ten real people were used in the pilot study.
An LLM can recognize common goals and send reinforcement signals that align with those goals, even in a unique situation. Therefore, goal-aligned RL agents can be trained using LLMs that only detect one of two correct results. The resulting RL agents are more likely to be accurate than those trained using labels because they only need to learn a single correct result.
review the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 26k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.