New AI research proposes 'First-Explore': a simple AI framework for Meta-RL with two policies, i.e. one policy learns only to explore and one policy learns only to exploit

Successful reinforcement learning (RL) applications include difficult tasks such as plasma control, molecular design, gaming, and robot control. Despite its potential, traditional RL is extremely inefficient for samples. Learning a task that a human could perform in a few tries can take an agent hundreds of thousands of game episodes.

Studies show the following reasons for sample inefficiency:

A complex prior, such as human common sense or extensive experience, is outside the scope of typical RL conditioning capabilities.
Conventional RL cannot customize each scan to be as informative as possible; instead, it adjusts by repeatedly reinforcing previously learned behaviors.
Both traditional RL and meta-RL employ the same policy to explore (collect data to improve the policy) and exploit (get a large reward per episode).

To address these shortcomings, researchers from the University of British Columbia, Vector Institute, and Canada CIFAR AI Chair introduce First-Explore. This lightweight meta-RL framework learns a set of policies: a smart scan policy and a smart exploit policy. First-Explore enables efficient sample, in-context, and human-level learning of Meta-RL in difficult-to-explore unknown domains, such as hostile ones that require sacrificing reward to investigate effectively.

[Sponsored]

Build your personal brand with Taplio

The first all-in-one AI-powered tool to grow on LinkedIn. Create better LinkedIn content 10 times faster, schedule, analyze your stats, and engage. Try it free!

Developing algorithms with human-level performance in previously encountered difficult exploration domains is one of the major hurdles in artificial general intelligence (AGI) development. The team suggests that combining First-Explore with a curriculum, such as the AdA curriculum, could be a step in the right direction. They believe that such progress would lead to the realization of AGI’s great potential benefits if they could adequately address the genuine and serious security issues associated with AGI development.

Computational resources dedicated to domain randomization from the start allow First-Explore to learn intelligent exploration, such as exhaustively searching for the first ten activities and then prioritizing sampling those with high rewards. However, once trained, the exploration strategy can be incredibly efficient at learning new tasks. Since the standard RL seems successful despite this restriction, one may also wonder how serious exploration through exploitation is. The researchers argue that the gap becomes more noticeable when one wants to intelligently explore and exploit with human-level adaptation in complex tasks.

Even on simple domains like the multi-armed Gaussian Bandit, First-Explore performs better and dramatically increases performance on sacrificial exploration domains like the Dark Prize Room environment (where the average expected prize value is negative). Findings from both problem domains highlight the importance of understanding the differences between optimal exploitation and exploration for effective contextual learning, specifically about the extent to which each strategy covers the state or action space and whether or not it helps. to achieve a high reward. .

review the Paper and github link. Don’t forget to join our 26k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at asif@marktechpost.com

Check out 100 AI tools at AI Tools Club

Dhanshree Shenwai is a Computer Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking domain with strong interest in AI applications. She is enthusiastic about exploring new technologies and advancements in today’s changing world, making everyone’s life easier.