This AI paper from Stanford and Google DeepMind reveals how efficient exploration increases the effectiveness of human feedback to improve large language models

artificial intelligence has seen notable advances with the development of large language models (LLM). Thanks to techniques such as reinforcement learning from human feedback (RLHF), they have significantly improved the performance of various tasks. However, the challenge lies in synthesizing novel content based solely on human feedback.

One of the main challenges to advance in LLMs is to optimize your learning process from human feedback. This feedback is obtained through a process in which models are presented with prompts and generate responses, with human evaluators indicating their preferences. The goal is to refine the models' responses to align them more closely with human preferences. However, this method requires many interactions, which poses a bottleneck to rapid model improvement.

Current methodologies for LLM training involve passive exploration, where models generate responses based on predefined prompts without actively seeking to optimize learning from feedback. One such approach is to use Thompson sampling, where queries are generated based on uncertainty estimates represented by an epistemic neural network (ENN). The choice of exploration scheme is critical and Thompson double sampling has been shown to be effective in generating high-performance queries. Others include Boltzmann Exploration and Infomax. While these methods have been instrumental in the early stages of LLM development, they must be optimized for efficiency, often requiring an impractical number of human interactions to achieve notable improvements.

Researchers at Google Deepmind and Stanford University have introduced a novel approach to active exploration, using Thompson double sampling and ENN for query generation. This method allows the model to actively seek feedback that is most informative for its learning, significantly reducing the number of queries required to achieve high performance levels. ENN provides uncertainty estimates that guide the exploration process, allowing the model to make more informed decisions about which queries to present for feedback.

In the experimental setup, agents generate responses to 32 messages, forming queries evaluated by a preference simulator. Feedback is used to refine your reward models at the end of each epoch. Agents explore the response space by selecting the most informative pairs from a pool of 100 candidates, using a multi-layer perceptron (MLP) architecture with two hidden layers of 128 units each or a set of 10 MLPs for epistemic neural networks (ENN). .

The results highlight the effectiveness of Thompson double sampling (TS) over other exploration methods such as Boltzmann exploration and infomax, especially in utilizing uncertainty estimates to improve query selection. While Boltzmann's exploration is promising at lower temperatures, double TS consistently outperforms others by making better use of the uncertainty estimates of the ENN reward model. This approach accelerates the learning process and demonstrates the potential for efficient exploration to dramatically reduce the volume of human feedback required, marking a significant advance in training large language models.

In conclusion, this research shows the potential of efficient exploration to overcome the limitations of traditional training methods. The team has opened new avenues for rapid and efficient model improvement by leveraging advanced exploration algorithms and uncertainty estimates. This approach promises to accelerate innovation in LLMs and highlights the importance of optimizing the learning process for the broader advancement of artificial intelligence.

Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter and Google news. Join our 36k+ ML SubReddit, 41k+ Facebook community, Discord Channeland LinkedIn Grabove.

If you like our work, you will love our Newsletter..

Don't forget to join our Telegram channel

Nikhil is an internal consultant at Marktechpost. He is pursuing an integrated double degree in Materials at the Indian Institute of technology Kharagpur. Nikhil is an ai/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in materials science, he is exploring new advances and creating opportunities to contribute.

<!– ai CONTENT END 2 –>

(FREE ai WEBINAR) 'Actions in GPT: Tips, Tricks and Techniques for Developers' (February 12, 2024)

This AI paper from Stanford and Google DeepMind reveals how efficient exploration increases the effectiveness of human feedback to improve large language models

Technical Terrence Team

Are once-rising FTSE renewable energy shares a smart investment right now?

Leave a Reply Cancel reply

Recommended.

Bitcoin Dominance Threatens 'Likely Top' Even as BTC Price Targets $45,000

Skibidi Anarchy: Technology in the Post-Pandemic Classroom

Here's why Bitwise CEO wants the SEC to postpone approval of Ethereum ETFs

Fortnite creative lead Donald Mustard is stepping down

The 'Teslatoover' protests are small, but numerous

Categories

Important Links

This AI paper from Stanford and Google DeepMind reveals how efficient exploration increases the effectiveness of human feedback to improve large language models

Related

Technical Terrence Team

Are once-rising FTSE renewable energy shares a smart investment right now?

Leave a Reply Cancel reply

Recommended.

Bitcoin Dominance Threatens 'Likely Top' Even as BTC Price Targets $45,000

Skibidi Anarchy: Technology in the Post-Pandemic Classroom

Here's why Bitwise CEO wants the SEC to postpone approval of Ethereum ETFs

Fortnite creative lead Donald Mustard is stepping down

The 'Teslatoover' protests are small, but numerous

Categories

Important Links

Get daily news updates to your inbox!