Can Large Language Models Self-Evaluate for Safety? Meet RAIN: A Novel Inference Method Transforming AI Alignment and Defense Without Finetuning

Pre-trained Large Language Models (LLMs), like GPT-3, have proven to have extraordinary aptitudes for comprehending and replying to questions from humans, helping with coding chores, and more. However, they frequently generate outcomes that differ from what people like. In the past, researchers have attempted to resolve this problem by gathering information on human preferences and then aligning previously trained models through the use of reinforcement learning or instruction tuning, entailing a fine-tuning stage. It is more appealing to align frozen LLMs, ones that have yet to undergo additional training, without the requirement for additional data.

Recently, a team of researchers has discovered that unaligned LLMs can directly produce replies that match human preferences through a self-improvement process by including self-evaluation and rewind mechanisms. In the interest of AI safety, they have introduced Rewindable Auto-regressive INference (RAIN), a unique inference technique that enables pre-trained LLMs to assess their own generated text and use the evaluation results to direct backward rewinding and forward generation.

RAIN is notable for its ability to run without requiring any further data for model alignment. It does away with the requirement for parameter updates, gradient computation, or training. The model obtains direction on which human preferences to align during the self-evaluation phase through a fixed-template prompt, obviating the requirement to adjust the initial query repeatedly.

The experimental outcomes, assessed by the GPT-4 model and human assessors, showed how successful RAIN is. For instance, using the HH dataset, RAIN keeps the helpfulness rate constant while dramatically boosting the harmlessness rate of LLaMA 30B compared to vanilla inference, going from 82% to 97%. The team has shared that RAIN even established a new baseline for defense by lowering the assault success rate from 94% to 19% when Vicuna 33B is the target of a notable hostile attack (LLM-ATTACKS).

RAIN offers a number of benefits over currently used methods for aligning Large Language Models (LLMs) –

Universality: The RAIN approach is adaptable and can be used for a variety of language-generating jobs. It fits in perfectly with the auto-regressive inference paradigm, which is the norm for many LLMs. This means that RAIN is highly customizable and user-friendly and can be quickly integrated into most current LLMs.

Alignment with Frozen Weights: RAIN does not necessitate the upkeep of extra models or the storing of gradient data and computational networks, in contrast to some other alignment strategies like RLHF. The minimum memory overhead produced by this is comparable to that of simple auto-regressive inference. RAIN is a realistic option for aligning LLMs with frozen weights because of its simple implementation and memory-efficient design, eliminating resource-intensive fine-tuning procedures.

Learning-free: RAIN does not rely on any type of labeled or unlabeled data or on human annotations. It doesn’t require a lot of information or training because it operates in a learning-free manner. RAIN considerably enhances alignment performance across a range of tasks and makes LLMs more resistant to hostile, prompt attacks. It significantly lowers the assault success rate when evaluated against a well-known adversarial attack method, demonstrating its potency as a defense against such attacks.

In conclusion, this study has introduced RAIN as a technique for adjusting LLMs to human preferences without the need for additional information or laborious fine-tuning. This is accomplished by allowing LLMs to assess and enhance their own outputs, ultimately resulting in more coordinated and secure AI-generated responses.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.