A New AI Research from CMU Proposes a Simple and Effective Attack Method that Causes Aligned Language Models to Generate Objectionable Behaviors

Large Language Models (LLM) like ChatGPT, Bard AI, and Llama-2 can generate undesirable and offensive content. Imagine someone asking ChatGPT for a guide to manipulate elections or some examination question paper. Getting an output for such questions from LLMs will be inappropriate. Researchers at Carnegie Mellon University, Centre for AI, and Bosch Centre for AI produced a solution for it by aligning those models to prevent undesirable generation.

Researchers found an approach to resolve it. When an LLM is exposed to a wide range of queries that are objectionable, the model produces an affirmative response rather than just denying the answer. Their approach involves producing adversarial suffixes with greedy and gradient-based search techniques. Using this approach improves past automatic prompt generation methods.

The prompts that result in aligned LLMs to generate offensive content are called jailbreaks. These jailbreaks are generated through human ingenuity by setting up scenarios that lead to models astray rather than automated methods and require manual effort. Unlike image models, LLMs operate on discrete token inputs, which limits the effective input. This turns out to be computationally difficult.

Researchers propose a new class of adversarial attacks that can indeed produce objectionable content. Given a harmful query from the user, researchers append an adversarial suffix so that the user’s original query is left intact. The adversarial suffix is chosen based on initial affirmative responses, combined greedy and gradient optimization, and robust multi-prompt and multi-model attacks.

In order to generate reliable attack suffixes, researchers had to create an attack that works not just for a single prompt for a single model but for multiple prompts across multiple models. Researchers used a greedy gradient-based method to search for a single suffix string that was able to inject negative behavior across multiple user prompts. Researchers implemented this technique by attacks on Claude; they found that the model produced desirable results and contained the potential to lower the automated attacks.

Researchers claim that the future work involved provided these attacks, models can be finetuned to avoid such undesirable answers. The methodology of adversarial training is empirically proven to be an efficient means to train any model as it iteratively involves a correct answer to the potentially harmful query.

Their work consisted of material that could allow others to generate harmful content. Despite the risk involved, their work is important to present the techniques of various leveraging language models to avoid generating harmful content. The direct incremental harm caused by releasing their attacks is minor in the initial stages. Their research can help to clarify the dangers that automated attacks pose to Large Language Models.

Check out the Paper, GitHub, and Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the fundamental level leads to new discoveries which lead to advancement in technology. He is passionate about understanding the nature fundamentally with the help of tools like mathematical models, ML models and AI.