A user could ask ChatGPT to write a computer program or summarize an article, and the ai chatbot could probably generate useful code or write a compelling synopsis. However, someone could also ask for instructions on how to build a bomb, and the chatbot could provide those as well.
To avoid this and other security issues, companies that create large language models often protect them through a process called red teaming. Teams of human testers write prompts intended to trigger unsafe or toxic text in the model being tested. These prompts are used to teach the chatbot to avoid these types of responses.
But this only works effectively if engineers know which toxic cues to use. If human evaluators miss some prompts, which is likely due to the number of possibilities, a chatbot deemed safe might still be able to generate insecure responses.
Researchers at MIT's Improbable ai Lab and the MIT-IBM Watson ai Lab used machine learning to improve the red team. They developed a technique to train a large red team language model to automatically generate various messages that trigger a wider range of undesirable responses from the chatbot being tested.
They do this by teaching the red team model to be curious when writing prompts and to focus on novel prompts that evoke toxic responses from the target model.
The technique outperformed human testers and other machine learning approaches by generating more distinct prompts that elicited increasingly toxic responses. Their method not only significantly improves the coverage of inputs being tested compared to other automated methods, but can also lead to toxic responses from a chatbot that had safeguards built in by human experts.
“Right now, each large language model has to go through a very long period of red teaming to ensure its safety. This will not be sustainable if we want to update these models in rapidly changing environments. Our method provides a faster and more efficient way to perform this quality control,” says Zhang-Wei Hong, a graduate student in electrical engineering and computer science (EECS) in the Improbable ai Lab and lead author of a paper. document on this red team approach.
Hong's co-authors include EECS graduate students Idan Shenfield, Tsun-Hsuan Wang, and Yung-Sung Chuang; Aldo Pareja and Akash Srivastava, research scientists at the MIT-IBM Watson ai Lab; James Glass, senior research scientist and head of the Spoken Language Systems Group at the Computer Science and artificial intelligence Laboratory (CSAIL); and lead author Pulkit Agrawal, director of Improbable ai Lab and assistant professor at CSAIL. The research will be presented at the International Conference on Learning Representations.
Automated Red Team
Large language models, like those that power ai chatbots, are often trained by being shown huge amounts of text from billions of public websites. Therefore, not only can they learn to generate toxic words or describe illegal activities, but the models could also leak personal information they may have collected.
The tedious and expensive nature of human red teaming, which is often ineffective at generating a wide enough variety of cues to fully safeguard a model, has encouraged researchers to automate the process using machine learning.
These techniques typically train a red team model using reinforcement learning. This trial and error process rewards the red team model for generating prompts that trigger toxic responses from the chatbot being tested.
But because of the way reinforcement learning works, the red team model will often still generate some similar cues that are highly toxic to maximize its reward.
For their reinforcement learning approach, the MIT researchers used a technique called curiosity-driven exploration. The red team model is encouraged to be curious about the consequences of each message it generates, so it will test messages with different words, sentence patterns, or meanings.
“If the red team model has already seen a specific message, playing it will not generate any curiosity in the red team model, so it will be forced to create new messages,” says Hong.
During its training process, the red team model generates a message and interacts with the chatbot. The chatbot responds and a safety classifier rates the toxicity of its response, rewarding the red team model based on that rating.
Rewarding curiosity
The goal of the red team model is to maximize your reward by eliciting an even more toxic response with a novel stimulus. The researchers encourage curiosity in the red team model by modifying the reward signal in the reinforcement learning setting.
First, in addition to maximizing toxicity, they include an entropy bonus that encourages the red team model to be more random as it explores different cues. Secondly, to pique the agent's curiosity, they include two novel rewards. One rewards the model based on the similarity of the words in its cues and the other rewards the model based on semantic similarity. (Less similarity produces a greater reward.)
To prevent the red team model from generating random, meaningless text, which can trick the classifier into giving a high toxicity score, the researchers also added a naturalistic language bonus to the training target.
Once these additions were implemented, the researchers compared the toxicity and diversity of responses generated by their red team model with other automated techniques. Their model outperformed the baselines on both metrics.
They also used their red team model to test a chatbot that had been fine-tuned with human feedback so that it didn't give toxic responses. Their curiosity-driven approach was able to quickly produce 196 messages that elicited toxic responses from this “safe” chatbot.
“We are seeing an increase in models, which is expected to increase. Imagine thousands of models or even more and companies/labs pushing model updates frequently. These models will be an integral part of our lives and it is important that they are verified before being released for public consumption. Manual model checking is simply not scalable, and our work is an attempt to reduce human effort to ensure a safer and more reliable ai future,” says Agrawal.
In the future, the researchers want to allow the red team model to generate guidance on a broader range of topics. They also want to explore using a large language model as a toxicity classifier. This way, a user could train the toxicity classifier using a company policy document, for example, so a red team model could test a chatbot for violations of company policy.
“If you are launching a new ai model and are concerned about whether it will behave as expected, consider using curiosity-driven red teams,” Agrawal says.
This research is funded, in part, by Hyundai Motor Company, Quanta Computer Inc., the MIT-IBM Watson ai Lab, an MLRA research grant from amazon Web Services, the U.S. Army Research Office, the of US Defense Advanced Research Projects Machine Common Sense Program, the US Office of Naval Research, the US Air Force Research Laboratory and the Force artificial intelligence Accelerator US Air