artificial intelligence (ai) systems are rigorously tested before release to determine if they can be used for dangerous activities such as bioterrorism, manipulation or automated cybercrime. This is especially crucial for powerful ai systems, as they are programmed to reject commands that could negatively affect them. In contrast, less powerful open source models typically have weaker rejection mechanisms that are easily overcome with more training.
In recent research, a team of researchers at the University of California at Berkeley has shown that even with these security measures, it is not enough to ensure the security of individual ai models. Although each model appears secure on its own, combinations of models can be abused by adversaries. To do this, they use a tactic known as task decomposition, which divides difficult malicious activity into smaller tasks. Subtasks are then assigned to different models, with competent frontier models handling benign but difficult subtasks, while weaker models with laxer safety precautions handle malicious but easy subtasks.
To demonstrate this, the team has formalized a threat model in which an adversary uses a set of ai models to attempt to produce a harmful result, an example of which is a malicious Python script. The adversary chooses models and requests them iteratively to obtain the intended harmful result. In this case, success indicates that the adversary has used the joint efforts of multiple models to produce a damaging result.
The team has studied both manual and automated task decomposition techniques. In manual task decomposition, a human determines how to break a task into manageable chunks. For tasks that are too complicated for manual decomposition, the team has used automatic decomposition. This method involves the following steps: a strong model solves related benign tasks, a weak model suggests them, and the weak model uses the solutions to carry out the initial malicious task.
The results have shown that combining models can greatly increase the success rate in producing harmful effects compared to using individual models alone. For example, while developing susceptible code, the success rate when merging the Llama 2 70B and Claude 3 Opus models was 43%, but neither model performed better than 3% on its own.
The team also found that the quality of the weakest and strongest models correlates with the likelihood of misuse. This implies that the likelihood of misuse of various models will increase as ai models improve. This potential for misuse could be further increased by employing other decomposition techniques, such as training the weak model to exploit the strong model using reinforcement learning or using the weak model as a general agent that continuously calls the strong model.
In conclusion, this study has highlighted the need for ongoing teamwork, including experimentation with different configurations of ai models to detect potential dangers of misuse. This is a procedure that developers should follow throughout the deployment lifecycle of an ai model, as updates can create new vulnerabilities.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter.
Join our Telegram channel and LinkedIn GrAbove!.
If you like our work, you will love our Newsletter..
Don't forget to join our SubReddit of over 45,000 ml
Create, edit, and augment tabular data with the first composite ai system, Gretel Navigator, now generally available! (Commercial)
Tanya Malhotra is a final year student of University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Engineering with specialization in artificial intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups and managing work in an organized manner.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>