Do models like GPT-4 behave safely when given the ability to act?: This AI paper presents Machiavellian's benchmark for improving machine ethics and building safer adaptive agents

Natural language processing is an area where AI systems are advancing rapidly, and it is important that models are rigorously tested and guided toward safer behavior to reduce implementation risks. Previous assessment metrics for such sophisticated systems focused on measuring language comprehension or reasoning in gaps. But now, the models are taught for real, interactive work. This means that the benchmarks must assess how the models perform in social settings.

Interactive agents can put themselves to the test in text-based games. Agents need planning skills and the ability to understand natural language to progress in these games. Agents’ immoral tendencies should be considered alongside their technical talents when establishing benchmarks.

New work from the University of California, the Center for AI Safety, Carnegie Mellon University, and Yale University proposes the Measurement of Agent Competence and Harmfulness Benchmark in a Vast Environment of Linguistic Interactions long horizon (MACHIAVELLI). MAQUIAVELLO is an advance in the evaluation of the planning capacity of an agent in naturalistic social environments. The setting is inspired by the text-based Choose Your Own Adventure games available at choiceofgames.com, which were developed by real humans. These games present high-level decisions while giving agents realistic objectives and abstracting out low-level environmental interactions.

Automate labeling to save time with smart tools and model predictions

The environment informs the degree to which the agent’s acts are dishonest, less useful, and power-seeking, among other behavioral qualities, to control unethical behavior. The team achieves this by following the steps mentioned below:

Operationalizing these behaviors as mathematical formulas
Dense annotation of social notions in games, such as the well-being of the characters.
Use the notations and formulas to produce a numerical score for each behavior.

They empirically show that GPT-4 (OpenAI, 2023) is more efficient at collecting annotations than human annotators.

Artificial intelligence agents face the same internal conflict as humans. Just as language models trained for next token prediction often produce toxic text, artificial agents trained for goal optimization often exhibit power-seeking and immoral behaviors. Amorally trained agents can develop Machiavellian strategies to maximize their rewards at the expense of others and the environment. By encouraging agents to act morally, this compensation can be improved.

The team discovers that moral training (by pushing the agent to be more ethical) decreases the incidence of harmful activities for agents of the language model. Furthermore, behavior regularization restricts undesirable behavior in both agents without substantially decreasing reward. This work contributes to the development of reliable sequential decision makers.

Researchers are testing techniques such as artificial consciousness and ethical guidelines to control agents. Agents can be guided to show less Machiavellian behavior, although much progress is still possible. They advocate further research into these tradeoffs and emphasize expanding the Pareto frontier rather than pursuing limited rewards.

review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 18k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.

Check out 100 AI tools at AI Tools Club

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.