AI agents help explain other AI systems | MIT News

Explaining the behavior of trained neural networks remains a compelling puzzle, especially as these models grow in size and sophistication. Like other scientific challenges throughout history, reverse engineering the functioning of artificial intelligence systems requires a substantial amount of experimentation: formulating hypotheses, intervening in behavior, and even dissecting large networks to examine individual neurons. To date, most successful experiments have involved a lot of human supervision. Explaining every calculation within models the size of GPT-4 and larger will almost certainly require more automation, perhaps even using the ai models themselves.

To facilitate this timely effort, researchers at MIT's Computer Science and artificial intelligence Laboratory (CSAIL) have developed a novel approach that uses ai models to perform experiments on other systems and explain their behavior. Their method uses agents built from pre-trained language models to produce intuitive explanations of computations within trained networks.

Central to this strategy is the “automated interpretability agent” (AIA), designed to mimic a scientist's experimental processes. Interpretability agents plan and perform tests on other computational systems, which can range in scale from single neurons to entire models, to produce explanations of these systems in a variety of forms: linguistic descriptions of what a system does and where it fails, and code. that reproduces the behavior of the system. Unlike existing interpretability procedures that passively classify or summarize examples, AIA actively engages in hypothesis formation, experimental testing, and iterative learning, thus refining its understanding of other systems in real time.

Complementing the AIA method is the new “interpretation and description of functions” (FIND) benchmark, a testbed of functions that resemble calculations within trained networks and accompanying descriptions of their behavior. A key challenge in assessing the quality of descriptions of real-world network components is that descriptions are only as good as their explanatory power: researchers do not have access to the ground truth. unit labels or descriptions of learned calculations. FIND addresses this long-standing problem in the field by providing a reliable standard for evaluating interpretability procedures: feature explanations (e.g., produced by an AIA) can be evaluated by comparing them to feature descriptions in the benchmark.

For example, FIND contains synthetic neurons designed to mimic the behavior of real neurons within language models, some of which are selective for individual concepts such as “ground transportation.” AIAs have black-box access to synthetic neurons and design inputs (such as “tree,” “happiness,” and “car”) to test a neuron’s response. After observing that a synthetic neuron produces higher response values for “car” than other inputs, an AIA could design more detailed tests to distinguish the neuron's selectivity for cars from other forms of transportation, such as airplanes and ships. When the AIA produces a description such as “this neuron is selective for road transportation, and not for air or sea travel”, this description is evaluated against the actual description of the synthetic neuron (“selective for land transportation”). in FIND. The benchmark can then be used to compare the capabilities of AIAs with other methods in the literature.

Sarah Schwettmann PhD '21, co-senior author of a document about new job and research scientist at CSAIL, emphasizes the advantages of this approach. “The ability of AIAs to autonomously generate and test hypotheses can bring to light behaviors that would otherwise be difficult for scientists to detect. “It is surprising that language models, when equipped with tools to test other systems, are capable of this type of experimental design,” says Schwettmann. “Clear, simple benchmarks with real answers have been an important driver of more general capabilities in language models, and we hope that FIND can play a similar role in interpretability research.”

Automation of interpretability

The great linguistic models continue to maintain their status as the most sought-after celebrities in the technological world. Recent advances in LLMs have highlighted their ability to perform complex reasoning tasks in various domains. The CSAIL team recognized that, given these capabilities, language models can serve as the backbone of generalized agents for automated interpretability. “Historically, interpretability has been a very multifaceted field,” says Schwettmann. “There is no one-size-fits-all approach; Most procedures are very specific to individual questions we may have about a system and to individual modalities such as vision or language. Existing approaches to labeling individual neurons within vision models have required training specialized models on human data, where these models perform only this single task. “Interpretability agents created from language models could provide a general interface to explain other systems: synthesizing results across experiments, integrating across different modalities, and even discovering new experimental techniques at a very fundamental level.”

As we enter a regime where explanatory models are themselves black boxes, external evaluations of interpretability methods become increasingly vital. The team's new benchmark addresses this need with a set of functions with known structure, which are modeled after behaviors observed in nature. Functions within FIND span a diversity of domains, from mathematical reasoning to symbolic operations on strings and synthetic neurons built from word-level tasks. The interactive feature dataset is constructed procedurally; Real-world complexity is introduced into simple functions by adding noise, compounding functions, and simulating biases. This allows for comparing interpretability methods in an environment that translates into real-world performance.

In addition to the feature dataset, the researchers introduced an innovative evaluation protocol to evaluate the effectiveness of AIAs and existing automated interpretability methods. This protocol involves two approaches. For tasks that require replicating the feature in code, the evaluation directly compares the ai-generated estimates and the original ground-truth features. Evaluation becomes more complex for tasks that involve natural language descriptions of functions. In these cases, accurately measuring the quality of these descriptions requires an automated understanding of their semantic content. To address this challenge, the researchers developed a specialized “third-party” language model. This model is specifically trained to evaluate the accuracy and consistency of natural language descriptions provided by ai systems and compares them to the behavior of the ground truth function.

FIND enables evaluation and reveals that we are still far from fully automating interpretability; While AIAs outperform existing interpretability approaches, they still fail to accurately describe nearly half of the features in the benchmark. Tamar Rott Shaham, co-lead author of the study and postdoc at CSAIL, notes that “while this generation of AIA is effective at describing high-level functionality, they often still miss finer details, particularly in noisy or noisy subdomains of functions.” irregular behavior. This is likely due to insufficient sampling in these areas. One problem is that the effectiveness of AIAs may be hampered by their initial exploratory data. To counteract this, we attempt to guide the exploration of AIAs by starting their search with specific and relevant inputs, which significantly improved the accuracy of the interpretation.” This approach combines new AIA methods with previous techniques using precomputed examples to start the interpretation process.

Researchers are also developing a set of tools to increase the ability of AIAs to perform more precise experiments on neural networks, in both black-box and white-box environments. This toolset aims to equip AIAs with better tools to select inputs and refine hypothesis testing capabilities for more precise and nuanced neural network analysis. The team is also addressing practical challenges in ai interpretability, focusing on determining the right questions to ask when analyzing models in real-world scenarios. Their goal is to develop automated interpretability procedures that could eventually help people audit systems (e.g., for autonomous driving or facial recognition) to diagnose potential failure modes, hidden biases, or surprising behaviors before implementation.

Watching the watchers

The team envisions one day developing near-autonomous AIAs that can audit other systems, with human scientists providing oversight and guidance. Advanced AIAs could develop new types of experiments and questions, potentially beyond the initial considerations of human scientists. The focus is on expanding the interpretability of ai to include more complex behaviors, such as subnetworks or entire neural circuits, and predicting inputs that could lead to undesired behaviors. This development represents an important step forward in ai research, which aims to make ai systems more understandable and reliable.

“A good benchmark is a powerful tool for tackling difficult challenges,” says Martin Wattenberg, a computer science professor at Harvard University who was not involved in the study. “It's wonderful to see this sophisticated benchmark for interpretability, one of the most important challenges in machine learning today. I'm particularly impressed with the automated interpretability agent the authors created. “It’s a kind of jujitsu of interpretability, turning ai back on itself to aid human understanding.”

Schwettmann, Rott Shaham and their colleagues presented their work at NeurIPS 2023 in December. Other MIT co-authors, all affiliated with CSAIL and the Department of Electrical Engineering and Computer Science (EECS), include graduate student Joanna Materzynska, undergraduate Neil Chowdhury, Shuang Li PhD '23, Assistant Professor Jacob Andreas and Professor Antonio Torralba. Northeastern University assistant professor David Bau is an additional co-author.

The work was supported, in part, by the MIT-IBM Watson ai Lab, Open Philanthropy, an Amazon Research Award, Hyundai NGV, the US Army Research Laboratory, the US National Science Foundation. , the Zuckerman STEM Leadership Program, and a Viterbi scholarship. .