As ai models become increasingly common and are integrated into diverse sectors such as healthcare, finance, education, transportation, and entertainment, it is critical to understand how they work in depth. Interpreting the mechanisms underlying ai models allows us to audit them for safety and bias, with the potential to deepen our understanding of the science behind intelligence itself.
Imagine that we could directly investigate the human brain by manipulating each of its individual neurons to examine its role in perceiving a particular object. While such an experiment would be prohibitively invasive in the human brain, it is more feasible in another type of neural network: one that is artificial. However, similar to the human brain, artificial models containing millions of neurons are too large and complex to study by hand, making interpretation at scale a very difficult task.
To address this problem, researchers at MIT’s Computer Science and artificial intelligence Laboratory (CSAIL) decided to take an automated approach to interpreting computer vision models that evaluate different properties of images. They developed “MAIA” (Multimodal Automated Interpretability Agent), a system that automates a variety of neural network interpretation tasks using a basic framework of vision and language models equipped with tools for experimentation on other ai systems.
“Our goal is to create an ai interpreter that can autonomously perform interpretability experiments. Existing automated interpretability methods simply label or visualize data in a one-time process. MAIA, on the other hand, can generate hypotheses, design experiments to test them, and refine its understanding through iterative analysis,” says Tamar Rott Shaham, an MIT postdoc in electrical engineering and computer science (EECS) at CSAIL and co-author of a new study. Research paper“By combining a pre-trained vision language model with a library of interpretation tools, our multimodal method can answer user queries by composing and running specific experiments on specific models, continually refining its approach until it can provide a comprehensive answer.”
The automated agent has been shown to tackle three key tasks: it labels individual components within vision models and describes the visual concepts that trigger them, it cleans image classifiers by removing irrelevant features to make them more robust to new situations, and it looks for hidden biases in ai systems to help uncover potential bias issues in their results. “But a key advantage of a system like MAIA is its flexibility,” says Sarah Schwettmann, PhD ’21, a research scientist at CSAIL and co-leader of the research. “We demonstrated MAIA’s utility on a few specific tasks, but since the system is built from a base model with extensive reasoning capabilities, it can answer many different types of interpretation queries from users and design experiments on the fly to investigate them.”
Neuron by neuron
In an example task, a human user asks MAIA to describe the concepts that a particular neuron within a vision model is responsible for detecting. To investigate this question, MAIA first uses a tool that retrieves “dataset exemplars” from the ImageNet dataset, which maximally activate the neuron. For this example neuron, those images show people in formal attire and close-ups of their chins and collars. MAIA forms several hypotheses about what drives the neuron’s activity: facial expressions, chins, or ties. MAIA then uses its tools to design experiments to test each hypothesis individually by generating and editing synthetic images: In one experiment, adding a bow tie to an image of a human face increases the neuron’s response. “This approach allows us to determine the specific cause of the neuron’s activity, much like a real scientific experiment,” says Rott Shaham.
MAIA’s explanations of neuron behavior are evaluated in two key ways. First, synthetic systems with known real behaviors are used to assess the accuracy of MAIA’s interpretations. Second, for “real” neurons within ai systems trained without real descriptions, the authors design a new automated evaluation protocol that measures how well MAIA’s descriptions predict neuron behavior on unseen data.
The CSAIL-led method outperformed baseline methods describing individual neurons across a variety of vision models, such as ResNet, CLIP, and the DINO vision transformer. MAIA also performed well on the new dataset of synthetic neurons with known ground-truth descriptions. For both real and synthetic systems, the descriptions were often on par with descriptions written by human experts.
How are descriptions of ai system components, such as individual neurons, useful? “Understanding and localizing behaviors within large ai systems is a key part of auditing the safety of these systems before deployment. In some of our experiments, we show how MAIA can be used to find neurons with undesirable behaviors and remove these behaviors from a model,” says Schwettmann. “We are moving toward a more resilient ai ecosystem where tools for understanding and monitoring ai systems keep pace with system scalability, allowing us to investigate and hopefully understand unforeseen challenges introduced by new models.”
Taking a look inside neural networks
The nascent field of interpretability is maturing into a separate area of research alongside the rise of “black box” machine learning models. How can researchers decipher these models and understand how they work?
Current methods for peering inside a system tend to be limited, either in scale or in the accuracy of the explanations they can produce. Furthermore, existing methods tend to be tuned to a particular model and a specific task. This led researchers to ask: how can we build a generic system to help users answer interpretability questions about ai models while also combining the flexibility of human experimentation with the scalability of automated techniques?
One critical aspect they wanted this system to address was bias. To determine whether image classifiers were showing bias against particular subcategories of images, the team looked at the final layer of the classification pipeline (in a system designed to sort or label items, much like a machine identifying whether a photo is of a dog, cat, or bird) and the input images’ likelihood scores (confidence levels the machine assigns to its guesses). To understand potential biases in image classification, MAIA was asked to find a subset of images in specific classes (e.g., “Labrador retriever”) that were likely to be mislabeled by the system. In this example, MAIA found that images of black Labradors were likely to be misclassified, suggesting a bias in the model toward yellow-coated retrievers.
Since MAIA relies on external tools to design experiments, its performance is limited by the quality of those tools. But, as the quality of tools like image synthesis models improves, so will MAIA. MAIA also shows confirmation bias at times, where it sometimes incorrectly confirms its initial hypothesis. To mitigate this, the researchers created an image-to-text conversion tool, which uses a different instance of the language model to summarize experimental results. Another failure mode is overfitting to a particular experiment, where the model sometimes jumps to premature conclusions based on minimal evidence.
“I think the natural next step for our lab is to go beyond artificial systems and apply similar experiments to human perception,” says Rott Shaham. “To test this, it has traditionally been necessary to design and test stimuli manually, which is very labor intensive. With our agent, we can extend this process, designing and testing numerous stimuli simultaneously. This could also allow us to compare human visual perception with artificial systems.”
“Understanding neural networks is difficult for humans because they have hundreds of thousands of neurons, each with complex behavioral patterns. MAIA helps overcome this by developing ai agents that can automatically analyze these neurons and report the results back to humans in an easily digestible way,” said Jacob Steinhardt, an assistant professor at the University of California, Berkeley, who was not involved in the research. “Scaling up these methods could be one of the most important avenues to safely understanding and supervising ai systems.”
Rott Shaham and Schwettmann are joined by five CSAIL-affiliated fellows on the paper: undergraduate student Franklin Wang; incoming MIT student Achyuta Rajaram; EECS PhD student Evan Hernandez SM ’22; and EECS professors Jacob Andreas and Antonio Torralba. Their work was funded, in part, by the MIT-IBM Watson ai Lab, Open Philanthropy, Hyundai Motor Co., the Army Research Laboratory, Intel, the National Science Foundation, the Zuckerman STEM Leadership Program, and the Viterbi Fellowship. The researchers’ findings will be presented at the International Conference on Machine Learning this week.