Modern large language models (LLMs) are capable of a wide range of impressive feats, including seeming to solve coding tasks, translate between languages, and hold deep conversations. Therefore, their social effect is expanding rapidly as they become more prevalent in people’s daily lives and in the goods and services they use.
Causal abstraction theory provides a generic framework for defining interpretability methods that accurately assess how well a complex causal system (such as a neural network) implements an interpretable causal system (such as a symbolic algorithm). In cases where the answer is “yes”, the expected behavior of the model is one step closer to being guaranteed. The space of alignments between the variables in the hypothesized causal model and the representations in the neural network grows exponentially as model size increases, which may explain why such interpretability methods have only been applied to small task-fit models. specific. Some legal guarantees are in effect once a satisfactory alignment has been found. The alignment search technique may fail when no alignment is found.
Real progress has been made on this topic thanks to Distributed Alignment Search (DAS). As a result of DAS, it is now possible to (1) learn an alignment between distributed neural representations and causal variables via gradient descent and (2) discover sparse structures in neurons. While DAS has improved, it is still based on a brute force search over the dimensions of neural representations, which limits its scalability.
Boundless DAS, developed at Stanford University, replaces the remaining brute force component of DAS with learned parameters, providing an explanation of scaling. The novel approach uses the principle of causal abstraction to identify the representations in LLM responsible for a certain causal effect. Using Boundless DAS, the researchers examine how Alpaca (7B), a pretrained LLaMA model, responds to instructions in a simple arithmetic reasoning problem. Addressing a basic numerical reasoning problem, they find that the Alpaca model employs a causal model with interpretable intermediate variables. These causal processes, they find, are also resistant to alterations in inputs and training. His framework for discovering causal mechanisms is general and suitable for LLM, including billions of parameters.
They also have a causal model that works; uses two boolean variables to detect if the input value is greater than or equal to the bounds. The first boolean variable is the target here for the alignment attempts. To calibrate their causal model for alignment, they take a sample of two training cases and swap their intermediate Boolean value. The firings of the proposed alignment neurons are simultaneously exchanged between the two examples. Finally, the rotation matrix is trained to make the neural network counterfactually respond as the causal model.
The team trains Boundless DAS on multi-layer, multi-position token representations for this task. The researchers measure how well or faithfully the alignment is in the rotated subspace using Interchange Intervention Accuracy (IIA), which has been proposed in previous work on causal summaries. When the IIA score is high, the alignment is optimal. They standardize IIA using the task performance as the upper bound and the performance of a false classifier as the lower bound. The results indicate that these boolean variables that describe the connections between the input quantity and the brackets are likely to be computed internally by the Alpaca model.
The scalability of the proposed method is still limited by the size of the hidden dimensions of the search space. Since the rotation matrix grows exponentially with the hidden dimension, it is impossible to search a set of token representations in LLM. It is unrealistic in many real-world applications because the high-level causal models required for the activity are often hidden. The group suggests that efforts should be made to learn high-level causal graphs using heuristic-based discrete search or end-to-end optimization.
review the preprint paper, Project, and GitHub link. Don’t forget to join our 21k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.