It is common to think of neural networks as adaptive “feature extractors” that learn by progressively refining appropriate representations from initial raw inputs. So, the question arises: what features are being represented and in what way? To better understand how high-level human-interpretable features are described in neural activations of LLMs, a research team from Massachusetts Institute of Technology (MIT), Harvard University (HU), and Northeastern University (NEU ) proposes a technique called sparse probing.
In a standard way, researchers will train a basic classifier (a probe) on internal activations of a model to predict a property of the input, and then examine the network to see if and where it represents the characteristic in question. The suggested sparse probing method analyzes more than 100 variables to identify the relevant neurons. This method overcomes the limitations of previous polling methods and sheds light on the intricate structure of LLMs. Constrains the probing classifier to use no more than k neurons in its prediction, where k is variable between 1 and 256.
The team uses state-of-the-art optimal sparse prediction techniques to demonstrate the small-k optimization of the k-sparse feature selection subproblem and address the confusion between classification and classification accuracy. They use sparseness as an inductive bias to ensure their probes can maintain high simplicity before targeting key neurons for granular examination. Furthermore, the technique can generate a more reliable signal of whether a specific feature is explicitly represented and used downstream because limited capacity prevents its probes from memorizing correlation patterns connected to features of interest.
The research group used LLM of autoregressive transformers in their experiment, reporting the classification results after training probes with values of k variables. They conclude the following from the study:
- LLM neurons contain a large number of interpretable structures, and sparse probing is an efficient way to locate them (even in overlap). Still, it must be used with caution and followed up with analysis if rigorous conclusions are to be drawn.
- When many neurons in the first layer fire for unrelated n-grams and local patterns, the features are encoded as sparse linear combinations of polysemantic neurons. Weight statistics and insights from toy models also lead us to conclude that the top 25% of fully connected layers make extensive use of overlap.
- Although definitive conclusions about monosemanticity remain methodologically out of reach, monosemantic neurons, especially in the middle layers, encode higher-level linguistic and contextual properties (such as is_python_code).
- While underrepresentation tends to increase as models get larger, this trend doesn’t hold across the board; some functions emerge with dedicated neurons as the model gets larger, while others break into more detailed functions as the model gets larger, and many others do not change or arrive quite randomly.
Some advantages of sparse polling
- The potential risk of mixing classification quality with classification quality when investigating single neurons with probes is further addressed by the availability of probes with optimization guarantees.
- Also, sparse probes are designed to have a low storage capacity, so there is less cause for alarm that the probe might learn the task on its own.
- To probe, you’ll need a monitored dataset. Still, once you’ve built one, you can use it to interpret any model, which opens the door to investigating things like the universality of learned circuits and the hypothesis of natural abstractions.
- Instead of relying on subjective evaluations, it can be used to automatically examine how different architectural choices affect the occurrence of polysemantics and overlap.
Scattered polling has its limitations
- Strong inferences from probing experimental data can only be made with additional secondary investigation of the identified neurons.
- Due to its sensitivity to implementation details, anomalies, misspecifications, and misleading correlations in the polling data set, polling provides only limited insight into causality.
- Particularly in terms of interpretability, sparse tests cannot recognize features built across multiple layers or differentiate between overlapping features and features represented as the union of numerous other more granular features.
- Iterative pruning may be required to identify all significant neurons if the sparse polling misses some due to redundancy in the polling data set. Using multi-token features requires specialized processing, commonly implemented using aggregations that could further dilute the specificity of the result.
Using a revolutionary sparse probing technique, our work reveals a wealth of rich, human-understandable structures in LLMs. The scientists plan to build an extensive repository of test data sets, possibly with the help of AI, recording details especially pertinent to bias, fairness, safety, and high-stakes decision making. They encourage other researchers to join the exploration of this “ambitious interpretability” and argue that an empirical approach reminiscent of the natural sciences may be more productive than typical machine learning experimental loops. Having large and diverse supervised data sets will enable better evaluations of the next generation of unsupervised interpretability techniques that will be required to keep up with the advancement of AI, as well as automate the evaluation of new models.
review the Paper. Don’t forget to join our 21k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Dhanshree Shenwai is a Computer Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking domain with strong interest in AI applications. She is enthusiastic about exploring new technologies and advancements in today’s changing world, making everyone’s life easier.