Although the vast majority of our explanations score low, we believe we can now use ML techniques to further improve our ability to produce explanations. For example, we found that we could improve scores by:
- Iterating on the explanations. We can increase the scores by asking GPT-4 to come up with possible counterexamples and then revising the explanations in light of their activations.
- Use of larger models to give explanations. The average score increases as the capabilities of the explanatory model increase. However, even GPT-4 gives worse explanations than humans, suggesting room for improvement.
- Changing the architecture of the model explained. Training models with different activation functions improved explanation scores.
We are opening our datasets and visualization tools to the explanations written in GPT-4 of the 307,200 neurons in GPT-2, as well as the code for explanation and scoring. using publicly available models in the OpenAI API. We hope that the research community will develop new techniques to generate higher-scoring explanations and better tools to explore GPT-2 by explanations.
We found more than 1,000 neurons with explanations that scored at least 0.8, which means that, according to GPT-4, they account for most of the neuron’s top firing behavior. Most of these well-explained neurons aren’t very interesting. However, we also found many interesting neurons that GPT-4 did not understand. We hope that as explanations improve, we can quickly discover interesting qualitative insights into the model’s computations.