Language models can explain neurons in language models

Although the vast majority of our explanations score low, we believe we can now use ML techniques to further improve our ability to produce explanations. For example, we found that we could improve scores by:

Iterating on the explanations. We can increase the scores by asking GPT-4 to come up with possible counterexamples and then revising the explanations in light of their activations.
Use of larger models to give explanations. The average score increases as the capabilities of the explanatory model increase. However, even GPT-4 gives worse explanations than humans, suggesting room for improvement.
Changing the architecture of the model explained. Training models with different activation functions improved explanation scores.

We are opening our datasets and visualization tools to the explanations written in GPT-4 of the 307,200 neurons in GPT-2, as well as the code for explanation and scoring. using publicly available models in the OpenAI API. We hope that the research community will develop new techniques to generate higher-scoring explanations and better tools to explore GPT-2 by explanations.

We found more than 1,000 neurons with explanations that scored at least 0.8, which means that, according to GPT-4, they account for most of the neuron’s top firing behavior. Most of these well-explained neurons aren’t very interesting. However, we also found many interesting neurons that GPT-4 did not understand. We hope that as explanations improve, we can quickly discover interesting qualitative insights into the model’s computations.

Language models can explain neurons in language models

Technical Terrence Team

US customers can pay with Ethereum via PayPal via MetaMask integration

Leave a Reply Cancel reply

Recommended.

Rescue dog breeding comes to Decentraland

Apple Could Outperform Again in 2024: Citi (NASDAQ:AAPL)

TikTok banned from London Council devices for security reasons | Tik Tok

Australia's Westpac posts stable third-quarter profit, helped by higher capital gains By Reuters

Ethereum price to hit $10,000, 'just how chips have fallen,' analyst says

Categories

Important Links

Language models can explain neurons in language models

Related

Technical Terrence Team

US customers can pay with Ethereum via PayPal via MetaMask integration

Leave a Reply Cancel reply

Recommended.

Rescue dog breeding comes to Decentraland

Apple Could Outperform Again in 2024: Citi (NASDAQ:AAPL)

TikTok banned from London Council devices for security reasons | Tik Tok

Australia's Westpac posts stable third-quarter profit, helped by higher capital gains By Reuters

Ethereum price to hit $10,000, 'just how chips have fallen,' analyst says

Categories

Important Links

Get daily news updates to your inbox!