Revealing the mysteries of AI neurons: how OpenAI's GPT-4 automatically writes and scores explanations for the behavior of GPT-2 neurons

While language models have improved and been widely implemented, our understanding of how they work internally still needs to be improved. For example, it might be hard to tell if they are using biased heuristics or being dishonest based on their results. Interpretability studies aim to obtain information about the model from within. The most recent work in artificial intelligence interpretation in OpenAI employs the GPT-4 large-scale language model to produce behavioral explanations for neurons in the large-scale language model. Then rate these explanations to assess their quality.

To increase confidence in AI systems, it is important to study their interpretability so that users and developers can better understand their underlying operation and the methods AI uses to make decisions. Furthermore, by analyzing the behavior of the AI model, model bias and errors can be better understood, creating opportunities to improve model performance and further strengthen human-AI cooperation.

Neurons and attention heads play a crucial role in deep learning, first in the neural network and then in the self-attention process. Investigating the role of each part is critical to interpretability studies. For neural networks containing tens of billions of parameters, the time-consuming and time-consuming procedure of manually inspecting neurons to confirm the characteristics of the data that these neurons represent is prohibitively expensive.

JOIN the fastest ML subreddit community

Learning how the parts (neurons and attention heads) work is a clear starting point for the study of interpretability. In the past, this required human inspection of neurons to determine the properties of the data they represent. Scalability issues prevent this method from using neural networks with hundreds of billions of parameters. To apply GPT-4 to neurons in another language model, the researchers offer an automated process for generating and evaluating natural language descriptions of neurons’ function.

This effort is aimed at automating the alignment research process, the third pillar of the strategy. The fact that this method can be extended to keep up with AI progress is encouraging. As future models become more sophisticated and useful as helpers, you will learn to better understand them.

To produce and evaluate the performance of additional language model neurons, OpenAI currently proposes an automated approach employing GPT-4. This research is crucial because AI is rapidly evolving and keeping up requires the use of automated methods; Furthermore, when new models are built, the quality of the explanations they produce will increase.

Neural behavior can be explained in three stages: generation of explanations, simulation using GPT-4, and comparison.

First, by providing a GPT-2 neuron and demonstrating the relevant text sequence and activity to GPT-4, you may be asked to write a natural language text that can explain the function of the neuron.
The next stage involves using GPT-4 to mimic the actions of virtual neurons. To test whether the interpretation is consistent with the behavior of the firing neurons, it is necessary to deduce why the neurons in the explanation are firing.
Finally, the explanation is scored based on how well it explains the differences between the simulation and the real situation.

Unfortunately, the automatic generation and evaluation of the behavior of GPT-4 neurons is not yet useful for more complex models. Scientists wonder if the neural network is more complicated than the last layers of the network, where most explanations focus. It is quite low, but OpenAI believes that it can be increased with the help of advances in machine learning technology. The quality of the interpretation can be improved, for example, by using a more complete model or by altering the structure of the interpretation model.

The OpenAI API now includes code for interpreting and scoring data from public models, visualization tools, and the GPT-2 interpretation dataset of 300,000 neurons created by GPT-4. OpenAI has expressed the wish that other AI projects do so. The community can contribute to the research by creating more effective methods for high-quality justifications.

Challenges that can be overcome with additional research

Although scientists have tried to describe neural behavior using only normal language, the behavior of some neurons may be too complex to describe in such a small space. Neurons, for example, can represent unique notions that humans don’t understand or have no words or be extremely polysemantic (representing many unique concepts).
Scientists want computers to one day automatically discover and explain the neural and attentional circuitry that underpins complicated behavior. The current approach explains the behavior of neurons in relation to the initial text input, but does not comment on the subsequent impacts. For example, a neuron firing dots might be incrementing a sentence counter or indicating that the next word should start with a capital letter.
Researchers must try to understand the underlying mechanics to describe the actions of neurons. Since high-scoring explanations simply report a connection, they may need to do better in out-of-distribution text.
The process as a whole is very computationally intensive.

The research suggests that the methods help fill in some gaps in the big picture of how the transformative language model works. Aiming to identify sets of interpretable directions in the residual flow or trying to find various explanations that describe the behavior of a neuron in its entire distribution, the methods can help to increase knowledge of overlap. Explanations can be improved through enhanced use of tools, conversational assistants, and chain of thought approaches. Researchers envision a future in which the explanatory model can generate, test, and iterate as many hypotheses as a human interpretability researcher can now. This would include speculation about the circuit’s functionality and abnormal behavior. Researchers could benefit from a more macro-focused approach if they could see hundreds of millions of neurons and query explanatory databases for commonalities. Simple applications can see rapid development, such as identifying salient features in reward models or understanding qualitative differences between a fitted model and its starting point.

The dataset and source code can be accessed at https://github.com/openai/automated-interpretability

review the Paper, Codeand Blog. Don’t forget to join our 22k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at asif@marktechpost.com

Check out 100 AI tools at AI Tools Club

Dhanshree Shenwai is a Computer Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking domain with strong interest in AI applications. She is enthusiastic about exploring new technologies and advancements in today’s changing world, making everyone’s life easier.