Undo characteristics in a complex neuronal network with overlaps

Complex neuronal networksLike large language models (LLM), they suffer quite often from Interpretability Challenges One of the most important reasons for such difficulty is overlap – A phenomenon of the neuronal network that has less dimensions than the amount of characteristics it has to represent. For example, a LLM toy with 2 neurons has to present 6 different language characteristics. As a result, we often observe that a single neuron needs to be activated for multiple characteristics. To obtain an explanation and a more detailed definition of overlap, see my previous blog post: “Overlay: which makes it difficult to explain the neuronal network.”
In this blog post, we take a step further: let's try to unravel some characteristics of fsuperposed. I will present a methodology called Scarce To decompose the complex neuronal network, especially LLM in interpretable characteristics, with an example of language characteristics.
TO ScarceBy definition, it is a self -defoder with dispersion introduced on purpose in the activations of its hidden layers. With a fairly simple structure and a light training process, its objective is to decompose a complex neuronal network and discover the characteristics in a more interpretable and more understandable way for humans.
Imagine that you have a trained neuronal network. Self -chire is not part of the model training process itself, but is a post-hoc analysis tool. The original model has its own activations, and these activations are later collected and then used as input data for scarce self -chire.
For example, we assume that its original model is a neuronal network with a hidden layer of 5 neurons. In addition, it has a 5000 samples training data set. You must collect all the 5 -dimensional activation values of the hidden layer for all your 5000 training samples, and now they are the entrance for your scarce self -chire.
Self -chire then learns a new scarce representation of these activations. The encoder assigns the original MLP activations in a new vector space with higher representation dimensions. Looking back in my simple example of 5 previous neurons, we could consider it to assign it to a vector space with 20 characteristics. With luck, we will obtain a scarce self -chirer decompose effectively the original MLP activations in a representation, easier to interpret and analyze.
Sparsity is important in Self -Doder because it is necessary that self -consider “unravels” the characteristics, with more “freedom” than in a dense and superimposed space. Without existence of scarcity, self -chire probably self -chire could learn a trivial compression without any significant characteristic of formation.
Model
We are going to build our toy model now. I beg the readers to keep in mind that this model is not realistic and even a bit silly in practice, but it is enough to show how we build a scattered self -coexist and capture some characteristics.
Suppose that we have now created a language model that has a particular hidden layer whose activation has three dimensions. Suppose we also have the following chips: “cat”, “happy cat”, “dog”, “energetic dog”, “no cat”, “no dog”, “robot” and “ai assistant” in the data set training and have the following activation values.
data = torch.tensor((
# Cat categories
(0.8, 0.3, 0.1, 0.05), # "cat"
(0.82, 0.32, 0.12, 0.06), # "happy cat" (similar to "cat")
# Dog categories
(0.7, 0.2, 0.05, 0.2), # "dog"
(0.75, 0.3, 0.1, 0.25), # "loyal dog" (similar to "dog")# "Not animal" categories
(0.05, 0.9, 0.4, 0.4), # "not cat"
(0.15, 0.85, 0.35, 0.5), # "not dog"
# Robot and ai assistant (more distinct in 4D space)
(0.0, 0.7, 0.9, 0.8), # "robot"
(0.1, 0.6, 0.85, 0.75) # "ai assistant"
), dtype=torch.float32)
Carscoder construction
Now we build the self -chire with the following code:
class SparseAutoencoder(nn.Module):
def __init__(self, input_dim, hidden_dim):
super(SparseAutoencoder, self).__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU()
)
self.decoder = nn.Sequential(
nn.Linear(hidden_dim, input_dim)
)def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return encoded, decoded
According to the previous code, we see that the encoder has only one fully connected linear layer, assigning the entrance to a hidden representation with hidden_dim
and then go to a reluct activation. The decoder uses only a linear layer to rebuild the entrance. Keep in mind that the absence of RELUT activation in the decoder is intentional for our specific reconstruction case, because reconstruction could contain valued data of real and potentially negative value. A reluctness would force the departure to remain not negative, which is not desirable for our reconstruction.
We train the model using the code below. Here, the loss function has two parts: the loss of reconstruction, which measures the accuracy of the reconstruction of the self -consideration of the input data and a loss of dispersion (with weight), which encourages the formulation of scarcity in the encoder.
# Training loop
for epoch in range(num_epochs):
optimizer.zero_grad()# Forward pass
encoded, decoded = model(data)
# Reconstruction loss
reconstruction_loss = criterion(decoded, data)
# Sparsity penalty (L1 regularization on the encoded features)
sparsity_loss = torch.mean(torch.abs(encoded))
# Total loss
loss = reconstruction_loss + sparsity_weight * sparsity_loss
# Backward pass and optimization
loss.backward()
optimizer.step()
Now we can take a look at the result. We have drawn the output value of the encoder of each activation of the original models. Remember that the entrance tokens are “cats”, “happy cat”, “dog”, “energetic dog”, “no cat”, “no dog”, “robot” and “ai assistant”.
Although the original model was designed with a very simple architecture without any deep consideration, self -coexist has still captured significant characteristics of this trivial model. According to the previous plot, we can observe at least four characteristics that seem to be learned by the encoder.
Give consideration to the first characteristic 1. This feautre has large activation values in the following 4 chips: “cat”, “happy cat”, “dog” and “energetic dog”. The result suggests that characteristic 1 can be something related to “animals” or “pets.” Characteristic 2 is also an interesting example, which is activated in two “robot” tokens and “ai assistant”. We assume, therefore, this characteristic has something to do with “artificial and robotics”, indicating the understanding of the model about technological contexts. Characteristic 3 has activation in 4 tokens: “no cat”, “no dog”, “robot” and “IA assistant” and this is possibly a characteristic “not an animal”.
Unfortunately, the original model is not a real model trained in the text of the real world, but is artificially designed with the assumption that similar tokens have some similarity in the vector activation space. However, the results still provide interesting ideas: the little self -coching managed to show some significant and friendly characteristics for humans or concepts of the real world.
The simple result in this blog post suggests:, a scarce self -coexist can effectively help obtain high -level characteristics and interpretable of complex neuronal networks such as LLM.
For readers interested in an implementation of the real world of scattered self -employers, I recommend this articlewhere a self -chirer was trained to interpret a real large language model with 512 neurons. This study provides a real application of dispersed self -conditions in the context of LLM's interpretability.
Finally, I provide this Google Colab here laptop For my detailed implementation mentioned in this article.