Superposition: what makes it difficult to explain neural networks | by Shuyang Xiang | December 2024

When there are more features than dimensions of the model

It would be ideal if the world of neural networks represented a one-to-one relationship: each neuron activates on one and only one characteristic. In such a world, interpreting the model would be simple: this neuron activates for the dog ear feature and that neuron activates for the car wheel. Unfortunately, that is not the case. Actually, a model with dimension. d often need to represent metro characteristics, where re <m. This is when we observe the phenomenon of superposition.

In the context of machine learning, overlapping refers to a specific phenomenon in which a neuron in a model represents multiple overlapping features rather than a single, distinct one. For example, InceptionV1 contains a neuron that responds to cat faces, car fronts, and cat paws (1). This means that we can superimpose different activation characteristics in the same neuron or circuit.

The existence of overlap makes model explainability challenging, especially in deep learning models, where neurons in hidden layers represent complex combinations of patterns rather than being associated with simple, straightforward features.

In this blog post, we will present a simple overlay example, with detailed Python implementations in this laptop.

We begin this section by discussing the term “feature.”

In tabular data, there is little ambiguity in defining what a characteristic is. For example, when predicting wine quality using a tabular data set, the features might be alcohol percentage, year of production, etc.

However, defining features can become complex when dealing with non-tabular data, such as images or textual data. In these cases, there is no universally accepted definition of a characteristic. Generally speaking, a feature can be considered any property of the input that is recognizable to most humans. For example, a feature in a large language model (LLM) might be whether a word is in French.

Overlap occurs when the number of features is greater than the dimensions of the model. We assert that two necessary conditions must be met if overlap were to occur:

Nonlinearity: Neural networks typically include nonlinear activation functions, such as sigmoid or ReLU, at the end of each hidden layer. These activation functions give the network possibilities to map inputs to outputs in a non-linear way, so that it can capture more complex relationships between features. We can imagine that without nonlinearity, the model would behave as a simple linear transformation, where the features remain linearly separable, without any possibility of dimension compression through superposition.
Lack of features: Feature sparsity means the fact that only a small subset of features is non-zero. For example, in language models, many features are not present at the same time: for example, the same word cannot be es_french and es_other_languages. If all features were dense, we can imagine significant interference due to overlapping representations, which would make it very difficult for the model to decode features.

Synthetic data set

Let's consider a toy example of 40 features with linearly decreasing feature importance: the first feature has an importance of 1, the last feature has an importance of 0.1, and the importance of the remaining features is evenly spaced between these two values.

We then generate a synthetic data set with the following code:

def generate_sythentic_dataset(dim_sample, num_sapmple, sparsity): 
"""Generate synthetic dataset according to sparsity"""
dataset=()
for _ in range(num_sapmple): 
x = np.random.uniform(0, 1, n)
mask = np.random.choice((0, 1), size=n, p=(sparsity, 1 - sparsity))
x = x * mask  # Apply sparsity
dataset.append(x)
return np.array(dataset)

This function creates a synthetic data set with the given number of dimensions, which is 40 in our case. For each dimension, a random value is generated from a uniform distribution at (0, 1). The spread parameter, which varies between 0 and 1, controls the percentage of active features in each sample. For example, when sparsity is 0.8, the characteristics of each sample have an 80% chance of being zero. The function applies a mask matrix to perform sparsity settings.

Linear and Relu models

We would now like to explore how ReLU-based neural models lead to the formation of superpositions and how sparsity values would change their behaviors.

We set up our experiment as follows: we compress the features with 40 dimensions into the 5-dimensional space, then reconstruct the vector by reversing the process. By observing the behavior of these transformations, we hope to see how the superposition forms in each case.

For this we consider two very similar models:

Linear model: A simple linear model with only 5 coefficients. Remember that we want to work with 40 features, many more than the dimensions of the model.
ReLU model: A model almost the same as the linear one, but with an additional ReLU activation function at the end, introducing a level of nonlinearity.

Both models are built with PyTorch. For example, we build the ReLU model with the following code:

class ReLUModel(nn.Module):
def __init__(self, n, m):
super().__init__()
self.W = nn.Parameter(torch.randn(m, n) * np.sqrt(1 / n))
self.b = nn.Parameter(torch.zeros(n))def forward(self, x):
h = torch.relu(torch.matmul(x, self.W.T))  # Add ReLU activation: x (batch, n) * W.T (n, m) -> h (batch, m)
x_reconstructed = torch.relu(torch.matmul(h, self.W) + self.b)  # Reconstruction with ReLU
return x_reconstructed

According to the code, the n-dimensional input vector x is projected into a lower-dimensional space by multiplying it by an m×n weight matrix. We then reconstruct the original vector by mapping it back to the original feature space using a ReLU transformation, adjusted by a bias vector. The linear model is given by a similar structure, with the only difference that the reconstruction is performed using only the linear transformation instead of ReLU. We train the model by minimizing the mean squared error between the original and reconstructed feature samples, weighting the feature importance.

We trained both models with different dispersion values: 0.1, 0.5 and 0.9, from least dispersed to most dispersed. We have observed several important results.

First, whatever the level of sparsity, ReLU models “compress” features much better than linear models: while linear models primarily capture features with higher importance, ReLU models could focus on less important features by overlay training– where a single dimension of the model represents multiple characteristics. Let's get a glimpse of this phenomenon in the following visualizations: For linear models, the biases are smaller for the top five features (in case you don't remember: feature importance is defined as a linearly decreasing function based on the order of characteristics). ). In contrast, the biases of the ReLU model do not show this order and are generally reduced further.

Another important and interesting result is that: overlap is much more likely to be observed when the level of sparsity is high in the features. To get an idea of this phenomenon, we can visualize the matrix W^T@W, where W is the matrix of m×n weights in the models. One could interpret the matrix W^T@W as a quantity of how the input features are projected into the lower dimensional space:

In particular:

The diagonal of W^T@W represents the “self-similarity” of each feature within the low-dimensional transformed space.
The off-diagonal matrix represents how different characteristics correlate with each other.

We now visualize the W^T@W values below for the Linear and ReLU models we built earlier with two different sparsity levels: 0.1 and 0.9. You can see that when the sparsity value is high like 0.9, the off-diagonal elements become much larger compared to the case where the sparsity is 0.1 (actually, not much difference is seen between the results of the two models). This observation indicates that correlations between different features are easier to learn when sparsity is high.

In this blog post, I did a simple experiment to introduce overlay training in neural networks by comparing linear and ReLU models with fewer dimensions than features to represent. We note that the nonlinearity introduced by ReLU activation, combined with a certain level of sparsity, can help the model form superposition.

In real-world applications, which are much more complex than my navie example, superposition is an important mechanism for representing complex relationships in neural models, especially in vision models or LLMs.

(1) Zoom in: Introduction to circuits. https://distill.pub/2020/circuits/zoom-in/

(2) Toy models with overlay. https://transformer-circuits.pub/2022/toy_model/index.html