Following the story of Zephyra, Anthropic ai delved deeper into the expedition of extracting meaningful features in a model. The idea behind this research lies in understanding how the different components of a neural network interact with each other and what role each component plays.
According to the newspaper “Towards monosemanticity: decomposition of linguistic models with dictionary learning“ A Sparse Autoencoder can successfully extract significant features from a model. In other words, sparse autoencoders help solve the problem of “polysemanticity” (neural activations that correspond to multiple meanings/interpretations at once by focusing on sparsely activated features that maintain a single interpretation); In other words, they are more unidirectional.
To understand how all this is done, we have these beautiful works of art on Autoencoders and Sparse Autoencoders by prof. Tom Yeh that explain the behind-the-scenes workings of these phenomenal mechanisms.
(All images below, unless otherwise noted, are by Professor Tom Yeh from the LinkedIn posts mentioned above, which I have edited with his permission.)
To get started, let's first explore what an autoencoder is and how it works.
Imagine that a writer has his desk full of different papers: some are his notes for the story he is writing, some are copies of final drafts, and others are illustrations for his action-packed story. Now, in the midst of this chaos, it is difficult to find the important parts, even more so when the writer is in a hurry and the editor is on the phone demanding a book in two days. Fortunately, the writer has a very efficient assistant: this assistant makes sure that the messy desk is cleaned regularly, groups similar items, organizes and puts things in their right place. And when necessary, the assistant would retrieve the correct elements for the writer, helping him or her meet deadlines set by his or her editor.
Well, the name of this assistant is Autoencoder. It mainly has two functions: encoding and decoding. Coding refers to condensing the input data and extracting the essential features (organization). Decoding is the process of reconstructing original data from an encoded representation with the goal of minimizing loss (recovery) of information.
Now let's see how this wizard works.
Given: four training examples X1, X2, X3, X4.
(1) Automatic
The first step is to copy the training examples to the objectives. AND'. The Autoencoder's job is to reconstruct these training examples. Since the objectives are the training examples themselves, the word 'Car' is used which in Greek means 'be'.
(2) Encoding: Layer 1 +ReLU
As we have seen in all our previous models, a simple weight and bias matrix coupled with ReLU is powerful and can work wonders. Therefore, by using the first layer of Encoding we reduce the size of the original feature set from 4×4 to 3×4.
A quick summary:
Linear transformation : The input embedding vector is multiplied by the weight matrix W and then added with the bias vector b,
z = W.x+bwhere W. is the weight matrix, x is our word embedding and b is the bias vector.
ReLU activation function : Next, we apply ReLU to this intermediate z.
ReLU returns the maximum per elements of the input and zero. Mathematically, h = max{0,z}.
(3) Encoding: Layer 2 + ReLU
The output of the previous layer is processed by the second layer of the encoder, which further reduces the input size to 2×3. This is where relevant feature extraction occurs. This layer is also called “bottleneck”, since the outputs of this layer have much lower characteristics than the inputs.
(4) Decoder: Layer 1 + ReLU
Once the encoding process is completed, the next step is to decode the relevant features to reconstruct the final result. To do so, we multiply the features from the last step with the corresponding weights and biases and apply the ReLU layer. The result is a 3×4 matrix.
(5) Decoder: Layer 2 + ReLU
A second decoder layer (weight, biases + ReLU) is applied on the above output to give the final result, which is the reconstructed 4×4 matrix. We do this to return to the original dimension and be able to compare the results with our original objective.
(6) Loss gradients and backpropagation
Once the output of the decoder layer is obtained, we calculate the gradients of the mean square error (MSE) between the outputs (Y) and the objectives (Y'). For this we find 2*(Y-Y') which provides us with the final gradients that trigger the backpropagation process and updates the weights and biases accordingly.
Now that we understand how the Autoencoder works, it's time to explore how it works. little variation is capable of achieving interpretability for large language models (LLM).
To begin, suppose we are given:
- The output of a transformer after the feedforward layer has processed it, that is, suppose we have the model activations for five tokens (x). They are good but they do not shed light on how the model arrives at its decision or makes the predictions.
The main question here is:
Is it possible to map each activation (3D) to a higher dimensional space (6D) that helps with understanding?
(1) Encoder: Linear layer
The first step in the Encoder layer is to multiply the input x with encoder weights and add biases (as done in the first step of an Autoencoder).
(2) Encode: ReLU
The next substep is to apply the ReLU activation function to add nonlinearity and suppress negative activations. This suppression leads to many features being set to 0, enabling the concept of sparsity: generating sparse, interpretable features. F.
Interpretability occurs when we have only one or two positive characteristics. if we examine f6We can see x2 and X3 are positive, and it can be said that both have 'Mountain' in common.
(3) Decoder: Reconstruction
Once we are done with the encoder, we proceed to the decoder step. we multiply F with decoder weights and add biases. This produces x'what is the reconstruction of x based on interpretable characteristics.
As is done in an Autoencoder, we want x' be as close as possible to x as possible. To ensure this, further training is essential.
(4) Decoder: Weights
As an intermediate step, we calculate the L2 norm for each of the weights in this step. We keep them aside to use later.
L2 norm
Also known as the Euclidean norm, the L2 norm calculates the magnitude of a vector using the formula: ||x||₂ = √(Σᵢ xᵢ²).
In other words, add the squares of each component and then take the square root of the result. This norm provides a simple way to quantify the length or distance of a vector in Euclidean space.
As mentioned above, a Sparse Autoencoder instills extensive training to rebuild the x' closer to x. To illustrate this, we proceed to the following steps below:
(5) Shortage: L1 Loss
The goal here is to get as many close to zero/zero values as possible. We do it by invoking L1 shortage penalize the absolute values of the weights; The central idea is that we want the sum to be as small as possible.
L1 Loss
The L1 loss is calculated as the sum of the absolute values of the weights: L1 = λΣ|w|, where λ is a regularization parameter.
This encourages many weights to become zero, simplifying the model and thus improving interpretability.
In other words, L1 helps focus on the most relevant features while avoiding overfitting, improving model generalization, and reducing computational complexity.
(6) Scarcity: gradient
The next step is to calculate L1gradients than -1 for positive values. Thus, for all values of f > 0 the result will be set to -1.