There has been new development in our neighborhood.
A 'Robo-Truck', as my son likes to call it, has made its new home on our street.
It's a Tesla Cyber Truck and I've tried to explain that name many times to my son, but he insists on calling it Robo-Truck. Now every time I look at Robo-Truck and hear that name, it reminds me of the movie Transformers, where the robots could transform to and from cars.
And isn't it strange that Transformers as we know them today could be on their way to powering these Robo-Trucks? It's almost a moment of closing the circle. But where am I going with all this?
Well, I'm heading to the destination: Transformers. Not those of robot cars but those of neural networks. And you are invited!
What are transformers?
Transformers are essentially neural networks. Neural networks that specialize in learning context from data.
But what makes them special is the presence of mechanisms that eliminate the need for labeled data sets and convolution or recurrence In the net.
What are these special mechanisms?
There are many. But the two mechanisms that are really the force behind transformers are attention weighting and feedback networks (FFN).
What is attention weighting?
Attention weighting is a technique by which the model learns which part of the incoming sequence to focus on. Think of it as the 'Eye of Sauron' that scans everything at all times and sheds light on the parts that are relevant.
Fun fact: Apparently, the researchers had almost called the Transformer model 'Attention-Net', since attention is such a crucial part of it.
What is FFN?
In the context of transformers, FFN is essentially a regular multilayer perceptron that acts on a batch of independent data vectors. Combined with attention, it produces the correct “position-dimension” combination.
So without further ado, let's dive into how attention weighting and FFN make transformers so powerful.
This discussion is based on Prof. Tom Yeh's wonderful ai by Hand series on Transformers . (All images below, unless otherwise noted, are by Professor Tom Yeh from the LinkedIn posts mentioned above, which I have edited with his permission.)
So, here we go:
Key ideas here: Attention and Feedback Weighting Network (FFN).
With this in mind, suppose we are given:
- 5 input features from a previous block (here a 3×5 matrix, where X1, X2, X3, X4 and X5 are the features and each of the three rows denotes their features respectively).
(1) Get attention weight matrix A
The first step of the process is to obtain the attention weight matrix A. This is the part where the self-attention mechanism comes into play. What you are trying to do is find the most relevant parts in this input sequence.
We do this by introducing the input functions into the query key (QK) module. For simplicity, the details of the QK module are not included here.
(2) Attention weighting
Once we have the attention weight matrix A (5×5)we multiply the input characteristics (3×5) with it to obtain the Z-weighted attention features.
The important part here is that the features here are combined. based on their positions P1, P2 and P3, i.e. horizontally.
To break it down further, consider this calculation done by rows:
P1 x A1 = Z1 → Position (1,1) = 11
P1 x A2 = Z2 → Position (1,2) = 6
P1 x A3 = Z3 → Position (1,3) = 7
P1 x A4 = Z4 → Position (1,4) = 7
P1 x A5 = Z5 → Position (1.5) = 5
.
.
.
P2 x A4 = Z4 → Position (2,4) = 3
P3
As an example:
It seems a little tedious at first, but follow the row multiplication and the result should be quite simple.
What is interesting is the way our attention-weight matrix TO is organized, the new features z turn out to be the combinations of x like below :
Z1 = X1 + X2
Z2 = X2 + X3
Z3 = X3 + X4
Z4 = X4 + X5
Z5 = X5 + X1
(Hint: Look at the positions of 0 and 1 in the array TO).
(3) FFN: First layer
The next step is to introduce the attention-weighted features into the feedback neural network.
However, the difference here lies in combining values between dimensions unlike the positions of the previous step. It is done as follows:
What this does is it looks at the data from the other direction.
– In the attention step, we combine our inputs based on the original features to obtain new features.
– In this step of FFN, we consider its features, that is, we combine features vertically to obtain our new matrix.
For example: P1(1,1) * Z1(1,1)
+ P2(1,2) * Z1 (2,1)
+ P3 (1,3) * Z1(3,1) + b(1) = 11, where b is the bias.
Once again, item row operations to the rescue. Notice that here the number of dimensions of the new matrix increases to 4.
(4) resume
Our favorite step: ReLU, where the negative values obtained in the above array are returned as zero and the positive value remains unchanged.
(5) FFN: Second layer
Finally we pass it through the second layer where the dimensionality of the resulting matrix is reduced from 4 to 3.
The output here is ready to be passed to the next block (see its similarity to the original array) and the entire process is repeated from the beginning.
The two key things to remember here are:
- The attention layer is combined between positions (horizontally).
- The forward layer is merged across dimensions (vertically).
And this is the secret ingredient behind the power of transformers: the ability to analyze data from different directions.
To summarize the ideas above, here are the key points:
- The transformer architecture can be perceived as a combination of the attention layer and the feedback layer.
- He The attention layer combines the features. to produce a new feature. For example, think about combining two robots Robo-Truck and Optimus Prime to get a new robot Robtimus Prime.
- He The forward layer (FFN) combines the parts or features of a feature to produce new parts/features. For example, Robo-Truck's wheels and Optimus Prime's ion laser could produce a laser on wheels.
Neural networks have been around for quite some time. Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) had been reigning supreme, but things took a rather bumpy turn once Transformers were introduced in 2017. And since then, the field of ai has grown. at an exponential rate, with new Models, new reference points, new learnings arriving every day. And only time will tell if this phenomenal idea will one day pave the way for something even bigger: a true 'Transformer'.
But for now it would not be wrong to say that an idea can really transform How we live!
PS: If you want to do this exercise on your own, here is the blank template for you to use.
Now have fun and create your own. Robert the First!