From introduction in 2017Transformers have emerged as a leading force in the field of machine learning, revolutionizing the capabilities of important translation and Autocomplete services.
Recently, the popularity of transformers has skyrocketed even further with the arrival of large language models like OpenAI. ChatGPT, GPT-4and Goal ai.meta.com/blog/large-language-model-llama-meta-ai/” rel=”noopener ugc nofollow” target=”_blank”>Calls. These models, which have attracted immense attention and enthusiasm, are built on the basis of transformer architecture. By harnessing the power of transformers, these models have made notable advances in natural language understanding and generation; expose them to the general public.
Despite many good resources that discuss how transformers work, I found myself in a position where I understood how the mechanics worked mathematically, but found it difficult to explain how a transformer works intuitively. After doing many interviews, talking to my colleagues, and giving a lightning talk on the topic, it seems like a lot of people share this problem!
In this blog post, I will try to provide a high-level explanation of how transformers work without relying on code or math. My goal is to avoid confusing technical jargon and comparisons to previous architectures. While I will try to keep things as simple as possible, this won’t be easy as transformers are quite complex, but I hope it provides a better intuition of what they do and how they do it.
A transformer is a type of neural network architecture that is well suited for tasks that involve processing sequences as inputs. Perhaps the most common example of sequence in this context is a sentence, which we can think of as an ordered set of words.
The goal of these models is to create a numerical representation for each element within a sequence; encapsulating essential information about the element and its neighboring context. The resulting numerical representations can then be transmitted to downstream networks, which can leverage this information to perform various tasks, including generation and classification.
By creating such rich representations, these models allow downstream networks to better understand the underlying patterns and relationships within the input sequence, improving their ability to generate coherent and contextually relevant results.
The key advantage of transformers lies in their ability to handle long-range dependencies within sequences, as well as being highly efficient; capable of processing sequences in parallel. This is particularly useful for tasks such as machine translation, sentiment analysis, and text generation.
To feed an input to a transformer, we must first convert it into a sequence of tokens; a set of integers that represent our input.
Since transformers were first applied in the NLP domain, let’s consider this scenario first. The simplest way to convert a sentence into a series of tokens is to define a vocabulary which acts as a lookup table, mapping words to integers; We can reserve a specific number to represent any word that is not contained in this vocabulary, so that we can always assign an integer value.
In practice, this is a naive way of encoding text, since words like cat and cats They are treated as completely different tokens, even though they are singular and plural descriptions of the same animal! To overcome this, different tokenization strategies, such as byte pair encoding – Systems have been devised that divide words into smaller fragments before indexing them. Additionally, it is often useful to add special tokens to represent features such as the beginning and end of a sentence, to provide additional context to the model.
Let’s consider the following example to better understand the tokenization process.
“Hello, isn’t the weather nice today in Drosval?”
Drosval is a name generated by GPT-4 using the following message: “Can you create a fictitious place name that looks like it belongs to David Gemmell’s Drenai universe?“; deliberately chosen as it should not appear in the vocabulary of any trained model.
Using the bert-base-uncased
tokenizer transformer librarythis becomes the following sequence of tokens:
The integers representing each word will change depending on the tokenization and training strategy of the specific model. Decoding this, we can see the word that each token represents:
Interestingly, we can see that this is not the same as our entry. Special tokens have been added, our abbreviation has been split into multiple tokens, and the name of our fictional place is represented by different “chunks.” Since we used the “cashless” model, we also lost all the capitalization context.
However, although we use a sentence for our example, transformers are not limited to text inputs; This architecture also has demonstrated good results in vision tasks. To convert an image into a sequence, the ViT authors sliced the image into non-overlapping 16×16 pixel patches and concatenated them into a long vector before passing it to the model. If we were using a transformer in a recommender system, one approach might be to use the item identifiers from the last north elements browsed by a user as entry to our network. If we can create a meaningful representation of the input tokens for our domain, we can feed it into a transformative network.
Embedding our tokens
Once we have a sequence of integers representing our input, we can convert them to scale. Embeddings are a way of representing information that can be easily processed by machine learning algorithms; Its goal is to capture the meaning of the token that is encoded in a compressed format, representing the information as a sequence of numbers. Initially, the embeddings are initialized as sequences of random numbers and meaningful representations are learned during training. However, these embeddings have an inherent limitation: they do not take into account the context in which the token appears. There are two aspects to this.
Depending on the task, when we embed our tokens, we may also want to preserve the order of our tokens; This is especially important in domains like NLP, or we will essentially end up with a bag of words getting closer. To overcome this, we apply positional encoding to our inlays. As long as there is multiple ways to create positional embedsthe main idea is that we have other set of embeddings representing the position of each token in the input sequence, which are combined with our token embeddings.
The other problem is that tokens can have different meanings depending on the tokens around them. Consider the following sentences:
It’s dark, who turned off the light?
Wow, this pack is really lightweight!
Here the word light It is used in two different contexts, where it has completely different meanings. However, it is likely that depending on the tokenization strategy, the integration will be the same. In a transformer, this is handled by its attention mechanism.
Perhaps the most important mechanism used by transformer architecture is known as attention, which allows the network to understand which parts of the input sequence are most relevant to the given task. For each token in the sequence, the attention mechanism identifies which other tokens are important for understanding the current token in the given context. Before exploring how this is implemented within a transformer, let’s start simple and try to understand what the attention mechanism is trying to achieve conceptually to develop our intuition.
One way to understand attention is to think of it as a method that replaces each token embedding with an embedding that includes information about its neighboring tokens; instead of using the same embedding for each token regardless of its context. If we knew which tokens were relevant to the current token, one way to capture this context would be to create a weighted average (or, more generally, a linear combination) of these embeddings.
Let’s consider a simple example of what this might look like for one of the sentences we looked at earlier. Before attention is applied, embeddings in the sequence have no context from their neighbors. Therefore, we can visualize the word embedding. light like the following linear combination.
Here we can see that our weights are just the identity matrix. After applying our attention mechanism, we would like to learn a weight matrix such that we can express our light embedding in a manner similar to the following.
This time, higher weights are given to the embeddings that correspond to the most relevant parts of the sequence of our chosen token; which should ensure that the most important context is captured in the new embedding vector.
Embeddings that contain information about their current context are sometimes known as contextualized embeds, and this is ultimately what we are trying to create.
Now that we have a high level of understanding of what attention is trying to achieve, let’s explore how it is actually implemented in the next section.
There are multiple types of attention and the main differences lie in the way the weights used to perform the linear combination are calculated. Here we will consider scaled product attentionas introduced in the original paper, since this is the most common approach. In this section, let’s assume that all of our embeddings have been positionally encoded.
Remembering that our goal is to create contextualized embeddings using linear combinations of our original embeddings, let’s start simple and assume that we can encode all the necessary information into our learned embedding vectors, and all we need to calculate are the weights.
To calculate the weights, we must first determine which tokens are relevant to each other. To achieve this, we need to establish a notion of similarity between two embeddings. One way to represent this similarity is by using the dot product, where we would like to learn embeddings such that higher scores indicate that two words are more similar.
Since, for each token, we need to calculate its relevance to all other tokens in the sequence, we can generalize this to a matrix multiplication, which gives us our weight matrix; which are often called attention scores. To ensure that our weights add up to one, we also apply the SoftMax function. However, since matrix multiplications can produce arbitrarily large numbers, this could result in the SoftMax function returning very small gradients for large attention scores; which can lead to vanishing gradient problem during the training. To counteract this, attention scores are multiplied by a scale factor before SoftMax is applied.
Now, to get our contextualized embedding matrix, we can multiply our attention scores with our original embedding matrix; which is the equivalent of taking linear combinations of our embeddings.