As the title suggests, in this article I will implement the Transformer architecture from scratch with PyTorch; yes, literally from scratch. Before I get into the details, let me give you a brief overview of the architecture. Transformer was first introduced in an article titled “Attention is all you need” written by Vaswani et al. back in 2017 (1). This neural network model is designed to perform seq2seq Tasks (sequence-to-sequence), where you accept a sequence as input and are expected to return another sequence as output, such as machine translation and question answering.
Before Transformer was introduced, we usually used RNN-based models such as LSTM or GRU to achieve seq2seq tasks. In fact, these models are capable of capturing context, but they do so sequentially. This approach makes it difficult to capture long-term dependencies, especially when the important context is far behind the current time step. In contrast, Transformer can freely attend to any part of the sequence it deems important without being limited by sequential processing.