The stellar performance of large language models (LLMs) like ChatGPT has amazed the world. The breakthrough was achieved thanks to the invention of the Transformer architecture, which is surprisingly simple and scalable. It is still built using deep learning neural networks. The main addition is the so-called “attention” mechanism that contextualizes each word token. Moreover, its unprecedented parallelisms endow LLMs with massive scalability and thus impressive accuracy after training with billions of parameters.
The simplicity that the Transformer architecture has demonstrated is, in fact, comparable to that of the Turing machine. The difference is that the Turing machine controls what the machine can do at each step. The Transformer, however, is like a magical black box that learns from massive input data through parameter optimizations. Researchers and scientists remain very interested in discovering its potential and theoretical implications for the study of the human mind.
In this article, we will first discuss the four main features of the Transformer architecture: word embedding, attention mechanism, single word prediction, and generalization capabilities such as cross-modal extension and transfer learning. The intention is to focus on why the architecture is so effective rather than on how to build it (for which readers may find many…