Large Language Models (LLMs) like ChatGPT, Gemini, Claude, etc. have been around for a while and I think we're all already using at least one of them. As of this writing, ChatGPT already implements the fourth generation of the GPT-based model, called GPT-4. But do you know what GPT really is and what the underlying neural network architecture is like? In this article we are going to talk about the GPT models, especially GPT-1, GPT-2 and GPT-3. I'll also demonstrate how to code them from scratch with PyTorch so you can better understand the structure of these models.
A brief history of GPT
Before entering GPT, we must understand the original architecture of Transformer in advance. Generally speaking, a transformer consists of two main components: the Encoder and the Decoder. The former is responsible for understanding the input sequence, while the latter is used to generate another sequence based on the input. For example, in a question answering task, the decoder will produce a response to the input sequence, while in a machine translation task it is used to generate the translation of the input.