LLMOps
Transformative architecture is arguably one of the most impactful innovations in modern deep learning. Proposed in the famous 2017 article “Attention is all you need”, has become the reference approach for most language-related models, including all large language models (LLMs), such as GPT familyas well as many computer vision tasks.
As the complexity and size of these models grows, so does the need to optimize their inference speed, especially in chat applications where users expect immediate responses. Key value (KV) caching is a clever trick to do just that – let's see how it works and when to use it.
Before we dive into KV caching, we'll need to make a brief detour into the attention mechanism used in transformers. Understanding how it works is necessary to detect and appreciate how KV caching optimizes transformer inference.
We will focus on autoregressive models used to generate text. These so-called decoder models include the GPT family, Gemini, claudiuseither GitHub Copilot. They are trained on a simple task: predicting the next token in sequence. During inference, the model receives some text and its task is…