How to Reduce RAG Costs by 80% Using Fast Compression | by Iulia Brezeanu | January 2024

Speed up inference with fast compression

The inference process is one of the things that greatly increases the money and time costs of using large language models. This problem increases considerably for longer entries. Below you can see the relationship between model performance and inference time.

Performance Score vs. Inference Performance (1)

Fast models, which generate more tokens per second, tend to score lower on the Open LLM leaderboard. Expanding the model size allows for better performance, but comes at the cost of lower inference performance. This makes it difficult to implement in real-life applications (1).

Improving the speed of LLMs and reducing resource requirements would allow them to be more widely used by individuals or small organizations.

Different solutions are proposed to increase the efficiency of the LLM; some focus on the architecture or the model system. However, proprietary models like ChatGPT or Claude can only be accessed via API, so we cannot change their internal algorithm.

We will discuss a simple and inexpensive method that relies solely on changing the input given to the model: fast compression.

First, let's clarify how LLMs understand language. The first step to making sense of a text in natural language is to divide it into parts. This process is called tokenization. A token can be an entire word, a syllable, or a sequence of characters frequently used in today's speech.