The inference process is one of the things that greatly increases the money and time costs of using large language models. This problem increases considerably for longer entries. Below you can see the relationship between model performance and inference time.
Fast models, which generate more tokens per second, tend to score lower on the Open LLM leaderboard. Expanding the model size allows for better performance, but comes at the cost of lower inference performance. This makes it difficult to implement in real-life applications (1).
Improving the speed of LLMs and reducing resource requirements would allow them to be more widely used by individuals or small organizations.
Different solutions are proposed to increase the efficiency of the LLM; some focus on the architecture or the model system. However, proprietary models like ChatGPT or Claude can only be accessed via API, so we cannot change their internal algorithm.
We will discuss a simple and inexpensive method that relies solely on changing the input given to the model: fast compression.
First, let's clarify how LLMs understand language. The first step to making sense of a text in natural language is to divide it into parts. This process is called tokenization. A token can be an entire word, a syllable, or a sequence of characters frequently used in today's speech.
As a rule of thumb, the number of tokens is 33% greater than the number of words. So, 1000 words correspond approximately to 1333 tokens.
Let's look specifically at OpenAI pricing for the gpt-3.5-turbo model, since that's the model we'll be using in the future.