What if I told you that you could save 60% or more on the cost of your LLM API spend without compromising accuracy? Surprisingly, now you can.
Large Language Models (LLM) are now part of our everyday lives. Companies use technology to automate processes, improve customer experiences, create better products, save money, and more.
Organizing your own LLMs is a big challenge. They offer extensive capabilities but are usually expensive to operate. They often require complex infrastructure and massive amounts of data. Cost and complexity are the reasons rapid engineering is used. You can even use Retrieval Augmented Generation (RAG) to improve context and reduce hallucinations. With both techniques, you download the running LLMs to companies like OpenAI, Cohere or Google. However, expanding LLM adoption to new use cases, especially with the latest powerful models, can create a new cost that was previously unaccounted for. Weaker models may be cheaper, but can you trust them with complex questions? Now, new research shows us how to save money and get LLM results just as good, sometimes better.
Get to know the LLM waterfalls
In the search for lower LLM costs, researchers turned to the concept of LLM Cascades. In the dark ages, before the release of ChatGPT, a team from Google and the University of Toronto defined this term as programs that use probability calculations to obtain the best results using multiple LLMs.
More recently, the FrugalGPT Paper Cascades were defined as sending a user query to a list of LLMs, one after another, from weaker to stronger LLMs, until the response is good enough. FrugalGPT Cascades uses a dedicated model to determine when the response is good enough based on a quality threshold.
A recent article titled 'Large cascades of language models with a combination of thinking representations for profitable reasoning' from George Mason University, Microsoft, and Virginia tech offers an alternative: a function that can determine whether the answer is good enough without needing to tune another model.
LLM Thought Cascade Blend
Instead of using multiple LLMs, 'Mix of Thought' (MoT) reasoning uses only two: GPT 3.5 Turbo and GPT 4. The first model is considered the “weaker” LLM, while the second is the “stronger” LLM. “. The authors took advantage of the “response consistency” of the LLM to indicate whether the response of an LLM is good enough. LLMs produce consistent answers to similar questions when they are confident that the answers are correct. Therefore, when weaker LLM responses are consistent, it is not necessary to call the stronger LLM. In contrast, these LLMs produce inconsistent responses when they lack confidence. That's when you need a stronger LLM to answer the question. (Note: You can also use a weaker/stronger LLM pair of your choice.)
The prompts themselves use brief contextual prompts to improve the quality of LLM responses. These prompts guide the LLM response by giving examples of similar questions and answers.
To improve model reasoning and simplify coherence measurement, researchers introduce a new stimulation technique for reasoning tasks by “mixing” two stimulation techniques:
- chain of thought (CoT) Prompts encourage LLMs to generate intermediate steps or reasoning before arriving at a final answer. Generating these steps helps the model improve the results of complicated tasks. It also increases the accuracy of the answers.
- thinking program (PoT) extends the Chain of Thought prompts and uses the output of the model as new input for further prompts. Prompts using this technique often request that the model respond with code rather than human language.
The article also presents two methods for determining the consistency of responses:
- Voting: This method displays multiple LLM query responses with similar prompts or varying the response temperature option. It then measures how similar the LLM answers are to each other. The answer that most closely agrees with all other answers is assumed to be correct. The team also defined a flexible “threshold” value that aligns response consistency and budget constraints.
- Verification: This approach compares the most consistent LLM responses across two different thought representations (e.g., CoT and PoT). The algorithm accepts the response from the weaker LLM if the two quick responses are identical.
Since voting requires multiple prompts, it may be more appropriate when there is a budget to guide the threshold number.
Conclusion: Combining Thoughts Saves You Money
Let's see how much money the MoT technique saves and its impact on the accuracy of the answers.
The researchers used the following sum to calculate the immediate cost:
- The cost of ordering the weakest model (because we can order it multiple times)
- The cost of the response evaluation process.
- If the evaluation process rejects the answer, we add the cost of boosting the strong model
The results were dramatic:
- Using variants of MoT (combining voting and verification with CoT and PoT) can deliver performance comparable to 40% of the cost of using GPT-4 alone.
- In the evidence against CREPE Q&A Dataset, MoT outperformed GPT-4 by 47% of its cost.
- Combining PoT and CoT improves decision making compared to using either technique alone.
- Increasing the threshold when using the voting method did not significantly affect quality despite the additional cost.
- The coherence model proved effective in reliably identifying correct LLM responses. Successfully predicted when to resort to using the strong model for optimal results.
Hosting and managing large language models (LLMs) internally comes with significant challenges. They bring complexity, high costs, and the need for extensive data infrastructure and resources. As a result, LLMs present substantial obstacles for organizations seeking to leverage their broad capabilities. That may lead you to turn to hosted LLMs. However, this approach presents companies with unforeseen cost increases and budget challenges as they expand into new use cases. This is especially evident when the latest powerful models are integrated. To avoid that fate, you face a new dilemma: can you rely on weaker, more affordable models? Can you overcome concerns about your accuracy when handling complex questions?
LLM Cascades with Mixture of Thought (MoT) offers two important steps forward:
- Substantial cost savings compared to exclusively using the latest models.
- Demonstrable results on par with the latest models.
This advancement provides organizations with a practical and efficient approach to navigate the delicate balance between the powerful capabilities of LLMs and the imperative to manage costs effectively.
Domino software engineer Subir Mansukhani contributed to this post.