My thanks to Evan Jolley for their contributions to this piece
New evaluations of RAG systems are released seemingly every day, and many of them focus on the recovery stage of the framework. However, the generation aspect (how a model synthesizes and articulates this retrieved information) may be of equal or greater importance in practice. Many use cases in production are not simply returning a fact from the context, but also require synthesizing the fact into a more complicated response.
We performed several experiments to evaluate and compare GPT-4, Claude 2.1, and claudius 3 Opus generation capabilities. This article details our research methodology, the results and model nuances found along the way, as well as why this is important for people building with generative ai.
Everything needed to reproduce the results can be found in this ai/LLMTest_NeedleInAHaystack” rel=”noopener ugc nofollow” target=”_blank”>GitHub repository.
Food for take away
- Although initial findings indicate that Claude outperforms GPT-4, subsequent testing reveals that with strategic rapid engineering, GPT-4 demonstrated superior performance across a broader range of evaluations. Inherent model behaviors and rapid engineering are a LOT in RAG systems.
- Simply adding “Explain yourself and then answer the question” to a message template significantly (more than doubles) GPT-4 performance. It is clear that when an LLM speaks and responds, it seems to help develop ideas. It is possible that by explaining, a model is reinforcing the correct response in the embedding/attention space.
While retrieval is responsible for identifying and retrieving the most relevant information, it is the generation phase that takes this raw data and transforms it into a coherent, meaningful, and contextually appropriate response. The generative step has the task of synthesizing the retrieved information, filling in the gaps, and presenting it in a way that is easily understandable and relevant to the user's query.
In many real-world applications, the value of RAG systems lies not only in their ability to locate a specific fact or information, but also in their ability to integrate and contextualize that information within a broader framework. The generation phase is what allows RAG systems to go beyond simple fact retrieval and deliver truly intelligent and adaptive responses.
The initial test we performed involved generating a date string from two randomly retrieved numbers: one representing the month and the other the day. The models had the task of:
- Retrieving random number #1
- Isolating the last digit and incrementing by 1
- Generating a month for our date string from the result
- Retrieving random number #2
- Generating the day for our date string from random number 2
For example, the random numbers 4827143 and 17 would represent April 17.
These numbers were placed at different depths within contexts of different duration. At first, the models had quite a bit of difficulty with this task.
While neither model performed very well, Claude 2.1 significantly outperformed GPT-4 in our initial testing, nearly quadrupling its success rate. It was here that Claude's detailed nature (providing detailed and explanatory answers) seemed to give him a clear advantage, resulting in more accurate results compared to GPT-4's initially concise answers.
Prompted by these unexpected results, we introduced a new variable into the experiment. We instructed GPT-4 to “explain and then answer the question,” a message that encouraged a more detailed response similar to Claude's natural output. The impact of this small adjustment was profound.
GPT-4's performance improved dramatically and achieved impeccable results in subsequent tests. Claude's results also improved to a lesser extent.
This experiment not only highlights the differences in how language models approach generation tasks, but also shows the potential impact of rapid engineering on their performance. The verbosity that seemed to be Claude's advantage turned out to be a replicable strategy for GPT-4, suggesting that the way a model processes and presents its reasoning can significantly influence its accuracy in generation tasks. Overall, including the seemingly tiny “explain yourself” line in our message helped improve the performance of the models in all of our experiments.
We performed four more tests to evaluate the ability of the prevailing models to synthesize and transform the retrieved information into various formats:
- String concatenation: Combine text fragments to form coherent strings and test the models' basic text manipulation skills.
- Money format: Format numbers as currency, round them, and calculate percentage changes to evaluate the accuracy of models and their ability to handle numerical data.
- Date mapping: Convert a numerical representation into the name and date of a month, requiring a combination of retrieval and contextual understanding.
- Arithmetic module: Perform complex numerical operations to test the mathematical generation capabilities of models.
As expected, each model showed strong performance in string concatenation, reaffirming the previous understanding that text manipulation is a fundamental strength of language models.
As for the money formatting test, Claude 3 and GPT-4 worked almost without problems. Claude 2.1's performance was generally worse. Accuracy did not vary significantly across token length, but was generally lower when the needle was closer to the beginning of the context window.
Despite stellar results in generation tests, Claude 3's accuracy declined in a recall-only experiment. In theory, simply retrieving numbers should be an easier task than manipulating them as well, which makes this drop in performance surprising and an area where we're planning more testing to examine. If anything, this counterintuitive drop only further confirms the idea that both recovery and generation should be tested when developing with RAG.
By testing various generation tasks, we observed that while both models excel at minor tasks such as string manipulation, their strengths and weaknesses become clear in more complex scenarios. LLMs are still not good at maths! Another key result was that the introduction of the “explain yourself” prompt markedly improved the performance of GPT-4, underscoring the importance of how models are prompted and how they articulate their reasoning to achieve accurate results.
These findings have broader implications for the evaluation of LLMs. When comparing models such as the detailed Claude and the initially less detailed GPT-4, it is evident that the evaluation criteria must go beyond mere correctness. The verbosity of a model's responses introduces a variable that can significantly influence its perceived performance. This nuance may suggest that future model evaluations should consider average response duration as a factor to take into account, providing a better understanding of a model's capabilities and ensuring a fairer comparison.