Develop an understanding of a variety of LLM benchmarks and scores, including an intuition for when they may be valuable for your purpose.
It seems like a new large language model (LLM) is released to the public almost weekly. With every announcement of an LLM, these providers will tout performance figures that may seem quite impressive. The challenge I found is that there are a wide variety of performance metrics referenced in these press releases. While there are some that appear more frequently than others, unfortunately there aren't just one or two metrics to fall back on. If you want to see a tangible example of this, see page for GPT-4 performance. It references many different benchmarks and scores!
The first natural question one might ask is: “Why can't we just agree to use a single metric?” In summary, there is no clear way to evaluate LLM performance, so each performance metric seeks to provide a quantitative assessment for a focused domain.. Additionally, many of these performance metrics have “submetrics” that calculate the metric slightly differently than the original metric. When I originally started researching this blog post, my intention was to cover each of these benchmarks and scores, but I quickly discovered that if I did, we would cover over 50 different metrics!
Because evaluating each individual metric isn't exactly feasible, what I discovered is that we can break down these various benchmarks and scores into categories of what they're generally trying to evaluate. In the rest of this post, we will cover these various categories and also provide specific examples of popular metrics that would fall into each of these categories. The goal of this post is so that you can walk away with a general idea of what performance metrics you are evaluating for your specific use case.
The six categories we will evaluate in this post include the following. Please note: There is no specific “industry standard” for how these categories were created. These categories were created based on how I hear them referenced most frequently:
- General Knowledge Benchmarks