How to reduce the cost of evaluating LLM applications.

This is how not to waste your budget evaluating models and systems

Image created by the author using Flux1.1 Pro.

You can build a fortress in two ways: start stacking bricks on top of each other, or draw a drawing of the fortress you are about to build and plan its execution; then continue to evaluate it against your plan.

We all know the second is the only way we can probably build a fortress.

Sometimes I am the worst follower of my advice. I'm talking about jumping straight into a notebook to create an LLM application. It is the worst thing we can do to ruin our project.

Before we start anything, we need a mechanism that tells us that we are moving in the right direction, that tells us that the last thing we tried was better than before (or not).

In software engineering, it is called test-driven development. For machine learning, it is evaluation.

The first step and most valuable skill in developing LLM-based applications is defining how you will evaluate your project.

Evaluating LLM applications is nothing like software testing. I don't undermine the challenges in software testing, but evaluating LLMs is not as simple as testing.