Intended Audience: Practitioners who want to learn what approaches are available and how to get started implementing them, and leaders seeking to understand the art of the possible as they build governance frameworks and technical roadmaps.
Seemingly overnight every CEO to-do list, job posting, and resume includes generative AI (genAI). And rightfully so. Applications based on foundation models have already changed the way millions work, learn, write, design, code, travel, and shop. Most, including me, feel this is just the tip of the iceberg.
In this article, I summarize research conducted on existing methods for large language model (LLM) monitoring. I spent many hours reading documentation, watching videos, and reading blogs from software vendors and open-source libraries specializing in LLM monitoring and observability. The result is a practical taxonomy for monitoring and observing LLMs. I hope you find it useful. In the near future, I plan to conduct a literature search of academic papers to add a forward-looking perspective.
Software researched*: Aporia, Arize, Arthur, Censius, Databricks/MLFlow, Datadog, DeepChecks, Evidently, Fiddler, Galileo, Giskard, Honeycomb, Hugging Face, LangSmith, New Relic, OpenAI, Parea, Trubrics, Truera, Weights & Biases, Why Labs
- *This article presents a cumulative taxonomy without grading or comparing software options. Reach out to me if you’d like to discuss a particular software covered in my research.
- Evaluating LLMs — How are LLMs evaluated and deemed ready for production?
- Tracking LLMs — What does it mean to track an LLM and what components need to be included?
- Monitoring LLMs — How are LLMs monitored once they are in production?
The race is on to incorporate LLMs in production workflows, but the technical community is scrambling to develop best practices to ensure these powerful models behave as expected over time.
Evaluating a traditional machine learning (ML) model involves checking the accuracy of its output or predictions. This is usually measured by well-known metrics such as Accuracy, RMSE, AUC, Precision, Recall, and so forth. Evaluating LLMs is a lot more complicated. Several methods are used today by data scientists.
(1) Classification and Regression Metrics
LLMs can produce numeric predictions or classification labels, in which case evaluation is easy. It’s the same as with traditional ML models. While this is helpful in some cases, we are usually concerned with evaluating LLMs that produce text.
(2) Standalone text-based Metrics
These metrics are useful for evaluating text output from an LLM when you do not have a source of ground truth. It is up to you to determine what is acceptable based on past experience, academic suggestions, or comparing scores of other models.
Perplexity is one example. It measures the likelihood the model would generate an input text sequence and can be thought of as evaluating how well the model learned the text it was trained on. Other examples include Reading Level and Non-letter Characters.
A more sophisticated standalone approach involves extracting embeddings from model output and analyzing those embeddings for unusual patterns. This can be done manually by inspecting a graph of your embeddings in a 3D plot. Coloring or comparing by key fields like gender, predicted class, or perplexity score can reveal lurking problems with your LLM application and provide a measure of bias and explainability. Several software tools exist that allow you to visualize embeddings in this way. They cluster the embeddings and map them into 3 dimensions. This is usually done with HDBSCAN and UMAP, but some leverage a K-means-based approach.
In addition to visual assessment, an anomaly detection algorithm can be run across the embeddings to look for outliers.
(3) Evaluation Datasets
A dataset with ground truth labels allows for the comparison of textual output to a baseline of approved responses.
A well-known example is the ROUGE metric. In the case of language translation tasks, ROUGE relies on a reference dataset whose answers are compared against the LLM being evaluated. Relevance, accuracy, and a host of other metrics can be calculated against a reference dataset. Embeddings play a key role. Standard distance metrics like as J-S Distance, Hellinger Distance, KS Distance, and PSI compare your LLM output embeddings to the ground truth embeddings.
Lastly, there are a number of widely accepted benchmark tests for LLMs. Stanford’s HELM page is a great place to learn about them.
(4) Evaluator LLMs
At first glance, you may think it’s cheating the system to use an LLM to evaluate an LLM, but many feel that this is the best path forward and studies have shown promise. It is highly likely that using what I call Evaluator LLMs will be the predominant method for LLM evaluation in the near future.
One widely accepted example is the Toxicity metric. It relies on an Evaluator LLM (Hugging Face recommends roberta-hate-speech-dynabench-r4) to determine if your model’s output is Toxic. All the metrics above under Evaluation Datasets apply here as we treat the output of the Evaluator LLM as the reference.
According to researchers at Arize, Evaluator LLMs should be configured to provide binary classification labels for the metrics they test. Numeric scores and ranking, they explain, need more work and are not as performant as binary labeling.
(5) Human Feedback
With all the emphasis on measurable metrics in this post, software documentation, and marketing material, you should not forget about manual human-based feedback. This is usually considered by data scientists and engineers in the early stages of building an LLM application. LLM observability software usually has an interface to aid in this task. In addition to early development feedback, it is a best practice to include human feedback in the final evaluation process as well (and ongoing monitoring). Grabbing 50 to 100 input prompts and manually analyzing the output can teach you a lot about your final product.
Tracking is the precursor to monitoring. In my research, I found enough nuance in the details of tracking LLMs to warrant its own section. The low-hanging fruit of tracking involves capturing the number of requests, response time, token usage, costs, and error rates. Standard system monitoring tools play a role here alongside the more LLM-specific options (and those traditional monitoring companies have marketing teams that are quick to claim LLM Observability and Monitoring based on simple functional metric tracking).
Deep insights are gained from capturing input prompts and output responses for future analysis. This sounds simple on the surface, but it’s not. The complexity comes from something I’ve glossed over so far (and most data scientists do the same when talking or writing about LLMs). We are not evaluating, tracking, and monitoring an LLM. We are dealing with an application; a conglomerate of one or more LLMs, pre-set instruction prompts, and agents that work together to produce the output. Some LLM applications are not that complex, but many are, and the trend is toward more sophistication. In even slightly sophisticated LLM applications it can be difficult to nail down the final prompt call. If we are debugging, we’ll need to know the state of the call at each step along the way and the sequence of those steps. Practitioners will want to leverage software that helps with unpacking these complexities.
While most LLMs and LLM applications undergo at least some form of evaluation, too few have implemented continuous monitoring. We’ll break down the components of monitoring to help you build a monitoring program that protects your users and brand.
(1) Functional Monitoring
To start, the low-hanging fruit mentioned in the Tracking section above should be monitored on a continuous basis. This includes the number of requests, response time, token usage, costs, and error rates.
(2) Monitoring Prompts
Next on your list should be monitoring user-supplied prompts or inputs. Standalone metrics like Readability could be informative. Evaluator LLMs should be utilized to check for Toxicity and the like. Embedding distances from the reference prompts are smart metrics to include. Even if your application can handle prompts that are substantially different than what you anticipated, you will want to know if your customers’ interaction with your application is new or changes over time.
At this point, we need to introduce a new evaluation class: adversarial attempts or malicious prompt injections. This is not always accounted for in the initial evaluation. Comparing against reference sets of known adversarial prompts can flag bad actors. Evaluator LLMs can also classify prompts as malicious or not.
(3) Monitoring Responses
There are a variety of useful checks to implement when comparing what your LLM application is spitting out to what you expect. Consider relevance. Is your LLM responding with relevant content or is it off in the weeds (hallucination)? Are you seeing a divergence from your anticipated topics? How about sentiment? Is your LLM responding in the right tone and is this changing over time?
You probably don’t need to monitor all these metrics on a daily basis. Monthly or quarterly will be sufficient for some. On the other hand, Toxicity and harmful output are always top on the worry list when deploying LLMs. These are examples of metrics that you will want to track on a more regular basis. Remember that the embedding visualization techniques discussed earlier may help with root cause analysis.
Prompt leakage is an adversarial approach we haven’t introduced yet. Prompt leakage occurs when someone tricks your application into divulging your stored prompts. You likely spent a lot of time figuring out which pre-set prompt instructions gave the best results. This is sensitive IP. Prompt leakage can be discovered by monitoring responses and comparing them to your database of prompt instructions. Embedding distance metrics work well.
If you have evaluation or reference datasets, you may want to periodically test your LLM application against these and compare the results of previous tests. This can give you a sense of accuracy over time and can alert you to drift. If you discover issues, some tools that manage embeddings allow you to export datasets of underperforming output so you can fine-tune your LLM on these classes of troublesome prompts.
(4) Alerting and Thresholds
Care should be taken to ensure that your thresholds and alerts do not cause too many false alarms. Multivariate drift detection and alerting can help. I have thoughts on how to do this but will save those for another article. Incidentally, I didn’t see one mention of false alarm rates or best practices for thresholds in any of my research for this article. That’s a shame.
There are several nice features related to alerts that you may want to include on your must-have list. Many monitoring systems provide integration with information feeds like Slack and Pager Duty. Some monitoring systems allow automatic response blocking if the input prompt triggers an alert. The same feature can apply to screening the response for PII leakage, Toxicity, and other quality metrics before sending it to the user.
I’ll add one more observation here as I didn’t know where else to put it. Custom metrics can be very important to your monitoring scheme. Your LLM application may be unique, or perhaps a sharp data scientist on your team thought of a metric that will add significant value to your approach. There will likely be advances in this space. You will want the flexibility of custom metrics.
(5) The Monitoring UI
If a system has a monitoring capability, it will have a UI that shows time-series graphs of metrics. That’s pretty standard. UIs start to differentiate when they allow for drilling down into alert trends in a manner that points to some level of root cause analysis. Others facilitate visualization of the embedding space based on clusters and projections (I’d like to see, or conduct, a study on the usefulness of these embedding visualizations in the wild).
More mature offerings will group monitoring by users, projects, and teams. They will have RBAC and work off the assumption that all users are on a need-to-know basis. Too often anyone in the tool can see everyone’s data, and that won’t fly at many of today’s organizations.
One cause of the problem I highlighted regarding the tendency for alerts to yield an unacceptable false alarm rate is that the UI does not facilitate a proper analysis of alerts. It is rare for software systems to attempt any kind of optimization in this respect, but some do. Again, there is much more to say on this topic at a later point.
Leaders, there is too much at stake to not place LLM monitoring and observability near the top of your organizational initiatives. I don’t say this only to prevent causing harm to users or losing brand reputation. Those are obviously on your radar. What you might not appreciate is that your company’s quick and sustainable adoption of AI could mean the difference between success and failure, and a mature responsible AI framework with a detailed technical roadmap for monitoring and observing LLM applications will provide a foundation to enable you to scale faster, better, and safer than the competition.
Practitioners, the concepts introduced in this article provide a list of tools, techniques, and metrics that should be included in the implementation of LLM observability and monitoring. You can use this as a guide to ensure that your monitoring system is up to the task. And you can use this as a basis for deeper study into each concept we discussed.
This is an exciting new field. Leaders and practitioners who become well-versed in it will be positioned to help their teams and companies succeed in the age of AI.
About the author:
Josh Poduska is an AI Leader, Strategist, and Advisor with over 20 years of experience. He is the former Chief Field Data Scientist at Domino Data Lab and has managed teams and led data science strategy at multiple companies. Josh has built and implemented data science solutions across multiple domains. He has a Bachelor’s in Mathematics from UC Irvine and a Master’s in Applied Statistics from Cornell University.