p99, or the value below which 99% of observations fall, is widely used to track and optimize worst-case performance across industries. For example, the time it takes for a page to load, complete a purchase order, or deliver a shipment can be optimized using p99 tracking.
While p99 is undoubtedly valuable, it is crucial to recognize that it ignores the top 1% of observations, which can have an unexpectedly large impact when correlated with other critical business metrics. Blindly searching for p99 without checking such correlations can potentially undermine other commercial objectives.
In this article, we will discuss the limitations of p99 through an example using fictional data, understand when to trust p99, and explore alternative metrics.
Consider an e-commerce platform where a team is tasked with optimizing the shopping cart checkout experience. The team has received complaints from customers that the payment process is quite slow compared to other platforms. The team then takes the last 1,000 payments and analyzes the time needed to make the payment. (I created some dummy data for this, you are free to use and modify it without restrictions)
import pandas as pd
import seaborn as sns
order_time = pd.read_csv('https://gist.githubusercontent.com/kkraoj/77bd8332e3155ed42a2a031ce63d8903/raw/458a67d3ebe5b649ec030b8cd21a8300d8952b2c/order_time.csv')
fig, ax = plt.subplots(figsize=(4,2))
sns.histplot(data = order_time, x = 'fulfillment_time_seconds', bins = 40, color = 'k', ax = ax)
print(f'p99 for fulfillment_time_seconds: {order_time.fulfillment_time_seconds.quantile(0.99):0.2f} s')
Unsurprisingly, most shopping cart payments seem to complete within a few seconds. And 99% of payments are made in 12.1 seconds. In other words, the p99 is 12.1 seconds. There are some long queue cases that take up to 30 seconds. Since there are so few of them, they may be outliers and it should be safe to ignore them, right?
Now, if we don't stop and analyze the implications of that last sentence, it could be quite dangerous. Is it really safe to ignore the top 1%? Are we sure payment times are not correlated with any other business metrics?
Let's say our e-commerce company also cares about gross merchandise value (GMV) and has an overall company-level goal of increasing it. We should immediately check if time to pay is correlated with GMV before ignoring the top 1%.
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter
order_value = pd.read_csv('https://gist.githubusercontent.com/kkraoj/df53cac7965e340356d6d8c0ce24cd2d/raw/8f4a30db82611a4a38a90098f924300fd56ec6ca/order_value.csv')
df = pd.merge(order_time, order_value, on='order_id')
fig, ax = plt.subplots(figsize=(4,4))
sns.scatterplot(data=df, x="fulfillment_time_seconds", y="order_value_usd", color = 'k')
plt.yscale('log')
ax.yaxis.set_major_formatter(ScalarFormatter())
Oh boy! Not only is cart value correlated with checkout times, but it increases exponentially when checkout times are longer. What is the penalty for ignoring the top 1% of payment times?
pct_revenue_ignored = df2.loc(df1.fulfilment_time_seconds>df1.fulfilment_time_seconds.quantile(0.99), 'order_value_usd').sum()/df2.order_value_usd.sum()*100
print(f'If we only focussed on p99, we would ignore {pct_revenue_ignored:0.0f}% of revenue')
## >>> If we only focussed on p99, we would ignore 27% of revenue
If we only focused on p99, we would ignore 27% of revenue (27 times more than the 1% we thought we were ignoring). That is, p99 of payment times are p73 of income. Focusing on p99 in this case inadvertently hurts the business. It ignores the needs of our highest value buyers.
df.sort_values('fulfillment_time_seconds', inplace = True)
dfc = df.cumsum()/df.cumsum().max() # percent cumulative sum
fig, ax = plt.subplots(figsize=(4,4))
ax.plot(dfc.fulfillment_time_seconds.values, color = 'k')
ax2 = ax.twinx()
ax2.plot(dfc.order_value_usd.values, color = 'magenta')
ax.set_ylabel('cumulative fulfillment time')
ax.set_xlabel('orders sorted by fulfillment time')
ax2.set_ylabel('cumulative order value', color = 'magenta')
ax.axvline(0.99*1000, linestyle='--', color = 'k')
ax.annotate('99% of orders', xy = (970,0.05), ha = 'right')
ax.axhline(0.73, linestyle='--', color = 'magenta')
ax.annotate('73% of revenue', xy = (0,0.75), color = 'magenta')
Above, we see why there is a large mismatch between payment time percentiles and GMV. The GMV curve rises sharply near the 99th percentile of orders, causing the top 1% of orders to have a huge impact on GMV.
This is not just an artifact of our fictitious data. Unfortunately, these extreme correlations are not uncommon. For example, the top 1% of Slack customers represent 50% of income. About 12% of UPS's revenue comes from only 1 client (amazon).
To avoid the pitfalls of optimizing solely for p99, we can take a more holistic approach.
One solution is to track p99 and p100 (the maximum value) simultaneously. This way, we won't be prone to ignoring high-value users.
Another solution is to use p99 weighted by revenue (or weighted by gross merchandise value, profit, or any other business metric of interest), which assigns greater weight to observations with higher associated revenues. This metric ensures that optimization efforts prioritize the most valuable transactions or processes, rather than treating all observations equally.
Finally, when there are high correlations between performance and business metrics, a stricter p99.5 or p99.9 can mitigate the risk of ignoring high-value users.
It's tempting to rely solely on metrics like p99 for optimization efforts. However, as we saw, ignoring the top 1% of observations can negatively impact a large percentage of other business results. Tracking p99 and p100 or using income-weighted p99 can provide a more complete view and mitigate the risks of optimizing only for p99. At a minimum, let's remember to avoid focusing strictly on any one performance metric and losing sight of the client's overall results.