Imagine you are an eCommerce platform that wants to personalize your email campaigns based on user activity from the past week. If a user has been less active compared to previous weeks, plan to send them a discount offer.
You gathered user statistics and noticed the following for a user named John:
- John visited the platform for the first time 15 days ago.
- During the first 7 days (days 1 to 7), he made 9 visits.
- Over the next 7 days (days 2-8), he made 8 visits.
- In total we have 9 values.
Now you want to evaluate how extreme the most recent value is compared to previous ones.
import numpy as np
visits = np.array((9, 8, 6, 5, 8, 6, 8, 7))
num_visits_last_week = 6
Let's create a CDF of these values.
import numpy as np
import matplotlib.pyplot as pltvalues = np.array(sorted(set(visits)))
counts = np.array((data.count(x) for x in values))
probabilities = counts / counts.sum()
cdf = np.cumsum(probabilities)
plt.scatter(values, cdf, color='black', linewidth=10)
Now we need to restore the function based on these values. We will use spline interpolation.
from scipy.interpolate import make_interp_splinex_new = np.linspace(values.min(), values.max(), 300)
spline = make_interp_spline(values, cdf, k=3)
cdf_smooth = spline(x_new)
plt.plot(x_new, cdf_smooth, label='Сплайн CDF', color='black', linewidth=4)
plt.scatter(values, cdf, color='black', linewidth=10)
plt.scatter(values(-2:), cdf(-2:), color='#f95d5f', linewidth=10, zorder=5)
plt.show()
Not bad. But we notice a small problem between the red dots: the CDF must increase monotonically. Let's fix this with the piecewise Hermite cubic interpolation polynomial.
from scipy.interpolate import PchipInterpolatorspline_monotonic = PchipInterpolator(values, cdf)
cdf_smooth = spline_monotonic(x_new)
plt.plot(x_new, cdf_smooth, color='black', linewidth=4)
plt.scatter(values, cdf, color='black', linewidth=10)
plt.show()
Okay, now it's perfect.
To calculate the p-value for our current observation (6 visits over the last week), we need to calculate the surface area of the filled area.
To do this, let's create a simple function. calculate_p_value:
def calculate_p_value(x):
if x < values.min():
return 0
elif x > values.max():
return 1
else:
return spline_monotonic(x) p_value = calculate_p_value(num_visits_last_week)
print(f"Probability of getting less than {num_visits_last_week} equals: {p_value}")
Probability of getting less than 6 is equal to: 0.375
So the probability is quite high (we can compare it with a threshold of 0.1, for example) and we decide not to send the discount to John. The same calculation that we must do for all users.