Version 1.5 of scikit-learn includes a new class, TunedThresholdClassifierCV, facilitating the optimization of the decision thresholds of the scikit-learn classifiers. A decision threshold is a cutoff point that converts the predicted probabilities generated by a machine learning model into discrete classes. The default decision threshold of the .predict()
The scikit-learn classifiers method in a binary classification setting is 0.5. Although this is a sensible default, it is rarely the best choice for classification tasks.
This post introduces the TunedThresholdClassifierCV class and demonstrates how it can optimize decision thresholds for various binary classification tasks. This new class will help bridge the gap between data scientists who create models and business stakeholders who make decisions based on the model output. By adjusting decision thresholds, data scientists can improve model performance and better align with business objectives.
This post will cover the following situations where it is beneficial to adjust decision thresholds:
- Maximize a metric– Use when choosing a threshold that maximizes a scoring metric, such as the F1 score.
- Cost-sensitive learning: Adjust the threshold when the cost of misclassifying a false positive is not equal to the cost of misclassifying a false negative and you have an estimate of the costs.
- Tuning under restrictions: Optimize the operating point on the ROC or precision recovery curve to meet specific performance constraints.
The code used in this post and links to data sets are available at GitHub.
Let us begin! First, import the necessary libraries, read the data, and split the training and testing data.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
RocCurveDisplay,
f1_score,
make_scorer,
recall_score,
roc_curve,
confusion_matrix,
)
from sklearn.model_selection import TunedThresholdClassifierCV, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScalerRANDOM_STATE = 26120
Maximize a metric
Before beginning the modeling process in any machine learning project, it is essential to work with stakeholders to determine which metrics to optimize. Making this decision early ensures that the project aligns with its intended goals.
Using an accuracy metric in fraud detection use cases to evaluate model performance is not ideal because the data is often imbalanced and most transactions are not fraudulent. The F1 score is the harmonic mean of precision and recall and is a better metric for imbalanced data sets such as fraud detection. let's use the TunedThresholdClassifierCV
class to optimize the decision threshold of a logistic regression model to maximize the F1 score.
We will use the Kaggle Credit Card Fraud Detection Dataset to introduce the first situation where we need to adjust a decision threshold. First, split the data into training and test sets, then create a scikit-learn pipeline to scale the data and train a logistic regression model. Fit the pipeline to the training data so we can compare the performance of the original model with the performance of the fitted model.
creditcard = pd.read_csv("data/creditcard.csv")
y = creditcard("Class")
x = creditcard.drop(columns=("Class"))X_train, X_test, y_train, y_test = train_test_split(
x, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)
# Only Time and Amount need to be scaled
original_fraud_model = make_pipeline(
ColumnTransformer(
(("scaler", StandardScaler(), ("Time", "Amount"))),
remainder="passthrough",
force_int_remainder_cols=False,
),
LogisticRegression(),
)
original_fraud_model.fit(X_train, y_train)
No adjustment has been made yet, but it will come in the next block of code. The arguments in favor TunedThresholdClassifierCV
are similar to others CV
classes in scikit-learn, like GridSearchCV. At a minimum, the user only needs to pass the original estimator and TunedThresholdClassifierCV
will store the decision threshold that maximizes balanced precision (default) using 5-fold stratified K-fold cross-validation (default). It also uses this threshold when calling. .predict()
. However, any scikit-learn (or callable) metric can be used as scoring
metric. Additionally, the user can pass the familiar cv
argument to customize the cross-validation strategy.
Create the TunedThresholdClassifierCV
instance and fit the model to the training data. Pass in the original model and set the score to “f1”. We will also want to establish store_cv_results=True
to access the thresholds evaluated during cross-validation for visualization.
tuned_fraud_model = TunedThresholdClassifierCV(
original_fraud_model,
scoring="f1",
store_cv_results=True,
)tuned_fraud_model.fit(X_train, y_train)
# average F1 across folds
avg_f1_train = tuned_fraud_model.best_score_
# Compare F1 in the test set for the tuned model and the original model
f1_test = f1_score(y_test, tuned_fraud_model.predict(X_test))
f1_test_original = f1_score(y_test, original_fraud_model.predict(X_test))
print(f"Average F1 on the training set: {avg_f1_train:.3f}")
print(f"F1 on the test set: {f1_test:.3f}")
print(f"F1 on the test set (original model): {f1_test_original:.3f}")
print(f"Threshold: {tuned_fraud_model.best_threshold_: .3f}")
Average F1 on the training set: 0.784
F1 on the test set: 0.796
F1 on the test set (original model): 0.733
Threshold: 0.071
Now that we have found the threshold that maximizes the F1 score check tuned_fraud_model.best_score_
to find out what was the best average F1 score across all folds in cross-validation. We can also see what threshold generated those results using tuned_fraud_model.best_threshold_
. You can view metric scores at decision thresholds during cross-validation using the objective_scores_
and decision_thresholds_
attributes:
fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(
tuned_fraud_model.cv_results_("thresholds"),
tuned_fraud_model.cv_results_("scores"),
marker="o",
linewidth=1e-3,
markersize=4,
color="#c0c0c0",
)
ax.plot(
tuned_fraud_model.best_threshold_,
tuned_fraud_model.best_score_,
"^",
markersize=10,
color="#ff6700",
label=f"Optimal cut-off point = {tuned_fraud_model.best_threshold_:.2f}",
)
ax.plot(
0.5,
f1_test_original,
label="Default threshold: 0.5",
color="#004e98",
linestyle="--",
marker="x",
markersize=10,
)
ax.legend(fontsize=8, loc="lower center")
ax.set_xlabel("Decision threshold", fontsize=10)
ax.set_ylabel("F1 score", fontsize=10)
ax.set_title("F1 score vs. Decision threshold -- Cross-validation", fontsize=12)
# Check that the coefficients from the original model and the tuned model are the same
assert (tuned_fraud_model.estimator_(-1).coef_ ==
original_fraud_model(-1).coef_).all()
We used the same underlying logistic regression model to evaluate two different decision thresholds. The underlying models are the same, as demonstrated by the equality of coefficients in the previous statement. Optimization in TunedThresholdClassifierCV
It is achieved using post-processing techniques, which are applied directly to the predicted probabilities generated by the model. However, it is important to note that TunedThresholdClassifierCV
uses cross-validation by default to find the decision threshold to avoid overfitting the training data.
Cost-sensitive learning
Cost-sensitive learning is a type of machine learning that assigns a cost to each type of misclassification. This translates model performance into units that stakeholders understand, such as dollars saved.
We will use the TELCO Customer Churn Dataset, a binary classification dataset, to demonstrate the value of cost-sensitive learning. The goal is to predict whether a customer will churn or not, given characteristics about the customer's demographics, contract details, and other technical information about the customer's account. The motivation to use this data set (and some of the code) comes from Dan Becker's course on decision threshold optimization.
data = pd.read_excel("data/Telco_customer_churn.xlsx")
drop_cols = (
"Count", "Country", "State", "Lat Long", "Latitude", "Longitude",
"Zip Code", "Churn Value", "Churn Score", "CLTV", "Churn Reason"
)
data.drop(columns=drop_cols, inplace=True)# Preprocess the data
data("Churn Label") = data("Churn Label").map({"Yes": 1, "No": 0})
data.drop(columns=("Total Charges"), inplace=True)
X_train, X_test, y_train, y_test = train_test_split(
data.drop(columns=("Churn Label")),
data("Churn Label"),
test_size=0.2,
random_state=RANDOM_STATE,
stratify=data("Churn Label"),
)
Set up a basic pipeline to process the data and generate predicted probabilities with a random forest model. This will serve as a basis for comparison with the TunedThresholdClassifierCV
.
preprocessor = ColumnTransformer(
transformers=(("one_hot", OneHotEncoder(),
selector(dtype_include="object"))),
remainder="passthrough",
)original_churn_model = make_pipeline(
preprocessor, RandomForestClassifier(random_state=RANDOM_STATE)
)
original_churn_model.fit(X_train.drop(columns=("customerID")), y_train);
The choice of preprocessing and model type is not important for this tutorial. The company wants to offer discounts to customers who are expected to churn. While collaborating with stakeholders, you learn that offering a discount to a customer who won't churn (a false positive) would cost $80. You also learn that it is worth $200 to offer a discount to a customer who would have abandoned you. You can represent this relationship in a cost matrix:
def cost_function(y, y_pred, neg_label, pos_label):
cm = confusion_matrix(y, y_pred, labels=(neg_label, pos_label))
cost_matrix = np.array(((0, -80), (0, 200)))
return np.sum(cm * cost_matrix)cost_scorer = make_scorer(cost_function, neg_label=0, pos_label=1)
We also included the cost function in a custom scikit-learn marker. This scorer will be used as scoring
argument in TunedThresholdClassifierCV and to evaluate the gains on the test set.
tuned_churn_model = TunedThresholdClassifierCV(
original_churn_model,
scoring=cost_scorer,
store_cv_results=True,
)tuned_churn_model.fit(X_train.drop(columns=("CustomerID")), y_train)
# Calculate the profit on the test set
original_model_profit = cost_scorer(
original_churn_model, X_test.drop(columns=("CustomerID")), y_test
)
tuned_model_profit = cost_scorer(
tuned_churn_model, X_test.drop(columns=("CustomerID")), y_test
)
print(f"Original model profit: {original_model_profit}")
print(f"Tuned model profit: {tuned_model_profit}")
Original model profit: 29640
Tuned model profit: 35600
The benefit is greater in the tuned model than in the original. Again, we can plot the objective metric against the decision thresholds to visualize the decision threshold selection on the training data during cross-validation:
fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(
tuned_churn_model.cv_results_("thresholds"),
tuned_churn_model.cv_results_("scores"),
marker="o",
markersize=3,
linewidth=1e-3,
color="#c0c0c0",
label="Objective score (using cost-matrix)",
)
ax.plot(
tuned_churn_model.best_threshold_,
tuned_churn_model.best_score_,
"^",
markersize=10,
color="#ff6700",
label="Optimal cut-off point for the business metric",
)
ax.legend()
ax.set_xlabel("Decision threshold (probability)")
ax.set_ylabel("Objective score (using cost-matrix)")
ax.set_title("Objective score as a function of the decision threshold")
In reality, assigning a static cost to all instances that are misclassified in the same way is not realistic from a business perspective. There are more advanced methods to adjust the threshold by assigning a weight to each instance in the data set. This is covered in scikit-learn cost-sensitive learning example.
Tuning under restrictions
This method is not currently covered in the scikit-learn documentation, but is a common business case for binary classification use cases. The tuning-under-constraint method finds a decision threshold by identifying a point on the ROC curve or precision-recall curve. The curve point is the maximum value of one axis while constraining the other axis. For this tutorial, we will use the Pima Indian diabetes dataset. This is a binary classification task to predict whether an individual has diabetes.
Imagine that your model will be used as a screening test for an average risk population applied to millions of people. There are an estimated 38 million people with diabetes in the United States. This represents approximately 11.6% of the population, so the specificity of the model must be high to not misdiagnose millions of people with diabetes and refer them for unnecessary confirmatory testing. Let's say your imaginary CEO has told you that he won't tolerate more than a 2% false positive rate. Let's build a model that achieves this using TunedThresholdClassifierCV
.
For this part of the tutorial, we will define a constraint function that will be used to find the maximum true positive rate with a false positive rate of 2%.
def max_tpr_at_tnr_constraint_score(y_true, y_pred, max_tnr=0.5):
fpr, tpr, thresholds = roc_curve(y_true, y_pred, drop_intermediate=False)
tnr = 1 - fpr
tpr_at_tnr_constraint = tpr(tnr >= max_tnr).max()
return tpr_at_tnr_constraintmax_tpr_at_tnr_scorer = make_scorer(
max_tpr_at_tnr_constraint_score, max_tnr=0.98)
data = pd.read_csv("data/diabetes.csv")
X_train, X_test, y_train, y_test = train_test_split(
data.drop(columns=("Outcome")),
data("Outcome"),
stratify=data("Outcome"),
test_size=0.2,
random_state=RANDOM_STATE,
)
Build two models, one of logistic regression that serves as a reference model and the other, TunedThresholdClassifierCV
that will involve the reference logistic regression model to achieve the objective set by the CEO. On the tuned model, set scoring=max_tpr_at_tnr_scorer
. Again, the choice of model and preprocessing is not important for this tutorial.
# A baseline model
original_model = make_pipeline(
StandardScaler(), LogisticRegression(random_state=RANDOM_STATE)
)
original_model.fit(X_train, y_train)# A tuned model
tuned_model = TunedThresholdClassifierCV(
original_model,
thresholds=np.linspace(0, 1, 150),
scoring=max_tpr_at_tnr_scorer,
store_cv_results=True,
cv=8,
random_state=RANDOM_STATE,
)
tuned_model.fit(X_train, y_train)
Compare the difference between the scikit-learn estimators' default decision threshold, 0.5, and one found using the fit-under-constraint approach on the ROC curve.
# Get the fpr and tpr of the original model
original_model_proba = original_model.predict_proba(X_test)(:, 1)
fpr, tpr, thresholds = roc_curve(y_test, original_model_proba)
closest_threshold_to_05 = (np.abs(thresholds - 0.5)).argmin()
fpr_orig = fpr(closest_threshold_to_05)
tpr_orig = tpr(closest_threshold_to_05)# Get the tnr and tpr of the tuned model
max_tpr = tuned_model.best_score_
constrained_tnr = 0.98
# Plot the ROC curve and compare the default threshold to the tuned threshold
fig, ax = plt.subplots(figsize=(5, 5))
# Note that this will be the same for both models
disp = RocCurveDisplay.from_estimator(
original_model,
X_test,
y_test,
name="Logistic Regression",
color="#c0c0c0",
linewidth=2,
ax=ax,
)
disp.ax_.plot(
1 - constrained_tnr,
max_tpr,
label=f"Tuned threshold: {tuned_model.best_threshold_:.2f}",
color="#ff6700",
linestyle="--",
marker="o",
markersize=11,
)
disp.ax_.plot(
fpr_orig,
tpr_orig,
label="Default threshold: 0.5",
color="#004e98",
linestyle="--",
marker="x",
markersize=11,
)
disp.ax_.set_ylabel("True Positive Rate", fontsize=8)
disp.ax_.set_xlabel("False Positive Rate", fontsize=8)
disp.ax_.tick_params(labelsize=8)
disp.ax_.legend(fontsize=7)
The constraint-fitted method found a threshold of 0.80, resulting in an average sensitivity of 19.2% during cross-validation of the training data. Compare sensitivity and specificity to see how the threshold holds up on the test set. Did the model meet the CEO specificity requirement in the test set?
# Average sensitivity and specificity on the training set
avg_sensitivity_train = tuned_model.best_score_# Call predict from tuned_model to calculate sensitivity and specificity on the test set
specificity_test = recall_score(
y_test, tuned_model.predict(X_test), pos_label=0)
sensitivity_test = recall_score(y_test, tuned_model.predict(X_test))
print(f"Average sensitivity on the training set: {avg_sensitivity_train:.3f}")
print(f"Sensitivity on the test set: {sensitivity_test:.3f}")
print(f"Specificity on the test set: {specificity_test:.3f}")
Average sensitivity on the training set: 0.192
Sensitivity on the test set: 0.148
Specificity on the test set: 0.990
Conclusion
The new TunedThresholdClassifierCV
The class is a powerful tool that can help you become a better data scientist by sharing with business leaders how you arrived at a decision threshold. You learned how to use the new scikit-learn. TunedThresholdClassifierCV
class to maximize a metric, perform cost-sensitive learning, and tune a metric under constraint. This tutorial is not intended to be complete or advanced. I wanted to introduce the new feature and highlight its power and flexibility in solving binary classification problems. See the scikit-learn documentation, user guide, and examples for complete usage examples.
A big greeting to Guillaume Lemaitre for his work in this role.
Thank you for reading. Happy tuning.
Data licenses:
Credit card fraud: DbCL
Pima Indian Diabetes: CC0
Telecommunications rotation: correct commercial use