Co-authored with S. Hué, C. Hurlin and C. Pérignon.
The reliability and acceptability of sentient ai systems largely depend on the ability of users to understand the associated models, or at least their predictions. To unravel opaque ai applications, explainable ai (XAI) methods, such as post-hoc interpretation tools (e.g. SHAP, LIME), are commonly used today, and the insights generated from their results are now widely understood.
Beyond individual forecasts, in this article we show how to identify the drivers of performance metrics (e.g. AUC, R2) of any classification or regression model using the eXplainable PERformance (XPER) methodology. Being able to identify the driving forces of the statistical or economic performance of a predictive model is the very foundation of modeling and is of great importance to both data scientists and experts who base their decisions on such models. The XPER library described below has proven to be an efficient tool for decomposing performance metrics into individual feature contributions.
While based on the same mathematical principles, XPER and SHAP are fundamentally different and simply have different goals. While SHAP pinpoints features that significantly influence individual model predictions, XPER identifies features that contribute most to model performance. The latter analysis can be performed at the global (model) or local (instance) level. In practice, the feature that has the greatest impact on individual predictions (e.g. feature A) may not be the one that has the greatest impact on performance. In fact, feature A drives individual decisions when the model is correct, but also when the model makes a mistake. Conceptually, if feature A primarily affects wrong predictions, it may be ranked lower with XPER than with SHAP.
What is a performance decomposition used for? First, it can improve any post-hoc interpretability analysis by giving a more complete insight into the inner workings of the model. This allows for a deeper understanding of why the model is or is not working effectively. Second, XPER can help to identify and address heterogeneity issues. Indeed, by analyzing individual XPER values, it is possible to identify subsamples in which features have similar effects on performance. A separate model can then be estimated for each subsample to improve predictive performance. Third, XPER can help to understand the origin of overfitting. In fact, XPER allows us to identify some features that contribute more to model performance in the training sample than in the test sample.
The XPER framework is a theoretically grounded method that is based on Shapley values (Shapley, 1953), a decomposition method from coalitional game theory. While Shapley values decompose an outcome across players in a game, XPER values decompose a performance metric (e.g., AUC, R2) across features of a model.
Suppose we train a classification model using three features and its predictive performance is measured with an AUC equal to 0.78. An example of XPER decomposition is as follows:
The first XPER value 𝜙₀ is called the benchmark and represents the performance of the model if none of the three features provided relevant information for predicting the target variable. When using AUC to evaluate the predictive performance of a model, the benchmark value corresponds to a random classification. Since the model’s AUC is greater than 0.50, it implies that at least one feature contains useful information for predicting the target variable. The difference between the model’s AUC and the benchmark represents the contribution of the features to the model’s performance, which can be decomposed using XPER values. In this example, the decomposition indicates that the first feature is the main driver of the model’s predictive performance, as it explains half of the difference between the model’s AUC and a random classification (𝜙₁), followed by the second feature (𝜙₂) and the third (𝜙₃). These results measure the overall effect of each feature on the predictive performance of the model and rank them from the least important (the third feature) to the most important (the first feature).
While the XPER framework can be used to perform a global analysis of model predictive performance, it can also be used to provide a local analysis at the instance level. At the local level, the XPER value corresponds to the contribution of a given instance and feature to the predictive performance of the model. The benchmark then represents the contribution of a given observation to predictive performance if the target variable was independent of the features, and the difference between the individual contribution and the benchmark is explained by the individual XPER values. Individual XPER values therefore allow us to understand why some observations contribute more to the predictive performance of a model than others, and can be used to address heterogeneity issues by identifying groups of individuals for which features have similar effects on performance.
It is also important to note that XPER is model- and metric-agnostic. This implies that XPER values can be used to interpret the predictive performance of any econometric or machine learning model, and to break down any performance metrics, such as predictive accuracy measures (AUC, accuracy), statistical loss functions (MSE, MAE), or economic performance measures (profit and loss functions).
01 — Download Library
The XPER approach is implemented in Python via XPER LibraryTo calculate XPER values, you must first install the XPER library as follows:
pip install XPER
02 — Import library
import XPER
import pandas as pd
03 — Load sample dataset
To illustrate how to use XPER values in Python, let us take a concrete example. Consider a classification problem whose main objective is to predict credit default. The dataset can be imported directly from the XPER library, as:
import XPER
from XPER.datasets.load_data import loan_status
loan = loan_status().iloc(:, :6)display(loan.head())
display(loan.shape)
The main goal of this dataset, given the variables included, appears to be building a predictive model to determine the “loan status” of a potential borrower. In other words, we want to predict whether a loan application will be approved (“1”) or not (“0”) based on the information provided by the applicant.
# Remove 'Loan_Status' column from 'loan' dataframe and assign it to 'x'
x = loan.drop(columns='Loan_Status')# Create a new dataframe 'Y' containing only the 'Loan_Status' column from 'loan' dataframe
Y = pd.Series(loan('Loan_Status'))
04 — Estimate the model
Next, we need to train a predictive model and measure its performance to calculate the associated XPER values. For illustrative purposes, we split the initial dataset into a training and a test set and fit an XGBoost classifier to the training set:
from sklearn.model_selection import train_test_split# Split the data into training and testing sets
# x: input features
# Y: target variable
# test_size: the proportion of the dataset to include in the testing set (in this case, 15%)
# random_state: the seed value used by the random number generator for reproducible results
X_train, X_test, y_train, y_test = train_test_split(x, Y, test_size=0.15, random_state=3)
import xgboost as xgb
# Create an XGBoost classifier object
gridXGBOOST = xgb.XGBClassifier(eval_metric="error")
# Train the XGBoost classifier on the training data
model = gridXGBOOST.fit(X_train, y_train)
05 — Evaluate performance
The XPER library provides an intuitive and simple way to calculate the predictive performance of a predictive model. Considering that the performance metric of interest is the areas under the ROC curve (AUC), it can be measured on the test set as follows:
from XPER.compute.Performance import ModelPerformance# Define the evaluation metric(s) to be used
XPER = ModelPerformance(X_train.values,
y_train.values,
X_test.values,
y_test.values,
model)
# Evaluate the model performance using the specified metric(s)
PM = XPER.evaluate(("AUC"))
# Print the performance metrics
print("Performance Metrics: ", round(PM, 3))
06 — Calculate XPER values
Finally, to explain the driving forces of AUC, XPER values can be calculated as follows:
# Calculate XPER values for the model's performance
XPER_values = XPER.calculate_XPER_values(("AUC"),kernel=False)
« XPER_values » is a tuple containing two elements: the XPER values and the individual XPER values of the features.
For use cases with more than 10 feature variables, it is recommended to use the default kernel=True option for higher computational efficiency
07 — Visualization
from XPER.viz.Visualisation import visualizationClass as vizlabels = list(loan.drop(columns='Loan_Status').columns)
To analyze the driving force at a global level, the XPER library proposes a bar chart XPER value representation.
viz.bar_plot(XPER_values=XPER_values, X_test=X_test, labels=labels, p=5,percentage=True)
For ease of presentation, feature contributions are expressed as a percentage of the difference between the AUC and its baseline, i.e. 0.5 for AUC, and are ordered from highest to lowest. From this figure, we can see that more than 78% of the model's outperformance over a random predictor comes from Credit historyfollowed by Applicant's income contributing around 16% to performance, and Co-applicant's income and Loan Amount Term each representing less than 6%. On the other hand, we can see that the variable Loan amount It hardly helps the model to better predict the probability of default, since its contribution is close to 0.
The XPER library also proposes graphical representations to analyze XPER values at a local level. First, a force plot can be used to analyze the driving forces of performance for a given observation:
viz.force_plot(XPER_values=XPER_values, instance=1, X_test=X_test, variable_name=labels, figsize=(16,4))
The above code plots the positive (negative) XPER values of observation #10 in red (blue), as well as the reference value (0.33) and the contribution (0.46) of this observation to the AUC of the model. The superior performance of borrower #10 is due to the positive XPER values of Loan amount, term, applicant's incomeand Credit history. On the other hand, Co-applicant's income and Loan amount had a negative effect and decreased the contribution of this borrower.
We can see that while Applicant's income and Loan amount have a positive effect In the aggregate AUC, these variables have a negative effect for borrower #10. Analysis of individual XPER values can therefore identify groups of observations for which characteristics have different effects on performance, potentially highlighting a heterogeneity problem.
Secondly, it is possible to represent the XPER values of each observation and characteristic in a single graph. To do this, one can use a bee swarm plot which represents the XPER values for each feature as a function of the feature value.
viz.beeswarn_plot(XPER_values=XPER_values,X_test=X_test,labels=labels)
In this figure, each point represents an observation. The horizontal axis represents the contribution of each observation to the model performance, while the vertical axis represents the magnitude of the feature values. Similar to the bar chart shown above, the features are ordered from those that contribute the most to the model performance to those that contribute the least. However, with the swarm chart it is also possible to analyze the effect of feature values on the XPER values. In this example, we can see large values of Credit history are associated with relatively small contributions (in absolute value), while low values lead to larger contributions (in absolute value).
All images, unless otherwise stated, are the author's own.
The contributors to this library are:
(1) L. Shapley, A value for n-player games (1953), Contributions to game theory, 2:307–317
(2) S. Lundberg, S. Lee, A unified approach to interpreting model predictions (2017)Advances in neural information processing systems
(3) S. Hué, C. Hurlin, C. Pérignon, S. Saurin, Measuring the drivers of predictive performance: Application to credit scoring (2023), HEC Paris Research Paper No. FIN-2022–1463