Causal ai, which explores the integration of causal reasoning into machine learning
Welcome to my series on Causal ai, where we'll explore integrating causal reasoning into machine learning models. Expect to explore a number of practical applications in different business contexts.
In the last article we covered measure the intrinsic causal influence of your marketing campaigns. In this article we will move on to Validate the causal impact of synthetic controls..
If you missed the last article on intrinsic causal influence, check it out here:
In this article we will focus on understanding the synthetic control method and explore how we can validate the estimated causal impact.
The following aspects will be covered:
- What is the synthetic control method?
- What challenge are you trying to overcome?
- How can we validate the estimated causal impact?
- A Python case study using realistic Google Trends data that demonstrates how we can validate the estimated causal impact of synthetic controls.
The complete notebook can be found here:
What is it?
The synthetic control method is a causal technique that can be used to evaluate the causal impact of an intervention or treatment when a randomized control trial (RCT) or A/B test was not possible. It was originally proposed in 2003 by Abadie and Gardezabal. The following article includes an excellent case study to help you understand the proposed method:
https://web.stanford.edu/~jhain/Paper/JASA2010.pdf
Let's cover some of the basics ourselves… The synthetic control method creates a counterfactual version of the treatment unit by creating a weighted combination of control units that did not receive the intervention or treatment.
- Treated unit: The unit that receives the intervention.
- Control units: A set of similar units that did not receive the intervention.
- Counterfactual: Created as a weighted combination of control units. The goal is to find weights for each control unit that result in a counterfactual that closely matches the unit treated in the pre-intervention period.
- Causal impact: The difference between the postintervention and counterfactual treatment unit.
If we really wanted to simplify things, we could think of it as a linear regression where each control unit is a characteristic and the treatment unit is the target. The pre-intervention period is our train set and we use the model to score our post-intervention period. The difference between what is real and what is planned is the causal impact.
Below are a couple of examples to bring it to life when we might consider using it:
- When we run a television marketing campaign, we cannot randomly assign the audience between those who can and cannot watch the campaign. However, we could carefully select one region to test the campaign and use the remaining regions as control units. Once we have measured the effect, the campaign could be extended to other regions. This is often called a geoelevation test.
- Policy changes that are introduced in some regions but not in others. For example, a local council may implement a policy change to reduce unemployment. Other regions where the policy was not in effect could be used as control units.
What challenge are you trying to overcome?
When we combine high dimensionality (many features) with limited observations, we can obtain a model that overfits.
Let's take the example of geo-lift to illustrate. If we use weekly data from the past year as our pre-intervention period, this gives us 52 observations. If we then decide to test our intervention in European countries, that will give us a 1:1 observation-to-trait ratio!
Previously we talked about how the synthetic control method could be implemented using linear regression. However, the relationship between observation and characteristics means that linear regression is very likely to overfit, resulting in poor estimation of causal impact in the post-intervention period.
In linear regression, the weights (coefficients) for each characteristic (control unit) can be negative or positive and can sum to a number greater than 1. However, the synthetic control method learns the weights while applying the following constraints:
- Constrain weights to add 1
- Constrain the weights to be ≥ 0
These constraints help with regularization and prevent extrapolation beyond the range of the observed data.
It is worth noting that in terms of regularization, Ridge and Lasso regression can achieve this and in some cases are reasonable alternatives. But we will test this in the case study!
How can we validate the estimated causal impact?
Arguably a greater challenge is the fact that we cannot validate the estimated causal impact in the post-intervention period.
How long should my pre-intervention period last? Are we sure we haven't over-adjusted our pre-intervention period? How can we know if our model generalizes well in the post-intervention period? What if I want to try different implementations of the synthetic control method?
We could randomly select some observations from the pre-intervention period and retain them for validation. But we have already highlighted the challenge of having limited observations so we can make things even worse!
What if we could run some type of simulation prior to the intervention? Could that help us answer some of the questions highlighted above and gain confidence in the estimated causal impact of our models? Everything will be explained in the case study!
Background
After convincing Finance that brand marketing drives great value, the marketing team approaches you to ask about geographic lift testing. Someone at facebook told them it's the next big thing (although it was the same person who told them Prophet was a good forecasting model) and they want to know if they could use it to measure their new upcoming TV campaign.
He's a little worried, since the last time he ran a geographic elevation test, the marketing analytics team thought it was a good idea to play with the pre-intervention period used until they had a big causal impact.
This time, you suggest that they conduct a “pre-intervention simulation,” after which you propose that the pre-intervention period be agreed upon before the test begins.
So let’s explore what a “pre-intervention simulation” looks like!
Creating the data
To make this as realistic as possible, I pulled some Google Trends data for most countries in Europe. The search term is not relevant, just imagine that it is your company's sales (and that you operate throughout Europe).
However, if you are interested in how I got the Google Trends data, check out my notebook:
Below we can see the data frame. We have sales for the last 3 years in 50 European countries. The marketing team plans to run its television campaign in Great Britain.
Now here comes the smart part. We will simulate an intervention in the last 7 weeks of the time series.
np.random.seed(1234)# Create intervention flag
mask = (df('date') >= "2024-04-14") & (df('date') <= "2024-06-02")
df('intervention') = mask.astype(int)
row_count = len(df)
# Create intervention uplift
df('uplift_perc') = np.random.uniform(0.10, 0.20, size=row_count)
df('uplift_abs') = round(df('uplift_perc') * df('GB'))
df('y') = df('GB')
df.loc(df('intervention') == 1, 'y') = df('GB') + df('uplift_abs')
Now let's plot the actual and counterfactual sales in GB to bring what we've done to life:
def synth_plot(df, counterfactual):plt.figure(figsize=(14, 8))
sns.set_style("white")
# Create plot
sns.lineplot(data=df, x='date', y='y', label='Actual', color='b', linewidth=2.5)
sns.lineplot(data=df, x='date', y=counterfactual, label='Counterfactual', color='r', linestyle='--', linewidth=2.5)
plt.title('Synthetic Control Method: Actual vs. Counterfactual', fontsize=24)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Metric Value', fontsize=20)
plt.legend(fontsize=16)
plt.gca().xaxis.set_major_formatter(plt.matplotlib.dates.DateFormatter('%Y-%m-%d'))
plt.xticks(rotation=90)
plt.grid(True, linestyle='--', alpha=0.5)
# High the intervention point
intervention_date = '2024-04-07'
plt.axvline(pd.to_datetime(intervention_date), color='k', linestyle='--', linewidth=1)
plt.text(pd.to_datetime(intervention_date), plt.ylim()(1)*0.95, 'Intervention', color='k', fontsize=18, ha='right')
plt.tight_layout()
plt.show()
synth_plot(df, 'GB')
Now that we have simulated an intervention, we can explore how well the synthetic control method will work.
Preprocessing
All European countries except GB are configured as control units (features). The treatment unit (objective) is sales in GB with the intervention applied.
# Delete the original target column so we don't use it as a feature by accident
del df('GB')# set feature & targets
x = df.columns(1:50)
y = 'y'
Regression
Below I have set up a function that we can reuse with different pre-intervention periods and different regression models (e.g. Ridge, Lasso):
def train_reg(df, start_index, reg_class):df_temp = df.iloc(start_index:).copy().reset_index()
X_pre = df_temp(df_temp('intervention') == 0)(x)
y_pre = df_temp(df_temp('intervention') == 0)(y)
X_train, X_test, y_train, y_test = train_test_split(X_pre, y_pre, test_size=0.10, random_state=42)
model = reg_class
model.fit(X_train, y_train)
yhat_train = model.predict(X_train)
yhat_test = model.predict(X_test)
mse_train = mean_squared_error(y_train, yhat_train)
mse_test = mean_squared_error(y_test, yhat_test)
print(f"Mean Squared Error train: {round(mse_train, 2)}")
print(f"Mean Squared Error test: {round(mse_test, 2)}")
r2_train = r2_score(y_train, yhat_train)
r2_test = r2_score(y_test, yhat_test)
print(f"R2 train: {round(r2_train, 2)}")
print(f"R2 test: {round(r2_test, 2)}")
df_temp('pred') = model.predict(df_temp.loc(:, x))
df_temp('delta') = df_temp('y') - df_temp('pred')
pred_lift = df_temp(df_temp('intervention') == 1)('delta').sum()
actual_lift = df_temp(df_temp('intervention') == 1)('uplift_abs').sum()
abs_error_perc = abs(pred_lift - actual_lift) / actual_lift
print(f"Predicted lift: {round(pred_lift, 2)}")
print(f"Actual lift: {round(actual_lift, 2)}")
print(f"Absolute error percentage: {round(abs_error_perc, 2)}")
return df_temp, abs_error_perc
To start, we keep things simple and use linear regression to estimate the causal impact, using a small period prior to the intervention:
df_lin_reg_100, pred_lift_lin_reg_100 = train_reg(df, 100, LinearRegression())
Looking at the results, linear regression doesn't work very well. But this is not surprising given the relationship between observations and characteristics.
synth_plot(df_lin_reg_100, 'pred')
Synthetic control method
Let's jump in and see how it compares to the synthetic control method. Next I configured a function similar to the previous one, but applying the synthetic control method using sciPy:
def synthetic_control(weights, control_units, treated_unit):synthetic = np.dot(control_units.values, weights)
return np.sqrt(np.sum((treated_unit - synthetic)**2))
def train_synth(df, start_index):
df_temp = df.iloc(start_index:).copy().reset_index()
X_pre = df_temp(df_temp('intervention') == 0)(x)
y_pre = df_temp(df_temp('intervention') == 0)(y)
X_train, X_test, y_train, y_test = train_test_split(X_pre, y_pre, test_size=0.10, random_state=42)
initial_weights = np.ones(len(x)) / len(x)
constraints = ({'type': 'eq', 'fun': lambda w: np.sum(w) - 1})
bounds = ((0, 1) for _ in range(len(x)))
result = minimize(synthetic_control,
initial_weights,
args=(X_train, y_train),
method='SLSQP',
bounds=bounds,
constraints=constraints,
options={'disp': False, 'maxiter': 1000, 'ftol': 1e-9},
)
optimal_weights = result.x
yhat_train = np.dot(X_train.values, optimal_weights)
yhat_test = np.dot(X_test.values, optimal_weights)
mse_train = mean_squared_error(y_train, yhat_train)
mse_test = mean_squared_error(y_test, yhat_test)
print(f"Mean Squared Error train: {round(mse_train, 2)}")
print(f"Mean Squared Error test: {round(mse_test, 2)}")
r2_train = r2_score(y_train, yhat_train)
r2_test = r2_score(y_test, yhat_test)
print(f"R2 train: {round(r2_train, 2)}")
print(f"R2 test: {round(r2_test, 2)}")
df_temp('pred') = np.dot(df_temp.loc(:, x).values, optimal_weights)
df_temp('delta') = df_temp('y') - df_temp('pred')
pred_lift = df_temp(df_temp('intervention') == 1)('delta').sum()
actual_lift = df_temp(df_temp('intervention') == 1)('uplift_abs').sum()
abs_error_perc = abs(pred_lift - actual_lift) / actual_lift
print(f"Predicted lift: {round(pred_lift, 2)}")
print(f"Actual lift: {round(actual_lift, 2)}")
print(f"Absolute error percentage: {round(abs_error_perc, 2)}")
return df_temp, abs_error_perc
I keep the pre-intervention period the same to create a fair comparison with linear regression:
df_synth_100, pred_lift_synth_100 = train_synth(df, 100)
Wow! I'll be the first to admit that I wasn't expecting such a significant improvement!
synth_plot(df_synth_100, 'pred')
Results comparison
Let's not get carried away yet. Below we perform some more experiments that explore types of models and periods leading up to interventions:
# run regression experiments
df_lin_reg_00, pred_lift_lin_reg_00 = train_reg(df, 0, LinearRegression())
df_lin_reg_100, pred_lift_lin_reg_100 = train_reg(df, 100, LinearRegression())
df_ridge_00, pred_lift_ridge_00 = train_reg(df, 0, RidgeCV())
df_ridge_100, pred_lift_ridge_100 = train_reg(df, 100, RidgeCV())
df_lasso_00, pred_lift_lasso_00 = train_reg(df, 0, LassoCV())
df_lasso_100, pred_lift_lasso_100 = train_reg(df, 100, LassoCV())# run synthetic control experiments
df_synth_00, pred_lift_synth_00 = train_synth(df, 0)
df_synth_100, pred_lift_synth_100 = train_synth(df, 100)
experiment_data = {
"Method": ("Linear", "Linear", "Ridge", "Ridge", "Lasso", "Lasso", "Synthetic Control", "Synthetic Control"),
"Data Size": ("Large", "Small", "Large", "Small", "Large", "Small", "Large", "Small"),
"Value": (pred_lift_lin_reg_00, pred_lift_lin_reg_100, pred_lift_ridge_00, pred_lift_ridge_100,pred_lift_lasso_00, pred_lift_lasso_100, pred_lift_synth_00, pred_lift_synth_100)
}
df_experiments = pd.DataFrame(experiment_data)
We will use the following code to display the results:
# Set the style
sns.set_style="whitegrid"# Create the bar plot
plt.figure(figsize=(10, 6))
bar_plot = sns.barplot(x="Method", y="Value", hue="Data Size", data=df_experiments, palette="muted")
# Add labels and title
plt.xlabel("Method")
plt.ylabel("Absolute error percentage")
plt.title("Synthetic Controls - Comparison of Methods Across Different Data Sizes")
plt.legend(title="Data Size")
# Show the plot
plt.show()
The results for the small data set are really interesting! As expected, regularization helped improve causal impact estimates. Synthetic control went one step further!
The results from the large data set suggest that longer pre-intervention periods are not always better.
However, what I want you to remember is how valuable it is to conduct a pre-intervention simulation. There are so many avenues you could explore with your own data set!
Today we explore the synthetic control method and how the causal impact can be validated. I leave you with some final thoughts:
- The simplicity of the synthetic control method makes it one of the most used techniques in the causal ai toolbox.
- Unfortunately, it is also the most used: let's run the R package CausalImpact, changing the pre-intervention period until we see an improvement we like.
- This is where I highly recommend running pre-intervention simulations to agree on the test design in advance.
- The synthetic control method is a highly researched area. It is worth consulting the proposed adaptations Augmented SC, Robust SC, and Penalized SC.
Alberto Abadie, Alexis Diamond, and Jens Hainmueller (2010) Synthetic control methods for comparative case studies: estimating the effect of the California tobacco control program, Journal of the American Statistical Association, 105:490, 493–505, DOI : 10.1198/jasa.2009 .ap08746