Have you ever seen those animated Nike ads while tuning into a podcast recapping last night's epic NBA showdown? Or how about you stumble upon the New Balance sneaker review show on YouTube? That's the magic of contextual targeting: the matchmaking master that connects content and ads based on the vibe of the moment! Say goodbye to ad discomfort and hello to personalized ad experiences that will make you dance happily. Picture this: “Would you rather enjoy Nike ads on a basketball podcast or spice things up on a politics podcast?”
As tech giants increase their investment in protecting user privacy, the old-school behavioral approach (you know, the one that relies on users' IP addresses and devices) could find itself on the back foot. complicated situation. With fewer cookies and mysterious IP addresses lurking around, it's like the Wild West for traditional targeting!
Let's breathe more life into the contextual product measurement game; It's usually all about the advertisers. We're talking about the typical success metrics: advertiser adoption, retention, referrals, and that great ad revenue. But here's where the plot thickens: my hypothesis is that serving more relevant ads turns the advertising experience into a joyride. Picture this: Fewer context switches during ads means users can enjoy similar contextual content without missing a beat.
However, it's not easy to run an A/B test to see how users react to contextual targeting products. Because? When advertisers purchase contextual targeting in their ads, it's not just contextual targeting: they will use all other targets in the same campaign, which means we can't randomly assign contextual targeting as a treatment. Therefore, it is not possible to randomize users into two groups.
Enter the superhero of alternatives: Causal Inference! When A/B testing isn't possible because users can't be shuffled like a deck of cards, we turn to historical data with causal inference!
In this blog post, I'll go over how to evaluate advertising targeting products using causal inference. So buckle up if:
- Navigate a domain where A/B testing is not yet ready, whether unethical, expensive, or downright impossible.
- Navigate the exciting waters of the advertising/social domain, where the focus is on how an ad is tailored to a specific user and their content.
It is important to design causal inference research by establishing hypotheses and metrics!
Hypothesis: We believe users are more engaged when they hear an ad through contextual targeting and plan to measure this through ad completion rate (the higher the better) and out-of-focus skipping (the lower the better).
Metrics: We start with ad completion rate, a standard metric that is common in the advertising space. However, this metric is noisy and we finally choose Off Focus Skip as our metric.
Our experimental unit: 90 days of users who were (filtered users who received both treatment ads and control ads). It is worth mentioning that we also tested it at the impressions level. We did both.
Population: We collected 90 user windows/impressions.
We will use Propensity Score Match in this research since we have two groups of samples that we just need to synthesize some randomization. You can read more about PSM at here, and my summary on PSM is: tell our samples to find pairs between the control and treatments, and then we measure the average delta between each pair to attribute any differences we find to the treatment. So let's start preparing the ingredients for our PSM model!
There are many things that could affect users' ad experience and these are the three categories:
- User attribute (i.e. age/sex/LHR)
- Advertiser attribute (i.e. company's previous advertising spend)
- Publisher attribute (i.e. company's past ad revenue/content metadata)
We believe that controlling for the above isolates the treatment effect between contextually targeted ads versus non-contextually targeted ads. Below is a sample data frame to help understand what the data might look like.
Using logistic regression, for example, when treatment status (exposure) is regressed on observed characteristics (covariates), we will obtain a predictive value for how likely it is whether a user is in treatment. This number is how we relate each pair between treatment and control. Please note that you can also use other classifiers of your choice! In the end, what you need to do is use your classifier to tag your users, so that we can match them accordingly in the next steps.
Y = Treatment (0, 1)
X = User Attributes + Advertiser Attributes + Publisher Attributes
If we extract the PS Score distributions for two groups, we will see two overlapping distributions as shown in the drawing below. The PS score distribution will probably be different in the two groups, and that is to be expected! What we want to compare Apple to Apple is the “matching” area.
As we assign users their propensity score, we will pair-match between the treatment and control groups. In the example here, we start to see how pairs are formed. Our sample size will also start to change as some samples may not match. (PS. use the psmpy package if you are in a Python environment.)
When we pair the two groups, the user attributes of the two groups will start to look similar than before. This is because users who couldn't be matched are removed from both of my groups.
Now that we have matched them based on PS, we can begin our measurement work! The main calculation is essentially the following:
MEAN (Treatment group Y var) — MEAN (Control group Y var) = Treatment effect
We will have treatment effect data that we can test for statistical significance and practical significance. By pairing the ducks to calculate the average delta for each pair, we measure the effect of the treatment.
So, if everything is set up correctly so far, we will have measured the effects of the treatment in the two groups. But it is essential to know that causal inference carries more risks if confounding variables or any other potential causes that we did not realize are overlooked. So, to further validate our research, let's perform an AA test!
An AA test is a test in which, instead of using the true treatment, we randomly assign a “fake” treatment to our data and perform causal inference again. Because it is a fake treatment, we should not detect any treatment effects! Running an ML test provides a good code review and also ensures that our process minimizes bias (when the true treatment effect is 0, we detect 0)
Once we complete our AA test without detecting a treatment effect, we are ready to communicate the information to engineering/product management! For my project, I ended up publishing my work and sharing it in a company-wide information forum on the first causal inference work to measure Spotify podcast ad targeting.
This blog post explains each step of causal inference to evaluate an advertising targeting product that is difficult to experiment with due to limitations in randomization. From how to determine causality, assign users a propensity matching score, match users and calculate the treatment effect, to checking the sanity of the result. I hope you find this article helpful and if you have any questions, let me know!
P.S. While due to confidentiality, I can't share the test result specifically for Spotify's contextual targeting product, you can still use this blog to develop your causal inference!