What the interactions do, why they are like any other change in the environment after the experiment, and some peace of mind
Experiments do not run one at a time. At any given time, hundreds or thousands of experiments are running on a mature website. The question arises: what happens if these experiments interact with each other? Is that a problem? As with many interesting questions, the answer is “Yes and no.” Read on to get even more defined, actionable, completely clear and confident, take it like that!
Definitions: experiments interact When the treatment effect for an experiment depends on which variant of another experiment the unit is assigned to.
For example, suppose we have an experiment testing a new search model and another testing a new recommendation model, driving a “people also bought” module. Both experiments are about helping customers find what they want to buy. Units assigned to the better recommendation algorithm may have a smaller treatment effect in the search experiment because they are less likely to be influenced by the search algorithm: they made their purchase because of the better recommendation.
Some empirical evidence suggests that typical interaction effects are little. You may not find this particularly comforting. I'm not sure I do either. After all, the size of the interaction effects depends on the experiments we run. For your particular organization, the experiments may interact more or less. It may be the case that the interaction effects are greater in your context than in the companies typically profiled in this type of analysis.
So this blog post is not an empirical argument. It's theoretical. That means it includes math. Then it goes. We will try to understand the problems with interactions with an explicit model without reference to a particular company's data. Even if the interaction effects are relatively large, we will find that they are rarely important for decision making. The interaction effects must be massive and have a peculiar pattern to affect which experiment wins. The goal of the blog is to bring you peace of mind.
Suppose we have two A/B experiments. Let Z = 1 indicate the treatment in the first experiment and W = 1 indicate the treatment in the second experiment. And it is the metric of interest.
The treatment effect in Experiment 1 is:
We decompose these terms to see how the interaction affects the treatment effect.
The cube for one randomized experiment is independent of the cube in another randomized experiment, so:
So the treatment effect is:
Or, more succinctly, the treatment effect is the weighted average of the treatment effect within the populations W = 1 and W = 0:
One of the best things about simply writing down the math is that it makes our problem concrete. We can see exactly what form the interaction bias will take and what will determine its size.
The problem is this: only w=1 or ow=0 will be thrown after the second experiment is over. Therefore, the environment during the first experiment will not be the same as the environment after it. This introduces the following bias into the treatment effect:
Suppose W = W is released, then the post-experiment treatment effect for the first experiment, TE (W = W), is not exhausted by the experiment treatment effect, TE, leading to bias:
If there is an interaction between the second experiment and the first, then te (w = 1-w)-te (w = w)! = 0, so there is a bias.
So, Yeahinteractions cause bias. The bias is directly proportional to the size of the interaction effect.
But The interactions are not special. Anything That differs between the experiment environment and the future environment that affects the treatment effect leads to a bias with the same shape. Does your product have seasonal demand? Was there a big supply shock? Did inflation rise sharply? What happens to butterflies in Korea? Did they fuck their wings?
Online experiments are No Laboratory experiments. We cannot control the environment. The economy is not under our control (unfortunately). We always face prejudices like this.
Therefore, online experiments are not about estimating treatment effects in perpetuity. They are about make decisions. Is it better than B? That response is unlikely to change due to an interaction effect for the same reason we generally don't worry about flipping because we run the experiment in March rather than some other month of the year.
For interactions to matter for decision making, we need, say, TE ≥ 0 (so we would throw B in the first experiment) and TE (W = W) < 0 (but we should have thrown a fact what happened in the second experiment).
Te ≥ 0 if and only if:
Taking the typical PR allocation (W = W) = 0.50, this means:
Because you (w = w) 0. What makes sense. For interactions to be a decision-making problem, the interaction effect must be large enough so that an experiment that is negative under one treatment is positive under the other.
The interaction effect has to be extreme to typical 50–50 allocations. If the treatment effect is +$2 per unit under one treatment, the treatment must be less than $2 per unit under the other for interactions to affect decision making. To make the wrong choice of the standard treatment effect, we would have to be cursed with massive interaction effects that change the sign of the treatment and Keep the same magnitude!
That's why we don't care about interactions and all those other factors (seasonality, etc.) that we can't keep the same during and after the experiment. The change in environment would have to radically alter the user experience of the feature. I probably won't.
It's always a good sign when your last take includes “probably.”