Experiments typically compare the frequency of an event (or some other summation metric) after exposure (treatment) or non-exposure (control) to some intervention. For example: we could compare the number of purchases, the minutes spent viewing content, or the number of clicks on a call to action.
While this setup may seem simple, standard, and common, it is just “common.” It is a thorny analysis problem. unless We limited the post-exposure time period in which we calculated the metric.
In general, for metrics that simply summarize a post-exposure metric (“unlimited metrics”), the following statements are NOT true:
- If I run the experiment longer, I will eventually reach meaning if the experiment has any effect.
- The average treatment effect is well defined.
- By calculating the sample size, I can use normal sample size calculations to calculate the duration of the experiment.
To see why, suppose we have a metric AND that is the accumulated sum of UNKNOWN, a metric defined in a single unit of time. For example, x could be the number of minutes watched today and Y would be the total minutes watched over the last t days. Assume discrete time:
Where AND is the experiment metric described above, a count of events, t is the current time of the experiment, and Yo indexes the individual drive.
Let's assume that traffic arrives at our experiment at a constant rate. r:
where t is the number of time periods our experiment has been active.
Suppose that each x(i,s) is independent and has identical variance (for simplicity; the same problem appears to a greater or lesser extent depending on autocorrelation, etc.) but not necessarily with a constant mean. So:
We are beginning to see the problem. The variance of our metric is not constant over time. In fact, it is growing more and more.
In a typical experiment, we construct a t-test for the null hypothesis that the treatment effect is 0 and look for evidence against that null hypothesis. If we find it, we will say that the experiment is a statistically significant gain or loss.
So what does the t-stat look like in this case, say for the hypothesis that the mean of AND is it zero?
Connecting n = rtWe can write the expression in terms of t,
As with any hypothesis test, we want when the null hypothesis is not true, the test statistic will increase as the sample size increases, so we reject the null hypothesis and opt for the alternative. An implication of this requirement is that, under the alternative, the mean of the t-statistic should diverge to infinity. But…
The mean of the t statistic at the moment t It's just the average of the metric so far. t times a constant that does not vary with the size of the sample or the duration of the experiment. Therefore, the only way it can diverge to infinity is if E(Y