It has become something of a meme that statistical significance is a bad standard. Several recent blogs have circulated, arguing that statistical significance is “cult” or “arbitrary.” If you want a classic polemic (and who doesn't?), check out: https://www.deirdremccloskey.com/docs/jsm.pdf.
This short essay is a defense of the so-called Cult of Statistical Significance.
Statistical significance is a pretty good idea, and I have yet to see anything fundamentally better or practical enough to use in industry.
I am not going to argue that statistical significance is the perfect way of making decisions, but it is good.
A common point made by those who would besmirch the Cult is that statistical significance is not the same as commercial importance. They are right, but it is not an argument to avoid statistical significance when making decisions.
Statistical significance says, for example, that if the estimated impact of some change is 1% with a standard error of 0.25%, it is statistically significant (at the 5% level), while if the estimated impact of another change is 10% with a standard error of 6%, it is statistically insignificant (at the 5% level).
The argument is that the 10% impact is more meaningful to the business, even if it is less precise.
Well, let's look at this from the perspective of Decision making.
Here there are two cases.
The two initiatives are separable.
If the two initiatives are separable, we should still roll 1% with a standard error of 0.25%, right? It is a positive effect, so statistical significance does not lead us astray. We should release the positive sig statistical result.
Okay, let's move on to the larger effect size experiment.
Suppose the effect size was +10% with a standard error of 20%, that is, the 95% confidence interval was approximately (-30%, +50%). In this case, we don't really think there's any evidence that the effect is positive, right? Despite the larger effect size, the standard error is too large to draw a meaningful conclusion.
The problem is not statistical significance. The problem is that we believe that a 6% standard error is small enough in this case to release the new feature based on this evidence. This example shows no problem with statistical significance as a framework. It shows that we are less worried about type 1 error than about alpha = 5%.
Alright! We accept other alphas into our Cult, as long as they have been selected before the experiment. Just use a larger alpha. For example, this is statistically significant with alpha = 10%.
The point is that there is is a noise level that we would find unacceptable. There is a level of noise where even if the estimated effect was +20%, we would say, “We don't really know what it is.”
So, we have to say how much noise is too much.
Statistical inference, like art and morality, requires that we draw the line somewhere.
The initiatives are alternatives.
Now let's assume that the two initiatives are alternatives. If we do one, we cannot do the other. Which one should we choose?
In this case, the problem with the above setup is that we are testing the wrong hypothesis. We do not want to simply compare these initiatives with control. We also want to compare them with each other.
But this is not a problem of statistical significance either. It's a problem with the hypothesis we're testing.
We want to test whether the 9% difference in effect sizes is statistically significant, using an alpha level that makes sense for the same reason as above. There is a noise level where 9% is simply spurious and we have to establish that level.
Once again, we have to draw the line somewhere.
Now, let's address some other common objections and then I'll give you a sign-up sheet to join the Cult.
This objection to statistical significance is common, but it makes no sense.
Our attitudes toward risk and ambiguity (in the sense of statistical decision theory) are “arbitrary” because we choose them. But that has no solution. Preferences are a fact in any decision-making problem.
Statistical significance is no more “arbitrary” than other decision-making rules, and has the nice intuition of balancing the amount of noise we will allow and the size of the effect. It has a simple scalar parameter that we can adjust to prefer more or less type 1 error relative to type 2 error. It's lovely.
Sometimes people argue that we should use Bayesian inference to make decisions because it is easier to interpret.
I will start by admitting that, in its ideal situation, Bayesian inference has good properties. We can take the later and treat it exactly as “beliefs” and make decisions based, for example, on the probability that the effect is positive, which is not possible with frequentist statistical significance.
In practice, Bayesian inference is a different animal.
Bayesian inference only gets those nice “belief”-like properties if the above reflects the decision maker's actual prior beliefs. This is extremely difficult to do in practice.
If you think choosing an “alpha” that sets the limit on how much noise you'll accept is complicated, imagine having to choose an density that correctly captures your opinion (or that of the decision maker) beliefs… Before each experiment! This is a very difficult problem.
Therefore, selected Bayesian principles in practice are usually chosen because they are “convenient”, “uninformative”, etc. They have little to do with actual prior beliefs.
When we don't specify our actual prior beliefs, the posterior distribution is just a weighting of the probability function. Claiming that we can look at the quantiles of this so-called posterior distribution and saying that the parameter has a 10% chance of being less than 0 is statistically nonsense.
So, if anything, it is easier to misinterpret what we are doing in the Bayesian realm than in the frequentist realm. Statisticians find it difficult to translate their prior beliefs into a distribution. How much more difficult is it for the decision maker on the project?
For these reasons, Bayesian inference does not scale well, which is why, I believe, experimentation platforms across the industry generally do not use it.
The arguments against the “cult” of statistical significance are, of course, a response to a real problem. There is a dangerous cult within our Church.
The Church of Statistical Importance is quite accepting. We allow other alphas besides 5%. We choose hypotheses that do not compare with zero nulls, etc.
But sometimes our good name is tarnished by a radical element within the Church that treats anything insignificant in the face of a null hypothesis of 0 at the 5% level as “not real.”
These heretics believe in a burden-cult version of statistical analysis, where the statistical significance procedure (at the 5% level) determines what is true rather than simply being a useful way of making decisions and weighing uncertainty.
Of course, we reject any association with this dangerous sect.
Let me know if you would like to join the Church. I'll sign you up for the monthly potluck.
Thanks for reading!
zach
Connect at: https://linkedin.com/in/zlflynn