For simplicity, we will examine Simpson's paradox by focusing on two cohorts, adult men and women.
Examining these data we can make three statements about three variables of interest:
- Gender is an independent variable (it does not “listen” to the other two)
- Treatment is gender dependent (as we can see, in this context the level administered is gender dependent; for some reason, women have been given a higher dose).
- The outcome depends on both gender and treatment.
According to this we can draw the causal graph as follows
Note how each arrow helps communicate the statements above. Equally important, the lack of an arrow pointing toward gender conveys that this is an independent variable.
We also observe that having arrows pointing from Gender to Treatment and Outcome is considered a common cause Among them.
The essence of Simpson's paradox is that although the outcome is affected by changes in treatment, as expected, there is also a back door path Information flow across gender.
The solution to this paradox, as you may have guessed by now, is that the common cause Gender is a confounding variable that needs to be considered. revised.
Controlling a variable, in terms of a causal graph, means eliminating the relationship between Gender and Treatment.
This can be done in two ways:
- Pre-data collection: setting up a Randomized control trial (RCT) in which participants will be given a dose regardless of their gender.
- Post Data Collection: As in this fabricated scenario, the data has already been collected and therefore we need to address what is known as Observational data.
In both pre- and post-data collection, removing the dependence of gender on treatment (i.e., controlling for gender) can be accomplished by modifying the graph so that the arrow between them is removed as follows:
The application of this “graphic surgery” involves modifying the last two statements (for convenience I will write all three):
- Gender is an independent variable
- Treatment is an independent variable
- The outcome depends on gender and treatment (but without an alternative route)
This allows us to obtain the causal relationship of interest: we can evaluate the direct impact of the treatment modification on the outcome.
The process of controlling a confounder, i.e., manipulating the data generating process, is formally known as applying a interventionThat is, we are no longer passive observers of the data, but rather we take an active role in modifying it to assess its causal impact.
How does this manifest itself in practice?
In the case of randomized controlled trials, the researcher must ensure that important confounding variables are controlled. Here we limit the analysis to gender (but in real-world situations other variables such as age, social status, and any other variables that may be relevant to a person's health can be imagined).
RCTs are considered the gold standard for causal analysis in many experimental settings thanks to their practice of confounding variables. That said, they have many drawbacks:
- Could be expensive Recruiting individuals can be tricky. logistically
- The intervention under investigation may not be physically possible or ethical carry out (for example, you cannot ask randomly selected people to smoke or not smoke for ten years)
- The artificial environment of a laboratory is not a real natural environment habitat of the population
On the other hand, observational data are much more widely available in industry and academia and are therefore much cheaper and may be more representative of individuals' actual habits. But, as illustrated in the Simpson diagram, there may be confounding variables that need to be controlled for.
This is where the ingenious solutions developed in the causal community over the past few decades are gaining ground. Detailing them is beyond the scope of this article, but I briefly mention at the end how to learn more.
To solve this Simpson's paradox with the given observation data, one
- Calculate for each cohort the impact of the treatment change on the outcome.
- Calculates a weighted average contribution of each cohort in the population.
We will focus on intuition here, but in a future post we will describe the mathematics behind this solution.
I'm sure that many analysts, like myself, have detected the Simpson effect at some point in their data and, hopefully, have corrected for it. They now know the name of this effect and, hopefully, are beginning to appreciate the usefulness of causal tools.
That being said… being confused at this stage is okay.
I'll be the first to admit that I struggled to understand this concept and it took me three weekends of digging into examples to internalize it. This was the gateway drug to causality for me. Part of my process for understanding statistics is playing with data. For this purpose I created An interactive web application hosted on Streamlit I call this the Simpson Calculator . I will write a separate post about this in the future.
Even if you are confused, the main conclusions from Simpson's paradox are that:
- It is a situation in which trends may exist in subgroups but be reversed for the whole.
- It can be resolved by identifying confounding variables between treatment and outcome variables and controlling for them.
This raises the question: should we control for all variables except treatment and outcome? Let us keep this in mind when solving Berkson's paradox.