A new correlation coefficient. What would happen if they told you that there was a… | by Tim Sumner | March 2024

Before introducing the formula, it is important to review some necessary prep work. As we said before, correlation can be considered as a way of measuring the relationship between two variables. Let's say we are measuring the current correlation between x and AND. If a linear relationship exists, it can be considered as one that is mutually shared, that is, the correlation between x and AND is always equal to the correlation between AND and x. However, with this new approach we will no longer measure the linear relationship between x and ANDbut our goal is to measure how much AND is a function of x. Understanding this subtle but important distinction between traditional correlation techniques will make understanding the formulas much easier, because in general it is no longer necessarily the case that x(x,AND) It does not matter x(AND,x).

Continuing along the same line of thought, suppose we still wanted to measure how much AND is a function of x. Note that each data point is an ordered pair of both. x and AND. First, we must sort the data like (x₍₁₎,AND₍₁₎),…,(x₍ₙ₎,AND₍ₙ₎) in a manner that results in x₍₁₎ ≤ x₍₂₎≤ ⋯ ≤ x₍ₙ₎. Clearly stated, we must order the data according to x. Then we can create the variables. r₁, r₂, …,rₙ where rᵢ is equal to the range of AND₍ᵢ₎. With these ranges now identified, we are ready to calculate.

Two formulas are used depending on the type of data you are working with. If links in your data are impossible (or extremely unlikely), we have

and if ties are allowed, we have

where lᵢ is defined as the number of j such that AND₍ ⱼ₎ ≥ AND₍ᵢ₎. One last important note on when ties are allowed. In addition to using the second formula, to obtain the best possible estimate it is important to randomly order the observed links so that one value is chosen to rank higher or lower over another, so that (rᵢ₊₁— rᵢ) is never equal to zero like before. The variable lᵢ is then just the number of observations AND₍ᵢ₎ is actually greater than or equal to.

Not to delve too deeply into the theory, it's also worth briefly noting that this new correlation comes with a good asymptotic theory behind it that makes it very easy to perform hypothesis testing without making assumptions about the underlying distributions. This is because this method depends on the range of the data and not the values themselves, making it a non-parametric statistic. If it is true that x and AND They are independent and AND is continuous, then

What this means is that if you have a large enough sample size, then this correlation statistic roughly follows a normal distribution. This can be useful if you want to test the degree of independence between the two variables you are testing.