Whenever we are faced with a task related to the analysis of binary outcomes, we usually think of logistic regression as the go-to method. This is why most articles on binary outcome regression focus exclusively on logistic regression. However, logistic regression is not the only option available. There are other methods, such as Linear Probability Model (LPM), Probit Regression, and Complementary Log-Log Regression (Cloglog). Unfortunately, there is a lack of articles on these topics available on the Internet.
The linear probability model is rarely used because it is not very effective in capturing the curvilinear relationship between a binary outcome and independent variables. I previously talked about Cloglog regression in one of my previous articles. While there are some articles on Probit regression available on the Internet, they tend to be technical and difficult to understand for non-technical readers. In this article, we will explain the basic principles of Probit regression and its applications and compare it with logistic regression.
This is what a relationship between a binary outcome variable and an independent variable typically looks like:
The curve you see is called an S-shaped curve or sigmoid curve. If we look closely at this graph, we will notice that it looks like a cumulative distribution function (CDF) of a random variable. Therefore, it makes sense to use the CDF to model the relationship between a binary outcome variable and independent variables. The two most used CDFs are the logistic and normal distribution. Logistic regression uses the logistic CDF, given with the following equation:
In Probit regression, we use the cumulative distribution function (CDF) of the normal distribution. Reasonably, we can simply replace the logistic CDF with the normally distributed CDF to obtain the Probit regression equation:
Where Φ() represents the cumulative distribution function of the standard normal distribution.
We can memorize this equation, but it will not clarify our concept related to Probit regression. Therefore, we will take a different approach to better understand how Probit regression works.
Let’s say we have data on the weight and depression status of a sample of 1,000 individuals. Our objective is to examine the relationship between weight and depression using Probit regression. (Download the data from this link. )
To give an idea, let’s imagine that whether an individual (the “th” individual) experiences depression or not depends on an unobservable latent variable, denoted A.Yo. This latent variable is influenced by one or more independent variables. In our scenario, the weight of an individual determines the value of the latent variable. The probability of experiencing depression increases with the increase in the latent variable.
The question is, given that AYo is an unobserved latent variable, how do we estimate the parameters of the previous equation? Well, if we assume that it is normally distributed with the same mean and variance, we will be able to obtain some information about the latent variable and estimate the parameters of the model. I’ll explain the equations in more detail later, but first let’s do some practical calculations.
Back to our data: In our data, let’s calculate the probability of depression for each age and tabulate it. For example, there are 7 people with a weight of 40 kg and 1 of them has depression, so the probability of depression for a weight of 40 is 1/7 = 0.14286. If we do this for all weights, we will get this table:
Now, how do we obtain the values of the latent variable? We know that the normal distribution gives the probability of Y for a given value of X. However, the inverse cumulative distribution function (CDF) of the normal distribution allows us to obtain the value of X for a given probability value. In this case, we already have the probability values, which means that we can determine the corresponding value of the latent variable using the inverse CDF of the normal distribution. (Note: The inverse normal CDF function is available in almost all statistical programs, including Excel.)
This unobserved latent variable AYo It is known as normal equivalent deviation (ned) or simply rules. Looking closely, they are nothing more than Z scores associated with the unobserved latent variable. Once we have the estimated ai, estimating β1 and β2 is relatively simple. We can run a simple linear regression between AYo and our independent variable.
The weight coefficient 0.0256 gives us the change in the z-score of the outcome variable (depression) associated with a one-unit weight change. Specifically, an increase of one unit of weight is associated with an increase of approximately 0.0256 z-score units in the probability of having high depression. We can calculate the probability of depression for any age using the standard normal distribution. For example, for weight 70,
TOYo = -1.61279 + (0.02565)*70
TOYo = 0.1828
The probability associated with a z score of 0.1828 (P(x
It is quite reasonable to say that the above explanation was an oversimplification of a moderately complex method. It is also important to note that it is just an illustration of the basic principle behind using the cumulative normal distribution in Probit regression. Now, let’s look at the mathematical equations.
mathematical structure
Previously we commented that there is a latent variable, AYo, which is determined by the predictor variables. It will be very logical to consider that there is a critical or threshold value (AYo_c) of the latent variable such that if AYo exceeds AYo_c, the individual will have depression; Otherwise, he will not have depression. Given the normality assumption, the probability that AYo is less than or equal to AYo_c can be calculated from normal standardized CDF:
where zYo is the standard normal variable, i.e., Z ∼ N(0, σ 2) and F is the standard normal CDF.
The information related to the latent variable and β1 and β2 can be obtained by taking the inverse of the previous equation:
The inverse CDF of standardized normal distribution is used when we want to obtain the value of Z for a given probability value.
Now, the estimation process of β1, β2 and AYo It depends on whether we have grouped data or ungrouped data at the individual level.
When we have grouped data, it is easy to calculate probabilities. In our depression example, the initial data is ungrouped, that is, there is a weight for each individual and their depression status (1 and 0). Initially, the total sample size was 1000, but we grouped that data by weight, resulting in 71 groups, and calculated the probability of depression in each weight group.
However, when the data are ungrouped, the maximum likelihood estimation (MLE) method is used to estimate the model parameters. The following figure shows the Probit regression on our ungrouped data (n = 1000):
It can be seen that the weighting coefficient is very close to what we estimated with the grouped data.
Now that we have understood the concept of Probit regression and are (hopefully) familiar with logistic regression, the question arises: which model is preferable? Which model works best in different conditions? Well, both models are quite similar in their application and produce comparable results (in terms of predicted probabilities). The only minor distinction lies in its sensitivity to extreme values. Let’s take a closer look at both models:
From the graph we can see that the Probit and Logit models are quite similar. However, Probit is less sensitive to extreme values compared to Logit. It means that at extreme values, the change in the probability of the outcome with respect to the unit change in the predictor variable is greater in the logit model compared to the Probit model. Therefore, if you want your model to be sensitive to extreme values, you may prefer to use logistic regression. However, this choice will not significantly affect the estimates, as both models give similar results in terms of predicted probabilities. It is important to note that the coefficients obtained from both models represent different quantities and cannot be directly compared. Logit regression provides changes in the log odds of the outcome with changes in the predictor variable, while probit regression provides changes in the z score of the outcome. However, if we calculate the predicted probabilities of the outcome using both models, the results will be very similar.
In practice, logistic regression is preferred over Probit regression due to its mathematical simplicity and easy interpretation of the coefficients.