Introduction
One of the biggest problems with linear regression is autocorrelated residuals. In this context, this article reviews linear regression, delves into the Cochrane-Orcutt procedure as a way to solve this problem, and explores a real-world application in the analysis of brain activation by fMRI.
Linear regression is probably one of the most important tools for any data scientist. However, it is common to see many misconceptions made, especially in the context of time series. So, let's spend some time reviewing the concept. The main objective of a GLM in time series analysis is to model the relationship between variables in a sequence of time points. Where AND is the destination data, x is the characteristic data, b and TO the coefficients to estimate and my It is the Gaussian error.
The index refers to the temporal evolution of the data. Writing in a more compact form:
by the author.
Parameter estimation is performed using ordinary least squares (OLS), which assumes that the errors, or residual copyrightbetween the observed values and the values predicted by the model, are independent and identically distributed (iid).
This means that the residuals must not be autocorrelated to ensure correct estimation of the coefficients, validity of the model, and accuracy of the predictions.
Autocorrelation refers to the correlation between observations within a time series. We can understand it as how each data point relates to lagging data points in a sequence.
Autocorrelation functions (ACF) are used to detect autocorrelation. These methods measure the correlation between a data point and its lagged values (t = 1,2,…,40), which reveals whether the data points are related to the previous or next values. The ACF plots (Figure 1) show correlation coefficients at different lags, indicating the strength of autocorrelation and statistical significance over the shadow region.
If the coefficients for certain lags differ significantly from zero, it suggests the presence of autocorrelation.
Autocorrelation in the residuals suggests that there is a relationship or dependence between current and past errors in the time series. This correlation pattern indicates that the errors are not random and may be influenced by factors not taken into account in the model. For example, autocorrelation can lead to biased parameter estimates, especially in the variance, affecting the understanding of the relationships between variables. This results in invalid inferences drawn from the model, leading to misleading conclusions about the relationships between variables. Furthermore, it results in inefficient predictionswhich means that the model does not capture the correct information.
The Cochrane-Orcutt procedure is a famous method in econometrics and in various areas to address questions of autocorrelation in a time series through a linear model of serial correlation in the error term (1,2). We already know that this violates one of the assumptions of ordinary least squares (OLS) regression, which assumes that (residual) errors are uncorrelated (1). Later in the article, we will use the procedure to remove autocorrelation and check how skewed the coefficients are.
The Cochrane-Orcutt procedure is as follows:
- 1. Initial OLS regression: Begin with an initial regression analysis using ordinary least squares (OLS) to estimate model parameters.
- 2. Residual calculation: Compute the residuals from the initial regression.
- 3. Autocorrelation test: Examine the residuals for the presence of autocorrelation using ACF plots or tests such as the Durbin-Watson test. If the autocorrelation is not significant, it is not necessary to follow the procedure.
- 4. Transformation: The estimated model is transformed by differentiating the dependent and independent variables to eliminate autocorrelation. The idea here is to make the residuals closer to being uncorrelated.
- 5. Regression of the transformed model: Perform a new regression analysis with the transformed model and calculate new residuals.
- 6. Check autocorrelation: Test the autocorrelation of the new residuals again. If autocorrelation persists, return to step 4 and transform the model further until the residuals show no significant autocorrelation.
Final model estimate: Once the residuals do not show significant autocorrelation, use the final model and the coefficients derived from the Cochrane-Orcutt procedure to make inferences and draw conclusions.
A brief introduction to fMRI
Functional magnetic resonance imaging (fMRI) is a neuroimaging technique that measures and maps brain activity by detecting changes in blood flow. It is based on the principle that neuronal activity is associated with increased blood flow and oxygenation. In fMRI, when a region of the brain is activated, it triggers a hemodynamic response that causes changes in blood oxygen level-dependent (BOLD) signals. fMRI data typically consist of 3D images that represent brain activation at different time points; therefore, each volume (voxel) of the brain has its own time series (Figure 2).
The general linear model (GLM)
The GLM assumes that the signal measured by fMRI is a linear combination of different factors (features), such as task information mixed with the expected response of neuronal activity known as the hemodynamic response function (HRF). For simplicity, we will ignore the nature of the HRF and simply assume that it is an important feature.
Understand the impact of tasks on the resulting BOLD signal. and (dependent variable), we are going to use a GLM. This translates into testing the effect through statistically significant coefficients associated with the task information. That's why, x1 and x2 (independent variables) They are information about the task that was executed by the participant through data collection convolved with the HRF (Figure 3).
Application on real data
To test this Real World application, we will use data collected by Prof. João Sato from the Federal University of ABC, which is available at GitHub. The independent variable fmri_data contains data from one voxel (a single time series), but we could do it for every voxel in the brain. The dependent variables that contain the task information are cong and incong. Explanations of these variables are beyond the scope of this article.
#Reading data
fmri_img = nib.load('/Users/rodrigo/Medium/GLM_Orcutt/Stroop.nii')
cong = np.loadtxt('/Users/rodrigo/Medium/GLM_Orcutt/congruent.txt')
incong = np.loadtxt('/Users/rodrigo/Medium/GLM_Orcutt/incongruent.txt')#Get the series from each voxel
fmri_data = fmri_img.get_fdata()
#HRF function
HRF = glover(.5)
#Convolution of task data with HRF
conv_cong = np.convolve(cong.ravel(), HRF.ravel(), mode='same')
conv_incong = np.convolve(incong.ravel(), HRF.ravel(), mode='same')
Display of task information variables (features).
GLM assembly
Using ordinary least squares to fit the model and estimate the model parameters, we arrive at
import statsmodels.api as sm#Selecting one voxel (time series)
y = fmri_data(20,30,30)
x = np.array((conv_incong, conv_cong)).T
#add constant to predictor variables
x = sm.add_constant(x)
#fit linear regression model
model = sm.OLS(y,x).fit()
#view model summary
print(model.summary())
params = model.params
It is possible to see that coefficient X1. is statistically significant, once P > |t| is less than 0.05. That could mean that the task does indeed affect the BOLD signal. But before using these parameters to make inferences, it is essential to check whether the residuals, which means and less prediction, they are not autocorrelated in any lag. Otherwise our estimate is biased.
Checking residual autocorrelation
As already mentioned, the ACF graph is a good way to check the autocorrelation in the series.
By observing the ACF plot it is possible to detect a high autocorrelation at lag 1. Therefore, this linear model is biased and it is important to solve this problem.
Cochrane-Orcutt to resolve autocorrelation in residuals
The Cochrane-Orcutt procedure is widely used in the analysis of fMRI data to resolve these types of problems (2). In this specific case, the lag 1 autocorrelation in the residuals is significant, so we can use the Cochrane-Orcutt formula for the autoregressive term AR(1).
# LAG 0
yt = y(2:180)
# LAG 1
yt1 = y(1:179)# calculate correlation coef. for lag 1
rho= np.corrcoef(yt,yt1)(0,1)
# Cochrane-Orcutt equation
Y2= yt - rho*yt1
X2 = x(2:180,1:) - rho*x(1:179,1:)
Transformed model fitting
Fitting the model again but after the Cochrane-Orcutt correction.
import statsmodels.api as sm#add constant to predictor variables
X2 = sm.add_constant(X2)
#fit linear regression model
model = sm.OLS(Y2,X2).fit()
#view model summary
print(model.summary())
params = model.params
Now the coefficient X1 is no longer statistically significant, ruling out the hypothesis that the task impacts the BOLD signal. The standard error estimation parameters changed significantly, indicating the high impact of autocorrelation on the estimation residuals.
Checking autocorrelation again
This makes sense since it is possible to show that the variance is always biased when autocorrelation exists (1).
Autocorrelation has now been removed from the residuals and the estimate is no longer biased. If we had ignored autocorrelation in the residuals, we could consider the coefficient significant. However, after removing the autocorrelation, it turns out that the parameter is not significant, avoiding a spurious inference that the task is indeed related to the signal.
Autocorrelation in the residuals of a general linear model can lead to biased estimates, inefficient predictions, and invalid inferences. Application of the Cochrane-Orcutt procedure to real-world fMRI data demonstrates its effectiveness in removing autocorrelation from residuals and avoiding false conclusions, ensuring the reliability of model parameters and the accuracy of inferences drawn from the analysis. .
Observations
Cochrane-Orcutt is just one method for resolving autocorrelation in residuals. However, there are others to address this problem such as the Hildreth-Lu Procedure and the First Differences Procedure (1).