The MLE provides a framework that addresses this issue precisely. Introduce a likelihood function, which is a function that produces another function. This likelihood function takes a vector of parameters, often called theta, and produces a probability density function (PDF) which depends on theta.
The probability density function (PDF) of a distribution is a function that takes a value, x, and returns its probability within the distribution. Therefore, probability functions are typically expressed as follows:
The value of this function indicates the probability of observing x from the distribution defined by the PDF with theta as its parameters.
The goal
When building a forecast model, we have data samples and a parameterized model, and our goal is to estimate the model parameters. In our examples, such as the regression and moving average models, these parameters are the coefficients in the respective model formulas.
The equivalent in MLE is that we have observations and a PDF for a distribution defined over a set of parameters, theta, which are unknown and not directly observable. Our goal is to estimate theta.
The MLE approach involves finding the set of parameters, theta, that maximizes the likelihood function given the observable data, x.
We assume that our samples, x, are drawn from a distribution with a known probability function that depends on a set of parameters, theta. This implies that the probability of observing x under this likelihood function is essentially 1. Therefore, identifying the theta values that make our likelihood function value close to 1 in our samples should reveal the true values of the parameters.
Conditional probability
Note that we have not made any assumptions about the distribution (PDF) on which the probability function is based. Now, suppose our observation x is a vector (x_1, x_2, …, x_n). We will consider a probability function that represents the probability of observing x_n conditional on having already observed (x_1, x_2, …, x_{n-1}) —
This represents the probability of observing only x_n given the above values (and theta, the set of parameters). Now, we define the conditional probability function as follows:
Later we will see why it is useful to use the conditional likelihood function instead of the exact likelihood function.
Probability logistics
In practice, it is often convenient to use the natural logarithm of the likelihood function, called the log-likelihood function:
This is more convenient because we often work with a likelihood function which is a joint probability function of independent variables, which translates to the product of the probability of each variable. Taking the logarithm converts this product into a sum.
For simplicity, I will demonstrate how to estimate the most basic moving average model: MA(1):
Here, x_t represents the time series observations, alpha and beta are the model parameters to be estimated, and epsilons are random noise drawn from a normal distribution with zero mean and some variance (sigma), which will also be estimated. Therefore, our “theta” is (alpha, beta, sigma), which we intend to estimate.
Let's define our parameters and generate some synthetic data using Python:
import pandas as pd
import numpy as npSTD = 3.3
MEAN = 0
ALPHA = 18
BETA = 0.7
N = 1000
df = pd.DataFrame({"et": np.random.normal(loc=MEAN, scale=STD, size=N)})
df("et-1") = df("et").shift(1, fill_value=0)
df("xt") = ALPHA + (BETA*df("et-1")) + df("et")
Note that we have set the standard deviation of the error distribution to 3.3, with alpha at 18 and beta at 0.7. The data looks like this:
Likelihood function for MA(1)
Our goal is to construct a likelihood function that addresses the question: what is the probability of observing our time series x=(x_1,…, x_n) assuming they are generated by the MA(1) process described above?
The challenge of calculating this probability lies in the mutual dependence between our samples (as evident from the fact that both x_t and x_{t-1} depend on e_{t-1}), which makes it non-trivial to determine the joint probability of observing all samples (known as the exact probability).
So, as discussed above, instead of calculating the exact probability, we will work with a conditional probability. Let's start with the probability of observing a single sample given all previous samples:
This is much easier to calculate because:
All that remains is to calculate the conditional probability of observing all samples:
Applying a natural logarithm we obtain:
what is the function that we should maximize.
Code
We will use the GenericLikelihoodModel
statsmodels class for our MLE estimation implementation. As described in the tutorial On the statsmodels website, we simply need to subclass this class and include our likelihood function calculation:
from scipy import stats
from statsmodels.base.model import GenericLikelihoodModel
import statsmodels.api as smclass MovingAverageMLE(GenericLikelihoodModel):
def initialize(self):
super().initialize()
extra_params_names = ('beta', 'std')
self._set_extra_params_names(extra_params_names)
self.start_params = np.array((0.1, 0.1, 0.1))
def calc_conditional_et(self, intercept, beta):
df = pd.DataFrame({"xt": self.endog})
ets = (0.0)
for i in range(1, len(df)):
ets.append(df.iloc(i)("xt") - intercept - (beta*ets(i-1)))
return ets
def loglike(self, params):
ets = self.calc_conditional_et(params(0), params(1))
return stats.norm.logpdf(
ets,
scale=params(2),
).sum()
The function loglike
It is essential to implement. Given the iterated parameter values params
and the dependent variables (in this case, the time series samples), which are stored as members of the class self.endog
calculates the conditional log-likelihood value, as we discussed above.
Now let's create the model and fit it to our simulated data:
df = sm.add_constant(df) # add intercept for estimation (alpha)
model = MovingAverageMLE(df("xt"), df("const"))
r = model.fit()
r.summary()
and the output is:
And that's it! As demonstrated, MLE successfully estimated the parameters we selected for the simulation.
Estimating even a simple MA(1) model with maximum likelihood demonstrates the power of this method, which not only allows us to make efficient use of our data but also provides a solid statistical basis for understanding and interpreting the dynamics of the data. of time series.
I hope you liked it !
(1) Andres Lesniewski, Time series analysis2019, Baruch College, New York
(2) Eric life, ARMA model estimation2005
Unless otherwise stated, all images are the property of the author.