When I notice that my model is overfittingI often think, “It's time to regularize”. But how do I decide which regularization method to use (L1, L2) and which parameters to choose? Typically, I perform hyperparameter optimization using a grid search to select settings. However, what happens if the independent variables have different scales or different levels of influence? Can I design a hyperparameter grid with different regularization coefficients for each variable? Is this type of optimization feasible in high-dimensional spaces? And are there alternative ways to design regularization? Let's explore this with a hypothetical example.
My fictional example is a binary classification use case with 3 explanatory variables. Each of these variables is categorical and has 7 different categories. My reproducible use case is in this laptop. The function that generates the data set is the following:
import numpy as np
import pandas as pddef get_classification_dataset():
n_samples = 200
cats = ("a", "b", "c", "d", "e", "f")
x = pd.DataFrame(
data={
"col1": np.random.choice(cats, size=n_samples),
"col2": np.random.choice(cats, size=n_samples),
"col3": np.random.choice(cats, size=n_samples),
}
)
X_preprocessed = pd.get_dummies(x)
theta = np.random.multivariate_normal(
np.zeros(len(cats) * x.shape(1)),
np.diag(np.array((1e-1) * len(cats) + (1) * len(cats) + (1e1) * len(cats))),
)
y = pd.Series(
data=np.random.binomial(1, expit(np.dot(X_preprocessed.to_numpy(), theta))),
index=X_preprocessed.index,
)
return X_preprocessed, y
For your information, I deliberately chose 3 different values for the theta covariance matrix to show the benefit of the approximate Bayesian Laplace optimization method. If the values were in any way similar, the interest would be less.
Along with a simple reference model that predicts the observed mean value On the training data set (used for comparison purposes), I chose to design a slightly more complex model. I decided to hot code the three independent variables and apply a logistic regression model in addition to this basic preprocessing. For regularization, I chose an L2 design and sought to find the optimal regularization coefficient using two techniques: grid search and Laplace approximates Bayesian optimizationas you may have already anticipated. Finally, I evaluated the model on a test data set using two (arbitrarily selected) metrics: log loss and AUC ROC.
Before presenting the results, let's first take a closer look at the Bayesian model and how we optimized it.
In the Bayesian framework, the parameters are no longer fixed constants, but rather random variables. Instead of maximizing the probability of estimating these unknown parameters, we now optimize the posterior distribution of the random parameters, given the observed data. This forces us to choose, often somewhat arbitrarily, the design and parameters of the previous one. However, it is also possible to treat the parameters of the above as random variables in themselves, as in Beginningwhere layers of uncertainty continue to accumulate on top of each other…
In this study, I have chosen the following model:
Logically I have chosen a Bernoulli model for Y_i | θ, a centered normal prior corresponding to an L2 regularization for θ | Σ and finally for Σ_i^{-1}, I chose a Gamma model. I chose to model the precision matrix instead of the covariance matrix as is traditional in the literature, such as in the scikit learn user guide for Bayesian linear regression (2).
In addition to this written model, I assumed that Y_i and Y_j are conditionally (at θ) independent, as are Y_i and Σ.
Probability
Depending on the model, the probability can be written:
To optimize, we need to evaluate almost all terms, with the exception of P(Y=y). The numerator terms can be evaluated using the chosen model. However, the remaining term in the denominator cannot. This is where the Laplace approximation comes into play.
Laplace approximation
To evaluate the first term of the denominator, we can take advantage of the Laplace approximation. We approximate the distribution of θ | And, Σ for:
where θ* is the mode of the density distribution of θ | And, Σ.
Although we do not know the density function, we can evaluate the Hessian part thanks to the following decomposition:
We only need to know the first two terms of the numerator to evaluate the Hessian, which we do.
For those interested in further explanation, I recommend part 4.4, “The Laplace Approximation,” of Pattern Recognition and Machine Learning by Christopher M. Bishop (1). It helped me a lot to understand the approach.
Approximate Laplace probability
Finally, the approximate probability of Laplace optimization is:
Once we approximate the density function of θ | And, Σ, we could finally evaluate the probability at any θ we want if the approximation were accurate everywhere. For the sake of simplicity and because the approximation is accurate only near the mode, we evaluate the approximate probability at θ*.
Below is a function that evaluates this loss for a given (scalar) σ²=1/p (in addition to the observed values, x and y, and design values, α and β).
import numpy as np
from scipy.stats import gammafrom module.bayesian_model import BayesianLogisticRegression
def loss(p, x, y, alpha, beta):
# computation of the loss for given values:
# - 1/sigma² (named p for precision here)
# - x: matrix of features
# - y: vector of observations
# - alpha: prior Gamma distribution alpha parameter over 1/sigma²
# - beta: prior Gamma distribution beta parameter over 1/sigma²
n_feat = x.shape(1)
m_vec = np.array((0) * n_feat)
p_vec = np.array(p * n_feat)
# computation of theta*
res = minimize(
BayesianLogisticRegression()._loss,
np.array((0) * n_feat),
args=(x, y, m_vec, p_vec),
method="BFGS",
jac=BayesianLogisticRegression()._jac,
)
theta_star = res.x
# computation the Hessian for the Laplace approximation
H = BayesianLogisticRegression()._hess(theta_star, x, y, m_vec, p_vec)
# loss
loss = 0
## first two terms: the log loss and the regularization term
loss += baysian_model._loss(theta_star, x, y, m_vec, p_vec)
## third term: prior distribution over sigma, written p here
out -= gamma.logpdf(p, a = alpha, scale = 1 / beta)
## fourth term: Laplace approximated last term
out += 0.5 * np.linalg.slogdet(H)(1) - 0.5 * n_feat * np.log(2 * np.pi)
return out
In my use case, I have chosen to optimize it using Adam Optimizer, what code has been taken from this repository.
def adam(
fun,
x0,
jac,
args=(),
learning_rate=0.001,
beta1=0.9,
beta2=0.999,
eps=1e-8,
startiter=0,
maxiter=1000,
callback=None,
**kwargs
):
"""``scipy.optimize.minimize`` compatible implementation of ADAM -
(http://arxiv.org/pdf/1412.6980.pdf).
Adapted from ``autograd/misc/optimizers.py``.
"""
x = x0
m = np.zeros_like(x)
v = np.zeros_like(x)for i in range(startiter, startiter + maxiter):
g = jac(x, *args)
if callback and callback(x):
break
m = (1 - beta1) * g + beta1 * m # first moment estimate.
v = (1 - beta2) * (g**2) + beta2 * v # second moment estimate.
mhat = m / (1 - beta1**(i + 1)) # bias correction.
vhat = v / (1 - beta2**(i + 1))
x = x - learning_rate * mhat / (np.sqrt(vhat) + eps)
i += 1
return OptimizeResult(x=x, fun=fun(x, *args), jac=g, nit=i, nfev=i, success=True)
For this optimization we need the derivative of the previous loss. We can't have an analytical form so I decided to use a numerical approximation of the derivative.
Once the model is trained on the training data set, it is necessary to make predictions on the evaluation data set to evaluate its performance and compare different models. However, it is not possible to directly calculate the actual distribution of a new point, since the calculation is intractable.
It is possible to approximate the results with:
considering:
I chose a non-informative prior variable instead of the precision random variable. The naive model performs poorly, with a log loss of 0.60 and a ROC AUC of 0.50. The second model performs better, with a log loss of 0.44 and a ROC AUC of 0.83, both when hyper-optimized using grid search and Bayesian optimization. This indicates that the logistic regression model, which incorporates the dependent variables, outperforms the naive model. However, there is no advantage to using Bayesian optimization over grid search, so I will continue with grid search for now. Thanks for reading.
…But wait, I'm thinking. Why are my parameters regularized with the same coefficient? Shouldn't my prior depend on the underlying dependent variables? Perhaps the parameters of the first dependent variable could take higher values, while those of the second dependent variable, with its lower influence, should be closer to zero. Let's explore these new dimensions.
So far we have considered two techniques, grid search and Bayesian optimization. We can use these same techniques in higher dimensions.
Considering new dimensions could dramatically increase the number of nodes in my network. That's why Bayesian optimization makes sense in higher dimensions to get the best regularization coefficients. In the considered use case, I assumed that there are 3 regularization parameters, one for each independent variable. After coding a single variable, I assumed that all new generated variables shared the same regularization parameter. Hence the total regularization parameters of 3, even if there are more than 3 columns as inputs of the logistic regression.
I updated the loss function above with the following code:
import numpy as np
from scipy.stats import gammafrom module.bayesian_model import BayesianLogisticRegression
def loss(p, x, y, alpha, beta, X_columns, col_to_p_id):
# computation of the loss for given values:
# - 1/sigma² vector (named p for precision here)
# - x: matrix of features
# - y: vector of observations
# - alpha: prior Gamma distribution alpha parameter over 1/sigma²
# - beta: prior Gamma distribution beta parameter over 1/sigma²
# - X_columns: list of names of x columns
# - col_to_p_id: dictionnary mapping a column name to a p index
# because many column names can share the same p value
n_feat = x.shape(1)
m_vec = np.array((0) * n_feat)
p_list = ()
for col in X_columns:
p_list.append(p(col_to_p_id(col)))
p_vec = np.array(p_list)
# computation of theta*
res = minimize(
BayesianLogisticRegression()._loss,
np.array((0) * n_feat),
args=(x, y, m_vec, p_vec),
method="BFGS",
jac=BayesianLogisticRegression()._jac,
)
theta_star = res.x
# computation the Hessian for the Laplace approximation
H = BayesianLogisticRegression()._hess(theta_star, x, y, m_vec, p_vec)
# loss
loss = 0
## first two terms: the log loss and the regularization term
loss += baysian_model._loss(theta_star, x, y, m_vec, p_vec)
## third term: prior distribution over 1/sigma² written p here
## there is now a sum as p is now a vector
out -= np.sum(gamma.logpdf(p, a = alpha, scale = 1 / beta))
## fourth term: Laplace approximated last term
out += 0.5 * np.linalg.slogdet(H)(1) - 0.5 * n_feat * np.log(2 * np.pi)
return out
Using this approach, the metrics evaluated on the test data set are as follows: 0.39 and 0.88, which are better than the initial model optimized using a grid search and a Bayesian approach with a single prior for all independent variables.
The use case can be reproduced with this laptop.
I have created an example to illustrate the usefulness of the technique. However, I have not been able to find a suitable real-world data set to fully demonstrate its potential. While working with a real data set, I was not able to get any significant benefit from applying this technique. If you find one, please let me know. I would love to see a real world application of this regularization method.
In conclusion, using Bayesian optimization (with Laplace approximation if necessary) to determine the best regularization parameters can be a good alternative to traditional hyperparameter tuning methods. By leveraging probabilistic models, Bayesian optimization not only reduces computational cost but also improves the probability of finding optimal regularization values, especially in high dimensions.
- Cristóbal M. Obispo. (2006). Pattern recognition and machine learning. Jumper.
- Bayesian Ridge Regression Sci-Fi Learning User Guide: https://scikit-learn.org/1.5/modules/linear_model.html#bayesian-ridge-regression