The first thing to notice is that even though there’s no explicit regularisation there are relatively smooth boundaries. For example, in the top left there happened to be a bit of sparse sampling (by chance) yet both models prefer to cut off one tip of the star rather than predicting a more complex shape around the individual points. This is an important reminder that many architectural decisions act as implicit regularisers.
From our analysis we would expect focal loss to predict complicated boundaries in areas of natural complexity. Ideally, this would be an advantage of using the focal loss. But if we inspect one of the areas of natural complexity we see that both models fail to identify that there is an additional shape inside the circles.
In regions of sparse data (dead zones) we would expect focal loss to create more complex boundaries. This isn’t necessarily desirable. If the model hasn’t learned any of the underlying patterns of the data then there are infinitely many ways to draw a boundary around sparse points. Here we can contrast two sparse areas and notice that focal loss has predicted a more complex boundary than the cross entropy:
The top row is from the central star and we can see that the focal loss has learned more about the pattern. The predicted boundary in the sparse region is more complex but also more correct. The bottom row is from the lower right corner and we can see that the predicted boundary is more complicated but it hasn’t learned a pattern about the shape. The smooth boundary predicted by BCE might be more desirable than the strange shape predicted by focal loss.
This qualitative analysis doesn’t help in determining which one is better. How can we quantify it? The two loss functions produce different values that can’t be compared directly. Instead we’re going to compare the accuracy of predictions. We’ll use a standard F1 score but note that different risk profiles might prefer extra weight on recall or precision.
To assess generalisation capability we use a validation set that’s iid with our training sample. We can also use early stopping to prevent both approaches from overfitting. If we compare the validation losses of the two models we see a slight boost in F1 scores using focal loss vs binary cross entropy.
- BCE Loss: 0.936 (Validation F1)
- Focal Loss: 0.954 (Validation F1)
So it seems that the model trained with focal loss performs slightly better when applied on unseen data. So far, so good, right?
The trouble with iid generalisation
In the standard definition of generalisation, future observations are assumed to be iid with our training distribution. But this won’t help if we want our model to learn an effective representation of the underlying process that generated the data. In this example that process involves the shapes and the symmetries that determine the decision boundary. If our model has an internal representation of those shapes and symmetries then it should perform equally well in those sparsely sampled “dead zones”.
Neither model will ever work OOD because they’ve only seen data from one distribution and cannot generalise. And it would be unfair to expect otherwise. However, we can focus on robustness in the sparse sampling regions. In the paper Machine Learning Robustness: A Primer, they mostly talk about samples from the tail of the distribution which is something we saw in our house prices models. But here we have a situation where sampling is sparse but it has nothing to do with an explicit “tail”. I will continue to refer to this as an “endogenous sampling bias” to highlight that tails are not explicitly required for sparsity.
In this view of robustness the endogenous sampling bias is one possibility where models may not generalise. For more powerful models we can also explore OOD and adversarial data. Consider an image model which is trained to recognise objects in urban areas but fails to work in a jungle. That would be a situation where we would expect a powerful enough model to work OOD. Adversarial examples on the other hand would involve adding noise to an image to change the statistical distribution of colours in a way that’s imperceptible to humans but causes miss-classification from a non-robust model. But building models that resist adversarial and OOD perturbations is out of scope for this already long article.
Robustness to perturbation
So how do we quantify this robustness? We’ll start with an accuracy function A (we previously used the F1 score). Then we consider a perturbation function φ which we can apply on both individual points or on an entire dataset. Note that this perturbation function should preserve the relationship between predictor x and target y. (i.e. we are not purposely mislabelling examples).
Consider a model designed to predict house prices in any city, an OOD perturbation may involve finding samples from cities not in the training data. In our example we’ll focus on a modified version of the dataset which samples exclusively from the sparse regions.
The robustness score (R) of a model (h) is a measure of how well the model performs under a perturbed dataset compared to a clean dataset:
Consider the two models trained to predict a decision boundary: one trained with focal loss and one with binary cross entropy. Focal loss performed slightly better on the validation set which was iid with the training data. Yet we used that dataset for early stopping so there is some subtle information leakage. Let’s compare results on:
- A validation set iid to our training set and used for early stopping.
- A test set iid to our training set.
- A perturbed (φ) test set where we only sample from the sparse regions I’ve called “dead zones”.
| Loss Type | Val (iid) F1 | Test (iid) F1 | Test (φ) F1 | R(φ) |
|------------|---------------|-----------------|-------------|---------|
| BCE Loss | 0.936 | 0.959 | 0.834 | 0.869 |
| Focal Loss | 0.954 | 0.941 | 0.822 | 0.874 |
The standard bias-variance decomposition suggested that we might get more robust results with focal loss by allowing increased complexity on hard examples. We knew that this might not be ideal in all circumstances so we evaluated on a validation set to confirm. So far so good. But now that we look at the performance on a perturbed test set we can see that focal loss performed slightly worse! Yet we also see that focal loss has a slightly higher robustness score. So what is going on here?
I ran this experiment several times, each time yielding slightly different results. This was one surprising instance I wanted to highlight. The bias-variance decomposition is about how our model will perform in expectation (across different possible worlds). By contrast this robustness approach tells us how these specific models perform under perturbation. But we made need more considerations for model selection.
There are a lot of subtle lessons in these results:
- If we make significant decisions on our validation set (e.g. early stopping) then it becomes vital to have a separate test set.
- Even training on the same dataset we can get varied results. When training neural networks there are multiple sources of randomness to consider which will become important in the last part of this article.
- A weaker model may be more robust to perturbations. So model selection needs to consider more than just the robustness score.
- We may need to evaluate models on multiple perturbations to make informed decisions.
Comparing approaches to robustness
In one approach to robustness we consider the impact of hyperparameters on model performance through the lens of the bias-variance trade-off. We can use this knowledge to understand how different kinds of training examples affect our training process. For example, we know that miss-labelled data is particularly bad to use with focal loss. We can consider whether particularly hard examples could be excluded from our training data to produce more robust models. And we can better understand the role of regularisation by consider the types of hyperparameters and how they impact bias and variance.
The other perspective largely disregards the bias variance trade-off and focuses on how our model performs on perturbed inputs. For us this meant focusing on sparsely sampled regions but may also include out of distribution (OOD) and adversarial data. One drawback to this approach is that it is evaluative and doesn’t necessarily tell us how to construct better models short of training on more (and more varied) data. A more significant drawback is that weaker models may exhibit more robustness and so we can’t exclusively use robustness score for model selection.
Regularisation and robustness
If we take the standard model trained with cross entropy loss we can plot the performance on different metrics over time: training loss, validation loss, validation_φ loss, validation accuracy, and validation_φ accuracy. We can compare the training process under the presence of different kinds of regularisation to see how it affects generalisation capability.
In this particular problem we can make some unusual observations
- As we would expect without regularisation, as the training loss tends towards 0 the validation loss starts to increase.
- The validation_φ loss increases much more significantly because it only contains examples from the sparse “dead zones”.
- But the validation accuracy doesn’t actually get worse as the validation loss increases. What is going on here? This is something I’ve actually seen in real datasets. The model’s accuracy improves but it also becomes increasingly confident in its outputs, so when it is wrong the loss is quite high. Using the model’s probabilities becomes useless as they all tend to 99.99% regardless of how well the model does.
- Adding regularisation prevents the validation losses from blowing up as the training loss cannot go to 0. However, it can also negatively impact the validation accuracy.
- Adding dropout and weight decay is better than just dropout, but both are worse than using no regularisation in terms of accuracy.
Reflection
If you’ve stuck with me this far into the article I hope you’ve developed an appreciation for the limitations of the bias-variance trade-off. It will always be useful to have an understanding of the typical relationship between model complexity and expected performance. But we’ve seen some interesting observations that challenge the default assumptions:
- Model complexity can change in different parts of the feature space. Hence, a single measure of complexity vs bias/variance doesn’t always capture the whole story.
- The standard measures of generalisation error don’t capture all types of generalisation, particularly lacking in robustness under perturbation.
- Parts of our training sample can be harder to learn from than others and there are multiple ways in which a training example can be considered “hard”. Complexity might be necessary in naturally complex regions of the feature space but problematic in sparse areas. This sparsity can be driven by endogenous sampling bias and so comparing performance to an iid test set can give false impressions.
- As always we need to factor in risk and risk minimisation. If you expect all future inputs to be iid with the training data it would be detrimental to focus on sparse regions or OOD data. Especially if tail risks don’t carry major consequences. On the other hand we’ve seen that tail risks can have unique consequences so it’s important to construct an appropriate test set for your particular problem.
- Simply testing a model’s robustness to perturbations isn’t sufficient for model selection. A decision about the generalisation capability of a model can only be done under a proper risk assessment.
- The bias-variance trade-off only concerns the expected loss for models averaged over possible worlds. It doesn’t necessarily tell us how accurate our model will be using hard classification boundaries. This can lead to counter-intuitive results.
Let’s review some of the assumptions that were key to our bias-variance decomposition:
- At low complexity, the total error is dominated by bias, while at high complexity total error is dominated by variance. With bias ≫ variance at the minimum complexity.
- As a function of complexity bias is monotonically decreasing and variance is monotonically increasing.
- The complexity function g is differentiable.
It turns out that with sufficiently deep neural networks those first two assumptions are incorrect. And that last assumption may just be a convenient fiction to simplify some calculations. We won’t question that one but we’ll be taking a look at the first two.
Let’s briefly review what it means to overfit:
- A model overfits when it fails to distinguish noise (aleatoric uncertainty) from intrinsic variation. This means that a trained model may behave wildly differently given different training data with different noise (i.e. variance).
- We notice a model has overfit when it fails to generalise to an unseen test set. This typically means performance on test data that’s iid with the training data. We may focus on different measures of robustness and so craft a test set which is OOS, stratified, OOD, or adversarial.
We’ve so far assumed that the only way to get truly low bias is if a model is overly complex. And we’ve assumed that this complexity leads to high variance between models trained on different data. We’ve also established that many hyperparameters contribute to complexity including the number of epochs of stochastic gradient descent.
Overparameterisation and memorisation
You may have heard that a large neural network can simply memorise the training data. But what does that mean? Given sufficient parameters the model doesn’t need to learn the relationships between features and outputs. Instead it can store a function which responds perfectly to the features of every training example completely independently. It would be like writing an explicit if statement for every combination of features and simply producing the average output for that feature. Consider our decision boundary dataset where every example is completely separable. That would mean 100% accuracy for everything in the training set.
If a model has sufficient parameters then the gradient descent algorithm will naturally use all of that space to do such memorisation. In general it’s believed that this is much simpler than finding the underlying relationship between the features and the target values. This is considered the case when p ≫ N (the number of trainable parameters is significantly larger than the number of examples).
But there are 2 situations where a model can learn to generalise despite having memorised training data:
- Having too few parameters leads to weak models. Adding more parameters leads to a seemingly optimal level of complexity. Continuing to add parameters makes the model perform worse as it starts to fit to noise in the training data. Once the number of parameters exceeds the number of training examples the model may start to perform better. Once p ≫ N the model reaches another optimal point.
- Train a model until the training and validation losses begin to diverge. The training loss tends towards 0 as the model memorises the training data but the validation loss blows up and reaches a peak. After some (extended) training time the validation loss starts to decrease.
This is known as the “double descent” phenomena where additional complexity actually leads to better generalisation.
Does double descent require mislabelling?
One general consensus is that label noise is sufficient but not necessary for double descent to occur. For example, the paper Unravelling The Enigma of Double Descent found that overparameterised networks will learn to assign the mislabelled class to points in the training data instead of learning to ignore the noise. However, a model may “isolate” these points and learn general features around them. It mainly focuses on the learned features within the hidden states of neural networks and shows that separability of those learned features can make labels noisy even without mislabelling.
The paper Double Descent Demystified describes several necessary conditions for double descent to occur in generalised linear models. These criteria largely focus on variance within the data (as opposed to model variance) which make it difficult for a model to correctly learn the relationships between predictor and target variables. Any of these conditions can contribute to double descent:
- The presence of singular values.
- That the test set distribution is not effectively captured by features which account for the most variance in the training data.
- A lack of variance for a perfectly fit model (i.e. a perfectly fit model seems to have no aleatoric uncertainty).
This paper also captures the double descent phenomena for a toy problem with this visualisation:
By contrast the paper Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition gives a detailed mathematical breakdown of different sources of noise and their impact on variance:
- Sampling — the general idea that fitting a model to different datasets leads to models with different predictions (V_D)
- Optimisation — the effects of parameters initialisation but potentially also the nature of stochastic gradient descent (V_P).
- Label noise — generally mislabelled examples (V_ϵ).
- The potential interactions between the 3 sources of variance.
The paper goes on to show that some of these variance terms actually contribute to the total error as part of a model’s bias. Additionally, you can condition the expectation calculation first on V_D or V_P and it means you reach different conclusions depending on how you do the calculation. A proper decomposition involves understanding how the total variance comes together from interactions between the 3 sources of variance. The conclusion is that while label noise exacerbates double descent it is not necessary.
Regularisation and double descent
Another consensus from these papers is that regularisation may prevent double descent. But as we saw in the previous section that does not necessarily mean that the regularised model will generalise better to unseen data. It more seems to be the case that regularisation acts as a floor for the training loss, preventing the model from taking the training loss arbitrarily low. But as we know from the bias-variance trade-off, that could limit complexity and introduce bias to our models.
Reflection
Double descent is an interesting phenomenon that challenges many of the assumptions used throughout this article. We can see that under the right circumstances increasing complexity doesn’t necessarily degrade a model’s ability to generalise.
Should we think of highly complex models as special cases or do they call into question the entire bias-variance trade-off. Personally, I think that the core assumptions hold true in most cases and that highly complex models are just a special case. I think the bias-variance trade-off has other weaknesses but the core assumptions tend to be valid.
The bias-variance trade-off is relatively straightforward when it comes to statistical inference and more typical statistical models. I didn’t go into other machine learning methods like decisions trees or support vector machines, but much of what we’ve discussed continues to apply there. But even in these settings we need to consider more factors than how well our model may perform if averaged over all possible worlds. Mainly because we’re comparing the performance against future data assumed to be iid with our training set.
Even if our model will only ever see data that looks like our training distribution we can still face large consequences with tail risks. Most <a target="_blank" class="af qo" href="https://medium.com/management-matters/managing-risks-in-deploying-generative-ai-393254259497″ rel=”noopener”>machine learning projects need a proper risk assessment to understand the consequences of mistakes. Instead of evaluating models under iid assumptions we should be constructing validation and test sets which fit into an appropriate risk framework.
Additionally, models which are supposed to have general capabilities need to be evaluated on OOD data. Models which perform critical functions need to be evaluated adversarially. It’s also worth pointing out that the bias-variance trade-off isn’t necessarily valid in the setting of reinforcement learning. Consider <a target="_blank" class="af qo" href="https://medium.com/towards-data-science/exploring-the-ai-alignment-problem-with-gridworlds-2683f2f5af38″ rel=”noopener”>the alignment problem in ai safety which considers model performance beyond explicitly stated objectives.
We’ve also seen that in the case of large overparameterised models the standard assumptions about over- and underfitting simply don’t hold. The double descent phenomena is complex and still poorly understood. Yet it holds an important lesson about trusting the validity of strongly held assumptions.
For those who’ve continued this far I want to make one last connection between the different sections of this article. In the section in inferential statistics I explained that Fisher information describes the amount of information a sample can contain about the distribution the sample was drawn from. In various parts of this article I’ve also mentioned that there are infinitely many ways to draw a decision boundary around sparsely sampled points. There’s an interesting question about whether there’s enough information in a sample to draw conclusions about sparse regions.
In my article on why scaling works I talk about the concept of an inductive prior. This is something introduced by the training process or model architecture we’ve chosen. These inductive priors bias the model into making certain kinds of inferences. For example, regularisation might encourage the model to make smooth rather than jagged boundaries. With a different kind of inductive prior it’s possible for a model to glean more information from a sample than would be possible with weaker priors. For example, there are ways to encourage symmetry, translation invariance, and even detecting repeated patterns. These are normally applied through feature engineering or through architecture decisions like convolutions or the attention mechanism.
I first started putting together the notes for this article over a year ago. I had one experiment where focal loss was vital for getting decent performance from my model. Then I had several experiments in a row where focal loss performed terribly for no apparent reason. I started digging into the bias-variance trade-off which led me down a rabbit hole. Eventually I learned more about double descent and realised that the bias-variance trade-off had a lot more nuance than I’d previously believed. In that time I read and annotated several papers on the topic and all my notes were just collecting digital dust.
Recently I realised that over the years I’ve read a lot of terrible articles on the bias-variance trade-off. The idea I felt was missing is that we are calculating an expectation over “possible worlds”. That insight might not resonate with everyone but it seems vital to me.
I also want to comment on a popular visualisation about bias vs variance which uses archery shots spread around a target. I feel that this visual is misleading because it makes it seem that bias and variance are about individual predictions of a single model. Yet the math behind the bias-variance error decomposition is clearly about performance averaged across possible worlds. I’ve purposely avoided that visualisation for that reason.
I’m not sure how many people will make it all the way through to the end. I put these notes together long before I started writing about ai and felt that I should put them to good use. I also just needed to get the ideas out of my head and written down. So if you’ve reached the end I hope you’ve found my observations insightful.
(1) “German tank problem,” Wikipedia, Nov. 26, 2021. https://en.wikipedia.org/wiki/German_tank_problem
(2) Wikipedia Contributors, “Minimum-variance unbiased estimator,” Wikipedia, Nov. 09, 2019. https://en.wikipedia.org/wiki/Minimum-variance_unbiased_estimator
(3) “Likelihood function,” Wikipedia, Nov. 26, 2020. https://en.wikipedia.org/wiki/Likelihood_function
(4) “Fisher information,” Wikipedia, Nov. 23, 2023. https://en.wikipedia.org/wiki/Fisher_information
(5) Why, “Why is using squared error the standard when absolute error is more relevant to most problems?,” Cross Validated, Jun. 05, 2020. https://stats.stackexchange.com/questions/470626/w (accessed Nov. 26, 2024).
(6) Wikipedia Contributors, “Bias–variance tradeoff,” Wikipedia, Feb. 04, 2020. https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
(7) B. Efron, “Prediction, Estimation, and Attribution,” International Statistical Review, vol. 88, no. S1, Dec. 2020, doi: https://doi.org/10.1111/insr.12409.
(8) T. Hastie, R. Tibshirani, and J. H. Friedman, The Elements of Statistical Learning. Springer, 2009.
(9) T. Dzekman, “Medium,” Medium, 2024. https://medium.com/towards-data-science/why-scalin (accessed Nov. 26, 2024).
(10) H. Braiek and F. Khomh, “Machine Learning Robustness: A Primer,” 2024. Available: https://arxiv.org/pdf/2404.00897
(11) O. Wu, W. Zhu, Y. Deng, H. Zhang, and Q. Hou, “A Mathematical Foundation for Robust Machine Learning based on Bias-Variance Trade-off,” arXiv.org, 2021. https://arxiv.org/abs/2106.05522v4 (accessed Nov. 26, 2024).
(12) “bias_variance_decomp: Bias-variance decomposition for classification and regression losses — mlxtend,” rasbt.github.io. https://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp
(13) T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection,” arXiv:1708.02002 (cs), Feb. 2018, Available: https://arxiv.org/abs/1708.02002
(14) Y. Gu, x. Zheng, and T. Aste, “Unraveling the Enigma of Double Descent: An In-depth Analysis through the Lens of Learned Feature Space,” arXiv.org, 2023. https://arxiv.org/abs/2310.13572 (accessed Nov. 26, 2024).
(15) R. Schaeffer et al., “Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle,” arXiv.org, 2023. https://arxiv.org/abs/2303.14151 (accessed Nov. 26, 2024).
(16) B. Adlam and J. Pennington, “Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition,” Neural Information Processing Systems, vol. 33, pp. 11022–11032, Jan. 2020.