Efficiently training deep learning models is challenging. The problem becomes more difficult with the recent growth of NLP models’ size and architecture complexity. To handle billions of parameters, more optimizations are proposed for faster convergence and stable training. One of the most remarkable techniques is normalization.
In this article, we will learn about some normalization techniques, how they work, and how they can be used for NLP deep models.
BatchNorm [2] is an early normalization technique proposed to solve internal covariate shifts.
To explain in simple terms, an internal covariate shift occurs when there is a change in the layer’s input data distribution. When the neural networks are forced to fit different data distributions, the gradient update changes dramatically between batches. Therefore, the models take longer to adjust, learn the correct weights and converge. The problem gets worse as the model size grows.
Initial solutions include using a small learning rate (so the impact of data distribution shifting is minor) and careful weight initialization. BatchNorm solved the problem effectively by normalizing the input on the feature dimension.
The technique helps speed up the convergence significantly and allows a higher learning rate as the model becomes less sensitive to outliers. However, it still has some drawbacks:
- Small batch size: BatchNorm relies on batch data to compute the feature’s mean and standard deviation. When the batch size is small, the mean and variance can no longer represent the population. Therefore, online learning is impossible with BatchNorm.
- Sequence input: In BatchNorm, each input sample’s normalization depends on other samples from the same batch. This does not work so well with sequence data. For example…