dcontinuous learning took a giant step in the world of artificial intelligence. Currently, neural networks outperform other types of algorithms on non-tabular data: images, videos, audio, etc. Deep learning models are typically highly complex, generating millions or even billions of trainable parameters. That is why it is essential in the modern era to use acceleration techniques to reduce training time.
One of the most common algorithms performed during training is back propagation which consists of changing the weights of a neural network with respect to a given loss function. Backpropagation is usually done by gradient descent which tries to converge the loss function to a local minimum step by step.
It turns out that naive gradient descent is usually not a preferable option for training a deep network due to its slow convergence rate. This became a motivation for researchers to develop optimization algorithms that accelerate gradient descent.
Before reading this article, it is highly recommended that you are familiar with the exponential moving average concept used in optimization algorithms. Otherwise, you can refer to the following article.
Gradient descent is the simplest optimization algorithm that calculates the gradients of the loss function with respect to the model weights and updates them using the following formula:
To understand why gradient descent converges slowly, let's look at the following example of a ravine where a given function of two variables must be minimized.
a ravine is an area where the surface is much steeper in one dimension than another
From the image, we can see that the starting point and the local minima have different horizontal coordinates and are almost the same vertical coordinates. Using gradient descent to find local minima will likely cause the loss function to oscillate slowly toward the vertical axes. These bounces occur because gradient descent does not store any history about its previous gradients, making gradient steps more indeterministic at each iteration. This example can be generalized to a larger number of dimensions.
Consequently, it would be risky to use a high learning rate as it could cause deconvergence.
Based on the example above, it would be desirable to create a loss function that takes larger steps in the horizontal direction and smaller steps in the vertical direction. In this way, convergence would be much faster. This effect is achieved exactly by Momentum.
Momentum uses a couple of equations in each iteration:
The first formula uses an exponential moving average for the gradient values. dw. Basically, it is done to store trend information about a set of previous gradient values. The second equation performs the normal gradient descent update using the moving average value calculated in the current iteration. α is the learning rate of the algorithm.
Boosting can be particularly useful in cases like the above. Imagine that we have calculated gradients in each iteration like in the image above. Instead of just using them to update the weights, we take several passed values and literally perform the update in the averaged direction.
Sebastian Ruder concisely describes the effect of Momentum on his paper: “The boost term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change direction. “As a result, we get faster convergence and reduced wobble.”
As a result, the updates made by Momentum might look like the figure below.
In practice, Momentum typically converges much faster than gradient descent. With Momentum, there is also less risk when using higher learning rates, which speeds up the training process.
In Momentum, it is recommended to choose β close to 0.9.
AdaGrad is another optimizer whose goal is to adapt the learning rate to the calculated gradient values. Situations may occur where during training, one component of the weight vector has very large gradient values while another has extremely small gradient values. This is especially true in cases where a rare model parameter appears to have little influence on the predictions.. It is worth noting that with frequent parameters these problems do not usually occur, since the model uses many prediction signals to update it. Since a lot of signal information is taken into account for the gradient calculation, gradients are usually adequate and represent a correct direction towards the local minimum. However, this is not the case for rare parameters that can lead to extremely large and unstable gradients. The same problem can occur with sparse data where there is very little information about certain characteristics.
AdaGrad addresses the aforementioned problem. independently adapting the learning rate for each weight component. If the gradients corresponding to a certain component of the weight vector are large, then the respective learning rate will be small. Conversely, for smaller gradients, the learning rate will be higher. In this way, Adagrad deals with vanishing and exploding gradient problems.
Under the hood, Adagrad accumulates item squares dw² of gradients from all previous iterations. During weight updating, instead of using the normal learning rate α, AdaGrad scales it by dividing α by the square root of the accumulated gradients. √vₜ. Additionally, a small positive term ε is added to the denominator to avoid possible division by zero.
The biggest advantage of AdaGrad is that you no longer need to manually adjust the learning rate as it adapts during training. However, there is a negative side to AdaGrad: the learning rate constantly decreases with increasing iterations (the learning rate is always divided by a positive cumulative number). Therefore, the algorithm tends to converge slowly during the last few iterations, where it becomes very low.
RMSProp was crafted as an improvement over AdaGrad that addresses the learning rate drop issue. Similar to AdaGrad, RMSProp uses a pair of equations for which the weight update is absolutely the same.
However, instead of storing a cumulative sum of squared gradients dw² for vₜ, the exponential moving average is calculated for squared gradients dw². Experiments show that RMSProp generally converges faster than AdaGrad because, with exponential moving average, it puts more emphasis on recent gradient values rather than equally distributing importance across all gradients by simply accumulating them from the first iteration. Furthermore, compared to AdaGrad, the learning rate in RMSProp does not always decrease with increasing iterations, allowing it to be better adapted to particular situations.
In RMSProp, it is recommended to choose β close to 1.
Why not just use a square gradient for v?ₜ instead of the exponential moving average?
The exponential moving average is known to distribute higher weights to recent gradient values. This is one of the reasons why RMSProp adapts quickly. But wouldn't it be better if instead of the moving average we only took into account the gradient of the last square in each iteration (vₜ = dw²)? It turns out that the update equation would be transformed as follows:
As we can see, the resulting formula is very similar to the one used in gradient descent. However, instead of using a normal gradient value for the update, we now use the sign of the gradient:
- Yeah dw > 0then a weight w is reduced by α.
- Yeah dw < 0then a weight w increases in α.
In summary, if vₜ = dw², then the model weights can only be changed by ±α. Although this approach works sometimes, it is still not flexible, the algorithm becomes extremely sensitive to the choice of α and the absolute magnitudes of the gradient are ignored, which can make the method painfully slow to converge. A positive aspect of this algorithm is the fact that only one bit is required to store signs of gradients, which can be useful in distributed computations with strict memory requirements.
At the moment, Adam is the most famous optimization algorithm in deep learning. At a high level, Adam combines the Momentum and RMSProp algorithms. To achieve this, simply keep track of the exponential moving averages of the calculated gradients and the squared gradients, respectively.
Additionally, it is possible to use bias correction for moving averages for a more accurate approximation of the gradient trend during the first few iterations. Experiments show that Adam adapts well to almost any type of neural network architecture taking advantage of both Momentum and RMSProp.
According to the adam papergood default values for hyperparameters are β₁ = 0.9, β₂ = 0.999, ε = 1e-8.
We have analyzed different optimization algorithms in neural networks. Considered as a combination of Momentum and RMSProp, Adam is the most superior of them and adapts strongly to large data sets and deep networks. Additionally, it has a simple implementation and low memory requirements, making it a preferable option in most situations.
All images, unless otherwise noted, are the author's.