In my last post, we discussed how the performance of neural networks can be improved through hyperparameter tuning:
This is a process by which the best hyperparameters, such as the learning rate and the number of hidden layers, are “tuned” to find the most optimal ones for our network to improve its performance.
Unfortunately, this process of tuning for large deep neural networks (deep learning) is painstakingly slow. One way to improve this is to use fastest optimizers than the traditional “vanilla” gradient descent method. In this post, we will delve into the most popular optimizers and optimization variants. gradient descent That can improve training speed and also convergence and compare them in PyTorch!
Before we dive in, let’s quickly review our knowledge about gradient descent and the theory behind it.
The goal of gradient descent is to update the model parameters by subtracting the gradient (partial derivative) from the parameter with respect to the loss function. A learning rate, toIt serves to regulate this process to ensure that the parameter update occurs on a reasonable scale and does not exceed or underestimate the optimal value.
- Yo are the parameters of the model.
- J(θ) is the loss function.
- ∇J(θ) is the gradient of the loss function. ∇ is the gradient operator, also known as nabla.
- to is the learning rate.
I wrote a previous article about gradient descent and how it works if you want to get a little more familiar with it: