The development of any machine learning model involves a rigorous experimental process that follows the idea-experiment-evaluation cycle.
The above cycle is repeated several times until satisfactory performance levels are reached. The “experiment” phase involves both the coding and training steps of the machine learning model. As The models become more complex. and they are trained for a long time larger data sets, training time is inevitably extended. As a result, training a large deep neural network can be extremely slow.
Fortunately for data science professionals, there are several techniques to speed up the training process, including:
- Transfer learning.
- Weight initializationas Glorot or He initialization.
- Batch normalization for training data.
- Choose a reliable activation function.
- Use a fastest optimizer.
While all of the techniques I pointed out are important, in this post I will focus deeply on the last point. I will describe multiple algorithms for optimizing neural network parameters, highlighting both their advantages and limitations.
In the last section of this post, I will present a visualization showing the comparison between the optimization algorithms discussed.
For practical implementationAll code used in this article can be accessed at this GitHub repository:
Traditionally, batch gradient descent is considered the default choice for the optimizer method in neural networks.