Before CNNs, the standard way to train a neural network to classify images was to flatten it into a list of pixels and pass it through a forward propagation neural network to generate the image class. The problem with flattening the image is that it discards essential spatial information of the image.
In 1989, Yann LeCun and his team introduced convolutional neural networks, the backbone of computer vision research for the past 15 years. Unlike forward propagation networks, convolutional neural networks preserve the 2D nature of images and are capable of processing information spatially.
In this article, we will go through the history of CNNs specifically for image classification tasks, from those early years of research in the 90s to the golden era of the mid-2010s when many of the coolest deep learning architectures in history were conceived, and finally discuss the latest trends in CNN research now competing with attention and vision transformers.
Review the Youtube video which explains all the concepts in this article in a visual and animated way. Unless otherwise specified, all images and illustrations used in this article are generated by me during the creation of the video version.
At the heart of a CNN is the convolution operation. We scan the filter across the image and compute the dot product of the filter with the image at each overlapping location. The result is called a feature map and captures how much and where the filter pattern is present in the image.
In a convolutional layer, we train multiple filters that extract different feature maps from the input image. When we stack multiple convolutional layers in sequence with some nonlinearity, we get a convolutional neural network (CNN).
So each convolution layer simultaneously does 2 things:
1. spatial filtering with the convolution operation between images and kernels, and
2. Combining multiple input channels and generate a new set of channels.
90 percent of the research on CNNs has focused on modifying or improving just these two things.
The 1989 document
This article from 1989 We were taught to train non-linear CNNs from scratch using backpropagation. They input 16×16 grayscale images of handwritten digits and pass them through two convolutional layers with 12 filters of size 5×5. The filters are also moved with a stride of 2 during scanning. Convolution with stride is useful for downsampling the input image. After the convolutional layers, the output maps are flattened and passed through two fully connected networks to output the probabilities of all 10 digits. Using softmax cross-entropy loss, the network is optimized to predict the correct labels for the handwritten digits. After each layer, tanh nonlinearity is also used, allowing the learned feature maps to be more complex and expressive. With only 9760 parameters, this was a very small network compared to today’s networks that contain hundreds of millions of parameters.
Inductive bias
Inductive bias is a machine learning concept where we deliberately introduce specific rules and constraints into the learning process to steer our models away from generalizations and more toward solutions that follow our human understanding.
When humans classify images, we also do spatial filtering look for common patterns to form multiple representations and then Combine them to form our predictionsThe architecture of convolutional neural networks is designed to replicate precisely that. In forward propagation networks, each pixel is treated as if it were its own isolated feature, as each neuron in the layers connects to all pixels; In convolutional neural networks there is more parameter exchange because the same filter scans the entire image. Inductive biases also make convolutional neural networks consume less data because they get local pattern recognition for free due to the network design, but forward propagation networks need to spend their training cycles learning about it from scratch.