Unlike you and me, computers only work with binary numbers. Therefore, they cannot see or understand an image. However, we can represent images using pixels. For a grayscale image, the smaller the pixel, the darker it is. A pixel takes values between 0 (black) and 255 (white), the numbers in between are a spectrum of grays. This range of numbers is equal to byte In binary, which is ²⁸, this is the smallest unit of work for most computers.
Below is an example image I created in Python and its corresponding pixel values:
Using this concept, we can develop algorithms that can see patterns in these pixels to classify images. This is exactly what a Convolutional Neural Network (CNN) does.
Most images are not grayscale and have some color. They are usually represented using RGB where we have three channels which are red, green and blue. Each color is a two-dimensional grid of pixels, which are then stacked on top of each other. So the input image is three-dimensional.
The code used to generate the graph is available on my GitHub:
General description
The key part of CNNs is the convolution operation. I have an entire article detailing how convolution works, but I'll do a quick summary here for completeness. If you want a deep understanding, I recommend you check out the previous post: