Convolutional neural networks are the current building blocks for image classification tasks using machine learning. However, another very useful task they perform before classification is to extract relevant features from an image. Feature extraction is how CNNs recognize key patterns in an image to classify it. This article will show an example of how to perform feature extractions using TensorFlow and the Keras Functional API. But first, to formalize these CNN concepts, we must first talk about pixel space.
Pixel space
Pixel space is exactly what the name suggests: it is the space where the image is converted into an array of values, where each value corresponds to an individual pixel. Therefore, the original image that we see, when fed into the CNN, is converted into an array of numbers. In grayscale images, these numbers typically range from 0 (black) to 255 (white), with values in between being shades of gray. In this article, all images have been normalized, that is, each pixel has been divided by 255 so its value is in the interval (0, 1).
CNN and pixel space
What a CNN does with the image in pixel representation is apply filters and process it to extract the relevant pixels to make the final “decision”, which is to put that image inside a class. For example, in the image at the top of the page, CNN paid a lot of attention to the lion’s mouth, tongue, eyes (and strong outlines in general), and these features are drawn out even more as we delve deeper. the neural network. . Therefore, suffice it to say that the more specialized a CNN is in terms of classification, the more professional it will be in recognizing the key features of an image.
The goal
That said, the goal is simple: see the level of specialization of a CNN when it comes to feature extraction.
The method
To do this, I trained two CNNs with the same architecture, but with different training sizes: one with 50K images (this is the reference pointthe smart one), and the other with 10K images (this is the fictional one). After that, I cut through the CNN layers to check what the algorithm sees and the meaning it makes of the image fed to it.
Data set
The data set used for this project was the widely used cifar10 image dataset (1), a public domain dataset, which is a 60K image base divided into 10 classes, of which 10K images are used as a fallback validation set. The images are 32×32 pixels in size and are RGB color, which means 3 color channels.
To avoid data leakage, I saved an image to use as a test image in feature recognition, therefore this image was not used in any of the training. I present to you our guinea pig: the frog.
The implementation is shown in the following code snippet. To correctly slice CNN layers it is necessary to use the Keras functional API in TensorFlow instead of the sequential API. It works like a waterfall, where the next layer is called on top of the last one.
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv2D, MaxPool2D, Dense, Dropout, Flatten
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStoppingdef get_new_model(input_shape):
'''
This function returns a compiled CNN with specifications given above.
'''
#Defining the architecture of the CNN
input_layer = Input(shape=input_shape, name='input')
h = Conv2D(filters=16, kernel_size=(3,3),
activation='relu', padding='same', name='conv2d_1')(input_layer)
h = Conv2D(filters=16, kernel_size=(3,3),
activation='relu', padding='same', name='conv2d_2')(h)
h = MaxPool2D(pool_size=(2,2), name='pool_1')(h)
h = Conv2D(filters=16, kernel_size=(3,3),
activation='relu', padding='same', name='conv2d_3')(h)
h = Conv2D(filters=16, kernel_size=(3,3),
activation='relu', padding='same', name='conv2d_4')(h)
h = MaxPool2D(pool_size=(2,2), name='pool_2')(h)
h = Conv2D(filters=16, kernel_size=(3,3),
activation='relu', padding='same', name='conv2d_5')(h)
h = Conv2D(filters=16, kernel_size=(3,3),
activation='relu', padding='same', name='conv2d_6')(h)
h = Dense(64, activation='relu', name='dense_1')(h)
h = Dropout(0.5, name='dropout_1')(h)
h = Flatten(name='flatten_1')(h)
output_layer = Dense(10, activation='softmax', name='dense_2')(h)
#To generate the model, we pass the input layer and the output layer
model = Model(inputs=input_layer, outputs=output_layer, name='model_CNN')
#Next we apply the compile method
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=('accuracy'))
return model
The architecture specifications are shown below in Fig. 1.
The optimizer used is Adam, the loss function was categorical cross entropy and the metric used for evaluation was simply precision since the data set is perfectly balanced.
Now we can cut some strategic layers from the two CNNs to check the level of image processing. The code implementation is shown below:
benchmark_layers = model_benchmark.layers
benchmark_input = model_benchmark.inputlayer_outputs_benchmark = (layer.output for layer in benchmark_layers)
features_benchmark = Model(inputs=benchmark_input, outputs=layer_outputs_benchmark)
What happens here is this: the first line accesses each layer of the model and the second line returns the input layer of the entire CNN. Then in the third line we make a list showing the outputs of each layer, and finally, we create a new model, whose outputs are the outputs of the layers. This way we can see what happens between layers.
Very similar code was written to access the layers of our dummy model, so we’ll skip it here. Now let’s proceed to look at the images of our frog, processed within different layers of our CNNs.
First convolutional layer
Fictional
Fig. 2 shows the images of the 16 filters of the convolutional layer (conv2d_1). We can see that the images are not super-processed and there is a lot of redundancy. One could argue that this is just the first convolutional layer, which explains the fact that the processing is not that heavy, and that is a fair observation. To address this, we’ll look at the first layer of the benchmark.
Reference point
The reference classifier shows a much more processed image, to the point that most of these images are no longer recognizable. Remember: this is only the first convolutional layer.
Last convolutional layer
Fictional
As expected, the image is no longer recognizable as we have gone through 6 convolutional layers at this point and 2 pooling layers, which explains the lower dimensions of the images. Let’s see what the last benchmark layer looks like.
Reference point
This is processed further, to the point where most of the pixels are black, showing that the important features have been selected and the rest of the image is basically thrown away.
We can see that the degrees of processing are very different for the same network segment. Qualitative analysis indicates that the reference model is more aggressive in extracting useful information from the input data. This is particularly evident in the first comparison of convolutional layers: the output of the frog image is much less distorted and much more recognizable in the dummy model than in the reference model.
This suggests that the benchmark is more efficient at discarding image elements that are not useful in predicting the class, while the dummy classifier, not knowing how to proceed, considers more features. We can see in Fig. 6 that the benchmark (in blue) discards more color pixels than the dummy model (in red), which shows a longer tail in its color distribution.
If we take a look at the pixel distribution of our original frog image, we have Fig. 7, which shows a much more symmetrical distribution, roughly centered around 0.4.
From an information theory point of view, the differences in the probability distributions of the original image and the resulting images after convolutional layers represent a massive gain of information.
Looking at Fig. 6 and comparing it with Fig. 7, we are much more sure of which pixels we are going to find in the first one than in the second. Therefore, there is a gain of information. This is a very brief and qualitative exploration of Information Theory and opens a door to a vast area. For more information on Information (pun intended), check this out mail.
And finally, one way to observe the uncertainty in the response of the classifiers is to observe the probability distribution between the classes. This is the output of the sofmax function, at the end of our CNN. Figure 8 (left) shows that the benchmark is much safer for the class, with a maximum distribution in the frog class; while Fig. 8 (right) shows a fuzzy dummy classifier, with the highest probability in the wrong class.