BK’s Machine Learning / AI World – 5 (Convolutional Neural Network)

One day near the end of October, instead of sending and discussing cat and dogs photos on Facebook, my younger brother who is studying PhD now asked me if I am good at image processing. My answer was no, and I told him the most advance? project I have done is classifying cats and dogs, which appeared can be one of the possible solutions for his assignment. Basically, he is being tasked to identify the presence of crystal formalization in an image, and by modifying slightly the code I have done for differentiating cats and dogs, he was able to achieve the result he needed.

I have also blogged about using Convolutional Neural Network (CNN) in classifying digits, in this post lets dive deeper to use some illustration to talk about each layer of CNN.


Using pre-trained convolutional neural network (Vgg16) and finetune the last fully connected layer to predict the presence of crystal in the provided image.

Source code of Jupyter notebook which includes the image recognition process, as well as a standalone classifier with its instruction to run it can be found here: https://github.com/bklim5/keras/tree/master/protein-crystallization-classifier


In the cats and dogs example, we used a CNN model – Vgg16 with pre-trained weights, finetuned it to output only 2 categories (i.e. cats and dogs) instead of the original ~1000 categories. By using the same technique, we can finetune the model to output to 2 categories – ‘GotCrystal’ and ‘NoCrystal’, supply the model with training data and it will settle the rest. We will talk more about finetuning later in the post, for now lets discuss exactly what is Convolutional Neural Network.

Convolutional Neural Network

Pipeline of the full image recognition process using CNN – source: My brother’s assignment report

The diagram above shows the entire flow of image recognition using CNN. There are 5 major layers/operation in this structure (which is being used by Vgg16, Vgg19 and many other common CNN), i.e. convolution, maxpooling, dropout, flatten and fully connected layer.


In this step, we are using the input image’s raw pixel (eg: matrix value of 3 x 224 x 224 which means 3 color channel, 224px width and 224px height) and passes it with a set of filters to produce feature maps. This step is basically a bunch of element-wise matrix multiplication, which can be shown in the diagram below.

a 1x5x5 image convolved with a 3×3 filter (source: Google Image)

As you can see, the 5×5 green color matrix is the original pixel value of the input image. The 3×3 yellow color matrix is our filter with the value of [(1,0,1), (0,1,0), (1,0,1)], and by doing matrix multiplication between the two we get our convolved feature.

What is filter and why do we need it?

A filter can be seen as feature identifier. Imagine a filter value as below:

Filter to identify curve in an image (source: Siraj’s Intro to deep learning #6)

and we used it to slide through an input image. Some part of the image will have a high activation result from the multiplication which means the filter successfully identify a curve in the image.

High activation results – Curve detected! (Source: Siraj’s Intro to deep learning #6)

Similarly, if the result of the multiplication is very low, we can say that the filter couldn’t identify the feature in that part of the image (in this case, no curve was identified)

No Curve…. (Source: Siraj’s Intro to deep learning #6)

But how do we know that we need to identify curve and how do we set the value of the filter? When we first started to train our CNN, the value of the filters are actually randomly initialized. We don’t even know that we need to identify curve. Overtime during the training of the model, the value of the filter gets updated and starts to learn to identify parts of the image needed to perform a successful classification in the end. In the case of mice above, aside from the curve detector, some other filters might get updated to identify the 2 circles on its head or the whiskers on its face.

After this convolution operation, we apply a non-linear activation function to the results to introduce non-linearity to our model and thus generating a more complex but powerful model. One of the most used activation function is ‘relu’ (y = max(0, X)), due to its simplicity and ease of calculating the derivatives.

Jump ahead before we discuss the next operation (Maxpooling), this 3 operation (Convolution + Activation + Maxpooling) are often bundled together to be known as convolutional block. These convolutional block can be then stacked multiple times, where the output of one convolutional block will be fed into another convolutional block. Why are we doing that? This can then allow us to identify more abstract features (eg: 2 curves and 1 circle becoming one eye) in subsequent convolutional block which will help in classifying the final results.


In a nutshell, pooling reduces the dimensionality of feature map by retaining the most important information. By reducing the dimensionality, it also helps in reduce the computational cost and speed up the entire training process. One of the most used pooling operation is max, which shown in the example below:

Maxpooling with 2×2 filter and stride of 2

Using a 2×2 filter and slide through the red-colored block, the maxpooling operation takes the max value from the 4 numbers (1,1,5,6) and output 6 as a result. Since we use a stride of 2, the filters will then be moved to the green blocks. Similarly, the maxpooling operation outputs the maximum value of the 4 numbers (in this case 8).

Using maxpooling also helps in generalizing the feature maps (slightly), since it selects the most important information from the feature maps, making it rougher but inherently cover more variations.


The remaining operations / layers are very straightforward. Once we have all of our feature maps in matrix form, we flatten it into one dimensional array / vector. This is required for the next operation in CNN which is fully connected / dense layer before the final classification takes place.

Fully Connected Layer (Dense Layer)

This is basically the most basic form of Artificial Neural Network, i.e. fully connected neurons from input -> hidden layer -> output layer with X number of neurons, where X = number of class we want to classify (in this case 2, GotCrystal / NoCrystal)

Artificial Neural Network (Source: Google Image)


Dropout is one more operation which we usually employ during the fully connected layer step. By specifying the dropout rate (eg: 0.5), the layer simply sets a random number of features to zero or delete that neuron (in this case with probability of 50% to disable a neuron). This forces the model to learn a more generic representation of the datasets, thus reducing the chances of overfitting.

That’s about it! In Keras, after defining all the layers, we will compile the model by specifying a loss function (eg: categorical crossentropy) and an optimizer (eg: SGD or rmsprop) before start feeding in the training data to train the model. Probably another blogpost to discuss different type of loss fucntion and optimizer.


Wait, I thought you mentioned about using pre-trained model and finetuning?

I almost forgot about this… The operations and layers discussed above are the internal pieces of how CNN works. It’s like the engine of a car, but to drive a car you might not need to know the fine detail of those (although still good to know for improved performance!). The internal engine of the pre-trained model such as Vgg16 has already been setup for you, and as a user of the model we just need to know how to finetune it to suit our needs.

The reason why it is so useful to use a pre-trained network is that the network already learned the low-level filters (eg: edges, curves) through a lot of datasets from Imagenet. We can then reuse the feature maps of a well trained CNN model and finetune it to classify the classes we want. Vgg16 predicts up to 1000 different categories, which means the final fully connected layer contains 1000 neuron outputs. By finetuning, essentially what we are doing is removing this 1000 neurons with only 2 neurons (since we only predicting 2 classes) and train the last layer only! This technique is also known as transfer learning as we are using the learnt weight of another model and change it slightly for our use case. Code snippet below:


Sadly we don’t have a lot of data for training and testing. In fact we only have about 20 images :face_palm:. However, the flow is already been setup, if we can get in touch with more images (probably few thousands more? Lol) I am quite confident that this can work out pretty well.

Showing you some overfitted model output based on what we have:

No Crystal


Got Crystal


Not bad eh? 🙂

Specially thanks to my brother for such an interesting topic.

Also now I am informed that Vgg16 is originated from Oxford. Good luck my brother, waiting for the day where you developed something as powerful as this for the public!




Leave a Reply

Your email address will not be published. Required fields are marked *