They are advanced neural networks that find patterns and features within images, similar to how a human would do. We as humans look for patterns within images. For example, in the images of elephants linked below, the object and animal is classified and recognized not based on the complete image, pixel by pixel, but rather the identification of specific features.

These features could include the trunk, tusks, big ears, or simply the colour and texture of the skin. CNNs operate in a similar manner, with the only difference being that they need to process the whole image pixel by pixel to extract important features. The process through which this is achieved is explained.
The convolutional operations compare the pixels of the image to a template called feature map.

In the example above, there is a 3 by 3 feature map, usually called a kernel. The kernel moves through the pixels of image starting with the 0 in the second column of second row. It compares each pixel of the feature detector with the pixel data of the image and if it is similar, it will count that as one. It does that nine times, comparing each pixel of the feature detector with the image pixel data. At the end, it sums up the number 1 values, and places that sum in the feature map.
As an example, we could look at the yellow highlighted box. None of the pixels in the box have a value of 1, therefore the corresponding pixel in the feature map, the one at the bottom left corner, is zero.

The pooling is processed after convolutional operations. There are different pooling methods, but one of the most popular ones is max pooling. Max pooling observes and finds the highest value in the feature map based on, in the case above, a 2 by 2 filter. The reason for this operation is to reduce the number of pixels in the featured map. In this case, the values in the first matrix of pixel data has been derived based on convolutional operations and is then fed into the max pooling layer. The layer finds highest values of pixels and adds them to to a new matrix. The image illustrates this operation.

The final layer converts the 2D dimensional matrix into a one-column vector. This is a necessary modification for us to be able to feed this into a neural network, where it will eventually the algorithm will eventually learn to identify key patterns and features in the image.