Introduction to Convolutional Neural Networks

Posted on 12/23/2024

Convolutional neural networks were popularized in 1998 by LeCun, Bottou et al. 1 when they introduced LeNet-5 for digit recognition, which was later applied to the recognition of ZIP code digits by the U.S. Postal Service. The main idea in convolutional neural networks (CNNs) is that, rather than connecting all the units (neurons) in one layer to all the units in the next one —as is the case in an Multilayer perceptron MLP—convolutional networks consist in sliding filters —also called kernels— over the original image or input, forming the so-called feature map. Several feature maps are then stacked creating the output of that layer.

One of the main reasons for using CNNs is that they are very efficient for processing images where there is a known data structure because the paramters —or weights— of the filters in each layer are the same for all pixels in the input, so there are overall fewer weights to learn compared to a fully connected network. They are also very useful to exploit the spatial invariance of images, by creating local feature layers which are successively combined to create more meaningful and complex features.

Convolutional Layer

Let X\boldsymbol{X} be a matrix with the input image, and H\boldsymbol{H} be the successive hidden layer. Then we can write the convolutional operation that takes us from X\boldsymbol{X} to H\boldsymbol{H} as

Hi,j=b+a=ΔΔb=ΔΔWa,bXi+a,j+bH_{i,j} = b + \sum_{a=-\Delta}^{\Delta} \sum_{b=-\Delta}^{\Delta} \boldsymbol{W}_{a,b} \boldsymbol{X}_{i+a,j+b}

where bb is the bias term, W\boldsymbol{W} defines the filter, also called the weight matrix or convolution kernel, and 2Δ+12\Delta+1 is the size of the kernel (assumed squared here).

Working principle of the convolutional operator.
Working principle of the convolutional operator.

So far we assumed that the input image is given by just a matrix, however, real images consist of three channels red, green, and blue (RGB), so they are third-order tensors rather than matrices. These tensors are characterized by their height, width, and channel. In the CNN jargon height and width are the spatial coordinates while the channels are referred to as the depth. As an example, a 1920x1080 image will be represented by a 1920x1080x3 tensor.

The concept of convolutional neural network is closely related to that of image convolution and the mathematical concept of convolution, however, there are some key differences that are worth pointing out:

Hi,j=b+a=ΔΔb=ΔΔcWa,b,cXi+a,j+b,c H_{i,j} = b + \sum_{a=-\Delta}^{\Delta}\sum_{b=-\Delta}^{\Delta} \sum_c\boldsymbol{W}_{a,b,c} \boldsymbol{X}_{i+a,j+b,c}
Convolution with no padding.
Convolution with no padding.

Up to this point, we’ve seen that each convolution layer will be defined by the spatial size of the filters and also the number of filters, which will define the depth of the output layer. Additionally, other parameters that will define the convolutional layer and allow for more control over the size of the output include padding, stride, dilation, and grouping.

Padding

Padding is used to avoid convolutions reducing largely the size of the output. As a matter of fact, convolving a kernel of size kk over an input image of shape n×nn\times n will give an output size:

output size=nk+1\text{output size}=n-k+1

because we can only shift the convolutional kernel until it runs out of pixel to apply the convolution. The first CNNs such as LeNet-5 did not use padding, which meant the size shrank on each convolution layer.
In order to solve this problem, extra pixels can be added around the frame of the input image to increase its size. Normally, the value of the padding is set to zero, although pixel replication methods such as wrap, clamp, or mirroring can also be used. Alsallakh et al. (2020)2 presented an extensive overview of the existing alternatives to zero-padding.

Border padding techniques.
Border padding techniques.

From Szeliski.

When using padding pp (number of rows/columns surrounding the input image), the size of the output layer is

output size=nk+2p+1\text{output size}=n-k+2p+1
Arbitrary padding resulting in output larger than the input.
Arbitrary padding resulting in output larger than the input.

Note that when using a padding p=k12p=\frac{k-1}{2} the input and output will have the same size if kk is even. Otherwise, there will be a difference between the top/left and bottom/right padding which will be computed as k12\lceil \frac{k-1}{2} \rceil and k12\lfloor \frac{k-1}{2} \rfloor. For this reason, normally convolution kernels have odd sizes so that the padding remains symmetric with the same number of top/bottom rows and left/right columns.

Padding preserving the original input size.
Padding preserving the original input size.

Stride

It refers to the amount of pixels the filter is slid when computing the cross-correlation. By default this value is normally 1, however, for reasons such as computational efficiency or to downsample the input, we can slide by skipping intermediate pixels. For instance, the stride for the first layer in AlexNet had a stride of 4, and traditional image pyramids use a stride of 2 in the coarser levels. The output dimension of a convolution operation with stride ss is

output size=nk+2ps+1\text{output size}=\bigg\lfloor\frac{n-k+2p}{s}+1\bigg\rfloor
Padding and strides
Padding and strides
Padding and strides (odd)
Padding and strides (odd)

In general, when the input image, filter, padding, and strides are not squares, the output shape will be given by

nhkh+2phsh+1×nwkw+2pwsw+1\bigg\lfloor\frac{n_h-k_h+2p_h}{s_h}+1\bigg\rfloor \times \bigg\lfloor\frac{n_w-k_w+2p_w}{s_w}+1\bigg\rfloor

Dilation

We call dilation when extra space is inserted between the pixels sampled during the convolution, this is also called atrous (from French a trous). 3 4

Dilation
Dilation

Grouping

It’s used when we group the input and output layers into separate groups which are convoluted separately. If G=1G=1 then it is the regular convolution (using all channels) otherwise each channel is convolved independently from the other, this is also known as depthwise or channel-separated convolution. 5 6 7

Output dimension calculator

Click on the following link for a interactive calculator of the output layer shape, based on all the variables we have learnt so far.

Pooling

The pooling operator consist, like with convolution layers, of a window that slides over the input and computes a single output at each location. However, there are two key differences between pooling and cross-correlation layers:

Max-pooling and average pooling in action.
Max-pooling and average pooling in action.

Pooling layers are added to the convolutional architecture to mitigate the sensitivity of the CNN to translation and to downsample the representations. The idea behind pooling is that features can shift relative to each other so that a robust matching of features can be done even when small distortions occur while also reducing the spatial dimension of the feature map. This operator is also used in other methods like in the scale-invariant feature transform (SIFT) descriptor with a 4×4 sum pooling grid.

In general, pooling layers use a value of stride equal to the size of the pooling window since the pooling operator aggregates information from an area. In practice max-pooling is preferable to average pooling in almost all cases, and the two most popular variations of max-pooling are:

since larger sizes or larger strides are too destructive

Pooling layers are a bit outdated and nowadays many people are proposing to discard them and prioritize architectures using just repeated convolutional layers.9 The idea is to use layers with large strides once in a while to reduce the size of the feature map. Discarding pooling layers was found to be important in training good generative models, such as variational autoencoders (VAEs) and generative adversarial networks (GANs), and seems to be the trend for the future.

Footnotes

  1. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE86(11), 2278-2324.

  2. Alsallakh, B., Kokhlikyan, N., Miglani, V., Yuan, J., & Reblitz-Richardson, O. (2020). Mind the Pad—CNNs Can Develop Blind Spots. _arXiv preprint arXiv:2010.02178.

  3. Yu, F. and Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations (ICLR).

  4. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2018). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848.

  5. Xie, S., Girshick, R., Dolla ́r, P., Tu, Z., and He, K. (2017). Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  6. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.

  7. Tran, D., Wang, H., Torresani, L., and Feiszli, M. (2019). Video classification with channel-separated convolutional networks. In IEEE/CVF International Conference on Computer Vision (ICCV).

  8. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature neuroscience2(11), 1019-1025.

  9. Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2014). Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806.

Comments

No comments yet.