Understanding VGGNet - Very Deep Convolutional Networks for Large-Scale Image Recognition

Posted on 2/16/2025

After having analyzed and implemented AlexNet, it’s time to move forward and study the work presented by Karen Simonyan and Andrew Zisserman in their 2014 paper Very Deep Convolutional Networks for Large-Scale Image Recognition. The network they created is commonly referred to as VGGNet, as the authors were part of the Visual Geometry Group (VGG) at the University of Oxford.

VGGNet was developed, like AlexNet, for the 2014 ILSVRC competition. Even though it did not win the 2014 ILSVRC classification competition, which was awarded to GoogLeNet, VGGNet won the localization task and has become relevant due to its depth, simple design, and performance. Furthermore, VGGNet became the foundation for other applications such as object detection, image segmentation, and style transfer.

ILSVRC winning architectures and their score compared to 'human-level'. Note the step between 2011/2012 when deep neural networks were introduced.
ILSVRC winning architectures and their score compared to 'human-level'. Note the step between 2011/2012 when deep neural networks were introduced. Source Link

The authors created a set of top-performing ConvNets that built on previous results and observations, such as:

VggNet Architecture

An important difference between this network and previous ones (e.g., AlexNet) was the complexity of the kernels used in the convolutional layers. Previous networks used large kernels in the first layers (11 channels for AlexNet), which progressively became smaller. The approach for VGGNet was different and much simpler: they used a 3x3 kernel for all convolutional layers in the network. The reason for using this size is that a 3x3 filter is the smallest size that allows for capturing the notion of left, right, up, down, and center. The use of small kernels was not pioneered by VGGNet, as they had been previously employed 2, but never for such deep networks.

According to the authors, using smaller kernels has some advantages:

The authors primarily investigated the effect of depth on network accuracy by testing different configurations, ranging from 11 to 19 layers, all based on the same principles. Some of the configurations explored involved slightly different architectures, namely:

All configurations are composed of 5 convolutional blocks, each followed by a max-pooling layer, and end with a three-layer fully connected classification block. The number of kernel channels is kept constant within each convolutional block, starting at 64 in the first layer and doubling with each subsequent convolutional block until reaching 512. The main difference between the different configurations is the number of convolutional layers in each convolutional block.

The table below contains all the VGGNet configurations explored in the original paper.

AA-LRNBCDE
Layers111113161619
Weights (mill.)133133133134138144
Conv 1conv3-64conv3-64conv3-64conv3-64conv3-64conv3-64
LRNconv3-64conv3-64conv3-64conv3-64
Maxpool
Conv 2conv3-128conv3-128conv3-128conv3-128conv3-128conv3-128
conv3-128conv3-128conv3-128conv3-128
Maxpool
Conv 3conv3-256conv3-256conv3-256conv3-256conv3-256conv3-256
conv3-256conv3-256conv3-256conv3-256conv3-256conv3-256
conv1-256conv3-256conv3-256
Maxpool
Conv 4conv3-512conv3-512conv3-512conv3-512conv3-512conv3-512
conv3-512conv3-512conv3-512conv3-512conv3-512conv3-512
conv1-512conv3-512conv3-512
Maxpool
Conv 5conv3-512conv3-512conv3-512conv3-512conv3-512conv3-512
conv3-512conv3-512conv3-512conv3-512conv3-512conv3-512
conv1-512conv3-512conv3-512
Maxpool
FC-4096
FC-4096
FC-1000
Softmax

These configurations are composed of the following common blocks, reused across all configurations and convolutional layers:

These networks contain around 140 million parameters, more than twice the size of AlexNet. Therefore, some of the concerns expressed by Krizhevsky and Sutskever, such as data augmentation to avoid overfitting, network initialization, and computational times, will be even more relevant here.

Vgg16 (config D) architecture
Vgg16 (config D) architecture Source Link

VGGNet is similar to GoogLeNet in that both are very deep CNNs with small convolution filters. However, the GoogLeNet architecture was more complex, with feature maps that reduced more in the first layers.

Data Processing and Augmentation

The data pre-processing employed for VGGNet is exactly the same as that for AlexNet: the entire variable-size dataset is rescaled so that the shorter side has a length of 256 pixels, and then the mean RGB value computed per pixel per channel for the entire dataset (1.2 million images!) is subtracted. This is done to center the input values around zero to improve convergence.

As mentioned in the previous section, VGGNet is a large convolutional network with around 140 million parameters to adjust. Therefore, having good data augmentation techniques will be essential to avoid overfitting. Some of the data augmentation techniques used for AlexNet were also used for VGGNet, such as PCA color augmentation with the same settings and a 50% chance of horizontally flipping an image. However, one new technique was added. Remember how in AlexNet we randomly cropped 227x227 square images from the original 256x256? For VGGNet, two different approaches were used for setting the training scale:

Learning

The learning settings used to train VGGNet were almost identical to those of AlexNet.

The network was trained by optimizing the multinomial logistic regression (maximum entropy) using stochastic gradient descent with a batch size of 256, momentum of 0.9, and weight decay of 0.0005.

The learning rate was set to 0.01 and decreased by a factor of 10 when the validation accuracy stopped improving.

A correct initialization of the network is really important to avoid vanishing or exploding gradients, which deep neural networks tend to suffer from. The original paper proposes two alternatives to initialize the different VGGNets:

The implementation of the code used the C++ Caffe toolbox, but with important modifications to enable the use of multiple GPUs simultaneously and to evaluate uncropped images at multiple scales.

The training was conducted using 4 NVIDIA Titan Black GPUs with 6GB of memory and took around 2-3 weeks per network. The paper mentions that (for an unknown configuration) it took 74 epochs to train, during which the learning rate was reduced three times. This is less than AlexNet, which might be caused by the pre-initialization of some of the weights.

Testing

The strategy for testing VGGNet is similar to that of Sermanet et al. 5, and allows for testing (square) images of different sizes by applying the network densely over the test image. This can be achieved by converting the fully connected layers into convolutional layers. For instance, in VGGNet, the first fully connected layer has 4096 neurons with an input of shape 7x7x512. This fully connected layer can be converted into 4096 convolutional kernels of size 7x7x512, which will produce an output of 1x1x4096. We can also verify that both have the same number of parameters:

In order to convert the fully connected layers into convolutional layers, we will need to reshape each of the weight matrices in the following way:

Note that when the input image is 224x224, this transformation will lead to a 1x1x1000 tensor, which will be the input to the Softmax function. However, when the input size is larger than 224, the output of the new third convolutional layer will be (>1)x(>1)x1000. In order to obtain a vector with class scores, we will sum-pool (or average pool) along these dimensions to reduce the shape to 1x1x1000.

This method of evaluating images at test time is called dense evaluation in the original paper, and removes the need to create multiple crops during test time, as was done for AlexNet. The paper analyzed results for all configurations when testing using a single scale (256 or 384) and multiple scales [224, 384, 512], with the results of the three scales averaged.

The authors also augment by horizontally flipping the images and averaging the solution of the original and flipped images to obtain the final scores for the image.

Simonyan and Zisserman also studied the effect of using multiple cropping at evaluation time. They used the same approach as GoogLeNet 6, using 50 crops per scale (5x5 grid with 2 flips) or 150 crops over 3 scales. Nonetheless, they acknowledge that the increased computation time of so many crops does not justify the potential gains in accuracy.

Results and Conclusions

The classification performance of the network is evaluated using two measures: top-1 and top-5 error. Top-1 refers to whether the correct class is the one with the maximum probability, and top-5 refers to whether the correct class is within the five classes with the largest probabilities.

From the single-scale evaluation (either 256 or 384), the authors drew the following conclusions:

The authors also compared the performance of configurations D and E when using different evaluation techniques, namely dense evaluation, multi-crop, and both. They observed that multiple crops performed slightly better than dense evaluation and that using both provided the best results. They concluded that these two approaches are complementary because they treat the convolution boundary conditions differently.

VggNet ConfigEvaluation methodtop-1 val. error (%)top-5 val. error (%)
D - Vgg16dense24.87.5
multi-crop24.67.3
multi-crop & dense24.47.2
E - Vgg19dense24.87.5
multi-crop24.67.4
multi-crop & dense24.47.1

The table above contains the top-1 and top-5 validation error rates for VGG16 and VGG19 using different evaluation techniques. It can be seen that the differences between them are quite small, even though configuration E (VGG19) contains 6 million more parameters, and the multi-crop evaluation requires 50 or 150 (multi-crop + dense) images compared to 3 with dense evaluation. For all of this, using VGG16 with dense evaluation might be a good compromise between accuracy and computational time/complexity.

As is typically done, the authors also used ConvNet fusion, combining the outputs of several models by averaging their predictions, to improve overall performance. With this technique, they achieved a top-5 error rate of 6.8% using combined dense evaluation and multi-crop. In contrast, the best single model achieved a 7.1% top-5 error rate.

The table below is presented in the paper and compares VGGNet with the state-of-the-art models at that time. VGGNet placed 2nd, just after GoogLeNet, when using ConvNet fusion, and 1st in single-model accuracy.

Methodtop-1 val. error (%)top-5 val. error (%)top-5 test error (%)
VGG (2 nets, multi-crop & dense eval.)23.76.86.8
VGG (1 net, multi-crop & dense eval.)24.47.17.0
VGG (ILSVRC submission, 7 nets, dense eval.)24.77.57.3
GoogLeNet6 (1 net)-7.97.9
GoogLeNet6 (7 nets)-6.76.7
MSRA7 (11 nets)--8.1
MSRA7 (1 net)27.99.19.1
Clarifai8 (multiple nets)--11.7
Clarifai8 (1 net)--12.5
Zeiler & Fergus9 (6 nets)36.014.714.8
Zeiler & Fergus9 (1 net)37.516.016.1
OverFeat1 (7 nets)34.013.213.6
OverFeat1 (1 net)35.714.2-
Krizhevsky et al.10(7 nets)38.116.416.4
Krizhevsky et al.10 (1 net)40.718.2-

The performance of VGGNet showed the importance of depth in convolutional networks with a simple architecture, following the steps of AlexNet. The authors noted, nonetheless, that after 19 layers, the accuracy of the architecture plateaued and convergence became difficult, indicating that innovations in the architecture would be required to increase the performance of these systems. This would give rise to a new type of convolutional neural network: ResNets.

Footnotes

  1. Sermanet, P. (2013). Overfeat: Integrated Recognition, Localization and Detection Using Convolutional networks. arXiv preprint arXiv:1312.6229. 2 3

  2. Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., & Schmidhuber, J. (2011, June). Flexible, high performance convolutional neural networks for image classification. In Twenty-second international joint conference on artificial intelligence.

  3. Lin, M. (2013). Network in network. arXiv preprint arXiv:1312.4400.

  4. Glorot, X., & Bengio, Y. (2010, March). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249-256). JMLR Workshop and Conference Proceedings.

  5. Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., & Schmidhuber, J. (2011, June). Flexible, high performance convolutional neural networks for image classification. In Twenty-second international joint conference on artificial intelligence.

  6. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9). 2 3

  7. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence37(9), 1904-1916. 2

  8. Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang et al. “Imagenet large scale visual recognition challenge.” International journal of computer vision 115 (2015): 211-252. 2

  9. Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13 (pp. 818-833). Springer International Publishing. 2

  10. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems25. 2

Comments

No comments yet.