Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
IMAGE CLASSIFICATION
1.1.1 Details
Architecture AlexNet
Architecture Modules ConvNets, Max Pooling, FC Layers
Parameters 60 million
Layers 8
Primary Activation ReLU
Weight Initialization N (0, 0.01) with biases set to 1
Image Input Size 227 x 227
Regularization Dropout, LocalResponseNorm
Data Augmentation RandomCrop, HorizontalFlip, RGB color shift
Optimizer SGD with momentum γ = 0.9, weight decay 0.0005
Learning Rate Schedule η0 = 0.01 - divide by 10 when val error stops im-
proving
Batch Size 128
Epochs 90
Computation 2 GTX 580 GPUs
Training Time 132 hours
2
CHAPTER 1. IMAGE CLASSIFICATION
1.1.3 Summary
AlexNet was a major turning point in computer vision research, heralding the
decline of kernel-based methods and the beginning of the deep learning era. It
was the first large-scale convolutional neural network to participate and perform
well on ImageNet. It entered the ILSVRC-2012 competition and outperformed the
next best non-deep learning method by a significant margin: achieving a Top 5
Error of 16.4% compared to 26.1%. This margin of victory, combined with the
asymptoting progress of kernel-based methods, led the computer vision field to
shift en masse to the deep learning approach soon after.
Much of AlexNet’s architecture built on previous research on convolutional
networks. The base architecture follows LeNet closely with some important
differences (Lecun et al., 1998b). The most obvious difference is that AlexNet is a
much bigger network: it has much larger inputs ( 3 x 224 x 224 versus 32 x 32),
many more channels in the convolutional layers (e.g. 96 vs 25 in the first layer),
and wider and deeper layers.
More conceptual differences include the use of Max Pooling, rather than
Average Pooling, which introduces more shift-invariance into the network, the use
of Rectified Linear Units (ReLU) over sigmoid activation functions to prevent
gradient saturation, and dropout in the fully-connected layers to counter overfitting.
The paper is also noticeable for the use of heavy data augmentation, including
random cropping, resizing, horizontal flipping and color augmentation.
3
CHAPTER 1. IMAGE CLASSIFICATION
Experiment Result
ReLU Activations Achieves 25% training error 6 times faster than tanh
LocalResponseNorm Reduces Top-1 Error Rate by 1.4%
Overlapping Pooling Reduces Top-1 Error Rate by 0.4%
Random Cropping Aug Without it network suffers "substantial overfitting"
ColorAug Reduces Top-1 Error Rate by > 1%
Dropout Without it network suffers "substantial overfitting"
1.1.5 Code
1
https://github.com/dansuh17/alexnet-pytorch/blob/master/model.py
4
CHAPTER 1. IMAGE CLASSIFICATION
5
CHAPTER 1. IMAGE CLASSIFICATION
1.2.1 Details
6
CHAPTER 1. IMAGE CLASSIFICATION
1.2.3 Summary
ZFNet was conceived in the wake of the AlexNet breakthrough on the ImageNet
2012 competition. The availability of large datasets like ImageNet, powerful GPU
implementations and new regularization strategies like Dropout gave researchers
cause to turn towards convolutional models. Nevertheless there was still a lingering
unsatisfaction in the vision community about the lack of model interpretability for
deep models
Zeiler and Fergus (2013) tackle concerns about interpretability by introducing a
detailed toolkit that unpacks and visualizes how convolutional networks work. They
dissect trained networks by inserting "deconvolutions" and unpooling operations
to project feature activations back into input space. Using this they are able to
visualize what sort of features the filters of a trained network have captured. They
also perform occlusion sensitivity analysis, showing that trained networks respond
intuitively to different parts of an image, and invariance analyses, showing that
trained networks are stable to translation scaling (although not very invariant to
rotation).
Using their visualization techniques, they identify within a trained AlexNet
model that the first layer filters capture mostly high and low frequency phenomena
(little middle frequency), while the second layer filters appear to suffer from
aliasing artifacts caused by the large stride of 4 in the first layer convolutions.
These observations inform their changes for their ZFNet model, in which they
reduce first layer filter sizes from 11 x 11 to 7 x 7 and use a stride of 2 rather
than 4. This helps the architecture retain much more information in the first and
second layer features and improves classification performance.
7
CHAPTER 1. IMAGE CLASSIFICATION
Experiment Result
Remove Final FCLayer Reduces Top-1 Error Rate by 0.5%
Remove Both FCLayers Increases Top-1 Error Rate by 4.3%
Remove Mid ConvLay- Increases Top-1 Error Rate by 4.9%
ers
Remove Both FCLayers Increases Top-1 Error Rate by 30.8%
and Mid ConvLayers
Change FCLayer Size Little Difference on Performance
Increase ConvNet size Reduces Top-1 Error Rate by 3.0%
to 512,1024,512 maps
1.2.5 Code
8
CHAPTER 1. IMAGE CLASSIFICATION
9
CHAPTER 1. IMAGE CLASSIFICATION
1.3.1 Details
Architecture Inception
Architecture Modules Inception Modules, Convs, 1 x 1 Convs, Max Pool-
ing, Global Avg Pooling, FC Layers, Aux Classifiers
Parameters 5 million
Layers 22
Primary Activation ReLU
Weight Initialization Gaussian Initialization; biases set to 0.2
Image Input Size 224 x 224
Regularization Dropout, LocalResponseNorm, Polyak averaging
Data Augmentation SubCropping, HorizontalFlip, Photometric Distor-
tions
Optimizer SGD with momentum γ = 0.9
Learning Rate Schedule η0 = 0.01 - decrease by 4% every 8 epochs
Batch Size 128
Epochs 250
Computation 3 − 4 GPUs
Training Time <168 hours
10
CHAPTER 1. IMAGE CLASSIFICATION
1.3.3 Summary
The goal behind Inception networks was to build deeper networks while also
achieving superior structure at local levels of the network. Szegedy et al. (2014)
achieve this through an Inception module, which combines multiple filter sizes (1 x
1, 3 x 3, 5 x 5) allowing the network to learn approprate filter sizes at each level of
the network. This can be thought of as a learnt filter version of Serre et al. (2007)
who solve the problem of varying scale by using several filters of different sizes.
But multiple filters means a naive Inception module has a heavy computational
overload with quadratic increases in computation with each layer. To satisfy a
computational budget, therefore, Szegedy et al. (2014) utilise several ideas from
the Network-In-Network architecture of Lin et al. (2013).
11
CHAPTER 1. IMAGE CLASSIFICATION
First, they use 1 x 1 convolutions to reduce the number of channels in the input
to each Inception Module. This is computationally efficient, but it also reflects
the intuition that the optimal local network is likely sparse - activations from the
previous layer are likely highly correlated and not all of them are useful. Secondly,
they use global average pooling at the end of the architecture to averages out
the channels after the last convolutional layer. This reduces the total number of
parameters in the network; it can be thought of as a replacement for the final
fully connected layers of AlexNet (which constituted 90% of parameters for the
network).
An additional network feature is the use of auxiliary classifiers at lower levels of
the network. The intended effect of these classifiers was to get gradient information
closer to lower levels of the network so it could train more easily and more quickly.
Taken as a whole, the outcome is an efficient network with 12x less parameters
than AlexNet but superior accuracy on ImageNet.
1.3.4 Code
def __init__ (self , in_channels , ch1x1 , ch3x3red , ch3x3 , ch5x5red , ch5x5 , pool_proj ):
super(Inception , self ). __init__ ()
12
CHAPTER 1. IMAGE CLASSIFICATION
13
CHAPTER 1. IMAGE CLASSIFICATION
1.4.1 Details
Architecture VGG
Architecture Modules ConvNets, Max Pooling, FC Layers
Parameters 144 million
Layers 19
Primary Activation ReLU
Weight Initialization w0 ∼ N (0, 0.01); biases set to 0
Image Input Size 224 x 224
Regularization Weight Decay L2 penalty 5 ∗ 10−4 , Dropout
Data Augmentation RandomCrop, HorizontalFlip, RGB color shift
Optimizer SGD with momentum γ = 0.9
Learning Rate Schedule η0 = 0.01 - divide by 10 when val error stops im-
proving
Batch Size 256
Epochs 74
Computation 4 NVIDIA Titan Black GPUs
Training Time ∼ 504 hours
14
CHAPTER 1. IMAGE CLASSIFICATION
1.4.3 Summary
15
CHAPTER 1. IMAGE CLASSIFICATION
advantage of a less complex topology, which makes the features easier to reuse and
extend upon.
VGG achieved state-of-the-art performance on ImageNet, with superior per-
formance to AlexNet, and seemed to point towards increased depth as the key to
increasing convolutional network performance.
Experiment Result
LocalResponseNorm Increases Top-1 Error Rate by 0.1%
Increasing Layers Monotonic Decrease in Top-1 Error Rate
1.4.5 Code
16
CHAPTER 1. IMAGE CLASSIFICATION
17
CHAPTER 1. IMAGE CLASSIFICATION
1.5.1 Details
18
CHAPTER 1. IMAGE CLASSIFICATION
1.5.2 Diagram
1.5.3 Summary
19
CHAPTER 1. IMAGE CLASSIFICATION
were pre-training and auxiliary classifiers, but these solutions complicated and
lengthened training. He et al. (2015) note that Xavier initialization is one approach,
with scaled uniform initialization, but it assumes activations are linear which is
invalid for ReLU-like activations (Glorot and Bengio, 2010).
They derive a new initialization scheme accounting for ReLU-like activations
which avoids magifying the magitudes of input exponentially. p The result is a
zero-mean Gaussian distribution whose standard deviation is 2/nl where nl is
the number of connections in the layer. They find that this initialization scheme
converges much more quickly but also reduces the error much earlier. For much
deeper networks, Xavier initialization does not converge at all and gradients
diminish.
Experiment Result
PReLU over ReLU Decreases Top-1 Error Rate by 1.18%
Channel shared not Increases Top-1 Error Rate by 0.07%
channel-wise PReLU
1.5.5 Code
20