Sei sulla pagina 1di 19

CHAPTER 1.

IMAGE CLASSIFICATION

1.1 AlexNet (2012)

Paper ImageNet Classification with Deep Convolutional


Neural Networks
Authors Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
Year 2012
Top 1/5 Accuracy 63.3% / 84.6%

1.1.1 Details

Architecture AlexNet
Architecture Modules ConvNets, Max Pooling, FC Layers
Parameters 60 million
Layers 8
Primary Activation ReLU
Weight Initialization N (0, 0.01) with biases set to 1
Image Input Size 227 x 227
Regularization Dropout, LocalResponseNorm
Data Augmentation RandomCrop, HorizontalFlip, RGB color shift
Optimizer SGD with momentum γ = 0.9, weight decay 0.0005
Learning Rate Schedule η0 = 0.01 - divide by 10 when val error stops im-
proving
Batch Size 128
Epochs 90
Computation 2 GTX 580 GPUs
Training Time 132 hours

2
CHAPTER 1. IMAGE CLASSIFICATION

1.1.2 Architecture Diagram

Figure 1.1: AlexNet Architecture

1.1.3 Summary

AlexNet was a major turning point in computer vision research, heralding the
decline of kernel-based methods and the beginning of the deep learning era. It
was the first large-scale convolutional neural network to participate and perform
well on ImageNet. It entered the ILSVRC-2012 competition and outperformed the
next best non-deep learning method by a significant margin: achieving a Top 5
Error of 16.4% compared to 26.1%. This margin of victory, combined with the
asymptoting progress of kernel-based methods, led the computer vision field to
shift en masse to the deep learning approach soon after.
Much of AlexNet’s architecture built on previous research on convolutional
networks. The base architecture follows LeNet closely with some important
differences (Lecun et al., 1998b). The most obvious difference is that AlexNet is a
much bigger network: it has much larger inputs ( 3 x 224 x 224 versus 32 x 32),
many more channels in the convolutional layers (e.g. 96 vs 25 in the first layer),
and wider and deeper layers.
More conceptual differences include the use of Max Pooling, rather than
Average Pooling, which introduces more shift-invariance into the network, the use
of Rectified Linear Units (ReLU) over sigmoid activation functions to prevent
gradient saturation, and dropout in the fully-connected layers to counter overfitting.
The paper is also noticeable for the use of heavy data augmentation, including
random cropping, resizing, horizontal flipping and color augmentation.

3
CHAPTER 1. IMAGE CLASSIFICATION

1.1.4 Ablation Studies

Experiment Result
ReLU Activations Achieves 25% training error 6 times faster than tanh
LocalResponseNorm Reduces Top-1 Error Rate by 1.4%
Overlapping Pooling Reduces Top-1 Error Rate by 0.4%
Random Cropping Aug Without it network suffers "substantial overfitting"
ColorAug Reduces Top-1 Error Rate by > 1%
Dropout Without it network suffers "substantial overfitting"

1.1.5 Code

The following network snippet is modied from danush17’s implementation1 :


class AlexNet (nn. Module ):

def __init__ (self , num_classes =1000):

super(AlexNet , self ). __init__ ()

self.net = nn. Sequential (


nn. Conv2d ( in_channels =3, out_channels =96 , kernel_size =11 , stride =4) , nn.ReLU (),
nn. LocalResponseNorm (size =5, alpha =0.0001 , beta =0.75 , k=2) ,
nn. MaxPool2d ( kernel_size =3, stride =2) ,
nn. Conv2d (96 , 256 , 5, padding =2) , nn.ReLU (),
nn. LocalResponseNorm (size =5, alpha =0.0001 , beta =0.75 , k=2) ,
nn. MaxPool2d ( kernel_size =3, stride =2) ,
nn. Conv2d (256 , 384 , 3, padding =1) , nn.ReLU (),
nn. Conv2d (384 , 384 , 3, padding =1) , nn.ReLU (),
nn. Conv2d (384 , 256 , 3, padding =1) , nn.ReLU (),
nn. MaxPool2d ( kernel_size =3, stride =2) ,
)

self. classifier = nn. Sequential (


nn. Dropout (p=0.5 , inplace =True),

1
https://github.com/dansuh17/alexnet-pytorch/blob/master/model.py

4
CHAPTER 1. IMAGE CLASSIFICATION

nn. Linear ( in_features =(256 ∗ 6 ∗ 6), out_features =4096) , nn.ReLU (),


nn. Dropout (p=0.5 , inplace =True),
nn. Linear ( in_features =4096 , out_features =4096) , nn.ReLU (),
nn. Linear ( in_features =4096 , out_features = num_classes ),
)
self. init_bias ()

def init_bias (self ):


for layer in self.net:
if isinstance (layer , nn. Conv2d ):
nn.init. normal_ ( layer .weight , mean =0, std =0.01)
nn.init. constant_ ( layer .bias , 0)

nn.init. constant_ (self.net [4]. bias , 1)


nn.init. constant_ (self.net [10]. bias , 1)
nn.init. constant_ (self.net [12]. bias , 1)

def forward (self , x):


x = self.net(x)
x = x.view(−1, 256 ∗ 6 ∗ 6)
return self. classifier (x)

5
CHAPTER 1. IMAGE CLASSIFICATION

1.2 ZFNet (2013)

Paper Visualizing and Understanding Convolutional Net-


works
Authors Matthew D Zeiler, Rob Fergus
Year 2013
Top 1/5 Accuracy 64% / 85.3%

1.2.1 Details

Architecture Modified AlexNet


Architecture Modules ConvNets, Max Pooling, FC Layers
Parameters 46 million
Layers 8
Primary Activation ReLU
Weight Initialization w0 = 0.01 with biases set to 0
Image Input Size 224 x 224
Regularization Dropout, LocalResponseNorm, RMSFilterNorm
Data Augmentation SubCropping, HorizontalFlip
Optimizer SGD with momentum γ = 0.9
Learning Rate Schedule η0 = 0.01 - divide by 10 when val error stops im-
proving
Batch Size 128
Epochs 70
Computation 1 GTX 580 GPU
Training Time 288 hours

6
CHAPTER 1. IMAGE CLASSIFICATION

1.2.2 Architecture Diagram

Figure 1.2: ZFNet Architecture

1.2.3 Summary

ZFNet was conceived in the wake of the AlexNet breakthrough on the ImageNet
2012 competition. The availability of large datasets like ImageNet, powerful GPU
implementations and new regularization strategies like Dropout gave researchers
cause to turn towards convolutional models. Nevertheless there was still a lingering
unsatisfaction in the vision community about the lack of model interpretability for
deep models
Zeiler and Fergus (2013) tackle concerns about interpretability by introducing a
detailed toolkit that unpacks and visualizes how convolutional networks work. They
dissect trained networks by inserting "deconvolutions" and unpooling operations
to project feature activations back into input space. Using this they are able to
visualize what sort of features the filters of a trained network have captured. They
also perform occlusion sensitivity analysis, showing that trained networks respond
intuitively to different parts of an image, and invariance analyses, showing that
trained networks are stable to translation scaling (although not very invariant to
rotation).
Using their visualization techniques, they identify within a trained AlexNet
model that the first layer filters capture mostly high and low frequency phenomena
(little middle frequency), while the second layer filters appear to suffer from
aliasing artifacts caused by the large stride of 4 in the first layer convolutions.
These observations inform their changes for their ZFNet model, in which they
reduce first layer filter sizes from 11 x 11 to 7 x 7 and use a stride of 2 rather
than 4. This helps the architecture retain much more information in the first and
second layer features and improves classification performance.

7
CHAPTER 1. IMAGE CLASSIFICATION

1.2.4 Ablation Studies

Experiment Result
Remove Final FCLayer Reduces Top-1 Error Rate by 0.5%
Remove Both FCLayers Increases Top-1 Error Rate by 4.3%
Remove Mid ConvLay- Increases Top-1 Error Rate by 4.9%
ers
Remove Both FCLayers Increases Top-1 Error Rate by 30.8%
and Mid ConvLayers
Change FCLayer Size Little Difference on Performance
Increase ConvNet size Reduces Top-1 Error Rate by 3.0%
to 512,1024,512 maps

1.2.5 Code

class ZFNet (nn. Module ):

def __init__ (self , num_classes =1000):

super(ZFNet , self ). __init__ ()

self.net = nn. Sequential (


nn. Conv2d ( in_channels =3, out_channels =96 , kernel_size =7, stride =2) , nn.ReLU (),
nn. LocalResponseNorm (size =5, alpha =0.0001 , beta =0.75 , k=2) ,
nn. MaxPool2d ( kernel_size =3, stride =2) ,
nn. Conv2d (96 , 256 , 5, padding =2) , nn.ReLU (),
nn. LocalResponseNorm (size =5, alpha =0.0001 , beta =0.75 , k=2) ,
nn. MaxPool2d ( kernel_size =3, stride =2) ,
nn. Conv2d (256 , 384 , 3, padding =1) , nn.ReLU (),
nn. Conv2d (384 , 384 , 3, padding =1) , nn.ReLU (),
nn. Conv2d (384 , 256 , 3, padding =1) , nn.ReLU (),
nn. MaxPool2d ( kernel_size =3, stride =2) ,
)

self. classifier = nn. Sequential (

8
CHAPTER 1. IMAGE CLASSIFICATION

nn. Dropout (p=0.5 , inplace =True),


nn. Linear ( in_features =(256 ∗ 6 ∗ 6), out_features =4096) , nn.ReLU (),
nn. Dropout (p=0.5 , inplace =True),
nn. Linear ( in_features =4096 , out_features =4096) , nn.ReLU (),
nn. Linear ( in_features =4096 , out_features = num_classes ),
)
self. init_bias ()

def init_bias (self ):


for layer in self.net:
nn.init. normal_ ( layer .weight , mean =0.01 , std =0)
nn.init. constant_ ( layer .bias , 0)

def forward (self , x):


x = self.net(x)
x = x.view(−1, 256 ∗ 6 ∗ 6)
return self. classifier (x)

9
CHAPTER 1. IMAGE CLASSIFICATION

1.3 Inception V1 (2014)

Paper Going deeper with convolutions


Authors Christian Szegedy, Wei Liu, Yangqing Jia, Pierre
Sermanet, Scott Reed, Dragomir Anguelov, Dumitru
Erhan, Vincent Vanhoucke, Andrew Rabinovich
Year 2014
Top 1/5 Accuracy 69.8% / 89.9%

1.3.1 Details

Architecture Inception
Architecture Modules Inception Modules, Convs, 1 x 1 Convs, Max Pool-
ing, Global Avg Pooling, FC Layers, Aux Classifiers
Parameters 5 million
Layers 22
Primary Activation ReLU
Weight Initialization Gaussian Initialization; biases set to 0.2
Image Input Size 224 x 224
Regularization Dropout, LocalResponseNorm, Polyak averaging
Data Augmentation SubCropping, HorizontalFlip, Photometric Distor-
tions
Optimizer SGD with momentum γ = 0.9
Learning Rate Schedule η0 = 0.01 - decrease by 4% every 8 epochs
Batch Size 128
Epochs 250
Computation 3 − 4 GPUs
Training Time <168 hours

10
CHAPTER 1. IMAGE CLASSIFICATION

1.3.2 Architecture Diagram

Figure 1.3: Inception (GoogLeNet) Architecture

Figure 1.4: Inception Module

1.3.3 Summary

The goal behind Inception networks was to build deeper networks while also
achieving superior structure at local levels of the network. Szegedy et al. (2014)
achieve this through an Inception module, which combines multiple filter sizes (1 x
1, 3 x 3, 5 x 5) allowing the network to learn approprate filter sizes at each level of
the network. This can be thought of as a learnt filter version of Serre et al. (2007)
who solve the problem of varying scale by using several filters of different sizes.
But multiple filters means a naive Inception module has a heavy computational
overload with quadratic increases in computation with each layer. To satisfy a
computational budget, therefore, Szegedy et al. (2014) utilise several ideas from
the Network-In-Network architecture of Lin et al. (2013).

11
CHAPTER 1. IMAGE CLASSIFICATION

First, they use 1 x 1 convolutions to reduce the number of channels in the input
to each Inception Module. This is computationally efficient, but it also reflects
the intuition that the optimal local network is likely sparse - activations from the
previous layer are likely highly correlated and not all of them are useful. Secondly,
they use global average pooling at the end of the architecture to averages out
the channels after the last convolutional layer. This reduces the total number of
parameters in the network; it can be thought of as a replacement for the final
fully connected layers of AlexNet (which constituted 90% of parameters for the
network).
An additional network feature is the use of auxiliary classifiers at lower levels of
the network. The intended effect of these classifiers was to get gradient information
closer to lower levels of the network so it could train more easily and more quickly.
Taken as a whole, the outcome is an efficient network with 12x less parameters
than AlexNet but superior accuracy on ImageNet.

1.3.4 Code

Below we isolate the Inception module, from the https://github.com/pytorch/vision


library:
class Inception (nn. Module ):

def __init__ (self , in_channels , ch1x1 , ch3x3red , ch3x3 , ch5x5red , ch5x5 , pool_proj ):
super(Inception , self ). __init__ ()

self. branch1 = BasicConv2d ( in_channels , ch1x1 , kernel_size =1)

self. branch2 = nn. Sequential (


BasicConv2d ( in_channels , ch3x3red , kernel_size =1) ,
BasicConv2d (ch3x3red , ch3x3 , kernel_size =3, padding =1)
)

self. branch3 = nn. Sequential (


BasicConv2d ( in_channels , ch5x5red , kernel_size =1) ,
BasicConv2d (ch5x5red , ch5x5 , kernel_size =3, padding =1)
)

self. branch4 = nn. Sequential (


nn. MaxPool2d ( kernel_size =3, stride =1, padding =1, ceil_mode =True),

12
CHAPTER 1. IMAGE CLASSIFICATION

BasicConv2d ( in_channels , pool_proj , kernel_size =1)


)

def forward (self , x):


branch1 = self. branch1 (x)
branch2 = self. branch2 (x)
branch3 = self. branch3 (x)
branch4 = self. branch4 (x)

outputs = [branch1 , branch2 , branch3 , branch4 ]


return torch .cat(outputs , 1)

13
CHAPTER 1. IMAGE CLASSIFICATION

1.4 VGG (2014)

Paper Very Deep Convolutional Networks For Large-Scale


Image Recognition
Authors Karen Simonyan, Andrew Zisserman
Year 2014
Top 1/5 Accuracy 74.5% / 92.0%

1.4.1 Details

Architecture VGG
Architecture Modules ConvNets, Max Pooling, FC Layers
Parameters 144 million
Layers 19
Primary Activation ReLU
Weight Initialization w0 ∼ N (0, 0.01); biases set to 0
Image Input Size 224 x 224
Regularization Weight Decay L2 penalty 5 ∗ 10−4 , Dropout
Data Augmentation RandomCrop, HorizontalFlip, RGB color shift
Optimizer SGD with momentum γ = 0.9
Learning Rate Schedule η0 = 0.01 - divide by 10 when val error stops im-
proving
Batch Size 256
Epochs 74
Computation 4 NVIDIA Titan Black GPUs
Training Time ∼ 504 hours

14
CHAPTER 1. IMAGE CLASSIFICATION

1.4.2 Architecture Diagram

Figure 1.5: VGG-19 Architecture

1.4.3 Summary

Simonyan and Zisserman (2014) focus on convolutional depth as a strategy to


improve classification accuracy. Existing improvements upon AlexNet at the time
had focussed on changes to the filter sizes in the convolutions. The original AlexNet
of Krizhevsky et al. (2012) used 11 x 11 size with stride 4, while ZFNet had used
size 7 x 7 with stride 2. VGG continues the trend by making large depth feasible
through very small (3 x 3) convolutional filters with stride 1.
A 3 x 3 filter is the smallest size that captures the notion of left/right, up/down
and center. Additionally, the effective receptive field is still 7 x 7 after three stacked
layers; the difference being that a stack of three layers includes three non-linear
ReLU layers, so the decision function is more discriminative than a single larger
receptive field layer - additionally there are almost half as few parameters.
The remaining architecture is similar to AlexNet with more layers (19) and
some additional tweaks, such as not using local response normalization - they note it
doesn’t improve performance but increases memory consumption and computation
time. The architecture has many more parameters than Inception, but has the

15
CHAPTER 1. IMAGE CLASSIFICATION

advantage of a less complex topology, which makes the features easier to reuse and
extend upon.
VGG achieved state-of-the-art performance on ImageNet, with superior per-
formance to AlexNet, and seemed to point towards increased depth as the key to
increasing convolutional network performance.

1.4.4 Ablation Studies

Experiment Result
LocalResponseNorm Increases Top-1 Error Rate by 0.1%
Increasing Layers Monotonic Decrease in Top-1 Error Rate

1.4.5 Code

The code below is reproduced from the https://github.com/pytorch/vision library:


class VGG(nn. Module ):

def __init__ (self , features , num_classes =1000 , init_weights =True ):


super(VGG , self ). __init__ ()
self. features = features
self. avgpool = nn. AdaptiveAvgPool2d ((7 , 7))
self. classifier = nn. Sequential (
nn. Linear (512 ∗ 7 ∗ 7, 4096) ,
nn.ReLU(True),
nn. Dropout (),
nn. Linear (4096 , 4096) ,
nn.ReLU(True),
nn. Dropout (),
nn. Linear (4096 , num_classes ),
)
if init_weights :
self. _initialize_weights ()

def forward (self , x):


x = self. features (x)
x = self. avgpool (x)

16
CHAPTER 1. IMAGE CLASSIFICATION

x = torch . flatten (x, 1)


x = self. classifier (x)
return x

def _initialize_weights (self ):


for m in self. modules ():
if isinstance (m, nn. Conv2d ):
nn.init. kaiming_normal_ (m.weight , mode=’fan_out ’, nonlinearity =’relu ’)
if m.bias is not None:
nn.init. constant_ (m.bias , 0)
elif isinstance (m, nn. BatchNorm2d ):
nn.init. constant_ (m.weight , 1)
nn.init. constant_ (m.bias , 0)
elif isinstance (m, nn. Linear ):
nn.init. normal_ (m.weight , 0, 0.01)
nn.init. constant_ (m.bias , 0)

def make_layers (cfg , batch_norm = False ):


layers = []
in_channels = 3
for v in cfg:
if v == ’M’:
layers += [nn. MaxPool2d ( kernel_size =2, stride =2)]
else:
conv2d = nn. Conv2d ( in_channels , v, kernel_size =3, padding =1)
if batch_norm :
layers += [conv2d , nn. BatchNorm2d (v), nn.ReLU( inplace =True )]
else:
layers += [conv2d , nn.ReLU( inplace =True )]
in_channels = v
return nn. Sequential(∗ layers )

17
CHAPTER 1. IMAGE CLASSIFICATION

1.5 PReLU-Net (2015)

Paper Delving Deep into Rectifiers: Surpassing Human-


Level Performance on ImageNet Classification
Authors Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian
Sun
Year 2015
Top 1/5 Accuracy 75.73% / 92.0%

1.5.1 Details

Architecture He-Sun CNN


Architecture Modules ConvNets, Max Pooling, Spatial Pyramid Pooling
Layer, FC Layers
Parameters ?million
Layers 19
Primary Activation PReLU
Weight Initialization He Initialization; ai = 0.25 for PReLU weights
Image Input Size 224 x 224
Regularization Weight Decay L2 penalty 5 ∗ 10−4 , Dropout
Data Augmentation RandomCrop, ScaleJitter, HorizontalFlip, RGB
color shift
Optimizer SGD with momentum γ = 0.9
Learning Rate Schedule η0 = 0.01 - divide by 10 when val error stops im-
proving
Batch Size 128
Epochs 80
Computation 8 GPUs

18
CHAPTER 1. IMAGE CLASSIFICATION

1.5.2 Diagram

Figure 1.6: PReLU activation

1.5.3 Summary

PReLU activations and He Initialization, the two principle contributions of He


et al. (2015), arose in a context of architectural and regularization progress. One
of the contributors of progress in deep learning was ReLU activations which
improved convergence and gave better solutions than sigmoid activations. In
the paper, He et al. (2015) propose a generalisation called PReLU which allows
for parameters to learn a ReLU non-linearity coefficient (for the negative part
of the function). Secondly, they consider the consequence of using ReLU-like
activations on initialization, and derive a new initialization scheme which accounts
for non-linearity and helps train deeper models.
They use a base convolutional network from He and Sun (2014), which is
similar to an Overfeat CNN, and adapt with PReLU activations. The number of
extra parameters is equal to the total number of channels. They gain an 1.4%
reduction in Top 1 Error Rate from using PReLU activations over ReLUs. The
coefficients learnt for the first convolutional layers are greater than zero (0.681 and
0.596). Since filters of the first convolutional layer are mostly Gabor-like filters like
edge or texture detection, the learned results show positive and negative results
of filters are respected. For deeper layers the coefficients are generally positive
but smaller (0 − 0.2) which indicates activations become gradually non-linear at
increased depth. In other words, the learned model is keeping more information at
earlier stages and becomes more discriminative at deeper stages.
For their second contribution, they propose a new initialization scheme - now
known as He Initialization. At the time most networks were being initialized with
Gaussian distributions with fixed standard deviation, e.g. 0.01, and deeper models
(>8 layers at the time) were having difficulty converging. Solutions at the time

19
CHAPTER 1. IMAGE CLASSIFICATION

were pre-training and auxiliary classifiers, but these solutions complicated and
lengthened training. He et al. (2015) note that Xavier initialization is one approach,
with scaled uniform initialization, but it assumes activations are linear which is
invalid for ReLU-like activations (Glorot and Bengio, 2010).
They derive a new initialization scheme accounting for ReLU-like activations
which avoids magifying the magitudes of input exponentially. p The result is a
zero-mean Gaussian distribution whose standard deviation is 2/nl where nl is
the number of connections in the layer. They find that this initialization scheme
converges much more quickly but also reduces the error much earlier. For much
deeper networks, Xavier initialization does not converge at all and gradients
diminish.

1.5.4 Ablation Studies

Experiment Result
PReLU over ReLU Decreases Top-1 Error Rate by 1.18%
Channel shared not Increases Top-1 Error Rate by 0.07%
channel-wise PReLU

1.5.5 Code

The code below is reproduced from the https://github.com/pytorch library:

def kaiming_uniform_ (tensor , a=0, mode=’fan_in ’, nonlinearity =’leaky_relu ’):


fan = _calculate_correct_fan (tensor , mode)
gain = calculate_gain ( nonlinearity , a) # math.sqrt(2.0) for ReLU
std = gain / math.sqrt(fan)
bound = math.sqrt (3.0) ∗ std
with torch . no_grad ():
return tensor . uniform_(−bound , bound )

20

Potrebbero piacerti anche