Architecture Handbook

CHAPTER 1.
IMAGE CLASSIFICATION
1.1 AlexNet (2012)
Paper ImageNet Classification with Deep Convolutional

Neural Networks
Authors Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
Year 2012
Top 1/5 Accuracy 63.3% / 84.6%
1.1.1 Details
Architecture AlexNet
Architecture Modules ConvNets, Max Pooling, FC Layers
Parameters 60 million
Layers 8
Primary Activation ReLU
Weight Initialization N (0, 0.01) with biases set to 1
Image Input Size 227 x 227
Regularization Dropout, LocalResponseNorm
Data Augmentation RandomCrop, HorizontalFlip, RGB color shift
Optimizer SGD with momentum γ = 0.9, weight decay 0.0005
Learning Rate Schedule η0 = 0.01 - divide by 10 when val error stops im-
proving
Batch Size 128
Epochs 90
Computation 2 GTX 580 GPUs
Training Time 132 hours
2
CHAPTER 1. IMAGE CLASSIFICATION
1.1.2 Architecture Diagram
Figure 1.1: AlexNet Architecture
1.1.3 Summary
AlexNet was a major turning point in computer vision research, heralding the
decline of kernel-based methods and the beginning of the deep learning era. It
was the first large-scale convolutional neural network to participate and perform
well on ImageNet. It entered the ILSVRC-2012 competition and outperformed the
next best non-deep learning method by a significant margin: achieving a Top 5
Error of 16.4% compared to 26.1%. This margin of victory, combined with the
asymptoting progress of kernel-based methods, led the computer vision field to
shift en masse to the deep learning approach soon after.
Much of AlexNet’s architecture built on previous research on convolutional
networks. The base architecture follows LeNet closely with some important
differences (Lecun et al., 1998b). The most obvious difference is that AlexNet is a
much bigger network: it has much larger inputs ( 3 x 224 x 224 versus 32 x 32),
many more channels in the convolutional layers (e.g. 96 vs 25 in the first layer),
and wider and deeper layers.
More conceptual differences include the use of Max Pooling, rather than
Average Pooling, which introduces more shift-invariance into the network, the use
of Rectified Linear Units (ReLU) over sigmoid activation functions to prevent
gradient saturation, and dropout in the fully-connected layers to counter overfitting.
The paper is also noticeable for the use of heavy data augmentation, including
random cropping, resizing, horizontal flipping and color augmentation.
3
1.1.4 Ablation Studies
Experiment Result
ReLU Activations Achieves 25% training error 6 times faster than tanh
LocalResponseNorm Reduces Top-1 Error Rate by 1.4%
Overlapping Pooling Reduces Top-1 Error Rate by 0.4%
Random Cropping Aug Without it network suffers "substantial overfitting"
ColorAug Reduces Top-1 Error Rate by > 1%
Dropout Without it network suffers "substantial overfitting"
1.1.5 Code
The following network snippet is modied from danush17’s implementation1 :

class AlexNet (nn. Module ):
def __init__ (self , num_classes =1000):
super(AlexNet , self ). __init__ ()
self.net = nn. Sequential (

nn. Conv2d ( in_channels =3, out_channels =96 , kernel_size =11 , stride =4) , nn.ReLU (),
nn. LocalResponseNorm (size =5, alpha =0.0001 , beta =0.75 , k=2) ,
nn. MaxPool2d ( kernel_size =3, stride =2) ,
nn. Conv2d (96 , 256 , 5, padding =2) , nn.ReLU (),
)
self. classifier = nn. Sequential (

nn. Dropout (p=0.5 , inplace =True),
1
https://github.com/dansuh17/alexnet-pytorch/blob/master/model.py
4
nn. Linear ( in_features =(256 ∗ 6 ∗ 6), out_features =4096) , nn.ReLU (),

nn. Linear ( in_features =4096 , out_features =4096) , nn.ReLU (),
nn. Linear ( in_features =4096 , out_features = num_classes ),
)
self. init_bias ()
def init_bias (self ):

for layer in self.net:
if isinstance (layer , nn. Conv2d ):
nn.init. normal_ ( layer .weight , mean =0, std =0.01)
nn.init. constant_ ( layer .bias , 0)
nn.init. constant_ (self.net [4]. bias , 1)

def forward (self , x):

x = self.net(x)
x = x.view(−1, 256 ∗ 6 ∗ 6)
return self. classifier (x)
5
1.2 ZFNet (2013)
Paper Visualizing and Understanding Convolutional Net-

works
Authors Matthew D Zeiler, Rob Fergus
Year 2013
Top 1/5 Accuracy 64% / 85.3%
1.2.1 Details
Architecture Modified AlexNet

Layers 8
Weight Initialization w0 = 0.01 with biases set to 0
Regularization Dropout, LocalResponseNorm, RMSFilterNorm
Data Augmentation SubCropping, HorizontalFlip
Optimizer SGD with momentum γ = 0.9
proving
Batch Size 128
Epochs 70
Computation 1 GTX 580 GPU
Training Time 288 hours
6
Figure 1.2: ZFNet Architecture
1.2.3 Summary
ZFNet was conceived in the wake of the AlexNet breakthrough on the ImageNet
2012 competition. The availability of large datasets like ImageNet, powerful GPU
implementations and new regularization strategies like Dropout gave researchers
cause to turn towards convolutional models. Nevertheless there was still a lingering
unsatisfaction in the vision community about the lack of model interpretability for
deep models
Zeiler and Fergus (2013) tackle concerns about interpretability by introducing a
detailed toolkit that unpacks and visualizes how convolutional networks work. They
dissect trained networks by inserting "deconvolutions" and unpooling operations
to project feature activations back into input space. Using this they are able to
visualize what sort of features the filters of a trained network have captured. They
also perform occlusion sensitivity analysis, showing that trained networks respond
intuitively to different parts of an image, and invariance analyses, showing that
trained networks are stable to translation scaling (although not very invariant to
rotation).
Using their visualization techniques, they identify within a trained AlexNet
model that the first layer filters capture mostly high and low frequency phenomena
(little middle frequency), while the second layer filters appear to suffer from
aliasing artifacts caused by the large stride of 4 in the first layer convolutions.
These observations inform their changes for their ZFNet model, in which they
reduce first layer filter sizes from 11 x 11 to 7 x 7 and use a stride of 2 rather
than 4. This helps the architecture retain much more information in the first and
second layer features and improves classification performance.
7
Experiment Result
Remove Final FCLayer Reduces Top-1 Error Rate by 0.5%
Remove Both FCLayers Increases Top-1 Error Rate by 4.3%
Remove Mid ConvLay- Increases Top-1 Error Rate by 4.9%
ers
Remove Both FCLayers Increases Top-1 Error Rate by 30.8%
and Mid ConvLayers
Change FCLayer Size Little Difference on Performance
Increase ConvNet size Reduces Top-1 Error Rate by 3.0%
to 512,1024,512 maps
1.2.5 Code
class ZFNet (nn. Module ):
def __init__ (self , num_classes =1000):
super(ZFNet , self ). __init__ ()
self.net = nn. Sequential (

nn. Conv2d ( in_channels =3, out_channels =96 , kernel_size =7, stride =2) , nn.ReLU (),
)
8

nn. Linear ( in_features =(256 ∗ 6 ∗ 6), out_features =4096) , nn.ReLU (),
nn. Linear ( in_features =4096 , out_features =4096) , nn.ReLU (),
nn. Linear ( in_features =4096 , out_features = num_classes ),
)
self. init_bias ()
def init_bias (self ):

for layer in self.net:
nn.init. normal_ ( layer .weight , mean =0.01 , std =0)
nn.init. constant_ ( layer .bias , 0)

x = self.net(x)
x = x.view(−1, 256 ∗ 6 ∗ 6)
return self. classifier (x)
9
1.3 Inception V1 (2014)
Paper Going deeper with convolutions

Authors Christian Szegedy, Wei Liu, Yangqing Jia, Pierre
Sermanet, Scott Reed, Dragomir Anguelov, Dumitru
Erhan, Vincent Vanhoucke, Andrew Rabinovich
Year 2014
Top 1/5 Accuracy 69.8% / 89.9%
1.3.1 Details
Architecture Inception
Architecture Modules Inception Modules, Convs, 1 x 1 Convs, Max Pool-
ing, Global Avg Pooling, FC Layers, Aux Classifiers
Layers 22
Weight Initialization Gaussian Initialization; biases set to 0.2
Regularization Dropout, LocalResponseNorm, Polyak averaging
Data Augmentation SubCropping, HorizontalFlip, Photometric Distor-
tions
Learning Rate Schedule η0 = 0.01 - decrease by 4% every 8 epochs
Batch Size 128
Epochs 250
Computation 3 − 4 GPUs
Training Time <168 hours
10
Figure 1.3: Inception (GoogLeNet) Architecture
Figure 1.4: Inception Module
1.3.3 Summary
The goal behind Inception networks was to build deeper networks while also
achieving superior structure at local levels of the network. Szegedy et al. (2014)
achieve this through an Inception module, which combines multiple filter sizes (1 x
1, 3 x 3, 5 x 5) allowing the network to learn approprate filter sizes at each level of
the network. This can be thought of as a learnt filter version of Serre et al. (2007)
who solve the problem of varying scale by using several filters of different sizes.
But multiple filters means a naive Inception module has a heavy computational
overload with quadratic increases in computation with each layer. To satisfy a
computational budget, therefore, Szegedy et al. (2014) utilise several ideas from
the Network-In-Network architecture of Lin et al. (2013).
11
First, they use 1 x 1 convolutions to reduce the number of channels in the input
to each Inception Module. This is computationally efficient, but it also reflects
the intuition that the optimal local network is likely sparse - activations from the
previous layer are likely highly correlated and not all of them are useful. Secondly,
they use global average pooling at the end of the architecture to averages out
the channels after the last convolutional layer. This reduces the total number of
parameters in the network; it can be thought of as a replacement for the final
fully connected layers of AlexNet (which constituted 90% of parameters for the
network).
An additional network feature is the use of auxiliary classifiers at lower levels of
the network. The intended effect of these classifiers was to get gradient information
closer to lower levels of the network so it could train more easily and more quickly.
Taken as a whole, the outcome is an efficient network with 12x less parameters
than AlexNet but superior accuracy on ImageNet.
1.3.4 Code
Below we isolate the Inception module, from the https://github.com/pytorch/vision

library:
class Inception (nn. Module ):
def __init__ (self , in_channels , ch1x1 , ch3x3red , ch3x3 , ch5x5red , ch5x5 , pool_proj ):
super(Inception , self ). __init__ ()
self. branch1 = BasicConv2d ( in_channels , ch1x1 , kernel_size =1)
self. branch2 = nn. Sequential (

BasicConv2d ( in_channels , ch3x3red , kernel_size =1) ,
BasicConv2d (ch3x3red , ch3x3 , kernel_size =3, padding =1)
)

BasicConv2d ( in_channels , ch5x5red , kernel_size =1) ,
BasicConv2d (ch5x5red , ch5x5 , kernel_size =3, padding =1)
)

nn. MaxPool2d ( kernel_size =3, stride =1, padding =1, ceil_mode =True),
12
BasicConv2d ( in_channels , pool_proj , kernel_size =1)

)

branch1 = self. branch1 (x)
outputs = [branch1 , branch2 , branch3 , branch4 ]

return torch .cat(outputs , 1)
13
1.4 VGG (2014)
Paper Very Deep Convolutional Networks For Large-Scale

Image Recognition
Authors Karen Simonyan, Andrew Zisserman
Year 2014
Top 1/5 Accuracy 74.5% / 92.0%
1.4.1 Details
Architecture VGG
Layers 19
Weight Initialization w0 ∼ N (0, 0.01); biases set to 0
Regularization Weight Decay L2 penalty 5 ∗ 10−4 , Dropout
Data Augmentation RandomCrop, HorizontalFlip, RGB color shift
proving
Batch Size 256
Epochs 74
Computation 4 NVIDIA Titan Black GPUs
Training Time ∼ 504 hours
14
Figure 1.5: VGG-19 Architecture
1.4.3 Summary
Simonyan and Zisserman (2014) focus on convolutional depth as a strategy to

improve classification accuracy. Existing improvements upon AlexNet at the time
had focussed on changes to the filter sizes in the convolutions. The original AlexNet
of Krizhevsky et al. (2012) used 11 x 11 size with stride 4, while ZFNet had used
size 7 x 7 with stride 2. VGG continues the trend by making large depth feasible
through very small (3 x 3) convolutional filters with stride 1.
A 3 x 3 filter is the smallest size that captures the notion of left/right, up/down
and center. Additionally, the effective receptive field is still 7 x 7 after three stacked
layers; the difference being that a stack of three layers includes three non-linear
ReLU layers, so the decision function is more discriminative than a single larger
receptive field layer - additionally there are almost half as few parameters.
The remaining architecture is similar to AlexNet with more layers (19) and
some additional tweaks, such as not using local response normalization - they note it
doesn’t improve performance but increases memory consumption and computation
time. The architecture has many more parameters than Inception, but has the
15
advantage of a less complex topology, which makes the features easier to reuse and
extend upon.
VGG achieved state-of-the-art performance on ImageNet, with superior per-
formance to AlexNet, and seemed to point towards increased depth as the key to
increasing convolutional network performance.
Experiment Result
LocalResponseNorm Increases Top-1 Error Rate by 0.1%
Increasing Layers Monotonic Decrease in Top-1 Error Rate
1.4.5 Code
The code below is reproduced from the https://github.com/pytorch/vision library:

class VGG(nn. Module ):
def __init__ (self , features , num_classes =1000 , init_weights =True ):

super(VGG , self ). __init__ ()
self. features = features
self. avgpool = nn. AdaptiveAvgPool2d ((7 , 7))
nn. Linear (512 ∗ 7 ∗ 7, 4096) ,
nn.ReLU(True),
nn. Dropout (),
nn. Linear (4096 , 4096) ,
nn.ReLU(True),
nn. Dropout (),
nn. Linear (4096 , num_classes ),
)
if init_weights :
self. _initialize_weights ()

x = self. features (x)
x = self. avgpool (x)
16
x = torch . flatten (x, 1)

x = self. classifier (x)
return x
def _initialize_weights (self ):

for m in self. modules ():
if isinstance (m, nn. Conv2d ):
nn.init. kaiming_normal_ (m.weight , mode=’fan_out ’, nonlinearity =’relu ’)
if m.bias is not None:
nn.init. constant_ (m.bias , 0)
elif isinstance (m, nn. BatchNorm2d ):
nn.init. constant_ (m.weight , 1)
elif isinstance (m, nn. Linear ):
nn.init. normal_ (m.weight , 0, 0.01)
def make_layers (cfg , batch_norm = False ):

layers = []
in_channels = 3
for v in cfg:
if v == ’M’:
layers += [nn. MaxPool2d ( kernel_size =2, stride =2)]
else:
conv2d = nn. Conv2d ( in_channels , v, kernel_size =3, padding =1)
if batch_norm :
layers += [conv2d , nn. BatchNorm2d (v), nn.ReLU( inplace =True )]
else:
layers += [conv2d , nn.ReLU( inplace =True )]
in_channels = v
return nn. Sequential(∗ layers )
17
1.5 PReLU-Net (2015)
Paper Delving Deep into Rectifiers: Surpassing Human-

Level Performance on ImageNet Classification
Authors Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian
Sun
Year 2015
Top 1/5 Accuracy 75.73% / 92.0%
1.5.1 Details
Architecture He-Sun CNN

Architecture Modules ConvNets, Max Pooling, Spatial Pyramid Pooling
Layer, FC Layers
Parameters ?million
Layers 19
Primary Activation PReLU
Weight Initialization He Initialization; ai = 0.25 for PReLU weights
Regularization Weight Decay L2 penalty 5 ∗ 10−4 , Dropout
Data Augmentation RandomCrop, ScaleJitter, HorizontalFlip, RGB
color shift
proving
Batch Size 128
Epochs 80
Computation 8 GPUs
18
1.5.2 Diagram
Figure 1.6: PReLU activation
1.5.3 Summary
PReLU activations and He Initialization, the two principle contributions of He

et al. (2015), arose in a context of architectural and regularization progress. One
of the contributors of progress in deep learning was ReLU activations which
improved convergence and gave better solutions than sigmoid activations. In
the paper, He et al. (2015) propose a generalisation called PReLU which allows
for parameters to learn a ReLU non-linearity coefficient (for the negative part
of the function). Secondly, they consider the consequence of using ReLU-like
activations on initialization, and derive a new initialization scheme which accounts
for non-linearity and helps train deeper models.
They use a base convolutional network from He and Sun (2014), which is
similar to an Overfeat CNN, and adapt with PReLU activations. The number of
extra parameters is equal to the total number of channels. They gain an 1.4%
reduction in Top 1 Error Rate from using PReLU activations over ReLUs. The
coefficients learnt for the first convolutional layers are greater than zero (0.681 and
0.596). Since filters of the first convolutional layer are mostly Gabor-like filters like
edge or texture detection, the learned results show positive and negative results
of filters are respected. For deeper layers the coefficients are generally positive
but smaller (0 − 0.2) which indicates activations become gradually non-linear at
increased depth. In other words, the learned model is keeping more information at
earlier stages and becomes more discriminative at deeper stages.
For their second contribution, they propose a new initialization scheme - now
known as He Initialization. At the time most networks were being initialized with
Gaussian distributions with fixed standard deviation, e.g. 0.01, and deeper models
(>8 layers at the time) were having difficulty converging. Solutions at the time
19
were pre-training and auxiliary classifiers, but these solutions complicated and
lengthened training. He et al. (2015) note that Xavier initialization is one approach,
with scaled uniform initialization, but it assumes activations are linear which is
invalid for ReLU-like activations (Glorot and Bengio, 2010).
They derive a new initialization scheme accounting for ReLU-like activations
which avoids magifying the magitudes of input exponentially. p The result is a
zero-mean Gaussian distribution whose standard deviation is 2/nl where nl is
the number of connections in the layer. They find that this initialization scheme
converges much more quickly but also reduces the error much earlier. For much
deeper networks, Xavier initialization does not converge at all and gradients
diminish.
Experiment Result
PReLU over ReLU Decreases Top-1 Error Rate by 1.18%
Channel shared not Increases Top-1 Error Rate by 0.07%
channel-wise PReLU
1.5.5 Code
The code below is reproduced from the https://github.com/pytorch library:
def kaiming_uniform_ (tensor , a=0, mode=’fan_in ’, nonlinearity =’leaky_relu ’):

fan = _calculate_correct_fan (tensor , mode)
gain = calculate_gain ( nonlinearity , a) # math.sqrt(2.0) for ReLU
std = gain / math.sqrt(fan)
bound = math.sqrt (3.0) ∗ std
with torch . no_grad ():
return tensor . uniform_(−bound , bound )
20

Architecture Handbook

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Architecture Handbook

Caricato da

Copyright:

Formati disponibili

CHAPTER 1.

1.1 AlexNet (2012)

Paper ImageNet Classification with Deep Convolutional

1.1.2 Architecture Diagram

Figure 1.1: AlexNet Architecture

1.1.4 Ablation Studies

The following network snippet is modied from danush17’s implementation1 :

def __init__ (self , num_classes =1000):

super(AlexNet , self ). __init__ ()

self.net = nn. Sequential (

self. classifier = nn. Sequential (

nn. Linear ( in_features =(256 ∗ 6 ∗ 6), out_features =4096) , nn.ReLU (),

def init_bias (self ):

nn.init. constant_ (self.net [4]. bias , 1)

def forward (self , x):

1.2 ZFNet (2013)

Paper Visualizing and Understanding Convolutional Net-

Architecture Modified AlexNet

1.2.2 Architecture Diagram

Figure 1.2: ZFNet Architecture

1.2.4 Ablation Studies

class ZFNet (nn. Module ):

def __init__ (self , num_classes =1000):

super(ZFNet , self ). __init__ ()

self.net = nn. Sequential (

self. classifier = nn. Sequential (

nn. Dropout (p=0.5 , inplace =True),

def init_bias (self ):

def forward (self , x):

1.3 Inception V1 (2014)

Paper Going deeper with convolutions

1.3.2 Architecture Diagram

Figure 1.3: Inception (GoogLeNet) Architecture

Figure 1.4: Inception Module

Below we isolate the Inception module, from the https://github.com/pytorch/vision

self. branch1 = BasicConv2d ( in_channels , ch1x1 , kernel_size =1)

self. branch2 = nn. Sequential (

self. branch3 = nn. Sequential (

self. branch4 = nn. Sequential (

BasicConv2d ( in_channels , pool_proj , kernel_size =1)

def forward (self , x):

outputs = [branch1 , branch2 , branch3 , branch4 ]

1.4 VGG (2014)

Paper Very Deep Convolutional Networks For Large-Scale

1.4.2 Architecture Diagram

Figure 1.5: VGG-19 Architecture

Simonyan and Zisserman (2014) focus on convolutional depth as a strategy to

1.4.4 Ablation Studies

The code below is reproduced from the https://github.com/pytorch/vision library:

def __init__ (self , features , num_classes =1000 , init_weights =True ):

def forward (self , x):

x = torch . flatten (x, 1)

def _initialize_weights (self ):

def make_layers (cfg , batch_norm = False ):

1.5 PReLU-Net (2015)

Paper Delving Deep into Rectifiers: Surpassing Human-

Architecture He-Sun CNN

Figure 1.6: PReLU activation

PReLU activations and He Initialization, the two principle contributions of He

1.5.4 Ablation Studies

The code below is reproduced from the https://github.com/pytorch library:

def kaiming_uniform_ (tensor , a=0, mode=’fan_in ’, nonlinearity =’leaky_relu ’):

Potrebbero piacerti anche

def init (self , num_classes =1000):

super(AlexNet , self ). init ()

def init (self , num_classes =1000):

super(ZFNet , self ). init ()

def init (self , features , num_classes =1000 , init_weights =True ):