Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract— In this paper, we propose a novel approach for these networks because the gradient-based weight updates
efficient training of deep neural networks in a bottom-up fashion derived through the chain rule for differentiation are the
using a layered structure. Our algorithm, which we refer to as products of n small numbers, where n is the number of layers
deep cascade learning, is motivated by the cascade correlation
approach of Fahlman and Lebiere, who introduced it in the con- being backward propagated through. In this paper, we aim to
text of perceptrons. We demonstrate our algorithm on networks directly tackle the vanishing gradient problem by proposing
of convolutional layers, though its applicability is more general. a training algorithm that trains the network from the bottom-
Such training of deep networks in a cascade directly circum- to-top layer incrementally, and ensures that the layers being
vents the well-known vanishing gradient problem by ensuring trained are always close to the output layer. This algorithm
that the output is always adjacent to the layer being trained.
We present empirical evaluations comparing our deep cascade has advantages in terms of complexity by reducing training
training with standard end–end training using back propagation time and can potentially also use less memory. The algorithm
of two convolutional neural network architectures on benchmark also has prospective use in building architectures without static
image classification tasks (CIFAR-10 and CIFAR-100). We then depth that adapt their complexity to the data.
investigate the features learned by the approach and find that Several attempts have been proposed to circumvent com-
better, domain-specific, representations are learned in early layers
when compared to what is learned in end–end training. This plexity in learning. Platt [2] developed the Resource Allocating
is partially attributable to the vanishing gradient problem that Network that allocates the memory based on the number of
inhibits early layer filters to change significantly from their captured patterns, and learns these representations quickly.
initial settings. While both networks perform similarly overall, This network was then further enhanced by changing the
recognition accuracy increases progressively with each added LMS algorithm to include the extended Kalman filter, and by
layer, with discriminative features learned in every stage of the
network, whereas in end–end training, no such systematic feature pruning and replacing it improved both in terms of memory
representation was observed. We also show that such cascade and performance [3], [4]. Further, Shadafan et al. [5] present
training has significant computational and memory advantages a sequential construction of multilayer perceptron (MLP)
over end–end training, and can be used as a pretraining classifiers trained locally by recursive least squares algorithm.
algorithm to obtain a better performance. Compressing, pruning, and binarization of the weights in a
Index Terms— Adaptive learning, cascade correlation, deep model have also been developed to diminish the learning
convolutional neural networks (CNNs), deep learning, image complexity of convolutional neural networks (CNNs) [6], [7].
classification. In the late 1980s, Fahlman and Lebiere [1] proposed the
cascade correlation algorithm/architecture as an approach to
I. I NTRODUCTION
sequentially train perceptrons and connect their outputs to
Algorithm 1 Pseudocode of Cascade Learning Adapted From the Cascade Correlation Algorithm [25]. Training Is Performed
in Batches, Hence Every Epoch Is Performed by Doing Backpropagation Through All the Batches of the Data
procedure C ASCADE L EARNING(layer s, η, epochs, epochsU pdate, out)
2: Inputs layer s : model layers parameters (loss function, activation, regularization, number of filters, size of filters,
stride)
η : Learning rate
4: epochs : starting number of epochs
k : epochs update constant
6: out : output block specifications
Output W : L layers with wl trained weights per layer
8: for layeri ndex = 1 : L do Cascading through trainable layers
I ni t new layer and connect out put block
10: i l ← epochs + k × layer _index
for i = 0; i ++; i < i l do Loop through data i l times
12: wnew ← wold − η∇ J (w) Update weights by gradient descent
if V ali dati on err or plateaus then
14: η ← η/10 Change learning rate if update criteria is satisfied
end if
16: end for
Di sconnect out put block and get new i nputs
18: end for
end procedure
by forward propagating the actual inputs through the (fixed) reduced over end–end learning, both in terms of training time
first layer. This process can then repeat until all layers have and memory.
been learned. At each stage, the pseudoinputs are generated 2) Cascade Learning as Supervised Pretraining Algorithm:
by forward propagating the actual inputs through all the A particular appeal of deep neural networks is pretraining the
previously trained layers. It should be noted that once layer weights to obtain a better initialization, and further achieve
weights have been learned that they are fixed for all subse- better minima. Starting from the work of Hinton et al. [10]
quent layers. Fig. 1 gives a graphical overview of the entire on deep belief networks, unsupervised learning has been
process. considered in the past as effective pretraining, initializing the
Most hyperparameters in the algorithm remain the same weights which are then improved in a supervised learning
across each layer, however, we have found it beneficial to setting. Although this was a great motivation, recent architec-
dynamically increase the number of learning epochs as we tures [12], [21], [28], however, have ignored this and
get deeper into the network. Additionally, we start training focused on pure supervised learning with random
the initial layers with orders of magnitude fewer epochs than initialization.
we would if training end to end. The rationale for this is The cascade learning can be used to initialize the fil-
that each subnetwork fits the data faster than the end–end ters in a CNN and diminish the impact of the vanishing
model and we do not want to overfit the data, especially gradient problem. After the weights have been pretrained
in the lower layers. Overfitting in the lower layers would using cascade learning, the network is tuned using traditional
severely hamper the generalization ability of later layers. end–end training (both stages are supervised). When apply-
In our experiments, we have found that the number of epochs ing this procedure, it is imperative to reinitialize the out-
required to fit the data is dependable on the layer index, if a put block after pretraining the network, otherwise the net-
layer requires i (epochs) , the subsequent layer should require work would rapidly reach the suboptimal minimum obtained
i (epochs) +k, where k is a constant whose value is set dependent by the cascade learning. This does not provide better per-
on the data set. formance in terms of accuracy. In Section I-B3, we dis-
A particular advantage of such cascaded training is that the cuss how this technique may lead the network to better
backward propagated gradient is not diminished by hidden lay- generalization.
ers as happens in the end–end training. This is because every 3) Time Complexity: In a CNN, the time complexity of the
trainable layer is immediately adjacent to the output block. convolutional layers is
In essence, this should help the network obtain more robust d
representations at every layer. In Section I-B, we demonstrate
2 2
this by comparing confusion matrices at different layers of O nl−1 sl nl m l i (1)
networks trained using deep cascade learning and standard l=1
end–end backpropagation. The other advantages, as demon- where i is the number of training iterations, l is the layer
strated in Section I-B, are that the complexity of learning is index, d is the last layer index, n is the number of filters,
5478 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 11, NOVEMBER 2018
s and m is the size of the input and output (spatial size), TABLE I
respectively1 [29]. S PACE C OMPLEXITY OF E ND –E ND T RAINING OF VARIOUS D EPTHS OF
VGG S TYLE N ETWORKS . T HE N UMBER OF PARAMETERS I NCREASES
Training a CNN using the deep cascade learning algorithm W ITH D EPTH . T HE D ATA S TORAGE U NITS OF THE T RAINING
changes the time complexity as follows: D EPEND ON THE C OMPUTATIONAL P RECISION
d
2 2
O nl−1 sl nl m l i l (2)
l=1
where i l represents the number of training iterations for the
lth layer. The main difference between both equations is the
number of epochs for every layer, in (1) i is constant, while in
(2) i depends on the layer index. Note in this analysis, we have
purposefully ignored the cost of performing the forward passes
weights, and the pseudoinputs of the current training batch are
to compute the pseudoinputs as this is essentially “free” if
stored in RAM (on the CPU or GPU) at any one time. This
the algorithm is implemented in two threads (given in the
potentially allows memory to be used much more effectively
following). The number of iterations in the cascade algorithm
and allows models to be trained whose weights exceed the
depends on the data set and the model architecture. The
amount of available memory, however, this is drastically
algorithm proportionally increases the number of epochs on
affected by the choice of output block architecture, and also
every iteration since the early layers must not be overfit, while
the depth and overall architecture of the network in question.
the later layers should be trained to more closely fit the data.
To explore this further, consider the parameter and data
In practice, as shown in the simulations (Section I-B), one
complexity of a Visual Geometry Group Net (VGG-style
can choose each i l such that i 1 i and i L ≤ i , and obtain
network) of different depths. Assume that we can grow the
almost equivalent performance to the end–end
d trained network
depth in the same way as going between the VGG-16 and
in a much shorter period of time. If l=1 i l = i , the time
VGG-19 models in the original paper [26] (note, we are
complexity of both training algorithms is the same, noting
considering Model D in the original paper to be VGG-16
that improvements coming from caching the pseudoinputs are
and Model E to be VGG-19), whereby to generate the
not considered.
next deeper architecture, we add an additional convolutional
There are two main ways of implementing the algorithm.
layer to the last three blocks of similarly sized convolutions.
The best and most efficient approach is by saving the pseudoin-
This process allows us to define models VGG-22, VGG-25,
puts on disk once they have been computed; in order to
VGG-28, etc. The number of parameters and training memory
compute the pseudoinputs for the next layer, only one has to
complexity of these models is shown in Table I. The numbers
forward propagate the cached pseudoinputs through a single
in this table were computed on the assumption of a batch
layer. An alternate, naive, approach would be implement-
size of 1, input size of 32 × 32, and the output block
ing the algorithm using two threads (or two GPUs), with
(last three fully connected/dense layers) consisting of 512,
one thread using the already trained layers to generate the
256, and 10 units, respectively. The remainder of the model
pseudoinputs on demand and the other thread training the
matches the description in the original paper [26], with blocks
current layer. The disadvantage of this is that it would require
of 64, 128, 256, and 512 convolutional filters with a spatial
the input to be forward propagated on each iteration. The first
size of 3 × 3 and the relevant max pooling between blocks.
approach can further drastically decrease the runtime of the
For simplicity, we assume that the convolutional filters are
algorithm and the memory required to train the model at the
zero-padded so the size of the input does not diminish.
expense of disk space used for storing cached pseudoinputs.
The key point to note from Table I is that as the model gets
4) Space Complexity: When considering the space com-
bigger, the amount of memory required for both parameters
plexity and memory usage of a network, we not only consider
and for data storage of end–end training increases. With our
both the number of parameters of the model, but also the
proposed cascade learning approach, this is not the case; the
amount of data that needs to be in memory in order to
total memory complexity is purely a function of the most com-
perform training of those parameters. In standard end–end
plex cascaded subnetwork (network trained in one iteration of
backpropagation, intermediary results (e.g., response maps
the cascade learning). In the case of all the above VGG-style
from convolutional layers and vectors from dense layers) need
networks, this happens very early in the cascading process.
to be stored for an iteration of backpropagation. With modern
More specifically, this happens when cascading the second
hardware and optimizers (based on variants of minibatch
layer, as can be seen in Table II, which illustrates that after
stochastic gradient descent), we usually consider batches of
the second layer (or more concretely after the first max
data being used for the training, so the amount of intermediary
pooling), the complexity of subsequent iterations of cascading
data at each layer is multiplied by the batch size.
reduces. The assumption in computing the numbers in Table II
Aside from offline storage for caching pseudoinputs and
is that the output blocks mirrored those of the end–end training
storing trained weights, the cascade algorithm only requires
had 512, 256, and 10 units, respectively.
that the weights of a single model layer, the output block
If we consider Tables I and II together, we can see with the
1 Note that this is the time complexity of a single forward pass; training architectures in question that for smaller networks the end–end
increases this by a constant factor of about 3. training will use less memory (although it is slower), while
MARQUEZ et al.: DEEP CASCADE LEARNING 5479
Fig. 2. Time complexity comparison between cascade learning, end–end, and pretrained using cascade learning (see Section I-B3 for details and results on
pretraining using cascade learning). Multiple VGG networks were executed within a range of starting number of epochs (10–100) (left) and depth (3,6,9) (right).
Black pentagons: runs executing the naive approach for both cascade learning and the pretraining stage. Blue solid dots: optimal run, which caches the
pseudoinputs after every iteration.
TABLE III
PARAMETER C OMPLEXITY C OMPARISON U SING D IFFERENT O UTPUT B LOCK S PECIFICATIONS S HOWS THE E FFECT OF U SING B ETWEEN
64 AND 512 U NITS IN THE F IRST F ULLY C ONNECTED L AYER (W HICH I S M OST C ORRELATED W ITH THE C OMPLEXITY ).
L EFT: N UMBER OF PARAMETERS . R IGHT: A CCURACY. B OTTOM ROW S HOWS THE PARAMETERS C OMPLEXITY OF THE
E ND –E ND M ODEL . T HE I NCREASE IN M EMORY C OMPLEXITY ON E ARLY S TAGES C AN B E N AIVELY R EDUCED
BY D ECREASING n. P OTENTIALLY, M EMORY R EDUCTION T ECHNIQUES ON THE F IRST F ULLY C ONNECTED
L AYER A RE A PPLICABLE AT E ARLY S TAGES OF THE N ETWORK . L ATER L AYERS A RE L ESS C OMPLEX
Fig. 6. Magnitude of the gradient after every minibatch forward pass on the convolutional layers of the end–end training (right) and the concatenated
gradients of every cascade learning (left) iteration. Vertical lines: start of a new iteration. Curves were smoothed by averaging the gradients (of every batch)
on every epoch.
TABLE IV
C OMPARISON OF A CCURACY PER L AYER U SING THE C ASCADE
A LGORITHM AND E ND –E ND T RAINING ON CIFAR-100 IN
B OTH A RCHITECTURES . U SING THE C ASCADE L EARNING
O UTPERFORMS A LMOST A LL THE L AYERS IN A VGG
N ETWORK , AND A LMOST A CHIEVES THE S AME
A CCURACY IN THE F INAL S TAGE . T HE A LL -CNN
W ITH AN E ND –E ND T RAINING O UTPERFORMS
IN THE F INAL C LASSIFICATION , H OWEVER ,
THE F IRST T HREE L AYERS D O N OT
L EARN S TRONG C ORRELATIONS
L IKE W HEN U SING THE
C ASCADE L EARNING
Fig. 8. Performance comparison on CIFAR-10 between pretrained network and random initialization. Left: VGG. Right: All-CNN. The step bumps in the
cascade learning are generated due to the start of a new iteration or changes in the learning rate.
Fig. 9. Filters on the first layer for at different stages of the procedure on the VGG network defined in Section I-B1. (a) Initial random weights.
(b) End–end. (c) Cascaded. (d) End–end trained network initialized by cascade learning.
initializations of end–end trained networks and achieve perfor- counterpart. Hence, the results obtained after preinitializing
mance improvements. the network are more stable and less affected by poor initial-
The weights are initialized randomly. Then the procedure is ization. Results on Fig. 2 show that even including the time of
divided into two stages, first, we cascade the network with the the tuning training stage, the time complexity can be reduced
minimum number of epochs to diminish its time complexity. if the correct parameters for the cascade learning are chosen.
Finally, the network is fine-tuned using a backpropagation and It is important to mention that, the end–end training typically
stochastic gradient descent, similar to the end–end training. requires up to 250 epochs, while the tuning stage may only
We applied this technique using a VGG-style network and require a small fraction since the training is stopped when the
the All-CNN. For more details on the architectures, refer training accuracy reaches ∼0.999.
to Section I-B1. The filters generated by the cascade learning filters are
Fig. 8 shows the difference in performance given ran- slightly overfitted (the first layer typically achieves ∼60% on
dom and cascade learning initialization. The learning curves the unseen data and ∼95% on the training data) as opposed
in Fig. 8 are for the VGG and the All-CNN architectures to the end–end training, on which the filters are more likely
trained on CIFAR-10. The improvements in testing accuracy to be close to its initialization. By pretraining with cascade
varies between ∼2% and ∼3% for the experiments developed learning, the network learns filters that are in between both
in this section. However, the most interesting property comes scenarios (under and overfitness), this behavior can be spotted
as a consequence of the variation of the resulting weights after on Fig. 9.
executing the cascade learning. As shown in Section I-B this Fig. 10 shows the test accuracy during training of a cas-
variation is significantly smaller in contrast with its end–end caded pretrained VGG model on CIFAR-100. Improvements
5484 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 11, NOVEMBER 2018
[15] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature [32] A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov, “Tensoriz-
hierarchies for accurate object detection and semantic segmentation,” ing neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2015,
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2014, pp. 442–450.
pp. 1–8. [33] W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Com-
[16] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks pressing neural networks with the hashing trick,” in Proc. Int. Conf.
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Mach. Learn., 2015, pp. 1–10.
Recognit. (CVPR), Jun. 2015, pp. 3431–3440. [34] M.-A. Côté and H. Larochelle, “An infinite restricted Boltzmann
[17] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “DeepFace: Closing machine,” Neural Comput., vol. 28, no. 7, pp. 1265–1288, 2016.
the gap to human-level performance in face verification,” in Proc. IEEE [35] C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and S. Yang. (2016).
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2014, pp. 1701–1708. “AdaNet: Adaptive structural learning of artificial neural networks.”
[18] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, and G. Penn, “Applying [Online]. Available: https://arxiv.org/abs/1607.01097
convolutional neural networks concepts to hybrid NN-HMM model for
speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
Process. (ICASSP), Mar. 2012, pp. 4277–4280.
[19] C. Yan, H. Xie, S. Liu, J. Yin, Y. Zhang, and Q. Dai, “Effective
uyghur language text detection in complex background images for traffic Enrique S. Marquez received the B.S. degree
prompt identification,” IEEE Trans. Intell. Transp. Syst., vol. 19, no. 1, in electrical engineering from Universidad Rafael
pp. 220–229, Jan. 2018. Urdaneta, Maracaibo, Venezuela, in 2013, and the
[20] C. Yan, H. Xie, D. Yang, J. Yin, Y. Zhang, and Q. Dai, “Supervised M.Sc. degree in artificial intelligence from the
hash coding with deep neural network for environment perception of University of Southampton, Southampton, U.K.,
intelligent vehicles,” IEEE Trans. Intell. Transp. Syst., vol. 19, no. 1, in 2015, where he is currently pursuing the Ph.D.
pp. 284–295, Jan. 2018. degree in computer science.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: His current research interests include machine
Surpassing human-level performance on imagenet classification,” in learning, computer vision, and image processing.
Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1026–1034.
[22] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger.
(2016). “Deep networks with stochastic depth.” [Online]. Available:
https://arxiv.org/abs/1603.09382
[23] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in Proc. 32nd Jonathon S. Hare received the B.Eng. degree in
Int. Conf. Mach. Learn., 2015, pp. 448–456. aerospace engineering and the Ph.D. degree in com-
[24] A. Veit, M. J. Wilber, and S. Belongie, “Residual networks behave like puter science from the University of Southampton.
ensembles of relatively shallow networks,” in Proc. Adv. Neural Inf. He is currently a Lecturer in computer science with
Process. Syst., 2016, pp. 550–558. the University of Southampton, Southampton, U.K.
[25] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. His current research interests include multimedia
Hoboken, NJ, USA: Wiley, 2012. data mining, analysis, and retrieval. These research
[26] K. Simonyan and A. Zisserman. (2014). “Very deep convolutional areas are at the convergence of machine learning and
networks for large-scale image recognition.” [Online]. Available: computer vision, but also encompass other modali-
https://arxiv.org/abs/1409.1556 ties of data. The long-term goal of his research is
[27] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller. to innovate techniques that can allow machines to
(2014). “Striving for simplicity: The all convolutional net.” [Online]. learn to understand the information conveyed by multimedia data and use
Available: https://arxiv.org/abs/1412.6806 that information to fulfill the information needs of humans.
[28] G. Larsson, M. Maire, and G. Shakhnarovich. (2016). “FractalNet:
Ultra-deep neural networks without residuals.” [Online]. Available:
https://arxiv.org/abs/1605.07648
[29] K. He and J. Sun, “Convolutional neural networks at constrained time
cost,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Mahesan Niranjan held academic positions with the University of Sheffield,
Jun. 2015, pp. 5353–5360. Sheffield, U.K., from 1998 to 2007, where he was the Head of the Department
[30] V. Nair and G. E. Hinton, “Rectified linear units improve restricted of Computer Science and the Dean of the Faculty of Engineering, and
boltzmann machines,” in Proc. 27th Int. Conf. Mach. Learn., 2010, the University of Cambridge, Cambridge, U.K., from 1990 to 1997. He is
pp. 1–8. [Online]. Available: http://www.icml2010.org/papers/432.pdf currently a Professor of electronics and computer science with the University
[31] A. Krizhevsky, “Learning multiple layers of features from tiny images,” of Southampton, Southampton, U.K. His current research interest includes
M.S. thesis, Dept. Comput. Sci., Univ. Toronto, Toronto, ON, Canada, machine learning and made contributions to the algorithmic and applied
2009. aspects of the machine learning.