Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
course of optimization and across different problems and The methods range from reinforcement learning (Zoph and
architectures. Le, 2016; Baker et al., 2016; Li et al., 2017), evolutionary
algorithms (e.g., (Stanley and Miikkulainen, 2002; Jozefow-
Contributions. icz et al., 2015; Real et al., 2017)), Bayesian optimization
(Bergstra et al., 2013) or differentiable architecture search
(Liu et al., 2018). These methods are resource-intensive and
• We introduce Alrao, a gradient descent method with
do not allow for finding a good learning rate in a single run.
close-to-optimal performance without learning rate
tuning. Alrao is found to be reliable over a range
of problems and architectures including convolutional 3. Motivation
networks, LSTMs, or reinforcement learning.
Alrao was inspired by the intuition that not all units in a
• We compare Alrao to the current default optimizer, neural network end up being useful. Hopefully, in a large
Adam with its default hyperparameters. While Adam enough network, a sub-network made of units with a good
sometimes outperforms Alrao, it is not reliable across learning rate could learn well, and hopefully the units with
the board when varying architectures or during train- a wrong learning rate will just be ignored. (Units with a too
ing. large learning rate may produce large activation values, so
this assumes the model has some form of protection against
those, such as BatchNorm or sigmoid/tanh activations.)
2. Related Work
Several lines of work support the idea that not all units
Automatically using the “right” learning rate for each param- of a network are useful or need to be trained. First, it is
eter was one motivation behind “adaptive” methods such as possible to prune a trained network without reducing the
RMSProp (Tieleman and Hinton, 2012), AdaGrad (Duchi performance too much (e.g., LeCun et al. 1990; Han et al.
et al., 2011) or Adam (Kingma and Ba, 2015). Adam with 2015a;b; See et al. 2016). Second, training only some of the
its default setting is currently considered the default method weights in a neural network while leaving the others at their
in many works (Wilson et al., 2017), and we use it as a initial values performs reasonably well (see experiments in
baseline. However, further global adjustement of the Adam Appendix G). So in Alrao, units with a very small learning
learning rate is common (Liu et al., 2017). rate should not hinder training. (Li et al., 2018) even show
Other heuristics for setting the learning rate have been pro- that performance is reasonable if learning only within a very
posed, e.g., (Schaul et al., 2013); these heuristics usually small-dimensional affine subspace of the parameters, chosen
start with the idea of approximating a second-order Newton in advance at random rather than post-selected.
step to define an optimal learning rate (LeCun et al., 1998). Alrao is consistent with the lottery ticket hypothesis, which
Indeed, asymptotically, an arguably optimal preconditioner posits that “large networks that train successfully contain
is either the Hessian of the loss (Newton method) or the subnetworks that—when trained in isolation—converge in
Fisher information matrix (Amari, 1998). a comparable number of iterations to comparable accuracy”
Such methods directly set per-direction learning rates, equiv- (Frankle and Carbin, 2018). This subnetwork is the lottery
alent to preconditioning the gradient descent with a (diag- ticket winner: the one which had the best initial values.
onal or non-diagonal) matrix. From this viewpoint, Alrao Arguably, given the combinatorial number of subnetworks
just replaces these preconditioners with a random diagonal in a large network, with high probability one of them is able
matrix whose entries span several orders of magnitude. to learn alone, and will make the whole network converge.
Viewing the per-feature learning rates of Alrao as part of
Another approach to optimize the learning rate is to perform the initialization, this hypothesis suggests there might be
a gradient descent on the learning rate itself through the enough sub-networks whose initialization leads to good
whole training procedure (for instance (Maclaurin et al., convergence.
2015)). This can be applied online to avoid backpropagating
through multiple training rounds (Massé and Ollivier, 2015). Alrao specifically exploits the network-type structure of
This idea has a long history, see, e.g., (Schraudolph, 1999; deep learning models, with their potential excess of pa-
Mahmood et al., 2012). Some training algorithms depart rameters compared to more traditional, lower-dimensional
from gradient descent altogether, and become learning rate- optimization. That Alrao works at all might already be
free, such as (Orabona and Tommasi, 2017) using betting informative about some phenomena at play in deep neu-
strategies to simulate gradient descent. ral networks, relying on the overall network approach of
combining a large number of features built for diversity of
The learning rate can also be optimized within the frame- behavior.
work of architecture search, exploring both the architecture
and learning rate at the same time (e.g., (Real et al., 2017)).
Learning with Random Learning Rates
Classifier
layer. For instance, for regression there may be only one
Softmax Softmax Softmax output feature. For classification, each feature in the final
...
classification layer represents a single category, and so using
different learning rates for these features would favor some
categories during learning. Instead, on the output layer
we chose to duplicate the layer using several learning rate
Pre-classifier model
... values, and use a (Bayesian) model averaging method to
obtain the overall network output (Fig. 1). Appendix B
contains a proof (under convexity assumptions) that this
mechanism works, given the initial layers.
... We set a learning rate per feature, rather than per parameter.
Otherwise, every feature would have some parameters with
large learning rates, and we would expect even a few large
incoming weights to be able to derail a feature. So having
diverging parameters within a feature is hurtful, while hav-
ing diverging features in a layer is not necessarily hurtful
Input since the next layer can choose to disregard them.
Figure 1. Alrao version of a standard fully connected network for Definitions and notation. We now describe Alrao more
a classification task with three classes. The classifier layer is precisely for deep learning models with softmax output, on
replaced with a set of parallel copies of the original classifier, classification tasks (the case of regression is similar).
averaged with a model averaging method. Each unit uses its own
learning rate for its incoming weights (represented by the styles of Let D = {(x1 , y1 ), ..., (xN , yN )}, with yi ∈ {1, ..., K}, be
the arrows). a classification dataset. The goal is to predict the yi given
the xi , using a deep learning model Φθ . For each input x,
Φθ (x) is a probability distribution over {1, ..., K}, and we
4. All Learning Rates At Once: Description want to minimizeP the categorical cross-entropy loss ` over
the dataset: N1 i `(Φθ (xi ), yi ).
Principle. Alrao starts with a standard optimization
method such as SGD, and a range of possible learning rates A deep learning model for classification Φθ is made of two
(ηmin , ηmax ). Instead of using a single learning rate, we parts: a pre-classifier φθpc which computes some quanti-
sample once and for all one learning rate for each feature, ties fed to a final classifier layer Cθcl , namely, Φθ (x) =
randomly sampled log-uniformly in (ηmin , ηmax ). Then Cθcl (φθpc (x)). The classifier layer Cθcl with K categories
these learning rates are used in the usual optimization up- is defined by Cθcl = softmax ◦ W T x + b Pwith θcl =
date: (W, b), and softmax(x1 , ..., xK )k = exk /( i exi ) .The
θl,i ← θl,i − ηl,i · ∇θl,i `(Φθ (x), y) (1) pre-classifier is a computational graph composed of any
number of layers, and each layer is made of multiple fea-
where θl,i is the set of parameters used to compute the fea- tures.
ture i of layer l from the activations of layer l −1 (the incom-
ing weights of feature i). Thus we build “slow-learning” and We denote log-U(·; ηmin , ηmax ) the log-uniform probabil-
“fast-learning” features, in the hope to get enough features ity distribution on an interval (ηmin , ηmax ): namely, if
in the “Goldilocks zone”. η ∼ log-U(·; ηmin , ηmax ), then log η is uniformly dis-
tributed between log ηmin and log ηmax . Its density function
What constitutes a feature depends on the type of layers in is
the model. For example, in a fully connected layer, each 1ηmin ≤η≤ηmax 1
component of a layer is considered as a feature: all incoming log-U(η; ηmin , ηmax ) = × (2)
log(ηmax ) − log(ηmin ) η
weights of the same unit share the same learning rate. On
the other hand, in a convolutional layer we consider each
convolution filter as constituting a feature: there is one Alrao for the pre-classifier: A random learning rate for
learning rate per filter (or channel), thus keeping translation- each feature. In the pre-classifier, for each feature i in
invariance over the input image. In LSTMs, we apply the each layer l, a learning rate ηl,i is sampled from the proba-
same learning rate to all components in each LSTM unit bility distribution log-U(.; ηmin , ηmax ), once and for all at
Learning with Random Learning Rates
Algorithm 1 Alrao for model Φθ = Cθcl ◦ φθpc with Ncl by gradient descent, but via a model averaging method from
classifiers and learning rates in [ηmin , ηmax ] the literature (see below).
1: aj ← 1/Ncl for each 1 ≤ j ≤ Ncl // Initialize the Ncl For each classifier Cθjcl , we set a learning rate ηj defined by
model averaging weights aj
PNcl log ηj = log ηmin + Nj−1 log ηmax
ηmin , so that the classi-
2: ΦAlrao
θ (x) := j=1 aj Cθjcl (φθpc (x)) // Define the Al- cl −1
rao architecture fiers’ learning rates are log-uniformly spread on the interval
3: for layers l, for all feature i in layer l do [ηmin , ηmax ].
4: Sample ηl,i ∼ log-U(.; ηmin , ηmax ). // Sample a Thus, the original model Φθ (x) leads to the Alrao model
learning rate for each feature ΦAlrao
θ (x). Only the classifier layer is modified, the pre-
5: end for classifier architecture being unchanged.
6: for classifiers j, 1 ≤ j ≤ Ncl do
7: Define log ηj = log ηmin + Nj−1 cl −1
log ηηmax
min
. // Set a Update rule. Alg. 1 presents the full Alrao algorithm.
learning rate for each classifier j The updates for the pre-classifier, classifier, and model aver-
8: end for aging weights are as follows.
9: while stopping criterion is false do
10: zt ← φθpc (xt ) // Store the pre-classifier output • The update rule for the pre-classifier is the usual SGD
11: for layers l, for all feature i in layer l do one, with per-feature learning rates. For each feature i
12: θl,i ← θl,i − ηl,i · ∇θl,i `(ΦAlrao
θ (xt ), yt ) // Update in each layer l, its incoming parameters are updated as:
the pre-classifier weights
13: end for θl,i ← θl,i − ηl,i · ∇θl,i `(ΦAlrao
θ (x), y) (5)
14: for Classifier j do
15: θjcl ← θjcl − ηj · ∇θjcl `(Cθjcl (zt ), yt ) // Update the • The parameters θjcl of each classifier clone j on the
classifiers’ weights classifier layer are updated as if this classifier alone
16: end for was the only output of the model:
17: a ← ModelAveraging(a, (Cθicl (zt ))i , yt ) // Update
the model averaging weights. θjcl ← θjcl − ηj · ∇θjcl `(Cθjcl (φθpc (x)), y) (6)
18: t ← t + 1 mod N
(still sharing the same pre-classifier φθpc ). This en-
19: end while
sures classifiers with low weights aj still learn, and
is consistent with model averaging philosophy. Algo-
rithmically this requires differentiating the loss Ncl
the beginning of training.1 Then the incoming parameters
times with respect to the last layer (but no additional
of each feature in the preclassifier are updated in the usual
backpropagations through the preclassifier).
way with this learning rate (Eq. 5).
• To set the weights aj , several model averaging tech-
Alrao for the classifier layer: Model averaging from niques are available, such as Bayesian Model Averag-
classifiers with different learning rates. In the classi- ing (Wasserman, 2000). We decided to use the Switch
fier layer, we build multiple clones of the original classifier model averaging (Van Erven et al., 2012), a Bayesian
layer, set a different learning rate for each, and then use a method which is both simple, principled and very re-
model averaging method from among them. The averaged sponsive to changes in performance of the various mod-
classifier and the overall Alrao model are: els. After each sample or mini-batch, the switch com-
putes a modified posterior distribution (aj ) over the
Ncl
X classifiers. This computation is directly taken from
CθAlrao
cl (z) := aj Cθjcl (z) (3) (Van Erven et al., 2012) and explained in Appendix A.
j=1
The observed evolution of this posterior during training
ΦAlrao
θ (x) := CθAlrao
cl (φθpc (x)) (4) is commented upon in Appendix C.
where the Cθjcl are copies of the original classifier layer, with
Implementation. We release along with this paper a Py-
non-tied parameters, and θcl := (θ1cl , ..., θN
cl
). The aj are
cl torch (Paszke et al., 2017) implementation of this method.
the parameters of the model averaging, and are such that for
P It can be used on an existing model with little modification.
all j, 0 ≤ aj ≤ 1, and j aj = 1. These are not updated
A short tutorial is given in Appendix H. Features (sets of
1
With learning rates resampled at each time, each step would weights sharing the same learning rate) need to be specified
be, in expectation, an ordinary SGD step with learning rate Eηl,i , for each layer type: for now this has been done for linear,
thus just yielding an ordinary SGD trajectory with more noise. convolutional, and LSTMs layers.
Learning with Random Learning Rates
Table 1. Performance of Alrao, of SGD with optimal learning rate from {10−5 , 10−4 , 10−3 , 10−2 , 10−1 , 1., 10.}, and of Adam with its
default setting. Three convolutional models are reported for image classification on CIFAR10, three others for ImageNet, one recurrent
model for character prediction (Penn Treebank), and two experiments on RL problems. The Alrao learning rates have been taken in wide
a priori reasonable intervals, [ηmin ; ηmax ] = [10−5 ; 10] for CNNs (CIFAR10 and ImageNet) and RL, and [10−3 ; 102 ] for RNNs (PTB).
Each experiment is run 10 times (CIFAR10 and RL), 5 times (PTB) or 1 time (ImageNet); the confidence intervals report the standard
deviation over these runs.
in Fig. 2. Fig. 3 tests the influence of ηmin and ηmax . ResNet50 (and only two of them achieve performance simi-
lar to Alrao, see Fig. 2a), and only 2 are able to learn on the
pendulum environment (and only one of them converges as
Recurrent learning on Penn Treebank. To test Alrao
fast as Alrao, Fig. 6 in Appendix D). More examples are
on a different kind of architecture, we used a recurrent
in Appendix D. It is surprising that Alrao manages to learn
neural network for text prediction on the Penn Treebank
even though most of the units in the network would have
(Marcus et al., 1993) dataset. The experimental procedure
learning rates unsuited for SGD.
is the same, with (ηmin , ηmax ) = (0.001, 100) and 6 output
classifiers for Alrao. The results appear in Table 1, where
the loss is given in bits per character and the accuracy is the
Robustness of Alrao compared to default Adam. Over-
proportion of correct character predictions.
all, Alrao learns reliably in every setup. Performance is
The model was trained for character prediction rather than close to optimal SGD in all cases (Table 1) with a somewhat
word prediction. This is technically easier for Alrao im- larger gap in one case (AlexNet on ImageNet). This obser-
plementation: since Alrao uses copies of the output layer, vation is quite stable over the course of learning, with Alrao
memory issues arise for models with most parameters on curves shadowing optimal SGD curves over time (Fig. 2).
the output layer. Word prediction (10,000 classes on PTB)
Often, Adam with its default parameters almost matches
requires more output parameters than character prediction;
optimal SGD, but this is not always the case. Over the 13
see Section 7 and Appendix F.
setups in Table 1, default Adam gives a significantly poor
The model is a two-layer LSTM (Hochreiter and Schmidhu- performance in three cases. One of those is a pure opti-
ber, 1997) with an embedding size of 100 and 100 hidden mization issue: with AlexNet on ImageNet, optimization
features. A dropout layer with rate 0.2 is included before the does not start (Fig. 2b) with default parameters. The other
decoder. The training set is divided into 20 minibatchs. Gra- two cases are due to strong overfit despite good train perfor-
dients are computed via truncated backprop through time mance: MobileNet on CIFAR (Fig. 5c in Appendix D) and
(Werbos, 1990) with truncation every 70 characters. ResNet with increased width on ImageNet (Fig. 2a).
In two further cases, Adam achieves good validation perfor-
Reinforcement Learning. We tested Alrao on two stan- mance but overfits shortly thereafter: ResNet (Fig. 2a)and
dard Reinforcement Learning problems: the Pendulum and DenseNet (Fig. 5b in Appendix D), both on ImageNet). On
Lunar Lander environments from OpenAI Gym (Brockman the whole, this confirms a known risk of overfit with Adam
et al., 2016). We use standard Deep Q-learning (Mnih et al., (Wilson et al., 2017).
2015). The Q-network is a standard MLP with 2 hidden lay-
Overall, default Adam tends to give slightly better results
ers. The experimental setting to compare Alrao, Adam and
than Alrao when it works, but does not learn reliably with
SGD is the same as above, with ηmin = 10−5 to ηmax = 10.
its default hyperparameters. It can exhibit two kinds of
Alrao uses 10 output layers (which are not classifiers in
lack of robustness: optimization failure, and overfit or
that case but regressors). More details on the Q-learning
non-robustness over the course of learning. On the other
implementation are given in Appendix D. For each environ-
hand, every single run of Alrao reached reasonably close-
ment, we selected the best epoch on evaluation runs, and
to-optimal performance. Alrao also exhibits a steady perfor-
then reported the return of the selected model on new runs
mance over the course of learning (Fig. 2).
in that environment.
loss
4
Adam 4
3
Adam, width * 3 3
2 2
1 1
0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
epochs epochs
loss
4
lr=1e+01 4
3
Adam 3
2 2
1 1
0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
epochs epochs
Figure 2. Learning curves for SGD with various learning rates, Alrao, and Adam with its default setting, with the Resnet50 architecture on
ImageNet and the MobileNetV2 architecture on CIFAR10. Left: training loss; right: test loss. While Alrao uses learning rates from the
entire range, its performance is comparable to the optimal learning rate.
than a direct softmax, such as a hierarchical softmax (see (Fig. 3) shows that the sensitivity to ηmin and ηmax is much
references in (Jozefowicz et al., 2016)); Alrao can be used lower than the original sensitivity to the learning rate. In our
in conjunction with such methods. experiments, convergence happens as soon as (ηmin ; ηmax )
contains a reasonable learning rate (Fig. 3).
This would clearly be a limitation of Alrao for models with
most parameters in the classifier layer and without existing A wide range of values of (ηmin ; ηmax ) will contain one
methods to streamline this layer. For such models, this good learning rate and achieve close-to-optimal perfor-
could be mitigated by handling the copies of the classifiers mance (Fig. 3). Typically, we recommend to just use an
on distinct computing units: in Alrao these copies work in interval containing all the learning rates that would have
parallel given the pre-classifier. been tested in a grid search, e.g., 10−5 to 10.
So, even if the choice of ηmin and ηmax is important, the
Adding two hyperparameters. We claim to remove a results are much more stable to varying these two hyper-
hyperparameter, the learning rate, but replace it with two parameters than to the original learning rate. For instance,
hyperparameters ηmin and ηmax . Formally, this is true. But a standard SGD fails due to numerical issues for η = 100
systematic study of the impact of these two hyperparameters while Alrao with ηmax = 100 works with any ηmin ≤ 1
Learning with Random Learning Rates
-7
-6
-5
-4
-3
-2
-1
-7
-6
-5
-4
-3
-2
-1
0
1
2
0
1
2
1e
1e
1e
1e
1e
1e
1e
1e
1e
1e
1e
1e
1e
1e
1e
1e
1e
1e
1e
1e
3.0
1e-7 1e-7
1e-6 1e-6
2.0 2.5
1e-5 1e-5
Minimum learning rate ηmin
Figure 3. Performance of Alrao with a GoogLeNet model on CIFAR10, depending on the interval (ηmin , ηmax ). Left: loss on the train
set; right: on the test set. Each point with coordinates (ηmin , ηmax ) above the diagonal represents the loss after 30 epochs for Alrao with
this interval. Points (η, η) on the diagonal represent standard SGD with learning rate η after 50 epochs. Standard SGD with η = 102 is
left blank to due numerical divergence (NaN). Alrao works as soon as (ηmin , ηmax ) contains at least one suitable learning rate.
(Fig. 3), and is thus stable to relatively large learning rates. The Alrao idea can also be used with other optimizers than
We would still expect numerical issues with very large ηmax , SGD, such as Adam. We tested combining Alrao and Adam,
but this has not been observed in our experiments. and found the combination less reliable than standard Alrao
(Appendix E, Fig. 7). This occurs mostly for test perfor-
mance, while training curves mostly look good (Fig. 7). The
Increasing network size. With Alrao, neurons with un- stark train/test discrepancy suggests that Alrao combined
suitable learning rates will not learn: those with a too large with Adam may perform well as a pure optimization method
learning rate might learn nothing, while those with too small but exacerbates the underlying risk of overfit of Adam (Wil-
learning rates will learn too slowly to be used. Thus, Al- son et al., 2017; Keskar and Socher, 2017).
rao may reduce the effective size of the network to only
a fraction of the actual architecture size, depending on The Alrao idea could be used on other hyperparameters as
(ηmin , ηmax ). Our first intuition was that increasing the well, such as momentum. However, if more hyperparame-
width of the network was going to be necessary with Alrao, ters are initialized randomly for each feature, the fraction of
to avoid wasting too many units. features having all their hyperparameters in the “Goldilocks
zone” will quickly decrease.
Tests of Alrao with increased width are reported in Table 1
and Fig. 2a.Incidentally, these tests show that width was a
limiting factor of the models used for both Alrao and SGD. 8. Conclusion
Still, to our surprise, Alrao worked well even without width Applying stochastic gradient descent with multiple learning
augmentation. rates for different features is surprisingly resilient in our
experiments, and provides performance close enough to
Other optimizers, other hyperparameters, learning rate SGD with an optimal learning rate, as soon as the range of
schedulers... Using a learning rate schedule instead of a random learning rates contains a suitable one. The same
fixed learning rate is often effective (Bengio, 2012). We resilience is not observed with default Adam. Alrao could
did not use learning rate schedulers here; this may partially save time when testing deep learning models, opening the
explain why the results in Table 1 are worse than the state- door to more out-of-the-box uses of deep learning.
of-the-art. Nothing prevents using such a scheduler within
Alrao, e.g., by dividing all Alrao learning rates by a time- Acknowledgments
dependent constant; we did not experiment with this yet.
One might have hoped that Alrao would match good stepsize We would like to thank Corentin Tallec for his technical help,
schedules thanks to the diversity of learning rates, but our and his many remarks and advice. We thank Olivier Teytaud
results do not currently support this. for pointing useful references, and Guillaume Charpiat and
Learning with Random Learning Rates
Léon Bottou for their remarks. Networks. In Advances in Neural Information Processing
Systems, 2015b.
References Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Shun-ichi Amari. Natural gradient works efficiently in learn- Deep residual learning for image recognition. In Pro-
ing. Neural Comput., 10:251–276, February 1998. ISSN ceedings of the IEEE conference on computer vision and
0899-7667. doi: 10.1162/089976698300017746. pattern recognition, pages 770–778, 2016.
Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Mark Herbster and Manfred K Warmuth. Tracking the best
Raskar. Designing neural network architectures using expert. Machine learning, 32(2):151–178, 1998.
reinforcement learning. arXiv preprint arXiv:1611.02167,
2016. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term
memory. Neural computation, 9(8):1735–1780, 1997.
Yoshua Bengio. Practical recommendations for gradient-
based training of deep architectures. In Neural networks: Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
Tricks of the trade, pages 437–478. Springer, 2012. Kalenichenko, Weijun Wang, Tobias Weyand, Marco
Andreetto, and Hartwig Adam. Mobilenets: Efficient
James Bergstra, Daniel Yamins, and David Daniel Cox. convolutional neural networks for mobile vision applica-
Making a science of model search: Hyperparameter opti- tions. arXiv preprint arXiv:1704.04861, 2017.
mization in hundreds of dimensions for vision architec-
tures. 2013. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and
Kilian Q Weinberger. Densely connected convolutional
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas networks. In CVPR, volume 1, page 3, 2017.
Schneider, John Schulman, Jie Tang, and Wojciech
Zaremba. Openai gym, 2016. Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit,
Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Storkey. Three factors influencing minima in sgd. arXiv
ImageNet: A Large-Scale Hierarchical Image Database. preprint arXiv:1711.04623, 2017.
In CVPR09, 2009.
Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever.
Michael Denkowski and Graham Neubig. Stronger base- An empirical exploration of recurrent network architec-
lines for trustable results in neural machine translation. tures. In International Conference on Machine Learning,
arXiv preprint arXiv:1706.09733, 2017. pages 2342–2350, 2015.
John Duchi, Elad Hazan, and Yoram Singer. Adaptive sub- Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam
gradient methods for online learning and stochastic op- Shazeer, and Yonghui Wu. Exploring the limits of lan-
timization. The Journal of Machine Learning Research, guage modeling. arXiv preprint arXiv:1602.02410, 2016.
12:2121–2159, 2011.
Nitish Shirish Keskar and Richard Socher. Improving gener-
Jonathan Frankle and Michael Carbin. The Lottery Ticket alization performance by switching from Adam to SGD.
Hypothesis: Finding Small, Trainable Neural Networks. arXiv preprint arXiv:1712.07628, 2017.
arXiv preprint arXiv:1704.04861, mar 2018.
Kianglu. pytorch-cifar, 2018. URL https://
Isabelle Guyon, Imad Chaabane, Hugo Jair Escalante, Ser- github.com/kuangliu/pytorch-cifar.
gio Escalera, Damir Jajetic, James Robert Lloyd, Núria
Macià, Bisakha Ray, Lukasz Romaszko, Michèle Sebag, Diederik P. Kingma and Jimmy Ba. Adam: A Method for
et al. A brief review of the ChaLearn AutoML challenge: Stochastic Optimization. In International Conference on
any-time any-dataset learning without human interven- Learning Representations, 2015.
tion. In Workshop on Automatic Machine Learning, pages
21–30, 2016. Wouter Koolen and Steven De Rooij. Combining expert
advice efficiently. arXiv preprint arXiv:0802.2015, 2008.
Song Han, Huizi Mao, and William J. Dally. Deep Com-
pression: Compressing Deep Neural Networks with Prun- Alex Krizhevsky. Learning Multiple Layers of Features
ing, Trained Quantization and Huffman Coding. arXiv from Tiny Images. 2009.
preprint arXiv:1510.00149, 2015a.
Alex Krizhevsky. One weird trick for parallelizing convolu-
Song Han, Jeff Pool, John Tran, and William J. Dally. Learn- tional neural networks. arXiv preprint arXiv:1404.5997,
ing both Weights and Connections for Efficient Neural 2014.
Learning with Random Learning Rates
Keita Kurita. Learning Rate Tuning in Deep Learning: Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice
A Practical Guide — Machine Learning Explained, Santorini. Building a large annotated corpus of english:
2018. URL http://mlexplained.com/2018/ The penn treebank. Comput. Linguist., 19(2):313–330,
01/29/learning-rate-tuning-in-deep- June 1993. ISSN 0891-2017.
learning-a-practical-guide/.
Pierre-Yves Massé and Yann Ollivier. Speed learning on
Yann LeCun, John S. Denker, and Sara A. Solla. Optimal the fly. arXiv preprint arXiv:1511.02540, 2015.
brain damage. In D. S. Touretzky, editor, Advances in
Neural Information Processing Systems 2, pages 598–605. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, An-
Morgan-Kaufmann, 1990. drei A Rusu, Joel Veness, Marc G Bellemare, Alex
Graves, Martin Riedmiller, Andreas K Fidjeland, Georg
Yann LeCun, Leon Bottou, Genevieve B Orr, and Klaus- Ostrovski, et al. Human-level control through deep rein-
Robert Müller. Efficient backprop. In Neural Networks: forcement learning. Nature, 518(7540):529, 2015.
Tricks of the Trade, pages 9–50. Springer, 1998.
Francesco Orabona and Tatiana Tommasi. Training deep net-
Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason works without learning rates through coin betting. In Ad-
Yosinski. Measuring the Intrinsic Dimension of Objective vances in Neural Information Processing Systems, pages
Landscapes. arXiv preprint arXiv:1804.08838, apr 2018. 2160–2170, 2017.
Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Ros- Adam Paszke, Sam Gross, Soumith Chintala, Gregory
tamizadeh, and Ameet Talwalkar. Hyperband: A novel Chanan, Edward Yang, Zachary DeVito, Zeming Lin,
bandit-based approach to hyperparameter optimization. Alban Desmaison, Luca Antiga, and Adam Lerer. Auto-
The Journal of Machine Learning Research, 18(1):6765– matic differentiation in pytorch. In NIPS-W, 2017.
6816, 2017.
Esteban Real, Sherry Moore, Andrew Selle, Saurabh Sax-
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, ena, Yutaka Leon Suematsu, Jie Tan, Quoc Le, and Alex
Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Kurakin. Large-scale evolution of image classifiers. arXiv
and Daan Wierstra. Continuous control with deep rein- preprint arXiv:1703.01041, 2017.
forcement learning. CoRR, abs/1509.02971, 2015. URL
Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky
http://arxiv.org/abs/1509.02971.
learning rates. In International Conference on Machine
Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Learning, pages 343–351, 2013.
Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin
Nicol N Schraudolph. Local gain adaptation in stochastic
Murphy. Progressive neural architecture search. arXiv
gradient descent. 1999.
preprint arXiv:1712.00559, 2017.
Abigail See, Minh-Thang Luong, and Christopher D Man-
Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts:
ning. Compression of Neural Machine Translation Mod-
Differentiable architecture search. arXiv preprint
els via Pruning. arXiv preprint arXiv:1606.09274, 2016.
arXiv:1806.09055, 2018.
K. Simonyan and A. Zisserman. Very deep convolu-
David Mack. How to pick the best learning rate tional networks for large-scale image recognition. CoRR,
for your machine learning project, 2016. URL abs/1409.1556, 2014.
https://medium.freecodecamp.org/how-
to-pick-the-best-learning-rate- Kenneth O Stanley and Risto Miikkulainen. Evolving neural
for-your-machine-learning-project- networks through augmenting topologies. Evolutionary
9c28865039a8. computation, 10(2):99–127, 2002.
Dougal Maclaurin, David Duvenaud, and Ryan Adams. Pavel Surmenok. Estimating an Optimal Learn-
Gradient-based hyperparameter optimization through re- ing Rate For a Deep Neural Network, 2017.
versible learning. In International Conference on Ma- URL https://towardsdatascience.com/
chine Learning, pages 2113–2122, 2015. estimating-optimal-learning-rate-for-
a-deep-neural-network-ce32f2556ce0.
Ashique Rupam Mahmood, Richard S Sutton, Thomas
Degris, and Patrick M Pilarski. Tuning-free step-size Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
adaptation. In Acoustics, Speech and Signal Processing Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
(ICASSP), 2012 IEEE International Conference on, pages Vanhoucke, and Andrew Rabinovich. Going deeper with
2121–2124. IEEE, 2012. convolutions. In Proceedings of the IEEE conference
Learning with Random Learning Rates
A. Model Averaging with the Switch Then, the model averaging weight aj for the classifier j after
seeing T samples is the posterior of the switch distribution:
As explained is Section 4, we use a model averaging method π(KT +1 = j|y1:T , x1:T ).
on the classifiers of the output layer. We could have used
the Bayesian Model Averaging method (Wasserman, 2000). aj = psw (KT +1 = j|y1:T , x1:T ) (10)
But one of its main weaknesses is the catch-up phenomenon psw (y1:T , KT +1 = j|x1:T )
(Van Erven et al., 2012): plain Bayesian posteriors are slow = (11)
psw (y1:T |x1:T )
to react when the relative performance of models changes
over time. Typically, for instance, some larger-dimensional These weights can be computed online exactly in a quick
models need more training data to reach good performance: and simple way (Van Erven et al., 2012), thanks to dynamic
at the time they become better than lower-dimensional mod- programming methods from hidden Markov models.
els for predicting current data, their Bayesian posterior is so
bad that they are not used right away (their posterior needs The implementation of the switch used in Alrao exactly
to “catch up” on their bad initial performance). This leads follows the pseudo-code from (Van Erven et al., 2008), with
to very conservative model averaging methods. hyperparameter θ = 0.999 (allowing for many switches a
priori). It can be found in the accompanying online code.
The solution from (Van Erven et al., 2012) against the catch-
up phenomenon is to switch between models. It is based on
previous methods for prediction with expert advice (see for
B. Convergence result in a simple case
instance (Herbster and Warmuth, 1998; Volf and Willems, We prove a convergence result on Alrao in a simplified case:
1998) and the references in (Koolen and De Rooij, 2008; we assume that the loss is convex, that the pre-classifier is
Van Erven et al., 2012)), and is well rooted in information fixed, that we work with full batch gradients rather than
theory. The switch method maintains a Bayesian posterior stochastic gradient descent, and that the Alrao model aver-
distribution, not over the set of models, but over the set of aging method is standard Bayesian model averaging. The
switching strategies between models. Intuitively, the model convexity and fixed classifier assumptions cover, for in-
selected can be adapted online to the number of samples stance, standard logistic regression: in that case the Alrao
seen. output layer contains copies of a logistic classifier with vari-
We now give a quick overview of the switch method from ous learning rates, and the Alrao pre-classifier is the identity
(Van Erven et al., 2012): this is how the model averaging (or any fixed linear pre-classifier).
weights aj are chosen in Alrao. For each Alrao classifier j, for simplicity we denote its
Assume that we have a set of prediction strategies M = parameters by θj instead of θjcl (there is no more ambiguity
{pj , j ∈ I}. We define the set of switch sequences, S = since the pre-classifier is fixed).
{((t1 , j1 ), ..., (tL , jL )), 1 = t1 < t2 < ... < tL , j ∈ I}. The loss of some classifier C on P a dataset with features
Let s = ((t1 , j1 ), ..., (tL , jL )) be a switch sequence. The (xi ) and labels (yi ) is L(C) := N1 i `(C(xi ), yi ), where
associated prediction strategy ps (y1:n |x1:n ) uses model pji for each input xi , (C(xi )y )y∈Y is a probability distribution
on the time interval [ti ; ti+1 ), namely over the possible labels y ∈ Y, and we use the log-loss
`(C(xi ), yi ) := − log C(xi )yi .
ps (y1:i+1 |x1:i+1 , y1:i ) = pKi (yi+1 |x1:i+1 , y1:i ) (7)
For a classifier Cθ with parameter θ, let us abbreviate
where Ki is such that Ki = jl for tl ≤ i < tl+1 . We L(θ) := L(Cθ ). We assume that L(θ) is a non-negative
fix a prior distribution π over switching sequences. In this convex function, with ∇2 L(θ) λI for all θ. Let L∗ be
work, I = {1, ..., NC } the prior is, for a switch sequence its global infimum; we assume L∗ is a minimum, reached
s = ((t1 , j1 ), ..., (tL , jL )): at some point θ∗ , namely L(θ∗ ) = L∗ . Moreover we as-
L
sume that L is locally strongly convex at its minimum θ∗ :
∇2 L(θ∗ ) 0.
Y
π(s) = πL (L)πK (j1 ) πT (ti |ti > ti−1 )πK (ji ) (8)
i=2 The Alrao architecture for such a classifier Cθ uses Ncl
θ L copies of the same classifier, with different parameter values:
with πL (L) = 1−θ a geometric distribution over the switch
sequences lengths, πK (j) = N1C the uniform distribution Ncl
X
1
over the models (here the classifiers) and πT (t) = t(t+1) . ΦAlrao
θAlrao (x) = aj Cθj (12)
j=1
This defines a Bayesian mixture distribution:
where θAlrao := (θ1 , ..., θNcl ), and where the (aj )j are
the weights given by the model averaging method. We
X
psw (y1:T |x1:T ) = π(s)ps (y1:T |x1:T ) (9)
s∈S abbreviate L(θAlrao ) := L(ΦAlrao
θAlrao ).
Learning with Random Learning Rates
The Alrao classification layer uses a set of learning rates L∗ . Moreover, since L is localy stricly convex around θ∗ ,
(0) (t)
(ηj )j∈J , and starting points (θj )j∈J . Using full-batch this implies that limt θj = θ∗ .
(non stochastic) Alrao updates we have
We now show that the sum of errors for this specific j
(t+1) (t) (t) converges. We assumed that L(θ) is locally strongly convex
θj = θj − ηj ∇L(θj ) (13)
in θ∗ . Let µ > 0 such that ∇2 L(θ∗ ) µI. Since L is
(t+1) (t)
a = ModelAveraging(a , (Cθi (x1:N ))i , y1:N ) C 2 , there is ε0 such that for any θ such that kθ − θ∗ k≤ ε0 ,
(τ )
(14) then ∇2 L(θ) µ2 I. Let τ ∈ N such that kθj − θ∗ k<
0
ε . Then, from the theory of gradient descent for strongly
We assume that the model averaging method is Bayesian convex functions (Tibshirani and Marchetti-Bowick, 2013),
(t)
Model Averaging. we know there is some γ < 1 such that for t > τ , L(θj ) −
(τ )
We have assumed that the Hessian of the loss of the model L∗ ≤ Ckθj − θ∗ kγ t . We have:
satisfies ∇2 L(θ) λI. Under this condition, the standard
theory of gradient descent for convex functions requires that
t
the learning rate be less than 1/λ, otherwise the gradient X (s)
descent might diverge. Therefore, for Alrao we assume L(θj ) − L∗ = (18)
s=1
that at least one of the learning rates considered by Alrao is
τ t
below this threshold. X (s) ∗
X (s)
= L(θj ) −L + L(θj ) − L∗ (19)
Theorem 1. Assume that at least one of the Alrao learning s=1 s=τ
rates ηj satisfies ηj < 1/λ, with λ as above. Then, under τ
X (s)
(τ ) 1
the assumptions above, the Alrao loss is at most the optimal ≤ L(θj ) − L∗ + Ckθj − θ∗ kγ τ
1−γ
loss when t → ∞: s=1
(20)
(t)
lim sup L(θAlrao ) ≤ L∗ (15)
t
(t)
Thus t≥0 L(θj ) − L∗ < ∞. Therefore, A is not
P
Proof. Let us analyze the dynamics of the different models
in the model averaging method. Let us split the set of Alrao empty.
classifiers in two categories according to whether their sum
of errors is finite or infinite, namely,
(t)
We now show that the weights aj tend to 0 for any j ∈ B,
(t)
namely, limt→∞ aj = 0. Let j ∈ B and take some i ∈ A.
X
(t)
A := j ∈ J such that L(θj ) − L∗ < ∞ ,
In Bayesian model averaging, the weights are
t≥0
(16)
Qt
X (t)
s=1 pθ(s) (y1:N |x1:N )
B := j ∈ J such that L(θj ) − L∗ = ∞ (t)
aj = P Qt j
(21)
s=1 pθ (s) (y1:N |x1:N )
t≥0 k k
(17) t p (s) (y1:N |x1:N )
Y θj
≤ (22)
(t)
and in particular, for any j ∈ A, limt L(θj ) = L∗ . s=1
pθ(s) (y1:N |x1:N )
i
t (s)
The proof is organized as follows: We first show that A is Y exp(−N L(θj ))
(t) = (s)
(23)
not empty. Then, we show that limt→∞ aj = 0 for all exp(−N L(θi ))
s=1
j ∈ B: these models are eliminated by the model averaging t
(s)
X
method. Then we will be able to conclude. = exp − N L(θj ) − L∗ +
s=1
t
(24)
First, we show that A is not empty: namely, that there is X (s)
(t) N L(θi ) − L∗
least one j such that t≥0 (L(θj ) − L∗ ) < ∞. We know
P
s=1
that there is j such that ηj < λ2 . Hence, the standard theory
of gradient descent for convex functions shows that this par-
ticular classifier converges (e.g., (Tibshirani and Marchetti- Since i ∈ A and j ∈ B, by definition of A and B this tends
(t) (t)
Bowick, 2013)), namely, the loss (L(θj ))t converges to to 0. Therefore, limt aj = 0 for all j ∈ B.
Learning with Random Learning Rates
We now prove the statement of the theorem. We have: Finally, after the model has seen approximately 4,000 sam-
−`(C (t) (xi ),yi ) ples, a classifier with a slightly smaller learning rate is as-
(t) 1 X X
θ
L(θAlrao ) = − log aj e j + signed a posterior weight aj close to 1, while all the others
N i
j∈A go to 0. This means that after a number of gradient steps, the
X −`(C (t) (xi ),yi )
θ
model averaging method acts like a model selection method.
aj e j
j∈B
(25) D. Additional Experimental Details and
Results
1 X X −`(C (t) (xi ),yi )
θ
≤ − log aj e j In the case of CIFAR-10 and ImageNet, we normalize each
N i
j∈A input channel xi (1 ≤ i ≤ 3), using its mean and its stan-
(26) dard deviation over the training set. Let µi and σi be re-
(t) (t) spectively the mean and the standard deviation of the i-th
(t) ai a
For all i ∈ A, set ãi := P (t) = Pi (t) . Then channel. Then each input (x1 , x2 , x3 ) is transformed into
j∈A aj 1− j∈B aj
( x1σ−µ
1
1 x2 −µ2 x3 −µ3
, σ2 , σ3 ). This operation is done over all the
data (training, validation and test).
(t) (t)
X
L(θAlrao ) ≤ − log 1 − aj
Moreover, we use data augmentation: every time an image
j∈B
of the training set is sent as input of the NN, this image is
1 X X −`(C (t) (xi ),yi ) randomly cropped and and randomly flipped horizontally.
θ
+ − log ãj e j Cropping consists in filling with black a band at the top,
N i
j∈A bottom, left and right of the image. The size of this band is
(27) randomly chosen between 0 and 4 in our experiments.
1 XX
≤ ãj `(Cθ(t) (xi ), yi ) + o(1) On CIFAR10 and PTB, the batch size was 32 for every
N i j
architecture. On ImageNet, the batch-size is 256 for Alexnet
j∈A
(28) and ResNet50, and 128 for Densenet121.
(t)
X
= ãj L(θj ) + o(1) (29) On the Reinforcement Learning environments, we used
j∈A vanilla Q-learning (Mnih et al., 2015) with a soft target
= L∗ + o(1) (30) update as in (Lillicrap et al., 2015) τ = 0.9, and a memory
buffer of size 1,000,000. The architecture for the Q network
thanks to Jensen’s inequality for − log, then because is a MLP with 2 hidden layers. The learning curves are in
(t) (t)
limt aj = 0 for j ∈ B, and finally because limt L(θj ) = Fig. 6. For the optimisation, the switch was used with 10
L∗ for j ∈ A. Taking the lim sup, we have: output layers. An output layer is a linear layer. Since the
(t)
switch is a probability model averaging method, we consider
lim sup L(θAlrao ) ≤ L∗ (31) each output layer as a probabilistic model, defined as a
t
Normal distribution with variance 1 and mean the predicted
which ends the proof. value by the output layer. The loss for the Alrao model is
the negative log-likelihood of the model mixture.
C. Evolution of the Posterior
The evolution of the model averaging weights can be ob- E. Alrao with Adam
served during training. In Figure 4, we can see their evo- In Figure 7, we report our experiments with Alrao-Adam
lution during the training of the GoogLeNet model with on CIFAR10. As explained in Section 7, Alrao is much less
Alrao on CIFAR10, 10 classifiers, with ηmin = 10−5 and reliable with Adam than with SGD.
ηmax = 101 .
This is especially true for the test performance, which can
We can make several observations. First, after only a few even diverge while training performance remains either
gradient descent steps, the model averaging weights corre- good or acceptable (Fig. 7). Thus Alrao-Adam seems to
sponding to the three classifiers with the largest learning send the model into atypical regions of the search space.
rates go to zero. This means that their parameters are mov-
ing too fast, and their loss is getting very large. We have no definitive explanation for this at present. It might
be that changing Adam’s learning rate requires changing its
Next, for a short time, a classifier with a moderately large momentum parameters accordingly. It might be that Alrao
learning rate gets the largest posterior weight, presumably does not work on Adam because Adam is more sensitive to
because it is the first to learn a useful model.
Learning with Random Learning Rates
1.0
a1 : η1 = 1.0e-05
a2 : η2 = 4.6e-05
0.8 a3 : η3 = 2.2e-04
Model averaging weights aj
a4 : η4 = 1.0e-03
a5 : η5 = 4.6e-03
0.6 a6 : η6 = 2.2e-02
a7 : η7 = 1.0e-01
a8 : η8 = 4.6e-01
0.4 a9 : η9 = 2.2e+00
a10 : η10 = 1.0e+01
0.2
0.0
10−3 10−2 10−1 100 101
Epochs (log-scale)
Figure 4. Model averaging weights during training. During the training of the GoogLeNet model with Alrao on CIFAR10, 10 classifiers,
with ηmin = 10−5 and ηmax = 101 , we represent the evolution of the model averaging weights aj , depending on the corresponding
classifier’s learning rate.
its hyperparameters. For those model with significant parameter increase, the
various classifier copies may be done on parallel GPUs.
F. Number of Parameters
G. Frozen Features Do Not Hurt Training
As explained in Section 7, Alrao increases the number of
parameters of a model, due to output layer copies. The As explained in the introduction, several works support the
additional number of parameters is approximately equal to idea that not all units are useful when learning a deep learn-
(Ncl − 1) × K × d where Ncl is the number of classifier ing model. Additional results supporting this hypothesis are
copies used in Alrao, d is the dimension of the output of presented in Figure 8. We trained a GoogLeNet architecture
the pre-classifier, and K is the number of classes in the on CIFAR10 with standard SGD with learning rate η0 , but
classification task (assuming a standard softmax output; learned only a random fraction p of the features (chosen at
classification with many classes often uses other kinds of startup), and kept the others at their initial value. This is
output parameterization instead). equivalent to sampling each learning rate η from the proba-
bility distribution P (η = η0 ) = p and P (η = 0) = 1 − p.
Table 2. Comparison between the number of parameters in models We observe that even with a fraction of the weights not being
used without and with Alrao. LSTM (C) is a simple LSTM cell learned, the model’s performance is close to its performance
used for character prediction while LSTM (W) is the same cell when fully trained.
used for word prediction.
When training a model with Alrao, many features might not
M ODEL N UMBER OF PARAMETERS learn at all, due to too small learning rates. But Alrao is still
W ITHOUT A LRAO W ITH A LRAO able to reach good results. This could be explained by the
G OOG L E N ET 6.166M 6.258M resilience of neural networks to partial training.
VGG 20.041M 20.087M
M OBILE N ET 2.297M 2.412M
LSTM (C) 0.172M 0.197M H. Tutorial
LSTM (W) 2.171M 7.221M
In this section, we briefly show how Alrao can be used in
practice on an already implemented method in Pytorch. The
The number of parameters for the models used, with and full code will be available once the anonymity constraint is
without Alrao, are in Table 2. We used 10 classifiers in lifted.
Alrao for convolutional neural networks, and 6 classifiers
for LSTMs. Using Alrao for classification tasks with many The first step is to build the preclassifier. Here, we use the
classes, such as word prediction (10,000 classes on PTB), VGG19 architecture. The model is built without a classifier.
increases the number of parameters noticeably. Nothing else is required for Alrao at this step.
Learning with Random Learning Rates
loss
lr=1e+00
1.0 lr=1e+01 1.0
lr=1e+02
0.5 0.5
0.0 0.0
0 10 20 30 40 0 10 20 30 40
epochs epochs
loss
4
lr=1e-05 4
3
Adam 3
2 2
1 1
0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
epochs epochs
loss
lr=1e-01
1.0 lr=1e+00 1.0
lr=1e+01
0.5 0.5
0.0 0.0
0 10 20 30 40 0 10 20 30 40
epochs epochs
Figure 5. Additional learning curves for SGD with various learning rates, Alrao, and Adam with its default setting, with the Densenet121
and Alexnet architectures on ImageNet and the GoogLeNet architecture on CIFAR10. Left: training loss; right: test loss.
Learning with Random Learning Rates
−2000 lr=1e-04
lr=1e-03 # nb_classifiers is the number of
−2500 lr=1e-02 classifiers averaged by Alrao.
lr=1e-01 nb_classifiers = 10
−3000
nb_categories = 10
−3500
net = AlraoModel(preclassifier,
nb_categories, preclassifier.
0 5 10 15 20 25 linearinputdim, nb_classifiers)
Steps
# We spread the classifiers learning rates
log-uniformly on the interval.
Figure 6. Learning curves for the Reinforcement Learning environ- classifiers_lr = [np.exp(np.log(minlr) + \
ment in the pendulum environment in Q-learning, for SGD with k /(nb_classifiers-1) * (np.log(maxlr) -
various learning rates, Alrao, and Adam with its default setting np.log(minlr)) \
) for k in range(nb_classifiers)]
loss
1.0 1.0
0.5 0.5
0.0 0.0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
epochs epochs
(a) Alrao-Adam with GoogLeNet on CIFAR10: Alrao-Adam compared with standard Adam with various
learning rates. Alrao uses 10 classifiers and learning rates in the interval (10−6 , 1). Each plot is averaged on 10
experiments. We observe that optimization with Alrao-Adam is efficient, since train loss is comparable to the
usual Adam methods. But the model starkly overfits, as the test loss diverges.
1.5 1.5
loss
loss
1.0 1.0
0.5 0.5
0.0 0.0
0 10 20 30 40 0 10 20 30 40
epochs epochs
(b) Alrao-Adam with MobileNet on CIFAR10: Alrao-Adam with two different learning rate intervals,
(10−6 , 10−2 ) for the first one, (10−6 , 10−1 ) for the second one, with 10 classifiers each. The first one is
with ηmin = 10−6 . Each plot is averaged on 10 experiments. Exactly as with GoogLeNet model, optimization
itself is efficient (for both intervals). For the interval with the smallest ηmax , the test loss does not converge and
is very unstable. For the interval with the largest ηmax , the test loss diverges.
2.0 2.0
1.5 1.5
loss
loss
1.0 1.0
0.5 0.5
0.0 0.0
0 10 20 30 40 50 0 10 20 30 40 50
epochs epochs
(c) Alrao-Adam with VGG19 on CIFAR10: Alrao-Adam on the interval (10−6 , 1), with 10 classifiers. The 10
plots are 10 runs of the same experiments. While 9 of them do converge and generalize, the last one exhibits
wide oscillations, both in train and test.
Figure 7. Alrao-Adam: Experiments with the VGG19, GoogLeNet and MobileNet networks on CIFAR10.
Learning with Random Learning Rates
test_nll as a function of p
1.6
1.4
1.2
1.0
0.8
0.6
Figure 8. Loss of a model where only a random fraction p of the features are trained, and the others left at their initial value, as a function
of p. The architecture is GoogLeNet, trained on CIFAR10.