s41467 020 14663 9

ARTICLE
https://doi.org/10.1038/s41467-020-14663-9 OPEN
Complexity control by gradient descent

in deep networks
Tomaso Poggio1 ✉, Qianli Liao1 & Andrzej Banburski1
Overparametrized deep networks predict well, despite the lack of an explicit complexity
control during training, such as an explicit regularization term. For exponential-type loss
1234567890():,;
functions, we solve this puzzle by showing an effective regularization effect of gradient

descent in terms of the normalized weights that are relevant for classification.
1 Center for Brains, Minds, and Machines, MIT, Cambridge, Massachusetts, USA. ✉email: tp@ai.mit.edu
NATURE COMMUNICATIONS | (2020)11:1027 | https://doi.org/10.1038/s41467-020-14663-9 | www.nature.com/naturecommunications 1

ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14663-9
O

nce upon a time, models needed more data than para- of the network f on a training set S ¼ f x1 ; y1 ; ; xN ; yN g.
meters to provide a meaningful fitting. Deep networks The most common loss optimized during training for binary
seem to avoid this basic constraint. In fact, more weights P
classification is the logistic loss Lðf Þ ¼ N1 Nn¼1 lnð1 þ eyn f ðxn Þ Þ.
than data is the standard situation for deep-learning networks We focus on the closely related, simpler exponential loss
that typically fit the data and still show good predictive perfor- P P
mance on new data1. Of course, it has been known for some time Lðf ðwÞÞ ¼ Nn¼1 ef ðxn Þyn . We call classification “error” N1 Nn¼1
that the key to good predictive performance is controlling the H(−yn(f(xn)), where y is binary (y ∈ {−1, +1}) and H is the
complexity of the network and not simply the raw number of its Heaviside function with H(−yf(x)) = 1 if −yf > 0, which cor-
parameters. The complexity of the network depends on appro- respond to the wrong classification. We say that f separates the
priate measures of complexity of the space of functions realized data if ynf(xn) > 0, ∀ n. We will typically assume that GD at some
by the network such as VC dimension, covering numbers and t > T0 will reach an f that separates the data (which is usually the
Rademacher numbers. Complexity can be controlled during case for overparametrized deep networks). There is a close relation
optimization by imposing a constraint, often under the form of a between the exponential or logistic loss and the classification
regularization penalty, on the norm of the weights, as all the error: both the exponential and the logistic losses are upper
notions of complexity listed above depend on it. The problem is bounds for the classification error. Minimizing the exponential or
that there is no obvious control of complexity in the training of logistic loss implies minimizing the classification error. Mini-
deep networks! This has given an aura of magic to deep learning mization of the loss can be performed by GD techniques. In
and has contributed to the belief that classical learning theory today’s praxis, stochastic GD (SGD) is used to train deep net-
does not hold for deep networks. works. We focus here on GD for simplicity. Our main results
In the case of regression for shallow linear networks such as should also hold for SGD.
kernel machines, it is well known from work on inverse problems
and in machine learning (see refs. 2,3) that iterative gradient
descent (GD) has a vanishing regularizing effect with the iteration Deep networks. We define a deep network with K layers
number t (for fixed step size) being equivalent to 1λ (where λ is the and coordinate-wise scalar activation functions σ(z): R → R as
corresponding regularization parameter): thus t → ∞ corresponds the set of functions f(W; x) = σ(WKσ(WK−1 ⋯ σ(W1x))), where
to λ → 0. The simplest example is least-square regression on a the input is x ∈ Rd, the weights are given by the matrices Wk,
linear network, where vanishing regularization unifies both the one per layer, with matching dimensions. The symbol W is used
overdetermined and the underdetermined cases as follows to denote the set of Wk matrices k = 1, ⋯ , K. For simplicity,
we consider here the case of binary classification in which f
1 takes scalar values, implying that the last layer matrix WK is
min jjY Xwjj2 þ λjjwjj2 ð1Þ
n
w2Rd W K 2 R1;K l . As mentioned, the labels are yn ∈ {−1, 1}. The
1
weights of hidden layer l are collected in a matrix of size hl × hl
yielding wλ ¼ ðX T X þ λnIÞ X T Y. It is noteworthy that −1. There are no biases apart from the input layer where
limλ!0 wλ ¼ wy is the pseudoinverse. In this case, iterative GD the bias is instantiated by one of the input dimensions being
minimizing n1 jjY Xwjj2 performs an implicit regularization a constant. The activation function is the Rectified Linear
equivalent to taking λ → 0 in wλ above. Unit (ReLU) activation. For ReLU activations, the follo-
The question is whether an effective regularization may be wing important positive one-homogeneity property holds
similarly induced in nonlinear deep networks by GD and how. σðzÞ ¼ ∂σðzÞ∂z z. For the network, homogeneity implies fW ðxÞ ¼
This paper addresses the question of implicit regularization in the QK
specific case of training wrt exponential-type loss functions—such k¼1 k ðV 1 ; V K ; x n Þ, where Wk = ρkVk.
ρ f
The network is a function f(x) = f(W1, ⋯ , WK; x) where x is
as cross-entropy or exponential loss. It is worth noting that cross-
the input and the weight matrices Wk are the parameters. We
entropy is the loss function used in training deep networks for
define the normalized network ~f as fV ¼ ρ~f , with jjVk jj ¼ 1; ρ ¼
classification problems, and that most of the practical applications QK ~
and successes of deep networks, at least so far, involve k¼1 ρk ; F is the associated class of “normalized neural networks”
classification. ~f ðxÞ. It is noteworthy that the definitions of ρk, Vk, and ~f all
This paper answers in the affirmative: there is a hidden reg- depend on the choice of the norm used in normalization. It is also
ularization in typical training of deep networks. The basic worth noting that because of homogeneity of the ReLU network
observation is that the weights computed during minimization f ðxÞ ¼ ρ~f ðxÞ, the signs of f and ~f are the same.
of the exponential loss do not matter by themselves and in fact For simplicity of notation we consider for each weight matrix
they diverge during GD. As shown in the next section, in the Vk the corresponding “vectorized” representation in terms of
case of classification—both binary and multiclass—only the i;j
vectors W k ¼ W k for each k layer.
normalized weights matter: thus, complexity control is implicit We use the following definitions and properties (for a vector w
in defining the normalized weights Vk as the variables of and the 2-norm) neglecting indeces:
interest. What is not completely obvious is that commonly used
GD on the unnormalized weights induces a dynamics of the ● w
Define jjwjj ¼ v; thus w ¼ jjwjj2 v with jjvjj2 ¼ 1.
2
normalized weights, which converges to a stationary point. We ● The following relations are easy to check:
show that at any finite time of the dynamics, the stationary
∂jjwjj
points of the flow of Vk satisfy the necessary and sufficient 1. ∂w 2 ¼ v
wwT
conditions for a minimum. This mechanism underlies regular- 2. S ¼ I vvT ¼ I jjwjj 2.
∂v
ization in deep networks for exponential losses and, as we dis- 3. ∂w ¼ jjwjj .
S 2
2
cuss later, is likely to be the starting point to explain their
prediction performance.
Results Training by unconstrained gradient descent. Consider the

Loss functions. We consider for simplicity of exposition the case typical training of deep networks. The exponential loss (more in
of binary classification. We call “loss” the measure of performance general an exponential-type loss such as the cross-entropy) is
2 NATURE COMMUNICATIONS | (2020)11:1027 | https://doi.org/10.1038/s41467-020-14663-9 | www.nature.com/naturecommunications

NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14663-9 ARTICLE
minimized by GD. The gradient dynamics is given by following dynamics for ρk = ∣∣Wk∣∣ and Vk:
" #
∂L XN
∂f ðxn Þ yn f ðxn Þ S
_ i;j
Wk ¼ ρ_k ¼ V k W_ k V_ k ¼ k W_ k ð6Þ
i;j ¼ yn i;j e : ð2Þ ρk
∂W k n¼1 ∂W k
Clearly there is no explicit regularization or norm control in where Sk ¼ I V k V Tk emerges this time from the change of
the typical GD dynamics of Eq. (2). Assuming that for T > T0 GD
∂V k
variables, as ∂W ¼ ρS . Inspection of the equation for V_ k shows
k k
achieves separability, the empirical loss goes to zero and the that there is a unit constraint on the L2 norm of Vk, because of the
norms of the weights ρk = ∣∣Wk∣∣2 grow to infinity ∀k. For presence of S: in fact, a tangent gradient transformation on V_ k
classification, however, only the normalized network outputs would not change the dynamics, as S is a projector and S2 = S.
matter because of the softmax operation. Consistently with this conclusion, unconstrained GD has the
same critical points for Vk as weight normalization but a some-
Training by constrained gradient descent. Let us contrast what different dynamics: in the one-layer case, weight normal-
the typical GD above with a classical approach that uses ization is
complexity control. In this case the goal is to minimize
P P ρ_ ¼ vT w_ v_ ¼ Sρw_ ð7Þ
LðfW Þ ¼ Nn¼1 efW ðxn Þyn ¼ Nn¼1 eρfV ðxn Þyn , with ρ = ∏ρk, sub-
ject to jjV k jjpp ¼ 1 8k, that is under a unit norm constraint for the which has to be compared with the typical gradient equations in
P the ρ and Vk variables given by
weight matrix at each layer (if p = 2 then i;j ðV k Þ2i;j ¼ 1 is the
Frobenius norm). It is noteworthy that the relevant function is S
ρ_ ¼ vT w_ v_ ¼ w:
_ ð8Þ
not f(x) and the associated Wk, but rather fV ðxÞ, and the asso- ρ
ciated Vk as the normalized network ~f ðxÞ is what matters for The two dynamical systems are thus quite similar, differing by
classification, and other key properties such as the margin a ρ2 factor in the v_ equations. It is clear that the stationary points
depend on it. of the gradient for the v vectors—that is the values for which
In terms of ρ, Vk, the unconstrained gradient of L gives v_ ¼ 0—are the same in both cases, as for any t > 0, ρ(t) > 0.
∂L ∂L Importantly, the almost equivalence between constrained and
ρ_k ¼ V Tk V_ k ¼ ρk ð3Þ unconstrained GD is true only when p = 2 in the unit Lp norm
∂W k ∂W k
constraint5. In both cases the stationary point (for fixed but
∂L
as ρ_k ¼ ∂ρ ¼ ∂W k ∂L _ ∂L ∂W k ∂L
usually very large ρ, that is a very long time) are the same and
k ∂ρ ∂W and V k ¼ ∂V ¼ ∂V ∂W .
k k k k k
likely to be (local) minima. In the case of deep networks, we
There are several ways to enforce the unit norm constraint on expect multiple such minima for any finite ρ.
the dynamics of Vk. The most obvious one consists of Lagrange
multipliers. We use an equivalent technique, which is also Convergence to minimum norm and maximum margin. Con-
equivalent to natural gradient, called tangent gradient transfor- sider the flow of Vk in Eq. (6) for fixed ρ. If we assume that for t >
mation4 of a gradient increment g(t) into Sg(t). For a unit L2 T0, f ðV; xÞ separates the data, i.e., 8n yn f ðVÞðxn Þ > 0, then
uuT
norm constraint, the projector S ¼ I jjujj 2 enforces the unit _
dt ρk > 0, i.e., ρ diverges to ∞ with limt!1 ρ ¼ 0. In the one-layer
d
2
norm constraint. According to theorem 1 in ref. 4, the dynamical network case, the dynamics yields ρ logt asymptotically. For
system u_ ¼ Sg with ∣∣u(0)∣∣2 = 1 describes the flow of a vector u deeper networks, this is different. Banburski et al.5 shows that the
that satisfies ∣∣u(t)∣∣2 = 1 for all t ≥ 0. Applying the tangent product of weights at each layer diverges faster than logarith-
∂L mically, but each individual layer diverges slower than in the one-
gradient transformation to ∂V yields
layer case. Banburski et al.5 also shows that ρ_2k (the rate of growth
k
∂L ∂L
ρ_k ¼ V Tk V_ k ¼ Sρk ð4Þ of ∣∣ρk∣∣2) is independent of k.
∂W k ∂W k
The stationary points at which V_ k ¼ 0 for fixed ρ —if they
V V Tk
with Sk ¼ I jjVk 2. It is relatively easy to check (see ref. 5) that the exist—satisfy the necessary condition for a minimum, i.e.,
k jj2 !
X
yn ρfV ðxn Þ ∂fV ðxn Þ
dynamics of Eq. (4) is the same as of the weight normalization i;j
algorithm, originally6 defined for each layer in terms of w ¼ g jjvjj e i;j V k fV ðx n Þ ¼0 ð9Þ
v
, as
∂V k
v ∂L g ∂L
g_ ¼ ; v_ ¼ S ð5Þ are critical points of the loss for fixed ρ (as the domain is
jjvjj ∂w jjvjj ∂w
compact minima exist). Let us call them V k ðρÞ ¼ minjjV k jj¼1
with S ¼ I jjvjj
vv T
2 . The reason Eqs. (4) and (5) are equivalent is
P y ρ~f ðx Þ
n . Consider now the limit for ρ → ∞
ne
n
because ∣∣v∣∣ in Eq. (5) can be shown to be constant in time5.

2 X ~ ~
Weight normalization and the closely related batch normalization lim minjjV k jj¼1 eyn ρf ðxn Þ ¼ lim minjjV k jj¼1 eρf ðxÞ ; ð10Þ
ρ!1 ρ!1
technique are in common use for training deep networks. n
Empirically, they behave similar to unconstrained GD with some where x* corresponds to the training data that are support vectors
advantages especially for very deep networks. Our derivation, and have the smallest margin. This provides a solution Vk(∞),
however, seems to suggest that they could be different, as weight which corresponds to a maximum margin solution, because
normalization enforces an explicit, although so far unrecognized, limρ!1 V k ðρÞ ¼ minn maxjjV k jj¼1 f~ρ ðxn Þ (see also ref. 5). Those
unit norm constraint (on the Vk dynamics), which unconstrained maximum margin solutions are also minimum norm solution in
GD (Eq. (2)) seems not to enforce. terms of the Wk for a fixed margin ≥1.
In summary, the dynamics of Vk for each fixed ρ converges to a
Implicit complexity control. The first step in solving the puzzle critical point which is likely to be a minimum on the boundary of
i;j i;j
is to reparametrize Eq. (2) in terms of Vk, ρk with W k ¼ ρk V k , the ρ ball. For ρ → ∞, the stationary points of the full dynamical
and ∣∣Vk∣∣2 = 1 at convergence. Following the chain rule for the system Eq. (6) are reached. These points are degenerate
time derivatives, the dynamics for Wk, Eq. (2), is identical to the equilibria5.

ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14663-9
Discussion a Model #params: 9370
Loss on CIFAR-10 (unnormalized network)

The main conclusion of this analysis is that unconstrained GD in 10
Training loss
Wk followed by normalization of Wk is equivalent to imposing a 9 Test loss
W k ðtÞ
L2 unit norm constraint on V k ¼ jjW during GD. For binary 8
k ðtÞjj
7
and multiclass classification, only the normalized weights are
6
needed. To provide some intuition, consider that GD is steepest
5
descent wrt the L2 norm and the steepest direction of the gradient
depends on the norm. The fact that the direction of the weights 4
converge to stationary points of the gradient under a constraint is 3
the origin of the hidden complexity control, described as implicit 2
bias of GD by Srebro and colleagues7, who first showed its effect 1

Model #params
in the special case of linear networks under the exponential loss. 0
102 3 4
10 10 105
In addition to the approach summarized here, another elegant
Number of training examples
theory8–10, leading to several of the same results, has been
developed around the notion of margin, which is closely related, b
as in the case of support vector machines, to minimization of an 2.305
Loss on CIFAR-10 (normalized network)

exponential-type loss function under a norm constraint.
2.3
In summary, there is an implicit regularization in deep non-
linear networks trained on exponential-type loss functions, ori- 2.295
ginating in the GD technique used for optimization. The Training loss
Test loss
solutions are in fact the same as that are obtained by vanishing 2.29
regularized optimization. This is thus similar to—but more robust
2.285
than—the classical implicit regularization induced by iterative
GD on linear networks under the square loss and with appro- 2.28
priate initial conditions. In our case, the maximum margin Model #params
solutions are independent of initial conditions and the linearity of 2.275
the network. The specific solutions, however, are not unique, 102 103 104 105
unlike the linear case: they depend on the trajectory of gradient Number of training examples
flow, each corresponding to one of multiple minima of the loss, c

0.03
each one being a margin maximizer. In general, each solution will
Loss on CIFAR-10 (normalized network)
Test-training loss difference

show a different test performance. Characterizing the conditions 0.025
Linear fit of last 7 points
that lead to the best among the margin maximizers is an

important open problem. 0.02
The classical analysis of ERM algorithms studies their
asymptotic behavior for the number of data n going to infinity. In 0.015
this limiting classical regime, n > D, where D is the fixed number

0.01
of weights; consistency (informally the expected error of the y = –0.0018709*x + 0.011976 (log scale)
empirical minimizer converges to the best in the class) and 0.005
generalization (the empirical error of the minimizer converges to Model #params
the expected error of the minimizer) are equivalent. The capacity 0
2
10 103 104 105
control described in this paper implies that there is asymptotic
Number of training examples
generalization and consistency in deep networks in the classical
regime (see Fig. 1). However, as we mentioned, it has been shown Fig. 1 Classical generalization and consistency in deep networks.
in the case of kernel regression, that there are situations in which a Unnormalized cross-entropy loss in CIFAR-10 for randomly labeled data.
there is simultaneously interpolation of the training data and b Cross-entropy loss for the normalized network for randomly labeled data.
good expected error. This is a modern regime in which D > n but c Generalization cross-entropy loss (difference between training and
γ ¼ Dn is constant. In the linear case, it corresponds to the limit for testing loss) for the normalized network for randomly labeled data as a
λ = 0 of regularization, i.e., the pseudoinverse. It is likely that function of the number of data N. The generalization loss converges to zero
deep nets may have a similar regime, in which case the implicit as a function of N but very slowly.
regularization described here is an important prerequisite for a
full explanation—as it is the case for kernel machines under the
square loss. In fact, the maximum margin solutions we char- vanishing λ. It is very likely that the same result we obtained for
acterize here are equivalent to minimum norm solutions (for GD also holds for SGD in deep networks (the equivalence holds
margin equal to 1, see ref. 5). Minimum norm of the weight for linear networks14). The convergence of SGD usually follows
matrices implies minimum uniform stability and thus suggests convergence of GD, although rates being different. The
minimum expected error, see ref. 11. This argument would Robbins–Siegmund theorem is a tool to establish almost sure
explain why deep networks trained with exponential losses pre- convergence under surprisingly mild conditions.
dict well and why classification error does not increase with It is not clear whether a similar effective regularization should
overparametrization (see Fig. 2). It would also explain, in the case also hold for deep networks with more than two layers trained
of kernel methods and square-loss regression, why the pseu- with square loss. In fact, we have not been able to find a
doinverse solution provides good expected error and at the same mechanism that could lead to a similarly robust regularization.
time perfect interpolation on the training set12,13 with a data- Therefore, we conjecture that deep networks trained with square
dependent double-descent behavior. loss (with more than two layers and not reducible to kernels) do
As we mentioned, capacity control follows from convergence of not converge to minimum norm solutions, unlike the same net-
the normalized weights during GD to a regularized solution with works trained on exponential-type losses.
4 NATURE COMMUNICATIONS | (2020)11:1027 | https://doi.org/10.1038/s41467-020-14663-9 | www.nature.com/naturecommunications

NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14663-9 ARTICLE
Training data size: 50,000 5. Banburski, A. et al. Theory of deep learning III: dynamics and generalization
0.7
in deep networks. CBMM Memo No. 090 (2019).
6. Salimans, T. & Kingma, D. P. Weight normalization: a simple
0.6
reparameterization to accelerate training of deep neural networks. In Advances
0.5
in Neural Information Processing Systems arXiv:1602.07868v3 (Barcelona,
Spain, 2016).
Error on CIFAR-10
0.4
7. Soudry, D., Hoffer, E. & Srebro, N. The implicit bias of gradient descent on
separable data. J. Mach. Learn. Res. 19, 70:1–57 (2017).
0.3 8. Nacson, M. S., Gunasekar, S., Lee, J. D., Srebro, N. & Soudry, D.
Lexicographic and depth-sensitive margins in homogeneous and non-
0.2 homogeneous deep models. In Proceedings of the 36th International
Conference on Machine Learning (ICML) (Long Beach, California, 2019).
0.1 9. Lyu, K. & Li, J. Gradient descent maximizes the margin of homogeneous
Training
Test
#Training data neural networks. CoRR, abs/1906.05890 (2019).
0 10. Neyshabur, B., Tomioka, R., Salakhutdinov, R. & Srebro, N. Geometry of
102 103 104 105 106 107 optimization and implicit regularization in deep learning. CoRR, abs/
Number of model params 1705.03071 (2017).
11. Poggio, T., Kur, G. & Banburski, A. Double descent in the condition number.
Fig. 2 No overfitting in deep networks. Empirical and expected error in CBMM memo 102 (2019).
CIFAR-10 as a function of number of neurons in a 5-layer convolutional 12. Liang, T. & Rakhlin, A. Just interpolate: kernel “Ridgeless” regression can
network. The expected classification error does not increase when generalize. In The Annals of Statistics. (2019).
13. Rakhlin, A. & Zhai, X. Consistency of interpolation with laplace kernels is a
increasing the number of parameters beyond the size of the training set. high-dimensional phenomenon. In Proceedings of the Thirty-Second
Conference on Learning Theory (COLT), PMLR 99, 2595–2623, (2019).
Interestingly, the theoretical observations we have described 14. Nacson, M. S., Srebro, N. & Soudry, D. Stochastic gradient descent on
suggest that the dynamics of ρ may be controlled independently separable data: exact convergence with a fixed learning rate. In AISTATS
from GD on the Vk, possibly leading to faster and better algo- 3051–3059 (2019).
rithms for training deep networks. A hint of this possibilities is
given by an analysis for linear networks (see ref. 5) of the Acknowledgements
dynamics of weight normalization (Eq. (7)) vs. the dynamics of This material is based upon work supported by the Center for Minds, Brains and
the unconstrained gradient (Eq. (8)). Under the same simplified Machines (CBMM), funded by NSF STC award CCF-1231216, and part by C-BRIC, one
of six centers in JUMP, a Semiconductor Research Corporation (SRC) program spon-
assumptions on the training data, the weight normalization sored by DARPA.
dynamics converges much faster—as t1=2logt 1
—than the typical
1
dynamics, which converges to the stationary point as logt . This
Author contributions
prediction was verified with simulations. Together with the T.P. developed the basic theory. Q.L. and A.B. run simulations. All three authors wrote
observation that ρ(t) associated with Eq. (8) is monotonic in t the paper.
after separability is reached, it suggests exploring a family of
algorithms that consist of an independently driven forcing term Competing interests
ρ(t) coupled with the equation in Vk from Eq. (8). The authors declare no competing interests.

ρX N
∂fV ðxn Þ
V_ k ¼ 2 eρfV ðxn Þ Vk fV ðxn Þ : ð11Þ Additional information
ρk n¼1 ∂V k Supplementary information is available for this paper at https://doi.org/10.1038/s41467-
020-14663-9.
The open question is of course what is the optimal ρ(t) schedule
for converging to the margin maximizer that is best in terms of Correspondence and requests for materials should be addressed to T.P.
expected error.
Peer review information Nature Communications thanks the anonymous reviewers for
their contribution to the peer review of this work. Peer reviewer reports are available.
Data availability
All relevant data are publicly available (Github: https://github.com/liaoq/natcom2020). Reprints and permission information is available at http://www.nature.com/reprints
Code availability Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
All relevant code are publicly available (Github: https://github.com/liaoq/natcom2020). published maps and institutional affiliations.
Received: 14 September 2019; Accepted: 22 January 2020; Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative
References Commons license, and indicate if changes were made. The images or other third party
1. Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding material in this article are included in the article’s Creative Commons license, unless
deep learning requires rethinking generalization. In International Conference indicated otherwise in a credit line to the material. If material is not included in the
on Learning Representations (ICLR) (Toulon, France, 2017). article’s Creative Commons license and your intended use is not permitted by statutory
2. Rosasco, L. & Villa, S. Learning with incremental iterative regularization. In regulation or exceeds the permitted use, you will need to obtain permission directly from
Advances in Neural Information Processing Systems, 1630–1638 (2015). the copyright holder. To view a copy of this license, visit http://creativecommons.org/
3. Engl, H., Hanke, M. & Neubauer, A. Regularization of Inverse Problems licenses/by/4.0/.
(Kluwer Academic Publishers, 2000).
4. Douglas, S. C., Amari, S. & Kung, S. Y. On gradient adaptation with unit-norm
constraints. IEEE Transact. Signal Process. 48(June), 1843–1847 (2000). © The Author(s) 2020

s41467 020 14663 9

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

s41467 020 14663 9

Caricato da

Copyright:

Formati disponibili

ARTICLE

Complexity control by gradient descent

functions, we solve this puzzle by showing an effective regularization effect of gradient

NATURE COMMUNICATIONS | (2020)11:1027 | https://doi.org/10.1038/s41467-020-14663-9 | www.nature.com/naturecommunications 1

Results Training by unconstrained gradient descent. Consider the

2 NATURE COMMUNICATIONS | (2020)11:1027 | https://doi.org/10.1038/s41467-020-14663-9 | www.nature.com/naturecommunications

because ∣∣v∣∣ in Eq. (5) can be shown to be constant in time5.

NATURE COMMUNICATIONS | (2020)11:1027 | https://doi.org/10.1038/s41467-020-14663-9 | www.nature.com/naturecommunications 3

Discussion a Model #params: 9370

Loss on CIFAR-10 (unnormalized network)

converge to stationary points of the gradient under a constraint is 3

the origin of the hidden complexity control, described as implicit 2

bias of GD by Srebro and colleagues7, who ﬁrst showed its effect 1

Loss on CIFAR-10 (normalized network)

ﬂow, each corresponding to one of multiple minima of the loss, c

Test-training loss difference

that lead to the best among the margin maximizers is an

this limiting classical regime, n > D, where D is the ﬁxed number

4 NATURE COMMUNICATIONS | (2020)11:1027 | https://doi.org/10.1038/s41467-020-14663-9 | www.nature.com/naturecommunications

NATURE COMMUNICATIONS | (2020)11:1027 | https://doi.org/10.1038/s41467-020-14663-9 | www.nature.com/naturecommunications 5

Potrebbero piacerti anche