Sei sulla pagina 1di 98

Mathematics of Deep Learning

Joan Bruna
Courant Institute, NYU

collaborators:
Stephane Mallat (ENS, France), Ivan Dokmanic (UIUC), M. de Hoop (Rice)
Pablo Sprechmann (NYU), Yann LeCun (NYU)
Daniel Freeman (UC Berkeley)
Curse of Dimensionality
• Learning as a High-dimensional interpolation problem:
Observe {(xi , yi = f (xi ))}iN for some unknown f : Rd ! R.

high-dimensional space
Curse of Dimensionality
• Learning as a High-dimensional interpolation problem:
Observe {(xi , yi = f (xi ))}iN for some unknown f : Rd ! R.

• Goal: Estimate ˆ
f from training data.

high-dimensional space
Curse of Dimensionality
• Learning as a High-dimensional interpolation problem:
Observe {(xi , yi = f (xi ))}iN for some unknown f : Rd ! R.
• Goal: Estimate fˆ from training data.
• “Pessimist” view:

Assume as little as possible on f .


e.g. f is Lipschitz: |f (x) f (x0 )|  Lkx x0 k.

high-dimensional space
Curse of Dimensionality
• Learning as a High-dimensional interpolation problem:
Observe {(xi , yi = f (xi ))}iN for some unknown f : Rd ! R.
• Goal: Estimate fˆ from training data.
• “Pessimist” view:

Assume as little as possible on f .



e.g. f is Lipschitz: |f (x) f (x0 )|  Lkx x0 k.
Q: How many points do we need to observe to guarantee
error |fˆ(x) f (x)| < ✏? high-dimensional space
Curse of Dimensionality
• Learning as a High-dimensional interpolation problem:
Observe {(xi , yi = f (xi ))}iN for some unknown f : Rd ! R.
• Goal: Estimate fˆ from training data.
• “Pessimist” view:

Assume as little as possible on f .

• e.g. f is Lipschitz: |f (x) f (x0 )|  Lkx x0 k.


Q: How many points do we need to observe to guarantee
• error | ˆ
f (x) f (x)| < ✏? high-dimensional space

N should be ⇠ ✏ d .
We pay an exponential price on the input dimension.
Curse of Dimensionality
• Therefore, in order to beat the curse of dimensionality, it is
necessary to make assumptions about our data and exploit them
in our models.
Curse of Dimensionality
• Therefore, in order to beat the curse of dimensionality, it is
necessary to make assumptions about our data and exploit them
in our models.

• Geometric Invariance and stability perspective: x

Supervised learning: (xi , yi )i , yi 2 {1, K} labels.


f (x) = p(y | x) satisfies f (x⌧ ) ⇡ f (x) if {x⌧ }⌧ is a x⌧
high-dimensional family of deformations of x

Unsupervised learning: (xi )i .


Density f (x) = p(x) also satisfies f (x⌧ ) ⇡ f (x).
Curse of Dimensionality
• Therefore, in order to beat the curse of dimensionality, it is
necessary to make assumptions about our data and exploit them
in our models.

• Statistical invariance (ergodicity):


• Many physical systems have statistical regularities as they grow in size.

Textures
art

Fractals
Ising spin glass
Turbulence
Towards a better non-convex picture
• Typical dichotomy in Optimization is convex vs non-convex.
Towards a better non-convex picture
• Typical dichotomy in Optimization is convex vs non-convex
problems.

• But, for Gradient Descent methods, we need a different


dichotomy: problems that can be solved with GD vs problems that
can’t.

convex

GD-solvable

• Where are Deep Learning typical problems in that diagram?


From Solvability to Practical Performance
• In convex problems, the performance (ie, speed of convergence)
depends upon geometrical properties of the loss surface.
– Typically via curvature properties of the loss.

f (u)

(strongly convex ex.) u0


l 2 L
hrf (u0 ), u u0 i + ku u0 k  f (u) f (u0 )  hrf (u0 ), u u0 i + ku u0 k2
2 2
From Solvability to Practical Performance
• In convex problems, the performance (ie, speed of convergence)
depends upon geometrical properties of the loss surface.
– Typically via curvature properties of the loss.

f (u)

(strongly convex ex.) u0


l 2 L
hrf (u0 ), u u0 i + ku u0 k  f (u) f (u0 )  hrf (u0 ), u u0 i + ku u0 k2
2 2
• Challenge: In standard deep Relu architectures, computing
curvature is NP.
• Challenge: How to extend this notion in the non-convex setting?
Efficient estimation?
Plan
• CNNs vs the curse of dimensionality on image/audio estimation.
– Multi scale wavelet scattering convolutional networks and beyond.
– Applications to inverse problems: Image and Texture Super-Resolution.

joint work with S. Mallat (ENS)

• Some perspectives on high-dimensional loss surfaces.


joint work with Daniel Freeman (UC Berkeley)

•Understanding Sparse Coding fast approximations via sparsity-


preserving matrix factorizations.

joint work with Thomas Moreau (Mines, Paris)


ill-posed Inverse Problems
• Consider the following linear problem:

y = x + w , x, y, w 2 L2 (Rd ) , where

is singular
x is drawn from a certain distribution (e.g. natural images)
w independent of x

• Examples:
– Super-Resolution, Inpainting, Deconvolution.
• Standard Regularization route:
2
x̂ = arg min k x yk + R(x) .
x
• Q: How to leverage training data? R(x) (typically) convex
• Q: Is this formulation always appropriate to estimate images?
Regularization and Learning
• The inverse problem requires a high-dimensional model for p(x | y)
• Underlying probabilistic model for regularized least squares is
R(x)
p(x | y) / N ( x y, I)e .

• Suppose x1 , x2 are such that p(x1 |y) = p(x2 |y) .


• Since log p(x|y) is convex, it results that
p(↵x1 + (1 ↵)x2 |y) p(x1 |y) , ↵ 2 [0, 1] .
Regularization and Learning
• The inverse problem requires a high-dimensional model for p(x | y)
• Underlying probabilistic model for regularized least squares is
R(x)
p(x | y) / N ( x y, I)e .

• Suppose x1 , x2 are such that p(x1 |y) = p(x2 |y) .


• Since log p(x|y) is convex, it results that
p(↵x1 + (1 ↵)x2 |y) p(x1 |y) , ↵ 2 [0, 1] .
• Jitter model:
y
Regularization and Learning
• The inverse problem requires a high-dimensional model for p(x | y)
• Underlying probabilistic model for regularized least squares is
R(x)
p(x | y) / N ( x y, I)e .

• Suppose x1 , x2 are such that p(x1 |y) = p(x2 |y) .


• Since log p(x|y) is convex, it results that
p(↵x1 + (1 ↵)x2 |y) p(x1 |y) , ↵ 2 [0, 1] .
• Jitter model: x1 x2
y
Regularization and Learning
• The inverse problem requires a high-dimensional model for p(x | y)
• Underlying probabilistic model for regularized least squares is
R(x)
p(x | y) / N ( x y, I)e .

• Suppose x1 , x2 are such that p(x1 |y) = p(x2 |y) .


• Since log p(x|y) is convex, it results that
p(↵x1 + (1 ↵)x2 |y) p(x1 |y) , ↵ 2 [0, 1] .
• Jitter model: x1 x2
y

↵x1 + (1 ↵)x2
• The conditional distribution of images is not well modeled only with
Gaussian noise. Non-convexity in general. How to model?
Learning with Sufficient Statistics
• A feature representation (x) defines a class of densities (relative
to a base measure) via sufficient statistics:
p✓ (x) = g✓ ( (x)) .
• (x) may encode prior knowledge to break the curse of
dimensionality (e.g if dim( (x)) ⌧ dim(x) )
• If (x) is stable to transformations and g✓ is smooth, then p✓ is
also stable.
Canonical Representation
Canonical Representation
• The canonical ensemble representation is defined as the
exponential family determined by (x):
exp(h✓, (x)i)
p✓ (x) = .
Z

L. Boltzmann
Canonical Representation
• The canonical ensemble representation is defined as the
exponential family determined by (x):
exp(h✓, (x)i)
p✓ (x) = .
Z

– ✓ are so-called canonical parameters.


– Maximum entropy model subject to the constraint that
Ex⇠p✓ ( (x)) = Ex⇠p ( (x)) .= µ .
Canonical Representation
• The canonical ensemble representation is defined as the
exponential family determined by (x):
exp(h✓, (x)i)
p✓ (x) = .
Z

– ✓ are so-called canonical parameters.


– Maximum entropy model subject to the constraint that
Ex⇠p✓ ( (x)) = Ex⇠p ( (x)) .= µ .

– ✓ and µ are related via a variational principle:


H(µ) = sup (hµ, ✓i log Z(✓)) .

– Challenge: computationally intensive: need to rely to MCMC methods


(e.g. Gibbs sampling).
From Canonical to Microcanonical
• Alternatively, we may search for features
such that (X) , X ⇠ p
concentrates, i.e. becomes Gaussian with k⌃k ! 0 .
From Canonical to Microcanonical
• Alternatively, we may search for featuressuch that (X) , X ⇠ p
concentrates, i.e. becomes Gaussian with k⌃k ! 0 .
• Then, define p✓ as the maximum entropy distribution such that
d
(X) = (Y ) , X ⇠ p, Y ⇠ p✓ .
From Canonical to Microcanonical
• Alternatively, we may search for featuressuch that (X) , X ⇠ p
concentrates, i.e. becomes Gaussian withk⌃k ! 0 .
• Then, define p✓ as the maximum entropy distribution such that
d
(X) = (Y ) , X ⇠ p, Y ⇠ p✓ .

• If Akxk  k (x)k  Bkxk, then p✓ is constructed as


1. sample z ⇠ q , (where (X) ⇠ q)
2. sample x ⇠ Unif({x ; (x) = z}) .
• Estimation: Finding the parameters of Gaussian model.
2
• Sampling by solving non-linear LS problems: min k (x) zk .
x

• This is a Mixture of Micro-canonical ensembles from statistical


physics. Non-asymptotic model.
Learning, features and Kernels
Change of variable (x) = { k (x)}kd0

to nearly linearize f (x), which is approximated by:


X
f˜(x) = h (x) , wi = wk k (x) .
1D projection k
d d0
Data: x 2 R (x) 2 R
Linear Classifier

w
x

Metric: kx x0 k k (x) (x0 )k

• How and when is possible to find such a ?


• What ”regularity” of f is needed ?
Representation Wish List
• transforms unknown, complex distributions into a known model,
e.g. Gaussian.
– High-dimensional concentration mechanisms: LLNs, CLT.
2
– Control of Variance E(k (X) E (X)k ) is sufficient via
Chebyshev ineq.

• p(x) should be smooth within the Level Sets of .


– must capture the invariances of the data, but nothing else!
– Level sets “not too large”.
– High-dimensional mechanism to control size of level sets: Sparsity
high-dimensional space
Representation Wish-List
• A change of variable (x) must linearize the orbits {g.x}g2G
g1 x g1p .x

x
x0
g 1 x0

g1p .x0

(g1p .x)

(x)
(x0 ) (g1p .x0 )

• Lipschitz: 8x, g : k (x) (g.x)k  C kgk


Lipschitz: 8x, g k: k(x)(x) (x0(g.x)k
• Discriminative: )k C 1C|fkgk
(x) f (x0 )|
Geometric Variability Prior
x(u) , u : pixels, time samples, etc. ⌧ (u) , : deformation field
L⌧ (x)(u) = x(u ⌧ (u)) : warping

Video of Philipp Scott Johnson


• Deformation cost: k⌧ k = sup |⌧ (u)| + sup |r⌧ (u)| .
u u
– Models change in point of view in images
– Models frequency transpositions in sounds
– Consistent with local translation invariance
Geometric Variability Prior
• Blur operator:Ax = x ⇤ , : local average
– The only linear operator A stable to deformations
kAL⌧ x Axk  k⌧ kkxk . (u)
[Bruna’12]
and characterizes x up to a global absolute position information [Soa09]. However, it

Geometric Variability Prior


has the same high-frequency instability as a Fourier modulus transform. Indeed, for any
choice of anchor point a(x), applying the Plancherel formula proves that

∥x(u − a(x)) − x′ (u − a(x′ ))∥ ≥ (2π)−1 ∥|x̂(ω)| − |x̂′ (ω)|∥ .


• Blur operator: ′ Ax = x ⇤ , : local average
If x = x , the Fourier transform instability at high frequencies implies that Φx =
τ
x(u − a(x)) is also unstable with respect to deformations.
– The only linear operator A stable to deformations
2.2.5 SIFT and HoG
kAL⌧ x Axk  k⌧ kkxk . (u)
SIFT (Scale Invariant Feature Transform) is a local image descriptor introduced by Lowe
[Bruna’12]
in [Low04], which achieved huge popularity thanks to its invariance and discriminability
properties.
The SIFT method originally consists in a keypoint detection phase, using a Dif-
ferences of Gaussians pyramid, followed by a local description around each detected
keypoint. The keypoint detection computes local maxima on a scale space generated
by isotropic gaussian differences, which induces invariance to translations, rotations and
partially to scaling. The descriptor then computes histograms of image gradient ampli-
tudes, using 8 orientation bins on a 4 × 4 grid around each keypoint, as shown in Figure
2.2.

19
Geometric Variability Prior
• Blur operator:Ax = x ⇤ , : local average
– The only linear operator A stable to deformations:
kAL⌧ x Axk  k⌧ kkxk . (u)
[Bruna’12]

• Wavelet filter bank: W x = {x ⇤ k} , k (u) = 2


j
(2 j
R✓ u)
k
: spatially localized band-pass filter.
W recovers information lost by A.
j


Geometric Variability Prior
• Blur operator: Ax = x ⇤ , : local average
– The only linear operator A stable to deformations:
kAL⌧ x Axk  k⌧ kkxk . (u)
[Bruna’12]

• Wavelet filter bank: j j


W x = {x ⇤ k} , k (u) = 2 (2 R✓ u)
k
: spatially localized band-pass filter.
W recovers information lost by A.
W nearly commutes with deformations: j
kW L⌧ L⌧ W k  C|r⌧ |1


Geometric Variability Prior
• Blur operator:Ax = x ⇤ , : local average
– The only linear operator A stable to deformations:
kAL⌧ x Axk  k⌧ kkxk . (u)
[Bruna’12]

• Wavelet filter bank: W x = {x ⇤ k} , k (u) = 2


j
(2 j
R✓ u)
k
: spatially localized band-pass filter.
W recovers information lost by A.
W nearly commutes with deformations: j
kW L⌧ L⌧ W k  C|r⌧ |1
• Point-wise non-linearity ⇢(x) = |x|
– Commutes with deformations:
⇢L⌧ x = L⌧ ⇢x [Bruna’12] ✓
– Demodulates wavelet coefficients, preserves energy.
Scattering Convolutional Network
f
|WJ |
f⇥ J

|f ⇥ j1 , 1
|
|WJ | j1
|f ⌅ ⇤j1 , 1 | ⌅ ⇥J
1
| |f ⇥ j1 , 1
|⇥ j2 , 2
|
|WJ | j1 , j2
| |f ⇧ ⇤j1 , 1 | ⇧ ⇤j2 , 2 | ⇧ ⇥J
1, 2
··· ···
| |f ⇥ j1 , 1 | · · · ⇥ jm , m
|
|WJ | ⇥j1 ...jm
| |f ⇧ ⇤j1 , 1 | · · · ⇧ ⇤jm , m | ⇧ ⇥J
⇥ 1 ... m
| |f ⇥ j1 , 1
···| ⇥ jm+1 , m+1
|
Cascade of contractive operators.
Image Examples
[Bruna, Mallat, ’11,’12]
Images Fourier Wavelet Scattering
f fˆ |f ⇤ ⇥ 1 | ⇤ ||f ⇤ ⇥ 1 | ⇤ ⇥ 2 | ⇤
1

1 SIFT

window size = image size


Representation of Stationary Processes
x(u): realizations of a stationary process X(u) (not Gaussian)
Representation of Stationary Processes
x(u): realizations of a stationary process X(u) (not Gaussian)

(X) = {E(fi (X))}i


( )
1 X
Estimation from samples x(n): b (X) = fi (x)(n)
N n
i

Discriminability: need to capture high-order moments


Stability: E(k b (X) (X)k2 ) small
Scattering Moments
X
|WJ |
E(X)
|X ? j1 , 1
|
|WJ |
E(|X ? j1 , 1
|) , 8j1 , 1

| |X ? j1 , 1
|? j2 , 2
|
|WJ |
E(| |X ? j1 , 1
|? j2 , 2
|) , 8ji , i

··· ···
| ..|X ? j1 , 1 | ? . . . | ? jm , m
|
|WJ |
E(|..|X ? j1 , 1
| ? ...| ? jm , m
|) , 8ji , i

| ..|X ? j1 , 1
| ? ...| ? jm+1 , m+1
|
Properties of Scattering Moments
• Captures high order moments: [Bruna, Mallat, ’11,’12]
Power Spectrum SJ [p]X
m=1 m=2

• Cascading non-linearities is necessary to reveal higher-order


moments.
Scattering Consistency and Stability

Theorem: [Mallat ’10] With appropriate wavelets, SJ is stable to additive


noise,
kSJ (x + n) SJ xk  knk ,
unitary, kSJ xk = kxk, and stable to deformations

kSJ x⌧ SJ xk  Ckxkkr⌧ k .

Theorem: [B’15] If is a wavelet such that k k1  1, and X(t) is a


linear, stationary process with finite energy, then

lim E(kŜN X SXk2 ) = 0 .


N !1
Corollary: If moreover X(t) is bounded, then
2
|X|
E(kŜN X SXk2 )  C p 1 .
N
Sparse Signal Recovery
P
Theorem [B,M’14]: Suppose x0 (t) = n an (t bn ) with |bn bn+1 | ,
and SJ x0 = SJ x with m = 1 and J = 1. If has compact support, then
X
x(t) = cn (t en ) , with |en en+1 | & .
n
Sparse Signal Recovery
P
Theorem [B,M’14]: Suppose x0 (t) = n an (t bn ) with |bn bn+1 | ,
and SJ x0 = SJ x with m = 1 and J = 1. If has compact support, then
X
x(t) = cn (t en ) , with |en en+1 | & .
n

• Sx essentially identifies sparse measures,


up to log spacing factors.
• Here, sparsity is encoded in the measurements themselves.
• In 2D, singular measures (ie curves) require m = 2 to be well
characterized.
• The result also extends to sparsity in the Fourier domain (oscillatory
signals).
Sparse Shape Reconstructions
Original images of N 2 pixels:

m = 1, 2J = N : reconstruction from O(log2 N ) scattering coe↵.

J 2
m = 2, 2 = N : reconstruction from O(log2 N ) scattering coe↵.
Ergodic Texture Reconstruction
Original Textures

Gaussian process model with same second order moments

J 2
m = 2, 2 = N : reconstruction from O(log2 N ) scattering coe↵.
Bernoulli Process

samples from original p:

samples from model p̃:

b
E(k b (X)k2 )
(X) E Ĥ(p,p̃) H(p)±ˆ {log p(X̃)}
N b (X)k2
kE H(p)
212 5 · 10 3
2e 1 ± e 1
14 3
2 3 · 10 2e 1 ± 5e 2
Ising above critical temperature

samples from original p:

samples from model p̃:

b
E(k b (X)k2 )
(X) E Ĥ(p,p̃) H(p)±ˆ {log p(X̃)}
N b (X)k2
kE H(p)
212 3 · 10 4
1.2e 3 ± 3.5e 3
214 1 · 10 4
0.023 ± 2.3e 3
Ising at critical temperature

samples from original p:

samples from model p̃:

b
E(k b (X)k2 )
(X) E Ĥ(p,p̃) H(p)±ˆ {log p(X̃)}
N b (X)k2
kE H(p)
212 8 · 10 3
0.02 ± 4e 2
14 3
2 2.5 · 10 0.012 ± e 2
Ising below critical temperature

samples from original p:

samples from model p̃:

b
E(k b (X)k2 )
(X) E Ĥ(p,p̃) H(p)±ˆ {log p(X̃)}
N b (X)k2
kE H(p)
212 2 · 10 2
0.6 ± 0.25
214 5 · 10 3
0.25 ± 0.1
Cox Point Process

samples from original p:

samples from model p̃:

b
E(k b (X)k2 )
(X) E
N b (X)k2
kE
12 2
2 2 · 10
14 3
2 8 · 10
Ergodic Texture Reconstruction using CNNs
• Results using a deep CNN network from [Gathys et al, NIPS’15]
• Uses a much larger feature vector than scattering ( O((log N )2 ).)
Application: Super-Resolution

x
y
• Best Linear Method: Least Squares estimate (linear interpolation):
b †b
ŷ = (⌃x ⌃xy )x
Application: Super-Resolution

x
y
• Best Linear Method: Least Squares estimate (linear interpolation):
• State-of-the-art Methods: ŷ = ( b
⌃ †b
x ⌃xy )x
– Dictionary-learning Super-Resolution
– CNN-based: Just train a CNN to regress from low-res to high-
res.
– They optimize cleverly a fundamentally unstable metric criterion:
X
⇥⇤ = arg min kF (xi , ⇥) yi k2 , ŷ = F (x, ⇥⇤ )

i
Scattering Approach
• Relax the metric:

x
y
S
S S 1

F
Scattering Approach
• Relax the metric:

x
y
S
S S 1

– Start with simple linear estimation on scattering domain.


– Deformation stability gives more approximation power in the
transformed domain via locally linear methods.
– The method is not necessarily better in terms of PSNR!
Some Numerical Results


Original Linear Estimate state-of-the-art Scattering
Some Numerical Results

Best
 Scattering

Original state-of-the-art
Linear Estimate Estimate
Some Numerical Results

Best
 Scattering

Original state-of-the-art
Linear Estimate Estimate
Sparse Spike Super-Resolution
(with I. Dokmanic, S. Mallat, M. de Hoop)
Examples with Cox Processes
(inhomogeneous Poisson point processes)
Conclusions and Open Problems
• CNNs: Geometric encoding with built-in deformation stability.
– Equipped to break curse of dimensionality.

• This statistical advantage is useful both in supervised and


unsupervised learning.
– Gibbs CNN distributions are stable to deformations.
– Exploited in high-dimensional inverse problems.

• Challenges Ahead:
– Decode geometry learnt by CNNs: role of higher layers? new invariance
groups or rather pattern memorization?
– CNNs and unsupervised learning: we need better inference.
– Non-Euclidean domains (text, genomics, n-body dynamical systems)
Analysis of Non-convex Loss Surfaces
[joint work with D. Freeman (UC Berkeley) ]

• Deep Learning involves mostly non-convex optimization.


• Models from Statistical physics have been considered as possible
approximations [Dauphin et al.’14, Choromqaska et al.’15, Segun
et al.’15]
– Although under assumptions sometimes unrealistic.
• Tensor factorization models capture some of the non convexity
essence [Anandukar et al’15, Cohen et al. ’15, Haeffele et al.’15]
– But there is still a gap between actual models/optimization.
Analysis of Non-convex Loss Surfaces
[joint work with D. Freeman (UC Berkeley) ]

• Deep Learning involves mostly non-convex optimization.


• Models from Statistical physics have been considered as possible
approximations [Dauphin et al.’14, Choromqaska et al.’15, Segun
et al.’15]
– Although under assumptions sometimes unrealistic.
• Tensor factorization models capture some of the non convexity
essence [Anandukar et al’15, Cohen et al. ’15, Haeffele et al.’15]
– But there is still a gap between actual models/optimization.

• Q: In practice, how much of the non-convexity of the loss is visible


within the SGD algorithm? [Shafran and Shamir,’15]
• Q: How to quantify how “non-convex” our problem is, using an
efficient, generic tool?
Non-convexity ≠ Not optimizable

• We can perturb any convex function in such a way it is no longer


convex, but such that gradient descent still converges.
• In high dimensions, it is even easier!
Analysis of Non-convex Loss Surfaces
[joint work with D. Freeman (UC Berkeley) ]
• Given loss E(✓) , ✓ 2 Rd , we consider its representation in terms
of level sets:
Z 1
E(✓) = 1(✓ 2 ⌦u )du , ⌦u = {y 2 Rd ; E(y)  u} .
0
Analysis of Non-convex Loss Surfaces
[joint work with D. Freeman (UC Berkeley) ]
• Given loss E(✓) , ✓ 2 Rd , we consider its representation in terms
of level sets:
Z 1
E(✓) = 1(✓ 2 ⌦u )du , ⌦u = {y 2 Rd ; E(y)  u} .
0

⌦u
• A first notion we address is about the topology of the level sets .

• In particular, we ask how connected they are, i.e. how many


connected components Nu at each energy level u?
Analysis of Non-convex Loss Surfaces
[joint work with D. Freeman (UC Berkeley) ]
• A first notion we address is about the topology of the level sets .
– In particular, we ask how connected they are, i.e. how many connected
components Nu at each energy level u?
• This is directly related to the question of global minima:

Proposition: If Nu = 1 for all u then E


has no poor local minima. ⌦u

(i.e. no local minima y ⇤ s.t. E(y ⇤ ) > miny E(y))


Analysis of Non-convex Loss Surfaces
[joint work with D. Freeman (UC Berkeley) ]
• A first notion we address is about the topology of the level sets .
– In particular, we ask how connected they are, i.e. how many connected
components Nu at each energy level u?
• This is directly related to the question of global minima:

Proposition: If Nu = 1 for all u then E


has no poor local minima. ⌦u

(i.e. no local minima y ⇤ s.t. E(y ⇤ ) > miny E(y))


• The converse is not true though.
Linear vs Non-linear deep models
[joint work with D. Freeman (UC Berkeley) ]

• Some authors have considered linear “deep” models as a first step


towards understanding generic models ([Ganguli et al.,
Choromaska et al]).
X
E(W1 , W2 , . . . , WK ) = (WK . . . W2 W1 xi yi )2 , Multilinear Regression .
i
Linear vs Non-linear deep models
[joint work with D. Freeman (UC Berkeley) ]

• Some authors have considered linear “deep” models as a first step


towards understanding generic models ([Ganguli et al.,
Choromaska et al]).
X
E(W1 , W2 , . . . , WK ) = (WK . . . W2 W1 xi yi )2 , Multilinear Regression .
i

• They have very different behavior in terms of the topology of its


level sets.
Proposition:
1. For any distribution (xi , yi ) ⇠ PX⇥Y ,
we have Nu = 1 for all u in the linear case.
2. For any architecture (choice of internal dimensions),
there exists distributions PX⇥Y such that
Nu > 1 for some u in the ReLU case.
Linear vs Non-linear deep models
[joint work with D. Freeman (UC Berkeley) ]
X
E(W1 , W2 , . . . , WK ) = (WK . . . W2 W1 xi yi )2 , Multilinear Regression .
i
• They have very different behavior in terms of the topology of its
level sets.
Proposition:
1. For any distribution (xi , yi ) ⇠ PX⇥Y ,
we have Nu = 1 for all u in the linear case.
2. For any architecture (choice of internal dimensions),
there exists distributions PX⇥Y such that
Nu > 1 for some u in the ReLU case.

• Remarks:
• In accordance with [Shamir’16: “Distribution Specific Hardness of
Learning Neural Networks,"].
• We are currently working on closing the gap for (2) on ReLUs.
From Topology to Geometry
• The next question we are interested in is conditioning.
• Even if level sets are connected, how easy it is to navigate through
them?
• How “large” and regular are they?

easy to move from one energy hard to move from one energy
level to lower one level to lower one
Finding Connected Components
[joint work with D. Freeman (UC Berkeley) ]
• Suppose ✓1 , ✓2 are such that E(✓1 ) = E(✓2 ) = u0
– They are in the same connected component of ⌦u0 iff
there is a path (t), (0) = ✓1 , (1) = ✓2 such that
8 t 2 (0, 1) , E( (t))  u0 .
⌦u
Finding Connected Components
[joint work with D. Freeman (UC Berkeley) ]
• Suppose ✓1 , ✓2 are such that E(✓1 ) = E(✓2 ) = u0
– They are in the same connected component of ⌦u0 iff
there is a path (t), (0) = ✓1 , (1) = ✓2 such that
8 t 2 (0, 1) , E( (t))  u0 .
⌦u
– Moreover, we can restrict the notion to smooth level sets by penalizing
the length of the path: Z
8 t 2 (0, 1) , E( (t))  u0 and k ˙ (t)kdt  M .
Finding Connected Components
[joint work with D. Freeman (UC Berkeley) ]
• Suppose ✓1 , ✓2 are such that E(✓1 ) = E(✓2 ) = u0
– They are in the same connected component of ⌦u0 iff
there is a path (t), (0) = ✓1 , (1) = ✓2 such that
8 t 2 (0, 1) , E( (t))  u0 .
⌦u
– Moreover, we can restrict the notion to smooth level sets by penalizing
the length of the path: Z
8 t 2 (0, 1) , E( (t))  u0 and k ˙ (t)kdt  M .
• Dynamic programming approach:

✓1

✓2
Finding Connected Components
[joint work with D. Freeman (UC Berkeley) ]
• Suppose ✓1 , ✓2 are such that E(✓1 ) = E(✓2 ) = u0
– They are in the same connected component of ⌦u0 iff
there is a path (t), (0) = ✓1 , (1) = ✓2 such that
8 t 2 (0, 1) , E( (t))  u0 .
⌦u
– Moreover, we can restrict the notion to smooth level sets by penalizing
the length of the path: Z
8 t 2 (0, 1) , E( (t))  u0 and k ˙ (t)kdt  M .
• Dynamic programming approach:
H

✓3
✓1
✓1 + ✓2
✓m =
2
✓3 = arg min k✓ ✓m k .
✓2H; E(✓)u0 ✓2
Finding Connected Components
[joint work with D. Freeman (UC Berkeley) ]
• Suppose ✓1 , ✓2 are such that E(✓1 ) = E(✓2 ) = u0
– They are in the same connected component of ⌦u0 iff
there is a path (t), (0) = ✓1 , (1) = ✓2 such that
8 t 2 (0, 1) , E( (t))  u0 .
⌦u
– Moreover, we can restrict the notion to smooth level sets by penalizing
the length of the path: Z
8 t 2 (0, 1) , E( (t))  u0 and k ˙ (t)kdt  M .
• Dynamic programming approach:
H

✓3
✓m =
✓1 + ✓2 ✓1
2
✓3 = arg min k✓ ✓m k .
✓2H; E(✓)u0 ✓2
Finding Connected Components
[joint work with D. Freeman and Y. Bahri (UC Berkeley)]
• Suppose ✓1 , ✓2 are such that E(✓1 ) = E(✓2 ) = u0
– They are in the same connected component of ⌦u0 iff
there is a path (t), (0) = ✓1 , (1) = ✓2 such that
8 t 2 (0, 1) , E( (t))  u0 .
⌦u
– Moreover, we can restrict the notion to smooth level sets by penalizing
the length of the path: Z
8 t 2 (0, 1) , E( (t))  u0 and k ˙ (t)kdt  M .
• Dynamic programming approach:
H

✓3
✓m =
✓1 + ✓2 ✓1
2
✓3 = arg min k✓ ✓m k .
✓2H; E(✓)u0 ✓2
Numerical Experiments
• Compute length of geodesic in ⌦u obtained by the algorithm and
normalize it by the Euclidean distance. Measure of curviness of
level sets.
• Experiments on small neural networks learning to fit k-degree
polynomials
• and CNNs on small datasets (Mnist).

cubic fitting MNIST


Preliminary analysis and perspectives

cubic fitting MNIST


• #of components does not increase: no detected poor local
minima so far when using typical datasets and typical
architectures.
• Level sets become more irregular as energy decreases.
• Presence of “energy barrier”?
• Factoring out global discrete symmetries of the model.
Sparse Coding
• Consider the following inference problem.

Given D 2 Rn⇥m and x 2 Rn ,


1 2
min E(z) = kx Dzk + kzk1 .
z 2
• Long history in Statistics and Signal Processing:
– Lasso estimator for variable selection [Tibshirani, ’95].
– Building block in many signal processing and machine learning pipelines
[Mairal et al. ’10]
• Problem is convex, unique solution for generic D, not strongly
convex in general.
Sparse Coding and Iterative Thresholding
• A popular approach to solving SC is via iterative splitting algorithms
[Bruck, Passty,70s]:
(n) T (n 1) T
z = ⇢ ((1 D D)z + D x) , with

⇢t (x) = sign(x) · max(0, |x| t)

1
• When  2 , z (n) converges to a solution, in the sense that
kDk
1 (0) ⇤ 2
(n) ⇤ kz z k
E(z ) E(z )  .
2n
[Beck, Teboulle,’09]
– sublinear convergence due to lack of strong convexity.
– however, linear convergence can be obtained under weaker conditions
(e.g. RSC/RSM, [Argawal & Wainwright]).
LISTA [Gregor & LeCun’10]
• The Lasso (sparse coding operator) can be implemented as a
specific deep network with infinite, recursive layers.
• Can we accelerate the sparse inference with a shallower network,
with trained parameters?

z= (x)
0 ⇢ ⇢ ⇢
V V V
t
Dx
LISTA [Gregor & LeCun’10]
• The Lasso (sparse coding operator) can be implemented as a
specific deep network with infinite, recursive layers.
• Can we accelerate the sparse inference with a shallower network,
with trained parameters? In practice, yes.

z= (x)
0 ⇢ ⇢ ⇢
V V V
t
Dx

M steps F (x, W, S)
0 S ⇢ S ⇢ S ⇢
x
W
Sparse Stable Matrix Factorizations
• Principle of proximal splitting: the regularization term kzk1 is
separable in the canonical basis:
X
kzk1 = |zi | .
i
• Using convexity we find an upper bound of the energy that is also
separable:
E(z)  Ẽ(z; z (n) ) = E(z (n) ) + hB(z (n) y), z z (n) i + Q(z, z (n) ) , with
1 T
Q(z, u) = (z u) S(z u) + kzk1 B = DT D , y = D† x
2
S diagonal such that S B 0.
Sparsity Stable Matrix Factorizations
• Principle of proximal splitting: the regularization term kzk1 is
separable in the canonical basis:
X
kzk1 = |zi | .
i
• Using convexity we find an upper bound of the energy that is also
separable:
E(z)  Ẽ(z; z (n) ) = E(z (n) ) + hB(z (n) y), z z (n) i + Q(z, z (n) ) , with
1 T
Q(z, u) = (z u) S(z u) + kzk1 B = DT D , y = D† x
2
S diagonal such that S B 0.
• Explicit minimization via the proximal operator:

(n+1) (n) (n) (n)


z = arg minhB(z y), z z i + Q(z, z ).
z
Sparsity Stable Matrix Factorizations
• Consider now unitary matrix A and

E(z)  ẼA (z; z (n) ) = E(z (n) ) + hB(z (n) y), z z (n) i + Q(Az, Az (n) ) .
Sparsity Stable Matrix Factorizations
• Consider now unitary matrix A and

E(z)  ẼA (z; z (n) ) = E(z (n) ) + hB(z (n) y), z z (n) i + Q(Az, Az (n) ) .

• Observation: ẼA (z; z (n) ) still admits an explicit solution via a


proximal operator:
(n)
arg min ẼA (z; z ) =
z
✓ ◆
T 1 (n) T (n)
A arg min hv, zi + (z Az ) S(z Az ) + kzk1 .
z 2

• Q: How to choose the rotation A ?


Sparsity Stable Matrix Factorizations
• We denote
A (z) = (kAzk1 kzk1 ) , R = AT SA B

• A (z) measures the invariance of the l1 ball by the action of A.


Sparsity Stable Matrix Factorizations
• We denote
A (z) = (kAzk1 kzk1 ) , R = AT SA B

• A (z) measures the invariance of the l1 ball by the action of A.

Proposition: If R 0 and z (n+1) = arg minz ẼA (z; z (n) ) then


⇤ 1 ⇤
E(z (n+1)
) E(z )  (z z (n) )T R(z ⇤ z (n) ) + A (z ⇤
) A (z (n+1)
).
2

• We are thus interested in factorizations (A, S) such that


– kRk is small,
0
|
– A (z) A (z )| is small.
Improving Upper bound
• Using the adaptive factorization we can improve the upper bound
given by ISTA:
Proposition: A factorization step using (A, S)
followed by ISTA yields

E(z (n) ) E(z ⇤ ) 


(z ⇤ z (0) )T R(z ⇤ z (0) ) + (z ⇤ z (1) )T R(z ⇤ z (1) ) + 2LA (z (1) )(kz ⇤ z (0) k + kz (1) z (0) k)
2n

LA (z): Lipschitz constant of A at z.


LA (z) is bounded by C(kzk0 + kAzk0 )1/2 .

• Q: How does this compare to the baseline ISTA?


Phase Transition
(n)
• Given a current estimate z , a factorization (A, S) improves the
ISTA upper bound whenever
LA (z (n+1) ) kBk
kRk + 2 ⇤  .
kz z (n) k 2
⇤ (n)
• During the initial iterations, kz z k is larger, allowing rotations
A further from the identity.
⇤ (n)
• At some point, kz z k is small enough so that commutation
error dominates: in that case, the bound recommends switching
to standard ISTA iteration ( A = 1).

• Grain of salt: this only concerns an upper bound!


Numerical Experiments
• Random Gaussian Dictionary, data generated from generic sparse
codes (no structured sparsity)

• Real data with structured dictionaries:


Minimax or “adversarial” Dictionaries
• What are the worst possible dictionaries in terms of such
factorization?
Minimax or “adversarial” Dictionaries
• What are the worst possible dictionaries in terms of such
factorization?
n⇥m
D = [d1 . . . dm ] 2 R ,
2⇡j⇣k
i M
dj (k) = e , {⇣k }kn = random subset of n frequencies .
Minimax or “adversarial” Dictionaries
• What are the worst possible dictionaries in terms of such
factorization?
n⇥m
D = [d1 . . . dm ] 2 R ,
2⇡j⇣k
i M
dj (k) = e , {⇣k }kn = random subset of n frequencies .

• Acceleration is severely reduced as predicted.


Conclusions and Open Questions

• Trainable sparse coding can be studied as a form of matrix


factorization.
– The model partially explains when acceleration is possible and for how
long.

• Extension to other problems which admit separable proximal


operators.
• Extension to the accelerated setting (momentum)
• The result is not constructive
– The objective function of the factorization is not convex
– LISTA optimizes the right objective function!

Potrebbero piacerti anche