Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Joan Bruna
Courant Institute, NYU
collaborators:
Stephane Mallat (ENS, France), Ivan Dokmanic (UIUC), M. de Hoop (Rice)
Pablo Sprechmann (NYU), Yann LeCun (NYU)
Daniel Freeman (UC Berkeley)
Curse of Dimensionality
• Learning as a High-dimensional interpolation problem:
Observe {(xi , yi = f (xi ))}iN for some unknown f : Rd ! R.
high-dimensional space
Curse of Dimensionality
• Learning as a High-dimensional interpolation problem:
Observe {(xi , yi = f (xi ))}iN for some unknown f : Rd ! R.
• Goal: Estimate ˆ
f from training data.
high-dimensional space
Curse of Dimensionality
• Learning as a High-dimensional interpolation problem:
Observe {(xi , yi = f (xi ))}iN for some unknown f : Rd ! R.
• Goal: Estimate fˆ from training data.
• “Pessimist” view:
high-dimensional space
Curse of Dimensionality
• Learning as a High-dimensional interpolation problem:
Observe {(xi , yi = f (xi ))}iN for some unknown f : Rd ! R.
• Goal: Estimate fˆ from training data.
• “Pessimist” view:
N should be ⇠ ✏ d .
We pay an exponential price on the input dimension.
Curse of Dimensionality
• Therefore, in order to beat the curse of dimensionality, it is
necessary to make assumptions about our data and exploit them
in our models.
Curse of Dimensionality
• Therefore, in order to beat the curse of dimensionality, it is
necessary to make assumptions about our data and exploit them
in our models.
Textures
art
Fractals
Ising spin glass
Turbulence
Towards a better non-convex picture
• Typical dichotomy in Optimization is convex vs non-convex.
Towards a better non-convex picture
• Typical dichotomy in Optimization is convex vs non-convex
problems.
convex
GD-solvable
f (u)
f (u)
y = x + w , x, y, w 2 L2 (Rd ) , where
is singular
x is drawn from a certain distribution (e.g. natural images)
w independent of x
• Examples:
– Super-Resolution, Inpainting, Deconvolution.
• Standard Regularization route:
2
x̂ = arg min k x yk + R(x) .
x
• Q: How to leverage training data? R(x) (typically) convex
• Q: Is this formulation always appropriate to estimate images?
Regularization and Learning
• The inverse problem requires a high-dimensional model for p(x | y)
• Underlying probabilistic model for regularized least squares is
R(x)
p(x | y) / N ( x y, I)e .
↵x1 + (1 ↵)x2
• The conditional distribution of images is not well modeled only with
Gaussian noise. Non-convexity in general. How to model?
Learning with Sufficient Statistics
• A feature representation (x) defines a class of densities (relative
to a base measure) via sufficient statistics:
p✓ (x) = g✓ ( (x)) .
• (x) may encode prior knowledge to break the curse of
dimensionality (e.g if dim( (x)) ⌧ dim(x) )
• If (x) is stable to transformations and g✓ is smooth, then p✓ is
also stable.
Canonical Representation
Canonical Representation
• The canonical ensemble representation is defined as the
exponential family determined by (x):
exp(h✓, (x)i)
p✓ (x) = .
Z
L. Boltzmann
Canonical Representation
• The canonical ensemble representation is defined as the
exponential family determined by (x):
exp(h✓, (x)i)
p✓ (x) = .
Z
w
x
x
x0
g 1 x0
g1p .x0
(g1p .x)
(x)
(x0 ) (g1p .x0 )
19
Geometric Variability Prior
• Blur operator:Ax = x ⇤ , : local average
– The only linear operator A stable to deformations:
kAL⌧ x Axk k⌧ kkxk . (u)
[Bruna’12]
✓
Geometric Variability Prior
• Blur operator: Ax = x ⇤ , : local average
– The only linear operator A stable to deformations:
kAL⌧ x Axk k⌧ kkxk . (u)
[Bruna’12]
✓
Geometric Variability Prior
• Blur operator:Ax = x ⇤ , : local average
– The only linear operator A stable to deformations:
kAL⌧ x Axk k⌧ kkxk . (u)
[Bruna’12]
|f ⇥ j1 , 1
|
|WJ | j1
|f ⌅ ⇤j1 , 1 | ⌅ ⇥J
1
| |f ⇥ j1 , 1
|⇥ j2 , 2
|
|WJ | j1 , j2
| |f ⇧ ⇤j1 , 1 | ⇧ ⇤j2 , 2 | ⇧ ⇥J
1, 2
··· ···
| |f ⇥ j1 , 1 | · · · ⇥ jm , m
|
|WJ | ⇥j1 ...jm
| |f ⇧ ⇤j1 , 1 | · · · ⇧ ⇤jm , m | ⇧ ⇥J
⇥ 1 ... m
| |f ⇥ j1 , 1
···| ⇥ jm+1 , m+1
|
Cascade of contractive operators.
Image Examples
[Bruna, Mallat, ’11,’12]
Images Fourier Wavelet Scattering
f fˆ |f ⇤ ⇥ 1 | ⇤ ||f ⇤ ⇥ 1 | ⇤ ⇥ 2 | ⇤
1
1 SIFT
| |X ? j1 , 1
|? j2 , 2
|
|WJ |
E(| |X ? j1 , 1
|? j2 , 2
|) , 8ji , i
··· ···
| ..|X ? j1 , 1 | ? . . . | ? jm , m
|
|WJ |
E(|..|X ? j1 , 1
| ? ...| ? jm , m
|) , 8ji , i
| ..|X ? j1 , 1
| ? ...| ? jm+1 , m+1
|
Properties of Scattering Moments
• Captures high order moments: [Bruna, Mallat, ’11,’12]
Power Spectrum SJ [p]X
m=1 m=2
kSJ x⌧ SJ xk Ckxkkr⌧ k .
J 2
m = 2, 2 = N : reconstruction from O(log2 N ) scattering coe↵.
Ergodic Texture Reconstruction
Original Textures
J 2
m = 2, 2 = N : reconstruction from O(log2 N ) scattering coe↵.
Bernoulli Process
b
E(k b (X)k2 )
(X) E Ĥ(p,p̃) H(p)±ˆ {log p(X̃)}
N b (X)k2
kE H(p)
212 5 · 10 3
2e 1 ± e 1
14 3
2 3 · 10 2e 1 ± 5e 2
Ising above critical temperature
b
E(k b (X)k2 )
(X) E Ĥ(p,p̃) H(p)±ˆ {log p(X̃)}
N b (X)k2
kE H(p)
212 3 · 10 4
1.2e 3 ± 3.5e 3
214 1 · 10 4
0.023 ± 2.3e 3
Ising at critical temperature
b
E(k b (X)k2 )
(X) E Ĥ(p,p̃) H(p)±ˆ {log p(X̃)}
N b (X)k2
kE H(p)
212 8 · 10 3
0.02 ± 4e 2
14 3
2 2.5 · 10 0.012 ± e 2
Ising below critical temperature
b
E(k b (X)k2 )
(X) E Ĥ(p,p̃) H(p)±ˆ {log p(X̃)}
N b (X)k2
kE H(p)
212 2 · 10 2
0.6 ± 0.25
214 5 · 10 3
0.25 ± 0.1
Cox Point Process
b
E(k b (X)k2 )
(X) E
N b (X)k2
kE
12 2
2 2 · 10
14 3
2 8 · 10
Ergodic Texture Reconstruction using CNNs
• Results using a deep CNN network from [Gathys et al, NIPS’15]
• Uses a much larger feature vector than scattering ( O((log N )2 ).)
Application: Super-Resolution
x
y
• Best Linear Method: Least Squares estimate (linear interpolation):
b †b
ŷ = (⌃x ⌃xy )x
Application: Super-Resolution
x
y
• Best Linear Method: Least Squares estimate (linear interpolation):
• State-of-the-art Methods: ŷ = ( b
⌃ †b
x ⌃xy )x
– Dictionary-learning Super-Resolution
– CNN-based: Just train a CNN to regress from low-res to high-
res.
– They optimize cleverly a fundamentally unstable metric criterion:
X
⇥⇤ = arg min kF (xi , ⇥) yi k2 , ŷ = F (x, ⇥⇤ )
⇥
i
Scattering Approach
• Relax the metric:
x
y
S
S S 1
F
Scattering Approach
• Relax the metric:
x
y
S
S S 1
Original Linear Estimate state-of-the-art Scattering
Some Numerical Results
Best
Scattering
Original state-of-the-art
Linear Estimate Estimate
Some Numerical Results
Best
Scattering
Original state-of-the-art
Linear Estimate Estimate
Sparse Spike Super-Resolution
(with I. Dokmanic, S. Mallat, M. de Hoop)
Examples with Cox Processes
(inhomogeneous Poisson point processes)
Conclusions and Open Problems
• CNNs: Geometric encoding with built-in deformation stability.
– Equipped to break curse of dimensionality.
• Challenges Ahead:
– Decode geometry learnt by CNNs: role of higher layers? new invariance
groups or rather pattern memorization?
– CNNs and unsupervised learning: we need better inference.
– Non-Euclidean domains (text, genomics, n-body dynamical systems)
Analysis of Non-convex Loss Surfaces
[joint work with D. Freeman (UC Berkeley) ]
⌦u
• A first notion we address is about the topology of the level sets .
• Remarks:
• In accordance with [Shamir’16: “Distribution Specific Hardness of
Learning Neural Networks,"].
• We are currently working on closing the gap for (2) on ReLUs.
From Topology to Geometry
• The next question we are interested in is conditioning.
• Even if level sets are connected, how easy it is to navigate through
them?
• How “large” and regular are they?
easy to move from one energy hard to move from one energy
level to lower one level to lower one
Finding Connected Components
[joint work with D. Freeman (UC Berkeley) ]
• Suppose ✓1 , ✓2 are such that E(✓1 ) = E(✓2 ) = u0
– They are in the same connected component of ⌦u0 iff
there is a path (t), (0) = ✓1 , (1) = ✓2 such that
8 t 2 (0, 1) , E( (t)) u0 .
⌦u
Finding Connected Components
[joint work with D. Freeman (UC Berkeley) ]
• Suppose ✓1 , ✓2 are such that E(✓1 ) = E(✓2 ) = u0
– They are in the same connected component of ⌦u0 iff
there is a path (t), (0) = ✓1 , (1) = ✓2 such that
8 t 2 (0, 1) , E( (t)) u0 .
⌦u
– Moreover, we can restrict the notion to smooth level sets by penalizing
the length of the path: Z
8 t 2 (0, 1) , E( (t)) u0 and k ˙ (t)kdt M .
Finding Connected Components
[joint work with D. Freeman (UC Berkeley) ]
• Suppose ✓1 , ✓2 are such that E(✓1 ) = E(✓2 ) = u0
– They are in the same connected component of ⌦u0 iff
there is a path (t), (0) = ✓1 , (1) = ✓2 such that
8 t 2 (0, 1) , E( (t)) u0 .
⌦u
– Moreover, we can restrict the notion to smooth level sets by penalizing
the length of the path: Z
8 t 2 (0, 1) , E( (t)) u0 and k ˙ (t)kdt M .
• Dynamic programming approach:
✓1
✓2
Finding Connected Components
[joint work with D. Freeman (UC Berkeley) ]
• Suppose ✓1 , ✓2 are such that E(✓1 ) = E(✓2 ) = u0
– They are in the same connected component of ⌦u0 iff
there is a path (t), (0) = ✓1 , (1) = ✓2 such that
8 t 2 (0, 1) , E( (t)) u0 .
⌦u
– Moreover, we can restrict the notion to smooth level sets by penalizing
the length of the path: Z
8 t 2 (0, 1) , E( (t)) u0 and k ˙ (t)kdt M .
• Dynamic programming approach:
H
✓3
✓1
✓1 + ✓2
✓m =
2
✓3 = arg min k✓ ✓m k .
✓2H; E(✓)u0 ✓2
Finding Connected Components
[joint work with D. Freeman (UC Berkeley) ]
• Suppose ✓1 , ✓2 are such that E(✓1 ) = E(✓2 ) = u0
– They are in the same connected component of ⌦u0 iff
there is a path (t), (0) = ✓1 , (1) = ✓2 such that
8 t 2 (0, 1) , E( (t)) u0 .
⌦u
– Moreover, we can restrict the notion to smooth level sets by penalizing
the length of the path: Z
8 t 2 (0, 1) , E( (t)) u0 and k ˙ (t)kdt M .
• Dynamic programming approach:
H
✓3
✓m =
✓1 + ✓2 ✓1
2
✓3 = arg min k✓ ✓m k .
✓2H; E(✓)u0 ✓2
Finding Connected Components
[joint work with D. Freeman and Y. Bahri (UC Berkeley)]
• Suppose ✓1 , ✓2 are such that E(✓1 ) = E(✓2 ) = u0
– They are in the same connected component of ⌦u0 iff
there is a path (t), (0) = ✓1 , (1) = ✓2 such that
8 t 2 (0, 1) , E( (t)) u0 .
⌦u
– Moreover, we can restrict the notion to smooth level sets by penalizing
the length of the path: Z
8 t 2 (0, 1) , E( (t)) u0 and k ˙ (t)kdt M .
• Dynamic programming approach:
H
✓3
✓m =
✓1 + ✓2 ✓1
2
✓3 = arg min k✓ ✓m k .
✓2H; E(✓)u0 ✓2
Numerical Experiments
• Compute length of geodesic in ⌦u obtained by the algorithm and
normalize it by the Euclidean distance. Measure of curviness of
level sets.
• Experiments on small neural networks learning to fit k-degree
polynomials
• and CNNs on small datasets (Mnist).
1
• When 2 , z (n) converges to a solution, in the sense that
kDk
1 (0) ⇤ 2
(n) ⇤ kz z k
E(z ) E(z ) .
2n
[Beck, Teboulle,’09]
– sublinear convergence due to lack of strong convexity.
– however, linear convergence can be obtained under weaker conditions
(e.g. RSC/RSM, [Argawal & Wainwright]).
LISTA [Gregor & LeCun’10]
• The Lasso (sparse coding operator) can be implemented as a
specific deep network with infinite, recursive layers.
• Can we accelerate the sparse inference with a shallower network,
with trained parameters?
z= (x)
0 ⇢ ⇢ ⇢
V V V
t
Dx
LISTA [Gregor & LeCun’10]
• The Lasso (sparse coding operator) can be implemented as a
specific deep network with infinite, recursive layers.
• Can we accelerate the sparse inference with a shallower network,
with trained parameters? In practice, yes.
z= (x)
0 ⇢ ⇢ ⇢
V V V
t
Dx
M steps F (x, W, S)
0 S ⇢ S ⇢ S ⇢
x
W
Sparse Stable Matrix Factorizations
• Principle of proximal splitting: the regularization term kzk1 is
separable in the canonical basis:
X
kzk1 = |zi | .
i
• Using convexity we find an upper bound of the energy that is also
separable:
E(z) Ẽ(z; z (n) ) = E(z (n) ) + hB(z (n) y), z z (n) i + Q(z, z (n) ) , with
1 T
Q(z, u) = (z u) S(z u) + kzk1 B = DT D , y = D† x
2
S diagonal such that S B 0.
Sparsity Stable Matrix Factorizations
• Principle of proximal splitting: the regularization term kzk1 is
separable in the canonical basis:
X
kzk1 = |zi | .
i
• Using convexity we find an upper bound of the energy that is also
separable:
E(z) Ẽ(z; z (n) ) = E(z (n) ) + hB(z (n) y), z z (n) i + Q(z, z (n) ) , with
1 T
Q(z, u) = (z u) S(z u) + kzk1 B = DT D , y = D† x
2
S diagonal such that S B 0.
• Explicit minimization via the proximal operator:
E(z) ẼA (z; z (n) ) = E(z (n) ) + hB(z (n) y), z z (n) i + Q(Az, Az (n) ) .
Sparsity Stable Matrix Factorizations
• Consider now unitary matrix A and
E(z) ẼA (z; z (n) ) = E(z (n) ) + hB(z (n) y), z z (n) i + Q(Az, Az (n) ) .