Sei sulla pagina 1di 81

Invariance and Stability to Deformations

of Deep Convolutional Representations

Julien Mairal
Inria Grenoble

ML and AI workshop, Telecom ParisTech, 2018

Julien Mairal Invariance and stability of DL 1/35


This is mostly the work of Alberto Bietti

A. Bietti and J. Mairal. Group Invariance, Stability to


Deformations, and Complexity of Deep Convolutional
Representations. arXiv:1706.03078. 2018.
A. Bietti and J. Mairal. Invariance and Stability of Deep
Convolutional Representations. NIPS. 2017.

Julien Mairal Invariance and stability of DL 2/35


Learning a predictive model
The goal is to learn a prediction function f : Rp → R given labeled
training data (xi , yi )i=1,...,n with xi in Rp , and yi in R:
n
1X
min L(yi , f (xi )) + λΩ(f ) .
f ∈F n | {z }
| i=1 {z } regularization
empirical risk, data fit

Julien Mairal Invariance and stability of DL 3/35


Objectives
Deep convolutional signal representations
Are they stable to deformations?
How can we achieve invariance to transformation groups?
Do they preserve signal information?

Learning aspects
Building a functional space for CNNs (or similar objects).
Paradigm 3: Deep Kernel Machines
Deriving a measure of model complexity.
A quick zoom on convolutional neural networks

still involves the ERM problem


Julien Mairal Invariance and stability of DL 4/35
A kernel perspective
n
1X
min L(yi , f (xi )) + λkf k2H .
f ∈H n
i=1

map data to a Hilbert space (RKHS) and work with linear forms:

Φ:X →H and f (x) = hΦ(x), f iH .

ϕ
6
......................
...............
......... .....................................
........................
.................. ...........
.............. ......
.................................
.........
..................... z
......

....
.
..
....
.....
.......
.......
X ...................
...
...
...
..
...
...
.......
H
x zx
... ... .....
... .. ........................................ .. ............................................
... .................. ..
......................................... ..........................................
.........................................
.. ...................... .. ..................................
................................

z x 
.. .. .....................................
.. .. ...................
*

... ...

x
... .. .............. ............................................ .
.. ... ............................. ............................................
... ......................... ........................ ...
.. .........................................

.......... x
...

x
....
...
... .....................
.....
I ....
.. -
Q
x
... ..
......Q
s
......................................:
.. .
.... ..
.... ..................................

:x
... ................................... ..................................
.... ........................................... ...........................................
.... ................ ...... ....... .......................
.... ...

........................
...... ...........................
................ ..
. ................
.. . ..........................
..................................
. .........................................
... ........................................
.. ..
............................................ ....... ...................................
... ...
... ...
...... ....
..........
.............. .......
..................... ......
................................. ..............
.. .................................

Julien Mairal Invariance and stability of DL 5/35


A kernel perspective
n
1X
min L(yi , f (xi )) + λkf k2H .
f ∈H n
i=1

Main purpose: embed data in a vectorial space where


many geometrical operations exist (angle computation, projection
on linear subspaces, definition of barycenters....).
one may learn potentially rich infinite-dimensional models.
regularization is natural:

|f (x) − f (x0 )| ≤ kf kH kΦ(x) − Φ(x0 )kH .

Julien Mairal Invariance and stability of DL 6/35


A kernel perspective
Second purpose: unhappy with the current Euclidean structure?
lift data to a higher-dimensional space with nicer properties (e.g.,
linear separability, clustering structure).
then, the linear form f (x) = hΦ(x), f iH in H may correspond to a
non-linear model in X .
x1 x1 2

x2
R
x2 2

Julien Mairal Invariance and stability of DL 7/35


A kernel perspective
Recipe
Map data x to high-dimensional space, Φ(x) in H (RKHS), with
Hilbertian geometry (projections, barycenters, angles, . . . , exist!).
predictive models f in H are linear forms in H: f (x) = hf, Φ(x)iH .
Learning with a positive definite kernel K(x, x0 ) = hΦ(x), Φ(x0 )iH .

[Schölkopf and Smola, 2002, Shawe-Taylor and Cristianini, 2004]...

Julien Mairal Invariance and stability of DL 8/35


A kernel perspective
Recipe
Map data x to high-dimensional space, Φ(x) in H (RKHS), with
Hilbertian geometry (projections, barycenters, angles, . . . , exist!).
predictive models f in H are linear forms in H: f (x) = hf, Φ(x)iH .
Learning with a positive definite kernel K(x, x0 ) = hΦ(x), Φ(x0 )iH .

What is the relation with deep neural networks?


It is possible to design a RKHS H where a large class of deep neural
networks live [Mairal, 2016].

f (x) = σk (Wk σk–1 (Wk–1 . . . σ2 (W2 σ1 (W1 x)) . . .)) = hf, Φ(x)iH .

This is the construction of “convolutional kernel networks”.

[Schölkopf and Smola, 2002, Shawe-Taylor and Cristianini, 2004]...

Julien Mairal Invariance and stability of DL 8/35


A kernel perspective
Recipe
Map data x to high-dimensional space, Φ(x) in H (RKHS), with
Hilbertian geometry (projections, barycenters, angles, . . . , exist!).
predictive models f in H are linear forms in H: f (x) = hf, Φ(x)iH .
Learning with a positive definite kernel K(x, x0 ) = hΦ(x), Φ(x0 )iH .

Why do we care?
Φ(x) is related to the network architecture and is independent
of training data. Is it stable? Does it lose signal information?
f is a predictive model. Can we control its stability?

|f (x) − f (x0 )| ≤ kf kH kΦ(x) − Φ(x0 )kH .

kf kH controls both stability and generalization!

Julien Mairal Invariance and stability of DL 8/35


A signal processing perspective
plus a bit of harmonic analysis

Consider images defined on a continuous domain Ω = Rd .


τ : Ω → Ω: C 1 -diffeomorphism.
Lτ x(u) = x(u − τ (u)): action operator.
Much richer group of transformations than translations.
Translations and Defo
• Patterns are translated and deformed

Invarian
Two dim

[Mallat, 2012, Allassonnière, Amit, and Trouvé, 2007, Trouvé and Younes, 2005]...

Julien Mairal Invariance and stability of DL 9/35


A signal processing perspective
plus a bit of harmonic analysis

Consider images defined on a continuous domain Ω = Rd .


τ : Ω → Ω: C 1 -diffeomorphism.
Lτ x(u) = x(u − τ (u)): action operator.
Much richer group of transformations thanTranslations
translations. and Defo
• Patterns are translated and deformed

Invarian
Two dim

Relation with deep convolutional representations


Stability to deformations studied for wavelet-based scattering transform.
[Mallat, 2012, Bruna and Mallat, 2013, Sifre and Mallat, 2013]...

Julien Mairal Invariance and stability of DL 9/35


A signal processing perspective
plus a bit of harmonic analysis

Consider images defined on a continuous domain Ω = Rd .


τ : Ω → Ω: C 1 -diffeomorphism.
Lτ x(u) = x(u − τ (u)): action operator.
Much richer group of transformations than translations.

Definition of stability
Representation Φ(·) is stable [Mallat, 2012] if:

kΦ(Lτ x) − Φ(x)k ≤ (C1 k∇τ k∞ + C2 kτ k∞ )kxk.

k∇τ k∞ = supu k∇τ (u)k controls deformation.


kτ k∞ = supu |τ (u)| controls translation.
C2 → 0: translation invariance.

Julien Mairal Invariance and stability of DL 9/35


Summary of our results
Multi-layer construction of the RKHS H
Contains CNNs with smooth homogeneous activations functions.

Julien Mairal Invariance and stability of DL 10/35


Summary of our results
Multi-layer construction of the RKHS H
Contains CNNs with smooth homogeneous activations functions.

Signal representation
Signal preservation of the multi-layer kernel mapping Φ.
Conditions of non-trivial stability for Φ.
Constructions to achieve group invariance.

Julien Mairal Invariance and stability of DL 10/35


Summary of our results
Multi-layer construction of the RKHS H
Contains CNNs with smooth homogeneous activations functions.

Signal representation
Signal preservation of the multi-layer kernel mapping Φ.
Conditions of non-trivial stability for Φ.
Constructions to achieve group invariance.

On learning
Bounds on the RKHS norm k.kH to control stability and
generalization of a predictive model f .

|f (x) − f (x0 )| ≤ kf kH kΦ(x) − Φ(x0 )kH .

Julien Mairal Invariance and stability of DL 10/35


Outline

1 Construction of the multi-layer convolutional representation

2 Invariance and stability

3 Learning aspects: model complexity

Julien Mairal Invariance and stability of DL 11/35


A generic deep convolutional representation
Initial map x0 in L2 (Ω, H0 )
x0 : Ω → H0 : continuous input signal
u ∈ Ω = Rd : location (d = 2 for images).
x0 (u) ∈ H0 : input value at location u (H0 = R3 for RGB images).

Julien Mairal Invariance and stability of DL 12/35


A generic deep convolutional representation
Initial map x0 in L2 (Ω, H0 )
x0 : Ω → H0 : continuous input signal
u ∈ Ω = Rd : location (d = 2 for images).
x0 (u) ∈ H0 : input value at location u (H0 = R3 for RGB images).

Building map xk in L2 (Ω, Hk ) from xk−1 in L2 (Ω, Hk−1 )


xk : Ω → Hk : feature map at layer k

Pk xk−1 .

Pk : patch extraction operator, extract small patch of feature map


xk−1 around each point u (Pk xk−1 (u) is a patch centered at u).

Julien Mairal Invariance and stability of DL 12/35


A generic deep convolutional representation
Initial map x0 in L2 (Ω, H0 )
x0 : Ω → H0 : continuous input signal
u ∈ Ω = Rd : location (d = 2 for images).
x0 (u) ∈ H0 : input value at location u (H0 = R3 for RGB images).

Building map xk in L2 (Ω, Hk ) from xk−1 in L2 (Ω, Hk−1 )


xk : Ω → Hk : feature map at layer k

Mk Pk xk−1 .

Pk : patch extraction operator, extract small patch of feature map


xk−1 around each point u (Pk xk−1 (u) is a patch centered at u).
Mk : non-linear mapping operator, maps each patch to a new
Hilbert space Hk with a pointwise non-linear function ϕk (·).

Julien Mairal Invariance and stability of DL 12/35


A generic deep convolutional representation
Initial map x0 in L2 (Ω, H0 )
x0 : Ω → H0 : continuous input signal
u ∈ Ω = Rd : location (d = 2 for images).
x0 (u) ∈ H0 : input value at location u (H0 = R3 for RGB images).

Building map xk in L2 (Ω, Hk ) from xk−1 in L2 (Ω, Hk−1 )


xk : Ω → Hk : feature map at layer k

xk = Ak Mk Pk xk−1 .

Pk : patch extraction operator, extract small patch of feature map


xk−1 around each point u (Pk xk−1 (u) is a patch centered at u).
Mk : non-linear mapping operator, maps each patch to a new
Hilbert space Hk with a pointwise non-linear function ϕk (·).
Ak : (linear) pooling operator at scale σk .
Julien Mairal Invariance and stability of DL 12/35
A generic deep convolutional representation

xk := Ak Mk Pk xk–1 : Ω → Hk

xk (w) = Ak Mk Pk xk–1 (w) ∈ Hk


linear pooling

Mk Pk xk–1 : Ω → Hk Mk Pk xk–1 (v) = ϕk (Pk xk–1 (v)) ∈ Hk


kernel mapping

Pk xk–1 (v) ∈ Pk (patch extraction)


xk–1 (u) ∈ Hk–1 xk–1 : Ω → Hk–1

Julien Mairal Invariance and stability of DL 13/35


Patch extraction operator Pk

Sk
Pk xk–1 (u) := (v ∈ Sk 7→ xk–1 (u + v)) ∈ Pk = Hk–1 .

Pk xk–1 (v) ∈ Pk (patch extraction)


xk–1 (u) ∈ Hk–1 xk–1 : Ω → Hk–1

Sk : patch shape, e.g. box.


Pk is linear, and preserves the norm: kPk xk–1 k = kxk–1 k.
R
Norm of a map: kxk2 = Ω kx(u)k2 du < ∞ for x in L2 (Ω, H).

Julien Mairal Invariance and stability of DL 14/35


Non-linear pointwise mapping operator Mk

Mk Pk xk–1 (u) := ϕk (Pk xk–1 (u)) ∈ Hk .

Mk Pk xk–1 : Ω → Hk Mk Pk xk–1 (v) = ϕk (Pk xk–1 (v)) ∈ Hk


non-linear mapping

Pk xk–1 (v) ∈ Pk
xk–1 : Ω → Hk–1

Julien Mairal Invariance and stability of DL 15/35


Non-linear pointwise mapping operator Mk

Mk Pk xk–1 (u) := ϕk (Pk xk–1 (u)) ∈ Hk .

ϕk : Pk → Hk pointwise non-linearity on patches.


We assume non-expansivity

kϕk (z)k ≤ kzk and kϕk (z) − ϕk (z 0 )k ≤ kz − z 0 k.

Mk then satisfies, for x, x0 ∈ L2 (Ω, Pk )

kMk xk ≤ kxk and kMk x − Mk x0 k ≤ kx − x0 k.

Julien Mairal Invariance and stability of DL 16/35


ϕk from kernels
Kernel mapping of homogeneous dot-product kernels:
 
hz, z 0 i
Kk (z, z 0 ) = kzkkz 0 kκk = hϕk (z), ϕk (z 0 )i.
kzkkz 0 k
P
κk (u) = ∞ j
j=0 bj u with bj ≥ 0, κk (1) = 1.
kϕk (z)k = Kk (z, z)1/2 = kzk (norm preservation).
0 0 0
kϕk (z) − ϕk (z )k ≤ kz − z k if κk (1) ≤ 1 (non-expansiveness).

Julien Mairal Invariance and stability of DL 17/35


ϕk from kernels
Kernel mapping of homogeneous dot-product kernels:
 
hz, z 0 i
Kk (z, z 0 ) = kzkkz 0 kκk = hϕk (z), ϕk (z 0 )i.
kzkkz 0 k
P
κk (u) = ∞ j
j=0 bj u with bj ≥ 0, κk (1) = 1.
kϕk (z)k = Kk (z, z)1/2 = kzk (norm preservation).
0 0 0
kϕk (z) − ϕk (z )k ≤ kz − z k if κk (1) ≤ 1 (non-expansiveness).

Examples
0 1 0 2
κexp (hz, z 0 i) = ehz,z i−1 = e− 2 kz−z k (if kzk = kz 0 k = 1).
1
κinv-poly (hz, z 0 i) = 2−hz,z 0 i .

[Schoenberg, 1942, Scholkopf, 1997, Smola et al., 2001, Cho and Saul, 2010, Zhang
et al., 2016, 2017, Daniely et al., 2016, Bach, 2017, Mairal, 2016]...

Julien Mairal Invariance and stability of DL 17/35


Pooling operator Ak
Z
xk (u) = Ak Mk Pk xk–1 (u) = hσk (u − v)Mk Pk xk–1 (v)dv ∈ Hk .
Rd

xk := Ak Mk Pk xk–1 : Ω → Hk xk (w) = Ak Mk Pk xk–1 (w) ∈ Hk


linear pooling

Mk Pk xk–1 : Ω → Hk

xk–1 : Ω → Hk–1

Julien Mairal Invariance and stability of DL 18/35


Pooling operator Ak
Z
xk (u) = Ak Mk Pk xk–1 (u) = hσk (u − v)Mk Pk xk–1 (v)dv ∈ Hk .
Rd

hσk : pooling filter at scale σk .


hσk (u) := σk−d h(u/σk ) with h(u) Gaussian.
linear, non-expansive operator: kAk k ≤ 1 (operator norm).

Julien Mairal Invariance and stability of DL 18/35


Recap: Pk , Mk , Ak

xk := Ak Mk Pk xk–1 : Ω → Hk

xk (w) = Ak Mk Pk xk–1 (w) ∈ Hk


linear pooling

Mk Pk xk–1 : Ω → Hk Mk Pk xk–1 (v) = ϕk (Pk xk–1 (v)) ∈ Hk


kernel mapping

Pk xk–1 (v) ∈ Pk (patch extraction)


xk–1 (u) ∈ Hk–1 xk–1 : Ω → Hk–1

Julien Mairal Invariance and stability of DL 19/35


Multilayer construction
Assumption on x0
x0 is typically a discrete signal aquired with physical device.
Natural assumption: x0 = A0 x, with x the original continuous
signal, A0 local integrator with scale σ0 (anti-aliasing).

Julien Mairal Invariance and stability of DL 20/35


Multilayer construction
Assumption on x0
x0 is typically a discrete signal aquired with physical device.
Natural assumption: x0 = A0 x, with x the original continuous
signal, A0 local integrator with scale σ0 (anti-aliasing).

Multilayer representation

Φn (x) = An Mn Pn An−1 Mn−1 Pn−1 · · · A1 M1 P1 x0 ∈ L2 (Ω, Hn ).

Sk , σk grow exponentially in practice (i.e., fixed with subsampling).

Julien Mairal Invariance and stability of DL 20/35


Multilayer construction
Assumption on x0
x0 is typically a discrete signal aquired with physical device.
Natural assumption: x0 = A0 x, with x the original continuous
signal, A0 local integrator with scale σ0 (anti-aliasing).

Multilayer representation

Φn (x) = An Mn Pn An−1 Mn−1 Pn−1 · · · A1 M1 P1 x0 ∈ L2 (Ω, Hn ).

Sk , σk grow exponentially in practice (i.e., fixed with subsampling).

Prediction layer
e.g., linear f (x) = hw, Φn (x)i.
R
“linear kernel” K(x, x0 ) = hΦn (x), Φn (x0 )i = 0
Ω hxn (u), xn (u)idu.

Julien Mairal Invariance and stability of DL 20/35


Outline

1 Construction of the multi-layer convolutional representation

2 Invariance and stability

3 Learning aspects: model complexity

Julien Mairal Invariance and stability of DL 21/35


Invariance, definitions
τ : Ω → Ω: C 1 -diffeomorphism with Ω = Rd .
Lτ x(u) = x(u − τ (u)): action operator.
Much richer group of transformations than translations.
Translations and Defo
• Patterns are translated and deformed

Invarian
Two dim

[Mallat, 2012, Bruna and Mallat, 2013, Sifre and Mallat, 2013]...

Julien Mairal Invariance and stability of DL 22/35


Invariance, definitions
τ : Ω → Ω: C 1 -diffeomorphism with Ω = Rd .
Lτ x(u) = x(u − τ (u)): action operator.
Much richer group of transformations than translations.

Definition of stability
Representation Φ(·) is stable [Mallat, 2012] if:

kΦ(Lτ x) − Φ(x)k ≤ (C1 k∇τ k∞ + C2 kτ k∞ )kxk.

k∇τ k∞ = supu k∇τ (u)k controls deformation.


kτ k∞ = supu |τ (u)| controls translation.
C2 → 0: translation invariance.

[Mallat, 2012, Bruna and Mallat, 2013, Sifre and Mallat, 2013]...

Julien Mairal Invariance and stability of DL 22/35


Warmup: translation invariance
Representation
M
Φn (x) = An Mn Pn An–1 Mn–1 Pn–1 · · · A1 M1 P1 A0 x.

How to achieve translation invariance?


Translation: Lc x(u) = x(u − c).

Julien Mairal Invariance and stability of DL 23/35


Warmup: translation invariance
Representation
M
Φn (x) = An Mn Pn An–1 Mn–1 Pn–1 · · · A1 M1 P1 A0 x.

How to achieve translation invariance?


Translation: Lc x(u) = x(u − c).
Equivariance - all operators commute with Lc : Lc = Lc .

kΦn (Lc x) − Φn (x)k = kLc Φn (x) − Φn (x)k


≤ kLc An − An k · kMn Pn Φn–1 (x)k
≤ kLc An − An kkxk.

Julien Mairal Invariance and stability of DL 23/35


Warmup: translation invariance
Representation
M
Φn (x) = An Mn Pn An–1 Mn–1 Pn–1 · · · A1 M1 P1 A0 x.

How to achieve translation invariance?


Translation: Lc x(u) = x(u − c).
Equivariance - all operators commute with Lc : Lc = Lc .

kΦn (Lc x) − Φn (x)k = kLc Φn (x) − Φn (x)k


≤ kLc An − An k · kMn Pn Φn–1 (x)k
≤ kLc An − An kkxk.
C2
Mallat [2012]: kLτ An − An k ≤ σn kτ k∞ (operator norm).

Julien Mairal Invariance and stability of DL 23/35


Warmup: translation invariance
Representation
M
Φn (x) = An Mn Pn An–1 Mn–1 Pn–1 · · · A1 M1 P1 A0 x.

How to achieve translation invariance?


Translation: Lc x(u) = x(u − c).
Equivariance - all operators commute with Lc : Lc = Lc .

kΦn (Lc x) − Φn (x)k = kLc Φn (x) − Φn (x)k


≤ kLc An − An k · kMn Pn Φn–1 (x)k
≤ kLc An − An kkxk.
C2
Mallat [2012]: kLc An − An k ≤ σn c (operator norm).
Scale σn of the last layer controls translation invariance.

Julien Mairal Invariance and stability of DL 23/35


Stability to deformations
Representation
M
Φn (x) = An Mn Pn An–1 Mn–1 Pn–1 · · · A1 M1 P1 A0 x.

How to achieve stability to deformations?


Patch extraction Pk and pooling Ak do not commute with Lτ !

Julien Mairal Invariance and stability of DL 24/35


Stability to deformations
Representation
M
Φn (x) = An Mn Pn An–1 Mn–1 Pn–1 · · · A1 M1 P1 A0 x.

How to achieve stability to deformations?


Patch extraction Pk and pooling Ak do not commute with Lτ !
kAk Lτ − Lτ Ak k ≤ C1 k∇τ k∞ [from Mallat, 2012].

Julien Mairal Invariance and stability of DL 24/35


Stability to deformations
Representation
M
Φn (x) = An Mn Pn An–1 Mn–1 Pn–1 · · · A1 M1 P1 A0 x.

How to achieve stability to deformations?


Patch extraction Pk and pooling Ak do not commute with Lτ !
k[Ak , Lτ ]k ≤ C1 k∇τ k∞ [from Mallat, 2012].

Julien Mairal Invariance and stability of DL 24/35


Stability to deformations
Representation
M
Φn (x) = An Mn Pn An–1 Mn–1 Pn–1 · · · A1 M1 P1 A0 x.

How to achieve stability to deformations?


Patch extraction Pk and pooling Ak do not commute with Lτ !
k[Ak , Lτ ]k ≤ C1 k∇τ k∞ [from Mallat, 2012].
But: [Pk , Lτ ] is unstable at high frequencies!

Julien Mairal Invariance and stability of DL 24/35


Stability to deformations
Representation
M
Φn (x) = An Mn Pn An–1 Mn–1 Pn–1 · · · A1 M1 P1 A0 x.

How to achieve stability to deformations?


Patch extraction Pk and pooling Ak do not commute with Lτ !
k[Ak , Lτ ]k ≤ C1 k∇τ k∞ [from Mallat, 2012].
But: [Pk , Lτ ] is unstable at high frequencies!
Adapt to current layer resolution, patch size controlled by σk–1 :

k[Pk Ak–1 , Lτ ]k ≤ C1,κ k∇τ k∞ sup |u| ≤ κσk–1


u∈Sk

Julien Mairal Invariance and stability of DL 24/35


Stability to deformations
Representation
M
Φn (x) = An Mn Pn An–1 Mn–1 Pn–1 · · · A1 M1 P1 A0 x.

How to achieve stability to deformations?


Patch extraction Pk and pooling Ak do not commute with Lτ !
k[Ak , Lτ ]k ≤ C1 k∇τ k∞ [from Mallat, 2012].
But: [Pk , Lτ ] is unstable at high frequencies!
Adapt to current layer resolution, patch size controlled by σk–1 :

k[Pk Ak–1 , Lτ ]k ≤ C1,κ k∇τ k∞ sup |u| ≤ κσk–1


u∈Sk

C1,κ grows as κd+1 =⇒ more stable with small patches


(e.g., 3x3, VGG et al.).

Julien Mairal Invariance and stability of DL 24/35


Stability to deformations: final result
Theorem
If k∇τ k∞ ≤ 1/2,
 
C2
kΦn (Lτ x) − Φn (x)k ≤ C1,κ (n + 1) k∇τ k∞ + kτ k∞ kxk.
σn

translation invariance: large σn .


stability: small patch sizes.
signal preservation: subsampling factor ≈ patch size.
=⇒ needs several layers.

related work on stability [Wiatowski and Bölcskei, 2017]

Julien Mairal Invariance and stability of DL 25/35


Stability to deformations: final result
Theorem
If k∇τ k∞ ≤ 1/2,
 
C2
kΦn (Lτ x) − Φn (x)k ≤ C1,κ (n + 1) k∇τ k∞ + kτ k∞ kxk.
σn

translation invariance: large σn .


stability: small patch sizes.
signal preservation: subsampling factor ≈ patch size.
=⇒ needs several layers.
requires additional discussion to make stability non-trivial.

related work on stability [Wiatowski and Bölcskei, 2017]

Julien Mairal Invariance and stability of DL 25/35


Stability to deformations: final result
Theorem
If k∇τ k∞ ≤ 1/2,
Y  
C2
kΦn (Lτ x) − Φn (x)k ≤ ρk C1,κ (n + 1) k∇τ k∞ + kτ k∞ kxk.
σn
k

translation invariance: large σn .


stability: small patch sizes.
signal preservation: subsampling factor ≈ patch size.
=⇒ needs several layers.
requires additional discussion to make stability non-trivial.
(also
Q valid for
Qgeneric CNNs with ReLUs: multiply
by k ρk = k kWk k, but no signal preservation).
related work on stability [Wiatowski and Bölcskei, 2017]

Julien Mairal Invariance and stability of DL 25/35


Beyond the translation group
Can we achieve invariance to other groups?
Group action: Lg x(u) = x(g −1 u) (e.g., rotations, reflections).
Feature maps x(u) defined on u ∈ G (G: locally compact group).

Julien Mairal Invariance and stability of DL 26/35


Beyond the translation group
Can we achieve invariance to other groups?
Group action: Lg x(u) = x(g −1 u) (e.g., rotations, reflections).
Feature maps x(u) defined on u ∈ G (G: locally compact group).

Recipe: Equivariant inner layers + global pooling in last layer


Patch extraction:

P x(u) = (x(uv))v∈S .

Non-linear mapping: equivariant because pointwise!


Pooling (µ: left-invariant Haar measure):
Z Z
Ax(u) = x(uv)h(v)dµ(v) = x(v)h(u−1 v)dµ(v).
G G
related work [Sifre and Mallat, 2013, Cohen and Welling, 2016, Raj et al., 2016]...

Julien Mairal Invariance and stability of DL 26/35


Group invariance and stability
Previous construction is similar to Cohen and Welling [2016] for CNNs.

A case of interest: the roto-translation group


G = R2 o SO(2) (mix of translations and rotations).
Stability with respect to the translation group.
Global invariance to rotations (only global pooling at final layer).
Inner layers: only pool on translation group.
Last layer: global pooling on rotations.
Cohen and Welling [2016]: pooling on rotations in inner layers hurts
performance on Rotated MNIST

Julien Mairal Invariance and stability of DL 27/35


Discretization and signal preservation: example in 1D
Discrete signal x¯k in `2 (Z, H̄k ) vs continuous ones xk in L2 (R, Hk ).
x̄k : subsampling factor sk after pooling with scale σk ≈ sk :

x̄k [n] = Āk M̄k P̄k x̄k–1 [nsk ].

Julien Mairal Invariance and stability of DL 28/35


Discretization and signal preservation: example in 1D
Discrete signal x¯k in `2 (Z, H̄k ) vs continuous ones xk in L2 (R, Hk ).
x̄k : subsampling factor sk after pooling with scale σk ≈ sk :

x̄k [n] = Āk M̄k P̄k x̄k–1 [nsk ].

Claim: We can recover x̄k−1 from x̄k if factor sk ≤ patch size.

Julien Mairal Invariance and stability of DL 28/35


Discretization and signal preservation: example in 1D
Discrete signal x¯k in `2 (Z, H̄k ) vs continuous ones xk in L2 (R, Hk ).
x̄k : subsampling factor sk after pooling with scale σk ≈ sk :

x̄k [n] = Āk M̄k P̄k x̄k–1 [nsk ].

Claim: We can recover x̄k−1 from x̄k if factor sk ≤ patch size.


How? Recover patches with linear functions (contained in H̄k )

hfw , M̄k P̄k x̄k−1 (u)i = fw (P̄k x̄k−1 (u)) = hw, P̄k x̄k−1 (u)i,

and X
P̄k x̄k−1 (u) = hfw , M̄k P̄k x̄k−1 (u)iw.
w∈B

Julien Mairal Invariance and stability of DL 28/35


Discretization and signal preservation: example in 1D
Discrete signal x¯k in `2 (Z, H̄k ) vs continuous ones xk in L2 (R, Hk ).
x̄k : subsampling factor sk after pooling with scale σk ≈ sk :

x̄k [n] = Āk M̄k P̄k x̄k–1 [nsk ].

Claim: We can recover x̄k−1 from x̄k if factor sk ≤ patch size.


How? Recover patches with linear functions (contained in H̄k )

hfw , M̄k P̄k x̄k−1 (u)i = fw (P̄k x̄k−1 (u)) = hw, P̄k x̄k−1 (u)i,

and X
P̄k x̄k−1 (u) = hfw , M̄k P̄k x̄k−1 (u)iw.
w∈B

Warning: no claim that recovery is practical and/or stable.

Julien Mairal Invariance and stability of DL 28/35


Discretization and signal preservation: example in 1D
x̄k−1

deconvolution

Āk x̄k−1

recovery with linear measurements

x̄k

downsampling

Āk M̄k P̄k x̄k−1

linear pooling

M̄k P̄k x̄k−1

dot-product kernel

x̄k−1
P̄k x̄k−1 (u) ∈ Pk

Julien Mairal Invariance and stability of DL 29/35


Outline

1 Construction of the multi-layer convolutional representation

2 Invariance and stability

3 Learning aspects: model complexity

Julien Mairal Invariance and stability of DL 30/35


RKHS of patch kernels Kk
  ∞
X
0 0 hz, z 0 i
Kk (z, z ) = kzkkz kκ , κ(u) = bj uj .
kzkkz 0 k
j=0

What does the RKHS contain?

Homogeneous version of [Zhang et al., 2016, 2017]


Julien Mairal Invariance and stability of DL 31/35
RKHS of patch kernels Kk
  ∞
X
0 0 hz, z 0 i
Kk (z, z ) = kzkkz kκ , κ(u) = bj uj .
kzkkz 0 k
j=0

What does the RKHS contain?


RKHS contains homogeneous functions:

f : z 7→ kzkσ(hg, zi/kzk).

Homogeneous version of [Zhang et al., 2016, 2017]


Julien Mairal Invariance and stability of DL 31/35
RKHS of patch kernels Kk
  ∞
X
0 0 hz, z 0 i
Kk (z, z ) = kzkkz kκ , κ(u) = bj uj .
kzkkz 0 k
j=0

What does the RKHS contain?


RKHS contains homogeneous functions:

f : z 7→ kzkσ(hg, zi/kzk).
P
Smooth activations: σ(u) = ∞ j
j=0 aj u with aj ≥ 0.
P a2j
Norm: kf k2Hk ≤ Cσ2 (kgk2 ) = ∞ 2
j=0 bj kgk < ∞.

Homogeneous version of [Zhang et al., 2016, 2017]


Julien Mairal Invariance and stability of DL 31/35
RKHS of patch kernels Kk
Examples:
σ(u) = u (linear): Cσ2 (λ2 ) = O(λ2 ).
σ(u) = up (polynomial): Cσ2 (λ2 ) = O(λ2p ).
2
σ ≈ sin, sigmoid, smooth ReLU: Cσ2 (λ2 ) = O(ecλ ).

f:x (x) f : x |x| (wx/|x|)


2.0 4 ReLU, w=1
ReLU
sReLU sReLU, w = 0
1.5 3 sReLU, w = 0.5
sReLU, w = 1
sReLU, w = 2
2
f(x)
1.0
f(x)

0.5 1

0.0 0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
x x

Julien Mairal Invariance and stability of DL 32/35


Constructing a CNN in the RKHS HK
Some CNNs live in the RKHS: “linearization” principle

f (x) = σk (Wk σk–1 (Wk–1 . . . σ2 (W2 σ1 (W1 x)) . . .)) = hf, Φ(x)iH .

Julien Mairal Invariance and stability of DL 33/35


Constructing a CNN in the RKHS HK
Some CNNs live in the RKHS: “linearization” principle

f (x) = σk (Wk σk–1 (Wk–1 . . . σ2 (W2 σ1 (W1 x)) . . .)) = hf, Φ(x)iH .

Consider a CNN with filters Wkij (u), u ∈ Sk .


k: layer;
i: index of filter;
j: index of input channel.
“Smooth homogeneous” activations σ.
The CNN can be constructed hierarchically in HK .
Norm (linear layers):

kfσ k2 ≤ kWn+1 k22 · kWn k22 · kWn–1 k22 . . . kW1 k22 .

Linear layers: product of spectral norms.

Julien Mairal Invariance and stability of DL 33/35


Link with generalization
Direct application of classical generalization bounds
Simple bound on Rademacher complexity for linear/kernel methods:
 
BR
FB = {f ∈ HK , kf k ≤ B} =⇒ RadN (FB ) ≤ O √ .
N

Julien Mairal Invariance and stability of DL 34/35


Link with generalization
Direct application of classical generalization bounds
Simple bound on Rademacher complexity for linear/kernel methods:
 
BR
FB = {f ∈ HK , kf k ≤ B} =⇒ RadN (FB ) ≤ O √ .
N


Leads to margin bound O(kfˆN kR/γ N ) for a learned CNN fˆN
with margin (confidence) γ > 0.
Related to recent generalization bounds for neural networks based
on product of spectral norms [e.g., Bartlett et al., 2017,
Neyshabur et al., 2018].

[see, e.g., Boucheron et al., 2005, Shalev-Shwartz and Ben-David, 2014]...

Julien Mairal Invariance and stability of DL 34/35


Deep convolutional representations: conclusions
Study of generic properties of signal representation
Deformation stability with small patches, adapted to resolution.
Signal preservation when subsampling ≤ patch size.
Group invariance by changing patch extraction and pooling.

Julien Mairal Invariance and stability of DL 35/35


Deep convolutional representations: conclusions
Study of generic properties of signal representation
Deformation stability with small patches, adapted to resolution.
Signal preservation when subsampling ≤ patch size.
Group invariance by changing patch extraction and pooling.

Applies to learned models


Same quantity kf k controls stability and generalization.
“higher capacity” is needed to discriminate small deformations.

Julien Mairal Invariance and stability of DL 35/35


Deep convolutional representations: conclusions
Study of generic properties of signal representation
Deformation stability with small patches, adapted to resolution.
Signal preservation when subsampling ≤ patch size.
Group invariance by changing patch extraction and pooling.

Applies to learned models


Same quantity kf k controls stability and generalization.
“higher capacity” is needed to discriminate small deformations.

Questions:
Better regularization?
How does SGD control capacity in CNNs?
What about networks with no pooling layers? ResNet?

Julien Mairal Invariance and stability of DL 35/35


References I
Stéphanie Allassonnière, Yali Amit, and Alain Trouvé. Towards a coherent
statistical framework for dense deformable template estimation. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 69(1):
3–29, 2007.
Francis Bach. On the equivalence between kernel quadrature rules and random
feature expansions. Journal of Machine Learning Research (JMLR), 18:1–38,
2017.
Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized
margin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.
Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi. Theory of
classification: A survey of some recent advances. ESAIM: probability and
statistics, 9:323–375, 2005.
Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks.
IEEE Transactions on pattern analysis and machine intelligence (PAMI), 35
(8):1872–1886, 2013.

Julien Mairal Invariance and stability of DL 36/35


References II
Y. Cho and L. K. Saul. Large-margin classification in infinite neural networks.
Neural Computation, 22(10):2678–2697, 2010.
Taco Cohen and Max Welling. Group equivariant convolutional networks. In
International Conference on Machine Learning (ICML), 2016.
Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of
neural networks: The power of initialization and a dual view on expressivity.
In Advances In Neural Information Processing Systems, pages 2253–2261,
2016.
Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the
physical world. arXiv preprint arXiv:1607.02533, 2016.
J. Mairal. End-to-end kernel learning with supervised convolutional kernel
networks. In Advances in Neural Information Processing Systems (NIPS),
2016.
Stéphane Mallat. Group invariant scattering. Communications on Pure and
Applied Mathematics, 65(10):1331–1398, 2012.

Julien Mairal Invariance and stability of DL 37/35


References III
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan
Srebro. A PAC-Bayesian approach to spectrally-normalized margin bounds
for neural networks. In Proceedings of the International Conference on
Learning Representations (ICLR), 2018.
Anant Raj, Abhishek Kumar, Youssef Mroueh, P Thomas Fletcher, and
Bernhard Scholkopf. Local group invariant representations via orbit
embeddings. preprint arXiv:1612.01988, 2016.
I. Schoenberg. Positive definite functions on spheres. Duke Math. J., 1942.
B. Scholkopf. Support Vector Learning. PhD thesis, Technischen Universität
Berlin, 1997.
Bernhard Schölkopf and Alexander J Smola. Learning with kernels: support
vector machines, regularization, optimization, and beyond. MIT press, 2002.
Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning:
From theory to algorithms. Cambridge university press, 2014.
John Shawe-Taylor and Nello Cristianini. An introduction to support vector
machines and other kernel-based learning methods. Cambridge University
Press, 2004.

Julien Mairal Invariance and stability of DL 38/35


References IV
Laurent Sifre and Stéphane Mallat. Rotation, scaling and deformation invariant
scattering for texture discrimination. In Proceedings of the IEEE conference
on computer vision and pattern recognition (CVPR), 2013.
Alex J Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for
machine learning. In Proceedings of the International Conference on
Machine Learning (ICML), 2000.
Alex J Smola, Zoltan L Ovari, and Robert C Williamson. Regularization with
dot-product kernels. In Advances in neural information processing systems,
pages 308–314, 2001.
Alain Trouvé and Laurent Younes. Local geometry of deformable templates.
SIAM journal on mathematical analysis, 37(1):17–59, 2005.
Thomas Wiatowski and Helmut Bölcskei. A mathematical theory of deep
convolutional neural networks for feature extraction. IEEE Transactions on
Information Theory, 2017.
C. Williams and M. Seeger. Using the Nyström method to speed up kernel
machines. In Advances in Neural Information Processing Systems (NIPS),
2001.

Julien Mairal Invariance and stability of DL 39/35


References V
Kai Zhang, Ivor W Tsang, and James T Kwok. Improved nyström low-rank
approximation and error analysis. In International Conference on Machine
Learning (ICML), 2008.
Y. Zhang, P. Liang, and M. J. Wainwright. Convexified convolutional neural
networks. In International Conference on Machine Learning (ICML), 2017.
Yuchen Zhang, Jason D Lee, and Michael I Jordan. `1 -regularized neural
networks are improperly learnable in polynomial time. In International
Conference on Machine Learning (ICML), 2016.

Julien Mairal Invariance and stability of DL 40/35


ϕk from kernel approximations: CKNs [Mairal, 2016]
Approximate ϕk (z) by projection (Nyström approximation) on
F = Span(ϕk (z1 ), . . . , ϕk (zp )).

ϕ(x) Hilbert space H

ϕ(x0 )

Figure: Nyström approximation.


[Williams and Seeger, 2001, Smola and Schölkopf, 2000, Zhang et al., 2008]...

Julien Mairal Invariance and stability of DL 41/35


ϕk from kernel approximations: CKNs [Mairal, 2016]
Approximate ϕk (z) by projection (Nyström approximation) on

F = Span(ϕk (z1 ), . . . , ϕk (zp )).

Leads to tractable, p-dimensional representation ψk (z).


Norm is preserved, and projection is non-expansive:

kψk (z) − ψk (z 0 )k = kΠk ϕk (z) − Πk ϕk (z 0 )k


≤ kϕk (z) − ϕk (z 0 )k ≤ kz − z 0 k.

Anchor points z1 , . . . , zp (≈ filters) can be learned from data


(K-means or backprop).

[Williams and Seeger, 2001, Smola and Schölkopf, 2000, Zhang et al., 2008]...

Julien Mairal Invariance and stability of DL 41/35


ϕk from kernel approximations: CKNs [Mairal, 2016]
Convolutional kernel networks in practice.

I1
linear pooling
ψ1 (x0 ) ϕ1 (x) Hilbert space H1
ψ1 (x)
ϕ1 (x0 )
M1
projection on F1 F1
x0
kernel trick

I0
x

Julien Mairal Invariance and stability of DL 42/35


Discussion
norm of kΦ(x)k is of the same order (or close enough) to kxk.
the kernel representation is non-expansive but not contractive

kΦ(x) − Φ(x0 )k
sup = 1.
x,x0 ∈L2 (Ω,H0 ) kx − x0 k

Julien Mairal Invariance and stability of DL 43/35


Future of Convolutional Neural Networks
What are current high-potential problems to solve?
1 lack of robustness (see next slide).
2 learning with few labeled data.
3 learning with no supervision (see Tab. from Bojanowski and Joulin, 2017).

Julien Mairal Invariance and stability of DL 44/35


Future of Convolutional Neural Networks
Illustration of instability. Picture from Kurakin et al. [2016].

Figure: Adversarial examples are generated by computer; then printed on paper;


a new picture taken on a smartphone fools the classifier.

Julien Mairal Invariance and stability of DL 45/35


Future of Convolutional Neural Networks
n
1X
min L(yi , f (xi )) + λΩ(f ) .
f ∈F n | {z }
i=1
| {z } regularization
empirical risk, data fit

The issue of regularization


today, heuristics are used (DropOut, weight decay, early stopping)...
...but they are not sufficient.
how to control variations of prediction functions?

|f (x) − f (x0 )| should be close if x and x0 are “similar”.

what does it mean for x and x0 to be “similar”?


what should be a good regularization function Ω?

Julien Mairal Invariance and stability of DL 46/35

Potrebbero piacerti anche