Kernel

Invariance and Stability to Deformations
of Deep Convolutional Representations
Julien Mairal
Inria Grenoble
ML and AI workshop, Telecom ParisTech, 2018
Julien Mairal Invariance and stability of DL 1/35

This is mostly the work of Alberto Bietti
A. Bietti and J. Mairal. Group Invariance, Stability to

Deformations, and Complexity of Deep Convolutional
Representations. arXiv:1706.03078. 2018.
A. Bietti and J. Mairal. Invariance and Stability of Deep
Convolutional Representations. NIPS. 2017.

Learning a predictive model
The goal is to learn a prediction function f : Rp → R given labeled
training data (xi , yi )i=1,...,n with xi in Rp , and yi in R:
n
1X
min L(yi , f (xi )) + λΩ(f ) .
f ∈F n | {z }
| i=1 {z } regularization
empirical risk, data fit

Objectives
Deep convolutional signal representations
Are they stable to deformations?
How can we achieve invariance to transformation groups?
Do they preserve signal information?
Learning aspects
Building a functional space for CNNs (or similar objects).
Paradigm 3: Deep Kernel Machines
Deriving a measure of model complexity.
A quick zoom on convolutional neural networks
still involves the ERM problem

A kernel perspective
n
1X
min L(yi , f (xi )) + λkf k2H .
f ∈H n
i=1
map data to a Hilbert space (RKHS) and work with linear forms:
Φ:X →H and f (x) = hΦ(x), f iH .
ϕ
6
......................
...............
......... .....................................
........................
.................. ...........
.............. ......
.................................
.........
..................... z
......
....
.
..
....
.....
.......
.......
X ...................
...
...
...
..
...
...
.......
H
x zx
... ... .....
... .. ........................................ .. ............................................
... .................. ..
......................................... ..........................................
.........................................
.. ...................... .. ..................................
................................
z x
.. .. .....................................
.. .. ...................
*

... ...
x
... .. .............. ............................................ .
.. ... ............................. ............................................
... ......................... ........................ ...
.. .........................................
.......... x
...
x
....
...
... .....................
.....
I ....
.. -
Q
x
... ..
......Q
s
......................................:
.. .
.... ..
.... ..................................
:x
... ................................... ..................................
.... ........................................... ...........................................
.... ................ ...... ....... .......................
.... ...

........................
...... ...........................
................ ..
. ................
.. . ..........................
..................................
. .........................................
... ........................................
.. ..
............................................ ....... ...................................
... ...
... ...
...... ....
..........
.............. .......
..................... ......
................................. ..............
.. .................................

n
1X
min L(yi , f (xi )) + λkf k2H .
f ∈H n
i=1
Main purpose: embed data in a vectorial space where

many geometrical operations exist (angle computation, projection
on linear subspaces, definition of barycenters....).
one may learn potentially rich infinite-dimensional models.
regularization is natural:
|f (x) − f (x0 )| ≤ kf kH kΦ(x) − Φ(x0 )kH .

Second purpose: unhappy with the current Euclidean structure?
lift data to a higher-dimensional space with nicer properties (e.g.,
linear separability, clustering structure).
then, the linear form f (x) = hΦ(x), f iH in H may correspond to a
non-linear model in X .
x1 x1 2
x2
R
x2 2

Recipe
Map data x to high-dimensional space, Φ(x) in H (RKHS), with
Hilbertian geometry (projections, barycenters, angles, . . . , exist!).
predictive models f in H are linear forms in H: f (x) = hf, Φ(x)iH .
Learning with a positive definite kernel K(x, x0 ) = hΦ(x), Φ(x0 )iH .
[Schölkopf and Smola, 2002, Shawe-Taylor and Cristianini, 2004]...

Recipe
What is the relation with deep neural networks?

It is possible to design a RKHS H where a large class of deep neural
networks live [Mairal, 2016].
f (x) = σk (Wk σk–1 (Wk–1 . . . σ2 (W2 σ1 (W1 x)) . . .)) = hf, Φ(x)iH .
This is the construction of “convolutional kernel networks”.
[Schölkopf and Smola, 2002, Shawe-Taylor and Cristianini, 2004]...

Recipe
Why do we care?
Φ(x) is related to the network architecture and is independent
of training data. Is it stable? Does it lose signal information?
f is a predictive model. Can we control its stability?
kf kH controls both stability and generalization!

A signal processing perspective
plus a bit of harmonic analysis
Consider images defined on a continuous domain Ω = Rd .

τ : Ω → Ω: C 1 -diffeomorphism.
Lτ x(u) = x(u − τ (u)): action operator.
Much richer group of transformations than translations.
Translations and Defo
• Patterns are translated and deformed
Invarian
Two dim
[Mallat, 2012, Allassonnière, Amit, and Trouvé, 2007, Trouvé and Younes, 2005]...


Much richer group of transformations thanTranslations
translations. and Defo
Invarian
Two dim
Relation with deep convolutional representations

Stability to deformations studied for wavelet-based scattering transform.
[Mallat, 2012, Bruna and Mallat, 2013, Sifre and Mallat, 2013]...


Definition of stability
Representation Φ(·) is stable [Mallat, 2012] if:
kΦ(Lτ x) − Φ(x)k ≤ (C1 k∇τ k∞ + C2 kτ k∞ )kxk.
k∇τ k∞ = supu k∇τ (u)k controls deformation.

kτ k∞ = supu |τ (u)| controls translation.
C2 → 0: translation invariance.

Summary of our results
Multi-layer construction of the RKHS H
Contains CNNs with smooth homogeneous activations functions.

Signal representation
Signal preservation of the multi-layer kernel mapping Φ.
Conditions of non-trivial stability for Φ.
Constructions to achieve group invariance.

Signal representation
Signal preservation of the multi-layer kernel mapping Φ.
Conditions of non-trivial stability for Φ.
Constructions to achieve group invariance.
On learning
Bounds on the RKHS norm k.kH to control stability and
generalization of a predictive model f .

Outline
1 Construction of the multi-layer convolutional representation
2 Invariance and stability
3 Learning aspects: model complexity

A generic deep convolutional representation
Initial map x0 in L2 (Ω, H0 )
x0 : Ω → H0 : continuous input signal
u ∈ Ω = Rd : location (d = 2 for images).
x0 (u) ∈ H0 : input value at location u (H0 = R3 for RGB images).

Building map xk in L2 (Ω, Hk ) from xk−1 in L2 (Ω, Hk−1 )

xk : Ω → Hk : feature map at layer k
Pk xk−1 .
Pk : patch extraction operator, extract small patch of feature map

xk−1 around each point u (Pk xk−1 (u) is a patch centered at u).


Mk Pk xk−1 .

Mk : non-linear mapping operator, maps each patch to a new
Hilbert space Hk with a pointwise non-linear function ϕk (·).


xk = Ak Mk Pk xk−1 .

Mk : non-linear mapping operator, maps each patch to a new
Hilbert space Hk with a pointwise non-linear function ϕk (·).
Ak : (linear) pooling operator at scale σk .
xk := Ak Mk Pk xk–1 : Ω → Hk
xk (w) = Ak Mk Pk xk–1 (w) ∈ Hk

linear pooling
Mk Pk xk–1 : Ω → Hk Mk Pk xk–1 (v) = ϕk (Pk xk–1 (v)) ∈ Hk

kernel mapping
Pk xk–1 (v) ∈ Pk (patch extraction)

xk–1 (u) ∈ Hk–1 xk–1 : Ω → Hk–1

Patch extraction operator Pk
Sk
Pk xk–1 (u) := (v ∈ Sk 7→ xk–1 (u + v)) ∈ Pk = Hk–1 .

xk–1 (u) ∈ Hk–1 xk–1 : Ω → Hk–1
Sk : patch shape, e.g. box.

Pk is linear, and preserves the norm: kPk xk–1 k = kxk–1 k.
R
Norm of a map: kxk2 = Ω kx(u)k2 du < ∞ for x in L2 (Ω, H).

Non-linear pointwise mapping operator Mk
Mk Pk xk–1 (u) := ϕk (Pk xk–1 (u)) ∈ Hk .

non-linear mapping
Pk xk–1 (v) ∈ Pk
xk–1 : Ω → Hk–1

Non-linear pointwise mapping operator Mk
Mk Pk xk–1 (u) := ϕk (Pk xk–1 (u)) ∈ Hk .
ϕk : Pk → Hk pointwise non-linearity on patches.

We assume non-expansivity
kϕk (z)k ≤ kzk and kϕk (z) − ϕk (z 0 )k ≤ kz − z 0 k.
Mk then satisfies, for x, x0 ∈ L2 (Ω, Pk )
kMk xk ≤ kxk and kMk x − Mk x0 k ≤ kx − x0 k.

ϕk from kernels
Kernel mapping of homogeneous dot-product kernels:

hz, z 0 i
Kk (z, z 0 ) = kzkkz 0 kκk = hϕk (z), ϕk (z 0 )i.
kzkkz 0 k
P
κk (u) = ∞ j
j=0 bj u with bj ≥ 0, κk (1) = 1.
kϕk (z)k = Kk (z, z)1/2 = kzk (norm preservation).
0 0 0
kϕk (z) − ϕk (z )k ≤ kz − z k if κk (1) ≤ 1 (non-expansiveness).

ϕk from kernels
Kernel mapping of homogeneous dot-product kernels:

hz, z 0 i
Kk (z, z 0 ) = kzkkz 0 kκk = hϕk (z), ϕk (z 0 )i.
kzkkz 0 k
P
κk (u) = ∞ j
j=0 bj u with bj ≥ 0, κk (1) = 1.
kϕk (z)k = Kk (z, z)1/2 = kzk (norm preservation).
0 0 0
kϕk (z) − ϕk (z )k ≤ kz − z k if κk (1) ≤ 1 (non-expansiveness).
Examples
0 1 0 2
κexp (hz, z 0 i) = ehz,z i−1 = e− 2 kz−z k (if kzk = kz 0 k = 1).
1
κinv-poly (hz, z 0 i) = 2−hz,z 0 i .
[Schoenberg, 1942, Scholkopf, 1997, Smola et al., 2001, Cho and Saul, 2010, Zhang
et al., 2016, 2017, Daniely et al., 2016, Bach, 2017, Mairal, 2016]...

Pooling operator Ak
Z
xk (u) = Ak Mk Pk xk–1 (u) = hσk (u − v)Mk Pk xk–1 (v)dv ∈ Hk .
Rd
xk := Ak Mk Pk xk–1 : Ω → Hk xk (w) = Ak Mk Pk xk–1 (w) ∈ Hk

linear pooling
Mk Pk xk–1 : Ω → Hk
xk–1 : Ω → Hk–1

Pooling operator Ak
Z
xk (u) = Ak Mk Pk xk–1 (u) = hσk (u − v)Mk Pk xk–1 (v)dv ∈ Hk .
Rd
hσk : pooling filter at scale σk .

hσk (u) := σk−d h(u/σk ) with h(u) Gaussian.
linear, non-expansive operator: kAk k ≤ 1 (operator norm).

Recap: Pk , Mk , Ak
xk := Ak Mk Pk xk–1 : Ω → Hk
xk (w) = Ak Mk Pk xk–1 (w) ∈ Hk

linear pooling

kernel mapping

xk–1 (u) ∈ Hk–1 xk–1 : Ω → Hk–1

Multilayer construction
Assumption on x0
x0 is typically a discrete signal aquired with physical device.
Natural assumption: x0 = A0 x, with x the original continuous
signal, A0 local integrator with scale σ0 (anti-aliasing).

Assumption on x0
Multilayer representation
Φn (x) = An Mn Pn An−1 Mn−1 Pn−1 · · · A1 M1 P1 x0 ∈ L2 (Ω, Hn ).
Sk , σk grow exponentially in practice (i.e., fixed with subsampling).

Assumption on x0
Multilayer representation
Φn (x) = An Mn Pn An−1 Mn−1 Pn−1 · · · A1 M1 P1 x0 ∈ L2 (Ω, Hn ).
Sk , σk grow exponentially in practice (i.e., fixed with subsampling).
Prediction layer
e.g., linear f (x) = hw, Φn (x)i.
R
“linear kernel” K(x, x0 ) = hΦn (x), Φn (x0 )i = 0
Ω hxn (u), xn (u)idu.

Outline

Invariance, definitions
τ : Ω → Ω: C 1 -diffeomorphism with Ω = Rd .
Translations and Defo
Invarian
Two dim

Invariance, definitions
τ : Ω → Ω: C 1 -diffeomorphism with Ω = Rd .
Definition of stability
Representation Φ(·) is stable [Mallat, 2012] if:
kΦ(Lτ x) − Φ(x)k ≤ (C1 k∇τ k∞ + C2 kτ k∞ )kxk.
k∇τ k∞ = supu k∇τ (u)k controls deformation.

kτ k∞ = supu |τ (u)| controls translation.
C2 → 0: translation invariance.

Warmup: translation invariance
Representation
M
Φn (x) = An Mn Pn An–1 Mn–1 Pn–1 · · · A1 M1 P1 A0 x.
How to achieve translation invariance?

Translation: Lc x(u) = x(u − c).

Representation
M

Equivariance - all operators commute with Lc : Lc = Lc .
kΦn (Lc x) − Φn (x)k = kLc Φn (x) − Φn (x)k

≤ kLc An − An k · kMn Pn Φn–1 (x)k
≤ kLc An − An kkxk.

Representation
M


C2
Mallat [2012]: kLτ An − An k ≤ σn kτ k∞ (operator norm).

Representation
M


C2
Mallat [2012]: kLc An − An k ≤ σn c (operator norm).
Scale σn of the last layer controls translation invariance.

Stability to deformations
Representation
M
How to achieve stability to deformations?

Patch extraction Pk and pooling Ak do not commute with Lτ !

Representation
M

kAk Lτ − Lτ Ak k ≤ C1 k∇τ k∞ [from Mallat, 2012].

Representation
M

k[Ak , Lτ ]k ≤ C1 k∇τ k∞ [from Mallat, 2012].

Representation
M

But: [Pk , Lτ ] is unstable at high frequencies!

Representation
M

Adapt to current layer resolution, patch size controlled by σk–1 :
k[Pk Ak–1 , Lτ ]k ≤ C1,κ k∇τ k∞ sup |u| ≤ κσk–1

u∈Sk

Representation
M

Adapt to current layer resolution, patch size controlled by σk–1 :
k[Pk Ak–1 , Lτ ]k ≤ C1,κ k∇τ k∞ sup |u| ≤ κσk–1

u∈Sk
C1,κ grows as κd+1 =⇒ more stable with small patches

(e.g., 3x3, VGG et al.).

Stability to deformations: final result
Theorem
If k∇τ k∞ ≤ 1/2,

C2
kΦn (Lτ x) − Φn (x)k ≤ C1,κ (n + 1) k∇τ k∞ + kτ k∞ kxk.
σn
translation invariance: large σn .

stability: small patch sizes.
signal preservation: subsampling factor ≈ patch size.
=⇒ needs several layers.
related work on stability [Wiatowski and Bölcskei, 2017]

Theorem
If k∇τ k∞ ≤ 1/2,

C2
kΦn (Lτ x) − Φn (x)k ≤ C1,κ (n + 1) k∇τ k∞ + kτ k∞ kxk.
σn

requires additional discussion to make stability non-trivial.

Theorem
If k∇τ k∞ ≤ 1/2,
Y
C2
kΦn (Lτ x) − Φn (x)k ≤ ρk C1,κ (n + 1) k∇τ k∞ + kτ k∞ kxk.
σn
k

requires additional discussion to make stability non-trivial.
(also
Q valid for
Qgeneric CNNs with ReLUs: multiply
by k ρk = k kWk k, but no signal preservation).

Beyond the translation group
Can we achieve invariance to other groups?
Group action: Lg x(u) = x(g −1 u) (e.g., rotations, reflections).
Feature maps x(u) defined on u ∈ G (G: locally compact group).

Beyond the translation group
Can we achieve invariance to other groups?
Group action: Lg x(u) = x(g −1 u) (e.g., rotations, reflections).
Feature maps x(u) defined on u ∈ G (G: locally compact group).
Recipe: Equivariant inner layers + global pooling in last layer

Patch extraction:
P x(u) = (x(uv))v∈S .
Non-linear mapping: equivariant because pointwise!

Pooling (µ: left-invariant Haar measure):
Z Z
Ax(u) = x(uv)h(v)dµ(v) = x(v)h(u−1 v)dµ(v).
G G
related work [Sifre and Mallat, 2013, Cohen and Welling, 2016, Raj et al., 2016]...

Group invariance and stability
Previous construction is similar to Cohen and Welling [2016] for CNNs.
A case of interest: the roto-translation group

G = R2 o SO(2) (mix of translations and rotations).
Stability with respect to the translation group.
Global invariance to rotations (only global pooling at final layer).
Inner layers: only pool on translation group.
Last layer: global pooling on rotations.
Cohen and Welling [2016]: pooling on rotations in inner layers hurts
performance on Rotated MNIST

Discretization and signal preservation: example in 1D
Discrete signal x¯k in `2 (Z, H̄k ) vs continuous ones xk in L2 (R, Hk ).
x̄k : subsampling factor sk after pooling with scale σk ≈ sk :
x̄k [n] = Āk M̄k P̄k x̄k–1 [nsk ].

Claim: We can recover x̄k−1 from x̄k if factor sk ≤ patch size.


How? Recover patches with linear functions (contained in H̄k )
hfw , M̄k P̄k x̄k−1 (u)i = fw (P̄k x̄k−1 (u)) = hw, P̄k x̄k−1 (u)i,
and X
P̄k x̄k−1 (u) = hfw , M̄k P̄k x̄k−1 (u)iw.
w∈B


How? Recover patches with linear functions (contained in H̄k )
hfw , M̄k P̄k x̄k−1 (u)i = fw (P̄k x̄k−1 (u)) = hw, P̄k x̄k−1 (u)i,
and X
P̄k x̄k−1 (u) = hfw , M̄k P̄k x̄k−1 (u)iw.
w∈B
Warning: no claim that recovery is practical and/or stable.

x̄k−1
deconvolution
Āk x̄k−1
recovery with linear measurements
x̄k
downsampling
Āk M̄k P̄k x̄k−1
linear pooling
M̄k P̄k x̄k−1
dot-product kernel
x̄k−1
P̄k x̄k−1 (u) ∈ Pk

Outline

RKHS of patch kernels Kk
∞
X
0 0 hz, z 0 i
Kk (z, z ) = kzkkz kκ , κ(u) = bj uj .
kzkkz 0 k
j=0
What does the RKHS contain?
Homogeneous version of [Zhang et al., 2016, 2017]

∞
X
0 0 hz, z 0 i
kzkkz 0 k
j=0

RKHS contains homogeneous functions:
f : z 7→ kzkσ(hg, zi/kzk).

∞
X
0 0 hz, z 0 i
kzkkz 0 k
j=0

RKHS contains homogeneous functions:
f : z 7→ kzkσ(hg, zi/kzk).
P
Smooth activations: σ(u) = ∞ j
j=0 aj u with aj ≥ 0.
P a2j
Norm: kf k2Hk ≤ Cσ2 (kgk2 ) = ∞ 2
j=0 bj kgk < ∞.

Examples:
σ(u) = u (linear): Cσ2 (λ2 ) = O(λ2 ).
σ(u) = up (polynomial): Cσ2 (λ2 ) = O(λ2p ).
2
σ ≈ sin, sigmoid, smooth ReLU: Cσ2 (λ2 ) = O(ecλ ).
f:x (x) f : x |x| (wx/|x|)

2.0 4 ReLU, w=1
ReLU
sReLU sReLU, w = 0
1.5 3 sReLU, w = 0.5
sReLU, w = 1
sReLU, w = 2
2
f(x)
1.0
f(x)
0.5 1
0.0 0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
x x

Constructing a CNN in the RKHS HK
Some CNNs live in the RKHS: “linearization” principle

Constructing a CNN in the RKHS HK
Some CNNs live in the RKHS: “linearization” principle
Consider a CNN with filters Wkij (u), u ∈ Sk .

k: layer;
i: index of filter;
j: index of input channel.
“Smooth homogeneous” activations σ.
The CNN can be constructed hierarchically in HK .
Norm (linear layers):
kfσ k2 ≤ kWn+1 k22 · kWn k22 · kWn–1 k22 . . . kW1 k22 .
Linear layers: product of spectral norms.

Link with generalization
Direct application of classical generalization bounds
Simple bound on Rademacher complexity for linear/kernel methods:

BR
FB = {f ∈ HK , kf k ≤ B} =⇒ RadN (FB ) ≤ O √ .
N

Link with generalization
Direct application of classical generalization bounds
Simple bound on Rademacher complexity for linear/kernel methods:

BR
FB = {f ∈ HK , kf k ≤ B} =⇒ RadN (FB ) ≤ O √ .
N
√
Leads to margin bound O(kfˆN kR/γ N ) for a learned CNN fˆN
with margin (confidence) γ > 0.
Related to recent generalization bounds for neural networks based
on product of spectral norms [e.g., Bartlett et al., 2017,
Neyshabur et al., 2018].
[see, e.g., Boucheron et al., 2005, Shalev-Shwartz and Ben-David, 2014]...

Deep convolutional representations: conclusions
Study of generic properties of signal representation
Deformation stability with small patches, adapted to resolution.
Signal preservation when subsampling ≤ patch size.
Group invariance by changing patch extraction and pooling.

Applies to learned models

Same quantity kf k controls stability and generalization.
“higher capacity” is needed to discriminate small deformations.

Applies to learned models

Same quantity kf k controls stability and generalization.
“higher capacity” is needed to discriminate small deformations.
Questions:
Better regularization?
How does SGD control capacity in CNNs?
What about networks with no pooling layers? ResNet?

References I
Stéphanie Allassonnière, Yali Amit, and Alain Trouvé. Towards a coherent
statistical framework for dense deformable template estimation. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 69(1):
3–29, 2007.
Francis Bach. On the equivalence between kernel quadrature rules and random
feature expansions. Journal of Machine Learning Research (JMLR), 18:1–38,
2017.
Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized
margin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.
Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi. Theory of
classification: A survey of some recent advances. ESAIM: probability and
statistics, 9:323–375, 2005.
Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks.
IEEE Transactions on pattern analysis and machine intelligence (PAMI), 35
(8):1872–1886, 2013.

References II
Y. Cho and L. K. Saul. Large-margin classification in infinite neural networks.
Neural Computation, 22(10):2678–2697, 2010.
Taco Cohen and Max Welling. Group equivariant convolutional networks. In
International Conference on Machine Learning (ICML), 2016.
Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of
neural networks: The power of initialization and a dual view on expressivity.
In Advances In Neural Information Processing Systems, pages 2253–2261,
2016.
Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the
physical world. arXiv preprint arXiv:1607.02533, 2016.
J. Mairal. End-to-end kernel learning with supervised convolutional kernel
networks. In Advances in Neural Information Processing Systems (NIPS),
2016.
Stéphane Mallat. Group invariant scattering. Communications on Pure and
Applied Mathematics, 65(10):1331–1398, 2012.

References III
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan
Srebro. A PAC-Bayesian approach to spectrally-normalized margin bounds
for neural networks. In Proceedings of the International Conference on
Learning Representations (ICLR), 2018.
Anant Raj, Abhishek Kumar, Youssef Mroueh, P Thomas Fletcher, and
Bernhard Scholkopf. Local group invariant representations via orbit
embeddings. preprint arXiv:1612.01988, 2016.
I. Schoenberg. Positive definite functions on spheres. Duke Math. J., 1942.
B. Scholkopf. Support Vector Learning. PhD thesis, Technischen Universität
Berlin, 1997.
Bernhard Schölkopf and Alexander J Smola. Learning with kernels: support
vector machines, regularization, optimization, and beyond. MIT press, 2002.
Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning:
From theory to algorithms. Cambridge university press, 2014.
John Shawe-Taylor and Nello Cristianini. An introduction to support vector
machines and other kernel-based learning methods. Cambridge University
Press, 2004.

References IV
Laurent Sifre and Stéphane Mallat. Rotation, scaling and deformation invariant
scattering for texture discrimination. In Proceedings of the IEEE conference
on computer vision and pattern recognition (CVPR), 2013.
Alex J Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for
machine learning. In Proceedings of the International Conference on
Machine Learning (ICML), 2000.
Alex J Smola, Zoltan L Ovari, and Robert C Williamson. Regularization with
dot-product kernels. In Advances in neural information processing systems,
pages 308–314, 2001.
Alain Trouvé and Laurent Younes. Local geometry of deformable templates.
SIAM journal on mathematical analysis, 37(1):17–59, 2005.
Thomas Wiatowski and Helmut Bölcskei. A mathematical theory of deep
convolutional neural networks for feature extraction. IEEE Transactions on
Information Theory, 2017.
C. Williams and M. Seeger. Using the Nyström method to speed up kernel
machines. In Advances in Neural Information Processing Systems (NIPS),
2001.

References V
Kai Zhang, Ivor W Tsang, and James T Kwok. Improved nyström low-rank
approximation and error analysis. In International Conference on Machine
Learning (ICML), 2008.
Y. Zhang, P. Liang, and M. J. Wainwright. Convexified convolutional neural
networks. In International Conference on Machine Learning (ICML), 2017.
Yuchen Zhang, Jason D Lee, and Michael I Jordan. `1 -regularized neural
networks are improperly learnable in polynomial time. In International
Conference on Machine Learning (ICML), 2016.

ϕk from kernel approximations: CKNs [Mairal, 2016]
Approximate ϕk (z) by projection (Nyström approximation) on
F = Span(ϕk (z1 ), . . . , ϕk (zp )).
ϕ(x) Hilbert space H
ϕ(x0 )
Figure: Nyström approximation.

[Williams and Seeger, 2001, Smola and Schölkopf, 2000, Zhang et al., 2008]...

Approximate ϕk (z) by projection (Nyström approximation) on
F = Span(ϕk (z1 ), . . . , ϕk (zp )).
Leads to tractable, p-dimensional representation ψk (z).

Norm is preserved, and projection is non-expansive:
kψk (z) − ψk (z 0 )k = kΠk ϕk (z) − Πk ϕk (z 0 )k

≤ kϕk (z) − ϕk (z 0 )k ≤ kz − z 0 k.
Anchor points z1 , . . . , zp (≈ filters) can be learned from data

(K-means or backprop).
[Williams and Seeger, 2001, Smola and Schölkopf, 2000, Zhang et al., 2008]...

Convolutional kernel networks in practice.
I1
linear pooling
ψ1 (x0 ) ϕ1 (x) Hilbert space H1
ψ1 (x)
ϕ1 (x0 )
M1
projection on F1 F1
x0
kernel trick
I0
x

Discussion
norm of kΦ(x)k is of the same order (or close enough) to kxk.
the kernel representation is non-expansive but not contractive
kΦ(x) − Φ(x0 )k
sup = 1.
x,x0 ∈L2 (Ω,H0 ) kx − x0 k

Future of Convolutional Neural Networks
What are current high-potential problems to solve?
1 lack of robustness (see next slide).
2 learning with few labeled data.
3 learning with no supervision (see Tab. from Bojanowski and Joulin, 2017).

Illustration of instability. Picture from Kurakin et al. [2016].
Figure: Adversarial examples are generated by computer; then printed on paper;

a new picture taken on a smartphone fools the classifier.

n
1X
min L(yi , f (xi )) + λΩ(f ) .
f ∈F n | {z }
i=1
| {z } regularization
empirical risk, data fit
The issue of regularization

today, heuristics are used (DropOut, weight decay, early stopping)...
...but they are not sufficient.
how to control variations of prediction functions?
|f (x) − f (x0 )| should be close if x and x0 are “similar”.
what does it mean for x and x0 to be “similar”?

what should be a good regularization function Ω?

Kernel

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Kernel

Caricato da

Copyright:

Formati disponibili

Invariance and Stability to Deformations

of Deep Convolutional Representations

ML and AI workshop, Telecom ParisTech, 2018

Julien Mairal Invariance and stability of DL 1/35

A. Bietti and J. Mairal. Group Invariance, Stability to

Julien Mairal Invariance and stability of DL 2/35

Julien Mairal Invariance and stability of DL 3/35

still involves the ERM problem

Φ:X →H and f (x) = hΦ(x), f iH .

Julien Mairal Invariance and stability of DL 5/35

Main purpose: embed data in a vectorial space where

|f (x) − f (x0 )| ≤ kf kH kΦ(x) − Φ(x0 )kH .

Julien Mairal Invariance and stability of DL 6/35

Julien Mairal Invariance and stability of DL 7/35

[Schölkopf and Smola, 2002, Shawe-Taylor and Cristianini, 2004]...

Julien Mairal Invariance and stability of DL 8/35

What is the relation with deep neural networks?

This is the construction of “convolutional kernel networks”.

[Schölkopf and Smola, 2002, Shawe-Taylor and Cristianini, 2004]...

Julien Mairal Invariance and stability of DL 8/35

|f (x) − f (x0 )| ≤ kf kH kΦ(x) − Φ(x0 )kH .

kf kH controls both stability and generalization!

Julien Mairal Invariance and stability of DL 8/35

Consider images defined on a continuous domain Ω = Rd .

Julien Mairal Invariance and stability of DL 9/35

Consider images defined on a continuous domain Ω = Rd .

Relation with deep convolutional representations

Julien Mairal Invariance and stability of DL 9/35

Consider images defined on a continuous domain Ω = Rd .

kΦ(Lτ x) − Φ(x)k ≤ (C1 k∇τ k∞ + C2 kτ k∞ )kxk.

k∇τ k∞ = supu k∇τ (u)k controls deformation.

Julien Mairal Invariance and stability of DL 9/35

Julien Mairal Invariance and stability of DL 10/35

Julien Mairal Invariance and stability of DL 10/35

|f (x) − f (x0 )| ≤ kf kH kΦ(x) − Φ(x0 )kH .

Julien Mairal Invariance and stability of DL 10/35

1 Construction of the multi-layer convolutional representation

2 Invariance and stability

3 Learning aspects: model complexity

Julien Mairal Invariance and stability of DL 11/35

Julien Mairal Invariance and stability of DL 12/35

Building map xk in L2 (Ω, Hk ) from xk−1 in L2 (Ω, Hk−1 )

Pk : patch extraction operator, extract small patch of feature map

Julien Mairal Invariance and stability of DL 12/35

Building map xk in L2 (Ω, Hk ) from xk−1 in L2 (Ω, Hk−1 )

Pk : patch extraction operator, extract small patch of feature map

Julien Mairal Invariance and stability of DL 12/35

Building map xk in L2 (Ω, Hk ) from xk−1 in L2 (Ω, Hk−1 )

Pk : patch extraction operator, extract small patch of feature map

xk (w) = Ak Mk Pk xk–1 (w) ∈ Hk

Mk Pk xk–1 : Ω → Hk Mk Pk xk–1 (v) = ϕk (Pk xk–1 (v)) ∈ Hk

Pk xk–1 (v) ∈ Pk (patch extraction)

Julien Mairal Invariance and stability of DL 13/35

Pk xk–1 (v) ∈ Pk (patch extraction)

Sk : patch shape, e.g. box.

Julien Mairal Invariance and stability of DL 14/35

Mk Pk xk–1 (u) := ϕk (Pk xk–1 (u)) ∈ Hk .

Mk Pk xk–1 : Ω → Hk Mk Pk xk–1 (v) = ϕk (Pk xk–1 (v)) ∈ Hk

Julien Mairal Invariance and stability of DL 15/35

Mk Pk xk–1 (u) := ϕk (Pk xk–1 (u)) ∈ Hk .

ϕk : Pk → Hk pointwise non-linearity on patches.

kϕk (z)k ≤ kzk and kϕk (z) − ϕk (z 0 )k ≤ kz − z 0 k.

Mk then satisfies, for x, x0 ∈ L2 (Ω, Pk )