Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
KU Leuven, ESAT-STADIUS
Kasteelpark Arenberg 10
B-3001 Leuven (Heverlee), Belgium
Email: johan.suykens@esat.kuleuven.be
http://www.esat.kuleuven.be/stadius/
• Robustness
• Explainability
• Recent developments
1
Introduction
1
Self-driving cars and neural networks
2
Self-driving cars and deep learning
3
Convolutional neural networks
4
Historical context
computing power
1992 Vapnik et al.: support vector machine classifiers
→ convex optimization, kernel machines
4
Different paradigms
Deep Neural
Learning Networks
Kernel methods
5
Different paradigms
Deep Neural
Learning Networks
new synergies?
Kernel methods
5
Towards a unifying picture
Parametric
linear, polynomial
(deep) neural network
finite or infinite dictionary
other
Primal representation
Duality principle
Conjugate feature duality
Lagrange duality
Model Legendre−Fenchel duality
other
Kernel−based
Dual representation
positive definite kernel
indefinite kernel
symmetric or non−symmetric kernel
tensor kernel
[Suykens 2017]
6
multi−scale multi−task
data fusion
multiplex ensemble
deep
[Suykens 2017]
7
Function estimation and model representations
7
Linear function estimation (1)
• Consider estimating w, b by
N
1 T 1X
min w w + γ (yi − wT xi − b)2
w,b 2 2 i=1
8
Linear function estimation (2)
• ... or write as a constrained optimization problem:
1 T 1
P 2
min 2w w + γ 2 i ei
w,b,e
subject to ei = yi − wT xi − b, i = 1, ..., N
Lagrangian: L(w, b, ei , αi ) = 21 wT w + γ 21 2
αi (ei − yi + wT xi + b)
P P
i ei − i
9
Linear model: solving in primal or dual?
(P ) : ŷ = wT x + b, w ∈ Rd
ր
Model
ց
αi xTi x + b, α ∈ RN
P
(D) : ŷ = i
10
Linear model: solving in primal or dual?
(P ) : ŷ = wT x + b, w ∈ Rd
ր
Model
ց
αi xTi x + b, α ∈ RN
P
(D) : ŷ = i
10
Linear model: solving in primal or dual?
primal : w ∈ Rd
dual: α ∈ RN (large kernel matrix: N × N )
11
Linear model: solving in primal or dual?
primal: w ∈ Rd
dual : α ∈ RN (small kernel matrix: N × N )
11
Feature map and kernel
From linear to nonlinear model:
(P ) : ŷ = wT ϕ(x) + b
ր
Model
ց P
(D) : ŷ = i αi K(xi, x) + b
Mercer theorem:
K(x, z) = ϕ(x)T ϕ(z)
Feature map ϕ, Kernel function K(x, z) (e.g. linear, polynomial, RBF, ...)
• SVMs: feature map and positive definite kernel [Cortes & Vapnik, 1995]
• Explicit or implicit choice of the feature map
• Neural networks: consider hidden layer as feature map [Suykens & Vandewalle, 1999]
• Least squares support vector machines [Suykens et al., 2002]: L2 loss and regularization
12
Least Squares Support Vector Machines: “core models”
• Regression
X
T
min w w + γ e2i s.t. yi = wT ϕ(xi) + b + ei, ∀i
w,b,e
i
• Classification
X
T
min w w + γ e2i s.t. yi(wT ϕ(xi) + b) = 1 − ei, ∀i
w,b,e
i
[Suykens & Vandewalle, 1999; Suykens et al., 2002; Alzate & Suykens, 2010]
13
Sparsity: through regularization or loss function
P
• through loss function: model ŷ = i αi K(x, xi ) +b
X
T
min w w + γ L(ei)
i
14
SVMs and neural networks
Primal space
Parametric
ŷ = sign[wT ϕ(x) + b]
ϕ(x) ϕ1 (x)
x w1
ŷ
xx x
x wnh
x x x ϕnh (x)
x
o x K(xi , xj ) = ϕ(xi )T ϕ(xj ) (Mercer)
o x
o oo
o
Dual space
o Input space
o Nonparametric
P#sv
ŷ = sign[ i=1 αi yiK(x, xi ) + b]
K(x, x1 )
Feature space α1
ŷ x
α#sv
K(x, x#sv )
15
SVMs and neural networks
Primal space
Parametric
Parametric
ŷ = sign[wT ϕ(x) + b]
ϕ(x) ϕ1 (x)
x w1
ŷ
xx x
x wnh
x x x ϕnh (x)
x
o x K(xi , xj ) = ϕ(xi )T ϕ(xj ) (“Kernel trick”)
o x
o oo
o
Dual space
Non−parametric
o Input space
o Nonparametric
P#sv
ŷ = sign[ i=1 αi yiK(x, xi ) + b]
K(x, x1 )
Feature space α1
ŷ x
α#sv
K(x, x#sv )
[Suykens et al., 2002]
15
Wider use of the “kernel trick”
x
• Angle between vectors: (e.g. correlation analysis)
Input space: θ z
xT z
cos θxz =
kxk2 kzk2
Feature space:
Feature space:
X
ŷ = sign[ αiyiK(x, xi) + b]
i
17
Function estimation in RKHS
N
1 X
min L(yi, f (xi)) + λkf k2K
f ∈HK N
i=1
18
Kernels
- linear K(x, z) = xT z
- polynomial K(x, z) = (η + xT z)d
- radial basis function K(x, z) = exp(−kx − zk22/σ 2 )
- splines
- wavelets
- string kernel
- kernels from graphical models
- kernels for dynamical systems
- Fisher kernels
- graph kernels
- data fusion kernels
- additive kernels (good for explainability)
- other
[Schölkopf & Smola, 2002; Shawe-Taylor & Cristianini, 2004; Jebara et al., 2004; other]
19
Krein spaces: indefinite kernels
• also: KPCA with indefinite kernel [X. Huang et al. 2017], KSC and
semi-supervised learning [Mehrkanoon et al., 2018]
20
Banach spaces: tensor kernels
• Regression problem:
γ PN
min ρ(kwkr ) + N i=1 L(ei )
(w,b,e)∈ℓr (K)×R×RN
subject to yi = hw, ϕ(xi)i + b + ei , ∀i = 1, ..., N
m
with r = m−1 for even m ≥ 2, ρ convex and even.
For m large this approaches ℓ1 regularization.
• Tensor-kernel representation
N
1 X
ŷ = hw, ϕ(x)ir,r∗ + b = ui1 ...uim−1 K(xi1 , ..., xim−1 , x) + b
N m−1 i
1 ,...,im−1 =1
[Salzo & Suykens, arXiv 1603.05876; Salzo, Suykens, Rosasco, AISTATS 2018]
related: RKBS [Zhang 2013; Fasshauer et al. 2015]
21
Generalization, deep learning and kernel methods
Recently one has observed in deep learning that over-parametrized neural
networks, that would ”overfit”, may still perform well on test data.
This phenomenon is currently not yet fully understood. A number of
researchers have stated that understanding kernel methods in this
context is important for understanding the generalization performance.
Related references:
• Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals,
Understanding deep learning requires rethinking generalization, 2016, arXiv:1611.03530
• Amit Daniely, SGD Learns the Conjugate Kernel Class of the Network, 2017,
arXiv:1702.08503
• Arthur Jacot, Franck Gabriel, Clement Hongler, Neural Tangent Kernel: Convergence
and Generalization in Neural Networks, 2018, arXiv:1806.07572
• Tengyuan Liang, Alexander Rakhlin, Just Interpolate: Kernel ”Ridgeless” Regression
Can Generalize, 2018, arXiv:1808.00387
• Mikhail Belkin, Siyuan Ma, Soumik Mandal, To understand deep learning we need to
understand kernel learning, 2018, arXiv:1802.01396
22
Generalization and deep learning - Double U curve
Figure: Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal, Reconciling modern
machine learning and the bias-variance trade-off, 2018, arXiv:1812.11118
23
Example: Black-box weather forecasting (1)
Weather data
350 stations located in US
Features:
Tmax, Tmin, precipitation,
wind speed, wind direction ,...
24
Black-box weather forecasting
• Multi-view learning:
- Multi-view LS-SVM regression [Houthuys, Karevan, Suykens, IJCNN 2017]
- Multi-view Deep Neural Networks [Karevan, Houthuys, Suykens, ICANN 2018]
25
Multi-view learning: kernel-based (1)
• Primal problem:
1
PV T 1
PV T PV T
[v] [v] [v] [v] [v] [v] [u]
min 2 v=1 w w + 2 v=1 γ e e +ρ v,u=1;v6=u e e
w[v] ,e[v]
subject to y = Φ[v]w[v] + b[v]1N + e[v], v = 1, ..., V
• Dual:
1TM
0V ×V bM 0V
=
ΓM 1M + ρ IM 1M ΓM ΩM + IN V + ρ IM ΩM αM ΓM yM + (V − 1)ρ yM
• Prediction:
V N
[v] [v]
X X
ŷ(x) = βv αk K [v](x[v], xk ) + b[v]
v=1 k=1
26
Multi-view learning: kernel-based (2)
• Data set:
– Real measurements for weather elements such as temperature, humidity, etc.
– From 2007 until mid 2014
– Two test sets:
- mid-November 2013 until mid-December 2013
- from mid-April 2014 to mid-May 2014
• Goal: Forecasting minimum and maximum temperature for one to six days ahead in
Brussels Belgium
• Tuning parameters:
- kernel parameters for each view
- regularization parameters γ [v]
- coupling parameter ρ
27
Multi-view learning: kernel-based (3)
Apr/May
Nov/Dec
28
Multi-view learning: deep neural network (1)
• Primal formulation of multi-view LS-SVM:
[v]T [v] [v] [v]T [v] [v]T [u]
1
PV 1
PV PV
min 2 v=1 w w + 2 v=1 γ e e + ρ v,u=1;v6=ue e
w[v] ,e[v] ,b[v]
subject to y = Φ[v]w[v] + b[v]1N + e[v] for v = 1, ..., V
29
Multi-view learning: deep neural network (2)
• The weight of each view is defined based on its error the validation set:
[v]
s[v] = exp(−mseval)
30
Fixed-size kernel methods for large scale data
30
Nyström method
with
√
M
1 √ M X
λ̂i = λi, φ̂i(xk ) = M uki, φ̂i(x′) = ukiK(xk , x′)
M λi k=1
31
Fixed-size method: estimation in primal
ϕ̃(·) : Rd → RM
′
based
p on the eigenvalue decomposition of the kernel matrix with ϕ̃ i (x )=
λ̂i φ̂i(x′) (on a subset of size M ≪ N ).
• Estimate in primal:
N
1 T 1X
min w̃ w̃ + γ (yi − w̃T ϕ̃(xi) − b̃)2
w̃,b̃ 2 2 i=1
32
Random Fourier Features
1
Z
p(ω) = exp(−jω T ∆)K(∆)d∆
2π
d
Draw D iid samples
q ω1, ..., ωD ∈ R from p.
1
Obtain z(x) = D [cos(ω1T x)... cos(ωD
T
x) sin(ω1T x)... sin(ωD
T
x)]T .
33
Deep neural-kernel networks using random Fourier features
Use of Random Fourier Features [Rahimi & Recht, NIPS 2007] to obtain
an approximation to the feature map in a deep architecture
34
Example: electricity load forecasting
1-hour ahead
1 1
Actual Load Actual Load
FS−LSSVM Linear
0.9 0.9
0.8 0.8
Normalized Load
Normalized Load
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160
Hour Hour
24-hours ahead (a) (b)
1 1
Actual Load Actual Load
FS−LSSVM Linear
0.9 0.9
0.8 0.8
Normalized Load
Normalized Load
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160
Hour Hour
Fixed-size LS-SVM ր (c) տ Linear ARX model (d)
35
Outliers and robustness
y
x
x x breakdown?
x x
x x
x
x
?
Robust statistics: Bounded derivative of loss function, bounded kernel
36
Weighted versions and robustness
convex
optimiz. robust
statistics
N
1 T 1X 2
• Weighted LS-SVM: min w w + γ viei s.t. yi = wT ϕ(xi) + b + ei, ∀i
w,b,e 2 2 i=1
with vi determined from {ei }Ni=1 of unweighted LS-SVM [Suykens et al., 2002].
Robustness and stability [Debruyne et al., JMLR 2008, 2010].
• SVM solution by applying iteratively weighted LS [Perez-Cruz et al., 2005]
37
Example: robust regression using weighted LS-SVM
2 2
1 1
y
y
0 0
−1 −1
−2 −2
−3 −3
−4 −4
−5 0 5 −5 0 5
x x
38
Generative models
38
Generative Adversarial Network (GAN)
source: https://deeplearning4j.org/generative-adversarial-network
39
GAN: example on MNIST
source: https://www.kdnuggets.com/2016/07/mnist-generative-adversarial-model-keras.html
40
Restricted Boltzmann Machines (RBM)
Joint distribution:
1
P (v, h; θ) = exp(−E(v, h; θ))
Z(θ)
P P
with partition function Z(θ) = v h exp(−E(v, h; θ))
[Hinton, Osindero, Teh, Neural Computation 2006]
41
Restricted Boltzmann Machines (RBM)
Joint distribution:
1
P (v, h; θ) = exp(−E(v, h; θ))
Z(θ)
P P
with partition function Z(θ) = v h exp(−E(v, h; θ))
[Hinton, Osindero, Teh, Neural Computation 2006]
41
RBM and deep learning
RBM
42
in other words ...
”deep sandwich”
”sandwich”
E = −v T W h
T T
E = −v T W 1h1 − h1 W 2h2 − h2 W 3h3
43
RBM: example on MNIST
source: https://www.kaggle.com/nicw102168/restricted-boltzmann-machine-rbm-on-mnist
44
Convolutional Deep Belief Networks
45
Restricted kernel machines
45
Restricted Kernel Machines (RKM)
46
Kernel principal component analysis (KPCA)
1.5 1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−1.5
−2
−2.5 −2
−1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1
47
Kernel PCA: classical LS-SVM approach
• Interpretation:
1. pool of candidate components (objective function equals zero)
2. select relevant components
48
From KPCA to RKM representation (1)
Model:
objective J
e = W T ϕ(x) = regularization termP Tr(W T W )
- ( λ1 ) variance term i eTi ei
1 T
↓ use property eT h ≤ 2λ e e + λ2 hT h
RKM representation:
obtain J ≤ J(hi, W )
P
e= j hj K(xj , x) solution from stationary points of J:
∂J ∂J
∂hi = 0, ∂W =0
49
From KPCA to RKM representation (2)
• Objective
N
η 1 X
J = Tr(W T W )− eTi ei s.t. ei = W T ϕ(xi), ∀i
2 2λ i=1
N N
X
T λ X η
≤ − ei hi + T
hi hi + Tr(W T W ) s.t. ei = W T ϕ(xi), ∀i
i=1
2 i=1 2
N N
X
T λ X η
= − ϕ(xi) W hi + T
hi hi + Tr(W T W ) , J
i=1
2 i=1 2
50
From KPCA to RKM representation (3)
1
KH T = H T Λ
η
(P )RKM : ê = W T ϕ(x)
ր
M
ց
1X
(D)RKM : ê = hj K(xj , x).
η j
51
Deep Restricted Kernel Machines
51
Deep RKM: example
y y
Jdeep = J 1 + J 2 + J 3
52
in more detail ...
y y
N N
X
T (1) λ1 X (1)T (1) η1
Jdeep = − ϕ1(xi) W1hi + hi hi + Tr(W1T W1)
i=1
2 i=1 2
N N
X (1) T (2) λ 2
X (2) T (2) η2
− ϕ2(hi ) W2hi + hi hi + Tr(W2T W2)
i=1
2 i=1 2
N N
X
T (2) T T (3) λ 3
X (3)T (3) η3
+ (yi − ϕ3(hi ) W3 − b )hi − hi hi + Tr(W3T W3)
i=1
2 i=1 2
53
Primal and dual model representations
The framework can be used for training deep feedforward neural networks
and deep kernel machines [Suykens, 2017].
(Other approaches: e.g. kernels for deep learning [Cho & Saul, 2009], mathematics of
the neural response [Smale et al., 2010], deep gaussian processes [Damianou & Lawrence,
2013], convolutional kernel networks [Mairal et al., 2014], multi-layer support vector
machines [Wiering & Schomaker, 2014])
54
Training process
10 6
10 5
10 4
Jdeep,Pstab
10 3
10 2
10 1
10 0
10 -1
0 100 200
iteration step
Objective function (logarithmic scale) during training on the ion data set:
55
Generative RKM
55
RKM objective for training and generating (1)
¯ h, W ) = −v TW h + λ hTh + 1 v Tv + η Tr(W TW )
J(v, 2 2 2
56
RKM objective for training and generating (2)
• Training: (clamp v)
N N
¯ T
X λX T η
Jtrain(hi, W ) = − vi W hi + hi hi + Tr(W TW )
i=1
2 i=1 2
Stationary points:
∂ J¯train
∂hi = 0 ⇒ W Tvi = λhi, ∀i
∂ J¯train 1
PN
∂W = 0 ⇒ W = η i=1 vihT i
Elimination of W :
1
KH T = H T∆,
η
where H = [h1, . . . , hN ] ∈ Rs×N , ∆ = diag{λ1, . . . , λs} with s ≤ N
the number of selected components and Kij = viTvj the kernel matrix
elements.
57
RKM objective for training and generating (3)
• Generating: (clamp h, W )
¯ ⋆ ⋆T ⋆ 1 ⋆T ⋆
Jgen(v ) = −v W h + v v
2
Stationary points:
∂ J¯gen
=0
∂v ⋆
This gives
v ⋆ = W h⋆
58
Dimensionality reduction and denoising: linear case
59
Dimensionality reduction and denoising: nonlinear case (1)
N
⋆ ⋆ 1X
ϕ(x ) = W h = ( ϕ(xi)hT
i )h
⋆
η i=1
N
1 X
K(xj , x⋆) = ( K(xj , xi)hT
i )h⋆
η i=1
On training data:
1
Ω̂ = ΩH TH
η
with H ∈ Rs×N , Ωij = K(xi, xj ) = ϕ(xi)T ϕ(xj ).
60
Dimensionality reduction and denoising: nonlinear case (2)
with K̃(xj , x⋆) (e.g. RBF kernel) the scaled similarity between 0 and
1, a design parameter S ≤ N (S closest points based on the similarity
K̃(xj , x⋆)).
61
Explainable AI: latent space exploration (1)
h(1,1) = -0.11 h(1,1) = -0.06 h(1,1) = -0.01 h(1,1) = 0.04 h(1,1) = 0.09
0.15
0
1
0.1 6
-0.05 h(1,3) = -0.11 h(1,3) = -0.05 h(1,3) = 0.01 h(1,3) = 0.06 h(1,3) = 0.12
-0.1
-0.15
-0.15 -0.1 -0.05 0 0.05 0.1
H1
62
Explainable AI: latent space exploration (2)
0.06
0.04
C
0.02 A
0
B
-0.02
-0.04
D
-0.06
-0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05
63
Tensor-based RKM for Multi-view KPCA
N N
[1] [V ]
X X
h2i with Φ(i) = ϕ[1](xi )⊗...⊗ϕ[V ](xi )
min hW, Wi − Φ(i), W hi+λ
i=1 i=1
✒☛ ✒☛ ✒☛ ✒☛
✂ ✄✞ ☎ ✂ ✄✞ ☎
✒☛ ✒☛
✞ ✞
✡☞✌ ✟
✠
✡☞✌
✎
✍
✟ ✠
✁ ✁ ✁
✝ ✝
✝ ✝
✝ ✝ ☞
✒✆ ✒✆ ✒✆ ✒✆
✂ ✄✞ ☎ ✂ ✄✞ ☎
✒✆ ✒✆
✞ ✞ ✡
✒✆
✠
✁
✒✆
✏
✁ ✁ ✁
64
Generative RKM (Gen-RKM) (1)
Train:
Generate:
65
Gen-RKM (2)
The objective
PN T T λ T
Jtrain(hi, U , V ) = i=1 −φ1(xi ) U hi − φ2(y i ) V hi + 2 hi hi
+ η21 Tr(U T U ) + η22 Tr(V T V )
1 1
( K 1 + K 2)H T = H T Λ
η1 η2
66
Gen-RKM (3)
1 1
Jgen(φ1(x⋆), ϕ2(y ⋆)) = −φ1(x⋆)T V h⋆−φ2(y ⋆)T U h⋆+ φ1(x⋆)T φ1(x⋆)+ φ2(y ⋆)T φ2(y ⋆)
2 2
giving
N N
⋆ 1 X T ⋆ ⋆ 1 X
φ1(x ) = φ1(xi)hi h , φ2(y ) = φ2(y i)hTi h⋆.
η1 i=1 η2 i=1
For generating x̂, ŷ one can either work with the kernel smoother or work
with an explicit feature map using a (deep) neural network or CNN.
67
Gen-RKM (4)
H
U⊤ U V V⊤
Fx Fy
φ1 (·) ψ1 (·) ψ2 (·) φ2 (·)
X Y
68
Gen-RKM: implicit feature map
Obtain
1 1
k x⋆ = K 1 H ⊤ h⋆ , ky ⋆ = K 2 H ⊤ h⋆ ,
η1 η2
⊤
with kx⋆ = [k(x1, x⋆), . . . , k(xN , x⋆)] .
with k̃1(xi, x⋆) and k̃2(y i, y ⋆) the scaled similarities between 0 and 1; nr
is the number of closest points based on the similarity defined by kernels k̃1
and k̃2.
69
Gen-RKM: explicit feature map
Parametrized feature maps: φθ (·), ψζ (·) (e.g. CNN and transposed CNN).
Overall objective function, using a stabilization mechanism [Suykens, 2017]:
70
Gen-RKM: examples (1)
MNIST Fashion-MNIST
Generated samples from the model using CNN as explicit feature map in the kernel function.
The yellow boxes show training examples and the adjacent boxes show the reconstructed
samples. The other images (columns 3-6) are generated by random sampling from the
fitted distribution over the learned latent variables.
71
Gen-RKM: examples (2)
CIFAR-10 CelebA
Generated samples from the model using CNN as explicit feature map in the kernel function.
The yellow boxes show training examples and the adjacent boxes show the reconstructed
samples. The other images (columns 3-6) are generated by random sampling from the
fitted distribution over the learned latent variables.
72
Gen-RKM: multi-view generation (1)
CelebA
73
Gen-RKM: multi-view generation (2)
Multi-view Generation (images and labels) using implicit and explicit feature maps
74
Gen-RKM: latent space exploration (1)
75
Gen-RKM: latent space exploration (2)
76
Gen-RKM: latent space exploration (3)
77
Future challenges
• optimal transport
78
Conclusions
• generative models
79
Acknowledgements (1)
80
Acknowledgements (2)
81
Acknowledgements (3)
82
Thank you
83