Deep Learning, Neural Networks and Kernel Machines: A Unifying Framework

Deep Learning, Neural Networks and Kernel
Machines: towards a unifying framework

Johan Suykens
KU Leuven, ESAT-STADIUS
Kasteelpark Arenberg 10
B-3001 Leuven (Heverlee), Belgium
Email: johan.suykens@esat.kuleuven.be
http://www.esat.kuleuven.be/stadius/
AI Seminar at BeCentral Brussels, Oct 2019

Outline
• Introduction
• Function estimation, model representations, duality
• Neural networks and kernel machines
• Application examples, large scale methods
• Robustness
• Generative models: GAN, RBM, Deep BM
• Restricted kernel machines (RKM), Gen-RKM, and deep learning
• Explainability
• Recent developments
1
Introduction
1
Self-driving cars and neural networks
in the early days of neural networks:
ALVINN (Autonomous Land Vehicle In a Neural Network)

[Pomerleau, Neural Computation 1991]
2
Self-driving cars and deep learning
(27 million connections)
from: [selfdrivingcars.mit.edu (Lex Fridman et al.), 2017]
3
Convolutional neural networks
[LeCun et al., Proc. IEEE 1998]
Further advanced architectures:
Alexnet (2012): 5 convolutional layers, 3 fully connected

VGGnet (2014): 19 layers
GoogLeNet (2014): 22 layers
ResNet (2015): 152 layers
4
Historical context
1942 McCulloch & Pitts: mathematical model for neuron

1958 Rosenblatt: perceptron learning
1960 Widrow & Hoff: adaline and lms learning rule
1969 Minsky & Papert: limitations of perceptron
1986 Rumelhart et al.: error backpropagating neural networks

→ booming of neural network universal approximators
computing power
1992 Vapnik et al.: support vector machine classifiers
→ convex optimization, kernel machines
1998 LeCun et al.: convolutional neural networks

2006 Hinton et al.: deep belief networks
2010 Bengio et al.: stacked autoencoders
→ booming of deep neural networks
4
Different paradigms
Deep Neural
Learning Networks
SVM, LS-SVM &
Kernel methods
5
Different paradigms
Deep Neural
Learning Networks
new synergies?
SVM, LS-SVM &
Kernel methods
5
Towards a unifying picture
Parametric
linear, polynomial
(deep) neural network
finite or infinite dictionary
other
Primal representation
Duality principle
Conjugate feature duality
Lagrange duality
Model Legendre−Fenchel duality
other
Kernel−based
Dual representation
positive definite kernel
indefinite kernel
symmetric or non−symmetric kernel
tensor kernel
[Suykens 2017]
6
multi−scale multi−task
data fusion
multiplex ensemble
deep
[Suykens 2017]
7
Function estimation and model representations
7
Linear function estimation (1)
• Given {(xi, yi )}N d

i=1 with xi ∈ R , yi ∈ R, consider ŷ = f (x) where f is
parametrized as
ŷ = wT x + b
with ŷ the estimated output of the linear model.
• Consider estimating w, b by
N
1 T 1X
min w w + γ (yi − wT xi − b)2
w,b 2 2 i=1
→ one can directly solve in w, b
8
Linear function estimation (2)
• ... or write as a constrained optimization problem:
1 T 1
P 2
min 2w w + γ 2 i ei
w,b,e
subject to ei = yi − wT xi − b, i = 1, ..., N
Lagrangian: L(w, b, ei , αi ) = 21 wT w + γ 21 2
αi (ei − yi + wT xi + b)
P P
i ei − i
• From optimality conditions:

X
ŷ = αi xTi x + b
i
where α, b follows from solving a linear system

1TN

0 b 0
=
1N Ω + I/γ α y
with Ωij = xTi xj for i, j = 1, ..., N and y = [y1; ...; yN ].
9
Linear model: solving in primal or dual?
inputs x ∈ Rd, output y ∈ R

training set {(xi, yi)}N
i=1
(P ) : ŷ = wT x + b, w ∈ Rd
ր
Model
ց
αi xTi x + b, α ∈ RN
P
(D) : ŷ = i
10
inputs x ∈ Rd, output y ∈ R

training set {(xi, yi)}N
i=1
(P ) : ŷ = wT x + b, w ∈ Rd
ր
Model
ց
αi xTi x + b, α ∈ RN
P
(D) : ŷ = i
10
few inputs, many data points: d ≪ N
primal : w ∈ Rd
dual: α ∈ RN (large kernel matrix: N × N )
11
many inputs, few data points: d ≫ N
primal: w ∈ Rd
dual : α ∈ RN (small kernel matrix: N × N )
11
Feature map and kernel
From linear to nonlinear model:
(P ) : ŷ = wT ϕ(x) + b
ր
Model
ց P
(D) : ŷ = i αi K(xi, x) + b
Mercer theorem:
K(x, z) = ϕ(x)T ϕ(z)
Feature map ϕ, Kernel function K(x, z) (e.g. linear, polynomial, RBF, ...)
• SVMs: feature map and positive definite kernel [Cortes & Vapnik, 1995]
• Explicit or implicit choice of the feature map
• Neural networks: consider hidden layer as feature map [Suykens & Vandewalle, 1999]
• Least squares support vector machines [Suykens et al., 2002]: L2 loss and regularization
12
Least Squares Support Vector Machines: “core models”
• Regression
X
T
min w w + γ e2i s.t. yi = wT ϕ(xi) + b + ei, ∀i
w,b,e
i
• Classification
X
T
min w w + γ e2i s.t. yi(wT ϕ(xi) + b) = 1 − ei, ∀i
w,b,e
i
• Kernel pca (V = I), Kernel spectral clustering (V = D −1)

X
T
min −w w + γ vie2i s.t. ei = wT ϕ(xi) + b, ∀i
w,b,e
i
• Kernel canonical correlation analysis/partial least squares
ei = wT ϕ(1)(xi ) + b
X
T T 2
min w w + v v + ν (ei − ri) s.t.
w,v,b,d,e,r ri = v T ϕ(2)(yi) + d
i
[Suykens & Vandewalle, 1999; Suykens et al., 2002; Alzate & Suykens, 2010]
13
Sparsity: through regularization or loss function
• through regularization: model ŷ = wT x + b

X X
min |wj | + γ e2i
j i
⇒ sparse w (e.g. Lasso)
P
• through loss function: model ŷ = i αi K(x, xi ) +b
X
T
min w w + γ L(ei)
i
⇒ sparse α (e.g. SVM)

−ε 0 +ε
14
SVMs and neural networks
Primal space
Parametric
ŷ = sign[wT ϕ(x) + b]
ϕ(x) ϕ1 (x)
x w1
ŷ
xx x
x wnh
x x x ϕnh (x)
x
o x K(xi , xj ) = ϕ(xi )T ϕ(xj ) (Mercer)
o x
o oo
o
Dual space
o Input space
o Nonparametric
P#sv
ŷ = sign[ i=1 αi yiK(x, xi ) + b]
K(x, x1 )
Feature space α1
ŷ x
α#sv
K(x, x#sv )
15
SVMs and neural networks
Primal space
Parametric
Parametric
ŷ = sign[wT ϕ(x) + b]
ϕ(x) ϕ1 (x)
x w1
ŷ
xx x
x wnh
x x x ϕnh (x)
x
o x K(xi , xj ) = ϕ(xi )T ϕ(xj ) (“Kernel trick”)
o x
o oo
o
Dual space
Non−parametric
o Input space
o Nonparametric
P#sv
ŷ = sign[ i=1 αi yiK(x, xi ) + b]
K(x, x1 )
Feature space α1
ŷ x
α#sv
K(x, x#sv )
[Suykens et al., 2002]
15
Wider use of the “kernel trick”
x
• Angle between vectors: (e.g. correlation analysis)
Input space: θ z
xT z
cos θxz =
kxk2 kzk2
Feature space:
ϕ(x)T ϕ(z) K(x, z)

cos θϕ(x),ϕ(z) = = p p
kϕ(x)k2kϕ(z)k2 K(x, x) K(z, z)
• Distance between vectors: (e.g. for “kernelized” clustering methods)

Input space:
kx − zk22 = (x − z)T (x − z) = xT x + z T z − 2xT z
Feature space:
kϕ(x) − ϕ(z)k22 = K(x, x) + K(z, z) − 2K(x, z)

16
Interpretation of kernel-based models
Decision making: classification problem (e.g. apples versus tomatoes) Input

data xi ∈ Rd and class labels yi ∈ {−1, +1}. N training data.
SVM or LS-SVM classifier: given a new x (e.g. ), obtain
X
ŷ = sign[ αiyiK(x, xi) + b]
i
with xi for i = 1, ..., N :
Here K(x, xi) characterizes the similarity between x and xi.

The bias term b can be related to prior class probabilities.
17
Function estimation in RKHS
• Find function f such that [Wahba, 1990; Evgeniou et al., 2000]
N
1 X
min L(yi, f (xi)) + λkf k2K
f ∈HK N
i=1
with L(·, ·) the loss function. kf kK is norm in RKHS HK defined by K.
• Representer theorem: for convex loss function, solution of the form

N
X
f (x) = αi K(x, xi )
i=1
Reproducing property f (x) = hf, KxiK with Kx(·) = K(x, ·)
• Sparse representation by hinge and ǫ-insensitive loss [Vapnik, 1998]
18
Kernels
Wide range of positive definite kernel functions possible:
- linear K(x, z) = xT z
- polynomial K(x, z) = (η + xT z)d
- radial basis function K(x, z) = exp(−kx − zk22/σ 2 )
- splines
- wavelets
- string kernel
- kernels from graphical models
- kernels for dynamical systems
- Fisher kernels
- graph kernels
- data fusion kernels
- additive kernels (good for explainability)
- other
[Schölkopf & Smola, 2002; Shawe-Taylor & Cristianini, 2004; Jebara et al., 2004; other]
19
Krein spaces: indefinite kernels
• LS-SVM for indefinite kernel case:

N
1 T T γX 2 T T
min (w+w+ − w−w−) + ei s.t. yi = w+ ϕ+(xi) + w− ϕ−(xi ) + b + ei, ∀i
w+ ,w− ,b,e 2 2 i=1
and indefinite kernel K(xi, xj ) = K+(xi, xj ) − K−(xi, xj )

with positive definite kernels K+, K−
K+(xi, xj ) = ϕ+(xi)T ϕ+(xj ) and K−(xi, xj ) = ϕ−(xi)T ϕ−(xj )
• also: KPCA with indefinite kernel [X. Huang et al. 2017], KSC and
semi-supervised learning [Mehrkanoon et al., 2018]
[X. Huang, Maier, Hornegger, Suykens, ACHA 2017]

[Mehrkanoon, X. Huang, Suykens, Pattern Recognition, 2018]
Related work of RKKS: [Ong et al 2004; Haasdonk 2005; Luss 2008; Loosli et al. 2015]
20
Banach spaces: tensor kernels
• Regression problem:
γ PN
min ρ(kwkr ) + N i=1 L(ei )
(w,b,e)∈ℓr (K)×R×RN
subject to yi = hw, ϕ(xi)i + b + ei , ∀i = 1, ..., N
m
with r = m−1 for even m ≥ 2, ρ convex and even.
For m large this approaches ℓ1 regularization.
• Tensor-kernel representation
N
1 X
ŷ = hw, ϕ(x)ir,r∗ + b = ui1 ...uim−1 K(xi1 , ..., xim−1 , x) + b
N m−1 i
1 ,...,im−1 =1
[Salzo & Suykens, arXiv 1603.05876; Salzo, Suykens, Rosasco, AISTATS 2018]
related: RKBS [Zhang 2013; Fasshauer et al. 2015]
21
Generalization, deep learning and kernel methods
Recently one has observed in deep learning that over-parametrized neural
networks, that would ”overfit”, may still perform well on test data.
This phenomenon is currently not yet fully understood. A number of
researchers have stated that understanding kernel methods in this
context is important for understanding the generalization performance.
Related references:
• Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals,
Understanding deep learning requires rethinking generalization, 2016, arXiv:1611.03530
• Amit Daniely, SGD Learns the Conjugate Kernel Class of the Network, 2017,
arXiv:1702.08503
• Arthur Jacot, Franck Gabriel, Clement Hongler, Neural Tangent Kernel: Convergence
and Generalization in Neural Networks, 2018, arXiv:1806.07572
• Tengyuan Liang, Alexander Rakhlin, Just Interpolate: Kernel ”Ridgeless” Regression
Can Generalize, 2018, arXiv:1808.00387
• Mikhail Belkin, Siyuan Ma, Soumik Mandal, To understand deep learning we need to
understand kernel learning, 2018, arXiv:1802.01396
22
Generalization and deep learning - Double U curve
Figure: Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal, Reconciling modern
machine learning and the bias-variance trade-off, 2018, arXiv:1812.11118
23
Example: Black-box weather forecasting (1)
Weather data
350 stations located in US
Features:
Tmax, Tmin, precipitation,
wind speed, wind direction ,...
Black-box forecasting multiple weather stations simultaneously
[Signoretto, Frandi, Karevan, Suykens, IEEE-SCCI, 2014]
24
Black-box weather forecasting
• Black-box weather forecasting: prediction temperature in Brussels
• Multi-view learning:
- Multi-view LS-SVM regression [Houthuys, Karevan, Suykens, IJCNN 2017]
- Multi-view Deep Neural Networks [Karevan, Houthuys, Suykens, ICANN 2018]
25
Multi-view learning: kernel-based (1)
• Primal problem:
1
PV T 1
PV T PV T
[v] [v] [v] [v] [v] [v] [u]
min 2 v=1 w w + 2 v=1 γ e e +ρ v,u=1;v6=u e e
w[v] ,e[v]
subject to y = Φ[v]w[v] + b[v]1N + e[v], v = 1, ..., V
• Dual:
1TM

0V ×V bM 0V
=
ΓM 1M + ρ IM 1M ΓM ΩM + IN V + ρ IM ΩM αM ΓM yM + (V − 1)ρ yM
• Prediction:
V N
[v] [v]
X X
ŷ(x) = βv αk K [v](x[v], xk ) + b[v]
v=1 k=1
[Houthuys et al., 2017]
26
• Data set:
– Real measurements for weather elements such as temperature, humidity, etc.
– From 2007 until mid 2014
– Two test sets:
- mid-November 2013 until mid-December 2013
- from mid-April 2014 to mid-May 2014
• Goal: Forecasting minimum and maximum temperature for one to six days ahead in
Brussels Belgium
• Views: Brussels together with 9 neighboring cities
• Tuning parameters:
- kernel parameters for each view
- regularization parameters γ [v]
- coupling parameter ρ
27
Apr/May
Nov/Dec
28
Multi-view learning: deep neural network (1)
• Primal formulation of multi-view LS-SVM:
[v]T [v] [v] [v]T [v] [v]T [u]
1
PV 1
PV PV
min 2 v=1 w w + 2 v=1 γ e e + ρ v,u=1;v6=ue e
w[v] ,e[v] ,b[v]
subject to y = Φ[v]w[v] + b[v]1N + e[v] for v = 1, ..., V
• Weighted Multi-view approach:

V V
1 X [v] [v]T [v] [v] [v] T
[v]
X
[v,u]
p p T
min s w w +γ e e + ρ s[v] s[u] e[v] e[u]
w[v] ,e[v] 2 v=1
v,u=1;v6=u
– s[v]: weight of the view v (can be manually determined by an expert,

or calculated during a pre-processing step)
– ρ[v,u]: coupling parameter for pairwise combination of views
– 0 ≤ ρ[v,u] ≤ min{γ [v], γ [u]}
[Karevan et al., 2018]
29
Multi-view learning: deep neural network (2)
• Weather forecasting is a time series prediction problem → Consider each

delay as a view
• Consider 5 views (i.e. the delay is considered to be 5)
• Tuning parameters: regularization parameter γ [v] and number of neurons

for each view, and ρ[v,u] coupling parameter for each part of views
• The weight of each view is defined based on its error the validation set:
[v]
s[v] = exp(−mseval)
• Forecasting minimum and maximum temperature for one to six days

ahead in Brussels, Belgium
30
Fixed-size kernel methods for large scale data
30
Nyström method
• “big” kernel matrix: Ω(N,N ) ∈ RN ×N

“small” kernel matrix: Ω(M,M ) ∈ RM ×M (on subset)
• Eigenvalue decompositions: Ω(N,N ) Ũ = Ũ Λ̃ and Ω(M,M ) U = U Λ

• Relation to eigenvalues and eigenfunctions of the integral equation
Z
K(x, x′)φi(x)p(x)dx = λiφi(x′)
with
√
M
1 √ M X
λ̂i = λi, φ̂i(xk ) = M uki, φ̂i(x′) = ukiK(xk , x′)
M λi k=1
[Williams & Seeger, 2001] (Nyström method in GP)
31
Fixed-size method: estimation in primal
• For the feature map ϕ(·) : Rd → Rh obtain an approximation
ϕ̃(·) : Rd → RM
′
based
p on the eigenvalue decomposition of the kernel matrix with ϕ̃ i (x )=
λ̂i φ̂i(x′) (on a subset of size M ≪ N ).
• Estimate in primal:
N
1 T 1X
min w̃ w̃ + γ (yi − w̃T ϕ̃(xi) − b̃)2
w̃,b̃ 2 2 i=1
Sparse representation is obtained: w̃ ∈ RM with M ≪ N and M ≪ h.
[Suykens et al., 2002; De Brabanter et al., CSDA 2010]
32
Random Fourier Features
• Proposed by [Rahimi & Recht, 2007].
• It requires a positive definite shift-invariant kernel K(x, y) = K(x − y).

One obtains a randomized feature map z(x) : Rd → R2D so that
z(x)T z(y) ≃ K(x − y).
• Compute the Fourier transform p of the kernel K:
1
Z
p(ω) = exp(−jω T ∆)K(∆)d∆
2π
d
Draw D iid samples
q ω1, ..., ωD ∈ R from p.
1
Obtain z(x) = D [cos(ω1T x)... cos(ωD
T
x) sin(ω1T x)... sin(ωD
T
x)]T .
33
Deep neural-kernel networks using random Fourier features
Use of Random Fourier Features [Rahimi & Recht, NIPS 2007] to obtain
an approximation to the feature map in a deep architecture
[Mehrkanoon & Suykens, Neurocomputing 2018]
34
Example: electricity load forecasting
1-hour ahead
1 1
Actual Load Actual Load
FS−LSSVM Linear
0.9 0.9
0.8 0.8
Normalized Load
Normalized Load
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160
Hour Hour
24-hours ahead (a) (b)
1 1
Actual Load Actual Load
FS−LSSVM Linear
0.9 0.9
0.8 0.8
Normalized Load
Normalized Load
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160
Hour Hour
Fixed-size LS-SVM ր (c) տ Linear ARX model (d)
[Espinoza, Suykens, Belmans, De Moor, IEEE CSM 2007]

35
Robustness
35
Outliers and robustness
y
x
x x breakdown?
x x
x x
x
x
?
Robust statistics: Bounded derivative of loss function, bounded kernel
[Huber, 1981; Hampel et al., 1986; Rousseeuw & Leroy, 1987]
36
Weighted versions and robustness
Weighted version with

Convex cost function
modified cost function
convex
optimiz. robust
statistics
SVM solution LS-SVM solution
SVM Weighted LS-SVM
N
1 T 1X 2
• Weighted LS-SVM: min w w + γ viei s.t. yi = wT ϕ(xi) + b + ei, ∀i
w,b,e 2 2 i=1
with vi determined from {ei }Ni=1 of unweighted LS-SVM [Suykens et al., 2002].
Robustness and stability [Debruyne et al., JMLR 2008, 2010].
• SVM solution by applying iteratively weighted LS [Perez-Cruz et al., 2005]
37
Example: robust regression using weighted LS-SVM
function estimation using LS−SVMRBF 2

γ=0.14185,σ =0.047615
function estimation using LS−SVMRBF 2
γ=95025.4538,σ =0.66686
4 4
LS−SVM LS−SVM
data data
3 3
Real function Real function
2 2
1 1
y
y
0 0
−1 −1
−2 −2
−3 −3
−4 −4
−5 0 5 −5 0 5
x x
using LS-SVMlab v1.8 http://www.esat.kuleuven.be/sista/lssvmlab/
38
Generative models
38
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN) [Goodfellow et al., 2014]

Training of two competing models in a zero-sum game:
(Generator) generate fake output examples from random noise

(Discriminator) discriminate between fake examples and real examples.
source: https://deeplearning4j.org/generative-adversarial-network
39
GAN: example on MNIST
MNIST training data:
GAN generated examples:
source: https://www.kdnuggets.com/2016/07/mnist-generative-adversarial-model-keras.html
40
Restricted Boltzmann Machines (RBM)
• Markov random field, bipartite graph, stochastic binary units

Layer of visible units v and layer of hidden units h
No hidden-to-hidden connections
• Energy:
E(v, h; θ) = −v T W h − cT v − aT h with θ = {W, c, a}
Joint distribution:
1
P (v, h; θ) = exp(−E(v, h; θ))
Z(θ)
P P
with partition function Z(θ) = v h exp(−E(v, h; θ))
[Hinton, Osindero, Teh, Neural Computation 2006]
41
Restricted Boltzmann Machines (RBM)
• Markov random field, bipartite graph, stochastic binary units

Layer of visible units v and layer of hidden units h
No hidden-to-hidden connections
• Energy:
E(v, h; θ) = −v T W h − cT v − aT h with θ = {W, c, a}
Joint distribution:
1
P (v, h; θ) = exp(−E(v, h; θ))
Z(θ)
P P
with partition function Z(θ) = v h exp(−E(v, h; θ))
[Hinton, Osindero, Teh, Neural Computation 2006]
41
RBM and deep learning
RBM
p(v, h) p(v, h1, h2, h3, ...)
[Hinton et al., 2006; Salakhutdinov, 2015]
42
in other words ...
”deep sandwich”
”sandwich”
E = −v T W h
T T
E = −v T W 1h1 − h1 W 2h2 − h2 W 3h3
43
RBM: example on MNIST
MNIST training data:
Generating new images:
source: https://www.kaggle.com/nicw102168/restricted-boltzmann-machine-rbm-on-mnist
44
Convolutional Deep Belief Networks
Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief

Networks [Lee et al. 2011]
45
Restricted kernel machines
45
Restricted Kernel Machines (RKM)
• Kernel machine interpretations in terms of visible and hidden units

(similar to Restricted Boltzmann Machines (RBM))
• Restricted Kernel Machine (RKM) representations for

– LS-SVM regression/classification
– Kernel PCA
– Matrix SVD
– Parzen-type models
– other
• Based on principle of conjugate feature duality (with hidden features

corresponding to dual variables)
• Deep Restricted Kernel Machines (Deep RKM)

[Suykens, Neural Computation, 2017]
46
Kernel principal component analysis (KPCA)
1.5 1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−1.5
−2
−2.5 −2
−1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1
linear PCA kernel PCA (RBF kernel)
Kernel PCA [Schölkopf et al., 1998]:

take eigenvalue decomposition of the kernel matrix
 
K(x1 , x1) ... K(x1 , xN )
 .. .. 
K(xN , x1) ... K(xN , xN )
(applications in dimensionality reduction and denoising)
47
Kernel PCA: classical LS-SVM approach
• Primal problem: [Suykens et al., 2002]: model-based approach

N
1 T 1 X 2
min w w − γ ei s.t. ei = wT ϕ(xi) + b, i = 1, ..., N.
w,b,e 2 2 i=1
• Dual problem corresponds to kernel PCA
Ω(c)α = λα with λ = 1/γ

(c)
with Ωij = (ϕ(xi) − µ̂ϕ)T (ϕ(xj ) − µ̂ϕ) the centered kernel matrix
PN
and µ̂ϕ = (1/N ) i=1 ϕ(xi).
• Interpretation:
1. pool of candidate components (objective function equals zero)
2. select relevant components
48
From KPCA to RKM representation (1)
Model:
objective J
e = W T ϕ(x) = regularization termP Tr(W T W )
- ( λ1 ) variance term i eTi ei
1 T
↓ use property eT h ≤ 2λ e e + λ2 hT h
RKM representation:
obtain J ≤ J(hi, W )
P
e= j hj K(xj , x) solution from stationary points of J:
∂J ∂J
∂hi = 0, ∂W =0
49
• Objective
N
η 1 X
J = Tr(W T W )− eTi ei s.t. ei = W T ϕ(xi), ∀i
2 2λ i=1
N N
X
T λ X η
≤ − ei hi + T
hi hi + Tr(W T W ) s.t. ei = W T ϕ(xi), ∀i
i=1
2 i=1 2
N N
X
T λ X η
= − ϕ(xi) W hi + T
hi hi + Tr(W T W ) , J
i=1
2 i=1 2
• Stationary points of J(hi, W ):


∂J
= 0 ⇒ W T ϕ(xi) = λhi , ∀i



∂hi

∂J 1X T


 ∂W = 0 ⇒ W = ϕ(xi )hi
η i

50
• Elimination of W gives the eigenvalue decomposition:
1
KH T = H T Λ
η
where H = [h1...hN ] ∈ Rs×N and Λ = diag{λ1, ..., λs} with s ≤ N
• Primal and dual model representations
(P )RKM : ê = W T ϕ(x)
ր
M
ց
1X
(D)RKM : ê = hj K(xj , x).
η j
51
Deep Restricted Kernel Machines
51
Deep RKM: example
x ϕ1(x) ϕ2(h(1)) ϕ3(h(2))

e(1)h(1) e(2)h(2) e(3)h(3)
y y
Deep RKM: KPCA + KPCA + LSSVM [Suykens, 2017]
Coupling of RKMs by taking sum of the objectives
Jdeep = J 1 + J 2 + J 3
Multiple levels and multiple layers per level.
52
in more detail ...
x ϕ1(x) ϕ2(h(1)) ϕ3(h(2))

e(1)h(1) e(2)h(2) e(3)h(3)
y y
N N
X
T (1) λ1 X (1)T (1) η1
Jdeep = − ϕ1(xi) W1hi + hi hi + Tr(W1T W1)
i=1
2 i=1 2
N N
X (1) T (2) λ 2
X (2) T (2) η2
− ϕ2(hi ) W2hi + hi hi + Tr(W2T W2)
i=1
2 i=1 2
N N
X
T (2) T T (3) λ 3
X (3)T (3) η3
+ (yi − ϕ3(hi ) W3 − b )hi − hi hi + Tr(W3T W3)
i=1
2 i=1 2
53
Primal and dual model representations
ê(1) = W1T ϕ1(x)

(P )DeepRKM : ê(2) = W2T ϕ2(Λ−1 (1)
1 ê )
ր ŷ = W3T ϕ3(Λ−1
2 ê
(2)
)+b
M
(1) 1
P (1)
ց ê = η1 j hj K1 (xj , x)
(2) 1
P (2) (1) −1 (1)
(D)DeepRKM : ê = η2 h
j j K2 (hj , Λ 1 ê )
1
P (3) (2)−1 (2)
ŷ = η3 h
j j K3 (hj , Λ 2 ê ) + b
The framework can be used for training deep feedforward neural networks
and deep kernel machines [Suykens, 2017].
(Other approaches: e.g. kernels for deep learning [Cho & Saul, 2009], mathematics of
the neural response [Smale et al., 2010], deep gaussian processes [Damianou & Lawrence,
2013], convolutional kernel networks [Mairal et al., 2014], multi-layer support vector
machines [Wiering & Schomaker, 2014])
54
Training process
10 6
10 5
10 4
Jdeep,Pstab
10 3
10 2
10 1
10 0
10 -1
0 100 200
iteration step
Objective function (logarithmic scale) during training on the ion data set:
• black color: level 3 objective only

• Jdeep for cstab = 1, 10, 100 (blue, red, magenta color) in stabilization term
55
Generative RKM
55
RKM objective for training and generating (1)
• RBM energy function
E(v, h; θ) = −v TW h − cTv − aTh
with model parameters θ = {W, c, a}
• RKM ”super-objective” function (for training and for generating)
¯ h, W ) = −v TW h + λ hTh + 1 v Tv + η Tr(W TW )
J(v, 2 2 2
Training: clamp v → J¯train(h, W )

Generating: clamp h, W → J¯gen(v)
[Schreurs & Suykens, ESANN 2018]
56
• Training: (clamp v)
N N
¯ T
X λX T η
Jtrain(hi, W ) = − vi W hi + hi hi + Tr(W TW )
i=1
2 i=1 2
Stationary points:
∂ J¯train
∂hi = 0 ⇒ W Tvi = λhi, ∀i
∂ J¯train 1
PN
∂W = 0 ⇒ W = η i=1 vihT i
Elimination of W :
1
KH T = H T∆,
η
where H = [h1, . . . , hN ] ∈ Rs×N , ∆ = diag{λ1, . . . , λs} with s ≤ N
the number of selected components and Kij = viTvj the kernel matrix
elements.
57
• Generating: (clamp h, W )
Estimate distribution p(h) from hi, i = 1, ..., N (or assumed normal).

Obtain a new value h⋆.
Generate in this way v ⋆ from
¯ ⋆ ⋆T ⋆ 1 ⋆T ⋆
Jgen(v ) = −v W h + v v
2
Stationary points:
∂ J¯gen
=0
∂v ⋆
This gives
v ⋆ = W h⋆
58
Dimensionality reduction and denoising: linear case
• Given training data vi = xi with X ∈ Rd×N , obtain hidden features

H ∈ Rs×N :
N
1X 1
X̂ = W H = ( xihTi )H = XH T H
η i=1 η
• Reconstruction error: kX − X̂k2
xi G(·) hi F (·) x̂i
59
Dimensionality reduction and denoising: nonlinear case (1)
• New datapoint x⋆ generated from h⋆ by
N
⋆ ⋆ 1X
ϕ(x ) = W h = ( ϕ(xi)hT
i )h
⋆
η i=1
• Multiplying both sides by ϕ(xj ) gives:
N
1 X
K(xj , x⋆) = ( K(xj , xi)hT
i )h⋆
η i=1
On training data:
1
Ω̂ = ΩH TH
η
with H ∈ Rs×N , Ωij = K(xi, xj ) = ϕ(xi)T ϕ(xj ).
60
Dimensionality reduction and denoising: nonlinear case (2)
• Estimated value x̂ for x⋆ by kernel smoother:

PS ⋆
j=1 K̃(xj , x )xj
x̂ = PS ⋆
j=1 K̃(xj , x )
with K̃(xj , x⋆) (e.g. RBF kernel) the scaled similarity between 0 and
1, a design parameter S ≤ N (S closest points based on the similarity
K̃(xj , x⋆)).
[Schreurs & Suykens, ESANN 2018]
61
Explainable AI: latent space exploration (1)
hidden units: exploring the whole continuum:
h(1,1) = -0.11 h(1,1) = -0.06 h(1,1) = -0.01 h(1,1) = 0.04 h(1,1) = 0.09
0.15
0
1
0.1 6
h(1,2) = -0.12 h(1,2) = -0.06 h(1,2) = 0 h(1,2) = 0.06 h(1,2) = 0.12

0.05
H2
-0.05 h(1,3) = -0.11 h(1,3) = -0.05 h(1,3) = 0.01 h(1,3) = 0.06 h(1,3) = 0.12
-0.1
-0.15
-0.15 -0.1 -0.05 0 0.05 0.1
H1
[figures by Joachim Schreurs]
62
Explainable AI: latent space exploration (2)
0.06
0.04
C
0.02 A
0
B
-0.02
-0.04
D
-0.06
-0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05
Yale Face database - generated faces from different regions A,B,C,D
[Winant, Schreurs, Suykens, BNAIC 2019]
63
Tensor-based RKM for Multi-view KPCA
N N
[1] [V ]
X X
h2i with Φ(i) = ϕ[1](xi )⊗...⊗ϕ[V ](xi )

min hW, Wi − Φ(i), W hi+λ
i=1 i=1
✒☛ ✒☛ ✒☛ ✒☛
✂ ✄✞ ☎ ✂ ✄✞ ☎
✒☛ ✒☛
✞ ✞
✡☞✌ ✟
✠
✡☞✌
✎
✍
✟ ✠
✁ ✁ ✁
✝ ✝
✝ ✝
✝ ✝ ☞
✒✆ ✒✆ ✒✆ ✒✆
✂ ✄✞ ☎ ✂ ✄✞ ☎
✒✆ ✒✆
✞ ✞ ✡
✒✆
✠
✁
✒✆
✏
✁ ✁ ✁
[Houthuys & Suykens, ICANN 2018]
64
Generative RKM (Gen-RKM) (1)
Train:
Generate:
[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]
65
Gen-RKM (2)
The objective
PN T T λ T

Jtrain(hi, U , V ) = i=1 −φ1(xi ) U hi − φ2(y i ) V hi + 2 hi hi
+ η21 Tr(U T U ) + η22 Tr(V T V )
results for training into the eigenvalue problem
1 1
( K 1 + K 2)H T = H T Λ
η1 η2
with H = [h1...hN ] and kernel matrices K 1, K 2 related to φ1, φ2.
66
Gen-RKM (3)
Generating data is based on a newly generated h⋆ and the objective
1 1
Jgen(φ1(x⋆), ϕ2(y ⋆)) = −φ1(x⋆)T V h⋆−φ2(y ⋆)T U h⋆+ φ1(x⋆)T φ1(x⋆)+ φ2(y ⋆)T φ2(y ⋆)
2 2
giving
N N
⋆ 1 X T ⋆ ⋆ 1 X
φ1(x ) = φ1(xi)hi h , φ2(y ) = φ2(y i)hTi h⋆.
η1 i=1 η2 i=1
For generating x̂, ŷ one can either work with the kernel smoother or work
with an explicit feature map using a (deep) neural network or CNN.
67
Gen-RKM (4)
H
U⊤ U V V⊤
Fx Fy
φ1 (·) ψ1 (·) ψ2 (·) φ2 (·)
X Y
Gen-RKM schematic representation modeling a common subspace H between two data

sources X and Y . The φ1 , φ2 are the feature maps (Fx and Fy represent the feature-
spaces) corresponding to the two data sources. While ψ1, ψ2 represent the pre-image
maps. The interconnection matrices U, V model dependencies between latent variables
and the mapped data sources.
68
Gen-RKM: implicit feature map
Obtain
1 1
k x⋆ = K 1 H ⊤ h⋆ , ky ⋆ = K 2 H ⊤ h⋆ ,
η1 η2
⊤
with kx⋆ = [k(x1, x⋆), . . . , k(xN , x⋆)] .
Using the kernel-smoother:

Pnr ⋆
P nr ⋆
j=1 k̃1 (xj , x )xj j=1 k̃2(y j , y )y j
x̂ = ψ1 (φ1(x⋆)) = P nr , ŷ = ψ2 (φ2(y ⋆)) = P nr ,
⋆ ⋆
j=1 k̃1(xj , x ) j=1 k̃2(y j , y )
with k̃1(xi, x⋆) and k̃2(y i, y ⋆) the scaled similarities between 0 and 1; nr
is the number of closest points based on the similarity defined by kernels k̃1
and k̃2.
69
Gen-RKM: explicit feature map
Parametrized feature maps: φθ (·), ψζ (·) (e.g. CNN and transposed CNN).
Overall objective function, using a stabilization mechanism [Suykens, 2017]:
min Jc = Jtrain + cstab

2
2
Jtrain
θ 1 ,θ 2 ,ζ 1 ,ζ 2 P h i
N
+ c2N
acc
i=1 L 1 (x⋆
i , ψ1 ζ1
(φ 1 θ1
(x ⋆
i ))) + L 2 (y ⋆
i , ψ2 ζ2
(φ 2 θ2
(y ⋆
i )))
with reconstruction errors

L1(x⋆i, ψ1ζ 1 (φ1θ1 (x⋆i))) = N1 kx⋆i − ψ1ζ 1 (φ1θ1 (x⋆i))k22,
L2(y ⋆i, ψ2ζ2 (φ2θ2 (y ⋆i))) = N1 ky ⋆i − ψ2ζ 2 (φ2θ2 (y ⋆i))k22
and with Φx = [φ1(x1), . . . , φ1(xN )], Φy = [φ2(y 1), . . . , φ2(y N )], U, V

from " #
1 ⊤ 1 ⊤
Φ Φ Φ Φ

η1 x x η1 x y U U
1 ⊤ 1 ⊤ = Λ.
Φ Φ
η2 y x Φ
η2 y y Φ V V
Hence, joint feature learning and subspace learning.
70
Gen-RKM: examples (1)
MNIST Fashion-MNIST
Generated samples from the model using CNN as explicit feature map in the kernel function.
The yellow boxes show training examples and the adjacent boxes show the reconstructed
samples. The other images (columns 3-6) are generated by random sampling from the
fitted distribution over the learned latent variables.
71
Gen-RKM: examples (2)
CIFAR-10 CelebA
Generated samples from the model using CNN as explicit feature map in the kernel function.
The yellow boxes show training examples and the adjacent boxes show the reconstructed
samples. The other images (columns 3-6) are generated by random sampling from the
fitted distribution over the learned latent variables.
72
Gen-RKM: multi-view generation (1)
CelebA
Multi-view generation on CelebA dataset showing images and attributes
73
Gen-RKM: multi-view generation (2)
MNIST: Implicit feature maps with Gaussian kernel + generation by kernel-smoother
MNIST: Explicit feature maps using Convolutional Neural Networks
CIFAR-10: Explicit feature maps using CNNs + Transposed CNNs
Multi-view Generation (images and labels) using implicit and explicit feature maps
74
Gen-RKM: latent space exploration (1)
Exploring the learned uncorrelated-features by traversing along the eigenvectors

Explainability: changing one single neuron’s hidden feature changes the hair color while
preserving face structure! [Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]
75
MNIST reconstructed images by bilinear-interpolation in latent space

76
CelebA reconstructed images by bilinear-interpolation in latent space

77
Future challenges
• efficient algorithms and implementations for large data
• extension to other loss functions and regularization schemes
• multimodal data, tensor models, coupling schemes
• models for deep clustering and semi-supervised learning
• choice kernel functions, invariances and symmetry properties
• deep generative models
• optimal transport
• synergies between neural networks, deep learning and kernel machines
78
Conclusions
• function estimation: parametric versus kernel-based
• primal and dual model representations
• neural network interpretations in primal and dual
• RKM: new connections between RBM, kernel PCA and LS-SVM
• deep kernel machines
• generative models
79
Acknowledgements (1)
• Current and former co-workers at ESAT-STADIUS:

C. Alzate, Y. Chen, J. De Brabanter, K. De Brabanter, B. De Cooman,
L. De Lathauwer, H. De Meulemeester, B. De Moor, H. De Plaen, Ph.
Dreesen, M. Espinoza, T. Falck, M. Fanuel, Y. Feng, B. Gauthier, X.
Huang, L. Houthuys, V. Jumutc, Z. Karevan, R. Langone, F. Liu, R.
Mall, S. Mehrkanoon, G. Nisol, M. Orchel, A. Pandey, P. Patrinos, K.
Pelckmans, S. RoyChowdhury, S. Salzo, J. Schreurs, M. Signoretto, Q.
Tao, F. Tonin, J. Vandewalle, T. Van Gestel, S. Van Huffel, C. Varon,
Y. Yang, and others
• Many other people for joint work, discussions, invitations, organizations
• Support from ERC AdG E-DUALITY, ERC AdG A-DATADRIVE-B, KU

Leuven, OPTEC, IUAP DYSCO, FWO projects, IWT, iMinds, BIL, COST
80
81
NEW: ERC Advanced Grant E-DUALITY

Exploring duality for future data-driven modelling
82
Thank you
83

Deep Learning, Neural Networks and Kernel Machines: A Unifying Framework

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Deep Learning, Neural Networks and Kernel Machines: A Unifying Framework

Caricato da

Copyright:

Formati disponibili

Deep Learning, Neural Networks and Kernel

Machines: towards a unifying framework

AI Seminar at BeCentral Brussels, Oct 2019

• Function estimation, model representations, duality

• Neural networks and kernel machines

• Application examples, large scale methods

• Generative models: GAN, RBM, Deep BM

• Restricted kernel machines (RKM), Gen-RKM, and deep learning

in the early days of neural networks:

ALVINN (Autonomous Land Vehicle In a Neural Network)

(27 million connections)

from: [selfdrivingcars.mit.edu (Lex Fridman et al.), 2017]

[LeCun et al., Proc. IEEE 1998]

Further advanced architectures:

Alexnet (2012): 5 convolutional layers, 3 fully connected

1942 McCulloch & Pitts: mathematical model for neuron

1986 Rumelhart et al.: error backpropagating neural networks

1998 LeCun et al.: convolutional neural networks

SVM, LS-SVM &

SVM, LS-SVM &

• Given {(xi, yi )}N d

→ one can directly solve in w, b

• From optimality conditions:

where α, b follows from solving a linear system

with Ωij = xTi xj for i, j = 1, ..., N and y = [y1; ...; yN ].

inputs x ∈ Rd, output y ∈ R

inputs x ∈ Rd, output y ∈ R

few inputs, many data points: d ≪ N

many inputs, few data points: d ≫ N

• Kernel pca (V = I), Kernel spectral clustering (V = D −1)

• through regularization: model ŷ = wT x + b

⇒ sparse w (e.g. Lasso)

⇒ sparse α (e.g. SVM)

ϕ(x)T ϕ(z) K(x, z)

• Distance between vectors: (e.g. for “kernelized” clustering methods)

kx − zk22 = (x − z)T (x − z) = xT x + z T z − 2xT z

kϕ(x) − ϕ(z)k22 = K(x, x) + K(z, z) − 2K(x, z)

Decision making: classification problem (e.g. apples versus tomatoes) Input

SVM or LS-SVM classifier: given a new x (e.g. ), obtain

with xi for i = 1, ..., N :

Here K(x, xi) characterizes the similarity between x and xi.

• Find function f such that [Wahba, 1990; Evgeniou et al., 2000]

with L(·, ·) the loss function. kf kK is norm in RKHS HK defined by K.

• Representer theorem: for convex loss function, solution of the form

Reproducing property f (x) = hf, KxiK with Kx(·) = K(x, ·)

• Sparse representation by hinge and ǫ-insensitive loss [Vapnik, 1998]

Wide range of positive definite kernel functions possible:

• LS-SVM for indefinite kernel case:

and indefinite kernel K(xi, xj ) = K+(xi, xj ) − K−(xi, xj )

K+(xi, xj ) = ϕ+(xi)T ϕ+(xj ) and K−(xi, xj ) = ϕ−(xi)T ϕ−(xj )

[X. Huang, Maier, Hornegger, Suykens, ACHA 2017]

Black-box forecasting multiple weather stations simultaneously

[Signoretto, Frandi, Karevan, Suykens, IEEE-SCCI, 2014]

• Black-box weather forecasting: prediction temperature in Brussels

[Houthuys et al., 2017]

• Views: Brussels together with 9 neighboring cities

• Weighted Multi-view approach:

– s[v]: weight of the view v (can be manually determined by an expert,

• Weather forecasting is a time series prediction problem → Consider each

• Consider 5 views (i.e. the delay is considered to be 5)

• Tuning parameters: regularization parameter γ [v] and number of neurons

• Forecasting minimum and maximum temperature for one to six days