Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Lecture 2
Kevin Duh
Graduate School of Information Science
Nara Institute of Science and Technology
Todays Topics
1
Discussions
Why it works, when it works, and the bigger picture
2/45
Todays Topics
1
Discussions
Why it works, when it works, and the bigger picture
3/45
Understanding in AI requires
high-level abstractions, modeled
by highly non-linear functions
4/45
Understanding in AI requires
high-level abstractions, modeled
by highly non-linear functions
These abstractions must
disentangle factors of variation
in data (e.g. 3D pose, lighting)
4/45
Understanding in AI requires
high-level abstractions, modeled
by highly non-linear functions
These abstractions must
disentangle factors of variation
in data (e.g. 3D pose, lighting)
Deep Architecture is one way to
achieve this: each intermediate
layer is a successively higher
level abstraction
(*Example from [Bengio, 2009])
4/45
h10
h20
h30
h1
h2
h3
x1
x2
x3
5/45
j =
Loss inj
inj wij
=
hP
= j xi
i
w
0 (inj )
j+1
j(j+1)
j+1
h10
h20
h30
wj(j+1)
h1
h2
h3
wij
x1
x2
x3
6/45
7/45
w1
wA
wB
w2
wC
E
8/45
Todays Topics
1
Discussions
Why it works, when it works, and the bigger picture
9/45
h10
h20
h30
h1
h2
h3
Train Layer1
x1
x2
x3
10/45
h10
h20
h30
Train Layer2
h1
h2
h3
Keep Layer1 fixed
x1
x2
x3
11/45
h10
h20
h30
Adjust weights
Predict f(x)
h1
h2
h3
x1
x2
x3
12/45
h10
h20
h30
Train Layer2
h1
h2
h3
Train Layer1
x1
x2
x3
13/45
h10
h20
h30
Train Layer2
h1
h2
h3
Extra advantage:
Can exploit large
amounts of unlabeled
data!
Train Layer1
x1
x2
x3
13/45
Todays Topics
1
Discussions
Why it works, when it works, and the bigger picture
14/45
15/45
15/45
15/45
1
Z
h1
h2
h3
x1
x2
x3
16/45
1
Z
h1
h2
h3
x1
x2
x3
Example:
I
16/45
1
Z
h1
h2
h3
x1
x2
x3
Example:
I
I
16/45
p(x, h)
1/Z exp( E(x, h))
P
=P
h p(x, h)
h 1/Z exp( E(x, h))
exp(x T Wh + b T x + d T h)
P
T
T
T
h exp(x Wh + b x + d h)
Q
T
T
j exp(x Wj hj + dj hj ) exp(b x)
P
P Q
= P
T
T
h1 {0,1}
h2 {0,1}
hj
j exp(x Wj hj + dj hj ) exp(b x)
Q
T
j exp(x Wj hj + dj hj )
= Q P
T
j
hj {0,1} exp(x Wj hj + dj hj )
Y
Y
exp(x T Wj hj + dj hj )
P
=
=
p(hj |x)
T
hj {0,1} exp(x Wj hj + dj hj )
=
17/45
p(x, h)
1/Z exp( E(x, h))
P
=P
h p(x, h)
h 1/Z exp( E(x, h))
exp(x T Wh + b T x + d T h)
P
T
T
T
h exp(x Wh + b x + d h)
Q
T
T
j exp(x Wj hj + dj hj ) exp(b x)
P
P Q
= P
T
T
h1 {0,1}
h2 {0,1}
hj
j exp(x Wj hj + dj hj ) exp(b x)
Q
T
j exp(x Wj hj + dj hj )
= Q P
T
j
hj {0,1} exp(x Wj hj + dj hj )
Y
Y
exp(x T Wj hj + dj hj )
P
=
=
p(hj |x)
T
hj {0,1} exp(x Wj hj + dj hj )
=
17/45
Todays Topics
1
Discussions
Why it works, when it works, and the bigger picture
18/45
Pw (x = x (m) , h)
(1)
X 1
exp ( Ew (x(m) , h))
Zw
h
X
= wij log Zw + wij log
exp ( Ew (x(m) , h))
= wij log
(2)
(3)
X
(m)
1 X ( Ew (x,h))
1
e
wij Ew (x, h) P ( E (x(m) ,h))
e ( Ew (x ,h)) wij Ew (x(m) , h)
w
Zw
he
h,x
h
X
X
(m)
Pw (x, h)[wij Ew (x, h)]
Pw (x , h)[wij Ew (x(m) , h)]
(4)
=
h,x
h
(m)
hj ]
(5)
19/45
Pw (x = x (m) , h)
(1)
X 1
exp ( Ew (x(m) , h))
Zw
h
X
= wij log Zw + wij log
exp ( Ew (x(m) , h))
= wij log
(2)
(3)
X
(m)
1 X ( Ew (x,h))
1
e
wij Ew (x, h) P ( E (x(m) ,h))
e ( Ew (x ,h)) wij Ew (x(m) , h)
w
Zw
he
h,x
h
X
X
(m)
Pw (x, h)[wij Ew (x, h)]
Pw (x , h)[wij Ew (x(m) , h)]
(4)
=
h,x
h
(m)
hj ]
(5)
19/45
20/45
20/45
20/45
20/45
21/45
Todays Topics
1
Discussions
Why it works, when it works, and the bigger picture
22/45
h100
h200
h300
Layer3 RBM
h10
h20
h30
Layer2 RBM
h1
h2
h3
Layer1 RBM
x1
x2
x3
23/45
h100
h200
h300
Layer3 RBM
h10
h20
h30
Layer2 RBM
h1
h2
h3
Layer1 RBM
x1
x2
x3
23/45
24/45
25/45
25/45
25/45
25/45
25/45
Todays Topics
1
Discussions
Why it works, when it works, and the bigger picture
26/45
x20
x30
Decoder: x 0 = (W 0 h + d)
h1
h2
Encoder: h = (Wx + b)
x1
x2
x3
27/45
x20
x30
Decoder: x 0 = (W 0 h + d)
h1
h2
Encoder: h = (Wx + b)
x1
x2
x3
27/45
x20
x30
Decoder: x 0 = (W 0 h + d)
h1
h2
Encoder: h = (Wx + b)
x1
x2
x3
27/45
x20
x30
Decoder: x 0 = (W 0 h + d)
h1
h2
Encoder: h = (Wx + b)
x1
x2
x3
27/45
28/45
h10
h20
Layer2 Encoder
h1
h2
h3
Layer1 Encoder
x1
x2
x3
x4
28/45
h10
h20
Layer2 Encoder
h1
h2
h3
Layer1 Encoder
x1
x2
x3
x4
28/45
h10
h20
Layer2 Encoder
h1
h2
h3
Layer1 Encoder
x1
x2
x3
x4
28/45
29/45
29/45
29/45
Todays Topics
1
Discussions
Why it works, when it works, and the bigger picture
30/45
Denoising Auto-Encoders
x10
x20
x30
Decoder: x 0 = (W 0 h + d)
h1
h2
Encoder: h = (W x + b)
x1
x2
x3
x = x+ noise
31/45
Denoising Auto-Encoders
Example: Randomly shift, rotate, and scale input image; add
Gaussian or salt-and-pepper noise.
A 2 is a 2 no matter how you add noise, so the auto-encoder will
be forced to cancel the variations that are not important.
32/45
33/45
33/45
33/45
Todays Topics
1
Discussions
Why it works, when it works, and the bigger picture
34/45
35/45
Even if lower-layers are random weights, upper-layer may still fit well.
But this might not generalize to new data
Pre-training with objective on P(x) learns more generalizable features
35/45
Even if lower-layers are random weights, upper-layer may still fit well.
But this might not generalize to new data
Pre-training with objective on P(x) learns more generalizable features
35/45
36/45
36/45
36/45
36/45
I
I
36/45
37/45
Neural Net as kernel for SVM [Li et al., 2005] and SVM training for
Neural Nets [Collobert and Bengio, 2004]
37/45
Neural Net as kernel for SVM [Li et al., 2005] and SVM training for
Neural Nets [Collobert and Bengio, 2004]
Decision trees are deep (but no distributed representation). Random
forests are both deep and distributed. They do well in practice too!
37/45
Neural Net as kernel for SVM [Li et al., 2005] and SVM training for
Neural Nets [Collobert and Bengio, 2004]
Decision trees are deep (but no distributed representation). Random
forests are both deep and distributed. They do well in practice too!
Philosophical connections to:
I
I
I
37/45
History
38/45
History
38/45
History
38/45
History
38/45
History
38/45
History
38/45
References I
Bengio, Y. (2009).
Learning Deep Architectures for AI, volume Foundations and Trends in
Machine Learning.
NOW Publishers.
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2006).
Greedy layer-wise training of deep networks.
In NIPS06, pages 153160.
Bishop, C. (2006).
Pattern Recognition and Machine Learning.
Springer.
Bourlard, H. and Kamp, Y. (1988).
Auto-association by multilayer perceptrons and singular value
decomposition.
Biological Cybernetics, 59:291294.
39/45
References II
40/45
References III
Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., and
Bengio, S. (2010).
Why does unsupervised pre-training help deep learning?
Journal of Machine Learning Research, 11:625660.
Erhan, D., Manzagol, P., Bengio, Y., Bengio, S., and Vincent, P.
(2009).
The difficulty of training deep architectures and the effect of
unsupervised pre-training.
In AISTATS.
Hinton, G., Osindero, S., and Teh, Y.-W. (2006).
A fast learning algorithm for deep belief nets.
Neural Computation, 18:15271554.
41/45
References IV
Le, Q. V., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado,
G. S., Dean, J., and Ng, A. Y. (2012).
Building high-level features using large scale unsupervised learning.
In ICML.
Li, X., Bilmes, J., and Malkin, J. (2005).
Maximum margin learning and adaptation of MLP classifiers.
In Interspeech.
McCulloch, W. S. and Pitts, W. H. (1943).
A logical calculus of the ideas immanent in nervous activity.
In Bulletin of Mathematical Biophysics, volume 5, pages 115137.
Minsky, M. and Papert, S. (1969).
Perceptrons: an introduction to computational geometry.
MIT Press.
42/45
References V
Poon, H. and Domingos, P. (2011).
Sum-product networks.
In UAI.
Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2006).
Sparse feature learning for deep belief networks.
In NIPS.
Rosenblatt, F. (1958).
The perceptron: A probabilistic model for information storage and
organization in the brain.
Psychological Review, 65:386408.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986).
Learning representations by back-propagating errors.
Nature, 323:533536.
43/45
References VI
44/45
References VII
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A.
(2010).
Stacked denoising autoencoders: Learning useful representations in a
deep network with a local denoising criterion.
Journal of Machine Learning Research, 11:33713408.
Vinyals, O., Jia, Y., Deng, L., and Darrell, T. (2012).
Learning with recursive perceptual representations.
In NIPS.
45/45