Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
10
5217 Data in real life is often high dimensional. For example, if we want to esti-
5218 mate the price of our house in a year’s time, we can use data that helps us
5219 to do this: the type of house, the size, the number of bedrooms and bath-
5220 rooms, the value of houses in the neighborhood when they were bought,
5221 the distance to the next train station and park, the number of crimes com-
5222 mitted in the neighborhood, the economic climate etc. – there are many
5223 things that influence the house price, and we collect this information in a
5224 data set that we can use to estimate the house price. Another example is a
5225 640×480 pixels color image, which is a data point in a million-dimensional
5226 space, where every pixel responds to three dimensions - one for each color
5227 channel (red, green, blue).
5228 Working directly with high-dimensional data comes with some difficul-
5229 ties: It is hard to analyze, interpretation is difficult, visualization is nearly
5230 impossible, and (from a practical point of view) storage can be expensive.
5231 However, high-dimensional data also has some nice properties: For exam-
5232 ple, high-dimensional data is often overcomplete, i.e., many dimensions
5233 are redundant and can be explained by a combination of other dimen-
5234 sions. Dimensionality reduction exploits structure and correlation and al-
5235 lows us to work with a more compact representation of the data, ideally
5236 without losing information. We can think of dimensionality reduction as
5237 a compression technique, similar to jpg or mp3, which are compression
5238 algorithms for images and music.
principal component
5239 In this chapter, we will discuss principal component analysis (PCA), an
analysis 5240 algorithm for linear dimensionality reduction. PCA, proposed by Pearson
dimensionality 5241 (1901) and Hotelling (1933), has been around for more than 100 years
reduction
5242 and is still one of the most commonly used techniques for data compres-
5243 sion, data visualization and the identification of simple patterns, latent
5244 factors and structures of high-dimensional data. In the signal processing
Karhunen-Loève 5245 community, PCA is also known as the Karhunen-Loève transform. In this
transform 5246 chapter, we will explore the concept of linear dimensionality reduction
5247 with PCA in more detail, drawing on our understanding of basis and ba-
5248 sis change (see Sections 2.6.1 and 2.7.2), projections (see Section 3.6),
5249 eigenvalues (see Section 4.2), Gaussian distributions (see Section 6.6) and
5250 constrained optimization (see Section 7.2).
5251 Dimensionality reduction generally exploits the property of high-dimen-
286
c
Draft chapter (July 10, 2018) from “Mathematics for Machine Learning”
2018 by Marc Peter
Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press.
Report errata and feedback to http://mml-book.com. Please do not post or distribute this file,
please link to https://mml-book.com.
10.1 Problem Setting 287
ure 10.1
stration:
4 4
mensionality
uction. (a) The
ginal dataset not 2 2
y much along the
direction. (b)
x2
x2
0 0
e data from (a)
n be represented
−2 −2
ng the
coordinate alone
h nearly no loss. −4 −4
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x1 x1
(a) Dataset with x1 and x2 coordinates. (b) Compressed dataset where only the x1 coor-
dinate is relevant.
5252 sional data (e.g., images) that it often lies on a low-dimensional subspace,
5253 and that many dimensions are highly correlated, redundant or contain
5254 little information. Figure 10.1 gives an illustrative example in two dimen-
5255 sions. Although the data in Figure 10.1(a) does not quite lie on a line, the
5256 data does not vary much in the x2 -direction, so that we can express it as
5257 if it was on a line – with nearly no loss, see Figure 10.1(b). The data in
5258 Figure 10.1(b) requires only the x1 -coordinated to describe and lies in a
5259 one-dimensional subspace of R2 .
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
288 Dimensionality Reduction with Principal Component Analysis
Figure 10.3
Examples of
handwritten digits
from the MNIST
dataset.
Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.2 Maximum Variance Perspective 289
5312 i.e., the variance of the low-dimensional code does not depend on the
5313 mean of the data. Therefore, we assume without loss of generality that the
5314 data has mean 0 for the remainder of this section. With this assumption
5315 the mean of the low-dimensional code is also 0 since Ez [z] = Ex [B > x] =
5316 B > Ex [x] = 0. ♦
We maximize the variance of the low-dimensional code using a sequen-
tial approach. We start by seeking a single vector b1 ∈ RD that maximizes
the variance of the projected data, i.e., we aim to maximize the first coor-
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
290 Dimensionality Reduction with Principal Component Analysis
dinate z1 of z ∈ RM so that
N
1 X 2
V1 := V[z1 ] = z (10.5)
N n=1 1n
max b>
1 Sb1 (10.8)
b1
2
subject to kb1 k = 1 . (10.9)
b>
1 b1 = 1 , (10.13)
Sb1 = λ1 b1 , (10.14)
Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.2 Maximum Variance Perspective 291
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
292 Dimensionality Reduction with Principal Component Analysis
5338 optimization problem and discover that the optimal solution bm is the
5339 eigenvector of Ŝ that belongs to the largest eigenvalue of Ŝ .
However, it also turns out that bm is an eigenvector of S . Since
N N m−1 m−1 >
1 X 1 X
(10.17)
x̂n x̂>
X X
Ŝ = n = xn − bi b>
i x n x n − b b>
i i x n
N n=1 N n=1 i=1 i=1
(10.19a)
N m−1 m−1 m−1
!
1 X X X X
= xn x> >
n − 2xn xn bi b>
i + xn
>
bi b>
i bi b>
i
N n=1 i=1 i=1 i=1
(10.19b)
5340 since b> m bm = 1. This means that the variance of the data, when projected
5341 onto an M -dimensional subspace, equals the sum of the eigenvalues that
To maximize the 5342 belong to the corresponding eigenvectors of the data covariance matrix.
variance of the In practice, we do not have to compute principal components sequen-
projected data, we
tially, but we can compute all of them at the same time. If we are looking
choose the columns
of B to be the for a projection onto an M -dimensional subspace so that as much variance
eigenvectors that as possible is retained in the projection, then PCA tells us to choose the
belong to the M columns of B to be the eigenvectors that belong to the M largest eigen-
largest eigenvalues
values of the data covariance matrix. The maximum amount of variance
of the data
covariance matrix. PCA can capture with the first M principal components is
M
X
V = λm , (10.22)
m=1
where the λm are the M largest eigenvalues of the data covariance matrix
S . Consequently, the variance lost by data compression via PCA is
D
X
J= λj . (10.23)
j=M +1
Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.3 Projection Perspective 293
Figure 10.4
6 Illustration of the
projection approach
4 to PCA. We aim to
find a
2 one-dimensional
subspace (line) of
x2
0
R2 so that the
distance vector
−2
between projected
−4 (orange) and
original (blue) data
−6 is as small as
possible.
−5 0 5
x1
5344 orthogonal projection maximizes the variance of the data we need to com-
5345 pute the M eigenvectors that belong to the M largest eigenvalues of the
5346 data covariance matrix. In Section 10.4, we will return to this point and
5347 discuss how to efficiently compute these M eigenvectors.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
294 Dimensionality Reduction with Principal Component Analysis
Figure 10.5
Simplified 2 2
projection setting.
(a) A vector x ∈ R2
(red cross) shall be 1 1
x2
x2
U U
projected onto a
one-dimensional b b
subspace U ⊆ R2 0 0
spanned by b. (b)
shows the difference
vectors between x −1 0 1 2 −1 0 1 2
x1 x1
and some
candidates x̃. (a) Setting. (b) Differences x − x̃ for 50 candidates x̃ are
shown by the red lines.
For example, vectors Let us assume dim(U ) = M where M < D = dim(RD ). Then, we
x̃ ∈ U could be can find basis vectors b1 , . . . , bD of RD so that at least D − M of the
vectors on a plane
coefficients zd are equal to 0, and we can rearrange the way we index the
in R3 . The
dimensionality of basis vectors bd such that the coefficients that are zero appear at the end.
the plane is 2, but This allows us to express x̃ as
the vectors still have
three coordinates in M
X D
X M
X
R3 . x̃ = zm bm + 0bj = zm bm = Bz ∈ RD , (10.26)
m=1 j=M +1 m=1
where we defined
Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.3 Projection Perspective 295
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
296 Dimensionality Reduction with Principal Component Analysis
5384 Coordinates
Let us start by finding the coordinates z1n , . . . , zM n of the projections x̃n
for n = 1, . . . , N . We assume that (b1 , . . . , bD ) is an ordered ONB of RD .
From (10.32) we require the partial derivative
M
!
∂ x̃n (10.29) ∂ X
= zmn bm = bi (10.35)
∂zin ∂zin m=1
for i = 1, . . . , M , such that we obtain
(10.34) M
!>
∂J 2
(10.35) 2
(10.29)
X
= − (xn − x̃n )> bi = − xn − zmn bm bi
∂zin N N m=1
(10.36)
ONB 2 > 2
= − (x bi − zin b> bi ) = − (x> bi − zin ) . (10.37)
N n | i{z } N n
=1
x> bj is the 5394 is the orthogonal projection of x onto the subspace spanned by the j th
coordinate of the 5395 basis vector, and zj = b> j x is the coordinate of this projection with respect
orthogonal
5396 to the basis vector bj that spans that subspace since zj bj = x̃. Figure 10.6
projection of x onto
5397
the one-dimensional illustrates this setting.
subspace spanned More generally, if we aim to project onto an M -dimensional subspace
by bj . of RD , we obtain the orthogonal projection of x onto the M -dimensional
subspace with orthonormal basis vectors b1 , . . . , bM as
> −1 > >
x̃ = B(B
| {zB}) B x = BB x , (10.40)
=I
Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.3 Projection Perspective 297
Figure 10.6
2 Optimal projection
3.0 of a vector x ∈ R2
onto a
2.5 1 one-dimensional
kx − x̃k
x2
U
x̃ subspace
2.0 b (continuation from
0 Figure 10.5). (a)
1.5 Distances kx − x̃k
for some x̃ ∈ U . (b)
−1 0 1 2 −1 0 1 2 Orthogonal
x1 x1
projection and
(a) Distances kx − x̃k for some x̃ ∈ U , see (b) The vector x̃ that minimizes the distance optimal coordinates.
panel (b) for the setting. in panel (a) is its orthogonal projection onto
U . The coordinate of the projection x̃ with
respect to the basis vector b that spans U
is the factor we need to scale b in order to
“reach” x̃.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
298 Dimensionality Reduction with Principal Component Analysis
This will make it easier to find the basis vectors. To reformulate the loss
function, we exploit our results from before and obtain
M M
(10.38)
X X
x̃n = zmn bm = (x>
n bm )bm . (10.42)
m=1 m=1
Since we can generally write the original data point xn as a linear combi-
nation of all basis vectors, we can also write
D D D
!
(10.38)
X X X >
>
xn = zdn bd = (xn bd )bd = bd bd xn (10.44a)
d=1 d=1 d=1
M
! D
!
X X
= bm b>
m xn + bj b>
j xn , (10.44b)
m=1 j=M +1
where we split the sum with D terms into a sum over M and a sum
over D − M terms. With this result, we find that the displacement vector
xn − x̃n , i.e., the difference vector between the original data point and its
projection, is
D
!
X >
xn − x̃n = bj bj xn (10.45)
j=M +1
D
X
= (x>
n bj )bj . (10.46)
j=M +1
5414 This means the difference is exactly the projection of the data point onto
5415 the orthogonal complement of the principal subspace: We identify the ma-
trix j=M +1 bj b>
PD
5416
j in (10.45) as the projection matrix that performs this
5417 projection. This also means the displacement vector xn − x̃n lies in the
5418 subspace that is orthogonal to the principal subspace as illustrated in Fig-
5419 ure 10.7.
PCA finds the best
rank-M Remark (Low-Rank Approximation). In (10.45), we saw that the projec-
approximation of
tion matrix, which projects x onto x̃ is given by
the identity matrix.
M
X
bm b> >
m = BB . (10.47)
m=1
Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.3 Projection Perspective 299
Figure 10.7
6 U⊥ Orthogonal
projection and
4 displacement
vectors. When
2 projecting data
points xn (blue)
x2
0
U onto subspace U1
we obtain x̃n
−2
(orange). The
−4 displacement vector
x̃n − xn lies
−6 completely in the
orthogonal
−5 0 5
x1 complement U2 of
U1 .
We now explicitly compute the squared norm and exploit the fact that the
bj form an ONB, which yields
N D N D
1 X X > 1 X X >
J= 2
(bj xn ) = bj xn b>
j xn (10.50a)
N n=1 j=M +1 N n=1 j=M +1
N D
1 X X >
= b xn x>
n bj , (10.50b)
N n=1 j=M +1 j
where we exploited the symmetry of the dot product in the last step to
write b> >
j xn = xn bj . We can now swap the sums and obtain
D N
! D
X 1 X X
J= b>
j x n x>
n bj = b>
j Sbj (10.51a)
j=M +1
N n=1 j=M +1
| {z }
=:S
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
300 Dimensionality Reduction with Principal Component Analysis
D
X D
X D
X
= tr(b>
j Sbj ) tr(Sbj b>
j ) = tr bj b>
j S ,
j=M +1 j=M +1 j=M +1
| {z }
projection matrix
(10.51b)
5424 where we exploited the property that the trace operator tr(·), see (4.16),
5425 is linear and invariant to cyclic permutations of its arguments. Since we
5426 assumed that our dataset is centered, i.e., E[X] = 0, we identify S as the
5427 data covariance matrix. We see that the projection matrix in (10.51b) is
5428 constructed as a sum of rank-one matrices bj b>j so that it itself is of rank
Minimizing the 5429 D − M.
average squared
5430 Equation (10.51a) implies that we can formulate the average squared
reconstruction error
is equivalent to 5431 reconstruction error equivalently as the covariance matrix of the data,
minimizing the 5432 projected onto the orthogonal complement of the principal subspace. Min-
projection of the 5433 imizing the average squared reconstruction error is therefore equivalent to
data covariance
5434 minimizing the variance of the data when projected onto the subspace we
matrix onto the
orthogonal 5435 ignore, i.e., the orthogonal complement of the principal subspace. Equiva-
complement of the5436 lently, we maximize the variance of the projection that we retain in the
principal subspace.5437 principal subspace, which links the projection loss immediately to the
5438 maximum-variance formulation of PCA discussed in Section 10.2. But this
5439 then also means that we will obtain the same solution that we obtained for
5440 the maximum-variance perspective. Therefore, we skip the slightly lengthy
5441 derivation here and summarize the results from earlier in the light of the
Minimizing the 5442 projection perspective.
average squared
The average squared reconstruction error, when projecting onto the M -
reconstruction error
is equivalent to dimensional principal subspace, is
maximizing the
variance of the
projected data.
D
X
J= λj , (10.52)
j=M +1
5443 where λj are the eigenvalues of the data covariance matrix. Therefore,
5444 to minimize (10.52) we need to select the smallest D − M eigenvalues,
5445 which then implies that their corresponding eigenvectors are the basis
5446 of the orthogonal complement of the principal subspace. Consequently,
5447 this means that the basis of the principal subspace are the eigenvectors
5448 b1 , . . . , bM that belong to the largest M eigenvalues of the data covariance
5449 matrix.
Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.4 Eigenvector Computation 301
Figure 10.8
Embedding of
MNIST digits 0
(blue) and 1
(orange) in a
two-dimensional
principal subspace
using PCA. Four
examples
embeddings of the
digits ‘0’ and ‘1’ in
the principal
subspace are
highlighted in red
with their
corresponding
original digit.
Figure 10.8 visualizes the training data of the MMIST digits ‘0’ and
‘1’ embedded in the vector subspace spanned by the first two principal
components. We can see a relatively clear separation between ‘0’s (blue
dots) and ‘1’s (orange dots), and we can see the variation within each
individual cluster.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
302 Dimensionality Reduction with Principal Component Analysis
X = U ΣV > , (10.55)
where U ∈ RD×D and and V > ∈ RD×N are orthogonal matrices and
Σ ∈ RD×N is a matrix whose only non-zero entries are the singular
values σii > 0. Then it follows that
1 1 1
S= XX > = U ΣV > V Σ> U > = U ΣΣ> U > . (10.56)
N N N
With the results from Section 4.5 we get that the columns of U are the
eigenvectors of XX > (and therefore S ). Furthermore, the eigenvalues
of S are related to the singular values of X via
σi2
λi = . (10.57)
N
5465 This means the vector xk is multiplied by S in every iteration and then
5466 normalized, i.e., we always have kxk k = 1. This sequence of vectors con-
5467 verges to the eigenvector associated with the largest eigenvalue of S . The
5468 original Google PageRank algorithm (Page et al., 1999) uses such an al-
5469 gorithm for ranking web pages based on their hyperlinks.
Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.5 PCA Algorithm 303
5491 with coordinates z ∗ = B > x∗ with respect to the basis of the prin-
5492 cipal subspace. Here, B is the matrix that contains the eigenvectors
5493 that belong to the largest eigenvalues of the data covariance matrix as
5494 columns.
5. Moving back to data space To see our projection in the original data
format (i.e., before standardization), we need to undo the standardiza-
tion (10.59) and multiply by the standard deviation before adding the
mean so that we obtain
x̃(d) (d)
∗ ← x̃∗ σd + µ
(d)
, d = 1, . . . , D , (10.61)
5495 where µ(d) and σd are the mean and standard deviation of the training
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
304 Dimensionality Reduction with Principal Component Analysis
Figure 10.9 Steps
6 6 6
of PCA.
4 4 4
2 2 2
x2
x2
x2
0 0 0
−2 −2 −2
−4 −4 −4
−2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0
x1 x1 x1
(a) Original dataset. (b) Step 1: Centering by sub- (c) Step 2: Dividing by the
tracting the mean from each standard deviation to make
data point. the data unit free. Data has
variance 1 along each axis.
6 6 6
4 4 4
2 2 2
x2
x2
x2
0 0 0
−2 −2 −2
−4 −4 −4
−2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0
x1 x1 x1
(d) Step 3: Compute eigenval- (e) Step 4: Project data onto (f) Step 5: Undo the standard-
ues and eigenvectors (arrows) the subspace spanned by the ization and move projected
of the data covariance matrix eigenvectors belonging to the data back into the original
(ellipse). largest eigenvalues (principal data space from (a).
subspace).
5496 data in the dth dimension, respectively. Figure 10.9(f) illustrates the
5497 projection in the original data format.
http: In the following, we will apply PCA to the MNIST digits dataset, which
//yann.lecun.
contains 60, 000 examples of handwritten digits 0–9. Each digit is an im-
com/exdb/mnist/
age of size 28×28, i.e., it contains 784 pixels so that we can interpret every
image in this dataset as a vector x ∈ R784 . Examples of these digits are
shown in Figure 10.3. For illustration purposes, we apply PCA to a subset
of the MNIST digits, and we focus on the digit ‘8’. We used 5,389 training
images of the digit ‘8’ and determined the principal subspace as detailed
in this chapter. We then used the learned projection matrix to reconstruct
a set of test images, which is illustrated in Figure 10.10. The first row
of Figure 10.10 shows a set of four original digits from the test set. The
following rows show reconstructions of exactly these digits when using
a principal subspace of dimensions 1, 10, 100, 500, respectively. We can
see that even with a single-dimensional principal subspace we get a half-
Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.5 PCA Algorithm 305
PCs: 10
PCs: 100
PCs: 500
Figure 10.11
Average reconstruction error
6 Average
reconstruction error
as a function of the
4 number of principal
components.
2
0
0 200 400 600
Number of PCs
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
306 Dimensionality Reduction with Principal Component Analysis
we can essentially fully reconstruct the training data that contains the
digit ‘8’.
Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.7 Latent Variable Perspective 307
5520 and is invertible. This also implies that N1 X > X has the same (non-zero)
5521 eigenvalues as the data covariance matrix S . But this is now an N × N
5522 matrix, so that we can compute the eigenvalues and eigenvectors much
5523 more efficiently than for the original D × D data covariance matrix.
Now, that we have the eigenvectors of N1 X > X , we are going to re-
cover the original eigenvectors, which we still need for PCA. Currently,
we know the eigenvectors of N1 X > X . If we left-multiply our eigenvalue/
eigenvector equation with X , we get
1
XX > Xcm = λm Xcm (10.67)
N
| {z }
S
5524 and we recover the data covariance matrix again. This now also means
5525 that we recover Xcm as an eigenvector of S .
5526 Remark. If we want to apply the PCA algorithm that we discussed in Sec-
5527 tion 10.5 we need to normalize the eigenvectors Xcm of S so that they
5528 have norm 1. ♦
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
308 Dimensionality Reduction with Principal Component Analysis
Figure 10.12
Graphical model for zn
probabilistic PCA.
The observations xn
explicitly depend on B µ
corresponding
latent variables
z n ∼ N 0, I . The
xn σ
model parameters
n = 1, . . . , N
B, µ and the
likelihood
parameter σ are
shared across the 5554 obtained by maximizing the variance in the projected space or by minimiz-
dataset. 5555 ing the reconstruction error is obtained as the special case of maximum
5556 likelihood estimation in a noise-free setting.
Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.7 Latent Variable Perspective 309
5568 finding something out about z given some observations. To get there we
5569 will apply Bayesian inference to “invert” the arrow implicitly and go from
5570 observations to latent variables. ♦
Figure 10.13
Generating new
MNIST digits. The
latent variables z
can be used to
generate new data
x̃ = Bz. The closer
we stay to the
training data the
more realistic the
generated data.
Figure 10.13 shows the latent coordinates of the MNIST digits ‘8’ found
by PCA when using a two-dimensional principal subspace (blue dots). We
can query any vector z ∗ in this latent space an generate an image x̃∗ =
Bz ∗ that resembles the digit ‘8’. We show eight of such generated images
with their corresponding latent space representation. Depending on where
we query the latent space, the generated images look different (shape,
rotation, size, ...). If we query away from the training data, we see more an
more artefacts, e.g., the top-left and top-right digits. Note that the intrinsic
dimensionality of these generated images is only two.
Z (10.73)
2
= N x | Bz + µ, σ I N z | 0, I dz .
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
310 Dimensionality Reduction with Principal Component Analysis
From Section 6.6, we know that the solution to this integral is a Gaussian
distribution with mean
E[x] = Ez [Bz + µ] + E [] = µ (10.74)
and with covariance matrix
V[x] = Vz [Bz + µ] + V [] = Vz [Bz]
(10.75)
= B Vz [z]B > + σ 2 I = BB > + σ 2 I .
PPCA likelihood 5572 The marginal distribution in (10.73) is the PPCA likelihood, which we can
5573 use for maximum likelihood or MAP estimation of the model parameters.
5574 Remark. Although the conditional distribution in (10.69) is also a like-
5575 lihood, we cannot use it for maximum likelihood estimation as it still
5576 depends on the latent variables. The likelihood function we require for
5577 maximum likelihood (or MAP) estimation should only be a function of
5578 the data x and the model parameters, but not on the latent variables. ♦
From Section 6.6 we also know that the joint distribution of a Gaus-
sian random variable z and a linear/affine transformation x = Bz of it
are jointly Gaussian distributed. We already know the marginals p(z) =
N z | 0, I and p(x) = N x | µ, BB > +σ 2 I . The missing cross-covariance
is given as
Cov[x, z] = Covz [Bz + µ] = B Covz [z, z] = B . (10.76)
Therefore, the probabilistic model of PPCA, i.e., the joint distribution of
latent and observed random variables is explicitly given by
BB > + σ 2 I B
x µ
p(x, z | B, µ, σ 2 ) = N , , (10.77)
z 0 B> I
5579 with a mean vector of length D + M and a covariance matrix of size
5580 (D + M ) × (D + M ).
Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.8 Further Reading 311
5603 If we repeat this process many times, we can explore the posterior dis-
5604 tribution (10.78) on the latent variables z ∗ and its implications on the
5605 observed data. The sampling process effectively hypothesizes data, which
5606 is plausible under the posterior distribution.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
312 Dimensionality Reduction with Principal Component Analysis
Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.8 Further Reading 313
D
2 1 X
σML = λj , (10.82)
D − M j=M +1
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
314 Dimensionality Reduction with Principal Component Analysis
Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.8 Further Reading 315
5710 above can be used to render PCA as a special case of a deep auto-encoder. deep auto-encoder
5711 In the deep auto-encoder, both the encoder and the decoder are repre-
5712 sented by multi-layer feedforward neural networks, which themselves are
5713 nonlinear mappings. If we set the activation functions in these neural net-
5714 works to be the identity, the model becomes equivalent to PCA. A different
5715 approach to nonlinear dimensionality reduction is the Gaussian Process La- Gaussian Process
5716 tent Variable Model (GP-LVM) proposed by Lawrence (2005). The GP-LVM Latent Variable
Model
5717 starts off with the latent-variable perspective that we used to derive PPCA
5718 and replaces the linear relationship between the latent variables z and the
5719 observations x with a Gaussian process (GP). Instead of estimating the pa-
5720 rameters of the mapping (as we do in PPCA), the GP-LVM marginalizes out
5721 the model parameters and makes point estimates of the latent variables
5722 z . Similar to Bayesian PCA, the Bayesian GP-LVM proposed by Titsias and Bayesian GP-LVM
5723 Lawrence (2010) maintains a distribution on the latent variables z and
5724 uses approximate inference to integrate them out as well.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.