Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Chapter 1
1.1
We often represent the data of interest as a set of vectors in some Euclidean space Rn . This has two major
advantages: 1) we can exploit the algebraic structure of Rn and 2) we can exploit its Euclidean Geometry.
This enables tools and concepts such as differential calculus, convexity, convex optimization, probability
distributions, and so on.
In some cases there is a natural embedding of the data into Rn . In other cases we must carefully
construct a useful embedding of the data into Rn (possibly ignoring some information) .
Example 1.1.1. In a medical context the readily measured variables might be: age, gender, weight, blood
pressure, resting heart rate, respiratory rate, body temperature, blood analysis, and so on. Group an
appropriate subset of these measurable variables into a vector x Rn . The variable of interest y R
might be: the existence of an infection, the level of an infection, the degree of reaction to a drug, the
probability that the patient is about to have heart attack, and so on. From measurements of the first set of
medical variables x we would like to predict the value of y.
Example 1.1.2. Medical Example 2. Not written up yet.
Example 1.1.3. Document Example 1. Not written up yet.
Example 1.1.4. fMRI Example 1. Not written up yet.
1.2
Below we give a quick summary of the key algebraic properties of Rn Although we do this in the context
of Rn , the concepts and constructions described generalize to any finite dimensional vector space.
c Peter J. Ramadge, 2015. Please do not distribute without permission.
1.2.1
Linear combinations
k
X
Pk
j=1 j xj .
j=1
1.2.2
Subspaces
x(i)
=
0}
is
a
subspace
of
R
i , i = 1, . . . , n, the set U = {x :
i
i=1
Pn
(j)
(j)
n
scalars, {i }ni=1 , j = 1, . . . , k, the set U = {x :
i=1 i x(i) = 0, j = 1, . . . , k} is a subspace of R .
1.2.3
U V = {x : x U and x V}
U + V = {x : x = u + v, some u U, v V}.
U V is the set of vectors in both U and V and U +V is the set of all vectors formed by a linear combination
of a vector in U and a vector in V.
Lemma 1.2.2. For subspaces U, V of Rn , U V and U + V are subspaces of Rn . U V is the largest
subspace contained in both U and V and U + V is the smallest subspace containing both U and V.
Proof. Exercise.
1.2.4
Linear independence
A finite set of vectors {x1 , . . . , xk } Rn is linearly independent if for each set of scalars 1 , . . . , k ,
k
X
i xi = 0
i = 0, i = 1, . . . , k.
i=1
A set of vectors which is not linearly independent is said to be linearly dependent. Notice that a linearly
independent set cannot contain the zero vector.
c Peter J. Ramadge, 2015. Please do not distribute without permission.
i=1 i xi
Pk
i=1 i xi
i = i , i = 1, . . . , k.
1.2.5
Let U be a subspace of Rn . A finite set of vectors {x1 , . . . , xk } is said to span U, or to be a spanning set
for U, if U = span{x1 , . . . , xk }. In this case, every x U can be written as a linear combination of the
vectors xj , j = 1, . . . , k. A spanning set may be redundant in the sense that one or more elements of the
set may be a linear combination of a subset of the elements.
A basis for a subspace U Rn is a linearly independent finite set of vectors that spans U. The
spanning property ensures that every vector in U can be represented as a linear combination of the basis
vectors and linear independence ensures that this representation is unique. A vector space that has a basis
is said to be finite dimensional.
We show that Rn is finite dimensional by exhibiting a basis. The standard basis for Rn is the set of
vectors ej , j = 1, . . . , n, with
(
1, if i = j;
ei (j) =
0, otherwise.
Pn
It is clear that if i=1 i ei = 0, then i = 0, i = 1, . . . , n. Hence the set is linearly independent. It is
also clear that any vector in Rn can be written as a linear combination of the ei s. Hence Rn is a finite
dimensional vector space.
We show below that every subspace U Rn has a basis, and every basis contains the same number
of elements. We define the dimension of U to be the number of elements in any basis for U. The standard
basis for Rn implies that every basis for Rn has n elements and hence Rn has dimension n.
Lemma 1.2.4. Every nonzero subspace U Rn has a basis and every such basis contains the same
number of vectors.
Proof. Rn is finite dimensional and has a basis with n elements. List the basis as L0 = {x1 , . . . , xn }.
Since U =
6 {0}, U contains a nonzero vector b1 . It must hold that b1 span(L0 ) since L0 spans Rn .
If U = span{b1 }, then we are done. Otherwise, add b1 to the start of L0 to form the new ordered list
L1 = {b1 , x1 , . . . , xn }. Then L1 is a linearly dependent set that spans Rn . Proceeding from the left, there
must be a first vector in L1 that is linearly dependent on the subset of vectors that precedes it. Suppose
c Peter J. Ramadge, 2015. Please do not distribute without permission.
6
x
2 x2
1 x1
0
Figure 1.1: The coordinates of a point x with respect a given basis {x1 , . . . , xn }.
this is xp . Removing xp from L1 yields a list L01 that still spans Rn . Since U 6= span{b1 }, there must
be a nonzero vector b2 U not contained in span{b1 }. If span{b1 , b2 } = U, we are done. Otherwise,
add b2 after b1 in L01 to obtain a new ordered list L2 = {b1 , b2 , x1 , . . . , xp1 , xp+1 , . . . , xn }. Then L2 is
linearly dependent and spans Rn . Proceeding from the left, there must be a first vector in L2 that is linearly
dependent on the subset of vectors that precedes it. This cant be one of the bj s since these are linearly
independent. Hence we can again remove one of the remaining xj s to obtain a reduced list L02 that spans
Rn . Since span{b1 , b2 } =
6 U, there must be a vector b3 U such that b3
/ span{b1 , b2 }. Adding this
after b2 in L02 gives a new linearly dependent list L3 that spans Rn . In this way, we either terminate with a
basis for U or continue to remove x0j s from the ordered set and add bj s until all the xj s are removed. In
that event, Ln = {b1 , . . . , bn } is a linearly independent spanning set for Rn and U = Rn . In either case,
U has a basis.
Let {x1 , . . . , xk } and {y1 , . . . , ym } be bases for U. First form the ordered list L0 = {x1 , . . . , xk }. We
will progressively add one of the yj s to the start of list and if possible remove one of the xj s. To begin,
set L1 = {ym , x1 , . . . , xk }. By assumption, U = span{L0 }. Hence ym span{L0 }. It follows that L1 is
linearly dependent and spans U. Proceeding from the left there must be a first vector in L1 that is linearly
dependent on the subset of vectors that precedes it. Suppose this is xp . Removing xp from L1 leaves a
list L01 that still spans U. Hence ym1 span(L01 ). So adding ym1 to L01 in the first position gives a
new linearly dependent list S2 = {ym1 , ym , x1 , . . . , xp1 , xp+1 , . . . , xk } that spans U. Proceeding from
the left there must be a first vector in L2 that is linearly dependent on the subset of vectors that precedes
it. This cant be one of the yj s since these are linearly independent. Hence we can again remove one of
the remaining xj s to obtain a reduced list spanning U. Then adding ym2 in the first place gives a new
linearly dependent list L3 spanning U. In this way, we continue to remove x0j s from the ordered set and
add yj s. If we remove all the xj s before adding all of the yj s we obtain a contradiction, since that would
imply that {y1 , . . . , yk } is linearly dependent. So we must be able to add all of the yj s. Hence k m. A
symmetric argument with the roles of the two bases interchanged shows that m k. Hence m = k.
1.2.6
Coordinates
n
n
The
Pn coordinates of x R with respect to a basis {xj }j=1 are the unique scalars j such that x =
j=1 j xj . Every vector uniquely determines, and is uniquely determined
P by, its coordinates. For the
standard basis, the coordinates of x Rn are simply the entries of x: x = nj=1 x(j)ej .
You can think of the coordinates as a way to locate (or construct) x starting from the origin. You simply
go via x1 scaled by 1 , then via x2 scaled by 2 , and so on. This is not a unique path, since the scaled
basis elements can be added in any order, but all such paths reach the same point x. This is illustrated
in Fig 1.1. In this sense, coordinates are like map coordinates with the basic map navigation primitives
specified by the basis elements. That also makes it clear that if we choose a different basis, then we must
use a different set of coordinates to reach the same point x.
1.3
Problems
1.4
Notes
We have given a brief outline of the algebraic structure of Rn . For a more detailed introduction see the
relevant sections in: Strang, Linear Algebra and its Applications, Chapter 2.
Chapter 2
The Geometry of Rn
Rn also has a geometric structure defined by the Euclidean inner product and norm. These add the important concepts of length, distance, angle, and orthogonality. It will be convenient to discuss these concepts
using the vector space (Cn , C). The concepts and definitions readily specialize to (Rn , R). In addition,
the vector spaces of complex and real matrices of given fixed dimensions can also be given a Euclidean
geometry.
2.1
n
X
x(k)y(k),
k=1
where y denotes the element-wise complex conjugate of y. This can also be written in terms of a matrix
product as <x, y> = xT y. The inner product satisfies the following basic properties.
Lemma 2.1.1 (Properties of the Inner Product). For x, y, z Cn and C,
1) <x, x> 0 with equality x = 0
2) <x, y> = <y, x>
3) <x, y> = <x, y>
4) <x + y, z> = <x, z> + <y, z>
Proof. These claims follow from the definition of the inner product via simple algebra.
The inner product also specifies the corresponding Euclidean norm on Cn via the formula:
1
2
n
X
!1
2
|x(k)|2
k=1
Based on the definition and properties of the inner product and the definition of the norm, we can then
derive the famous Cauchy-Schwartz inequality.
Lemma 2.1.2 (Cauchy-Schwarz Inequality). For all x, y Cn , |<x, y>| kxk kyk.
c Peter J. Ramadge, 2015. Please do not distribute without permission.
10
x+y
x
||x||
||x+y||
||x||
y
||y||
Proof. Exercise.
By Cauchy-Schwarz, we have
0
|<x, y>|
1.
kxk kyk
|<x, y>|
.
kxk kyk
For vectors in Rn we can dispense with the modulus function and write
<x, y> = kxkkyk cos(),
where cos = <x, y>/(kxk kyk). So in Rn , the inner product of unit length vectors is the cosine of the
angle between them.
Finally, we have the properties of the norm function.
Lemma 2.1.3 (Properties of the Norm). For x, y Cn and C:
1) kxk 0 with equality if and only if x = 0 (positivity).
2) kxk = || kxk (scaling).
3) kx + yk kxk + kyk (triangle inequality).
Proof. Items 1) and 2) easily follow from the definition of the norm. Item 3) can be proved using the
Cauchy-Schwarz inequality and is left as an exercise.
The norm kxk measures the length or size of the vector x. Equivalently, kxk is the distance
between 0 and x and kx yk is the distance between x to y. The triangle inequality is illustrated in
Fig. 2.1. If kxk = 1, x is called a unit vector or a unit direction. The set {x : kxk = 1} of all unit vectors
is called the unit sphere.
2.2
11
Proof. Using only the definition of the norm and properties of the inner product we have:
k
k
X
xj k2 = <
j=1
k
X
i=1
xi ,
k
X
xj > =
j=1
k X
k
X
<xi , xj > =
i=1 j=1
k
X
<xj , xj > =
j=1
k
X
kxj k2 .
j=1
A set of vectors {x1 , . . . , xk } in Cn is orthonormal if it is orthogonal and every vector in the set has
unit norm (kxj k = 1, j = 1, . . . , k).
Lemma 2.2.1. An orthonormal set is linearly independent.
Proof. Let {x1 , . . . , xk } be an orthonormal set and suppose that
P
have 0 = < kj=1 j xj , xi > = i .
Pk
j=1 j xj
An orthonormal basis for Cn is basis of orthonormal vectors. Since an orthonormal set is always
linearly independent, any set of n orthonormal vectors is an orthonormal basis for Cn . Orthonormal bases
have a particularly convenient property: it is easy to find coordinates ofPany vector x with respect to such
a basis. To see this, let {x1 , . . . , xn } be an orthonormal basis and x = j j xj . Then
<x, xk > = <
j xj , xk > =
j <xj , xk > = k
So the coordinate of x with respect to the basis element xk is simply k = <x, xk >, k = 1, . . . , n.
2.3
More generally, a real or complex vector space X equipped with a function <, > satisfying the properties
listed in Lemma 2.1.1 is called an inner product space. We give an important example below.
2.3.1
We can define an inner product on the vector space of complex matrices Cmn by:
X
<A, B> =
Aij B ij .
i,j
This function satisfies the properties listed in Lemma 2.1.1. The corresponding norm is:
1/2
X
1
kAkF = (<A, A>) 2 =
|Aij |2 .
i,j
This is frequently called the Frobenius norm. Hence the special notation kAkF . The following lemma
gives a very useful alternative expression <A, B>.
Lemma 2.3.1. For all A, B Cmn , <A, B> = trace(AT B).
Proof. Exercise.
c Peter J. Ramadge, 2015. Please do not distribute without permission.
2.4
12
A square matrix Q Rnn is orthogonal if QT Q = QQT = In . In this case, the columns of Q form
an orthonormal basis for Rn and QT is the inverse of Q. We denote denote the set of n n orthogonal
matrices by On .
Lemma 2.4.1. If Q On , then for each x, y Rn , <Qx, Qy> = <x, y> and kQxk = kxk.
Proof. <Qx, Qy> = xT QT QQy = xT y = <x, y>, and kQxk2 = <Qx, Qx> = <x, x> = kxk2 .
Lemma 2.4.2. The set On contains the identity matrix In , and is closed under matrix multiplication and
matrix inverse.
Proof. If Q, W are orthogonal, then (QW )T (QW ) = W T QT QW = In and (QW )(QW )T = QW W T QT =
In . So QW is orthogonal. If Q is orthogonal, Q1 = QT is orthogonal. Clearly, In is orthogonal.
Hence the set of matrices On forms a (noncommutative) group under matrix multiplication. This is
called the n n orthogonal group.
For complex matrices a slight change is required. A square complex matrix Q Cnn is called a
T
unitary matrix if Q Q = QQ = I where Q = Q . It is readily verified that if Q is a unitary matrix,
then for each x, y Cn , <Qx, Qy> = <x, y> and kQxk = kxk. So multiplication of a vector by a
unitary (or orthogonal) matrix preserves inner products, angles, norms, distances, and hence Euclidean
geometry.
2.5
Problems
Pn
i=1
<x1 , x2 >
.
kx1 kkx2 k
For given x1 , what vectors x2 maximize the correlation? What vectors x2 minimize the correlation? Show that
(x1 , x2 ) [1, 1] and is zero precisely when the vectors are orthogonal.
2.3. Use the definition of the inner product and its properties listed in Lemma 2.1.1, together with the definition of
the norm to prove the Cauchy-Schwarz Inequality (Lemma 2.1.2).
a) First let x, y Cn with kxk = kyk = 1.
1) Set x
= <x, y>y and rx = x x
. Show that <rx , y> = <rx , x
> = 0.
2) Show that krx k2 = 1 <
x, x>.
3) Using the previous result and the definition of x
show that |<x, y>| 1.
b) Prove the result when kxk =
6 0 and kyk =
6 0.
c) Prove the result when x or y (or both) is zero.
2.4. Prove the triangle inequality for the Euclidean norm in Cn . Expand kx + yk2 using the properties of the inner
inner product, and note that 2Re(<x, y>) 2|<x, y>|.
2.5. Let X , Y be inner product spaces over the same field F with F = R, or F = C. A linear isometry from X to
Y is a linear function D : X Y that preserves distances: (x X ) kD(x)k = kxk. Show that a linear isometry
between inner product spaces also preserves inner products.
c Peter J. Ramadge, 2015. Please do not distribute without permission.
13
a) First examine kD(x + y)k2 and conclude that Re(<Dx, Dy>) = Re(<x, y>).
b) Now examine kD(x + iy)k2 where i is the imaginary unit.
2.6. Let Pn denote the set of n n permutation matrices. Show that Pn is a (noncommutative) group under matrix
multiplication. Show that every permutation matrix is an orthogonal matrix. Hence Pn is a subgroup of On .
2.9. Let b1 , . . . , bn be an orthonormal basis for Cn (or Rn ), and set Bj,k = bj bk , j, k = 1, . . . , n. Show that
T
2.6
Notes
For a more detailed introduction to the Euclidean geometry of Rn , see the relevant sections in: Strang,
Linear Algebra and its Applications, Chapter 2.
14
15
Chapter 3
Orthogonal Projection
We now consider the following fundamental problem. Given a data vector x in an inner product space X
and a subspace U X , find the closest point to x in U. This operation is a simple building block that we
will use repeatedly.
3.1
Simplest Instance
The simplest instance of our problem is: given x, u Rn with kuk = 1, find the closest point to x in
span{u}. This can be posed as the simple constrained optimization problem:
min
zRn
1
2 kx
zk2
(3.1)
s.t. z span{u}
The subspace span{u} is a line through the origin in the direction u, and we seek the point z on this
line that is closest x. So we must have z = u for some scalar . Hence we can equivalently solve the
unconstrained optimization problem:
min 21 kx uk2 .
R
This is a quadratic in with a positive coefficient on the second order term. Hence there is a unique value
of that minimizes the objective. Setting the derivative of the above expression w.r.t. equal to zero gives
the unique solution, = <u, x>. Hence the optimal solution is
x
= <u, x>u.
The associated error vector rx = x x
is called the residual. We claim that the residual is orthogonal to
u and hence to the subspace span{u}. To see this note that
<u, rx > = <u, x x
> = <u, x> <u, x><u, u> = 0.
Thus x
is the unique orthogonal projection of x onto the line span{u}. This is illustrated in Figure 3.1.
By Pythagoras, we have kxk2 = k
xk2 + krx k2 .
c Peter J. Ramadge, 2015. Please do not distribute without permission.
16
x
rx
^
u x
0
Figure 3.1: Orthogonal projection of x onto a line through zero.
We can also write the solution using matrix notation. Noting that <u, x> = uT x, we have
x
= (uuT )x = P x
rx = (I uuT )x = (I P )x.
So for fixed u, both x
and rx are linear functions of x. As one might expect, these linear functions have
some special properties. For example, since x
span{u}, the projection of x
onto span{u} must be x
.
So we must have P 2 = P . We can easily check this using the formula P = uuT : P 2 = (uuT )(uuT ) =
uuT = P . In addition, we note that P = uuT is symmetric. So P is symmetric (P T = P ) and idempotent
(P 2 = P ). A matrix with these two properties is called a projection matrix.
3.2
Now let U be a subspace of Rn with an orthonormal basis {u1 , . . . , uk }. For a given x Rn , we seek a
point z in U that minimizes the distance to x:
1
2 kx
min
zRn
zk2
(3.2)
s.t. z U.
Since we can uniquely write z =
mization problem:
Pk
j=1 j uj ,
1
kx
1 ,...,k 2
min
k
X
j uj k2 .
j=1
Using the definition of the norm and the properties of the inner product, we can expand the objective
function to obtain:
1
2 kx
2
1
2 kxk
k
X
j <uj , x> +
j=1
1
2
k
X
j2 .
j=1
P
In the last line we used Pythagoras to write kzk2 = kj=1 j2 . Taking the derivative with respect to j and
setting this equal to zero yields the unique solution j = <uj , x>, j = 1, . . . , k. So the unique closest
c Peter J. Ramadge, 2015. Please do not distribute without permission.
17
point in U to x is:
x
=
k
X
<uj , x>uj .
(3.3)
j=1
^
x
0
Figure 3.2: Orthogonal projection of x onto the subspace U.
We can also write these results as matrix equations. First, from (3.3) we have
x
=
k
X
k
X
uj uTj x = (
uj uTj )x = P x,
j=1
with P =
Pk
T
j=1 uj uj .
j=1
k
X
uj uTj = U U T .
j=1
3.3
18
3.4
Problems
3.1. Given x, y Rn find the closest point to x on the line through 0 in the direction of y.
3.2. Let u1 , . . . , uk Rn be an ON set spanning a subspace U and let v Rn with v
/ U. Find a point y on the
linear manifold M = {x : x v U} that is closest to a given point y Rn . [Hint: transform the problem to one
that you know how to solve.]
3.3. A Householder transformation on Rn is a linear transformation that reflects each point x in Rn about a given
n 1 dimensional subspace U specified by giving its unit normal u Rn . To reflect x about U we want to move it
orthogonally through the subspace to the point on the opposite side that is equidistant from the subspace.
a) Given U = u = {x : uT x = 0}, find the required Householder matrix.
b) Show that a Householder matrix H is symmetric, orthogonal, and is its own inverse.
19
Chapter 4
4.1
Preliminaries
We will need a few properties of symmetric matrices. Recall that a matrix S Rnn is symmetric if
S T = S. The eigenvalues and eigenvectors of real symmetric matrices have some special properties.
Lemma 4.1.1. A symmetric matrix S Rnn has n real eigenvalues and n real orthonormal eigenvectors.
2 and x
x. Hence x
A matrix P Rnn with the property that for all x Rn , xT P x 0 is said to be positive semidefinite.
Similarly, P is positive definite if for all x 6= 0, xT P x > 0. Without loss of generality, we will always
assume P is symmetric. If not, P can be replaced by the symmetric matrix Q = 21 (P + P T ) since
xT Qx = 12 xT (P + P T )x = xT P x.
Here is a fundamental property of such matrices.
c Peter J. Ramadge, 2015. Please do not distribute without permission.
20
Lemma 4.1.2. If P Rnn is symmetric and positive semidefinite (resp. positive definite), then all the
eigenvalues of P are real and nonnegative (resp. positive) and the eigenvectors of P can be selected to be
real and orthonormal.
Proof. Since P is symmetric all of its eigenvalues are real and it has a set of n real ON eigenvectors. Let
x be an eigenvector with eigenvalue . Then xT P x = xT x = kxk2 0. Hence 0. If P is PD and
x 6= 0, then xT P x > 0. Hence kxk2 > 0 and thus > 0.
4.2
Centering Data
P
The sample mean of a set of data {xj Rn }pj=1 is the vector = 1/p pj=1 xj . By subtracting from
each xj , forming yj = xj , we translate the data vectors so that the new sample mean is zero:
1/p
p
X
j=1
yj =
1/p
p
X
(xj ) = = 0.
j=1
This is called centering the data. We can also express the centering operation in matrix form as follows.
Form the data into the matrix X = [x1 , . . . , xp ] Rnp . Then = 1/pX1, where 1 Rn denotes the
vector of all 1s. Let Y denotes the corresponding matrix of centered data and u = (1/ p)1. Then
Y = X 1T = X 1/pX11T = X(I 1/p11T ) = X(I uuT ).
From this point forward we assume that the data has been centered.
4.3
4.4
We seek a k-dimensional subspace U such that the orthogonal projection of the data onto U minimizes the
sum of squared norms of the residuals. Assuming such a subspace exists, we call it an optimal projection
subspace of dimension k.
c Peter J. Ramadge, 2015. Please do not distribute without permission.
21
Let the columns of U Rnk be an orthonormal basis for a subspace U. Then the matrix of projected
= U U T X and the corresponding matrix of residuals is X U U T X. Hence we seek to solve:
data is X
kX U U T Xk2F
min
U Rnk
s.t. U T U = Ik .
(4.1)
The solution of this problem cant be unique since if U is a solution so is U Q for every Q Ok . These
solutions correspond to different parameterizations of the same subspace. In addition, it is of interest to
determine if two distinct subspaces could both be optimal projection subspaces of dimension k.
Using standard equalities, the objective function of Problem 4.1 can be rewritten as
kX U U T Xk2F = trace(X T (I U U T )(I U U T )X) = trace(XX T ) trace(U T XX T U ).
Hence, letting P Rnn denote the symmetric positive semidefinite matrix XX T , we can equivalently
solve the following problem:
max
U Rmk
trace(U T P U )
s.t. U T U = Ik .
(4.2)
Problem 4.2 is a well known problem. For the simplest version with k = 1, we have the following standard
result.
Theorem 4.4.1 (Horn and Johnson, 4.2.2). Let P Rnn be a symmetric positive semidefinite matrix
with eigenvalues 1 2 n . The problem
max
uRn
uT P u
s.t. uT u = 1
(4.3)
has the optimal value 1 . This is achieved if and only if u is a unit norm eigenvector of P for the eigenvalue
1 .
Proof. We want to maximize xT P x subject to xT x = 1. Bring in a Lagrange multiplier and form the
Lagrangian L(x, ) = xT P x + (1 xT x). Taking the derivative of this expression with respect to x
and setting this equal to zero yields P x = x. Hence must be a eigenvalue of P with x a corresponding
eigenvector normalized so that xT x = 1. For such x, xT P x = xT x = . Hence the maximum
achievable value of the objective is 1 and this is achieved when u a corresponding eigenvector of P .
Conversely, if u is any unit norm eigenvector of P for 1 , then uT P u = 1 and hence u is a solution.
Theorem 4.4.1 can be generalized as follows. However, the proof uses results we have not covered yet.
Theorem 4.4.2 (Horn and Johnson, 4.3.18). Let P Rnn be a symmetric positive semidefinite matrix
with eigenvalues 1 2 n . The problem
max
U Rnk
trace(U T P U )
s.t. U T U = Ik
(4.4)
P
has the optimal value kj=1 j . Moreover, this is achieved if the columns of U are k orthonormal eigenvectors for the largest k eigenvalues 1 , . . . , k of P .
c Peter J. Ramadge, 2015. Please do not distribute without permission.
22
0 Ink
0 Ink
and W W
= W
Ik 0 W T
Z
.
0 0 ZT
It is a standard result that if we are free to select the left and right singular vectors of a matrix B, then the
inner product <A, B> is maximized when the left and right singular vectors of B are chosen to equal the
T
left and right singular vectors of A, respectively. Hence selecting W = Ik 0 maximizes the inner
product <, W W T >. This gives
P U = Vk , where Vk is the matrix of the first k columns of V and results
in the optimal objective value kj=1 j .
It follows from Theorem 4.4.2, that a solution U ? to Problem 4.2 is obtained by selecting the columns
of U ? to be a set orthonormal eigenvectors of P = XX T corresponding to its k largest eigenvalues.
Working backwards, we see that U ? is then also a solution to Problem 4.1. In both cases, there is nothing
special about U ? beyond the fact that it spans U ? . Any basis of the form U ? Q with Q Ok spans the
same optimal subspace U ? .
We also note that U ? may not be unique. To see this, consider the situation when k = k+1 . When
this holds, the selection of a k-th eigenvector in U ? is not unique.
In summary, a solution to Problem 4.1 can be obtained as follows. Find the k largest eigenvalues of
XX T and a corresponding set of orthonormal eigenvectors U ? . Then over all k dimensional subspaces,
U ? = R(U ? ) minimizes the sum of the squared norms of the projection residuals. By projecting each xj
to x
j = U ? (U ? )T xj we obtain a representation of the data as points on U ? . Moreover, if we now represent
x
j by its coordinates yj = (U ? )T xj with respect the orthonormal basis U ? , then we have linearly mapped
the data into k-dimensional space.
4.5
An Alternative Viewpoint
We now consider an alternative way to view the same problem. This will give some additional insights
into the solution we have derived.
4.5.1
T
j=1 u xj
= uT
Pp
j=1 xj
= 0.
23
So the spread of the data in direction u can be quantified by the scalar sample variance
p
p
n
X
X
X
1
1
1
(uT xj )(uT xj )T = uT
xj xTj u.
2 (u) =
(uT xj )2 =
p
p
p
j=1
j=1
(4.5)
j=1
(4.6)
j=1
R is called the sample covariance matrix of the (centered) data. The product xj xTj is a real nn symmetric
matrix formed by the outer product of the j-th data point with itself. The sample covariance is the mean of
these matrices and hence is also a real n n symmetric matrix. More generally, if the data is not centered
but has sample mean , then the sample covariance is
p
R=
1X
(xj )(xj )T .
p
(4.7)
j=1
4.5.2
j=1
j=1
j=1
1X
1X T
1X T 2
xj xTj )x =
(x xj xTj x) =
(xj x) 0.
p
p
p
Using R and (4.5) we can concisely express the variance of the data in direction u as
2 (v) = uT Ru.
(4.8)
Hence the direction u in which the data has maximum sample variance is given by the solution of the
problem:
arg maxn
uR
uT Ru
s.t. uT u = 1
(4.9)
with R a symmetric positive semidefinite matrix. This is Problem 4.3. By Theorem 4.4.1, the data has
maximum variance 12 in the direction v1 , where 12 0 is the largest eigenvalue of R and v1 is a corresponding unit norm eigenvector of R.
We must take care if we want to find two directions with the largest variance. Without any constraint,
the second direction can come arbitrarily close to v1 and variance 12 . One way to prevent this is to
constrain the second direction to be orthogonal to the first. Then if we want a third direction, constraint it
to orthogonal to the two previous directions, and
Pso on. In this case, for k orthogonal directions we want
to find U = [u1 , . . . , uk ] On,k to maximize kj=1 uTj Ruj = trace(U T RU ). Hence we want to solve
Problem 4.4 with P = R. By Theorem 4.4.2, the solution is attained by taking the k directions to be unit
norm eigenvectors v1 , . . . , vk for the largest k eigenvalues of R.
c Peter J. Ramadge, 2015. Please do not distribute without permission.
24
By this means you see that we obtain n orthonormal directions of maximum (sample) variance in
the data. These directions v1 , v2 , . . . , vn and the corresponding variances 12 22 n2 are
eigenvectors and corresponding eigenvalues of R: Rvj = j2 vj , j = 1, . . . , n. The vectors vj are called the
principal components of the data, and this decomposition is called Principal Component Analysis (PCA).
Let V be the matrix with the vj as its columns, and 2 = diag(12 , . . . , n2 ) (note 12 22 n2 ) .
Then PCA is an ordered eigen-decomposition of the sample covariance matrix: R = V 2 V T .
There is a clear connection between PCA and finding a subspace that minimizes the sum of squared
norms of the residuals. We can see this by writing
p
1X
1
R=
xj xTj = XX T .
p
p
j=1
So the sample covariance is just a scalar multiple of the matrix XX T . This means that the principal
components are just the eigenvectors of XX T listed in order of decreasing eigenvalues. In particular, the
first k principal components are the first k eigenvectors (ordered by eigenvalue) of XX T . This is exactly
the orthonormal basis U ? that defines an optimal k-dimensional projection subspace U ? . So the leading k principal components give a particular orthonormal basis for an optimal k-dimensional projection
subspace.
A direction in which the data has small variance relative to 12 may not be not an important direction after all the data stays close to the mean in this direction. If one accepts this hypothesis, then the directions
of largest variance are the important directions - they capture most of the variability in the data. This
suggests that we could select k < rank(R) and project the data onto the k directions of largest variance.
Let Vk = [v1 , v2 , . . . , vk ]. Then the projection onto the span of the columns of Vk is x
j = Vk (VkT xj ). The
T
j
term yj = (Vk )xj gives the coordinates of xj with respect to Vk . Then the product Vk yj synthesizes x
using these coefficients to form the appropriate linear combination of the columns of Vk .
Here is a critical observation: since the directions are fixed and known, we dont need to form x
j .
Instead we can simply map xj to the coordinate vector yj Rk . No information is lost in working with
yj instead of x
j since the latter is an invertible linear function of the former. Hence {yj }pj=1 gives a new
set of data that captures most of the variation in the original data, and lies in a reduced dimension space
(k rank(R) n).
The natural next question is how to select k? Clearly this involves a tradeoff between the size of k and
the amount of variation
the original data that is captured in the projection.
captured by
Pn The variance
Pin
k
2
2
2
2
the projection is = j=1 j and the variance in the residual is = j=k+1 j . Reducing k reduces
2 and increases 2 . The selection of k thus involves determining how much of the total variance in X
needs to be captured in order to successfully use the projected data to complete the analysis or decision
task at hand. For example, if the projected data is to be used to learn a classifier, then one needs to select
the value of k that yields acceptable (or perhaps best) classifier performance. This could be done using
cross-validation.
4.6
Problems
4.1. Let X Rnp . Show that the set of nonzero eigenvalues of XX T is the same as the set of nonzero eigenvalues
of X T X.
25
Chapter 5
Overview
We now discuss in detail a very useful matrix factorization called the singular value decomposition (SVD).
The SVD extends the idea of eigen-decomposition of square matrices to non-square matrices. It is useful
in general, but has specific application in data analysis, dimensionality reduction (PCA), low rank matrix
approximation, and some forms of regression.
5.2
Preliminaries
Recall that for A Rmn , the range of A is the subspace of Rm defined by R(A) = {y : y =
Ax, some x Rn } Rm . So the range of A is the set of all vectors that can be formed as a linear combination of the columns of A. The nullspace of A is the subspace of Rn defined by N (A) = {x : Ax = 0}.
This is the set of all vectors that are mapped to the zero vector in Rm by A.
The following fundamental result from linear algebra will be very useful.
Theorem 5.2.1. Let A Rmn have nullspace N (A) and range R(A). Then N (A) = R(AT ).
Proof. Let x N (A). Then Ax = 0 and xT AT = 0. So for every y Rm , xT (AT y) = 0. Thus
x R(AT ) . This shows that N (A) R(AT ) . Now for all subspaces: (a) (U ) = U, and (b) U V
implies V U (see Problem 3.5). Applying these properties yields R(AT ) N (A) .
Conversely, suppose x R(AT ) . Then for all y Rm , xT AT y = 0. Hence for all y Rm ,
y T Ax = 0. This implies Ax = 0 and hence that x N (A). Thus R(AT ) N (A) and N (A)
R(AT ).
We have shown R(AT ) N (A) and N (A) R(AT ). Thus N (A) = R(AT ).
The rank of A is the dimension r of the range of A. Clearly this equals the number of linearly independent columns in A. The rank r is also the number of linearly independent rows of A. Thus r min(m, n).
The matrix A is said to full rank if r = min(m, n).
An m n rank one matrix has the form yxT where y Rm and x Rn are both nonzero. Notice that
for all w Rn , (yxT )w = y(xT w) is a scalar multiple of y. Moreover, by suitable choice of w we can
make this scalar any real value. So R(yxT ) = span(y) and the rank of yxT is one.
5.2.1
The Euclidean norm in Rp is also known as the 2-norm and is often denoted by k k2 . We will henceforth
adopt this notation.
c Peter J. Ramadge, 2015. Please do not distribute without permission.
26
The gain of a matrix A Rmn when acting on a unit norm vector x Rn is given by the norm of the
vector Ax. This measures the change in the vector magnitude resulting from the application of A. More
generally, for x 6= 0, define the gain by G(A, x) = kAxk2 /kxk2 , where in the numerator the norm is in
Rm , and in the denominator it is in Rn . The maximum gain of A over all x Rn is then:
G(A) = max
x6=0
kAk2
kxk2
This is called the induced matrix 2-norm of A, and is denoted by kAk2 . It is induced by the Euclidean
norms on Rn and Rm . From the definition of the induced norm we see that
kAk22 = max kAxk22 = max xT (AT A)x.
kxk2 =1
kxk2 =1
Since AT A is real, symmetric and positive semidefinite, the solution of this problem is to select x to be a
unit norm eigenvector of AT A for the largest eigenvalue. So
q
kAk2 = max (AT A)
(5.1)
Because of this connection with eigenvalues, the induced matrix 2-norm is sometimes also called the
spectral norm.
It is easy to check that the induced norm is indeed a norm. It also has the following additional properties.
Lemma 5.2.1. Let A, B be matrices of appropriate size and x Rn . Then
1) kAxk2 kAk2 kxk2 ;
2) kABk2 kAk2 kBk2 .
Proof. Exercise.
Important: The induced matrix 2-norm and the matrix Euclidean norm are distinct norms on Rmn .
Recall, the Euclidean norm on Rmn is called the Frobenius norm and is denoted by kAkF .
5.3
We first present the main SVD result in what is called the compact form. We then give interpretations of
the SVD and indicate an alternative version known as the full SVD. After these discussions, we turn our
attention to the ideas and constructions that form the foundation of the SVD.
Theorem 5.3.1 (Singular Value Decomposition). Let A Rmn have rank r min{m, n}. Then there
exist U Rmr with U T U = Ir , V Rnr with V T V = Ir , and a diagonal matrix Rrr with
diagonal entries 1 2 r > 0, such that
A = U V T =
r
X
j uj vjT .
j=1
The positive scalars j are called the singular values of A. The r orthonormal columns of U are called
the left or output singular vectors of A, and the r orthonormal columns of V are called the right or input
singular vectors of A. The conditions U T U = Ir and V T V = Ir indicated that U and V have orthonormal
columns. But in general, since U and V need not be square matrices, U U T 6= Im and V V T 6= In (in
general U and V are not orthogonal matrices). Notice also that the theorem does not claim that U and V
are unique. We discuss this issue later in the chapter. The decomposition is illustrated in Fig. 5.1.
c Peter J. Ramadge, 2015. Please do not distribute without permission.
27
Figure 5.2: A visualization of the three operational steps in the compact SVD. The projection of x Rn onto
N (A) is represented in terms of the basis v1 , v2 : here x = 1 v1 + 2 v2 . These coordinates are scaled by the
singular values. Then the scaled coordinates are transferred to the output space Rm and used to the form the result
y = Ax as the linear combination y = 1 1 u1 + 2 2 u2 .
Lemma 5.3.1. The matrices U and V in the compact SVD have the following additional properties:
a) The columns of U form an orthonormal basis for the range of A.
b) The columns of V form an orthonormal basis for N (A) .
Proof. a) You can see this by writing Ax = U (V T x) and noting that is invertible and V T vj = ej
where ej is the j-th standard basis vector for Rr . So the range of V T is Rr . It follows that R(U ) = R(A).
b) By taking transposes and using part a), the columns of V form an ON basis for the range of AT . Then
using N (A) = R(AT ), yields that the columns of V form an orthonormal basis for N (A) .
The above observations lead to the following operational interpretation of the SVD. For x Rn , the
operation V T x gives the coordinates with respect to V of the orthogonal projection of x onto the subspace
N (A) . (The orthogonal projection is x
= V V T x.) These r coordinates are then individually scaled using
the r diagonal entries of . Finally, we synthesize the output vector by using the scaled coordinates and
the ON basis U for R(A): y = U (V T x). So the SVD has three steps: (1) An analysis step: V T x, (2) A
scaling step: (V T x), and (3) a synthesis step: U (V T x). In particular, when x = vk , y = Ax = k uk ,
k = 1, . . . , r. So the r ON basis vectors for N (A) are mapped to scaled versions of corresponding ON
basis vectors for R(A). This is illustrated in Fig. 5.2.
5.3.1
Recall that the induced matrix 2-norm of A is the maximum gain of A (5.2.1). We show below that this
is related to the singular values of the matrix A. First note that
kAxk2 = kU V T xk2 = xT (V 2 V T )x.
c Peter J. Ramadge, 2015. Please do not distribute without permission.
28
Figure 5.3: A visualization of the action of A on the unit sphere in Rn in terms of its SVD.
(5.2)
.
(5.3)
j=1 j
So the Frobenius norm of A is the Euclidean norm of the singular values of A.
c Peter J. Ramadge, 2015. Please do not distribute without permission.
5.3.2
29
There is a second version of the SVD that is often convenient in various proofs involving the SVD. Often
this second version is just called the SVD. However, to emphasize its distinctness from the equally useful
compact SVD, we refer to it as a full SVD.
The basic idea is very simple. Let A = Uc c VcT be a compact SVD with Uc Rmr , Vc Rnr ,
rr . To U we add an orthonormal basis for R(U ) to form the orthogonal matrix U =
c
c
and c R mm
Uc Uc R
. Similarly, to Vc we add an orthonormal basis for R(Vc ) to form the orthogonal
matrix V = Vc Vc Rnn . To ensure that these extra columns in U and V do not interfere with the
factorization of A, we form Rmn by padding c with zero entries:
=
0r(nr)
0(mr)r 0(mr)(nr)
We then have a full SVD factorization A = U V T . The utility of the full SVD derives from U and V
being orthogonal (hence invertible) matrices. The full SVD is illustrated in Fig. 5.4.
If P is a symmetric positive definite matrix, a full SVD of P is simply an eigen-decomposition of P :
U V T = QQT , where Q is the orthogonal matrix of eigenvectors of P . In this sense, the SVD extends
the eigen-decomposition by using different orthonormal sets of vectors in the input and output spaces.
5.4
We now give a quick overview of where the matrices U , V and of SVD come from. Let A Rmn
have rank r. So the range of A has dimension r and nullspace of A has dimension n r.
Let B = AT A Rnn . Since B is a symmetric positive semi-definite (PSD) matrix, it has nonnegative eigenvalues and full set of orthonormal eigenvectors. Order the eigenvalues in decreasing order:
12 22 n2 0 and let vj denote the eigenvector for j2 . So
Bvj = j2 vj ,
j = 1, . . . , n.
Noting that Ax = 0 if and only if Bx = 0 we see that the null space of B also has dimension n r.
It follows that n r of the eigenvectors of B must lie in N (A) and r must lie in N (A) . Hence
12 r2 > 0
and
2
r+1
= = n2 = 0,
j = 1, . . . , m.
Since R(AT ) = N (A) , the dimension of R(AT ) is r, and that of N (AT ) is m r. By the same
reasoning as above, m r of the eigenvectors of C must lie in N (AT ) and r must lie in R(A). Hence
21 2r > 0 and
2r+1 = = 2m = 0,
30
and
We can take > 0 by swapping uk for uk if necessary. Using this result we find
(
j2 vjT vj = j2 ;
vjT Bvj =
(Avj )T (Avj ) = 2 uTj uj = 2 .
So we must have = j and
Avj = j uk .
Now do the same analysis for Cuk =
2k uk
and
AT uk = vp , some 6= 0.
1
..
A v1 . . . vr = u1 . . . ur
.
.
r
From this we deduce that AV V T = U V T . V V T computes the orthogonal projection of x onto N (A) .
T
T
T
Hence for every x Rn , AV
pV x = Ax. Thus
p AV V = A, and we have A = U V .
T
T
Finally note that j = j (A A) = j (AA ), j = 1, . . . , r. So the singular values are always
unique. If the singular values are distinct, the P
SVD is unique up to sign interchanges between the uj and
vj . But this still leaves the representation A = rj=1 j uj vjT unique. If the singular values are not distinct,
then U and V are not unique. For example, In = U In U T for every orthogonal matrix U .
5.5
Problems
Pn
j=1
Pn
T
j=1 (1/j )vj uj .