Sei sulla pagina 1di 30

ELE 535

Machine Learning and Pattern Recognition1


Peter J. Ramadge
Fall 2015, v1.0

c J. Ramadge 2015. Please do not distribute without permission.


P.

ELE 535 Fall 2015

c Peter J. Ramadge, 2015. Please do not distribute without permission.


ELE 535 Fall 2015

Chapter 1

Data And Data Embeddings


Machine learning is essentially about learning structural patterns and relations in data and using these
to make decisions about new data points. This can include identifying low dimensional structure (dimensionality reduction), clustering the data based on a measure similarity, the prediction of a category
(classification), or the prediction of the value of an associated unmeasured variable (regression). We want
the data to be the primary guide for accomplishing this but we sometimes...

1.1

Data as a Set of Vectors

We often represent the data of interest as a set of vectors in some Euclidean space Rn . This has two major
advantages: 1) we can exploit the algebraic structure of Rn and 2) we can exploit its Euclidean Geometry.
This enables tools and concepts such as differential calculus, convexity, convex optimization, probability
distributions, and so on.
In some cases there is a natural embedding of the data into Rn . In other cases we must carefully
construct a useful embedding of the data into Rn (possibly ignoring some information) .
Example 1.1.1. In a medical context the readily measured variables might be: age, gender, weight, blood
pressure, resting heart rate, respiratory rate, body temperature, blood analysis, and so on. Group an
appropriate subset of these measurable variables into a vector x Rn . The variable of interest y R
might be: the existence of an infection, the level of an infection, the degree of reaction to a drug, the
probability that the patient is about to have heart attack, and so on. From measurements of the first set of
medical variables x we would like to predict the value of y.
Example 1.1.2. Medical Example 2. Not written up yet.
Example 1.1.3. Document Example 1. Not written up yet.
Example 1.1.4. fMRI Example 1. Not written up yet.

1.2

A Quick Review of the Algebraic Structure of Rn

Below we give a quick summary of the key algebraic properties of Rn Although we do this in the context
of Rn , the concepts and constructions described generalize to any finite dimensional vector space.
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

1.2.1

Linear combinations

A linear combination of vectors x1 , . . . , xk using scalars 1 , . . . , k produces the vector x =


The span of vectors x1 , . . . , xk Rn is the set of all such linear combinations:
span{x1 , . . . , xk } = {x : x =

k
X

Pk

j=1 j xj .

j xj , for some scalars j , j = 1, . . . , k}.

j=1

1.2.2

Subspaces

A subspace of Rn is a subset U Rn that isP


closed under linear combinations of its elements. So for any
x1 , . . . , xk U and any scalars 1 , . . . , k , kj=1 j xj U. A subspace U of Rn is itself a vector space
under the field R.
There are two common ways to specify a subspace. First, for any x1 , . . . , xk Rn , U = span{x1 , . . . , xk }
is a subspace of Rn . So any subset of vectors specifies a subspace through the span operation.
Lemma 1.2.1. span{x1 , . . . , xk } is smallest subspace of Rn that contains the vectors x1 , . . . , xk .
Proof. It is clear that U = span{x1 , . . . , xk } contains the vectors x1 , . . . , xk and is a subspace of Rn .
Suppose V is a subspace of Rn and contains the vectors x1 , . . . , xk . Since V is a subspace it follows that
span{x1 , . . . , xk } = U V. So U is the smallest subspace that contains {x1 , . . . , xk }.
The second method gives an implicit
equations. Given fixed scalars
Pn specification through a set of linear
n . More generally, given k sets of

x(i)
=
0}
is
a
subspace
of
R
i , i = 1, . . . , n, the set U = {x :
i
i=1
Pn
(j)
(j)
n
scalars, {i }ni=1 , j = 1, . . . , k, the set U = {x :
i=1 i x(i) = 0, j = 1, . . . , k} is a subspace of R .

1.2.3

Subspace intersection and subspace sum

For U, V subspaces of Rn , define

U V = {x : x U and x V}

U + V = {x : x = u + v, some u U, v V}.
U V is the set of vectors in both U and V and U +V is the set of all vectors formed by a linear combination
of a vector in U and a vector in V.
Lemma 1.2.2. For subspaces U, V of Rn , U V and U + V are subspaces of Rn . U V is the largest
subspace contained in both U and V and U + V is the smallest subspace containing both U and V.
Proof. Exercise.

1.2.4

Linear independence

A finite set of vectors {x1 , . . . , xk } Rn is linearly independent if for each set of scalars 1 , . . . , k ,
k
X

i xi = 0

i = 0, i = 1, . . . , k.

i=1

A set of vectors which is not linearly independent is said to be linearly dependent. Notice that a linearly
independent set cannot contain the zero vector.
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

Lemma 1.2.3. The following conditions are equivalent:


1) {x1 , . . . , xk } is linearly independent.
2) Each x span{x1 , . . . , xk } is a unique linearly combination of x1 , . . . , xk , that is,
Pk

i=1 i xi

Pk

i=1 i xi

i = i , i = 1, . . . , k.

3) No element of {x1 , . . . , xk } can be written as a linear combination of the others.


Proof. We prove:
P
P
(1 2) Suppose that xP
can be represented in two ways: x = ni=1 i xi and x = ni=1 i xi . Subtracting
these equations yields ni=1 (i i )xi = 0. Then the linear independence of the basis implies that
i = i , i = 1, . . . , n. SoP
the representation of x is unique.
(2 3) Suppose xm = j6=m j xj . Then xm has two distinct representations with respect to the basis:
P
the one given above and xm = kj=1 j xj with j = 0 if j 6= m and m = 1. This contracts 2). Hence
no element of {x1 , . . . , xP
k } combination of the others.
P
(3 1) Suppose that kj=1 j xj = 0. If m 6= 0, then were can write xm =
j6=m j xj with
j = j /m . This violates 2). Hence j = 0, j = 1, . . . , k.

1.2.5

Spanning sets and bases

Let U be a subspace of Rn . A finite set of vectors {x1 , . . . , xk } is said to span U, or to be a spanning set
for U, if U = span{x1 , . . . , xk }. In this case, every x U can be written as a linear combination of the
vectors xj , j = 1, . . . , k. A spanning set may be redundant in the sense that one or more elements of the
set may be a linear combination of a subset of the elements.
A basis for a subspace U Rn is a linearly independent finite set of vectors that spans U. The
spanning property ensures that every vector in U can be represented as a linear combination of the basis
vectors and linear independence ensures that this representation is unique. A vector space that has a basis
is said to be finite dimensional.
We show that Rn is finite dimensional by exhibiting a basis. The standard basis for Rn is the set of
vectors ej , j = 1, . . . , n, with
(
1, if i = j;
ei (j) =
0, otherwise.
Pn
It is clear that if i=1 i ei = 0, then i = 0, i = 1, . . . , n. Hence the set is linearly independent. It is
also clear that any vector in Rn can be written as a linear combination of the ei s. Hence Rn is a finite
dimensional vector space.
We show below that every subspace U Rn has a basis, and every basis contains the same number
of elements. We define the dimension of U to be the number of elements in any basis for U. The standard
basis for Rn implies that every basis for Rn has n elements and hence Rn has dimension n.
Lemma 1.2.4. Every nonzero subspace U Rn has a basis and every such basis contains the same
number of vectors.
Proof. Rn is finite dimensional and has a basis with n elements. List the basis as L0 = {x1 , . . . , xn }.
Since U =
6 {0}, U contains a nonzero vector b1 . It must hold that b1 span(L0 ) since L0 spans Rn .
If U = span{b1 }, then we are done. Otherwise, add b1 to the start of L0 to form the new ordered list
L1 = {b1 , x1 , . . . , xn }. Then L1 is a linearly dependent set that spans Rn . Proceeding from the left, there
must be a first vector in L1 that is linearly dependent on the subset of vectors that precedes it. Suppose
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

6
x
2 x2

1 x1
0
Figure 1.1: The coordinates of a point x with respect a given basis {x1 , . . . , xn }.

this is xp . Removing xp from L1 yields a list L01 that still spans Rn . Since U 6= span{b1 }, there must
be a nonzero vector b2 U not contained in span{b1 }. If span{b1 , b2 } = U, we are done. Otherwise,
add b2 after b1 in L01 to obtain a new ordered list L2 = {b1 , b2 , x1 , . . . , xp1 , xp+1 , . . . , xn }. Then L2 is
linearly dependent and spans Rn . Proceeding from the left, there must be a first vector in L2 that is linearly
dependent on the subset of vectors that precedes it. This cant be one of the bj s since these are linearly
independent. Hence we can again remove one of the remaining xj s to obtain a reduced list L02 that spans
Rn . Since span{b1 , b2 } =
6 U, there must be a vector b3 U such that b3
/ span{b1 , b2 }. Adding this
after b2 in L02 gives a new linearly dependent list L3 that spans Rn . In this way, we either terminate with a
basis for U or continue to remove x0j s from the ordered set and add bj s until all the xj s are removed. In
that event, Ln = {b1 , . . . , bn } is a linearly independent spanning set for Rn and U = Rn . In either case,
U has a basis.
Let {x1 , . . . , xk } and {y1 , . . . , ym } be bases for U. First form the ordered list L0 = {x1 , . . . , xk }. We
will progressively add one of the yj s to the start of list and if possible remove one of the xj s. To begin,
set L1 = {ym , x1 , . . . , xk }. By assumption, U = span{L0 }. Hence ym span{L0 }. It follows that L1 is
linearly dependent and spans U. Proceeding from the left there must be a first vector in L1 that is linearly
dependent on the subset of vectors that precedes it. Suppose this is xp . Removing xp from L1 leaves a
list L01 that still spans U. Hence ym1 span(L01 ). So adding ym1 to L01 in the first position gives a
new linearly dependent list S2 = {ym1 , ym , x1 , . . . , xp1 , xp+1 , . . . , xk } that spans U. Proceeding from
the left there must be a first vector in L2 that is linearly dependent on the subset of vectors that precedes
it. This cant be one of the yj s since these are linearly independent. Hence we can again remove one of
the remaining xj s to obtain a reduced list spanning U. Then adding ym2 in the first place gives a new
linearly dependent list L3 spanning U. In this way, we continue to remove x0j s from the ordered set and
add yj s. If we remove all the xj s before adding all of the yj s we obtain a contradiction, since that would
imply that {y1 , . . . , yk } is linearly dependent. So we must be able to add all of the yj s. Hence k m. A
symmetric argument with the roles of the two bases interchanged shows that m k. Hence m = k.

1.2.6

Coordinates

n
n
The
Pn coordinates of x R with respect to a basis {xj }j=1 are the unique scalars j such that x =
j=1 j xj . Every vector uniquely determines, and is uniquely determined
P by, its coordinates. For the
standard basis, the coordinates of x Rn are simply the entries of x: x = nj=1 x(j)ej .
You can think of the coordinates as a way to locate (or construct) x starting from the origin. You simply
go via x1 scaled by 1 , then via x2 scaled by 2 , and so on. This is not a unique path, since the scaled
basis elements can be added in any order, but all such paths reach the same point x. This is illustrated
in Fig 1.1. In this sense, coordinates are like map coordinates with the basic map navigation primitives
specified by the basis elements. That also makes it clear that if we choose a different basis, then we must
use a different set of coordinates to reach the same point x.

c Peter J. Ramadge, 2015. Please do not distribute without permission.


ELE 535 Fall 2015

1.3

Problems

1.1. Show that:


a) A linearly independent set in Rn containing n vectors is a basis for Rn .
b) A subset of Rn containing k > n vectors is linearly dependent.
c) If U is proper subspace of Rn , then dim(U) < n.

1.4

Notes

We have given a brief outline of the algebraic structure of Rn . For a more detailed introduction see the
relevant sections in: Strang, Linear Algebra and its Applications, Chapter 2.

c Peter J. Ramadge, 2015. Please do not distribute without permission.


ELE 535 Fall 2015

c Peter J. Ramadge, 2015. Please do not distribute without permission.


ELE 535 Fall 2015

Chapter 2

The Geometry of Rn
Rn also has a geometric structure defined by the Euclidean inner product and norm. These add the important concepts of length, distance, angle, and orthogonality. It will be convenient to discuss these concepts
using the vector space (Cn , C). The concepts and definitions readily specialize to (Rn , R). In addition,
the vector spaces of complex and real matrices of given fixed dimensions can also be given a Euclidean
geometry.

2.1

Inner Product and Norm on Cn

The inner product of vectors x, y Cn is the scalar


<x, y> =

n
X

x(k)y(k),

k=1

where y denotes the element-wise complex conjugate of y. This can also be written in terms of a matrix
product as <x, y> = xT y. The inner product satisfies the following basic properties.
Lemma 2.1.1 (Properties of the Inner Product). For x, y, z Cn and C,
1) <x, x> 0 with equality x = 0
2) <x, y> = <y, x>
3) <x, y> = <x, y>
4) <x + y, z> = <x, z> + <y, z>
Proof. These claims follow from the definition of the inner product via simple algebra.
The inner product also specifies the corresponding Euclidean norm on Cn via the formula:
1
2

kxk = (<x, x>) =

n
X

!1
2

|x(k)|2

k=1

Based on the definition and properties of the inner product and the definition of the norm, we can then
derive the famous Cauchy-Schwartz inequality.
Lemma 2.1.2 (Cauchy-Schwarz Inequality). For all x, y Cn , |<x, y>| kxk kyk.
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

10

x+y
x

||x||

||x+y||

||x||

y
||y||

Figure 2.1: An illustration of the triangle inequality.

Proof. Exercise.
By Cauchy-Schwarz, we have
0

|<x, y>|
1.
kxk kyk

This allows us to define the angle [0, /2] between x and y by


cos =

|<x, y>|
.
kxk kyk

For vectors in Rn we can dispense with the modulus function and write
<x, y> = kxkkyk cos(),
where cos = <x, y>/(kxk kyk). So in Rn , the inner product of unit length vectors is the cosine of the
angle between them.
Finally, we have the properties of the norm function.
Lemma 2.1.3 (Properties of the Norm). For x, y Cn and C:
1) kxk 0 with equality if and only if x = 0 (positivity).
2) kxk = || kxk (scaling).
3) kx + yk kxk + kyk (triangle inequality).
Proof. Items 1) and 2) easily follow from the definition of the norm. Item 3) can be proved using the
Cauchy-Schwarz inequality and is left as an exercise.
The norm kxk measures the length or size of the vector x. Equivalently, kxk is the distance
between 0 and x and kx yk is the distance between x to y. The triangle inequality is illustrated in
Fig. 2.1. If kxk = 1, x is called a unit vector or a unit direction. The set {x : kxk = 1} of all unit vectors
is called the unit sphere.

2.2

Orthogonality and Orthonormal Bases

Vectors x, y Cn are orthogonal, written x y, if <x, y> = 0. A set of vectors {x1 , . . . , xk } in Rn is


orthogonal if each pair is orthogonal: xi xj , i, j = 1, . . . , k, i 6= j.
P
P
Theorem 2.2.1 (Pythagoras). If x1 , . . . , xk is an orthogonal set, then k kj=1 xj k2 = kj=1 kxj k2 .
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

11

Proof. Using only the definition of the norm and properties of the inner product we have:
k

k
X

xj k2 = <

j=1

k
X
i=1

xi ,

k
X

xj > =

j=1

k X
k
X

<xi , xj > =

i=1 j=1

k
X

<xj , xj > =

j=1

k
X

kxj k2 .

j=1

A set of vectors {x1 , . . . , xk } in Cn is orthonormal if it is orthogonal and every vector in the set has
unit norm (kxj k = 1, j = 1, . . . , k).
Lemma 2.2.1. An orthonormal set is linearly independent.
Proof. Let {x1 , . . . , xk } be an orthonormal set and suppose that
P
have 0 = < kj=1 j xj , xi > = i .

Pk

j=1 j xj

= 0. Then for each xi we

An orthonormal basis for Cn is basis of orthonormal vectors. Since an orthonormal set is always
linearly independent, any set of n orthonormal vectors is an orthonormal basis for Cn . Orthonormal bases
have a particularly convenient property: it is easy to find coordinates ofPany vector x with respect to such
a basis. To see this, let {x1 , . . . , xn } be an orthonormal basis and x = j j xj . Then
<x, xk > = <

j xj , xk > =

j <xj , xk > = k

So the coordinate of x with respect to the basis element xk is simply k = <x, xk >, k = 1, . . . , n.

2.3

General Inner Product Spaces

More generally, a real or complex vector space X equipped with a function <, > satisfying the properties
listed in Lemma 2.1.1 is called an inner product space. We give an important example below.

2.3.1

Inner Product on Cmn

We can define an inner product on the vector space of complex matrices Cmn by:
X
<A, B> =
Aij B ij .
i,j

This function satisfies the properties listed in Lemma 2.1.1. The corresponding norm is:

1/2
X
1
kAkF = (<A, A>) 2 =
|Aij |2 .
i,j

This is frequently called the Frobenius norm. Hence the special notation kAkF . The following lemma
gives a very useful alternative expression <A, B>.
Lemma 2.3.1. For all A, B Cmn , <A, B> = trace(AT B).
Proof. Exercise.
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

2.4

12

Orthogonal and Unitary Matrices

A square matrix Q Rnn is orthogonal if QT Q = QQT = In . In this case, the columns of Q form
an orthonormal basis for Rn and QT is the inverse of Q. We denote denote the set of n n orthogonal
matrices by On .
Lemma 2.4.1. If Q On , then for each x, y Rn , <Qx, Qy> = <x, y> and kQxk = kxk.
Proof. <Qx, Qy> = xT QT QQy = xT y = <x, y>, and kQxk2 = <Qx, Qx> = <x, x> = kxk2 .
Lemma 2.4.2. The set On contains the identity matrix In , and is closed under matrix multiplication and
matrix inverse.
Proof. If Q, W are orthogonal, then (QW )T (QW ) = W T QT QW = In and (QW )(QW )T = QW W T QT =
In . So QW is orthogonal. If Q is orthogonal, Q1 = QT is orthogonal. Clearly, In is orthogonal.
Hence the set of matrices On forms a (noncommutative) group under matrix multiplication. This is
called the n n orthogonal group.
For complex matrices a slight change is required. A square complex matrix Q Cnn is called a
T
unitary matrix if Q Q = QQ = I where Q = Q . It is readily verified that if Q is a unitary matrix,
then for each x, y Cn , <Qx, Qy> = <x, y> and kQxk = kxk. So multiplication of a vector by a
unitary (or orthogonal) matrix preserves inner products, angles, norms, distances, and hence Euclidean
geometry.

2.5

Problems

2.1. The mean of a vector x Rn is the scalar mx = (1/n)

Pn

x(i). Show that the set of all vectors in Rn


with mean 0 (zero vector) is a subspace U0 R of dimension n 1. Show that all vectors in U0 are orthogonal to
1 Rn , where 1 denotes the vector with all components equal to 1.
n

i=1

2.2. The correlation of x1 , x2 Rn is the scalar:


(x1 , x2 ) =

<x1 , x2 >
.
kx1 kkx2 k

For given x1 , what vectors x2 maximize the correlation? What vectors x2 minimize the correlation? Show that
(x1 , x2 ) [1, 1] and is zero precisely when the vectors are orthogonal.

2.3. Use the definition of the inner product and its properties listed in Lemma 2.1.1, together with the definition of
the norm to prove the Cauchy-Schwarz Inequality (Lemma 2.1.2).
a) First let x, y Cn with kxk = kyk = 1.
1) Set x
= <x, y>y and rx = x x
. Show that <rx , y> = <rx , x
> = 0.
2) Show that krx k2 = 1 <
x, x>.
3) Using the previous result and the definition of x
show that |<x, y>| 1.
b) Prove the result when kxk =
6 0 and kyk =
6 0.
c) Prove the result when x or y (or both) is zero.

2.4. Prove the triangle inequality for the Euclidean norm in Cn . Expand kx + yk2 using the properties of the inner
inner product, and note that 2Re(<x, y>) 2|<x, y>|.

2.5. Let X , Y be inner product spaces over the same field F with F = R, or F = C. A linear isometry from X to
Y is a linear function D : X Y that preserves distances: (x X ) kD(x)k = kxk. Show that a linear isometry
between inner product spaces also preserves inner products.
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

13

a) First examine kD(x + y)k2 and conclude that Re(<Dx, Dy>) = Re(<x, y>).
b) Now examine kD(x + iy)k2 where i is the imaginary unit.

2.6. Let Pn denote the set of n n permutation matrices. Show that Pn is a (noncommutative) group under matrix
multiplication. Show that every permutation matrix is an orthogonal matrix. Hence Pn is a subgroup of On .

2.7. Show that for A, B Cmn :


a) <A, B> = trace(AT B) = trace(B A).
b) | trace(B A)| trace(A A)1/2 trace(B B)1/2 .

2.8. Show that the Euclidean norm in Cn is:


a) permutation invariant: if y a permutation of the entries of x, then kyk = kxk.
b) an absolute norm: if y = |x| component-wise, then kyk = kxk.
T

2.9. Let b1 , . . . , bn be an orthonormal basis for Cn (or Rn ), and set Bj,k = bj bk , j, k = 1, . . . , n. Show that
T

{Bj,k = bj bk }ni,j=1 is an orthonormal basis for Cnn .

2.6

Notes

For a more detailed introduction to the Euclidean geometry of Rn , see the relevant sections in: Strang,
Linear Algebra and its Applications, Chapter 2.

c Peter J. Ramadge, 2015. Please do not distribute without permission.


ELE 535 Fall 2015

c Peter J. Ramadge, 2015. Please do not distribute without permission.


14

ELE 535 Fall 2015

15

Chapter 3

Orthogonal Projection
We now consider the following fundamental problem. Given a data vector x in an inner product space X
and a subspace U X , find the closest point to x in U. This operation is a simple building block that we
will use repeatedly.

3.1

Simplest Instance

The simplest instance of our problem is: given x, u Rn with kuk = 1, find the closest point to x in
span{u}. This can be posed as the simple constrained optimization problem:
min

zRn

1
2 kx

zk2
(3.1)

s.t. z span{u}
The subspace span{u} is a line through the origin in the direction u, and we seek the point z on this
line that is closest x. So we must have z = u for some scalar . Hence we can equivalently solve the
unconstrained optimization problem:
min 21 kx uk2 .
R

Expanding the objective function in (3.1) and setting z = u yields:


1
2 kx

zk2 = 12 <x z, x z>


= 12 kxk2 <u, x> + 12 2 kuk2 .

This is a quadratic in with a positive coefficient on the second order term. Hence there is a unique value
of that minimizes the objective. Setting the derivative of the above expression w.r.t. equal to zero gives
the unique solution, = <u, x>. Hence the optimal solution is
x
= <u, x>u.
The associated error vector rx = x x
is called the residual. We claim that the residual is orthogonal to
u and hence to the subspace span{u}. To see this note that
<u, rx > = <u, x x
> = <u, x> <u, x><u, u> = 0.
Thus x
is the unique orthogonal projection of x onto the line span{u}. This is illustrated in Figure 3.1.
By Pythagoras, we have kxk2 = k
xk2 + krx k2 .
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

16

x
rx

^
u x

0
Figure 3.1: Orthogonal projection of x onto a line through zero.

We can also write the solution using matrix notation. Noting that <u, x> = uT x, we have
x
= (uuT )x = P x
rx = (I uuT )x = (I P )x.
So for fixed u, both x
and rx are linear functions of x. As one might expect, these linear functions have
some special properties. For example, since x
span{u}, the projection of x
onto span{u} must be x
.
So we must have P 2 = P . We can easily check this using the formula P = uuT : P 2 = (uuT )(uuT ) =
uuT = P . In addition, we note that P = uuT is symmetric. So P is symmetric (P T = P ) and idempotent
(P 2 = P ). A matrix with these two properties is called a projection matrix.

3.2

Projection of x onto a Subspace U

Now let U be a subspace of Rn with an orthonormal basis {u1 , . . . , uk }. For a given x Rn , we seek a
point z in U that minimizes the distance to x:
1
2 kx

min

zRn

zk2
(3.2)

s.t. z U.
Since we can uniquely write z =
mization problem:

Pk

j=1 j uj ,

we can equivalently pose this as the unconstrained opti-

1
kx
1 ,...,k 2

min

k
X

j uj k2 .

j=1

Using the definition of the norm and the properties of the inner product, we can expand the objective
function to obtain:
1
2 kx

zk2 = 21 <x z, x z>


= 12 kxk2 <z, x> + 12 kzk2
=

2
1
2 kxk

k
X

j <uj , x> +

j=1

1
2

k
X

j2 .

j=1

P
In the last line we used Pythagoras to write kzk2 = kj=1 j2 . Taking the derivative with respect to j and
setting this equal to zero yields the unique solution j = <uj , x>, j = 1, . . . , k. So the unique closest
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

17

point in U to x is:
x
=

k
X

<uj , x>uj .

(3.3)

j=1

Moreover, the residual rx = x


x is orthogonal to every uj and hence to the subspace U = span{u1 , . . . , uk }.
To see this compute:
<uj , rx > = <uj , x x
> = <uj , x> <uj , x
> = <uj , x> <uj , x> = 0.
Thus x
is the unique orthogonal projection of x onto U, and by Pythagoras, kxk2 = k
xk2 + krx k2 . This is
illustrated in Figure 3.2. From (3.3), notice that x
and the residual rx = x x
are linear functions of x.
x
rx

^
x

0
Figure 3.2: Orthogonal projection of x onto the subspace U.

We can also write these results as matrix equations. First, from (3.3) we have
x
=

k
X

k
X
uj uTj x = (
uj uTj )x = P x,

j=1

with P =

Pk

T
j=1 uj uj .

j=1

Let U Rnk be the matrix with columns u1 , . . . , uk . Then


P =

k
X

uj uTj = U U T .

j=1

Hence we can write


x
= UUT x
rx = (I U U T )x.
This confirms that x
and rx are linear functions of x and that P is symmetric and idempotent.

3.3

The Orthogonal Complement of a Subspace

The orthogonal complement of a subspace U of Rn is the subset:


U = {x Rn : (u U) x u}.
So U is the set of vectors orthogonal to every vector in the subspace in U. When U = span{u} we write
U as simply u . The set U is easily shown to be a subspace of Rn with U U = 0.
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

18

Lemma 3.3.1. U is a subspace of Rn and U U = 0.


Proof. Exercise.
For example, it is clear that {0} = Rn and Rn = {0}. For a unit vector u, u is the n 1
dimensional hyperplane in Rn passing through the origin and with normal u. Given a subspace U in Rn
and x Rn , the projection x
of x onto U lies in U and the residual rx lies in U .
An important consequence of the orthogonality of U and U is that every x Rn has a unique
representation of the form x = u + v with u U and v U . This follows from the properties of the
orthogonal projection of x onto U.
Lemma 3.3.2. Every x Rn has a unique representation in the form x = u + v with u U and v U .
Proof. By the properties of orthogonal projection we have x = x
+rx with x
U and rx U . This gives
one decomposition of the required form. Suppose there are two decompositions of this form: x = ui + vi ,
with ui U and vi V, i = 1, 2. Subtracting these expressions gives (u1 u2 ) = (v1 v2 ). Now
u1 u2 U and v1 v2 U , and since U U = 0 (Lemma 3.3.1), we must have u1 = u2 and
v1 = v2 .
It follows from Lemma 3.3.2 that U + U = Rn . This simply states that every vector in Rn is the sum
of some vector in U and some vector in U . Because this representation is also unique, this is sometimes
written as Rn = U U and we say that Rn is the direct sum of U and U . Several additional properties
of the orthogonal complement are covered in Problem 3.5.

3.4

Problems

3.1. Given x, y Rn find the closest point to x on the line through 0 in the direction of y.
3.2. Let u1 , . . . , uk Rn be an ON set spanning a subspace U and let v Rn with v
/ U. Find a point y on the
linear manifold M = {x : x v U} that is closest to a given point y Rn . [Hint: transform the problem to one
that you know how to solve.]

3.3. A Householder transformation on Rn is a linear transformation that reflects each point x in Rn about a given
n 1 dimensional subspace U specified by giving its unit normal u Rn . To reflect x about U we want to move it
orthogonally through the subspace to the point on the opposite side that is equidistant from the subspace.
a) Given U = u = {x : uT x = 0}, find the required Householder matrix.
b) Show that a Householder matrix H is symmetric, orthogonal, and is its own inverse.

3.4. Prove Lemma 3.3.1.


3.5. Let X be a real or complex inner product space of dimension n, and U, V be subspaces of X . Prove each of the
following:
a) U V implies V U .
b) (U ) = U.
c) (U + V) = U V
d) (U V) = U + V
e) If dim(U) = k, then dim(U ) = n k

c Peter J. Ramadge, 2015. Please do not distribute without permission.


ELE 535 Fall 2015

19

Chapter 4

Principal Component Analysis


Given a set of data {xj Rn }pj=1 , and an integer k < n, we ask if there is a k-dimensional subspace onto
which we can project the data so that the sum of the squared norms of the residuals is minimized. It turns
out that for every 1 k < n, there is indeed a subspace that minimizes this metric. If the resulting approximation is reasonably accurate, then the original data lies approximately on a k-dimensional subspace
in Rn . Hence, at the cost of small approximation error, we gain the benefit of reducing the dimension of
the data to k. This is one form of linear dimensionality reduction.
There is also a connection with another way of thinking about the data: how is the data spread out about
its sample mean. Directions in which the data does not have significant variation could be eliminated
allowing the data to be represented in a lower dimensional subspace. This leads to a core method of
dimensionality reduction known as Principal Component Analysis (PCA). It selects a subspace onto which
to project the data that maximizes the captured variance of the original data. It turns out that this subspace
minimizes the sum of squared norms of the resulting residuals.

4.1

Preliminaries

We will need a few properties of symmetric matrices. Recall that a matrix S Rnn is symmetric if
S T = S. The eigenvalues and eigenvectors of real symmetric matrices have some special properties.
Lemma 4.1.1. A symmetric matrix S Rnn has n real eigenvalues and n real orthonormal eigenvectors.
2 and x
x. Hence x

Proof. Let Sx = x with x 6= 0. Then S x


= Sx =
T Sx = kxk
T Sx = kxk2 .
Thus is real. It follows that x can be
Subtracting these expressions and using x 6= 0, yields = .
n
selected in R .
We prove the second claim under the simplifying assumption that S has n distinct eigenvalues. Let
Sx1 = 1 x1 and Sx2 = 2 x2 . Then xT2 Sx1 = 2 xT2 x1 and xT2 Sx1 = 1 xT2 x1 . Subtracting these
expressions and using 1 6= 2 yields xT2 x1 = 0. Thus x1 x2 . For a proof without our simplifying
assumption, see Theorem 2.5.6 in Horn and Johnson.

A matrix P Rnn with the property that for all x Rn , xT P x 0 is said to be positive semidefinite.
Similarly, P is positive definite if for all x 6= 0, xT P x > 0. Without loss of generality, we will always
assume P is symmetric. If not, P can be replaced by the symmetric matrix Q = 21 (P + P T ) since
xT Qx = 12 xT (P + P T )x = xT P x.
Here is a fundamental property of such matrices.
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

20

Lemma 4.1.2. If P Rnn is symmetric and positive semidefinite (resp. positive definite), then all the
eigenvalues of P are real and nonnegative (resp. positive) and the eigenvectors of P can be selected to be
real and orthonormal.
Proof. Since P is symmetric all of its eigenvalues are real and it has a set of n real ON eigenvectors. Let
x be an eigenvector with eigenvalue . Then xT P x = xT x = kxk2 0. Hence 0. If P is PD and
x 6= 0, then xT P x > 0. Hence kxk2 > 0 and thus > 0.

4.2

Centering Data

P
The sample mean of a set of data {xj Rn }pj=1 is the vector = 1/p pj=1 xj . By subtracting from
each xj , forming yj = xj , we translate the data vectors so that the new sample mean is zero:
1/p

p
X
j=1

yj =

1/p

p
X
(xj ) = = 0.
j=1

This is called centering the data. We can also express the centering operation in matrix form as follows.
Form the data into the matrix X = [x1 , . . . , xp ] Rnp . Then = 1/pX1, where 1 Rn denotes the

vector of all 1s. Let Y denotes the corresponding matrix of centered data and u = (1/ p)1. Then
Y = X 1T = X 1/pX11T = X(I 1/p11T ) = X(I uuT ).
From this point forward we assume that the data has been centered.

4.3

Parameterizing the Family of k-Dimensional Subspaces

A subspace U Rn of dimension k n can be represented by an orthonormal basis for U. However,


this representation is not unique since there are infinitely many orthonormal bases for U. Any such basis
contains k vectors and these can be arranged into the columns of a n k matrix U = [u1 , . . . , uk ] Rnk
with U T U = Ik .
Let U1 , U2 Rnk be two orthonormal bases for the same k-dimensional subspace U. Since U1 is a
basis for U and every column of U2 lies in U, there must exist a matrix Q Rkk such that U2 = U1 Q. It
follows that Q = U1T U2 . Using U1 U1T U2 = U2 and U2 U2T U1 = U1 , we then have
QT Q = U2T U1 U1T U2 = U2T U2 = Ik , and
QQT = U1T U2 U2T U1 = U1T U1 = Ik .
Hence Q Ok . So any two orthonormal basis representations U1 , U2 of U are related by a kk orthogonal
matrix Q: U2 = U1 Q and U1 = U2 QT .

4.4

An Optimal Projection Subspace

We seek a k-dimensional subspace U such that the orthogonal projection of the data onto U minimizes the
sum of squared norms of the residuals. Assuming such a subspace exists, we call it an optimal projection
subspace of dimension k.
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

21

Let the columns of U Rnk be an orthonormal basis for a subspace U. Then the matrix of projected
= U U T X and the corresponding matrix of residuals is X U U T X. Hence we seek to solve:
data is X
kX U U T Xk2F

min

U Rnk

s.t. U T U = Ik .

(4.1)

The solution of this problem cant be unique since if U is a solution so is U Q for every Q Ok . These
solutions correspond to different parameterizations of the same subspace. In addition, it is of interest to
determine if two distinct subspaces could both be optimal projection subspaces of dimension k.
Using standard equalities, the objective function of Problem 4.1 can be rewritten as
kX U U T Xk2F = trace(X T (I U U T )(I U U T )X) = trace(XX T ) trace(U T XX T U ).
Hence, letting P Rnn denote the symmetric positive semidefinite matrix XX T , we can equivalently
solve the following problem:
max

U Rmk

trace(U T P U )

s.t. U T U = Ik .

(4.2)

Problem 4.2 is a well known problem. For the simplest version with k = 1, we have the following standard
result.
Theorem 4.4.1 (Horn and Johnson, 4.2.2). Let P Rnn be a symmetric positive semidefinite matrix
with eigenvalues 1 2 n . The problem
max

uRn

uT P u

s.t. uT u = 1

(4.3)

has the optimal value 1 . This is achieved if and only if u is a unit norm eigenvector of P for the eigenvalue
1 .
Proof. We want to maximize xT P x subject to xT x = 1. Bring in a Lagrange multiplier and form the
Lagrangian L(x, ) = xT P x + (1 xT x). Taking the derivative of this expression with respect to x
and setting this equal to zero yields P x = x. Hence must be a eigenvalue of P with x a corresponding
eigenvector normalized so that xT x = 1. For such x, xT P x = xT x = . Hence the maximum
achievable value of the objective is 1 and this is achieved when u a corresponding eigenvector of P .
Conversely, if u is any unit norm eigenvector of P for 1 , then uT P u = 1 and hence u is a solution.
Theorem 4.4.1 can be generalized as follows. However, the proof uses results we have not covered yet.
Theorem 4.4.2 (Horn and Johnson, 4.3.18). Let P Rnn be a symmetric positive semidefinite matrix
with eigenvalues 1 2 n . The problem
max

U Rnk

trace(U T P U )

s.t. U T U = Ik

(4.4)

P
has the optimal value kj=1 j . Moreover, this is achieved if the columns of U are k orthonormal eigenvectors for the largest k eigenvalues 1 , . . . , k of P .
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

22

Proof. Let P = V V T be an eigen-decomposition of P with V On and Rnn a diagonal matrix


with the eigenvalues 1 n listed in decreasing order down the diagonal. We want to maximize
trace(U T V V T U ) = trace(V T U U T V ) = trace(W W T ), where W = V T U Onk . This is
equivalent to maximizing <, W W T > by choice of W Onk , then setting U = V W .
The maximization of <, W W T > can be solved as follows. Let Z On,nk be an orthonormal basis
for R(U ) . Then the matrices and W W T have the following singular value decompositions

 
T
Ik
0
Ik
0
=

0 Ink
0 Ink

and W W


= W




 Ik 0 W T
Z
.
0 0 ZT

It is a standard result that if we are free to select the left and right singular vectors of a matrix B, then the
inner product <A, B> is maximized when the left and right singular vectors of B are chosen to equal the

T
left and right singular vectors of A, respectively. Hence selecting W = Ik 0 maximizes the inner
product <, W W T >. This gives
P U = Vk , where Vk is the matrix of the first k columns of V and results
in the optimal objective value kj=1 j .
It follows from Theorem 4.4.2, that a solution U ? to Problem 4.2 is obtained by selecting the columns
of U ? to be a set orthonormal eigenvectors of P = XX T corresponding to its k largest eigenvalues.
Working backwards, we see that U ? is then also a solution to Problem 4.1. In both cases, there is nothing
special about U ? beyond the fact that it spans U ? . Any basis of the form U ? Q with Q Ok spans the
same optimal subspace U ? .
We also note that U ? may not be unique. To see this, consider the situation when k = k+1 . When
this holds, the selection of a k-th eigenvector in U ? is not unique.
In summary, a solution to Problem 4.1 can be obtained as follows. Find the k largest eigenvalues of
XX T and a corresponding set of orthonormal eigenvectors U ? . Then over all k dimensional subspaces,
U ? = R(U ? ) minimizes the sum of the squared norms of the projection residuals. By projecting each xj
to x
j = U ? (U ? )T xj we obtain a representation of the data as points on U ? . Moreover, if we now represent
x
j by its coordinates yj = (U ? )T xj with respect the orthonormal basis U ? , then we have linearly mapped
the data into k-dimensional space.

4.5

An Alternative Viewpoint

We now consider an alternative way to view the same problem. This will give some additional insights
into the solution we have derived.

4.5.1

The Sample Covariance of the Data

The data points xj are spread out around the sample


Pp mean . 2 In the case of scalars, to measure the
1
spread around we form the sample variance /p j=1 (xj ) . However, for vectors the situation is
more complicated since variation about the mean can also depend on direction.
We will continue with our assumption that the data has zero sample mean. Hence we examine how the
data is spread around the vector 0. Select a unit norm vector u Rn and project xj onto the line through 0
in the direction u. This yields x
j = uuT xj , j = 1, . . . , p. Since the direction is fixed to be u, the projected
data is effectively specified by the set of scalars uT xj . This set of scalars also has zero sample mean:
Pp

T
j=1 u xj

= uT

Pp

j=1 xj

c Peter J. Ramadge, 2015. Please do not distribute without permission.


= 0.

ELE 535 Fall 2015

23

So the spread of the data in direction u can be quantified by the scalar sample variance

p
p
n
X
X
X
1
1
1
(uT xj )(uT xj )T = uT
xj xTj u.
2 (u) =
(uT xj )2 =
p
p
p
j=1

j=1

(4.5)

j=1

This expresses the variance of the data as a function of the direction u.


Let
p
1X
R=
xj xTj .
p

(4.6)

j=1

R is called the sample covariance matrix of the (centered) data. The product xj xTj is a real nn symmetric
matrix formed by the outer product of the j-th data point with itself. The sample covariance is the mean of
these matrices and hence is also a real n n symmetric matrix. More generally, if the data is not centered
but has sample mean , then the sample covariance is
p

R=

1X
(xj )(xj )T .
p

(4.7)

j=1

Lemma 4.5.1. The sample covariance matrix R is symmetric positive semidefinite.


Proof. R is clearly symmetric. Positive semidefiniteness follows by noting that for any x Rn ,
xT Rx = xT (

4.5.2

j=1

j=1

j=1

1X
1X T
1X T 2
xj xTj )x =
(x xj xTj x) =
(xj x) 0.
p
p
p

Directions of Maximum Variance

Using R and (4.5) we can concisely express the variance of the data in direction u as
2 (v) = uT Ru.

(4.8)

Hence the direction u in which the data has maximum sample variance is given by the solution of the
problem:
arg maxn
uR

uT Ru

s.t. uT u = 1

(4.9)

with R a symmetric positive semidefinite matrix. This is Problem 4.3. By Theorem 4.4.1, the data has
maximum variance 12 in the direction v1 , where 12 0 is the largest eigenvalue of R and v1 is a corresponding unit norm eigenvector of R.
We must take care if we want to find two directions with the largest variance. Without any constraint,
the second direction can come arbitrarily close to v1 and variance 12 . One way to prevent this is to
constrain the second direction to be orthogonal to the first. Then if we want a third direction, constraint it
to orthogonal to the two previous directions, and
Pso on. In this case, for k orthogonal directions we want
to find U = [u1 , . . . , uk ] On,k to maximize kj=1 uTj Ruj = trace(U T RU ). Hence we want to solve
Problem 4.4 with P = R. By Theorem 4.4.2, the solution is attained by taking the k directions to be unit
norm eigenvectors v1 , . . . , vk for the largest k eigenvalues of R.
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

24

By this means you see that we obtain n orthonormal directions of maximum (sample) variance in
the data. These directions v1 , v2 , . . . , vn and the corresponding variances 12 22 n2 are
eigenvectors and corresponding eigenvalues of R: Rvj = j2 vj , j = 1, . . . , n. The vectors vj are called the
principal components of the data, and this decomposition is called Principal Component Analysis (PCA).
Let V be the matrix with the vj as its columns, and 2 = diag(12 , . . . , n2 ) (note 12 22 n2 ) .
Then PCA is an ordered eigen-decomposition of the sample covariance matrix: R = V 2 V T .
There is a clear connection between PCA and finding a subspace that minimizes the sum of squared
norms of the residuals. We can see this by writing
p

1X
1
R=
xj xTj = XX T .
p
p
j=1

So the sample covariance is just a scalar multiple of the matrix XX T . This means that the principal
components are just the eigenvectors of XX T listed in order of decreasing eigenvalues. In particular, the
first k principal components are the first k eigenvectors (ordered by eigenvalue) of XX T . This is exactly
the orthonormal basis U ? that defines an optimal k-dimensional projection subspace U ? . So the leading k principal components give a particular orthonormal basis for an optimal k-dimensional projection
subspace.
A direction in which the data has small variance relative to 12 may not be not an important direction after all the data stays close to the mean in this direction. If one accepts this hypothesis, then the directions
of largest variance are the important directions - they capture most of the variability in the data. This
suggests that we could select k < rank(R) and project the data onto the k directions of largest variance.
Let Vk = [v1 , v2 , . . . , vk ]. Then the projection onto the span of the columns of Vk is x
j = Vk (VkT xj ). The
T
j
term yj = (Vk )xj gives the coordinates of xj with respect to Vk . Then the product Vk yj synthesizes x
using these coefficients to form the appropriate linear combination of the columns of Vk .
Here is a critical observation: since the directions are fixed and known, we dont need to form x
j .
Instead we can simply map xj to the coordinate vector yj Rk . No information is lost in working with
yj instead of x
j since the latter is an invertible linear function of the former. Hence {yj }pj=1 gives a new
set of data that captures most of the variation in the original data, and lies in a reduced dimension space
(k rank(R) n).
The natural next question is how to select k? Clearly this involves a tradeoff between the size of k and
the amount of variation
the original data that is captured in the projection.
captured by
Pn The variance
Pin
k
2
2
2
2
the projection is = j=1 j and the variance in the residual is = j=k+1 j . Reducing k reduces
2 and increases 2 . The selection of k thus involves determining how much of the total variance in X
needs to be captured in order to successfully use the projected data to complete the analysis or decision
task at hand. For example, if the projected data is to be used to learn a classifier, then one needs to select
the value of k that yields acceptable (or perhaps best) classifier performance. This could be done using
cross-validation.

4.6

Problems

4.1. Let X Rnp . Show that the set of nonzero eigenvalues of XX T is the same as the set of nonzero eigenvalues
of X T X.

c Peter J. Ramadge, 2015. Please do not distribute without permission.


ELE 535 Fall 2015

25

Chapter 5

Singular Value Decomposition


5.1

Overview

We now discuss in detail a very useful matrix factorization called the singular value decomposition (SVD).
The SVD extends the idea of eigen-decomposition of square matrices to non-square matrices. It is useful
in general, but has specific application in data analysis, dimensionality reduction (PCA), low rank matrix
approximation, and some forms of regression.

5.2

Preliminaries

Recall that for A Rmn , the range of A is the subspace of Rm defined by R(A) = {y : y =
Ax, some x Rn } Rm . So the range of A is the set of all vectors that can be formed as a linear combination of the columns of A. The nullspace of A is the subspace of Rn defined by N (A) = {x : Ax = 0}.
This is the set of all vectors that are mapped to the zero vector in Rm by A.
The following fundamental result from linear algebra will be very useful.
Theorem 5.2.1. Let A Rmn have nullspace N (A) and range R(A). Then N (A) = R(AT ).
Proof. Let x N (A). Then Ax = 0 and xT AT = 0. So for every y Rm , xT (AT y) = 0. Thus
x R(AT ) . This shows that N (A) R(AT ) . Now for all subspaces: (a) (U ) = U, and (b) U V
implies V U (see Problem 3.5). Applying these properties yields R(AT ) N (A) .
Conversely, suppose x R(AT ) . Then for all y Rm , xT AT y = 0. Hence for all y Rm ,
y T Ax = 0. This implies Ax = 0 and hence that x N (A). Thus R(AT ) N (A) and N (A)
R(AT ).
We have shown R(AT ) N (A) and N (A) R(AT ). Thus N (A) = R(AT ).
The rank of A is the dimension r of the range of A. Clearly this equals the number of linearly independent columns in A. The rank r is also the number of linearly independent rows of A. Thus r min(m, n).
The matrix A is said to full rank if r = min(m, n).
An m n rank one matrix has the form yxT where y Rm and x Rn are both nonzero. Notice that
for all w Rn , (yxT )w = y(xT w) is a scalar multiple of y. Moreover, by suitable choice of w we can
make this scalar any real value. So R(yxT ) = span(y) and the rank of yxT is one.

5.2.1

Induced Norm of a Matrix

The Euclidean norm in Rp is also known as the 2-norm and is often denoted by k k2 . We will henceforth
adopt this notation.
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

26

The gain of a matrix A Rmn when acting on a unit norm vector x Rn is given by the norm of the
vector Ax. This measures the change in the vector magnitude resulting from the application of A. More
generally, for x 6= 0, define the gain by G(A, x) = kAxk2 /kxk2 , where in the numerator the norm is in
Rm , and in the denominator it is in Rn . The maximum gain of A over all x Rn is then:
G(A) = max
x6=0

kAk2
kxk2

This is called the induced matrix 2-norm of A, and is denoted by kAk2 . It is induced by the Euclidean
norms on Rn and Rm . From the definition of the induced norm we see that
kAk22 = max kAxk22 = max xT (AT A)x.
kxk2 =1

kxk2 =1

Since AT A is real, symmetric and positive semidefinite, the solution of this problem is to select x to be a
unit norm eigenvector of AT A for the largest eigenvalue. So
q
kAk2 = max (AT A)
(5.1)
Because of this connection with eigenvalues, the induced matrix 2-norm is sometimes also called the
spectral norm.
It is easy to check that the induced norm is indeed a norm. It also has the following additional properties.
Lemma 5.2.1. Let A, B be matrices of appropriate size and x Rn . Then
1) kAxk2 kAk2 kxk2 ;
2) kABk2 kAk2 kBk2 .
Proof. Exercise.
Important: The induced matrix 2-norm and the matrix Euclidean norm are distinct norms on Rmn .
Recall, the Euclidean norm on Rmn is called the Frobenius norm and is denoted by kAkF .

5.3

Singular Value Decomposition

We first present the main SVD result in what is called the compact form. We then give interpretations of
the SVD and indicate an alternative version known as the full SVD. After these discussions, we turn our
attention to the ideas and constructions that form the foundation of the SVD.
Theorem 5.3.1 (Singular Value Decomposition). Let A Rmn have rank r min{m, n}. Then there
exist U Rmr with U T U = Ir , V Rnr with V T V = Ir , and a diagonal matrix Rrr with
diagonal entries 1 2 r > 0, such that
A = U V T =

r
X

j uj vjT .

j=1

The positive scalars j are called the singular values of A. The r orthonormal columns of U are called
the left or output singular vectors of A, and the r orthonormal columns of V are called the right or input
singular vectors of A. The conditions U T U = Ir and V T V = Ir indicated that U and V have orthonormal
columns. But in general, since U and V need not be square matrices, U U T 6= Im and V V T 6= In (in
general U and V are not orthogonal matrices). Notice also that the theorem does not claim that U and V
are unique. We discuss this issue later in the chapter. The decomposition is illustrated in Fig. 5.1.
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

27

Figure 5.1: A visualization of the matrices in the compact SVD.

Figure 5.2: A visualization of the three operational steps in the compact SVD. The projection of x Rn onto
N (A) is represented in terms of the basis v1 , v2 : here x = 1 v1 + 2 v2 . These coordinates are scaled by the
singular values. Then the scaled coordinates are transferred to the output space Rm and used to the form the result
y = Ax as the linear combination y = 1 1 u1 + 2 2 u2 .

Lemma 5.3.1. The matrices U and V in the compact SVD have the following additional properties:
a) The columns of U form an orthonormal basis for the range of A.
b) The columns of V form an orthonormal basis for N (A) .
Proof. a) You can see this by writing Ax = U (V T x) and noting that is invertible and V T vj = ej
where ej is the j-th standard basis vector for Rr . So the range of V T is Rr . It follows that R(U ) = R(A).
b) By taking transposes and using part a), the columns of V form an ON basis for the range of AT . Then
using N (A) = R(AT ), yields that the columns of V form an orthonormal basis for N (A) .
The above observations lead to the following operational interpretation of the SVD. For x Rn , the
operation V T x gives the coordinates with respect to V of the orthogonal projection of x onto the subspace
N (A) . (The orthogonal projection is x
= V V T x.) These r coordinates are then individually scaled using
the r diagonal entries of . Finally, we synthesize the output vector by using the scaled coordinates and
the ON basis U for R(A): y = U (V T x). So the SVD has three steps: (1) An analysis step: V T x, (2) A
scaling step: (V T x), and (3) a synthesis step: U (V T x). In particular, when x = vk , y = Ax = k uk ,
k = 1, . . . , r. So the r ON basis vectors for N (A) are mapped to scaled versions of corresponding ON
basis vectors for R(A). This is illustrated in Fig. 5.2.

5.3.1

Singular Values and Norms

Recall that the induced matrix 2-norm of A is the maximum gain of A (5.2.1). We show below that this
is related to the singular values of the matrix A. First note that
kAxk2 = kU V T xk2 = xT (V 2 V T )x.
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

28

Figure 5.3: A visualization of the action of A on the unit sphere in Rn in terms of its SVD.

Figure 5.4: A visualization of the matrices in the full SVD.

To maximize this expression we select x to be a unit norm eigenvector of (V 2 V T ) with maximum


eigenvalue. Hence we use x = v1 and achieve kAxk2 = 12 . So the input direction with the most gain is
v1 , this appears in the output in the direction u1 , and the gain is 1 : Av1 = 1 u1 . Hence
kAk2 = 1 .

(5.2)

So the induced 2-norm of A is given by the maximum singular value of A.


We can also express the Frobenius norm of matrix in terms of its singular
To see this, let
Pr values.
T
mn
AR
have rank r and write the compact SVD of A in the form A = j=1 j uj vj . In this form we
see that the SVD expresses A as a positive linear combination of the rank one matrices uj vjT , j = 1, . . . , r.
These matrices form an orthonormal set in Rmn :
(
0, if j =
6 k;
<uk vkT , uj vjT > = trace((uk vkT )T (uj vjT )) = trace(uTk uj vjT vk ) =
1, if j = k.
So the SVD is selecting an orthonormal basis of rank one matrices {uj vjT }rj=1 specifically adapted to A,
and expressing A as a positive linear combination of this basis.
P
With these insights, we can apply Pythagorous Theorem to the expression kAk2F = k rj=1 j uj vjT k2F
to obtain:
P
1/2
r
2
kAkF =

.
(5.3)
j=1 j
So the Frobenius norm of A is the Euclidean norm of the singular values of A.
c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

5.3.2

29

The Full SVD

There is a second version of the SVD that is often convenient in various proofs involving the SVD. Often
this second version is just called the SVD. However, to emphasize its distinctness from the equally useful
compact SVD, we refer to it as a full SVD.
The basic idea is very simple. Let A = Uc c VcT be a compact SVD with Uc Rmr , Vc Rnr ,
rr . To U we add an orthonormal basis for R(U ) to form the orthogonal matrix U =
c
c
and c  R mm
Uc Uc R
. Similarly, to Vc we add an orthonormal basis for R(Vc ) to form the orthogonal


matrix V = Vc Vc Rnn . To ensure that these extra columns in U and V do not interfere with the
factorization of A, we form Rmn by padding c with zero entries:

=

0r(nr)

0(mr)r 0(mr)(nr)

We then have a full SVD factorization A = U V T . The utility of the full SVD derives from U and V
being orthogonal (hence invertible) matrices. The full SVD is illustrated in Fig. 5.4.
If P is a symmetric positive definite matrix, a full SVD of P is simply an eigen-decomposition of P :
U V T = QQT , where Q is the orthogonal matrix of eigenvectors of P . In this sense, the SVD extends
the eigen-decomposition by using different orthonormal sets of vectors in the input and output spaces.

5.4

Inner Workings of the SVD

We now give a quick overview of where the matrices U , V and of SVD come from. Let A Rmn
have rank r. So the range of A has dimension r and nullspace of A has dimension n r.
Let B = AT A Rnn . Since B is a symmetric positive semi-definite (PSD) matrix, it has nonnegative eigenvalues and full set of orthonormal eigenvectors. Order the eigenvalues in decreasing order:
12 22 n2 0 and let vj denote the eigenvector for j2 . So
Bvj = j2 vj ,

j = 1, . . . , n.

Noting that Ax = 0 if and only if Bx = 0 we see that the null space of B also has dimension n r.
It follows that n r of the eigenvectors of B must lie in N (A) and r must lie in N (A) . Hence
12 r2 > 0

and

2
r+1
= = n2 = 0,

with v1 , . . . , vr an orthonormal basis for N (A) .


Now consider C = AAT Rmm . This is also symmetric and PSD. Hence C has nonnegative
eigenvalues and a full set of orthonormal eigenvectors. Order the eigenvalues in decreasing order: 21
22 m 0 and let uj denote the eigenvector for 2j . So
Cuj = 2j uj ,

j = 1, . . . , m.

Since R(AT ) = N (A) , the dimension of R(AT ) is r, and that of N (AT ) is m r. By the same
reasoning as above, m r of the eigenvectors of C must lie in N (AT ) and r must lie in R(A). Hence
21 2r > 0 and

2r+1 = = 2m = 0,

with u1 , . . . , ur an orthonormal basis for R(A).


c Peter J. Ramadge, 2015. Please do not distribute without permission.

ELE 535 Fall 2015

30

Now we show a relationship between j2 , 2j and the corresponding eigenvectors vj , uj , for j =


1, . . . , r. First consider Bvj = j2 vj with j2 > 0. Then
C(Avj ) = (AAT )(Avj ) = A(AT Avj ) = A(Bvj ) = j2 (Avj ).
So either Avj = 0, or Avj is an eigenvector of C with eigenvalue j2 . The first case, Avj = 0 contradicts
AT Avj = j2 vj with j2 > 0 since Avj = 0 implies (AT A)vj = 0. Hence Avj must be an eigenvector of
C with eigenvalue j2 . Assume for simplicity, that the positive eigenvalues of AT A and AAT are distinct.
Then for some k, with 1 k r:
j2 = 2k

Avj = uk , with > 0.

and

We can take > 0 by swapping uk for uk if necessary. Using this result we find
(
j2 vjT vj = j2 ;
vjT Bvj =
(Avj )T (Avj ) = 2 uTj uj = 2 .
So we must have = j and
Avj = j uk .
Now do the same analysis for Cuk =

2k uk

with 2k > 0. This yields

B(AT uk ) = (AT A)(AT uk ) = AT (AAT uk ) = 2k (AT uk ).


Since 2k > 0, we cant have AT uk = 0. So AT uk is an eigenvector of AT A with eigenvalue 2k . Under
the assumption of distinct nonzero eigenvalues, this implies that for some p with 1 p r,
2k = p2

and

AT uk = vp , some 6= 0.

Using this expression to evaluate uTk Cuk we find 2k = 2 . Hence 2 = 2k = p2 and AT uk = vp .


We now have two ways to evaluate AT Avj :
(
j2 vj
by definition;
T
A Avj =
AT uk = vp .
using the above analysis.
Equating these answers gives j = p and = j2 . Since > 0, it follows that > 0 and = j = j =
. Thus Avj = j uj , j = 1, . . . , r. Written in matrix form this is almost the compact SVD:

1

 


..
A v1 . . . vr = u1 . . . ur
.
.
r
From this we deduce that AV V T = U V T . V V T computes the orthogonal projection of x onto N (A) .
T
T
T
Hence for every x Rn , AV
pV x = Ax. Thus
p AV V = A, and we have A = U V .
T
T
Finally note that j = j (A A) = j (AA ), j = 1, . . . , r. So the singular values are always
unique. If the singular values are distinct, the P
SVD is unique up to sign interchanges between the uj and
vj . But this still leaves the representation A = rj=1 j uj vjT unique. If the singular values are not distinct,
then U and V are not unique. For example, In = U In U T for every orthogonal matrix U .

5.5

Problems

5.1. Let A Rnn be a square invertible matrix with SVD A =

Pn

c Peter J. Ramadge, 2015. Please do not distribute without permission.


j=1

j uj vjT . Show that A1 =

Pn

T
j=1 (1/j )vj uj .

Potrebbero piacerti anche