Sei sulla pagina 1di 419

Cambridge University Press

978-1-107-14713-3 — Numerical Linear Algebra


Holger Wendland
Frontmatter
More Information

Numerical Linear Algebra

This self-contained introduction to Numerical Linear Algebra provides a


comprehensive, yet concise, overview of the subject. It includes standard material
such as direct methods for solving linear systems and least-squares problems, error,
stability and conditioning, basic iterative methods and the calculation of eigenvalues.
Later chapters cover more advanced material, such as Krylov subspace methods,
multigrid methods, domain decomposition methods, multipole expansions,
hierarchical matrices and compressed sensing.
The book provides rigorous mathematical proofs throughout, and gives algorithms
in general-purpose language-independent form. Requiring only a solid knowledge in
linear algebra and basic analysis, this book will be useful for applied mathematicians,
engineers, computer scientists and all those interested in efficiently solving linear
problems.

H O L G E R W E N D L A N D holds the Chair of Applied and Numerical Analysis at the


University of Bayreuth. He works in the area of Numerical Analysis and is the author
of two other books, Scattered Data Approximation (Cambridge, 2005) and
Numerische Mathematik (Springer 2004, with Robert Schaback).

© in this web service Cambridge University Press www.cambridge.org


Cambridge University Press
978-1-107-14713-3 — Numerical Linear Algebra
Holger Wendland
Frontmatter
More Information

Cambridge Texts in Applied Mathematics


All titles listed below can be obtained from good booksellers or from Cambridge
University Press. For a complete series listing, visit www.cambridge.org/mathematics.

Nonlinear Dispersive Waves


MARK J. ABLOWITZ

Flow, Deformation and Fracture


G. I. BARENBLATT

Hydrodynamic Instabilities
FRANÇOIS CHARRU
The Mathematics of Signal Processing
STEVEN B. DAMELIN & WILLARD MILLER, JR
Introduction to Magnetohydrodynamics (2nd Edition)
P. A. DAVIDSON
An Introduction to Stochastic Dynamics
JINQIAO DUAN
Singularities: Formation, Structure and Propagation
J. EGGERS & M. A. FONTELOS

A Physical Introduction to Suspension Dynamics


ÉLISABETH GUAZZELLI & JEFFREY F. MORRIS

Discrete Systems and Integrability


J. HIETARINTA, N. JOSHI & F. W. NIJHOFF

Iterative Methods in Combinatorial Optimization


LAP CHI LAU, R. RAVI & MOHIT SINGH

An Introduction to Polynomial and Semi-Algebraic Optimization


JEAN BERNARD LASSERRE
An Introduction to Computational Stochastic PDEs
GABRIEL J. LORD, CATHERINE E. POWELL & TONY SHARDLOW

© in this web service Cambridge University Press www.cambridge.org


Cambridge University Press
978-1-107-14713-3 — Numerical Linear Algebra
Holger Wendland
Frontmatter
More Information

Numerical Linear Algebra


An Introduction

HOLGER WENDLAND
Universität Bayreuth, Germany

© in this web service Cambridge University Press www.cambridge.org


Cambridge University Press
978-1-107-14713-3 — Numerical Linear Algebra
Holger Wendland
Frontmatter
More Information

University Printing House, Cambridge CB2 8BS, United Kingdom


One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
4843/24, 2nd Floor, Ansari Road, Daryaganj, Delhi – 110002, India
79 Anson Road, #06-04/06, Singapore 079906

Cambridge University Press is part of the University of Cambridge.


It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning, and research at the highest international levels of excellence.

www.cambridge.org
Information on this title: www.cambridge.org/9781107147133
DOI: 10.1017/9781316544938
c Holger Wendland 2018

This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2018
Printed in the United Kingdom by Clays, St Ives plc
A catalogue record for this publication is available from the British Library.
ISBN 978-1-107-14713-3 Hardback
ISBN 978-1-316-60117-4 Paperback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet Web sites referred to in this publication,
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.

© in this web service Cambridge University Press www.cambridge.org


Cambridge University Press
978-1-107-14713-3 — Numerical Linear Algebra
Holger Wendland
Frontmatter
More Information

Contents

Preface page ix

PART ONE PRELIMINARIES 1


1 Introduction 3
1.1 Examples Leading to Linear Systems 5
1.2 Notation 10
1.3 Landau Symbols and Computational Cost 13
1.4 Facts from Linear Algebra 17
1.5 Singular Value Decomposition 24
1.6 Pseudo-inverse 26
Exercises 29
2 Error, Stability and Conditioning 30
2.1 Floating Point Arithmetic 30
2.2 Norms for Vectors and Matrices 32
2.3 Conditioning 46
2.4 Stability 54
Exercises 55

PART TWO BASIC METHODS 57


3 Direct Methods for Solving Linear Systems 59
3.1 Back Substitution 59
3.2 Gaussian Elimination 61
3.3 LU Factorisation 65
3.4 Pivoting 71
3.5 Cholesky Factorisation 76

© in this web service Cambridge University Press www.cambridge.org


Cambridge University Press
978-1-107-14713-3 — Numerical Linear Algebra
Holger Wendland
Frontmatter
More Information

vi Contents

3.6 QR Factorisation 78
3.7 Schur Factorisation 84
3.8 Solving Least-Squares Problems 87
Exercises 100
4 Iterative Methods for Solving Linear Systems 101
4.1 Introduction 101
4.2 Banach’s Fixed Point Theorem 102
4.3 The Jacobi and Gauss–Seidel Iterations 106
4.4 Relaxation 116
4.5 Symmetric Methods 125
Exercises 130
5 Calculation of Eigenvalues 132
5.1 Basic Localisation Techniques 133
5.2 The Power Method 141
5.3 Inverse Iteration by von Wielandt and Rayleigh 143
5.4 The Jacobi Method 153
5.5 Householder Reduction to Hessenberg Form 159
5.6 The QR Algorithm 162
5.7 Computing the Singular Value Decomposition 171
Exercises 180

PART THREE ADVANCED METHODS 181


6 Methods for Large Sparse Systems 183
6.1 The Conjugate Gradient Method 183
6.2 GMRES and MINRES 203
6.3 Biorthogonalisation Methods 226
6.4 Multigrid 244
Exercises 258
7 Methods for Large Dense Systems 260
7.1 Multipole Methods 261
7.2 Hierarchical Matrices 282
7.3 Domain Decomposition Methods 307
Exercises 327
8 Preconditioning 329
8.1 Scaling and Preconditioners Based on Splitting 331
8.2 Incomplete Splittings 338
8.3 Polynomial and Approximate Inverse Preconditioners 346

© in this web service Cambridge University Press www.cambridge.org


Cambridge University Press
978-1-107-14713-3 — Numerical Linear Algebra
Holger Wendland
Frontmatter
More Information

Contents vii

8.4 Preconditioning Krylov Subspace Methods 357


Exercises 368
9 Compressed Sensing 370
9.1 Sparse Solutions 370
9.2 Basis Pursuit and Null Space Property 372
9.3 Restricted Isometry Property 378
9.4 Numerical Algorithms 384
Exercises 393

Bibliography 395
Index 403

© in this web service Cambridge University Press www.cambridge.org


Cambridge University Press
978-1-107-14713-3 — Numerical Linear Algebra
Holger Wendland
Frontmatter
More Information

Preface

Numerical Linear Algebra (NLA) is a subarea of Applied Mathematics. It is


mainly concerned with the development, implementation and analysis of nu-
merical algorithms for solving linear problems. In general, such linear prob-
lems arise when discretising a continuous problem by restricting it to a finite-
dimensional subspace of the original solution space. Hence, the development
and analysis of numerical algorithms is almost always problem-dependent. The
more is known about the underlying problem, the better a suitable algorithm
can be developed.
Nonetheless, many of the so-derived methods are more general in the sense
that they can be applied to larger classes of problems than initially intended.
One of the challenges in Mathematics is deciding how to describe the neces-
sary assumptions, under which a certain method works, in the most general
way. In the context of NLA, this means finding for each method the most gen-
eral description of matrices to which the method can be applied. It also means
extracting the most general methods from the vast number of available algo-
rithms. Particularly for users with new problems this is crucial, as it allows
them to apply and test well-established algorithms first, before starting to de-
velop new methods or to extend existing ones.
In this book, I have attempted to use this matrix-driven approach rather than
the problem-driven one. Naturally, the selection of the material is biased by
my own point of view. Also, a book on NLA without any examples would be
rather dire, so there are typical examples and applications included to illustrate
the methods, but I have tried to restrict myself to simple examples, which do
not require much previous knowledge on specific problems and discretisation
techniques.
During the past years, I have given courses on Numerical Linear Algebra at
advanced BSc and early MSc level at the University of Sussex (UK), the Uni-
versity of Oxford (UK) and the University of Bayreuth (Germany). I have also

ix

© in this web service Cambridge University Press www.cambridge.org


Cambridge University Press
978-1-107-14713-3 — Numerical Linear Algebra
Holger Wendland
Frontmatter
More Information

x Preface

given courses on Numerical Analysis which covered parts of the NLA material
in Oxford, Göttingen (Germany) and Bayreuth.
This book on Numerical Linear Algebra is based on these courses and the ma-
terial of these courses. It covers the standard material, as well as more recent
and more specific techniques, which are usually not found in standard text-
books on NLA. Examples include the multigrid method, the domain decompo-
sition method, multipole expansions, hierarchical matrices and applications to
compressed or compressive sensing. The material on each of these topics fills
entire books so that I can obviously present only a selection. However, this
selection should allow the readers to grasp the underlying ideas of each topic
and enable them to understand current research in these areas.
Each chapter of this book contains a small number of theoretical exercises.
However, to really understand NLA one has to implement the algorithms by
oneself and test them on some of the matrices from the examples. Hence, the
most important exercises intrinsic to each chapter are to implement and test the
proposed algorithms.
All algorithms are stated in a clean pseudo-code; no programming language
is preferred. This, I hope, allows readers to use the programming language of
their choice and hence yields the greatest flexibility.
Finally, NLA is obviously closely related to Linear Algebra. However, this is
not a book on Linear Algebra, and I expect readers to have a solid knowledge
of Linear Algebra. Though I will review some of the material, particularly to
introduce the notation, terms like linear space, linear mapping, determinant etc.
should be well-known.

© in this web service Cambridge University Press www.cambridge.org


P AR T ONE

PRELIMINARIES

13:04:03, subject to the Cambridge


Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316544938
13:04:03, subject to the Cambridge
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316544938
1
Introduction

In this book, we are concerned with the basic problem of solving a linear sys-
tem
Ax = b,

where A ∈ Rn×n is a given invertible matrix, b ∈ Rn is a given vector and x ∈ Rn


is the solution we seek. The solution is, of course, given by

x = A−1 b,

but does this really help if we are interested in actually computing the solu-
tion vector x ∈ Rn ? What are the problems we are facing? First of all, such
linear systems have a certain background. They are the results of other math-
ematical steps. Usually, they are at the end of a long processing chain which
starts with setting up a partial differential equation to model a real-world prob-
lem, continues with discretising this differential equation using an appropriate
approximation space and method, and results in such a linear system. This is
important because it often tells us something about the structure of the matrix.
The matrix might be symmetric or sparse. It is also important since it tells us
something about the size n of the matrix. With simulations becoming more
and more complex, this number nowadays becomes easily larger than a mil-
lion, even values of several hundreds of millions are not unusual. Hence, the
first obstacle that we encounter is the size of the matrix. Obviously, for larger
dimensions n it is not possible to solve a linear system by hand. This means
we need an algorithmic description of the solution process and a computer to
run our program.
Unfortunately, using a computer leads to our second obstacle. We cannot rep-
resent real numbers accurately on a computer because of the limited number
system used by a computer. Even worse, each calculation that we do might
lead to a number which is not representable in the computer’s number system.

08:35:57, subject to the Cambridge


02
4 Introduction

Hence, we have to address questions like: Is a matrix that is invertible in the


real numbers also invertible in the number system used by a computer? What
are the errors that we make when representing the matrix in the computer and
when using our algorithm to compute the solution. Further questions that easily
come up are as follows.
1. How expensive is the algorithm? How much time (and space) does it require
to solve the problem? What is the best way of measuring the cost of an
algorithm?
2. How stable is the algorithm? If we slightly change the input, i.e. the matrix
A and/or the right-hand side b, how does this affect the solution?
3. Can we exploit the structure of the matrix A, if it has a special structure?
4. What happens if we do not have a square system, i.e. a matrix A ∈ Rm×n
and a vector b ∈ Rm . If m > n then we have an over-determined system
and usually cannot find a (unique) solution but might still be interested in
something which comes close to a solution. If m < n we have an under-
determined system and we need to choose from several possible solutions.
Besides solving a linear system, we will also be interested in a related topic,
the computation of eigenvalues and eigenvectors of a matrix. This means we
are interested in finding numbers λ ∈ C and vectors x ∈ Cn \ {0} such that
Ax = λx.
Finding such eigenvectors and eigenvalues is again motivated by applications.
For example, in structural mechanics a vibrating system is represented by finite
elements and the eigenvectors of the corresponding discretisation matrix reflect
the shape modes and the roots of the eigenvalues reflect the frequencies with
which the system is vibrating. But eigenvalues will also be helpful in better
understanding some of the questions above. For example, they have a crucial
influence on the stability of an algorithm.
In this book, we are mainly interested in systems of real numbers, simply be-
cause they arise naturally in most applications. However, as the problem of
finding eigenvalues indicates, it is sometimes necessary to consider complex
valued systems, as well. Fortunately, most of our algorithms and findings will
carry over from the real to the complex case in a straightforward way.
We will look at direct and iterative methods to solve linear systems. Direct
methods compute the solution in a finite number of steps, iterative methods
construct a sequence of approximations to the solution.
We will look at how efficient and stable these methods are. The former means
that we are interested in how much time and computer memory they require.
Particular emphasis will be placed on the number of floating point operations

08:35:57, subject to the Cambridge


02
1.1 Examples Leading to Linear Systems 5

required with respect to the dimension of the linear system. The latter means
for example investigating whether these methods converge at all, under what
conditions they converge and how they respond to small changes in the input
data.

1.1 Examples Leading to Linear Systems


As mentioned above, linear systems arise naturally during the discretisation
process of mathematical models of real-world problems. Here, we want to col-
lect three examples leading to linear systems. These examples are our model
problems, which we will refer to frequently in the rest of this book. They com-
prise the problem of interpolating an unknown function only known at discrete
data sites, the solution of a one-dimensional boundary value problem with fi-
nite differences and the solution of a (one-dimensional) integral equation with
a Galerkin method. We have chosen these three examples because they are
simple and easily explained, yet they are significant enough and each of them
represents a specific class of problems. In particular, the second problem leads
to a linear system with a matrix A which has a very simple structure. This
matrix will serve us as a role model for testing and investigating most of our
methods since it is simple to analyse yet complicated enough to demonstrate
the advantages and drawbacks of the method under consideration.

1.1.1 Interpolation
Suppose we are given data sites X = {x1 , . . . , xn } ⊆ Rd and observations
f1 , . . . , fn ∈ R. Suppose further that the observations follow an unknown gen-
eration process, i.e. there is a function f such that f (xi ) = fi , 1 ≤ i ≤ n.
One possibility to approximately reconstruct the unknown function f is to
choose basis functions φ1 , . . . , φn ∈ C(Rd ) and to approximate f by a function
s of the form
n
s(x) = α j φ j (x), x ∈ Rd ,
j=1

where the coefficients are determined by the interpolation conditions



n
fi = s(xi ) = α j φ j (xi ), 1 ≤ i ≤ n.
j=1

08:35:57, subject to the Cambridge


02
6 Introduction

This leads to a linear system, which can be written in matrix form as


⎛ ⎞⎛ ⎞ ⎛ ⎞
⎜⎜⎜φ1 (x1 ) φ2 (x1 ) . . . φn (x1 )⎟⎟⎟ ⎜⎜⎜α1 ⎟⎟⎟ ⎜⎜⎜ f1 ⎟⎟⎟
⎜⎜⎜⎜φ (x ) φ (x ) . . . φ (x )⎟⎟⎟⎟ ⎜⎜⎜⎜α ⎟⎟⎟⎟ ⎜⎜⎜⎜ f ⎟⎟⎟⎟
⎜⎜⎜ 1 2 2 2 n 2 ⎟ ⎜ 2⎟
⎟ ⎜ ⎟ ⎜⎜ 2 ⎟⎟
⎜⎜⎜ .
. .. .. ⎟⎟⎟⎟ ⎜⎜⎜⎜ .. ⎟⎟⎟⎟ = ⎜⎜⎜⎜ .. ⎟⎟⎟⎟ . (1.1)
⎜⎜⎜ . . ... . ⎟⎟⎟⎟ ⎜⎜⎜⎜ . ⎟⎟⎟⎟ ⎜⎜⎜⎜ . ⎟⎟⎟⎟
⎜⎝ ⎠⎝ ⎠ ⎝ ⎠
φ1 (xn ) φ2 (xn ) . . . φn (xn ) αn fn
From standard Numerical Analysis courses we know this topic usually in the
setting that the dimension is d = 1, that the points are ordered a ≤ x1 < x2 <
· · · < xn ≤ b and that the basis is given as a basis for the space of polynomials
of degree at most n−1. This basis could be the basis of monomials φi (x) = xi−1 ,
1 ≤ i ≤ n, in which case the matrix in (1.1) becomes the transpose of a so-
called Vandermonde matrix, i.e. a matrix of the form
⎛ ⎞
⎜⎜⎜1 x1 x12 . . . x1n−1 ⎟⎟⎟
⎜⎜⎜⎜1 x x2 . . . xn−1 ⎟⎟⎟⎟
⎜⎜⎜ 2 2 ⎟ ⎟
.. ⎟⎟⎟⎟ .
2
⎜⎜⎜ . . ..
⎜⎜⎜ .. .. . ... . ⎟⎟⎟⎟
⎜⎝ n−1 ⎠
1 xn xn . . . xn
2

This matrix is a full matrix, meaning that each entry is different from zero, so
that the determination of the interpolant requires the solution of a linear system
with a full matrix.
However, we also know, from basic Numerical Analysis, that we could alter-
natively choose the so-called Lagrange functions as a basis:
n
x − xi
φ j (x) = L j (x) = , 1 ≤ j ≤ n.
i=1
x j − xi
i j

They obviously have the property L j (x j ) = 1 and L j (xi ) = 0 for j  i. Thus,


with this basis, the matrix in (1.1) simply becomes the identity matrix and the
interpolant can be derived without solving a linear system at all.

1 1 1

−2 −1 0 1 −2 −1 0 1 −2 −1 0 1
Figure 1.1 Typical radial basis functions: Gaussian, inverse multiquadric and a
compactly supported one (from left to right).

In higher dimensions, i.e. d ≥ 2, polynomial interpolation can become quite


problematic and a more elegant way employs a basis of the form φi = Φ(· − xi ),

08:35:57, subject to the Cambridge


02
1.1 Examples Leading to Linear Systems 7

where Φ : Rd → R is a fixed function. In most applications, this function is



Φ(x) = φ(x2 ), where φ : [0, ∞) → R
chosen to be radial, i.e. it is of the form
is a univariate function and x2 = x12 + · · · + xd2 denotes the Euclidean norm.
Examples of possible univariate functions are

Gaussian: φ(r) = exp(−r2 ),


Multiquadric: φ(r) = (r2 + 1)1/2 ,
Inverse Multiquadric: φ(r) = (r2 + 1)−1/2 ,
Compactly Supported: φ(r) = (1 − r)4+ (4r + 1),

where (x)+ is defined to be x if x ≥ 0 and to be 0 if x < 0. The functions are


visualised in Figure 1.1.
In all these cases, except for the multiquadric basis function, it is known that
the resulting interpolation matrix is positive definite (with the restriction of
d ≤ 3 for the compactly supported function). Such functions are therefore
called positive definite. More generally, a good choice of a basis is given by
φ j (x) = K(x, x j ) with a kernel K : Rd × Rd → R, which is positive definite
in the sense that for all possible, pairwise distinct points x1 , . . . , xn ∈ Rd , the
matrix (K(xi , x j )) is symmetric and positive definite.
In the case of the multiquadric basis function, it is known that the interpolation
matrix is invertible and has only real, non-vanishing eigenvalues and that all
but one of these eigenvalues are negative.
Note that in the case of the inverse multiquadric and the Gaussian the matri-
ces are dense while in the case of the compactly supported basis function the
matrix can have a lot of zeros depending on the distribution of the data sites.
Details on this topic can be found in Wendland [133].

1.1.2 Boundary Value Problem


Another application is to compute a stationary solution to the heat equation. In
one dimension, we could imagine an infinitely thin rod of length one, which is
heated in the interior of (0, 1) with a heat source f and is kept at zero degrees
at the boundary points 0, 1. Mathematically, this means that we want to find a
function u : [0, 1] → R with

−u (x) = f (x), x ∈ (0, 1),

with boundary conditions u(0) = u(1) = 0. If the function f is too complicated


or even given only at discrete points then it is not possible to compute the
solution u analytically. In this case a numerical scheme has to be used and the

08:35:57, subject to the Cambridge


02
8 Introduction

simplest idea is to approximate the derivative by differences:


u(x + h) − u(x) u(x) − u(x − h)
u (x) ≈ , or u (x) ≈ .
h h
The first rule could be referred to as a forward rule while the second is a back-
ward rule. Using first a forward and then a backward rule for the second deriva-
tive leads to

u (x + h) − u (x) 1 u(x + h) − u(x) u(x) − u(x − h)
u (x) ≈ ≈ −
h h h h
u(x + h) − 2u(x) + u(x − h)
= .
h2
For finding a numerical approximation using such finite differences we may
divide the domain [0, 1] into n + 1 pieces of equal length h = 1/(n + 1) with
nodes
i
xi = ih = , 0 ≤ i ≤ n + 1,
n+1
and set ui := u(xi ). We now define the finite difference approximation uh to u,
as follows: find uh such that uh0 = uhn+1 = 0 and
⎛ h ⎞
⎜⎜ u − 2uhi + uhi−1 ⎟⎟⎟
− ⎜⎜⎝ i+1 ⎟⎠ = fi , 1 ≤ i ≤ n.
h2

Alternatively this linear system of n equations may be written in the form


⎛ ⎞⎛ ⎞ ⎛ ⎞
⎜⎜⎜ 2 −1 0 ⎟⎟⎟ ⎜⎜⎜ uh1 ⎟⎟⎟ ⎜⎜⎜ f1 ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ h ⎟⎟ ⎜⎜⎜ ⎟ ⎟
⎜⎜⎜−1 2 −1 ⎟⎟⎟ ⎜⎜⎜ u2 ⎟⎟⎟ ⎜⎜⎜ f2 ⎟⎟⎟⎟⎟
⎜⎜ ⎟⎟⎟ ⎜⎜⎜ uh ⎟⎟⎟ ⎜⎜⎜ f ⎟⎟⎟
1 ⎜⎜⎜⎜ −1 2 −1 ⎟⎟⎟ ⎜⎜⎜ 3 ⎟⎟⎟ ⎜⎜⎜ 3 ⎟⎟⎟


⎜ .. .. .. ⎟⎟⎟ ⎜⎜⎜ . ⎟⎟⎟ = ⎜⎜⎜ . ⎟⎟⎟ . (1.2)
h2 ⎜⎜⎜ . . . ⎟⎟⎟ ⎜⎜⎜ .. ⎟⎟⎟ ⎜⎜⎜ .. ⎟⎟⎟
⎜⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟⎟ ⎜⎜ ⎟⎟⎟
⎜⎜⎜
⎜⎝ −1 2 −1⎟⎟⎟⎟⎟ ⎜⎜⎜⎜⎜uhn−1 ⎟⎟⎟⎟ ⎜⎜⎜⎜⎜ fn−1 ⎟⎟⎟⎟
⎠⎝ h ⎠ ⎝ ⎠
0 −1 2 un fn

This system of equations is sparse, meaning that the number of non-zero en-
tries is much smaller than n2 . This sparsity can be used to store the matrix and
to implement matrix–vector and matrix–matrix multiplications efficiently.
To obtain an accurate approximation to u, we may have to choose h very small,
thereby increasing the size of the linear system.
For a general boundary value problem in d-dimensions the size of the linear
system can grow rapidly. For example, three-dimensional problems grow over
eight times larger with each uniform refinement of the domain.

08:35:57, subject to the Cambridge


02
1.1 Examples Leading to Linear Systems 9

1.1.3 Integral Equations


In the last section we have introduced a way of solving a differential equation.
A differential equation can also be recast as an integral equation but integral
equations often also come up naturally during the modelling process. Hence,
let us consider a typical integral equation as another example.
We now seek a function u : [0, 1] → R satisfying
1
log(|x − y|)u(y)dy = f (x), x ∈ [0, 1],
0

where f : [0, 1] → R is given. Note that the integral on the left-hand side
contains the kernel K(x, y) := log(|x − y|), which is singular on the diagonal
x = y.
To solve this integral equation numerically we will use a Galerkin approxi-
mation. The idea here is to choose an approximate solution un from a fixed,
finite-dimensional subspace V = span{φ1 , . . . , φn } and to test the approximate
solution via
1 1 1
log(|x − y|)un (y)dy φi (x)dx = f (x)φi (x)dx, 1 ≤ i ≤ n. (1.3)
0 0 0

Since we choose un ∈ V it must have a representation un = nj=1 c j φ j with
certain coefficients c j . Inserting this representation into (1.3) and changing the
order of summation and integration yields
n 1 1 1
cj log(|x − y|)φ j (y)φi (x)dy dx = f (x)φi (x)dx, 1 ≤ i ≤ n,
j=1 0 0 0

which we easily identify as a linear system Ac = f with the matrix A having


entries
1 1
ai j = log(|x − y|)φ j (y)φi (x)dy dx, 1 ≤ i, j ≤ n.
0 0

A typical choice for the space V is the space of piece-wise constant functions.
To be more precise, we can choose



⎨1 if n ≤ x < n ,
i−1 i
φi (x) = ⎪

⎩0 else,

but other basis functions and approximation spaces are possible. But we note
that particularly in this case the matrix A is once again a full matrix as its
entries are given by
i/n j/n
ai j = log(|x − y|)dy dx.
(i−1)/n ( j−1)/n

08:35:57, subject to the Cambridge


02
10 Introduction

An obvious generalisation of this problem to arbitrary domains Ω ⊆ Rd leads


to matrix entries of the form

ai j = K(x, y)φi (x)φ j (y)dy dx
Ω Ω

with a given kernel K : Ω × Ω → R.

1.2 Notation
Now, it is time to set up the notation which we will use throughout this book.
However, we will specify only the most basic notation and definitions here and
introduce further concepts whenever required.

1.2.1 Mathematics
We will denote the real, complex, natural and integer numbers as usual with
R, C, N and Z, respectively. The natural numbers will not include zero. We
will use the notation x ∈ Rn to denote vectors. The components of x will be
denoted by x j ∈ R, i.e. x = (x1 , . . . , xn )T . Thus vectors will always be column
vectors. We will denote the unit standard basis of Rn by e1 , . . . , en , where the
ith unit vector ei has only zero entries except for a one at position i. In general,
we will suppress the dimension n when it comes to this basis and we might use
the same notation to denote the ith unit vector for Rn and, say, Rm . It should
be clear from the context which one is meant. On Rn we will denote the inner
product between two vectors x and y by either xT y or x, y 2 , i.e.

n
xT y = x, y 2 = x jy j.
j=1

For a matrix A with m rows, n columns and real entries we will write A ∈ Rm×n
and A = (ai j ), where the index i refers to the rows and the index j refers to the
columns:
⎛ ⎞
⎜⎜⎜ a11 a12 · · · a1n ⎟⎟⎟
⎜⎜⎜ ⎟
⎜⎜⎜ a21 a22 · · · a2n ⎟⎟⎟⎟⎟
A := (ai j ) := ⎜⎜⎜ . ⎜ .. ⎟⎟⎟⎟ .
⎜⎜⎜ .. . ⎟⎟⎟⎟
⎜⎝ ⎠
am1 am2 · · · amn

For a non-square matrix A ∈ Rm×n , we can write ai j = eTi Ae j , where the first
unit vector is from Rm while the second unit vector is from Rn .

08:35:57, subject to the Cambridge


02
1.2 Notation 11

We will use the Kronecker δ-symbol δi j , which is defined as





⎨1 if i = j,
δi j = ⎪

⎩0 if i  j.

For the identity matrix in Rn we will use the symbol I ∈ Rn×n . We obviously
have I = (δi j )1≤i, j≤n . Again, as in the case of the unit vectors, we will usually
refrain from explicitly indicating the dimension n of the underlying space Rn .
We will also denote the columns of a matrix A ∈ Rm×n by a j := Ae j ∈ Rm ,
1 ≤ j ≤ n. Hence, we have

A = (a1 , a2 , . . . , an ).

For a matrix A ∈ Rm×n and a vector x ∈ Rn , we can write x = j x j e j and
hence
 n
Ax = x ja j.
j=1

We will encounter specific forms of matrices and want to use the following,
well-known names.

Definition 1.1 A matrix A = (ai j ) ∈ Rm×n is

• a square matrix, if m = n,
• a diagonal matrix, if ai j = 0 for i  j,
• an upper triangular matrix, if ai j = 0 for i > j,
• a lower triangular matrix, if ai j = 0 for i < j,
• a band-matrix, if there are k,  ∈ N0 such that ai j = 0 if j < i − k or j > i + ,
• sparse, if more than half of the entries are zero,
• dense or full, if it is not sparse.

In the case of a diagonal matrix A with diagonal entries aii = λi , we will also
use the notation
A = diag(λ1 , . . . , λn ).

In particular, we have for the identity matrix I = diag(1, . . . , 1) ∈ Rn×n .

Most of these names are self-explanatory. In the case of a band matrix, we have
all entries zero outside a diagonally bordered band. Only those entries ai j with
indices i − k ≤ j ≤ i +  may be different from zero. This means we have at
most k sub-diagonals and  super-diagonals with non-zero entries. The most
prominent example is given by k =  = 1, which has one super-diagonal and
one sub-diagonal of non-zero entries and is hence called a tridiagonal matrix.

08:35:57, subject to the Cambridge


02
12 Introduction

Schematically, upper triangular, lower triangular and tridiagonal matrices look


as follows:
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
⎜⎜⎜∗ ∗ ∗ · · · ∗⎟⎟⎟ ⎜⎜⎜∗ 0⎟⎟⎟ ⎜⎜⎜∗ ∗ 0 · · · 0⎟⎟

⎜⎜⎜ ⎜⎜⎜ .. ⎟⎟⎟⎟
⎜⎜⎜ .. ⎟⎟⎟⎟ ⎜⎜⎜
⎜ ∗ ∗
⎟⎟⎟
⎟ ⎜

∗ ∗ . ⎟⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜∗ ∗ ∗ . ⎟⎟⎟⎟
⎜⎜⎜⎜ ⎟
⎟ ⎜⎜⎜ . . ⎟

⎟ ⎜
⎜ ⎟⎟⎟
⎜⎜⎜ .. .. ⎟⎟⎟ , ⎜⎜⎜∗ . . . . ⎟⎟⎟ , ⎜⎜⎜ ⎜ .. .. .. ⎟⎟⎟ ,
⎜⎜⎜⎜ . . ∗⎟⎟⎟ ⎜⎜⎜⎜ . . . .
⎟ ⎟⎟⎟⎟ ⎜⎜⎜⎜ 0 0 ⎟⎟⎟⎟
⎜⎜⎜ ∗ ∗ ⎟⎟⎟ ⎜⎜⎜ .. ⎟⎟⎟ ⎜⎜⎜ .. ⎟⎟
⎜⎜⎝ ⎟⎟⎠ ⎜⎜⎝ ∗ ∗ ⎟⎟⎠ ⎜⎜⎜ . ∗ ∗ ∗⎟⎟⎟⎟
0 ∗ ∗ ··· ∗ ∗ ∗ ⎝ ⎠
0 ··· 0 ∗ ∗
where a ∗ marks a possible non-zero entry.
For a matrix A = (ai j ) ∈ Rm×n we denote the transpose of A by AT . It is given by
exchanging columns and rows from A, i.e. AT = (a ji ) ∈ Rn×m . A square matrix
A ∈ Rn×n is said to be symmetric if AT = A. If the matrix A is invertible, we
have (A−1 )T = (AT )−1 which we will simply denote with A−T . If A is symmetric
and invertible then also the inverse A−1 is symmetric. Both the transpose and
the inverse satisfy the rules
(AB)T = BT AT , (AB)−1 = B−1 A−1 ,
as long as these operations are well-defined.

1.2.2 Algorithms
We will not use a specific computing language to describe the algorithms in
this book. However, we will assume that the reader is familiar with basic pro-
gramming techniques. In particular, we expect the reader to know what a for
and a while loop are. We will use if, then and else in the usual way and, when
assigning a new value to a variable x, this variable might appear also on the
right-hand side of the assignment, i.e. such an assignment can, for example, be
of the form x := x + y, which means that x and y are first evaluated and the sum
of their values is then assigned to x. For each algorithm we will declare the es-
sential input data and the output data. There will, however, be no explicit return
statement. Each algorithm should also have a deterministic stopping criterion.
To demonstrate this, a first example of an algorithm is given in Algorithm 1,
which computes the inner product s := xT y of the two vectors x, y ∈ Rn .
At the beginning, we will formulate algorithms very close to actual programs
using only basic operations. An implementation within a modern computing
language should be straightforward. Later on, when it comes to more sophisti-
cated algorithms, we will use higher-level mathematical notation to compress
the representation of the algorithm. For example, an inner product will then
only appear as s := xT y.

08:35:57, subject to the Cambridge


02
1.3 Landau Symbols and Computational Cost 13

Algorithm 1: Inner product


Input : x, y ∈ Rn .
Output: s = xT y.
1 s := 0
2 for j = 1 to n do
3 s := s + x j y j

In this book, we will not discuss low-level data structures, i.e. ways of storing a
vector or a matrix within a computer. We will assume that the reader is familiar
with the concepts of arrays, which are usually used to store (full) matrices, and
index lists, which can be used to store sparse matrices.

1.3 Landau Symbols and Computational Cost


Before developing algorithms to solve linear equations, we will introduce con-
cepts to analyse the cost of such algorithms and their stability. Though, of
course, it is possible to compare the actual run-times of two algorithms on a
computer, the actual run-time is not a particularly good measure. It is more
important to understand how the computational cost of an algorithm changes
with the number of unknowns.
The multiplication of a matrix A ∈ Rm×n with a vector x ∈ Rn results in a vector
b = Ax ∈ Rm with components

n
bi = (Ax)i = ai j x j , 1 ≤ i ≤ m.
j=1

The multiplication of a matrix A ∈ Rm×n with a matrix B ∈ Rn×p gives a matrix


C = AB ∈ Rm×p , with entries

n
ci j = aik bk j , 1 ≤ i ≤ m, 1 ≤ j ≤ p.
k=1

So, how much does the computation of it cost? Usually, the cost is measured
in flops, which stands for floating point operations. A floating point operation
consists of one addition plus one multiplication. Sometimes, it is helpful to
distinguish between additions and multiplications and count them separately.
This was particularly true when a multiplication on a computer was substan-
tially more expensive than an addition. However, as this is no longer the case

08:35:57, subject to the Cambridge


02
14 Introduction

on modern computers, where multiplications are realised as efficiently as addi-


tions, we will stick to the above definition of flops. It is nonetheless important
to note that while subtractions are as efficient as additions and multiplications,
this is not true for divisions, which are significantly slower.
Most of the time, we will not be interested in the actual number of floating
point operations but rather in the asymptotic behaviour with respect to the di-
mension.
For example, if we look at a matrix–vector multiplication b = Ax, then we
must for every index i compute the sum

n
ai j x j ,
j=1

which means we have n multiplications and n − 1 additions, i.e. n floating point


operations. Hence, if we double the size of the matrix, we would require twice
as many flops for each component. This, however, would also be true if the
actual computing cost would be cn with a positive constant c > 0. The total
cost of the matrix–vector multiplication becomes mn since we have m entries
to compute.
If we are not interested in the constant c > 0 then we will use the following
notation.

Definition 1.2 (Landau notation) For two functions f, g : Nd → R, we will


write
f (n) = O(g(n))

if there is a constant c > 0 such that

| f (n)| ≤ c|g(n)|, n ∈ Nd .

It is important to see that the constant has to be independent of the argument


n ∈ Nd . Moreover, though in most cases we will ignore the constant c in our
considerations, a huge c > 0 can mean that we will never or only for very large
n see the actual asymptotic behaviour.
With this definition, we can say that matrix–vector multiplication of a matrix
A ∈ Rm×n with a vector x ∈ Rn costs

time(Ax) = O(mn).

In the case of a square matrix m = n, the cost is therefore O(n2 ), which means
that doubling the input size of the matrix and the vector, i.e. replacing n by 2n,
will require four times the time of the original matrix–vector multiplication.

08:35:57, subject to the Cambridge


02
1.3 Landau Symbols and Computational Cost 15

We can also use this notation to analyse the space required to store the infor-
mation on a computer. We will have to store each matrix entry ai j and each
entry of x as well as the result Ax. This requires

O(mn + n + m) = O(mn)

space. When developing algorithms it is important to consider both resources,


time and space, and it might sometimes be necessary to sacrifice something of
one resource to gain in the other.
It is now easy to see that for the matrix–matrix multiplication C = AB, we
would require O(mnp) operations. Hence, for square systems with m = n = p
the time is O(n3 ) and doubling the input size results in computations that are
eight times longer.
Let us summarise our findings so far, with some obvious additions.

Lemma 1.3 Let A ∈ Rm×n , B ∈ Rn×p , x ∈ Rn and α ∈ R.

• It costs O(n) space to store the vector x ∈ Rn and O(mn) space to store the
matrix A.
• It costs O(n) time to compute the product αx.
• It costs O(mn) time to compute the product Ax.
• It costs O(mnp) time to compute the product AB.

The cost is called linear if it is O(n), quadratic if it is O(n2 ) and cubic if it is


O(n3 ).
More sophisticated matrix–matrix products have been developed with the goal
of reducing the computational cost by reducing the number of multiplications
at the cost of a mild increase in the number of additions. The most famous
one is a recursive algorithm by Strassen (see [120]) which can compute the
product of two n × n matrices using at most 4.7 · nlog2 7 = O(n2.807355 ) flops.
Since then, other such algorithms have been introduced, most notably one by
Coppersmith and Winograd in [37] which reduces the cost to O(n2.375477 ). The
latter algorithm is, however, more complicated and the constant hidden in the
O-notation substantially larger so that the algorithm is not used in practice.
Though superior in this context, even the Strassen algorithm is not seriously
used in practical applications, which is due to the fact that it is not as stable as
the conventional scheme, see Higham [84].
Finally, let us see how additional information can be used to reduce the cost.
If, for example, we have a tridiagonal matrix, i.e. a matrix which has only non-
zero entries on the diagonal and the sub- and super-diagonal, i.e. which is of

08:35:57, subject to the Cambridge


02
16 Introduction

the form
⎛ ⎞
⎜⎜⎜a11 a12 0 ··· 0 ⎟⎟⎟
⎜⎜⎜ .. ⎟⎟⎟
⎜⎜⎜a . ⎟⎟⎟
⎜⎜⎜⎜ 21 a22 a23 ⎟⎟⎟
⎟⎟
A = ⎜⎜⎜⎜⎜ 0 ..
.
..
.
..
. 0 ⎟⎟⎟⎟⎟ ,
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ .. ⎟
⎜⎜⎜ . an−1,n−2 an−1,n−1 an−1,n ⎟⎟⎟⎟
⎝ ⎠
0 ··· 0 an,n−1 ann

then a matrix–vector multiplication reduces to



n
(Ax)i = ai j x j = ai,i−1 xi−1 + aii xi + ai,i+1 xi+1 .
j=1

Hence, the time for the full matrix–vector product is now only O(n) instead
of O(n2 ). Also the matrix can be stored by exploiting its special form in O(n)
space instead of O(n2 ).
We will sometimes encounter algorithms which have a recursive structure. This
means, for example, that solving a problem with problem size n is reduced to
solving problems with problem size n/b, where b > 1. The Strassen algorithm
mentioned above is one such example. To analyse the cost of such an algorithm
it is often helpful to assume that n = bk with some k ∈ N. The following result
will be helpful.

Lemma 1.4 The recursion T (1) = c and T (n) = aT (n/b)+cnα with c, a, α > 0
and b > 1 satisfies
⎧ bα α


⎪ c bα −a n if a < bα ,


⎨ α
T (n) ≤ ⎪
⎪ cn (logb n + 1) if a = bα ,



⎩c a α nlogb a if a > bα .a−b

Proof As mentioned above, we will prove this only for n = bk . Though the
results remain true for general n, the proof is more tedious in the general case.
Using induction, it is easy to see that

k
T (bk ) = c ai (bα )k−i .
i=0

Hence, if a < bα , we have


 k+1
k 
 a i α
1 − baα
α b
α
T (b ) = cb
k
= kα
cn < cn .
i=0
bα 1 − bαa
bα − a

08:35:57, subject to the Cambridge


02
1.4 Facts from Linear Algebra 17

For a = bα we have
T (bk ) = cnα (k + 1) = cnα (logb n + 1)
and for a > bα we finally use
 α k+1
k α i
 1− b
b a a a
T (b ) = ca
k k
= ca k
≤ cak =c nlogb a ,
i=0
a 1− bα
a
a − bα a − bα

where we have used


 
log a log a
a =ek k log a
= exp log b = (bk ) log b = nlogb a .
k

log b
Finally, let us mention that the definition of the Landau symbol O can even be
further generalised in the following sense.
Definition 1.5 For two functions f, g : Rn → R and x0 ∈ Rn , we will write
f (x) = O(g(x)), x → x0 ,
if there is a constant c > 0 and a surrounding U = U(x0 ) ⊆ Rn of x0 such that
| f (x)| ≤ c|g(x)|, x ∈ U.

1.4 Facts from Linear Algebra


In this section, we want to collect further material on matrices and vectors,
which should be known from classical, basic linear algebra courses. The char-
acter of this section is more to remind the reader of the material and to intro-
duce the notation. However, we will also prove some results which are less
familiar.
In Rn we have the canonical inner product defined by x, y 2 := xT y for all
x, y ∈ Rn . It particularly satisfies x, x 2 = x12 + · · · + xn2 > 0 for all x  0.
As mentioned above, we will use both notations x, y 2 and xT y equally. The

canonical inner product defines a canonical norm or length x2 := x, x 2 ,
the Euclidean norm. This norm has the usual properties of a norm, which can
more generally be defined for an arbitrary linear space. Of course, we assume
the reader to be familiar with the concept of linear spaces, linear sub-spaces,
linear independent vectors and the dimension of a linear space.
Definition 1.6 Let V be a real (or complex) linear space. A mapping  ·  :
V → [0, ∞) is called a norm on V if it satisfies
1. homogeneity: λx = |λ|x for all x ∈ V and λ ∈ R (or λ ∈ C),

08:35:57, subject to the Cambridge


02
18 Introduction

2. definiteness: x = 0 if and only if x = 0,


3. triangle inequality: x + y ≤ x + y for all x, y ∈ V.
The space V with norm  ·  is called a normed space.
In the case of V = Rn and · = ·2 , the first two properties follow immediately
from the definition, while the triangle inequality follows from the Cauchy–
Schwarz inequality
| x, y 2 | ≤ x2 y2
which holds for all x, y ∈ Rn and where equality occurs if and only if y is a
scalar multiple of x. This all remains true, in the more general situation when
the norm is defined by an inner product.
Definition 1.7 Let V be a real linear space. A mapping ·, · : V × V → R is
called an inner product if it is
1. symmetric: x, y = y, x for all x, y ∈ V,
2. linear: αx + βy, z = α x, z + β y, z for all x, y, z ∈ V and α, β ∈ R,
3. definite: x, x > 0 for all x ∈ V \ {0}.
A space V with an inner product ·, · : V × V → R is called a pre-Hilbert
space.
The second property together with the first property indeed guarantees that
the mapping ·, · is bilinear, i.e. it is also linear in the second argument. Each
pre-Hilbert space becomes a normed space upon defining the canonical norm

x := x, x .
If V is a complex linear space then an inner product on V is again a mapping
·, · : V × V → C but the first two conditions above have to be modified
appropriately. For example, the first condition becomes x, y = y, x for all
x, y ∈ V, where α is the complex conjugate of α ∈ C. The second property
must now hold for all α, β ∈ C, which means that the inner product is linear
in its first and anti-linear in its second component. The canonical norm is then
defined as before.
Throughout this book, we will mainly be concerned with the space V = Rn and
hence introduce most concepts only for this space. But it is worth noting that
some of them immediately carry over to more general spaces.
As usual, for given x1 , . . . , xn ∈ V, we use the notation span{x1 , . . . , xn } to
denote the linear sub-space of V spanned by these elements, i.e.
⎧ ⎫


⎪ n ⎪


⎨ ⎬
span{x1 , . . . , xn } = ⎪
⎪ α x : α , . . . , α ∈ R ⎪
⎪ .

⎩ j=1
j j 1 n

08:35:57, subject to the Cambridge


02
1.4 Facts from Linear Algebra 19

Definition 1.8 Let V be a finite-dimensional pre-Hilbert space with basis


x1 , . . . , xn . A basis is called an orthonormal basis if all basis vectors have unit
length, i.e. xi  = xi , xi 1/2 = 1 and two different vectors are orthogonal, i.e
xi , x j = 0 for i  j.

It is always possible to transform a given basis into an orthonormal basis.

Lemma 1.9 (Gram–Schmidt) Let x1 , . . . , xn be linearly independent vectors


of a pre-Hilbert space V. Define u1 = x1 /x1  and then for k = 1, 2, . . . , n,

k
ũ j+1 := u j+1 − u j+1 , u j u j ,
j=1
u j+1 := ũ j+1 /ũ j+1 .

Then, the set {u1 , . . . , uk } forms an orthonormal basis of span{x1 , . . . , xk } for


1 ≤ k ≤ n.

The Gram–Schmidt orthonormalisation process should be well-known and we


leave a proof of the above lemma to the reader. Though the Gram–Schmidt
procedure is obviously easily implemented, it is numerically often problematic.
We will later discuss better methods of finding orthonormal bases.

Definition 1.10 Let A ∈ Rm×n be given. Its null space is defined to be the set

ker(A) = {x ∈ Rn : Ax = 0} ⊆ Rn

and its range to be

range(A) = {Ax : x ∈ Rn } ⊆ Rm .

The rank of A ∈ Rm×n is defined as

rank(A) = dim{Ax : x ∈ Rn } = dim range(A).

It is easy to see that the null space of a matrix A ∈ Rm×n is a subspace of Rn ,


while the range of A is a subspace of Rm .

Definition 1.11 Let A ∈ Rn×n be a square matrix.

1. The matrix is called positive semi-definite if, for all x ∈ Rn , we have



n 
n
xT Ax = xi x j ai j ≥ 0.
i=1 j=1

It is called positive definite if the above expression is positive for all x  0.


2. The matrix is called orthogonal if AT A = I.

08:35:57, subject to the Cambridge


02
20 Introduction

3. The matrix is called diagonalisable if there exist an invertible matrix S ∈


Rn×n (or more generally S ∈ Cn×n ) and a diagonal matrix D ∈ Rn×n (∈ Cn×n )
such that A = S DS −1 .

If a matrix A ∈ Rn×n is symmetric then there is an orthonormal basis of Rn con-


sisting of eigenvectors of A. This means that there are real values λ1 , . . . , λn ∈
R and vectors w1 , . . . , wn ∈ Rn \ {0}, which satisfy

Aw j = λ j w j , w j , wk 2 = δ jk .

The eigenvectors form an orthogonal matrix Q = (w1 , . . . , wn ) ∈ Rn×n and


hence we have the following result.

Proposition 1.12 If A ∈ Rn×n is symmetric then there is an orthogonal matrix


Q ∈ Rn×n such that QT AQ = D is a diagonal matrix with the eigenvalues of A
as diagonal entries.
A symmetric matrix is positive definite (positive semi-definite) if and only if all
its eigenvalues are positive (non-negative).

As a consequence, a symmetric and positive definite matrix possesses a square


root.

Corollary 1.13 If A ∈ Rn×n is symmetric and positive (semi-)definite then


there is a symmetric and positive (semi-)definite matrix A1/2 ∈ Rn×n such that
A = A1/2 A1/2 . The matrix A1/2 is called a square root of A.

Proof According to Proposition 1.12, we can write A = QDQT with a di-


agonal matrix D = diag(λ1 , . . . , λn ) and an orthogonal matrix Q. The diago-
nal entries of D are non-negative. Hence, we can define the diagonal matrix
√ √
D1/2 := diag( λ1 , . . . , λn ) and then A1/2 := QD1/2 QT . This matrix is obvi-
ously symmetric and because of

n

xT A1/2 x = xT QD1/2 QT x = yT D1/2 y = y2j λ j ≥ 0
j=1

also positive (semi-)definite, where we have set y := QT x. Finally, we have

A1/2 A1/2 = (QD1/2 QT )(QD1/2 QT ) = QD1/2 D1/2 QT = QDQT = A. 

We also remark that for a matrix A ∈ Rm×n , the matrix AT A ∈ Rn×n is obviously
symmetric and positive semi-definite, simply because of

xT (AT A)x = (Ax)T (Ax) = Ax22 ≥ 0.

As a matter of fact, each symmetric and positive definite matrix can be written

08:35:57, subject to the Cambridge


02
1.4 Facts from Linear Algebra 21

in this form, see Exercise 1.5. An orthogonal matrix is always invertible and
its rows and columns, respectively, form an orthonormal basis of Rn .
Sometimes it is necessary to consider a generalised eigenvalue problem. This
occurs, for example, when discussing discretised systems coming from elastic-
ity theory. Later on, it will also be important when we look at preconditioning.

Definition 1.14 Let A, B ∈ Rn×n be matrices. Then, λ ∈ C is a generalised


eigenvalue with respect to A, B if there is a non-zero vector x ∈ Cn such that

Ax = λBx.

The vector x is called generalised eigenvector.

If the matrix B is invertible, finding a generalised eigenvalue and eigenvector is


equivalent to finding a classical eigenvalue and eigenvector of the matrix B−1 A.
However, even if A and B are symmetric, i.e. both have only real eigenvalues
and a complete set of eigenvectors, then the matrix B−1 A is in general not
symmetric. Nonetheless, we have the following result.

Theorem 1.15 Let A, B ∈ Rn×n be symmetric and positive definite. Then,


all eigenvalues of B−1 A are real and positive. Moreover, there exist n pairs of
generalised eigenvalues λ j and eigenvectors v j , 1 ≤ j ≤ n, satisfying Av j =
λ j Bv j and
vTi Bv j = vTi Av j = 0, i  j.

The matrix V = (v1 , . . . , vn ) diagonalises the matrices A and B simultaneously.


If the eigenvectors are normalised such that vTi Bvi = 1, 1 ≤ i ≤ n, then

V T BV = I, V T AV = diag(λ1 , . . . , λn ).

Proof As mentioned above, a generalised eigenvalue is simply an eigenvalue


of B−1 A. Since B is symmetric and positive definite, it has a symmetric and pos-
itive definite root B1/2 with inverse denoted by B−1/2 . The matrix B−1/2 AB−1/2
is symmetric and positive definite since we have on the one hand

(B−1/2 AB−1/2 )T = (B−1/2 )T AT (B−1/2 )T = B−1/2 AB−1/2

and on the other hand for x ∈ Rn :

xT B−1/2 AB−1/2 x = (B−1/2 x)T A(B−1/2 x) > 0,

unless B−1/2 x = 0 which is equivalent to x = 0. Thus, all eigenvalues of

08:35:57, subject to the Cambridge


02
22 Introduction

B−1/2 AB−1/2 are positive. Next, for λ ∈ C we have

det(B−1/2 AB−1/2 − λI) = det(B−1/2 A − λB1/2 ) det(B−1/2 )


= det(B−1/2 ) det(B−1/2 A − λB1/2 )
= det(B−1 A − λI),

meaning that B−1/2 AB−1/2 and B−1 A have exactly the same eigenvalues.
Finally, let λ1 , . . . , λn be the eigenvalues of B−1/2 AB−1/2 with corresponding,
orthonormal eigenvectors w1 , . . . , wn . If we set v j := B−1/2 w j then we have

B−1/2 Av j = B−1/2 AB−1/2 w j = λ j w j = λ j B1/2 v j ,

or Av j = λ j Bv j . Moreover, we have

vTj Bvk = wTj B−1/2 BB−1/2 wk = wTj wk = δ jk

and
vTj Avk = wTj B−1/2 AB−1/2 wk = λk δ jk . 

In several places we will require the notion of orthogonal projection. Again,


we define things immediately in a more general setting though we will mainly
be interested in the case of Rn .

Definition 1.16 Let V be a pre-Hilbert space. A linear mapping P : V → V


is called a projection if P2 = P. A linear mapping P : V → U ⊆ V is called
orthogonal projection onto U, if

x − Px, u = 0, x ∈ V, u ∈ U. (1.4)

It is easy to see that a linear mapping P : V → V is a projection if and only


if Pu = u for all u ∈ range(P) = {Px : x ∈ V}. Moreover, an orthogonal
projection is indeed a projection since we have for x ∈ U also x − Px ∈ U and
hence by the orthogonality condition x − Px2 = x − Px, x − Px = 0, i.e.
Px = x for all x ∈ U.
If u1 , . . . , uk ∈ U form an orthonormal basis of the finite-dimensional subspace
U ⊆ V, then the orthogonal projection onto U is given by

k
Px = x, u j u j . (1.5)
j=1

Obviously, this so-defined P is linear and maps onto U. To see that it is also
the orthogonal projection, let us assume that V is also finite-dimensional, as
we are only concerned with finite-dimensional spaces in this book, though the
result is also true for infinite-dimensional spaces. If V is finite-dimensional, we

08:35:57, subject to the Cambridge


02
1.4 Facts from Linear Algebra 23

can extend u1 , . . . , uk to an orthonormal basis u1 , . . . , uk , . . . , un of V. Then, we


have, for any x ∈ V,

n 
n
x= x, u j u j , x − Px = x, u j u j ,
j=1 j=k+1

so that the orthogonal condition (1.4) yields for P defined by (1.5) that

n  n
x − Px, ui = x, u j u j , ui = x, u j u j , ui = 0
j=k+1 j=k+1

for all 1 ≤ i ≤ k.
The orthogonal projection Px also describes the best approximation to x from
U.

Proposition 1.17 Let V be a pre-Hilbert space and let U ⊆ V be a finite-


dimensional subspace. Then, for x ∈ V and u∗ ∈ U are equivalent:

1. u∗ is the best approximation to x from U, i.e.

x − u∗  = min x − u.
u∈U

2. u is the orthogonal projection of x onto U, i.e.

x − u∗ , u = 0, u ∈ U.

Proof Let u1 , . . . , uk be an orthonormal basis of U. Then, using the orthonor-


mality we have with Px from (1.5) for arbitrary α1 , . . . , αk ∈ R,
 2
 k   
k 
k 
2 
x − Px = x − 
x, u j u j  = x − x, ui ui , x − x, u j u j
 
j=1 i=1 j=1


k 
k
= x2 − 2 | x, u j |2 + x, u j x, ui 2 ui , u j
j=1 i, j=1


k 
k 
k
= x2 − | x, u j |2 ≤ x2 − | x, u j |2 + ( x, u j − α j )2
j=1 j=1 j=1


k 
k
= x2 − 2 α j x, u j + α2j u j , u j
j=1 j=1
 2
 
k 
= x − α j u j  .
 
j=1

This shows that Px from (1.5) is the unique best approximation to x from U.

08:35:57, subject to the Cambridge


02
24 Introduction

As we already know that Px is also the orthogonal projection of x onto U, the


proof is finished. 

1.5 Singular Value Decomposition


The geometric interpretation of Proposition 1.12 is simple. The linear mapping
A : Rn → Rn , which is given by the matrix A in the standard basis, has a
much simpler representation when going over to the basis consisting of the
eigenvectors of A. If this basis is used in both the domain and the range of
the mapping then it can be represented by the diagonal matrix D. While the
simplicity of this representation is indeed appealing, it is not possible to find
such a simple representation for all matrices A ∈ Rn×n let alone for non-square
matrices A ∈ Rm×n . However, if we relax the requirement of using the same
orthogonal basis in the domain and the range, i.e. in Rn and Rm , to represent the
mapping, we can find a similarly simple representation. This is the so-called
singular value decomposition, which we want to discuss now.
To understand it, we need to recall a few more facts from linear algebra. We al-
ready introduced the rank and null space of a matrix. We also need the intrinsic,
well-known relations
n = dim ker(A) + rank(A), rank(A) = rank(AT ). (1.6)
Note that an element x ∈ ker(A) satisfies Ax = 0 and hence also AT Ax = 0, i.e.
it belongs to ker(AT A). If, on the other hand x ∈ ker(AT A) then AT Ax = 0, thus
0 = xT (AT Ax) = Ax22 and hence x ∈ ker(A). This means
ker(A) = ker(AT A).
Similarly, we can easily see that ker(AT ) = ker(AAT ). Hence, the dimension
formula above immediately gives
rank(A) = rank(AT ) = rank(AT A) = rank(AAT ).
This means that the symmetric and positive semi-definite matrices AT A ∈ Rn×n
and AAT ∈ Rm×m have the same rank.
As before, we choose for AT A a system of eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn ≥
0 and associated eigenvectors v1 , . . . , vn ∈ Rn satisfying AT Av j = λ j v j and
vTj vk = δ jk . Exactly the first r = rank(A) eigenvalues are positive and the
remaining ones are zero.
According to the considerations we just made, also the matrix AAT must have
exactly r positive eigenvalues. We will now see that, interestingly, the positive
eigenvalues of AAT are exactly the positive eigenvalues of AT A and that there

08:35:57, subject to the Cambridge


02
1.5 Singular Value Decomposition 25

is a very simple connection between the corresponding eigenvectors. To see


this, let us define

σ j := λ j , 1 ≤ j ≤ n,
1
u j := Av j , 1 ≤ j ≤ r.
σj
Then, we have on the one hand that
1 1
AAT u j = AAT Av j = λ j Av j = λ j u j , 1 ≤ j ≤ r.
σj σj
On the other hand, we have that
1 T T λk T
uTj uk = v A Avk = v vk = δ jk , 1 ≤ j, k ≤ r.
σ j σk j σ j σk j
Thus, u1 , . . . , ur ∈ Rm form an orthonormal system of eigenvectors corre-
sponding to the eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λr > 0 of AAT . Since this matrix
also has rank r and is positive semi-definite, all other eigenvalues have to be
zero. It is now possible to complete the system {u1 , . . . , ur } to an orthonor-
mal basis {u1 , . . . , um } consisting of eigenvectors. If we define the matrices
U := (u1 , . . . , um ) ∈ Rm×m and V = (v1 , . . . , vn ) ∈ Rn×n then we have
Av j = σ j u j , 1 ≤ j ≤ r,
Av j = 0, r + 1 ≤ j ≤ n,
by definition and by the fact that ker(A) = ker(AT A). This can alternatively be
written as
AV = UΣ,
where Σ = (σ j δi j ) ∈ Rm×n is a non-square diagonal matrix. This altogether
proves the following result.
Theorem 1.18 (Singular Value Decomposition (SVD)) A matrix A ∈ Rm×n
has a singular value decomposition A = UΣV T with orthogonal matrices U ∈
Rm×m and V ∈ Rn×n and a diagonal matrix Σ = (σ j δi j ) ∈ Rm×n .
Note that some of the σ j may be zero, depending on the rank of the matrix A.
If r is the rank of A, then there are exactly r non-zero and hence positive σ j .
Furthermore, the singular value decomposition is not unique. To be more pre-
cise, the matrices U and V are not unique, since we can, for example, change
the sign of the columns. Moreover, if some of the σ j have the same value, then
we can choose different bases for the corresponding eigenspaces. However, the
values σ j are uniquely determined by A and thus, if we order the σ j , then the
matrix Σ is unique.

08:35:57, subject to the Cambridge


02
26 Introduction

Definition 1.19 Those σ j in the singular value decomposition which are pos-
itive are called the singular values of the matrix A.
Taking r = rank(A) into account, we can also rewrite the representation A =
UΣV T from Theorem 1.18 in the form
n 
r
A= σ j u j vTj = σ j u j vTj .
j=1 j=1

Hence, if we define the matrices Û := (u1 , . . . , ur ) ∈ Rm×r , V̂ := (v1 , . . . , vr ) ∈


Rn×r and Σ̂ := diag(σ1 , . . . , σr ) ∈ Rr×r then we have the alternative representa-
tion
A = Û Σ̂V̂ T ,
which is also called reduced singular value decomposition of the matrix A.
Since we can also write the reduced form as A = Û(V̂ Σ̂)T , we have the follow-
ing result.
Corollary 1.20 Every matrix A ∈ Rm×n of rank(A) = r can be written in the
form A = BC T with B ∈ Rm×r and C ∈ Rn×r .
In particular, we see that every rank 1 matrix is necessarily of the form bcT
with b ∈ Rm and c ∈ Rn .
There are efficient methods available to compute the singular value decompo-
sition and we will discuss some of them in Section 5.7.

1.6 Pseudo-inverse
We can use the singular value decomposition to define a kind of inverse to a
singular and even non-square matrix.
Definition 1.21 Let A ∈ Rm×n have rank r = rank(A). Let A = UΣV T be
the singular value decomposition of A with orthogonal matrices U ∈ Rm×m ,
V ∈ Rn×n and with Σ ∈ Rm×n being a diagonal matrix with non-zero diagonal
entries σ1 ≥ σ2 ≥ · · · ≥ σr > 0. Let Σ+ ∈ Rn×m be the matrix
⎛ ⎞
⎜⎜⎜1/σ1 0 . . . 0⎟⎟
⎜⎜⎜ ⎟
⎜⎜⎜ .. .. .. ⎟⎟⎟⎟
. . . ⎟⎟⎟
⎜⎜⎜⎜ ⎟⎟⎟
+

⎜⎜⎜ 1/σ 0 0 ⎟⎟⎟
Σ = ⎜⎜ r
⎟⎟ .
⎜⎜⎜ 0 ... 0 0 . . . 0⎟⎟⎟⎟
⎜⎜⎜ .
⎜⎜⎜ . .. .. .. ⎟⎟⎟⎟⎟
⎜⎜⎝ . . . . ⎟⎟⎟

0 ... 0 ... 0

08:35:57, subject to the Cambridge


02
1.6 Pseudo-inverse 27

Then, the pseudo-inverse of A is defined as

A+ := VΣ+ U T ∈ Rn×m . (1.7)

We know that the singular values of a matrix A ∈ Rm×n are uniquely deter-
mined. However, the orthogonal matrices U and V are not unique and hence
we have to make sure that the pseudo-inverse is well-defined before we inves-
tigate its properties.

Theorem 1.22 Let A ∈ Rm×n with r = rank(A).

1. Each pseudo-inverse A+ ∈ Rn×m of A satisfies

AA+ = (AA+ )T , A+ A = (A+ A)T , AA+ A = A, A+ AA+ = A+ . (1.8)

2. The properties

AB = (AB)T , BA = (BA)T , ABA = A, BAB = B (1.9)

determine a matrix B ∈ Rn×m uniquely. This means in particular that the


pseudo-inverse is well-defined.

Proof 1. By definition of a pseudo-inverse, we have

AA+ = (UΣV T )(VΣ+ U T ) = UΣΣ+ U T .

Since ΣΣ+ ∈ Rm×m is a diagonal matrix with r diagonal entries equal to 1


and m − r diagonal entries equal to 0 it is obviously symmetric and hence
AA+ = (AA+ )T . The equality A+ A = (A+ A)T is shown in the same way.
Moreover, we have

AA+ A = (UΣV T )(VΣ+ U T )(UΣV T ) = UΣΣ+ ΣV T = UΣV T = A

and A+ AA+ = A+ follows in the same way.


2. Now assume that both B and C satisfy the stated equalities (1.9). Then, we
can conclude that

B = BAB = B(AB)T = (BBT )AT = BBT ATC T AT = B(AB)T (AC)T


= (BAB)AC = BAC = (BA)TC = (BA)T (CAC) = (BA)T (CA)TC
= AT BT ATC TC = ATC TC = (CA)TC = CAC = C. 

From this we have the uniqueness of the pseudo-inverse. Moreover, we see


that the pseudo-inverse is uniquely determined by the so-called Penrose condi-
tions (1.9), i.e. we could have also used these conditions to define the pseudo-
inverse.

08:35:57, subject to the Cambridge


02
28 Introduction

Moreover, if A ∈ Rn×n is invertible then B = A−1 obviously satisfies the condi-


tions (1.9). This means that for an invertible matrix the pseudo-inverse and the
classical inverse are the same.
Finally, if m ≥ n and rank(A) = n, the following representation of the pseudo-
inverse will become important later on.

Lemma 1.23 Let A ∈ Rm×n satisfy rank(A) = n. Then, the pseudo-inverse of


A is given by
A+ = (AT A)−1 AT . (1.10)

Proof Since A ∈ Rm×n with rank(A) = n, we find that AT A ∈ Rn×n has also
rank n and is hence invertible. Thus, the expression B := (AT A)−1 AT is well-
defined. Moreover, B satisfies (1.9), since, using the symmetry of AT A, we
have

(AB)T = (A(AT A)−1 AT )T = A(AT A)−1 AT = AB,


(BA)T = ((AT A)−1 AT A)T = I T = I = (AT A)−1 AT A = BA,
ABA = A(AT A)−1 AT A = A,
BAB = ((AT A)−1 AT )A(AT A)−1 AT = (AT A)−1 AT = B. 

The pseudo-inverse will play an important role when it comes to solving and
analysing least-squares problems. The next result relates the pseudo-inverse to
the orthogonal projections onto the ranges of A and AT .

Lemma 1.24 Let A ∈ Rm×n with pseudo-inverse A+ ∈ Rn×m . Then,

1. P := AA+ : Rm → Rm is the orthogonal projection onto range(A),


2. P := A+ A : Rn → Rn is the orthogonal projection onto range(AT ).

Proof We will only prove the first statement since the proof of the second
statement is very similar, as range(A+ ) = range(AT ).
By definition Px = A(A+ x) ∈ range(A). Moreover, using the properties (1.8)
we find for each x ∈ Rm and u = Av ∈ range(A) that

x − Px, u 2 = x − Px, Av 2 = vT AT (x − AA+ x)


= vT [AT x − AT (AA+ )T x] = vT [AT x − (AA+ A)T x]
= vT [AT x − AT x] = 0,

which shows that Px = AA+ x is indeed the orthogonal projection of x onto


range(A). 

08:35:57, subject to the Cambridge


02
Exercises 29

Exercises
1.1 Let V be a pre-Hilbert space. Use 0 ≤ αx + βy, αx + βy for α, β ∈ R
and x, y ∈ V to prove the Cauchy–Schwarz inequality.
1.2 Let V be a normed space with norm  ·  : V → R. Show that V is a
pre-Hilbert space and that the norm comes from an inner product ·, · :
V × V → R if and only if the parallelogram equality
x + y2 + x − y2 = 2x2 + 2y2 , x, y ∈ V
holds and that in this case the inner product satisfies
1 !
x, y = x + y2 − x − y2 , x, y ∈ V.
4
1.3 Let V be a pre-Hilbert space. Show that the norm induced by the inner
product satisfies x + y < 2 for all x  y ∈ V with x = y = 1.
1.4 Let A ∈ Rn×n be symmetric and positive definite and let C ∈ Rn×m . Show:
1. the matrix C T AC is positive semi-definite,
2. rank(C T AC) = rank(C),
3. the matrix C T AC is positive definite if and only if rank(C) = m.
1.5 Show that a matrix A ∈ Rn×n is symmetric and positive definite if and
only if it is of the form A = BBT with an invertible matrix B ∈ Rn×n .
1.6 Let A, B, C ∈ R2×2 with C = AB. Let
p = (a11 + a22 )(b11 + b22 ), q = (a21 + a22 )b11 ,
r = a11 (b12 − b22 ), s = a22 (b21 − b11 ),
t = (a11 + a12 )b22 , u = (a21 − a11 )(b11 + b12 ),
v = (a12 − a22 )(b21 + b22 ).
Show that the elements of C can then be computed via
c11 = p + s − t + v, c12 = r + t,
c21 = q + s, c22 = p + r − q + u.
Compare the number of multiplications and additions for this method
with the number of multiplications and additions for the standard method
of multiplying two 2 × 2 matrices.
Finally, show that if the above method is recursively applied to matrices
A, B ∈ Rn×n with n = 2k then the method requires 7k multiplications and
6 · 7k − 6 · 22k additions and subtractions.

08:35:57, subject to the Cambridge


02
2
Error, Stability and Conditioning

2.1 Floating Point Arithmetic


Since computers use a finite number of bits to represent a real number, they
can only represent a finite subset of the real numbers. In general the range of
numbers is sufficiently large but there are naturally gaps, which might lead to
problems.
Hence, it is time to discuss the representation of a number in a computer. We
are used to representing a number in digits, i.e.

105.67 = 1000 · 0.10567 = +103 (1 · 10−1 + 0 · 10−2 + 5 · 10−3 + 6 · 10−4 + 7 · 10−5 ).

However, there is no need to represent numbers with respect to base 10. A


more general representation is the B-adic representation.

Definition 2.1 A B-adic, normalised floating point number of precision m is


either x = 0 or

−1
x = ±Be xk Bk , x−1  0, xk ∈ {0, 1, 2, . . . , B − 1}.
k=−m

Here, e denotes the exponent within a given range emin ≤ e ≤ emax , B ≥ 2 is



the base and xk Bk denotes the mantissa.
The corresponding floating point number system consists of all of the numbers
which can be represented in this form.

For a computer, the different ways of representing a number, like single and
double precision, are defined by the IEEE 754 norm and we will also refer to
them as machine numbers.
For example, the IEEE 754 norm states that for a double precision number we
have B = 2 and m = 52. Hence, if one bit is used to store the sign and if
11 bits are reserved for representing the exponent, a double precision number

30

08:40:18, subject to the Cambridge


03
2.1 Floating Point Arithmetic 31

can be stored in 64 bits. In this case, the IEEE 754 norm defines the following
additional values:

• ±In f for ±∞ as the result of dividing positive or negative numbers by zero,


for the evaluation of log(0) and for overflow errors, meaning results which
would require an exponent outside the range [emin , emax ].
• NaN (not a number) for undefined numbers as they, for example, occur when
dividing 0 by 0 or evaluating the logarithm for a negative number.

As a consequence, the distribution of floating point numbers is not uniform.


Furthermore, every real number and the result of every operation has to be
rounded. There are different rounding strategies, but the default one is rounding
x to the nearest machine number rd(x).

Definition 2.2 The machine precision, eps, is the smallest number which
satisfies
|x − rd(x)| ≤ eps|x|

for all x ∈ R within the range of the floating point system.

Another way of expressing this is that for such x there is an ∈ R with | | ≤ eps
and rd(x) = (1 + )x.
It is easy to determine the machine precision for the default rounding strategy.

Theorem 2.3 For a floating point number system with base B and precision
m the machine precision is given by eps = B1−m , i.e. we have

|x − rd(x)| ≤ |x|B1−m .

Proof For every real number x in the range of the floating point system there
exists an integer e between emin and emax such that

−1 
−1 
−m−1
x = ±Be xk Bk = ±Be xk Bk ± Be xk Bk = rd(x) + (x − rd(x)).
k=−∞ k=−m k=−∞

Hence, we have

−m−1 
−m−1
|x − rd(x)| ≤ Be xk Bk ≤ Be (B − 1)Bk = Be−m .
k=−∞ k=−∞

However, as x−1  0, we also have |x| ≥ Be B−1 = Be−1 such that


|x − rd(x)|
≤ Be−m−e+1 . 
|x|

08:40:18, subject to the Cambridge


03
32 Error, Stability and Conditioning

Note that for B = 2 and m = 52 this means eps = 2−51 ≈ 4.4409 · 10−16 , giving
the precision of real numbers stored in a 64-bit floating point system.
Since numbers are usually provided in decimal form, they must be converted
to the B-adic form for their representation within the computer, which usually
differs from the original number. Even if then all computations are done in
exact arithmetic, this input error cannot be avoided.
However, the limitation of the floating point system also means that stan-
dard operations like adding, subtracting, multiplying and dividing two machine
numbers will not result in a machine number and hence the results have to be
rounded again. The best we can hope for is that the result of such an elemen-
tary operation is given by the rounded number of the exact result of such an
operation, i.e., for example, rd(x + y) instead of rd(rd(x) + rd(y)). The IEEE
754 norm guarantees that this is indeed the case and not only for adding, sub-
tracting, multiplying and dividing but also in the case of evaluating standard

functions like x, sin(x), log(x) and exp(x). This can be reformulated as fol-
lows.

Theorem 2.4 Let ∗ be one of the operations +, −, ·, / and let  be the equiv-
alent floating point operation, then for all x, y from the floating point system,
there exists an ∈ R with | | ≤ eps such that

x  y = (x ∗ y)(1 + ).

2.2 Norms for Vectors and Matrices


In this section we will study norms for vectors and matrices in greater detail,
as these norms are essential for analysing the stability and conditioning of a
linear system and any method for solving it.
We start with the following example.

Example 2.5 (Matrix–vector multiplication) Let us consider the goal of mul-


tiplying a matrix A ∈ Rn×n with a vector x ∈ Rn . Let us assume, however, that
the data given is inexact, for example, because it contains noise from mea-
surements or because the representation of the real numbers on the computer
causes rounding errors. For simplicity and for the time being, we will still as-
sume that the matrix A ∈ Rn×n is exactly given but that the vector x is given
as something of the form x + Δx, where Δx is a vector with, one hopes, small
entries representing the error. Hence, instead of computing b := Ax we are
computing (b + Δb) = A(x + Δx) with Δb = AΔx and we have to analyse the
error Δb in the output.

08:40:18, subject to the Cambridge


03
2.2 Norms for Vectors and Matrices 33

To achieve the goal of this example, we have to introduce a few theoretical


concepts. We need to know how to “measure”. After that we will subsequently
come back to this introductory example. Let us recall Definition 1.6, the defi-
nition of a norm on a linear space V as a mapping  ·  : V → [0, ∞) satisfying
1. homogeneity: λx = |λ|x for all x ∈ V and λ ∈ R,
2. definiteness: x = 0 if and only if x = 0,
3. triangle inequality: x + y ≤ x + y for all x, y ∈ V.
The norm of a vector allows us to measure the length of it. In particular, a unit
vector with respect to a norm  ·  is a vector x ∈ V with x = 1.
From basic courses on Analysis and Linear Algebra, the reader should know
the following list of standard norms on Rn .
Definition 2.6 For 1 ≤ p ≤ ∞, the  p -norm on Rn is defined to be
 1/p
• x p := ni=1 |xi | p for 1 ≤ p < ∞,
• x∞ := max1≤i≤n |xi |.
Besides the ∞ -norm, the 2 -norm or Euclidean norm and the 1 -norm are the
most popular ones:
⎛ n ⎞
n ⎜⎜⎜ 2 ⎟⎟⎟1/2
x1 := |xi |, ⎜
x2 := ⎜⎝ |xi | ⎟⎟⎠ .
i=1 i=1

It is quite easy to verify that  · 1 and  · ∞ are indeed norms. For  · 2 and
more generally  ·  p the first and second properties of a norm are also easily
verified. The triangle inequality, however, requires more work and needs the
Cauchy–Schwarz or more generally the Hölder inequality

n
1 1
|x j y j | ≤ x p yq , x, y ∈ Rn , p, q ≥ 1 with + = 1. (2.1)
j=1
p q

Here, the expression 1/p is interpreted as zero for p = ∞. Since this is done in
basic courses on Linear Algebra and Analysis, we will assume that the reader
is familiar with these facts.
Example 2.7 The unit ball with respect to the  p -norm is defined to be the
set B p := {x ∈ Rn : x p ≤ 1}. For R2 , Figure 2.1 shows B1 , B2 , B∞ .
Note that the “length” of a vector depends on the chosen norm. The choice of
the norm might depend on the application one has in mind. For example, if a
vector contains temperatures at a certain position at n different times, we can
use the ∞ -norm to determine the maximum temperature. If we are, however,
interested in an average temperature, the 1 - or 2 -norm would be our choice.

08:40:18, subject to the Cambridge


03
34 Error, Stability and Conditioning

B∞
1
B2

B1

−1 1

−1

Figure 2.1 B1 , B2 and B∞ in R2 .

Though we are given different norms on Rn , the following theorem shows that
these do not differ too much in a topological sense.
Theorem 2.8 On Rn , all norms are equivalent, i.e. for two norms  ·  and
 · ∗ on Rn , there are two constants c1 , c2 > 0 such that
c1 x ≤ x∗ ≤ c2 x, x ∈ Rn .
Proof First of all, we note that this indeed defines an equivalence relation.
Hence, it suffices to show that all norms are equivalent to one specific norm,
say the 1 -norm.
Thus, let  ·  be an arbitrary norm on Rn . If x = (x1 , . . . , xn )T is a vector of Rn ,
the triangle inequality yields
 
 n   n
x =   
x je j ≤ |x j |e j  ≤ max e j  x1 =: Mx1 .
  1≤ j≤n
j=1 j=1

From this we can conclude that the norm ·, as a function on Rn , is continuous
with respect to the 1 -norm, since we have
| x − y | ≤ x − y ≤ Mx − y1
for x, y ∈ Rn . Since the unit sphere K = {x ∈ Rn : x1 = 1} with respect to the
1 -norm is a compact set, the continuous function  ·  attains its minimum and
maximum on it, i.e. there are positive constants c ≤ C such that
c ≤ x ≤ C, x ∈ K.

08:40:18, subject to the Cambridge


03
2.2 Norms for Vectors and Matrices 35

A simple scaling argument now shows the equivalence of the norms  ·  and
 · 1 . 

It is sometimes good to know what the equivalence constants look like. In the
case of the  p -norms, the optimal constants are well-known.

Example 2.9 For the standard  p -norms, it is not too difficult to compute the
equivalence constants. For example,

x2 ≤ x1 ≤ nx2 ,

x∞ ≤ x2 ≤ nx∞ ,
x∞ ≤ x1 ≤ nx∞ .

A general formula is given in the next proposition.

Proposition 2.10 Let x ∈ Rn be given. Let 1 ≤ p, q ≤ ∞ with 1


p + 1
q = 1. If
q ≥ p then
x p ≤ n p − q xq .
1 1
(2.2)

If p ≥ q then
x p ≤ xq . (2.3)

In both cases, the constants are optimal in the sense that there is no smaller
constant for which (2.2) or (2.3) holds for all x ∈ Rn .

Proof In the case p = q = ∞ both bounds are obviously true. We start with
showing (2.2), i.e. we assume q ≥ p. There is nothing to show if x = 0 so we
may assume x  0. Moreover, if q = ∞ and 1 ≤ p < ∞ we have

n
x pp = p
|x j | p ≤ nx∞ ,
j=1

which gives the result in this case. For 1 ≤ p ≤ q < ∞, we can use Hölder’s
inequality with r = q/p ≥ 1 and s ≥ 1 with 1r + 1s = 1, i.e. with s = q/(q − p),
to derive
⎛ n ⎞ ⎛ n ⎞1/r
n ⎜⎜⎜ ⎟⎟⎟1/s ⎜⎜⎜ ⎟⎟⎟
p
x p = |x j | ≤ ⎜⎜⎝⎜ 1 ⎟⎟⎠⎟ ⎜⎜⎝⎜ |x j | ⎟⎟⎠⎟ = n1−p/q xqp/q .
p s pr

j=1 j=1 j=1

Taking the pth root yields the desired result. The constant is the best possible
because we have equality in (2.2) for the vector x = (1, . . . , 1)T .
Next, let us consider the case that p ≥ q ≥ 1. If p = ∞ and q < ∞ then we

08:40:18, subject to the Cambridge


03
36 Error, Stability and Conditioning

have
⎛ n ⎞1/q
⎜⎜⎜ ⎟⎟⎟
x∞ = max |x j | = max(|x j | ) q 1/q
≤ ⎜⎜⎝⎜ |x j | ⎟⎟⎟⎠ = xq .
q
j j
j=1

If, however, q ≤ p < ∞, then


p−q
|x j | p = |x j |q |x j | p−q ≤ |x j |q x∞ ≤ |x j |q xqp−q ,

such that

n 
n
x pp = |x j | p ≤ |x j |q xqp−q = xqq xqp−q = xqp ,
j=1 j=1

which also gives the desired result. To see that this bound is also sharp, we can
simply pick x = e1 = (1, 0, . . . , 0)T . 

Next, we return to our initial motivational example, the analysis of a matrix–


vector multiplication with a perturbed input vector, see Example 2.5.

Example 2.11 (Matrix–vector multiplication) We are now looking again at


the situation

b := Ax,
b + Δb := A(x + Δx).

We would like to have Δb be small if Δx is small. However, what does small


mean? It will depend on the size of x and hence it makes more sense to look at
the relative errors
Δx Δb
, ,
x b
provided x, b  0.
Let us assume that we have fixed a norm on Rn and that there are two constants
c1 > 0 and c2 > 0 such that

c1 x ≤ Ax ≤ c2 x, x ∈ Rn . (2.4)

The left inequality guarantees that A is injective and hence, as a square matrix,
bijective, i.e. invertible. Under these assumptions we have
Δb A(x + Δx) − Ax AΔx c2 Δx
= = ≤ . (2.5)
b Ax Ax c1 x
This means that the relative input error Δx x is “magnified” by the factor c2 /c1
in the relative output error. If this factor is large, we have to expect large relative
output errors even for small relative input errors.

08:40:18, subject to the Cambridge


03
2.2 Norms for Vectors and Matrices 37

So far, the findings for our matrix–vector multiplication example have been for
square matrices. However, a thorough inspection of what we have just outlined
shows that particularly (2.5) remains true as long as (2.4) holds. We will see
that the latter also holds for non-square matrices A ∈ Rm×n as long as A is
injective, which means that we need m ≥ n and rank(A) = n.
It is now time to describe a systematic way of handling the constants c2 and c1
from above by introducing matrix norms.
Since the space of all m × n matrices forms a vector space by adding two ma-
trices component-wise and multiplying a matrix by a scalar also component-
wise, we already know what a norm on the space of matrices is. However, in
this context it is important how matrix norms interact with vector norms.

Definition 2.12 Given vector norms  · (n) and  · (m) on Rn and Rm , respec-
tively, we say that a matrix norm  · ∗ is compatible with these vector norms
if
Ax(m) ≤ A∗ x(n) , x ∈ Rn .

A matrix norm will be called compatible if there are two vector norms so that
the matrix norm is compatible with these vector norms.
We define the subordinate or induced matrix norm with respect to the above
vector norms as
Ax(m)
A(m,n) := sup .
x∈Rn x(n)
x0

Obviously, for any x ∈ R , x  0 we have by definition


n

Ax(m)
A(m,n) ≥
x(n)
or
Ax(m) ≤ A(m,n) x(n) ,

which shows that every induced matrix norm is compatible with the vector
norms used in its definition.

Theorem 2.13 The induced matrix norm is indeed a norm. It is the “small-
est” compatible matrix norm.

Proof First of all, we notice that using the third property of a norm and the
linearity of A yields
 
Ax(m)  x 
A(m,n) = sup = sup A  = sup Ax(m) .
x∈Rn x(n) x∈Rn  x(n) (m) x(n) =1
x0 x0

08:40:18, subject to the Cambridge


03
38 Error, Stability and Conditioning

Since x → Ax(m) is a continuous function it attains its maximum on the


compact set {x : x(n) = 1}. This means that there is an x0 ∈ Rn with x0 (n) =
1 and
A(m,n) = Ax0 (m) (2.6)

such that the induced norm is well-defined. The three norm properties now
easily follow from the fact that  · (m) is a norm and that A is linear.
Finally, we have already noticed that the induced matrix norm is compatible. It
is the smallest compatible norm in the sense that there is a vector x ∈ Rn with

Ax(m) = A(m,n) x(n) ,

which simply follows from (2.6). 

In many cases, we will suppress index notation and simply write A, when the
chosen norm is not relevant or obvious from the context.
In many situations, the following concept will turn out useful.

Definition 2.14 Let  · m×n ,  · n×p and  · m×p denote norms on the matrix
spaces Rm×n , Rn×p and Rm×p , respectively. Such norms are called multiplicative
if they satisfy

ABm×p ≤ Am×n Bn×p , A ∈ Rm×n , B ∈ Rn×p .

Lemma 2.15 Let  · (m) ,  · (n) and  · (p) denote vector norms on Rn , Rm
and R p , respectively. Then, the induced matrix norms satisfy

AB(m,p) ≤ A(m,n) B(n,p)

for all A ∈ Rm×n and B ∈ Rn×p , i.e. induced matrix norms are always multi-
plicative. The same is true for compatible matrix norms.

Proof With x ∈ R p we have Bx ∈ Rn and ABx ∈ Rm and hence

ABx(m) ≤ A(m,n) Bx(n) ≤ A(m,n) B(n,p) x(p) ,

which immediately gives the stated inequality. As the above inequality holds
also for compatible matrix norms, such norms are also multiplicative. 

Coming back to our example of matrix–vector multiplication, see Example 2.5


and Example 2.11, we can now give bounds for the constants c1 and c2 in (2.4).

Example 2.16 (Matrix–vector multiplication) The constant c2 in (2.4) can


obviously be chosen to be c2 = A. For the constant c1 we note that

x = A−1 Ax ≤ A−1  Ax.

08:40:18, subject to the Cambridge


03
2.2 Norms for Vectors and Matrices 39

Hence, c1 can be chosen to be c1 = 1/A−1 . Thus (2.5) becomes


Δb Δx
≤ A A−1  . (2.7)
b x
The number A A−1  which can be assigned to any square, invertible matrix
A ∈ Rn×n plays an important role in the upcoming analysis of algorithms and
methods.
Definition 2.17 The number κ(A) := A A−1  is called the condition num-
ber of the invertible matrix A ∈ Rn×n .
With this definition, (2.7) means that the condition number of the matrix gives
the maximum factor by which the relative input error can be enlarged. Hence,
we want to have a condition number as small as possible to ensure that the rel-
ative output error is of the same size as the relative input error. Note, however,
that for every compatible matrix norm we have
x = Ix ≤ I x,
meaning I ≥ 1. If the norm is induced, we obviously even have I = 1. In
any case, we can conclude that
1 ≤ I = AA−1  ≤ A A−1  = κ(A)
for every compatible matrix norm.
Proposition 2.18 For every compatible matrix norm  · , the condition num-
ber satisfies κ(A) ≥ 1 for every A ∈ Rn×n invertible. If the matrix norm is
induced, then we also have I = 1 if I is the identity matrix.
The condition number obviously depends on the matrix norm used. But since
we can identify the space of all m × n matrices with Rm·n , we see that also all
matrix norms are equivalent and so are the different condition numbers.
Hence, our conclusion so far is that for matrix–vector multiplications a large
condition number of the matrix is extremely problematic when it comes to
propagating input or rounding errors. So far, we were only able to draw this
conclusion for square, invertible matrices. Soon, we will return to our matrix–
vector multiplication example for non-square matrices and generalise the con-
cept of the condition number.
The induced norms, which are what we are mostly interested in, are given by
the standard  p vector norms defined previously. Hence, for 1 ≤ p ≤ ∞ we
define the induced matrix p-norm by
Ax p
A p = sup . (2.8)
x0 x p

08:40:18, subject to the Cambridge


03
40 Error, Stability and Conditioning

This expression is of course only of limited use if we cannot really compute the
matrix norm explicitly. Fortunately, this is in the most important cases possible.
The proof of the following theorem uses the fact that the induced matrix norm
is the smallest compatible norm. If we want to show that the induced matrix
norm A equals a certain number C then we only have to show that Ax ≤
Cx for all x ∈ Rn and that there is at least one x ∈ Rn with Ax = Cx.
Theorem 2.19 Let A ∈ Rm×n and let λ1 ≥ 0 be the largest eigenvalue of AT A.
Then
m
A1 = max |ai j |, (2.9)
1≤ j≤n
i=1

A2 = λ1 , (2.10)

n
A∞ = max |ai j |. (2.11)
1≤i≤m
j=1

Proof Let us first consider the 1-norm of a matrix A ∈ Rm×n . For every x ∈ Rn
we have
" ""
m  m "" n ""  m  n
""
Ax1 = |(Ax)i | = "" ai j x j """ ≤ |ai j ||x j |
i=1 i=1 " j=1 " i=1 j=1
⎛ m ⎞ ⎛ ⎞ n
n ⎜
⎜⎜ ⎟⎟⎟ ⎜⎜⎜ m ⎟⎟ 
= ⎜
⎜⎝ |ai j |⎟⎟⎠ |x j | ≤ ⎜⎜⎝ max |ai j |⎟⎟⎟⎠ |x j | =: Cx1 ,
1≤ j≤n
j=1 i=1 i=1 j=1
m
showing that A1 ≤ C. Next, let us pick an index k such that C = i=1 |aik |
and x = ek the kth unit vector. Then, we have x1 = 1 and
m 
m
Ax1 = Aek 1 = |(Aek )i | = |aik | = C,
i=1 i=1

which shows A1 = C.


In the case of the 2-norm of a matrix A ∈ Rm×n , we have
Ax22 = xT AT Ax.
Since, AT A ∈ Rn×n is symmetric and positive semi-definite there are n or-
thonormal eigenvectors, {zi }ni=1 corresponding to the real, non-negative eigen-
values λ1 ≥ λ2 ≥ · · · ≥ λn ≥ 0 of AT A.
Let x = α1 z1 + · · · + αn zn so that x22 = xT x = |α1 |2 + · · · + |αn |2 . Furthermore,
xT AT Ax = λ1 |α1 |2 + · · · + λn |αn |2 ≤ λ1 xT x. (2.12)
Hence we have A22 ≤ λ1 . Finally, choosing x = z1 gives equality in (2.12),
which shows A22 = λ1 .

08:40:18, subject to the Cambridge


03
2.2 Norms for Vectors and Matrices 41

Finally, for the ∞-norm we have


"" ""
""
n ""
A∞ = max Ax∞ = max max "" ai j x j ""
x∞ =1 x∞ =1 1≤i≤m "" ""
j=1

n 
n
≤ max |ai j | = |ak j |,
1≤i≤m
j=1 j=1

for some k. To show that this upper bound is also attained we may choose
x ∈ Rn with
⎧ ak j


⎨ |ak j | if ak j  0,
xj = ⎪

⎩0 else
to obtain it. 

Another way of expressing the result for the Euclidean norm is to use the sin-
gular values of the matrix, see Definition 1.19. The singular values σ1 ≥ σ2 ≥
· · · ≥ σr with r = rank(A) ≤ min(m, n) of a matrix A ∈ Rm×n are the positive
square roots of the positive eigenvalues of AT A. Hence, we have A2 = σ1 ,
i.e. A2 is given by the largest singular value of A. This is the first half of the
following corollary.

Corollary 2.20 Let A ∈ Rm×n have the singular values σ1 ≥ σ2 ≥ · · · ≥ σr


with r = rank(A) ≤ min(m, n). Then, we have A2 = σ1 . Moreover, if m ≥ n
and A has full rank r = n, then we have

Ax2 ≥ σn x2 , x ∈ Rn .

Proof The second statement follows immediately from the representation


(2.12). 

Another consequence is that for symmetric matrices, the norm can directly be
computed from the eigenvalues of A rather than those of AT A.

Corollary 2.21 Let A ∈ Rn×n be symmetric. Then we have A2 = |λ1 | if


λ1 , . . . , λn are the eigenvalues of A ordered so that |λ1 | ≥ |λ2 | ≥ · · · ≥ |λn |.
Moreover, if A ∈ Rn×n is of the form A = C TC with C ∈ Rn×n then A2 = C22 .

Proof Since A ∈ Rn×n is symmetric, there is an orthogonal matrix U ∈ Rn×n


such that A = UDU T with D being a diagonal matrix with the eigenvalues of A
as diagonal entries. From this we can conclude that AT A = UD2 U T . Hence, the
eigenvalues of AT A are the squares of the eigenvalues of A. Finally, if A = C TC
then we have just proven that A2 = C TC2 is given by the largest eigenvalue

08:40:18, subject to the Cambridge


03
42 Error, Stability and Conditioning

of C TC. But Theorem 2.19 tells us that the square root of the largest eigenvalue
of C TC is just C2 . Hence, we have A2 = C22 . 

Yet another way of computing the 2-norm of a symmetric or normal matrix is


the following one, which is based upon the so-called Rayleigh quotient.

Proposition 2.22 Let A ∈ Rn×n be symmetric. Then, the 2-norm is given by

A2 = sup | Ax, x 2 |.


x2 =1

Proof As in the last corollary, we can pick an orthogonal matrix U ∈ Rn×n


such that A = UDU T with D ∈ Rn×n containing the eigenvalues λ1 , . . . , λn ∈ R
of A, sorted such that |λ1 | ≥ · · · ≥ |λn |. Then, as in the corollary, we have
A2 = |λ1 |. However, we also have | Ax, x 2 | ≤ Ax2 x2 ≤ A2 for all
x ∈ Rn with x2 = 1 and if v ∈ Rn denotes a normalised eigenvector of A to
the eigenvalue λ1 then | Av, v 2 | = | λ1 v, v 2 | = |λ1 |, which shows

A2 = | Av, v 2 | ≤ sup | Ax, x 2 | ≤ A2 ,


x2 =1

i.e. the stated result. 

The above result shows that if A is symmetric and positive semi-definite then
its largest eigenvalue λmax is given by

λmax = sup Ax, x 2 .


x2 =1

However, the ideas used in Proposition 2.22 can also be used to show that the
smallest eigenvalue λmin of a symmetric and positive (semi-)definite matrix A
is given by
λmin = inf Ax, x 2 . (2.13)
x2 =1

This follows directly with the notation of the proof by noting that

Ax, x 2 = xT Ax = xT UDU T x ≥ λmin U T x22 = λmin xT UU T x = λmin x22 ,

where equality holds if x is an eigenvector of A to the eigenvalue λmin . Later


on, we will generalise these results when discussing more general localisation
techniques for eigenvalues.
We will now briefly motivate how to generalise the concept of the condition
number to non-square matrices. We will, however, restrict ourselves to the ma-
trix norm induced by the 2 vector norm. Moreover, for our example, we will
only look at injective matrices.

08:40:18, subject to the Cambridge


03
2.2 Norms for Vectors and Matrices 43

Example 2.23 (Matrix–vector multiplication) Let A ∈ Rm×n be given with


m ≥ n = rank(A). If we compute once again b + Δb = A(x + Δx) for given
x, Δx ∈ Rn , then the relative output error can be bounded by
Δb2 σ1 Δx2
≤ ,
b2 σn x2
where σ1 is the largest and σn is the smallest singular value of A. This follows
immediately from (2.5), our earlier findings c2 = A2 = σ1 and
c21 = inf Ax22 = inf AT Ax, x 2 = λmin (AT A) = σ2n .
x2 =1 x2 =1

Hence, we can generalise the condition number for a non-square matrix as


follows.
Definition 2.24 Let A ∈ Rm×n have the singular values σ1 ≥ σ2 ≥ · · · ≥
σr > 0 with r = rank(A). Then, the condition number of A subordinate to the
2 -norm is defined to be
σ1
κ2 (A) := .
σr
Obviously, this definition coincides with our earlier definition in the case of a
square and invertible matrix.
We also note that we have A2 = σ1 and, using the pseudo-inverse A+ from
(1.7), we can easily see that A+ 2 = 1/σr . Hence, we can express the condition
number also as
κ2 (A) = A2 A+ 2 .
Remark 2.25 When dealing with complex-valued matrices A ∈ Cn×n , we can
proceed in exactly the same way and introduce an induced matrix norm via
A := sup Ax,
x∈Cn
x=1

with x ∈ Cn now. Note that if A ∈ Rn×n and  ·  : Cn → R is a norm, then we


can define an induced matrix norm over Rn and over Cn . Since Rn ⊆ Cn , we
obviously have
AR := sup Ax ≤ sup Ax =: AC .
x∈Rn x∈Cn
x=1 x=1

Indeed, there exist vector norms for which AR < AC , which we will discuss
later on. Fortunately, in the case of the matrix norms (2.8) induced by the  p -
norms, the two norms coincide. In particular, the statements of Theorem 2.19
remain true for real-valued matrices no matter which definition we use, and
T
even for complex-valued matrices. We only have to replace AT A by A A in the

08:40:18, subject to the Cambridge


03
44 Error, Stability and Conditioning

case of the 2-norm. Here, A is the complex-conjugate of the matrix A, i.e. each
entry of A is the complex-conjugate of the corresponding entry of A.
However, instead of using the same  p -norm in the domain and range of our
matrix A, we can also choose different  p -norms. Then, it is possible to show
the following result. A proof can be found in [62, 93].

Theorem 2.26 Let 1 ≤ p, q ≤ ∞. For a matrix A ∈ Rm×n , the subordinate


matrix norms
Axq Axq
ARq,p := sup , ACq,p := sup
x∈Rn \{0} x p x∈Cn \{0} x p
satisfy for p ≤ q
ARq,p = ACq,p .

For p > q and m, n ≥ 3 there exists a matrix A such that

ARq,p < ACq,p .

The following, simple example illustrates the situation for p > q.

Example 2.27 We want to compute the norm A1,∞ of the matrix A ∈ R2×2
given by

1 −1
A=
1 1
both in the complex setting and the real setting. To this end, we note that we
can simplify the formula for computing the matrix norm to
Ax1
A1,∞ = sup = sup [|x1 − x2 | + |x1 + x2 |] .
x0 x∞ x∞ =1

The supremum is taken either over all x = (x1 , x2 )T from R2 or C2 , respectively.


In the first case, the possible candidates x ∈ R2 with x∞ = 1 must have either
x1 = ±1 or x2 = ±1. Since the norm is symmetric in x1 and x2 it suffices to
look at x1 = ±1. In the case of x1 = 1 we must find the maximum of

f (x2 ) = |1 − x2 | + |x2 − 1|, x2 ∈ [−1, 1].

However, on this domain the function obviously reduces to f (x2 ) = 1 − x2 +


x2 − 1 = 2. The other case of x1 = −1 follows in the same fashion so that we
have
AR1,∞ = 2.

In the situation of complex x ∈ C2 with x∞ = 1 it suffices to look at the case

08:40:18, subject to the Cambridge


03
2.2 Norms for Vectors and Matrices 45

|x1 | = 1 and |x2 | = r ≤ 1. Hence, we can write x1 = eiθ1 and x2 = reiθ2 with
θ j ∈ R for j = 1, 2 and r ∈ [0, 1]. We have to find the maximum of
|eiθ1 + reiθ2 | + |eiθ1 − reiθ2 | = |eiθ1 (1 + rei(θ2 −θ1 ) )| + |eiθ1 (1 − rei(θ2 −θ1 ) )|
= |1 + reiθ | + |1 − reiθ | (2.14)
=: f (r, θ)
with θ := θ2 − θ1 . Both expressions on the right-hand side of (2.14) denote
the distance of a point in C from the origin, where, in the first summand, the
point is on the circle about 1 with radius r and, in the second summand, the
point is on the circle about −1 with radius r. The maximum of this expression
is attained if both distances are the same. To see this let us further express
f (r, θ) = (1 + r cos θ)2 + r2 sin2 θ)1/2 + (1 − r cos θ)2 + r2 sin2 θ)1/2
= (1 + r2 + 2r cos θ)1/2 + (1 + r2 − 2r cos θ)1/2 .
For r = 0 the latter expression has value 2. For a fixed r ∈ (0, 1], the latter
expression is as a function of θ a periodic function with period π and attains
its maximum for θ = π/2 + πZ, which gives f (r, π/2) = 2(1 + r2 )1/2 which is
largest when r = 1. This shows altogether

AC1,∞ = |1 − i| + |1 + i| = 2 2.
After this example, we return to the more common situation of p = q and in
particular to p = q = 2, where Corollary 2.21 and Proposition 2.22 can be
generalised as follows.
T T
Corollary 2.28 Let A ∈ Cn×n be normal, i.e. A A = AA . Then, the 2-norm
of A is given by
A2 = |λn | = sup | x, Ax 2 |,
x2 =1
T
where x, y 2 = y x and where λn is the eigenvalue of A with the largest abso-
lute value.
The following example shows how to compute the induced matrix norm if the
vector norm is defined by xA := Ax with an invertible matrix A and a vector
norm  · . We will run into such norms frequently.
Example 2.29 Suppose S ∈ Rn×n is an invertible matrix and  ·  is a vector
norm on Rn . Then, it is easy to see that x → xS −1 := S −1 x is also a vector
norm on Rn . The induced matrix norm is given by
AxS −1 S −1 AS S −1 x S −1 AS y
AS −1 := sup = sup = sup = S −1 AS .
x0 xS −1 x0 S −1 x y0 y

08:40:18, subject to the Cambridge


03
46 Error, Stability and Conditioning

The above example remains true in the complex setting. Assume that S ∈ Cn×n
is invertible and that  ·  : Cn → C is a norm with induced matrix norm. Then,
the above mapping  · S −1 : Cn → R, x → S −1 x defines a norm on Cn with
induced matrix norm  · S −1 : Cn×n → R given by AS −1 = S −1 AS . We can
restrict both norms to Rn and Rn×n and, even though the resulting matrix norm
on Rn×n is in general not the induced matrix norm of the restricted vector norm,
the two norms are still compatible in the real setting.
These considerations imply that not every matrix norm is an induced norm,
not even every compatible norm has to be an induced norm. Another, simpler
example is given by the so-called Frobenius norm.
Definition 2.30 The Frobenius norm of a matrix A ∈ Rm×n is defined to be
⎛ m n ⎞1/2
⎜⎜⎜  ⎟⎟⎟
AF = ⎜⎜⎝ ⎜ |ai j | ⎟⎟⎟⎠ .
2

i=1 j=1

The Frobenius norm is indeed a norm. This follows immediately by realising


that it is the Euclidean norm if we identify Rm×n with Rm·n . Moreover, the
Frobenius norm is compatible with the 2 vector norm, since we have
" ""2 ⎛ n ⎞
m "" "" m ⎜  ⎟⎟⎟
"
n
⎜⎜⎜ n
Ax2 =
2 " "
"" ai j x j "" ≤ ⎜⎜⎝ |ai j |2
|x j | ⎟⎟⎟⎠ = A2F x22 .
2

i=1 " j=1 " i=1 j=1 j=1

Corollary 2.31 The Frobenius norm is compatible with the Euclidean norm.
Moreover, it can be expressed alternatively as
A2F = tr(AT A) = tr(AAT ), A ∈ Rm×n ,

where tr(B) = i bii denotes the trace of the matrix B.
Proof The additional representation follows from the fact that all three terms
give the sum over the squared entries of the matrix. 

The Frobenius norm cannot be induced since we have IF = min(m, n)  1.
But as a compatible norm the Frobenius norm is multiplicative.

2.3 Conditioning
In the last section, while discussing the problem of multiplying a matrix with
a vector, we encountered the idea of conditioning. We associated a condition
number with a matrix , which gave some insight into whether small input errors
are magnified or not.
We now want to discuss this more generally and in more detail.

08:40:18, subject to the Cambridge


03
2.3 Conditioning 47

Definition 2.32 Consider a problem f : V → W from a normed vector space


V of data to a normed vector space W of solutions. For a particular point x ∈ V
we say that the problem f (x) is well-conditioned if all small perturbations of x
lead only to small changes in f (x). A problem is ill-conditioned if there exists
a small perturbation that leads to a large change in f (x).
To make this more precise, we want to introduce the concept of a condition
number also in this context.
In general, one distinguishes between the absolute condition number, which is
the smallest number κ̃(x) which satisfies
 f (x + Δx) − f (x) ≤ κ̃(x)Δx,
for all x and all small Δx, and the more useful relative condition number, κ(x),
given by the smallest number κ(x) satisfying
 f (x + Δx) − f (x) Δx
≤ κ(x) , (2.15)
 f (x) x
for all x  0 with f (x)  0 and all small Δx. We still need to make the notion
of small errors Δx more precise. In the case of the relative condition number
this is done as follows.
Definition 2.33 Let V, W be normed spaces and f : V → W. The (relative)
condition number of f at x ∈ V is defined to be
 
 f (x + Δx) − f (x) # Δx
κ(x) = κ( f, x) := lim sup .
→0 Δx≤  f (x) x
We will soon derive a general formula for the condition number in the case of
differentiable functions. But before that, let us discuss a few simple examples.
Example 2.34 (Matrix–vector multiplication) As a first example, let us re-
view our previous example of matrix–vector multiplication. Here, we have
f (x) = Ax. Hence,
A(x + Δx) − Ax A Δx A x Δx Δx
= ≤ =: κ(x) .
Ax Ax Ax x x
If A is invertible we have x = A−1 Ax ≤ A−1  Ax such that
κ(x) ≤ A A−1  = κ(A).
The next two examples deal with the simple operation of adding and multi-
plying two numbers. Since subtraction and division can be recast as addition
and multiplication, these operations are covered by these examples, as well.
We leave it to the reader to determine the actual condition number and rather

08:40:18, subject to the Cambridge


03
48 Error, Stability and Conditioning

concentrate on the question of whether the operation is well-conditioned or


not.
Example 2.35 (Addition of two numbers) Let f : R2 → R be defined as
f (x, y) = x + y. Then,
f (x + Δx, y + Δy) − f (x, y) x + Δx + y + Δy − (x + y)
=
f (x, y) x+y
x Δx y Δy
= + .
x+y x x+y y
Hence, the addition of two numbers is well-conditioned if they have the same
sign and are sufficiently far away from zero. However, if the two numbers are
of the same magnitude but with different signs, i.e. x ≈ −y, then adding these
numbers is highly ill-conditioned because of cancellation.
Example 2.36 (Multiplying two numbers) The procedure is similar to the
one of the previous example. This time, we have f (x, y) = x · y. Hence, we
have
f (x + Δx, y + Δy) − f (x, y) (x + Δx)(y + Δy) − (xy)
=
f (x, y) xy
Δy Δx Δx Δy
= + + ,
y x x y
which shows that the multiplication of two numbers is well-conditioned.
Next, let us consider a more general approach. Recall that for a differentiable
function f : Rn → R, the gradient of f at x is defined to be the vector
T
∂ f (x) ∂ f (x)
∇ f (x) = ,..., .
∂x1 ∂xn
Theorem 2.37 Let f : Rn → R be a continuously differentiable function.
Then, for x ∈ Rn , the condition number of f with respect to the 2 -norm in Rn
satisfies
∇ f (x)2
κ(x) = x2 . (2.16)
| f (x)|
Proof The Taylor expansion of f gives
f (x + Δx) − f (x) = ∇ f (ξ)T Δx
with a ξ on the straight line between x and x + Δx. Hence, we can bound the
relative output error by
| f (x + Δx) − f (x)| x2 |∇ f (ξ)T Δx| x2 ∇ f (ξ)2
= ≤ x2 ,
| f (x)| Δx2 | f (x)| Δx2 | f (x)|

08:40:18, subject to the Cambridge


03
2.3 Conditioning 49

where we have used the Cauchy–Schwarz inequality in the last step. The ex-
pression on the right-hand side depends only implicitly on Δx as ξ is on the
line segment between x and x + Δx. If we now take the limit Δx2 → 0 as
in Definition 2.33 we have that ξ → x, which shows that κ(x) is at most the
expression on the right-hand side of (2.16). If Δx is parallel to ∇ f (x) we even
have equality in the above inequality, which shows that κ(x) does indeed have
the stated form. 

Example 2.38 (Square root) Let f (x) = x, x > 0. Then, we have f (x) =
1 −1/2
2x and hence
| f (x)| 1 x 1
κ(x) = |x| = √ √ = .
| f (x)| 2 x x 2
Hence, taking the square root is a well-conditioned problem with condition
number 1/2.
The next example is another application of (2.16). This time we look at the
exponential function.
Example 2.39 (Function evaluation) Let f (x) = e x , x ∈ R. Since f (x) =
2

2
2xe x = 2x f (x), we can use (2.16) to see that
| f (x)|
κ(x) = |x| = 2x2 .
| f (x)|
Hence, the problem is well-conditioned for small x and ill-conditioned for
large x.
So far, we have dealt with the condition number of a problem. While there is
not much choice in computing the sum of two real numbers, the situation is
already different when it comes to computing the square root. Here, we did not
consider a particular method to compute the square root but only the problem
itself. However, it might happen that a problem is well-conditioned but the
algorithm or method we use is not.
Definition 2.40 A method for solving a numerical problem is called well-
conditioned if its condition number is comparable to the condition number of
the problem.
Let us explain this with a simple example.
Example 2.41 We consider the quadratic equation x2 + 2px − q = 0 with
p, q > 0. Then, we know from elementary calculus that the positive solution to
this problem is given by

x = −p + p2 + q. (2.17)

08:40:18, subject to the Cambridge


03
50 Error, Stability and Conditioning

We can expect that this formula is ill-conditioned if p is much larger than


q since then we will have x ≈ −p + p and the last expression is prone to
cancellation. We will make this more precise later but we first want to look
at the condition number of the problem of finding the positive solution itself.
To do this, let us introduce the notation x(p, q) for the solution of the equation
with coefficients p and q. Then, the relative output error is given by
Δx x(p + Δp, q + Δq) − x(p, q)
x = = .
x x(p, q)
Δq
If we also write p = Δp p and q = q for the relative input error, we find
p + Δp = p(1 + p ) and q + Δq = q(1 + q ). Using that x(p + Δp, q + Δq) solves
the quadratic equation with coefficients p + Δp and q + Δq yields

0 = (1 + x )2 x2 + 2px(1 + p )(1 + x ) − q(1 + q )


= (1 + 2 x + x2 )x2 + 2px(1 + p + x + p x ) − q − q q .

If we now ignore all quadratic terms in this equation (which we will denote
with a dot above the equal sign), i.e. x2 and p x , and rearrange the terms, we
derive

0 =˙ x2 + 2px − q + 2 x (x2 + px) + 2 p px − q q = 2 x (x2 + px) + 2 p px − q q .

This resolves to
 
1 q q − 2 p px 1 (q − px) q + px q − 2 p px 1 px( q − 2 p )
x =˙ = = q +
2 x2 + px 2 q − px 2 x2 + px
and this finally gives the bound
1 1 px(| q | + 2| p |)
| x | ≤˙ | q | + = | q | + | p |.
2 2 px
This means that finding the positive solution to our quadratic equation is a
well-conditioned problem.
As mentioned above, (2.17) is not a well-conditioned method to compute the
solution. To see this, let us analyse the method in more detail. Using the same
notation as above to denote the relative error, we can, first of all, summarise
the findings of our previous examples as follows:
1
| √ x | ≤ | x |,
""2 " "" "
" x "" " y ""
| x+y | ≤ " " " | x | + "" " | y |,
x + y" x + y"
| x·y | ≤˙ | x | + | y |.

08:40:18, subject to the Cambridge


03
2.3 Conditioning 51

Under the assumption p , q ∈ [− , ], this means for the solution formula


(2.17) successively

| p2 | ≤˙ | p | + | p | ≤ 2 ,
p2 q
| p2 +q | ≤˙ | p2 | + 2 | q | ≤ 2 ,
p2 +q p +q
1
| √ p2 +q | ≤ | p2 +q | ≤˙
2

and finally, with y = p2 + q,
"" " "" "
−p "" y "" p+y
| x | = | −p+y | ≤˙ """ "" | p | + """ " | y | ≤ .
−p + y −p + y " y− p
The term y − p in the denominator becomes problematic in the situation that p
is much larger than q due to cancellation. Thus, in this situation, the formula is
ill-conditioned. However, in this situation we can rewrite it as



p + p2 + q
x = −p + p2 + q = −p + p2 + q 
p + p2 + q
p2 + q − p2 q
=  =  .
p + p2 + q p + p2 + q

To show that this formula is well-conditioned we define z = p + p2 + q and
find
p y
| z | = | p+y | ≤ | p | + | y | ≤ .
p+y y+ p
Using the easily verifiable
x/y =˙ x − y

then finally gives


| x | = | q/z | =˙ | q − z | ≤ 2 ,

i.e. the well-conditioning of the new formula.

Let us now return to linear algebra. We will finish this section by studying
the problem of solving a linear system of the form Ax = b with an invertible
matrix A ∈ Rn×n and given b ∈ Rn .
Suppose our input data A and b have been perturbed such that we are actually
given A + ΔA and b + Δb. We can try to model the error in the output by
assuming that x + Δx solves

(A + ΔA)(x + Δx) = b + Δb.

08:40:18, subject to the Cambridge


03
52 Error, Stability and Conditioning

Multiplying this out and using Ax = b yields


(A + ΔA)Δx = Δb − (ΔA)x.
For the time being, let us assume that A + ΔA is invertible, which is, by conti-
nuity, true if A is invertible and ΔA only a small perturbation. Then,
Δx = (A + ΔA)−1 (Δb − (ΔA)x) ,
i.e.
Δx ≤ (A + ΔA)−1  (Δb + ΔA x) ,
which gives a first estimate on the absolute error. However, more relevant is
the relative error

Δx −1 Δb ΔA
≤ (A + ΔA)  A +
x A x A

−1 Δb ΔA
≤ (A + ΔA)  A + .
b A
The last estimate shows that the relative input error is magnified by a factor
(A + ΔA)−1  A (2.18)
for the relative output error. This demonstrates the problem of conditioning.
The latter expression can further be modified to incorporate the condition num-
ber of a matrix.
It is now time to make our assumptions more rigorous, which ultimately also
leads to a further bound on (2.18).
Lemma 2.42 If A ∈ Rn×n is a non-singular matrix and ΔA ∈ Rn×n a pertur-
bation, which satisfies
A−1  ΔA < 1
in a compatible matrix norm, then, A + ΔA is non-singular and we have
A−1 
(A + ΔA)−1  ≤ .
1 − A−1  ΔA
Proof Let us define the mapping T : Rn → Rn with T x = (A + ΔA)x, x ∈ Rn .
Then, we have obviously x = A−1 T x − A−1 (ΔA)x and x ≤ A−1  T x +
A−1  ΔA x. This means particularly
A−1 
x ≤ T x,
1 − A−1  ΔA
which immediately leads to the injectivity and hence bijectivity of T = A + ΔA
and to the stated bound. 

08:40:18, subject to the Cambridge


03
2.3 Conditioning 53

Theorem 2.43 Suppose that A ∈ Rn×n is invertible and that the perturbation
ΔA ∈ Rn×n satisfies A−1  ΔA < 1. Then, for b, Δb ∈ Rn with b  0, the
estimate
−1
Δx ΔA Δb ΔA
≤ κ(A) 1 − κ(A) +
x A b A

holds.

Proof Lemma 2.42 guarantees the invertibility of A+ΔA. Hence, our findings
so far together with Lemma 2.42 yield

Δx −1 Δb ΔA
≤ (A + ΔA)  A +
x b A

A−1  A Δb ΔA
≤ + ,
1 − A−1  ΔA b A

from which the statement easily follows. 

If the number κ(A)ΔA/A is close to zero, the stability estimate in Theorem


2.43 takes the form

Δx Δb ΔA
≈ κ(A) + ,
x b A

showing that the relative input error is amplified by κ(A) in the relative output
error.

Example 2.44 Assume the condition number behaves like κ(A) = O(10σ )
and that the relative input errors behave like

ΔA Δb
= O(10−k ), = O(10−k )
A b
with k > σ. Then, the relative error of the solution is of size
Δx
≈ 10σ−k .
x
In the case of the maximum norm we have to expect to lose σ digits in accu-
racy.

As we always have to expect errors of the size of the machine precision, the
above example shows that a condition number inverse proportional to the ma-
chine precision will produce essentially unreliable results.

08:40:18, subject to the Cambridge


03
54 Error, Stability and Conditioning

2.4 Stability
There is yet another way of looking at the output of an algorithm and how well
it computes the solution of the original problem.
Consider again a problem f : V → W from a vector space V of data to a vector
space W of solutions. A numerical algorithm can then simply be defined to
be another map f˜ : V → W. More precisely, the given data x ∈ V will be
rounded to floating point precision and then supplied to the algorithm f˜. We
can and will, however, incorporate this rounding into the algorithm f˜. Hence,
the actual outcome is a solution f˜(x).
Obviously, we could call f˜ stable if the approximate result f˜(x) is close to the
desired result f (x), i.e., when taking the relative point of view again, if
 f˜(x) − f (x)
 f (x)
is small. Unfortunately, this is often extremely difficult to verify. Hence, a dif-
ferent concept has been developed. While the above concept requires the ap-
proximate result from the correct input data to be close to the correct result, we
could also try to interpret the outcome f˜(x) as the result of the original function
f at a perturbed point x̃ ∈ V, i.e. we assume that f˜(x) = f (x̃).

Definition 2.45 The algorithm f˜ is called backward-stable for the problem f


if for every x ∈ V there is an x̃ ∈ V with f (x̃) = f˜(x) and
x − x̃
= O(eps), (2.19)
x
where eps denotes the machine precision.

In other words, this means that a backward-stable algorithm solves the problem
correctly with nearly the correct input data.
The concept of backward-stability is a very powerful concept in Numerical
Linear Algebra. For many algorithms it can be shown that they are backward-
stable. A backward-stable algorithm computes a good approximation to the
original problem provided the condition number of the original problem is of
moderate size.

Theorem 2.46 Let f˜ : V → W be a backward-stable algorithm to compute


the problem f : V → W. Then, for each x ∈ V,
 f˜(x) − f (x)
= O(κ(x) eps),
 f (x)
where κ(x) denotes the relative condition number for computing f (x).

08:40:18, subject to the Cambridge


03
Exercises 55

Proof Since f˜ is backward-stable, we can find an x̃ ∈ V with f˜(x) = f (x̃)


and (2.19). The definition of the relative condition number in Definition 2.33
yields
 f˜(x) − f (x)  f (x̃) − f (x) x − x̃
= ≤ κ(x) = O(κ(x) eps). 
 f (x)  f (x) x
Let us briefly describe how this general theory applies to our main application
of solving linear systems, though we can also use it for analysing algorithms
for solving least-squares or eigenvalue problems.
When solving linear systems Ax = b, we may assume that the right-hand side b
is exact. Hence, the linear mapping we are interested in is given by f : Rn×n →
Rn , f (A) := A−1 b. The condition number of this problem is given by Theorem
2.43 with Δb = 0. More precisely, Theorem 2.43 yields
−1
 f (A + ΔA) − f (A) # ΔA Δx # ΔA ΔA
= ≤ κ(A) 1 − κ(A) .
 f (A) A x A A
Taking the limit ΔA → 0 as in Definition 2.33, this shows that κ( f, A) ≤
κ(A) = A A−1 .

Corollary 2.47 Let A ∈ Rn×n be invertible and b ∈ Rn . If x = A−1 b and if


x̃ ∈ Rn is the approximate solution computed with a backward-stable algorithm
then the relative error satisfies
x̃ − x
= O(κ(A) eps).
x
In the next chapter, we will discuss the backward-stability of several algo-
rithms. However, we will omit the proofs as they are usually extremely tech-
nical. Nonetheless, the reader should be very aware of this problem and may
consult the books by Higham [85] and Trefethen and Bau [124].

Exercises
2.1 Consider the matrix A = uvT , where u ∈ Rm and v ∈ Rn . Show that

A2 = u2 v2 .

2.2 Calculate A1 , A2 and A∞ for the matrices


⎛ ⎞
⎜⎜⎜ 4 1 −1⎟⎟⎟
1 −2 ⎜ ⎟
A= , A = ⎜⎜⎜⎜⎜ 1 2 0 ⎟⎟⎟⎟⎟ .
−2 4 ⎝ ⎠
−1 0 2

08:40:18, subject to the Cambridge


03
56 Error, Stability and Conditioning

2.3 For A ∈ Rm×n show that



A2 ≤ A∞ A1 .
2.4 Let A ∈ Rn×n be invertible and 1 ≤ p, q ≤ ∞ with 1
p + 1
q = 1. Show that

κ p (A) = κq (AT ).
2.5 Let A ∈ Rm×n have the singular values σ1 ≥ σ2 ≥ · · · ≥ σr > 0 with
r = rank(A). Show that
⎛ r ⎞
⎜⎜⎜ ⎟⎟⎟1/2
AF = ⎜⎜⎜⎝ σ2j ⎟⎟⎟⎠ .
j=1

2.6 Consider the matrix



1 1
A= .
1 −1
Show that for 1 ≤ p, q ≤ ∞ the norm ACq,p is given by

ACq,p = max{21− p , 2 q , 2 q − p + 2 }.
1 1 1 1 1

Show further that for p ≤ 2 or q ≥ 2 we also have


ACq,p = ARq,p = max{21− p , 2 q }
1 1

and that for q < 2 < p we have


Aq,p < 2 q − p + 2 = ACq,p .
1 1 1

2.7 For > 0 let A ∈ R3×3 be given by


⎛ ⎞
⎜⎜⎜1 − − − ⎟⎟⎟
⎜ ⎟
A := ⎜⎜⎜⎜⎜ − 1− − ⎟⎟⎟⎟⎟ .
⎝ ⎠
− − 1−
Show that for p > q ≥ 1 there exists an > 0 such that A Rq,p ≤ A Cq,p .

08:40:18, subject to the Cambridge


03
PART TWO

BASIC METHODS

13:06:52, subject to the Cambridge


Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316544938
13:06:52, subject to the Cambridge
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316544938
3
Direct Methods for Solving Linear Systems

In this chapter we will discuss direct methods for solving linear systems. Direct
means that the algorithms derived here will terminate after a fixed number of
steps, usually n for an n × n matrix, with the solution of the system.

3.1 Back Substitution


The simplest linear system to solve is, of course, a system of the form Dx = b,
where D = diag(d11 , . . . , dnn ) ∈ Rn×n is a diagonal matrix with non-vanishing
entries on the diagonal. The obvious solution is simply given by xi = bi /dii for
1 ≤ i ≤ n.
A slightly more complex problem is given if the matrix is a triangular matrix,
i.e. we seek a solution x ∈ Rn of the equation

Rx = b, (3.1)

where b ∈ Rn is given and R ∈ Rn×n is, for example, an upper triangular matrix,
i.e. a matrix of the form
⎛ ⎞
⎜⎜⎜r11 r12 r13 . . . r1n ⎟⎟⎟
⎜⎜⎜⎜ ⎟
⎜⎜⎜ r22 r23 . . . r2n ⎟⎟⎟⎟

⎜⎜ .. .. .. ⎟⎟⎟⎟
R = ⎜⎜⎜⎜⎜ . . . ⎟⎟⎟⎟ .
⎜⎜⎜⎜ .. .. ⎟⎟⎟⎟
⎜⎜⎜
⎜⎝ 0 . . ⎟⎟⎟⎟

rnn

The system has a unique solution if the determinant of R is different from zero.
Since we have an upper triangular matrix, the determinant equals the product
of all diagonal entries. Hence, R is invertible once again if rii  0 for 1 ≤ i ≤ n.

59

08:40:30, subject to the Cambridge


04
60 Direct Methods for Solving Linear Systems

Assuming all of this, we see that the last equation of (3.1) is simply given by
rnn xn = bn ,
which resolves to xn = bn /rnn . Next, the equation from row n − 1 is
rn−1,n−1 xn−1 + rn−1,n xn = bn−1 .
In this equation, we already know xn so that the only unknown is xn−1 , which
we can compute as
xn−1 = (bn−1 − rn−1,n xn )/rn−1,n−1 .
We obviously can proceed in this way to compute successively xn−2 , xn−3 and
so on until we finally compute x1 . The general algorithm for computing the
solution in this situation is based upon
⎛ ⎞
1 ⎜⎜⎜⎜ n ⎟⎟⎟
xi = ⎜⎜⎝bi − ri j x j ⎟⎟⎟⎠ , i = n − 1, n − 2, . . . , 1, (3.2)
rii j=i+1

see Algorithm 2, and is called back substitution or backward substitution.

Algorithm 2: Back substitution


Input : Upper triangular matrix R ∈ Rn×n , b ∈ Rn .
Output: Solution x ∈ Rn of Rx = b.
1 xn := bn /rnn
2 for i = n − 1 downto 1 do
3 xi := bi
4 for j = i + 1 to n do
5 xi := xi − ri j x j
6 xi := xi /rii

A standard run-time analysis shows the following.


Theorem 3.1 The system Rx = b with an invertible upper triangular matrix
R ∈ Rn×n can be solved in O(n2 ) time using back substitution.
Proof The computational cost is mainly determined by the two for loops in
Algorithm 2. The cost of line 5 is one flop, as is the cost of line 6. Hence, the
total cost becomes
⎛ ⎞
n−1 ⎜
⎜⎜⎜ 
n
⎟⎟⎟ 
n−1 
n−1 
n−1
(n − 1)n
⎜⎜⎝1 + 1⎟⎟⎟⎠ = n − 1 + (n − (i + 1)) = (n − i) = i= . 
i=1 j=i+1 i=1 i=1 i=1
2

08:40:30, subject to the Cambridge


04
3.2 Gaussian Elimination 61

It should be clear that we can proceed in the exactly same way when con-
sidering lower triangular matrices, starting with the computation of x1 . The
corresponding solution procedure is then called forward substitution.

Remark 3.2 Back substitution is backward-stable. To be more precise, the


computed solution x̃ ∈ Rn of a non-singular upper triangular system Rx = b
satisfies (R + ΔR)x̃ = b with ΔR ∈ Rn×n satisfying

|(ΔR)i j | ≤ neps + O(eps2 ), 1 ≤ i, j ≤ n.

A proof of this result can, for example, be found in Higham [85] and Trefethen
and Bau [124]. According to Corollary 2.47, we can assume that back substi-
tution allows us to compute a sufficiently accurate solution as long as κ(R) is
not too large. The same remains true for forward substitution.

3.2 Gaussian Elimination


We now turn to solving the system Ax = b for arbitrary, invertible matrices
A ∈ Rn×n . Probably the best-known and still very popular method for solving a
square system Ax = b is Gaussian elimination.
The aim of Gaussian elimination is to reduce a linear system of equations
Ax = b to an upper triangular system Ux = c by applying very simple lin-
ear transformations to it, which do not change the solution.
Suppose we have a linear system of the form

a11 x1 + a12 x2 + · · · + a1n xn = b1 ,


a21 x1 + a22 x2 + · · · + a2n xn = b2 ,
.. ..
. .
an1 x1 + an2 x2 + · · · + ann xn = bn ,

then, assuming that a11  0, we can multiply the first row by aa1121
and subtract
the resulting row from the second row, which cancels the factor in front of x1
in that row. Then, we can multiply the original first row by aa31
11
and subtract it
from the third row, which, again, cancels the factor in front of x1 in that row.
It is important to note that this procedure does not change the solution of the
system, i.e. the original system and the system modified in this way obviously
have the same solution.
Continuing like this all the way down, we end up with an equivalent linear
system, i.e. a system with the same solution, of the form

08:40:30, subject to the Cambridge


04
62 Direct Methods for Solving Linear Systems

a11 x1 + a12 x2 + · · · + a1n xn = b1 ,

a(2) (2) (2)


22 x2 + · · · + a2n xn = b2 ,

.. ..
. .

a(2)
n2 x2 + · · · + ann xn = bn .
(2) (2)

Assuming now that a(2) 22  0, we can repeat the whole process using the second
row now to eliminate x2 from row three to n, and so on. Hence, after k − 1
steps, k ≥ 2, the system can be written in matrix form as

⎛ ⎞ ⎛ (1) ⎞
⎜⎜⎜a11 ∗ ∗ ... ∗ ⎟⎟ ⎜⎜⎜b1 ⎟⎟⎟
⎜⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ 0 ⎟⎟⎟ ⎜⎜⎜ (2) ⎟⎟⎟
⎜⎜⎜ a(2) ∗ ... ∗ ⎟⎟⎟ ⎜⎜⎜b2 ⎟⎟⎟
⎜⎜⎜
22 ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ .. ⎟⎟⎟ ⎜⎜⎜ . ⎟⎟⎟
⎜⎜⎜ . ⎟⎟⎟ ⎜⎜⎜ .. ⎟⎟⎟
⎟⎟⎟ ⎜ ⎟
A x = ⎜⎜⎜⎜
(k) ⎜ ⎟⎟⎟ x = ⎜⎜⎜⎜ ⎟⎟⎟⎟ .
⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ (3.3)
⎜⎜⎜ 0 a(k) ... a(k) ⎟⎟⎟ ⎜⎜⎜b(k) ⎟⎟⎟
⎜⎜⎜
0 kk kn ⎟ ⎟⎟ ⎜⎜⎜⎜ k ⎟⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜⎜ ⎟⎟⎟⎟
⎜⎜⎜⎜ . ⎟⎟⎟ ⎜⎜⎜ . ⎟⎟⎟
.. .. .. ⎟⎟⎟
⎜⎜⎜ .. . . . ⎟⎟⎟ ⎜⎜⎜⎜ .. ⎟⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟
⎜⎝ (k) ⎠ ⎜⎝ (k) ⎟⎠
0 0 a(k)
nk ... ann bn
(k)
If in each step akk is different from zero, Gaussian elimination produces after
n − 1 steps an upper triangular matrix. The final, resulting linear system can
then be solved using back substitution as described in Algorithm 2.
In the next sections, we will address the questions of when all the diagonal
entries a(k)
kk are different from zero and what to do if this is not the case.
For theoretical (not numerical) reasons, it is useful to rewrite this process using
matrices. To this end, we need to rewrite the elementary operations we just used
into matrix form. The corresponding matrices are the following elementary
lower triangular matrices.

Definition 3.3 For 1 ≤ k ≤ n − 1, let m ∈ Rn be a vector with eTj m = 0 for


1 ≤ j ≤ k, meaning that m is of the form

m = (0, 0, . . . , 0, mk+1 , . . . , mn )T .

An elementary lower triangular matrix is a lower triangular matrix of the spe-

08:40:30, subject to the Cambridge


04
3.2 Gaussian Elimination 63

cific form
⎛ ⎞
⎜⎜⎜1 ⎟⎟⎟
⎜⎜⎜ .. ⎟⎟⎟
⎜⎜⎜ . ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ 1 ⎟⎟⎟
Lk (m) := I − mek = ⎜⎜⎜⎜
T ⎟⎟⎟ .
⎟⎟⎟
⎜⎜⎜ −mk+1 1 ⎟⎟⎟
⎜⎜⎜ .. ⎟⎟⎟
⎜⎜⎜ ..
⎜⎜⎝ . . ⎟⎟⎟

−mn 1
Obviously, an elementary lower triangular matrix is a lower triangular matrix.
It also is normalised in the sense that the diagonal entries are all 1. It is quite
easy to see that an elementary lower triangular matrix has the following prop-
erties.

Lemma 3.4 An elementary lower triangular matrix Lk (m) has the following
properties:

1. det Lk (m) = 1.
2. Lk (m)−1 = Lk (−m).
3. Multiplying a matrix A with Lk (m) from the left leaves the first k rows un-
changed and, starting from row j = k + 1, subtracts the row m j (ak1 , . . . , akn )
from row j of A.

With this notation, we see that the first step of Gaussian elimination can be
described using the elementary lower triangular matrix

L1 = L1 (m1 ) = I − m1 eT1

employing the vector


T
a21 an1
m1 = 0, ,..., .
a11 a11
The complete Gaussian elimination process can be described as follows.

Theorem 3.5 Let A ∈ Rn×n be non-singular and b ∈ Rn be given. Set A(1) = A


and b(1) = b and then for 1 ≤ k ≤ n − 1 iteratively
⎛ ⎞T
⎜⎜⎜ a(k) a(k) ⎟⎟
nk ⎟

mk := ⎜⎜⎝0, . . . , 0, (k) , . . . , (k) ⎟⎟⎟⎠ ,
k+1,k

akk akk
Lk := I − mk eTk ,
 
(A(k+1) , b(k+1) ) := Lk A(k) , b(k) ,

provided that a(k)


kk  0. Assuming that the process does not finish prematurely,

08:40:30, subject to the Cambridge


04
64 Direct Methods for Solving Linear Systems

it stops with a linear system A(n) x = b(n) with an upper triangular matrix A(n) ,
which has the same solution as the original system Ax = b.

Proof Obviously, the systems have the same solution since, in each step, we
multiply both sides of the equation Ax = b by an invertible matrix.
We will show by induction that the matrices A(k) , 2 ≤ k ≤ n, satisfy

a(k)
i j = ei A e j = 0
T (k)
for 1 ≤ j < k, j < i ≤ n. (3.4)

This means particularly for A(n) that all sub-diagonal entries vanish.
For k = 2 this has been explained above. Hence, let us now assume that (3.4)
holds for k < n. If the indices i and j satisfy 1 ≤ j < k + 1 and j < i ≤ n then
we can conclude

a(k+1)
ij = eTi Lk A(k) e j = eTi (I − mk eTk )A(k) e j
= eTi A(k) e j − (eTi mk )(eTk A(k) e j ).

For j < k this means particularly a(k+1)


ij = 0, since both eTi A(k) e j and eTk A(k) e j
are zero by assumption. Finally, for j = k and i > j we have because of

a(k) eTi A(k) ek


eTi mk = ik
=
a(k)
kk
eTk A(k) ek

also
eTi A(k) ek
a(k+1)
ij = eTi A(k) ek − eTk A(k) ek = 0. 
eTk A(k) ek

An algorithmic description of Gaussian elimination is given in Algorithm 3.


Note that this algorithm works in situ, meaning that the sequence of matrices
A(k) all overwrite the previous matrix and no additional memory is required.
Note also that Algorithm 3 does not only store the upper triangular matrix A(n)
but even stores the essential elements of the elementary lower triangular ma-
trices Lk , i.e. it stores the defining vectors mk in that part of column k beneath
the diagonal.

Remark 3.6 Algorithm 3 for Gaussian elimination requires O(n3 ) time and
O(n2 ) space.

Proof The remark about the space is obvious. For the computational cost, we
analyse the three for loops in Algorithm 3. Considering that line 2 needs one
flop, lines 4 and 5 require two flops and line 7 requires again one flop, we see

08:40:30, subject to the Cambridge


04
3.3 LU Factorisation 65

Algorithm 3: Gaussian elimination


Input : A ∈ Rn×n , b ∈ Rn .
Output: A(n) upper triangular and m1 , . . . , mn−1 and b(n) ∈ Rn .
1 for k = 1 to n − 1 do
2 d := 1/akk
3 for i = k + 1 to n do
4 aik := aik · d
5 bi := bi − bk · aik
6 for j = k + 1 to n do
7 ai j := ai j − aik · ak j

that the time requirement is determined by


⎛ ⎛ ⎞⎞

n−1 ⎜
⎜⎜⎜ 
n ⎜
⎜⎜⎜ 
n ⎟⎟⎟⎟⎟⎟
T (n) = ⎜
⎝⎜1 + ⎜
⎝⎜2 + 1⎟⎟⎠⎟⎟⎟⎠⎟
k=1 i=k+1 j=k+1
⎛ ⎞

n−1 ⎜
⎜⎜⎜ 
n ⎟⎟⎟
= (n − 1) + ⎜⎝ 2(n − k) + (n − k) ⎟⎟⎠
k=1 i=k+1

n−1  
= (n − 1) + 2k + k2
k=1
(n − 1)n(2n − 1)
= (n − 1) + n(n − 1) +
6
1 3
= n + O(n2 ). 
3

3.3 LU Factorisation
Naturally, Gaussian elimination is not realised by matrix multiplication, but by
programming the corresponding operations directly, as shown in Algorithm 3.
This leads to an O(n3 ) complexity. In addition, we have an O(n2 ) complexity
for solving the final linear system by back substitution.
However, there is another way of looking at the process. Suppose we may
construct a sequence of n−1 elementary lower triangular matrices L j = L j (m j )
such that
Ln−1 Ln−2 · · · L2 L1 A = U

08:40:30, subject to the Cambridge


04
66 Direct Methods for Solving Linear Systems

is an upper triangular matrix. Since Li (mi )−1 = Li (−mi ), we have


A = L1 (−m1 )L2 (−m2 ) · · · Ln−2 (−mn−2 )Ln−1 (−mn−1 )U
= (I + m1 eT1 )(I + m2 eT2 ) · · · (I + mn−2 eTn−2 )(I + mn−1 eTn−1 )U
= (I + mn−1 eTn−1 + mn−2 eTn−2 + · · · + m2 eT2 + m1 eT1 )U
=: LU,
where L is lower triangular with unit diagonal elements

n−1
L=I+ mi eTi .
i=1

Definition 3.7 A matrix L = (i j ) ∈ Rn×n is called a normalised lower trian-


gular matrix if i j = 0 for i < j and ii = 1.
The LU factorisation of a matrix A is the decomposition A = LU into the
product of a normalised lower triangular matrix and an upper triangular matrix.
We will soon see that if an LU factorisation of a matrix A exists then it is
unique, which explains why we can speak of the LU factorisation.
Sometimes, the condition on L to be normalised is relaxed and A = LU with
a regular lower triangular matrix L and a regular upper triangular matrix U
is already called an LU factorisation. In this case, we will, however, lose the
uniqueness of the factorisation. We will also have uniqueness if we require U
to be normalised instead of L.
So far, we have assumed that Gaussian elimination does not break down, i.e.
that we can divide by a(k)
kk for each k. It is now time to look at examples of
matrices where this condition is guaranteed.
Theorem 3.8 Let A ∈ Rn×n . Let A p ∈ R p×p be the pth principal sub-matrix
of A, i.e.
⎛ ⎞
⎜⎜⎜a11 ... a1p ⎟⎟⎟
⎜⎜ . .. ⎟⎟⎟⎟
A p = ⎜⎜⎜⎜⎜ .. . ⎟⎟⎟⎟ .
⎜⎝ ⎠
a p1 ... a pp
If det(A p )  0 for 1 ≤ p ≤ n then A has an LU factorisation. In particular,
every symmetric, positive definite matrix possesses an LU factorisation.
Proof We will prove this by induction. For k = 1 we have det A1 = a11  0,
hence the first step in the Gaussian elimination procedure is possible. For k →
k + 1 assume that our matrix is of the form as in (3.3). Then, as elementary
matrix operations do not change the determinant of a matrix, we have
0  det Ak = det A(k) (2) (k)
k = a11 a22 · · · akk ,

08:40:30, subject to the Cambridge


04
3.3 LU Factorisation 67

so that particularly a(k)


kk  0. Hence, the next step is also possible and we can
proceed in this way.
A symmetric, positive definite matrix satisfies obviously det A p > 0 for all
1 ≤ p ≤ n. 

Another class of matrices with an LU factorisation is the following one.

Definition 3.9 A matrix A is called strictly row diagonally dominant if



n
|aik | < |aii |, 1 ≤ i ≤ n.
k=1
ki

Theorem 3.10 A strictly row diagonally dominant matrix A ∈ Rn×n is invert-


ible and possesses an LU factorisation.

Proof Since |a11 | > nj=2 |a1 j | ≥ 0, the first step of Gaussian elimination
is possible. To use once again an inductive argument, it obviously suffices to
show that the new matrix A(2) = (I − m1 eT1 )A is also strictly row diagonally
dominant. Since the first row of A(2) coincides with that of A we only have to
look at the rows with index i = 2, . . . , n. For such an i, we have a(2)
i1 = 0 and
ai1
a(2)
i j = ai j − a1 j , 2 ≤ j ≤ n.
a11
This leads to

n 
n 
n
|ai1 | 
n
|a(2)
ij | = |a(2)
ij | ≤ |ai j | + |a1 j |
j=1 j=2 j=2
|a11 | j=2
ji ji ji ji
|ai1 |
< |aii | − |ai1 | + {|a11 | − |a1i |}
|a11 |
|ai1 |
= |aii | − |a1i | ≤ |a(2)
ii |.
|a11 |

Thus, A(2) is indeed strictly row diagonally dominant and hence a(2)
22  0 so that
we can proceed in the Gaussian elimination procedure. Continuing like this,
we see that all elements a(k)kk are different from zero, meaning that Gaussian
elimination goes through all the way. It also means that the upper triangular
matrix A(n) is non-singular and hence A too is non-singular. 

We have used Gaussian elimination to compute the LU factorisation of a matrix


A. However, this might not be the only way of doing this. Hence, the question
of whether the LU factorisation is unique comes up naturally. To answer this
question, we need a result on triangular matrices.

08:40:30, subject to the Cambridge


04
68 Direct Methods for Solving Linear Systems

Theorem 3.11 The set of non-singular (normalised) lower (or upper) trian-
gular matrices is a group with respect to matrix multiplication.

Proof If A and B are matrices of one of these sets, we have to show that also
A−1 and AB belong to that particular set. Suppose A is a non-singular upper
triangular matrix and ek is, as usual, the kth unit vector in Rn . Let x(k) be the
solution of Ax(k) = ek . Then, the columns of A−1 are given by these x(k) . From
(k)
Algorithm 2 we can inductively conclude that the components xk+1 , . . . , xn(k)
(k) −1
are zero and that xk = 1/akk . This shows that A is also an upper triangular
matrix. If A is normalised then this shows that A−1 is normalised, as well.
Next, assume that B is another non-singular upper triangular matrix. Since
ai j = bi j = 0 for i > j we can conclude that


j
(AB)i j = aik bk j , 1 ≤ i, j ≤ n,
k=i

which shows that AB is also an upper triangular matrix. If both A and B are
normalised the above formula immediately shows that AB is also normalised.
This finishes the proof for (normalised) upper triangular matrices. For (nor-
malised) lower triangular matrices the proof is essentially the same. 

Theorem 3.12 If the invertible matrix A has an LU factorisation then it is


unique.

Proof Suppose A = L1 U1 and A = L2 U2 are two LU factorisations of a


non-singular matrix A. Then, we have

L2−1 L1 = U2 U1−1 .

Furthermore, L2−1 L1 is, as the product of two normalised lower triangular ma-
trices, a normalised lower triangular matrix itself. However, U2 U1−1 is an upper
triangular matrix. Since we have equality between these matrices, this is possi-
ble only if they are the identity matrix, which shows L1 = L2 and U1 = U2 . 

As mentioned above, Algorithm 3 already overwrites A with all the information


required for the LU factorisation. To be more precise, if we leave out step 5 in
Algorithm 3 then we have an algorithm for computing the LU factorisation in
situ, because we have at the end
⎧ ⎧


⎨ai j if i > j, ⎪

⎨ai j if i ≤ j,
i j = ⎪
⎪ ui j = ⎪

⎩δi j if i ≤ j, ⎩0 if i > j.

There are, however, other ways of computing the LU factorisation. One other

08:40:30, subject to the Cambridge


04
3.3 LU Factorisation 69

possible way of deriving an algorithm for calculating the LU factorisation


starts with the component-wise formulation of A = LU, i.e.

n
ai j = ik uk j .
k=1

Since U is upper triangular and L is lower triangular, the upper limit of the sum
is actually given by min(i, j). Taking the two cases separately gives

j
ai j = ik uk j , 1 ≤ j < i ≤ n,
k=1

i
ai j = ik uk j , 1 ≤ i ≤ j ≤ n.
k=1

Rearranging these equations, and using the fact that ii = 1 for all 1 ≤ i ≤ n,
we find that
⎛ ⎞
1 ⎜⎜⎜⎜ 
j−1 ⎟⎟⎟
i j = ⎜⎝⎜ai j − ik uk j ⎟⎟⎠⎟ , 2 ≤ i ≤ n, 1 ≤ j ≤ i − 1, (3.5)
ujj k=1

i−1
ui j = ai j − ik uk j , 1 ≤ i ≤ n, i ≤ j ≤ n. (3.6)
k=1

Thus, the elements of U in the first row are u1 j = a1 j , 1 ≤ j ≤ n, and the


elements of L in the first column are 11 = 1 and i1 = ai1 /u11 , 2 ≤ i ≤ n.
The equations (3.5) and (3.6) can now be used for the calculation of the ele-
ments i j and ui j . For each value of i, starting with i = 2, we calculate first i j
for 1 ≤ j ≤ i − 1 in increasing order, and then the values of ui j for i ≤ j ≤ n,
again in increasing order. We then move on to the same calculation for i + 1
and so on until i = n. We refrain from giving the easily derived algorithm, but
state the obvious result on the computational complexity.
Remark 3.13 The computational cost of computing the LU factorisation of
a matrix A ∈ Rn×n is O(n3 ).
We want to address a few applications of the LU factorisation. The first one is
rather obvious.
Remark 3.14 If A = LU then the linear system b = Ax = LUx can be solved
in two steps. First, we solve Ly = b by forward substitution and then Ux = y
by back substitution. Both are possible in O(n2 ) time.
This means in particular that we could compute the inverse of a matrix A with
LU factorisation by solving the n linear systems Ax( j) = e j , 1 ≤ j ≤ n, where

08:40:30, subject to the Cambridge


04
70 Direct Methods for Solving Linear Systems

e j is the jth unit vector in Rn . Solving these n systems would cost O(n3 ) time
and hence the computation of A−1 would cost O(n3 ) time.
Remark 3.15 A numerically reasonable way of calculating the determinant
of a matrix A is to first compute the LU factorisation and then use the fact that
det(A) = det(LU) = det(L) det(U) = det(U) = u11 u22 · · · unn .
In the case of a tridiagonal matrix, there is an efficient way of constructing the
LU factorisation, at least under certain additional assumptions. Let the tridiag-
onal matrix be given by
⎛ ⎞
⎜⎜⎜a1 c1 0 ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜b2 a2 c2 ⎟⎟⎟
⎜⎜⎜ .. .. .. ⎟⎟⎟
A = ⎜⎜⎜⎜ ⎜ . . . ⎟⎟⎟ .
⎜⎜⎜ ⎟⎟⎟⎟ (3.7)
⎜⎜⎜ . .
. . . . c ⎟⎟⎟ ⎟
⎜⎜⎝ n−1 ⎟ ⎟⎠
0 bn an
Then, we have the following theorem.
Theorem 3.16 Assume A is a tridiagonal matrix of the form (3.7) with
|a1 | > |c1 | > 0,
|ai | ≥ |bi | + |ci |, bi , ci  0, 2 ≤ i ≤ n − 1,
|an | ≥ |bn | > 0.
Then, A is invertible and has an LU factorisation of the form
⎛ ⎞⎛ ⎞
⎜⎜⎜ 1 0⎟⎟⎟ ⎜⎜⎜⎜u1 c1 0 ⎟⎟
⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜2 1 ⎟⎟⎟ ⎜⎜⎜ .. ⎟⎟⎟
u . ⎟⎟⎟
A = ⎜⎜⎜ ⎜ . . ⎟
⎟⎟⎟ ⎜⎜⎜ ⎜ 2
⎟⎟ .
⎜⎜⎜ . . . . ⎟⎟⎟ ⎜⎜⎜ . . . c ⎟⎟⎟⎟
⎜⎝ ⎟⎠ ⎜⎜⎝ n−1 ⎟ ⎟⎠
0 n 1 0 un

The vectors  ∈ Rn−1 and u ∈ Rn can be computed as follows: u1 = a1 and


i = bi /ui−1 and ui = ai − i ci−1 for 2 ≤ i ≤ n.
Proof We show inductively for 1 ≤ i ≤ n − 1 that ui  0 and |ci /ui | < 1. From
this, it follows that both  and u are well defined. By assumption, we have for
i = 1 obviously |u1 | = |a1 | > |c1 | > 0. The induction step then follows from
|ci |
|ui+1 | ≥ |ai+1 | − |bi+1 | > |ai+1 | − |bi+1 | ≥ |ci+1 |.
|ui |
This calculation also shows that un  0, proving that A is invertible, provided
the method really gives the LU factorisation of A. To see the latter, we first

08:40:30, subject to the Cambridge


04
3.4 Pivoting 71

notice that for the given L and U their product LU must be a tridiagonal matrix.
Simple calculations then show the statement. For example, we have (LU)i,i =
i ci−1 + ui = ai . The other two cases (LU)i,i+1 and (LU)i+1,i follow just as
easily. 

Note that for such a system not only the calculation of the LU factorisation
simplifies but also the solution of the corresponding systems by forward and
back substitution. This means that in this case it is possible to solve the linear
system Ax = b in O(n) time.

Example 3.17 For our second example in the introduction, Section 1.1.2, we
see that the matrix from (1.2) satisfies the assumption of Theorem 3.16. Here,
the situation is even simpler since the entries in each band are constant, i.e.
if we ignore the 1/h2 factor then we have ai = 2 and bi = ci = −1. Thus, the
computation reduces to u1 = a1 = 2 and then i = bi /ui−1 = −1/ui−1 , 2 ≤ i ≤ n.
Inserting this into the computation of ui yields
1
ui = ai − i ci−1 = 2 + i = 2 − , 2 ≤ i ≤ n,
ui−1
 T
which resolve to u = 2, 23 , 34 , 54 , . . . , n+1
n , or ui = i+1
i , which also determines
i = −1/ui−1 = −(i − 1)/i.

3.4 Pivoting
We now want to address the situation that the Gaussian elimination process
breaks down prematurely. This means that there is an index k with a(k)
kk = 0.
For example, let us consider the simple matrix

0 1
A= .
1 1

This matrix is √non-singular and well-conditioned with a condition number


κ2 (A) = (3 + 5)/2, in the 2-norm. But Gaussian elimination fails already
at its first step, since a11 = 0. Note, however, that a simple exchanging of the
rows gives an upper triangular matrix. Furthermore, a simple exchanging of
the columns leads to a lower triangular matrix. The first action corresponds to
rearranging the equations in the linear system, the latter corresponds to rear-
ranging the unknowns.
Hence, rearranging either the rows or columns of the matrix (or both) seems to
resolve the problem of a premature breakdown.

08:40:30, subject to the Cambridge


04
72 Direct Methods for Solving Linear Systems

Next, let us consider the slightly perturbed matrix


−20
10 1
A= .
1 1

Using Gaussian elimination on this matrix gives the factors


−20
1 0 10 1
L= , U = .
1020 1 0 1 − 1020

However, we are using floating point arithmetic with machine precision ≈


10−16 . The number 1 − 1020 will not be represented exactly, but by its near-
est floating point number, let us say that this is the number −1020 . Using this
number produces the factorisation
−20
1 0 10 1
L̃ = , Ũ = .
1020 1 0 −1020

The matrix Ũ is relatively close to the correct U, but on calculating the product
we obtain
−20
10 1
L̃Ũ = .
1 0

We see that this matrix is not close to A, since a 1 has been replaced by a
0. Suppose we wish to solve Ax = b with b = (1, 0)T using floating point
arithmetic we would obtain x̃ = (0, 1)T while the true answer is approximately
(−1, 1)T . The explanation for this is that LU factorisation is not backward-
stable.
A proof of the following result can again be found in Higham [85] or Trefethen
and Bau [124]. It shows that Gaussian elimination is backward-stable if the
norms of the triangular matrices L and U can be bounded. Unfortunately, as
we have just seen, this may not be the case.

Remark 3.18 Let A ∈ Rn×n have an LU factorisation. Then, Gaussian elimi-


nation computes matrices L̃, Ũ satisfying L̃Ũ = A + ΔA, where ΔA satisfies

ΔA = O(L U eps).

But the important conclusion here is that exchanging rows or columns or both
is also necessary if the element a(k)
kk is close to zero to stabilise the elimination
process.
The exchange of rows or columns during the Gaussian elimination process is
referred to as pivoting. Assume we have performed k − 1 steps and now have a

08:40:30, subject to the Cambridge


04
3.4 Pivoting 73

matrix of the form


⎛ ⎞
⎜⎜⎜a11 ∗ ∗ ... ∗ ⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ 0 ⎟
⎜⎜⎜ a(2) ∗ . . . ∗ ⎟⎟⎟⎟⎟
⎜⎜⎜
22 ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ .. ⎟⎟⎟
⎜⎜⎜ . ⎟⎟⎟
⎜⎜ ⎟⎟⎟
A(k) = ⎜⎜⎜⎜ ⎟⎟⎟ .
⎟⎟ (3.8)
⎜⎜⎜⎜ (k) ⎟
⎜⎜⎜ 0 0 a(k) . . . akn ⎟⎟⎟⎟
⎜⎜⎜ kk ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ .
⎜⎜⎜ .. .. .. .. ⎟⎟⎟⎟⎟
⎜⎜⎜ . . . ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎝ ⎟⎟⎟

0 0 a(k)
nk . . . a(k)
nn

(k) (k)
If det A = det A(k)  0, not all the entries in the column vector (akk , . . . , ank )T
can be zero. Hence, we can pick one entry and swap the corresponding row
with the kth row. This is usually referred to as partial (row) pivoting. For nu-
merical reasons it is sensible to pick a row index  ∈ {k, k + 1, . . . , n} with
|ak | = max |a(k)
ik |.
k≤i≤n

In a similar way, partial column pivoting can be defined. Finally, we could use
total pivoting, i.e. we could pick indices , m ∈ {k, k + 1, . . . , n} such that
|am | = max |a(k)
i j |.
k≤i, j≤n

Usually, partial row pivoting is implemented. This, however, depends on the


way the programming language used stores matrices. If a matrix is stored row
by row then swapping two rows can be achieved by simply swapping two row
pointers. An exchange of two columns could only be realised by really ex-
changing all entries in these two columns which is much more expensive.
To understand the process of pivoting in an abstract form, we once again use
matrices to describe the operations. Once again, this is only for theoretical
purposes and not for a numerical implementation. The required matrices to
describe pivoting are permutation matrices.
Definition 3.19 A permutation of the set {1, . . . , n} is a bijective map p :
{1, . . . , n} → {1, . . . , n}. A matrix P ∈ Rn×n is called a permutation matrix if
there is a permutation p such that
Pe j = e p( j) , 1 ≤ j ≤ n,
where e j denotes once again the jth unit vector.

08:40:30, subject to the Cambridge


04
74 Direct Methods for Solving Linear Systems

A permutation matrix is simply a matrix that results from the identity matrix by
exchanging its rows (or columns). This means particularly that a permutation
matrix is invertible.
Theorem 3.20 Every permutation matrix P ∈ Rn×n is orthogonal, i.e. P−1 =
PT . Let P = P p ∈ Rn×n be a permutation matrix with permutation p. If p−1
is the permutation inverse to p, i.e. if p−1 (p( j)) = j for 1 ≤ j ≤ n, then
P p−1 = P−1
p .

Proof We start with the second statement. This is true since we have
P p−1 P p e j = P p−1 e p( j) = e p−1 (p( j)) = e j ,
for 1 ≤ j ≤ n. The first statement follows, for 1 ≤ j, k ≤ n, from
eTp−1 ( j) PTp ek = eTk P p e p−1 ( j) = eTk e j = δ jk = eTp−1 ( j) e p−1 (k) = eTp−1 ( j) P−1
p ek . 
For partial pivoting we need to exchange exactly two rows in each step. This
is realised using elementary permutation matrices.
Definition 3.21 A permutation matrix Pi j ∈ Rn×n is called an elementary
permutation matrix if it is of the form
Pi j = I − (ei − e j )(ei − e j )T .
This means simply that the matrix Pi j is an identity matrix with rows (columns)
i and j exchanged.
Remark 3.22 An elementary permutation matrix has the properties
P−1
i j = Pi j = P ji = Pi j
T

and det(Pi j ) = −1, for i  j and Pii = I. Pre-multiplication of a matrix A by Pi j


exchanges rows i and j of A. Similarly post-multiplication exchanges columns
i and j of A.
Theorem 3.23 Let A be an n × n matrix. There exist elementary lower trian-
gular matrices Li := Li (mi ) and elementary permutation matrices Pi := Pri i
with ri ≥ i, i = 1, 2, . . . , n − 1, such that
U := Ln−1 Pn−1 Ln−2 Pn−2 · · · L2 P2 L1 P1 A (3.9)
is upper triangular.
Proof This follows from the previous considerations. Note that, if the ma-
trix is singular, we might come into a situation (3.8), where the column vector
(a(k) (k) T
kk , . . . , ank ) is the zero vector. In such a situation we pick the correspond-
ing elementary lower triangular and permutation matrices as the identity and
proceed with the next column. 

08:40:30, subject to the Cambridge


04
3.4 Pivoting 75

The permutation matrices in (3.9) can be “moved” to the right in the following
sense. If P is a symmetric permutation matrix, which acts only on rows (or
columns) with index larger than j then we have

PL j PT = P(I − m j eTj )PT = I − Pm j (Pe j )T = I − m̃ j eTj = L̃ j .

Obviously, the first j components of m̃ j are also vanishing, making L̃ j again


an elementary lower triangular matrix. This gives in particular

U = Ln−1 Pn−1 Ln−2 PTn−1 Pn−1 Pn−2 Ln−3 · · · L2 P2 L1 P1 A


= Ln−1 L̃n−2 Pn−1 Pn−2 Ln−3 · · · L2 P2 L1 P1 A.

In the same way, we can move Pn−1 Pn−2 to the right of Ln−3 . Continuing with
this procedure leads to

U = Ln−1 L̃n−2 · · · L̃1 Pn−1 Pn−2 · · · P1 A =: L−1 PA,

which establishes the following theorem.

Theorem 3.24 For every non-singular matrix A ∈ Rn×n there is a permuta-


tion matrix P such that PA possesses an LU factorisation PA = LU.

As mentioned above, permutation matrices are only used to describe the theory
mathematically. When it comes to the implementation, permutations are done
directly. For partial pivoting, this is described in Algorithm 4, which is a simple
modification of the original Algorithm 3.

Algorithm 4: LU-factorisation with partial pivoting


Input : A ∈ Rn×n , b ∈ Rn .
Output: PA = LU.
1 for k = 1, 2, . . . , n − 1 do
2 Find the pivot element akr for row k
3 Exchange row k with row r
4 d := 1/akk
5 for i = k + 1, . . . , n do
6 aik := aik d
7 for j = k + 1, . . . , n do
8 ai j := ai j − aik ak j

08:40:30, subject to the Cambridge


04
76 Direct Methods for Solving Linear Systems

3.5 Cholesky Factorisation


Let us assume that A has an LU factorisation A = LU and that A is symmetric,
i.e. AT = A. Then we have

LU = A = AT = U T LT .

Since U T is a lower triangular matrix and LT is an upper triangular matrix,


we would like to use the uniqueness of the LU factorisation to conclude that
L = U T and U = LT . Unfortunately, this is not possible since the uniqueness
requires the lower triangular matrix to be normalised, i.e. to have only ones as
diagonal entries, and this may not be the case for U T .
But if A is invertible, then we can write

U = DŨ

with a diagonal matrix D = diag(d11 , . . . , dnn ) and a normalised upper triangu-


lar matrix Ũ. Then, we can conclude that A = LDŨ and hence AT = Ũ T DLT =
LDŨ, so that we can now apply the uniqueness result to derive L = Ũ T and
U = DLT , which gives A = LDLT .
Let us finally make the assumption that all diagonal entries of D are positive,
√ √
so that we can define its square root by setting D1/2 = diag( d11 , . . . , dnn ),
which leads to the decomposition

A = (LD1/2 )(LD1/2 )T = L̃L̃T .

Definition 3.25 A decomposition of a matrix A of the form A = LLT with a


lower triangular matrix L is called a Cholesky factorisation of A.

It remains to verify for what kind of matrices a Cholesky factorisation exists.

Theorem 3.26 Suppose A = AT is positive definite. Then, A possesses a


Cholesky factorisation.

Proof A matrix A is positive definite if and only if the determinant of every


upper principal sub-matrix A p is positive. From Theorem 3.8 we can therefore
conclude that A = LU is possible. Since a positive definite matrix is also invert-
ible, our previous derivation shows that A = LDLT . Finally, A being positive
definite also means that

dii = eTi Dei = eTi (L−1 AL−T )ei = (L−T ei )T A(L−T ei ) > 0,

where we have used the short notation L−T for (L−1 )T . 

08:40:30, subject to the Cambridge


04
3.5 Cholesky Factorisation 77

For the numerical algorithm, it is again useful to write A = LLT component-


wise taking into account that L is lower triangular. For i ≥ j we have

j 
j−1
ai j = ik  jk = ik  jk + i j  j j .
k=1 k=1

This can first be resolved for i = j:


⎛ ⎞1/2
⎜⎜⎜ 
j−1 ⎟⎟⎟

 j j = ⎜⎜⎝a j j −  jk ⎟⎟⎟⎠ .
2

k=1

With this, we can successively compute the other coefficients. For i > j we
have
⎛ ⎞
1 ⎜⎜⎜⎜ 
j−1 ⎟⎟⎟
i j = ⎜⎜⎝ai j − ik  jk ⎟⎟⎟⎠ .
jj k=1

With these equations it is now easy to successively compute the matrix L either
row-wise or column-wise. A column-oriented version of the Cholesky factori-
sation is given in Algorithm 5. Obviously, in a final implementation it would be
appropriate to check a j j for non-negativity before taking the square root in step
4. This is also a means to check whether the given matrix A is “numerically”
positive definite.

Remark 3.27 The computational complexity of computing the Cholesky fac-


torisation is given by n3 /6 + O(n2 ), which is about half the complexity of the
standard Gaussian elimination process. However, here, we also have to take n
roots.

Proof Leaving the n roots from line 4 of Algorithm 5 aside and counting one
flop for each of the lines 3, 7 and 8 yields
⎛ j−1 ⎛ ⎞⎞
n ⎜
⎜⎜⎜ n ⎜
⎜⎜⎜ 
j−1 ⎟⎟
⎟⎟⎟⎟
T (n) = ⎜⎜⎝ 1 + ⎜⎜⎝1 + 1⎟⎟⎟⎠⎟⎟⎟⎠ .
j=1 k=1 i= j+1 k=1

After some simple calculations, similar to the ones we have done before, we
see that this leads indeed to T (n) = n3 /6 + O(n2 ). 

Remark 3.28 In contrast to Gaussian elimination, the computation of a Cho-


lesky factorisation is backward-stable, see again [85, 124], as long as the norm
of the matrix is not too large. The computed factor L̃ satisfies L̃T L̃ = A + ΔA
where ΔA can be bounded by ΔA = O(A eps).

08:40:30, subject to the Cambridge


04
78 Direct Methods for Solving Linear Systems

Algorithm 5: Cholesky factorisation


Input : A ∈ Rn×n symmetric, positive definite.
Output: A = LLT Cholesky factorisation.
1 for j = 1 to n do
2 for k = 1 to j − 1 do
3 a j j := a j j − a jk · a jk

4 a j j := a j j
5 for i = j + 1 to n do
6 for k = 1 to j − 1 do
7 ai j := ai j − aik · a jk
8 ai j := ai j /a j j

As we already know, solving triangular systems with backward and forward


substitution is also backward-stable. Hence, the solution process for solving
a linear system Ax = b with A being symmetric and positive definite using
the Cholesky factorisation will produce a reasonable solution as long as the
condition number of A is not too large.

3.6 QR Factorisation
We have seen that we can derive an LU factorisation for every matrix A if
we employ pivoting if necessary. The LU factorisation can be described by
multiplying the original matrix A with elementary lower triangular matrices
from the left. Unfortunately, this might change the condition number κ(A) =
A A−1 . Though we can only say that

κ(LA) = LA (LA)−1  ≤ κ(L)κ(A),

we have to expect κ(LA) > κ(A) because κ(L) ≥ 1. Hence, it would be better
to use transformations which do not change the condition number. Candidates
for this are orthogonal matrices, i.e. matrices Q ∈ Rn×n with QT Q = I.

Proposition 3.29 Let Q ∈ Rn×n be orthogonal and A ∈ Rn×n , then Qx2 =


x2 for all x ∈ Rn and QA2 = A2 .

Proof We have

Qx22 = (Qx)T (Qx) = xT QT Qx = xT x = x22 .

08:40:30, subject to the Cambridge


04
3.6 QR Factorisation 79

Similarly,
QA2 = sup QAx2 = sup Ax2 = A2 . 
x2 =1 x2 =1

The goal of this section is to prove that every matrix A ∈ Rm×n , m ≥ n, has a
factorisation of the form
A = QR,
where Q ∈ Rm×m is orthogonal and R ∈ Rm×n is an upper triangular matrix,
which, because of the non-quadratic form of R, takes the form
⎛ ⎞
⎜⎜⎜∗ · · · ∗⎟⎟⎟
⎜⎜⎜⎜ ∗ · · · ∗⎟⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ .. ⎟⎟⎟
⎜⎜⎜ . ⎟⎟⎟
R = ⎜⎜ ⎟⎟⎟ .
⎜⎜⎜ ∗ ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎝ ⎠

We will achieve this by employing specific, orthogonal matrices which are


geometrically reflections. It would, however, also be possible to use rotations,
which we will encounter later on.
Definition 3.30 A Householder matrix H = H(w) ∈ Rm×m is a matrix of the
form
H(w) = I − 2wwT ∈ Rm×m ,
where w ∈ Rm satisfies either w2 = 1 or w = 0.
A more general form of a Householder matrix is given by
wwT
H(w) = I − 2 ,
wT w
for an arbitrary vector w  0. Here, we find both the inner product wT w ∈ R
and the outer product wwT ∈ Rm×m . Householder matrices have quite a few
elementary but important properties, which we want to collect now.
Lemma 3.31 Let H = H(w) ∈ Rm×m be a Householder matrix.
1. H is symmetric.
2. HH = I so that H is orthogonal.
3. det(H(w)) = −1 if w  0.
4. Storing H(w) only requires storing the m elements of w.
5. The computation of the product Ha of H = H(w) with a vector a ∈ Rm
requires only O(m) time.

08:40:30, subject to the Cambridge


04
80 Direct Methods for Solving Linear Systems

Proof The first property is clear. For the second note that wT w = 1 leads to

HH = (I − 2wwT )(I − 2wwT ) = I − 4wwT + 4(wwT )(wwT ) = I.

For the third property, we note that H has the eigenvalues λ = ±1. To be more
precise, we have Hw = −w so that w is the corresponding eigenvector to the
eigenvalue λ = −1. Every vector orthogonal to w is obviously an eigenvector
to the eigenvalue λ = 1. Hence, the eigenvalue λ = 1 has multiplicity (m − 1)
and the eigenvalue λ = −1 has multiplicity 1. Since the determinant is the
product of the eigenvalues, the determinant of H is −1.
The fourth property is clear again. For the last property we first compute the
number λ = aT w in O(m) time, then we multiply this number by each compo-
nent of w to form the vector λw and finally we subtract twice this vector from
the vector a. All of this can be done in O(m) time. 

Figure 3.1 demonstrates, in the case of R2 , that a Householder matrix repre-


sents a reflection. To be more precise, suppose we want to find a reflection, i.e.
an orthogonal matrix H which maps the vector a ∈ R2 onto the vector αe1 with
α to be specified. First of all, because of

|α| = αe1 2 = Ha2 = a2 ,

we see that α = ±a2 . Next, if S w denotes the bisector of the angle between a
and e1 and w its orthogonal vector, then we can express the transformation via

Ha = a − 2(wT a)w = (I − 2wwT )a.

Formalising this fact yields the following result.

Theorem 3.32 For every vector a ∈ Rm , there is a vector w ∈ Rm with w = 0


or w2 = 1, such that H(w)a = a2 e1 .

Proof There is nothing to show if a = 0 or a = a2 e1 , as we can simply pick


w = 0 in these cases. Otherwise, let us define v = a − a2 e1 and w = v/v2 .
Then,
vT v (a − a2 e1 )T (a − a2 e1 ) aT a − 2a2 aT e1 + a22
1 = wT w = = =
v22 v22 v22
2aT (a − a2 e1 ) 2aT v 2aT w
= = = ,
v22 v22 v2

such that v2 = 2aT w. Hence, the application of H(w) to a becomes


v
H(w)a = a − 2(aT w)w = a − v2 = a − v = a2 e1 . 
v2

08:40:30, subject to the Cambridge


04
3.6 QR Factorisation 81

a w
T
Sw

a w
w

T
a2 e1

Figure 3.1 Householder transformation.

Note that we could also project onto −e1 by using v = a + a2 e1 . In general,
one should pick the sign in such a way that numerical cancellation is avoided,
i.e. one should use v = a + sign(a1 )a2 e1 .

Theorem 3.33 (QR factorisation) For every A ∈ Rm×n , m ≥ n, there exist an


orthogonal matrix Q ∈ Rm×m and an upper triangular matrix R ∈ Rm×n such
that A = QR.

Proof Let a1 ∈ Rm be the first column of A. We can find a Householder matrix


H1 ∈ Rm×m such that
⎛ ⎞
⎜⎜⎜α1 ∗ . . . ∗⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ 0 ⎟⎟⎟
H1 A = ⎜⎜⎜ . ⎜ ⎟⎟⎟
⎜⎜⎜ .. ⎟⎟⎟
⎜⎝ A1 ⎟⎟⎠
0

with an (m − 1) × (n − 1) matrix A1 . Considering the first column ã2 ∈ Rm−1 of


this matrix, we can find a second Householder matrix H̃2 ∈ R(m−1)×(m−1) such
that H̃2 ã2 = α2 e1 ∈ Rm−1 . The matrix
⎛ ⎞
⎜⎜⎜1 0 . . . 0⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜0 ⎟⎟⎟
H2 = ⎜⎜⎜⎜ . ⎟⎟⎟⎟
⎜⎜⎜ .. H̃ ⎟⎟⎟
⎜⎝ 2 ⎟⎠
0

08:40:30, subject to the Cambridge


04
82 Direct Methods for Solving Linear Systems

is easily seen to be orthogonal and we derive


⎛ ⎞
⎜⎜⎜α1 ∗ . . . ∗⎟⎟
⎟⎟
⎜⎜⎜⎜ 0 α2 ∗ . . . ∗⎟⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟

H2 H1 A = ⎜⎜⎜⎜ 0 0 ⎟⎟⎟ ,
⎟⎟⎟
⎜⎜⎜⎜ .. .. ⎟⎟⎟
⎜⎜⎜ . . A2 ⎟⎟⎠

0 0
and we can proceed in this form to establish
Hn−1 Hn−2 · · · H2 H1 A = R,
where R is an upper triangular matrix. Since all H j are orthogonal so is their
product and also their inverse. 
Note that the proof is actually constructive and describes a numerical method
for deriving the QR factorisation of a matrix A. Moreover, if we take Lemma
3.31 into account we can already determine the cost.
Remark 3.34 It takes O(mn2 ) time to compute the QR factorisation of an
m × n matrix with m ≥ n.
Proof In the first step we have to apply the matrix H1 to the n columns of the
matrix A. Each of these applications can be done in O(m) time such that the
first step of the procedure requires O(mn) time. Since we have to apply n − 1
of these steps, we see that the cost is bounded by O(mn2 ). 
It is not entirely possible to program the QR factorisation in situ, since the
space of the matrix A is not enough to store all relevant information. Though
we can store the R part in the upper triangular part of A we also need to store
the relevant information on the transformations H j = I −c j v j vTj , i.e. the vectors
v j = (v j j , . . . , vm j )T , which form a lower triangular matrix, and the coefficients
c j . We can store the vectors v j in the lower triangular part of A except for v j j ,
which we can store in an additional vector v. For the square case m = n, this is
given in detail in Algorithm 6.

In the case of the LU factorisation, we have seen that it might be necessary


to use pivoting to continue the process all the way down. In contrast to this, a
QR factorisation can always be computed without pivoting even if the matrix
A is singular. We actually can compute it, even if the matrix is not square.
Nonetheless, pivoting can improve the numerical stability.
As in the case of the LU factorisation (Theorem 3.12) the factors in the QR
factorisation are unique. However, we have to make some additional assump-
tions.

08:40:30, subject to the Cambridge


04
3.6 QR Factorisation 83

Algorithm 6: QR factorisation
Input : A ∈ Rn×n .
Output: QR factorisation of A.
1 for j = 1 to n − 1 do
2 β := 0
3 for k = j to n do β := β + ak j · ak j

4 α := β
5 if α = 0 then v j := 0
6 else
7 c j := 1/(β + α · |a j j |)
8 if a j j < 0 then α := −α
9 v j := a j j + α
10 a j j := −α
11 for k = j + 1 to n do
12 sum := v j · a jk
13 for i = j + 1 to n do sum := sum + ai j · aik
14 sum := sum · c j
15 a jk := a jk − v j · sum
16 for i = j + 1 to n do aik := aik − ai j · sum

Theorem 3.35 The factorisation A = QR of a non-singular matrix A ∈ Rn×n


into the product of an orthogonal matrix Q ∈ Rn×n and an upper triangular
matrix R ∈ Rn×n is unique, if the signs of the diagonal elements of R are pre-
scribed.

Proof Assume there are two such factorisations A = Q1 R1 = Q2 R2 with or-


thogonal matrices Q1 , Q2 and upper triangular matrices R1 , R2 . Then, we have
S := Q−1 −1
2 Q1 = R2 R1 . This is an identity between an orthogonal matrix on the
one hand and an upper triangular matrix on the other. Hence, S is an orthogo-
nal upper triangular matrix and since S −1 = S T it must be a diagonal matrix.
But a diagonal orthogonal matrix can only have ±1 as diagonal entries. Thus, if
the diagonal entries of R1 and R2 have the same sign, we must have S = I. 

Remark 3.36 The QR factorisation is backward-stable. The computed fac-


tors Q̃ and R̃ satisfy Q̃R̃ = A + ΔA with ΔA satisfying

ΔA = O(A eps).

08:40:30, subject to the Cambridge


04
84 Direct Methods for Solving Linear Systems

3.7 Schur Factorisation


We want to use Householder transformations to derive yet another factorisa-
tion. Though its derivation is algorithmic again, this factorisation will only be
of interest to us from a theoretical point of view. However, here we need to
discuss complex matrices.
T
Recall that a matrix U ∈ Cn×n is called unitary if it satisfies U U = I, where the
overbar indicates the complex conjugate. A unitary matrix is thus the complex
counterpart to an orthogonal matrix.
Moreover, we can introduce complex Householder matrices H ∈ Cm×m simply
by defining H = H(w) = I − 2wwT where w ∈ Cm satisfies either w22 =
wT w = 1 or w = 0. It is easy to see that a complex Householder matrix is
T
unitary, that it satisfies H = H, and that the results of the previous subsection
remain true. In particular, we have the complex counterpart to Theorem 3.32.

Lemma 3.37 For every vector a ∈ Cm , there is a vector w ∈ Cm with w = 0


or w2 = 1 such that H(w)a = a2 e1 .

We may now use the above Lemma to obtain the following Theorem.

Theorem 3.38 (Schur factorisation) Let A ∈ Cn×n . There exists a unitary


T
matrix, U ∈ Cn×n , such that R := U AU is an upper triangular matrix.

Proof We will prove this by induction on n. Clearly the result holds for n = 1.
Hence, let us now assume that the result holds for all (n − 1) × (n − 1) matrices.
Since the characteristic polynomial p(λ) = det(A − λI) factorises over C, we
can find an eigenvalue λ and a corresponding eigenvector x ∈ Cn \ {0}, i.e. we
have Ax = λx.
By the previous lemma there exists a Householder matrix such that
T
H(w1 )x = c1 e1 or x = c1 H(w1 ) e1 = c1 H(w1 )e1

with
|c1 | = x2  0.

Using the above properties we have


x λ
H(w1 )AH(w1 )e1 = H(w1 )A = H(w1 )x = λe1 .
c1 c1
We may write this in the form

λ aT
H(w1 )AH(w1 ) =
0 Ã

08:40:30, subject to the Cambridge


04
3.7 Schur Factorisation 85

with a vector a ∈ Cn−1 and a matrix à ∈ C(n−1)×(n−1) . By the induction hypoth-


T
esis there is a unitary matrix V ∈ C(n−1)×(n−1) such that V ÃV = T is an upper
triangular matrix. If we now define an n × n matrix by
⎛ ⎞
1 0T ⎜⎜⎜1 0T ⎟⎟⎟
U = H(w1 ) ,
T
U = ⎜⎝ ⎜ T⎟⎟⎠ H(w1 )
0 V 0 V
T
then we immediately see that UU = I, i.e. U is unitary. Moreover, we have
⎛ ⎞ ⎛ ⎞
T ⎜⎜⎜1 0T ⎟⎟⎟ 1 0T ⎜⎜⎜1 0T ⎟⎟⎟ λ aT 1 0T
U AU = ⎜⎝ ⎜ ⎟
T ⎟ H(w1 )AH(w1 ) = ⎜⎝ ⎜ T⎟ ⎟
0 V ⎠ 0 V 0 V ⎠ 0 Ã 0 V
⎛ ⎞ ⎛ ⎞
⎜⎜⎜1 0T ⎟⎟⎟ λ aT V ⎜⎜⎜λ aT V ⎟⎟⎟
= ⎜⎝ ⎜ T⎟ ⎟
⎠ 0 ÃV = ⎜⎝ ⎜ T ⎟⎟⎠ .
0 V 0 V ÃV
T
Hence, U AU is also an upper triangular matrix, which completes the induc-
tion and the proof. 
In the situation of a real matrix with complex eigenvalues, we cannot achieve
this result using only orthogonal transformations. However, each complex ei-
genvalue λ of a real matrix comes in pairs, i.e. also the complex conjugate of
λ is an eigenvalue. Indeed, if Ax = λx with x  0 then λx = Ax = Ax. This
allows us to derive the following analogue to Theorem 3.38.
Theorem 3.39 (Real Schur factorisation) To A ∈ Rn×n there is an orthogonal
matrix Q ∈ Rn×n such that
⎛ ⎞
⎜⎜⎜R11 R12 · · · R1m ⎟⎟⎟
⎜⎜⎜⎜ 0 R ⎟
· · · R2m ⎟⎟⎟⎟
⎜⎜⎜ 22

Q AQ = ⎜⎜⎜ . .. ⎟⎟⎟⎟ ,
T
.. ..
⎜⎜⎜ .. . . . ⎟⎟⎟
⎜⎝ ⎟⎠
0 0 · · · Rmm
where the diagonal blocks Rii are either 1 × 1 or 2 × 2 matrices. A 1 × 1 block
corresponds to a real eigenvalue, a 2×2 block corresponds to a pair of complex
conjugate eigenvalues.
If A has only real eigenvalues then it is orthogonal similar to an upper trian-
gular matrix.
Proof Let k be the number of pairs of complex conjugate eigenvalues. We
will prove the result by induction on k. If k = 0 then the statement is obviously
correct since we can proceed as in the proof of Theorem 3.38 with real House-
holder matrices. Hence, let us assume that k ≥ 1 and that the result is correct
for k−1. Let λ = α+iβ with β  0 be a complex eigenvalue with corresponding
eigenvector z = x + iy with x, y ∈ Rn . Then, y  0 since the vector Az would

08:40:30, subject to the Cambridge


04
86 Direct Methods for Solving Linear Systems

otherwise be a real vector equalling the complex vector λx. Moreover, with λ
also λ is an eigenvalue of A with corresponding eigenvector z.
Next, we have A(x + iy) = (α + iβ)(x + iy) and hence

Ax = αx − βy, Ay = βx + αy,

which we can write in matrix form as



α β
A(x, y) = (x, y) .
−β α

The real vectors x, y must be linearly independent. Otherwise, x = cy with


c ∈ R \ {0} would mean that z = (c + i)y and z = (c − i)y are linearly dependent,
which contradicts the fact that they are eigenvectors to two distinct eigenvalues
λ and λ, respectively. Thus, we can choose an orthonormal basis {x1 , x2 } of
span{x, y} and describe the coordinate transformation by a non-singular matrix
C ∈ R2×2 , i.e. (x1 , x2 ) = (x, y)C. With this, we can conclude that

α β
A(x1 , x2 ) = A(x, y)C = (x, y) C
−β α

α β
= (x1 , x2 )C −1 C =: (x1 , x2 )S .
−β α

Thus, the newly defined matrix S has also the eigenvalues λ = α + iβ and
λ = α − iβ. If n = 2 we are done. Otherwise, we now extend the vectors {x1 , x2 }
to a full orthonormal basis {x1 , . . . , xn } of Rn and define the orthogonal matrix
(x1 , . . . , xn ) =: (x1 , x2 , W) with W = (x3 , . . . , xn ) ∈ Rn×(n−2) to see that
⎛ T⎞
⎜⎜⎜ x1 ⎟⎟⎟
⎜ ⎟ (x1 , x2 )T AW
(x1 , x2 , W)T A(x1 , x2 , W) = ⎜⎜⎜⎜⎜ xT2 ⎟⎟⎟⎟⎟ ((x1 , x2 )S , AW) =
S
.
⎝ T⎠ 0 W T AW
W

The matrix W T AW has less than k complex conjugate eigenvalue pairs. Hence,
by the induction assumption, there is an orthogonal matrix Q̃ ∈ R(n−2)×(n−2)
so that the matrix Q̃T (W T AW)Q̃ has the desired form. Finally, the orthogonal
matrix

I 0
Q := (x1 , . . . , xn ) ,
0 Q̃

with I being the 2 × 2 identity matrix, is the required orthogonal matrix, which
transforms A to the described quasi-triangular form. 

08:40:30, subject to the Cambridge


04
3.8 Solving Least-Squares Problems 87

3.8 Solving Least-Squares Problems


We now want to deal with the problem of solving an over-determined system
of the form
Ax = b, (3.10)

where A ∈ Rm×n and b ∈ Rm are given and m > n. The latter means that we have
more equations than unknowns, which makes the system over-determined.
Hence, we cannot expect a solution to (3.10) and may instead try to change
the problem to solving the least-squares problem

minn Ax − b2 . (3.11)


x∈R

For the time being, we will assume that the rank of the matrix rank(A) is
rank(A) = n, i.e. the matrix has full rank. Then, to solve (3.11) we can look for
critical points of the function

f (x) := Ax − b22 = xT AT Ax − 2bT Ax + bT b.

Setting the gradient to zero yields

0 = ∇ f (x) = 2AT Ax − 2AT b.

Since rank(A) = n, we also have rank(AT A) = n (see (1.6)) which means that
the matrix AT A ∈ Rn×n is invertible. Thus we have in principle the following
result.

Theorem 3.40 (Normal Equations) The solution of the least-squares problem


(3.11) with a matrix A ∈ Rm×n with m ≥ n = rank(A) is the unique solution of
the Normal Equations
(AT A)x = AT b. (3.12)

Proof The solution to the normal equations is indeed a critical point. Fur-
thermore, since the Hessian of the function f is given by the positive definite
matrix AT A, we have a local minimum, which is actually a global one, as it is
the only critical point. 

The normal equations are not a good way of computing the solution to the
least-squares problem since the involved matrix AT A is usually badly condi-
tioned.
Nonetheless, we notice that we can use this result to better understand the
pseudo-inverse. By looking at (3.12) and (1.10) We can immediately read off
the following result.

08:40:30, subject to the Cambridge


04
88 Direct Methods for Solving Linear Systems

Corollary 3.41 Let A ∈ Rm×n with rank(A) = n be given. Then, for a given
b ∈ Rn , the solution of the least-squares problem (3.11) is given by x = A+ b.

We are now going to investigate two better ways of computing the solution to
the least-squares problem. The first way is based on the QR factorisation of the
matrix A and the second way is based upon the singular value decomposition
A = UΣV T from Theorem 1.18. In both cases, orthogonal matrices play an
important role.
Let us start by using the QR factorisation from Theorem 3.33. Hence, we can
represent our matrix A ∈ Rm×n as A = QR with an orthogonal matrix Q ∈
Rm×m and an upper triangular matrix R ∈ Rm×n . Using the fact that orthogonal
transformations leave the Euclidean norm invariant, we see that we can now
produce a solution by noting that
 
R c 
Ax − b22 = QRx − b22 = Rx − c22 =  1 x − 1 
 O c2  2
= R1 x − c1 22 + c2 22 ,

where we have set c = QT b ∈ Rm and where R1 ∈ Rn×n denotes the upper part
of the upper triangular matrix R.
The solution to the least-squares problem is therefore given by the solution of

R1 x = c1 ,

where R1 ∈ Rn×n is a square upper-triangular matrix. This system can be solved


quite easily using back substitution, as we have seen at the beginning of this
chapter.

Proposition 3.42 Let A ∈ Rm×n with m ≥ n and rank(A) = n be given. Then,


the solution of the least-squares problem (3.11) can be computed as follows:

1. Compute the QR factorisation of A,


2. Let R1 ∈ Rn×n be the upper square part of the upper triangular matrix
R ∈ Rm×n , c = QT b ∈ Rm and c1 ∈ Rn be the upper part of c.
3. Compute R1 x = c1 by back substitution.

The method also works if A does not have full rank, i.e. if rank(A) < n. In this
case the diagonal of R1 has additional zeros. But splitting R and c up appropri-
ately will also lead to the solution in this case.
If A does not have full rank then we cannot solve the normal equations di-
rectly, since the matrix AT A is not invertible. However, we can still compute
the pseudo-inverse of A using the singular value decomposition. Hence, we
will now discuss how to use the singular value decomposition A = UΣV T with

08:40:30, subject to the Cambridge


04
3.8 Solving Least-Squares Problems 89

orthogonal matrices U ∈ Rm×m and V ∈ Rn×n and a diagonal matrix Σ ∈ Rm×n


for solving the least-squares problem and also generalise Corollary 3.41. We
will describe this now already in the case of a matrix that is not necessarily of
full rank.
The following observation is helpful. Let r denote the rank of the matrix A and
hence the number of singular values. If we set y := V T x ∈ Rn and c := U T b ∈
Rm , then

r 
m
b − Ax22 = UU T b − UΣV T x22 = c − Σy22 = (c j − σ j y j )2 + c2j ,
j=1 j=r+1

i.e. the residual is given by



r 
m
b − Ax22 = |σ j vTj x − uTj b|2 + |uTj b|2 . (3.13)
j=1 j=r+1

This means, we have to choose x such that y j = vTj x = uTj b/σ j for 1 ≤ j ≤ r.
The remaining variables y j , r + 1 ≤ j ≤ n are free to choose. Hence, we have
the following result
Theorem 3.43 The general solution to the least-squares problem (3.11) is
given by

r uT b
j

n 
n
x= vj + α j v j =: x+ + α jv j.
j=1
σj j=r+1 j=r+1

Here, σ j denote the singular values of the matrix A, v j ∈ Rn are the columns
of the matrix V and u j ∈ Rm are the columns of the matrix U from the singular
value decomposition. The coefficients α j are free to choose. From all possible
solutions, the solution x+ is the solution having minimal Euclidean norm.

Proof The representation of the general form follows by expressing x in the


orthonormal basis v1 , . . . , vn , having coefficients vTj x = uTj b/σ j . The solution
x+ has minimal norm, since we have for a general solution x ∈ Rn that

n
x22 = x+ 22 + α2j ≥ x+ 22 ,
j=r+1

using the fact that the v j form an orthonormal system. 


With this result, it is now easy to see that the norm minimal solution is given
by applying the pseudo-inverse A+ to the right-hand side b. Indeed, The repre-
sentation A+ = VΣ+ U T from (1.7) shows that
A+ b = VΣ+ U T b = VΣ+ c = x+ .

08:40:30, subject to the Cambridge


04
90 Direct Methods for Solving Linear Systems

Hence, we have the following generalisation of Corollary 3.41.


Corollary 3.44 Let A ∈ Rm×n with m ≥ n be given. Then, the pseudo-inverse
A+ ∈ Rn×m maps every vector b ∈ Rm to its norm-minimal least-squares solu-
tion x+ ∈ Rn , i.e.
$ %
A+ b = min x2 : x ∈ Rn solves min Az − b2 .
z

Next, we want to analyse the condition number of the least-squares process.


Since the Euclidean norm is the natural choice here, we will give all estimates
in it. As in the analysis of solving linear systems, the condition number of the
matrix will play a crucial role. We will continue to assume that A ∈ Rm×n with
m ≥ n. Hence, according to Definition 2.24, the condition number is given by
σ1
κ2 (A) = = A2 A+ 2 ,
σr
where σ1 and σr are the largest and smallest singular value of A, respectively,
and A+ denotes the pseudo-inverse of A.
We start with a simple perturbation result, where the right-hand side b but not
the matrix A is perturbed. The result employs the orthogonal projection P onto
range(A), which is, according to Lemma 1.24, given by P = AA+ .
Proposition 3.45 Let A ∈ Rm×n with m ≥ n be given. Let b, Δb ∈ Rm . If
x ∈ Rn and Δx ∈ Rn denote the norm-minimal solutions of the least-squares
problem (3.11) with data (A, b) and (A, b + Δb), respectively, then
Δx2 PΔb2
≤ κ2 (A) ,
x2 Pb2
where P := AA+ is the orthogonal projection of Rm onto range(A).
Proof By Corollary 3.44 we have x = A+ b and x + Δx = A+ (b + Δb) =
x + A+ Δb. This means Δx = A+ Δb = A+ AA+ Δb, where we have used (1.8) in
the last equality. From this, it immediately follows that
Δx2 ≤ A+ 2 AA+ Δb2 = A+ 2 PΔb2 .
Using the full singular value factorisation A = UΣV T , we have

r
uT Pb
x = A+ b = A+ AA+ b = A+ Pb = i
vi ,
i=1
σi
where r = rank(A) and σ1 ≥ σ2 ≥ · · · ≥ σr > 0 are the singular values of A.
The orthonormality of the columns ui and vi of U and V, respectively, shows
r "" T "2
 "" ui Pb """ 1  T 2
r
1
x2 =
2
"" "" ≥ 2 |ui Pb| = 2 Pb22 ,
i=1
σ i σ 1 i=1 σ 1

08:40:30, subject to the Cambridge


04
3.8 Solving Least-Squares Problems 91

where we have used that Pb ∈ range(A) = span{u1 , . . . , ur }. Employing finally


σ1 = A2 yields x2 ≥ Pb2 /A2 and hence
Δx2 PΔb2
≤ A+ 2 A2 . 
x2 Pb2
For the general perturbation result, where b as well as A can be perturbed we
need two auxiliary statements. The first one is a generalisation of the perturba-
tion Lemma 2.42.

Lemma 3.46 Let A ∈ Rm×n be given with rank(A) = n ≤ m. Let ΔA ∈ Rm×n


be a perturbation with A+ 2 ΔA2 < 1. Then, A + ΔA has also rank n and
satisfies the bound
A+ 2
(A + ΔA)+ 2 ≤ .
1 − A+ 2 ΔA2
Proof We start by showing that A+ΔA has rank n. Let x ∈ ker(A+ΔA), which
means Ax = −ΔA x and, after multiplication by AT , also AT Ax = −AT ΔA x.
As AT A is invertible, (1.10) gives x = −(AT A)−1 AT ΔA x = −A+ ΔA x. and the
condition A+ 2 ΔA2 < 1 shows x = 0. From this we can conclude that the
columns of A + ΔA are linearly independent and hence the rank of A + ΔA is n.
Next, let λn be the smallest eigenvalue of (A + ΔA)T (A + ΔA) and let x ∈ Rn be
a corresponding eigenvector with x2 = 1. Then we have
1 !1/2
= λ 1/2
= xT
(A + ΔA) T
(A + ΔA)x
(A + ΔA)+ 2 n

= (A + ΔA)x2 ≥ Ax2 − ΔA x2


≥ (xT AT Ax)1/2 − ΔA2 = σn (A) − ΔA2
1 1 − A+ 2 ΔA2
= + − ΔA2 = .
A 2 A+ 2
The latter expression is positive by assumption, from which the statement then
follows. 

The result remains essentially true if A does not have full rank, as long as
rank(A + ΔA) ≤ rank(A) is satisfied. However, we will restrict ourselves to the
case rank(A) = n from now on.

Lemma 3.47 Let A, B ∈ Rm×n with rank(A) = rank(B) = n. Then,

BB+ (I − AA+ )2 = AA+ (I − BB+ )2 .

Proof Let A = UΣV T and B = Ũ Σ̃Ṽ T be the full singular value decom-
positions of A and B, respectively. Then, we have AA+ = UΣV T VΣ+ U T =
UΣΣ+ U T = UDU T with the diagonal matrix D = diag(1, . . . , 1, 0, . . . , 0) ∈

08:40:30, subject to the Cambridge


04
92 Direct Methods for Solving Linear Systems

Rm×m having exactly n times a 1 and m − n times a 0 on the diagonal. In the


same way we have BB+ = ŨDŨ T with the same diagonal matrix. If we define

W W12
W := Ũ T U = 11 ∈ Rm×m
W21 W22

with block matrices W11 ∈ Rn×n , W12 ∈ Rn×(m−n) , W21 ∈ R(m−n)×n and W22 ∈
R(m−n)×(m−n) then, using the invariance of the Euclidean norm under orthogonal
transformations, we find

BB+ (I − AA+ )2 = ŨDŨ T (I − UDU T )2 = DW(I − D)2 = W12 2

and, in the same way,

AA+ (I − BB+ )2 = W21


T
2 .

Hence, it remains to show that W12 2 = W21 T


2 holds. Note that the transpose
is irrelevant as the Euclidean norm is invariant under forming the transpose. To
see this, let z ∈ Rm−n be given and let x = (0T , zT )T ∈ Rm be its prolongation.
Using again the invariance of the Euclidean norm, we find

z22 = x22 = Ũ T Ux22 = Wx22 = W12 z22 + W22 z22 ,

which gives the norm

W12 22 = max W12 z22 = 1 − min W22 z22 = 1 − λmin (W22
T
W22 ),
z2 =1 z2 =1

where we have also used that W22 z22 = zT W22T


W22 z = W22
T
W22 z, z 2 so that
T
(2.13) applies to the symmetric and positive semi-definite matrix W22 W22 .
In the same way, we can conclude from

z22 = x22 = (Ũ T U)T x22 = W T x22 = W21


T
z22 + W22
T
z22

that

W21
T 2
 = max W21
T
z22 = 1 − min W22
T
z22 = 1 − λmin (W22 W22
T
).
z2 =1 z2 =1

However, as we have seen when deriving the singular value decomposition,


T
the smallest eigenvalue of W22 W22 coincides with the smallest eigenvalue of
T
W22 W22 . 

After this, we can state and prove our main perturbation result.

Theorem 3.48 Let A ∈ Rm×n with rank(A) = n be given. Let ΔA ∈ Rm×n be


a perturbation with A+ 2 ΔA2 < 1. Let b, Δb ∈ Rm . Finally, let x, x + Δx ∈

08:40:30, subject to the Cambridge


04
3.8 Solving Least-Squares Problems 93

Rn be the solutions of the least-squares problem (3.11) with data (A, b) and
(A + ΔA, b + Δb), respectively. Then,
 
Δx2 κ2 (A) ΔA2 Ax − b2 Δb2
≤ 1 + κ2 (A) + .
x2 1 − κ2 (A)ΔA2 /A2 A2 A2 x2 A2 x2
Proof From x = A+ b and x + Δx = (A + ΔA)+ (b + Δb) it follows that
& '
Δx = (A + ΔA)+ (b + Δb) − A+ b = (A + ΔA)+ − A+ b + (A + ΔA)+ Δb.

We will now bound the norms of both terms on the right-hand side separately.
To this end, we have to rewrite the first term. As A has rank n, its pseudo-
inverse is given by (1.10) as A+ = (AT A)−1 AT which means A+ A = I. As
A + ΔA also has rank n, we also have (A + ΔA)+ (A + ΔA) = I. With this, we
derive

(A + ΔA)+ − A+ = (A + ΔA)+ (I − AA+ ) + (A + ΔA)+ AA+ − A+


= (A + ΔA)+ (I − AA+ ) + (A + ΔA)+ AA+
− (A + ΔA)+ (A + ΔA)A+
= (A + ΔA)+ (I − AA+ ) − (A + ΔA)+ (ΔA)A+ .

Applying this to b and using A+ b = x and (I − AA+ )A = A − AA+ A = 0 gives

[(A + ΔA)+ − A+ ]b = (A + ΔA)+ (I − AA+ )b − (A + ΔA)+ (ΔA)A+ b


= (A + ΔA)+ (I − AA+ )(b − Ax) − (A + ΔA)+ (ΔA)x.

Setting B := (A + ΔA) in Lemma 3.47 shows

(A + ΔA)+ (I − AA+ )2 = B+ (I − AA+ )2 = B+ BB+ (I − AA+ )2
≤ B+ 2 BB+ (I − AA+ )2
= B+ 2 AA+ (I − BB+ )2
= B+ 2 (AA+ )T (I − BB+ )2
= B+ 2 (A+ )T AT (I − BB+ )2
= B+ 2 (A+ )T (AT − BT )(I − BB+ )2
= B+ 2 (A+ )T (ΔA)T (I − BB+ )2
≤ B+ 2 A+ 2 ΔA2 I − BB+ 2 ,

where we have also used BT (I − BB+ ) = BT − BT BB+ = 0 and B − A = ΔA and


the fact that the Euclidean norm of a matrix equals the Euclidean norm of its
transpose.
If B = UΣV T is this time the full singular value decomposition of B, we have
with B+ = VΣ+ U T that BB+ = UΣΣ+ U T . As ΣΣ+ ∈ Rm×m is a diagonal matrix

08:40:30, subject to the Cambridge


04
94 Direct Methods for Solving Linear Systems

with n entries equal to 1 and m − n entries equal to 0 on the diagonal and as the
Euclidean norm is invariant under orthogonal transformations, we see that



⎨1 if m > n,
+ +
I − BB 2 = I − ΣΣ 2 = ⎪ ⎪
⎩0 if m = n.

Thus, we can conclude that

A + ΔA)+ (I − AA+ )2 ≤ (A + ΔA)+ 2 A+ 2 ΔA2 .

Finally, this allows us to finish the proof. We have

Δx2 ≤ [(A + ΔA)+ − A+ ]b2 + (A + ΔA)+ 2 Δb2


≤ (A + ΔA)+ (I − AA+ )2 b − Ax2 + (A + ΔA)+ 2 ΔA2 x2
+ (A + ΔA)+ 2 Δb2
& '
≤ (A + ΔA)+ 2 ΔA2 A+ 2 Ax − b2 + ΔA2 x2 + Δb2
A+ 2 & '
≤ +
ΔA2 A+ 2 Ax − b2 + ΔA2 x2 + Δb2
1 − A 2 ΔA2
 
κ2 (A) ΔA2 ( + ) Δb2
= A 2 Ax − b2 + x2 +
1 − κ2 (A)ΔA2 /A2 A2 A2
 
κ2 (A) ΔA2 Ax − b2 Δb2
= κ2 (A) + x2 +
1 − κ2 (A)ΔA2 /A2 A2 A2 A2
and dividing by x2 gives the stated result. 

Theorem 3.48 remains true, even if A is not of full rank, as long as the per-
turbation does not lead to a reduced rank, i.e. as long as the perturbed matrix
A + ΔA has the same rank as A, see for example Björck [20].
We note that a large defect Ax − b2 becomes the dominant error term as it is
multiplied by a factor of κ2 (A)2 . If the defect Ax − b2 is small the sensitivity
of the least-squares problem is determined by the factor κ2 (A), instead.
In the case of a quadratic matrix A ∈ Rn×n with full rank, the defect Ax − b2
is zero. As we also have b2 ≤ A2 x2 , the bound from the above Theorem
3.48 is exactly the one from Theorem 2.43 for the Euclidean norm.
If the smallest singular value of A is close to zero then this rank deficiency leads
obviously to a large condition number. In this situation it is better to reduce the
rank of A numerically by setting the smallest singular value – and any other
which is considered to be too small – to zero and solve the resulting least-
squares problem. If A = UΣV T is the original singular value decomposition of
A with singular values σ1 ≥ σ2 ≥ · · · ≥ σr > 0 and if we set σk+1 = · · · = σr =
0 for a fixed k < r then this means that we actually minimise the expression

08:40:30, subject to the Cambridge


04
3.8 Solving Least-Squares Problems 95

Ak x − b2 , where Ak ∈ Rm×n is given by



k
Ak = σ j u j vTj .
j=1

If x+k = A+k b denotes the solution to this problem and if x+ = A+ b denotes the
solution to the original least-squares problem (3.11) then (3.13) shows that the
residual b − Ax+k is given by

m 
r
b − Ax+k 22 = |uTj b|2 = b − Ax+ 22 + |uTj b|2 ,
j=k+1 j=k+1

which is manageable as long as the components of b in the direction of these


u j , k + 1 ≤ j ≤ r, are not too large. However, from the representations of x+
and x+k , we can also read off that the actual error is given by
" "
r "" uT b ""2
+ + 2
x − xk 2 = "
""
j "
" ,
j=k+1
" σ j ""

which is problematic as we are concerned with small singular values σ j , k+1 ≤


j ≤ r.
Hence, we are now going to look at a different way of dealing with the problem
of numerical rank deficiency.
Definition 3.49 Let A ∈ Rm×n , b ∈ Rm and τ > 0. Then, the Tikhonov reg-
ularisation of the least-squares problem (3.11) with regularisation parameter
τ > 0 is
minn Ax − b22 + τx22 . (3.14)
x∈R

The term τx22 is also called a penalty term and the problem a penalised least-
squares problem. The penalty term is sometimes replaced by a more general
expression of the form τLx22 , where L is a given matrix. Here, however, we
will restrict ourselves to the case of L = I.
The penalised least-squares problem (3.14) is equivalent to a standard least-
squares problem as we have
 2
 A b 
Ax − b2 + τx2 =  √ x −
2 2
 =: Aτ x − b̃2 ,
2
 τI 0  2

where Aτ ∈ R (m+n)×n
and b̃ ∈ R . Obviously, the new matrix Aτ has full rank
m+n

n, as its last n rows are linearly independent. Thus it has a unique solution and
so has (3.14) even if rank(A) = r < n. Moreover, we can use once again the
singular value decomposition to represent the solution.

08:40:30, subject to the Cambridge


04
96 Direct Methods for Solving Linear Systems

Theorem 3.50 Let A ∈ Rm×n be given with rank(A) = r ≤ n ≤ m. Let


A = Û Σ̂V̂ T be a reduced singular value decomposition of A ∈ Rm×n , where
Σ̂ ∈ Rr×r is a diagonal matrix with diagonal entries σ1 ≥ σ2 ≥ · · · ≥ σr > 0
and where Û ∈ Rm×r and V̂ ∈ Rn×r have orthonormal columns.
The penalised least-squares problem (3.14) has for each τ > 0 a unique solu-
tion x+τ which is given by

r σ uT b
j j
x+τ = v j. (3.15)
j=1
σ2j + τ

For τ → 0 the solution x+τ converges to x+ := A+ b, the norm-minimal solu-


tion of the least-squares problem (3.11). The error between x+τ and x+ can be
bounded by
τ
x+ − x+τ 2 ≤ b2 .
σr (σ2r + τ)
The condition number of the associated matrix Aτ ∈ R(m+n)×n is
⎛ 2 ⎞1/2
+
⎜⎜⎜ σ1 + τ ⎟⎟⎟
κ2 (Aτ ) = Aτ 2 Aτ 2 = ⎝⎜ 2 ⎟⎠ . (3.16)
σr + τ

Proof From the definition of Aτ and b̃ it immediately follows that

ATτ Aτ = AT A + τI, ATτ b̃ = AT b.

Using the reduced singular value decomposition and Corollary 3.41 in combi-
nation with (1.10) we can derive

x+τ = A+τ b̃ = (ATτ Aτ )−1 ATτ b̃ = (AT A + τI)−1 AT b



r σ uT b
j j
T −1 −1
= [V̂(Σ̂ + τI)V̂ ] V̂ Σ̂Û b = V̂(Σ̂ + τI) Σ̂U b =
2 T 2 T
v j.
j=1
σ2j + τ

For the error between x+ and x+τ we note that


⎡ ⎤
r ⎢⎢ 1 σ j ⎥⎥⎥ r
τ
+ + T ⎢ ⎢ ⎥⎥⎦ v j =
x − xτ = u j b ⎢⎣ − 2 uTj b v j.
j=1
σj σj + τ j=1
σ j (σ2j + τ)

As the v j are orthonormal, we immediately can conclude that


⎡ ⎤2  2 
 r ⎢⎢ τ ⎥⎥⎥ τ
r
+ + 2 T 2⎢ ⎢ ⎥
x − xτ 2 = |u j b| ⎢⎣ ⎥
⎦ ≤ |uTj b|2
j=1
σ j (σ 2
j + τ) σ r (σ 2 + τ)
r j=1
 2
τ
≤ b22 .
σr (σ2r + τ)

08:40:30, subject to the Cambridge


04
3.8 Solving Least-Squares Problems 97

For the condition number κ2 (Aτ ) = Aτ 2 A+τ 2 we start with


Aτ 22 = λmax (ATτ Aτ ) = λmax (AT A + τI) = λmax (Σ̂2 + τI) = σ21 + τ. (3.17)
For the remaining A+τ 2 we use A+τ 2 = (A+τ )T 2 and then A+τ = (ATτ Aτ )−1 ATτ
to derive
A+τ (A+τ )T = (ATτ Aτ )−1 ATτ [(ATτ Aτ )−1 ATτ ]T = (ATτ Aτ )−1 ATτ Aτ (ATτ Aτ )−1
= (ATτ Aτ )−1 = (AT A + τI)−1 = V̂(Σ̂2 + τI)−1 V̂ T .
Thus, we can conclude
1
A+τ 22 = λmax ((Σ̂2 + τI)−1 ) = ,
σ2r +τ
which gives, together with (3.17), the condition number. 
From (3.15) it obviously follows that x+τ tends with τ → 0 to the solution x+
of the standard least-squares problem. This is not too surprising since the pe-
nalised least-squares problem (3.14) becomes the least-squares problem (3.11)
for τ = 0. From (3.15) we see that the parameter τ is indeed regularising the
problem since it moves the singular values away from zero.
The remaining question is how to choose the regularisation parameter τ > 0.
Obviously, a smaller τ means that the least-squares part becomes more signif-
icant and the solution vector tries mainly to minimise this. For a larger τ, the
penalty term τx22 is dominant and the solution vector will try to achieve a
small norm. This becomes particularly important if the given data is corrupted.
Let us assume that the right-hand side b is perturbed by some noise Δb to
b + Δb.
If we denote the solution to this perturbed Tikhonov problem by xΔτ , then we
see that
r σ uT (b + Δb)
j j r σ uT Δb
j j
xΔτ = v j = x+
τ + v j,
j=1
σ 2
j + τ j=1
σ j +τ
2

where the latter summand is just the solution of the Tikhonov problem with
data Δb. Hence, we can bound the error between xΔτ and the solution of the
original least-squares problem x+ = A+ b by
 
 r σ uT Δb  
j j
x − xτ 2 ≤ x − xτ 2 + 
+ Δ + +
v j 
 σ j + τ 
2
j=1 2
τ σj
≤ b2 + max 2 Δb2 .
σr (σ2r + τ) 1≤ j≤r σ + τ
j

This shows how the error is split up into an approximation error, which goes

08:40:30, subject to the Cambridge


04
98 Direct Methods for Solving Linear Systems

to zero with τ → 0, and a data error. It would be desirable to choose τ to


minimise the overall error. This is, however, in general not possible.
There are various ways to choose τ. Often, a statistical method called cross
validation or generalised cross validation is used. Generalised cross validation
is based on the idea that a good value of the regularisation parameter should
help to predict missing data values. Hence, the parameter τ is chosen as the
minimiser of
Axτ − b22
G(τ) := ,
[tr(I − A(AT A + τI)−1 AT )]2
see for example Golub et al. [67] or Hansen [81]. Another method is the so-
called Morozov discrepancy principle (see [97]), which we want to describe
now and which is based on the strategy of choosing τ in such a way that the
defect AxΔτ − b2 equals the data error Δb2 . We will show that this is always
possible. As before let x+τ be the minimiser of

μτ (x) := Ax − b22 + τx22 .

As we are now interested in the behaviour of the defect, we define the function

J(τ) := Ax+τ − b22 .

Lemma 3.51 Let b ∈ range(A) with b  0 be given. Then, the function


J(τ) is continuous on [0, ∞) and strictly monotonically increasing with range
[0, b22 ).

Proof The continuity of τ → x+τ follows immediately from the representation


(3.15). From this, the continuity of J(τ) is obvious. Moreover, our assumptions
yield that x+τ  0 for each τ > 0, as x+τ = 0 means by (3.15) that uTj b = 0 for
1 ≤ j ≤ r, which contradicts 0  b ∈ range(A) = span{u1 , . . . , ur }.
Next, let K(τ) := x+τ 22 so that we have the decomposition

μτ (x+τ ) = J(τ) + τK(τ).

Let us assume that 0 < τ1 < τ2 . We will first show that x+τ1  x+τ2 . To see this,
we note that they satisfy the equations

(AT A + τ1 I)x+τ1 = AT b, (AT A + τ2 I)x+τ2 = AT b.

Subtracting these equations yields

AT A(x+τ1 − x+τ2 ) + τ1 (x+τ1 − x+τ2 ) = (τ2 − τ1 )x+τ2 . (3.18)

From this, we can conclude that x+τ1 = x+τ2 would imply (τ2 − τ1 )x+τ2 = 0. But
as x+τ2  0 we have the contradiction τ1 = τ2 .

08:40:30, subject to the Cambridge


04
3.8 Solving Least-Squares Problems 99

With this, we can show that K(τ) is strictly monotonically decreasing. Indeed,
multiplying (3.18) by (x+τ1 − x+τ2 )T from the left gives
A(x+τ1 − x+τ2 )22 + τ1 x+τ1 − x+τ2 22 = (τ2 − τ1 ) x+τ1 − x+τ2 , x+τ2 2 .
Since x+τ1  x+τ2 , we see that the left-hand side is positive and as τ2 > τ1 we can
conclude that
x+τ1 − x+τ2 , x+τ2 2 > 0,
showing
x+τ2 22 < x+τ1 , x+τ2 2 ≤ x+τ1 2 x+τ2 2 ,
i.e. K(τ2 ) < K(τ1 ). Finally, still under the assumption of 0 < τ1 < τ2 , we can
conclude that J(τ1 ) < J(τ2 ) because J(τ2 ) ≤ J(τ1 ) would imply that
μτ1 (x+τ2 ) = J(τ2 ) + τ1 K(τ2 ) < J(τ1 ) + τ1 K(τ1 ) = μτ1 (x+τ1 ),
which contradicts the fact that x+τ1 minimises μτ1 .
Finally, it remains to show that the range of J is given by [0, b2 ). On the one
hand, we have J(0) = 0 as b ∈ range(A). On the other hand, we know that
J is continuous and monotonically increasing, which means that the range of
J is an interval with an upper bound given by the limit of J(τ) for τ → ∞.
To determine this limit, we use that (3.15) yields x+τ → 0 for τ → ∞. This,
however, implies J(x+τ ) → b22 for τ → 0. 
Hence, for each s ∈ (0, b2 ) there exists a unique τ > 0 such that J(τ) = s. As
we can apply this also to J Δ (τ) = AxΔτ − (b + Δb)22 , this proves the following
result.
Theorem 3.52 (Morozov’s discrepancy principle) Let A ∈ Rm×n with m ≥ n.
Let b  0 be in the range of A. Let the perturbation Δb ∈ Rm satisfy 0 <
Δb2 < b2 . Then, there exists a unique τ > 0 such that the unique solution
xΔτ which minimises Ax − (b + Δb)22 + τx22 over all x ∈ Rn satisfies
AxΔτ − (b + Δb)2 = Δb2 .
The above result only guarantees the existence of a unique τ with this prop-
erty. Its numerical determination is often based upon the so-called L-curve
principle. As we know that the function K(τ) is monotonically decreasing and
the function J(τ) is monotonically increasing, we see that the function K is
monotonically decreasing as a function of J, i.e. if the curve (K(τ), J(τ)) is
monotonically decreasing in the first quadrant. In many cases this curve has
the shape of an L with a prominent corner, particularly in a logarithmic plot.
The parameter τ corresponding to this corner often yields a good choice of the
penalisation parameter. Details can be found, for example, in Hansen [81].

08:40:30, subject to the Cambridge


04
100 Direct Methods for Solving Linear Systems

Exercises
3.1 Let A ∈ Rn×n be a symmetric matrix and assume that Gaussian elimina-
tion without pivoting is possible. Show that after k − 1 steps, the lower
right (n − k + 1) × (n − k + 1) block in A(k) from (3.8) is symmetric.
3.2 Denote the columns of A ∈ Rn×n by a1 , . . . , an ∈ Rn . Use the QR factori-
sation of A to show | det A| ≤ a1 2 · · · an 2 .
3.3 Let data {y1 , . . . , ym } at positions {x1 , . . . , xm } be given.
• Use the least-squares approach to determine the linear polynomial
y(x) = ax + b which fits these data best, i.e. find the linear polyno-
mial which minimises
m
|y j − y(x j )|2 . (3.19)
j=1

• Find the function y(x) = aeλ(x−μ) which minimises (3.19) for a, λ, μ ∈


R.
3.4 A Givens rotation is an n × n matrix of the form
J(i, j, θ) = I + (cos(θ) − 1)(e j eTj + ei eTi ) + sin(θ)(e j eTi − ei eTj ).
• Show that a Givens rotation is orthogonal.
• Show that for x ∈ Rn there is an angle θ such that yk = 0 for y =
J(i, k, θ)x. Use this to show that the QR factorisations of a tridiagonal
matrix A ∈ Rn×n can be achieved in O(n) time using Givens rotations.
• Show that in R2×2 an orthogonal matrix is either a Householder matrix
or a Givens rotation.
3.5 Let A ∈ Rm×n with m ≥ n and let b, Δb ∈ Rm with b  range(A). Let x
and x + Δx be the minimal norm solutions of the least-squares problem
with data (A, b) and (A, b + Δb), respectively. Show
Δx2 Δb2 Δb2
≤ κ2 (A) ≤ κ2 (A) .
x2 A2 x2 AA+ b2
3.6 Let A ∈ Rm×n with rank(A) = n and b ∈ Rm be given. Determine the
condition number of solving the least-squares problem minx Ax − b2
under the assumption that b is exact, i.e. determine the condition number
of the function f : {A ∈ Rm×n : rank(A) = n} → Rn , f (A) := x+ .
3.7 Let A ∈ Rn×n be a symmetric, positive definite matrix. Let τ > 0 and
b ∈ Rn be given. Show that the function x → Ax − b22 + τxT Ax has a
unique minimum x∗τ ∈ Rn which solves the linear system (A + τI)x = b.

08:40:30, subject to the Cambridge


04
4
Iterative Methods for Solving Linear Systems

4.1 Introduction
We have seen that a direct algorithm for solving an n × n linear system

Ax = b

generally requires O(n3 ) time. For larger n, this becomes quite quickly im-
practical. The basic idea behind iterative methods is to produce a way of ap-
proximating the application of the inverse of A to b so that the amount of work
required is substantially less than O(n3 ). To achieve this, a sequence of approx-
imate solutions is generated, which is supposed to converge to the solution.
The computation of each iteration usually costs O(n2 ) time so that the method
becomes computationally interesting only if a sufficiently good approximation
can be achieved in far fewer than n steps.
To understand the general concept, let us write the matrix A as A = A − B + B
with an invertible matrix B at our disposal. Then, the equation Ax = b can be
reformulated as b = Ax = (A − B)x + Bx and hence as

x = B−1 (B − A)x + B−1 b =: Cx + c =: F(x), (4.1)

so that x ∈ Rn solves Ax = b if and only if it is the only fixed point of the


mapping F, i.e. it satisfies F(x) = x. To calculate the fixed point of F, we can
use the following simple iterative process. We first pick a starting point x0 and
then form
x j+1 := F(x j ) = Cx j + c, j = 0, 1, 2, . . . . (4.2)

Noting that this particular F is continuous, we see that if this sequence con-
verges then the limit has to be the fixed point of F and hence the solution of
Ax = b.

101

08:42:03, subject to the Cambridge


05
102 Iterative Methods for Solving Linear Systems

We can, however, also start with an iteration of the form (4.2). If C and c have
the form (4.1) and if (4.2) converges the limit must be the solution of Ax = b.

Definition 4.1 An iterative method (4.2) is called consistent with the linear
system Ax = b if the iteration matrix C and vector c are of the form C =
I − B−1 A and c = B−1 b with an invertible matrix B.

In the next section, we will discuss possible choices of B and conditions for
the convergence of the iterative method.

4.2 Banach’s Fixed Point Theorem


We will start by deriving a general convergence theory for an iteration process
of the form (4.2) even with a more general, not necessarily linear F : Rn → Rn .

Definition 4.2 A mapping F : Rn → Rn is called a contraction mapping with


respect to a norm  ·  on Rn if there is a constant 0 ≤ q < 1 such that

F(x) − F(y) ≤ qx − y

for all x, y ∈ Rn .

A contraction mapping is Lipschitz continuous with Lipschitz constant q < 1.

Theorem 4.3 (Banach) If F : Rn → Rn is a contraction mapping then F


has exactly one fixed point x∗ ∈ Rn . The sequence x j+1 := F(x j ) converges for
every starting point x0 ∈ Rn to x∗ . Furthermore, we have the error estimates
q
x∗ − x j  ≤ x j − x j−1  (a posteriori),
1−q
qj
x∗ − x j  ≤ x1 − x0  (a priori).
1−q
Proof First, let us assume that F has two different fixed points x∗ , x̃ ∈ Rn .
Then, the contraction property of F yields

x∗ − x̃ = F(x∗ ) − F(x̃) ≤ qx∗ − x̃ < x∗ − x̃,

which is a contradiction. Hence, if existent, the fixed point is unique.


In the next step, we will show that the sequence {x j } is a Cauchy sequence. The
completeness of Rn then proves its convergence to a point x∗ ∈ Rn . Since F is
continuous, this limit point x∗ has to be the fixed point of F.

08:42:03, subject to the Cambridge


05
4.2 Banach’s Fixed Point Theorem 103

For arbitrary j and k from N0 , we can use the contraction property j times to
derive

x j+k − x j  = F(x j+k−1 ) − F(x j−1 ) ≤ qx j+k−1 − x j−1 


≤ q j xk − x0 .

The distance between xk and x0 can then be estimated using a telescoping sum
argument:

xk − x0  ≤ xk − xk−1  + xk−1 − xk−2  + · · · + x1 − x0 


 
≤ qk−1 + qk−2 + · · · + 1 x1 − x0 
1
≤ x1 − x0 ,
1−q
using the geometric series for 0 ≤ q < 1. Hence, we have
qj
x j+k − x j  ≤ x1 − x0 , (4.3)
1−q
and this expression becomes smaller than any given > 0, provided j is large
enough. This proves that {x j } is indeed a Cauchy sequence.
Finally, letting k → ∞ in (4.3) and using the continuity of the norm yields the
a priori estimate. Since this estimate is valid for each starting point, we can
restart the iteration with y0 = x j−1 and the a priori estimate for the first two
terms y0 = x j−1 and y1 = F(y0 ) = x j gives the a posteriori estimate. 

If we apply this theorem to our specific iteration function F(x) = Cx+c, where
C is the iteration matrix, then we see that

F(x) − F(y) = Cx + c − (Cy + c) = C(x − y) ≤ C x − y,

so that we have convergence if C < 1. Unfortunately, this depends on the


chosen vector and hence matrix norm, while, since all norms on Rn are equiv-
alent, the fact that the sequence converges does not depend on the norm.
In other words, having a specific compatible matrix norm with C < 1 is suf-
ficient for convergence but not necessary. A sufficient and necessary condition
can be stated using the spectral radius of the iteration matrix.

Definition 4.4 Let A ∈ Rn×n with eigenvalues λ1 , λ2 , . . . , λn ordered so that


|λ1 | ≥ |λ2 | ≥ · · · ≥ |λn |. The spectral radius of A is given by ρ(A) := |λ1 |.

We have encountered the spectral radius before. In Theorem 2.19 we have


seen that the 2-norm of a matrix A is given by the square root of the largest
eigenvalue of AT A, which we can now write as A2 = ρ(AT A)1/2 .

08:42:03, subject to the Cambridge


05
104 Iterative Methods for Solving Linear Systems

Note that, if λ is an eigenvalue of A with eigenvector x, then λk is an eigenvalue


of Ak , k = 1, 2, 3, . . . with eigenvector x. Hence, ρ(A)k ≤ ρ(Ak ).
In the proof of the following theorem we have to be a little careful since a real
matrix A ∈ Rn×n might easily have eigenvectors x in Cn and hence Ax ∈ Cn .

Theorem 4.5 1. If  ·  : Rn×n → R is a compatible matrix norm then ρ(A) ≤


A for all matrices A ∈ Rn×n .
2. For any A ∈ Rn×n and any > 0 there exists a compatible matrix norm
 ·  : Rn×n → R such that ρ(A) ≤ A ≤ ρ(A) + .

Proof 1. Suppose λ ∈ C is an eigenvalue of A with corresponding eigenvec-


tor x ∈ Cn \ {0}. If we have λ ∈ R then we can choose the corresponding
eigenvector x ∈ Rn . To see this note that with x ∈ Cn also x ∈ Cn and hence
the real part (x) ∈ Rn are eigenvectors of A to the eigenvalue λ. Hence,
for such an x ∈ Rn we immediately have

|λ| x = λx = Ax ≤ A x

and hence |λ| ≤ A. Next, let λ = μeiθ ∈ C be a complex-valued eigenvalue


with μ = |λ| > 0 and θ  0 with corresponding normalised eigenvector x ∈
Cn . With x also each vector y := eiφ x is obviously an eigenvector of A to the
eigenvalue λ. Thus, we can choose the angle φ such that y has an imaginary
part (y) satisfying (y) ≤ (eiθ y). Then, we have A((y)) = (Ay) =
(λy) = μ(eiθ y) and hence

A(y) (eiθ y)


A ≥ =μ ≥ μ = |λ|.
(y) (y)

Since λ was an arbitrary eigenvalue this means ρ(A) ≤ A.


2. From Theorem 3.38, Schur’s factorisation, we know that there exists a uni-
T
tary matrix W such that W AW is an upper triangular matrix, with the
eigenvalues of A as the diagonal elements:
⎛ ⎞
⎜⎜⎜λ1 r12 ··· r1n ⎟⎟⎟
⎜⎜⎜⎜ 0 λ2 ···

r2n ⎟⎟⎟⎟

W AW = ⎜⎜⎜⎜⎜ . ⎟
T
.. .. .. ⎟⎟⎟⎟ .
⎜⎜⎜ .. . . . ⎟⎟⎟⎟
⎜⎝ ⎠
0 0 ··· λn

Let D be a diagonal matrix with diagonal elements 1, δ, . . . , δn−1 , where

08:42:03, subject to the Cambridge


05
4.2 Banach’s Fixed Point Theorem 105

0 < δ < min(1, /(r + )) with r = max |ri j | and > 0. Let S = WD, then
⎛ ⎞
⎜⎜⎜λ1 δr12 · · · δn−1 r1n ⎟⎟⎟
⎜⎜⎜ ⎟
⎜⎜ 0 λ2 · · · δn−2 r2n ⎟⎟⎟⎟
S AS = D W AWD = ⎜⎜⎜⎜⎜ .
−1 −1 T
.. .. .. ⎟⎟⎟⎟ .

⎜⎜⎜ .. . . . ⎟⎟⎟⎟
⎜⎝ ⎠
0 0 ··· λn
Furthermore, using Theorem 2.19 and Remark 2.25, we see that

n
−1
S AS ∞ = max |(S −1 AS )i j | ≤ ρ(A) + δr(1 + δ + · · · + δn−2 ),
1≤i≤n
j=1

which is a geometric progression. Hence, we have



−1 1 − δn−1 δr
S AS ∞ ≤ ρ(A) + δr ≤ ρ(A) + ≤ ρ(A) + .
1−δ 1−δ
Since A → S −1 AS ∞ is the (complex) matrix norm induced by the (com-
plex) vector norm x → S −1 x∞ , see Example 2.29 and the remarks right
after it, we know that these norms are compatible when restricted to the real
numbers, which completes the proof. 
Note that the proof of the above theorem, as well as the proof of the next
theorem, simplifies significantly if a complex set-up is used. Nonetheless, it is
worth knowing that everything works out even in an entirely real set-up.
Theorem 4.6 Let C ∈ Rn×n and c ∈ Rn be given. The iteration x j+1 = Cx j + c
converges for every starting point x0 ∈ Rn if and only if ρ(C) < 1.
Proof Assume first that ρ(C) < 1. Then, we can pick an > 0 such that
ρ(C) + < 1 and, by Theorem 4.5, we can find a compatible matrix norm  · 
such that C ≤ ρ(C) + < 1, which gives convergence.
Assume now that the iteration converges to x∗ for every starting point x0 . This
particularly means that if we pick two arbitrary starting points x0 , y0 ∈ Rn then
the sequences x j+1 := Cx j + c and y j+1 := Cy j + c both converge to x∗ ∈ Rn .
Hence, the sequence
z j+1 := x j+1 + iy j+1 = C(x j + iy j ) + (1 + i)c = Cz j + c̃
converges for every starting point z0 = x0 + iy0 ∈ Cn to z∗ = (1 + i)x∗ , which
satisfies z∗ = Cz∗ + c̃. If we pick the starting point z0 such that z = z0 − z∗ is
an eigenvector of C with eigenvalue λ, then
z j − z∗ = C(z j−1 − z∗ ) = · · · = C j (z0 − z∗ ) = λ j (z0 − z∗ ).
Since the expression on the left-hand side tends to zero for j → ∞, so does the

08:42:03, subject to the Cambridge


05
106 Iterative Methods for Solving Linear Systems

expression on the right-hand side. This, however, is possible only if |λ| < 1.
Since λ was an arbitrary eigenvalue of C, this shows that ρ(C) < 1. 

4.3 The Jacobi and Gauss–Seidel Iterations


After this general discussion, we return to the question of how to choose the
iteration matrix C. Our initial approach is based upon a splitting
C = B−1 (B − A) = I − B−1 A, c = B−1 b
with a matrix B, which should be sufficiently close to A but also easily invert-
ible.
From now on, we will assume that the diagonal elements of A are all non-zero.
This can be achieved by exchanging rows and/or columns as long as A is non-
singular. Next, we decompose A in its lower-left sub-diagonal part, its diagonal
part and its upper-right super-diagonal part, i.e.
A= L+D+R
with
⎧ ⎧ ⎧


⎨ai j if i > j, ⎪

⎨ai j if i = j, ⎪

⎨ai j if i < j,
L=⎪
⎪ D=⎪
⎪ R=⎪

⎩0 else, ⎩0 else, ⎩0 else.
The simplest possible approximation to A is then given by picking its diagonal
part D for B so that the iteration matrix becomes
C J = I − B−1 A = I − D−1 (L + D + R) = −D−1 (L + R),
with entries



⎨−aik /aii if i  k,
cik = ⎪
⎪ (4.4)
⎩0 else.
This means that we can write the iteration defined by
x( j+1) = −D−1 (L + R)x( j) + D−1 b
component-wise as follows.
Definition 4.7 The iteration defined by
⎛ ⎞
⎜⎜⎜  ⎟⎟⎟
1 ⎜⎜⎜ ⎟⎟
n
xi( j+1) = ⎜⎜⎜bi − aik xk( j) ⎟⎟⎟⎟ , 1 ≤ i ≤ n, (4.5)
aii ⎜⎝ k=1
⎟⎠
ki

is called the Jacobi method.

08:42:03, subject to the Cambridge


05
4.3 The Jacobi and Gauss–Seidel Iterations 107

Note that we wrote the iteration index as an upper index, which we will con-
tinue to do from now on. An algorithmic formulation is given in Algorithm 7.
Here we have realised the iteration with just two vectors x and y, where x
represents the “old” iteration and y represents the new, next iteration.
We also need a stopping criterion here. Unfortunately, we cannot use the error
x∗ − x( j) since we do not know the exact solution x∗ . But there are at least two
possibilities. The easiest one is to choose a maximum number of iteration steps
for which we are willing to run the iteration. The advantage of this stopping
criterion is that we will finish after the prescribed number of iterations. How-
ever, we might have a good approximation x( j) already after a few steps and
would thereafter do more iterations than necessary. But we might also have a
bad approximation, and would then need to perform additional iterations.
A better stopping criterion is to look at the residual Ax∗ − Ax( j) = b − Ax( j) and
to stop if the norm of the residual falls below a given threshold. Unfortunately,
computing the residual is an O(n2 ) operation for a full matrix. Another, widely
used stopping criterion is to look at the difference of two iterations x( j+1) − x( j) .
If the norm of this difference falls below a certain threshold, the process stops.
This is computationally cheaper as it costs only O(n) time.

Algorithm 7: Jacobi iteration


Input : A ∈ Rn×n invertible, b ∈ Rn , x(0) ∈ Rn .
Output: Approximate solution of Ax = b.
1 Compute r := b − Ax(0) 
2 while r ≥ do
3 for i = 1 to n do
4 yi := bi
5 for k = 1 to i − 1 do
6 yi := yi − aik xk
7 for k = i + 1 to n do
8 yi := yi − aik xk
9 yi := yi /aii
10 x := y
11 r := b − Ax

It is easy to calculate the computational cost. As is typical for an iterative


method, we only give the computational cost for one step of the iteration, so
the total cost becomes this cost times the number of iterations.

08:42:03, subject to the Cambridge


05
108 Iterative Methods for Solving Linear Systems

Remark 4.8 Let A ∈ Rn×n be a full matrix. Then, one step of the Jacobi
method costs O(n2 ) time.

Note that the computational cost drops to O(n) if the matrix A is sparse and has
only a constant number of non-zero entries per row.
Obviously, one can expect convergence of the Jacobi method if the original ma-
trix A resembles a diagonal matrix, i.e. if the entries on the diagonal dominate
the other entries in that row. Recall that a matrix A is strictly row diagonally
dominant if
n
|aik | < |aii |, 1 ≤ i ≤ n.
k=1
ki

Theorem 4.9 The Jacobi method converges for every starting point if the
matrix A is strictly row diagonally dominant.

Proof We use the row sum norm to calculate the norm of the iteration matrix
C J as

n 
n
|aik |
C J ∞ = max |cik | = max < 1.
1≤i≤n
k=1
1≤i≤n
k=1
|aii |
ki

Hence, we have convergence. 

The latter criterion can be weakened in a way which we will describe now.

Definition 4.10 A matrix A ∈ Rn×n is called weakly (row) diagonally domi-


nant, if, for 1 ≤ i ≤ n,


n
|aik | ≤ |aii | (4.6)
k=1
ki

is satisfied and if there is at least one index i for which (4.6) holds with a strict
lower sign.

If the matrix A is invertible and weakly diagonally dominant, the diagonal


elements of A have to be different from zero. To see this, let us assume that
there is an i with aii = 0. In that case the whole row i would be zero, which
contradicts the non-singularity of A.
Being weakly diagonally dominant is not enough for the Jacobi method to
achieve convergence. We also need that the matrix is not reducible in the fol-
lowing sense.

08:42:03, subject to the Cambridge


05
4.3 The Jacobi and Gauss–Seidel Iterations 109

Definition 4.11 A matrix A ∈ Rn×n is called reducible if there are non-empty


subsets N1 and N2 of N := {1, 2, . . . , n} with

1. N1 ∩ N2 = ∅,
2. N1 ∪ N2 = N,
3. for each i ∈ N1 and each k ∈ N2 we have aik = 0.

The matrix A is called irreducible if it is not reducible.

For a linear system Ax = b with reducible matrix A ∈ Rn×n we can split the
system into two smaller, independent systems. If, for example, N1 contains
exactly m < n elements, then we can first solve the m equations


n 
aik xk = aik xk = bi , i ∈ N1 .
k=1 k∈N1

This gives us the m components xk of the solution x with index k ∈ N1 . Having


these components, we can rewrite the equations for the other indices i ∈ N2 as
 
aik xk = bi − aik xk , i ∈ N2 .
k∈N2 k∈N1

Hence, without restriction, we can assume that our matrix is irreducible.

Theorem 4.12 If the matrix A ∈ Rn×n is irreducible and weakly diagonally


dominant then the Jacobi method converges for every starting point.

Proof First of all, we note that all diagonal elements of A are different from
zero. Otherwise, the sets N1 := {i ∈ N : aii = 0} and N2 := {i ∈ N : aii  0}
would give a non-trivial decomposition of N. To see this, note that N1 is non-
empty by this assumption and N2 also contains at least one element by the fact
that A is weakly diagonally dominant. The union of the two sets obviously
gives N. Furthermore, for i ∈ N1 the weakly diagonal dominance of A ensures
that also all ai j for j  i have to vanish, in particular for those j ∈ N2 .
From A being weakly diagonally dominant and Theorem 4.5 we immediately
have ρ(C J ) ≤ C J ∞ ≤ 1 for the iteration matrix C J of the Jacobi method.
Hence, it remains to ensure that there is no eigenvalue of C J with absolute
value one. Hence, let us assume there is such an eigenvalue λ ∈ C of C J with
|λ| = 1 and corresponding eigenvector x ∈ Cn , which we normalise to satisfy
x∞ = 1. Then we can form the sets N1 := {i ∈ N : |xi | = 1}  ∅ and

08:42:03, subject to the Cambridge


05
110 Iterative Methods for Solving Linear Systems

N2 = N \ N1 . The set N2 is also not empty since


"" ""
"" n ""
1 ""  " 1 
n
1 
n
|xi | = |λ||xi | = |(C J x)i | = "" aik xk """ ≤ |aik xk | ≤ |aik | ≤ 1
|aii | " k=1 " |aii | k=1 |aii | k=1
" ki " ki ki
(4.7)
shows that for i ∈ N1 we must have equality here, meaning particularly

n
|aii | = |aik |, i ∈ N1 .
k=1
ki

Hence, since A is weakly diagonally dominant, there must be at least one index
i  N1 , which shows that N2 cannot be empty.
Since A is irreducible there is at least one i ∈ N1 and one k0 ∈ N2 with aik0  0.
For this k0 ∈ N2 we have particularly |aik0 xk0 | < |aik0 |, so that we have a strict
inequality in (4.7). This, however, is a contradiction to i ∈ N1 . 

If the matrix A is strictly diagonally dominant, we have C J ∞ < 1. If the


matrix A is irreducible and weakly diagonally dominant we only have ρ(C J ) <
1. We will have C J ∞ = 1 whenever there is at least one index i with equality
in (4.6).
A closer inspection of the Jacobi method shows that the computation of xi( j+1)
is independent of any other x( j+1) . This means that on a parallel or vector com-
puter all components of the new iteration x( j+1) can be computed simultane-
ously.
However, it also gives us the possibility to improve the process. For example,
to calculate x2( j+1) we could already employ the newly computed x1( j+1) . Then,
for computing x3( j+1) we could use x1( j+1) and x2( j+1) and so on. This leads to the
following iterative scheme.

Definition 4.13 The Gauss–Seidel method is given by the iterative scheme


⎛ ⎞
1 ⎜⎜⎜⎜   ⎟⎟
( j) ⎟
i−1 n
xi( j+1) = ⎜⎝⎜bi − ( j+1)
aik xk − aik xk ⎟⎟⎟⎠ , 1 ≤ i ≤ n. (4.8)
aii k=1 k=i+1

As usual, the algorithm is given in Algorithm 8. Once, again we use the residual
as the stopping criterion. This time, we have an in situ algorithm, i.e. we can
store each iteration directly over the previous one.
As in the case of the Jacobi method, the computational cost of the Gauss–
Seidel iteration is easily derived.

Remark 4.14 The computational cost of one iteration of the Gauss–Seidel

08:42:03, subject to the Cambridge


05
4.3 The Jacobi and Gauss–Seidel Iterations 111

Algorithm 8: Gauss–Seidel iteration


Input : A ∈ Rn×n invertible, b ∈ Rn , x(0) ∈ Rn .
Output: Approximate solution of Ax = b.
1 Compute r := b − Ax(0) 
2 while r ≥ do
3 for i = 1 to n do
4 xi := bi
5 for k = 1 to i − 1 do
6 xi := xi − aik xk
7 for k = i + 1 to n do
8 xi := xi − aik xk
9 xi := xi /aii
10 r := b − Ax

method is O(n2 ). This reduces to O(n) if we have a sparse matrix with only
O(n) non-zero entries.

To see that this scheme is consistent and to analyse its convergence, we have
to find C = I − B−1 A = B−1 (B − A) and c = B−1 b. To this end, we rewrite (4.8)
as

i−1 
n
aii xi( j+1) + aik xk( j+1) = bi − aik xk( j) ,
k=1 k=i+1

which translates into (L + D)x( j+1) = −Rx( j) + b. Hence, if we define B =


(L + D)−1 and use B − A = −R, we note that the scheme is indeed consistent
and that the iteration matrix of the Gauss–Seidel method is given by

CGS = −(L + D)−1 R.

Later on, we will prove a more general version of the following theorem.

Theorem 4.15 If A ∈ Rn×n is symmetric, positive definite then the Gauss–


Seidel method converges.

Our motivation for introducing the Gauss–Seidel method was to improve the
Jacobi method. Under certain assumptions it is indeed possible to show that
the Gauss–Seidel method converges faster than the Jacobi method.

Theorem 4.16 (Sassenfeld) Let A ∈ Rn×n . Assume that the diagonal elements

08:42:03, subject to the Cambridge


05
112 Iterative Methods for Solving Linear Systems

of A are different from zero. Define recursively


⎛ i−1 ⎞
1 ⎜⎜⎜⎜  ⎟⎟⎟
n
si = ⎜⎜⎝ |aik |sk + |aik |⎟⎟⎟⎠ , 1 ≤ i ≤ n.
|aii | k=1 k=i+1

Then, we have for the iteration matrix CGS = −(L + D)−1 R of the Gauss–Seidel
method the bound
CGS ∞ ≤ max si =: s,
1≤i≤n

and the Gauss–Seidel method converges if s < 1. If A is a strictly row diago-


nally dominant matrix, we also have

CGS ∞ ≤ s ≤ C J ∞ < 1,

where C J = −D−1 (L + R) is the iteration matrix of the Jacobi method.

Proof Let y ∈ Rn . We have to show that CGS y∞ ≤ sy∞ . From this,
CGS ∞ ≤ s immediately follows. Let z = CGS y. For the components z1 , . . . , zn
of z we can conclude that
⎛ i−1 ⎞
1 ⎜⎜⎜⎜  ⎟⎟⎟
n
zi = − ⎜⎝⎜ aik zk + aik yk ⎟⎟⎠⎟ , 1 ≤ i ≤ n.
aii k=1 k=i+1

In particular for i = 1 this means

1  1 
n n
|z1 | ≤ |a1k ||yk | ≤ |a1k | y∞ = s1 y∞ ≤ sy∞ .
|a11 | k=2 |a11 | k=2

If, for 1 ≤ j ≤ i − 1 the statement |z j | ≤ s j y∞ is correct, then we can, by


induction, conclude that
⎛ i−1 ⎞
1 ⎜⎜⎜⎜  ⎟⎟⎟
n
|zi | ≤ ⎜⎜⎝ |aik ||zk | + |aik ||yk |⎟⎟⎟⎠
|aii | k=1 k=i+1
⎛ i−1 ⎞
y∞ ⎜⎜⎜ ⎜   n ⎟⎟⎟
≤ ⎜⎜⎝ |aik |sk + |aik |⎟⎟⎟⎠
|aii | k=1 k=i+1
= y∞ si .

Hence, we have proven the first part of the theorem. Now suppose that A is a
strictly row diagonally dominant matrix. Then, it follows that

1 
n
K := C J ∞ = max |aik | < 1
1≤i≤n |aii |
k=1
ki

08:42:03, subject to the Cambridge


05
4.3 The Jacobi and Gauss–Seidel Iterations 113

and particularly s1 ≤ K < 1. We can now continue by induction and may


assume that s j ≤ K < 1 for all 1 ≤ j ≤ i − 1. With this, we can conclude that
⎛ i−1 ⎞ ⎛ i−1 ⎞
1 ⎜⎜⎜⎜  ⎟⎟⎟ 1 ⎜⎜⎜⎜   ⎟⎟⎟
n n
si = ⎜⎝⎜ |aik |sk + ⎟
|aik |⎟⎠⎟ ≤ ⎜⎜⎝K |aik | + |aik |⎟⎟⎟⎠
|aii | k=1 k=i+1
|aii | k=1 k=i+1

1 
n
≤ |aik | ≤ K < 1,
|aii | k=1
ki

meaning also s ≤ C J ∞ . 

As an example, let us have a look at the discretisation of the boundary value


problem described in Section 1.1.2. Essentially, we have to solve a linear sys-
tem with a matrix of the form
⎛ ⎞
⎜⎜⎜ 2 −1 0 ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜−1 2 −1 ⎟⎟⎟
⎜⎜⎜⎜ −1 2 −1 ⎟⎟⎟⎟
⎜ ⎟⎟⎟
A = ⎜⎜⎜⎜ .. .. .. ⎟⎟⎟ = L + D + R, (4.9)
⎜⎜⎜ . . . ⎟

⎜⎜⎜⎜ ⎟⎟
⎜⎜⎜ −1 2 −1⎟⎟⎟⎟⎟
⎝ ⎠
0 −1 2

where D = 2I, L is a matrix with −1 on the sub-diagonal and R is a matrix with


−1 on the super-diagonal. All other entries of R and L are zero. This means that
the Jacobi method computes the new iteration u( j+1) component-wise as
⎧ ( j)


⎪ u2 if i = 1,
( j+1) 1 2 1⎪

⎨ ( j)
ui = h fi + ⎪ ⎪
( j)
ui−1 + ui+1 if 2 ≤ i ≤ n − 1, (4.10)
2 2⎪


⎩u( j) if i = n
n−1

and the Gauss–Seidel method calculates it as


⎧ ( j)


⎪ u2 if i = 1,
1 2 ⎪
1⎪
⎨ ( j+1)
( j+1)
ui = h fi + ⎪ j)
ui−1 + u(i+1 if 2 ≤ i ≤ n − 1, (4.11)
2 2⎪



⎩u( j+1) if i = n.
n−1

Note that we have used u rather than x to denote the iterations since this is
more natural in this context.
To analyse the convergence, we need to determine the iteration matrices. Un-
fortunately, the matrix A is not strictly row diagonally dominant but only weak-
ly diagonally dominant and irreducible. Nonetheless, we can have a look at the
iteration matrices and their  · ∞ norms.

08:42:03, subject to the Cambridge


05
114 Iterative Methods for Solving Linear Systems

For the Jacobi method the iteration matrix is given by


⎛1 ⎞⎛ ⎞
⎜⎜⎜ 2 ⎟⎟⎟ ⎜⎜⎜0 1 ⎟⎟⎟
⎜⎜⎜⎜ 1 ⎟⎟⎟⎟ ⎜⎜⎜⎜1 0 1
⎟⎟⎟
⎟⎟⎟
⎜⎜⎜ 2 ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟
⎜ .. ⎟⎟⎟ ⎜⎜⎜ .. .. ..
C J = −D−1 (L + R) = ⎜⎜⎜⎜ . ⎟⎟⎟ ⎜⎜⎜ . . . ⎟⎟⎟
⎜⎜⎜⎜ ⎟ ⎜
⎟⎟⎟⎟ ⎜⎜⎜⎜
⎟⎟⎟
⎜⎜⎜ 1
⎟⎠ ⎜⎝ 1 0 1⎟⎟⎟⎟
⎝ 2
1 ⎠
2 1 0
⎛ ⎞
⎜⎜⎜ 0 1
⎟⎟⎟
⎜⎜⎜ 1 2
1 ⎟⎟⎟
⎜⎜⎜ 2 0 2 ⎟⎟⎟
⎜⎜ .. .. .. ⎟⎟⎟
= ⎜⎜⎜⎜ . . . ⎟⎟⎟ , (4.12)
⎜⎜⎜ ⎟⎟
⎜⎜⎜ 1 1⎟ ⎟⎟
⎜⎝ 2 0 2⎟ ⎠⎟
1
2 0
showing that C J ∞ = 1, as expected, so that we cannot conclude convergence
from this. The situation is slightly better when it comes to the Gauss–Seidel
method. Here, we first have to compute
⎛ ⎞−1 ⎛ 1 ⎞
⎜⎜⎜ 2 ⎟⎟⎟ ⎜⎜⎜ 2 0 ⎟⎟
⎟⎟⎟
⎜⎜⎜⎜−1 2 ⎟⎟⎟⎟ ⎜⎜⎜⎜ 1 1
⎟⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ 4 2
⎟⎟⎟
−1 ⎜⎜⎜ −1 2 ⎟⎟⎟ ⎜⎜⎜ 1 1 1
⎟⎟⎟ .
(L + D) = ⎜⎜ ⎟⎟⎟ = ⎜⎜⎜ 8 4 2
⎜⎜⎜ .. .. ⎟⎟⎟ ⎜⎜⎜ .. .. .. .. ⎟⎟⎟
⎜⎜⎜ . . ⎟
⎟ ⎜
⎜ . . . . ⎟⎟⎟
⎝⎜ ⎠⎟ ⎝⎜ 1 1⎠

0 −1 2 2n
1
2n−1
... 4 2
1

Note that there is a fill in when computing the inverse. We cannot expect that
the inverse of a sparse matrix is sparse again. Here, however, we know that the
inverse at least has to be a lower triangular matrix.
With this, we can compute the iteration matrix of the Gauss–Seidel method:
⎛ ⎞
⎜⎜⎜0 12 ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜0 14 1
⎟⎟⎟
⎜⎜

2
. ⎟⎟⎟⎟
CGS = −(L + D)−1 R = ⎜⎜⎜⎜⎜0 18 1 .. ⎟⎟⎟ , (4.13)
⎜⎜⎜ .
4
. .. ⎟⎟⎟⎟
⎜⎜⎜0 .. .. . 2 ⎟⎟⎟
1 ⎟
⎜⎜⎝ ⎟⎠
0 2n 2n−1 . . . 14
1 1

since multiplying by this R from the right means shifting all columns to the
right and filling the first column up with zeros. This gives
1  −i 1
n−2
1 1 1
CGS ∞ = + + ··· + n = 2 = (1 − 2−n+1 ) < 1. (4.14)
4 8 2 4 i=0 2
This shows that the Gauss–Seidel method converges for our problem. How-
ever, the bound on the right-hand side becomes arbitrarily close to 1 if n tends

08:42:03, subject to the Cambridge


05
4.3 The Jacobi and Gauss–Seidel Iterations 115

to infinity. Hence, we might expect that convergence slows down for larger n,
which is quite unfortunate since a larger n already means a higher computa-
tional cost per iteration.
To determine the convergence of the Jacobi method, we need to compute the
spectral radius and hence the eigenvalues of the iteration matrices. Since the
iteration matrix is given by C J = I − D−1 A we see that every eigenvector of A
is also an eigenvector of C J . To be more precise, if Ax = λx, then
λ  λ
C J x = x − D−1 Ax = x − λD−1 x = x − x = 1 − x.
2 2
Hence, it suffices to determine the eigenvalues of A.

Remark 4.17 The matrix A ∈ Rn×n defined by (4.9) has the eigenvalues
λ j = 2(1 − cos θ j ) = 4 sin2 (θ j /2) with corresponding eigenvectors

x j = (sin θ j , sin(2θ j ), . . . , sin(nθ j ))T

for 1 ≤ j ≤ n with θ j = jπ/(n + 1).

Proof To verify this, it is helpful to recall the identity

sin(α + β) + sin(α − β) = 2 sin α cos β (4.15)

for the trigonometric functions. Now, using the above-defined quantities yields



⎪ 2 sin θ j − sin(2θ j ) for i = 1,



(Ax j )i = ⎪
⎪ −sin((i − 1)θ j ) + 2 sin(iθ j ) − sin((i + 1)θ j ) for 2 ≤ i ≤ n − 1,



⎩−sin((n − 1)θ ) + 2 sin(nθ ) for i = n.
j j

Taking sin(0θ j ) = 0 and sin(n + 1)θ j = sin( jπ) = 0 into account, this can
uniformly be written as

(Ax j )i = 2 sin(iθ j ) − [sin(i − 1)θ j + sin(i + 1)θ j ], 1 ≤ i ≤ n.

Using (4.15) with α = iθ j and β = θ j then gives

(Ax j )i = 2 sin(iθ j ) − 2 sin(iθ j ) cos θ j = λ j sin(iθ j ) = λ j (x j )i . 

From this remark and the above we can conclude the following.

Remark 4.18 The iteration matrix C J , given in (4.12), of the Jacobi method
has the eigenvalues
 jπ 
λj jπ
μj = 1 − = 1 − 2 sin 2
= cos , 1 ≤ j ≤ n.
2 2(n + 1) n+1
The eigenvectors are the same as in Remark 4.17.

08:42:03, subject to the Cambridge


05
116 Iterative Methods for Solving Linear Systems

Note that in both remarks we have used the identity cos(2α) = 1 − 2 sin2 α.
Taking the graph of the cos into account, we see that
 π 
ρ(C J ) = cos < 1. (4.16)
n+1
Hence, as expected, the Jacobi method converges. However, a Taylor expan-
sion of cos reveals
π2 π4
ρ(C J ) = 1 − + − ··· ,
2(n + 1) 2 4!(n + 1)4
which means that the convergence also slows down for larger n. If we com-
pare this with (4.14), it seems to be better and hence convergence of the Jacobi
method should be faster. But this is not true since in (4.14) we have only looked
at an upper bound for ρ(CGS ). To have a real comparison of both methods we
must compute ρ(CGS ), as well. This is slightly more complicated than comput-
ing ρ(C J ) , but the following can be shown.
Remark 4.19 The iteration matrix CGS given in (4.13) has the eigenvalues
λ j = cos2 θ j with corresponding eigenvectors
x j = (cos(θ j ) sin(θ j ), cos2 (θ j ) sin(2θ j ), . . . , cosn (θ j ) sin(nθ j ))T ,
where θ j = jπ/(n + 1).
This means that we have the relation λ j = μ2j between the eigenvalues λ j of
CGS and μ j of C J , leading also to ρ(CGS ) = ρ(C J )2 . Hence, the Gauss–Seidel
method converges faster than the Jacobi method.
The above relation between the spectral radius of the Gauss–Seidel method
and the spectral radius of the Jacobi method is always given if the matrix A is
consistently ordered. We will discuss this in more detail at the end of the next
section.

4.4 Relaxation
A further improvement of both methods can sometimes be achieved by relax-
ation. We start by looking at the Jacobi method. Here, the iterations can be
written as
x( j+1) = D−1 b − D−1 (L + R)x( j) = x( j) + D−1 b − D−1 (L + R + D)x( j)
= x( j) + D−1 (b − Ax( j) ).
The latter equality shows that the new iteration x( j+1) is given by the old iter-
ation x( j) corrected by the D−1 -multiple of the residual b − Ax( j) . In practice,

08:42:03, subject to the Cambridge


05
4.4 Relaxation 117

one often notices that the correction term is off the correct correction term by a
fixed factor. Hence, it makes sense to introduce a relaxation parameter ω and
to form the new iteration rather as

x( j+1) = x( j) + ωD−1 (b − Ax( j) ), (4.17)

which is a consistent splitting with B = B(ω) = 1


ω D.

Definition 4.20 The Jacobi relaxation is given by


⎛ ⎞
ω ⎜⎜⎜⎜ n ⎟
( j) ⎟
( j+1)
xi ( j)
= xi + ⎜⎝bi − aik xk ⎟⎟⎟⎠ , 1 ≤ i ≤ n.
aii k=1

In algorithmic form this is detailed in Algorithm 9. Similarly to Algorithm 7,


we use two vectors to realise the iteration. Obviously, to compute one iteration
requires O(n2 ) time for a full matrix A.

Algorithm 9: Jacobi relaxation


Input : A ∈ Rn×n invertible, b ∈ Rn , x(0) ∈ Rn , ω ∈ R.
Output: Approximate solution of Ax = b.
1 Compute r := b − Ax(0) 
2 while r ≥ do
3 for i = 1 to n do
4 yi := bi
5 for k = 1 to n do
6 yi := yi − aik xk
7 yi := xi + ω · yi /aii
8 x := y
9 r := b − Ax

Of course, the relaxation parameter should be chosen such that the convergence
improves compared with the original Jacobi method. It follows from

x( j+1) = x( j) + ωD−1 b − ωD−1 (L + D + R)x( j)


= [(1 − ω)I − ωD−1 (L + R)]x( j) + ωD−1 b (4.18)

that the iteration matrix is given by

C J (ω) = [(1 − ω)I − ωD−1 (L + R)] = (1 − ω)I + ωC J , (4.19)

which shows that C J (1) = C J corresponds to the classical Jacobi method.

08:42:03, subject to the Cambridge


05
118 Iterative Methods for Solving Linear Systems

f1/5

λ1
λn
−3 −2 −1 1

fω ∗
f1

Figure 4.1 Determination of the relaxation parameter.

Theorem 4.21 Assume that C J = −D−1 (L+R) has only real eigenvalues λ1 ≤
λ2 ≤ · · · ≤ λn < 1 with corresponding eigenvectors z1 , . . . , zn . Then, C J (ω) has
the same eigenvectors z1 , . . . , zn , but with eigenvalues μ j = 1 − ω + ωλ j for
1 ≤ j ≤ n. The spectral radius of C J (ω) is minimised by choosing
2
ω∗ = . (4.20)
2 − λ1 − λ n
In the case of λ1  −λn relaxation converges faster than the Jacobi method.
Proof For every eigenvector z j of C J it follows that
C J (ω)z j = (1 − ω)z j + ωλ j z j = (1 − ω + ωλ j )z j ,
i.e. z j is an eigenvector of C J (ω) for the eigenvalue 1−ω+ωλ j =: μ j (ω). Thus,
the spectral radius of C J (ω) is given by
ρ(C J (ω)) = max |μ j (ω)| = max |1 − ω + ωλ j |,
1≤ j≤n 1≤ j≤n

which should be minimised. For a fixed ω let us have a look at the function
fω (λ) := 1 − ω + ωλ, which is, as a function of λ, a straight line with fω (1) = 1.
For different choices of ω we thus have a collection of such lines (see Figure
4.1) and it follows that the maximum in the definition of ρ(C J (ω)) can only
be attained for the indices j = 1 and j = n. Moreover, it follows that ω is
optimally chosen if fω (λ1 ) = − fω (λn ) or
1 − ω + ωλ1 = −(1 − ω + ωλn ).
This gives (4.20). Finally, we have the Jacobi method if and only if ω∗ = 1,
which is equivalent to λ1 = −λn . 

08:42:03, subject to the Cambridge


05
4.4 Relaxation 119

Example 4.22 Let us have another look at the second example from the in-
troduction. We were concerned with solving a boundary value problem with
finite differences. The resulting system has been stated in Section 1.1.2. In the
last section, we have seen that the iteration matrix of the Jacobi method has
eigenvalues
 jπ 
μ j = cos , 1 ≤ j ≤ n.
n+1
Hence, the biggest eigenvalue is μ1 = cos(π/(n+1)) and the smallest eigenvalue
is given by μn = cos(nπ/(n + 1)).
This means that the relaxation of the Jacobi method has an iteration matrix
with eigenvalues
 jπ 
μ j (ω) = 1 − ω + ω cos .
n+1
However, since we have μn = −μ1 , Theorem 4.21 tells us that the optimal
relaxation parameter is ω∗ = 1 and hence the classical Jacobi method cannot
be sped up using relaxation.
An alternative interpretation of the Jacobi relaxation can be derived from
x( j+1) = (1 − ω)x( j) + ωC J x( j) + ωD−1 b
= (1 − ω)x( j) + ω(C J x( j) + D−1 b).
Hence, if we define z( j+1) = C J x( j) + D−1 b, which is one step of the classical
Jacobi method, the next iteration of the Jacobi relaxation method is
x( j+1) = (1 − ω)x( j) + ωz( j+1) ,
which is a linear interpolation between the old iteration and the new Jacobi
iteration.
This idea can be used to introduce relaxation for the Gauss–Seidel method, as
well. We start by looking at Dx( j+1) = b − Lx( j+1) − Rx( j+1) and replace the
iteration on the left-hand side by z( j+1) and then use linear interpolation again.
Hence, we set
Dz( j+1) = b − Lx( j+1) − Rx( j+1) ,
x( j+1) = (1 − ω)x( j) + ωz( j+1) .
Multiplying the second equation by D and inserting the first one yields
Dx( j+1) = (1 − ω)Dx( j) + ωb − ωLx( j+1) − ωRx( j)
and hence
(D + ωL)x( j+1) = [(1 − ω)D − ωR] x( j) + ωb.

08:42:03, subject to the Cambridge


05
120 Iterative Methods for Solving Linear Systems

Thus, the iteration matrix of the relaxed Gauss–Seidel method is given by

CGS (ω) = (D + ωL)−1 [(1 − ω)D − ωR]. (4.21)

We can rewrite this component-wise.

Definition 4.23 The Gauss–Seidel Relaxation or Successive Over-relaxation


(SOR) method is given by
⎛ ⎞
ω ⎜⎜⎜⎜   ⎟⎟
( j) ⎟
i−1 n
( j+1)
xi ( j)
= xi + ⎜⎜⎝bi − ( j+1)
aik xk − aik xk ⎟⎟⎟⎠ , 1 ≤ i ≤ n.
aii k=1 k=i

The algorithmic description is given in Algorithm 10.

Algorithm 10: SOR


Input : A ∈ Rn×n invertible, b ∈ Rn , x(0) ∈ Rn , ω ∈ R.
Output: Approximate solution of Ax = b.
1 Compute r := b − Ax(0) 
2 while r ≥ do
3 for i = 1 to n do
4 xi := bi
5 for k = 1 to n do
6 xi := xi − aik xk
7 xi := xi + ω · xi /aii
8 r := b − Ax

Again, we have to deal with the question of how to choose the relaxation pa-
rameter. The next result shows that we do not have much of a choice.

Theorem 4.24 The spectral radius of the iteration matrix CGS (ω) of SOR
satisfies
ρ(CGS (ω)) ≥ |ω − 1|.

Hence, convergence is only possible if ω ∈ (0, 2).

Proof The iteration matrix CGS (ω) can be written in the form

CGS (ω) = (I + ωD−1 L)−1 [(1 − ω)I − ωD−1 R].

The first matrix in this product is a normalised lower triangular matrix and the
second matrix is an upper triangular matrix with diagonal entries all equal to

08:42:03, subject to the Cambridge


05
4.4 Relaxation 121

1 − ω. Since the determinant of a matrix equals the product of its eigenvalues,


the result follows from
|1 − ω|n = | det CGS (ω)| ≤ ρ(CGS (ω))n . 
We will now show that for a positive definite matrix, ω ∈ (0, 2) is also sufficient
for convergence. Since ω = 1 gives the classical Gauss–Seidel method, we also
cover Theorem 4.15. The proof of the following theorem also shows that the
method is always consistent.
Theorem 4.25 Let A ∈ Rn×n be symmetric and positive definite. Then, the
SOR method converges for every relaxation parameter ω ∈ (0, 2).
Proof We have to show that ρ(CGS (ω)) < 1. To this end, we rewrite the
iteration matrix CGS (ω) in the form
CGS (ω) = (D + ωL)−1 [D + ωL − ω(L + D + R)]
−1
1
= I − ω(D + ωL)−1 A = I − D+L A
ω
= I − B−1 A,
with B := ω1 D + L. Let λ ∈ C be an eigenvalue of CGS (ω) with corresponding
eigenvector x ∈ Cn , which we assume to be normalised by x2 = 1. Then,
we have CGS (ω)x = (I − B−1 A)x = λx or Ax = (1 − λ)Bx. Since A is positive
definite, we must have λ  1. Hence, we can conclude
1 xT Bx
= T .
1 − λ x Ax
Since A is symmetric, we can also conclude that B + BT = ( ω2 − 1)D + A, such
that the real part of 1/(1 − λ) satisfies
⎧ T ⎫
1 1 xT (B + BT )x 1 ⎪⎪
⎨ 2 x Dx ⎪

⎬ 1
 = = ⎪
⎪ − 1 + 1 ⎪ > 2,

1−λ 2 T
x Ax ⎩
2 ω T
x Ax ⎭

because, on account of ω ∈ (0, 2), the expression 2/ω − 1 is positive, as also is


xT Dx/xT Ax. The latter follows since the diagonal entries of a positive definite
matrix have to be positive. If we write λ = u + iv then we can conclude that

1 1 1−u
< =
2 1−λ (1 − u)2 + v2
and hence |λ|2 = u2 + v2 < 1. 
To find an optimal parameter ω∗ is in general difficult. However, the situation
is better if the matrix is consistently ordered. We have mentioned the term

08:42:03, subject to the Cambridge


05
122 Iterative Methods for Solving Linear Systems

consistently ordered before, namely in the situation following Remark 4.19.


There, we have seen that in the specific situation of our boundary value model
problem from Section 1.1.2, we have the relation ρ(C J )2 = ρ(CGS ) between
the spectral radii of the Jacobi and Gauss–Seidel methods. We pointed out that
this always holds when A is consistently ordered. Hence, it is now time to make
this more precise.

Definition 4.26 The matrix A = L + D + R ∈ Rn×n is called consistently


ordered if the eigenvalues of the matrix

C(α) := −(αD−1 L + α−1 D−1 R) = −D−1 (αL + α−1 R)

are independent of the parameter α.

Note that we require the diagonal elements of A to be non-zero to make this


definition meaningful. This condition has already been imposed throughout
this entire section. Note also that all the matrices C(α) have the same eigenval-
ues as the matrix
C(1) = −D−1 (L + R) = C J ,

i.e. as the iteration matrix of the Jacobi method, provided that A is consistently
ordered.

Example 4.27 Any tridiagonal matrix


⎛ ⎞
⎜⎜⎜a1 c1 0 ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜b2 a2 c2 ⎟⎟⎟
⎜⎜⎜ .. .. .. ⎟⎟⎟
A = ⎜⎜⎜⎜⎜ . . . ⎟⎟⎟ ∈ Rn×n
⎟⎟⎟
⎜⎜⎜ . .. ⎟⎟
⎜⎜⎜ .. . cn−1 ⎟⎟⎟⎟⎟
⎜⎜⎝ ⎠
0 bn an
with ai  0 for 1 ≤ i ≤ n is consistently ordered. To see this, note that
⎛ ⎞
⎜⎜⎜ 0 r1 0 ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜2 0 r2 ⎟⎟⎟
⎜⎜⎜⎜ .. .. .. ⎟⎟⎟⎟
C(1) = −D (L + R) = ⎜⎜⎜⎜
−1 . . . ⎟⎟⎟
⎟⎟⎟
⎜⎜⎜
⎜⎜⎜ . . . . . r ⎟⎟⎟⎟
.
⎜⎜⎝ n−1 ⎟⎟⎠
0 n 0
with ri = −ci /ai and i = −bi /ai . Hence, if we define a diagonal matrix S (α) =
diag(1, α, α2 , . . . , αn−1 ) with α  0, we see that C(α) = S (α)C(1)S −1 (α), which
has the same eigenvalues as C(1).

08:42:03, subject to the Cambridge


05
4.4 Relaxation 123

The following theorem shows that there is an intrinsic relation between the
eigenvalues of the iteration matrices of the SOR method and the Jacobi method.
It also shows that, for consistently ordered matrices, the Gauss–Seidel method
converges faster than the Jacobi method, as long as the Jacobi method con-
verges.

Theorem 4.28 Let A ∈ Rn×n be consistently ordered. Let CGS (ω), ω ∈ (0, 2),
denote the iteration matrix of the SOR method and C J denote the iteration
matrix of the Jacobi method. Then, μ ∈ C \ {0} is an eigenvalue of CGS (ω) if
and only if
μ+ω−1
λ= √
ω μ
is an eigenvalue of C J . In particular, we have

ρ(CGS ) = ρ(C j )2 .

Proof By (4.21) the iteration matrix of SOR is given by

CGS (ω) = (D + ωL)−1 [(1 − ω)D − ωR].

For μ ∈ C \ {0} we thus have

(I + ωD−1 L)(μI − CGS (ω)) = μ(I + ωD−1 L) − (I + ωD−1 L)CGS (ω)


= μ(I + ωD−1 L) − D−1 (D + ωL)CGS (ω)
= μ(I + ωD−1 L) − D−1 [(1 − ω)D − ωR]
= (μ − (1 − ω))I + ωD−1 (μL + R)
 
= (μ − (1 − ω))I + ωμ1/2 D−1 μ1/2 L + μ−1/2 R .

Since the determinant of I + ωD−1 L is one, we can conclude from this that
μ  0 is an eigenvalue of CGS (ω) if and only if
 !
det (μ − (1 − ω))I + ωμ1/2 D−1 μ1/2 L + μ−1/2 R = 0,

i.e. if
μ − (1 − ω)
λ :=
ωμ1/2
 
is an eigenvalue of −D−1 μ1/2 L + μ−1/2 R . But since A is consistently ordered
the eigenvalues of the latter matrix coincide with the eigenvalues of C J .
Finally, we have for Gauss–Seidel CGS = CGS (1), i.e. ω = 1 and hence μ  0

is an eigenvalue of CGS if and only if μ is an eigenvalue of C J . Thus we have
ρ(CGS ) = ρ(C J )2 . 

08:42:03, subject to the Cambridge


05
124 Iterative Methods for Solving Linear Systems

If the matrix A is consistently ordered then it is even possible to show that the
optimal parameter is given by
2
ω∗ =  ,
1 + 1 − ρ(C J )2
where C J is the iteration matrix of the Jacobi method, for details see Young
[140]. Note that this means that ω∗ ∈ (1, 2), justifying the name successive
over relaxation.
Usually, ρ(C J ) is unknown. A possible remedy to this is to use one of the
methods to compute the largest eigenvalue, which we will discuss in the next
chapter, particularly the so-called power method to compute an approximation
to ρ(C J ).
Example 4.29 Coming back to our boundary value problem  from
 Section
π
1.1.2, we know from Remark 4.18 and (4.16) that ρ(C J ) = cos n+1 = cos(πh)
with h = 1/(n + 1). A Taylor expansion then yields
2 2
ω∗ =  =
1 + 1 − cos2 (πh) 1 + sin(πh)
5 4
= 2 − 2πh + 2(πh)2 − (πh)3 + (πh)4 + O(h5 )
3 3
2π 1
=2− +O .
n+1 (n + 1)2
Note that this comes arbitrarily close to 2 with n becoming large. Nonetheless,
because of Theorem 4.24 we cannot simply choose ω = 2.
Example 4.30 Let us stick with our second model problem from Section
1.1.2. We already know from (4.10) and (4.11) what the Jacobi and Gauss–
Seidel methods look like. It is similarly easy to see that the relaxed versions
become
⎧ ( j)


⎪ u2 if i = 1,
( j+1) ( j) ω 2 ω⎪ ⎪
⎨ ( j)
ui = (1 − ω)ui + h fi + ⎪ ⎪
( j)
ui−1 + ui+1 if 2 ≤ i ≤ n − 1,
2 2⎪⎪

⎩u( j) if i = n
n−1

for the relaxation of the Jacobi iteration and


⎧ ( j)


⎪ u2 if i = 1,
( j+1) ( j) ω 2 ω⎪⎪
⎨ ( j+1)
ui = (1 − ω)ui + h fi + ⎪ ⎪
j)
ui−1 + u(i+1 if 2 ≤ i ≤ n − 1,
2 2⎪⎪

⎩u ( j+1)
if i = n
n−1

for the SOR method.


Let us now assume that we want to solve the original problem −u = f on (0, 1)

08:42:03, subject to the Cambridge


05
4.5 Symmetric Methods 125

with u(0) = u(1) = 0 and now the right-hand side f (x) = sin(πx). Then, the true
solution is given by u(x) = π12 sin(πx). Let us run our simulations with n = 10.
Since we know that in this situation relaxation of the Jacobi method does not
improve convergence, we restrict ourselves to the Jacobi, Gauss–Seidel and
SOR methods. For the latter, we know that the best relaxation parameter is,
with h = 1/(n + 1), given by

2
ω∗ = ≈ 1.5603879 . . . .
1 + sin(πh)

In Figure 4.2 we have plotted the residual Ax( j) − b2 for these three methods.
Clearly, as predicted, the Gauss–Seidel method converges faster than the Jacobi
method, but the SOR method is by far the fastest method. Nonetheless, each
method needs significantly more than n = 10 steps to reach machine precision.

10−3 ×
2.0

1.5

1.0

0.5

0
0 10 20 30 40 50 60 70 80 90

Figure 4.2 The convergence of the Jacobi, Gauss–Seidel (dashed) and SOR (dot-
ted) methods. The number of iterations is on the x-axis, while the residual is on
the y-axis.

4.5 Symmetric Methods


We started discussing iterative methods for linear systems by using the iteration

x( j+1) = B−1 (B − A)x( j) + B−1 b = x( j) + B−1 (b − Ax( j) ) (4.22)

08:42:03, subject to the Cambridge


05
126 Iterative Methods for Solving Linear Systems

and our goal was to choose the matrix B as an easy computable approximation
to the inverse of A. If the initial matrix A is symmetric and positive definite, it
would hence be good to choose B also as symmetric and positive definite.

Definition 4.31 A splitting iteration of the form (4.22) is called symmetric


if for a symmetric (and positive definite) matrix A ∈ Rn×n the matrix B is also
symmetric (and positive definite).

This will become even more important later on. Obviously, the matrix B = D
of the Jacobi method is symmetric but the matrix B = L + D of the Gauss–
Seidel method is not, unless the strict lower part L of A is completely zero.
Both statements are also true for the relaxed versions of the Jacobi and Gauss–
Seidel method.
It is now our goal to derive a symmetric version of the Gauss–Seidel method.
We recall that the Gauss–Seidel iteration is given by
x( j+1) = −(D + L)−1 Rx( j) + (D + L)−1 b =: CGS x( j) + cGS , (4.23)
if A = L + D + R is the usual decomposition of A into its strictly sub-diagonal
part L, its diagonal part D and its strictly upper-diagonal part R. We could,
however, also define a backward Gauss–Seidel method by iterating according
to
x( j+1) = −(D + R)−1 Lx( j) + (D + R)−1 b =: C BGS x( j) + cBGS . (4.24)
This does also not lead to a symmetric method but we could combine the two
steps into one step. Hence, if we first do a “half step” according to (4.23) and
then another “half step” according to (4.24), we have the iteration
!
x( j+1) = C BGS CGS x( j) + cGS + cBGS
= C BGS CGS x( j) + (C BGS cGS + cBGS )
= CS GS x( j) + cS GS
with the iteration matrix CS GS and vector cS GS given by
CS GS := C BGS CGS = (D + R)−1 L(D + L)−1 R,
cS GS := C BGS cGS + cBGS
= −(D + R)−1 L(D + L)−1 b + (D + R)−1 b
!
= (D + R)−1 I − L(D + L)−1 b
= (D + R)−1 D(D + L)−1 b.
To see that this is indeed a symmetric method, we have to rewrite this iteration
in the form of (4.22), i.e. we need to find the associated matrix B = BS GS .

08:42:03, subject to the Cambridge


05
4.5 Symmetric Methods 127

Before that, we will introduce the relaxation of this method. It is simply de-
fined to be the just-introduced combination of the relaxed Gauss–Seidel and
backward Gauss–Seidel methods. The latter is relaxed in exactly the same way
as the Gauss–Seidel method is. Hence, the relevant matrices and vectors are
given as
−1  
1 1
CGS (ω) = (D + ωL)−1 [(1 − ω)D − ωR] = D+L −1 D−R ,
ω ω
−1  
−1 1 1
C BGS (ω) = (D + ωR) [(1 − ω)D − ωL] = D+R −1 D−L ,
ω ω
−1
1
cGS (ω) = (D + ωL)−1 ωb = D + L b,
ω
−1
−1 1
cBGS (ω) = (D + ωR) ωb = D + R b.
ω

Definition 4.32 The symmetric Gauss–Seidel relaxation for solving the linear
system Ax = b is given by

x( j+1) = CS GS (ω)x( j) + cS GS (ω),

where

CS GS (ω) := C BGS (ω)CGS (ω),


cS GS (ω) := C BGS (ω)cGS (ω) + cBGS (ω).

In the case of a symmetric matrix A, the method is also called the symmetric
successive over-relaxation (SSOR) method.

Obviously, the symmetric Gauss–Seidel method makes mainly sense for sym-
metric matrices. In this case, the notation further simplifies since we have
R = LT . Nonetheless, for the time being we will stick to the general notation.
The first step in understanding this method is to show that it is indeed an it-
erative method of the form (4.22), i.e. we have to find the defining matrix
B = BS GS (ω) and to show that CS GS (ω) = I − B−1 S GS (ω)A. To find BS GS
we have to investigate the iteration vector cS GS (ω) which has to be given by
cS GS (ω) = B−1
S GS (ω)b. With

1 1
BGS (ω) = D+L and BBGS (ω) = D+R
ω ω
we obviously have
−1
CGS (ω) = I − BGS (ω)A and C BGS (ω) = I − B−1
BGS (ω)A,

08:42:03, subject to the Cambridge


05
128 Iterative Methods for Solving Linear Systems

which is easily verified and shows again that both methods are of the form
(4.22). Furthermore, we have from the definition of cS GS (ω) that
!
−1
cS GS (ω) = C BGS (ω)cGS (ω) + cBGS (ω) = C BGS (ω)BGS (ω) + B−1
BGS (ω) b

which implies

B−1 −1 −1
S GS (ω) = C BGS (ω)BGS (ω) + BBGS (ω),

as long as the matrix on the right-hand side is invertible. In this situation, we


can conclude that the symmetric Gauss–Seidel method is indeed a splitting
scheme of the form (4.22) since
 
I − B−1 −1 −1
S GS (ω)A = I − C BGS (ω)BGS (ω) + BBGS (ω) A

= I − B−1 −1
BGS (ω)A − C BGS (ω)BGS (ω)A
= C BGS (ω) − C BGS (ω)(I − CGS (ω))
= C BGS (ω)CGS (ω) = CS GS (ω).

Finally, we have to verify that all the previous operations are justified. To this
end, we note that we have for ω  2 that

cS GS (ω) = C BGS (ω)cGS (ω) + cBGS (ω)


 
−1 1 −1
= BBGS (ω) − 1 D − L BGS (ω)b + B−1
BGS (ω)b
ω

−1 −1 ⎤
⎢⎢⎢ 1 ⎥⎥
D + L + I ⎥⎥⎥⎦ b
1 1
= D + R ⎢⎢⎣ −1 D−L
ω ω ω
−1 ⎡ −1 −1 ⎤
⎢⎢⎢ 1 ⎥⎥
D + L ⎥⎥⎥⎦ b
1 1 1 1
= D + R ⎢⎢⎣ D − D − L D+L + D+L
ω ω ω ω ω
−1   −1
1 1 1 1
= D+R D−D−L+ D+L D+L b
ω ω ω ω
−1 −1
2 1 1
= −1 D+R D D + L b,
ω ω ω
from which we can conclude that

ω 1 −1 1
BS GS (ω) = D+L D D+R
2−ω ω ω
ω
= BGS (ω)D−1 BBGS (ω). (4.25)
2−ω
This has several further consequences, which we will collect now.

Proposition 4.33 Let A ∈ Rn×n be symmetric with positive diagonal entries.

08:42:03, subject to the Cambridge


05
4.5 Symmetric Methods 129

For each ω ∈ (0, 2), the SSOR method is well-defined and a symmetric splitting
method.

Proof As we have already seen, the matrix BS GS (ω) is well-defined for ω ∈


(0, 2). Furthermore, since A is symmetric, we have BGS (ω) = BTBGS (ω). Hence,
(4.25) implies that BS GS (ω) is symmetric. Since D is a diagonal matrix with
positive entries, (4.25) also implies that BS GS (ω) is positive definite. 

We finally want to prove convergence of the SSOR method. To this end, it


is helpful to recall the following context. If the matrix A is positive definite
and symmetric it possesses a square root A1/2 which is also positive definite
and symmetric and satisfies A = A1/2 A1/2 . According to Example 2.29, the
induced matrix norm of the vector norm x → A1/2 x2 is given by CA1/2 =
A1/2CA−1/2 2 .

Theorem 4.34 Let A ∈ Rn×n be symmetric and positive definite. Then, the
symmetric Gauss–Seidel relaxation converges for each ω ∈ (0, 2) and

ρ(CS GS (ω)) = CS GS (ω)A1/2 = CGS (ω)2A1/2 < 1.

Proof The matrix CS GS (ω) = C BGS (ω)CGS (ω) has the same eigenvalues as
the matrix
! !
A1/2CS GS (ω)A−1/2 = A1/2C BGS (ω)A−1/2 A1/2CGS (ω)A−1/2 .

Since we have BTBGS (ω) = BGS (ω) we can conclude that


!
A1/2C BGS (ω)A−1/2 = A1/2 I − B−1
BGS (ω)A A
−1/2
= I − A1/2 B−1
BGS (ω)A
1/2

!T !T
−1
= I − A1/2 BGS (ω)A1/2 = A1/2CGS (ω)A−1/2 ,

which shows
!T !
A1/2CS GS (ω)A−1/2 = A1/2CGS (ω)A−1/2 A1/2CGS (ω)A−1/2 .

From this, we can conclude that A1/2CS GS (ω)A−1/2 is symmetric and positive
definite and has therefore only non-negative eigenvalues. Furthermore, with
Example 2.29 and Corollary 2.21 we find

ρ(CS GS (ω)) = ρ(A1/2CS GS (ω)A−1/2 )


= A1/2CS GS (ω)A−1/2 2 = CS GS (ω)A1/2
 !T !
=  A1/2CGS (ω)A−1/2 A1/2CGS (ω)A−1/2 
2
= A 1/2
CGS (ω)A−1/2 22 = CGS (ω)2A1/2 .

It remains to show that the spectral radius is less than one, i.e., we have to

08:42:03, subject to the Cambridge


05
130 Iterative Methods for Solving Linear Systems

show that A1/2CGS (ω)A−1/2 2 < 1. Since the  · 2 -norm of the matrix C(ω) :=
A1/2CGS (ω)A−1 is given by the square root of the maximum eigenvalue of
C(ω)TC(ω), we have to compute the latter. To this end, we use again that
T
BGS (ω) = BBGS (ω),

2
BGS (ω) + BBGS (ω) = A + −1 D
ω
and
!
C(ω) = A1/2CGS (ω)A−1/2 = A1/2 I − BGS
−1
(ω)A A−1/2 = I − A1/2 BGS
−1
(ω)A1/2
to derive
!T !
−1 −1
C(ω)TC(ω) = I − A1/2 BGS (ω)A1/2 I − A1/2 BGS (ω)A1/2
! !
= I − A1/2 B−1
BGS (ω)A
1/2 −1
I − A1/2 BGS (ω)A1/2
!
= I − A1/2 B−1 −1
BGS (ω) + BGS (ω) A
1/2
+ A1/2 B−1 −1
BGS (ω)ABGS (ω)A
1/2
 
2
= I − A1/2 B−1
BGS (ω) A + − 1 D BGS −1
(ω)A1/2
ω
+ A1/2 B−1 −1
BGS (ω)ABGS (ω)A
1/2

2
= I− A1/2 B−1 −1
BGS (ω)DBGS (ω)A
1/2
−1
ω

2
=: I − − 1 B(ω).
ω
The newly introduced matrix B(ω) is, because of
!T !
B(ω) := A1/2 B−1 −1
BGS (ω)DBGS (ω)A
1/2 −1
= D1/2 BGS (ω)A1/2 −1
D1/2 BGS (ω)A1/2 ,
symmetric and positive definite. If λ ∈ R is an eigenvalue of C(ω)TC(ω) with
corresponding eigenvector x ∈ Rn satisfying x2 = 1 then ω ∈ (0, 2) implies

2
0 ≤ λ = xTC(ω)TC(ω)x = 1 − − 1 xT B(ω)x < 1. 
ω

Exercises
4.1 Let F : Rn → Rn be a contraction mapping with contraction number
q ∈ (0, 1) and fixed point x∗ ∈ Rn . Let G : Rn → Rn be a mapping
satisfying F(x) − G(x) ≤ for all x ∈ Rn . Show that the iteration
y j+1 := G(y j ) satisfies the error bound
1  
x∗ − y j  ≤ + q j y0 − y1  .
1−q

08:42:03, subject to the Cambridge


05
Exercises 131

4.2 Show that the following modification of Banach’s fixed point theorem is
true. Let D ⊆ Rn be a compact set. Let F : D → D be a contraction
mapping. Then, F has exactly one fixed point x∗ ∈ D and the sequence
x j+1 := F(x j ) converges to x∗ for each starting point x0 ∈ D.
Use Banach’s fixed point theorem
√ to show that the following functions
have a fixed point f (x) = 2 + x, x ≥ 0, f (x) = e−x , x ∈ R, f (x1 , x2 ) :=
 T
2 1 −x1
5 (x1 + x2 ), 2 e , x = (x1 , x2 ) ∈ R2 . Compute the fixed point for the
1 2

first function and show that the sequence x j+1 = f (x j ) converges for
every x0 ∈ R in the case of the second function.
4.3 Show that a matrix A ∈ Rn×n is reducible if and only if there are a permu-
tation matrix P ∈ Rn×n , an index k ∈ {1, . . . , n−1} and matrices B ∈ Rk×k ,
C ∈ R(n−k)×k and D ∈ R(n−k)×(n−k) such that

B O
PT AP = .
C D
4.4 Show that a matrix A ∈ Rn×n is irreducible if and only if we can find for
each pair of indices i, j ∈ {1, . . . , n} a chain j0 , . . . , jk ∈ {1, . . . , n} with
j0 = i, jk = j and a j , j+1  0 for 0 ≤  ≤ k − 1.
4.5 Let A ∈ Rn×n satisfy aii > 0 and ai j ≤ 0 for i  j. Let C J = −D−1 (L + R)
be the iteration matrix of the Jacobi method for A = L + D + R. Show
that ρ(C J ) < 1 if and only if A has an inverse with non-negative entries
(such a matrix is called an M-matrix).
4.6 Let A ∈ Rn×n be symmetric, irreducible and weakly diagonally domi-
nant. Assume that all diagonal elements of A are positive. Show that A is
positive definite.
4.7 Let A ∈ Rn×n with n ≥ 2 have diagonal elements aii = 1 and off-diagonal
elements ai j = a, i  j.
• Determine the iteration matrix CGS of the Gauss–Seidel method for A.
Calculate the eigenvalues of CGS and A.
• Determine those a ∈ R, for which the Gauss–Seidel method con-
verges. For which a ∈ R is A positive definite?
4.8 A simple residual correction method to solve Ax = b is the so-called
Richardson iteration, given by x j+1 = (I − A)x j + b = x j + (b − Ax j ).
Its relaxation is given by x j+1 = x j + ω(b − Ax j ). Determine the iteration
matrix C(ω) of this method. Assume that A has only real eigenvalues.
Determine ρ(C(ω)) and the optimal relaxation parameter.

08:42:03, subject to the Cambridge


05
5
Calculation of Eigenvalues

Particularly when analysing vibrations in mechanical or electrical systems,


we are mathematically confronted with eigenvalue problems. Initially, these
are eigenvalue problems for certain differential operators but via discretisation
they can be recast as eigenvalue problems for linear systems. We have also
seen that eigenvalues play an important role when studying the conditioning of
linear systems. Hence, in this chapter we are interested in finding solutions to
the problem
Ax = λx,

where A ∈ Rn×n is given and the eigenvalue λ ∈ C and the eigenvector x ∈


Cn \ {0} are what we are looking for.
Recall from linear algebra that the eigenvalues of A are the zeros of the char-
acteristic polynomial
p(λ) := det(A − λI).

In the applications mentioned above, the root of an eigenvalue represents a fre-


quency of the vibration of the mechanical system. The corresponding eigen-
vector gives the corresponding shape mode or mode of vibration.
There are different eigenvalue problems depending on the nature of the un-
derlying problem. For each of them, there are different solution techniques.
Typical problems are as follows.

1. Computation of the largest (or smallest) eigenvalue without the correspond-


ing eigenvector.
2. Computation of all eigenvalues without the corresponding eigenvectors.
3. Computation of one eigenvalue together with a corresponding eigenvector.
4. Computation of several (or all) eigenvalues and the corresponding eigen-
vectors.

132

08:42:21, subject to the Cambridge


06
5.1 Basic Localisation Techniques 133

The numerical methods also depend on the special form of the matrix. Partic-
ularly, whether the matrix A is symmetric (or Hermitian) or not. In this book,
we will mainly concentrate on symmetric matrices since methods for comput-
ing the eigenvalues and eigenvectors are much easier in this case. Moreover,
eigenvalues and eigenvectors are real in this situation.
As in the case of solving linear systems, one can distinguish between direct
and iterative methods. In direct methods, first the characteristic polynomial
and then its zeros are determined. However, more popular are iterative meth-
ods, which try to compute successive approximations to the eigenvectors and
eigenvalues without computing the characteristic polynomial.
Before discussing methods for computing eigenvalues and eigenvectors, we
will address basic localisation techniques. They sometimes already give good
approximations of the eigenvalues.

5.1 Basic Localisation Techniques


The first way of estimating the location of all eigenvalues works for an arbitrary
matrix.

Theorem 5.1 (Gershgorin) The eigenvalues of a matrix A ∈ Rn×n are con-


tained in the union of the discs
⎧ ⎫


⎪ ⎪




⎨ n ⎪


Dj = ⎪
⎪ λ ∈ C : |λ − a | ≤ |a | ⎪
⎪ , 1 ≤ j ≤ n.



j j jk



⎩ k=1
k j

Proof For every eigenvalue λ ∈ C of A we can choose an eigenvector x ∈


Cn \ {0} with x∞ = max1≤ j≤n |x j | = 1. From Ax − λx = 0 we can conclude
that

n
(a j j − λ)x j + a jk xk = 0, 1 ≤ j ≤ n.
k=1
k j

If we pick an index j with |x j | = 1 then the statement follows via


"" ""
"" n ""
""  " 
n n
|a j j − λ| = |(a j j − λ)x j | = " a jk xk "" ≤ |a jk | |xk | ≤ |a jk |. 
"" ""
" k j
k=1 " k j
k=1 k=1
k j

Corollary 5.2 Let A ∈ Rn×n be symmetric and strictly row diagonally domi-
nant with positive diagonal elements. Then, A is positive definite.

08:42:21, subject to the Cambridge


06
134 Calculation of Eigenvalues

Proof Since A is symmetric, all eigenvalues λ of A are real. Since A is strictly


row diagonally dominant with positive diagonal elements, we have

n
r j := |a jk | < a j j , 1 ≤ j ≤ n.
k=1
k j

According to Gershgorin’s theorem we have for every eigenvalue λ ∈ R of A


an index j with |λ − a j j | ≤ r j and hence

0 < a j j − r j ≤ λ. 

It is possible to derive an improved version of Gershgorin’s theorem.

Theorem 5.3 (Gershgorin) If all the discs D j are disjoint, then each of them
contains exactly one eigenvalue of A. If p discs D j form a connected set P
which is disjoint to all other discs, then P contains exactly p eigenvalues of A.

Proof The first statement follows immediately from the second one. Hence,
we will concentrate on the second statement. Let r j denote the radius of D j ,

i.e. r j = k j |a jk |. Let P denote the union of the remaining discs D j which are
disjoint to P. Then, both P and P are closed, disjoint subsets of C and hence
there is an 0 > 0 such that for all x ∈ P and all y ∈ P we have |x − y| > 0 .
Next we define for t ∈ [0, 1] the matrix A(t) = (ai j (t)) by setting aii (t) =
aii and ai j (t) = tai j if i  j. Obviously, we have A(1) = A and A(0) =
diag(a11 , . . . , ann ). Furthermore, the Gershgorin discs D j (t) associated with this
matrix A(t) are also centred at a j j but have radius tr j , which means D j (t) ⊆ D j
for all t ∈ [0, 1] and hence P(t) ⊆ P and P (t) ⊆ P . Here, P(t) denotes the
union of the p discs D j (t) corresponding to those in P. Note that Theorem 5.1
states that all eigenvalues of A(t) must be contained in the union of P(t) and
P (t).
Let us now define

t0 := sup{t ∈ [0, 1] : P(t) contains exactly p eigenvalues of A(t)}. (5.1)

For every t in that set, we also have that P (t) contains exactly n− p eigenvalues.
Our proof is finished once we have shown that t0 = 1. We immediately see
that 0 belongs to the set since A(0) is a diagonal matrix and P(0) contains
exactly p elements, the eigenvalues of A(0). Hence, the set in (5.1) is not empty.
Next, notice that A(t) depends continuously on t. Since the eigenvalues of A(t)
are the zeros of the corresponding characteristic polynomial, they also depend
continuously on t. Hence, to every > 0 there is a δ > 0 such that |λ j (t) −
λ j (t0 )| < for all t ∈ [0, 1] with |t − t0 | < δ and 1 ≤ j ≤ n, provided we sort the
eigenvalues consistently. We will use this for the above given = 0 .

08:42:21, subject to the Cambridge


06
5.1 Basic Localisation Techniques 135

The definition of t0 allows us to pick a t ∈ [0, 1] with t0 − δ < t < t0 such that
P(t) contains exactly p eigenvalues of A(t) and P (t) contains the remaining n−
p eigenvalues. The definition of 0 and the choice of δ now guarantee that also
P(t0 ) contains exactly p eigenvalues of A(t0 ) and P (t0 ) contains the remaining
n − p eigenvalues. Hence, the supremum in (5.1) is actually a maximum, i.e. t0
belongs to the set in (5.1).
Let us now assume t0 < 1. Then, we can again choose = 0 and a corre-
sponding δ > 0 as above. This means, however, that for every t ∈ [0, 1] with
t0 < t < t0 + δ we have |λ j (t) − λ j (t0 )| < 0 for every 1 ≤ j ≤ n. The choice
of 0 guarantees once again that we now also have exactly p eigenvalues of
A(t) in P(t), which contradicts the definition of t0 . Hence, t0 = 1 and P = P(1)
contains exactly p eigenvalues of A = A(1). 
In this section, we want to address two additional questions. How does a small
change in the elements of the matrix A change the eigenvalues of A? What
can we say about the location of an eigenvalue if we have an approximate
eigenvalue?
To answer these questions we need sharper estimates on the eigenvalues, which
we will discuss now. To derive them, we will restrict ourselves mainly to sym-
metric matrices. Most of the results also carry immediately over to Hermitian
T T
matrices and normal matrices, i.e. matrices which satisfy AA = A A. In all
other cases the situation is more complicated and, though the eigenvalues still
depend continuously on the matrix entries, it is very likely that a small pertur-
bation of the matrix may lead to significant changes in its eigenvalues.
Theorem 5.4 (Rayleigh) Let A ∈ Rn×n be symmetric with eigenvalues λ1 ≥
λ2 ≥ · · · ≥ λn and corresponding orthonormal eigenvectors w1 , . . . , wn ∈ Rn .
For 1 ≤ j ≤ n let W j := span{w1 , . . . , w j } and W ⊥j := span{w j+1 , . . . , wn }. Let
also W0⊥ := Rn . Then,
xT Ax
λ j = max , (5.2)

x∈W j−1 xT x
x0
xT Ax
λ j = min (5.3)
x∈W j xT x
x0

for 1 ≤ j ≤ n.

Proof Every x ∈ Rn has a representation of the form x = nk=1 ck wk with
 
ck = xT wk . This means particularly xT x = nk=1 |ck |2 , Ax = nk=1 ck λk wk and

xT Ax = nk=1 |ck |2 λk .
If x ∈ W ⊥j−1 = span{w j , . . . , wn } then we have particularly c1 = · · · = c j−1 = 0

08:42:21, subject to the Cambridge


06
136 Calculation of Eigenvalues

such that, provided x  0,


n
xT Ax k= j |ck |2 λk λ j xT x
= ≤ = λ j. (5.4)
xT x xT x xT x
Hence, the maximum of the quotient over all x ∈ W ⊥j−1 is less than or equal to
λ j . However, the maximum is attained for x = w j ∈ W ⊥j−1 , which shows (5.2).
If, however, x ∈ W j = span{w1 , . . . , w j } then we have this time c j+1 = · · · =
cn = 0 and hence, again under the assumption that x  0,
j
k=1 |ck | λk λ j xT x
2
xT Ax
= ≥ = λ j. (5.5)
xT x xT x xT x
This shows that the minimum of the quotient over all x ∈ W j is greater than or
equal to λ j and since the minimum is attained for x = w j , we have the equality
(5.3). 
An immediate consequence of this result is the following minimum maximum
principle.
Theorem 5.5 (Courant–Fischer) Let A ∈ Rn×n be symmetric with eigenvalues
λ1 ≥ λ2 ≥ . . . ≥ λn . For 1 ≤ j ≤ n let M j denote the set of all linear subspaces
of Rn of dimension j. Then,
xT Ax
λj = min max , (5.6)
U∈Mn+1− j x∈U xT x
x0
T
x Ax
λ j = max min (5.7)
U∈M j x∈U xT x
x0

for 1 ≤ j ≤ n.
Proof We will only prove (5.6) since the second equality is proven in a very
similar fashion.
Let w j , W j and W ⊥j be defined as in Theorem 5.4. Let us denote the right-hand
side of (5.6) with ρ j . Since W ⊥j−1 = span{w j , . . . , wn } ∈ Mn− j+1 we immedi-
ately have ρ j ≤ λ j . Moreover, the space W j = span{w1 , . . . , w j } obviously has
dimension j and an arbitrary U ∈ Mn+1− j has dimension n + 1 − j. Thus, the
dimension formula yields
   
dim W j ∩ U = dim(W j ) + dim(U) − dim W j + U ≥ 1.
Hence, for each U ∈ Mn+1− j there is an x ∈ W j ∩ U, x  0. This means that
j
this x has the representation x = k=1 ck wk , and we can derive, as before,
 j
k=1 |ck | λk
2
xT Ax
= ≥ λj
xT x xT x

08:42:21, subject to the Cambridge


06
5.1 Basic Localisation Techniques 137

and therefore also ρ j ≥ λ j . 

A first consequence of this min–max principle is Sylvester’s theorem of inertia.


The inertia of a symmetric matrix A ∈ Rn×n is a triple (n+ , n0 , n− ) with n+ +
n− + n0 = n, where n+ , n− , n0 denote the numbers of positive, negative and
zero eigenvalues of A. The result will be of importance for us later on, when
discussing preconditioning.

Theorem 5.6 (Sylvester) Let A ∈ Rn×n be symmetric and S ∈ Rn×n be non-


singular. Then, A and B := S T AS have the same inertia, i.e. the same numbers
of positive, negative and zero eigenvalues.

Proof We start by noting that the matrix S T S is positive definite and denote
its largest and its smallest eigenvalues by λmax (S T S ) and λmin (S T S ), respec-
tively. Both are positive; they are the squares of the largest and smallest singu-
lar values of S and satisfy

xT S T S x yT y
λmin (S T S ) = min T
= min T −T −1 ,
x0 x x y0 y S S y
T T T
x S Sx y y
λmax (S T S ) = max = max T −T −1
x0 xT x y0 y S S y
due to Theorem 5.4.
Next, let us denote the eigenvalues of A and B by λ1 (A) ≥ λ2 (A) ≥ · · · ≥ λn (A)
and λ1 (B) ≥ λ2 (B) ≥ · · · ≥ λn (B).
By (5.7), we have

xT S T AS x yT Ay
λ j (B) = max min T
= max min T −T −1
U∈M j x∈U x x U∈M j y∈S (U) y S S y
x0 y0

using the substitution y = S x again and noting that S (U) is also in M j . Thus,
if w1 , . . . , wn denote once again an orthonormal basis consisting of the eigen-
vectors to the eigenvalues of A and setting again W j = span{w1 , . . . , w j }, we
can choose U := S −1 (W j ) ∈ M j . This, together with (5.5), consequently yields

yT Ay yT y yT y
λ j (B) ≥ min ≥ λ j (A) min
y∈W j yT y yT S −T S −1 y y∈W j yT S −T S −1 y
y0 y0
T
y y xT S T S x
≥ λ j (A) min = λ j (A) min
y0 yT S −T S −1 y x0 xT x
= λ j (A)λmin (S S ),
T

since yT y/yT S −T S −1 y = y22 /S −1 y22 > 0.

08:42:21, subject to the Cambridge


06
138 Calculation of Eigenvalues

In the same way, we can use (5.6) and (5.4) to derive


xT S T AS x yT Ay
λ j (B) = min max T
= min max T −T −1
U∈Mn+1− j x∈U x x U∈Mn+1− j y∈S (U) y S S y
x0 y0
T T
y Ay y y yT y
≤ max −T −1
≤ λ j (A) max

y∈W j−1 T T
y y y S S y y∈W j−1 y S −T S −1 y
⊥ T
y0 y0
T
y y
≤ λ j (A) max
yT S T S −1 y
y0

= λ j (A)λmax (S T S ).
Both bounds together yield
λmin (S T S )λ j (A) ≤ λ j (B) ≤ λmax (S T S )λ j (A) (5.8)
and since λmin (S T S ) and λmax (S T S ) are both positive, λ j (A) and λ j (B) must
have the same sign. 
The proof we have just given actually shows more than we have stated, namely
(5.8), which is also called Ostrowski’s quantitative version of Sylvester’s theo-
rem, see for example [98] and Wimmer [136].
Another consequence of Theorem 5.5 is that it allows us to give a stability
result on the eigenvalues of a symmetric matrix.
Corollary 5.7 Let A ∈ Rn×n and B ∈ Rn×n be two symmetric matrices with
eigenvalues λ1 (A) ≥ λ2 (A) ≥ · · · ≥ λn (A) and λ1 (B) ≥ λ2 (B) ≥ · · · ≥ λn (B).
Then,
|λ j (A) − λ j (B)| ≤ A − B
for every compatible matrix norm.
Proof Using the Cauchy–Schwarz inequality, we have
xT (Ax − Bx) ≤ Ax − Bx2 x2 ≤ A − B2 x22 ,
meaning particularly
xT Ax xT Bx
≤ T + A − B2
xT x x x
for x  0. Thus, Theorem 5.5 yields λ j (A) ≤ λ j (B) + A − B2 . Exchanging the
roles of A and B finally gives
|λ j (A) − λ j (B)| ≤ A − B2 = ρ(A − B) ≤ A − B,
where we have used Theorems 4.5 and 2.19 and the fact that the matrix A − B
is symmetric. 

08:42:21, subject to the Cambridge


06
5.1 Basic Localisation Techniques 139

As mentioned earlier, the situation is not so favourable if A is not a symmet-


ric or Hermitian matrix. This is demonstrated in the next theorem, which is a
generalisation of the above corollary.

Theorem 5.8 (Bauer–Fike) Let A ∈ Cn×n be given, and let B = A+ΔA ∈ Cn×n
be a perturbation of A. If μ ∈ C is an eigenvalue of B and A = S DS −1 with
D = diag(λ1 , . . . , λn ) then
min |λ j − μ| ≤ κ p (S )ΔA p .
1≤ j≤n

Here,  ·  p , 1 ≤ p ≤ ∞, is the  p -norm and κ p (S ) is the condition number of


the matrix S in the induced norm.
Proof If μ is an eigenvalue of A this is obviously true. Hence, assume that μ
is not an eigenvalue of A. Since μ is an eigenvalue of B = A + ΔA, the matrix
S −1 (A + ΔA − μI)S is singular, i.e. there is an x ∈ Cn \ {0} with S −1 (A + ΔA −
μI)S x = 0. We can rearrange this to (D − μI)x = −S −1 (ΔA)S x and then to
x = −(D − μI)−1 S −1 (ΔA)S x.
Taking the norm immediately leads to
1 = (D − μI)−1 S −1 (ΔA)S  p ≤ (D − μI)−1  p S −1  p ΔA p S  p
1
≤ max κ p (S )ΔA p . 
1≤ j≤n |λ j − μ|

This theorem shows that the perturbation ΔA is magnified by the condition


number of the transformation matrix S . If S is orthogonal or unitary then
κ2 (S ) = 1 and this is the best we can hope for. However, a unitary transfor-
mation is possible only if A is normal. Hence, in all other cases this will very
probably lead to problems.
We now return to the situation of real eigenvalues and real eigenvectors of real
matrices. Assume that we already have an approximation λ of an eigenvalue
and an approximation x of a corresponding eigenvector. Then, we can try to
use this information to derive a better estimate for the eigenvalue. To this end,
let us assume that all components x j of x are different from zero. Then, all the
quotients
T
(Ax) j e j Ax
q j := = T (5.9)
xj ej x
should be about the same and coincide approximately with the eigenvalue.

Lemma 5.9 If all the components x j of x ∈ Rn are different from zero and
if Q denotes the diagonal matrix Q = diag(q1 , . . . , qn ) ∈ Rn×n with diagonal

08:42:21, subject to the Cambridge


06
140 Calculation of Eigenvalues

components given by (5.9) then we have for an arbitrary vector norm and a
compatible matrix norm

1. 1 ≤ (A − pI)−1  (Q − pI)x/x,


2. 1 ≤ (A − pI)−1  Q − pI

for every p ∈ R for which A − pI is non-singular.

Proof The definition of the q j yields for our fixed vector x the equation Ax =
Qx so that (A − pI)x = (Q − pI)x follows. This means

x = (A − pI)−1 (Q − pI)x,

from which the stated inequalities immediately follow. 

As a special case of Lemma 5.9 we have the following quotient theorem.

Theorem 5.10 Let A ∈ Rn×n be symmetric. Let q j be defined by (5.9). Then,


for every p ∈ R there is an eigenvalue λ of A satisfying

|λ − p| ≤ max |q j − p|.
1≤ j≤n

Proof If p is an eigenvalue of A the inequality is obviously satisfied. If p is


not an eigenvalue of A then the matrix A − pI is invertible and (A − pI)−1 has
the eigenvalues 1/(λ j − p), if λ j , 1 ≤ j ≤ n, denote the eigenvalues of A.
Since (A − pI)−1 is symmetric its matrix norm induced by the Euclidean vector
norm and its spectral radius are the same. Thus,
1 1
(A − pI)−1 2 = ρ((A − pI)−1 ) = max "" "= " " . (5.10)
1≤ j≤n "λ j − p"" min1≤ j≤n ""λ j − p""
In the same way, we can derive

Q − pI2 = ρ(Q − pI) = max |q j − p|


1≤ j≤n

and the second statement of Lemma 5.9 yields

min |λ j − p| ≤ max |q j − p|. 


1≤ j≤n 1≤ j≤n

Corollary 5.11 There is at least one eigenvalue λ of A satisfying

qmin := min q j ≤ λ ≤ qmax := max q j .


1≤ j≤n 1≤ j≤n

Proof If we define p = (qmin + qmax )/2 then we have


"" "
qmin + qmax "" 1
max |q j − p| = max """q j − "" = |qmax − qmin |.
1≤ j≤n 1≤ j≤n 2 2

08:42:21, subject to the Cambridge


06
5.2 The Power Method 141

Hence, we can conclude from Theorem 5.10 that there is at least one eigenvalue
λ with
"" "
""λ − qmax + qmin """ ≤ 1 |q − q |,
" 2 " 2 max min

which means that λ must be in the given interval. 

5.2 The Power Method


The first numerical method we want to discuss is for computing the largest
eigenvalue of a matrix A ∈ Rn×n and a corresponding eigenvector. It is designed
for matrices A with one dominant eigenvalue λ1 , i.e. we assume in this section
that the eigenvalues of A satisfy

|λ1 | > |λ2 | ≥ |λ3 | ≥ · · · ≥ |λn |. (5.11)

We will not assume that A is symmetric but we will assume that it has real
eigenvalues and that there is a basis w1 , . . . , wn of Rn consisting of eigenvectors

of A. Then, we can represent every x as x = nj=1 c j w j with certain coefficients
c j . Using (5.11) shows that
⎛ n m ⎞⎟
 n ⎜⎜  λ ⎟⎟
m⎜
c j λ j w j = λ1 ⎜⎜⎝⎜c1 w1 + w j ⎟⎟⎠⎟ =: λm
j
A x=
m m
1 (c1 w1 + Rm ) (5.12)
j=1 j=2
λ1

defines a vector sequence {Rm } with Rm → 0 for m → ∞. This means that if


c1  0 that

1 → c1 w1 ,
Am x/λm m → ∞.

The latter is an eigenvector of A to the eigenvalue λ1 . Of course, this is only of


limited use since we do not know the eigenvalue λ1 and hence cannot form the
quotient Am x/λm 1 . Another problem comes from the fact that the norm of A x
m

converges to zero if |λ1 | < 1 and to infinity if |λ1 | > 1. Both problems can be
resolved by normalisation. For example, if we take the Euclidean norm of Am x
then
⎛ n ⎞1/2
⎜⎜⎜  ⎟⎟⎟
A x2 = ⎜⎜⎝
m ⎜ ci c j λi λ j wi w j ⎟⎟⎟⎠ =: |λ1 |m (|c1 | w1 2 + rm )
m m T
(5.13)
i, j=1

with R  rm → 0 for m → ∞. From this we can conclude that


Am+1 x2 Am+1 x2 |λ1 |m
= |λ1 | → |λ1 |, m → ∞, (5.14)
Am x2 |λ1 |m+1 Am x2

08:42:21, subject to the Cambridge


06
142 Calculation of Eigenvalues

which gives λ1 up to its sign. To determine also its sign and a corresponding
eigenvector, we refine the method as follows.
Definition 5.12 The power method or von Mises iteration is defined in the

following way. First, we pick a start vector y0 = nj=1 c j w j with c1  0 and set
x0 = y0 /y0 2 . Then, for m = 1, 2, . . . we define

1. ym = Axm−1 ,
2. xm = σymmym2 , where the sign σm ∈ {−1, 1} is chosen such that xTm xm−1 ≥ 0.
Since the algorithm can directly be derived from this definition, we refrain
from stating it explicitly. Each step costs O(n2 ) time, which is mainly given by
the involved matrix–vector multiplication.
The choice of the sign means that the angle between xm−1 and xm is in [0, π/2],
which means that we are avoiding a “jump” in the orientation when moving
from xm−1 to xm .
The condition c1  0 is usually satisfied simply because of numerical rounding
errors and is hence no real restriction.
Theorem 5.13 Let A ∈ Rn×n be a diagonalisable matrix with real eigenvalues
and eigenvectors with one dominant eigenvalue λ1 . The iterations of the power
method satisfy

1. ym 2 → |λ1 | for m → ∞,


2. xm converges to an eigenvector of A to the eigenvalue λ1 ,
3. σm → sign(λ1 ) for m → ∞, i.e. σm = sign(λ1 ) for sufficiently large m.
Proof We have
 
Axm−1 A σm−1 Ax Axm−2
m−2 2
xm = σm = σm  
Axm−1 2 σm−1 A2 xm−2 
Axm−2 2 2
2
A xm−2 Am x0
= σm σm−1 = σ m σ m−1 · · · σ 1
A2 xm−2 2 Am x0 2
m
A y0
= σm σm−1 · · · σ1 m
A y0 2
for m = 1, 2, . . .. From this, we can conclude
Am+1 y0
ym+1 = Axm = σm · · · σ1 ,
Am y0 2
such that (5.14) immediately leads to
Am+1 y0 2
ym+1 2 = → |λ1 |, m → ∞.
Am y0 2

08:42:21, subject to the Cambridge


06
5.3 Inverse Iteration by von Wielandt and Rayleigh 143

For the rest of the proof we may assume, without restriction, that w1 2 = 1.
Using the representations (5.12) and (5.13) yields
λm1 (c1 w1 + Rm )
xm = σm · · · σ1
|λ1 | (|c1 |w1 2 + rm )
m

= σm · · · σ1 sign(λ1 )m sign(c1 )w1 + ρm

with ρm → 0 for m → ∞. From this we can conclude that xm indeed converges


to an eigenvector of A to the eigenvalue λ1 , provided that σm = sign(λ1 ) for all
m ≥ m0 sufficiently large. The latter follows from
λ12m−1 (c1 wT1 + RTm−1 )(c1 w1 + Rm )
0 ≤ xTm−1 xm = σm σ2m−1 · · · σ21
|λ1 |2m−1 (|c1 | + rm−1 )(|c1 | + rm )
|c1 | + c1 wT1 Rm + c1 RTm−1 w1 + RTm−1 Rm
2
= σm sign(λ1 ) ,
|c1 |2 + |c1 |(rm−1 + rm ) + rm rm−1
because the fraction involved converges to one for m → ∞. 

The Euclidean norm involved in the normalisation process can be replaced by


any other norm without changing the convergence result. Often, the maximum
norm is used, since it is cheaper to compute. Also, we do not need all eigen-
values to be real. It suffices that the dominant eigenvalue is real. Even if the
dominant eigenvalue is complex, we can still apply the method and have at
least ym  → |λ1 |.
It follows from (5.12) that the convergence rate of the power method is mainly
determined by the quotient |λ2 /λ1 | < 1. Obviously, the convergence will be
slow if λ2 is close to λ1 .

5.3 Inverse Iteration by von Wielandt and Rayleigh


We now want to discuss a specific application of the power method. To this
end, let us assume that A has a simple eigenvalue λ j ∈ R and that we already
have an approximation λ ∈ R to this eigenvalue. Then, obviously, we may
assume that
|λ − λ j | < |λ − λi |

for all other eigenvalues λi  λ j of A. If our approximation λ is not an eigen-


value of A itself then we can invert A − λI and (A − λI)−1 has the eigenvalues
1/(λi − λ) =: λ̃i . Moreover, we have
1
λj = λ +
λ̃ j

08:42:21, subject to the Cambridge


06
144 Calculation of Eigenvalues

such that a good approximation to λ̃ j allows us to improve our approximation


to λ j . From
1 1
< for all i  j
|λ − λi | |λ − λ j |

we can conclude that λ̃ j is a dominant eigenvalue of (A − λI)−1 such that the


power method applied to (A − λI)−1 should give good results. This is the idea
behind the inverse iteration method by von Wielandt.

Definition 5.14 For the inverse iteration method or von Wielandt iteration
for computing an eigenvalue of a symmetric matrix A ∈ Rn×n we assume that
we have an approximation λ ∈ R, which is not an eigenvalue, and that we
have a start vector y0  0. Then, we compute x0 := y0 /y0 2 and then, for
m = 1, 2, . . ., we obtain

ym = (A − λI)−1 xm−1 ,
ym
xm = .
ym 2
The computation of ym thus requires solving the linear system

(A − λI)ym = xm−1 .

To this end, we determine, before the iteration starts, an LU factorisation of A


(or a QR factorisation) using column pivoting, i.e. we compute

P(A − λI) = LU,

with a permutation matrix P, an upper triangular matrix U and a normalised


lower triangular matrix L. Then, we can use forward and back substitution to
solve
LUym = Pxm−1 .

This reduces the complexity necessary in each step to O(n2 ) and requires only
O(n3 ) once to compute the LU decomposition.
The approximation λ can be given by any other method. From this point of
view, inverse iteration is in particular interesting for improving the results of
other methods.
Obviously, we could and should adapt the sign as we have done in the power
method. Here, however, we want to discuss a different strategy for symmet-
ric matrices, where it is possible to improve the convergence dramatically by
adapting λ in each step. To this end, let us recall the Rayleigh quotient, which
we have encountered already several times.

08:42:21, subject to the Cambridge


06
5.3 Inverse Iteration by von Wielandt and Rayleigh 145

Definition 5.15 The Rayleigh quotient of a vector x ∈ Rn with respect to a


real matrix A ∈ Rn×n is the scalar
xT Ax
r(x) := , x  0.
xT x
Obviously, if x is an eigenvector then r(x) gives the corresponding eigen-
value of A. Moreover, we have the following two immediate properties of the
Rayleigh quotient for symmetric matrices.
Lemma 5.16 Let A ∈ Rn×n be symmetric.
1. If λmin (A) and λmax (A) denote the minimum and maximum eigenvalues of A,
respectively, then
λmin (A) ≤ r(x) ≤ λmax (A), x ∈ Rn \ {0}.
2. For x ∈ Rn the function f : R → R defined by f (λ) := (A − λI)x22 becomes
minimal for λ = r(x) with value
f (r(x)) = Ax22 − r(x)2 x22 .
Proof The first statement follows directly from Theorem 5.4. The second
property follows from differentiating the function
f (λ) = Ax − λx22 = Ax22 − 2λxT Ax + λ2 x22 . (5.15)
The only root of the derivative is given by λ = r(x) and inserting this into (5.15)
yields the stated function value. It is a minimum as f (λ) = 2x22 > 0. 
The idea of the Rayleigh iteration is based upon using the Rayleigh quotient as
an approximation to an eigenvalue. If x is sufficiently close to an eigenvector z
then, by continuity, r(x) is close to r(z) which is an eigenvalue.
Thus, changing the parameter λ in the inverse iteration method in each step
leads to Algorithm 11.

Algorithm 11: Rayleigh iteration


Input : A ∈ Rn×n , y0 ∈ Rn \ {0}.
Output: Approximate eigenvalue and eigenvector.
1 x0 := y0 /y0 2
2 for m = 0, 1, 2, . . . do
3 μm := xTm Axm
4 ym+1 := (A − μm I)−1 xm
5 xm+1 := ym+1 /ym+1 2

08:42:21, subject to the Cambridge


06
146 Calculation of Eigenvalues

Some remarks need to be made concerning Algorithm 11. First of all, ym+1 is
computed only if μm is not an eigenvalue of A, otherwise the method stops.
An indication for this is that ym+1 2 becomes too large, which can be used
as a stopping criterion. Although, for m → ∞, the matrices A − μm I become
singular, there are usually no problems with the computation of ym+1 , provided
that one terminates the computation of the LU- or QR-decomposition of A −
μm I in time.
Note that this time we have to compute an LU factorisation in each step. This
additional computational cost has to be compensated for by a faster conver-
gence. This is, fortunately, given in most cases, as we will see soon. We start
with a simpler stability result.

Proposition 5.17 Let A ∈ Rn×n be symmetric. Let the sequence {(xm , μm )} be


generated by the Rayleigh iteration. Then,

(A − μm+1 I)xm+1 2 ≤ (A − μm I)xm 2 , m = 0, 1, 2, . . . .

Here, equality holds if and only if μm = μm+1 and if xm is an eigenvector of


(A − μm I)2 .

Proof We know from Lemma 5.16 that the Rayleigh quotient μm+1 minimises
f (λ) = (A − λI)xm+1 22 , which means that we have

(A − μm+1 I)xm+1 2 ≤ (A − μm I)xm+1 2 .

By construction, we have xm = (A − μm I)xm+1 ym+1 2 and xm 2 = 1, which


means
(A − μm I)xm+1
xm = .
(A − μm I)xm+1 2
Thus, we can conclude that

(A − μm+1 I)xm+1 2 ≤ (A − μm I)xm+1 2 = |xTm (A − μm I)xm+1 |


= |xTm+1 (A − μm I)xm | ≤ (A − μm I)xm 2 ,

using again that xm+1 2 = 1. Equality can hold only if it holds in both of the in-
equalities above. The first inequality means μm+1 = μm since the Rayleigh quo-
tient gives the unique minimum of the above function f . The second equality
is an equality in the Cauchy–Schwarz inequality. Hence, we must in this case
have that (A − μm I)xm is a multiple of xm+1 , say (A − μm I)xm = αm xm+1 with
αm  0. This shows that
(A − μm I)−1 xm
xm+1 = (5.16)
(A − μm I)−1 xm 2

08:42:21, subject to the Cambridge


06
5.3 Inverse Iteration by von Wielandt and Rayleigh 147

implies that
αm
(A − μm I)2 xm = αm (A − μm I)xm+1 = xm ,
(A − μm I)−1 xm 2

proving that xm is indeed an eigenvector of (A − μm I)2 . 

Next, let us come back to the claimed fast convergence of the Rayleigh method.
We will show that if the method converges it converges cubically. We need,
however, the following auxiliary result.

Lemma 5.18 Let A ∈ Rn×n be symmetric with eigenvalues λ1 , . . . , λn ∈ R and


associated orthonormal eigenvectors w1 , . . . , wn . Let λ be one of the eigenval-
ues and let J := { j ∈ {1, . . . , n} : λ j  λ}. If u ∈ span{w j : j ∈ J} with u2 = 1
and if μ ∈ R is not an eigenvalue of A then
1
(A − μI)−1 u2 ≤ .
min j∈J |λ j − μ|
 
Proof We can expand u as u = j∈J α j w j . Since u2 = 1, we have j∈J α2j =
1 and hence
⎛ ⎞1/2
⎜⎜⎜ α2j ⎟⎟⎟ 1
−1
(A − μI) u2 = ⎜⎜⎝ ⎜ ⎟⎟ ≤ . 
2⎟ ⎠
j∈J
(λ j − μ) min j∈J |λ j − μ|

Theorem 5.19 Let A ∈ Rn×n be symmetric and let {xm }, {μm } be the sequences
generated by the Rayleigh iteration. Assume that {xm } converges to an eigen-
vector w ∈ Rn with w2 = 1 associated with the eigenvalue λ ∈ R of A. Then,
the convergence of {xm } is cubic, i.e. there is a constant C > 0 such that

xm+1 − w2 ≤ Cxm − w32 , m = 0, 1, 2, . . . .

Furthermore, there is a constant c > 0 such that

|μm − λ| ≤ cxm − w22 , m = 0, 1, 2, . . . .

Proof The proof is based on the idea of studying the behaviour of the angle
φm between xm and w. Since both vectors have norm one, the angle is given by
cos φm = xm , w 2 . From

xm − w22 = xm 22 + w22 − 2 xm , w 2 = 2(1 − cos φm ) = 4 sin2 (φm /2) (5.17)

we see that it suffices to show that 2 sin(φm /2) or, equivalently, sin φm con-
verges cubically. As we have Rn = span{w}⊕span{w}⊥ , we can find for each xm
a vector um ∈ Rn with um 2 = 1 and uTm w = 0 such that xm = αm w+βm um with

08:42:21, subject to the Cambridge


06
148 Calculation of Eigenvalues

αm = w, xm 2 = cos φm . From 1 = xm 22 = cos2 φm + β2m we find βm = sin φm ,


i.e. we have the representation
xm = cos φm w + sin φm um . (5.18)
Multiplying this by (A − μm I)−1 yields
cos φm
ym+1 = (A − μm I)−1 xm = w + sin φm (A − μm I)−1 um
λ − μm
and because of xm+1 = ym+1 /ym+1 2 also
cos φm sin φm
xm+1 = w+ (A − μm I)−1 um
(λ − μm )ym+1 2 ym+1 2
cos φm sin φm (A − μm I)−1 um 2 (A − μm I)−1 um
= w+ .
(λ − μm )ym+1 2 ym+1 2 (A − μm I)−1 um 2
Noticing that w is also an eigenvector of (A − μm I)−1 we find that wT (A −
μm I)−1 um = 0. Hence, we can choose um+1 as
(A − μm I)−1 um
um+1 = (5.19)
(A − μm I)−1 um 2
and comparing coefficients yields
cos φm sin φm (A − μm I)−1 um 2
cos φm+1 = , sin φm+1 = .
(λ − μm )ym+1 2 ym+1 2
From (5.18) we find Axm = λ cos φm w + sin φm Aum and hence
μm = xTm Axm = λ cos2 φm + sin2 φm uTm Aum
= λ − sin2 φm [λ − r(um )] (5.20)
with the Rayleigh quotient r(um ) = uTm Aum . Inserting this into the formula of
cos φm+1 from above yields
sin φm+1
tan φm+1 = = (λ − μm )(A − μm I)−1 um 2 tan φm
cos φm+1
= [λ − r(um )](A − μm I)−1 um 2 tan φm sin2 φm . (5.21)
Since the Rayleigh quotient is bounded from above and below we know that
the term λ − r(um ) in the above expression is bounded. To show that (A −
μm I)−1 um 2 is also bounded independently of m, we will apply Lemma 5.18.
Using the notation established in that lemma, we have to show that um ∈
span{w j : j ∈ J} for all m, where J = { j ∈ {1, . . . , n} : λ j  λ}. This is
clear if λ is a simple eigenvalue. If this is not the case then we have to work
harder. We know from (5.19) that um+1 is a scalar multiple of (A − μm I)−1 um .
As A − μm I maps span{w j : j ∈ J} onto itself, we see that um ∈ span{w j :

08:42:21, subject to the Cambridge


06
5.3 Inverse Iteration by von Wielandt and Rayleigh 149

j ∈ J} implies um+1 ∈ span{w j : j ∈ J} and it hence suffices to show that


u0 ∈ span{w j : j ∈ J}. Let M := span{w j : j  J} and denote the orthogonal
projection from Rn to M by P M . Then, we have to show that P M (u0 ) = 0. This
is done, as follows. First of all, we show that w is a multiple of P M (x0 ). To see
this, note that if xm has the representation
  
xm = α jw j + α j w j = P M (xm ) + α jw j,
jJ j∈J j∈J

then xm+1 has, as a normalised vector, the representation


1
xm+1 = (A − μm I)−1 xm
(A − μm I)−1 xm 2
1 
= α jw j
(A − μm I)−1 xm 2 (λ − μm ) jJ
1  αj
+ wj
(A − μm I)−1 xm 2 j∈J λ j − μm

and hence we can conclude that


1
P M (xm+1 ) = P M (xm ),
(A − μm I)−1 xm 2 (λ − μm )
which shows that P M (xm ) is a scalar multiple of P M (x0 ). Since we assume that
xm converges to the eigenvector w ∈ M, we can conclude by the continuity of
the orthogonal projection that w is also a scalar multiple of P M (x0 ), i.e. there
is an α ∈ R \ {0} such that w = αP M (x0 ). Now, if x0 = cos φ0 w + sin φ0 u0
then we may assume that sin φ0  0 because otherwise x0 would already be an
eigenvector. The linearity of the orthogonal projection gives
P M (x0 ) = cos φ0 P M (w) + sin φ0 P M (u0 ) = cos φ0 w + sin φ0 P M (u0 )
= α cos φ0 P M (x0 ) + sin φ0 P M (u0 ) = P M (x0 ) + sin φ0 P M (u0 ),
where in the last step we have used that w ∈ M implies
1 = w22 = αP M (x0 )T w = αxT0 w = α cos φ0 .
Hence, we have P M (u0 ) = 0 and Lemma 5.18 can indeed be applied, yielding
1
(A − μm I)−1 um 2 ≤
min j∈J |λ j − μm |
for all m. Let us now define γ := min j∈J |λ j − λ| > 0. As xm → w implies
μm → λ, we have for sufficiently large m that min j∈J |λ j − μm | ≥ γ/2 and thus
2
(A − μm I)−1 um 2 ≤
γ

08:42:21, subject to the Cambridge


06
150 Calculation of Eigenvalues

for all such m sufficiently large. Next, xm → w also implies cos φm → 1 such
that cos φm ≥ 1/2 for all sufficiently large m. This means, using (5.21), that
there is a c > 0 such that
cos φm+1
sin φm+1 ≤ |λ − r(um )|(A − μm I)−1 um 2 sin3 φm ≤ c sin3 φm
cos φm
for all sufficiently large m. As sin φm behaves like φm and hence 2 sin(φm /2)
behaves like sin φm for small φm , we see that we can conclude, with (5.17), that
also
xm+1 − w2 ≤ cxm − w32 .

Finally, combining (5.20) with (5.17) yields

|λ − μm | ≤ |λ − r(um )| sin φ2m ≤ Cxm − w22 . 

The actual proof of convergence is even more difficult. Here, we will prove
only a partial, easier result. By Proposition 5.17, we know that the sequence
{(A − μm I)xm 2 } is non-increasing and obviously bounded from below such
that it converges. Let τ be the limit of this sequence, i.e.

τ := lim (A − μm I)xm 2 . (5.22)


m→∞

Theorem 5.20 Let {xm } and {μm } be the sequences generated by the Rayleigh
iteration. Let τ be defined by (5.22). Then, the following results hold.

1. The sequence {μm } converges.


2. If τ = 0 then the sequence {xm } converges to an eigenvector of A and {μm }
converges to the corresponding eigenvalue.

Proof We start by showing that {μm } always converges, though not necessar-
ily to an eigenvalue of A. This is done in two steps. First, we will show that
μm+1 −μm converges to zero. Second, we will show that {μm } has at most a finite
number of accumulation points. From this, it follows that {μm } converges. To
see this, let {ρ1 , . . . , ρL } be the set of distinct accumulation points and assume
that L > 1. Let δ := mini j |ρi − ρ j | > 0. Then, for sufficiently large m each μm
must be within a distance of δ/4 of one of the ρi , since, otherwise, the bound-
edness of the sequence would yield another accumulation point. Hence, there
is an m0 > 0 such that we have for m ≥ m0 on the one hand |μm+1 − μm | < δ/4
and on the other hand |μm − ρ j | < δ/4 for a certain j = j(m) ∈ {1, . . . , L}. But
this means
 
|μm+1 − ρi | ≥ |ρ j − ρi | − |ρ j − μm | + |μm − μm+1 | ≥ δ − 2δ/4 = δ/2

for all i  j. But since μm+1 must also be in the δ/4 neighbourhood of an

08:42:21, subject to the Cambridge


06
5.3 Inverse Iteration by von Wielandt and Rayleigh 151

accumulation point, this accumulation point can only be ρ j . This means that
all μm , m ≥ m0 , are within a distance δ/4 of ρ j , which contradicts the fact that
we have L > 1. Thus the sequence has only one accumulation point, i.e. it
converges.
To see that μm+1 − μm converges to zero, let us set rm := (A − μm I)xm and let us
define θm by cos θm := rTm xm+1 /rm 2 . Then,

rm+1 22 = (A − μm I)xm+1 + (μm − μm+1 )xm+1 22


= (A − μm I)xm+1 22 + 2(μm − μm+1 )xTm+1 (A − μm I)xm+1 + (μm − μm+1 )2
= (A − μm I)xm+1 22 + 2(μm − μm+1 )(μm+1 − μm ) + (μm − μm+1 )2
= |xTm (A − μm I)xm+1 |2 − (μm+1 − μm )2
= rm 22 cos2 θm − (μm+1 − μm )2 ,

where we have used ideas from the beginning of the proof of Proposition 5.17.
This means in particular

(μm+1 − μm )2 ≤ rm 22 − rm+1 22 . (5.23)

However, we know from Proposition 5.17 that the sequence {rm 2 } converges
to τ so that the right-hand side of (5.23) converges to zero, which shows μm+1 −
μm → 0 for m → ∞.
Next, we need to show that {μm } has only finitely many accumulation points.
If τ = 0 then each accumulation point has to be an eigenvalue of A and since
A has only a finite number of eigenvalues, we only have to consider the case
τ > 0. In this situation, let ρ be an accumulation point. Using rm 2 → τ and
μm+1 − μm → 0 together with

rm+1 22 = rm 22 cos2 θm − (μm+1 − μm )2

allows us to conclude that cos2 θm → 1. With (5.16) we can conclude from


1
rm 2 cos θm = rTm xm+1 = >0 (5.24)
(A − μm I)−1 xm 2
that even cos θm → 1 for m → ∞. This, however, lets us conclude that

rm − rm 2 xm+1 22 = 2rm 22 (1 − cos θm ) → 0.

Finally, combining (5.24) with (5.16) yields (A − μm I)xm+1 = rm 2 cos θm xm


and thus we have
 
[(A − μm I)2 − rm 22 cos θm I]xm 2 = (A − μm I)(rm − rm 2 xm+1 )2
≤ A − μm I2 rm − rm 2 xm+1 2
→ 0.

08:42:21, subject to the Cambridge


06
152 Calculation of Eigenvalues

Now assume that {μm j } converges to our accumulation point ρ. Then, as the
sub-sequence {xm j } is also bounded, it must also have an accumulation point w
with w2 = 1. Taking a sub-sequence of the sub-sequence {m j } which we will
denote with {m j } again, we see that μm j → ρ and xm j → w. This all then leads
to [(A − ρI)2 − τ2 I]w2 = 0 and hence the accumulation point is a zero of the
polynomial p(t) = det((A − tI)2 − τ2 I), but, since any polynomial has only a
finite number of different zeros, we only have a finite number of accumulation
points. This finishes the proof of the first part.
For the second part, we assume that τ = 0. Let (λ, w) be an accumulation point
of the sequence {(μm , xm )}. Then, τ = 0 implies that (A − λI)w = 0 and, since w
has norm one, it follows that w is an eigenvector of A to the eigenvalue λ. As
we know from the first part that {μm } converges, it must converge to λ. Finally,
if {xm j } is the sequence that converges to w, then for each > 0 there is a j0
such that xm j0 − w2 < . If we choose > 0 so small that C 2 < 1 with C
from Theorem 5.19 then we have

xm j0 +1 − w2 ≤ Cxm j0 − w32 < C 3 < .

From this it immediately follows that xm − w2 < for all m ≥ m j0 , i.e. the
sequence {xm } converges to w. 

The case τ > 0 is somewhat more complicated. It is possible to show that in


this situation we have x2m → w+ and x2m+1 → w− linearly, where w+ and
w− are the bisectors of a pair of eigenvectors whose eigenvalues have mean
ρ = lim μm , though this case is unstable under perturbations of xm . Details can
be found in Parlett [100].
To finish this discussion, we want to give one more result on the Rayleigh
quotient. As we have seen in Theorem 5.19, the sequence of Rayleigh quotients
μm = r(xm ) satisfies |μm − λ| ≤ cxm − w22 , if the Rayleigh iteration converges.
Interestingly, this quadratic dependence is not restricted to the sequence {xm }
generated by the Rayleigh iteration.

Theorem 5.21 Suppose A ∈ Rn×n is symmetric. Assume that we have a se-


quence {xm } with xm 2 = 1, which converges to an eigenvector w. Then we
have
|r(xm ) − r(w)| = O(xm − w22 ), xm → w.

Proof As before let λ1 ≥ · · · ≥ λn be the eigenvalues of A and w1 , . . . , wn


be the corresponding orthonormal eigenvectors, i.e. we have Aw j = λ j w j and
wTj wk = δ jk . Without restriction, we may assume that there is a j ∈ {1, . . . , n}
such that w = w j .

The representation xm = k ck wk , suppressing the fact that ck = c(m)k depends

08:42:21, subject to the Cambridge


06
5.4 The Jacobi Method 153

also on m, gives Axm = k ck λk wk and, since xm 2 = 1,

n
r(xm ) = xTm Axm = λk c2k .
k=1

This shows

n 
r(xm ) − r(w) = λk c2k − λ j = λ j (c2j − 1) + λk c2k .
k=1 k j

If we set = xm − w22 then it obviously suffices to show that c2j − 1 = O( 2 )


and ck = O( ) for k  j and m → ∞. The first property follows from

2 = xm − w22 = xm 22 + w22 − 2wT xm = 2(1 − c j )

and hence
|c2j − 1| = |(c j − 1)(c j + 1)| ≤ 2|c j − 1| = 2 ,

using c2j ≤ k c2k = 1. The second property then follows for all k  j, as we

can conclude from k c2k = 1 that

c2k ≤ c2 = 1 − c2j = |1 − c2j | ≤ 2 . 
 j

5.4 The Jacobi Method


Another iterative method goes back to Jacobi, who developed an algorithm for
dealing with the eigenvalue problem for symmetric n × n matrices, which is
still feasible if the matrices are not too big.
Certain versions of this algorithm can run in parallel (see for example Golub
and Van Loan [66]), so that it sometimes outperforms on parallel computers
even the otherwise superior QR method. It computes all eigenvalues and if
necessary also all eigenvectors. It is based upon the following easy fact.

Lemma 5.22 Let A ∈ Rn×n be a symmetric matrix. For every orthogonal ma-
trix Q ∈ Rn×n both A and QT AQ have the same Frobenius norm. If λ1 , . . . , λn
are the eigenvalues of A then

n 
n
|λ j |2 = |a jk |2 .
j=1 k, j=1

Proof The square of the Frobenius norm of A equals j,k |a jk |2 and this is

08:42:21, subject to the Cambridge


06
154 Calculation of Eigenvalues

precisely the trace of AAT , see also Corollary 2.31. Since the trace of AB is the
same as the trace of BA for two arbitrary matrices, we can conclude that

QT AQ2F = tr(QT AQQT AT Q) = tr(QT AAT Q)


= tr(AT QQT A) = A2F .

For a symmetric matrix A there is an orthogonal matrix Q, such that QT AQ = D


is a diagonal matrix having the eigenvalues of A as diagonal entries. 

If we define the outer norm, which is not really a norm in the strict sense of
Definition 1.6, by

N(A) := |a jk |2 ,
jk

then, Lemma 5.22 gives the decomposition



n 
n
|λ j |2 = |a j j |2 + N(A). (5.25)
j=1 j=1

Since the left-hand side is invariant to orthogonal transformations, it is now


our goal to diminish N(A) by choosing appropriate orthogonal transformations

and thus by increasing |a j j |2 to transform A into a diagonal matrix. To this
end, we choose an element ai j  0 with i  j and perform a transformation
in the plane spanned by ei and e j , which maps ai j to zero. We can choose this
transformation in R2 as a rotation with angle α. The resulting matrix

bii bi j cos α sin α aii ai j cos α −sin α
= (5.26)
bi j b j j −sin α cos α ai j a j j sin α cos α
is a diagonal matrix if the non-diagonal element bi j satisfies

0 = bi j = ai j (cos2 α − sin2 α) + (a j j − aii ) cos α sin α


1
= ai j cos(2α) + (a j j − aii ) sin(2α).
2
From ai j  0 we can conclude that the angle α is given by
aii − a j j
cot(2α) = . (5.27)
2ai j
For the implementation, it is better to avoid the calculation of the angle α by
taking the inverse cot−1 in this formula and then its sine and cosine. Instead,
one can first only compute
aii − a j j
θ := .
2ai j

08:42:21, subject to the Cambridge


06
5.4 The Jacobi Method 155

Then, since we have the trigonometric identity tan2 α + 2 cot(2α) tan α = 1, we


see that t := tan α satisfies the quadratic equation

t2 + 2θt = 1,

which has two solutions t1,2 = −θ ± 1 + θ2 . Considering numerical issues
we should pick the one with smaller absolute value in a cancellation-avoiding
form
√ sign(θ)
t := −θ + sign(θ) 1 + θ2 = √ .
|θ| + 1 + θ2
With this, we can compute all relevant quantities without using trigonometric
functions,
1
c := √ = cos α,
1 + t2
s := ct = sin α,
s α
τ := = tan ,
1+c 2
where τ will be required later on. This solves the problem for 2 × 2 matrices.
The general case is dealt with using Givens rotations.

Definition 5.23 Let i, j ∈ {1, . . . , n} and an angle α be given. Let c = cos α


and s = sin α. A Givens rotation Gi j (α) ∈ Rn×n is a matrix of the form

Gi j (α) = I + (c − 1)(e j eTj + ei eTi ) + s(e j eTi − ei eTj )


⎛ .. .. ⎞
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜⎜ 1 . . ⎟⎟
⎜⎜⎜· · · c · · · −s · · ·⎟⎟⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
= ⎜⎜⎜⎜⎜ .. .. ⎟⎟⎟ .
⎜⎜⎜ . 1 . ⎟⎟⎟
⎜⎜⎜· · · s · · · c · · ·⎟⎟⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎝ .. .. ⎠
. . 1

Note that in the last equation of this definition we have assumed that j > i to
show what the Givens rotation looks like. This is, of course, not required.

Proposition 5.24 Let A ∈ Rn×n be symmetric and ai j  0 for a pair i  j.


If B = GTi j (α)AGi j (α) with α determined by (5.27) then bi j = b ji = 0 and
N(B) = N(A) − 2|ai j |2 .

Proof The first part is clear. For the second part, we use the invariance prop-
erty of the Frobenius norm twice. On the one hand, we have AF = BF . On

08:42:21, subject to the Cambridge


06
156 Calculation of Eigenvalues

the other hand, we also have equality of the norms for the small matrices in
(5.26), which means
|aii |2 + |a j j |2 + 2|ai j |2 = |bii |2 + |b j j |2 ,
since bi j = 0. From this, we can conclude that

n 
n
N(B) = B2F − |bkk |2 = A2F − |bkk |2
k=1 k=1

n  
= N(A) + |akk |2 − |bkk |2 = N(A) − 2|ai j |2 ,
k=1

since akk = bkk for all k  i, j. 


The iteration of this process gives the classical Jacobi method for computing
the eigenvalues of a symmetric matrix.
Definition 5.25 Let A ∈ Rn×n be symmetric. The classical Jacobi method
for computing the eigenvalues defines A(0) = A and then proceeds for m =
0, 1, 2, . . . , as follows
1. Determine i  j with |a(m) (m)
i j | = maxk |ak | and determine G
(m)
= Gi j .
T
2. Set A(m+1) = G(m) AG(m) .
For practical computations we will again avoid matrix multiplication and code
the transformations directly, see Algorithm 12. Furthermore, we will use the
following identities to compute the ith and jth row and column, which follow
after some trigonometric manipulations from (5.26). We have bi j = b ji = 0 and
bii = aii − tai j ,
b j j = a j j + tai j ,
bi = bi = cai + sa j = ai + s(a j − τai ),   i, j,
b j = b j = −sai + ca j = a j − s(ai + τa j ),   i, j,
where we have used the previous abbreviations s = sin α, c = cos α, t =
tan α and τ = tan(α/2), which can be computed without using trigonometric
functions, as explained above. In the implementation in Algorithm 12 we have
used two additional vectors bi and b j to store the relevant update information
temporarily before overwriting the old matrix.
Note that an element which has been mapped to zero in an earlier step might
be changed again in later steps.
Theorem 5.26 The classical Jacobi method converges linearly in the outer
norm.

08:42:21, subject to the Cambridge


06
5.4 The Jacobi Method 157

Algorithm 12: Jacobi method for computing eigenvalues


Input : A ∈ Rn×n .
Output: Approximate eigenvalues of A.
1 norm := N(A)
2 while norm > do
3 Determine (i, j) with |ai j | = maxk |ak |
4 if |ai j |  0 then
5 θ := 0.5 · (aii − a j j )/ai j

6 t := sign(θ)/(|θ| + 1 + θ · θ)

7 c := 1/ 1 + t · t, s := c · t
8 bi j := 0, b ji := 0
9 bii := aii − t · ai j , b j j := a j j + t · ai j
10 for  = 1 to n do
11 if   i and   j then
12 bi := c · ai + s · a j
13 b j := −s · ai + c · a j

14 norm := norm − 2 · ai j · ai j
15 for  = 1 to n do
16 ai := ai := bi
17 a j := a j := b j

Proof We have to look at one step of the method, i.e. at the transformation
from A to B = GTi j AGi j . Since |ai j | ≥ |ak | for all   k, we have

N(A) = |ak |2 ≤ n(n − 1)|ai j |2 .
k

Thus, we can bound the outer norm by



2 2
N(B) = N(A) − 2|ai j |2 ≤ N(A) − N(A) = 1 − N(A),
n(n − 1) n(n − 1)
which means linear convergence. 

Convergence in the outer norm does not necessarily mean convergence of the
eigenvalues, since the elements on the diagonal can still permute. However, ev-
ery convergent sub-sequence converges to a diagonal matrix having the eigen-
values of A as its diagonal entries. Another way of formulating this is as fol-
lows.

08:42:21, subject to the Cambridge


06
158 Calculation of Eigenvalues

Corollary 5.27 If λ1 ≥ λ2 ≥ · · · ≥ λn are the eigenvalues of the symmetric


matrix A and if ã(m) (m) (m)
11 ≥ ã22 ≥ · · · ≥ ãnn is an appropriate permutation of the
diagonal elements of A(m) then

|λi − ã(m)
ii | ≤ N(A(m) ) → 0, for 1 ≤ i ≤ n and m → ∞.

Proof If we define B = diag(ã(m) (m)


11 , . . . , ãnn ) then we have by Corollary 5.7 the
estimate

|λ j − ã(m)
j j | = |λ j (A) − λ j (B)| ≤ A − BF = A
(m)
− BF = N(A(m) )1/2 . 

Remark 5.28 One step of the classical Jacobi method takes O(n2 ) time to
find the maximum element and then O(n) time to update the matrix.

Since the cost for finding the maximum element is critical, there are cheaper
versions. For example, the cyclic Jacobi method visits all non-diagonal entries
regardless of their sizes in a cyclic way, i.e. the index pair (i, j) is cyclically
chosen as (1, 2), (1, 3), . . . , (1, n), (2, 3), . . . , (2, n), (3, 4), . . . and then the pro-
cess starts again with (1, 2) and so on. A transformation only takes place if
ai j  0. The cyclic version is convergent and in the case of pairwise distinct
eigenvalues even quadratically convergent if one complete cycle is considered
to be one step (see Henrici [82]).
For small values of ai j it is not efficient to perform a transformation. Thus, we
can restrict ourselves to elements ai j having a square above a certain threshold,
for example N(A)/(2n2 ). Since then

1
N(B) ≤ N(A) 1 − 2
n
we still have linear convergence in the outer norm for the cyclic Jacobi method
with thresholding.
Finally, let us have a look at the eigenvectors. Since the matrices
T T T
A(m+1) = G(m) A(m)G(m) = G(m) · · · G(1) AG(1) · · · G(m) =: QTm AQm

are, at least for sufficiently large m, almost diagonal, we see that the columns
of Qm define approximate eigenvectors of A.

Corollary 5.29 If the columns of Qm are sorted as proposed in Corollary 5.27


and if the eigenvalues of A are pairwise distinct, then the columns of Qm con-
verge to a complete basis of orthonormal eigenvectors of A. If the eigenvalues
are not pairwise distinct then every accumulation point gives an orthonormal
basis of eigenvectors.

08:42:21, subject to the Cambridge


06
5.5 Householder Reduction to Hessenberg Form 159

5.5 Householder Reduction to Hessenberg Form


In the next section, we want to use the QR factorisation of a matrix A from
Section 3.6 to compute its eigenvalues in an iterative process. As we have to use
in this process similarity transformations which do not change the eigenvalues,
our transformations have to be of the form QAQT with orthogonal Q. The first,
immediate idea, however, faces the following problem.
Suppose we choose a Householder matrix H = H(w) such that the first col-
umn of A is mapped to a multiple of the first unit vector, i.e. HAe1 = α1 e1 .
Then, HAH T will in general be full again, i.e. all the zeros created by pre-
multiplication with H are destroyed again by the post-multiplication with H.
This simply comes from the fact that multiplication from the right particularly
means that the first column is replaced by linear combinations of all columns.
Hence we have to modify this approach. To this end, let us write

a bT
A = 11
ã1 Ã

with a11 ∈ R, ã1 , b ∈ Rn−1 and à ∈ R(n−1)×(n−1) . Then, we can choose a House-
holder matrix H̃1 ∈ R(n−1)×(n−1) with H̃1 ã1 = α1 e1 , and define the orthogonal
and symmetric matrix

1 0T
H1 = .
0 H̃1

This time, pre- and post-multiplication with H1 yields



1 0T a11 bT 1 0T a bT 1 0T
H1 AH1 = = 11
0 H̃1 ã1 Ã 0 H̃1 α1 e1 H̃1 Ã 0 H̃1

a11 bT H̃1 a11 bT H̃1
= =
α1 e1 H̃1 ÃH̃1 α1 e1 A2

with A2 ∈ R(n−1)×(n−1) . This means we have kept almost all the zeros in the first
column except for the one directly under the diagonal. If we proceed like this
now, we derive in the next step a transformation of the form
⎛ ⎞
⎜⎜⎜∗ ∗ · · · ∗⎟⎟⎟
⎜⎜⎜⎜∗ ∗ ∗ ⎟
· · · ∗⎟⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜0 ∗ ⎟⎟⎟

H2 H1 AH1 H2 = ⎜⎜⎜⎜0 0 ⎟⎟⎟ ,
⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ . . ⎟⎟⎟
⎜⎜⎜ .. .. ⎟⎟⎠
⎜⎝
0 0

08:42:21, subject to the Cambridge


06
160 Calculation of Eigenvalues

and continuing all the way down finally yields a representation


⎛ ⎞
⎜⎜⎜∗ · · · · · · ∗⎟⎟⎟
⎜⎜⎜ .. ⎟⎟⎟⎟
⎜⎜⎜∗ . . . . ⎟⎟⎟⎟
Hn−2 · · · H1 AH1 · · · Hn−2 = ⎜⎜⎜⎜⎜ . ⎟⎟ .
⎜⎜⎜ . . . . . .. ⎟⎟⎟⎟
.
⎜⎜⎝ ⎟⎟⎠
0 ∗ ∗

Such a matrix is almost an upper triangular matrix. We have only one sub-
diagonal band which might be different from zero. A matrix of this form is
called a Hessenberg matrix or a matrix in Hessenberg form. We can define this
even for non-square matrices.

Definition 5.30 A matrix B ∈ Rm×n is of Hessenberg form if bi j = 0 for all


i > j + 1.

Hence, the above contemplations give the following result.

Theorem 5.31 Any matrix A ∈ Rn×n can be transformed into a Hessenberg


matrix using n − 2 Householder transformations in O(n3 ) time.

Proof We only have to analyse the computational cost. In the kth step, we
have to multiply a Householder matrix Hk ∈ R(n−k)×(n−k) by the significant
part of the current matrix both from the left and from the right. According to
Lemma 3.31, this can be done in O((n−k)2 ) time. Hence, the total time required
is given by

n−2
O((n − k)2 ) = O(n3 ). 
k=1

We can easily construct the algorithm from the above proof. As we have done
before, we want to overwrite the matrix A with its Hessenberg form H =
Hn−2 · · · H1 AH1 · · · Hn−2 during the process. In step k we have the relevant in-
formation given by Hk = diag(Ik , H̃k ), where Ik ∈ Rk×k denotes the identity on
Rk and H̃k = In−k − βk uk uTk with uk = (u(k) (k) (k) T
k+1 , uk+2 , . . . , un ) is a Householder
matrix. As in the case of the QR factorisation, there is not enough space in A
to store all the relevant information.
A possible way of using the available space is as follows. We can store the
partial vector (u(k) (k)
k+2 , . . . , un ) within the free n−k−1 positions of the kth column
of A and u(k)
k+1 and βk for 1 ≤ k ≤ n − 2 as the kth component of new vectors d
and β in Rn−2 . Details are given in Algorithm 13, where lines 2 to 9 represent
the computation of H̃k and the remaining lines deal with the actual update step.

08:42:21, subject to the Cambridge


06
5.5 Householder Reduction to Hessenberg Form 161

Algorithm 13: Hessenberg reduction


Input : A ∈ Rn×n .
Output: Hessenberg form of A.
1 for k = 1 to n − 2 do
2 norm := maxk+1≤i≤n |aik |
3 if norm = 0 then dk := 0, βk := 0
4 else
5 α := 0
6 for i = k + 1 to n do
7 aik := aik /norm, α := α + aik · aik

8 α := α, βk := 1/[α(α + |ak+1,k |)]
9 dk := −sign(ak+1,k )α · norm, ak+1,k := sign(ak+1,k )α
10 for j = k + 1 to n do

11 s := βk · ni=k+1 aik · ai j
12 for i = k + 1 to n do ai j := ai j − s · aik
13 for i = 1 to n do

14 s := βk · nj=k+1 ai j · a jk
15 for j = k + 1 to n do ai j := ai j − s · a jk
16 Exchange ak+1,k and dk

At the end of the algorithm, the matrix A and the vectors d, β have the follow-
ing form
⎛ ⎞
⎜⎜⎜ h11 h12 ··· ··· h1,n−1 h1n ⎟⎟⎟
⎜⎜⎜ ⎟ ⎞ ⎞
⎜⎜⎜ h21 h22 ··· ··· h2,n−1 h2n ⎟⎟⎟⎟ ⎛
⎜⎜⎜ u(1) ⎟⎟⎟

⎜⎜⎜ β1 ⎟⎟⎟
⎜⎜⎜⎜u(1) ⎟⎟ ⎟⎟⎟ ⎟⎟⎟
h3n ⎟⎟⎟⎟ ⎜⎜⎜ ⎜⎜⎜
2
⎜⎜ 3 h32 ··· ··· h3,n−1 ⎜⎜ u(2) ⎟⎟⎟ ⎜⎜ β2 ⎟⎟⎟
⎟⎟⎟
A = ⎜⎜⎜⎜ (1) d = ⎜⎜⎜⎜⎜ ⎟⎟⎟ , β = ⎜⎜⎜⎜⎜ ⎟⎟⎟ ,
3
.. ⎟, .. ..
⎜⎜⎜u4 u(2) . ··· h4,n−1 h4n ⎟⎟⎟⎟ ⎜⎜⎜ . ⎟⎟⎟ ⎜⎜⎜ . ⎟⎟⎟
⎜⎜⎜ ⎟ ⎟⎟⎠ ⎟⎟⎠
.. ⎟⎟⎟⎟ ⎜⎝ ⎜⎝
4
⎜⎜⎜ .. .. .. .. ..
u(n−2) βn−2
⎜⎜⎜ . . . . . . ⎟⎟⎟⎟ n−1
⎝ (1) ⎠
un u(2)
n ··· u(n−2)
n hn,n−1 hnn

from which all the relevant information can be retrieved.


We now want to have a closer look at the situation when A ∈ Rn×n is symmetric.
In this case its Hessenberg form must be a tridiagonal matrix. It is hence quite
natural to assume that the reduction of a symmetric matrix to its Hessenberg
form is significantly cheaper than the general case.
Let us assume that A = AT ∈ Rn×n has already been transformed using the k − 1

08:42:21, subject to the Cambridge


06
162 Calculation of Eigenvalues

Householder matrices H1 , . . . , Hk−1 , i.e. we have


⎛ ⎞
⎜⎜⎜ OT ⎟⎟⎟
⎜ A ⎟
Hk−1 · · · H1 AH1 · · · Hk−1 = ⎜⎜⎜⎜⎜ k aTk ⎟⎟⎟⎟⎟ ,
⎝ ⎠
O ak Ck
where Ak is a k × k tridiagonal matrix, Ck is a (n − k) × (n − k) matrix, ak ∈ Rn−k
and O represents a zero matrix of dimension (n − k) × (k − 1). Then, we can
pick a Householder matrix H̃k ∈ R(n−k)×(n−k) , set Hk := diag(Ik , H̃k ) and derive
⎛ ⎞
⎜⎜⎜ OT ⎟⎟⎟
⎜ Ak ⎟
Hk · · · H1 AH1 · · · Hk = ⎜⎜⎜⎜⎜ aTk H̃k ⎟⎟⎟⎟⎟ .
⎝ ⎠
O H̃k ak H̃k Ck H̃k
Obviously, we will pick H̃k such that H̃k ak is again a multiple of the first unit
vector. We can now use the symmetry to compute the product H̃k Ck H̃k effi-
ciently. Assume that H̃k = I − βk uk uTk with βk = 2/uk 22 . If we now define the
vector pk := βk Ck uk , we find
H̃k Ck H̃k = [I − βk uk uTk ]Ck [I − βk uk uTk ]
= [I − βk uk uTk ][Ck − pk uTk ]
= Ck − uk pTk − pk uTk + βk uTk pk uk uTk
= Ck − uk wTk − wk uTk
with
βk uTk pk
wk = pk − uk .
2
Using this, we can derive an algorithm as given in Algorithm 14. Here, we
store the diagonal elements and the sub-diagonal elements in two vectors δ =
(δ1 , . . . , δn )T ∈ Rn and γ = (γ1 , . . . , γn−1 )T ∈ Rn−1 . The information on the
Householder transformation H̃k = I − βk uk uTk can be stored in the kth column
of A by storing βk as the kth diagonal element and uk ∈ Rn−k in the part of the
kth column below the diagonal.

5.6 The QR Algorithm


This method aims at computing all eigenvalues of a given matrix A ∈ Rn×n
simultaneously. It benefits from the Hessenberg form of a matrix and can be
accelerated by shifts. It is extremely efficient, particularly for symmetric ma-
trices. It is also possible to derive the eigenvectors by accumulating certain
transformation matrices. It goes back to Francis [56, 57].

08:42:21, subject to the Cambridge


06
5.6 The QR Algorithm 163

Algorithm 14: Hessenberg reduction of a symmetric matrix


Input : A = AT ∈ Rn×n .
Output: Tridiagonal matrix.
1 for k = 1 to n − 2 do
2 δk := akk
3 norm := maxk+1≤i≤n |aik |
4 if norm = 0 then akk := 0, γk := 0
5 else
6 α := 0
7 for i = k + 1 to n do
8 aik := aik /norm, α := α + aik · aik

9 α := α, akk := 1/[α(α + |ak+1,k |)]
10 γk := −sign(ak+1,k ) · α · norm, ak+1,k := ak+1,k + sign(ak+1,k ) · α
11 s := 0
12 for i = k + 1 to n do 
 
13 γi := akk ij=k+1 ai j · a jk + nj=i+1 a ji · a jk
14 s := s + γi · aik
15 s := 0.5 · akk · s
16 for i = k + 1 to n do γi := γi − s · aik
17 for i = k + 1 to n do
18 for j = k + 1 to i do
19 ai j := ai j − aik · γ j − a jk · γi

20 δn−1 := an−1,n−1 , δn := ann , γn−1 := an,n−1

Definition 5.32 (QR method) Let A0 := A ∈ Rn×n . For m = 0, 1, . . . decom-


pose Am in the form Am = Qm Rm with an orthogonal matrix Qm and an upper
triangular matrix Rm . Then, form the swapped product

Am+1 := Rm Qm .

Since QTm Am = Rm we obviously have

Am+1 = QTm Am Qm , (5.28)

showing that all matrices Am have the same eigenvalues as A0 = A.


For the analysis of the convergence of the QR method, we need to recall that the
QR factorisation of a matrix is unique if the signs of the diagonal elements of

08:42:21, subject to the Cambridge


06
164 Calculation of Eigenvalues

R are prescribed, see Theorem 3.35. This means particularly that the mapping
A → Q is continuous.

Lemma 5.33 For A ∈ Rn×n invertible let A = QR be the unique QR factori-


sation with orthogonal Q ∈ Rn×n and upper triangular R ∈ Rn×n with positive
diagonal entries. Then, the mapping A → Q is continuous.

Proof We start by looking at the Cholesky factorisation of a symmetric and


positive definite matrix B ∈ Rn×n , i.e. B = LLT with a lower triangular matrix
L. We will show that this factorisation is unique and that the mapping B → L
is continuous.
In the process of deriving the Cholesky factorisation, we have seen that the
LU factorisation can be written as B = LDLT with a diagonal matrix D with
positive elements. This means that the factors L and U = DLT are uniquely
determined by the uniqueness of the LU factorisation. Hence, D is uniquely
determined and if we take the positive square root, we see that L̃ = LD1/2 is
the unique L matrix with positive elements of the Cholesky factorisation of B.
To see that B → L is continuous we use induction by n. For n = 1, we have

B = (β) with β > 0 and L = ( β) which obviously depends continuously on β.
Hence, let us assume that the statement is true for all symmetric and positive
definite (n − 1) × (n − 1) matrices. We decompose B and L into

β bT γ 0T
B= , L=
b B̂  L̂

with B̂ ∈ R(n−1)×(n−1) symmetric and positive definite, L̂ ∈ R(n−1)×(n−1) lower


triangular with positive diagonal elements and b,  ∈ Rn−1 , β, γ ∈ R. With this

notation the equation B = LLT immediately gives γ = β and  = b/γ =

b/ β which shows that both depend continuously on B. We also have B̂ =
L̂L̂T + T so that B → L̂L̂T = B̂ − T = B̂ − bbT /β maps continuously into the
set of symmetric and positive definite (n−1)×(n−1) matrices. By induction we
know that the mapping L̂L̂T → L̂ is continuous and hence the mapping B → L
is continuous.
Finally, let A ∈ Rn×n be a non-singular matrix with unique QR factorisation
A = QR, where R has positive diagonal entries. Then the matrix B = AT A is
symmetric and positive definite and has hence a unique Cholesky factorisation
AT A = LLT = R̂T R̂ with a unique upper triangular matrix R̂ = LT with positive
diagonal elements. From this uniqueness and AT A = (QR)T (QR) = RT R we
can conclude that R = R̂ depends continuously on A. Finally, A → Q = AR−1
is also continuous, as R → R−1 obviously is. 

08:42:21, subject to the Cambridge


06
5.6 The QR Algorithm 165

Obviously, the proof of the previous lemma also showed the uniqueness of the
Cholesky factorisation of a symmetric and positive definite matrix.

Lemma 5.34 Let D := diag(d1 , . . . , dn ) ∈ Rn×n be a diagonal matrix with

|d j | > |d j+1 | > 0, 1 ≤ j ≤ n − 1,



and let L = (i j ) ∈ Rn×n be a normalised lower triangular matrix. Let Lm
denote the lower triangular matrix with entries i j di /d j for i ≥ j. Then, we
m m

have
∗ m
Dm L = Lm D , m ∈ N0 .

Furthermore, Lm converges linearly to the identity matrix for m → ∞. The
asymptotic error coefficient is at most max2≤i≤n |di |/|di−1 |, i.e.
"" " m
∗ " di ""
Lm − I∞ ≤ max "" " L − I∞ .
2≤i≤n di−1 "


Proof The identity Dm L = Lm D immediately follows from the fact that mul-
tiplication by a diagonal matrix from the left means scaling the rows while
multiplication from the right means scaling the columns. Moreover, as L is
normalised, we have
" "" " "
 i "" " i−1 "" m ""
"  d m
" " d
""i j im """
∗ i j
Lm − I∞ = max "" mi − 1"" = max
1≤i≤n
j=1
" jd " 2≤i≤n
j=1
" dj "
"" " i−1 "" "
" di ""m  " di ""m
≤ max " " " |i j | ≤ max " " " L − I∞ ,
2≤i≤n di−1 " 2≤i≤n di−1 "
j=1

from which the convergence result follows. 

With this result at hand, we can now prove convergence of the QR method
for computing the eigenvalues of a matrix under some additional assumptions.
These assumptions mean in particular that A has only simple eigenvalues. For
a real matrix this also means that all eigenvalues are real. The latter is not too
surprising, as in the situation of complex eigenvalues, Theorem 3.39 tells us
that we cannot expect to find an orthogonal matrix Q ∈ Rn×n with QT AQ being
upper triangular.

Theorem 5.35 Assume that the eigenvalues of the matrix A ∈ Rn×n can be
sorted as |λ1 | > |λ2 | > · · · > |λn | > 0. Let T be the matrix of corresponding
eigenvectors of A. Assume that T −1 possesses an LU factorisation without piv-
oting. Then, the matrices Am = (a(m) i j ) created by the QR algorithm satisfy the
following conditions.

08:42:21, subject to the Cambridge


06
166 Calculation of Eigenvalues

• The sub-diagonal elements converge to zero, i.e. a(m)


i j → 0 for m → ∞ for
all i > j.
• The diagonal elements converge to the eigenvalues, i.e. a(m)
ii → λi for m →
∞ for all 1 ≤ i ≤ n.
• The sequences {A2m } and {A2m+1 } converge to upper triangular matrices.

Furthermore, the sequence {Qm } converges to an orthogonal diagonal matrix,


i.e. to an diagonal matrix having only 1 or −1 on the diagonal.

Proof We already know that all generated matrices Am have the same eigen-
values as A. If we use the notation Rm...0 = Rm · · · R0 and Q0...m = Q0 · · · Qm ,
then we have

Am = QTm−1 Am−1 Qm−1 = QTm−1 QTm−2 Am−2 Qm−2 Qm−1 = · · · = QT0...m−1 AQ0...m−1 .

By induction we can also show that the powers of A have a similar decomposi-
tion. To be more precise, we have Am = Q0...m−1 Rm−1...0 . This is clear for m = 1
and for m + 1 it follows via

Am+1 = AAm = AQ0...m−1 Rm−1...0 = Q0...m−1 Am Rm−1...0 = Q0...m Rm...0 .

However, this QR factorisation Am = Q0...m−1 Rm−1...0 is, by Theorem 3.35,


unique provided that Am is non-singular and that we assume that all upper
triangular matrices Ri have, for example, positive diagonal elements.
If we define D = diag(λ1 , . . . , λn ), we have the relation AT = T D, which leads
to Am = T Dm T −1 . By assumption, we have an LU factorisation of T −1 , i.e.
T −1 = LR with a normalised lower triangular matrix L and an upper triangular
matrix R. This gives to
Am = T Dm LR.

By Lemma 5.34, there is a sequence {Lm } of lower triangular matrices, which
∗ m
converges to I and which satisfy D L = Lm
m
D , which yields the representation
∗ m
Am = T Lm D R.

Using a QR factorisation T = Q̃R̃ of T with positive diagonal elements in R̃


finally gives
∗ m
Am = Q̃R̃Lm D R. (5.29)
∗ ∗
Since the matrices Lm converge to I, the matrices R̃Lm have to converge to R̃. If
we now compute a QR factorisation of R̃Lm as R̃Lm = Q∗∗
∗ ∗ ∗∗
m Rm , again with posi-
tive diagonal elements in R∗∗ ∗∗ ∗∗
m then we can conclude that Qm Rm converges to R̃

08:42:21, subject to the Cambridge


06
5.6 The QR Algorithm 167

so that Lemma 5.33 shows that Q∗∗m converges to the identity matrix. Equation
(5.29) can hence be rewritten as

Am = Q̃Q∗∗ ∗∗ m
m Rm D R.

Next, let us introduce diagonal matrices Δm := diag(s(m) (m)


1 , . . . , sn ) with entries
s(m)
i := sign(λmi rii ), where rii are the diagonal elements of R. Then, because of
Δm = I, we can rewrite the representation of Am as
2

Am = (Q̃Q∗∗ ∗∗ m
m Δm )(Δm Rm D R). (5.30)

Since the signs of the diagonal elements of the upper triangular matrix R∗∗ m
mD R
coincide with those of D R and hence with those of Δm , we see that (5.30) is a
m

QR factorisation of Am with positive diagonal elements in the upper triangular


matrix.
If we compare this QR factorisation to the factorisation Am = Q0...m−1 Rm−1...0
we can thus conclude that

Q0...m−1 = Q̃Q∗∗
m Δm . (5.31)

Since Q∗∗ m converges to the identity matrix, the matrices Q0...m must become
constant for sufficiently large m and up to the sign of the columns. From Am =
QT0...m−1 AQ0...m−1 and (5.31) we can derive

Am = Q−1 ∗∗−1 −1 ∗∗
0...m−1 A Q0...m−1 = Δm Qm Q̃ A Q̃Qm Δm
= Δm Q∗∗−1 R̃ T −1
A T R̃−1 Q∗∗
0123 m Δm .
0123
m
0123
→I =D →I

If m tends to infinity then, because of Δ2m = Δ0 and Δ2m+1 = Δ1 , we can con-


clude the convergence of {A2m } and {A2m+1 } to Δ0 R̃DR̃−1 Δ0 and Δ1 R̃DR̃−1 Δ1 ,
respectively. Since both limit matrices have the form
⎛ ⎞
⎜⎜⎜λ1 ∗ · · · ∗ ⎟⎟⎟
⎜⎜⎜⎜ .. ⎟⎟⎟⎟
⎜⎜⎜
⎜⎜⎜ λ n . ⎟⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
..
⎜⎜⎜ . ∗ ⎟⎟⎟⎟⎟
⎝ ⎠
λn

we have also proven the first two statements.


Finally, using the convergence of Q∗∗
m to the identity matrix again and

Qm = Q−1 −1 ∗∗ −1 −1 ∗∗
0...m−1 Q0...m = Δm (Qm ) Q̃ Q̃Qm+1 Δm+1

we see that Qm converges to Δ1 . 

08:42:21, subject to the Cambridge


06
168 Calculation of Eigenvalues

There are several things to mention here. Obviously, we have the following
computational cost. In each step, we have to compute a QR factorisation, which
will cost us O(n3 ) time. Then, we have to form the new iteration Am+1 which is
done by multiplying two full matrices, which also costs O(n3 ) time. Both are
too expensive if a large number of steps is required or the matrix dimension n
is too large.
As already indicated in the last section, it is possible to improve this perfor-
mance significantly by reducing the matrix A first to its Hessenberg form. This
still may cost O(n3 ) time but it is required only once.
After this pre-processing step, it is important to see that the QR algorithm re-
spects the specific form and that we can compute both the QR factorisation and
the product RQ in each step efficiently. We will explain how this works with
special emphasis on symmetric Hessenberg matrices, i.e. tridiagonal matrices.

Theorem 5.36 Suppose Am ∈ Rn×n is a symmetric tridiagonal matrix. Then,


it is possible to compute the QR factorisation of Am = Qm Rm in O(n) time.
The next iteration of the QR algorithm Am+1 = Rm Qm is also a symmetric
tridiagonal matrix and can be computed in linear time, as well.

Proof The symmetry of Am+1 follows immediately from (5.28). Suppose Am


has the form
⎛ ⎞
⎜⎜⎜∗ ∗ 0 · · · 0⎟⎟

⎜⎜⎜ .. ⎟⎟⎟⎟
⎜⎜⎜∗ ∗ ∗ . ⎟⎟⎟
⎜⎜⎜⎜ ⎟⎟⎟
Am = ⎜⎜⎜⎜⎜0 . . . . . . . . . 0⎟⎟⎟⎟ .
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ .. ⎟⎟⎟
⎜⎜⎜ . ∗ ∗ ∗⎟⎟⎟⎟
⎝ ⎠
0 ··· 0 ∗ ∗
We will again proceed in an iterative way. We first concentrate on the upper-
left 2 × 2 block of A and determine a 2 × 2 Householder matrix H̃(1, 2), which
transforms the vector (a11 , a21 )T into a multiple of (1, 0). This matrix H̃(1, 2)
can be “filled-up” to give an n × n orthogonal matrix H(1, 2) in the usual way.
Important here is that the computation of H(1, 2)A costs only constant time
since it involves only changing the first two rows and, because of the tridiago-
nal structure, only the first three columns. Continuing like this, we can use the
index pairs (2, 3), (3, 4), . . . , (n − 1, n) to create the QR factorisation

R = H(n − 1, n)H(n − 2, n) · · · H(1, 2)A,

such that, because of the symmetry of the Householder matrices,

Q = H(1, 2)H(2, 3) · · · H(n − 1, n).

08:42:21, subject to the Cambridge


06
5.6 The QR Algorithm 169

In each step, at most six entries are changed. Thus the computational cost is
O(n).
This can be visualised as follows:
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
⎜⎜⎜∗ ∗ 0 0 0⎟⎟⎟ ⎜⎜⎜∗ ∗ ∗ 0 0⎟⎟⎟ ⎜⎜⎜∗ ∗ ∗ 0 0⎟⎟⎟
⎜⎜⎜⎜∗ ∗ ∗ 0 0⎟⎟⎟⎟ ⎜⎜ 0 0⎟⎟⎟⎟
⎟ ⎜⎜⎜
∗ ∗ 0⎟⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ (1,2) ⎜⎜⎜⎜⎜0 ∗ ∗
⎟⎟⎟ (2,3) ⎜⎜⎜⎜⎜
0 ∗
⎟⎟
⎜⎜⎜0 ∗ ∗ ∗ 0⎟⎟⎟ → ⎜⎜⎜0 ∗ ∗ ⎟
∗ 0⎟⎟ → ⎜⎜0 ⎜ 0 ∗ ∗ 0⎟⎟⎟⎟
⎜⎜⎜ ⎟ ⎜⎜⎜ ⎜⎜⎜
⎜⎜⎜0 0 ∗ ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜0 0 ∗ ∗ ∗⎟⎟⎟⎟⎟

⎜⎜⎜0 0 ∗ ∗ ∗⎟⎟⎟⎟⎟

⎜⎝ ⎟⎠ ⎜⎝ ⎠ ⎝ ⎠
0 0 0 ∗ ∗ 0 0 0 ∗ ∗ 0 0 0 ∗ ∗
⎛ ⎞ ⎛ ⎞
⎜⎜⎜∗ ∗ ∗ 0 0⎟⎟⎟ ⎜⎜⎜∗ ∗ ∗ 0 0⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟
⎜⎜0 ∗ ∗ ∗ 0⎟⎟ ∗ ∗ ∗ 0⎟⎟⎟⎟
⎟⎟⎟ (4,5) ⎜⎜⎜⎜⎜
0
(3,4) ⎜ ⎜ ⎟⎟
→ ⎜⎜⎜⎜0 0 ∗ ∗ ∗⎟⎟⎟ → ⎜⎜⎜0 0 ∗ ∗ ∗⎟⎟⎟⎟ .
⎜⎜⎜ ⎟ ⎜⎜⎜ ⎟
⎜⎜⎜0 0 0 ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜0 0 0 ∗ ∗⎟⎟⎟⎟⎟
⎝ ⎠ ⎝ ⎠
0 0 0 ∗ ∗ 0 0 0 0 ∗

Next, to compute the new iteration Am+1 = Rm Qm = Rm H(1, 2)H(2, 3) · · · H(n−


1, n) we see that the first matrix H(1, 2) acts only on the first two columns, the
next matrix H(2, 3) on columns 2 and 3 and so on:
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
⎜⎜⎜∗ ∗ ∗ 0 0⎟⎟⎟ ⎜⎜⎜∗ ∗ ∗ 0 0⎟⎟⎟ ⎜⎜∗ ∗ ∗ 0 0⎟⎟⎟

⎜⎜⎜⎜0 ∗ ∗ ∗ 0⎟⎟⎟⎟ ⎜⎜
∗ ∗ ∗ 0⎟⎟⎟⎟
⎟ ⎜⎜⎜
∗ ∗ ∗ ∗ 0⎟⎟⎟⎟

⎜⎜⎜ ⎟⎟ (1,2) ⎜⎜⎜⎜⎜∗
⎟ ⎟⎟⎟ (2,3) ⎜⎜⎜⎜⎜ ⎟⎟
⎜⎜⎜0 0 ∗ ∗ ∗⎟⎟⎟ → ⎜⎜⎜0 0 ∗ ∗ ∗⎟⎟⎟ → ⎜⎜⎜0 ∗ ∗ ∗ ∗⎟⎟⎟⎟
⎜⎜⎜ ⎟ ⎜⎜⎜ ⎜⎜⎜
⎜⎜⎜0 0 0 ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜0 0 0 ∗ ∗⎟⎟⎟⎟⎟

⎜⎜⎜0 0 0 ∗ ∗⎟⎟⎟⎟⎟

⎜⎝ ⎟⎠ ⎜⎝ ⎠ ⎝ ⎠
0 0 0 0 ∗ 0 0 0 0 ∗ 0 0 0 0 ∗
⎛ ⎞ ⎛ ⎞
⎜⎜⎜∗ ∗ ∗ ∗ 0⎟⎟⎟ ⎜⎜⎜∗ ∗ ∗ ∗ ∗⎟⎟⎟
⎜⎜⎜ ⎟ ⎜⎜⎜∗ ∗ ∗ ∗ ∗⎟⎟⎟⎟
⎜⎜∗ ∗ ∗ ∗ 0⎟⎟⎟⎟
(3,4) ⎜ ⎜ ⎟⎟⎟ (4,5) ⎜⎜⎜⎜⎜ ⎟⎟⎟
→ ⎜⎜⎜⎜0 ∗ ∗ ∗ ∗⎟⎟⎟ → ⎜⎜⎜0 ∗ ∗ ∗ ∗⎟⎟⎟⎟ .
⎜⎜⎜ ⎟ ⎜⎜⎜ ⎟⎟
⎜⎜⎜⎝0 0 ∗ ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜⎝0 0 ∗ ∗ ∗⎟⎟⎟⎟⎠

0 0 0 0 ∗ 0 0 0 ∗ ∗

The new matrix is indeed a Hessenberg matrix again. But since it is also sym-
metric, all entries above the super-diagonal have to be zero. This ensures that
the new matrix is a tridiagonal matrix and that the computation can again be
done in linear time. 

Let us note two consequences, which follow from the above proof immediately.
The first one regards tridiagonal matrices.

Corollary 5.37 If A ∈ Rn×n is a tridiagonal matrix, then the upper triangular


matrix R ∈ Rn×n has at most three non-zero entries in each row, which are
rii , ri,i+1 and ri,i+2 .

08:42:21, subject to the Cambridge


06
170 Calculation of Eigenvalues

The second consequence will be important later on. The algorithm described
above yields for a non-symmetric Hessenberg matrix the following.

Corollary 5.38 If A ∈ Rn×n is a Hessenberg matrix then its QR factorisation


can be computed in O(n2 ) time.

Proof This follows from the proof of the previous theorem. The only differ-
ence is that, in each step, we have to apply the 2 × 2 matrices H(i, i + 1) to all
entries in rows i and i + 1 and columns i + 1 ≤ k ≤ n. 

A further improvement can be achieved by introducing shifts into the algo-


rithm. Instead of decomposing the matrix Am , we now decompose the matrix
Am − μm I with a certain shift parameter μm ∈ R. Thus, one step of the QR
method with shifts consists of

• form the QR factorisation of Am − μm I = Qm Rm ,


• set Am+1 = Rm Qm + μm I.

From QTm Am − μm QTm = Rm it again follows easily that

Am+1 = QTm Am Qm − μm QTm Qm + μm I = QTm Am Qm = · · · = QT0...m AQ0...m ,

such that all of the matrices involved have again the same eigenvalues as A. It
also follows that

Q0...m Rm = Q0...m−1 Qm Rm = Q0...m−1 (Am − μm I)


= AQ0...m−1 − μm Q0...m−1
= (A − μm I)Q0...m−1 .

If we form the transpose of this equation we can conclude that

Q0...m−1 RTm = (AT − μm I)Q0...m .

Hence, if we define ρm := eTn Rm en to be the bottom diagonal element of Rm


then
Q0...m−1 RTm en = ρm Q0...m−1 en = (AT − μm I)Q0...m en ,

and with v(m+1) := Q0...m en we derive

ρm v(m) = (AT − μm I)v(m+1) ,

which is one step of the inverse iteration. Thus, in this case it is possible to
apply the convergence results on the power method and the inverse iteration to
derive convergence results for the smallest eigenvalue of A and an associated
eigenvector. In the best case, it is possible to establish cubic convergence again.

08:42:21, subject to the Cambridge


06
5.7 Computing the Singular Value Decomposition 171

In particular, if we simply use the shift a(m−1)


n,n , we will, in almost all cases, have
rapid convergence leading to

T 0
Am = Tm ,
0 λ

where T m ∈ R(n−1)×(n−1) is tridiagonal and symmetric, and λ is an eigenvalue


of A1 . Inductively, once this form has been found, the QR algorithm with shift
can be concentrated only on the (n − 1) × (n − 1) leading sub-matrix T m . This
process is called deflation.
Taking this all into account, the final algorithm for calculating the eigenvalues
of an n × n symmetric matrix is given in Algorithm 15.

Algorithm 15: QR algorithm with shifts


Input : A = AT ∈ Rn×n .
Output: Approximate eigenvalues and eigenvectors.
1 Reduce A to tridiagonal form
2 for k = n downto 2 do
3 while |ak−1,k | > do
4 Compute QR = A − akk I
5 Set A := RQ + akk I
6 Record eigenvalue λk = akk
7 Replace A by its leading k − 1 by k − 1 sub-matrix
8 Record eigenvalue λ1 = a11

5.7 Computing the Singular Value Decomposition


In this section we want to discuss numerical methods to compute the singular
value decomposition
A = UΣV T (5.32)

of a matrix A ∈ Rm×n with m ≥ n. Here, the matrices U ∈ Rm×m and V ∈ Rn×n


are orthogonal and Σ ∈ Rm×n is a diagonal matrix with diagonal entries σi ≥ 0
for 1 ≤ i ≤ n. We restrict ourselves to the case m ≥ n since it is the more
important case and since in the the case m < n we can simply consider AT
instead of A.

08:42:21, subject to the Cambridge


06
172 Calculation of Eigenvalues

As mentioned before, the factorisation (5.32) is also called the full singu-
lar value decomposition since it computes square matrices U ∈ Rm×m and
V ∈ Rn×n . However, the essential information is already given by a reduced
singular value decomposition, see Section 1.5. Sometimes a reduced singular
value decomposition for A ∈ Rm×n with m ≥ n employs matrices U ∈ Rm×n
with U T U = I, V ∈ Rn×n orthogonal and a diagonal matrix Σ ∈ Rn×n with
A = UΣV T . In any case, we have

AT A = VΣT ΣV T

and S := ΣT Σ ∈ Rn×n is a diagonal matrix with diagonal entries σ2j , 1 ≤ i ≤ n,


and that the singular values σi are the square roots of the positive eigenvalues
of AT A.
This, of course, immediately gives rise to a numerical scheme. We could first
compute the matrix à := AT A and then use one of the methods of the previous
sections to compute the eigenvalues and eigenvectors of Ã. Taking the square
roots of the eigenvalues gives us the singular values. Moreover, if we have the
eigenvalue decomposition à = VS V T we can solve for UΣ = AV, for example
using a QR factorisation to compute the missing orthogonal matrix U.
This is, unfortunately, though quite obvious, not advisable. First of all, the
computational cost expended to form the matrix AT A is already O(mn2 ), but
this is not the main problem. The main problem is that the condition number
will be the square of the condition number of the matrix A. It can be shown that
this leads to problems, particularly for computing the small singular values of
A, see for example Trefethen and Bau [124].
Hence, the general idea is to use the fact that the singular values are the positive
roots of the positive eigenvalues of AT A without computing AT A.
As in the case of computing the eigenvalues of a symmetric matrix using the
QR method, we will achieve this by first transferring our matrix to a more
suitable form. This time, it will be a bidiagonal form.

Definition 5.39 A matrix B = (bi j ) ∈ Rm×n with m ≥ n is in (upper) bidiago-


nal form if bi j = 0 for all j  i, i + 1, i.e. B is of the form

⎛ ⎞
⎜⎜⎜∗ ∗ ⎟⎟⎟
⎜⎜⎜⎜ ∗ ∗
⎟⎟⎟
⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ .. .. ⎟⎟⎟
⎜ . .
B = ⎜⎜⎜⎜ ⎟⎟⎟ .
⎜⎜⎜ ∗ ∗⎟⎟⎟⎟
⎜⎜⎜ ⎟
⎜⎜⎜
⎜⎝ ∗⎟⎟⎟⎟⎟

08:42:21, subject to the Cambridge


06
5.7 Computing the Singular Value Decomposition 173

To compute the bidiagonal form of A we proceed similarly to what we did to


compute the Hessenberg form of a matrix. However, this time we are more
flexible. In the case of constructing the Hessenberg form of A, we had to use
transformations of the form QT AQ with Q orthogonal to keep the same eigen-
values. Here, we want to keep the same singular values, which allows us to use
transformations of the form U T AV with orthogonal matrices U and V since we
have:
Remark 5.40 Let A ∈ Rm×n with m ≥ n be given. If B ∈ Rm×n is defined by
B = U T AV with orthogonal matrices U ∈ Rm×m and V ∈ Rn×n then A and B
have the same singular values.
Proof The singular values of B are the square roots of the positive eigenvalues
of BT B. Since
BT B = (U T AV)T (U T AV) = V T AT UU T AV = V T AT AV,
BT B and AT A have the same positive eigenvalues. 
Hence, we can start with a Householder matrix U1 ∈ Rm×m which transforms
the first column of A into a multiple of the first unit vector. We then want
to choose a Householder matrix V1 to transform the first row of U1 A into a
multiple of the transpose of the first unit vector. However, as explained in the
context of deriving the Hessenberg form, this would destroy the previously
created zeros in the first column. Hence, we have to transform the row vector
consisting of ((U1 A)12 , . . . (U1 A)1n ) into a multiple of the first row unit vector
in Rn−1 . Continuing this way, this looks schematically like
⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞
⎜⎜⎜∗ ∗ ∗ ∗⎟⎟⎟ ⎜⎜⎜∗ ∗ ∗ ∗⎟⎟⎟ ⎜⎜⎜• ∗ 0 0⎟⎟⎟ ⎜⎜⎜• • 0 0⎟⎟⎟
⎜⎜⎜⎜∗ ∗ ∗ ∗⎟⎟⎟⎟ ⎜⎜ ⎟⎟ ⎜⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟⎟ ⎜⎜ ⎟⎟
⎜⎜⎜ ⎟⎟⎟ T ⎜⎜⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ T ⎜⎜⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟⎟⎟
⎜⎜⎜∗ ∗ ∗ ∗⎟⎟⎟ U1 ⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟ V1 ⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟ U2 ⎜⎜⎜0 0 ∗ ∗⎟⎟⎟
⎜⎜⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜⎜⎜∗ ∗ ∗ ∗⎟⎟⎟⎟⎟ → ⎜⎜⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟⎟⎟ → ⎜⎜⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟⎟⎟ → ⎜⎜⎜⎜⎜0 0 ∗ ∗⎟⎟⎟⎟⎟
⎜⎜⎜ ⎟ ⎜⎜⎜ ⎟ ⎜⎜⎜ ⎟ ⎜⎜⎜ ⎟
⎜⎜⎜∗ ∗ ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜0 0 ∗ ∗⎟⎟⎟⎟⎟
⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠
∗ ∗ ∗ ∗ 0 ∗ ∗ ∗ 0 ∗ ∗ ∗ 0 0 ∗ ∗
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
⎜⎜⎜• • 0 0⎟⎟⎟ ⎜⎜⎜• • 0 0⎟⎟⎟ ⎜⎜⎜• • 0 0⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟
⎜⎜⎜0 • ∗ 0⎟⎟⎟ ⎜⎜⎜0 • • 0⎟⎟⎟ ⎜⎜⎜0 • • 0⎟⎟⎟⎟⎟
V2 ⎜
⎜⎜0 0 ∗ ∗⎟⎟⎟ U T ⎜⎜⎜0 0 ∗ ∗⎟⎟⎟ U T ⎜⎜⎜0 0 • •⎟⎟⎟
→ ⎜⎜⎜⎜⎜ ⎟⎟⎟ →3 ⎜⎜⎜ ⎟ 4 ⎜
⎜⎜⎜0 0 0 ∗⎟⎟⎟⎟⎟ → ⎜⎜⎜⎜⎜0 0 0 ∗⎟⎟⎟⎟⎟ ,

⎜⎜⎜0 0 ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜ ⎟ ⎜⎜⎜ ⎟
⎜⎜⎜ ⎟⎟ ⎜⎜⎜0 0 0 ∗⎟⎟⎟⎟⎟ ⎜⎜⎜0 0 0 0⎟⎟⎟⎟⎟
⎜⎜⎝0 0 ∗ ∗⎟⎟⎟⎠ ⎜⎝ ⎟⎠ ⎜⎝ ⎟⎠
0 0 ∗ ∗ 0 0 0 ∗ 0 0 0 0

where the U T matrices are multiplied from the left and the V matrices are mul-
tiplied from the right. In the above diagram, elements that do not change any-

08:42:21, subject to the Cambridge


06
174 Calculation of Eigenvalues

more are denoted by a •; those elements that change are denoted by a ∗. Note
that we need to apply n Householder matrices from the left and n − 2 House-
holder transformations from the right. The method described here is called
Householder bidiagonalisation and goes back to Golub and Kahan [64]. A
possible realisation is given in Algorithm 16. To analyse its numerical cost,
we recall that multiplying a Householder matrix U ∈ Rm×m to a vector a ∈ Rm
costs O(m) time, see Lemma 3.31.

Lemma 5.41 A matrix A ∈ Rm×n with m ≥ n can be transformed into a


bidiagonal form using the Householder bidiagonalisation method in O(mn2 )
time.

Proof We can represent the U matrix of step i by an (m − 1 + i) × (m − 1 + i)


matrix, which we then have to apply to n − i + 1 partial columns of A. Hence
this multiplication can be done in O((n − i + 1)(m − i + 1)) time and we have to
do it for 1 ≤ i ≤ n, so that the total cost for multiplying with all U matrices is

n
O((n − i + 1)(m − i + 1)) = O(n2 m).
i=1

Furthermore, we can represent the V matrix of step i by an (n−i)×(n−i) matrix


which has to be applied to m − i + 1 partial row vectors of A, for 1 ≤ i ≤ n − 2.
Hence, the total cost for all these operations becomes


n−2
O((n − i)(m − i + 1)) = O(n2 m). 
i=1

Algorithm 16: Householder bidiagonalisation


Input : A ∈ Rm×n .
Output: B = UAV T bidiagonal.
1 for i = 1 to n do
2 Determine Householder matrix Ũi ∈ R(m−i+1)×(m−i+1) such that
Ũi (aii , ai+1,m , . . . , ami )T = (∗, 0, . . . , 0)T .
3 Let Ui := diag(Ii−1 , Ũi ) and compute A := Ui A
4 if i ≤ n − 2 then
5 Determine Householder matrix Ṽi ∈ R(n−i)×(n−i) such that
(ai,i+1 , ai,i+2 , . . . , ain )ṼiT = (∗, 0, . . . , 0)
6 Let ViT := diag(Ii , ṼiT ) and compute A := AViT

08:42:21, subject to the Cambridge


06
5.7 Computing the Singular Value Decomposition 175

As mentioned above, the matrix–matrix multiplications in Lines 3 and 6 of Al-


gorithm 16 have to be carried out using the special structure of the Householder
matrices. Furthermore, we can also store these matrices efficiently. The House-
holder matrix Ũi ∈ R(m−i+1)×(m−i+1) can be represented by Ũi = Im−i+1 − 2ui uTi ,
i.e. by a vector ui ∈ Rm−i+1 . Similarly, the matrix Ṽi ∈ R(n−i)×(n−i) can be repre-
sented by a vector vi ∈ Rn−i . Hence, when it comes to an actual implementation
of Algorithm 16 we could store most of the information in the original matrix
as follows. The vector associated with Ui can be stored in the lower triangu-
lar part of A while the Householder vector associated with Vi can be stored
in the upper triangular portion of A. For details, see the similar Algorithm 6.
Hence, we only need to store the diagonal and super-diagonal entries of the
transformed matrix.
In the case of m being much larger than n, the method will produce a large
number of unnecessary zeros. To avoid this, we could first compute a QR fac-
torisation of A and then the Householder bidiagonalisation B = URV T of the
upper triangular matrix R. The advantage here is that the bidiagonalisation pro-
cess then only needs to be carried out on the upper n × n part of R. A thorough
analysis of the computational cost (see for example Trefethen and Bau [124])
shows that the QR factorisation can be done in about half the time T H (m, n)
of the bidiagonalisation of A. The bidiagonalisation process on the small n × n
part of R costs O(n3 ) so that the total cost of this method will be
1
T (m, n) = T H (m, n) + O(n3 ),
2
which turns out to be better if at least m > (5/3)n.
After the transformation to an upper bidiagonal matrix, the transformed matrix
has the upper bidiagonal form
⎛ ⎞
⎜⎜⎜d1 f2 0 · · · 0 ⎟⎟⎟
⎜⎜⎜⎜ .. . ⎟⎟⎟
⎜⎜⎜ d f . .. ⎟⎟⎟⎟ ⎛⎜ ⎞
⎟⎟⎟
⎜⎜⎜ 2 3 ⎟⎟⎟ ⎜⎜
⎜⎜⎜ .. .. ⎟ ⎜ ⎜ ⎟⎟⎟
⎜⎜⎜ . . 0 ⎟⎟⎟⎟⎟ ⎜⎜⎜⎜ B ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟
⎜ . . . f ⎟⎟⎟ = ⎜⎜⎜ ⎜ ⎟⎟⎟
B̃ = ⎜⎜⎜⎜ n⎟ ⎟ ⎜ ⎟⎟ (5.33)
⎜⎜⎜⎜ ⎟⎟⎟⎟ ⎜⎜⎜⎜0 · · · 0⎟⎟⎟⎟
⎜⎜⎜ d n⎟ ⎟⎟⎟ ⎜⎜⎜⎜ .. .. ⎟⎟⎟
⎜⎜⎜⎜ 0 · · · 0 ⎟⎟⎟ ⎜⎜⎜ . . ⎟⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟ ⎝ ⎠
⎜⎜⎜ .. .. ⎟⎟ 0 ··· 0
⎜⎜⎜ . . ⎟⎟⎟⎟
⎝ ⎠
0 ··· 0
with B ∈ Rn×n , and we have matrices U ∈ Rm×n and V ∈ Rn×n such that A =
U BV T . The singular values of A are the square roots of the positive eigenvalues
of the matrix C := BT B, which is a tridiagonal matrix with diagonal entries

08:42:21, subject to the Cambridge


06
176 Calculation of Eigenvalues

c11 = d12 and cii = di2 + fi2 for 2 ≤ i ≤ n. The off-diagonal entries are given by
ci,i+1 = ci+1,i = di fi+1 for 1 ≤ i ≤ n − 1.
Hence, the matrix C = BT B is irreducible (see Definition 4.11) if and only if
d1 · · · dn−1  0 and f2 · · · fn  0.
If the matrix is reducible, the calculation of the singular values of B can be
reduced to the calculation of two lower-dimensional matrices. This is detailed
in the following lemma, where we have to distinguish between the two cases
characterised by whether one of the di or one of the fi vanishes.
Lemma 5.42 Let B ∈ Rn×n be a bidiagonal matrix with diagonal elements
d1 , . . . , dn and super-diagonal elements f2 , . . . , fn .
1. If there is a 1 ≤ k ≤ n − 1 with fk+1 = 0 then the matrix B can be split into

B1 O
B=
O B2
with B1 ∈ Rk×k and B2 ∈ R(n−k)×(n−k) . If B1 = U1 Σ1 V1T and B2 = U2 Σ2 V2T
are singular value decompositions of B1 and B2 , respectively, then
T
U O Σ 1 O V1 O
B= 1
O U2 O Σ2 O V2
is a singular value decomposition of B.
2. If there is a 1 ≤ k ≤ n − 1 with dk = 0 and fk+1  0 then we can choose n − k
Givens rotations Gk,k+1 , . . . , Gk,n such that B̃ = (b̃i j ) := Gk,n · · · Gk,k+1 B
is a bidiagonal matrix with b̃k,k+1 = 0. Hence, the matrix B̃ satisfies the
assumptions of the first case.
Proof The first part of the lemma simply follows by straightforward calcu-
lations. For the second part, we will use Givens rotations to move the non-
vanishing fk+1 along the kth row to the right until it disappears, as described
schematically in
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
⎜⎜⎜• • ⎟⎟⎟ ⎜⎜⎜• • ⎟⎟⎟ ⎜⎜⎜• • ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟
⎜⎜⎜ 0 • ⎟⎟⎟ ⎜⎜⎜ 0 0 ∗ ⎟⎟⎟ ⎜⎜⎜ 0 0 0 ∗⎟⎟⎟⎟⎟
⎜⎜⎜⎜ ⎟ 23 ⎜
• • ⎟⎟⎟⎟ → ⎜⎜⎜⎜
G
∗ ∗
⎟⎟⎟ 24 ⎜⎜⎜
⎟⎟⎟ → ⎜⎜⎜
G
• • ⎟⎟⎟⎟

⎜⎜⎜ ⎟
⎟ ⎜
⎜ ⎟ ⎜ ⎟
⎜⎜⎜
⎜⎝ • •⎟⎟⎟⎟ ⎜⎜⎜
⎜⎝ • •⎟⎟⎟⎟ ⎜⎜⎜
⎜⎝ ∗ ∗⎟⎟⎟⎟⎟
⎠ ⎠ ⎠
• • •
⎛ ⎞
⎜⎜⎜• • ⎟⎟⎟
⎜⎜⎜ ⎟
⎜⎜⎜ 0 0 0 0⎟⎟⎟⎟⎟
G25 ⎜
→ ⎜⎜⎜⎜ • • ⎟⎟⎟⎟ .
⎜⎜⎜ ⎟⎟
⎜⎜⎜ • •⎟⎟⎟⎟⎟
⎝ ⎠

08:42:21, subject to the Cambridge


06
5.7 Computing the Singular Value Decomposition 177

Here, unchanging entries are again denoted by a • and changing entries by a


∗. The precise proof follows the same pattern and is left to the reader, see also
Algorithm 17. 

Algorithm 17: Transformation of reducible bidiagonal matrix


Input : B ∈ Rn×n upper bidiagonal, i.e. of the form (5.33), index k with
dk = 0 and fk+1  0, matrix U ∈ Rm×n so that A = U BV T .
Output: B̃ upper bidiagonal with f˜k+1 = 0.
1 c := 0, s := 1
2 for j = k + 1 to n do
3 f := s f j , f j := c f j , u := ( f 2 + d2j )1/2
4 c := d j /u, s := − f /u, d j := u
5 for i = 1 to m do

c −s
6 (uik , ui j ) := (uik , ui j )
s c

At the end of Algorithm 17 we have an upper bidiagonal matrix B̃ ∈ Rn×n ,


which can be split as described, and a new matrix Ũ which satisfy A = Ũ B̃V T .
We can repeat this until we only have to deal with blocks B with BT B irre-
ducible.
It remains to derive a method for computing the singular values of an upper
bidiagonal matrix B with irreducible BT B. Here, the idea of Golub and Reinsch
[65] is to compute the eigenvalues of BT B with the shifted QR method without
actually computing BT B. We will now describe a method which transforms B
to a new bidiagonal matrix B̃ which should have a much smaller off-diagonal
element b̃n−1,n than the original matrix B. Hence, after a few steps it should be
possible to use deflation to reduce the problem to an (n − 1) × (n − 1) matrix.
We start by selecting the shift parameter μ ∈ R as the eigenvalue of the lower
block
2
dn−1 + fn−1
2
dn−1 fn
dn−1 fn dn2 + fn2
of BT B which is closest to dn2 + fn2 . Then, we determine a Givens rotation
V12 = V12 (c, s) satisfying

c s d12 − μ ∗
= ,
−s c d1 f2 0
i.e. V12 transforms the first column of BT B − μI into a multiple of the first unit

08:42:21, subject to the Cambridge


06
178 Calculation of Eigenvalues

vector. This means that the first column of V12 coincides with the first column
of the orthogonal matrix Q of a QR factorisation of BT B − μI.
T
If we apply this directly to B, i.e. if we form BV12 with dimensions appropri-
ately adapted, we only change the upper 2 × 2 block of B according to

d1 f2 c −s d1 c + f2 s −d1 s + f2 c
= ,
0 d2 s c d2 s d2 c
which means that the upper bidiagonal form is destroyed by the element d2 s in
position (2,1). We will now use successive Givens rotations from the left and
from the right to move this element out of the matrix. This idea can be depicted
as follows:
⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞
⎜⎜⎜• • ⎟⎟⎟ ⎜∗ ∗ ⎟⎟⎟ ⎜⎜⎜∗ ∗ ∗ ⎟⎟⎟ ⎜⎜⎜• ∗ ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟ T ⎜⎜⎜⎜⎜∗ ⎟⎟⎟ ⎜ ⎟⎟⎟ T ⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ • • ⎟⎟⎟ V12 ⎜⎜⎜ ∗ • ⎟⎟⎟ U12 ⎜⎜⎜⎜⎜ ∗ ∗ ⎟⎟⎟ V23 ⎜⎜⎜ ∗ ∗ ⎟⎟⎟
⎜⎜⎜ • • ⎟⎟⎟ → ⎜⎜⎜ • • ⎟⎟⎟⎟ → ⎜⎜⎜⎜ • • ⎟⎟⎟ → ⎜⎜⎜ ∗ ∗ • ⎟⎟⎟
⎜⎜⎜
⎜⎝ • •⎟⎟⎟⎠⎟ ⎜⎜⎜
⎜⎝ • •⎟⎟⎠⎟ ⎜⎜⎜
⎝ • •⎟⎟⎟⎠⎟ ⎜⎜⎜
⎜⎝ • •⎟⎟⎟⎠⎟
• • • •
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
⎜⎜⎜• • ⎟⎟⎟ ⎜• • ⎟⎟⎟ ⎜⎜⎜• • ⎟⎟⎟
⎜⎜ ∗ ∗ ∗ ⎟⎟⎟ T ⎜⎜⎜⎜⎜ • ∗ ⎟⎟⎟ ⎜⎜⎜ • • ⎟⎟⎟
U23 ⎜ ⎜⎜ ⎟⎟⎟ 34 ⎜⎜⎜
V ⎟⎟⎟ U34 ⎜⎜⎜ ⎟⎟
→ ⎜⎜⎜⎜ ∗ ∗ ⎟⎟⎟ → ⎜⎜⎜⎜ ∗ ∗ ⎟⎟⎟ → ⎜⎜⎜ ∗ ∗ ∗⎟⎟⎟⎟
⎜⎜⎜
⎜⎝ • •⎟⎟⎟⎠⎟ ⎜⎜⎜
⎝ ∗ ∗ •⎟⎟⎟⎟⎠ ⎜⎜⎜
⎜⎝ ∗ ∗⎟⎟⎟⎟⎠
• • •
⎛ ⎞ ⎛ ⎞
⎜⎜⎜• • ⎟⎟⎟ ⎜⎜⎜• • ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟ ⎜ ⎟⎟⎟
T
V45 ⎜⎜ • • ⎟⎟⎟ U45 ⎜⎜⎜⎜⎜ • • ⎟⎟⎟
→ ⎜⎜⎜⎜ • ∗ ⎟⎟⎟ → ⎜⎜⎜ • • ⎟⎟⎟ ,
⎜⎜⎜
⎜⎝ ∗ ∗⎟⎟⎟⎠⎟ ⎜⎜⎜
⎝⎜ ∗ ∗⎟⎟⎟⎟⎠
∗ ∗ ∗
T
where Vk,k+1 is multiplied from the right and hence acts only on columns k and
k + 1, while Uk,k+1 is multiplied from the left and hence acts on rows k and
k + 1. All other rows and columns are left unchanged. As before, we indicated
non-changing values with • and those which change with ∗, unless they are
transformed to zero.
The result of this procedure is a matrix

B̃ = Un−1,n · · · U12 BV12


T
· · · Vn−1,n
T

which is also upper bidiagonal and has the same singular values as B but
should, one hopes, have a smaller entry b̃n−1,n than B. The full algorithm of
this step is given in Algorithm 18.
Algorithm 18 not only gives a new upper bidiagonal matrix B̃ but also over-
writes the matrices U ∈ Rm×n and V ∈ Rn×n with

Ũ = UU12
T
· · · Un−1,n
T
, Ṽ = VV12
T
· · · Vn−1,n
T
,

so that we also have Ũ B̃Ṽ T = U BV T . To see the connection to the shifted QR

08:42:21, subject to the Cambridge


06
5.7 Computing the Singular Value Decomposition 179

Algorithm 18: QR Golub–Kahan SVD step


Input : B ∈ Rn×n bidiagonal matrix of the form (5.33) with BT B
irreducible, matrices U ∈ Rm×n and V ∈ Rn×n .
Output: Upper bidiagonal matrix B̃ ∈ Rn×n , matrices Ũ ∈ Rm×n and
Ṽ ∈ Rn×n with Ũ B̃Ṽ T = U BV T .
1 Determine μ ∈ R as the eigenvalue of
2
dn−1 + fn−1
2
dn−1 fn
,
dn−1 fn dn2 + fn2

which is closest to dn2 + fn2 .


2 Determine Givens rotation V12 = V12 (c, s) with

c s d12 − μ ∗
= .
−s c d1 f2 0
T T
Set B := BV12 and V := VV12 .
3 for k = 1 to n − 1 do
4 Determine Givens rotation Uk,k+1 = Uk,k+1 (c, s) with

c s bkk ∗
= .
−s c bk+1,k 0

T
5 Set B := Uk,k+1 B and U := UUk,k+1 .
6 if k < n − 1 then
7 Determine Givens rotation Vk+1,k+2 = Vk+1,k+2 (c, s) with

c s bk,k+1 ∗
= .
−s c bk,k+2 0

T T
8 Set B := BVk+1,k+2 and V := VVk+1,k+2 .

method applied to BT B, we note that, with the just-introduced matrices, we


also have
B̃T B̃ = Vn−1,n · · · V12 BT BV12
T
· · · Vn−1,n
T
.

As we have
(Vn−1,n · · · V12 )T e1 = V12
T
· · · Vn−1,n
T
e1 = V12
T
e1

08:42:21, subject to the Cambridge


06
180 Calculation of Eigenvalues

we see that the first column of (Vn−1,n · · · V1,2 )T coincides with the first column
T
of V12 which coincides with the first column of Q the orthogonal matrix in
the QR factorisation of BT B − μI. This means that we have indeed implicitly
applied one step of the shifted QR method and our considerations there allow
us now to expect a new matrix B̃ with considerably smaller super-diagonal
element fn .
It should be clear by now how to implement these algorithms and how to com-
bine them to create a working algorithm for computing the singular value de-
composition. After each step B → B̃ it should be checked whether the last
super-diagonal falls below a given threshold.

Exercises
5.1 • Let A ∈ Rn×n be a tridiagonal matrix with diagonal entries 5 j, 1 ≤ j ≤
n, and sub- and super-diagonal entries ±1. Use Gershgorin’s theorem
to give as tight bounds on the eigenvalues of A as possible. Show that
A is invertible.
• Let x0 ∈ Rn and define for k ∈ N xk by Axk = xk−1 , where A is the
matrix from the first part. Show that there is a constant C > 0 such
that the components eTj xk of xk satisfy |e j xk | ≤ 4−k C.
5.2 Let A, B ∈ Rn×n be symmetric with eigenvalues λ1 ≥ · · · ≥ λn and
μ1 ≥ · · · ≥ μn , respectively. Let ν1 ≥ · · · ≥ νn be the eigenvalues of A + B
Show λ j + μn ≤ ν j ≤ λ j + μ1 for 1 ≤ j ≤ n.
5.3 Let A ∈ Rn×n be symmetric with eigenvalues λ1 , . . . , λn . Show, for every
λ ∈ R and x ∈ Rn \ {0},
λx − Ax2
min |λ − λ j | ≤ .
1≤ j≤n x2
5.4 Let A ∈ Rm×n with m ≥ n. Let σ1 ≥ · · · ≥ σn be the singular values of A.
Show
Ax2
σ j = min max , 1 ≤ j ≤ n,
U∈Mn+1− j x∈U x2
x0

where M j is the set of all j-dimensional subspaces of Rn .


5.5 Let A, B ∈ Rm×n with m ≥ n and singular values σ1 ≥ · · · ≥ σn and
τ1 ≥ · · · ≥ τn . Show
|σ j − τ j | ≤ A − B2 , 1 ≤ j ≤ n.
5.6 Show that every irreducible, symmetric tridiagonal matrix A ∈ Rn×n has
pairwise distinct eigenvalues.

08:42:21, subject to the Cambridge


06
PART THREE

ADVANCED METHODS

13:06:58, subject to the Cambridge


Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316544938
13:06:58, subject to the Cambridge
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316544938
6
Methods for Large Sparse Systems

In this chapter, we will concentrate on iterative methods which are particularly


suited for large sparse systems, though they can also be applied to non-sparse
matrices. We will discuss multigrid methods but our main focus will be on so-
called Krylov subspace methods. For such methods, the main operation within
one step of the iteration is the calculation of one or at most two matrix–vector
products. If the matrix is sparse with only O(n) non-zero entries then this op-
eration can be done in O(n) time. For a general, non-sparse matrix the cost
becomes O(n2 ), which might still be acceptable if only a very small number of
steps is required and n is not too large.

6.1 The Conjugate Gradient Method


We start by constructing an iterative method for symmetric positive definite
matrices, which was introduced in [83] by Hestenes and Stiefel. Let A ∈ Rn×n
be symmetric, positive definite and b ∈ Rn . We wish to solve the linear system

Ax = b.

Associated with this system we have a function f : Rn → R given by


1 T
f (x) := x Ax − xT b, x ∈ Rn .
2
We can calculate the first-order and second-order derivatives as

∇ f (x) = Ax − b,
H f (x) = A.

Here ∇ f = (∂1 f, . . . , ∂n f )T is the gradient of f and H f = (∂2i j f )i j is its Hessian

183

08:44:43, subject to the Cambridge


07
184 Methods for Large Sparse Systems

matrix. Since all higher-order derivatives vanish, we can rewrite f as its Taylor
polynomial of degree two as
1
f (x) = f (y) + (x − y)T ∇ f (y) + (x − y)T H f (y)(x − y)
2
1
= f (y) + (x − y)T (Ay − b) + (x − y)T A(x − y)
2
≥ f (y) + (x − y)T (Ay − b),

where, in the last step, we have used that A is positive definite. Thus, if y = x∗
solves Ax = b, we see that f (x) ≥ f (x∗ ), i.e. x∗ minimises f . On the other hand,
every minimiser y of f has to satisfy the necessary condition 0 = ∇ f (y) =
Ay − b. This is only possible for y = x∗ .

Theorem 6.1 Let A ∈ Rn×n be symmetric and positive definite. The solution of
the system Ax = b is the unique minimum of the function f (y) := 12 yT Ay − yT b.

We will now try to find the minimum of f by a simple iterative procedure of


the form
x j+1 = x j + α j p j .

Here, x j is our current position. From this position we want to move in the
direction of p j . The step-length of our move is determined by α j . Of course, it
will be our goal to select the direction p j and the step-length α j in such a way
that
f (x j+1 ) ≤ f (x j ).

Given a new direction p j , the best possible step-length in that direction can be
determined by looking at the minimum of f along the line x j + αp j . Hence, if
we set φ(α) = f (x j + αp j ) the necessary condition for a minimum yields

0 = φ (α) = pTj ∇ f (x j + αp j ) = pTj [A(x j + αp j ) − b] = pTj Ax j + αpTj Ap j − pTj b.

Resolving this for α gives the new step-length as


pTj (b − Ax j ) pTj r j r j , p j 2
αj = = = ,
pTj Ap j pTj Ap j Ap j , p j 2

where we have defined the residual of the jth step as r j := b − Ax j . Thus we


have a generic algorithm as outlined in Algorithm 19.
Note that we pick α j in such a way that

0 = pTj (Ax j+1 − b) = − r j+1 , p j 2 , (6.1)

showing that p j and r j+1 are orthogonal.

08:44:43, subject to the Cambridge


07
6.1 The Conjugate Gradient Method 185

Algorithm 19: Generic minimisation


Input : A = AT ∈ Rn×n positive definite, b ∈ Rn .
Output: Approximate solution of Ax = b.
1 Choose x0 and p0 .
2 Compute r0 := b − Ax0 and set j := 0.
3 while r j  > do
r j , p j 2
4 α j :=
Ap j , p j 2
5 x j+1 := x j + α j p j
6 r j+1 := b − Ax j+1
7 Choose next direction p j+1
8 j := j + 1

Obviously, we still have to determine how to choose the search directions. One
possible way is to pick the direction of steepest descent. It is easy to see and
known from analysis that this direction is given by the negative gradient of the
target function, i.e. by
p j = −∇ f (x j ) = −(Ax j − b) = r j .
This gives the steepest descent algorithm shown in Algorithm 20.

Algorithm 20: Steepest descent


Input : A = AT ∈ Rn×n positive definite, b ∈ Rn .
Output: Approximate solution of Ax = b.
1 Choose x0 .
2 Set p0 := b − Ax0 and j := 0
3 while p j  > do
p j , p j 2
4 α j :=
Ap j , p j 2
5 x j+1 := x j + α j p j
6 p j+1 := b − Ax j+1
7 j := j + 1

The relation (6.1) now becomes p j+1 , p j 2 = 0, which means that two succes-
sive search directions are orthogonal.
To see that the method of steepest descent converges and to determine its

08:44:43, subject to the Cambridge


07
186 Methods for Large Sparse Systems

convergence speed, we recall that a symmetric and positive definite matrix


A ∈ Rn×n has n positive eigenvalues 0 < λ1 ≤ λ2 ≤ · · · ≤ λn and Rn has an
orthonormal basis w1 , . . . , wn satisfying Aw j = λ j w j for 1 ≤ j ≤ n. We need
the following lemma.

Lemma 6.2 (Kantorovich inequality) Let λ1 and λn be the smallest and


largest eigenvalues of the symmetric and positive definite matrix A ∈ Rn×n .
Then,
Ax, x 2 A−1 x, x 2 (λ1 + λn )2
≤ , x ∈ Rn \ {0}.
x, x 2 x, x 2 4λ1 λn

Proof Without restriction we may assume that x, x 2 = x22 = 1. Using the



notation introduced above, we can expand such an x ∈ Rn as x = nj=1 x j w j
with

n 
n
Ax, x 2 = xT Ax = λ j x2j , A−1 x, x 2 = xT A−1 x = λ−1
j xj.
2

j=1 j=1


Noticing that x2j ∈ [0, 1] with x2j = 1, we see that


n
λ := λ j x2j = Ax, x 2
j=1

is a convex combination of the eigenvalues of A and hence we have λ ∈


[λ1 , λn ]. Next, let us define the points P j := (λ j , 1/λ j ) for 1 ≤ j ≤ n and

P := (λ, j λ−1 j x j ). The points P j lie on the graph of the function f (t) = 1/t,
2

which is a convex function. Thus, all P j for 2 ≤ j ≤ n − 1 lie under the straight

line connecting P1 and Pn . The point P can be written as P = j x2j P j and
is therefore a convex combination of the points P j and thus contained in the
convex hull of P1 , . . . , Pn . Hence, P cannot be above the straight line through
P1 and Pn . Since this line is given by g(t) = (λ1 + λn − t)/(λ1 λn ), we see that

n
λ1 + λn − λ
A−1 x, x 2 = λ−1
j x j ≤ g(λ) =
2

j=1
λ1 λn

and hence
λ 1 + λn − λ
Ax, x 2 A−1 x, x 2 ≤ λ . (6.2)
λ1 λn
The expression on the right-hand side is a quadratic polynomial in λ which
attains its global maximum in [λ1 , λn ] at λ = (λ1 + λn )/2. Inserting this into
(6.2) gives the desired result. 

08:44:43, subject to the Cambridge


07
6.1 The Conjugate Gradient Method 187

With this, we can prove convergence of the method of steepest descent. To


formulate it, we introduce the inner product

x, y A := yT Ax = Ax, y 2 = x, Ay 2 = Ay, x 2 (6.3)



with associated norm xA := x, x A , x ∈ Rn .

Theorem 6.3 Let A ∈ Rn×n be symmetric and positive definite. Then, for a
given right-hand side b ∈ Rn , the method of steepest descent converges for
every starting point x0 ∈ Rn to the solution x∗ ∈ Rn of Ax = b. Moreover, if x j
denotes the jth iteration, the error can be bounded by
j
∗ κ2 (A) − 1
x − x j A ≤ x∗ − x0 A , (6.4)
κ2 (A) + 1

where κ2 (A) = A2 A−1 2 denotes the condition number of A with respect to
the Euclidean norm.

Proof Let us write e j := x∗ − x j . Then, we have the following two relations:

e j+1 = x∗ − x j − α j p j = e j − α j p j ,
p j+1 = b − Ax j+1 = A(x∗ − x j+1 ) = Ae j+1 .

The first one leads to

e j+1 2A = e j+1 , e j − α j p j A = e j+1 , e j A − α j e j+1 , p j A .

The orthogonality of p j and p j+1 and the fact that p j+1 = Ae j+1 shows

e j+1 , p j A = Ae j+1 , p j 2 = p j+1 , p j 2 = 0,

which allows us to derive

e j+1 2A = e j+1 , e j A = Ae j , e j+1 2 = p j , e j+1 2 = p j , e j − α j p j 2


= p j , e j 2 − α j p j , p j 2 = Ae j , e j 2 − α j p j , p j 2

p j , p j 2
= Ae j , e j 2 1 − α j
Ae j , e j 2

p j , p j 2 p j , p j 2
= e j 2A 1 −
Ap j , p j 2 p j , A−1 p j 2

4λ1 λn
≤ e j 2A 1 −
(λ1 + λn )2
(λn − λ1 )2
= e j 2A ,
(λn + λ1 )2

08:44:43, subject to the Cambridge


07
188 Methods for Large Sparse Systems

using the Kantorovich inequality. From this, we can conclude that


κ2 (A) − 1
e j+1 A ≤ e j A
κ2 (A) + 1
and iterating this argument gives the stated error bound and convergence. 

The predicted error bound shows that the convergence is best if the condition
number of A is close to 1 and that the convergence will be arbitrarily slow if the
condition number is large. We can even estimate the number of iterations that
we would need to reduce the initial error x∗ − x0 A to a given > 0, though
this estimate is true only when one is working in exact arithmetic.

Corollary 6.4 Let A ∈ Rn×n be symmetric and positive definite. If {x j } denotes


the iterations of the steepest descent method then we have
x∗ − x j A

x∗ − x0 A
after at most j ≥ 12 κ2 (A) log(1/ ) steps.

Proof Let us write κ := κ2 (A). It follows from (6.4) that the relative error
drops below if
j
κ+1 1
> .
κ−1
Taking the logarithm of both sides shows that we need
log(1/ )
j≥  . (6.5)
log κ+1
κ−1

Since κ ≥ 1, we can use the expansion


∞  
κ+1 1 −2k−1 1 1 1 2
log =2 κ = 2 + 3 + 5 + ··· ≥
κ−1 k=0
2k + 1 κ 3κ 5κ κ

to see that (6.5) is satisfied if j ≥ 12 κ log(1/ ). 

The next example shows that the error bound of the previous theorem can be
sharp, showing also that even for moderately conditioned matrices the conver-
gence of the steepest descent method becomes unacceptably slow.

Example 6.5 Let us choose the following matrix, right-hand side and initial
position:

1 0 0 9
A= , b= , x0 = .
0 9 0 1

08:44:43, subject to the Cambridge


07
6.1 The Conjugate Gradient Method 189

It is not too difficult to see that the method of steepest descent produces the
iterations
x j = (0.8) j (9, (−1) j )T ,

which are depicted in Figure 6.1.

0.5

0 1 2 3 4 5 6 7 8

Figure 6.1 Steepest descent for the example.

Since the solution x∗ is obviously given by x∗ = 0, we can easily compute all


relevant quantities from Theorem 6.3. First of all we have κ2 (A) = 9 and hence
the convergence factor√is (κ2 (A) − 1)/(κ2 (A) + 1) = 0.8. Moreover, we have
x0 − x∗ A = x0 A = 90 and

∗ 2 j 1 0 9
x j − x A = (0.8) (9, (−1) )
2j
= (0.8)2 j 90 = 0.82 j x0 − x∗ 2A .
0 9 (−1) j
Thus, we have equality in equation (6.4).

xj

x j+1

pj

Figure 6.2 The steepest descent method, level curves.

The problem of the previous example becomes more apparent if we look at the
level curves of the corresponding quadratic function f (x) = xT Ax = 12 (x12 +
9x22 ), which are ellipses, see Figure 6.2. Since the new search direction has to
be orthogonal to the previous one, we are always tangential to the level curves,
and hence the new direction does not point towards the centre of the ellipses,
which is our solution vector.

08:44:43, subject to the Cambridge


07
190 Methods for Large Sparse Systems

yj

s j+1

y j+1

sj

Figure 6.3 The steepest descent method, level curves after scaling.

Thus, it would make sense to introduce a new coordinate system by setting


y := A1/2 x, in which the function becomes g(y) = f (A−1/2 y) = 12 yT y. Follow-
ing this idea, we transform the positions by setting y j := A1/2 x j and the di-
rections by s j := A1/2 p j . If the new search directions are orthogonal as shown
in Figure 6.3, we find the minimum in two steps. Moreover, being orthogonal
now translates back into 0 = sTj s j+1 = pTj Ap j+1 = Ap j , p j+1 2 .
Following these ideas, we make the following definition.

Definition 6.6 The search directions p0 , . . . , pn−1 ∈ Rn \ {0} are called A-


conjugate for a given positive definite and symmetric matrix A ∈ Rn×n if
Apk , p j 2 = 0 for all 0 ≤ j  k ≤ n − 1.

The following properties of A-conjugate directions can easily be verified.

Remark 6.7 • Since A is symmetric and positive definite, we have for each
p  0 that Ap, p 2 = p, Ap 2 > 0.
• As in (6.3), we can introduce an inner product via x, y A := Ax, y 2 =
x, Ay 2 . With respect to this inner product, the search directions are orthog-
onal.
• A-conjugate directions are always linearly independent.

For A-conjugate directions it turns out that the generic algorithm, Algorithm
19, terminates after n steps, so that the method is actually a direct method.
Of course, we hope to have an acceptable approximation after significantly
fewer iterations. However, in some cases, due to numerical contamination, the
algorithm might even require more than n steps.

Theorem 6.8 Let A ∈ Rn×n be symmetric and positive definite. Let x0 ∈ Rn

08:44:43, subject to the Cambridge


07
6.1 The Conjugate Gradient Method 191

be given and assume that the search directions p0 , . . . , pn−1 ∈ Rn \ {0} are A-
conjugate. The generic Algorithm 19 terminates after at most n steps with the
solution x∗ of Ax = b.
Proof Since the search directions form a basis of Rn we can represent the
vector x∗ − x0 in this basis, which means that we can find β0 , . . . , βn−1 with

n−1
x∗ = x0 + β jp j.
j=0

Furthermore, from the generic algorithm, we can conclude that the iterations
have the representation

i−1
xi = x0 + α jp j, 0 ≤ i ≤ n. (6.6)
j=0

Hence, it is sufficient to show that αi = βi for all 0 ≤ i ≤ n − 1. We know that


ri , pi 2
αi = .
Api , pi 2
To compute the βi , we first observe that

n−1
b − Ax0 = A(x∗ − x0 ) = β j Ap j
j=0

and hence

n−1
b − Ax0 , pi 2 = β j Ap j , pi 2 = βi Api , pi 2 ,
j=0

because the directions are A-conjugate. This gives the explicit representation
b − Ax0 , pi 2 r0 , pi 2
βi = = , 0 ≤ i ≤ n − 1,
Api , pi 2 Api , pi 2
which differs, at first sight, from the αi in the numerator. Fortunately, from
(6.6) we can conclude that

i−1
ri = b − Axi = r0 − α j Ap j ,
j=0

which allows us to derive



i−1
ri , pi 2 = r0 , pi 2 − α j Ap j , pi 2 = r0 , pi 2 ,
j=0

employing again the fact that the directions are A-conjugate.Thus we have αi =
βi for all i. 

08:44:43, subject to the Cambridge


07
192 Methods for Large Sparse Systems

It remains to answer the question regarding how the search directions are ac-
tually determined. Obviously, we do not want to determine them a priori but
while we iterate, hoping that our iteration will terminate significantly before
the iteration step becomes n.
Assume we already have created directions p0 , . . . , p j . In the case of Ax j+1 = b
we can terminate the algorithm and hence do not need to compute another
direction. Otherwise, we could try to compute the next direction by setting

j 
j
p j+1 = b − Ax j+1 + β jk pk = r j+1 + β jk pk ,
k=0 k=0

where we can determine the coefficients from the j + 1 conditions



j
0 = Api , p j+1 2 = Api , r j+1 2 + β jk Api , pk 2
k=0
= Api , r j+1 2 + β ji Api , pi 2 , 0 ≤ i ≤ j,

as
Api , r j+1 2
β ji = − , 0 ≤ i ≤ j.
Api , pi 2
Surprisingly, β j0 , . . . , β j, j−1 vanish automatically, as we will see soon, so that
only the coefficient
Ap j , r j+1 2
β j+1 := β j j = −
Ap j , p j 2
remains and the new direction is given by

p j+1 = r j+1 + β j+1 p j .

This gives the first preliminary version of the conjugate gradient (CG) method,
Algorithm 21.

It is now time to show that the so-defined directions are indeed A-conjugate.

Theorem 6.9 If the CG method does not stop prematurely, the introduced
vectors satisfy the following equations:

Ap j , pi 2 = 0, 0 ≤ j ≤ i − 1, (6.7)
r j , ri 2 = 0, 0 ≤ j ≤ i − 1, (6.8)
r j , p j 2 = r j , r j 2 , 0 ≤ j ≤ i. (6.9)

The CG method terminates with an index i ≤ n and ri = b − Axi = 0 = pi .


Hence, it needs at most n steps to compute the solution x∗ of Ax = b.

08:44:43, subject to the Cambridge


07
6.1 The Conjugate Gradient Method 193

Algorithm 21: CG method, preliminary version


Input : A = AT ∈ Rn×n positive definite, b ∈ Rn .
Output: Approximate solution of Ax = b.
1 Choose x0 ∈ Rn .
2 Set p0 := r0 := b − Ax0 and j := 0.
3 while r j  0 do
4 α j := r j , p j 2 / Ap j , p j 2
5 x j+1 := x j + α j p j
6 r j+1 := b − Ax j+1
7 β j+1 := − Ap j , r j+1 2 / Ap j , p j 2
8 p j+1 := r j+1 + β j+1 p j
9 j := j + 1

Proof We will prove (6.7) to (6.9) by induction. For i = 0 there is nothing to


show for (6.7) and (6.8); (6.9) follows immediately from p0 = r0 .
Let us now assume everything is satisfied for an arbitrary i ≥ 0. We will first
show that then (6.8) follows for i + 1. By definition, we have
ri+1 = b − Axi+1 = b − Axi − αi Api = ri − αi Api . (6.10)
Thus, by the induction hypothesis, we have for j ≤ i − 1 immediately
r j , ri+1 2 = r j , ri 2 − αi Ar j , pi 2 = r j , ri 2 − αi A(p j − β j p j−1 ), pi 2
= r j , ri 2 − αi Ap j , pi 2 + αi β j Ap j−1 , pi 2 = 0.
For j = i we can use the definition of αi , equation (6.10) and
Ari , pi 2 = A(pi − βi pi−1 ), pi 2 = Api , pi 2
and the third induction hypothesis to conclude that
ri , pi 2
ri , ri+1 2 = ri , ri 2 − αi Ari , pi 2 = ri , ri 2 − Ari , pi 2
Api , pi 2
= ri , ri 2 − ri , pi 2 = 0,
which finishes the induction step for (6.8). Similarly, we proceed for (6.7).
First of all, we have
Ap j , pi+1 2 = Ap j , (ri+1 + βi+1 pi ) 2 = Ap j , ri+1 2 + βi+1 Ap j , pi 2 .
In the case of j = i this leads to
Api , ri+1 2
Api , pi+1 2 = Api , ri+1 2 − Api , pi 2 = 0.
Api , pi 2

08:44:43, subject to the Cambridge


07
194 Methods for Large Sparse Systems

In the case of j ≤ i − 1 we have Ap j , pi+1 2 = Ap j , ri+1 2 . Moreover, α j = 0


particularly means r j , p j 2 = r j , r j 2 = 0, which is equivalent to r j = 0.
Thus if α j = 0 then the iteration would have stopped before. We hence may
assume that α j  0 such that we can use (6.10) to gain the representation
Ap j = (r j − r j+1 )/α j . Together with (6.8) for i + 1 this leads to
1
Ap j , pi+1 2 = Ap j , ri+1 2 = r j − r j+1 , ri+1 2 = 0.
αj
This proves (6.7) for i + 1. Finally, for (6.9) we use (6.1) from the generic
minimisation, which holds for any index, in the form ri+1 , pi 2 = 0. From this,
we can conclude that
ri+1 , pi+1 2 = ri+1 , ri+1 2 + βi+1 ri+1 , pi 2 = ri+1 , ri+1 2 ,
which finalises our induction step.
For the computation of the next iteration step, we need pi+1  0. If this is
not the case then the method terminates. Since 0 = ri+1 , pi+1 2 = ri+1 22 and
hence ri+1 = 0 = b − Axi+1 we have produced a solution. If the method does
not terminate early then, after n steps, we have created n conjugate directions
and Theorem 6.8 shows that xn is the true solution x∗ . 
It is possible to improve the preliminary version of the CG method. First of
all, we can get rid of the matrix–vector multiplication in the definition of r j+1
by using (6.10). Moreover, from (6.10) it also follows that Ap j = α1j (r j − r j+1 )
such that
Ap j , r j+1 2 1 r j , r j+1 2 − r j+1 , r j+1 2
β j+1 = − =−
Ap j , p j 2 αj Ap j , p j 2
r j+1 , r j+1 Ap j , p j 2 r j+1 22
= = .
Ap j , p j 2 r j , p j 2 r j 22
This leads to a version of the CG method which can easily be run on a par-
allel computer, and which contains only one matrix–vector multiplication and
three inner products and three scalar–vector multiplications, as detailed in Al-
gorithm 22.

In the following theorems we will show some of the properties of the CG


algorithm. In particular, we will investigate convergence in the energy norm

xA := x, x A , x ∈ Rn .
Definition 6.10 Let A ∈ Rn×n and r ∈ Rn be given. For i ∈ N we define the
ith Krylov space to A and r as
Ki (A, r) = span{r, Ar, . . . , Ai−1 r}.

08:44:43, subject to the Cambridge


07
6.1 The Conjugate Gradient Method 195

Algorithm 22: CG method


Input : A = AT ∈ Rn×n positive definite, b ∈ Rn .
Output: Approximate solution of Ax = b.
1 Choose x0 ∈ Rn .
2 Set p0 := r0 := b − Ax0 and j := 0.
3 while r j 2 > do
4 t j := Ap j
5 α j := r j 22 / t j , p j 2
6 x j+1 := x j + α j p j
7 r j+1 := r j − α j t j
8 β j+1 := r j+1 22 /r j 22
9 p j+1 := r j+1 + β j+1 p j
10 j := j + 1

In general, we only have dim Ki (A, r) ≤ i, since r could, for example, be from
the null space of A. Fortunately, in our situation things are much better.
Lemma 6.11 Let A ∈ Rn×n be symmetric and positive definite. Let pi and ri
be the vectors created during the CG iterations. Then,
Ki (A, r0 ) = span{r0 , . . . , ri−1 } = span{p0 , . . . , pi−1 }
for 1 ≤ i ≤ n. In particular, we have dim Ki (A, r0 ) = i.
Proof The proof is again by induction on i. For i = 1 this follows immediately
as we have p0 = r0 . For the induction step i → i + 1 we assume that we have
not stopped prematurely. By construction, we have pi = ri + βi pi−1 with βi  0
as we would have stopped before otherwise. This shows pi ∈ span{ri , pi−1 } and
hence by induction span{p0 , . . . , pi } ⊆ span{r0 , . . . , ri }. As the A-conjugate di-
rections p0 , . . . , pi are linearly independent, comparing the dimensions shows
that we actually have span{p0 , . . . , pi } = span{r0 , . . . , ri }.
Using ri = ri−1 − αi−1 Api−1 , where we again may assume that αi−1  0,
we find ri ∈ span{ri−1 , Api−1 } ⊆ Ki (A, r0 ) + AKi (A, r0 ) = Ki+1 (A, r0 ) and
thus span{r0 , . . . , ri } ⊆ Ki+1 (A, r0 ) and a comparison of the dimensions shows
equality again. 
Recall that the true solution x∗ of Ax = b and the iterations xi of the CG method
can be written as

n−1 i−1
x∗ = x0 + α j p j and xi = x0 + α jp j.
j=0 j=0

08:44:43, subject to the Cambridge


07
196 Methods for Large Sparse Systems

Moreover, by Lemma 6.11, it is possible to represent an arbitrary element x ∈


x0 + Ki (A, r0 ) by

i−1
x = x0 + γ jp j
j=0

with certain numbers γ0 , . . . , γi−1 ∈ R. Let us define d j = Ap j , p j 2 . Since the


directions are A-conjugate, we can conclude that
 2
 n−1  
n−1 
n−1
x∗ − xi 2A =  α j p j  = α j αk Apk , p j 2 = d j |α j |2
 
j=i A j,k=i j=i
 2

n−1 
i−1  n−1 
i−1 
≤ d j |α j |2 + d j |α j − γ j |2 =  α j p j + (α j − γ j )p j 
 
j=i j=0 j=i j=0 A
 2
 i−1 
= x∗ − x0 − γ j p j  = x∗ − x2A .
 
j=0 A

Since this holds for an arbitrary x ∈ x0 +Ki (A, r0 ) we have proven the following
theorem.

Theorem 6.12 Let A ∈ Rn×n be symmetric and positive definite. The iteration
xi from the CG method gives the best approximation to the solution x∗ of Ax =
b from the affine space x0 + Ki (A, r0 ) with respect to the energy norm  · A , i.e.

x∗ − xi A ≤ x∗ − xA , x ∈ x0 + Ki (A, r0 ).

The same idea shows that the iteration sequence is monotone.

Corollary 6.13 The sequence of iterations of the CG method is monotone in


the energy norm, i.e. for all i we have

x∗ − xi+1 A ≤ x∗ − xi A .

Proof Using the same notation as in the theorem, we find

xi ∈ x0 + Ki (A, r0 ) ⊆ x0 + Ki+1 (A, r0 ).

Since xi+1 minimises x → x∗ −xA over x0 +Ki+1 (A, r0 ), the statement follows
immediately. 

Next, let us rewrite this approximation problem in the original basis of the
Krylov space. We will use the following notation. We denote the set of all
polynomials of degree less than or equal to n by πn = πn (R).

08:44:43, subject to the Cambridge


07
6.1 The Conjugate Gradient Method 197

For a polynomial P(t) = i−1
j=0 γ j t ∈ πi−1 , a matrix A ∈ R
j n×n
and x ∈ Rn we
will write

i−1 
i−1
P(A) = γ j A j, P(A)x = γ j A j x.
j=0 j=0

If x is an eigenvector of A with eigenvalue λ then



i−1
P(A)x = γ j λ j x = P(λ)x.
j=0

Theorem 6.14 Let A ∈ Rn×n be symmetric and positive definite, having the
eigenvalues 0 < λ1 ≤ · · · ≤ λn . If x∗ denotes the solution of Ax = b and if xi
denotes the ith iteration of the CG method, then

x∗ − xi A ≤ min max |P(λ j )| x∗ − x0 A . (6.11)


P∈πi 1≤ j≤n
P(0)=1

Proof Let us express an arbitrary x ∈ x0 + Ki (A, r0 ) as



i−1
x = x0 + γ j A j r0 =: x0 + Q(A)r0 ,
j=0

introducing the polynomial Q(t) = i−1 j=0 γ j t ∈ πi−1 , where we set Q = 0 for
j

i = 0. This gives with r0 = A(x −x0 ) and P(A) := I − Q(A)A the representation

x∗ − x = x∗ − x0 − Q(A)r0 = x∗ − x0 − Q(A)A(x∗ − x0 )
= (I − Q(A)A)(x∗ − x0 ) = P(A)(x∗ − x0 )

and hence
x∗ − xA = P(A)(x∗ − x0 )A .

For the polynomial P we have



i−1 
i
P(t) = 1 − γ j t j+1 = 1 − γ j−1 t j
j=0 j=1

showing P ∈ πi with P(0) = 1. However, if, vice versa, P ∈ πi with P(0) = 1 is


given then we can define Q(t) = (P(t) − 1)/t ∈ πi−1 , which leads to an element
from x0 +Ki (A, r0 ), for which the calculation above holds. Thus, Theorem 6.12
yields
x∗ − xi A ≤ min P(A)(x∗ − x0 )A . (6.12)
P∈πi
P(0)=1

08:44:43, subject to the Cambridge


07
198 Methods for Large Sparse Systems

Next, let w1 , . . . , wn be an orthonormal basis of Rn consisting of eigenvectors


of A associated with the eigenvalues λ1 , . . . , λn . These vectors satisfy

w j , wk A = Aw j , wk 2 = λ j w j , wk 2 = λ j δ jk .

Moreover, we can represent every vector using this basis and with such a rep-

resentation x∗ − x0 = ρ j w j we can conclude that

n
P(A)(x∗ − x0 ) = ρ j P(λ j )w j
j=1

holds, which leads to



n 
n
P(A)(x∗ − x0 )2A = ρ2j λ j P(λ j )2 ≤ max P(λ j )2 ρ2j λ j
1≤ j≤n
j=1 j=1

= max P(λ j ) x −2
x0 2A .
1≤ j≤n

Plugging this into (6.12) gives the desired inequality. 

Hence, we have converted the weighted 2 -approximation problem into a dis-


crete ∞ -approximation problem. Recall that by Theorem 6.8 the CG method
converges after at most n steps. This can also directly be concluded from (6.11),
since for i = n we can pick a polynomial P∗ ∈ πn with P∗ (0) = 1 and P∗ (λ j ) = 0
for 1 ≤ j ≤ n, such that (6.11) gives

x∗ − xn A ≤ min max |P(λ j )| x∗ − x0 A ≤ max |P∗ (λ j )| x∗ − x0 A = 0.


P∈πn 1≤ j≤n 1≤ j≤n
P(0)=1

This result can obviously be generalised in the following sense.

Corollary 6.15 Assume that A ∈ Rn×n is symmetric and positive definite and
has 1 ≤ d ≤ n distinct eigenvalues. Then, the CG method terminates with the
solution x∗ ∈ Rn of Ax = b after at most d steps.

Since the eigenvalues are in general unknown, we need to convert the discrete
∞ problem into a continuous L∞ approximation problem by further estimating

x∗ − xi A ≤ min PL∞ [λ1 ,λn ] x∗ − x0 A ,


P∈πi
P(0)=1

where PL∞ [a,b] = max x∈[a,b] |P(x)|. Note that λ1 and λn can be replaced by
estimates λ̃1 ≤ λ1 and λ̃n ≥ λn .
The solution of this approximation problem can be computed explicitly. It in-
volves the classical Chebyshev polynomials.

08:44:43, subject to the Cambridge


07
6.1 The Conjugate Gradient Method 199

Definition 6.16 The Chebyshev polynomial of degree n ∈ N0 is defined by


T n (t) = cos(n arccos(t)), t ∈ [−1, 1].
It is easy to see that we have T 0 (t) = 1, T 1 (t) = t and
T n+1 (t) = 2tT n (t) − T n−1 (t). (6.13)
From this, it follows immediately that T n is indeed a polynomial of degree n.
We also need some further elementary properties of the Chebyshev polynomi-
als, which we will collect in the next lemma.
Lemma 6.17 The Chebyshev polynomials have the following properties.
1. For every t ∈ [0, 1), we have
√ n
1 1+ t 1+t
√ ≤ Tn .
2 1− t 1−t
2. For every t ∈ [−1, 1] we have |T n (t)| ≤ 1.
3. On [−1, 1], the Chebyshev polynomial T n attains its extreme values ±1 at
tk = cos(πk/n), 0 ≤ k ≤ n. To be more precise we have T n (tk ) = (−1)k .
Proof The second property follows immediately from the definition. The
third property follows from the second one and by simple computation. For
the first property it is helpful to show that for |x| ≥ 1, we have the representa-
tion
1  √ n  √ n 
T n (x) = x + x2 − 1 + x − x2 − 1 .
2
This is clear for n = 0 and n = 1 and for general n it follows if we can show
that the right-hand side follows the same recursion (6.13) as the Chebyshev
polynomials, which is easily done. If we now set x = 1−t 1+t
with t ∈ [0, 1) we
have x ≥ 1 and hence

1+t 1 √ n 1 1 + t n
Tn ≥ x+ x −1 =
2 √ . 
1−t 2 2 1− t
With this at hand, we can continue with bounding the error for the conjugate
gradient method.
Theorem 6.18 Let λn > λ1 > 0. The problem
min{PL∞ [λ1 ,λn ] : P ∈ πi , P(0) = 1}
has the solution

λn + λ1 − 2t # λn + λ1
P∗ (t) = T i Ti , t ∈ [λ1 , λn ].
λn − λ 1 λn − λ 1

08:44:43, subject to the Cambridge


07
200 Methods for Large Sparse Systems

Proof Setting γ = λ1 /λn ∈ (0, 1) shows


λn + λ 1 1 + γ
= ≥ 1.
λn − λ 1 1 − γ
This means that the denominator of P∗ is positive and P∗ is hence a feasible
candidate for the minimum.
Noting that
λn + λ1 − 2t
t → , t ∈ [λ1 , λn ]
λn − λ 1
maps [λ1 , λn ] linearly to [−1, 1], we can conclude from Lemma 6.17 that there
are i + 1 different points s j ∈ [λ1 , λn ] on which P∗ attains its extreme value

1
M :=  λ +λ 
Ti n 1
λn −λ1

on [λ1 , λn ] with alternating signs.


Now, assume that P∗ is not the solution to the minimisation problem, i.e. that
there is a Q ∈ πi with Q(0) = 1 and |Q(t)| < M for all t ∈ [λ1 , λn ]. Then, P∗ − Q
must have alternating signs in the s j , which means that P∗ − Q must have at
least i zeros in [λ1 , λn ]. However, P∗ − Q is also zero at 0  [λ1 , λn ], which
means that P∗ − Q must have at least i + 1 different zeros. But since P∗ − Q ∈ πi
this means P∗ = Q by the fundamental theorem of algebra. This is, however, a
contradiction to our assumption on Q. 

This altogether gives the final estimate on the convergence of the CG method.
We have, again with γ = λ1 /λn , the bound

x∗ − xi A ≤ min PL∞ [λ1 ,λn ] x∗ − x0 A


P∈πi
P(0)=1
""  "
""T i λn +λ1 −2t """
λn −λ1
≤ max   x∗ − x0 A
t∈[λ1 ,λn ] T λn +λ1
i λn −λ1
1
≤  1+γ  x∗ − x0 A
T i 1−γ

1− γ i ∗
≤2 √ x − x0 A .
1+ γ

Finally using the fact that 1/γ = κ2 (A) shows that we have the following result.

Theorem 6.19 Let A ∈ Rn×n be symmetric and positive definite. The sequence

08:44:43, subject to the Cambridge


07
6.1 The Conjugate Gradient Method 201

of iterations {xi } generated by the CG method satisfies the estimate


√ √ i
1− γ i κ2 (A) − 1
x∗ − xi A ≤ 2x∗ − x0 A √ ≤ 2x∗
− x 
0 A √ , (6.14)
1+ γ κ2 (A) + 1
where γ = 1/κ2 (A) = λ1 /λn .

As in the case of the method of steepest descent we can determine the expected
number of iterations necessary to reduce the initial error to a given > 0. The
proof of the following result is the same as the proof of Corollary 6.4.

Corollary 6.20 Let A ∈ Rn×n be symmetric and positive definite. If {x j } de-


notes the iterations of the conjugate gradient method then we have
x∗ − x j A

x∗ − x0 A

after at most j ≥ 1
2 κ2 (A) log(1/ ) steps.

If we compare this result with j ≥ 21 κ2 (A) log(1/ ), which is required for the

steepest descent method and note that κ2 (A) ≤ κ2 (A) we see that the CG
method is significantly faster than the steepest descent method.
In specific situations, the estimate (6.14) can further be improved. As a mat-
ter of fact, (6.14) is useful only if the condition number is small and if the
eigenvalues are more or less quasi-uniformly distributed over [λ1 , λn ]. If the
eigenvalues are clustered, it is possible to choose better polynomials.
For example, let us assume that A has an isolated largest eigenvalue, i.e. we
have λ1 ≤ λ2 ≤ · · · ≤ λn−1  λn . Then, with each polynomial P̃ ∈ πi−1 with
P̃(0) = 1 we can define a polynomial P ∈ πi by

λn − t
P(t) := P̃(t), t ∈ [λ1 , λn ],
λn
which also satisfies P(0) = 1 but also P(λn ) = 0 and, since the first factor
has absolute value less than or equal to one on [λ1 , λn ], we also have |P(λ j )| ≤
|P̃(λ j )|. Thus, using (6.11) and then the ideas leading to (6.14), we can derive

x∗ − xi A ≤ min max |P(λ j )| x∗ − x0 A ≤ min max |P(λ j )| x∗ − x0 A


P∈πi 1≤ j≤n P∈πi−1 1≤ j≤n−1
P(0)=1 P(0)=1
""  "
""T i−1 λn−1 +λ1 −2t """
λn−1 −λ1 1
≤ max  λ +λ  x∗ − x0 A ≤  1+γ  x∗ − x0 A
t∈[λ1 ,λn−1 ] T n−1 1
T i−1 1−γ
i−1 λn−1 −λ1
√ i−1
1− γ
≤2 √ x∗ − x0 A
1+ γ

08:44:43, subject to the Cambridge


07
202 Methods for Large Sparse Systems

with γ = λ1 /λn−1 . Hence, we have eliminated the largest eigenvalue and ex-
pressed convergence using the reduced condition number κn−1 := λn−1 /λ1 at
the cost of one factor. Obviously, this result can be generalised in such a way
that the d largest eigenvalues are discarded.
Corollary 6.21 If A ∈ Rn×n is symmetric and positive definite with eigenval-
ues 0 < λ1 ≤ · · · ≤ λn−d  λn−d+1 ≤ · · · ≤ λn , then we have for i ≥ d the
estimate

∗ κn−d − 1 i−d ∗
x − xi A ≤ 2 √ x − x0 A ,
κn−d + 1
with the reduced condition number κn−d = λn−d /λ1 .
Similarly, we can deal with an isolated smallest eigenvalue. However, the sit-
uation here is somewhat worse, in the following sense. If we proceed as above
then, for a polynomial P̃ ∈ πi−1 with P̃(0) = 1, we define the polynomial P ∈ πi
by

λ1 − t
P(t) = P̃(t), t ∈ [λ1 , λn ].
λ1
We still have P(0) = 1 and P(λ1 ) = 0. But at the other eigenvalues λ j the
absolute value of the first factor can only be bounded by (λ j − λ1 )/λ1 ≤ (λn −
λ1 )/λ1 = κ̃1 − 1 with the (original) condition number κ̃1 = κ2 (A) = λn /λ1 .
Hence, this time we only have
√ i−1
∗ κ̃2 − 1
x − xi A ≤ 2 (κ̃1 − 1) √ x∗ − x0 A
κ̃2 + 1
with κ̃2 = λn /λ2 . This can still be an improvement over the standard estimate
since the condition number enters only as a constant factor but the convergence
forcing term can significantly improve. Of course, we can iterate this process,
as well.
Corollary 6.22 If A ∈ Rn×n is symmetric and positive definite with eigen-
values 0 < λ1 ≤ · · · ≤ λd  λd+1 ≤ · · · ≤ λn , then we have for i ≥ d the
estimate
d   √κ̃d+1 − 1 i−d
x∗ − xi A ≤ 2 κ̃ j − 1 √ x∗ − x0 A ,
j=1
κ̃d+1 + 1

with the reduced condition number κ̃ j = λn /λ j .


This also shows that the method still converges if the matrix is almost singular,
i.e. if λ1 ≈ · · · ≈ λd ≈ 0, but it might take some time for the convergence
to become effective. The convergence rate depends then on [λd+1 , λn ]. Hence,

08:44:43, subject to the Cambridge


07
6.2 GMRES and MINRES 203

one should try to use appropriate matrix manipulations to reduce the condition
number of A. If the matrix A is almost singular one should try to find a trans-
formation of the smallest eigenvalues to zero and of the bigger eigenvalues to
an interval of the form (λ, λ(1 + ε)) ⊂ (0, ∞) with ε as small as possible.
In general it does not matter if, during this process, the rank of the matrix is
reduced. Generally, this leads only to a small additional error but also to a
significant increase in the convergence rate of the CG method.

6.2 GMRES and MINRES


We are now going to discuss how to solve the linear system Ax = b for arbitrary
matrices A ∈ Rn×n using ideas similar to those we applied in the CG method.
One possibility would be to look at the normal equations AT Ax = AT b, which
lead to the same solution, provided the matrix A is invertible. Since AT A is
symmetric and positive definite, we can use the CG method to solve this prob-
lem. If x∗ is the solution and x j ∈ x0 + K j (AT A, AT r0 ) is a typical iteration then
we have the identity
x∗ − x j 2AT A = (x∗ − x j )T AT A(x∗ − x j ) = (Ax∗ − Ax j )T (Ax∗ − Ax j )
= b − Ax j 22 .
This is the reason why this method is also called the CGNR method, where
CGNR stands for Conjugate Gradient on the Normal equation to minimise the
Residual.
However, the method is only of limited use since we know that we have to ex-
pect a condition number which is the square of the original condition number.
Nonetheless, we can generalise this idea in the following way.
Definition 6.23 Assume A ∈ Rn×n is invertible and b ∈ Rn . Let x0 ∈ Rn
and r0 := b − Ax0 . The jth iteration of the GMRES (Generalised Minimum
Residual) method is defined to be the solution of
min{b − Ax2 : x ∈ x0 + K j (A, r0 )}. (6.15)
If A ∈ Rn×n is symmetric, then the method generating the sequence {x j } by
minimising (6.15) is called MINRES (Minimum Residual).
The difference between MINRES and GMRES is two-fold. First of all, in the
case of GMRES, an actual algorithm for creating the sequence {x j } needs to
work for all non-singular matrices A ∈ Rn×n , while in the case of MINRES
such an algorithm can exploit the fact that it will only be used for symmetric
matrices. We will soon see that this makes a significant difference in designing

08:44:43, subject to the Cambridge


07
204 Methods for Large Sparse Systems

such algorithms. Secondly, there is a difference in analysing the algorithms. For


MINRES we can use the fact that the matrix is diagonalisable with only real
eigenvalues and a corresponding orthonormal basis consisting of eigenvectors.
We cannot do this when analysing GMRES. Finally, historically, MINRES was
introduced in [99] by Paige and Saunders long before GMRES was presented
in [105] by Saad and Schultz, but this is obviously of no consequence for us.
Despite these differences, the starting point for analysing GMRES and MIN-
RES is the same and very similar to the analysis of the CG method. Since every
 j−1
x ∈ x0 + K j (A, r0 ) can be written as x = x0 + k=0 ak Ak r0 , we can conclude
that

j−1 
j
b − Ax = b − Ax0 − ak Ak+1 r0 = r0 − ak−1 Ak r0 =: P(A)r0
k=0 k=1

with a polynomial P ∈ π j satisfying P(0) = 1. Hence, minimising over all such


x is equivalent to minimising over all these P.

Theorem 6.24 If A ∈ Rn×n is invertible (and symmetric) and if {x j } denote


the iterations of the GMRES (MINRES) method then

b − Ax j 2 = min P(A)r0 2 .
P∈π j
P(0)=1

Moreover, the method needs at most n steps to terminate with the solution of
Ax = b.

Proof It remains to show that the method stops with the solution x∗ . To see
this, let p(t) = det(A − tI) be the characteristic polynomial of A. Then, p ∈ πn
and, since A is invertible, p(0) = det A  0. Hence, we can choose the poly-
nomial P := p/p(0) in the last step. However, the Cayley–Hamilton theorem
gives p(A) = 0 and hence P(A) = 0 so that we must have b = Axn , unless the
method terminates before. 

In the case of the CG method, we then used the fact that there is a basis of
Rn consisting of orthonormal eigenvectors of A. This is the point where the
analyses of MINRES and GMRES diverge. Since the analysis of MINRES is
closer to that one of CG we will first look at MINRES.
Hence, for the time being, let A = AT ∈ Rn×n be symmetric. Then, A has real
eigenvalues λ j ∈ R, 1 ≤ j ≤ n, which are due to the invertibility of A all differ-
ent from zero. Moreover, we can choose an orthonormal basis w1 , . . . , wn ∈ Rn
of Rn consisting of eigenvectors of A, i.e. Aw j = λ j w j . With this we can con-
tinue as in the analysis of the CG method and expand x ∈ Rn with x2 = 1 as

08:44:43, subject to the Cambridge


07
6.2 GMRES and MINRES 205
 
x = x j w j which then gives P(A)x = x j P(λ j )w j and hence

n
P(A)22 = sup P(A)x22 = sup x2j P2 (λ j ) ≤ max P2 (λ j ).
x2 =1 x2 =1 j=1 1≤ j≤n

Thus, we have the following version of Theorem 6.14.


Proposition 6.25 Let A ∈ Rn×n be symmetric and invertible, having the eigen-
values λ1 ≤ · · · ≤ λn . Let x∗ ∈ Rn denote the solution of Ax = b and let {xi }
denote the iterations from MINRES, then the residuals ri := b − Axi satisfy
ri 2 ≤ min max |P(λ j )| r0 2 . (6.16)
P∈πi 1≤ j≤n
P(0)=1

In contrast to the situation of the CG method, now the eigenvalues of A, though


still real numbers, will be negative and positive, which makes the analysis
somewhat more complicated. In this context, it is usual to assume that the
eigenvalues of A are contained in [a, b] ∪ [c, d] with a < b < 0 < c < d
such that d − c and b − a are as small as possible. Ideally, a, b, c, d are given
by eigenvalues of A. However, for the purpose of the analysis it makes things
easier to assume that b − a = d − c, which can be achieved by shifting some of
the values and increasing the size of one of the intervals if necessary.
Since we now have both positive and negative eigenvalues we must expect a
slower convergence than when we only have positive (or negative) eigenvalues.
Nonetheless, using ideas similar to those in the analysis of the CG method, we
can derive the following result.
Theorem 6.26 Let A ∈ Rn×n be symmetric and invertible. Let the eigenvalues
of A be contained in [a, b] ∪ [c, d] with a < b < 0 < c < d and b − a = d − c.
Then, the residual ri = b − Axi between the true solution x∗ ∈ Rn of Ax = b
and the ith iteration xi of MINRES can be bounded by
√ √ i/2
|ad| − |bc|
ri 2 ≤ 2 √ √ r0 2 .
|ad| + |bc|
Proof As we want to use our ideas on Chebyshev polynomials, we need to
map the intervals [a, b] and [c, d] simultaneously to [−1, 1]. Since they have
the same length, we can do this by a polynomial of degree two. It is not too
difficult to see that this polynomial is given by
(t − b)(t − c)
q(t) = 1 + 2 , t ∈ [a, b] ∪ [c, d].
ad − bc
Next, we have
ad + bc |ad| + |bc| 1 + γ
q(0) = = = > 1,
ad − bc |ad| − |bc| 1 − γ

08:44:43, subject to the Cambridge


07
206 Methods for Large Sparse Systems

with γ = |bc|/|ad| ∈ (0, 1). Thus


T i/2 (q(t))
P∗ (t) := ∈ πi
T i/2 (q(0))
is well-defined and satisfies P∗ (0) = 1. Thus, we can bound

ri 2 ≤ min max |P(λ j )| r0 2 ≤ min max |P(t)| r0 2


P∈πi 1≤ j≤n P∈πi t∈[a,b]∪[c,d]
P(0)=1 P(0)=1
1
≤ max |P∗ (t)| r0 2 ≤ r0 2
t∈[a,b]∪[c,d] T i/2 (q(0))

1 1 − γ i/2
=  1+γ  r0 2 ≤ 2 √ r0 2 ,
T i/2 1+ γ
1−γ

where we have used Lemma 6.17. The result follows using γ = |bc|/|ad|. 

In the case where the intervals [a, b] and [c, d] are symmetric with respect to
the origin, i.e. in the case of a = −d and b = −c, we have γ = |bc|/|ad| =
c2 /d2 ≤ κ2 (A)2 and hence
i/2
κ2 (A) − 1
ri 2 ≤ 2 r0 2 .
κ2 (A) + 1
Finally, if MINRES is applied to a positive definite and symmetric matrix, then
obviously we can proceed exactly as in the analysis of the CG method, yielding
an estimate of the form
√ i
κ2 (A) − 1
ri 2 ≤ 2 √ r0 2 .
κ2 (A) + 1
Also results similar to those in Corollaries 6.21 and 6.22 regarding large and
small isolated eigenvalues remain valid if the  · A -norm there is replaced by
the  · 2 -norm.
Next, let us return to the analysis of the more general GMRES method. Hence,
we are back at the representation

b − Ax j 2 = min P(A)r0 2 . (6.17)


P∈π j
P(0)=1

Since A ∈ Rn×n is a general invertible matrix, we cannot expect anymore to


have an orthonormal basis of Rn consisting of eigenvectors of A.
Nonetheless, let us assume that the matrix A is at least diagonalisable, i.e. that
there is a matrix V ∈ Cn×n and a diagonal matrix D = diag(λ1 , . . . , λn ) ∈ Cn×n ,
such that
A = V DV −1 .

08:44:43, subject to the Cambridge


07
6.2 GMRES and MINRES 207

The diagonal elements of D are the eigenvalues of A. However, from now on,
we have to work with complex matrices.
j
If P ∈ π j is of the form P(t) = 1 + k=1 ak tk then we can conclude that

j 
j
P(A) = P(V DV −1 ) = 1 + ak (V DV −1 )k = 1 + ak V Dk V −1
k=1 k=1
−1
= V P(D)V ,
and we derive the following error estimate, which is, compared with the corre-
sponding result of the CG method, worse by a factor κ2 (V).
Corollary 6.27 Assume the matrix A ∈ Rn×n is similar to a diagonal matrix
with eigenvalues λk ∈ C. The jth iteration of the GMRES method satisfies the
error bound
b − Ax j 2 ≤ κ2 (V) min max |P(λk )| b − Ax0 2 . (6.18)
P∈π j 1≤k≤n
P(0)=1

Note that, in contrast to the corresponding minimisation problem of the CG or


the MINRES method, here, the eigenvalues λk can be complex, which com-
plicates matters. Even more, the role of the matrix V is problematic. Estimate
(6.18) is useful only if κ2 (V) ≈ 1. For example, if A is a normal matrix, then
V is unitary and hence κ2 (V) = 1. In this situation, (6.18) is optimal. How-
ever, if κ2 (V) is large, then the error estimate (6.18) is not useful and it is
unclear whether in such a situation GMRES converges only poorly or whether
estimate (6.18) is simply too pessimistic. If the condition number of V is suf-
ficiently small, then it is possible to use complex Chebyshev polynomials to
derive bounds similar to those of the CG method. Details can, for example, be
found in Saad’s book [104].
Moreover, in some situations the convergence of GMRES is difficult to predict
from the eigenvalues of A alone. To see this, let us have a look at the following
example.
Example 6.28 Let A be a matrix of the form
⎛ ⎞
⎜⎜⎜0 ∗ 0 . . . 0⎟⎟⎟
⎜⎜⎜ ⎟
⎜⎜⎜0 ∗ ∗ . . . 0⎟⎟⎟⎟⎟
⎜⎜⎜⎜ .. .. .. . . ⎟⎟⎟⎟
⎜⎜⎜ . . . . ⎟⎟⎟ ,
⎜⎜⎜ ⎟
⎜⎜⎜0 ∗ ∗ . . . ∗⎟⎟⎟⎟⎟
⎝ ⎠
∗ ∗ ∗ ... ∗
where a ∗ denotes a non-zero entry, and let x0 be such that r0 = b − Ax0 is
a multiple of e1 . Then we see that Ar0 is a multiple of en , A2 r0 is a linear

08:44:43, subject to the Cambridge


07
208 Methods for Large Sparse Systems

combination of en and en−1 and so on. Hence, all vectors Ak r0 for 1 ≤ k ≤ n − 1


are orthogonal to r0 .
Since we can write x j as x j = x0 + y j , where y j minimises r0 − Ay2 over all
y ∈ K j (A, r0 ), we see that this minimising of r0 − Ay2 over all y ∈ K j (A, r0 )
will always produce y j = 0 as long as j ≤ n − 1, i.e. we will have x j = x0 and
hence no progress for 1 ≤ j ≤ n − 1.

Typical matrices of the form given in the example are companion matrices, i.e.
matrices of the form
⎛ ⎞
⎜⎜⎜ 0 1 0 . . . 0 ⎟⎟
⎟⎟
⎜⎜⎜⎜ 0 0 1 . . . 0 ⎟⎟⎟⎟
⎜⎜⎜ ⎟
A = ⎜⎜⎜⎜ ...
⎜ .. .. . . .. ⎟⎟⎟⎟
. . . . ⎟⎟⎟ .
⎜⎜⎜⎜ ⎟⎟
⎜⎜⎜ 0 0 0 . . . 1 ⎟⎟⎟⎟
⎝ ⎠
a0 a1 a2 . . . an−1

The eigenvalues are the roots of the polynomial tn − n−1 j
j=0 a j t , which shows that
matrices of the above form can occur with any desired eigenvalue distribution.
We will not further pursue this here. Instead we will look at a different possible
approach.
To describe this, we need to broaden our perspective a little. First of all, it is
easy to see that for a real matrix A ∈ Rn×n , the GMRES iteration x j satisfies
(6.17) even if we allow polynomials P ∈ π j with complex coefficients. This
follows because such a polynomial would have the form P(t) = Q(t) + iR(t)
with real polynomials Q, R ∈ π j satisfying Q(0) = 1 and R(0) = 0 and from
 1/2
P(A)r0 2 = Q(A)r0 22 + R(A)r0 22

we see that any nonzero R would only increase this norm.


In the case of a complex matrix A ∈ Cn×n we would define the GMRES iter-
ation by looking at complex Krylov spaces and hence would immediately use
polynomials with complex coefficients in (6.17). Hence, for the time being let
us work in the complex setting.
For a symmetric matrix we know that the maximum of the Rayleigh quotient
defines its maximum eigenvalue. The same is true for normal matrices. For a
general matrix we can generalise this concept as follows.

Definition 6.29 For a matrix A ∈ Cn×n , the field of values of A is defined to


be
4 5
F (A) := xT Ax : x ∈ Cn , xT x = 1 .

08:44:43, subject to the Cambridge


07
6.2 GMRES and MINRES 209

The numerical radius of A is defined as


ν(A) := max{|z| : z ∈ F (A)}.
Since the function x → xT Ax is continuous, the field of values F (A) is compact
and hence the numerical radius is well-defined. Moreover, by Corollary 2.28
we know that in the case of a normal matrix A the numerical radius coincides
with the 2-norm, i.e. ν(A) = A2 . For a general matrix we have λ ∈ F (A) for
every eigenvalue of A because if x ∈ Cn denotes a corresponding, normalised
eigenvector then we obviously have xT Ax = λ. We have the following proper-
ties.
Lemma 6.30 Let A ∈ Cn×n and α ∈ C. Then,
1. F (A + αI) = F (A) + α,
2. F (αA) = αF (A),
3. 21 A2 ≤ ν(A) ≤ A,
4. ν(Am ) ≤ [ν(A)]m , m = 1, 2, . . ..
Proof The first two properties follow from
xT (A + αI)x = xT Ax + α, xT αAx = αxT Ax
which hold for all xT x = 1. For the third property we first note that
xT Ax ≤ x2 Ax2 ≤ A2 ,
which gives the right inequality. For the left inequality we split A into A =
T T
H + N with H = (A + A )/2 and N = (A − A )/2. Since both matrices are
normal, we have
A2 ≤ H2 + N2 = ν(H) + ν(N)

1 T T
= sup |xT (A + A )x| + sup |xT (A − A )x|
2 x2 =1 x2 =1

1 T T T
≤ 2 sup |x Ax| + sup |x A x|
2 x2 =1 x2 =1
= 2ν(A).
For the last property, ν(Am ) ≤ [ν(A)]m , we have to work a little harder. First
note that because of the second property we have ν(αA) = |α|ν(A). Thus, if we
can show that ν(A) ≤ 1 implies ν(Am ) ≤ 1 for all m ∈ N, then we will have for
an arbitrary A ∈ Cn×n with α = 1/ν(A) that ν(αA) ≤ 1 and hence ν((αA)m ) ≤ 1,
i.e.
1
ν((αA)m ) = αm ν(Am ) = ν(Am ) ≤ 1.
ν(A)m

08:44:43, subject to the Cambridge


07
210 Methods for Large Sparse Systems

To see that ν(A) ≤ 1 implies ν(Am ) ≤ 1, let us introduce the roots of unity
ωk := exp(2πik/m) for k ∈ Z. Obviously, there are only m different roots of
unity, for example for 1 ≤ k ≤ m. Nonetheless, whenever necessary, we will
use them in this periodic form. In particularly, we will use that ωm = ω0 . In
this sense, the linear polynomial k (z) = 1 − ωk z satisfies k (ωm−k ) = 0 for all k
so that we can factorise the polynomial 1 − zm as

m
m
1 − zm = k (z) = (1 − ωk z). (6.19)
k=1 k=1

Next, let πm (C) denote the space of complex polynomials of degree at most m
and let us define the Lagrange functions L j ∈ πm−1 (C), 1 ≤ j ≤ m via

m
k (z) 1 − ωk z
m
1
m
L j (z) := = = (1 − ωk z).
k=1
k (ωm− j ) k=1 1 − ωk+m− j m k=1
k j k j k j

In the last step, we have used that the denominator can be significantly simpli-
fied employing the identities

m
m−1
m−1
(1 − ωk+m− j ) = (1 − ωk ) = lim (1 − ωk z)
z→1
k=1 k=1 k=1
k j

1 m
1 − zm
= lim (1 − ωk z) = lim
z→1 1 − ωm z z→1 1 − z
k=1
= m.

As we well know, these Lagrangian functions form a very specific basis of the
polynomials of degree less than m as they obviously satisfy L j (ωm−k ) = δ jk for
1 ≤ j, k ≤ m. In particular, we can reconstruct the constant function 1 by its
interpolant

m
1 
m m
1= 1 · L j (z) = (1 − ωk z). (6.20)
j=1
m j=1 k=1
k j

Equations (6.19) and (6.20) remain obviously valid if z is replaced by a matrix


B ∈ Cn×n .
For x ∈ Cn with x2 = 1, we define the vectors
⎡ ⎤
⎢⎢⎢ m ⎥⎥⎥
⎢⎢⎢ ⎥⎥
x j := ⎢⎢⎢ (I − ωk B)⎥⎥⎥⎥⎥ x,
⎢ 1 ≤ j ≤ m,
⎢⎣k=1
⎥⎦
k j

08:44:43, subject to the Cambridge


07
6.2 GMRES and MINRES 211

and use (6.19) and (6.20) to derive


  
1
m
Bx j xj
x j 2 1 − ω j
2
,
m j=1 x j 2 x j 2 2

m  m
⎤ 
1 1  ⎢⎢⎢⎢ ⎥⎥
m
= (I − ω j B)x j , x j 2 = ⎢⎣ (I − ωk B)⎥⎥⎥⎦ x, x j
m j=1 m j=1 k=1 2
m  
m 
1 1
= (I − Bm )x, x j 2 = (I − Bm )x, xj
m j=1 m j=1 2
 m m 
1
= (I − Bm )x, (I − ωk B) x
m j=1 k=1 2
k j
= (I − Bm )x, x 2 = 1 − Bm x, x 2 .

If we now set B = eiθ A with θ ∈ R chosen such that eimθ Am x, x 2 = | Am x, x 2 |


we thus have

1 − | Am x, x 2 | = 1 − eimθ Am x, x 2
   
1
m
Ax j xj
= x j 2 1 − ω j e
2 iθ
,
m j=1 x j 2 x j 2 2
   
1
m
Ax j xj
= x j 22 1 −  ω j eiθ ,
m j=1 x j 2 x j 2 2
 " 
"" Ax  ""
1
m
" j xj ""
≥ x j 2 1 − " , " ≥0
2
m j=1 " x j 2 x j 2 2 "

since ν(A) ≤ 1 implies that the term in the brackets on the right-hand side of
the last line is non-negative. Thus we have | Am x, x 2 | ≤ 1 for all normalised
vectors x ∈ Cn showing ν(Am ) ≤ 1. 

Let us see how we can use the concept of the numerical radius to provide
possible convergence estimates for GMRES.

Theorem 6.31 Suppose that F (A) is contained in the disc D := {z ∈ C :


|z − c| ≤ s} and assume that 0  D, i.e. that s < |c|. Then the residuals of the
GMRES method satisfy
j
s
r j 2 ≤ 2 r0 2 .
|c|

08:44:43, subject to the Cambridge


07
212 Methods for Large Sparse Systems

Proof Let us define P(z) = (1 − z/c) j . Then, Lemma 6.30, yields


⎛ j⎞   j
r j 2 ⎜⎜⎜ 1 ⎟⎟⎟ 1
≤ P(A)2 ≤ 2ν(P(A)) = 2ν ⎜⎝ I − A ⎟⎠ ≤ 2 ν I − A .
r0 2 c c

Furthermore, if x ∈ Cn satisfies x2 = 1 then we know by assumption that


z := xT Ax satisfies |z − c| ≤ s or |1 − z/c| ≤ s/|c|. Hence,
"" "" "" "
1 "" T 1 "" " 1 T "" s
ν I − A = max "x I − A x" = max "1 − x Ax"" ≤ , "
c x2 =1 " c " x2 =1 c |c|

which completes the proof. 

Now, we will have a look at a possible implementation for both GMRES and
MINRES.
We start by deriving an algorithm for GMRES and will then work out how the
algorithm can be changed for MINRES.
The implementation is based upon the idea to find an orthonormal basis for
K j (A, r0 ) and then to solve the minimisation problem within this basis.
We can use the Gram–Schmidt orthogonalisation method (see Lemma 1.9) to
compute an orthogonal basis of K j (A, r0 ). Recall that the idea of the Gram–
Schmidt method is to compute orthogonal vectors successively, starting with
a vector v1 ∈ Rn with v1 2 = 1. Next, provided that {v1 , . . . , vk } are already
orthonormal and that v  span{v1 , . . . , vk }, the Gram–Schmidt procedure cal-
culates

k
ṽk+1 := v − v, v 2 v ,
=1
vk+1 := ṽk+1 /ṽk+1 2 .

We will use this by setting r0 = b − Ax0 and v1 = r0 /r0 2 and then choose v
as v = Avk . This gives


k 
k
ṽk+1 = Avk − Avk , v 2 v = Avk − hk v , (6.21)
=1 =1
vk+1 = ṽk+1 /ṽk+1 2 = ṽk+1 /hk+1,k , (6.22)

where we have defined hk = Avk , v 2 for 1 ≤  ≤ k and hk+1,k = ṽk+1 2 .


This method is called the Arnoldi method and was introduced in [2]. We will
analyse it in this simple form. However, from a numerical point of view, a more
stable version can be derived as follows. Since v1 , . . . , vk are orthogonal, we

08:44:43, subject to the Cambridge


07
6.2 GMRES and MINRES 213

see that we have for arbitrary numbers ci , 1 ≤ i ≤ k, the equivalent formulation



k
ṽk+1 = Avk − hk v
=1

k 
k 
−1
= Avk − Avk , v 2 v − ci vi , v 2 v
=1 =1 i=1
k 
 
−1 
= Avk − Avk − ci vi , v v .
=1 i=1 2

Choosing ci = Avk , vi 2 gives an algorithm which is in general not only much


more stable than the basic algorithm but also can easily be implemented. This
modified Arnoldi method is given in Algorithm 23.

Algorithm 23: Arnoldi method (modified Gram–Schmidt)


Input : A ∈ Rn×n , x0 ∈ Rn .
Output: Orthonormal basis {v1 , . . . , v j } of K j (A, r0 ).
1 r0 := b − Ax0 , v1 := r0 /r0 
2 for k = 1, 2, . . . , j − 1 do
3 wk := Avk
4 for  = 1 to k do
5 hk := wk , v 2
6 wk := wk − hk v
7 hk+1,k := wk 2
8 if hk+1,k = 0 then break
9 else vk+1 := wk /hk+1,k

Obviously, the Arnoldi method can produce an orthonormal basis only if it


does not stop prematurely. Hence, for the time being, let us assume that always
wk = ṽk+1  0.
Lemma 6.32 If the Arnoldi method is well-defined up to step j then v1 , . . . , v j
form an orthonormal basis of K j (A, r0 ).
Proof It is easy to check that {vk } forms an orthonormal set. That the elements
actually form a basis for K j (A, r0 ) can be proven by induction on j. Obviously,
for j = 1 there is nothing to show. For the induction step note that the definition
gives v j+1 ∈ span{v1 , . . . , v j , Av j } and the induction hypothesis leads to v j+1 ∈
K j+1 (A, r0 ). Equality of the spaces follows on comparing the dimensions. 

08:44:43, subject to the Cambridge


07
214 Methods for Large Sparse Systems

To further analyse Arnoldi’s method and to see how it can be used for an im-
plementation of GMRES, we stick to the original formulation (6.21), (6.22).
If we want to compute a basis for K j (A, r0 ) we need to compute up to j or-
thogonal vectors. As before, let us define hk = Avk , v 2 for 1 ≤  ≤ k and
1 ≤ k ≤ j and hk+1,k = ṽk+1 2 for 1 ≤ k ≤ j. Then, using (6.22), equation
(6.21) can be rewritten as

k+1
Avk = hk v . (6.23)
=1

If we now set additionally hk = 0 for 1 ≤ k ≤ j − 1 and k + 2 ≤  ≤ j + 1, this


defines a non-square Hessenberg matrix
⎛ ⎞
⎜⎜⎜h11 h12 · · · · · · h1 j ⎟⎟
⎟⎟
⎜⎜⎜
⎜⎜⎜h21 h22 · · · · · · h2 j ⎟⎟⎟⎟
⎜⎜⎜⎜ . .. ⎟⎟⎟⎟⎟
⎜⎜⎜ h32 . . . ⎟⎟⎟
H j := (hk ) = ⎜⎜⎜ ⎟
.. ⎟⎟⎟⎟ ∈ R
( j+1)× j
.
⎜⎜⎜ .. ..
⎜⎜⎜ . . . ⎟⎟⎟
⎜⎜⎜ ⎟⎟
⎜⎜⎝ h j, j−1 h j j ⎟⎟⎟⎟

h j+1, j
If we further define the matrix V j = (v1 , . . . , v j ) ∈ Rn× j then we can use (6.23)
to derive
 
AV j = Av1 , . . . , Av j
⎛ 2 ⎞
⎜⎜⎜ 
3 
j+1 ⎟⎟⎟
= ⎜⎜⎝⎜ h1 v , h2 v , . . . , h j v ⎟⎟⎟⎠
=1 =1 =1

j+1
 
= h1 v , h2 v , . . . , h j v ,
=1

i.e. the matrix AV j ∈ R n× j


has entries

j+1 
j+1
(AV j )ik = hk (v )i = hk (V j+1 )i = (V j+1 H j )ik , 1 ≤ i ≤ n, 1 ≤ k ≤ j.
=1 =1

Moreover, if H̃ j ∈ R j× j is the matrix H j without its last row then we can rewrite
this as

j+1 
j
(AV j )ik = hk (v )i = hk (v )i + h j+1,k (v j+1 )i
=1 =1
= (V j H̃ j )ik + h j+1,k (v j+1 )i = (V j H̃ j )ik + δk j ṽ j+1 2 (v j+1 )i
= (V j H̃ j )ik + δk j (ṽ j+1 )i = (V j H̃ j )ik + (ṽ j+1 eTj )ik ,

08:44:43, subject to the Cambridge


07
6.2 GMRES and MINRES 215

where we have particularly used that h j+1,k = 0 for k  j. Hence, we can


summarise this as
AV j = V j+1 H j = V j H̃ j + ṽ j+1 eTj . (6.24)
Lemma 6.33 Assume that A is invertible. If the Arnoldi method stops with
iteration j, then the solution x∗ of Ax = b is contained in x0 + K j (A, r0 ).
Proof That the method stops with iteration j means that ṽ j+1 = 0. Hence,
equation (6.24) implies AV j = V j H̃ j , which means that the space K j (A, r0 )
is invariant under A. Next, since A is invertible, we also must have that H̃ j is
invertible. For each x ∈ x0 +K j (A, r0 ) there is a vector y ∈ R j with x−x0 = V j y.
Moreover, we have r0 = βv1 = βV j e1 with β = r0 2 . This leads to
b − Ax2 = r0 − A(x − x0 )2 = r0 − AV j y2 = V j (βe1 − H̃ j y)2
= βe1 − H̃ j y2 ,
where we also have used the orthogonality of the columns of V j . Since H̃ j
is invertible we can set y = βH̃ −1 ∗
j e1 , which leads to x = x0 + V j y ∈ x0 +
K j (A, r0 ). 

Algorithm 24: GMRES( j)


Input : A ∈ Rn×n , b ∈ Rn , x0 ∈ Rn , j ∈ {1, . . . , n}.
Output: Approximate solution of Ax = b.
1 r0 := b − Ax0 , β := r0 2 , v1 := r0 /β
2 for k = 1, 2, . . . , j do
3 Compute vk+1 and hk for 1 ≤  ≤ k + 1 using Arnoldi’s method
4 Compute y j as the solution of βe1 − H j y2
5 x j = x0 + V j y j

The proof of Lemma 6.33 gives a constructive way of computing the iterations
of the GMRES method. Equation (6.24) implies, with β = r0 2 and x = x0 +
V j y ∈ x0 + K j (A, r0 ), as before,
b − Ax = r0 − AV j y = βV j+1 e1 − V j+1 H j y = V j+1 (βe1 − H j y).
Hence, we see that x j = x0 +V j y j , where y j minimises βe1 − H j y2 . This leads
to Algorithm 24.
The GMRES method can stop prematurely. The only possible reason for this
is that the Arnoldi method stops prematurely, i.e. if hk+1,k = 0 for a k < j.
According to Lemma 6.33 this means that x∗ ∈ x0 + Kk (A, r0 ) and hence, by
definition, xk = x∗ .

08:44:43, subject to the Cambridge


07
216 Methods for Large Sparse Systems

Remark 6.34 The solution of the least-squares problem βe1 − H j y2 can be
computed using a QR factorisation, see Section 3.8. This can be done in O( j2 )
time since H j is a Hessenberg matrix, see Corollary 5.38.

There is another consequence of this process, which we want to mention here,


though it is not related to solving linear systems but rather to computing the
eigenvalues of the matrix A. Recall from Section 5.5 that it is desirable to
reduce A to a Hessenberg matrix and then to compute the eigenvalues of this
Hessenberg matrix. The following remark is a simple consequence of (6.24).

Remark 6.35 If the Arnoldi method does not stop prematurely then, after n
steps, v1 , . . . , vn form an orthonormal basis of Rn and Vn = (v1 , . . . , vn ) trans-
forms A into the Hessenberg matrix H̃n = VnT AVn having the same eigenvalues
as A.

In the present form, the GMRES algorithm does not compute an approximate
solution in each step, i.e. for each Krylov space K j (A, r0 ). Instead, we first
compute the orthonormal basis of K j (A, r0 ) and after that the approximate so-
lution x j . Since one usually does not know beforehand which iteration has to
be computed to achieve a desired accuracy, one could compute iterations al-
ways after a fixed number of steps of the Arnoldi method. Another possibility
is to produce an iteration in each step of the Arnoldi method by computing
the QR factorisation of H j+1 , which is necessary for computing a new iteration
x j+1 , by exploiting the already-computed QR factorisation of H j . Finally, we
can use this repeated updating of the QR factorisation to compute only an ap-
proximate solution if a prescribed accuracy is achieved. We will describe this
now in more detail, covering also the solution of the least-squares problem.
The idea for this comes from the proofs of Corollary 5.38 or Theorem 5.36,
where we have successively removed the sub-diagonal entries using 2 × 2
Householder matrices. Instead of Householder matrices, we can of course also
use Givens rotations. As usual, we describe the idea using full matrices of the
form
⎛ ⎞
⎜⎜⎜Ii−1 ⎟⎟⎟ ⎛ ⎞
⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜Ii−1 ⎟⎟⎟
⎜⎜⎜ ci si ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟⎟ ∈ R( j+1)×( j+1) ,
Gi := ⎜⎜
⎜⎜⎜ ⎟⎟⎟ = ⎜⎜⎜ G̃i ⎟⎠⎟
−si ci ⎟⎟⎠ ⎝
⎝ I j−i
I j−i

where Ii−1 and I j−i are the identities matrices of size (i − 1) × (i − 1) and ( j −
i) × ( j − i), respectively.
As described in the proof of Theorem 5.36, j such orthogonal matrices suffice

08:44:43, subject to the Cambridge


07
6.2 GMRES and MINRES 217

to reduce H j to upper triangular form


⎛ ⎞ ⎞ ⎛
⎜⎜⎜h11 h12 ··· ··· · · · r1 j ⎟⎟
h1 j ⎟⎟ ⎜⎜r11 r12 ···
⎜⎜⎜⎜h21 ⎟⎟ ⎟⎟ ⎜⎜
⎜⎜⎜ h22 ··· ··· · · · r2 j ⎟⎟⎟⎟
h2 j ⎟⎟⎟⎟ ⎜⎜⎜⎜ 0 r22 ···
⎜⎜⎜ .. .. ⎟⎟⎟⎟⎟
.. ⎟⎟⎟⎟⎟ ⎜⎜⎜⎜⎜ ..
h32 ⎜⎜ . . ⎟⎟⎟
. ⎟⎟⎟ ⎜⎜⎜ .
G jG j−1 · · · G1 ⎜⎜⎜⎜ .
⎟⎟⎟ .
⎟ ⎜
.. ⎟⎟⎟⎟ = ⎜⎜⎜⎜
..
⎜⎜⎜ .. .. .. ⎟
.
⎜⎜⎜ . . . .. ⎟⎟⎟⎟
. ⎟⎟⎟ ⎜⎜⎜
⎜⎜⎜ ⎟⎟
⎟⎟ ⎜⎜
⎜⎜⎝
h j, j−1 r j j ⎟⎟⎟⎟
h j j ⎟⎟⎟⎟ ⎜⎜⎜⎜
⎠⎠ ⎝
h j+1, j
0
(6.25)
From this, we can conclude two things. First of all, defining QTj := G j · · · G1 ∈
R( j+1)×( j+1) and R j = QTj H j ∈ R( j+1)× j allows us to transform βe1 − H j y2 as
usual. To this end, let us define q j := QTj e1 ∈ R j+1 and write q j = (q̃ j , q j+1 )T
with q̃ j ∈ R j . Then, we have

βe1 − H j y22 = βe1 − Q j R j y22


= βQTj e1 − R j y22
= βq j − R j y22
= βq̃ j − R̃ j y22 + |βq j+1 |2 ,

where R̃ j ∈ R j× j is R j without its last row. The solution y j , which minimises


the latter expression, is hence given by

y j := βR̃−1
j q̃ j . (6.26)

Obviously, we can compute q j = QTj e1 step by step by applying the rotations


G̃ j ∈ R2×2 to the corresponding entries of the right-hand side. From this, it also
immediately follows that the previously computed components do not change
from now on. To be more precise, when going over from q j ∈ R j+1 to q j+1 ∈
R j+2 we only have to apply the 2 × 2 matrix G̃ j+1 to the last two components
of (q j , 0)T = (q̃ j , q j+1 , 0)T , i.e. we have
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
⎜⎜⎜ ⎟ ⎜q̃ ⎟ ⎜q̃ ⎟
⎟⎟⎟⎟ ⎜⎜⎜⎜ j ⎟⎟⎟⎟ ⎜⎜⎜⎜ j+1 ⎟⎟⎟⎟
q̃ j
⎜⎜⎜
q j+1 = ⎜⎜⎜ q j+1 ⎟⎟⎟⎟ = ⎜⎜⎜⎜ ∗ ⎟⎟⎟⎟ =: ⎜⎜⎜⎜ ⎟⎟⎟ .
⎟⎠ (6.27)
⎝G̃ j+1 ⎠ ⎝ ⎠ ⎝
0 ∗ q j+2

Next, assume that we have already computed the rotations G1 , . . . , G j and ap-
plied them to H j . Since H j is the upper left sub-matrix of H j+1 , this means that
we only have to apply G1 , . . . , G j (after adjusting their dimensions) to the last
column of H j+1 , to obtain

08:44:43, subject to the Cambridge


07
218 Methods for Large Sparse Systems

⎛ ⎞
⎜⎜⎜r11 r12 ··· · · · r1 j r1, j+1 ⎟⎟
⎜⎜⎜⎜ 0 ⎟⎟
⎜⎜⎜ r22 ··· · · · r2 j r2, j+1 ⎟⎟⎟⎟

⎜⎜⎜ .. .. .. ⎟⎟⎟⎟
⎜⎜⎜ . . . ⎟⎟⎟
⎜⎜ ⎟
G jG j−1 · · · G1 H j+1 = ⎜⎜⎜⎜ .. .. .. .. ⎟⎟⎟⎟ . (6.28)
⎜⎜⎜ . . . . ⎟⎟⎟⎟
⎜⎜⎜ ⎟
⎜⎜⎜ rjj r j, j+1 ⎟⎟⎟⎟
⎜⎜⎜ ⎟⎟
⎜⎜⎝ 0 d ⎟⎟⎟⎟

0 h
Here, the entry h in the last row and column is simply given by h = h j+2, j+1 .
Thus, besides these applications to the last column, we only need one addi-
tional rotation G j+1 to eliminate this entry. This can be done by choosing

c j+1 = |d|/ d2 + h2 , s j+1 = c j+1 h/d
if d  0 and c j+1 = 0, s j+1 = 1 if d = 0. Taking this all into account, we can
state the final version of the GMRES algorithm. In principle, after initialisa-
tion, which also contains the initialisation of ξ = βe1 , the method consists in
each iteration of the following steps.
1. Compute vk+1 and hk using Arnoldi’s method.
2. Apply the rotations G1 , . . . , Gk−1 to the last column of H.
3. Compute the kth rotation to eliminate hk+1,k .
4. Apply the kth rotation to ξ and to the last column of H.
The iteration stops if the residual norm estimate |ξk+1 | becomes sufficiently
small. Algorithm 25 contains all these steps in detail.
Even in this form, the GMRES method suffers from two problems. First of
all, the computational cost increases from iteration to iteration since the cost
is still O(k2 ) in the kth iteration. Moreover, with increasing k, the memory
requirement grows. In each step we have to store the vectors v1 , . . . , vk ∈ Rn .
For this alone we need O(nk) space and that is even the case if the original
matrix A was sparse.
An often-suggested remedy to this problem is the restarted GMRES method,
which stops after a fixed number of orthogonalisation steps, computes an ap-
proximate solution and then restarts the algorithm using this approximate so-
lution as the new initialisation, see Algorithm 26.
As for the CG method it is easy to see that the iterations of the GMRES (and
also MINRES) method yield a decreasing sequence of residuals {rk }, this time,
however, in the 2 -norm. This remains obviously true for the restarted GMRES
method. Unfortunately, while we have (at least theoretical) convergence after

08:44:43, subject to the Cambridge


07
6.2 GMRES and MINRES 219

Algorithm 25: GMRES


Input : A ∈ Rn×n , b ∈ Rn , x0 ∈ Rn , > 0.
Output: Approximate solution of Ax = b.
1 r0 := b − Ax0 , β := r0 2 , v1 := r0 /β, ξ = (β, 0, . . . , 0)T
2 for k = 1, 2, . . . , do
3 wk := Avk
4 for  = 1 to k do
5 hk := wk , v 2
6 wk := wk − hk v
7 hk+1,k := wk 2
8 for i = 1, . . . , k − 1 do

hik ci si hik
9 :=
hi+1,k −si ci hi+1,k
10 γ := (h2kk + h2k+1,k )1/2 , ck := hkk /γ, sk := ck hk+1,k /hkk

ξk c k sk ξ k
11 :=
ξk+1 −sk ck 0
12 hkk := γ, hk+1,k := 0
13 if |ξk+1 | < then
14 for  = k, k − 1, . . . , 1 do 

15 y := h1 ξ − ki=+1 hi yi

16 xk := x0 + ki=1 yk vk
17 break
18 else
19 vk+1 := wk /hk+1,k

n steps for GMRES, this may not be true for the restarted GMRES. As a mat-
ter of fact, it can happen that the sequence {rk 2 } “levels out”, i.e. becomes
stationary with k → ∞. However, in the case of a positive definite but not
necessarily symmetric matrix A, the following result shows that the restarted
GMRES method converges. The matrix A is positive definite if xT Ax > 0 for
all x  0. Since xT Ax = xT AT x, we have that the smallest eigenvalue μ of the
symmetric matrix (A + AT )/2 is positive by Theorem 5.4:

1 xT (AT + A)x xT Ax
μ := min T
= min T > 0.
x0 2 x x x0 x x

Also, if x0 is a corresponding eigenvector to the eigenvalue μ of (A + AT )/2

08:44:43, subject to the Cambridge


07
220 Methods for Large Sparse Systems

Algorithm 26: GMRES restarted


Input : A ∈ Rn×n , b ∈ Rn , x0 ∈ Rn , j ∈ N, > 0.
Output: Approximate solution of Ax = b.
1 r0 := b − Ax0 , β := r0 2
2 while β > do
3 Compute the GMRES approximation x j using Algorithm 24
4 Set x0 := x j , r0 := b − Ax0 , β := r0 2

with x0 2 = 1 then


1 T
μ= x (A + AT )x0 = xT0 Ax0 ≤ Ax0 2 ≤ A2 .
2 0
Proposition 6.36 Let A ∈ Rn×n be positive definite. Let x j ∈ x0 + K j (A, r0 )
be the jth iteration of the GMRES method. Then, the residual r j = b − Ax j
satisfies
j/2
μ2
r j 2 ≤ 1 − 2 r0 2 ,
σ

where μ is the smallest eigenvalue of (A + AT )/2 and σ = A2 .

Proof From Theorem 6.24 we know that

r j 2 ≤ min P(A)2 r0 2 .


P∈π j
P(0)=1

If we now define p ∈ π1 by p(t) := 1 − αt with α ∈ R then we have P = p j ∈ π j


with P(0) = 1 such that P(A)2 = p j (A)2 ≤ p(A)2j gives an upper bound
on the minimum above. Next, we want to choose α in the definition of p to
have a bound on p(A)2 . To this end, note that, for a fixed x ∈ Rn \ {0},

p(A)x22 = x − αAx22 = x22 − 2αxT Ax + α2 Ax22

becomes minimal for α = xT Ax/Ax22 , see also the second part of Lemma
5.16 for a similar argument, with value
T 2 ⎡ ⎛ T ⎞2 ⎤
x Ax 2⎢
⎢⎢ ⎜⎜⎜ x Ax ⎟⎟⎟ x2 2 ⎥⎥⎥
p(A)x2 = x2 −
2 2 ⎢
= x2 ⎣⎢1 − ⎜⎝ ⎟⎠ ⎥⎥
Ax2 x22 Ax2 ⎦

μ2
≤ x22 1 − 2 ,
σ

08:44:43, subject to the Cambridge


07
6.2 GMRES and MINRES 221

where we have used that xT Ax/x22 ≥ μ and x2 /Ax2 ≥ 1/σ. This yields

μ2
p(A)22 ≤ 1 −
σ2
and hence the overall statement. 

As mentioned above, this means that the restarted GMRES method converges.

Corollary 6.37 If A ∈ Rn×n is positive definite then the restarted GMRES( j)


method converges for every j ≥ 1.

Further improvements of GMRES can be achieved by preconditioning, which


we will discuss later on, and by replacing the Gram–Schmidt process by a more
stable Householder variant. Details can, for example, be found in Saad [104].
To see how things change when going from GMRES to MINRES, we first note
that in the case of a symmetric matrix A, most entries computed in the Arnoldi
method are actually zero.

Lemma 6.38 Let A ∈ Rn×n be symmetric. Let {v } be the sequence of vectors
computed by the Arnoldi method and let hk = Avk , v 2 for 1 ≤  ≤ k and
hk+1,k = ṽk+1 2 . Then,

hk = 0 1 ≤  < k − 1,
hk,k+1 = hk+1,k .

Proof From the fact that span{v1 , . . . , v } = K (A, r0 ) we see that Av ∈
K+1 (A, r0 ) = span{v1 , . . . , v+1 }. Thus, using the symmetry of A yields

hk = Avk , v 2 = vk , Av 2 = 0

for all  + 1 < k. For the second property note that the orthogonality of ṽk+1 to
v1 , . . . , vk gives

1
hk,k+1 = Avk+1 , vk 2 = vk+1 , Avk 2 = ṽk+1 , Avk 2
ṽk+1 2
 
k 
1
= ṽk+1 , Avk − Avk , v 2 v = ṽk+1 2
ṽk+1 2 =1 2
= hk+1,k . 

Thus, the previously defined matrix H̃ j = (hk ) ∈ R j× j reduces to a symmetric


tridiagonal matrix with diagonal entries αk := hkk and sub- and super-diagonal

08:44:43, subject to the Cambridge


07
222 Methods for Large Sparse Systems

entries βk := hk,k+1 = hk+1,k :


⎛ ⎞
⎜⎜⎜α1 β1 ⎟⎟⎟
⎜⎜⎜ .. .. ⎟⎟⎟
⎜⎜⎜ β . . ⎟⎟⎟
H̃ j = ⎜⎜⎜⎜⎜ 1 ⎟⎟⎟ .
⎟⎟⎟
⎜⎜⎜ .. ..
⎜⎜⎝ . . β j−1 ⎟⎟⎟⎟

β j−1 αj

Taking this new notation into account, we see that the Arnoldi algorithm, Algo-
rithm 23, reduces to the algorithm given in Algorithm 27, which is the Lanczos
orthogonalisation method for symmetric matrices, see [89, 90].

Algorithm 27: Lanczos method


Input : A = AT ∈ Rn×n , x0 ∈ Rn , j ∈ {1, . . . , n}.
Output: Orthonormal basis {v1 , . . . , v j } of K j (A, r0 ).
1 r0 := b − Ax0 , v0 := 0, v1 := r0 /r0 2 , β0 := 0
2 for k = 1, 2, . . . , j − 1 do
3 wk := Avk − βk−1 vk−1
4 αk := wk , vk 2
5 wk := wk − αk vk
6 βk := wk 2
7 if βk = 0 then break
8 else vk+1 := wk /βk

It is interesting to note that we could derive the CG method from Lanczos’


method. We will not do this here, but later on, when we discuss biorthogo-
nalisation methods, we will derive a method similar to the CG method from a
two-sided or biorthogonal Lanczos method.
As in the non-symmetric case, we now also have an alternative method for
reducing A to a tridiagonal matrix having the same eigenvalues as A.

Remark 6.39 If the Lanczos method does not stop prematurely, the orthog-
onal matrix V = (v1 , . . . , vn ) transforms the symmetric matrix A into the tridi-
agonal matrix VnT AVn = H̃n having the same eigenvalues as A.

We will now work out the details of a possible implementation of MINRES.


As in the case of GMRES, we use Givens rotations to transform the matrix H j
to an upper triangular matrix R j . However, by Corollary 5.37, we know that
row i of R has at most non-zero entries rii , ri,i+1 and ri,i+2 . Hence, applying the

08:44:43, subject to the Cambridge


07
6.2 GMRES and MINRES 223

Givens rotations G1 to G j−1 to H j−1 leads to the j × ( j − 1) matrix


⎛ ⎞
⎜⎜⎜r11 r12 r13 0 · · · 0 ⎟⎟
⎜⎜⎜ ⎟
⎜⎜⎜ .. .. .. .. .. ⎟⎟⎟⎟
⎜⎜⎜ . . . . . ⎟⎟⎟⎟
⎜⎜⎜ .. .. .. ⎟⎟⎟
⎜⎜⎜
⎜⎜
. . . 0 ⎟⎟⎟⎟⎟
⎟⎟⎟
G j−1 · · · G1 H j−1 = ⎜⎜⎜⎜⎜ .. ..
. . r j−3, j−1 ⎟⎟⎟⎟⎟ .
⎜⎜⎜
⎜⎜⎜ .. ⎟⎟⎟⎟
⎜⎜⎜ . r j−2, j−1 ⎟⎟⎟⎟
⎜⎜⎜ ⎟⎟
⎜⎜⎜ r j−1, j−1 ⎟⎟⎟⎟
⎜⎝ ⎠
0
The next step is to carry this over to H j . This means we have to add the jth
column and and ( j + 1)st row and then apply G j−1 · · · G1 to the jth column.
However, since the jthe column of H j has only the entries β j−1 , α j and β j in
rows j − 1, j and j + 1, respectively, and since each Gi acts only on rows i and
i + 1, we see that we only have to apply G j−2 and then G j−1 to the new jth
column. This essentially means that we have to form

r j−2, j c j−2 s j−2 0
= ,
r̃ j−1, j −s j−2 c j−2 β j−1

r j−1, j c j−1 s j−1 r̃ j−1, j
= ,
r̃ j j −s j−1 c j−1 αj
which leads to the representation
⎛ ⎞
⎜⎜⎜r11 r12 r13 0 ··· 0 0 ⎟⎟⎟
⎜⎜⎜⎜ .. .. .. .. .. .. ⎟⎟⎟
⎟⎟⎟
⎜⎜⎜ . . . . . . ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ .. .. .. .. ⎟⎟⎟
⎜⎜⎜ . . . 0 . ⎟⎟⎟
⎜⎜⎜ ⎟⎟
⎜ .. ..
G j−1 · · · G1 H j = ⎜⎜⎜⎜ . . r j−3, j−1 0 ⎟⎟⎟⎟ .
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ .. ⎟
⎜⎜⎜ . r j−2, j−1 r j−2, j ⎟⎟⎟⎟
⎜⎜⎜ ⎟⎟
⎜⎜⎜ r j−1, j−1 r j−1, j ⎟⎟⎟⎟
⎜⎜⎜ ⎟
⎜⎜⎝ 0 r̃ j j ⎟⎟⎟⎟⎟

0 βj
Then, we have to determine a final rotation to remove the β j from the last
column and row, i.e. we determine c j and s j such that

c j s j r̃ j j r
= jj .
−s j c j β j 0

Obviously, we can choose c j = r̃ j j / r̃2j j + β2j and s j = β j / r̃2j j + β2j , but

08:44:43, subject to the Cambridge


07
224 Methods for Large Sparse Systems

see also the discussion about numerical stability in this context just before
Definition 5.23. We will see that we do not even have to save the numbers
r j−2, j , r j−1, j and r j j since we do not need them in the next iteration. Hence, we
can compute all relevant quantities by

γ0 := r̃ j j = −s j−1 r̃ j−1, j + c j−1 α j


= −s j−1 c j−2 β j−1 + c j−1 α j ,

γ1 := γ02 + β2j ,
γ2 := r j−1, j = c j−1 r̃ j−1, j + s j−1 α j
= s j−1 α j + c j−1 c j−2 β j−1 ,
γ3 := r j−2, j = s j−2 β j−1 ,
c j := γ0 /γ1 , s j := β j /γ1 ,

using the above considerations. Moreover, we also have r j j = c j r̃ j j + s j β j = γ1 .


As in the case of GMRES, we can write the iteration xk = x0 + Vk y, where
y ∈ Rk minimises βe1 − Hk y2 . However, for MINRES it is not necessary
to save the orthonormal basis, i.e. the matrix Vk . Thus we need a different
approach to actually compute the iteration xk . According to (6.26), we can
express the iteration as

xk = x0 + βVk R̃−1
k q̃k = x0 + βPk q̃k

if we define the matrix Pk := Vk R̃−1 k , where R̃k is again the upper k × k block
of the matrix Rk ∈ R (k+1)×k
, which is the upper triangular matrix from the QR
factorisation of Hk = Qk Rk . Since the upper k × k block H̃k of Hk is now a
tridiagonal matrix, the only entries ri j of Rk in row i which might be non-zero
are rii , ri,i+1 and ri,i+2 . We also know from our previous considerations that
when going over from Rk ∈ R(k+1)×k to Rk+1 ∈ R(k+2)×(k+1) we keep Rk as the
upper left part of Rk+1 . This altogether means that we can compute the columns
of Pk ∈ Rn×k iteratively. If we set Pk = (p1 , . . . , pk ) and Vk = (v1 , . . . , vk ) then
the relation Pk R̃k = Vk yields

v1 = r11 p1 , v2 = r12 p1 + r22 p2

and then

v j = r j−2, j p j−2 + r j−1, j p j−1 + r j j p j , 3 ≤ j ≤ k.

Noting that r j j  0 as long as the Lanczos method does not break down, this

08:44:43, subject to the Cambridge


07
6.2 GMRES and MINRES 225

can be rearranged and, using the notation above, rewritten as


1
pj = (v j − r j−1, j p j−1 − r j−2, j p j−2 ) (6.29)
rjj
1
= (v j − γ2 p j−1 − γ3 p j−2 ), 1 ≤ j ≤ k,
γ1
where we have set p0 = p−1 = 0 and recall that the constants γ1 , γ2 , γ3 have to
be computed in each iteration, i.e. for each j.
A consequence of these considerations is that we can update the iterations in
the following form. We have

k 
k−1
xk = x0 + βPk q̃k = x0 + β q j p j = x0 + β q j p j + βqk pk
j=1 j=1
= xk−1 + βqk pk . (6.30)
From this, we can also conclude that we do not need the whole vector q̃k
but only the scalar δ := βqk to update the iteration. Since βqk = βQTk e1 =
Gk · · · G1 (βe1 ) we compute with each application of a G j two new entries q j
and q̃ j+1 of which the first one is not changed by the following updates while
the latter one is changed only by the next update. This can be expressed by

qj q̃ j c j s j q̃ j c j q̃ j
= G̃ j = = ,
q̃ j+1 0 −s j c j 0 −s j q̃ j
which is started with q̃1 = β, which means that we are “absorbing” β into
qk . Hence, if we have computed δ := q̃k , then we need qk = ck δ to update
xk = xk−1 +ck δpk and we need to replace δ by −sk δ to prepare the next iteration.
This altogether shows that we can implement MINRES as given in Algorithm
28.
Let us make a few more remarks. First of all, as we know, the residual of the
kth iteration satisfies
b − Axk 22 = βe1 − H̃k yk 22 = βq̃k − R̃k yk 22 + |βqk+1 |2 = |βqk+1 |2
or, if we again absorb β into qk as described above then we simply have b −
Axk 2 = |qk+1 |. The algorithm described above shows that |qk+1 | = |ck+1 q̃k+1 | =
|ck+1 sk q̃k | is at hand at any time during the algorithm without any additional
cost, which can be used as a stopping criterion.
There is an alternative to MINRES in the case of symmetric matrices, which we
only want to mention here. The method is called SYMMLQ . Here the iteration
xk ∈ x0 + AKk (A, r0 ) is computed as the minimiser of x∗ − x2 over x0 +
AKk (A, r0 ), where x∗ is again the solution to Ax = b. Details can be found in
Paige and Saunders [99].

08:44:43, subject to the Cambridge


07
226 Methods for Large Sparse Systems

Algorithm 28: MINRES


Input : A ∈ Rn×n symmetric, b ∈ Rn , x0 ∈ Rn .
Output: Approximate solution of Ax = b.
1 β0 := 0, s−1 = s0 = 0, c−1 = c0 = 1
2 p−1 := p0 := v0 = 0
3 r0 := b − Ax0 , δ := r0 2 , v1 := r0 /δ
4 for k = 1, 2, . . . , do
5 wk := Avk − βk−1 vk−1
6 αk := wk , vk 2
7 wk := wk − αk vk
8 βk := wk 2
9 if βk = 0 then stop
10 γ0 := ck−1 αk − sk−1 ck−2 βk−1

11 γ1 := γ02 + β2k
12 γ2 := sk−1 αk + ck−1 ck−2 βk−1
13 γ3 := sk−2 βk−1
14 ck := γ0 /γ1 sk := βk /γ1
15 pk := [vk − γ2 pk−1 − γ3 pk−2 ]/γ1
16 xk := xk−1 + ck δpk
17 δ := −sk δ
18 vk+1 := wk /βk

6.3 Biorthogonalisation Methods


For symmetric, sparse matrices A ∈ Rn×n , the Krylov subspace method of
choice would be MINRES if the matrix is indefinite and either CG or MINRES
if the matrix is definite.
For non-symmetric matrices, GMRES often works well but has the disadvan-
tage of an increasing memory and time requirement. Moreover, though the
restarted GMRES avoids this increase in required resources, it might not con-
verge but only level out. Hence, in the case of non-symmetric matrices it makes
sense to look for alternative Krylov subspace methods.
Recall that both MINRES and GMRES were based upon an orthogonalisa-
tion of the Krylov space Kk (A, r0 ). In the case of MINRES this was given by
Lanczos’ method, which essentially boils down to the three-term recursion

ṽ j+1 = Av j − α j v j − β j−1 v j−1 ,


v j+1 = ṽ j+1 /ṽ j+1 2

08:44:43, subject to the Cambridge


07
6.3 Biorthogonalisation Methods 227

and leads to orthonormal vectors v1 , . . . , vk ∈ Rn , which form a basis of the


Krylov space Kk (A, r0 ). In the case of GMRES the construction of an orthonor-
mal basis v1 , . . . , vk of Kk (A, r0 ) was based upon Arnoldi’s method and could
not be reduced to such a simple three-term recursion.
This naturally raises the question of whether it is in general possible to find
a short recurrence relation for computing an orthonormal basis of the Krylov
space of an arbitrary matrix A ∈ Rn×n . The answer, in general, is negative as
shown in [49] by Faber and Manteuffel.
Hence, to have a memory-efficient orthogonalisation procedure also in the case
of a non-symmetric matrix, we will now relax the idea of orthonormality.
Instead of having a set of orthonormal basis vectors v1 , . . . , vk we will con-
struct two sets v1 , . . . , vk and w1 , . . . , wk which are biorthogonal, i.e. satisfy
vi , w j 2 = 0 for all i  j and which are bases for

Kk (A, r0 ) = span{r0 , Ar0 , . . . Ak−1 r0 },


Kk (AT , r̂0 ) = span{r̂0 , AT r̂0 , . . . , (AT )k−1 r̂0 },

respectively.
The corresponding procedure is the Lanczos biorthogonalisation method, also
called the two-sided Lanczos method, and is given in detail in Algorithm 29.

Algorithm 29: Two-sided Lanczos method


Input : A ∈ Rn×n , r0 , r̂0 ∈ Rn with r0 , r̂0 2  0.
Output: Biorthogonal bases of Kk (A, r0 ) and Kk (A, r̂0 ).
1 v1 = r0 /r0 2 , w1 = r̂0 / r̂0 , v1 2 , β0 := γ0 := 0, v0 = w0 = 0.
2 for j = 1, 2, . . . , k do
3 α j := Av j , w j 2
4 ṽ j+1 := Av j − α j v j − β j−1 v j−1
5 w̃ j+1 := AT w j − α j w j − γ j−1 w j−1
6 γ j := ṽ j+1 2
7 v j+1 := ṽ j+1 /γ j
8 β j := v j+1 , w̃ j+1 2
9 w j+1 := w̃ j+1 /β j

The two-sided Lanczos method in Algorithm 29 requires two matrix–vector


products per iteration, one with A and one with AT . It does indeed compute
biorthogonal vectors, as long as it does not stop prematurely. This and further
properties are given in the next proposition, which will help us to define Krylov
subspace methods based upon this biorthogonalisation technique.

08:44:43, subject to the Cambridge


07
228 Methods for Large Sparse Systems

Proposition 6.40 Assume that the two-sided Lanczos method does not termi-
nate prematurely. Then, we have v j 2 = 1 and v j , w j 2 = 1 for 1 ≤ j ≤ k
and

vi , w j 2 = 0, 1 ≤ i  j ≤ k. (6.31)

Furthermore, we have

Kk (A, r0 ) = span{v1 , . . . , vk }, and Kk (AT , r̂0 ) = span{w1 , . . . , wk }.

Finally, if Vk := (v1 , . . . , vk ) ∈ Rn×k and Wk := (w1 , . . . , wk ) ∈ Rn×k then

AVk = Vk T k + γk vk+1 eTk , (6.32)


A Wk =
T
Wk T kT + βk wk+1 eTk , (6.33)
WkT AVk = Tk , (6.34)

where T k ∈ Rk×k is the tridiagonal matrix


⎛ ⎞
⎜⎜⎜α1 β1 ⎟⎟⎟
⎜⎜⎜⎜ γ α2 β2
⎟⎟⎟
⎟⎟⎟
⎜⎜⎜ 1 ⎟⎟⎟
⎜⎜ .. .. ..
T k := ⎜⎜⎜⎜⎜ . . . ⎟⎟⎟ .
⎟⎟⎟ (6.35)
⎜⎜⎜ .. .. ⎟⎟
⎜⎜⎜ . . βk−1 ⎟⎟⎟⎟⎟
⎜⎜⎝ ⎠
γk−1 αk

Proof We will show the first three properties by induction on k. For k = 1 we


have v1 2 = 1 and v1 , w1 2 = v1 , r̂0 2 / v1 , r̂0 2 = 1 by definition and there is
nothing to show for (6.31). Let us now assume that we have biorthogonal vec-
tors v1 , . . . , vk and w1 , . . . , wk normalised in the stated way. Then, obviously,
we also have vk+1 2 = 1 and vk+1 , wk+1 2 = vk+1 , w̃k+1 2 /βk = 1. Hence, it
remains to show that vk+1 is orthogonal to w1 , . . . , wk and that wk+1 is orthog-
onal to v1 , . . . , vk . Since these two claims are verified in very similar ways, we
will show the reasoning just for the first property. Here, we start with

1 & '
vk+1 , wk 2 = Avk , wk 2 − αk vk , wk 2 − βk−1 vk−1 , wk 2
γk
1
= [ Avk , wk 2 − Avk , wk 2 vk , wk 2 ] = 0,
γk

using the orthogonality of vk−1 to wk , the definition of αk and the fact that
vk , wk 2 = 1.

08:44:43, subject to the Cambridge


07
6.3 Biorthogonalisation Methods 229

Next, let j < k. Then we have


1 !
vk+1 , w j 2 = Avk , w j 2 − αk vk , w j 2 − βk−1 vk−1 , w j 2
γk
1 !
= vk , AT w j 2 − βk−1 vk−1 , w j 2
γk
1 !
= vk , β j w j+1 + α j w j + γ j−1 w j−1 2 − βk−1 vk−1 , w j 2
γk
1 !
= β j vk , w j+1 2 − βk−1 vk−1 , w j 2 ,
γk
using the orthogonality relation that we have by the induction hypothesis and
the definition of w j+1 . Now, if even j < k − 1 then the last two remaining terms
on the right-hand side also vanish, which shows orthogonality in this case. If,
however, j = k − 1, this becomes
1 & '
vk+1 , w j 2 = βk−1 vk , wk 2 − βk−1 vk−1 , wk−1 2 = 0
γk
since vk , wk 2 = vk−1 , wk−1 2 = 1. This establishes the biorthogonality rela-
tion (6.31). Equation (6.32) follows from
 
AVk = Av1 , . . . , Av j , . . . , Avk
 
= α1 v1 + γ1 v2 , . . . , β j−1 v j−1 + α j v j + γ j v j+1 , . . . , βk−1 vk−1 + αk vk + γk vk+1
= Vk T k + γk vk+1 eTk ,

and equation (6.33) can be proven in the same way. Finally, equation (6.34)
follows from (6.32) and the fact that WkT Vk = I and WkT vk+1 = 0. 

To simplify the presentation of the algorithm, we have left out any test for
division by zero. However, it can happen that either γ j or β j become zero and
in this case the corresponding vectors v j+1 or w j+1 become undefined and the
algorithm in this form will stop prematurely.
Such a premature ending can happen for various reasons. First of all, we could
have that either ṽ j+1 = 0 or w̃ j+1 = 0. These cases are usually referred to as
regular termination. The Lanczos process has found an invariant subspace. If
ṽ j+1 = 0 then the vectors v1 , . . . , v j , which have been created so far, form an
A-invariant space. If w̃ j+1 = 0 then the vectors w1 , . . . , w j form an AT -invariant
space. More problematic is the situation when both vectors ṽ j+1 and w̃ j+1 are
different from the zero vector but are orthogonal, i.e. satisfy ṽ j+1 , w̃ j+1 2 = 0.
This situation is called a serious breakdown. It means that there are no vec-
tors v j+1 ∈ K j+1 (A, r0 ) and w j+1 ∈ K j+1 (AT , r̂0 ) with v j+1 , wi 2 = 0 and
vi , w j+1 2 = 0 for all 1 ≤ i ≤ j. However, this does not necessarily mean that

08:44:43, subject to the Cambridge


07
230 Methods for Large Sparse Systems

later on, say for k > j + 1, there are such vectors vk ∈ Kk (A, r0 ) orthogonal
to Kk−1 (AT , r̂0 ) and wk ∈ Kk (AT , r̂0 ) orthogonal to Kk−1 (A, r0 ). In this case,
a remedy to the problem is to simply skip those steps in the algorithm which
lead to undefined Lanczos vectors and to continue with those steps, where
these vectors are well-defined. Such a method is called a look-ahead Lanczos
method.
After having a procedure for computing the biorthogonal bases for the Krylov
spaces Kk (A, r0 ) and Kk (AT , r̂0 ) we have to address the question of how to
compute a possible approximation xk ∈ Kk (A, r0 ). Naturally, we will write
xk = x0 + Vk yk
and could, for example, determine yk such that the residual rk = b − Axk =
r0 − AVk yk is orthogonal to Kk (AT , r̂0 ), i.e.
0 = WkT rk = WkT r0 − WkT AVk yk = r0 2 e1 − T k yk ,
where we have also used that v1 = r0 /r0 2 and v1 , w1 2 = 1. Thus, if the
tridiagonal matrix T k is invertible, we have a unique solution yk ∈ Rk , which
then defines an iteration xk ∈ Rn .
Definition 6.41 Under the assumption that the Lanczos biorthogonalisation
process is defined and the tridiagonal matrix T k is invertible, the biconjugate
gradient (BiCG) method computes the iteration xk ∈ x0 + Kk (A, r0 ) as
xk = x0 + Vk T k−1 (βe1 ),
where β = r0 2 .
Note, however, that we have encountered another possible situation, where the
iteration xk might not be well-defined, namely if the matrix T k is not invertible.
This situation is different from the previously described breakdowns of the bi-
Lanczos process.
To derive an algorithmic description of the BiCG method, we will now derive
characteristic properties, which can then be used for stating the algorithm.
Let us assume that the tridiagonal matrix T k ∈ Rk×k from (6.35) has an LU-
factorisation T k = Lk Uk . According to Theorem 3.16 the factors have the form
⎛ ⎞ ⎛ ⎞
⎜⎜⎜ 1 0⎟⎟⎟ ⎜⎜⎜u1 β1 0 ⎟⎟
⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟ ⎜
⎜⎜⎜ .. ⎟⎟⎟
⎜⎜⎜2 1 ⎟⎟⎟ ⎜⎜⎜ u . ⎟⎟⎟
Lk = ⎜⎜⎜ ⎜ .. .. ⎟
⎟⎟⎟⎟ , Uk = ⎜⎜⎜ ⎜ 2
⎟⎟ ,
⎜⎜⎜ . . ⎟⎟⎟ ⎜
⎜⎜⎜ . . . β ⎟⎟⎟⎟
⎜⎝ ⎠ ⎜⎝ k−1 ⎟⎟⎠
0 k 1 0 uk
where the components ui and i are defined recursively as u1 = α1 and i =

08:44:43, subject to the Cambridge


07
6.3 Biorthogonalisation Methods 231

γi−1 /ui−1 and ui = αi − i βi−1 for 2 ≤ i ≤ k. Defining Pk := Vk Uk−1 and denoting


the columns of Pk as Pk = (p0 , . . . , pk−1 ), we can rewrite the iteration xk in an
update form as

xk = x0 + Vk T k−1 (βe1 ) = x0 + Vk Uk−1 Lk−1 (βe1 )


k−1
= x0 + Pk ak = x0 + a jp j
j=0
= xk−1 + ak−1 pk−1

with ak = (a0 , . . . , ak−1 )T := Lk−1 (βe1 ). For the residual rk = b − Axk we have
on the one hand

rk = b − Axk−1 − ak−1 Apk−1 = rk−1 − ak−1 Apk−1

and on the other hand, using (6.32) and the fact that eTk T k−1 e1 ∈ R,

rk = b − Ax0 − AVk T k−1 (βe1 ) = r0 − (Vk T k + γk vk+1 eTk )T k−1 (βe1 )


= r0 − r0 2 v1 − δk vk+1 = δk vk+1

with a certain constant δk ∈ R. This means that the residual rk is always parallel
to vk+1 and hence orthogonal to w1 , . . . , wk . Next, from Pk = Vk Uk−1 or Pk Uk =
Vk we find vk = βk−1 pk−2 + uk pk−1 and hence

1 1 1
pk−1 = (vk − βk−1 pk−2 ) = rk−1 − βk−1 pk−2 , (6.36)
uk uk δk−1
i.e. we can reconstruct pk recursively from pk−1 and rk . Finally, if we define
P̂k = (p̂0 , . . . , p̂k−1 ) := Wk Lk−T , then we see that

P̂Tk APk = Lk−1 WkT AVk Uk−1 = Lk−1 T k Uk−1 = Lk−1 Lk = I,

i.e. the vectors p j and p̂i are A-orthogonal.


Finally, the vectors generated by the biorthogonal Lanczos method can also be
used to solve the adjoint linear system AT x = b̂, if r̂0 is defined as r̂0 := b̂−Ax̂0 .
As a matter of fact, we can define a sequence of approximations and residuals

x̂k := x̂0 + Wk zk ,
r̂k := b̂ − AT x̂k = r̂0 − AT Wk zk ,

where zk ∈ Rk is determined by the orthogonality condition

0 = VkT r̂k = VkT r̂0 − VkT AT Wk zk = r̂0 , v1 2 e1 − T kT zk ,

i.e. zk = T k−T (β̂e1 ) with β̂ := r̂0 , v1 2 . Using this and the above definition

08:44:43, subject to the Cambridge


07
232 Methods for Large Sparse Systems

P̂k = Wk Lk−T , we can rewrite the iterations as


x̂k = x̂0 + Wk T k−T (β̂e1 ) = x̂0 + Wk Lk−T Uk−T (β̂e1 )
= x̂0 + P̂k âk = x̂k−1 + âk−1 p̂k−1 ,
where we have set âk = (â0 , . . . , âk−1 )T := Uk−T (β̂e1 ).
Using (6.33) and the definition of w1 gives again that r̂k and wk+1 are parallel:
 
r̂k = r̂0 − AT Wk zk = r̂0 − Wk T kT + βk wk+1 eTk zk
 
= r̂0 − Wk T kT + βk wk+1 eTk T k−T (β̂e1 )
= r̂0 − Wk (β̂e1 ) − βk β̂(eTk T k−T e1 )wk+1
=: δ̂k wk+1 .
From the relation P̂k LkT = Wk we can also conclude that wk = k p̂k−2 + p̂k−1 or
1
p̂k−1 = wk − k p̂k−2 = r̂k−1 − k p̂k−2 . (6.37)
δ̂k−1
So far, we have update formulas for all vectors. However, we can simplify the
update formulas for the “search directions” pk and p̂k if we rescale them. To
be more precise, if we replace pk by uk+1 δk pk , then (6.36) shows that the new
pk satisfies a recursion of the form pk−1 = rk−1 − β̃k−1 pk−2 . In a similar way, we
can rescale p̂k such that the coefficient in front of r̂k−1 in the update formula
(6.37) can be chosen to be one. Of course, this also changes the coefficients
in the other update formulas and the normalisation, but the general form and
the biorthogonality will not be modified. In conclusion we have the following
result.
Proposition 6.42 If the Lanczos biorthogonalisation process does not stop
prematurely and if the matrix T k is invertible, the process generates sequences
of vectors satisfying the relations
x j+1 = x j + α j p j , x̂ j+1 = x̂ j + α̂ j p̂ j ,
r j+1 = r j − α j Ap j , r̂ j+1 = r̂ j − α̂ j AT p̂ j ,
p j+1 = r j+1 + β j+1 p j , p̂ j+1 = r̂ j+1 + β̂ j+1 p̂ j
with certain constants α j , β j+1 , α̂ j , β̂ j+1 ∈ R. Moreover, we have the biorthogo-
nalities
r j , r̂i 2 = 0, Ap j , p̂i 2 = 0, i  j. (6.38)
Finally, we have the alternative bases
K j (A, r0 ) = span{r0 , . . . , r j−1 } = span{p0 , . . . , p j−1 },
K j (AT , r̂0 ) = span{r̂0 , . . . , r̂ j−1 } = span{p̂0 , . . . , p̂ j−1 }.

08:44:43, subject to the Cambridge


07
6.3 Biorthogonalisation Methods 233

Proof The recursion formulas and the biorthogonalities (6.38) follow directly
from the derivation above. Hence, the only things that remain to be proven are
the identities regarding the Krylov spaces at the end of the proposition. This,
however, follows from the recurrence relations for r j+1 and p j+1 on the one
hand and for r̂ j+1 and p̂ j+1 on the other hand, once we have established that
p0 and r0 (and p̂0 and r̂0 ) are parallel. This can be seen as follows. Taking the
above-mentioned rescaling into account, p0 is given by
p0 = u1 δ0 Vk Uk−1 e1 .
Since Uk is an upper triangular matrix, so is Uk−1 and we can conclude that
Uk−1 e1 = (1/u1 )e1 and hence that p0 = δ0 Vk e1 = δ0 v1 = r0 . 
We can also start with the recursion formulas above and use the biorthogonal-
ities (6.38) to determine the coefficients in Proposition 6.42. First of all, from
p̂ j = r̂ j + β̂ j p̂ j−1 we can conclude that
Ap j , r̂ j 2 = Ap j , p̂ j − β̂ j p̂ j−1 2 = Ap j , p̂ j 2 .
Then, from r j+1 = r j − α j Ap j it follows that
0 = r j+1 , r̂ j 2 = r j , r̂ j 2 − α j Ap j , r̂ j 2 ,
i.e. the coefficient α j must have the form
r j , r̂ j 2 r j , r̂ j 2
αj = = .
Ap j , r̂ j 2 Ap j , p̂ j 2
In the same way we can conclude that α̂ j must have the form
r j , r̂ j 2 r j , r̂ j 2 r j , r̂ j 2
α̂ j = = = = α j.
r j , AT p̂ j 2 p j , AT p̂ j 2 Ap j , p̂ j 2
From p j+1 = r j+1 + β j+1 p j we have
0 = p j+1 , AT p̂ j 2 = r j+1 , AT p̂ j 2 + β j+1 p j , AT p̂ j 2
which resolves to
r j+1 , AT p̂ j 2
β j+1 = − .
p j , AT p̂ j 2
Taking α j = α̂ j and AT p̂ j = −(r̂ j+1 − r̂ j )/α̂ j into account, we can simplify the
formula for β j+1 to
r j+1 , AT p̂ j 2 1 r j+1 , r̂ j+1 − r̂ j 2 r j+1 , r̂ j+1 2
β j+1 = − = = .
p j , A p̂ j 2
T α j p j , AT p̂ j 2 r j , r̂ j 2
The symmetry of the argument also shows that we once again have β̂ j+1 = β j+1 .

08:44:43, subject to the Cambridge


07
234 Methods for Large Sparse Systems

Thus, we have determined all the coefficients and the final algorithm for the
BiCG method is stated in Algorithm 30.

Algorithm 30: BiCG method


Input : A ∈ Rn×n , b ∈ Rn .
Output: Approximate solution of Ax = b.
1 Choose x0 ∈ Rn .
2 Set r0 := p0 = b − Ax0 and j := 0.
3 Choose p̂0 = r̂0 such that r0 , r̂0 2  0
4 while r j 2 > do
5 α j := r j , r̂ j 2 / Ap j , p̂ j 2
6 x j+1 := x j + α j p j
7 r j+1 := r j − α j Ap j
8 r̂ j+1 := r̂ j − α j AT p̂ j
9 β j+1 := r j+1 , r̂ j+1 2 / r j , r̂ j 2
10 p j+1 := r j+1 + β j+1 p j
11 p̂ j+1 := r̂ j+1 + β j+1 p̂ j
12 j := j + 1

Technically, we must check whether these necessary conditions to compute


the coefficients indeed lead to sequences of vectors with the desired properties.
This can, however, be done as in the proof of Theorem 6.9.
When choosing r̂0 = r0 , the BiCG algorithm reduces for a symmetric matrix
A = AT to the standard CG algorithm (Algorithm 22), though all vectors are
computed twice, since in this case we obviously have x j = x̂ j , r j = r̂ j and
p j = p̂ j . If the matrix is in addition positive definite then we will also have
convergence.
As mentioned above, the BiCG algorithm can also be used to solve a system
of the form AT x = b̂ simultaneously. To this end, we only have to define r̂0 :=
b̂ − AT x̂0 and compute x̂ j+1 = x̂ j + α j p̂ j after step 6 of Algorithm 30.
As mentioned above, the BiCG method can fail to produce a new iteration if
the two-sided Lanczos process terminates early or if the matrix T k is not invert-
ible. The latter can be dealt with by changing the actual idea of computing the
iteration to a more GMRES-like method without the drawback of increasing
memory requirements.
Recall from (6.32) that we have the relation

Tk
AVk = Vk T k + γk vk+1 eTk = Vk+1 Hk , Hk = ∈ R(k+1)×k , (6.39)
γk eTk

08:44:43, subject to the Cambridge


07
6.3 Biorthogonalisation Methods 235

which corresponds to the GMRES representation (6.24). Thus, as before, we


can write our iteration as xk = x0 + Vk y with a corresponding residual

rk := b−Axk = b−A(x0 +Vk y) = r0 −AVk y = βv1 −Vk+1 Hk y = Vk+1 (βe1 −Hk y),

where we have used that v1 = r0 /r0 2 . Minimising the 2 -norm of rk would


be equivalent to minimising the 2 -norm of βe1 − Hk y if the columns of Vk+1
were orthonormal. In general this will not be the case but it is still a possible
option to just find a minimiser for βe1 − Hk y2 , particularly since T k and hence
Hk have a tridiagonal structure.

Definition 6.43 Under the assumption that the Lanczos biorthogonalisation


process is defined, the quasi-minimal residual (QMR) method computes the
iteration xk ∈ x0 + Kk (A, r0 ) as xk = x0 + Vk yk , where yk ∈ Rk solve

min βe1 − Hk y2 ,


y∈Rk

with β = r0 2 .

In an algorithmic description of the QMR method, we can proceed as in the


case of the MINRES method and avoid the need to save the Lanczos vectors.
As for MINRES (see equation (6.25)), we use successive Givens rotations to
transform Hk into an upper triangular matrix. However, this time Hk ∈ R(k+1)×k
is a tridiagonal matrix and thus the transformed matrix has the band structure
⎛ ⎞ ⎛⎜r11 r12 r13 ⎞
⎜⎜⎜α1 β1 ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟
⎟⎟⎟
⎜⎜⎜⎜ γ1 α2 β2 ⎟⎟⎟⎟ ⎜⎜⎜⎜ .. ..
. . ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ r22 ⎟⎟⎟
⎜⎜⎜⎜ .. .. .. ⎟⎟ ⎜⎜⎜
⎟ . . ⎟⎟⎟
⎜ . . . ⎟⎟⎟ ⎜⎜ ⎜ .. .. r ⎟⎟
Gk Gk−1 · · · G1 ⎜⎜⎜⎜ ⎟⎟⎟ = ⎜⎜⎜ k−2,k ⎟
⎜⎜⎜ .. .. ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟⎟ .
. . βk−1 ⎟⎟ ⎜⎜ .. ⎟⎟⎟
⎜⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ . r k−1,k ⎟ ⎟⎟⎟
⎜⎜⎜ γk−1 αk ⎟⎟⎟ ⎜⎜⎜ ⎜
⎝⎜ ⎠ ⎜⎝ rkk ⎟⎟⎟⎟
γk ⎠
0

As in the case of MINRES, we can now easily compute the solution of the
minimisation problem and also update the QR factorisation. If we define again
QTk := Gk · · · G1 and Rk := QTk Hk ∈ R(k+1)×k then we can transform βe1 −Hk y2
as before. Recalling the definitions qk := QTk ek ∈ Rk+1 , qk = (q̃k , qk+1 )T with
q̃k ∈ Rk and R̃k ∈ Rk×k as the upper k × k block of Rk , equation (6.26) shows
that the minimiser of βe1 − Hk y2 is given by

yk := βR̃−1
k q̃k .

Furthermore, when using the rotations G j appropriately, we see that (6.28)

08:44:43, subject to the Cambridge


07
236 Methods for Large Sparse Systems

becomes
⎛ ⎞
⎜⎜⎜r11 r12 r13 ⎟⎟⎟
⎜⎜⎜⎜ .. .. ⎟⎟⎟
⎟⎟⎟
⎜⎜⎜ r22 . . ⎟⎟⎟
⎜⎜⎜ ⎟⎟⎟
⎜⎜⎜ .. .. ⎟⎟⎟
⎜⎜⎜ . . rk−2,k ⎟⎟⎟
Gk Gk−1 · · · G1 Hk+1 = ⎜⎜⎜⎜⎜ .. ⎟⎟ ,
⎜⎜⎜ . rk−1,k rk−1,k+1 ⎟⎟⎟⎟
⎜⎜⎜ ⎟⎟
⎜⎜⎜ rkk rk,k+1 ⎟⎟⎟⎟
⎜⎜⎜ ⎟
⎜⎜⎜ 0 d ⎟⎟⎟⎟⎟
⎝ ⎠
0 h

from which we can again determine the next iteration Gk+1 , as before. Setting
finally again Pk = (p1 , . . . , pk ) = Vk R̃−1
k we can compute the iterations via

xk = x0 + Vk yk = x0 + βPk q̃k = xk−1 + βqk pk ,

see equation (6.30) for details. Taking this all into account, gives the final al-
gorithm for QMR depicted in Algorithm 31, where the first 10 lines cover the
two-sided Lanczos method.
There is an obvious connection between QMR and GMRES/MINRES. To see
this, recall that the iterations of GMRES xGk are defined in such a way that the
residual rGk = b − AxGk satisfies

rGk 2 = min{b − Ax2 : x ∈ x0 + Kk (A, r0 )}.

As before, we can use actually any basis v1 , . . . , vk of Kk (A, r0 ) to express


xGk as xGk = x0 + Vk yGk with yGk ∈ Rk . If we use the basis constructed by the
bi-Lanczos process, this leads to

rGk 2 = b − Ax0 − AVk yGk 2 = Vk+1 (βe1 − Hk yGk )2


≥ σk+1 βe1 − Hk yGk 2 ,

using (6.39) and Corollary 2.20 with the matrix Vk+1 ∈ Rn×(k+1) having full
rank k + 1 and smallest singular value σk+1 > 0. Moreover, we can bound the
residual of the QMR method by

rkQ 2 = Vk+1 (βe1 − Hk ykQ )2 ≤ Vk+1 2 βe1 − Hk ykQ 2


σ1 G
≤ Vk+1 2 βe1 − Hk yGk 2 ≤ r 2 ,
σk+1 k
using also that Vk+1 2 = σ1 , the largest singular value of Vk+1 . This means
that the residual of QMR does not deviate too much from the optimal residual
of GMRES as long as Vk remains well-conditioned.

08:44:43, subject to the Cambridge


07
6.3 Biorthogonalisation Methods 237

Algorithm 31: QMR (quasi-minimum residual)


Input : A ∈ Rn×n , b ∈ Rn , x0 ∈ Rn .
Output: Approximate solution of Ax = b.
1 r0 := b − Ax0 , β := r0 2 , v1 := r0 /β, ξ = (β, 0, . . . , 0)T
2 Choose r̂0 with r0 , r̂0 2  0 and set w1 := r̂0 / r0 , r̂0 2
3 for k = 1, 2, . . . , do
4 hkk := Avk , wk 2
5 ṽk+1 := Avk − hkk vk − hk−1,k vk−1 (undefined terms are zero)
6 w̃k+1 := AT wk − hkk wk − hk,k−1 wk−1 (undefined terms are zero)
7 hk+1,k := ṽk+1 2
8 vk+1 := ṽk+1 /hk+1,k
9 hk,k+1 := vk+1 , w̃k+1 2
10 wk+1 := w̃k+1 /hk,k+1 .
11 if k > 2 then

hk−2,k c sk−2 0
12 := k−2
hk−1,k −sk−2 ck−2 hk−1,k
13 if k > 1 then

hk−1,k c sk−1 hk−1,k
14 := k−1
hk,k −sk−1 ck−1 hkk
15 γ := (h2kk + h2k+1,k )1/2 , ck := hkk /γ, sk := ck hk+1,k /hkk

ξk c sk ξ k
16 := k
ξk+1 −sk ck 0
17 hkk := γ, hk+1,k := 0
18 pk := [vk − hk−1,k pk−1 − hk−2,k pk−2 ]/hkk (undefined terms are zero)
19 xk := xk−1 + ξk pk

Lemma 6.44 Provided that the bi-Lanczos process does not stop prema-
turely, the residuals of the QMR and GMRES method satisfy

rkQ 2 ≤ κ2 (Vk+1 )rGk 2 ,

where κ2 (Vk+1 ) is the condition number of the matrix Vk+1 consisting of the
basis constructed by the bi-Lanczos process.

Both the BiCG method and the QMR method require a matrix–vector multipli-
cation with A and a matrix–vector multiplication with AT . This can be avoided,
though only at the price of an additional matrix–vector multiplication with the
original matrix. One possible way of achieving this is based on the following

08:44:43, subject to the Cambridge


07
238 Methods for Large Sparse Systems

observation. If we look at the iterations defined by the BiCG method closely,


we see that, as usual in the context of Krylov subspace methods, we can write
the direction p j = r j + β j p j−1 as
p j = P j (A)r0
with P ∈ π j . In the same way, we can express the residuals r j = r j−1 −
α j−1 Ap j−1 as
r j = R j (A)r0
with R j ∈ π j satisfying R j (0) = 1. Furthermore, the dual directions and residu-
als are defined by the same polynomials, i.e. we have
p̂ j = P j (AT )r̂0 = (P j (A))T r̂0 , r̂ j = R j (AT )r̂0 = (R j (A))T r̂0 .
With this, the defining coefficients can be expressed as
r j , r̂ j 2 R j (A)r0 , R j (AT )r̂0 2 R2j (A)r0 , r̂0 2
αj = = = , (6.40)
Ap j , p̂ j 2 AP j (A)r0 , P j (AT )r̂0 2 AP2j (A)r0 , r̂0 2
r j+1 r̂ j+1 2 R j+1 (A)r0 , r̂0 2
2
β j+1 = = . (6.41)
r j , r̂ j 2 R2j (A), r0 , r̂0 2

Hence, if we can somehow compute R2j (A)r0 and AP2j (A)r0 , then we will be
able to compute these coefficients directly. The defining polynomials satisfy
the recurrence
P j (t) = R j (t) + β j P j−1 (t), (6.42)
R j (t) = R j−1 (t) − α j−1 tP j−1 (t). (6.43)
Squaring these equations yields
P2j (t) = R2j (t) + 2β j P j−1 (t)R j (t) + β2j P2j−1 (t),
R2j (t) = R2j−1 (t) − 2α j−1 tP j−1 (t)R j−1 (t) + α2j−1 t2 P2j−1 (t),
which would be update formulas for the squared terms P̃ j := P2j and R̃ j = R2j
if there were no mixed terms. To remove these mixed terms, we will simply
introduce a third quantity Q̃ j := P j−1 R j . To derive recursion formulas for all
three quantities, let us multiply (6.42) by R j , which gives
P j (t)R j (t) = R2j (t) + β j P j−1 (t)R j (t). (6.44)
If we further multiply (6.43) by P j−1 then we find
P j−1 (t)R j (t) = P j−1 (t)R j−1 (t) − α j−1 tP2j−1 (t)
= R2j−1 (t) + β j−1 P j−2 (t)R j−1 (t) − α j−1 tP2j−1 (t).

08:44:43, subject to the Cambridge


07
6.3 Biorthogonalisation Methods 239

With this, we immediately find the recurrence relations for the three quantities
P̃ j := P2j , R̃ j := R2j and Q̃ j := P j−1 R j :
Q̃ j (t) = R̃ j−1 (t) + β j−1 Q̃ j−1 (t) − α j−1 t P̃ j−1 (t),
 
R̃ j (t) = R̃ j−1 (t) − 2α j−1 t R̃ j−1 (t) + β j−1 Q̃ j−1 (t) + α2j−1 t2 P̃ j−1 (t),
P̃ j (t) = R̃ j (t) + 2β j Q̃ j (t) + β2j P̃ j−1 (t).
These recurrence relations can now be used to define recurrence relations for
the vectors defined by
r j := R̃ j (A)r0 , p j := P̃ j (A)r0 , q j−1 := Q̃ j (A)r0 .
Note, however, that these vectors are not identical with those defined in the
BiCG method. For the final derivation of our new algorithm, let us also intro-
duce the quantity u j := r j + β j q j−1 . Then, we have the update formulas
q j−1 = r j−1 + β j−1 q j−2 − α j−1 Ap j−1
= u j−1 − α j−1 Ap j−1 ,
r j = r j−1 − α j−1 A(2r j−1 + 2β j−1 q j−2 − α j−1 Ap j−1 )
= r j−1 − α j−1 A(2u j−1 − α j−1 Ap j−1 )
= r j−1 − α j−1 A(u j−1 + q j−1 ),
p j = r j + 2β j q j−1 + β2j p j−1
= u j + β j (q j−1 + β j p j−1 ).
To finalise the algorithm, we only have to note two more things. First of all,
from (6.40) and (6.41) we see that we can compute the coefficients α j and β j+1
as
r j , r̂0 2 r j+1 , r̂0 2
αj = , β j+1 =
Ap j , r̂0 2 r j , r̂0 2
and from r j+1 = r j − α j A(u j + q j ) we can conclude that the next iteration is
defined by
x j+1 := x j + α j (u j + q j ).
This procedure gives a method called the Conjugate Gradient Squared (CGS)
method. The complete algorithm is given in Algorithm 32. This method does
not require us to form a matrix–vector product using the transpose AT of A.
However, it requires us to form two matrix–vector products with the matrix
A instead. Hence, the computational cost is still comparable to the cost of the
BiCG method.
The CGS method has a residual vector which can be described by R2j (A)r0
while R j (A)r0 is the residual vector of the BiCG method. This means that,

08:44:43, subject to the Cambridge


07
240 Methods for Large Sparse Systems

Algorithm 32: CGS method


Input : A ∈ Rn×n , b ∈ Rn .
Output: Approximate solution of Ax = b.
1 Choose x0 ∈ Rn and r̂0 ∈ Rn \ {0}
2 Set r0 := u0 := p0 := b − Ax0
3 Set j := 0.
4 while r j 2 > do
5 α j := r j , r̂0 2 / Ap j , r̂0 2
6 q j := u j − α j Ap j
7 x j+1 := x j + α j (u j + q j )
8 r j+1 := r j − α j A(u j + q j )
9 β j+1 := r j+1 , r̂0 2 / r j , r̂0 2
10 u j+1 := r j+1 + β j+1 q j
11 p j+1 := u j+1 + β j+1 (q j + β j+1 p j )
12 j := j + 1

in the case of convergence, we can expect that the CGS method converges
approximately twice as fast as the BiCG method. However, it also means that
the effect of rounding errors will be worse in CGS than in BiCG.
A possible remedy to this is to change the residual polynomial once again. The
idea here is to use the additional degrees of freedom to possibly smooth the
error. To this end, we will write the residual vector r j and the direction p j now
in the form

r j := χ j (A)R j (A)r0 ,
p j := χ j (A)P j (A)r0 ,

where χ j ∈ π j is a polynomial defined by χ0 (t) := 1 and


j
χ j (t) = (1 − ωi t) = (1 − ω j t)χ j−1 (t),
i=1

with certain weights ωi , which we will discuss later on. Moreover, we still
expect the polynomials P j and R j to satisfy the recurrence relations (6.42) and
(6.43), i.e.

P j (t) = R j (t) + β j P j−1 (t), (6.45)


R j (t) = R j−1 (t) − α j−1 tP j−1 (t). (6.46)

08:44:43, subject to the Cambridge


07
6.3 Biorthogonalisation Methods 241

For the vectors p j and r j these recurrence relations lead to the recursion
r j = (I − ω j A)χ j−1 (A)R j (A)r0
= (I − ω j A)χ j−1 (A)[R j−1 (A) − α j−1 AP j−1 (A)]r0
= (I − ω j A)[r j−1 − α j−1 Ap j−1 ],
p j = χ j (A)[R j (A) + β j P j−1 (A)]r0
= r j + β j (I − ω j A)p j−1 .
Note that recurrence relations (6.45) and (6.46) mean that the vectors P j (A)r0
and R j (A)r0 are the original directions and residuals of the BiCG method.
Hence, we can use the biorthogonality from Proposition 6.42, which means
in particular
R j (A)r0 ⊥ K j (AT , r̂0 ), AP j (A)r0 ⊥ K j (AT , r̂0 ).
From this, using also (6.45) and (6.46), we can conclude that
R j (A)r0 , R j (AT )r̂0 2 = R j (A)r0 , [R j−1 (AT ) − α j−1 AT P j−1 (AT )]r̂0 2
= −α j−1 R j (A)r0 , AT P j−1 (A)r̂0 2
= −α j−1 R j (A)r0 , AT [R j−1 (AT ) + β j−1 P j−2 (AT )]r̂0 2
= −α j−1 R j (A)r0 , AT R j−1 (AT )r̂0 2
= ···
= (−1) j α j−1 α j−2 · · · α0 R j (A)r0 , (AT ) j r̂0 2 (6.47)
and also in the same fashion that
AP j (A)r0 , P j (AT )r̂0 2 = AP j (A)r0 , [R j (AT ) + β j P j−1 (AT )]r̂0 2
= AP j (A)r0 , R j (AT )r̂0 2
= (−1) j α j−1 α j−2 · · · α0 AP j (A)r0 , (AT ) j r̂0 2 .
Hence, we can rewrite the coefficient α j from (6.40) as
R j (A)r0 , R j (AT )r̂0 2 R j (A)r0 , (AT ) j r̂0 2
αj = = , (6.48)
AP j (A)r0 , P j (AT )r̂0 2 AP j (A)r0 , (AT ) j r̂0 2
which is not yet what we want, since we still do not use the newly defined
vectors r j and p j . However, these vectors satisfy
r j , r̂0 2 = χ j (A)R j (A)r0 , r̂0 2 = R j (A)r0 , χ j (AT )r̂0 2
= R j (A)r0 , (I − ω j AT )χ j−1 (AT )r̂0 2
= −ω j R j (A)r0 , AT χ j−1 (AT )r̂0 2
= ···
= (−1) j ω j · · · ω1 R j (A)r0 , (AT ) j r̂0 2 (6.49)

08:44:43, subject to the Cambridge


07
242 Methods for Large Sparse Systems

and also
Ap j , r̂0 2 = Aχ j (A)P j (A)r0 , r̂0 2 = AP j (A)r0 , χ j (AT )r̂0 2
= (−1) j ω j · · · ω1 AP j (A)r0 , (AT ) j r̂0 2 .
Hence, we can express (6.48) as
r j , r̂0 2
αj = .
Ap j , r̂0 2
Moreover, (6.47) and (6.49) allow us also to express β j+1 from (6.41) now as
R j+1 (A)r0 , R j+1 (AT )r̂0 2 α j r j+1 , r̂0 2
β j+1 = = .
R j (A)r0 , R j (A)r̂0 2 ω j+1 r j , r̂0 2
Thus, the only parameters that we have not yet determined are the weights ω j .
Definition 6.45 If the coefficients ω j are in each step determined in such a
way that r j minimises its 2-norm, i.e. if ω j ∈ R minimises
ω → (I − ωA)χ j−1 (A)R j (A)r0 2
then the resulti