0 valutazioniIl 0% ha trovato utile questo documento (0 voti)

24 visualizzazioni419 pagineFeb 04, 2020

© © All Rights Reserved

PDF, TXT o leggi online da Scribd

© All Rights Reserved

0 valutazioniIl 0% ha trovato utile questo documento (0 voti)

24 visualizzazioni419 pagine© All Rights Reserved

Sei sulla pagina 1di 419

Holger Wendland

Frontmatter

More Information

comprehensive, yet concise, overview of the subject. It includes standard material

such as direct methods for solving linear systems and least-squares problems, error,

stability and conditioning, basic iterative methods and the calculation of eigenvalues.

Later chapters cover more advanced material, such as Krylov subspace methods,

multigrid methods, domain decomposition methods, multipole expansions,

hierarchical matrices and compressed sensing.

The book provides rigorous mathematical proofs throughout, and gives algorithms

in general-purpose language-independent form. Requiring only a solid knowledge in

linear algebra and basic analysis, this book will be useful for applied mathematicians,

engineers, computer scientists and all those interested in efficiently solving linear

problems.

University of Bayreuth. He works in the area of Numerical Analysis and is the author

of two other books, Scattered Data Approximation (Cambridge, 2005) and

Numerische Mathematik (Springer 2004, with Robert Schaback).

Cambridge University Press

978-1-107-14713-3 — Numerical Linear Algebra

Holger Wendland

Frontmatter

More Information

All titles listed below can be obtained from good booksellers or from Cambridge

University Press. For a complete series listing, visit www.cambridge.org/mathematics.

MARK J. ABLOWITZ

G. I. BARENBLATT

Hydrodynamic Instabilities

FRANÇOIS CHARRU

The Mathematics of Signal Processing

STEVEN B. DAMELIN & WILLARD MILLER, JR

Introduction to Magnetohydrodynamics (2nd Edition)

P. A. DAVIDSON

An Introduction to Stochastic Dynamics

JINQIAO DUAN

Singularities: Formation, Structure and Propagation

J. EGGERS & M. A. FONTELOS

ÉLISABETH GUAZZELLI & JEFFREY F. MORRIS

J. HIETARINTA, N. JOSHI & F. W. NIJHOFF

LAP CHI LAU, R. RAVI & MOHIT SINGH

JEAN BERNARD LASSERRE

An Introduction to Computational Stochastic PDEs

GABRIEL J. LORD, CATHERINE E. POWELL & TONY SHARDLOW

Cambridge University Press

978-1-107-14713-3 — Numerical Linear Algebra

Holger Wendland

Frontmatter

More Information

An Introduction

HOLGER WENDLAND

Universität Bayreuth, Germany

Cambridge University Press

978-1-107-14713-3 — Numerical Linear Algebra

Holger Wendland

Frontmatter

More Information

One Liberty Plaza, 20th Floor, New York, NY 10006, USA

477 Williamstown Road, Port Melbourne, VIC 3207, Australia

4843/24, 2nd Floor, Ansari Road, Daryaganj, Delhi – 110002, India

79 Anson Road, #06-04/06, Singapore 079906

It furthers the University’s mission by disseminating knowledge in the pursuit of

education, learning, and research at the highest international levels of excellence.

www.cambridge.org

Information on this title: www.cambridge.org/9781107147133

DOI: 10.1017/9781316544938

c Holger Wendland 2018

This publication is in copyright. Subject to statutory exception

and to the provisions of relevant collective licensing agreements,

no reproduction of any part may take place without the written

permission of Cambridge University Press.

First published 2018

Printed in the United Kingdom by Clays, St Ives plc

A catalogue record for this publication is available from the British Library.

ISBN 978-1-107-14713-3 Hardback

ISBN 978-1-316-60117-4 Paperback

Cambridge University Press has no responsibility for the persistence or accuracy of

URLs for external or third-party internet Web sites referred to in this publication,

and does not guarantee that any content on such websites is, or will remain,

accurate or appropriate.

Cambridge University Press

978-1-107-14713-3 — Numerical Linear Algebra

Holger Wendland

Frontmatter

More Information

Contents

Preface page ix

1 Introduction 3

1.1 Examples Leading to Linear Systems 5

1.2 Notation 10

1.3 Landau Symbols and Computational Cost 13

1.4 Facts from Linear Algebra 17

1.5 Singular Value Decomposition 24

1.6 Pseudo-inverse 26

Exercises 29

2 Error, Stability and Conditioning 30

2.1 Floating Point Arithmetic 30

2.2 Norms for Vectors and Matrices 32

2.3 Conditioning 46

2.4 Stability 54

Exercises 55

3 Direct Methods for Solving Linear Systems 59

3.1 Back Substitution 59

3.2 Gaussian Elimination 61

3.3 LU Factorisation 65

3.4 Pivoting 71

3.5 Cholesky Factorisation 76

Cambridge University Press

978-1-107-14713-3 — Numerical Linear Algebra

Holger Wendland

Frontmatter

More Information

vi Contents

3.6 QR Factorisation 78

3.7 Schur Factorisation 84

3.8 Solving Least-Squares Problems 87

Exercises 100

4 Iterative Methods for Solving Linear Systems 101

4.1 Introduction 101

4.2 Banach’s Fixed Point Theorem 102

4.3 The Jacobi and Gauss–Seidel Iterations 106

4.4 Relaxation 116

4.5 Symmetric Methods 125

Exercises 130

5 Calculation of Eigenvalues 132

5.1 Basic Localisation Techniques 133

5.2 The Power Method 141

5.3 Inverse Iteration by von Wielandt and Rayleigh 143

5.4 The Jacobi Method 153

5.5 Householder Reduction to Hessenberg Form 159

5.6 The QR Algorithm 162

5.7 Computing the Singular Value Decomposition 171

Exercises 180

6 Methods for Large Sparse Systems 183

6.1 The Conjugate Gradient Method 183

6.2 GMRES and MINRES 203

6.3 Biorthogonalisation Methods 226

6.4 Multigrid 244

Exercises 258

7 Methods for Large Dense Systems 260

7.1 Multipole Methods 261

7.2 Hierarchical Matrices 282

7.3 Domain Decomposition Methods 307

Exercises 327

8 Preconditioning 329

8.1 Scaling and Preconditioners Based on Splitting 331

8.2 Incomplete Splittings 338

8.3 Polynomial and Approximate Inverse Preconditioners 346

Cambridge University Press

978-1-107-14713-3 — Numerical Linear Algebra

Holger Wendland

Frontmatter

More Information

Contents vii

Exercises 368

9 Compressed Sensing 370

9.1 Sparse Solutions 370

9.2 Basis Pursuit and Null Space Property 372

9.3 Restricted Isometry Property 378

9.4 Numerical Algorithms 384

Exercises 393

Bibliography 395

Index 403

Cambridge University Press

978-1-107-14713-3 — Numerical Linear Algebra

Holger Wendland

Frontmatter

More Information

Preface

mainly concerned with the development, implementation and analysis of nu-

merical algorithms for solving linear problems. In general, such linear prob-

lems arise when discretising a continuous problem by restricting it to a finite-

dimensional subspace of the original solution space. Hence, the development

and analysis of numerical algorithms is almost always problem-dependent. The

more is known about the underlying problem, the better a suitable algorithm

can be developed.

Nonetheless, many of the so-derived methods are more general in the sense

that they can be applied to larger classes of problems than initially intended.

One of the challenges in Mathematics is deciding how to describe the neces-

sary assumptions, under which a certain method works, in the most general

way. In the context of NLA, this means finding for each method the most gen-

eral description of matrices to which the method can be applied. It also means

extracting the most general methods from the vast number of available algo-

rithms. Particularly for users with new problems this is crucial, as it allows

them to apply and test well-established algorithms first, before starting to de-

velop new methods or to extend existing ones.

In this book, I have attempted to use this matrix-driven approach rather than

the problem-driven one. Naturally, the selection of the material is biased by

my own point of view. Also, a book on NLA without any examples would be

rather dire, so there are typical examples and applications included to illustrate

the methods, but I have tried to restrict myself to simple examples, which do

not require much previous knowledge on specific problems and discretisation

techniques.

During the past years, I have given courses on Numerical Linear Algebra at

advanced BSc and early MSc level at the University of Sussex (UK), the Uni-

versity of Oxford (UK) and the University of Bayreuth (Germany). I have also

ix

Cambridge University Press

978-1-107-14713-3 — Numerical Linear Algebra

Holger Wendland

Frontmatter

More Information

x Preface

given courses on Numerical Analysis which covered parts of the NLA material

in Oxford, Göttingen (Germany) and Bayreuth.

This book on Numerical Linear Algebra is based on these courses and the ma-

terial of these courses. It covers the standard material, as well as more recent

and more specific techniques, which are usually not found in standard text-

books on NLA. Examples include the multigrid method, the domain decompo-

sition method, multipole expansions, hierarchical matrices and applications to

compressed or compressive sensing. The material on each of these topics fills

entire books so that I can obviously present only a selection. However, this

selection should allow the readers to grasp the underlying ideas of each topic

and enable them to understand current research in these areas.

Each chapter of this book contains a small number of theoretical exercises.

However, to really understand NLA one has to implement the algorithms by

oneself and test them on some of the matrices from the examples. Hence, the

most important exercises intrinsic to each chapter are to implement and test the

proposed algorithms.

All algorithms are stated in a clean pseudo-code; no programming language

is preferred. This, I hope, allows readers to use the programming language of

their choice and hence yields the greatest flexibility.

Finally, NLA is obviously closely related to Linear Algebra. However, this is

not a book on Linear Algebra, and I expect readers to have a solid knowledge

of Linear Algebra. Though I will review some of the material, particularly to

introduce the notation, terms like linear space, linear mapping, determinant etc.

should be well-known.

P AR T ONE

PRELIMINARIES

Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316544938

13:04:03, subject to the Cambridge

Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316544938

1

Introduction

In this book, we are concerned with the basic problem of solving a linear sys-

tem

Ax = b,

is the solution we seek. The solution is, of course, given by

x = A−1 b,

but does this really help if we are interested in actually computing the solu-

tion vector x ∈ Rn ? What are the problems we are facing? First of all, such

linear systems have a certain background. They are the results of other math-

ematical steps. Usually, they are at the end of a long processing chain which

starts with setting up a partial diﬀerential equation to model a real-world prob-

lem, continues with discretising this diﬀerential equation using an appropriate

approximation space and method, and results in such a linear system. This is

important because it often tells us something about the structure of the matrix.

The matrix might be symmetric or sparse. It is also important since it tells us

something about the size n of the matrix. With simulations becoming more

and more complex, this number nowadays becomes easily larger than a mil-

lion, even values of several hundreds of millions are not unusual. Hence, the

ﬁrst obstacle that we encounter is the size of the matrix. Obviously, for larger

dimensions n it is not possible to solve a linear system by hand. This means

we need an algorithmic description of the solution process and a computer to

run our program.

Unfortunately, using a computer leads to our second obstacle. We cannot rep-

resent real numbers accurately on a computer because of the limited number

system used by a computer. Even worse, each calculation that we do might

lead to a number which is not representable in the computer’s number system.

02

4 Introduction

real numbers also invertible in the number system used by a computer? What

are the errors that we make when representing the matrix in the computer and

when using our algorithm to compute the solution. Further questions that easily

come up are as follows.

1. How expensive is the algorithm? How much time (and space) does it require

to solve the problem? What is the best way of measuring the cost of an

algorithm?

2. How stable is the algorithm? If we slightly change the input, i.e. the matrix

A and/or the right-hand side b, how does this aﬀect the solution?

3. Can we exploit the structure of the matrix A, if it has a special structure?

4. What happens if we do not have a square system, i.e. a matrix A ∈ Rm×n

and a vector b ∈ Rm . If m > n then we have an over-determined system

and usually cannot ﬁnd a (unique) solution but might still be interested in

something which comes close to a solution. If m < n we have an under-

determined system and we need to choose from several possible solutions.

Besides solving a linear system, we will also be interested in a related topic,

the computation of eigenvalues and eigenvectors of a matrix. This means we

are interested in ﬁnding numbers λ ∈ C and vectors x ∈ Cn \ {0} such that

Ax = λx.

Finding such eigenvectors and eigenvalues is again motivated by applications.

For example, in structural mechanics a vibrating system is represented by ﬁnite

elements and the eigenvectors of the corresponding discretisation matrix reﬂect

the shape modes and the roots of the eigenvalues reﬂect the frequencies with

which the system is vibrating. But eigenvalues will also be helpful in better

understanding some of the questions above. For example, they have a crucial

inﬂuence on the stability of an algorithm.

In this book, we are mainly interested in systems of real numbers, simply be-

cause they arise naturally in most applications. However, as the problem of

ﬁnding eigenvalues indicates, it is sometimes necessary to consider complex

valued systems, as well. Fortunately, most of our algorithms and ﬁndings will

carry over from the real to the complex case in a straightforward way.

We will look at direct and iterative methods to solve linear systems. Direct

methods compute the solution in a ﬁnite number of steps, iterative methods

construct a sequence of approximations to the solution.

We will look at how eﬃcient and stable these methods are. The former means

that we are interested in how much time and computer memory they require.

Particular emphasis will be placed on the number of ﬂoating point operations

02

1.1 Examples Leading to Linear Systems 5

required with respect to the dimension of the linear system. The latter means

for example investigating whether these methods converge at all, under what

conditions they converge and how they respond to small changes in the input

data.

As mentioned above, linear systems arise naturally during the discretisation

process of mathematical models of real-world problems. Here, we want to col-

lect three examples leading to linear systems. These examples are our model

problems, which we will refer to frequently in the rest of this book. They com-

prise the problem of interpolating an unknown function only known at discrete

data sites, the solution of a one-dimensional boundary value problem with ﬁ-

nite diﬀerences and the solution of a (one-dimensional) integral equation with

a Galerkin method. We have chosen these three examples because they are

simple and easily explained, yet they are signiﬁcant enough and each of them

represents a speciﬁc class of problems. In particular, the second problem leads

to a linear system with a matrix A which has a very simple structure. This

matrix will serve us as a role model for testing and investigating most of our

methods since it is simple to analyse yet complicated enough to demonstrate

the advantages and drawbacks of the method under consideration.

1.1.1 Interpolation

Suppose we are given data sites X = {x1 , . . . , xn } ⊆ Rd and observations

f1 , . . . , fn ∈ R. Suppose further that the observations follow an unknown gen-

eration process, i.e. there is a function f such that f (xi ) = fi , 1 ≤ i ≤ n.

One possibility to approximately reconstruct the unknown function f is to

choose basis functions φ1 , . . . , φn ∈ C(Rd ) and to approximate f by a function

s of the form

n

s(x) = α j φ j (x), x ∈ Rd ,

j=1

n

fi = s(xi ) = α j φ j (xi ), 1 ≤ i ≤ n.

j=1

02

6 Introduction

⎛ ⎞⎛ ⎞ ⎛ ⎞

⎜⎜⎜φ1 (x1 ) φ2 (x1 ) . . . φn (x1 )⎟⎟⎟ ⎜⎜⎜α1 ⎟⎟⎟ ⎜⎜⎜ f1 ⎟⎟⎟

⎜⎜⎜⎜φ (x ) φ (x ) . . . φ (x )⎟⎟⎟⎟ ⎜⎜⎜⎜α ⎟⎟⎟⎟ ⎜⎜⎜⎜ f ⎟⎟⎟⎟

⎜⎜⎜ 1 2 2 2 n 2 ⎟ ⎜ 2⎟

⎟ ⎜ ⎟ ⎜⎜ 2 ⎟⎟

⎜⎜⎜ .

. .. .. ⎟⎟⎟⎟ ⎜⎜⎜⎜ .. ⎟⎟⎟⎟ = ⎜⎜⎜⎜ .. ⎟⎟⎟⎟ . (1.1)

⎜⎜⎜ . . ... . ⎟⎟⎟⎟ ⎜⎜⎜⎜ . ⎟⎟⎟⎟ ⎜⎜⎜⎜ . ⎟⎟⎟⎟

⎜⎝ ⎠⎝ ⎠ ⎝ ⎠

φ1 (xn ) φ2 (xn ) . . . φn (xn ) αn fn

From standard Numerical Analysis courses we know this topic usually in the

setting that the dimension is d = 1, that the points are ordered a ≤ x1 < x2 <

· · · < xn ≤ b and that the basis is given as a basis for the space of polynomials

of degree at most n−1. This basis could be the basis of monomials φi (x) = xi−1 ,

1 ≤ i ≤ n, in which case the matrix in (1.1) becomes the transpose of a so-

called Vandermonde matrix, i.e. a matrix of the form

⎛ ⎞

⎜⎜⎜1 x1 x12 . . . x1n−1 ⎟⎟⎟

⎜⎜⎜⎜1 x x2 . . . xn−1 ⎟⎟⎟⎟

⎜⎜⎜ 2 2 ⎟ ⎟

.. ⎟⎟⎟⎟ .

2

⎜⎜⎜ . . ..

⎜⎜⎜ .. .. . ... . ⎟⎟⎟⎟

⎜⎝ n−1 ⎠

1 xn xn . . . xn

2

This matrix is a full matrix, meaning that each entry is diﬀerent from zero, so

that the determination of the interpolant requires the solution of a linear system

with a full matrix.

However, we also know, from basic Numerical Analysis, that we could alter-

natively choose the so-called Lagrange functions as a basis:

n

x − xi

φ j (x) = L j (x) = , 1 ≤ j ≤ n.

i=1

x j − xi

i j

with this basis, the matrix in (1.1) simply becomes the identity matrix and the

interpolant can be derived without solving a linear system at all.

1 1 1

−2 −1 0 1 −2 −1 0 1 −2 −1 0 1

Figure 1.1 Typical radial basis functions: Gaussian, inverse multiquadric and a

compactly supported one (from left to right).

problematic and a more elegant way employs a basis of the form φi = Φ(· − xi ),

02

1.1 Examples Leading to Linear Systems 7

Φ(x) = φ(x2 ), where φ : [0, ∞) → R

chosen to be radial, i.e. it is of the form

is a univariate function and x2 = x12 + · · · + xd2 denotes the Euclidean norm.

Examples of possible univariate functions are

Multiquadric: φ(r) = (r2 + 1)1/2 ,

Inverse Multiquadric: φ(r) = (r2 + 1)−1/2 ,

Compactly Supported: φ(r) = (1 − r)4+ (4r + 1),

visualised in Figure 1.1.

In all these cases, except for the multiquadric basis function, it is known that

the resulting interpolation matrix is positive deﬁnite (with the restriction of

d ≤ 3 for the compactly supported function). Such functions are therefore

called positive deﬁnite. More generally, a good choice of a basis is given by

φ j (x) = K(x, x j ) with a kernel K : Rd × Rd → R, which is positive deﬁnite

in the sense that for all possible, pairwise distinct points x1 , . . . , xn ∈ Rd , the

matrix (K(xi , x j )) is symmetric and positive deﬁnite.

In the case of the multiquadric basis function, it is known that the interpolation

matrix is invertible and has only real, non-vanishing eigenvalues and that all

but one of these eigenvalues are negative.

Note that in the case of the inverse multiquadric and the Gaussian the matri-

ces are dense while in the case of the compactly supported basis function the

matrix can have a lot of zeros depending on the distribution of the data sites.

Details on this topic can be found in Wendland [133].

Another application is to compute a stationary solution to the heat equation. In

one dimension, we could imagine an inﬁnitely thin rod of length one, which is

heated in the interior of (0, 1) with a heat source f and is kept at zero degrees

at the boundary points 0, 1. Mathematically, this means that we want to ﬁnd a

function u : [0, 1] → R with

or even given only at discrete points then it is not possible to compute the

solution u analytically. In this case a numerical scheme has to be used and the

02

8 Introduction

u(x + h) − u(x) u(x) − u(x − h)

u (x) ≈ , or u (x) ≈ .

h h

The ﬁrst rule could be referred to as a forward rule while the second is a back-

ward rule. Using ﬁrst a forward and then a backward rule for the second deriva-

tive leads to

u (x + h) − u (x) 1 u(x + h) − u(x) u(x) − u(x − h)

u (x) ≈ ≈ −

h h h h

u(x + h) − 2u(x) + u(x − h)

= .

h2

For ﬁnding a numerical approximation using such ﬁnite diﬀerences we may

divide the domain [0, 1] into n + 1 pieces of equal length h = 1/(n + 1) with

nodes

i

xi = ih = , 0 ≤ i ≤ n + 1,

n+1

and set ui := u(xi ). We now deﬁne the ﬁnite diﬀerence approximation uh to u,

as follows: ﬁnd uh such that uh0 = uhn+1 = 0 and

⎛ h ⎞

⎜⎜ u − 2uhi + uhi−1 ⎟⎟⎟

− ⎜⎜⎝ i+1 ⎟⎠ = fi , 1 ≤ i ≤ n.

h2

⎛ ⎞⎛ ⎞ ⎛ ⎞

⎜⎜⎜ 2 −1 0 ⎟⎟⎟ ⎜⎜⎜ uh1 ⎟⎟⎟ ⎜⎜⎜ f1 ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ h ⎟⎟ ⎜⎜⎜ ⎟ ⎟

⎜⎜⎜−1 2 −1 ⎟⎟⎟ ⎜⎜⎜ u2 ⎟⎟⎟ ⎜⎜⎜ f2 ⎟⎟⎟⎟⎟

⎜⎜ ⎟⎟⎟ ⎜⎜⎜ uh ⎟⎟⎟ ⎜⎜⎜ f ⎟⎟⎟

1 ⎜⎜⎜⎜ −1 2 −1 ⎟⎟⎟ ⎜⎜⎜ 3 ⎟⎟⎟ ⎜⎜⎜ 3 ⎟⎟⎟

⎜

⎜

⎜ .. .. .. ⎟⎟⎟ ⎜⎜⎜ . ⎟⎟⎟ = ⎜⎜⎜ . ⎟⎟⎟ . (1.2)

h2 ⎜⎜⎜ . . . ⎟⎟⎟ ⎜⎜⎜ .. ⎟⎟⎟ ⎜⎜⎜ .. ⎟⎟⎟

⎜⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟⎟ ⎜⎜ ⎟⎟⎟

⎜⎜⎜

⎜⎝ −1 2 −1⎟⎟⎟⎟⎟ ⎜⎜⎜⎜⎜uhn−1 ⎟⎟⎟⎟ ⎜⎜⎜⎜⎜ fn−1 ⎟⎟⎟⎟

⎠⎝ h ⎠ ⎝ ⎠

0 −1 2 un fn

This system of equations is sparse, meaning that the number of non-zero en-

tries is much smaller than n2 . This sparsity can be used to store the matrix and

to implement matrix–vector and matrix–matrix multiplications eﬃciently.

To obtain an accurate approximation to u, we may have to choose h very small,

thereby increasing the size of the linear system.

For a general boundary value problem in d-dimensions the size of the linear

system can grow rapidly. For example, three-dimensional problems grow over

eight times larger with each uniform reﬁnement of the domain.

02

1.1 Examples Leading to Linear Systems 9

In the last section we have introduced a way of solving a diﬀerential equation.

A diﬀerential equation can also be recast as an integral equation but integral

equations often also come up naturally during the modelling process. Hence,

let us consider a typical integral equation as another example.

We now seek a function u : [0, 1] → R satisfying

1

log(|x − y|)u(y)dy = f (x), x ∈ [0, 1],

0

where f : [0, 1] → R is given. Note that the integral on the left-hand side

contains the kernel K(x, y) := log(|x − y|), which is singular on the diagonal

x = y.

To solve this integral equation numerically we will use a Galerkin approxi-

mation. The idea here is to choose an approximate solution un from a ﬁxed,

ﬁnite-dimensional subspace V = span{φ1 , . . . , φn } and to test the approximate

solution via

1
1
1

log(|x − y|)un (y)dy φi (x)dx = f (x)φi (x)dx, 1 ≤ i ≤ n. (1.3)

0 0 0

Since we choose un ∈ V it must have a representation un = nj=1 c j φ j with

certain coeﬃcients c j . Inserting this representation into (1.3) and changing the

order of summation and integration yields

n
1
1
1

cj log(|x − y|)φ j (y)φi (x)dy dx = f (x)φi (x)dx, 1 ≤ i ≤ n,

j=1 0 0 0

entries

1 1

ai j = log(|x − y|)φ j (y)φi (x)dy dx, 1 ≤ i, j ≤ n.

0 0

A typical choice for the space V is the space of piece-wise constant functions.

To be more precise, we can choose

⎧

⎪

⎪

⎨1 if n ≤ x < n ,

i−1 i

φi (x) = ⎪

⎪

⎩0 else,

but other basis functions and approximation spaces are possible. But we note

that particularly in this case the matrix A is once again a full matrix as its

entries are given by

i/n
j/n

ai j = log(|x − y|)dy dx.

(i−1)/n ( j−1)/n

02

10 Introduction

to matrix entries of the form

ai j = K(x, y)φi (x)φ j (y)dy dx

Ω Ω

1.2 Notation

Now, it is time to set up the notation which we will use throughout this book.

However, we will specify only the most basic notation and deﬁnitions here and

introduce further concepts whenever required.

1.2.1 Mathematics

We will denote the real, complex, natural and integer numbers as usual with

R, C, N and Z, respectively. The natural numbers will not include zero. We

will use the notation x ∈ Rn to denote vectors. The components of x will be

denoted by x j ∈ R, i.e. x = (x1 , . . . , xn )T . Thus vectors will always be column

vectors. We will denote the unit standard basis of Rn by e1 , . . . , en , where the

ith unit vector ei has only zero entries except for a one at position i. In general,

we will suppress the dimension n when it comes to this basis and we might use

the same notation to denote the ith unit vector for Rn and, say, Rm . It should

be clear from the context which one is meant. On Rn we will denote the inner

product between two vectors x and y by either xT y or x, y2 , i.e.

n

xT y = x, y2 = x jy j.

j=1

For a matrix A with m rows, n columns and real entries we will write A ∈ Rm×n

and A = (ai j ), where the index i refers to the rows and the index j refers to the

columns:

⎛ ⎞

⎜⎜⎜ a11 a12 · · · a1n ⎟⎟⎟

⎜⎜⎜ ⎟

⎜⎜⎜ a21 a22 · · · a2n ⎟⎟⎟⎟⎟

A := (ai j ) := ⎜⎜⎜ . ⎜ .. ⎟⎟⎟⎟ .

⎜⎜⎜ .. . ⎟⎟⎟⎟

⎜⎝ ⎠

am1 am2 · · · amn

For a non-square matrix A ∈ Rm×n , we can write ai j = eTi Ae j , where the ﬁrst

unit vector is from Rm while the second unit vector is from Rn .

02

1.2 Notation 11

⎧

⎪

⎪

⎨1 if i = j,

δi j = ⎪

⎪

⎩0 if i j.

For the identity matrix in Rn we will use the symbol I ∈ Rn×n . We obviously

have I = (δi j )1≤i, j≤n . Again, as in the case of the unit vectors, we will usually

refrain from explicitly indicating the dimension n of the underlying space Rn .

We will also denote the columns of a matrix A ∈ Rm×n by a j := Ae j ∈ Rm ,

1 ≤ j ≤ n. Hence, we have

A = (a1 , a2 , . . . , an ).

For a matrix A ∈ Rm×n and a vector x ∈ Rn , we can write x = j x j e j and

hence

n

Ax = x ja j.

j=1

We will encounter speciﬁc forms of matrices and want to use the following,

well-known names.

• a square matrix, if m = n,

• a diagonal matrix, if ai j = 0 for i j,

• an upper triangular matrix, if ai j = 0 for i > j,

• a lower triangular matrix, if ai j = 0 for i < j,

• a band-matrix, if there are k, ∈ N0 such that ai j = 0 if j < i − k or j > i + ,

• sparse, if more than half of the entries are zero,

• dense or full, if it is not sparse.

In the case of a diagonal matrix A with diagonal entries aii = λi , we will also

use the notation

A = diag(λ1 , . . . , λn ).

Most of these names are self-explanatory. In the case of a band matrix, we have

all entries zero outside a diagonally bordered band. Only those entries ai j with

indices i − k ≤ j ≤ i + may be diﬀerent from zero. This means we have at

most k sub-diagonals and super-diagonals with non-zero entries. The most

prominent example is given by k = = 1, which has one super-diagonal and

one sub-diagonal of non-zero entries and is hence called a tridiagonal matrix.

02

12 Introduction

as follows:

⎛ ⎞ ⎛ ⎞ ⎛ ⎞

⎜⎜⎜∗ ∗ ∗ · · · ∗⎟⎟⎟ ⎜⎜⎜∗ 0⎟⎟⎟ ⎜⎜⎜∗ ∗ 0 · · · 0⎟⎟

⎟

⎜⎜⎜ ⎜⎜⎜ .. ⎟⎟⎟⎟

⎜⎜⎜ .. ⎟⎟⎟⎟ ⎜⎜⎜

⎜ ∗ ∗

⎟⎟⎟

⎟ ⎜

⎜

∗ ∗ . ⎟⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜∗ ∗ ∗ . ⎟⎟⎟⎟

⎜⎜⎜⎜ ⎟

⎟ ⎜⎜⎜ . . ⎟

⎟

⎟ ⎜

⎜ ⎟⎟⎟

⎜⎜⎜ .. .. ⎟⎟⎟ , ⎜⎜⎜∗ . . . . ⎟⎟⎟ , ⎜⎜⎜ ⎜ .. .. .. ⎟⎟⎟ ,

⎜⎜⎜⎜ . . ∗⎟⎟⎟ ⎜⎜⎜⎜ . . . .

⎟ ⎟⎟⎟⎟ ⎜⎜⎜⎜ 0 0 ⎟⎟⎟⎟

⎜⎜⎜ ∗ ∗ ⎟⎟⎟ ⎜⎜⎜ .. ⎟⎟⎟ ⎜⎜⎜ .. ⎟⎟

⎜⎜⎝ ⎟⎟⎠ ⎜⎜⎝ ∗ ∗ ⎟⎟⎠ ⎜⎜⎜ . ∗ ∗ ∗⎟⎟⎟⎟

0 ∗ ∗ ··· ∗ ∗ ∗ ⎝ ⎠

0 ··· 0 ∗ ∗

where a ∗ marks a possible non-zero entry.

For a matrix A = (ai j ) ∈ Rm×n we denote the transpose of A by AT . It is given by

exchanging columns and rows from A, i.e. AT = (a ji ) ∈ Rn×m . A square matrix

A ∈ Rn×n is said to be symmetric if AT = A. If the matrix A is invertible, we

have (A−1 )T = (AT )−1 which we will simply denote with A−T . If A is symmetric

and invertible then also the inverse A−1 is symmetric. Both the transpose and

the inverse satisfy the rules

(AB)T = BT AT , (AB)−1 = B−1 A−1 ,

as long as these operations are well-deﬁned.

1.2.2 Algorithms

We will not use a speciﬁc computing language to describe the algorithms in

this book. However, we will assume that the reader is familiar with basic pro-

gramming techniques. In particular, we expect the reader to know what a for

and a while loop are. We will use if, then and else in the usual way and, when

assigning a new value to a variable x, this variable might appear also on the

right-hand side of the assignment, i.e. such an assignment can, for example, be

of the form x := x + y, which means that x and y are ﬁrst evaluated and the sum

of their values is then assigned to x. For each algorithm we will declare the es-

sential input data and the output data. There will, however, be no explicit return

statement. Each algorithm should also have a deterministic stopping criterion.

To demonstrate this, a ﬁrst example of an algorithm is given in Algorithm 1,

which computes the inner product s := xT y of the two vectors x, y ∈ Rn .

At the beginning, we will formulate algorithms very close to actual programs

using only basic operations. An implementation within a modern computing

language should be straightforward. Later on, when it comes to more sophisti-

cated algorithms, we will use higher-level mathematical notation to compress

the representation of the algorithm. For example, an inner product will then

only appear as s := xT y.

02

1.3 Landau Symbols and Computational Cost 13

Input : x, y ∈ Rn .

Output: s = xT y.

1 s := 0

2 for j = 1 to n do

3 s := s + x j y j

In this book, we will not discuss low-level data structures, i.e. ways of storing a

vector or a matrix within a computer. We will assume that the reader is familiar

with the concepts of arrays, which are usually used to store (full) matrices, and

index lists, which can be used to store sparse matrices.

Before developing algorithms to solve linear equations, we will introduce con-

cepts to analyse the cost of such algorithms and their stability. Though, of

course, it is possible to compare the actual run-times of two algorithms on a

computer, the actual run-time is not a particularly good measure. It is more

important to understand how the computational cost of an algorithm changes

with the number of unknowns.

The multiplication of a matrix A ∈ Rm×n with a vector x ∈ Rn results in a vector

b = Ax ∈ Rm with components

n

bi = (Ax)i = ai j x j , 1 ≤ i ≤ m.

j=1

C = AB ∈ Rm×p , with entries

n

ci j = aik bk j , 1 ≤ i ≤ m, 1 ≤ j ≤ p.

k=1

So, how much does the computation of it cost? Usually, the cost is measured

in ﬂops, which stands for ﬂoating point operations. A ﬂoating point operation

consists of one addition plus one multiplication. Sometimes, it is helpful to

distinguish between additions and multiplications and count them separately.

This was particularly true when a multiplication on a computer was substan-

tially more expensive than an addition. However, as this is no longer the case

02

14 Introduction

tions, we will stick to the above deﬁnition of ﬂops. It is nonetheless important

to note that while subtractions are as eﬃcient as additions and multiplications,

this is not true for divisions, which are signiﬁcantly slower.

Most of the time, we will not be interested in the actual number of ﬂoating

point operations but rather in the asymptotic behaviour with respect to the di-

mension.

For example, if we look at a matrix–vector multiplication b = Ax, then we

must for every index i compute the sum

n

ai j x j ,

j=1

operations. Hence, if we double the size of the matrix, we would require twice

as many ﬂops for each component. This, however, would also be true if the

actual computing cost would be cn with a positive constant c > 0. The total

cost of the matrix–vector multiplication becomes mn since we have m entries

to compute.

If we are not interested in the constant c > 0 then we will use the following

notation.

write

f (n) = O(g(n))

| f (n)| ≤ c|g(n)|, n ∈ Nd .

n ∈ Nd . Moreover, though in most cases we will ignore the constant c in our

considerations, a huge c > 0 can mean that we will never or only for very large

n see the actual asymptotic behaviour.

With this deﬁnition, we can say that matrix–vector multiplication of a matrix

A ∈ Rm×n with a vector x ∈ Rn costs

time(Ax) = O(mn).

In the case of a square matrix m = n, the cost is therefore O(n2 ), which means

that doubling the input size of the matrix and the vector, i.e. replacing n by 2n,

will require four times the time of the original matrix–vector multiplication.

02

1.3 Landau Symbols and Computational Cost 15

We can also use this notation to analyse the space required to store the infor-

mation on a computer. We will have to store each matrix entry ai j and each

entry of x as well as the result Ax. This requires

O(mn + n + m) = O(mn)

time and space, and it might sometimes be necessary to sacriﬁce something of

one resource to gain in the other.

It is now easy to see that for the matrix–matrix multiplication C = AB, we

would require O(mnp) operations. Hence, for square systems with m = n = p

the time is O(n3 ) and doubling the input size results in computations that are

eight times longer.

Let us summarise our ﬁndings so far, with some obvious additions.

• It costs O(n) space to store the vector x ∈ Rn and O(mn) space to store the

matrix A.

• It costs O(n) time to compute the product αx.

• It costs O(mn) time to compute the product Ax.

• It costs O(mnp) time to compute the product AB.

O(n3 ).

More sophisticated matrix–matrix products have been developed with the goal

of reducing the computational cost by reducing the number of multiplications

at the cost of a mild increase in the number of additions. The most famous

one is a recursive algorithm by Strassen (see [120]) which can compute the

product of two n × n matrices using at most 4.7 · nlog2 7 = O(n2.807355 ) ﬂops.

Since then, other such algorithms have been introduced, most notably one by

Coppersmith and Winograd in [37] which reduces the cost to O(n2.375477 ). The

latter algorithm is, however, more complicated and the constant hidden in the

O-notation substantially larger so that the algorithm is not used in practice.

Though superior in this context, even the Strassen algorithm is not seriously

used in practical applications, which is due to the fact that it is not as stable as

the conventional scheme, see Higham [84].

Finally, let us see how additional information can be used to reduce the cost.

If, for example, we have a tridiagonal matrix, i.e. a matrix which has only non-

zero entries on the diagonal and the sub- and super-diagonal, i.e. which is of

02

16 Introduction

the form

⎛ ⎞

⎜⎜⎜a11 a12 0 ··· 0 ⎟⎟⎟

⎜⎜⎜ .. ⎟⎟⎟

⎜⎜⎜a . ⎟⎟⎟

⎜⎜⎜⎜ 21 a22 a23 ⎟⎟⎟

⎟⎟

A = ⎜⎜⎜⎜⎜ 0 ..

.

..

.

..

. 0 ⎟⎟⎟⎟⎟ ,

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ .. ⎟

⎜⎜⎜ . an−1,n−2 an−1,n−1 an−1,n ⎟⎟⎟⎟

⎝ ⎠

0 ··· 0 an,n−1 ann

n

(Ax)i = ai j x j = ai,i−1 xi−1 + aii xi + ai,i+1 xi+1 .

j=1

Hence, the time for the full matrix–vector product is now only O(n) instead

of O(n2 ). Also the matrix can be stored by exploiting its special form in O(n)

space instead of O(n2 ).

We will sometimes encounter algorithms which have a recursive structure. This

means, for example, that solving a problem with problem size n is reduced to

solving problems with problem size n/b, where b > 1. The Strassen algorithm

mentioned above is one such example. To analyse the cost of such an algorithm

it is often helpful to assume that n = bk with some k ∈ N. The following result

will be helpful.

Lemma 1.4 The recursion T (1) = c and T (n) = aT (n/b)+cnα with c, a, α > 0

and b > 1 satisﬁes

⎧ bα α

⎪

⎪

⎪ c bα −a n if a < bα ,

⎪

⎪

⎨ α

T (n) ≤ ⎪

⎪ cn (logb n + 1) if a = bα ,

⎪

⎪

⎪

⎩c a α nlogb a if a > bα .a−b

Proof As mentioned above, we will prove this only for n = bk . Though the

results remain true for general n, the proof is more tedious in the general case.

Using induction, it is easy to see that

k

T (bk ) = c ai (bα )k−i .

i=0

k+1

k

a i α

1 − baα

α b

α

T (b ) = cb

k

= kα

cn < cn .

i=0

bα 1 − bαa

bα − a

02

1.4 Facts from Linear Algebra 17

For a = bα we have

T (bk ) = cnα (k + 1) = cnα (logb n + 1)

and for a > bα we ﬁnally use

α k+1

k α i

1− b

b a a a

T (b ) = ca

k k

= ca k

≤ cak =c nlogb a ,

i=0

a 1− bα

a

a − bα a − bα

log a log a

a =ek k log a

= exp log b = (bk ) log b = nlogb a .

k

log b

Finally, let us mention that the deﬁnition of the Landau symbol O can even be

further generalised in the following sense.

Deﬁnition 1.5 For two functions f, g : Rn → R and x0 ∈ Rn , we will write

f (x) = O(g(x)), x → x0 ,

if there is a constant c > 0 and a surrounding U = U(x0 ) ⊆ Rn of x0 such that

| f (x)| ≤ c|g(x)|, x ∈ U.

In this section, we want to collect further material on matrices and vectors,

which should be known from classical, basic linear algebra courses. The char-

acter of this section is more to remind the reader of the material and to intro-

duce the notation. However, we will also prove some results which are less

familiar.

In Rn we have the canonical inner product deﬁned by x, y2 := xT y for all

x, y ∈ Rn . It particularly satisﬁes x, x2 = x12 + · · · + xn2 > 0 for all x 0.

As mentioned above, we will use both notations x, y2 and xT y equally. The

√

canonical inner product deﬁnes a canonical norm or length x2 := x, x2 ,

the Euclidean norm. This norm has the usual properties of a norm, which can

more generally be deﬁned for an arbitrary linear space. Of course, we assume

the reader to be familiar with the concept of linear spaces, linear sub-spaces,

linear independent vectors and the dimension of a linear space.

Deﬁnition 1.6 Let V be a real (or complex) linear space. A mapping · :

V → [0, ∞) is called a norm on V if it satisﬁes

1. homogeneity: λx = |λ|x for all x ∈ V and λ ∈ R (or λ ∈ C),

02

18 Introduction

3. triangle inequality: x + y ≤ x + y for all x, y ∈ V.

The space V with norm · is called a normed space.

In the case of V = Rn and · = ·2 , the ﬁrst two properties follow immediately

from the deﬁnition, while the triangle inequality follows from the Cauchy–

Schwarz inequality

|x, y2 | ≤ x2 y2

which holds for all x, y ∈ Rn and where equality occurs if and only if y is a

scalar multiple of x. This all remains true, in the more general situation when

the norm is deﬁned by an inner product.

Deﬁnition 1.7 Let V be a real linear space. A mapping ·, · : V × V → R is

called an inner product if it is

1. symmetric: x, y = y, x for all x, y ∈ V,

2. linear: αx + βy, z = αx, z + βy, z for all x, y, z ∈ V and α, β ∈ R,

3. deﬁnite: x, x > 0 for all x ∈ V \ {0}.

A space V with an inner product ·, · : V × V → R is called a pre-Hilbert

space.

The second property together with the ﬁrst property indeed guarantees that

the mapping ·, · is bilinear, i.e. it is also linear in the second argument. Each

pre-Hilbert space becomes a normed space upon deﬁning the canonical norm

√

x := x, x.

If V is a complex linear space then an inner product on V is again a mapping

·, · : V × V → C but the ﬁrst two conditions above have to be modiﬁed

appropriately. For example, the ﬁrst condition becomes x, y = y, x for all

x, y ∈ V, where α is the complex conjugate of α ∈ C. The second property

must now hold for all α, β ∈ C, which means that the inner product is linear

in its ﬁrst and anti-linear in its second component. The canonical norm is then

deﬁned as before.

Throughout this book, we will mainly be concerned with the space V = Rn and

hence introduce most concepts only for this space. But it is worth noting that

some of them immediately carry over to more general spaces.

As usual, for given x1 , . . . , xn ∈ V, we use the notation span{x1 , . . . , xn } to

denote the linear sub-space of V spanned by these elements, i.e.

⎧ ⎫

⎪

⎪

⎪ n ⎪

⎪

⎪

⎨ ⎬

span{x1 , . . . , xn } = ⎪

⎪ α x : α , . . . , α ∈ R ⎪

⎪ .

⎪

⎩ j=1

j j 1 n

⎪

⎭

02

1.4 Facts from Linear Algebra 19

x1 , . . . , xn . A basis is called an orthonormal basis if all basis vectors have unit

length, i.e. xi = xi , xi 1/2 = 1 and two diﬀerent vectors are orthogonal, i.e

xi , x j = 0 for i j.

of a pre-Hilbert space V. Deﬁne u1 = x1 /x1 and then for k = 1, 2, . . . , n,

k

ũ j+1 := u j+1 − u j+1 , u j u j ,

j=1

u j+1 := ũ j+1 /ũ j+1 .

1 ≤ k ≤ n.

leave a proof of the above lemma to the reader. Though the Gram–Schmidt

procedure is obviously easily implemented, it is numerically often problematic.

We will later discuss better methods of ﬁnding orthonormal bases.

Deﬁnition 1.10 Let A ∈ Rm×n be given. Its null space is deﬁned to be the set

ker(A) = {x ∈ Rn : Ax = 0} ⊆ Rn

range(A) = {Ax : x ∈ Rn } ⊆ Rm .

while the range of A is a subspace of Rm .

n

n

xT Ax = xi x j ai j ≥ 0.

i=1 j=1

2. The matrix is called orthogonal if AT A = I.

02

20 Introduction

Rn×n (or more generally S ∈ Cn×n ) and a diagonal matrix D ∈ Rn×n (∈ Cn×n )

such that A = S DS −1 .

sisting of eigenvectors of A. This means that there are real values λ1 , . . . , λn ∈

R and vectors w1 , . . . , wn ∈ Rn \ {0}, which satisfy

Aw j = λ j w j , w j , wk 2 = δ jk .

hence we have the following result.

Q ∈ Rn×n such that QT AQ = D is a diagonal matrix with the eigenvalues of A

as diagonal entries.

A symmetric matrix is positive deﬁnite (positive semi-deﬁnite) if and only if all

its eigenvalues are positive (non-negative).

root.

there is a symmetric and positive (semi-)deﬁnite matrix A1/2 ∈ Rn×n such that

A = A1/2 A1/2 . The matrix A1/2 is called a square root of A.

agonal matrix D = diag(λ1 , . . . , λn ) and an orthogonal matrix Q. The diago-

nal entries of D are non-negative. Hence, we can deﬁne the diagonal matrix

√ √

D1/2 := diag( λ1 , . . . , λn ) and then A1/2 := QD1/2 QT . This matrix is obvi-

ously symmetric and because of

n

xT A1/2 x = xT QD1/2 QT x = yT D1/2 y = y2j λ j ≥ 0

j=1

We also remark that for a matrix A ∈ Rm×n , the matrix AT A ∈ Rn×n is obviously

symmetric and positive semi-deﬁnite, simply because of

As a matter of fact, each symmetric and positive deﬁnite matrix can be written

02

1.4 Facts from Linear Algebra 21

in this form, see Exercise 1.5. An orthogonal matrix is always invertible and

its rows and columns, respectively, form an orthonormal basis of Rn .

Sometimes it is necessary to consider a generalised eigenvalue problem. This

occurs, for example, when discussing discretised systems coming from elastic-

ity theory. Later on, it will also be important when we look at preconditioning.

eigenvalue with respect to A, B if there is a non-zero vector x ∈ Cn such that

Ax = λBx.

equivalent to ﬁnding a classical eigenvalue and eigenvector of the matrix B−1 A.

However, even if A and B are symmetric, i.e. both have only real eigenvalues

and a complete set of eigenvectors, then the matrix B−1 A is in general not

symmetric. Nonetheless, we have the following result.

all eigenvalues of B−1 A are real and positive. Moreover, there exist n pairs of

generalised eigenvalues λ j and eigenvectors v j , 1 ≤ j ≤ n, satisfying Av j =

λ j Bv j and

vTi Bv j = vTi Av j = 0, i j.

If the eigenvectors are normalised such that vTi Bvi = 1, 1 ≤ i ≤ n, then

V T BV = I, V T AV = diag(λ1 , . . . , λn ).

of B−1 A. Since B is symmetric and positive deﬁnite, it has a symmetric and pos-

itive deﬁnite root B1/2 with inverse denoted by B−1/2 . The matrix B−1/2 AB−1/2

is symmetric and positive deﬁnite since we have on the one hand

02

22 Introduction

= det(B−1/2 ) det(B−1/2 A − λB1/2 )

= det(B−1 A − λI),

meaning that B−1/2 AB−1/2 and B−1 A have exactly the same eigenvalues.

Finally, let λ1 , . . . , λn be the eigenvalues of B−1/2 AB−1/2 with corresponding,

orthonormal eigenvectors w1 , . . . , wn . If we set v j := B−1/2 w j then we have

or Av j = λ j Bv j . Moreover, we have

and

vTj Avk = wTj B−1/2 AB−1/2 wk = λk δ jk .

we deﬁne things immediately in a more general setting though we will mainly

be interested in the case of Rn .

is called a projection if P2 = P. A linear mapping P : V → U ⊆ V is called

orthogonal projection onto U, if

x − Px, u = 0, x ∈ V, u ∈ U. (1.4)

if Pu = u for all u ∈ range(P) = {Px : x ∈ V}. Moreover, an orthogonal

projection is indeed a projection since we have for x ∈ U also x − Px ∈ U and

hence by the orthogonality condition x − Px2 = x − Px, x − Px = 0, i.e.

Px = x for all x ∈ U.

If u1 , . . . , uk ∈ U form an orthonormal basis of the ﬁnite-dimensional subspace

U ⊆ V, then the orthogonal projection onto U is given by

k

Px = x, u j u j . (1.5)

j=1

Obviously, this so-deﬁned P is linear and maps onto U. To see that it is also

the orthogonal projection, let us assume that V is also ﬁnite-dimensional, as

we are only concerned with ﬁnite-dimensional spaces in this book, though the

result is also true for inﬁnite-dimensional spaces. If V is ﬁnite-dimensional, we

02

1.4 Facts from Linear Algebra 23

have, for any x ∈ V,

n

n

x= x, u j u j , x − Px = x, u j u j ,

j=1 j=k+1

so that the orthogonal condition (1.4) yields for P deﬁned by (1.5) that

n n

x − Px, ui = x, u j u j , ui = x, u j u j , ui = 0

j=k+1 j=k+1

for all 1 ≤ i ≤ k.

The orthogonal projection Px also describes the best approximation to x from

U.

dimensional subspace. Then, for x ∈ V and u∗ ∈ U are equivalent:

x − u∗ = min x − u.

u∈U

∗

2. u is the orthogonal projection of x onto U, i.e.

x − u∗ , u = 0, u ∈ U.

mality we have with Px from (1.5) for arbitrary α1 , . . . , αk ∈ R,

2

k

k

k

2

x − Px = x −

x, u j u j = x − x, ui ui , x − x, u j u j

j=1 i=1 j=1

k

k

= x2 − 2 |x, u j |2 + x, u j x, ui 2 ui , u j

j=1 i, j=1

k

k

k

= x2 − |x, u j |2 ≤ x2 − |x, u j |2 + (x, u j − α j )2

j=1 j=1 j=1

k

k

= x2 − 2 α j x, u j + α2j u j , u j

j=1 j=1

2

k

= x − α j u j .

j=1

This shows that Px from (1.5) is the unique best approximation to x from U.

02

24 Introduction

proof is ﬁnished.

The geometric interpretation of Proposition 1.12 is simple. The linear mapping

A : Rn → Rn , which is given by the matrix A in the standard basis, has a

much simpler representation when going over to the basis consisting of the

eigenvectors of A. If this basis is used in both the domain and the range of

the mapping then it can be represented by the diagonal matrix D. While the

simplicity of this representation is indeed appealing, it is not possible to ﬁnd

such a simple representation for all matrices A ∈ Rn×n let alone for non-square

matrices A ∈ Rm×n . However, if we relax the requirement of using the same

orthogonal basis in the domain and the range, i.e. in Rn and Rm , to represent the

mapping, we can ﬁnd a similarly simple representation. This is the so-called

singular value decomposition, which we want to discuss now.

To understand it, we need to recall a few more facts from linear algebra. We al-

ready introduced the rank and null space of a matrix. We also need the intrinsic,

well-known relations

n = dim ker(A) + rank(A), rank(A) = rank(AT ). (1.6)

Note that an element x ∈ ker(A) satisﬁes Ax = 0 and hence also AT Ax = 0, i.e.

it belongs to ker(AT A). If, on the other hand x ∈ ker(AT A) then AT Ax = 0, thus

0 = xT (AT Ax) = Ax22 and hence x ∈ ker(A). This means

ker(A) = ker(AT A).

Similarly, we can easily see that ker(AT ) = ker(AAT ). Hence, the dimension

formula above immediately gives

rank(A) = rank(AT ) = rank(AT A) = rank(AAT ).

This means that the symmetric and positive semi-deﬁnite matrices AT A ∈ Rn×n

and AAT ∈ Rm×m have the same rank.

As before, we choose for AT A a system of eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn ≥

0 and associated eigenvectors v1 , . . . , vn ∈ Rn satisfying AT Av j = λ j v j and

vTj vk = δ jk . Exactly the ﬁrst r = rank(A) eigenvalues are positive and the

remaining ones are zero.

According to the considerations we just made, also the matrix AAT must have

exactly r positive eigenvalues. We will now see that, interestingly, the positive

eigenvalues of AAT are exactly the positive eigenvalues of AT A and that there

02

1.5 Singular Value Decomposition 25

this, let us deﬁne

σ j := λ j , 1 ≤ j ≤ n,

1

u j := Av j , 1 ≤ j ≤ r.

σj

Then, we have on the one hand that

1 1

AAT u j = AAT Av j = λ j Av j = λ j u j , 1 ≤ j ≤ r.

σj σj

On the other hand, we have that

1 T T λk T

uTj uk = v A Avk = v vk = δ jk , 1 ≤ j, k ≤ r.

σ j σk j σ j σk j

Thus, u1 , . . . , ur ∈ Rm form an orthonormal system of eigenvectors corre-

sponding to the eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λr > 0 of AAT . Since this matrix

also has rank r and is positive semi-deﬁnite, all other eigenvalues have to be

zero. It is now possible to complete the system {u1 , . . . , ur } to an orthonor-

mal basis {u1 , . . . , um } consisting of eigenvectors. If we deﬁne the matrices

U := (u1 , . . . , um ) ∈ Rm×m and V = (v1 , . . . , vn ) ∈ Rn×n then we have

Av j = σ j u j , 1 ≤ j ≤ r,

Av j = 0, r + 1 ≤ j ≤ n,

by deﬁnition and by the fact that ker(A) = ker(AT A). This can alternatively be

written as

AV = UΣ,

where Σ = (σ j δi j ) ∈ Rm×n is a non-square diagonal matrix. This altogether

proves the following result.

Theorem 1.18 (Singular Value Decomposition (SVD)) A matrix A ∈ Rm×n

has a singular value decomposition A = UΣV T with orthogonal matrices U ∈

Rm×m and V ∈ Rn×n and a diagonal matrix Σ = (σ j δi j ) ∈ Rm×n .

Note that some of the σ j may be zero, depending on the rank of the matrix A.

If r is the rank of A, then there are exactly r non-zero and hence positive σ j .

Furthermore, the singular value decomposition is not unique. To be more pre-

cise, the matrices U and V are not unique, since we can, for example, change

the sign of the columns. Moreover, if some of the σ j have the same value, then

we can choose diﬀerent bases for the corresponding eigenspaces. However, the

values σ j are uniquely determined by A and thus, if we order the σ j , then the

matrix Σ is unique.

02

26 Introduction

Deﬁnition 1.19 Those σ j in the singular value decomposition which are pos-

itive are called the singular values of the matrix A.

Taking r = rank(A) into account, we can also rewrite the representation A =

UΣV T from Theorem 1.18 in the form

n

r

A= σ j u j vTj = σ j u j vTj .

j=1 j=1

Rn×r and Σ̂ := diag(σ1 , . . . , σr ) ∈ Rr×r then we have the alternative representa-

tion

A = Û Σ̂V̂ T ,

which is also called reduced singular value decomposition of the matrix A.

Since we can also write the reduced form as A = Û(V̂ Σ̂)T , we have the follow-

ing result.

Corollary 1.20 Every matrix A ∈ Rm×n of rank(A) = r can be written in the

form A = BC T with B ∈ Rm×r and C ∈ Rn×r .

In particular, we see that every rank 1 matrix is necessarily of the form bcT

with b ∈ Rm and c ∈ Rn .

There are eﬃcient methods available to compute the singular value decompo-

sition and we will discuss some of them in Section 5.7.

1.6 Pseudo-inverse

We can use the singular value decomposition to deﬁne a kind of inverse to a

singular and even non-square matrix.

Deﬁnition 1.21 Let A ∈ Rm×n have rank r = rank(A). Let A = UΣV T be

the singular value decomposition of A with orthogonal matrices U ∈ Rm×m ,

V ∈ Rn×n and with Σ ∈ Rm×n being a diagonal matrix with non-zero diagonal

entries σ1 ≥ σ2 ≥ · · · ≥ σr > 0. Let Σ+ ∈ Rn×m be the matrix

⎛ ⎞

⎜⎜⎜1/σ1 0 . . . 0⎟⎟

⎜⎜⎜ ⎟

⎜⎜⎜ .. .. .. ⎟⎟⎟⎟

. . . ⎟⎟⎟

⎜⎜⎜⎜ ⎟⎟⎟

+

⎜

⎜⎜⎜ 1/σ 0 0 ⎟⎟⎟

Σ = ⎜⎜ r

⎟⎟ .

⎜⎜⎜ 0 ... 0 0 . . . 0⎟⎟⎟⎟

⎜⎜⎜ .

⎜⎜⎜ . .. .. .. ⎟⎟⎟⎟⎟

⎜⎜⎝ . . . . ⎟⎟⎟

⎠

0 ... 0 ... 0

02

1.6 Pseudo-inverse 27

We know that the singular values of a matrix A ∈ Rm×n are uniquely deter-

mined. However, the orthogonal matrices U and V are not unique and hence

we have to make sure that the pseudo-inverse is well-deﬁned before we inves-

tigate its properties.

2. The properties

pseudo-inverse is well-deﬁned.

and m − r diagonal entries equal to 0 it is obviously symmetric and hence

AA+ = (AA+ )T . The equality A+ A = (A+ A)T is shown in the same way.

Moreover, we have

2. Now assume that both B and C satisfy the stated equalities (1.9). Then, we

can conclude that

= (BAB)AC = BAC = (BA)TC = (BA)T (CAC) = (BA)T (CA)TC

= AT BT ATC TC = ATC TC = (CA)TC = CAC = C.

that the pseudo-inverse is uniquely determined by the so-called Penrose condi-

tions (1.9), i.e. we could have also used these conditions to deﬁne the pseudo-

inverse.

02

28 Introduction

tions (1.9). This means that for an invertible matrix the pseudo-inverse and the

classical inverse are the same.

Finally, if m ≥ n and rank(A) = n, the following representation of the pseudo-

inverse will become important later on.

A is given by

A+ = (AT A)−1 AT . (1.10)

Proof Since A ∈ Rm×n with rank(A) = n, we ﬁnd that AT A ∈ Rn×n has also

rank n and is hence invertible. Thus, the expression B := (AT A)−1 AT is well-

deﬁned. Moreover, B satisﬁes (1.9), since, using the symmetry of AT A, we

have

(BA)T = ((AT A)−1 AT A)T = I T = I = (AT A)−1 AT A = BA,

ABA = A(AT A)−1 AT A = A,

BAB = ((AT A)−1 AT )A(AT A)−1 AT = (AT A)−1 AT = B.

The pseudo-inverse will play an important role when it comes to solving and

analysing least-squares problems. The next result relates the pseudo-inverse to

the orthogonal projections onto the ranges of A and AT .

2. P := A+ A : Rn → Rn is the orthogonal projection onto range(AT ).

Proof We will only prove the ﬁrst statement since the proof of the second

statement is very similar, as range(A+ ) = range(AT ).

By deﬁnition Px = A(A+ x) ∈ range(A). Moreover, using the properties (1.8)

we ﬁnd for each x ∈ Rm and u = Av ∈ range(A) that

= vT [AT x − AT (AA+ )T x] = vT [AT x − (AA+ A)T x]

= vT [AT x − AT x] = 0,

range(A).

02

Exercises 29

Exercises

1.1 Let V be a pre-Hilbert space. Use 0 ≤ αx + βy, αx + βy for α, β ∈ R

and x, y ∈ V to prove the Cauchy–Schwarz inequality.

1.2 Let V be a normed space with norm · : V → R. Show that V is a

pre-Hilbert space and that the norm comes from an inner product ·, · :

V × V → R if and only if the parallelogram equality

x + y2 + x − y2 = 2x2 + 2y2 , x, y ∈ V

holds and that in this case the inner product satisﬁes

1 !

x, y = x + y2 − x − y2 , x, y ∈ V.

4

1.3 Let V be a pre-Hilbert space. Show that the norm induced by the inner

product satisﬁes x + y < 2 for all x y ∈ V with x = y = 1.

1.4 Let A ∈ Rn×n be symmetric and positive deﬁnite and let C ∈ Rn×m . Show:

1. the matrix C T AC is positive semi-deﬁnite,

2. rank(C T AC) = rank(C),

3. the matrix C T AC is positive deﬁnite if and only if rank(C) = m.

1.5 Show that a matrix A ∈ Rn×n is symmetric and positive deﬁnite if and

only if it is of the form A = BBT with an invertible matrix B ∈ Rn×n .

1.6 Let A, B, C ∈ R2×2 with C = AB. Let

p = (a11 + a22 )(b11 + b22 ), q = (a21 + a22 )b11 ,

r = a11 (b12 − b22 ), s = a22 (b21 − b11 ),

t = (a11 + a12 )b22 , u = (a21 − a11 )(b11 + b12 ),

v = (a12 − a22 )(b21 + b22 ).

Show that the elements of C can then be computed via

c11 = p + s − t + v, c12 = r + t,

c21 = q + s, c22 = p + r − q + u.

Compare the number of multiplications and additions for this method

with the number of multiplications and additions for the standard method

of multiplying two 2 × 2 matrices.

Finally, show that if the above method is recursively applied to matrices

A, B ∈ Rn×n with n = 2k then the method requires 7k multiplications and

6 · 7k − 6 · 22k additions and subtractions.

02

2

Error, Stability and Conditioning

Since computers use a ﬁnite number of bits to represent a real number, they

can only represent a ﬁnite subset of the real numbers. In general the range of

numbers is suﬃciently large but there are naturally gaps, which might lead to

problems.

Hence, it is time to discuss the representation of a number in a computer. We

are used to representing a number in digits, i.e.

more general representation is the B-adic representation.

either x = 0 or

−1

x = ±Be xk Bk , x−1 0, xk ∈ {0, 1, 2, . . . , B − 1}.

k=−m

the base and xk Bk denotes the mantissa.

The corresponding ﬂoating point number system consists of all of the numbers

which can be represented in this form.

For a computer, the diﬀerent ways of representing a number, like single and

double precision, are deﬁned by the IEEE 754 norm and we will also refer to

them as machine numbers.

For example, the IEEE 754 norm states that for a double precision number we

have B = 2 and m = 52. Hence, if one bit is used to store the sign and if

11 bits are reserved for representing the exponent, a double precision number

30

03

2.1 Floating Point Arithmetic 31

can be stored in 64 bits. In this case, the IEEE 754 norm deﬁnes the following

additional values:

for the evaluation of log(0) and for overﬂow errors, meaning results which

would require an exponent outside the range [emin , emax ].

• NaN (not a number) for undeﬁned numbers as they, for example, occur when

dividing 0 by 0 or evaluating the logarithm for a negative number.

Furthermore, every real number and the result of every operation has to be

rounded. There are diﬀerent rounding strategies, but the default one is rounding

x to the nearest machine number rd(x).

Deﬁnition 2.2 The machine precision, eps, is the smallest number which

satisﬁes

|x − rd(x)| ≤ eps|x|

Another way of expressing this is that for such x there is an ∈ R with | | ≤ eps

and rd(x) = (1 + )x.

It is easy to determine the machine precision for the default rounding strategy.

Theorem 2.3 For a ﬂoating point number system with base B and precision

m the machine precision is given by eps = B1−m , i.e. we have

|x − rd(x)| ≤ |x|B1−m .

Proof For every real number x in the range of the ﬂoating point system there

exists an integer e between emin and emax such that

−1

−1

−m−1

x = ±Be xk Bk = ±Be xk Bk ± Be xk Bk = rd(x) + (x − rd(x)).

k=−∞ k=−m k=−∞

Hence, we have

−m−1

−m−1

|x − rd(x)| ≤ Be xk Bk ≤ Be (B − 1)Bk = Be−m .

k=−∞ k=−∞

|x − rd(x)|

≤ Be−m−e+1 .

|x|

03

32 Error, Stability and Conditioning

Note that for B = 2 and m = 52 this means eps = 2−51 ≈ 4.4409 · 10−16 , giving

the precision of real numbers stored in a 64-bit ﬂoating point system.

Since numbers are usually provided in decimal form, they must be converted

to the B-adic form for their representation within the computer, which usually

diﬀers from the original number. Even if then all computations are done in

exact arithmetic, this input error cannot be avoided.

However, the limitation of the ﬂoating point system also means that stan-

dard operations like adding, subtracting, multiplying and dividing two machine

numbers will not result in a machine number and hence the results have to be

rounded again. The best we can hope for is that the result of such an elemen-

tary operation is given by the rounded number of the exact result of such an

operation, i.e., for example, rd(x + y) instead of rd(rd(x) + rd(y)). The IEEE

754 norm guarantees that this is indeed the case and not only for adding, sub-

tracting, multiplying and dividing but also in the case of evaluating standard

√

functions like x, sin(x), log(x) and exp(x). This can be reformulated as fol-

lows.

Theorem 2.4 Let ∗ be one of the operations +, −, ·, / and let be the equiv-

alent ﬂoating point operation, then for all x, y from the ﬂoating point system,

there exists an ∈ R with | | ≤ eps such that

x y = (x ∗ y)(1 + ).

In this section we will study norms for vectors and matrices in greater detail,

as these norms are essential for analysing the stability and conditioning of a

linear system and any method for solving it.

We start with the following example.

tiplying a matrix A ∈ Rn×n with a vector x ∈ Rn . Let us assume, however, that

the data given is inexact, for example, because it contains noise from mea-

surements or because the representation of the real numbers on the computer

causes rounding errors. For simplicity and for the time being, we will still as-

sume that the matrix A ∈ Rn×n is exactly given but that the vector x is given

as something of the form x + Δx, where Δx is a vector with, one hopes, small

entries representing the error. Hence, instead of computing b := Ax we are

computing (b + Δb) = A(x + Δx) with Δb = AΔx and we have to analyse the

error Δb in the output.

03

2.2 Norms for Vectors and Matrices 33

concepts. We need to know how to “measure”. After that we will subsequently

come back to this introductory example. Let us recall Deﬁnition 1.6, the deﬁ-

nition of a norm on a linear space V as a mapping · : V → [0, ∞) satisfying

1. homogeneity: λx = |λ|x for all x ∈ V and λ ∈ R,

2. deﬁniteness: x = 0 if and only if x = 0,

3. triangle inequality: x + y ≤ x + y for all x, y ∈ V.

The norm of a vector allows us to measure the length of it. In particular, a unit

vector with respect to a norm · is a vector x ∈ V with x = 1.

From basic courses on Analysis and Linear Algebra, the reader should know

the following list of standard norms on Rn .

Deﬁnition 2.6 For 1 ≤ p ≤ ∞, the p -norm on Rn is deﬁned to be

1/p

• x p := ni=1 |xi | p for 1 ≤ p < ∞,

• x∞ := max1≤i≤n |xi |.

Besides the ∞ -norm, the 2 -norm or Euclidean norm and the 1 -norm are the

most popular ones:

⎛ n ⎞

n ⎜⎜⎜ 2 ⎟⎟⎟1/2

x1 := |xi |, ⎜

x2 := ⎜⎝ |xi | ⎟⎟⎠ .

i=1 i=1

It is quite easy to verify that · 1 and · ∞ are indeed norms. For · 2 and

more generally · p the ﬁrst and second properties of a norm are also easily

veriﬁed. The triangle inequality, however, requires more work and needs the

Cauchy–Schwarz or more generally the Hölder inequality

n

1 1

|x j y j | ≤ x p yq , x, y ∈ Rn , p, q ≥ 1 with + = 1. (2.1)

j=1

p q

Here, the expression 1/p is interpreted as zero for p = ∞. Since this is done in

basic courses on Linear Algebra and Analysis, we will assume that the reader

is familiar with these facts.

Example 2.7 The unit ball with respect to the p -norm is deﬁned to be the

set B p := {x ∈ Rn : x p ≤ 1}. For R2 , Figure 2.1 shows B1 , B2 , B∞ .

Note that the “length” of a vector depends on the chosen norm. The choice of

the norm might depend on the application one has in mind. For example, if a

vector contains temperatures at a certain position at n diﬀerent times, we can

use the ∞ -norm to determine the maximum temperature. If we are, however,

interested in an average temperature, the 1 - or 2 -norm would be our choice.

03

34 Error, Stability and Conditioning

B∞

1

B2

B1

−1 1

−1

Though we are given diﬀerent norms on Rn , the following theorem shows that

these do not diﬀer too much in a topological sense.

Theorem 2.8 On Rn , all norms are equivalent, i.e. for two norms · and

· ∗ on Rn , there are two constants c1 , c2 > 0 such that

c1 x ≤ x∗ ≤ c2 x, x ∈ Rn .

Proof First of all, we note that this indeed deﬁnes an equivalence relation.

Hence, it suﬃces to show that all norms are equivalent to one speciﬁc norm,

say the 1 -norm.

Thus, let · be an arbitrary norm on Rn . If x = (x1 , . . . , xn )T is a vector of Rn ,

the triangle inequality yields

n n

x =

x je j ≤ |x j |e j ≤ max e j x1 =: Mx1 .

1≤ j≤n

j=1 j=1

From this we can conclude that the norm ·, as a function on Rn , is continuous

with respect to the 1 -norm, since we have

| x − y | ≤ x − y ≤ Mx − y1

for x, y ∈ Rn . Since the unit sphere K = {x ∈ Rn : x1 = 1} with respect to the

1 -norm is a compact set, the continuous function · attains its minimum and

maximum on it, i.e. there are positive constants c ≤ C such that

c ≤ x ≤ C, x ∈ K.

03

2.2 Norms for Vectors and Matrices 35

A simple scaling argument now shows the equivalence of the norms · and

· 1 .

It is sometimes good to know what the equivalence constants look like. In the

case of the p -norms, the optimal constants are well-known.

Example 2.9 For the standard p -norms, it is not too diﬃcult to compute the

equivalence constants. For example,

√

x2 ≤ x1 ≤ nx2 ,

√

x∞ ≤ x2 ≤ nx∞ ,

x∞ ≤ x1 ≤ nx∞ .

p + 1

q = 1. If

q ≥ p then

x p ≤ n p − q xq .

1 1

(2.2)

If p ≥ q then

x p ≤ xq . (2.3)

In both cases, the constants are optimal in the sense that there is no smaller

constant for which (2.2) or (2.3) holds for all x ∈ Rn .

Proof In the case p = q = ∞ both bounds are obviously true. We start with

showing (2.2), i.e. we assume q ≥ p. There is nothing to show if x = 0 so we

may assume x 0. Moreover, if q = ∞ and 1 ≤ p < ∞ we have

n

x pp = p

|x j | p ≤ nx∞ ,

j=1

which gives the result in this case. For 1 ≤ p ≤ q < ∞, we can use Hölder’s

inequality with r = q/p ≥ 1 and s ≥ 1 with 1r + 1s = 1, i.e. with s = q/(q − p),

to derive

⎛ n ⎞ ⎛ n ⎞1/r

n ⎜⎜⎜ ⎟⎟⎟1/s ⎜⎜⎜ ⎟⎟⎟

p

x p = |x j | ≤ ⎜⎜⎝⎜ 1 ⎟⎟⎠⎟ ⎜⎜⎝⎜ |x j | ⎟⎟⎠⎟ = n1−p/q xqp/q .

p s pr

Taking the pth root yields the desired result. The constant is the best possible

because we have equality in (2.2) for the vector x = (1, . . . , 1)T .

Next, let us consider the case that p ≥ q ≥ 1. If p = ∞ and q < ∞ then we

03

36 Error, Stability and Conditioning

have

⎛ n ⎞1/q

⎜⎜⎜ ⎟⎟⎟

x∞ = max |x j | = max(|x j | ) q 1/q

≤ ⎜⎜⎝⎜ |x j | ⎟⎟⎟⎠ = xq .

q

j j

j=1

p−q

|x j | p = |x j |q |x j | p−q ≤ |x j |q x∞ ≤ |x j |q xqp−q ,

such that

n

n

x pp = |x j | p ≤ |x j |q xqp−q = xqq xqp−q = xqp ,

j=1 j=1

which also gives the desired result. To see that this bound is also sharp, we can

simply pick x = e1 = (1, 0, . . . , 0)T .

vector multiplication with a perturbed input vector, see Example 2.5.

the situation

b := Ax,

b + Δb := A(x + Δx).

mean? It will depend on the size of x and hence it makes more sense to look at

the relative errors

Δx Δb

, ,

x b

provided x, b 0.

Let us assume that we have ﬁxed a norm on Rn and that there are two constants

c1 > 0 and c2 > 0 such that

The left inequality guarantees that A is injective and hence, as a square matrix,

bijective, i.e. invertible. Under these assumptions we have

Δb A(x + Δx) − Ax AΔx c2 Δx

= = ≤ . (2.5)

b Ax Ax c1 x

This means that the relative input error Δx x is “magniﬁed” by the factor c2 /c1

in the relative output error. If this factor is large, we have to expect large relative

output errors even for small relative input errors.

03

2.2 Norms for Vectors and Matrices 37

So far, the ﬁndings for our matrix–vector multiplication example have been for

square matrices. However, a thorough inspection of what we have just outlined

shows that particularly (2.5) remains true as long as (2.4) holds. We will see

that the latter also holds for non-square matrices A ∈ Rm×n as long as A is

injective, which means that we need m ≥ n and rank(A) = n.

It is now time to describe a systematic way of handling the constants c2 and c1

from above by introducing matrix norms.

Since the space of all m × n matrices forms a vector space by adding two ma-

trices component-wise and multiplying a matrix by a scalar also component-

wise, we already know what a norm on the space of matrices is. However, in

this context it is important how matrix norms interact with vector norms.

Deﬁnition 2.12 Given vector norms · (n) and · (m) on Rn and Rm , respec-

tively, we say that a matrix norm · ∗ is compatible with these vector norms

if

Ax(m) ≤ A∗ x(n) , x ∈ Rn .

A matrix norm will be called compatible if there are two vector norms so that

the matrix norm is compatible with these vector norms.

We deﬁne the subordinate or induced matrix norm with respect to the above

vector norms as

Ax(m)

A(m,n) := sup .

x∈Rn x(n)

x0

n

Ax(m)

A(m,n) ≥

x(n)

or

Ax(m) ≤ A(m,n) x(n) ,

which shows that every induced matrix norm is compatible with the vector

norms used in its deﬁnition.

Theorem 2.13 The induced matrix norm is indeed a norm. It is the “small-

est” compatible matrix norm.

Proof First of all, we notice that using the third property of a norm and the

linearity of A yields

Ax(m) x

A(m,n) = sup = sup A = sup Ax(m) .

x∈Rn x(n) x∈Rn x(n) (m) x(n) =1

x0 x0

03

38 Error, Stability and Conditioning

compact set {x : x(n) = 1}. This means that there is an x0 ∈ Rn with x0 (n) =

1 and

A(m,n) = Ax0 (m) (2.6)

such that the induced norm is well-deﬁned. The three norm properties now

easily follow from the fact that · (m) is a norm and that A is linear.

Finally, we have already noticed that the induced matrix norm is compatible. It

is the smallest compatible norm in the sense that there is a vector x ∈ Rn with

In many cases, we will suppress index notation and simply write A, when the

chosen norm is not relevant or obvious from the context.

In many situations, the following concept will turn out useful.

Deﬁnition 2.14 Let · m×n , · n×p and · m×p denote norms on the matrix

spaces Rm×n , Rn×p and Rm×p , respectively. Such norms are called multiplicative

if they satisfy

Lemma 2.15 Let · (m) , · (n) and · (p) denote vector norms on Rn , Rm

and R p , respectively. Then, the induced matrix norms satisfy

for all A ∈ Rm×n and B ∈ Rn×p , i.e. induced matrix norms are always multi-

plicative. The same is true for compatible matrix norms.

which immediately gives the stated inequality. As the above inequality holds

also for compatible matrix norms, such norms are also multiplicative.

and Example 2.11, we can now give bounds for the constants c1 and c2 in (2.4).

obviously be chosen to be c2 = A. For the constant c1 we note that

03

2.2 Norms for Vectors and Matrices 39

Δb Δx

≤ A A−1 . (2.7)

b x

The number A A−1 which can be assigned to any square, invertible matrix

A ∈ Rn×n plays an important role in the upcoming analysis of algorithms and

methods.

Deﬁnition 2.17 The number κ(A) := A A−1 is called the condition num-

ber of the invertible matrix A ∈ Rn×n .

With this deﬁnition, (2.7) means that the condition number of the matrix gives

the maximum factor by which the relative input error can be enlarged. Hence,

we want to have a condition number as small as possible to ensure that the rel-

ative output error is of the same size as the relative input error. Note, however,

that for every compatible matrix norm we have

x = Ix ≤ I x,

meaning I ≥ 1. If the norm is induced, we obviously even have I = 1. In

any case, we can conclude that

1 ≤ I = AA−1 ≤ A A−1 = κ(A)

for every compatible matrix norm.

Proposition 2.18 For every compatible matrix norm · , the condition num-

ber satisﬁes κ(A) ≥ 1 for every A ∈ Rn×n invertible. If the matrix norm is

induced, then we also have I = 1 if I is the identity matrix.

The condition number obviously depends on the matrix norm used. But since

we can identify the space of all m × n matrices with Rm·n , we see that also all

matrix norms are equivalent and so are the diﬀerent condition numbers.

Hence, our conclusion so far is that for matrix–vector multiplications a large

condition number of the matrix is extremely problematic when it comes to

propagating input or rounding errors. So far, we were only able to draw this

conclusion for square, invertible matrices. Soon, we will return to our matrix–

vector multiplication example for non-square matrices and generalise the con-

cept of the condition number.

The induced norms, which are what we are mostly interested in, are given by

the standard p vector norms deﬁned previously. Hence, for 1 ≤ p ≤ ∞ we

deﬁne the induced matrix p-norm by

Ax p

A p = sup . (2.8)

x0 x p

03

40 Error, Stability and Conditioning

This expression is of course only of limited use if we cannot really compute the

matrix norm explicitly. Fortunately, this is in the most important cases possible.

The proof of the following theorem uses the fact that the induced matrix norm

is the smallest compatible norm. If we want to show that the induced matrix

norm A equals a certain number C then we only have to show that Ax ≤

Cx for all x ∈ Rn and that there is at least one x ∈ Rn with Ax = Cx.

Theorem 2.19 Let A ∈ Rm×n and let λ1 ≥ 0 be the largest eigenvalue of AT A.

Then

m

A1 = max |ai j |, (2.9)

1≤ j≤n

i=1

A2 = λ1 , (2.10)

n

A∞ = max |ai j |. (2.11)

1≤i≤m

j=1

Proof Let us ﬁrst consider the 1-norm of a matrix A ∈ Rm×n . For every x ∈ Rn

we have

" ""

m m "" n "" m n

""

Ax1 = |(Ax)i | = "" ai j x j """ ≤ |ai j ||x j |

i=1 i=1 " j=1 " i=1 j=1

⎛ m ⎞ ⎛ ⎞ n

n ⎜

⎜⎜ ⎟⎟⎟ ⎜⎜⎜ m ⎟⎟

= ⎜

⎜⎝ |ai j |⎟⎟⎠ |x j | ≤ ⎜⎜⎝ max |ai j |⎟⎟⎟⎠ |x j | =: Cx1 ,

1≤ j≤n

j=1 i=1 i=1 j=1

m

showing that A1 ≤ C. Next, let us pick an index k such that C = i=1 |aik |

and x = ek the kth unit vector. Then, we have x1 = 1 and

m

m

Ax1 = Aek 1 = |(Aek )i | = |aik | = C,

i=1 i=1

In the case of the 2-norm of a matrix A ∈ Rm×n , we have

Ax22 = xT AT Ax.

Since, AT A ∈ Rn×n is symmetric and positive semi-deﬁnite there are n or-

thonormal eigenvectors, {zi }ni=1 corresponding to the real, non-negative eigen-

values λ1 ≥ λ2 ≥ · · · ≥ λn ≥ 0 of AT A.

Let x = α1 z1 + · · · + αn zn so that x22 = xT x = |α1 |2 + · · · + |αn |2 . Furthermore,

xT AT Ax = λ1 |α1 |2 + · · · + λn |αn |2 ≤ λ1 xT x. (2.12)

Hence we have A22 ≤ λ1 . Finally, choosing x = z1 gives equality in (2.12),

which shows A22 = λ1 .

03

2.2 Norms for Vectors and Matrices 41

"" ""

""

n ""

A∞ = max Ax∞ = max max "" ai j x j ""

x∞ =1 x∞ =1 1≤i≤m "" ""

j=1

n

n

≤ max |ai j | = |ak j |,

1≤i≤m

j=1 j=1

for some k. To show that this upper bound is also attained we may choose

x ∈ Rn with

⎧ ak j

⎪

⎪

⎨ |ak j | if ak j 0,

xj = ⎪

⎪

⎩0 else

to obtain it.

Another way of expressing the result for the Euclidean norm is to use the sin-

gular values of the matrix, see Deﬁnition 1.19. The singular values σ1 ≥ σ2 ≥

· · · ≥ σr with r = rank(A) ≤ min(m, n) of a matrix A ∈ Rm×n are the positive

square roots of the positive eigenvalues of AT A. Hence, we have A2 = σ1 ,

i.e. A2 is given by the largest singular value of A. This is the ﬁrst half of the

following corollary.

with r = rank(A) ≤ min(m, n). Then, we have A2 = σ1 . Moreover, if m ≥ n

and A has full rank r = n, then we have

Ax2 ≥ σn x2 , x ∈ Rn .

(2.12).

Another consequence is that for symmetric matrices, the norm can directly be

computed from the eigenvalues of A rather than those of AT A.

λ1 , . . . , λn are the eigenvalues of A ordered so that |λ1 | ≥ |λ2 | ≥ · · · ≥ |λn |.

Moreover, if A ∈ Rn×n is of the form A = C TC with C ∈ Rn×n then A2 = C22 .

such that A = UDU T with D being a diagonal matrix with the eigenvalues of A

as diagonal entries. From this we can conclude that AT A = UD2 U T . Hence, the

eigenvalues of AT A are the squares of the eigenvalues of A. Finally, if A = C TC

then we have just proven that A2 = C TC2 is given by the largest eigenvalue

03

42 Error, Stability and Conditioning

of C TC. But Theorem 2.19 tells us that the square root of the largest eigenvalue

of C TC is just C2 . Hence, we have A2 = C22 .

the following one, which is based upon the so-called Rayleigh quotient.

x2 =1

such that A = UDU T with D ∈ Rn×n containing the eigenvalues λ1 , . . . , λn ∈ R

of A, sorted such that |λ1 | ≥ · · · ≥ |λn |. Then, as in the corollary, we have

A2 = |λ1 |. However, we also have |Ax, x2 | ≤ Ax2 x2 ≤ A2 for all

x ∈ Rn with x2 = 1 and if v ∈ Rn denotes a normalised eigenvector of A to

the eigenvalue λ1 then |Av, v2 | = |λ1 v, v2 | = |λ1 |, which shows

x2 =1

The above result shows that if A is symmetric and positive semi-deﬁnite then

its largest eigenvalue λmax is given by

x2 =1

However, the ideas used in Proposition 2.22 can also be used to show that the

smallest eigenvalue λmin of a symmetric and positive (semi-)deﬁnite matrix A

is given by

λmin = inf Ax, x2 . (2.13)

x2 =1

This follows directly with the notation of the proof by noting that

on, we will generalise these results when discussing more general localisation

techniques for eigenvalues.

We will now brieﬂy motivate how to generalise the concept of the condition

number to non-square matrices. We will, however, restrict ourselves to the ma-

trix norm induced by the 2 vector norm. Moreover, for our example, we will

only look at injective matrices.

03

2.2 Norms for Vectors and Matrices 43

m ≥ n = rank(A). If we compute once again b + Δb = A(x + Δx) for given

x, Δx ∈ Rn , then the relative output error can be bounded by

Δb2 σ1 Δx2

≤ ,

b2 σn x2

where σ1 is the largest and σn is the smallest singular value of A. This follows

immediately from (2.5), our earlier ﬁndings c2 = A2 = σ1 and

c21 = inf Ax22 = inf AT Ax, x2 = λmin (AT A) = σ2n .

x2 =1 x2 =1

follows.

Deﬁnition 2.24 Let A ∈ Rm×n have the singular values σ1 ≥ σ2 ≥ · · · ≥

σr > 0 with r = rank(A). Then, the condition number of A subordinate to the

2 -norm is deﬁned to be

σ1

κ2 (A) := .

σr

Obviously, this deﬁnition coincides with our earlier deﬁnition in the case of a

square and invertible matrix.

We also note that we have A2 = σ1 and, using the pseudo-inverse A+ from

(1.7), we can easily see that A+ 2 = 1/σr . Hence, we can express the condition

number also as

κ2 (A) = A2 A+ 2 .

Remark 2.25 When dealing with complex-valued matrices A ∈ Cn×n , we can

proceed in exactly the same way and introduce an induced matrix norm via

A := sup Ax,

x∈Cn

x=1

can deﬁne an induced matrix norm over Rn and over Cn . Since Rn ⊆ Cn , we

obviously have

AR := sup Ax ≤ sup Ax =: AC .

x∈Rn x∈Cn

x=1 x=1

Indeed, there exist vector norms for which AR < AC , which we will discuss

later on. Fortunately, in the case of the matrix norms (2.8) induced by the p -

norms, the two norms coincide. In particular, the statements of Theorem 2.19

remain true for real-valued matrices no matter which deﬁnition we use, and

T

even for complex-valued matrices. We only have to replace AT A by A A in the

03

44 Error, Stability and Conditioning

case of the 2-norm. Here, A is the complex-conjugate of the matrix A, i.e. each

entry of A is the complex-conjugate of the corresponding entry of A.

However, instead of using the same p -norm in the domain and range of our

matrix A, we can also choose diﬀerent p -norms. Then, it is possible to show

the following result. A proof can be found in [62, 93].

matrix norms

Axq Axq

ARq,p := sup , ACq,p := sup

x∈Rn \{0} x p x∈Cn \{0} x p

satisfy for p ≤ q

ARq,p = ACq,p .

Example 2.27 We want to compute the norm A1,∞ of the matrix A ∈ R2×2

given by

1 −1

A=

1 1

both in the complex setting and the real setting. To this end, we note that we

can simplify the formula for computing the matrix norm to

Ax1

A1,∞ = sup = sup [|x1 − x2 | + |x1 + x2 |] .

x0 x∞ x∞ =1

In the ﬁrst case, the possible candidates x ∈ R2 with x∞ = 1 must have either

x1 = ±1 or x2 = ±1. Since the norm is symmetric in x1 and x2 it suﬃces to

look at x1 = ±1. In the case of x1 = 1 we must ﬁnd the maximum of

x2 − 1 = 2. The other case of x1 = −1 follows in the same fashion so that we

have

AR1,∞ = 2.

03

2.2 Norms for Vectors and Matrices 45

|x1 | = 1 and |x2 | = r ≤ 1. Hence, we can write x1 = eiθ1 and x2 = reiθ2 with

θ j ∈ R for j = 1, 2 and r ∈ [0, 1]. We have to ﬁnd the maximum of

|eiθ1 + reiθ2 | + |eiθ1 − reiθ2 | = |eiθ1 (1 + rei(θ2 −θ1 ) )| + |eiθ1 (1 − rei(θ2 −θ1 ) )|

= |1 + reiθ | + |1 − reiθ | (2.14)

=: f (r, θ)

with θ := θ2 − θ1 . Both expressions on the right-hand side of (2.14) denote

the distance of a point in C from the origin, where, in the ﬁrst summand, the

point is on the circle about 1 with radius r and, in the second summand, the

point is on the circle about −1 with radius r. The maximum of this expression

is attained if both distances are the same. To see this let us further express

f (r, θ) = (1 + r cos θ)2 + r2 sin2 θ)1/2 + (1 − r cos θ)2 + r2 sin2 θ)1/2

= (1 + r2 + 2r cos θ)1/2 + (1 + r2 − 2r cos θ)1/2 .

For r = 0 the latter expression has value 2. For a ﬁxed r ∈ (0, 1], the latter

expression is as a function of θ a periodic function with period π and attains

its maximum for θ = π/2 + πZ, which gives f (r, π/2) = 2(1 + r2 )1/2 which is

largest when r = 1. This shows altogether

√

AC1,∞ = |1 − i| + |1 + i| = 2 2.

After this example, we return to the more common situation of p = q and in

particular to p = q = 2, where Corollary 2.21 and Proposition 2.22 can be

generalised as follows.

T T

Corollary 2.28 Let A ∈ Cn×n be normal, i.e. A A = AA . Then, the 2-norm

of A is given by

A2 = |λn | = sup |x, Ax2 |,

x2 =1

T

where x, y2 = y x and where λn is the eigenvalue of A with the largest abso-

lute value.

The following example shows how to compute the induced matrix norm if the

vector norm is deﬁned by xA := Ax with an invertible matrix A and a vector

norm · . We will run into such norms frequently.

Example 2.29 Suppose S ∈ Rn×n is an invertible matrix and · is a vector

norm on Rn . Then, it is easy to see that x → xS −1 := S −1 x is also a vector

norm on Rn . The induced matrix norm is given by

AxS −1 S −1 AS S −1 x S −1 AS y

AS −1 := sup = sup = sup = S −1 AS .

x0 xS −1 x0 S −1 x y0 y

03

46 Error, Stability and Conditioning

The above example remains true in the complex setting. Assume that S ∈ Cn×n

is invertible and that · : Cn → C is a norm with induced matrix norm. Then,

the above mapping · S −1 : Cn → R, x → S −1 x deﬁnes a norm on Cn with

induced matrix norm · S −1 : Cn×n → R given by AS −1 = S −1 AS . We can

restrict both norms to Rn and Rn×n and, even though the resulting matrix norm

on Rn×n is in general not the induced matrix norm of the restricted vector norm,

the two norms are still compatible in the real setting.

These considerations imply that not every matrix norm is an induced norm,

not even every compatible norm has to be an induced norm. Another, simpler

example is given by the so-called Frobenius norm.

Deﬁnition 2.30 The Frobenius norm of a matrix A ∈ Rm×n is deﬁned to be

⎛ m n ⎞1/2

⎜⎜⎜ ⎟⎟⎟

AF = ⎜⎜⎝ ⎜ |ai j | ⎟⎟⎟⎠ .

2

i=1 j=1

that it is the Euclidean norm if we identify Rm×n with Rm·n . Moreover, the

Frobenius norm is compatible with the 2 vector norm, since we have

" ""2 ⎛ n ⎞

m "" "" m ⎜ ⎟⎟⎟

"

n

⎜⎜⎜ n

Ax2 =

2 " "

"" ai j x j "" ≤ ⎜⎜⎝ |ai j |2

|x j | ⎟⎟⎟⎠ = A2F x22 .

2

Corollary 2.31 The Frobenius norm is compatible with the Euclidean norm.

Moreover, it can be expressed alternatively as

A2F = tr(AT A) = tr(AAT ), A ∈ Rm×n ,

where tr(B) = i bii denotes the trace of the matrix B.

Proof The additional representation follows from the fact that all three terms

give the sum over the squared entries of the matrix.

√

The Frobenius norm cannot be induced since we have IF = min(m, n) 1.

But as a compatible norm the Frobenius norm is multiplicative.

2.3 Conditioning

In the last section, while discussing the problem of multiplying a matrix with

a vector, we encountered the idea of conditioning. We associated a condition

number with a matrix , which gave some insight into whether small input errors

are magniﬁed or not.

We now want to discuss this more generally and in more detail.

03

2.3 Conditioning 47

V of data to a normed vector space W of solutions. For a particular point x ∈ V

we say that the problem f (x) is well-conditioned if all small perturbations of x

lead only to small changes in f (x). A problem is ill-conditioned if there exists

a small perturbation that leads to a large change in f (x).

To make this more precise, we want to introduce the concept of a condition

number also in this context.

In general, one distinguishes between the absolute condition number, which is

the smallest number κ̃(x) which satisﬁes

f (x + Δx) − f (x) ≤ κ̃(x)Δx,

for all x and all small Δx, and the more useful relative condition number, κ(x),

given by the smallest number κ(x) satisfying

f (x + Δx) − f (x) Δx

≤ κ(x) , (2.15)

f (x) x

for all x 0 with f (x) 0 and all small Δx. We still need to make the notion

of small errors Δx more precise. In the case of the relative condition number

this is done as follows.

Deﬁnition 2.33 Let V, W be normed spaces and f : V → W. The (relative)

condition number of f at x ∈ V is deﬁned to be

f (x + Δx) − f (x) # Δx

κ(x) = κ( f, x) := lim sup .

→0 Δx≤ f (x) x

We will soon derive a general formula for the condition number in the case of

diﬀerentiable functions. But before that, let us discuss a few simple examples.

Example 2.34 (Matrix–vector multiplication) As a ﬁrst example, let us re-

view our previous example of matrix–vector multiplication. Here, we have

f (x) = Ax. Hence,

A(x + Δx) − Ax A Δx A x Δx Δx

= ≤ =: κ(x) .

Ax Ax Ax x x

If A is invertible we have x = A−1 Ax ≤ A−1 Ax such that

κ(x) ≤ A A−1 = κ(A).

The next two examples deal with the simple operation of adding and multi-

plying two numbers. Since subtraction and division can be recast as addition

and multiplication, these operations are covered by these examples, as well.

We leave it to the reader to determine the actual condition number and rather

03

48 Error, Stability and Conditioning

not.

Example 2.35 (Addition of two numbers) Let f : R2 → R be deﬁned as

f (x, y) = x + y. Then,

f (x + Δx, y + Δy) − f (x, y) x + Δx + y + Δy − (x + y)

=

f (x, y) x+y

x Δx y Δy

= + .

x+y x x+y y

Hence, the addition of two numbers is well-conditioned if they have the same

sign and are suﬃciently far away from zero. However, if the two numbers are

of the same magnitude but with diﬀerent signs, i.e. x ≈ −y, then adding these

numbers is highly ill-conditioned because of cancellation.

Example 2.36 (Multiplying two numbers) The procedure is similar to the

one of the previous example. This time, we have f (x, y) = x · y. Hence, we

have

f (x + Δx, y + Δy) − f (x, y) (x + Δx)(y + Δy) − (xy)

=

f (x, y) xy

Δy Δx Δx Δy

= + + ,

y x x y

which shows that the multiplication of two numbers is well-conditioned.

Next, let us consider a more general approach. Recall that for a diﬀerentiable

function f : Rn → R, the gradient of f at x is deﬁned to be the vector

T

∂ f (x) ∂ f (x)

∇ f (x) = ,..., .

∂x1 ∂xn

Theorem 2.37 Let f : Rn → R be a continuously diﬀerentiable function.

Then, for x ∈ Rn , the condition number of f with respect to the 2 -norm in Rn

satisﬁes

∇ f (x)2

κ(x) = x2 . (2.16)

| f (x)|

Proof The Taylor expansion of f gives

f (x + Δx) − f (x) = ∇ f (ξ)T Δx

with a ξ on the straight line between x and x + Δx. Hence, we can bound the

relative output error by

| f (x + Δx) − f (x)| x2 |∇ f (ξ)T Δx| x2 ∇ f (ξ)2

= ≤ x2 ,

| f (x)| Δx2 | f (x)| Δx2 | f (x)|

03

2.3 Conditioning 49

where we have used the Cauchy–Schwarz inequality in the last step. The ex-

pression on the right-hand side depends only implicitly on Δx as ξ is on the

line segment between x and x + Δx. If we now take the limit Δx2 → 0 as

in Deﬁnition 2.33 we have that ξ → x, which shows that κ(x) is at most the

expression on the right-hand side of (2.16). If Δx is parallel to ∇ f (x) we even

have equality in the above inequality, which shows that κ(x) does indeed have

the stated form.

√

Example 2.38 (Square root) Let f (x) = x, x > 0. Then, we have f (x) =

1 −1/2

2x and hence

| f (x)| 1 x 1

κ(x) = |x| = √ √ = .

| f (x)| 2 x x 2

Hence, taking the square root is a well-conditioned problem with condition

number 1/2.

The next example is another application of (2.16). This time we look at the

exponential function.

Example 2.39 (Function evaluation) Let f (x) = e x , x ∈ R. Since f (x) =

2

2

2xe x = 2x f (x), we can use (2.16) to see that

| f (x)|

κ(x) = |x| = 2x2 .

| f (x)|

Hence, the problem is well-conditioned for small x and ill-conditioned for

large x.

So far, we have dealt with the condition number of a problem. While there is

not much choice in computing the sum of two real numbers, the situation is

already diﬀerent when it comes to computing the square root. Here, we did not

consider a particular method to compute the square root but only the problem

itself. However, it might happen that a problem is well-conditioned but the

algorithm or method we use is not.

Deﬁnition 2.40 A method for solving a numerical problem is called well-

conditioned if its condition number is comparable to the condition number of

the problem.

Let us explain this with a simple example.

Example 2.41 We consider the quadratic equation x2 + 2px − q = 0 with

p, q > 0. Then, we know from elementary calculus that the positive solution to

this problem is given by

x = −p + p2 + q. (2.17)

03

50 Error, Stability and Conditioning

q since then we will have x ≈ −p + p and the last expression is prone to

cancellation. We will make this more precise later but we ﬁrst want to look

at the condition number of the problem of ﬁnding the positive solution itself.

To do this, let us introduce the notation x(p, q) for the solution of the equation

with coeﬃcients p and q. Then, the relative output error is given by

Δx x(p + Δp, q + Δq) − x(p, q)

x = = .

x x(p, q)

Δq

If we also write p = Δp p and q = q for the relative input error, we ﬁnd

p + Δp = p(1 + p ) and q + Δq = q(1 + q ). Using that x(p + Δp, q + Δq) solves

the quadratic equation with coeﬃcients p + Δp and q + Δq yields

= (1 + 2 x + x2 )x2 + 2px(1 + p + x + p x ) − q − q q .

If we now ignore all quadratic terms in this equation (which we will denote

with a dot above the equal sign), i.e. x2 and p x , and rearrange the terms, we

derive

This resolves to

1 q q − 2 p px 1 (q − px) q + px q − 2 p px 1 px( q − 2 p )

x =˙ = = q +

2 x2 + px 2 q − px 2 x2 + px

and this ﬁnally gives the bound

1 1 px(| q | + 2| p |)

| x | ≤˙ | q | + = | q | + | p |.

2 2 px

This means that ﬁnding the positive solution to our quadratic equation is a

well-conditioned problem.

As mentioned above, (2.17) is not a well-conditioned method to compute the

solution. To see this, let us analyse the method in more detail. Using the same

notation as above to denote the relative error, we can, ﬁrst of all, summarise

the ﬁndings of our previous examples as follows:

1

| √ x | ≤ | x |,

""2 " "" "

" x "" " y ""

| x+y | ≤ " " " | x | + "" " | y |,

x + y" x + y"

| x·y | ≤˙ | x | + | y |.

03

2.3 Conditioning 51

(2.17) successively

| p2 | ≤˙ | p | + | p | ≤ 2 ,

p2 q

| p2 +q | ≤˙ | p2 | + 2 | q | ≤ 2 ,

p2 +q p +q

1

| √ p2 +q | ≤ | p2 +q | ≤˙

2

and ﬁnally, with y = p2 + q,

"" " "" "

−p "" y "" p+y

| x | = | −p+y | ≤˙ """ "" | p | + """ " | y | ≤ .

−p + y −p + y " y− p

The term y − p in the denominator becomes problematic in the situation that p

is much larger than q due to cancellation. Thus, in this situation, the formula is

ill-conditioned. However, in this situation we can rewrite it as

p + p2 + q

x = −p + p2 + q = −p + p2 + q

p + p2 + q

p2 + q − p2 q

= = .

p + p2 + q p + p2 + q

To show that this formula is well-conditioned we deﬁne z = p + p2 + q and

ﬁnd

p y

| z | = | p+y | ≤ | p | + | y | ≤ .

p+y y+ p

Using the easily veriﬁable

x/y =˙ x − y

| x | = | q/z | =˙ | q − z | ≤ 2 ,

Let us now return to linear algebra. We will ﬁnish this section by studying

the problem of solving a linear system of the form Ax = b with an invertible

matrix A ∈ Rn×n and given b ∈ Rn .

Suppose our input data A and b have been perturbed such that we are actually

given A + ΔA and b + Δb. We can try to model the error in the output by

assuming that x + Δx solves

03

52 Error, Stability and Conditioning

(A + ΔA)Δx = Δb − (ΔA)x.

For the time being, let us assume that A + ΔA is invertible, which is, by conti-

nuity, true if A is invertible and ΔA only a small perturbation. Then,

Δx = (A + ΔA)−1 (Δb − (ΔA)x) ,

i.e.

Δx ≤ (A + ΔA)−1 (Δb + ΔA x) ,

which gives a ﬁrst estimate on the absolute error. However, more relevant is

the relative error

Δx −1 Δb ΔA

≤ (A + ΔA) A +

x A x A

−1 Δb ΔA

≤ (A + ΔA) A + .

b A

The last estimate shows that the relative input error is magniﬁed by a factor

(A + ΔA)−1 A (2.18)

for the relative output error. This demonstrates the problem of conditioning.

The latter expression can further be modiﬁed to incorporate the condition num-

ber of a matrix.

It is now time to make our assumptions more rigorous, which ultimately also

leads to a further bound on (2.18).

Lemma 2.42 If A ∈ Rn×n is a non-singular matrix and ΔA ∈ Rn×n a pertur-

bation, which satisﬁes

A−1 ΔA < 1

in a compatible matrix norm, then, A + ΔA is non-singular and we have

A−1

(A + ΔA)−1 ≤ .

1 − A−1 ΔA

Proof Let us deﬁne the mapping T : Rn → Rn with T x = (A + ΔA)x, x ∈ Rn .

Then, we have obviously x = A−1 T x − A−1 (ΔA)x and x ≤ A−1 T x +

A−1 ΔA x. This means particularly

A−1

x ≤ T x,

1 − A−1 ΔA

which immediately leads to the injectivity and hence bijectivity of T = A + ΔA

and to the stated bound.

03

2.3 Conditioning 53

Theorem 2.43 Suppose that A ∈ Rn×n is invertible and that the perturbation

ΔA ∈ Rn×n satisﬁes A−1 ΔA < 1. Then, for b, Δb ∈ Rn with b 0, the

estimate

−1

Δx ΔA Δb ΔA

≤ κ(A) 1 − κ(A) +

x A b A

holds.

Proof Lemma 2.42 guarantees the invertibility of A+ΔA. Hence, our ﬁndings

so far together with Lemma 2.42 yield

Δx −1 Δb ΔA

≤ (A + ΔA) A +

x b A

A−1 A Δb ΔA

≤ + ,

1 − A−1 ΔA b A

2.43 takes the form

Δx Δb ΔA

≈ κ(A) + ,

x b A

showing that the relative input error is ampliﬁed by κ(A) in the relative output

error.

Example 2.44 Assume the condition number behaves like κ(A) = O(10σ )

and that the relative input errors behave like

ΔA Δb

= O(10−k ), = O(10−k )

A b

with k > σ. Then, the relative error of the solution is of size

Δx

≈ 10σ−k .

x

In the case of the maximum norm we have to expect to lose σ digits in accu-

racy.

As we always have to expect errors of the size of the machine precision, the

above example shows that a condition number inverse proportional to the ma-

chine precision will produce essentially unreliable results.

03

54 Error, Stability and Conditioning

2.4 Stability

There is yet another way of looking at the output of an algorithm and how well

it computes the solution of the original problem.

Consider again a problem f : V → W from a vector space V of data to a vector

space W of solutions. A numerical algorithm can then simply be deﬁned to

be another map f˜ : V → W. More precisely, the given data x ∈ V will be

rounded to ﬂoating point precision and then supplied to the algorithm f˜. We

can and will, however, incorporate this rounding into the algorithm f˜. Hence,

the actual outcome is a solution f˜(x).

Obviously, we could call f˜ stable if the approximate result f˜(x) is close to the

desired result f (x), i.e., when taking the relative point of view again, if

f˜(x) − f (x)

f (x)

is small. Unfortunately, this is often extremely diﬃcult to verify. Hence, a dif-

ferent concept has been developed. While the above concept requires the ap-

proximate result from the correct input data to be close to the correct result, we

could also try to interpret the outcome f˜(x) as the result of the original function

f at a perturbed point x̃ ∈ V, i.e. we assume that f˜(x) = f (x̃).

if for every x ∈ V there is an x̃ ∈ V with f (x̃) = f˜(x) and

x − x̃

= O(eps), (2.19)

x

where eps denotes the machine precision.

In other words, this means that a backward-stable algorithm solves the problem

correctly with nearly the correct input data.

The concept of backward-stability is a very powerful concept in Numerical

Linear Algebra. For many algorithms it can be shown that they are backward-

stable. A backward-stable algorithm computes a good approximation to the

original problem provided the condition number of the original problem is of

moderate size.

the problem f : V → W. Then, for each x ∈ V,

f˜(x) − f (x)

= O(κ(x) eps),

f (x)

where κ(x) denotes the relative condition number for computing f (x).

03

Exercises 55

and (2.19). The deﬁnition of the relative condition number in Deﬁnition 2.33

yields

f˜(x) − f (x) f (x̃) − f (x) x − x̃

= ≤ κ(x) = O(κ(x) eps).

f (x) f (x) x

Let us brieﬂy describe how this general theory applies to our main application

of solving linear systems, though we can also use it for analysing algorithms

for solving least-squares or eigenvalue problems.

When solving linear systems Ax = b, we may assume that the right-hand side b

is exact. Hence, the linear mapping we are interested in is given by f : Rn×n →

Rn , f (A) := A−1 b. The condition number of this problem is given by Theorem

2.43 with Δb = 0. More precisely, Theorem 2.43 yields

−1

f (A + ΔA) − f (A) # ΔA Δx # ΔA ΔA

= ≤ κ(A) 1 − κ(A) .

f (A) A x A A

Taking the limit ΔA → 0 as in Deﬁnition 2.33, this shows that κ( f, A) ≤

κ(A) = A A−1 .

x̃ ∈ Rn is the approximate solution computed with a backward-stable algorithm

then the relative error satisﬁes

x̃ − x

= O(κ(A) eps).

x

In the next chapter, we will discuss the backward-stability of several algo-

rithms. However, we will omit the proofs as they are usually extremely tech-

nical. Nonetheless, the reader should be very aware of this problem and may

consult the books by Higham [85] and Trefethen and Bau [124].

Exercises

2.1 Consider the matrix A = uvT , where u ∈ Rm and v ∈ Rn . Show that

⎛ ⎞

⎜⎜⎜ 4 1 −1⎟⎟⎟

1 −2 ⎜ ⎟

A= , A = ⎜⎜⎜⎜⎜ 1 2 0 ⎟⎟⎟⎟⎟ .

−2 4 ⎝ ⎠

−1 0 2

03

56 Error, Stability and Conditioning

A2 ≤ A∞ A1 .

2.4 Let A ∈ Rn×n be invertible and 1 ≤ p, q ≤ ∞ with 1

p + 1

q = 1. Show that

κ p (A) = κq (AT ).

2.5 Let A ∈ Rm×n have the singular values σ1 ≥ σ2 ≥ · · · ≥ σr > 0 with

r = rank(A). Show that

⎛ r ⎞

⎜⎜⎜ ⎟⎟⎟1/2

AF = ⎜⎜⎜⎝ σ2j ⎟⎟⎟⎠ .

j=1

1 1

A= .

1 −1

Show that for 1 ≤ p, q ≤ ∞ the norm ACq,p is given by

ACq,p = max{21− p , 2 q , 2 q − p + 2 }.

1 1 1 1 1

ACq,p = ARq,p = max{21− p , 2 q }

1 1

Aq,p < 2 q − p + 2 = ACq,p .

1 1 1

⎛ ⎞

⎜⎜⎜1 − − − ⎟⎟⎟

⎜ ⎟

A := ⎜⎜⎜⎜⎜ − 1− − ⎟⎟⎟⎟⎟ .

⎝ ⎠

− − 1−

Show that for p > q ≥ 1 there exists an > 0 such that A Rq,p ≤ A Cq,p .

03

PART TWO

BASIC METHODS

Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316544938

13:06:52, subject to the Cambridge

Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316544938

3

Direct Methods for Solving Linear Systems

In this chapter we will discuss direct methods for solving linear systems. Direct

means that the algorithms derived here will terminate after a ﬁxed number of

steps, usually n for an n × n matrix, with the solution of the system.

The simplest linear system to solve is, of course, a system of the form Dx = b,

where D = diag(d11 , . . . , dnn ) ∈ Rn×n is a diagonal matrix with non-vanishing

entries on the diagonal. The obvious solution is simply given by xi = bi /dii for

1 ≤ i ≤ n.

A slightly more complex problem is given if the matrix is a triangular matrix,

i.e. we seek a solution x ∈ Rn of the equation

Rx = b, (3.1)

where b ∈ Rn is given and R ∈ Rn×n is, for example, an upper triangular matrix,

i.e. a matrix of the form

⎛ ⎞

⎜⎜⎜r11 r12 r13 . . . r1n ⎟⎟⎟

⎜⎜⎜⎜ ⎟

⎜⎜⎜ r22 r23 . . . r2n ⎟⎟⎟⎟

⎟

⎜⎜ .. .. .. ⎟⎟⎟⎟

R = ⎜⎜⎜⎜⎜ . . . ⎟⎟⎟⎟ .

⎜⎜⎜⎜ .. .. ⎟⎟⎟⎟

⎜⎜⎜

⎜⎝ 0 . . ⎟⎟⎟⎟

⎠

rnn

The system has a unique solution if the determinant of R is diﬀerent from zero.

Since we have an upper triangular matrix, the determinant equals the product

of all diagonal entries. Hence, R is invertible once again if rii 0 for 1 ≤ i ≤ n.

59

04

60 Direct Methods for Solving Linear Systems

Assuming all of this, we see that the last equation of (3.1) is simply given by

rnn xn = bn ,

which resolves to xn = bn /rnn . Next, the equation from row n − 1 is

rn−1,n−1 xn−1 + rn−1,n xn = bn−1 .

In this equation, we already know xn so that the only unknown is xn−1 , which

we can compute as

xn−1 = (bn−1 − rn−1,n xn )/rn−1,n−1 .

We obviously can proceed in this way to compute successively xn−2 , xn−3 and

so on until we ﬁnally compute x1 . The general algorithm for computing the

solution in this situation is based upon

⎛ ⎞

1 ⎜⎜⎜⎜ n ⎟⎟⎟

xi = ⎜⎜⎝bi − ri j x j ⎟⎟⎟⎠ , i = n − 1, n − 2, . . . , 1, (3.2)

rii j=i+1

Input : Upper triangular matrix R ∈ Rn×n , b ∈ Rn .

Output: Solution x ∈ Rn of Rx = b.

1 xn := bn /rnn

2 for i = n − 1 downto 1 do

3 xi := bi

4 for j = i + 1 to n do

5 xi := xi − ri j x j

6 xi := xi /rii

Theorem 3.1 The system Rx = b with an invertible upper triangular matrix

R ∈ Rn×n can be solved in O(n2 ) time using back substitution.

Proof The computational cost is mainly determined by the two for loops in

Algorithm 2. The cost of line 5 is one ﬂop, as is the cost of line 6. Hence, the

total cost becomes

⎛ ⎞

n−1 ⎜

⎜⎜⎜

n

⎟⎟⎟

n−1

n−1

n−1

(n − 1)n

⎜⎜⎝1 + 1⎟⎟⎟⎠ = n − 1 + (n − (i + 1)) = (n − i) = i= .

i=1 j=i+1 i=1 i=1 i=1

2

04

3.2 Gaussian Elimination 61

It should be clear that we can proceed in the exactly same way when con-

sidering lower triangular matrices, starting with the computation of x1 . The

corresponding solution procedure is then called forward substitution.

computed solution x̃ ∈ Rn of a non-singular upper triangular system Rx = b

satisﬁes (R + ΔR)x̃ = b with ΔR ∈ Rn×n satisfying

A proof of this result can, for example, be found in Higham [85] and Trefethen

and Bau [124]. According to Corollary 2.47, we can assume that back substi-

tution allows us to compute a suﬃciently accurate solution as long as κ(R) is

not too large. The same remains true for forward substitution.

We now turn to solving the system Ax = b for arbitrary, invertible matrices

A ∈ Rn×n . Probably the best-known and still very popular method for solving a

square system Ax = b is Gaussian elimination.

The aim of Gaussian elimination is to reduce a linear system of equations

Ax = b to an upper triangular system Ux = c by applying very simple lin-

ear transformations to it, which do not change the solution.

Suppose we have a linear system of the form

a21 x1 + a22 x2 + · · · + a2n xn = b2 ,

.. ..

. .

an1 x1 + an2 x2 + · · · + ann xn = bn ,

then, assuming that a11 0, we can multiply the ﬁrst row by aa1121

and subtract

the resulting row from the second row, which cancels the factor in front of x1

in that row. Then, we can multiply the original ﬁrst row by aa31

11

and subtract it

from the third row, which, again, cancels the factor in front of x1 in that row.

It is important to note that this procedure does not change the solution of the

system, i.e. the original system and the system modiﬁed in this way obviously

have the same solution.

Continuing like this all the way down, we end up with an equivalent linear

system, i.e. a system with the same solution, of the form

04

62 Direct Methods for Solving Linear Systems

22 x2 + · · · + a2n xn = b2 ,

.. ..

. .

a(2)

n2 x2 + · · · + ann xn = bn .

(2) (2)

Assuming now that a(2) 22 0, we can repeat the whole process using the second

row now to eliminate x2 from row three to n, and so on. Hence, after k − 1

steps, k ≥ 2, the system can be written in matrix form as

⎛ ⎞ ⎛ (1) ⎞

⎜⎜⎜a11 ∗ ∗ ... ∗ ⎟⎟ ⎜⎜⎜b1 ⎟⎟⎟

⎜⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ 0 ⎟⎟⎟ ⎜⎜⎜ (2) ⎟⎟⎟

⎜⎜⎜ a(2) ∗ ... ∗ ⎟⎟⎟ ⎜⎜⎜b2 ⎟⎟⎟

⎜⎜⎜

22 ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ .. ⎟⎟⎟ ⎜⎜⎜ . ⎟⎟⎟

⎜⎜⎜ . ⎟⎟⎟ ⎜⎜⎜ .. ⎟⎟⎟

⎟⎟⎟ ⎜ ⎟

A x = ⎜⎜⎜⎜

(k) ⎜ ⎟⎟⎟ x = ⎜⎜⎜⎜ ⎟⎟⎟⎟ .

⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ (3.3)

⎜⎜⎜ 0 a(k) ... a(k) ⎟⎟⎟ ⎜⎜⎜b(k) ⎟⎟⎟

⎜⎜⎜

0 kk kn ⎟ ⎟⎟ ⎜⎜⎜⎜ k ⎟⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜⎜ ⎟⎟⎟⎟

⎜⎜⎜⎜ . ⎟⎟⎟ ⎜⎜⎜ . ⎟⎟⎟

.. .. .. ⎟⎟⎟

⎜⎜⎜ .. . . . ⎟⎟⎟ ⎜⎜⎜⎜ .. ⎟⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟

⎜⎝ (k) ⎠ ⎜⎝ (k) ⎟⎠

0 0 a(k)

nk ... ann bn

(k)

If in each step akk is diﬀerent from zero, Gaussian elimination produces after

n − 1 steps an upper triangular matrix. The ﬁnal, resulting linear system can

then be solved using back substitution as described in Algorithm 2.

In the next sections, we will address the questions of when all the diagonal

entries a(k)

kk are diﬀerent from zero and what to do if this is not the case.

For theoretical (not numerical) reasons, it is useful to rewrite this process using

matrices. To this end, we need to rewrite the elementary operations we just used

into matrix form. The corresponding matrices are the following elementary

lower triangular matrices.

1 ≤ j ≤ k, meaning that m is of the form

m = (0, 0, . . . , 0, mk+1 , . . . , mn )T .

04

3.2 Gaussian Elimination 63

ciﬁc form

⎛ ⎞

⎜⎜⎜1 ⎟⎟⎟

⎜⎜⎜ .. ⎟⎟⎟

⎜⎜⎜ . ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ 1 ⎟⎟⎟

Lk (m) := I − mek = ⎜⎜⎜⎜

T ⎟⎟⎟ .

⎟⎟⎟

⎜⎜⎜ −mk+1 1 ⎟⎟⎟

⎜⎜⎜ .. ⎟⎟⎟

⎜⎜⎜ ..

⎜⎜⎝ . . ⎟⎟⎟

⎠

−mn 1

Obviously, an elementary lower triangular matrix is a lower triangular matrix.

It also is normalised in the sense that the diagonal entries are all 1. It is quite

easy to see that an elementary lower triangular matrix has the following prop-

erties.

Lemma 3.4 An elementary lower triangular matrix Lk (m) has the following

properties:

1. det Lk (m) = 1.

2. Lk (m)−1 = Lk (−m).

3. Multiplying a matrix A with Lk (m) from the left leaves the ﬁrst k rows un-

changed and, starting from row j = k + 1, subtracts the row m j (ak1 , . . . , akn )

from row j of A.

With this notation, we see that the ﬁrst step of Gaussian elimination can be

described using the elementary lower triangular matrix

L1 = L1 (m1 ) = I − m1 eT1

T

a21 an1

m1 = 0, ,..., .

a11 a11

The complete Gaussian elimination process can be described as follows.

and b(1) = b and then for 1 ≤ k ≤ n − 1 iteratively

⎛ ⎞T

⎜⎜⎜ a(k) a(k) ⎟⎟

nk ⎟

⎜

mk := ⎜⎜⎝0, . . . , 0, (k) , . . . , (k) ⎟⎟⎟⎠ ,

k+1,k

akk akk

Lk := I − mk eTk ,

(A(k+1) , b(k+1) ) := Lk A(k) , b(k) ,

kk 0. Assuming that the process does not ﬁnish prematurely,

04

64 Direct Methods for Solving Linear Systems

it stops with a linear system A(n) x = b(n) with an upper triangular matrix A(n) ,

which has the same solution as the original system Ax = b.

Proof Obviously, the systems have the same solution since, in each step, we

multiply both sides of the equation Ax = b by an invertible matrix.

We will show by induction that the matrices A(k) , 2 ≤ k ≤ n, satisfy

a(k)

i j = ei A e j = 0

T (k)

for 1 ≤ j < k, j < i ≤ n. (3.4)

This means particularly for A(n) that all sub-diagonal entries vanish.

For k = 2 this has been explained above. Hence, let us now assume that (3.4)

holds for k < n. If the indices i and j satisfy 1 ≤ j < k + 1 and j < i ≤ n then

we can conclude

a(k+1)

ij = eTi Lk A(k) e j = eTi (I − mk eTk )A(k) e j

= eTi A(k) e j − (eTi mk )(eTk A(k) e j ).

ij = 0, since both eTi A(k) e j and eTk A(k) e j

are zero by assumption. Finally, for j = k and i > j we have because of

eTi mk = ik

=

a(k)

kk

eTk A(k) ek

also

eTi A(k) ek

a(k+1)

ij = eTi A(k) ek − eTk A(k) ek = 0.

eTk A(k) ek

Note that this algorithm works in situ, meaning that the sequence of matrices

A(k) all overwrite the previous matrix and no additional memory is required.

Note also that Algorithm 3 does not only store the upper triangular matrix A(n)

but even stores the essential elements of the elementary lower triangular ma-

trices Lk , i.e. it stores the deﬁning vectors mk in that part of column k beneath

the diagonal.

Remark 3.6 Algorithm 3 for Gaussian elimination requires O(n3 ) time and

O(n2 ) space.

Proof The remark about the space is obvious. For the computational cost, we

analyse the three for loops in Algorithm 3. Considering that line 2 needs one

ﬂop, lines 4 and 5 require two ﬂops and line 7 requires again one ﬂop, we see

04

3.3 LU Factorisation 65

Input : A ∈ Rn×n , b ∈ Rn .

Output: A(n) upper triangular and m1 , . . . , mn−1 and b(n) ∈ Rn .

1 for k = 1 to n − 1 do

2 d := 1/akk

3 for i = k + 1 to n do

4 aik := aik · d

5 bi := bi − bk · aik

6 for j = k + 1 to n do

7 ai j := ai j − aik · ak j

⎛ ⎛ ⎞⎞

n−1 ⎜

⎜⎜⎜

n ⎜

⎜⎜⎜

n ⎟⎟⎟⎟⎟⎟

T (n) = ⎜

⎝⎜1 + ⎜

⎝⎜2 + 1⎟⎟⎠⎟⎟⎟⎠⎟

k=1 i=k+1 j=k+1

⎛ ⎞

n−1 ⎜

⎜⎜⎜

n ⎟⎟⎟

= (n − 1) + ⎜⎝ 2(n − k) + (n − k) ⎟⎟⎠

k=1 i=k+1

n−1

= (n − 1) + 2k + k2

k=1

(n − 1)n(2n − 1)

= (n − 1) + n(n − 1) +

6

1 3

= n + O(n2 ).

3

3.3 LU Factorisation

Naturally, Gaussian elimination is not realised by matrix multiplication, but by

programming the corresponding operations directly, as shown in Algorithm 3.

This leads to an O(n3 ) complexity. In addition, we have an O(n2 ) complexity

for solving the ﬁnal linear system by back substitution.

However, there is another way of looking at the process. Suppose we may

construct a sequence of n−1 elementary lower triangular matrices L j = L j (m j )

such that

Ln−1 Ln−2 · · · L2 L1 A = U

04

66 Direct Methods for Solving Linear Systems

A = L1 (−m1 )L2 (−m2 ) · · · Ln−2 (−mn−2 )Ln−1 (−mn−1 )U

= (I + m1 eT1 )(I + m2 eT2 ) · · · (I + mn−2 eTn−2 )(I + mn−1 eTn−1 )U

= (I + mn−1 eTn−1 + mn−2 eTn−2 + · · · + m2 eT2 + m1 eT1 )U

=: LU,

where L is lower triangular with unit diagonal elements

n−1

L=I+ mi eTi .

i=1

gular matrix if i j = 0 for i < j and ii = 1.

The LU factorisation of a matrix A is the decomposition A = LU into the

product of a normalised lower triangular matrix and an upper triangular matrix.

We will soon see that if an LU factorisation of a matrix A exists then it is

unique, which explains why we can speak of the LU factorisation.

Sometimes, the condition on L to be normalised is relaxed and A = LU with

a regular lower triangular matrix L and a regular upper triangular matrix U

is already called an LU factorisation. In this case, we will, however, lose the

uniqueness of the factorisation. We will also have uniqueness if we require U

to be normalised instead of L.

So far, we have assumed that Gaussian elimination does not break down, i.e.

that we can divide by a(k)

kk for each k. It is now time to look at examples of

matrices where this condition is guaranteed.

Theorem 3.8 Let A ∈ Rn×n . Let A p ∈ R p×p be the pth principal sub-matrix

of A, i.e.

⎛ ⎞

⎜⎜⎜a11 ... a1p ⎟⎟⎟

⎜⎜ . .. ⎟⎟⎟⎟

A p = ⎜⎜⎜⎜⎜ .. . ⎟⎟⎟⎟ .

⎜⎝ ⎠

a p1 ... a pp

If det(A p ) 0 for 1 ≤ p ≤ n then A has an LU factorisation. In particular,

every symmetric, positive deﬁnite matrix possesses an LU factorisation.

Proof We will prove this by induction. For k = 1 we have det A1 = a11 0,

hence the ﬁrst step in the Gaussian elimination procedure is possible. For k →

k + 1 assume that our matrix is of the form as in (3.3). Then, as elementary

matrix operations do not change the determinant of a matrix, we have

0 det Ak = det A(k) (2) (k)

k = a11 a22 · · · akk ,

04

3.3 LU Factorisation 67

kk 0. Hence, the next step is also possible and we can

proceed in this way.

A symmetric, positive deﬁnite matrix satisﬁes obviously det A p > 0 for all

1 ≤ p ≤ n.

n

|aik | < |aii |, 1 ≤ i ≤ n.

k=1

ki

ible and possesses an LU factorisation.

Proof Since |a11 | > nj=2 |a1 j | ≥ 0, the ﬁrst step of Gaussian elimination

is possible. To use once again an inductive argument, it obviously suﬃces to

show that the new matrix A(2) = (I − m1 eT1 )A is also strictly row diagonally

dominant. Since the ﬁrst row of A(2) coincides with that of A we only have to

look at the rows with index i = 2, . . . , n. For such an i, we have a(2)

i1 = 0 and

ai1

a(2)

i j = ai j − a1 j , 2 ≤ j ≤ n.

a11

This leads to

n

n

n

|ai1 |

n

|a(2)

ij | = |a(2)

ij | ≤ |ai j | + |a1 j |

j=1 j=2 j=2

|a11 | j=2

ji ji ji ji

|ai1 |

< |aii | − |ai1 | + {|a11 | − |a1i |}

|a11 |

|ai1 |

= |aii | − |a1i | ≤ |a(2)

ii |.

|a11 |

Thus, A(2) is indeed strictly row diagonally dominant and hence a(2)

22 0 so that

we can proceed in the Gaussian elimination procedure. Continuing like this,

we see that all elements a(k)kk are diﬀerent from zero, meaning that Gaussian

elimination goes through all the way. It also means that the upper triangular

matrix A(n) is non-singular and hence A too is non-singular.

A. However, this might not be the only way of doing this. Hence, the question

of whether the LU factorisation is unique comes up naturally. To answer this

question, we need a result on triangular matrices.

04

68 Direct Methods for Solving Linear Systems

Theorem 3.11 The set of non-singular (normalised) lower (or upper) trian-

gular matrices is a group with respect to matrix multiplication.

Proof If A and B are matrices of one of these sets, we have to show that also

A−1 and AB belong to that particular set. Suppose A is a non-singular upper

triangular matrix and ek is, as usual, the kth unit vector in Rn . Let x(k) be the

solution of Ax(k) = ek . Then, the columns of A−1 are given by these x(k) . From

(k)

Algorithm 2 we can inductively conclude that the components xk+1 , . . . , xn(k)

(k) −1

are zero and that xk = 1/akk . This shows that A is also an upper triangular

matrix. If A is normalised then this shows that A−1 is normalised, as well.

Next, assume that B is another non-singular upper triangular matrix. Since

ai j = bi j = 0 for i > j we can conclude that

j

(AB)i j = aik bk j , 1 ≤ i, j ≤ n,

k=i

which shows that AB is also an upper triangular matrix. If both A and B are

normalised the above formula immediately shows that AB is also normalised.

This ﬁnishes the proof for (normalised) upper triangular matrices. For (nor-

malised) lower triangular matrices the proof is essentially the same.

unique.

non-singular matrix A. Then, we have

L2−1 L1 = U2 U1−1 .

Furthermore, L2−1 L1 is, as the product of two normalised lower triangular ma-

trices, a normalised lower triangular matrix itself. However, U2 U1−1 is an upper

triangular matrix. Since we have equality between these matrices, this is possi-

ble only if they are the identity matrix, which shows L1 = L2 and U1 = U2 .

required for the LU factorisation. To be more precise, if we leave out step 5 in

Algorithm 3 then we have an algorithm for computing the LU factorisation in

situ, because we have at the end

⎧ ⎧

⎪

⎪

⎨ai j if i > j, ⎪

⎪

⎨ai j if i ≤ j,

i j = ⎪

⎪ ui j = ⎪

⎪

⎩δi j if i ≤ j, ⎩0 if i > j.

There are, however, other ways of computing the LU factorisation. One other

04

3.3 LU Factorisation 69

starts with the component-wise formulation of A = LU, i.e.

n

ai j = ik uk j .

k=1

Since U is upper triangular and L is lower triangular, the upper limit of the sum

is actually given by min(i, j). Taking the two cases separately gives

j

ai j = ik uk j , 1 ≤ j < i ≤ n,

k=1

i

ai j = ik uk j , 1 ≤ i ≤ j ≤ n.

k=1

Rearranging these equations, and using the fact that ii = 1 for all 1 ≤ i ≤ n,

we ﬁnd that

⎛ ⎞

1 ⎜⎜⎜⎜

j−1 ⎟⎟⎟

i j = ⎜⎝⎜ai j − ik uk j ⎟⎟⎠⎟ , 2 ≤ i ≤ n, 1 ≤ j ≤ i − 1, (3.5)

ujj k=1

i−1

ui j = ai j − ik uk j , 1 ≤ i ≤ n, i ≤ j ≤ n. (3.6)

k=1

elements of L in the ﬁrst column are 11 = 1 and i1 = ai1 /u11 , 2 ≤ i ≤ n.

The equations (3.5) and (3.6) can now be used for the calculation of the ele-

ments i j and ui j . For each value of i, starting with i = 2, we calculate ﬁrst i j

for 1 ≤ j ≤ i − 1 in increasing order, and then the values of ui j for i ≤ j ≤ n,

again in increasing order. We then move on to the same calculation for i + 1

and so on until i = n. We refrain from giving the easily derived algorithm, but

state the obvious result on the computational complexity.

Remark 3.13 The computational cost of computing the LU factorisation of

a matrix A ∈ Rn×n is O(n3 ).

We want to address a few applications of the LU factorisation. The ﬁrst one is

rather obvious.

Remark 3.14 If A = LU then the linear system b = Ax = LUx can be solved

in two steps. First, we solve Ly = b by forward substitution and then Ux = y

by back substitution. Both are possible in O(n2 ) time.

This means in particular that we could compute the inverse of a matrix A with

LU factorisation by solving the n linear systems Ax( j) = e j , 1 ≤ j ≤ n, where

04

70 Direct Methods for Solving Linear Systems

e j is the jth unit vector in Rn . Solving these n systems would cost O(n3 ) time

and hence the computation of A−1 would cost O(n3 ) time.

Remark 3.15 A numerically reasonable way of calculating the determinant

of a matrix A is to ﬁrst compute the LU factorisation and then use the fact that

det(A) = det(LU) = det(L) det(U) = det(U) = u11 u22 · · · unn .

In the case of a tridiagonal matrix, there is an eﬃcient way of constructing the

LU factorisation, at least under certain additional assumptions. Let the tridiag-

onal matrix be given by

⎛ ⎞

⎜⎜⎜a1 c1 0 ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜b2 a2 c2 ⎟⎟⎟

⎜⎜⎜ .. .. .. ⎟⎟⎟

A = ⎜⎜⎜⎜ ⎜ . . . ⎟⎟⎟ .

⎜⎜⎜ ⎟⎟⎟⎟ (3.7)

⎜⎜⎜ . .

. . . . c ⎟⎟⎟ ⎟

⎜⎜⎝ n−1 ⎟ ⎟⎠

0 bn an

Then, we have the following theorem.

Theorem 3.16 Assume A is a tridiagonal matrix of the form (3.7) with

|a1 | > |c1 | > 0,

|ai | ≥ |bi | + |ci |, bi , ci 0, 2 ≤ i ≤ n − 1,

|an | ≥ |bn | > 0.

Then, A is invertible and has an LU factorisation of the form

⎛ ⎞⎛ ⎞

⎜⎜⎜ 1 0⎟⎟⎟ ⎜⎜⎜⎜u1 c1 0 ⎟⎟

⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜2 1 ⎟⎟⎟ ⎜⎜⎜ .. ⎟⎟⎟

u . ⎟⎟⎟

A = ⎜⎜⎜ ⎜ . . ⎟

⎟⎟⎟ ⎜⎜⎜ ⎜ 2

⎟⎟ .

⎜⎜⎜ . . . . ⎟⎟⎟ ⎜⎜⎜ . . . c ⎟⎟⎟⎟

⎜⎝ ⎟⎠ ⎜⎜⎝ n−1 ⎟ ⎟⎠

0 n 1 0 un

i = bi /ui−1 and ui = ai − i ci−1 for 2 ≤ i ≤ n.

Proof We show inductively for 1 ≤ i ≤ n − 1 that ui 0 and |ci /ui | < 1. From

this, it follows that both and u are well deﬁned. By assumption, we have for

i = 1 obviously |u1 | = |a1 | > |c1 | > 0. The induction step then follows from

|ci |

|ui+1 | ≥ |ai+1 | − |bi+1 | > |ai+1 | − |bi+1 | ≥ |ci+1 |.

|ui |

This calculation also shows that un 0, proving that A is invertible, provided

the method really gives the LU factorisation of A. To see the latter, we ﬁrst

04

3.4 Pivoting 71

notice that for the given L and U their product LU must be a tridiagonal matrix.

Simple calculations then show the statement. For example, we have (LU)i,i =

i ci−1 + ui = ai . The other two cases (LU)i,i+1 and (LU)i+1,i follow just as

easily.

Note that for such a system not only the calculation of the LU factorisation

simpliﬁes but also the solution of the corresponding systems by forward and

back substitution. This means that in this case it is possible to solve the linear

system Ax = b in O(n) time.

Example 3.17 For our second example in the introduction, Section 1.1.2, we

see that the matrix from (1.2) satisﬁes the assumption of Theorem 3.16. Here,

the situation is even simpler since the entries in each band are constant, i.e.

if we ignore the 1/h2 factor then we have ai = 2 and bi = ci = −1. Thus, the

computation reduces to u1 = a1 = 2 and then i = bi /ui−1 = −1/ui−1 , 2 ≤ i ≤ n.

Inserting this into the computation of ui yields

1

ui = ai − i ci−1 = 2 + i = 2 − , 2 ≤ i ≤ n,

ui−1

T

which resolve to u = 2, 23 , 34 , 54 , . . . , n+1

n , or ui = i+1

i , which also determines

i = −1/ui−1 = −(i − 1)/i.

3.4 Pivoting

We now want to address the situation that the Gaussian elimination process

breaks down prematurely. This means that there is an index k with a(k)

kk = 0.

For example, let us consider the simple matrix

0 1

A= .

1 1

κ2 (A) = (3 + 5)/2, in the 2-norm. But Gaussian elimination fails already

at its ﬁrst step, since a11 = 0. Note, however, that a simple exchanging of the

rows gives an upper triangular matrix. Furthermore, a simple exchanging of

the columns leads to a lower triangular matrix. The ﬁrst action corresponds to

rearranging the equations in the linear system, the latter corresponds to rear-

ranging the unknowns.

Hence, rearranging either the rows or columns of the matrix (or both) seems to

resolve the problem of a premature breakdown.

04

72 Direct Methods for Solving Linear Systems

−20

10 1

A= .

1 1

−20

1 0 10 1

L= , U = .

1020 1 0 1 − 1020

10−16 . The number 1 − 1020 will not be represented exactly, but by its near-

est ﬂoating point number, let us say that this is the number −1020 . Using this

number produces the factorisation

−20

1 0 10 1

L̃ = , Ũ = .

1020 1 0 −1020

The matrix Ũ is relatively close to the correct U, but on calculating the product

we obtain

−20

10 1

L̃Ũ = .

1 0

We see that this matrix is not close to A, since a 1 has been replaced by a

0. Suppose we wish to solve Ax = b with b = (1, 0)T using ﬂoating point

arithmetic we would obtain x̃ = (0, 1)T while the true answer is approximately

(−1, 1)T . The explanation for this is that LU factorisation is not backward-

stable.

A proof of the following result can again be found in Higham [85] or Trefethen

and Bau [124]. It shows that Gaussian elimination is backward-stable if the

norms of the triangular matrices L and U can be bounded. Unfortunately, as

we have just seen, this may not be the case.

nation computes matrices L̃, Ũ satisfying L̃Ũ = A + ΔA, where ΔA satisﬁes

But the important conclusion here is that exchanging rows or columns or both

is also necessary if the element a(k)

kk is close to zero to stabilise the elimination

process.

The exchange of rows or columns during the Gaussian elimination process is

referred to as pivoting. Assume we have performed k − 1 steps and now have a

04

3.4 Pivoting 73

⎛ ⎞

⎜⎜⎜a11 ∗ ∗ ... ∗ ⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ 0 ⎟

⎜⎜⎜ a(2) ∗ . . . ∗ ⎟⎟⎟⎟⎟

⎜⎜⎜

22 ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ .. ⎟⎟⎟

⎜⎜⎜ . ⎟⎟⎟

⎜⎜ ⎟⎟⎟

A(k) = ⎜⎜⎜⎜ ⎟⎟⎟ .

⎟⎟ (3.8)

⎜⎜⎜⎜ (k) ⎟

⎜⎜⎜ 0 0 a(k) . . . akn ⎟⎟⎟⎟

⎜⎜⎜ kk ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ .

⎜⎜⎜ .. .. .. .. ⎟⎟⎟⎟⎟

⎜⎜⎜ . . . ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎝ ⎟⎟⎟

⎠

0 0 a(k)

nk . . . a(k)

nn

(k) (k)

If det A = det A(k) 0, not all the entries in the column vector (akk , . . . , ank )T

can be zero. Hence, we can pick one entry and swap the corresponding row

with the kth row. This is usually referred to as partial (row) pivoting. For nu-

merical reasons it is sensible to pick a row index ∈ {k, k + 1, . . . , n} with

|ak | = max |a(k)

ik |.

k≤i≤n

In a similar way, partial column pivoting can be deﬁned. Finally, we could use

total pivoting, i.e. we could pick indices , m ∈ {k, k + 1, . . . , n} such that

|am | = max |a(k)

i j |.

k≤i, j≤n

way the programming language used stores matrices. If a matrix is stored row

by row then swapping two rows can be achieved by simply swapping two row

pointers. An exchange of two columns could only be realised by really ex-

changing all entries in these two columns which is much more expensive.

To understand the process of pivoting in an abstract form, we once again use

matrices to describe the operations. Once again, this is only for theoretical

purposes and not for a numerical implementation. The required matrices to

describe pivoting are permutation matrices.

Deﬁnition 3.19 A permutation of the set {1, . . . , n} is a bijective map p :

{1, . . . , n} → {1, . . . , n}. A matrix P ∈ Rn×n is called a permutation matrix if

there is a permutation p such that

Pe j = e p( j) , 1 ≤ j ≤ n,

where e j denotes once again the jth unit vector.

04

74 Direct Methods for Solving Linear Systems

A permutation matrix is simply a matrix that results from the identity matrix by

exchanging its rows (or columns). This means particularly that a permutation

matrix is invertible.

Theorem 3.20 Every permutation matrix P ∈ Rn×n is orthogonal, i.e. P−1 =

PT . Let P = P p ∈ Rn×n be a permutation matrix with permutation p. If p−1

is the permutation inverse to p, i.e. if p−1 (p( j)) = j for 1 ≤ j ≤ n, then

P p−1 = P−1

p .

Proof We start with the second statement. This is true since we have

P p−1 P p e j = P p−1 e p( j) = e p−1 (p( j)) = e j ,

for 1 ≤ j ≤ n. The ﬁrst statement follows, for 1 ≤ j, k ≤ n, from

eTp−1 ( j) PTp ek = eTk P p e p−1 ( j) = eTk e j = δ jk = eTp−1 ( j) e p−1 (k) = eTp−1 ( j) P−1

p ek .

For partial pivoting we need to exchange exactly two rows in each step. This

is realised using elementary permutation matrices.

Deﬁnition 3.21 A permutation matrix Pi j ∈ Rn×n is called an elementary

permutation matrix if it is of the form

Pi j = I − (ei − e j )(ei − e j )T .

This means simply that the matrix Pi j is an identity matrix with rows (columns)

i and j exchanged.

Remark 3.22 An elementary permutation matrix has the properties

P−1

i j = Pi j = P ji = Pi j

T

exchanges rows i and j of A. Similarly post-multiplication exchanges columns

i and j of A.

Theorem 3.23 Let A be an n × n matrix. There exist elementary lower trian-

gular matrices Li := Li (mi ) and elementary permutation matrices Pi := Pri i

with ri ≥ i, i = 1, 2, . . . , n − 1, such that

U := Ln−1 Pn−1 Ln−2 Pn−2 · · · L2 P2 L1 P1 A (3.9)

is upper triangular.

Proof This follows from the previous considerations. Note that, if the ma-

trix is singular, we might come into a situation (3.8), where the column vector

(a(k) (k) T

kk , . . . , ank ) is the zero vector. In such a situation we pick the correspond-

ing elementary lower triangular and permutation matrices as the identity and

proceed with the next column.

04

3.4 Pivoting 75

The permutation matrices in (3.9) can be “moved” to the right in the following

sense. If P is a symmetric permutation matrix, which acts only on rows (or

columns) with index larger than j then we have

an elementary lower triangular matrix. This gives in particular

= Ln−1 L̃n−2 Pn−1 Pn−2 Ln−3 · · · L2 P2 L1 P1 A.

In the same way, we can move Pn−1 Pn−2 to the right of Ln−3 . Continuing with

this procedure leads to

tion matrix P such that PA possesses an LU factorisation PA = LU.

As mentioned above, permutation matrices are only used to describe the theory

mathematically. When it comes to the implementation, permutations are done

directly. For partial pivoting, this is described in Algorithm 4, which is a simple

modiﬁcation of the original Algorithm 3.

Input : A ∈ Rn×n , b ∈ Rn .

Output: PA = LU.

1 for k = 1, 2, . . . , n − 1 do

2 Find the pivot element akr for row k

3 Exchange row k with row r

4 d := 1/akk

5 for i = k + 1, . . . , n do

6 aik := aik d

7 for j = k + 1, . . . , n do

8 ai j := ai j − aik ak j

04

76 Direct Methods for Solving Linear Systems

Let us assume that A has an LU factorisation A = LU and that A is symmetric,

i.e. AT = A. Then we have

LU = A = AT = U T LT .

we would like to use the uniqueness of the LU factorisation to conclude that

L = U T and U = LT . Unfortunately, this is not possible since the uniqueness

requires the lower triangular matrix to be normalised, i.e. to have only ones as

diagonal entries, and this may not be the case for U T .

But if A is invertible, then we can write

U = DŨ

lar matrix Ũ. Then, we can conclude that A = LDŨ and hence AT = Ũ T DLT =

LDŨ, so that we can now apply the uniqueness result to derive L = Ũ T and

U = DLT , which gives A = LDLT .

Let us ﬁnally make the assumption that all diagonal entries of D are positive,

√ √

so that we can deﬁne its square root by setting D1/2 = diag( d11 , . . . , dnn ),

which leads to the decomposition

lower triangular matrix L is called a Cholesky factorisation of A.

Cholesky factorisation.

upper principal sub-matrix A p is positive. From Theorem 3.8 we can therefore

conclude that A = LU is possible. Since a positive deﬁnite matrix is also invert-

ible, our previous derivation shows that A = LDLT . Finally, A being positive

deﬁnite also means that

dii = eTi Dei = eTi (L−1 AL−T )ei = (L−T ei )T A(L−T ei ) > 0,

04

3.5 Cholesky Factorisation 77

wise taking into account that L is lower triangular. For i ≥ j we have

j

j−1

ai j = ik jk = ik jk + i j j j .

k=1 k=1

⎛ ⎞1/2

⎜⎜⎜

j−1 ⎟⎟⎟

⎜

j j = ⎜⎜⎝a j j − jk ⎟⎟⎟⎠ .

2

k=1

With this, we can successively compute the other coeﬃcients. For i > j we

have

⎛ ⎞

1 ⎜⎜⎜⎜

j−1 ⎟⎟⎟

i j = ⎜⎜⎝ai j − ik jk ⎟⎟⎟⎠ .

jj k=1

With these equations it is now easy to successively compute the matrix L either

row-wise or column-wise. A column-oriented version of the Cholesky factori-

sation is given in Algorithm 5. Obviously, in a ﬁnal implementation it would be

appropriate to check a j j for non-negativity before taking the square root in step

4. This is also a means to check whether the given matrix A is “numerically”

positive deﬁnite.

torisation is given by n3 /6 + O(n2 ), which is about half the complexity of the

standard Gaussian elimination process. However, here, we also have to take n

roots.

Proof Leaving the n roots from line 4 of Algorithm 5 aside and counting one

ﬂop for each of the lines 3, 7 and 8 yields

⎛ j−1 ⎛ ⎞⎞

n ⎜

⎜⎜⎜ n ⎜

⎜⎜⎜

j−1 ⎟⎟

⎟⎟⎟⎟

T (n) = ⎜⎜⎝ 1 + ⎜⎜⎝1 + 1⎟⎟⎟⎠⎟⎟⎟⎠ .

j=1 k=1 i= j+1 k=1

After some simple calculations, similar to the ones we have done before, we

see that this leads indeed to T (n) = n3 /6 + O(n2 ).

lesky factorisation is backward-stable, see again [85, 124], as long as the norm

of the matrix is not too large. The computed factor L̃ satisﬁes L̃T L̃ = A + ΔA

where ΔA can be bounded by ΔA = O(A eps).

04

78 Direct Methods for Solving Linear Systems

Input : A ∈ Rn×n symmetric, positive deﬁnite.

Output: A = LLT Cholesky factorisation.

1 for j = 1 to n do

2 for k = 1 to j − 1 do

3 a j j := a j j − a jk · a jk

√

4 a j j := a j j

5 for i = j + 1 to n do

6 for k = 1 to j − 1 do

7 ai j := ai j − aik · a jk

8 ai j := ai j /a j j

substitution is also backward-stable. Hence, the solution process for solving

a linear system Ax = b with A being symmetric and positive deﬁnite using

the Cholesky factorisation will produce a reasonable solution as long as the

condition number of A is not too large.

3.6 QR Factorisation

We have seen that we can derive an LU factorisation for every matrix A if

we employ pivoting if necessary. The LU factorisation can be described by

multiplying the original matrix A with elementary lower triangular matrices

from the left. Unfortunately, this might change the condition number κ(A) =

A A−1 . Though we can only say that

we have to expect κ(LA) > κ(A) because κ(L) ≥ 1. Hence, it would be better

to use transformations which do not change the condition number. Candidates

for this are orthogonal matrices, i.e. matrices Q ∈ Rn×n with QT Q = I.

x2 for all x ∈ Rn and QA2 = A2 .

Proof We have

04

3.6 QR Factorisation 79

Similarly,

QA2 = sup QAx2 = sup Ax2 = A2 .

x2 =1 x2 =1

The goal of this section is to prove that every matrix A ∈ Rm×n , m ≥ n, has a

factorisation of the form

A = QR,

where Q ∈ Rm×m is orthogonal and R ∈ Rm×n is an upper triangular matrix,

which, because of the non-quadratic form of R, takes the form

⎛ ⎞

⎜⎜⎜∗ · · · ∗⎟⎟⎟

⎜⎜⎜⎜ ∗ · · · ∗⎟⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ .. ⎟⎟⎟

⎜⎜⎜ . ⎟⎟⎟

R = ⎜⎜ ⎟⎟⎟ .

⎜⎜⎜ ∗ ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎝ ⎠

geometrically reﬂections. It would, however, also be possible to use rotations,

which we will encounter later on.

Deﬁnition 3.30 A Householder matrix H = H(w) ∈ Rm×m is a matrix of the

form

H(w) = I − 2wwT ∈ Rm×m ,

where w ∈ Rm satisﬁes either w2 = 1 or w = 0.

A more general form of a Householder matrix is given by

wwT

H(w) = I − 2 ,

wT w

for an arbitrary vector w 0. Here, we ﬁnd both the inner product wT w ∈ R

and the outer product wwT ∈ Rm×m . Householder matrices have quite a few

elementary but important properties, which we want to collect now.

Lemma 3.31 Let H = H(w) ∈ Rm×m be a Householder matrix.

1. H is symmetric.

2. HH = I so that H is orthogonal.

3. det(H(w)) = −1 if w 0.

4. Storing H(w) only requires storing the m elements of w.

5. The computation of the product Ha of H = H(w) with a vector a ∈ Rm

requires only O(m) time.

04

80 Direct Methods for Solving Linear Systems

Proof The ﬁrst property is clear. For the second note that wT w = 1 leads to

For the third property, we note that H has the eigenvalues λ = ±1. To be more

precise, we have Hw = −w so that w is the corresponding eigenvector to the

eigenvalue λ = −1. Every vector orthogonal to w is obviously an eigenvector

to the eigenvalue λ = 1. Hence, the eigenvalue λ = 1 has multiplicity (m − 1)

and the eigenvalue λ = −1 has multiplicity 1. Since the determinant is the

product of the eigenvalues, the determinant of H is −1.

The fourth property is clear again. For the last property we ﬁrst compute the

number λ = aT w in O(m) time, then we multiply this number by each compo-

nent of w to form the vector λw and ﬁnally we subtract twice this vector from

the vector a. All of this can be done in O(m) time.

sents a reﬂection. To be more precise, suppose we want to ﬁnd a reﬂection, i.e.

an orthogonal matrix H which maps the vector a ∈ R2 onto the vector αe1 with

α to be speciﬁed. First of all, because of

we see that α = ±a2 . Next, if S w denotes the bisector of the angle between a

and e1 and w its orthogonal vector, then we can express the transformation via

or w2 = 1, such that H(w)a = a2 e1 .

w = 0 in these cases. Otherwise, let us deﬁne v = a − a2 e1 and w = v/v2 .

Then,

vT v (a − a2 e1 )T (a − a2 e1 ) aT a − 2a2 aT e1 + a22

1 = wT w = = =

v22 v22 v22

2aT (a − a2 e1 ) 2aT v 2aT w

= = = ,

v22 v22 v2

v

H(w)a = a − 2(aT w)w = a − v2 = a − v = a2 e1 .

v2

04

3.6 QR Factorisation 81

a w

T

Sw

a w

w

T

a2 e1

Note that we could also project onto −e1 by using v = a + a2 e1 . In general,

one should pick the sign in such a way that numerical cancellation is avoided,

i.e. one should use v = a + sign(a1 )a2 e1 .

orthogonal matrix Q ∈ Rm×m and an upper triangular matrix R ∈ Rm×n such

that A = QR.

H1 ∈ Rm×m such that

⎛ ⎞

⎜⎜⎜α1 ∗ . . . ∗⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ 0 ⎟⎟⎟

H1 A = ⎜⎜⎜ . ⎜ ⎟⎟⎟

⎜⎜⎜ .. ⎟⎟⎟

⎜⎝ A1 ⎟⎟⎠

0

this matrix, we can ﬁnd a second Householder matrix H̃2 ∈ R(m−1)×(m−1) such

that H̃2 ã2 = α2 e1 ∈ Rm−1 . The matrix

⎛ ⎞

⎜⎜⎜1 0 . . . 0⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜0 ⎟⎟⎟

H2 = ⎜⎜⎜⎜ . ⎟⎟⎟⎟

⎜⎜⎜ .. H̃ ⎟⎟⎟

⎜⎝ 2 ⎟⎠

0

04

82 Direct Methods for Solving Linear Systems

⎛ ⎞

⎜⎜⎜α1 ∗ . . . ∗⎟⎟

⎟⎟

⎜⎜⎜⎜ 0 α2 ∗ . . . ∗⎟⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜

H2 H1 A = ⎜⎜⎜⎜ 0 0 ⎟⎟⎟ ,

⎟⎟⎟

⎜⎜⎜⎜ .. .. ⎟⎟⎟

⎜⎜⎜ . . A2 ⎟⎟⎠

⎝

0 0

and we can proceed in this form to establish

Hn−1 Hn−2 · · · H2 H1 A = R,

where R is an upper triangular matrix. Since all H j are orthogonal so is their

product and also their inverse.

Note that the proof is actually constructive and describes a numerical method

for deriving the QR factorisation of a matrix A. Moreover, if we take Lemma

3.31 into account we can already determine the cost.

Remark 3.34 It takes O(mn2 ) time to compute the QR factorisation of an

m × n matrix with m ≥ n.

Proof In the ﬁrst step we have to apply the matrix H1 to the n columns of the

matrix A. Each of these applications can be done in O(m) time such that the

ﬁrst step of the procedure requires O(mn) time. Since we have to apply n − 1

of these steps, we see that the cost is bounded by O(mn2 ).

It is not entirely possible to program the QR factorisation in situ, since the

space of the matrix A is not enough to store all relevant information. Though

we can store the R part in the upper triangular part of A we also need to store

the relevant information on the transformations H j = I −c j v j vTj , i.e. the vectors

v j = (v j j , . . . , vm j )T , which form a lower triangular matrix, and the coeﬃcients

c j . We can store the vectors v j in the lower triangular part of A except for v j j ,

which we can store in an additional vector v. For the square case m = n, this is

given in detail in Algorithm 6.

to use pivoting to continue the process all the way down. In contrast to this, a

QR factorisation can always be computed without pivoting even if the matrix

A is singular. We actually can compute it, even if the matrix is not square.

Nonetheless, pivoting can improve the numerical stability.

As in the case of the LU factorisation (Theorem 3.12) the factors in the QR

factorisation are unique. However, we have to make some additional assump-

tions.

04

3.6 QR Factorisation 83

Algorithm 6: QR factorisation

Input : A ∈ Rn×n .

Output: QR factorisation of A.

1 for j = 1 to n − 1 do

2 β := 0

3 for k = j to n do β := β + ak j · ak j

√

4 α := β

5 if α = 0 then v j := 0

6 else

7 c j := 1/(β + α · |a j j |)

8 if a j j < 0 then α := −α

9 v j := a j j + α

10 a j j := −α

11 for k = j + 1 to n do

12 sum := v j · a jk

13 for i = j + 1 to n do sum := sum + ai j · aik

14 sum := sum · c j

15 a jk := a jk − v j · sum

16 for i = j + 1 to n do aik := aik − ai j · sum

into the product of an orthogonal matrix Q ∈ Rn×n and an upper triangular

matrix R ∈ Rn×n is unique, if the signs of the diagonal elements of R are pre-

scribed.

thogonal matrices Q1 , Q2 and upper triangular matrices R1 , R2 . Then, we have

S := Q−1 −1

2 Q1 = R2 R1 . This is an identity between an orthogonal matrix on the

one hand and an upper triangular matrix on the other. Hence, S is an orthogo-

nal upper triangular matrix and since S −1 = S T it must be a diagonal matrix.

But a diagonal orthogonal matrix can only have ±1 as diagonal entries. Thus, if

the diagonal entries of R1 and R2 have the same sign, we must have S = I.

tors Q̃ and R̃ satisfy Q̃R̃ = A + ΔA with ΔA satisfying

04

84 Direct Methods for Solving Linear Systems

We want to use Householder transformations to derive yet another factorisa-

tion. Though its derivation is algorithmic again, this factorisation will only be

of interest to us from a theoretical point of view. However, here we need to

discuss complex matrices.

T

Recall that a matrix U ∈ Cn×n is called unitary if it satisﬁes U U = I, where the

overbar indicates the complex conjugate. A unitary matrix is thus the complex

counterpart to an orthogonal matrix.

Moreover, we can introduce complex Householder matrices H ∈ Cm×m simply

by deﬁning H = H(w) = I − 2wwT where w ∈ Cm satisﬁes either w22 =

wT w = 1 or w = 0. It is easy to see that a complex Householder matrix is

T

unitary, that it satisﬁes H = H, and that the results of the previous subsection

remain true. In particular, we have the complex counterpart to Theorem 3.32.

or w2 = 1 such that H(w)a = a2 e1 .

We may now use the above Lemma to obtain the following Theorem.

T

matrix, U ∈ Cn×n , such that R := U AU is an upper triangular matrix.

Proof We will prove this by induction on n. Clearly the result holds for n = 1.

Hence, let us now assume that the result holds for all (n − 1) × (n − 1) matrices.

Since the characteristic polynomial p(λ) = det(A − λI) factorises over C, we

can ﬁnd an eigenvalue λ and a corresponding eigenvector x ∈ Cn \ {0}, i.e. we

have Ax = λx.

By the previous lemma there exists a Householder matrix such that

T

H(w1 )x = c1 e1 or x = c1 H(w1 ) e1 = c1 H(w1 )e1

with

|c1 | = x2 0.

x λ

H(w1 )AH(w1 )e1 = H(w1 )A = H(w1 )x = λe1 .

c1 c1

We may write this in the form

λ aT

H(w1 )AH(w1 ) =

0 Ã

04

3.7 Schur Factorisation 85

T

esis there is a unitary matrix V ∈ C(n−1)×(n−1) such that V ÃV = T is an upper

triangular matrix. If we now deﬁne an n × n matrix by

⎛ ⎞

1 0T ⎜⎜⎜1 0T ⎟⎟⎟

U = H(w1 ) ,

T

U = ⎜⎝ ⎜ T⎟⎟⎠ H(w1 )

0 V 0 V

T

then we immediately see that UU = I, i.e. U is unitary. Moreover, we have

⎛ ⎞ ⎛ ⎞

T ⎜⎜⎜1 0T ⎟⎟⎟ 1 0T ⎜⎜⎜1 0T ⎟⎟⎟ λ aT 1 0T

U AU = ⎜⎝ ⎜ ⎟

T ⎟ H(w1 )AH(w1 ) = ⎜⎝ ⎜ T⎟ ⎟

0 V ⎠ 0 V 0 V ⎠ 0 Ã 0 V

⎛ ⎞ ⎛ ⎞

⎜⎜⎜1 0T ⎟⎟⎟ λ aT V ⎜⎜⎜λ aT V ⎟⎟⎟

= ⎜⎝ ⎜ T⎟ ⎟

⎠ 0 ÃV = ⎜⎝ ⎜ T ⎟⎟⎠ .

0 V 0 V ÃV

T

Hence, U AU is also an upper triangular matrix, which completes the induc-

tion and the proof.

In the situation of a real matrix with complex eigenvalues, we cannot achieve

this result using only orthogonal transformations. However, each complex ei-

genvalue λ of a real matrix comes in pairs, i.e. also the complex conjugate of

λ is an eigenvalue. Indeed, if Ax = λx with x 0 then λx = Ax = Ax. This

allows us to derive the following analogue to Theorem 3.38.

Theorem 3.39 (Real Schur factorisation) To A ∈ Rn×n there is an orthogonal

matrix Q ∈ Rn×n such that

⎛ ⎞

⎜⎜⎜R11 R12 · · · R1m ⎟⎟⎟

⎜⎜⎜⎜ 0 R ⎟

· · · R2m ⎟⎟⎟⎟

⎜⎜⎜ 22

⎟

Q AQ = ⎜⎜⎜ . .. ⎟⎟⎟⎟ ,

T

.. ..

⎜⎜⎜ .. . . . ⎟⎟⎟

⎜⎝ ⎟⎠

0 0 · · · Rmm

where the diagonal blocks Rii are either 1 × 1 or 2 × 2 matrices. A 1 × 1 block

corresponds to a real eigenvalue, a 2×2 block corresponds to a pair of complex

conjugate eigenvalues.

If A has only real eigenvalues then it is orthogonal similar to an upper trian-

gular matrix.

Proof Let k be the number of pairs of complex conjugate eigenvalues. We

will prove the result by induction on k. If k = 0 then the statement is obviously

correct since we can proceed as in the proof of Theorem 3.38 with real House-

holder matrices. Hence, let us assume that k ≥ 1 and that the result is correct

for k−1. Let λ = α+iβ with β 0 be a complex eigenvalue with corresponding

eigenvector z = x + iy with x, y ∈ Rn . Then, y 0 since the vector Az would

04

86 Direct Methods for Solving Linear Systems

otherwise be a real vector equalling the complex vector λx. Moreover, with λ

also λ is an eigenvalue of A with corresponding eigenvector z.

Next, we have A(x + iy) = (α + iβ)(x + iy) and hence

Ax = αx − βy, Ay = βx + αy,

α β

A(x, y) = (x, y) .

−β α

c ∈ R \ {0} would mean that z = (c + i)y and z = (c − i)y are linearly dependent,

which contradicts the fact that they are eigenvectors to two distinct eigenvalues

λ and λ, respectively. Thus, we can choose an orthonormal basis {x1 , x2 } of

span{x, y} and describe the coordinate transformation by a non-singular matrix

C ∈ R2×2 , i.e. (x1 , x2 ) = (x, y)C. With this, we can conclude that

α β

A(x1 , x2 ) = A(x, y)C = (x, y) C

−β α

α β

= (x1 , x2 )C −1 C =: (x1 , x2 )S .

−β α

Thus, the newly deﬁned matrix S has also the eigenvalues λ = α + iβ and

λ = α − iβ. If n = 2 we are done. Otherwise, we now extend the vectors {x1 , x2 }

to a full orthonormal basis {x1 , . . . , xn } of Rn and deﬁne the orthogonal matrix

(x1 , . . . , xn ) =: (x1 , x2 , W) with W = (x3 , . . . , xn ) ∈ Rn×(n−2) to see that

⎛ T⎞

⎜⎜⎜ x1 ⎟⎟⎟

⎜ ⎟ (x1 , x2 )T AW

(x1 , x2 , W)T A(x1 , x2 , W) = ⎜⎜⎜⎜⎜ xT2 ⎟⎟⎟⎟⎟ ((x1 , x2 )S , AW) =

S

.

⎝ T⎠ 0 W T AW

W

The matrix W T AW has less than k complex conjugate eigenvalue pairs. Hence,

by the induction assumption, there is an orthogonal matrix Q̃ ∈ R(n−2)×(n−2)

so that the matrix Q̃T (W T AW)Q̃ has the desired form. Finally, the orthogonal

matrix

I 0

Q := (x1 , . . . , xn ) ,

0 Q̃

with I being the 2 × 2 identity matrix, is the required orthogonal matrix, which

transforms A to the described quasi-triangular form.

04

3.8 Solving Least-Squares Problems 87

We now want to deal with the problem of solving an over-determined system

of the form

Ax = b, (3.10)

where A ∈ Rm×n and b ∈ Rm are given and m > n. The latter means that we have

more equations than unknowns, which makes the system over-determined.

Hence, we cannot expect a solution to (3.10) and may instead try to change

the problem to solving the least-squares problem

x∈R

For the time being, we will assume that the rank of the matrix rank(A) is

rank(A) = n, i.e. the matrix has full rank. Then, to solve (3.11) we can look for

critical points of the function

Since rank(A) = n, we also have rank(AT A) = n (see (1.6)) which means that

the matrix AT A ∈ Rn×n is invertible. Thus we have in principle the following

result.

(3.11) with a matrix A ∈ Rm×n with m ≥ n = rank(A) is the unique solution of

the Normal Equations

(AT A)x = AT b. (3.12)

Proof The solution to the normal equations is indeed a critical point. Fur-

thermore, since the Hessian of the function f is given by the positive deﬁnite

matrix AT A, we have a local minimum, which is actually a global one, as it is

the only critical point.

The normal equations are not a good way of computing the solution to the

least-squares problem since the involved matrix AT A is usually badly condi-

tioned.

Nonetheless, we notice that we can use this result to better understand the

pseudo-inverse. By looking at (3.12) and (1.10) We can immediately read oﬀ

the following result.

04

88 Direct Methods for Solving Linear Systems

Corollary 3.41 Let A ∈ Rm×n with rank(A) = n be given. Then, for a given

b ∈ Rn , the solution of the least-squares problem (3.11) is given by x = A+ b.

We are now going to investigate two better ways of computing the solution to

the least-squares problem. The ﬁrst way is based on the QR factorisation of the

matrix A and the second way is based upon the singular value decomposition

A = UΣV T from Theorem 1.18. In both cases, orthogonal matrices play an

important role.

Let us start by using the QR factorisation from Theorem 3.33. Hence, we can

represent our matrix A ∈ Rm×n as A = QR with an orthogonal matrix Q ∈

Rm×m and an upper triangular matrix R ∈ Rm×n . Using the fact that orthogonal

transformations leave the Euclidean norm invariant, we see that we can now

produce a solution by noting that

R c

Ax − b22 = QRx − b22 = Rx − c22 = 1 x − 1

O c2 2

= R1 x − c1 22 + c2 22 ,

where we have set c = QT b ∈ Rm and where R1 ∈ Rn×n denotes the upper part

of the upper triangular matrix R.

The solution to the least-squares problem is therefore given by the solution of

R1 x = c1 ,

quite easily using back substitution, as we have seen at the beginning of this

chapter.

the solution of the least-squares problem (3.11) can be computed as follows:

2. Let R1 ∈ Rn×n be the upper square part of the upper triangular matrix

R ∈ Rm×n , c = QT b ∈ Rm and c1 ∈ Rn be the upper part of c.

3. Compute R1 x = c1 by back substitution.

The method also works if A does not have full rank, i.e. if rank(A) < n. In this

case the diagonal of R1 has additional zeros. But splitting R and c up appropri-

ately will also lead to the solution in this case.

If A does not have full rank then we cannot solve the normal equations di-

rectly, since the matrix AT A is not invertible. However, we can still compute

the pseudo-inverse of A using the singular value decomposition. Hence, we

will now discuss how to use the singular value decomposition A = UΣV T with

04

3.8 Solving Least-Squares Problems 89

for solving the least-squares problem and also generalise Corollary 3.41. We

will describe this now already in the case of a matrix that is not necessarily of

full rank.

The following observation is helpful. Let r denote the rank of the matrix A and

hence the number of singular values. If we set y := V T x ∈ Rn and c := U T b ∈

Rm , then

r

m

b − Ax22 = UU T b − UΣV T x22 = c − Σy22 = (c j − σ j y j )2 + c2j ,

j=1 j=r+1

r

m

b − Ax22 = |σ j vTj x − uTj b|2 + |uTj b|2 . (3.13)

j=1 j=r+1

This means, we have to choose x such that y j = vTj x = uTj b/σ j for 1 ≤ j ≤ r.

The remaining variables y j , r + 1 ≤ j ≤ n are free to choose. Hence, we have

the following result

Theorem 3.43 The general solution to the least-squares problem (3.11) is

given by

r uT b

j

n

n

x= vj + α j v j =: x+ + α jv j.

j=1

σj j=r+1 j=r+1

Here, σ j denote the singular values of the matrix A, v j ∈ Rn are the columns

of the matrix V and u j ∈ Rm are the columns of the matrix U from the singular

value decomposition. The coeﬃcients α j are free to choose. From all possible

solutions, the solution x+ is the solution having minimal Euclidean norm.

orthonormal basis v1 , . . . , vn , having coeﬃcients vTj x = uTj b/σ j . The solution

x+ has minimal norm, since we have for a general solution x ∈ Rn that

n

x22 = x+ 22 + α2j ≥ x+ 22 ,

j=r+1

With this result, it is now easy to see that the norm minimal solution is given

by applying the pseudo-inverse A+ to the right-hand side b. Indeed, The repre-

sentation A+ = VΣ+ U T from (1.7) shows that

A+ b = VΣ+ U T b = VΣ+ c = x+ .

04

90 Direct Methods for Solving Linear Systems

Corollary 3.44 Let A ∈ Rm×n with m ≥ n be given. Then, the pseudo-inverse

A+ ∈ Rn×m maps every vector b ∈ Rm to its norm-minimal least-squares solu-

tion x+ ∈ Rn , i.e.

$ %

A+ b = min x2 : x ∈ Rn solves min Az − b2 .

z

Since the Euclidean norm is the natural choice here, we will give all estimates

in it. As in the analysis of solving linear systems, the condition number of the

matrix will play a crucial role. We will continue to assume that A ∈ Rm×n with

m ≥ n. Hence, according to Deﬁnition 2.24, the condition number is given by

σ1

κ2 (A) = = A2 A+ 2 ,

σr

where σ1 and σr are the largest and smallest singular value of A, respectively,

and A+ denotes the pseudo-inverse of A.

We start with a simple perturbation result, where the right-hand side b but not

the matrix A is perturbed. The result employs the orthogonal projection P onto

range(A), which is, according to Lemma 1.24, given by P = AA+ .

Proposition 3.45 Let A ∈ Rm×n with m ≥ n be given. Let b, Δb ∈ Rm . If

x ∈ Rn and Δx ∈ Rn denote the norm-minimal solutions of the least-squares

problem (3.11) with data (A, b) and (A, b + Δb), respectively, then

Δx2 PΔb2

≤ κ2 (A) ,

x2 Pb2

where P := AA+ is the orthogonal projection of Rm onto range(A).

Proof By Corollary 3.44 we have x = A+ b and x + Δx = A+ (b + Δb) =

x + A+ Δb. This means Δx = A+ Δb = A+ AA+ Δb, where we have used (1.8) in

the last equality. From this, it immediately follows that

Δx2 ≤ A+ 2 AA+ Δb2 = A+ 2 PΔb2 .

Using the full singular value factorisation A = UΣV T , we have

r

uT Pb

x = A+ b = A+ AA+ b = A+ Pb = i

vi ,

i=1

σi

where r = rank(A) and σ1 ≥ σ2 ≥ · · · ≥ σr > 0 are the singular values of A.

The orthonormality of the columns ui and vi of U and V, respectively, shows

r "" T "2

"" ui Pb """ 1 T 2

r

1

x2 =

2

"" "" ≥ 2 |ui Pb| = 2 Pb22 ,

i=1

σ i σ 1 i=1 σ 1

04

3.8 Solving Least-Squares Problems 91

σ1 = A2 yields x2 ≥ Pb2 /A2 and hence

Δx2 PΔb2

≤ A+ 2 A2 .

x2 Pb2

For the general perturbation result, where b as well as A can be perturbed we

need two auxiliary statements. The ﬁrst one is a generalisation of the perturba-

tion Lemma 2.42.

be a perturbation with A+ 2 ΔA2 < 1. Then, A + ΔA has also rank n and

satisﬁes the bound

A+ 2

(A + ΔA)+ 2 ≤ .

1 − A+ 2 ΔA2

Proof We start by showing that A+ΔA has rank n. Let x ∈ ker(A+ΔA), which

means Ax = −ΔA x and, after multiplication by AT , also AT Ax = −AT ΔA x.

As AT A is invertible, (1.10) gives x = −(AT A)−1 AT ΔA x = −A+ ΔA x. and the

condition A+ 2 ΔA2 < 1 shows x = 0. From this we can conclude that the

columns of A + ΔA are linearly independent and hence the rank of A + ΔA is n.

Next, let λn be the smallest eigenvalue of (A + ΔA)T (A + ΔA) and let x ∈ Rn be

a corresponding eigenvector with x2 = 1. Then we have

1 !1/2

= λ 1/2

= xT

(A + ΔA) T

(A + ΔA)x

(A + ΔA)+ 2 n

≥ (xT AT Ax)1/2 − ΔA2 = σn (A) − ΔA2

1 1 − A+ 2 ΔA2

= + − ΔA2 = .

A 2 A+ 2

The latter expression is positive by assumption, from which the statement then

follows.

The result remains essentially true if A does not have full rank, as long as

rank(A + ΔA) ≤ rank(A) is satisﬁed. However, we will restrict ourselves to the

case rank(A) = n from now on.

Proof Let A = UΣV T and B = Ũ Σ̃Ṽ T be the full singular value decom-

positions of A and B, respectively. Then, we have AA+ = UΣV T VΣ+ U T =

UΣΣ+ U T = UDU T with the diagonal matrix D = diag(1, . . . , 1, 0, . . . , 0) ∈

04

92 Direct Methods for Solving Linear Systems

same way we have BB+ = ŨDŨ T with the same diagonal matrix. If we deﬁne

W W12

W := Ũ T U = 11 ∈ Rm×m

W21 W22

with block matrices W11 ∈ Rn×n , W12 ∈ Rn×(m−n) , W21 ∈ R(m−n)×n and W22 ∈

R(m−n)×(m−n) then, using the invariance of the Euclidean norm under orthogonal

transformations, we ﬁnd

T

2 .

2 holds. Note that the transpose

is irrelevant as the Euclidean norm is invariant under forming the transpose. To

see this, let z ∈ Rm−n be given and let x = (0T , zT )T ∈ Rm be its prolongation.

Using again the invariance of the Euclidean norm, we ﬁnd

W12 22 = max W12 z22 = 1 − min W22 z22 = 1 − λmin (W22

T

W22 ),

z2 =1 z2 =1

W22 z = W22

T

W22 z, z2 so that

T

(2.13) applies to the symmetric and positive semi-deﬁnite matrix W22 W22 .

In the same way, we can conclude from

T

z22 + W22

T

z22

that

W21

T 2

= max W21

T

z22 = 1 − min W22

T

z22 = 1 − λmin (W22 W22

T

).

z2 =1 z2 =1

T

the smallest eigenvalue of W22 W22 coincides with the smallest eigenvalue of

T

W22 W22 .

After this, we can state and prove our main perturbation result.

a perturbation with A+ 2 ΔA2 < 1. Let b, Δb ∈ Rm . Finally, let x, x + Δx ∈

04

3.8 Solving Least-Squares Problems 93

Rn be the solutions of the least-squares problem (3.11) with data (A, b) and

(A + ΔA, b + Δb), respectively. Then,

Δx2 κ2 (A) ΔA2 Ax − b2 Δb2

≤ 1 + κ2 (A) + .

x2 1 − κ2 (A)ΔA2 /A2 A2 A2 x2 A2 x2

Proof From x = A+ b and x + Δx = (A + ΔA)+ (b + Δb) it follows that

& '

Δx = (A + ΔA)+ (b + Δb) − A+ b = (A + ΔA)+ − A+ b + (A + ΔA)+ Δb.

We will now bound the norms of both terms on the right-hand side separately.

To this end, we have to rewrite the ﬁrst term. As A has rank n, its pseudo-

inverse is given by (1.10) as A+ = (AT A)−1 AT which means A+ A = I. As

A + ΔA also has rank n, we also have (A + ΔA)+ (A + ΔA) = I. With this, we

derive

= (A + ΔA)+ (I − AA+ ) + (A + ΔA)+ AA+

− (A + ΔA)+ (A + ΔA)A+

= (A + ΔA)+ (I − AA+ ) − (A + ΔA)+ (ΔA)A+ .

= (A + ΔA)+ (I − AA+ )(b − Ax) − (A + ΔA)+ (ΔA)x.

(A + ΔA)+ (I − AA+ )2 = B+ (I − AA+ )2 = B+ BB+ (I − AA+ )2

≤ B+ 2 BB+ (I − AA+ )2

= B+ 2 AA+ (I − BB+ )2

= B+ 2 (AA+ )T (I − BB+ )2

= B+ 2 (A+ )T AT (I − BB+ )2

= B+ 2 (A+ )T (AT − BT )(I − BB+ )2

= B+ 2 (A+ )T (ΔA)T (I − BB+ )2

≤ B+ 2 A+ 2 ΔA2 I − BB+ 2 ,

the fact that the Euclidean norm of a matrix equals the Euclidean norm of its

transpose.

If B = UΣV T is this time the full singular value decomposition of B, we have

with B+ = VΣ+ U T that BB+ = UΣΣ+ U T . As ΣΣ+ ∈ Rm×m is a diagonal matrix

04

94 Direct Methods for Solving Linear Systems

with n entries equal to 1 and m − n entries equal to 0 on the diagonal and as the

Euclidean norm is invariant under orthogonal transformations, we see that

⎧

⎪

⎪

⎨1 if m > n,

+ +

I − BB 2 = I − ΣΣ 2 = ⎪ ⎪

⎩0 if m = n.

≤ (A + ΔA)+ (I − AA+ )2 b − Ax2 + (A + ΔA)+ 2 ΔA2 x2

+ (A + ΔA)+ 2 Δb2

& '

≤ (A + ΔA)+ 2 ΔA2 A+ 2 Ax − b2 + ΔA2 x2 + Δb2

A+ 2 & '

≤ +

ΔA2 A+ 2 Ax − b2 + ΔA2 x2 + Δb2

1 − A 2 ΔA2

κ2 (A) ΔA2 ( + ) Δb2

= A 2 Ax − b2 + x2 +

1 − κ2 (A)ΔA2 /A2 A2 A2

κ2 (A) ΔA2 Ax − b2 Δb2

= κ2 (A) + x2 +

1 − κ2 (A)ΔA2 /A2 A2 A2 A2

and dividing by x2 gives the stated result.

Theorem 3.48 remains true, even if A is not of full rank, as long as the per-

turbation does not lead to a reduced rank, i.e. as long as the perturbed matrix

A + ΔA has the same rank as A, see for example Björck [20].

We note that a large defect Ax − b2 becomes the dominant error term as it is

multiplied by a factor of κ2 (A)2 . If the defect Ax − b2 is small the sensitivity

of the least-squares problem is determined by the factor κ2 (A), instead.

In the case of a quadratic matrix A ∈ Rn×n with full rank, the defect Ax − b2

is zero. As we also have b2 ≤ A2 x2 , the bound from the above Theorem

3.48 is exactly the one from Theorem 2.43 for the Euclidean norm.

If the smallest singular value of A is close to zero then this rank deﬁciency leads

obviously to a large condition number. In this situation it is better to reduce the

rank of A numerically by setting the smallest singular value – and any other

which is considered to be too small – to zero and solve the resulting least-

squares problem. If A = UΣV T is the original singular value decomposition of

A with singular values σ1 ≥ σ2 ≥ · · · ≥ σr > 0 and if we set σk+1 = · · · = σr =

0 for a ﬁxed k < r then this means that we actually minimise the expression

04

3.8 Solving Least-Squares Problems 95

k

Ak = σ j u j vTj .

j=1

If x+k = A+k b denotes the solution to this problem and if x+ = A+ b denotes the

solution to the original least-squares problem (3.11) then (3.13) shows that the

residual b − Ax+k is given by

m

r

b − Ax+k 22 = |uTj b|2 = b − Ax+ 22 + |uTj b|2 ,

j=k+1 j=k+1

u j , k + 1 ≤ j ≤ r, are not too large. However, from the representations of x+

and x+k , we can also read oﬀ that the actual error is given by

" "

r "" uT b ""2

+ + 2

x − xk 2 = "

""

j "

" ,

j=k+1

" σ j ""

j ≤ r.

Hence, we are now going to look at a diﬀerent way of dealing with the problem

of numerical rank deﬁciency.

Deﬁnition 3.49 Let A ∈ Rm×n , b ∈ Rm and τ > 0. Then, the Tikhonov reg-

ularisation of the least-squares problem (3.11) with regularisation parameter

τ > 0 is

minn Ax − b22 + τx22 . (3.14)

x∈R

The term τx22 is also called a penalty term and the problem a penalised least-

squares problem. The penalty term is sometimes replaced by a more general

expression of the form τLx22 , where L is a given matrix. Here, however, we

will restrict ourselves to the case of L = I.

The penalised least-squares problem (3.14) is equivalent to a standard least-

squares problem as we have

2

A b

Ax − b2 + τx2 = √ x −

2 2

=: Aτ x − b̃2 ,

2

τI 0 2

where Aτ ∈ R (m+n)×n

and b̃ ∈ R . Obviously, the new matrix Aτ has full rank

m+n

n, as its last n rows are linearly independent. Thus it has a unique solution and

so has (3.14) even if rank(A) = r < n. Moreover, we can use once again the

singular value decomposition to represent the solution.

04

96 Direct Methods for Solving Linear Systems

A = Û Σ̂V̂ T be a reduced singular value decomposition of A ∈ Rm×n , where

Σ̂ ∈ Rr×r is a diagonal matrix with diagonal entries σ1 ≥ σ2 ≥ · · · ≥ σr > 0

and where Û ∈ Rm×r and V̂ ∈ Rn×r have orthonormal columns.

The penalised least-squares problem (3.14) has for each τ > 0 a unique solu-

tion x+τ which is given by

r σ uT b

j j

x+τ = v j. (3.15)

j=1

σ2j + τ

tion of the least-squares problem (3.11). The error between x+τ and x+ can be

bounded by

τ

x+ − x+τ 2 ≤ b2 .

σr (σ2r + τ)

The condition number of the associated matrix Aτ ∈ R(m+n)×n is

⎛ 2 ⎞1/2

+

⎜⎜⎜ σ1 + τ ⎟⎟⎟

κ2 (Aτ ) = Aτ 2 Aτ 2 = ⎝⎜ 2 ⎟⎠ . (3.16)

σr + τ

Using the reduced singular value decomposition and Corollary 3.41 in combi-

nation with (1.10) we can derive

r σ uT b

j j

T −1 −1

= [V̂(Σ̂ + τI)V̂ ] V̂ Σ̂Û b = V̂(Σ̂ + τI) Σ̂U b =

2 T 2 T

v j.

j=1

σ2j + τ

⎡ ⎤

r ⎢⎢ 1 σ j ⎥⎥⎥ r

τ

+ + T ⎢ ⎢ ⎥⎥⎦ v j =

x − xτ = u j b ⎢⎣ − 2 uTj b v j.

j=1

σj σj + τ j=1

σ j (σ2j + τ)

⎡ ⎤2 2

r ⎢⎢ τ ⎥⎥⎥ τ

r

+ + 2 T 2⎢ ⎢ ⎥

x − xτ 2 = |u j b| ⎢⎣ ⎥

⎦ ≤ |uTj b|2

j=1

σ j (σ 2

j + τ) σ r (σ 2 + τ)

r j=1

2

τ

≤ b22 .

σr (σ2r + τ)

04

3.8 Solving Least-Squares Problems 97

Aτ 22 = λmax (ATτ Aτ ) = λmax (AT A + τI) = λmax (Σ̂2 + τI) = σ21 + τ. (3.17)

For the remaining A+τ 2 we use A+τ 2 = (A+τ )T 2 and then A+τ = (ATτ Aτ )−1 ATτ

to derive

A+τ (A+τ )T = (ATτ Aτ )−1 ATτ [(ATτ Aτ )−1 ATτ ]T = (ATτ Aτ )−1 ATτ Aτ (ATτ Aτ )−1

= (ATτ Aτ )−1 = (AT A + τI)−1 = V̂(Σ̂2 + τI)−1 V̂ T .

Thus, we can conclude

1

A+τ 22 = λmax ((Σ̂2 + τI)−1 ) = ,

σ2r +τ

which gives, together with (3.17), the condition number.

From (3.15) it obviously follows that x+τ tends with τ → 0 to the solution x+

of the standard least-squares problem. This is not too surprising since the pe-

nalised least-squares problem (3.14) becomes the least-squares problem (3.11)

for τ = 0. From (3.15) we see that the parameter τ is indeed regularising the

problem since it moves the singular values away from zero.

The remaining question is how to choose the regularisation parameter τ > 0.

Obviously, a smaller τ means that the least-squares part becomes more signif-

icant and the solution vector tries mainly to minimise this. For a larger τ, the

penalty term τx22 is dominant and the solution vector will try to achieve a

small norm. This becomes particularly important if the given data is corrupted.

Let us assume that the right-hand side b is perturbed by some noise Δb to

b + Δb.

If we denote the solution to this perturbed Tikhonov problem by xΔτ , then we

see that

r σ uT (b + Δb)

j j r σ uT Δb

j j

xΔτ = v j = x+

τ + v j,

j=1

σ 2

j + τ j=1

σ j +τ

2

where the latter summand is just the solution of the Tikhonov problem with

data Δb. Hence, we can bound the error between xΔτ and the solution of the

original least-squares problem x+ = A+ b by

r σ uT Δb

j j

x − xτ 2 ≤ x − xτ 2 +

+ Δ + +

v j

σ j + τ

2

j=1 2

τ σj

≤ b2 + max 2 Δb2 .

σr (σ2r + τ) 1≤ j≤r σ + τ

j

This shows how the error is split up into an approximation error, which goes

04

98 Direct Methods for Solving Linear Systems

minimise the overall error. This is, however, in general not possible.

There are various ways to choose τ. Often, a statistical method called cross

validation or generalised cross validation is used. Generalised cross validation

is based on the idea that a good value of the regularisation parameter should

help to predict missing data values. Hence, the parameter τ is chosen as the

minimiser of

Axτ − b22

G(τ) := ,

[tr(I − A(AT A + τI)−1 AT )]2

see for example Golub et al. [67] or Hansen [81]. Another method is the so-

called Morozov discrepancy principle (see [97]), which we want to describe

now and which is based on the strategy of choosing τ in such a way that the

defect AxΔτ − b2 equals the data error Δb2 . We will show that this is always

possible. As before let x+τ be the minimiser of

As we are now interested in the behaviour of the defect, we deﬁne the function

J(τ) is continuous on [0, ∞) and strictly monotonically increasing with range

[0, b22 ).

(3.15). From this, the continuity of J(τ) is obvious. Moreover, our assumptions

yield that x+τ 0 for each τ > 0, as x+τ = 0 means by (3.15) that uTj b = 0 for

1 ≤ j ≤ r, which contradicts 0 b ∈ range(A) = span{u1 , . . . , ur }.

Next, let K(τ) := x+τ 22 so that we have the decomposition

Let us assume that 0 < τ1 < τ2 . We will ﬁrst show that x+τ1 x+τ2 . To see this,

we note that they satisfy the equations

From this, we can conclude that x+τ1 = x+τ2 would imply (τ2 − τ1 )x+τ2 = 0. But

as x+τ2 0 we have the contradiction τ1 = τ2 .

04

3.8 Solving Least-Squares Problems 99

With this, we can show that K(τ) is strictly monotonically decreasing. Indeed,

multiplying (3.18) by (x+τ1 − x+τ2 )T from the left gives

A(x+τ1 − x+τ2 )22 + τ1 x+τ1 − x+τ2 22 = (τ2 − τ1 )x+τ1 − x+τ2 , x+τ2 2 .

Since x+τ1 x+τ2 , we see that the left-hand side is positive and as τ2 > τ1 we can

conclude that

x+τ1 − x+τ2 , x+τ2 2 > 0,

showing

x+τ2 22 < x+τ1 , x+τ2 2 ≤ x+τ1 2 x+τ2 2 ,

i.e. K(τ2 ) < K(τ1 ). Finally, still under the assumption of 0 < τ1 < τ2 , we can

conclude that J(τ1 ) < J(τ2 ) because J(τ2 ) ≤ J(τ1 ) would imply that

μτ1 (x+τ2 ) = J(τ2 ) + τ1 K(τ2 ) < J(τ1 ) + τ1 K(τ1 ) = μτ1 (x+τ1 ),

which contradicts the fact that x+τ1 minimises μτ1 .

Finally, it remains to show that the range of J is given by [0, b2 ). On the one

hand, we have J(0) = 0 as b ∈ range(A). On the other hand, we know that

J is continuous and monotonically increasing, which means that the range of

J is an interval with an upper bound given by the limit of J(τ) for τ → ∞.

To determine this limit, we use that (3.15) yields x+τ → 0 for τ → ∞. This,

however, implies J(x+τ ) → b22 for τ → 0.

Hence, for each s ∈ (0, b2 ) there exists a unique τ > 0 such that J(τ) = s. As

we can apply this also to J Δ (τ) = AxΔτ − (b + Δb)22 , this proves the following

result.

Theorem 3.52 (Morozov’s discrepancy principle) Let A ∈ Rm×n with m ≥ n.

Let b 0 be in the range of A. Let the perturbation Δb ∈ Rm satisfy 0 <

Δb2 < b2 . Then, there exists a unique τ > 0 such that the unique solution

xΔτ which minimises Ax − (b + Δb)22 + τx22 over all x ∈ Rn satisﬁes

AxΔτ − (b + Δb)2 = Δb2 .

The above result only guarantees the existence of a unique τ with this prop-

erty. Its numerical determination is often based upon the so-called L-curve

principle. As we know that the function K(τ) is monotonically decreasing and

the function J(τ) is monotonically increasing, we see that the function K is

monotonically decreasing as a function of J, i.e. if the curve (K(τ), J(τ)) is

monotonically decreasing in the ﬁrst quadrant. In many cases this curve has

the shape of an L with a prominent corner, particularly in a logarithmic plot.

The parameter τ corresponding to this corner often yields a good choice of the

penalisation parameter. Details can be found, for example, in Hansen [81].

04

100 Direct Methods for Solving Linear Systems

Exercises

3.1 Let A ∈ Rn×n be a symmetric matrix and assume that Gaussian elimina-

tion without pivoting is possible. Show that after k − 1 steps, the lower

right (n − k + 1) × (n − k + 1) block in A(k) from (3.8) is symmetric.

3.2 Denote the columns of A ∈ Rn×n by a1 , . . . , an ∈ Rn . Use the QR factori-

sation of A to show | det A| ≤ a1 2 · · · an 2 .

3.3 Let data {y1 , . . . , ym } at positions {x1 , . . . , xm } be given.

• Use the least-squares approach to determine the linear polynomial

y(x) = ax + b which ﬁts these data best, i.e. ﬁnd the linear polyno-

mial which minimises

m

|y j − y(x j )|2 . (3.19)

j=1

R.

3.4 A Givens rotation is an n × n matrix of the form

J(i, j, θ) = I + (cos(θ) − 1)(e j eTj + ei eTi ) + sin(θ)(e j eTi − ei eTj ).

• Show that a Givens rotation is orthogonal.

• Show that for x ∈ Rn there is an angle θ such that yk = 0 for y =

J(i, k, θ)x. Use this to show that the QR factorisations of a tridiagonal

matrix A ∈ Rn×n can be achieved in O(n) time using Givens rotations.

• Show that in R2×2 an orthogonal matrix is either a Householder matrix

or a Givens rotation.

3.5 Let A ∈ Rm×n with m ≥ n and let b, Δb ∈ Rm with b range(A). Let x

and x + Δx be the minimal norm solutions of the least-squares problem

with data (A, b) and (A, b + Δb), respectively. Show

Δx2 Δb2 Δb2

≤ κ2 (A) ≤ κ2 (A) .

x2 A2 x2 AA+ b2

3.6 Let A ∈ Rm×n with rank(A) = n and b ∈ Rm be given. Determine the

condition number of solving the least-squares problem minx Ax − b2

under the assumption that b is exact, i.e. determine the condition number

of the function f : {A ∈ Rm×n : rank(A) = n} → Rn , f (A) := x+ .

3.7 Let A ∈ Rn×n be a symmetric, positive deﬁnite matrix. Let τ > 0 and

b ∈ Rn be given. Show that the function x → Ax − b22 + τxT Ax has a

unique minimum x∗τ ∈ Rn which solves the linear system (A + τI)x = b.

04

4

Iterative Methods for Solving Linear Systems

4.1 Introduction

We have seen that a direct algorithm for solving an n × n linear system

Ax = b

generally requires O(n3 ) time. For larger n, this becomes quite quickly im-

practical. The basic idea behind iterative methods is to produce a way of ap-

proximating the application of the inverse of A to b so that the amount of work

required is substantially less than O(n3 ). To achieve this, a sequence of approx-

imate solutions is generated, which is supposed to converge to the solution.

The computation of each iteration usually costs O(n2 ) time so that the method

becomes computationally interesting only if a suﬃciently good approximation

can be achieved in far fewer than n steps.

To understand the general concept, let us write the matrix A as A = A − B + B

with an invertible matrix B at our disposal. Then, the equation Ax = b can be

reformulated as b = Ax = (A − B)x + Bx and hence as

mapping F, i.e. it satisﬁes F(x) = x. To calculate the ﬁxed point of F, we can

use the following simple iterative process. We ﬁrst pick a starting point x0 and

then form

x j+1 := F(x j ) = Cx j + c, j = 0, 1, 2, . . . . (4.2)

Noting that this particular F is continuous, we see that if this sequence con-

verges then the limit has to be the ﬁxed point of F and hence the solution of

Ax = b.

101

05

102 Iterative Methods for Solving Linear Systems

We can, however, also start with an iteration of the form (4.2). If C and c have

the form (4.1) and if (4.2) converges the limit must be the solution of Ax = b.

Deﬁnition 4.1 An iterative method (4.2) is called consistent with the linear

system Ax = b if the iteration matrix C and vector c are of the form C =

I − B−1 A and c = B−1 b with an invertible matrix B.

In the next section, we will discuss possible choices of B and conditions for

the convergence of the iterative method.

We will start by deriving a general convergence theory for an iteration process

of the form (4.2) even with a more general, not necessarily linear F : Rn → Rn .

respect to a norm · on Rn if there is a constant 0 ≤ q < 1 such that

for all x, y ∈ Rn .

has exactly one ﬁxed point x∗ ∈ Rn . The sequence x j+1 := F(x j ) converges for

every starting point x0 ∈ Rn to x∗ . Furthermore, we have the error estimates

q

x∗ − x j ≤ x j − x j−1 (a posteriori),

1−q

qj

x∗ − x j ≤ x1 − x0 (a priori).

1−q

Proof First, let us assume that F has two diﬀerent ﬁxed points x∗ , x̃ ∈ Rn .

Then, the contraction property of F yields

In the next step, we will show that the sequence {x j } is a Cauchy sequence. The

completeness of Rn then proves its convergence to a point x∗ ∈ Rn . Since F is

continuous, this limit point x∗ has to be the ﬁxed point of F.

05

4.2 Banach’s Fixed Point Theorem 103

For arbitrary j and k from N0 , we can use the contraction property j times to

derive

≤ q j xk − x0 .

The distance between xk and x0 can then be estimated using a telescoping sum

argument:

≤ qk−1 + qk−2 + · · · + 1 x1 − x0

1

≤ x1 − x0 ,

1−q

using the geometric series for 0 ≤ q < 1. Hence, we have

qj

x j+k − x j ≤ x1 − x0 , (4.3)

1−q

and this expression becomes smaller than any given > 0, provided j is large

enough. This proves that {x j } is indeed a Cauchy sequence.

Finally, letting k → ∞ in (4.3) and using the continuity of the norm yields the

a priori estimate. Since this estimate is valid for each starting point, we can

restart the iteration with y0 = x j−1 and the a priori estimate for the ﬁrst two

terms y0 = x j−1 and y1 = F(y0 ) = x j gives the a posteriori estimate.

If we apply this theorem to our speciﬁc iteration function F(x) = Cx+c, where

C is the iteration matrix, then we see that

chosen vector and hence matrix norm, while, since all norms on Rn are equiv-

alent, the fact that the sequence converges does not depend on the norm.

In other words, having a speciﬁc compatible matrix norm with C < 1 is suf-

ﬁcient for convergence but not necessary. A suﬃcient and necessary condition

can be stated using the spectral radius of the iteration matrix.

|λ1 | ≥ |λ2 | ≥ · · · ≥ |λn |. The spectral radius of A is given by ρ(A) := |λ1 |.

seen that the 2-norm of a matrix A is given by the square root of the largest

eigenvalue of AT A, which we can now write as A2 = ρ(AT A)1/2 .

05

104 Iterative Methods for Solving Linear Systems

of Ak , k = 1, 2, 3, . . . with eigenvector x. Hence, ρ(A)k ≤ ρ(Ak ).

In the proof of the following theorem we have to be a little careful since a real

matrix A ∈ Rn×n might easily have eigenvectors x in Cn and hence Ax ∈ Cn .

A for all matrices A ∈ Rn×n .

2. For any A ∈ Rn×n and any > 0 there exists a compatible matrix norm

· : Rn×n → R such that ρ(A) ≤ A ≤ ρ(A) + .

tor x ∈ Cn \ {0}. If we have λ ∈ R then we can choose the corresponding

eigenvector x ∈ Rn . To see this note that with x ∈ Cn also x ∈ Cn and hence

the real part (x) ∈ Rn are eigenvectors of A to the eigenvalue λ. Hence,

for such an x ∈ Rn we immediately have

with μ = |λ| > 0 and θ 0 with corresponding normalised eigenvector x ∈

Cn . With x also each vector y := eiφ x is obviously an eigenvector of A to the

eigenvalue λ. Thus, we can choose the angle φ such that y has an imaginary

part (y) satisfying (y) ≤ (eiθ y). Then, we have A((y)) = (Ay) =

(λy) = μ(eiθ y) and hence

A ≥ =μ ≥ μ = |λ|.

(y) (y)

2. From Theorem 3.38, Schur’s factorisation, we know that there exists a uni-

T

tary matrix W such that W AW is an upper triangular matrix, with the

eigenvalues of A as the diagonal elements:

⎛ ⎞

⎜⎜⎜λ1 r12 ··· r1n ⎟⎟⎟

⎜⎜⎜⎜ 0 λ2 ···

⎟

r2n ⎟⎟⎟⎟

⎜

W AW = ⎜⎜⎜⎜⎜ . ⎟

T

.. .. .. ⎟⎟⎟⎟ .

⎜⎜⎜ .. . . . ⎟⎟⎟⎟

⎜⎝ ⎠

0 0 ··· λn

05

4.2 Banach’s Fixed Point Theorem 105

0 < δ < min(1, /(r + )) with r = max |ri j | and > 0. Let S = WD, then

⎛ ⎞

⎜⎜⎜λ1 δr12 · · · δn−1 r1n ⎟⎟⎟

⎜⎜⎜ ⎟

⎜⎜ 0 λ2 · · · δn−2 r2n ⎟⎟⎟⎟

S AS = D W AWD = ⎜⎜⎜⎜⎜ .

−1 −1 T

.. .. .. ⎟⎟⎟⎟ .

⎟

⎜⎜⎜ .. . . . ⎟⎟⎟⎟

⎜⎝ ⎠

0 0 ··· λn

Furthermore, using Theorem 2.19 and Remark 2.25, we see that

n

−1

S AS ∞ = max |(S −1 AS )i j | ≤ ρ(A) + δr(1 + δ + · · · + δn−2 ),

1≤i≤n

j=1

−1 1 − δn−1 δr

S AS ∞ ≤ ρ(A) + δr ≤ ρ(A) + ≤ ρ(A) + .

1−δ 1−δ

Since A → S −1 AS ∞ is the (complex) matrix norm induced by the (com-

plex) vector norm x → S −1 x∞ , see Example 2.29 and the remarks right

after it, we know that these norms are compatible when restricted to the real

numbers, which completes the proof.

Note that the proof of the above theorem, as well as the proof of the next

theorem, simpliﬁes signiﬁcantly if a complex set-up is used. Nonetheless, it is

worth knowing that everything works out even in an entirely real set-up.

Theorem 4.6 Let C ∈ Rn×n and c ∈ Rn be given. The iteration x j+1 = Cx j + c

converges for every starting point x0 ∈ Rn if and only if ρ(C) < 1.

Proof Assume ﬁrst that ρ(C) < 1. Then, we can pick an > 0 such that

ρ(C) + < 1 and, by Theorem 4.5, we can ﬁnd a compatible matrix norm ·

such that C ≤ ρ(C) + < 1, which gives convergence.

Assume now that the iteration converges to x∗ for every starting point x0 . This

particularly means that if we pick two arbitrary starting points x0 , y0 ∈ Rn then

the sequences x j+1 := Cx j + c and y j+1 := Cy j + c both converge to x∗ ∈ Rn .

Hence, the sequence

z j+1 := x j+1 + iy j+1 = C(x j + iy j ) + (1 + i)c = Cz j + c̃

converges for every starting point z0 = x0 + iy0 ∈ Cn to z∗ = (1 + i)x∗ , which

satisﬁes z∗ = Cz∗ + c̃. If we pick the starting point z0 such that z = z0 − z∗ is

an eigenvector of C with eigenvalue λ, then

z j − z∗ = C(z j−1 − z∗ ) = · · · = C j (z0 − z∗ ) = λ j (z0 − z∗ ).

Since the expression on the left-hand side tends to zero for j → ∞, so does the

05

106 Iterative Methods for Solving Linear Systems

expression on the right-hand side. This, however, is possible only if |λ| < 1.

Since λ was an arbitrary eigenvalue of C, this shows that ρ(C) < 1.

After this general discussion, we return to the question of how to choose the

iteration matrix C. Our initial approach is based upon a splitting

C = B−1 (B − A) = I − B−1 A, c = B−1 b

with a matrix B, which should be suﬃciently close to A but also easily invert-

ible.

From now on, we will assume that the diagonal elements of A are all non-zero.

This can be achieved by exchanging rows and/or columns as long as A is non-

singular. Next, we decompose A in its lower-left sub-diagonal part, its diagonal

part and its upper-right super-diagonal part, i.e.

A= L+D+R

with

⎧ ⎧ ⎧

⎪

⎪

⎨ai j if i > j, ⎪

⎪

⎨ai j if i = j, ⎪

⎪

⎨ai j if i < j,

L=⎪

⎪ D=⎪

⎪ R=⎪

⎪

⎩0 else, ⎩0 else, ⎩0 else.

The simplest possible approximation to A is then given by picking its diagonal

part D for B so that the iteration matrix becomes

C J = I − B−1 A = I − D−1 (L + D + R) = −D−1 (L + R),

with entries

⎧

⎪

⎪

⎨−aik /aii if i k,

cik = ⎪

⎪ (4.4)

⎩0 else.

This means that we can write the iteration deﬁned by

x( j+1) = −D−1 (L + R)x( j) + D−1 b

component-wise as follows.

Deﬁnition 4.7 The iteration deﬁned by

⎛ ⎞

⎜⎜⎜ ⎟⎟⎟

1 ⎜⎜⎜ ⎟⎟

n

xi( j+1) = ⎜⎜⎜bi − aik xk( j) ⎟⎟⎟⎟ , 1 ≤ i ≤ n, (4.5)

aii ⎜⎝ k=1

⎟⎠

ki

05

4.3 The Jacobi and Gauss–Seidel Iterations 107

Note that we wrote the iteration index as an upper index, which we will con-

tinue to do from now on. An algorithmic formulation is given in Algorithm 7.

Here we have realised the iteration with just two vectors x and y, where x

represents the “old” iteration and y represents the new, next iteration.

We also need a stopping criterion here. Unfortunately, we cannot use the error

x∗ − x( j) since we do not know the exact solution x∗ . But there are at least two

possibilities. The easiest one is to choose a maximum number of iteration steps

for which we are willing to run the iteration. The advantage of this stopping

criterion is that we will ﬁnish after the prescribed number of iterations. How-

ever, we might have a good approximation x( j) already after a few steps and

would thereafter do more iterations than necessary. But we might also have a

bad approximation, and would then need to perform additional iterations.

A better stopping criterion is to look at the residual Ax∗ − Ax( j) = b − Ax( j) and

to stop if the norm of the residual falls below a given threshold. Unfortunately,

computing the residual is an O(n2 ) operation for a full matrix. Another, widely

used stopping criterion is to look at the diﬀerence of two iterations x( j+1) − x( j) .

If the norm of this diﬀerence falls below a certain threshold, the process stops.

This is computationally cheaper as it costs only O(n) time.

Input : A ∈ Rn×n invertible, b ∈ Rn , x(0) ∈ Rn .

Output: Approximate solution of Ax = b.

1 Compute r := b − Ax(0)

2 while r ≥ do

3 for i = 1 to n do

4 yi := bi

5 for k = 1 to i − 1 do

6 yi := yi − aik xk

7 for k = i + 1 to n do

8 yi := yi − aik xk

9 yi := yi /aii

10 x := y

11 r := b − Ax

method, we only give the computational cost for one step of the iteration, so

the total cost becomes this cost times the number of iterations.

05

108 Iterative Methods for Solving Linear Systems

Remark 4.8 Let A ∈ Rn×n be a full matrix. Then, one step of the Jacobi

method costs O(n2 ) time.

Note that the computational cost drops to O(n) if the matrix A is sparse and has

only a constant number of non-zero entries per row.

Obviously, one can expect convergence of the Jacobi method if the original ma-

trix A resembles a diagonal matrix, i.e. if the entries on the diagonal dominate

the other entries in that row. Recall that a matrix A is strictly row diagonally

dominant if

n

|aik | < |aii |, 1 ≤ i ≤ n.

k=1

ki

Theorem 4.9 The Jacobi method converges for every starting point if the

matrix A is strictly row diagonally dominant.

Proof We use the row sum norm to calculate the norm of the iteration matrix

C J as

n

n

|aik |

C J ∞ = max |cik | = max < 1.

1≤i≤n

k=1

1≤i≤n

k=1

|aii |

ki

The latter criterion can be weakened in a way which we will describe now.

nant, if, for 1 ≤ i ≤ n,

n

|aik | ≤ |aii | (4.6)

k=1

ki

is satisﬁed and if there is at least one index i for which (4.6) holds with a strict

lower sign.

elements of A have to be diﬀerent from zero. To see this, let us assume that

there is an i with aii = 0. In that case the whole row i would be zero, which

contradicts the non-singularity of A.

Being weakly diagonally dominant is not enough for the Jacobi method to

achieve convergence. We also need that the matrix is not reducible in the fol-

lowing sense.

05

4.3 The Jacobi and Gauss–Seidel Iterations 109

subsets N1 and N2 of N := {1, 2, . . . , n} with

1. N1 ∩ N2 = ∅,

2. N1 ∪ N2 = N,

3. for each i ∈ N1 and each k ∈ N2 we have aik = 0.

For a linear system Ax = b with reducible matrix A ∈ Rn×n we can split the

system into two smaller, independent systems. If, for example, N1 contains

exactly m < n elements, then we can ﬁrst solve the m equations

n

aik xk = aik xk = bi , i ∈ N1 .

k=1 k∈N1

these components, we can rewrite the equations for the other indices i ∈ N2 as

aik xk = bi − aik xk , i ∈ N2 .

k∈N2 k∈N1

dominant then the Jacobi method converges for every starting point.

Proof First of all, we note that all diagonal elements of A are diﬀerent from

zero. Otherwise, the sets N1 := {i ∈ N : aii = 0} and N2 := {i ∈ N : aii 0}

would give a non-trivial decomposition of N. To see this, note that N1 is non-

empty by this assumption and N2 also contains at least one element by the fact

that A is weakly diagonally dominant. The union of the two sets obviously

gives N. Furthermore, for i ∈ N1 the weakly diagonal dominance of A ensures

that also all ai j for j i have to vanish, in particular for those j ∈ N2 .

From A being weakly diagonally dominant and Theorem 4.5 we immediately

have ρ(C J ) ≤ C J ∞ ≤ 1 for the iteration matrix C J of the Jacobi method.

Hence, it remains to ensure that there is no eigenvalue of C J with absolute

value one. Hence, let us assume there is such an eigenvalue λ ∈ C of C J with

|λ| = 1 and corresponding eigenvector x ∈ Cn , which we normalise to satisfy

x∞ = 1. Then we can form the sets N1 := {i ∈ N : |xi | = 1} ∅ and

05

110 Iterative Methods for Solving Linear Systems

"" ""

"" n ""

1 "" " 1

n

1

n

|xi | = |λ||xi | = |(C J x)i | = "" aik xk """ ≤ |aik xk | ≤ |aik | ≤ 1

|aii | " k=1 " |aii | k=1 |aii | k=1

" ki " ki ki

(4.7)

shows that for i ∈ N1 we must have equality here, meaning particularly

n

|aii | = |aik |, i ∈ N1 .

k=1

ki

Hence, since A is weakly diagonally dominant, there must be at least one index

i N1 , which shows that N2 cannot be empty.

Since A is irreducible there is at least one i ∈ N1 and one k0 ∈ N2 with aik0 0.

For this k0 ∈ N2 we have particularly |aik0 xk0 | < |aik0 |, so that we have a strict

inequality in (4.7). This, however, is a contradiction to i ∈ N1 .

matrix A is irreducible and weakly diagonally dominant we only have ρ(C J ) <

1. We will have C J ∞ = 1 whenever there is at least one index i with equality

in (4.6).

A closer inspection of the Jacobi method shows that the computation of xi( j+1)

is independent of any other x( j+1) . This means that on a parallel or vector com-

puter all components of the new iteration x( j+1) can be computed simultane-

ously.

However, it also gives us the possibility to improve the process. For example,

to calculate x2( j+1) we could already employ the newly computed x1( j+1) . Then,

for computing x3( j+1) we could use x1( j+1) and x2( j+1) and so on. This leads to the

following iterative scheme.

⎛ ⎞

1 ⎜⎜⎜⎜ ⎟⎟

( j) ⎟

i−1 n

xi( j+1) = ⎜⎝⎜bi − ( j+1)

aik xk − aik xk ⎟⎟⎟⎠ , 1 ≤ i ≤ n. (4.8)

aii k=1 k=i+1

As usual, the algorithm is given in Algorithm 8. Once, again we use the residual

as the stopping criterion. This time, we have an in situ algorithm, i.e. we can

store each iteration directly over the previous one.

As in the case of the Jacobi method, the computational cost of the Gauss–

Seidel iteration is easily derived.

05

4.3 The Jacobi and Gauss–Seidel Iterations 111

Input : A ∈ Rn×n invertible, b ∈ Rn , x(0) ∈ Rn .

Output: Approximate solution of Ax = b.

1 Compute r := b − Ax(0)

2 while r ≥ do

3 for i = 1 to n do

4 xi := bi

5 for k = 1 to i − 1 do

6 xi := xi − aik xk

7 for k = i + 1 to n do

8 xi := xi − aik xk

9 xi := xi /aii

10 r := b − Ax

method is O(n2 ). This reduces to O(n) if we have a sparse matrix with only

O(n) non-zero entries.

To see that this scheme is consistent and to analyse its convergence, we have

to ﬁnd C = I − B−1 A = B−1 (B − A) and c = B−1 b. To this end, we rewrite (4.8)

as

i−1

n

aii xi( j+1) + aik xk( j+1) = bi − aik xk( j) ,

k=1 k=i+1

(L + D)−1 and use B − A = −R, we note that the scheme is indeed consistent

and that the iteration matrix of the Gauss–Seidel method is given by

Later on, we will prove a more general version of the following theorem.

Seidel method converges.

Our motivation for introducing the Gauss–Seidel method was to improve the

Jacobi method. Under certain assumptions it is indeed possible to show that

the Gauss–Seidel method converges faster than the Jacobi method.

Theorem 4.16 (Sassenfeld) Let A ∈ Rn×n . Assume that the diagonal elements

05

112 Iterative Methods for Solving Linear Systems

⎛ i−1 ⎞

1 ⎜⎜⎜⎜ ⎟⎟⎟

n

si = ⎜⎜⎝ |aik |sk + |aik |⎟⎟⎟⎠ , 1 ≤ i ≤ n.

|aii | k=1 k=i+1

Then, we have for the iteration matrix CGS = −(L + D)−1 R of the Gauss–Seidel

method the bound

CGS ∞ ≤ max si =: s,

1≤i≤n

nally dominant matrix, we also have

CGS ∞ ≤ s ≤ C J ∞ < 1,

Proof Let y ∈ Rn . We have to show that CGS y∞ ≤ sy∞ . From this,

CGS ∞ ≤ s immediately follows. Let z = CGS y. For the components z1 , . . . , zn

of z we can conclude that

⎛ i−1 ⎞

1 ⎜⎜⎜⎜ ⎟⎟⎟

n

zi = − ⎜⎝⎜ aik zk + aik yk ⎟⎟⎠⎟ , 1 ≤ i ≤ n.

aii k=1 k=i+1

1 1

n n

|z1 | ≤ |a1k ||yk | ≤ |a1k | y∞ = s1 y∞ ≤ sy∞ .

|a11 | k=2 |a11 | k=2

induction, conclude that

⎛ i−1 ⎞

1 ⎜⎜⎜⎜ ⎟⎟⎟

n

|zi | ≤ ⎜⎜⎝ |aik ||zk | + |aik ||yk |⎟⎟⎟⎠

|aii | k=1 k=i+1

⎛ i−1 ⎞

y∞ ⎜⎜⎜ ⎜ n ⎟⎟⎟

≤ ⎜⎜⎝ |aik |sk + |aik |⎟⎟⎟⎠

|aii | k=1 k=i+1

= y∞ si .

Hence, we have proven the ﬁrst part of the theorem. Now suppose that A is a

strictly row diagonally dominant matrix. Then, it follows that

1

n

K := C J ∞ = max |aik | < 1

1≤i≤n |aii |

k=1

ki

05

4.3 The Jacobi and Gauss–Seidel Iterations 113

assume that s j ≤ K < 1 for all 1 ≤ j ≤ i − 1. With this, we can conclude that

⎛ i−1 ⎞ ⎛ i−1 ⎞

1 ⎜⎜⎜⎜ ⎟⎟⎟ 1 ⎜⎜⎜⎜ ⎟⎟⎟

n n

si = ⎜⎝⎜ |aik |sk + ⎟

|aik |⎟⎠⎟ ≤ ⎜⎜⎝K |aik | + |aik |⎟⎟⎟⎠

|aii | k=1 k=i+1

|aii | k=1 k=i+1

1

n

≤ |aik | ≤ K < 1,

|aii | k=1

ki

meaning also s ≤ C J ∞ .

problem described in Section 1.1.2. Essentially, we have to solve a linear sys-

tem with a matrix of the form

⎛ ⎞

⎜⎜⎜ 2 −1 0 ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜−1 2 −1 ⎟⎟⎟

⎜⎜⎜⎜ −1 2 −1 ⎟⎟⎟⎟

⎜ ⎟⎟⎟

A = ⎜⎜⎜⎜ .. .. .. ⎟⎟⎟ = L + D + R, (4.9)

⎜⎜⎜ . . . ⎟

⎟

⎜⎜⎜⎜ ⎟⎟

⎜⎜⎜ −1 2 −1⎟⎟⎟⎟⎟

⎝ ⎠

0 −1 2

−1 on the super-diagonal. All other entries of R and L are zero. This means that

the Jacobi method computes the new iteration u( j+1) component-wise as

⎧ ( j)

⎪

⎪

⎪ u2 if i = 1,

( j+1) 1 2 1⎪

⎪

⎨ ( j)

ui = h fi + ⎪ ⎪

( j)

ui−1 + ui+1 if 2 ≤ i ≤ n − 1, (4.10)

2 2⎪

⎪

⎪

⎩u( j) if i = n

n−1

⎧ ( j)

⎪

⎪

⎪ u2 if i = 1,

1 2 ⎪

1⎪

⎨ ( j+1)

( j+1)

ui = h fi + ⎪ j)

ui−1 + u(i+1 if 2 ≤ i ≤ n − 1, (4.11)

2 2⎪

⎪

⎪

⎪

⎩u( j+1) if i = n.

n−1

Note that we have used u rather than x to denote the iterations since this is

more natural in this context.

To analyse the convergence, we need to determine the iteration matrices. Un-

fortunately, the matrix A is not strictly row diagonally dominant but only weak-

ly diagonally dominant and irreducible. Nonetheless, we can have a look at the

iteration matrices and their · ∞ norms.

05

114 Iterative Methods for Solving Linear Systems

⎛1 ⎞⎛ ⎞

⎜⎜⎜ 2 ⎟⎟⎟ ⎜⎜⎜0 1 ⎟⎟⎟

⎜⎜⎜⎜ 1 ⎟⎟⎟⎟ ⎜⎜⎜⎜1 0 1

⎟⎟⎟

⎟⎟⎟

⎜⎜⎜ 2 ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟

⎜ .. ⎟⎟⎟ ⎜⎜⎜ .. .. ..

C J = −D−1 (L + R) = ⎜⎜⎜⎜ . ⎟⎟⎟ ⎜⎜⎜ . . . ⎟⎟⎟

⎜⎜⎜⎜ ⎟ ⎜

⎟⎟⎟⎟ ⎜⎜⎜⎜

⎟⎟⎟

⎜⎜⎜ 1

⎟⎠ ⎜⎝ 1 0 1⎟⎟⎟⎟

⎝ 2

1 ⎠

2 1 0

⎛ ⎞

⎜⎜⎜ 0 1

⎟⎟⎟

⎜⎜⎜ 1 2

1 ⎟⎟⎟

⎜⎜⎜ 2 0 2 ⎟⎟⎟

⎜⎜ .. .. .. ⎟⎟⎟

= ⎜⎜⎜⎜ . . . ⎟⎟⎟ , (4.12)

⎜⎜⎜ ⎟⎟

⎜⎜⎜ 1 1⎟ ⎟⎟

⎜⎝ 2 0 2⎟ ⎠⎟

1

2 0

showing that C J ∞ = 1, as expected, so that we cannot conclude convergence

from this. The situation is slightly better when it comes to the Gauss–Seidel

method. Here, we ﬁrst have to compute

⎛ ⎞−1 ⎛ 1 ⎞

⎜⎜⎜ 2 ⎟⎟⎟ ⎜⎜⎜ 2 0 ⎟⎟

⎟⎟⎟

⎜⎜⎜⎜−1 2 ⎟⎟⎟⎟ ⎜⎜⎜⎜ 1 1

⎟⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ 4 2

⎟⎟⎟

−1 ⎜⎜⎜ −1 2 ⎟⎟⎟ ⎜⎜⎜ 1 1 1

⎟⎟⎟ .

(L + D) = ⎜⎜ ⎟⎟⎟ = ⎜⎜⎜ 8 4 2

⎜⎜⎜ .. .. ⎟⎟⎟ ⎜⎜⎜ .. .. .. .. ⎟⎟⎟

⎜⎜⎜ . . ⎟

⎟ ⎜

⎜ . . . . ⎟⎟⎟

⎝⎜ ⎠⎟ ⎝⎜ 1 1⎠

⎟

0 −1 2 2n

1

2n−1

... 4 2

1

Note that there is a ﬁll in when computing the inverse. We cannot expect that

the inverse of a sparse matrix is sparse again. Here, however, we know that the

inverse at least has to be a lower triangular matrix.

With this, we can compute the iteration matrix of the Gauss–Seidel method:

⎛ ⎞

⎜⎜⎜0 12 ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜0 14 1

⎟⎟⎟

⎜⎜

⎜

2

. ⎟⎟⎟⎟

CGS = −(L + D)−1 R = ⎜⎜⎜⎜⎜0 18 1 .. ⎟⎟⎟ , (4.13)

⎜⎜⎜ .

4

. .. ⎟⎟⎟⎟

⎜⎜⎜0 .. .. . 2 ⎟⎟⎟

1 ⎟

⎜⎜⎝ ⎟⎠

0 2n 2n−1 . . . 14

1 1

since multiplying by this R from the right means shifting all columns to the

right and ﬁlling the ﬁrst column up with zeros. This gives

1 −i 1

n−2

1 1 1

CGS ∞ = + + ··· + n = 2 = (1 − 2−n+1 ) < 1. (4.14)

4 8 2 4 i=0 2

This shows that the Gauss–Seidel method converges for our problem. How-

ever, the bound on the right-hand side becomes arbitrarily close to 1 if n tends

05

4.3 The Jacobi and Gauss–Seidel Iterations 115

to inﬁnity. Hence, we might expect that convergence slows down for larger n,

which is quite unfortunate since a larger n already means a higher computa-

tional cost per iteration.

To determine the convergence of the Jacobi method, we need to compute the

spectral radius and hence the eigenvalues of the iteration matrices. Since the

iteration matrix is given by C J = I − D−1 A we see that every eigenvector of A

is also an eigenvector of C J . To be more precise, if Ax = λx, then

λ λ

C J x = x − D−1 Ax = x − λD−1 x = x − x = 1 − x.

2 2

Hence, it suﬃces to determine the eigenvalues of A.

Remark 4.17 The matrix A ∈ Rn×n deﬁned by (4.9) has the eigenvalues

λ j = 2(1 − cos θ j ) = 4 sin2 (θ j /2) with corresponding eigenvectors

for the trigonometric functions. Now, using the above-deﬁned quantities yields

⎧

⎪

⎪

⎪ 2 sin θ j − sin(2θ j ) for i = 1,

⎪

⎪

⎨

(Ax j )i = ⎪

⎪ −sin((i − 1)θ j ) + 2 sin(iθ j ) − sin((i + 1)θ j ) for 2 ≤ i ≤ n − 1,

⎪

⎪

⎪

⎩−sin((n − 1)θ ) + 2 sin(nθ ) for i = n.

j j

Taking sin(0θ j ) = 0 and sin(n + 1)θ j = sin( jπ) = 0 into account, this can

uniformly be written as

From this remark and the above we can conclude the following.

Remark 4.18 The iteration matrix C J , given in (4.12), of the Jacobi method

has the eigenvalues

jπ

λj jπ

μj = 1 − = 1 − 2 sin 2

= cos , 1 ≤ j ≤ n.

2 2(n + 1) n+1

The eigenvectors are the same as in Remark 4.17.

05

116 Iterative Methods for Solving Linear Systems

Note that in both remarks we have used the identity cos(2α) = 1 − 2 sin2 α.

Taking the graph of the cos into account, we see that

π

ρ(C J ) = cos < 1. (4.16)

n+1

Hence, as expected, the Jacobi method converges. However, a Taylor expan-

sion of cos reveals

π2 π4

ρ(C J ) = 1 − + − ··· ,

2(n + 1) 2 4!(n + 1)4

which means that the convergence also slows down for larger n. If we com-

pare this with (4.14), it seems to be better and hence convergence of the Jacobi

method should be faster. But this is not true since in (4.14) we have only looked

at an upper bound for ρ(CGS ). To have a real comparison of both methods we

must compute ρ(CGS ), as well. This is slightly more complicated than comput-

ing ρ(C J ) , but the following can be shown.

Remark 4.19 The iteration matrix CGS given in (4.13) has the eigenvalues

λ j = cos2 θ j with corresponding eigenvectors

x j = (cos(θ j ) sin(θ j ), cos2 (θ j ) sin(2θ j ), . . . , cosn (θ j ) sin(nθ j ))T ,

where θ j = jπ/(n + 1).

This means that we have the relation λ j = μ2j between the eigenvalues λ j of

CGS and μ j of C J , leading also to ρ(CGS ) = ρ(C J )2 . Hence, the Gauss–Seidel

method converges faster than the Jacobi method.

The above relation between the spectral radius of the Gauss–Seidel method

and the spectral radius of the Jacobi method is always given if the matrix A is

consistently ordered. We will discuss this in more detail at the end of the next

section.

4.4 Relaxation

A further improvement of both methods can sometimes be achieved by relax-

ation. We start by looking at the Jacobi method. Here, the iterations can be

written as

x( j+1) = D−1 b − D−1 (L + R)x( j) = x( j) + D−1 b − D−1 (L + R + D)x( j)

= x( j) + D−1 (b − Ax( j) ).

The latter equality shows that the new iteration x( j+1) is given by the old iter-

ation x( j) corrected by the D−1 -multiple of the residual b − Ax( j) . In practice,

05

4.4 Relaxation 117

one often notices that the correction term is oﬀ the correct correction term by a

ﬁxed factor. Hence, it makes sense to introduce a relaxation parameter ω and

to form the new iteration rather as

ω D.

⎛ ⎞

ω ⎜⎜⎜⎜ n ⎟

( j) ⎟

( j+1)

xi ( j)

= xi + ⎜⎝bi − aik xk ⎟⎟⎟⎠ , 1 ≤ i ≤ n.

aii k=1

we use two vectors to realise the iteration. Obviously, to compute one iteration

requires O(n2 ) time for a full matrix A.

Input : A ∈ Rn×n invertible, b ∈ Rn , x(0) ∈ Rn , ω ∈ R.

Output: Approximate solution of Ax = b.

1 Compute r := b − Ax(0)

2 while r ≥ do

3 for i = 1 to n do

4 yi := bi

5 for k = 1 to n do

6 yi := yi − aik xk

7 yi := xi + ω · yi /aii

8 x := y

9 r := b − Ax

Of course, the relaxation parameter should be chosen such that the convergence

improves compared with the original Jacobi method. It follows from

= [(1 − ω)I − ωD−1 (L + R)]x( j) + ωD−1 b (4.18)

05

118 Iterative Methods for Solving Linear Systems

f1/5

λ1

λn

−3 −2 −1 1

fω ∗

f1

Theorem 4.21 Assume that C J = −D−1 (L+R) has only real eigenvalues λ1 ≤

λ2 ≤ · · · ≤ λn < 1 with corresponding eigenvectors z1 , . . . , zn . Then, C J (ω) has

the same eigenvectors z1 , . . . , zn , but with eigenvalues μ j = 1 − ω + ωλ j for

1 ≤ j ≤ n. The spectral radius of C J (ω) is minimised by choosing

2

ω∗ = . (4.20)

2 − λ1 − λ n

In the case of λ1 −λn relaxation converges faster than the Jacobi method.

Proof For every eigenvector z j of C J it follows that

C J (ω)z j = (1 − ω)z j + ωλ j z j = (1 − ω + ωλ j )z j ,

i.e. z j is an eigenvector of C J (ω) for the eigenvalue 1−ω+ωλ j =: μ j (ω). Thus,

the spectral radius of C J (ω) is given by

ρ(C J (ω)) = max |μ j (ω)| = max |1 − ω + ωλ j |,

1≤ j≤n 1≤ j≤n

which should be minimised. For a ﬁxed ω let us have a look at the function

fω (λ) := 1 − ω + ωλ, which is, as a function of λ, a straight line with fω (1) = 1.

For diﬀerent choices of ω we thus have a collection of such lines (see Figure

4.1) and it follows that the maximum in the deﬁnition of ρ(C J (ω)) can only

be attained for the indices j = 1 and j = n. Moreover, it follows that ω is

optimally chosen if fω (λ1 ) = − fω (λn ) or

1 − ω + ωλ1 = −(1 − ω + ωλn ).

This gives (4.20). Finally, we have the Jacobi method if and only if ω∗ = 1,

which is equivalent to λ1 = −λn .

05

4.4 Relaxation 119

Example 4.22 Let us have another look at the second example from the in-

troduction. We were concerned with solving a boundary value problem with

ﬁnite diﬀerences. The resulting system has been stated in Section 1.1.2. In the

last section, we have seen that the iteration matrix of the Jacobi method has

eigenvalues

jπ

μ j = cos , 1 ≤ j ≤ n.

n+1

Hence, the biggest eigenvalue is μ1 = cos(π/(n+1)) and the smallest eigenvalue

is given by μn = cos(nπ/(n + 1)).

This means that the relaxation of the Jacobi method has an iteration matrix

with eigenvalues

jπ

μ j (ω) = 1 − ω + ω cos .

n+1

However, since we have μn = −μ1 , Theorem 4.21 tells us that the optimal

relaxation parameter is ω∗ = 1 and hence the classical Jacobi method cannot

be sped up using relaxation.

An alternative interpretation of the Jacobi relaxation can be derived from

x( j+1) = (1 − ω)x( j) + ωC J x( j) + ωD−1 b

= (1 − ω)x( j) + ω(C J x( j) + D−1 b).

Hence, if we deﬁne z( j+1) = C J x( j) + D−1 b, which is one step of the classical

Jacobi method, the next iteration of the Jacobi relaxation method is

x( j+1) = (1 − ω)x( j) + ωz( j+1) ,

which is a linear interpolation between the old iteration and the new Jacobi

iteration.

This idea can be used to introduce relaxation for the Gauss–Seidel method, as

well. We start by looking at Dx( j+1) = b − Lx( j+1) − Rx( j+1) and replace the

iteration on the left-hand side by z( j+1) and then use linear interpolation again.

Hence, we set

Dz( j+1) = b − Lx( j+1) − Rx( j+1) ,

x( j+1) = (1 − ω)x( j) + ωz( j+1) .

Multiplying the second equation by D and inserting the ﬁrst one yields

Dx( j+1) = (1 − ω)Dx( j) + ωb − ωLx( j+1) − ωRx( j)

and hence

(D + ωL)x( j+1) = [(1 − ω)D − ωR] x( j) + ωb.

05

120 Iterative Methods for Solving Linear Systems

(SOR) method is given by

⎛ ⎞

ω ⎜⎜⎜⎜ ⎟⎟

( j) ⎟

i−1 n

( j+1)

xi ( j)

= xi + ⎜⎜⎝bi − ( j+1)

aik xk − aik xk ⎟⎟⎟⎠ , 1 ≤ i ≤ n.

aii k=1 k=i

Input : A ∈ Rn×n invertible, b ∈ Rn , x(0) ∈ Rn , ω ∈ R.

Output: Approximate solution of Ax = b.

1 Compute r := b − Ax(0)

2 while r ≥ do

3 for i = 1 to n do

4 xi := bi

5 for k = 1 to n do

6 xi := xi − aik xk

7 xi := xi + ω · xi /aii

8 r := b − Ax

Again, we have to deal with the question of how to choose the relaxation pa-

rameter. The next result shows that we do not have much of a choice.

Theorem 4.24 The spectral radius of the iteration matrix CGS (ω) of SOR

satisﬁes

ρ(CGS (ω)) ≥ |ω − 1|.

Proof The iteration matrix CGS (ω) can be written in the form

The ﬁrst matrix in this product is a normalised lower triangular matrix and the

second matrix is an upper triangular matrix with diagonal entries all equal to

05

4.4 Relaxation 121

the result follows from

|1 − ω|n = | det CGS (ω)| ≤ ρ(CGS (ω))n .

We will now show that for a positive deﬁnite matrix, ω ∈ (0, 2) is also suﬃcient

for convergence. Since ω = 1 gives the classical Gauss–Seidel method, we also

cover Theorem 4.15. The proof of the following theorem also shows that the

method is always consistent.

Theorem 4.25 Let A ∈ Rn×n be symmetric and positive deﬁnite. Then, the

SOR method converges for every relaxation parameter ω ∈ (0, 2).

Proof We have to show that ρ(CGS (ω)) < 1. To this end, we rewrite the

iteration matrix CGS (ω) in the form

CGS (ω) = (D + ωL)−1 [D + ωL − ω(L + D + R)]

−1

1

= I − ω(D + ωL)−1 A = I − D+L A

ω

= I − B−1 A,

with B := ω1 D + L. Let λ ∈ C be an eigenvalue of CGS (ω) with corresponding

eigenvector x ∈ Cn , which we assume to be normalised by x2 = 1. Then,

we have CGS (ω)x = (I − B−1 A)x = λx or Ax = (1 − λ)Bx. Since A is positive

deﬁnite, we must have λ 1. Hence, we can conclude

1 xT Bx

= T .

1 − λ x Ax

Since A is symmetric, we can also conclude that B + BT = ( ω2 − 1)D + A, such

that the real part of 1/(1 − λ) satisﬁes

⎧ T ⎫

1 1 xT (B + BT )x 1 ⎪⎪

⎨ 2 x Dx ⎪

⎪

⎬ 1

= = ⎪

⎪ − 1 + 1 ⎪ > 2,

⎪

1−λ 2 T

x Ax ⎩

2 ω T

x Ax ⎭

xT Dx/xT Ax. The latter follows since the diagonal entries of a positive deﬁnite

matrix have to be positive. If we write λ = u + iv then we can conclude that

1 1 1−u

< =

2 1−λ (1 − u)2 + v2

and hence |λ|2 = u2 + v2 < 1.

To ﬁnd an optimal parameter ω∗ is in general diﬃcult. However, the situation

is better if the matrix is consistently ordered. We have mentioned the term

05

122 Iterative Methods for Solving Linear Systems

There, we have seen that in the speciﬁc situation of our boundary value model

problem from Section 1.1.2, we have the relation ρ(C J )2 = ρ(CGS ) between

the spectral radii of the Jacobi and Gauss–Seidel methods. We pointed out that

this always holds when A is consistently ordered. Hence, it is now time to make

this more precise.

ordered if the eigenvalues of the matrix

deﬁnition meaningful. This condition has already been imposed throughout

this entire section. Note also that all the matrices C(α) have the same eigenval-

ues as the matrix

C(1) = −D−1 (L + R) = C J ,

i.e. as the iteration matrix of the Jacobi method, provided that A is consistently

ordered.

⎛ ⎞

⎜⎜⎜a1 c1 0 ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜b2 a2 c2 ⎟⎟⎟

⎜⎜⎜ .. .. .. ⎟⎟⎟

A = ⎜⎜⎜⎜⎜ . . . ⎟⎟⎟ ∈ Rn×n

⎟⎟⎟

⎜⎜⎜ . .. ⎟⎟

⎜⎜⎜ .. . cn−1 ⎟⎟⎟⎟⎟

⎜⎜⎝ ⎠

0 bn an

with ai 0 for 1 ≤ i ≤ n is consistently ordered. To see this, note that

⎛ ⎞

⎜⎜⎜ 0 r1 0 ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜2 0 r2 ⎟⎟⎟

⎜⎜⎜⎜ .. .. .. ⎟⎟⎟⎟

C(1) = −D (L + R) = ⎜⎜⎜⎜

−1 . . . ⎟⎟⎟

⎟⎟⎟

⎜⎜⎜

⎜⎜⎜ . . . . . r ⎟⎟⎟⎟

.

⎜⎜⎝ n−1 ⎟⎟⎠

0 n 0

with ri = −ci /ai and i = −bi /ai . Hence, if we deﬁne a diagonal matrix S (α) =

diag(1, α, α2 , . . . , αn−1 ) with α 0, we see that C(α) = S (α)C(1)S −1 (α), which

has the same eigenvalues as C(1).

05

4.4 Relaxation 123

The following theorem shows that there is an intrinsic relation between the

eigenvalues of the iteration matrices of the SOR method and the Jacobi method.

It also shows that, for consistently ordered matrices, the Gauss–Seidel method

converges faster than the Jacobi method, as long as the Jacobi method con-

verges.

Theorem 4.28 Let A ∈ Rn×n be consistently ordered. Let CGS (ω), ω ∈ (0, 2),

denote the iteration matrix of the SOR method and C J denote the iteration

matrix of the Jacobi method. Then, μ ∈ C \ {0} is an eigenvalue of CGS (ω) if

and only if

μ+ω−1

λ= √

ω μ

is an eigenvalue of C J . In particular, we have

ρ(CGS ) = ρ(C j )2 .

= μ(I + ωD−1 L) − D−1 (D + ωL)CGS (ω)

= μ(I + ωD−1 L) − D−1 [(1 − ω)D − ωR]

= (μ − (1 − ω))I + ωD−1 (μL + R)

= (μ − (1 − ω))I + ωμ1/2 D−1 μ1/2 L + μ−1/2 R .

Since the determinant of I + ωD−1 L is one, we can conclude from this that

μ 0 is an eigenvalue of CGS (ω) if and only if

!

det (μ − (1 − ω))I + ωμ1/2 D−1 μ1/2 L + μ−1/2 R = 0,

i.e. if

μ − (1 − ω)

λ :=

ωμ1/2

is an eigenvalue of −D−1 μ1/2 L + μ−1/2 R . But since A is consistently ordered

the eigenvalues of the latter matrix coincide with the eigenvalues of C J .

Finally, we have for Gauss–Seidel CGS = CGS (1), i.e. ω = 1 and hence μ 0

√

is an eigenvalue of CGS if and only if μ is an eigenvalue of C J . Thus we have

ρ(CGS ) = ρ(C J )2 .

05

124 Iterative Methods for Solving Linear Systems

If the matrix A is consistently ordered then it is even possible to show that the

optimal parameter is given by

2

ω∗ = ,

1 + 1 − ρ(C J )2

where C J is the iteration matrix of the Jacobi method, for details see Young

[140]. Note that this means that ω∗ ∈ (1, 2), justifying the name successive

over relaxation.

Usually, ρ(C J ) is unknown. A possible remedy to this is to use one of the

methods to compute the largest eigenvalue, which we will discuss in the next

chapter, particularly the so-called power method to compute an approximation

to ρ(C J ).

Example 4.29 Coming back to our boundary value problem from

Section

π

1.1.2, we know from Remark 4.18 and (4.16) that ρ(C J ) = cos n+1 = cos(πh)

with h = 1/(n + 1). A Taylor expansion then yields

2 2

ω∗ = =

1 + 1 − cos2 (πh) 1 + sin(πh)

5 4

= 2 − 2πh + 2(πh)2 − (πh)3 + (πh)4 + O(h5 )

3 3

2π 1

=2− +O .

n+1 (n + 1)2

Note that this comes arbitrarily close to 2 with n becoming large. Nonetheless,

because of Theorem 4.24 we cannot simply choose ω = 2.

Example 4.30 Let us stick with our second model problem from Section

1.1.2. We already know from (4.10) and (4.11) what the Jacobi and Gauss–

Seidel methods look like. It is similarly easy to see that the relaxed versions

become

⎧ ( j)

⎪

⎪

⎪ u2 if i = 1,

( j+1) ( j) ω 2 ω⎪ ⎪

⎨ ( j)

ui = (1 − ω)ui + h fi + ⎪ ⎪

( j)

ui−1 + ui+1 if 2 ≤ i ≤ n − 1,

2 2⎪⎪

⎪

⎩u( j) if i = n

n−1

⎧ ( j)

⎪

⎪

⎪ u2 if i = 1,

( j+1) ( j) ω 2 ω⎪⎪

⎨ ( j+1)

ui = (1 − ω)ui + h fi + ⎪ ⎪

j)

ui−1 + u(i+1 if 2 ≤ i ≤ n − 1,

2 2⎪⎪

⎪

⎩u ( j+1)

if i = n

n−1

Let us now assume that we want to solve the original problem −u = f on (0, 1)

05

4.5 Symmetric Methods 125

with u(0) = u(1) = 0 and now the right-hand side f (x) = sin(πx). Then, the true

solution is given by u(x) = π12 sin(πx). Let us run our simulations with n = 10.

Since we know that in this situation relaxation of the Jacobi method does not

improve convergence, we restrict ourselves to the Jacobi, Gauss–Seidel and

SOR methods. For the latter, we know that the best relaxation parameter is,

with h = 1/(n + 1), given by

2

ω∗ = ≈ 1.5603879 . . . .

1 + sin(πh)

In Figure 4.2 we have plotted the residual Ax( j) − b2 for these three methods.

Clearly, as predicted, the Gauss–Seidel method converges faster than the Jacobi

method, but the SOR method is by far the fastest method. Nonetheless, each

method needs signiﬁcantly more than n = 10 steps to reach machine precision.

10−3 ×

2.0

1.5

1.0

0.5

0

0 10 20 30 40 50 60 70 80 90

Figure 4.2 The convergence of the Jacobi, Gauss–Seidel (dashed) and SOR (dot-

ted) methods. The number of iterations is on the x-axis, while the residual is on

the y-axis.

We started discussing iterative methods for linear systems by using the iteration

05

126 Iterative Methods for Solving Linear Systems

and our goal was to choose the matrix B as an easy computable approximation

to the inverse of A. If the initial matrix A is symmetric and positive deﬁnite, it

would hence be good to choose B also as symmetric and positive deﬁnite.

if for a symmetric (and positive deﬁnite) matrix A ∈ Rn×n the matrix B is also

symmetric (and positive deﬁnite).

This will become even more important later on. Obviously, the matrix B = D

of the Jacobi method is symmetric but the matrix B = L + D of the Gauss–

Seidel method is not, unless the strict lower part L of A is completely zero.

Both statements are also true for the relaxed versions of the Jacobi and Gauss–

Seidel method.

It is now our goal to derive a symmetric version of the Gauss–Seidel method.

We recall that the Gauss–Seidel iteration is given by

x( j+1) = −(D + L)−1 Rx( j) + (D + L)−1 b =: CGS x( j) + cGS , (4.23)

if A = L + D + R is the usual decomposition of A into its strictly sub-diagonal

part L, its diagonal part D and its strictly upper-diagonal part R. We could,

however, also deﬁne a backward Gauss–Seidel method by iterating according

to

x( j+1) = −(D + R)−1 Lx( j) + (D + R)−1 b =: C BGS x( j) + cBGS . (4.24)

This does also not lead to a symmetric method but we could combine the two

steps into one step. Hence, if we ﬁrst do a “half step” according to (4.23) and

then another “half step” according to (4.24), we have the iteration

!

x( j+1) = C BGS CGS x( j) + cGS + cBGS

= C BGS CGS x( j) + (C BGS cGS + cBGS )

= CS GS x( j) + cS GS

with the iteration matrix CS GS and vector cS GS given by

CS GS := C BGS CGS = (D + R)−1 L(D + L)−1 R,

cS GS := C BGS cGS + cBGS

= −(D + R)−1 L(D + L)−1 b + (D + R)−1 b

!

= (D + R)−1 I − L(D + L)−1 b

= (D + R)−1 D(D + L)−1 b.

To see that this is indeed a symmetric method, we have to rewrite this iteration

in the form of (4.22), i.e. we need to ﬁnd the associated matrix B = BS GS .

05

4.5 Symmetric Methods 127

Before that, we will introduce the relaxation of this method. It is simply de-

ﬁned to be the just-introduced combination of the relaxed Gauss–Seidel and

backward Gauss–Seidel methods. The latter is relaxed in exactly the same way

as the Gauss–Seidel method is. Hence, the relevant matrices and vectors are

given as

−1

1 1

CGS (ω) = (D + ωL)−1 [(1 − ω)D − ωR] = D+L −1 D−R ,

ω ω

−1

−1 1 1

C BGS (ω) = (D + ωR) [(1 − ω)D − ωL] = D+R −1 D−L ,

ω ω

−1

1

cGS (ω) = (D + ωL)−1 ωb = D + L b,

ω

−1

−1 1

cBGS (ω) = (D + ωR) ωb = D + R b.

ω

Deﬁnition 4.32 The symmetric Gauss–Seidel relaxation for solving the linear

system Ax = b is given by

where

cS GS (ω) := C BGS (ω)cGS (ω) + cBGS (ω).

In the case of a symmetric matrix A, the method is also called the symmetric

successive over-relaxation (SSOR) method.

Obviously, the symmetric Gauss–Seidel method makes mainly sense for sym-

metric matrices. In this case, the notation further simpliﬁes since we have

R = LT . Nonetheless, for the time being we will stick to the general notation.

The ﬁrst step in understanding this method is to show that it is indeed an it-

erative method of the form (4.22), i.e. we have to ﬁnd the deﬁning matrix

B = BS GS (ω) and to show that CS GS (ω) = I − B−1 S GS (ω)A. To ﬁnd BS GS

we have to investigate the iteration vector cS GS (ω) which has to be given by

cS GS (ω) = B−1

S GS (ω)b. With

1 1

BGS (ω) = D+L and BBGS (ω) = D+R

ω ω

we obviously have

−1

CGS (ω) = I − BGS (ω)A and C BGS (ω) = I − B−1

BGS (ω)A,

05

128 Iterative Methods for Solving Linear Systems

which is easily veriﬁed and shows again that both methods are of the form

(4.22). Furthermore, we have from the deﬁnition of cS GS (ω) that

!

−1

cS GS (ω) = C BGS (ω)cGS (ω) + cBGS (ω) = C BGS (ω)BGS (ω) + B−1

BGS (ω) b

which implies

B−1 −1 −1

S GS (ω) = C BGS (ω)BGS (ω) + BBGS (ω),

can conclude that the symmetric Gauss–Seidel method is indeed a splitting

scheme of the form (4.22) since

I − B−1 −1 −1

S GS (ω)A = I − C BGS (ω)BGS (ω) + BBGS (ω) A

= I − B−1 −1

BGS (ω)A − C BGS (ω)BGS (ω)A

= C BGS (ω) − C BGS (ω)(I − CGS (ω))

= C BGS (ω)CGS (ω) = CS GS (ω).

Finally, we have to verify that all the previous operations are justiﬁed. To this

end, we note that we have for ω 2 that

−1 1 −1

= BBGS (ω) − 1 D − L BGS (ω)b + B−1

BGS (ω)b

ω

⎡

−1 −1 ⎤

⎢⎢⎢ 1 ⎥⎥

D + L + I ⎥⎥⎥⎦ b

1 1

= D + R ⎢⎢⎣ −1 D−L

ω ω ω

−1 ⎡ −1 −1 ⎤

⎢⎢⎢ 1 ⎥⎥

D + L ⎥⎥⎥⎦ b

1 1 1 1

= D + R ⎢⎢⎣ D − D − L D+L + D+L

ω ω ω ω ω

−1 −1

1 1 1 1

= D+R D−D−L+ D+L D+L b

ω ω ω ω

−1 −1

2 1 1

= −1 D+R D D + L b,

ω ω ω

from which we can conclude that

ω 1 −1 1

BS GS (ω) = D+L D D+R

2−ω ω ω

ω

= BGS (ω)D−1 BBGS (ω). (4.25)

2−ω

This has several further consequences, which we will collect now.

05

4.5 Symmetric Methods 129

For each ω ∈ (0, 2), the SSOR method is well-deﬁned and a symmetric splitting

method.

(0, 2). Furthermore, since A is symmetric, we have BGS (ω) = BTBGS (ω). Hence,

(4.25) implies that BS GS (ω) is symmetric. Since D is a diagonal matrix with

positive entries, (4.25) also implies that BS GS (ω) is positive deﬁnite.

is helpful to recall the following context. If the matrix A is positive deﬁnite

and symmetric it possesses a square root A1/2 which is also positive deﬁnite

and symmetric and satisﬁes A = A1/2 A1/2 . According to Example 2.29, the

induced matrix norm of the vector norm x → A1/2 x2 is given by CA1/2 =

A1/2CA−1/2 2 .

Theorem 4.34 Let A ∈ Rn×n be symmetric and positive deﬁnite. Then, the

symmetric Gauss–Seidel relaxation converges for each ω ∈ (0, 2) and

Proof The matrix CS GS (ω) = C BGS (ω)CGS (ω) has the same eigenvalues as

the matrix

! !

A1/2CS GS (ω)A−1/2 = A1/2C BGS (ω)A−1/2 A1/2CGS (ω)A−1/2 .

!

A1/2C BGS (ω)A−1/2 = A1/2 I − B−1

BGS (ω)A A

−1/2

= I − A1/2 B−1

BGS (ω)A

1/2

!T !T

−1

= I − A1/2 BGS (ω)A1/2 = A1/2CGS (ω)A−1/2 ,

which shows

!T !

A1/2CS GS (ω)A−1/2 = A1/2CGS (ω)A−1/2 A1/2CGS (ω)A−1/2 .

From this, we can conclude that A1/2CS GS (ω)A−1/2 is symmetric and positive

deﬁnite and has therefore only non-negative eigenvalues. Furthermore, with

Example 2.29 and Corollary 2.21 we ﬁnd

= A1/2CS GS (ω)A−1/2 2 = CS GS (ω)A1/2

!T !

= A1/2CGS (ω)A−1/2 A1/2CGS (ω)A−1/2

2

= A 1/2

CGS (ω)A−1/2 22 = CGS (ω)2A1/2 .

It remains to show that the spectral radius is less than one, i.e., we have to

05

130 Iterative Methods for Solving Linear Systems

show that A1/2CGS (ω)A−1/2 2 < 1. Since the · 2 -norm of the matrix C(ω) :=

A1/2CGS (ω)A−1 is given by the square root of the maximum eigenvalue of

C(ω)TC(ω), we have to compute the latter. To this end, we use again that

T

BGS (ω) = BBGS (ω),

2

BGS (ω) + BBGS (ω) = A + −1 D

ω

and

!

C(ω) = A1/2CGS (ω)A−1/2 = A1/2 I − BGS

−1

(ω)A A−1/2 = I − A1/2 BGS

−1

(ω)A1/2

to derive

!T !

−1 −1

C(ω)TC(ω) = I − A1/2 BGS (ω)A1/2 I − A1/2 BGS (ω)A1/2

! !

= I − A1/2 B−1

BGS (ω)A

1/2 −1

I − A1/2 BGS (ω)A1/2

!

= I − A1/2 B−1 −1

BGS (ω) + BGS (ω) A

1/2

+ A1/2 B−1 −1

BGS (ω)ABGS (ω)A

1/2

2

= I − A1/2 B−1

BGS (ω) A + − 1 D BGS −1

(ω)A1/2

ω

+ A1/2 B−1 −1

BGS (ω)ABGS (ω)A

1/2

2

= I− A1/2 B−1 −1

BGS (ω)DBGS (ω)A

1/2

−1

ω

2

=: I − − 1 B(ω).

ω

The newly introduced matrix B(ω) is, because of

!T !

B(ω) := A1/2 B−1 −1

BGS (ω)DBGS (ω)A

1/2 −1

= D1/2 BGS (ω)A1/2 −1

D1/2 BGS (ω)A1/2 ,

symmetric and positive deﬁnite. If λ ∈ R is an eigenvalue of C(ω)TC(ω) with

corresponding eigenvector x ∈ Rn satisfying x2 = 1 then ω ∈ (0, 2) implies

2

0 ≤ λ = xTC(ω)TC(ω)x = 1 − − 1 xT B(ω)x < 1.

ω

Exercises

4.1 Let F : Rn → Rn be a contraction mapping with contraction number

q ∈ (0, 1) and ﬁxed point x∗ ∈ Rn . Let G : Rn → Rn be a mapping

satisfying F(x) − G(x) ≤ for all x ∈ Rn . Show that the iteration

y j+1 := G(y j ) satisﬁes the error bound

1

x∗ − y j ≤ + q j y0 − y1 .

1−q

05

Exercises 131

4.2 Show that the following modiﬁcation of Banach’s ﬁxed point theorem is

true. Let D ⊆ Rn be a compact set. Let F : D → D be a contraction

mapping. Then, F has exactly one ﬁxed point x∗ ∈ D and the sequence

x j+1 := F(x j ) converges to x∗ for each starting point x0 ∈ D.

Use Banach’s ﬁxed point theorem

√ to show that the following functions

have a ﬁxed point f (x) = 2 + x, x ≥ 0, f (x) = e−x , x ∈ R, f (x1 , x2 ) :=

T

2 1 −x1

5 (x1 + x2 ), 2 e , x = (x1 , x2 ) ∈ R2 . Compute the ﬁxed point for the

1 2

ﬁrst function and show that the sequence x j+1 = f (x j ) converges for

every x0 ∈ R in the case of the second function.

4.3 Show that a matrix A ∈ Rn×n is reducible if and only if there are a permu-

tation matrix P ∈ Rn×n , an index k ∈ {1, . . . , n−1} and matrices B ∈ Rk×k ,

C ∈ R(n−k)×k and D ∈ R(n−k)×(n−k) such that

B O

PT AP = .

C D

4.4 Show that a matrix A ∈ Rn×n is irreducible if and only if we can ﬁnd for

each pair of indices i, j ∈ {1, . . . , n} a chain j0 , . . . , jk ∈ {1, . . . , n} with

j0 = i, jk = j and a j , j+1 0 for 0 ≤ ≤ k − 1.

4.5 Let A ∈ Rn×n satisfy aii > 0 and ai j ≤ 0 for i j. Let C J = −D−1 (L + R)

be the iteration matrix of the Jacobi method for A = L + D + R. Show

that ρ(C J ) < 1 if and only if A has an inverse with non-negative entries

(such a matrix is called an M-matrix).

4.6 Let A ∈ Rn×n be symmetric, irreducible and weakly diagonally domi-

nant. Assume that all diagonal elements of A are positive. Show that A is

positive deﬁnite.

4.7 Let A ∈ Rn×n with n ≥ 2 have diagonal elements aii = 1 and oﬀ-diagonal

elements ai j = a, i j.

• Determine the iteration matrix CGS of the Gauss–Seidel method for A.

Calculate the eigenvalues of CGS and A.

• Determine those a ∈ R, for which the Gauss–Seidel method con-

verges. For which a ∈ R is A positive deﬁnite?

4.8 A simple residual correction method to solve Ax = b is the so-called

Richardson iteration, given by x j+1 = (I − A)x j + b = x j + (b − Ax j ).

Its relaxation is given by x j+1 = x j + ω(b − Ax j ). Determine the iteration

matrix C(ω) of this method. Assume that A has only real eigenvalues.

Determine ρ(C(ω)) and the optimal relaxation parameter.

05

5

Calculation of Eigenvalues

we are mathematically confronted with eigenvalue problems. Initially, these

are eigenvalue problems for certain diﬀerential operators but via discretisation

they can be recast as eigenvalue problems for linear systems. We have also

seen that eigenvalues play an important role when studying the conditioning of

linear systems. Hence, in this chapter we are interested in ﬁnding solutions to

the problem

Ax = λx,

Cn \ {0} are what we are looking for.

Recall from linear algebra that the eigenvalues of A are the zeros of the char-

acteristic polynomial

p(λ) := det(A − λI).

quency of the vibration of the mechanical system. The corresponding eigen-

vector gives the corresponding shape mode or mode of vibration.

There are diﬀerent eigenvalue problems depending on the nature of the un-

derlying problem. For each of them, there are diﬀerent solution techniques.

Typical problems are as follows.

ing eigenvector.

2. Computation of all eigenvalues without the corresponding eigenvectors.

3. Computation of one eigenvalue together with a corresponding eigenvector.

4. Computation of several (or all) eigenvalues and the corresponding eigen-

vectors.

132

06

5.1 Basic Localisation Techniques 133

The numerical methods also depend on the special form of the matrix. Partic-

ularly, whether the matrix A is symmetric (or Hermitian) or not. In this book,

we will mainly concentrate on symmetric matrices since methods for comput-

ing the eigenvalues and eigenvectors are much easier in this case. Moreover,

eigenvalues and eigenvectors are real in this situation.

As in the case of solving linear systems, one can distinguish between direct

and iterative methods. In direct methods, ﬁrst the characteristic polynomial

and then its zeros are determined. However, more popular are iterative meth-

ods, which try to compute successive approximations to the eigenvectors and

eigenvalues without computing the characteristic polynomial.

Before discussing methods for computing eigenvalues and eigenvectors, we

will address basic localisation techniques. They sometimes already give good

approximations of the eigenvalues.

The ﬁrst way of estimating the location of all eigenvalues works for an arbitrary

matrix.

tained in the union of the discs

⎧ ⎫

⎪

⎪

⎪ ⎪

⎪

⎪

⎪

⎪

⎨ n ⎪

⎪

⎬

Dj = ⎪

⎪ λ ∈ C : |λ − a | ≤ |a | ⎪

⎪ , 1 ≤ j ≤ n.

⎪

⎪

⎪

j j jk

⎪

⎪

⎪

⎩ k=1

k j

⎭

Cn \ {0} with x∞ = max1≤ j≤n |x j | = 1. From Ax − λx = 0 we can conclude

that

n

(a j j − λ)x j + a jk xk = 0, 1 ≤ j ≤ n.

k=1

k j

"" ""

"" n ""

"" "

n n

|a j j − λ| = |(a j j − λ)x j | = " a jk xk "" ≤ |a jk | |xk | ≤ |a jk |.

"" ""

" k j

k=1 " k j

k=1 k=1

k j

Corollary 5.2 Let A ∈ Rn×n be symmetric and strictly row diagonally domi-

nant with positive diagonal elements. Then, A is positive deﬁnite.

06

134 Calculation of Eigenvalues

row diagonally dominant with positive diagonal elements, we have

n

r j := |a jk | < a j j , 1 ≤ j ≤ n.

k=1

k j

an index j with |λ − a j j | ≤ r j and hence

0 < a j j − r j ≤ λ.

Theorem 5.3 (Gershgorin) If all the discs D j are disjoint, then each of them

contains exactly one eigenvalue of A. If p discs D j form a connected set P

which is disjoint to all other discs, then P contains exactly p eigenvalues of A.

Proof The ﬁrst statement follows immediately from the second one. Hence,

we will concentrate on the second statement. Let r j denote the radius of D j ,

i.e. r j = k j |a jk |. Let P denote the union of the remaining discs D j which are

disjoint to P. Then, both P and P are closed, disjoint subsets of C and hence

there is an 0 > 0 such that for all x ∈ P and all y ∈ P we have |x − y| > 0 .

Next we deﬁne for t ∈ [0, 1] the matrix A(t) = (ai j (t)) by setting aii (t) =

aii and ai j (t) = tai j if i j. Obviously, we have A(1) = A and A(0) =

diag(a11 , . . . , ann ). Furthermore, the Gershgorin discs D j (t) associated with this

matrix A(t) are also centred at a j j but have radius tr j , which means D j (t) ⊆ D j

for all t ∈ [0, 1] and hence P(t) ⊆ P and P (t) ⊆ P . Here, P(t) denotes the

union of the p discs D j (t) corresponding to those in P. Note that Theorem 5.1

states that all eigenvalues of A(t) must be contained in the union of P(t) and

P (t).

Let us now deﬁne

For every t in that set, we also have that P (t) contains exactly n− p eigenvalues.

Our proof is ﬁnished once we have shown that t0 = 1. We immediately see

that 0 belongs to the set since A(0) is a diagonal matrix and P(0) contains

exactly p elements, the eigenvalues of A(0). Hence, the set in (5.1) is not empty.

Next, notice that A(t) depends continuously on t. Since the eigenvalues of A(t)

are the zeros of the corresponding characteristic polynomial, they also depend

continuously on t. Hence, to every > 0 there is a δ > 0 such that |λ j (t) −

λ j (t0 )| < for all t ∈ [0, 1] with |t − t0 | < δ and 1 ≤ j ≤ n, provided we sort the

eigenvalues consistently. We will use this for the above given = 0 .

06

5.1 Basic Localisation Techniques 135

The deﬁnition of t0 allows us to pick a t ∈ [0, 1] with t0 − δ < t < t0 such that

P(t) contains exactly p eigenvalues of A(t) and P (t) contains the remaining n−

p eigenvalues. The deﬁnition of 0 and the choice of δ now guarantee that also

P(t0 ) contains exactly p eigenvalues of A(t0 ) and P (t0 ) contains the remaining

n − p eigenvalues. Hence, the supremum in (5.1) is actually a maximum, i.e. t0

belongs to the set in (5.1).

Let us now assume t0 < 1. Then, we can again choose = 0 and a corre-

sponding δ > 0 as above. This means, however, that for every t ∈ [0, 1] with

t0 < t < t0 + δ we have |λ j (t) − λ j (t0 )| < 0 for every 1 ≤ j ≤ n. The choice

of 0 guarantees once again that we now also have exactly p eigenvalues of

A(t) in P(t), which contradicts the deﬁnition of t0 . Hence, t0 = 1 and P = P(1)

contains exactly p eigenvalues of A = A(1).

In this section, we want to address two additional questions. How does a small

change in the elements of the matrix A change the eigenvalues of A? What

can we say about the location of an eigenvalue if we have an approximate

eigenvalue?

To answer these questions we need sharper estimates on the eigenvalues, which

we will discuss now. To derive them, we will restrict ourselves mainly to sym-

metric matrices. Most of the results also carry immediately over to Hermitian

T T

matrices and normal matrices, i.e. matrices which satisfy AA = A A. In all

other cases the situation is more complicated and, though the eigenvalues still

depend continuously on the matrix entries, it is very likely that a small pertur-

bation of the matrix may lead to signiﬁcant changes in its eigenvalues.

Theorem 5.4 (Rayleigh) Let A ∈ Rn×n be symmetric with eigenvalues λ1 ≥

λ2 ≥ · · · ≥ λn and corresponding orthonormal eigenvectors w1 , . . . , wn ∈ Rn .

For 1 ≤ j ≤ n let W j := span{w1 , . . . , w j } and W ⊥j := span{w j+1 , . . . , wn }. Let

also W0⊥ := Rn . Then,

xT Ax

λ j = max , (5.2)

⊥

x∈W j−1 xT x

x0

xT Ax

λ j = min (5.3)

x∈W j xT x

x0

for 1 ≤ j ≤ n.

Proof Every x ∈ Rn has a representation of the form x = nk=1 ck wk with

ck = xT wk . This means particularly xT x = nk=1 |ck |2 , Ax = nk=1 ck λk wk and

xT Ax = nk=1 |ck |2 λk .

If x ∈ W ⊥j−1 = span{w j , . . . , wn } then we have particularly c1 = · · · = c j−1 = 0

06

136 Calculation of Eigenvalues

n

xT Ax k= j |ck |2 λk λ j xT x

= ≤ = λ j. (5.4)

xT x xT x xT x

Hence, the maximum of the quotient over all x ∈ W ⊥j−1 is less than or equal to

λ j . However, the maximum is attained for x = w j ∈ W ⊥j−1 , which shows (5.2).

If, however, x ∈ W j = span{w1 , . . . , w j } then we have this time c j+1 = · · · =

cn = 0 and hence, again under the assumption that x 0,

j

k=1 |ck | λk λ j xT x

2

xT Ax

= ≥ = λ j. (5.5)

xT x xT x xT x

This shows that the minimum of the quotient over all x ∈ W j is greater than or

equal to λ j and since the minimum is attained for x = w j , we have the equality

(5.3).

An immediate consequence of this result is the following minimum maximum

principle.

Theorem 5.5 (Courant–Fischer) Let A ∈ Rn×n be symmetric with eigenvalues

λ1 ≥ λ2 ≥ . . . ≥ λn . For 1 ≤ j ≤ n let M j denote the set of all linear subspaces

of Rn of dimension j. Then,

xT Ax

λj = min max , (5.6)

U∈Mn+1− j x∈U xT x

x0

T

x Ax

λ j = max min (5.7)

U∈M j x∈U xT x

x0

for 1 ≤ j ≤ n.

Proof We will only prove (5.6) since the second equality is proven in a very

similar fashion.

Let w j , W j and W ⊥j be deﬁned as in Theorem 5.4. Let us denote the right-hand

side of (5.6) with ρ j . Since W ⊥j−1 = span{w j , . . . , wn } ∈ Mn− j+1 we immedi-

ately have ρ j ≤ λ j . Moreover, the space W j = span{w1 , . . . , w j } obviously has

dimension j and an arbitrary U ∈ Mn+1− j has dimension n + 1 − j. Thus, the

dimension formula yields

dim W j ∩ U = dim(W j ) + dim(U) − dim W j + U ≥ 1.

Hence, for each U ∈ Mn+1− j there is an x ∈ W j ∩ U, x 0. This means that

j

this x has the representation x = k=1 ck wk , and we can derive, as before,

j

k=1 |ck | λk

2

xT Ax

= ≥ λj

xT x xT x

06

5.1 Basic Localisation Techniques 137

The inertia of a symmetric matrix A ∈ Rn×n is a triple (n+ , n0 , n− ) with n+ +

n− + n0 = n, where n+ , n− , n0 denote the numbers of positive, negative and

zero eigenvalues of A. The result will be of importance for us later on, when

discussing preconditioning.

singular. Then, A and B := S T AS have the same inertia, i.e. the same numbers

of positive, negative and zero eigenvalues.

Proof We start by noting that the matrix S T S is positive deﬁnite and denote

its largest and its smallest eigenvalues by λmax (S T S ) and λmin (S T S ), respec-

tively. Both are positive; they are the squares of the largest and smallest singu-

lar values of S and satisfy

xT S T S x yT y

λmin (S T S ) = min T

= min T −T −1 ,

x0 x x y0 y S S y

T T T

x S Sx y y

λmax (S T S ) = max = max T −T −1

x0 xT x y0 y S S y

due to Theorem 5.4.

Next, let us denote the eigenvalues of A and B by λ1 (A) ≥ λ2 (A) ≥ · · · ≥ λn (A)

and λ1 (B) ≥ λ2 (B) ≥ · · · ≥ λn (B).

By (5.7), we have

xT S T AS x yT Ay

λ j (B) = max min T

= max min T −T −1

U∈M j x∈U x x U∈M j y∈S (U) y S S y

x0 y0

using the substitution y = S x again and noting that S (U) is also in M j . Thus,

if w1 , . . . , wn denote once again an orthonormal basis consisting of the eigen-

vectors to the eigenvalues of A and setting again W j = span{w1 , . . . , w j }, we

can choose U := S −1 (W j ) ∈ M j . This, together with (5.5), consequently yields

yT Ay yT y yT y

λ j (B) ≥ min ≥ λ j (A) min

y∈W j yT y yT S −T S −1 y y∈W j yT S −T S −1 y

y0 y0

T

y y xT S T S x

≥ λ j (A) min = λ j (A) min

y0 yT S −T S −1 y x0 xT x

= λ j (A)λmin (S S ),

T

06

138 Calculation of Eigenvalues

xT S T AS x yT Ay

λ j (B) = min max T

= min max T −T −1

U∈Mn+1− j x∈U x x U∈Mn+1− j y∈S (U) y S S y

x0 y0

T T

y Ay y y yT y

≤ max −T −1

≤ λ j (A) max

⊥

y∈W j−1 T T

y y y S S y y∈W j−1 y S −T S −1 y

⊥ T

y0 y0

T

y y

≤ λ j (A) max

yT S T S −1 y

y0

= λ j (A)λmax (S T S ).

Both bounds together yield

λmin (S T S )λ j (A) ≤ λ j (B) ≤ λmax (S T S )λ j (A) (5.8)

and since λmin (S T S ) and λmax (S T S ) are both positive, λ j (A) and λ j (B) must

have the same sign.

The proof we have just given actually shows more than we have stated, namely

(5.8), which is also called Ostrowski’s quantitative version of Sylvester’s theo-

rem, see for example [98] and Wimmer [136].

Another consequence of Theorem 5.5 is that it allows us to give a stability

result on the eigenvalues of a symmetric matrix.

Corollary 5.7 Let A ∈ Rn×n and B ∈ Rn×n be two symmetric matrices with

eigenvalues λ1 (A) ≥ λ2 (A) ≥ · · · ≥ λn (A) and λ1 (B) ≥ λ2 (B) ≥ · · · ≥ λn (B).

Then,

|λ j (A) − λ j (B)| ≤ A − B

for every compatible matrix norm.

Proof Using the Cauchy–Schwarz inequality, we have

xT (Ax − Bx) ≤ Ax − Bx2 x2 ≤ A − B2 x22 ,

meaning particularly

xT Ax xT Bx

≤ T + A − B2

xT x x x

for x 0. Thus, Theorem 5.5 yields λ j (A) ≤ λ j (B) + A − B2 . Exchanging the

roles of A and B ﬁnally gives

|λ j (A) − λ j (B)| ≤ A − B2 = ρ(A − B) ≤ A − B,

where we have used Theorems 4.5 and 2.19 and the fact that the matrix A − B

is symmetric.

06

5.1 Basic Localisation Techniques 139

ric or Hermitian matrix. This is demonstrated in the next theorem, which is a

generalisation of the above corollary.

Theorem 5.8 (Bauer–Fike) Let A ∈ Cn×n be given, and let B = A+ΔA ∈ Cn×n

be a perturbation of A. If μ ∈ C is an eigenvalue of B and A = S DS −1 with

D = diag(λ1 , . . . , λn ) then

min |λ j − μ| ≤ κ p (S )ΔA p .

1≤ j≤n

the matrix S in the induced norm.

Proof If μ is an eigenvalue of A this is obviously true. Hence, assume that μ

is not an eigenvalue of A. Since μ is an eigenvalue of B = A + ΔA, the matrix

S −1 (A + ΔA − μI)S is singular, i.e. there is an x ∈ Cn \ {0} with S −1 (A + ΔA −

μI)S x = 0. We can rearrange this to (D − μI)x = −S −1 (ΔA)S x and then to

x = −(D − μI)−1 S −1 (ΔA)S x.

Taking the norm immediately leads to

1 = (D − μI)−1 S −1 (ΔA)S p ≤ (D − μI)−1 p S −1 p ΔA p S p

1

≤ max κ p (S )ΔA p .

1≤ j≤n |λ j − μ|

number of the transformation matrix S . If S is orthogonal or unitary then

κ2 (S ) = 1 and this is the best we can hope for. However, a unitary transfor-

mation is possible only if A is normal. Hence, in all other cases this will very

probably lead to problems.

We now return to the situation of real eigenvalues and real eigenvectors of real

matrices. Assume that we already have an approximation λ of an eigenvalue

and an approximation x of a corresponding eigenvector. Then, we can try to

use this information to derive a better estimate for the eigenvalue. To this end,

let us assume that all components x j of x are diﬀerent from zero. Then, all the

quotients

T

(Ax) j e j Ax

q j := = T (5.9)

xj ej x

should be about the same and coincide approximately with the eigenvalue.

Lemma 5.9 If all the components x j of x ∈ Rn are diﬀerent from zero and

if Q denotes the diagonal matrix Q = diag(q1 , . . . , qn ) ∈ Rn×n with diagonal

06

140 Calculation of Eigenvalues

components given by (5.9) then we have for an arbitrary vector norm and a

compatible matrix norm

2. 1 ≤ (A − pI)−1 Q − pI

Proof The deﬁnition of the q j yields for our ﬁxed vector x the equation Ax =

Qx so that (A − pI)x = (Q − pI)x follows. This means

x = (A − pI)−1 (Q − pI)x,

for every p ∈ R there is an eigenvalue λ of A satisfying

|λ − p| ≤ max |q j − p|.

1≤ j≤n

not an eigenvalue of A then the matrix A − pI is invertible and (A − pI)−1 has

the eigenvalues 1/(λ j − p), if λ j , 1 ≤ j ≤ n, denote the eigenvalues of A.

Since (A − pI)−1 is symmetric its matrix norm induced by the Euclidean vector

norm and its spectral radius are the same. Thus,

1 1

(A − pI)−1 2 = ρ((A − pI)−1 ) = max "" "= " " . (5.10)

1≤ j≤n "λ j − p"" min1≤ j≤n ""λ j − p""

In the same way, we can derive

1≤ j≤n

1≤ j≤n 1≤ j≤n

1≤ j≤n 1≤ j≤n

"" "

qmin + qmax "" 1

max |q j − p| = max """q j − "" = |qmax − qmin |.

1≤ j≤n 1≤ j≤n 2 2

06

5.2 The Power Method 141

Hence, we can conclude from Theorem 5.10 that there is at least one eigenvalue

λ with

"" "

""λ − qmax + qmin """ ≤ 1 |q − q |,

" 2 " 2 max min

The ﬁrst numerical method we want to discuss is for computing the largest

eigenvalue of a matrix A ∈ Rn×n and a corresponding eigenvector. It is designed

for matrices A with one dominant eigenvalue λ1 , i.e. we assume in this section

that the eigenvalues of A satisfy

We will not assume that A is symmetric but we will assume that it has real

eigenvalues and that there is a basis w1 , . . . , wn of Rn consisting of eigenvectors

of A. Then, we can represent every x as x = nj=1 c j w j with certain coeﬃcients

c j . Using (5.11) shows that

⎛ n m ⎞⎟

n ⎜⎜ λ ⎟⎟

m⎜

c j λ j w j = λ1 ⎜⎜⎝⎜c1 w1 + w j ⎟⎟⎠⎟ =: λm

j

A x=

m m

1 (c1 w1 + Rm ) (5.12)

j=1 j=2

λ1

c1 0 that

1 → c1 w1 ,

Am x/λm m → ∞.

limited use since we do not know the eigenvalue λ1 and hence cannot form the

quotient Am x/λm 1 . Another problem comes from the fact that the norm of A x

m

converges to zero if |λ1 | < 1 and to inﬁnity if |λ1 | > 1. Both problems can be

resolved by normalisation. For example, if we take the Euclidean norm of Am x

then

⎛ n ⎞1/2

⎜⎜⎜ ⎟⎟⎟

A x2 = ⎜⎜⎝

m ⎜ ci c j λi λ j wi w j ⎟⎟⎟⎠ =: |λ1 |m (|c1 | w1 2 + rm )

m m T

(5.13)

i, j=1

Am+1 x2 Am+1 x2 |λ1 |m

= |λ1 | → |λ1 |, m → ∞, (5.14)

Am x2 |λ1 |m+1 Am x2

06

142 Calculation of Eigenvalues

which gives λ1 up to its sign. To determine also its sign and a corresponding

eigenvector, we reﬁne the method as follows.

Deﬁnition 5.12 The power method or von Mises iteration is deﬁned in the

following way. First, we pick a start vector y0 = nj=1 c j w j with c1 0 and set

x0 = y0 /y0 2 . Then, for m = 1, 2, . . . we deﬁne

1. ym = Axm−1 ,

2. xm = σymmym2 , where the sign σm ∈ {−1, 1} is chosen such that xTm xm−1 ≥ 0.

Since the algorithm can directly be derived from this deﬁnition, we refrain

from stating it explicitly. Each step costs O(n2 ) time, which is mainly given by

the involved matrix–vector multiplication.

The choice of the sign means that the angle between xm−1 and xm is in [0, π/2],

which means that we are avoiding a “jump” in the orientation when moving

from xm−1 to xm .

The condition c1 0 is usually satisﬁed simply because of numerical rounding

errors and is hence no real restriction.

Theorem 5.13 Let A ∈ Rn×n be a diagonalisable matrix with real eigenvalues

and eigenvectors with one dominant eigenvalue λ1 . The iterations of the power

method satisfy

2. xm converges to an eigenvector of A to the eigenvalue λ1 ,

3. σm → sign(λ1 ) for m → ∞, i.e. σm = sign(λ1 ) for suﬃciently large m.

Proof We have

Axm−1 A σm−1 Ax Axm−2

m−2 2

xm = σm = σm

Axm−1 2 σm−1 A2 xm−2

Axm−2 2 2

2

A xm−2 Am x0

= σm σm−1 = σ m σ m−1 · · · σ 1

A2 xm−2 2 Am x0 2

m

A y0

= σm σm−1 · · · σ1 m

A y0 2

for m = 1, 2, . . .. From this, we can conclude

Am+1 y0

ym+1 = Axm = σm · · · σ1 ,

Am y0 2

such that (5.14) immediately leads to

Am+1 y0 2

ym+1 2 = → |λ1 |, m → ∞.

Am y0 2

06

5.3 Inverse Iteration by von Wielandt and Rayleigh 143

For the rest of the proof we may assume, without restriction, that w1 2 = 1.

Using the representations (5.12) and (5.13) yields

λm1 (c1 w1 + Rm )

xm = σm · · · σ1

|λ1 | (|c1 |w1 2 + rm )

m

to an eigenvector of A to the eigenvalue λ1 , provided that σm = sign(λ1 ) for all

m ≥ m0 suﬃciently large. The latter follows from

λ12m−1 (c1 wT1 + RTm−1 )(c1 w1 + Rm )

0 ≤ xTm−1 xm = σm σ2m−1 · · · σ21

|λ1 |2m−1 (|c1 | + rm−1 )(|c1 | + rm )

|c1 | + c1 wT1 Rm + c1 RTm−1 w1 + RTm−1 Rm

2

= σm sign(λ1 ) ,

|c1 |2 + |c1 |(rm−1 + rm ) + rm rm−1

because the fraction involved converges to one for m → ∞.

any other norm without changing the convergence result. Often, the maximum

norm is used, since it is cheaper to compute. Also, we do not need all eigen-

values to be real. It suﬃces that the dominant eigenvalue is real. Even if the

dominant eigenvalue is complex, we can still apply the method and have at

least ym → |λ1 |.

It follows from (5.12) that the convergence rate of the power method is mainly

determined by the quotient |λ2 /λ1 | < 1. Obviously, the convergence will be

slow if λ2 is close to λ1 .

We now want to discuss a speciﬁc application of the power method. To this

end, let us assume that A has a simple eigenvalue λ j ∈ R and that we already

have an approximation λ ∈ R to this eigenvalue. Then, obviously, we may

assume that

|λ − λ j | < |λ − λi |

value of A itself then we can invert A − λI and (A − λI)−1 has the eigenvalues

1/(λi − λ) =: λ̃i . Moreover, we have

1

λj = λ +

λ̃ j

06

144 Calculation of Eigenvalues

to λ j . From

1 1

< for all i j

|λ − λi | |λ − λ j |

power method applied to (A − λI)−1 should give good results. This is the idea

behind the inverse iteration method by von Wielandt.

Deﬁnition 5.14 For the inverse iteration method or von Wielandt iteration

for computing an eigenvalue of a symmetric matrix A ∈ Rn×n we assume that

we have an approximation λ ∈ R, which is not an eigenvalue, and that we

have a start vector y0 0. Then, we compute x0 := y0 /y0 2 and then, for

m = 1, 2, . . ., we obtain

ym = (A − λI)−1 xm−1 ,

ym

xm = .

ym 2

The computation of ym thus requires solving the linear system

(A − λI)ym = xm−1 .

(or a QR factorisation) using column pivoting, i.e. we compute

lower triangular matrix L. Then, we can use forward and back substitution to

solve

LUym = Pxm−1 .

This reduces the complexity necessary in each step to O(n2 ) and requires only

O(n3 ) once to compute the LU decomposition.

The approximation λ can be given by any other method. From this point of

view, inverse iteration is in particular interesting for improving the results of

other methods.

Obviously, we could and should adapt the sign as we have done in the power

method. Here, however, we want to discuss a diﬀerent strategy for symmet-

ric matrices, where it is possible to improve the convergence dramatically by

adapting λ in each step. To this end, let us recall the Rayleigh quotient, which

we have encountered already several times.

06

5.3 Inverse Iteration by von Wielandt and Rayleigh 145

real matrix A ∈ Rn×n is the scalar

xT Ax

r(x) := , x 0.

xT x

Obviously, if x is an eigenvector then r(x) gives the corresponding eigen-

value of A. Moreover, we have the following two immediate properties of the

Rayleigh quotient for symmetric matrices.

Lemma 5.16 Let A ∈ Rn×n be symmetric.

1. If λmin (A) and λmax (A) denote the minimum and maximum eigenvalues of A,

respectively, then

λmin (A) ≤ r(x) ≤ λmax (A), x ∈ Rn \ {0}.

2. For x ∈ Rn the function f : R → R deﬁned by f (λ) := (A − λI)x22 becomes

minimal for λ = r(x) with value

f (r(x)) = Ax22 − r(x)2 x22 .

Proof The ﬁrst statement follows directly from Theorem 5.4. The second

property follows from diﬀerentiating the function

f (λ) = Ax − λx22 = Ax22 − 2λxT Ax + λ2 x22 . (5.15)

The only root of the derivative is given by λ = r(x) and inserting this into (5.15)

yields the stated function value. It is a minimum as f (λ) = 2x22 > 0.

The idea of the Rayleigh iteration is based upon using the Rayleigh quotient as

an approximation to an eigenvalue. If x is suﬃciently close to an eigenvector z

then, by continuity, r(x) is close to r(z) which is an eigenvalue.

Thus, changing the parameter λ in the inverse iteration method in each step

leads to Algorithm 11.

Input : A ∈ Rn×n , y0 ∈ Rn \ {0}.

Output: Approximate eigenvalue and eigenvector.

1 x0 := y0 /y0 2

2 for m = 0, 1, 2, . . . do

3 μm := xTm Axm

4 ym+1 := (A − μm I)−1 xm

5 xm+1 := ym+1 /ym+1 2

06

146 Calculation of Eigenvalues

Some remarks need to be made concerning Algorithm 11. First of all, ym+1 is

computed only if μm is not an eigenvalue of A, otherwise the method stops.

An indication for this is that ym+1 2 becomes too large, which can be used

as a stopping criterion. Although, for m → ∞, the matrices A − μm I become

singular, there are usually no problems with the computation of ym+1 , provided

that one terminates the computation of the LU- or QR-decomposition of A −

μm I in time.

Note that this time we have to compute an LU factorisation in each step. This

additional computational cost has to be compensated for by a faster conver-

gence. This is, fortunately, given in most cases, as we will see soon. We start

with a simpler stability result.

generated by the Rayleigh iteration. Then,

(A − μm I)2 .

Proof We know from Lemma 5.16 that the Rayleigh quotient μm+1 minimises

f (λ) = (A − λI)xm+1 22 , which means that we have

means

(A − μm I)xm+1

xm = .

(A − μm I)xm+1 2

Thus, we can conclude that

= |xTm+1 (A − μm I)xm | ≤ (A − μm I)xm 2 ,

using again that xm+1 2 = 1. Equality can hold only if it holds in both of the in-

equalities above. The ﬁrst inequality means μm+1 = μm since the Rayleigh quo-

tient gives the unique minimum of the above function f . The second equality

is an equality in the Cauchy–Schwarz inequality. Hence, we must in this case

have that (A − μm I)xm is a multiple of xm+1 , say (A − μm I)xm = αm xm+1 with

αm 0. This shows that

(A − μm I)−1 xm

xm+1 = (5.16)

(A − μm I)−1 xm 2

06

5.3 Inverse Iteration by von Wielandt and Rayleigh 147

implies that

αm

(A − μm I)2 xm = αm (A − μm I)xm+1 = xm ,

(A − μm I)−1 xm 2

Next, let us come back to the claimed fast convergence of the Rayleigh method.

We will show that if the method converges it converges cubically. We need,

however, the following auxiliary result.

associated orthonormal eigenvectors w1 , . . . , wn . Let λ be one of the eigenval-

ues and let J := { j ∈ {1, . . . , n} : λ j λ}. If u ∈ span{w j : j ∈ J} with u2 = 1

and if μ ∈ R is not an eigenvalue of A then

1

(A − μI)−1 u2 ≤ .

min j∈J |λ j − μ|

Proof We can expand u as u = j∈J α j w j . Since u2 = 1, we have j∈J α2j =

1 and hence

⎛ ⎞1/2

⎜⎜⎜ α2j ⎟⎟⎟ 1

−1

(A − μI) u2 = ⎜⎜⎝ ⎜ ⎟⎟ ≤ .

2⎟ ⎠

j∈J

(λ j − μ) min j∈J |λ j − μ|

Theorem 5.19 Let A ∈ Rn×n be symmetric and let {xm }, {μm } be the sequences

generated by the Rayleigh iteration. Assume that {xm } converges to an eigen-

vector w ∈ Rn with w2 = 1 associated with the eigenvalue λ ∈ R of A. Then,

the convergence of {xm } is cubic, i.e. there is a constant C > 0 such that

Proof The proof is based on the idea of studying the behaviour of the angle

φm between xm and w. Since both vectors have norm one, the angle is given by

cos φm = xm , w2 . From

xm − w22 = xm 22 + w22 − 2xm , w2 = 2(1 − cos φm ) = 4 sin2 (φm /2) (5.17)

we see that it suﬃces to show that 2 sin(φm /2) or, equivalently, sin φm con-

verges cubically. As we have Rn = span{w}⊕span{w}⊥ , we can ﬁnd for each xm

a vector um ∈ Rn with um 2 = 1 and uTm w = 0 such that xm = αm w+βm um with

06

148 Calculation of Eigenvalues

i.e. we have the representation

xm = cos φm w + sin φm um . (5.18)

Multiplying this by (A − μm I)−1 yields

cos φm

ym+1 = (A − μm I)−1 xm = w + sin φm (A − μm I)−1 um

λ − μm

and because of xm+1 = ym+1 /ym+1 2 also

cos φm sin φm

xm+1 = w+ (A − μm I)−1 um

(λ − μm )ym+1 2 ym+1 2

cos φm sin φm (A − μm I)−1 um 2 (A − μm I)−1 um

= w+ .

(λ − μm )ym+1 2 ym+1 2 (A − μm I)−1 um 2

Noticing that w is also an eigenvector of (A − μm I)−1 we ﬁnd that wT (A −

μm I)−1 um = 0. Hence, we can choose um+1 as

(A − μm I)−1 um

um+1 = (5.19)

(A − μm I)−1 um 2

and comparing coeﬃcients yields

cos φm sin φm (A − μm I)−1 um 2

cos φm+1 = , sin φm+1 = .

(λ − μm )ym+1 2 ym+1 2

From (5.18) we ﬁnd Axm = λ cos φm w + sin φm Aum and hence

μm = xTm Axm = λ cos2 φm + sin2 φm uTm Aum

= λ − sin2 φm [λ − r(um )] (5.20)

with the Rayleigh quotient r(um ) = uTm Aum . Inserting this into the formula of

cos φm+1 from above yields

sin φm+1

tan φm+1 = = (λ − μm )(A − μm I)−1 um 2 tan φm

cos φm+1

= [λ − r(um )](A − μm I)−1 um 2 tan φm sin2 φm . (5.21)

Since the Rayleigh quotient is bounded from above and below we know that

the term λ − r(um ) in the above expression is bounded. To show that (A −

μm I)−1 um 2 is also bounded independently of m, we will apply Lemma 5.18.

Using the notation established in that lemma, we have to show that um ∈

span{w j : j ∈ J} for all m, where J = { j ∈ {1, . . . , n} : λ j λ}. This is

clear if λ is a simple eigenvalue. If this is not the case then we have to work

harder. We know from (5.19) that um+1 is a scalar multiple of (A − μm I)−1 um .

As A − μm I maps span{w j : j ∈ J} onto itself, we see that um ∈ span{w j :

06

5.3 Inverse Iteration by von Wielandt and Rayleigh 149

u0 ∈ span{w j : j ∈ J}. Let M := span{w j : j J} and denote the orthogonal

projection from Rn to M by P M . Then, we have to show that P M (u0 ) = 0. This

is done, as follows. First of all, we show that w is a multiple of P M (x0 ). To see

this, note that if xm has the representation

xm = α jw j + α j w j = P M (xm ) + α jw j,

jJ j∈J j∈J

1

xm+1 = (A − μm I)−1 xm

(A − μm I)−1 xm 2

1

= α jw j

(A − μm I)−1 xm 2 (λ − μm ) jJ

1 αj

+ wj

(A − μm I)−1 xm 2 j∈J λ j − μm

1

P M (xm+1 ) = P M (xm ),

(A − μm I)−1 xm 2 (λ − μm )

which shows that P M (xm ) is a scalar multiple of P M (x0 ). Since we assume that

xm converges to the eigenvector w ∈ M, we can conclude by the continuity of

the orthogonal projection that w is also a scalar multiple of P M (x0 ), i.e. there

is an α ∈ R \ {0} such that w = αP M (x0 ). Now, if x0 = cos φ0 w + sin φ0 u0

then we may assume that sin φ0 0 because otherwise x0 would already be an

eigenvector. The linearity of the orthogonal projection gives

P M (x0 ) = cos φ0 P M (w) + sin φ0 P M (u0 ) = cos φ0 w + sin φ0 P M (u0 )

= α cos φ0 P M (x0 ) + sin φ0 P M (u0 ) = P M (x0 ) + sin φ0 P M (u0 ),

where in the last step we have used that w ∈ M implies

1 = w22 = αP M (x0 )T w = αxT0 w = α cos φ0 .

Hence, we have P M (u0 ) = 0 and Lemma 5.18 can indeed be applied, yielding

1

(A − μm I)−1 um 2 ≤

min j∈J |λ j − μm |

for all m. Let us now deﬁne γ := min j∈J |λ j − λ| > 0. As xm → w implies

μm → λ, we have for suﬃciently large m that min j∈J |λ j − μm | ≥ γ/2 and thus

2

(A − μm I)−1 um 2 ≤

γ

06

150 Calculation of Eigenvalues

for all such m suﬃciently large. Next, xm → w also implies cos φm → 1 such

that cos φm ≥ 1/2 for all suﬃciently large m. This means, using (5.21), that

there is a c > 0 such that

cos φm+1

sin φm+1 ≤ |λ − r(um )|(A − μm I)−1 um 2 sin3 φm ≤ c sin3 φm

cos φm

for all suﬃciently large m. As sin φm behaves like φm and hence 2 sin(φm /2)

behaves like sin φm for small φm , we see that we can conclude, with (5.17), that

also

xm+1 − w2 ≤ cxm − w32 .

The actual proof of convergence is even more diﬃcult. Here, we will prove

only a partial, easier result. By Proposition 5.17, we know that the sequence

{(A − μm I)xm 2 } is non-increasing and obviously bounded from below such

that it converges. Let τ be the limit of this sequence, i.e.

m→∞

Theorem 5.20 Let {xm } and {μm } be the sequences generated by the Rayleigh

iteration. Let τ be deﬁned by (5.22). Then, the following results hold.

2. If τ = 0 then the sequence {xm } converges to an eigenvector of A and {μm }

converges to the corresponding eigenvalue.

Proof We start by showing that {μm } always converges, though not necessar-

ily to an eigenvalue of A. This is done in two steps. First, we will show that

μm+1 −μm converges to zero. Second, we will show that {μm } has at most a ﬁnite

number of accumulation points. From this, it follows that {μm } converges. To

see this, let {ρ1 , . . . , ρL } be the set of distinct accumulation points and assume

that L > 1. Let δ := mini j |ρi − ρ j | > 0. Then, for suﬃciently large m each μm

must be within a distance of δ/4 of one of the ρi , since, otherwise, the bound-

edness of the sequence would yield another accumulation point. Hence, there

is an m0 > 0 such that we have for m ≥ m0 on the one hand |μm+1 − μm | < δ/4

and on the other hand |μm − ρ j | < δ/4 for a certain j = j(m) ∈ {1, . . . , L}. But

this means

|μm+1 − ρi | ≥ |ρ j − ρi | − |ρ j − μm | + |μm − μm+1 | ≥ δ − 2δ/4 = δ/2

for all i j. But since μm+1 must also be in the δ/4 neighbourhood of an

06

5.3 Inverse Iteration by von Wielandt and Rayleigh 151

accumulation point, this accumulation point can only be ρ j . This means that

all μm , m ≥ m0 , are within a distance δ/4 of ρ j , which contradicts the fact that

we have L > 1. Thus the sequence has only one accumulation point, i.e. it

converges.

To see that μm+1 − μm converges to zero, let us set rm := (A − μm I)xm and let us

deﬁne θm by cos θm := rTm xm+1 /rm 2 . Then,

= (A − μm I)xm+1 22 + 2(μm − μm+1 )xTm+1 (A − μm I)xm+1 + (μm − μm+1 )2

= (A − μm I)xm+1 22 + 2(μm − μm+1 )(μm+1 − μm ) + (μm − μm+1 )2

= |xTm (A − μm I)xm+1 |2 − (μm+1 − μm )2

= rm 22 cos2 θm − (μm+1 − μm )2 ,

where we have used ideas from the beginning of the proof of Proposition 5.17.

This means in particular

However, we know from Proposition 5.17 that the sequence {rm 2 } converges

to τ so that the right-hand side of (5.23) converges to zero, which shows μm+1 −

μm → 0 for m → ∞.

Next, we need to show that {μm } has only ﬁnitely many accumulation points.

If τ = 0 then each accumulation point has to be an eigenvalue of A and since

A has only a ﬁnite number of eigenvalues, we only have to consider the case

τ > 0. In this situation, let ρ be an accumulation point. Using rm 2 → τ and

μm+1 − μm → 0 together with

1

rm 2 cos θm = rTm xm+1 = >0 (5.24)

(A − μm I)−1 xm 2

that even cos θm → 1 for m → ∞. This, however, lets us conclude that

and thus we have

[(A − μm I)2 − rm 22 cos θm I]xm 2 = (A − μm I)(rm − rm 2 xm+1 )2

≤ A − μm I2 rm − rm 2 xm+1 2

→ 0.

06

152 Calculation of Eigenvalues

Now assume that {μm j } converges to our accumulation point ρ. Then, as the

sub-sequence {xm j } is also bounded, it must also have an accumulation point w

with w2 = 1. Taking a sub-sequence of the sub-sequence {m j } which we will

denote with {m j } again, we see that μm j → ρ and xm j → w. This all then leads

to [(A − ρI)2 − τ2 I]w2 = 0 and hence the accumulation point is a zero of the

polynomial p(t) = det((A − tI)2 − τ2 I), but, since any polynomial has only a

ﬁnite number of diﬀerent zeros, we only have a ﬁnite number of accumulation

points. This ﬁnishes the proof of the ﬁrst part.

For the second part, we assume that τ = 0. Let (λ, w) be an accumulation point

of the sequence {(μm , xm )}. Then, τ = 0 implies that (A − λI)w = 0 and, since w

has norm one, it follows that w is an eigenvector of A to the eigenvalue λ. As

we know from the ﬁrst part that {μm } converges, it must converge to λ. Finally,

if {xm j } is the sequence that converges to w, then for each > 0 there is a j0

such that xm j0 − w2 < . If we choose > 0 so small that C 2 < 1 with C

from Theorem 5.19 then we have

From this it immediately follows that xm − w2 < for all m ≥ m j0 , i.e. the

sequence {xm } converges to w.

this situation we have x2m → w+ and x2m+1 → w− linearly, where w+ and

w− are the bisectors of a pair of eigenvectors whose eigenvalues have mean

ρ = lim μm , though this case is unstable under perturbations of xm . Details can

be found in Parlett [100].

To ﬁnish this discussion, we want to give one more result on the Rayleigh

quotient. As we have seen in Theorem 5.19, the sequence of Rayleigh quotients

μm = r(xm ) satisﬁes |μm − λ| ≤ cxm − w22 , if the Rayleigh iteration converges.

Interestingly, this quadratic dependence is not restricted to the sequence {xm }

generated by the Rayleigh iteration.

quence {xm } with xm 2 = 1, which converges to an eigenvector w. Then we

have

|r(xm ) − r(w)| = O(xm − w22 ), xm → w.

be the corresponding orthonormal eigenvectors, i.e. we have Aw j = λ j w j and

wTj wk = δ jk . Without restriction, we may assume that there is a j ∈ {1, . . . , n}

such that w = w j .

The representation xm = k ck wk , suppressing the fact that ck = c(m)k depends

06

5.4 The Jacobi Method 153

also on m, gives Axm = k ck λk wk and, since xm 2 = 1,

n

r(xm ) = xTm Axm = λk c2k .

k=1

This shows

n

r(xm ) − r(w) = λk c2k − λ j = λ j (c2j − 1) + λk c2k .

k=1 k j

and ck = O( ) for k j and m → ∞. The ﬁrst property follows from

and hence

|c2j − 1| = |(c j − 1)(c j + 1)| ≤ 2|c j − 1| = 2 ,

using c2j ≤ k c2k = 1. The second property then follows for all k j, as we

can conclude from k c2k = 1 that

c2k ≤ c2 = 1 − c2j = |1 − c2j | ≤ 2 .

j

Another iterative method goes back to Jacobi, who developed an algorithm for

dealing with the eigenvalue problem for symmetric n × n matrices, which is

still feasible if the matrices are not too big.

Certain versions of this algorithm can run in parallel (see for example Golub

and Van Loan [66]), so that it sometimes outperforms on parallel computers

even the otherwise superior QR method. It computes all eigenvalues and if

necessary also all eigenvectors. It is based upon the following easy fact.

Lemma 5.22 Let A ∈ Rn×n be a symmetric matrix. For every orthogonal ma-

trix Q ∈ Rn×n both A and QT AQ have the same Frobenius norm. If λ1 , . . . , λn

are the eigenvalues of A then

n

n

|λ j |2 = |a jk |2 .

j=1 k, j=1

Proof The square of the Frobenius norm of A equals j,k |a jk |2 and this is

06

154 Calculation of Eigenvalues

precisely the trace of AAT , see also Corollary 2.31. Since the trace of AB is the

same as the trace of BA for two arbitrary matrices, we can conclude that

= tr(AT QQT A) = A2F .

is a diagonal matrix having the eigenvalues of A as diagonal entries.

If we deﬁne the outer norm, which is not really a norm in the strict sense of

Deﬁnition 1.6, by

N(A) := |a jk |2 ,

jk

n

n

|λ j |2 = |a j j |2 + N(A). (5.25)

j=1 j=1

our goal to diminish N(A) by choosing appropriate orthogonal transformations

and thus by increasing |a j j |2 to transform A into a diagonal matrix. To this

end, we choose an element ai j 0 with i j and perform a transformation

in the plane spanned by ei and e j , which maps ai j to zero. We can choose this

transformation in R2 as a rotation with angle α. The resulting matrix

bii bi j cos α sin α aii ai j cos α −sin α

= (5.26)

bi j b j j −sin α cos α ai j a j j sin α cos α

is a diagonal matrix if the non-diagonal element bi j satisﬁes

1

= ai j cos(2α) + (a j j − aii ) sin(2α).

2

From ai j 0 we can conclude that the angle α is given by

aii − a j j

cot(2α) = . (5.27)

2ai j

For the implementation, it is better to avoid the calculation of the angle α by

taking the inverse cot−1 in this formula and then its sine and cosine. Instead,

one can ﬁrst only compute

aii − a j j

θ := .

2ai j

06

5.4 The Jacobi Method 155

see that t := tan α satisﬁes the quadratic equation

t2 + 2θt = 1,

√

which has two solutions t1,2 = −θ ± 1 + θ2 . Considering numerical issues

we should pick the one with smaller absolute value in a cancellation-avoiding

form

√ sign(θ)

t := −θ + sign(θ) 1 + θ2 = √ .

|θ| + 1 + θ2

With this, we can compute all relevant quantities without using trigonometric

functions,

1

c := √ = cos α,

1 + t2

s := ct = sin α,

s α

τ := = tan ,

1+c 2

where τ will be required later on. This solves the problem for 2 × 2 matrices.

The general case is dealt with using Givens rotations.

and s = sin α. A Givens rotation Gi j (α) ∈ Rn×n is a matrix of the form

⎛ .. .. ⎞

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜⎜ 1 . . ⎟⎟

⎜⎜⎜· · · c · · · −s · · ·⎟⎟⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

= ⎜⎜⎜⎜⎜ .. .. ⎟⎟⎟ .

⎜⎜⎜ . 1 . ⎟⎟⎟

⎜⎜⎜· · · s · · · c · · ·⎟⎟⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎝ .. .. ⎠

. . 1

Note that in the last equation of this deﬁnition we have assumed that j > i to

show what the Givens rotation looks like. This is, of course, not required.

If B = GTi j (α)AGi j (α) with α determined by (5.27) then bi j = b ji = 0 and

N(B) = N(A) − 2|ai j |2 .

Proof The ﬁrst part is clear. For the second part, we use the invariance prop-

erty of the Frobenius norm twice. On the one hand, we have AF = BF . On

06

156 Calculation of Eigenvalues

the other hand, we also have equality of the norms for the small matrices in

(5.26), which means

|aii |2 + |a j j |2 + 2|ai j |2 = |bii |2 + |b j j |2 ,

since bi j = 0. From this, we can conclude that

n

n

N(B) = B2F − |bkk |2 = A2F − |bkk |2

k=1 k=1

n

= N(A) + |akk |2 − |bkk |2 = N(A) − 2|ai j |2 ,

k=1

The iteration of this process gives the classical Jacobi method for computing

the eigenvalues of a symmetric matrix.

Deﬁnition 5.25 Let A ∈ Rn×n be symmetric. The classical Jacobi method

for computing the eigenvalues deﬁnes A(0) = A and then proceeds for m =

0, 1, 2, . . . , as follows

1. Determine i j with |a(m) (m)

i j | = maxk |ak | and determine G

(m)

= Gi j .

T

2. Set A(m+1) = G(m) AG(m) .

For practical computations we will again avoid matrix multiplication and code

the transformations directly, see Algorithm 12. Furthermore, we will use the

following identities to compute the ith and jth row and column, which follow

after some trigonometric manipulations from (5.26). We have bi j = b ji = 0 and

bii = aii − tai j ,

b j j = a j j + tai j ,

bi = bi = cai + sa j = ai + s(a j − τai ), i, j,

b j = b j = −sai + ca j = a j − s(ai + τa j ), i, j,

where we have used the previous abbreviations s = sin α, c = cos α, t =

tan α and τ = tan(α/2), which can be computed without using trigonometric

functions, as explained above. In the implementation in Algorithm 12 we have

used two additional vectors bi and b j to store the relevant update information

temporarily before overwriting the old matrix.

Note that an element which has been mapped to zero in an earlier step might

be changed again in later steps.

Theorem 5.26 The classical Jacobi method converges linearly in the outer

norm.

06

5.4 The Jacobi Method 157

Input : A ∈ Rn×n .

Output: Approximate eigenvalues of A.

1 norm := N(A)

2 while norm > do

3 Determine (i, j) with |ai j | = maxk |ak |

4 if |ai j | 0 then

5 θ := 0.5 · (aii − a j j )/ai j

√

6 t := sign(θ)/(|θ| + 1 + θ · θ)

√

7 c := 1/ 1 + t · t, s := c · t

8 bi j := 0, b ji := 0

9 bii := aii − t · ai j , b j j := a j j + t · ai j

10 for = 1 to n do

11 if i and j then

12 bi := c · ai + s · a j

13 b j := −s · ai + c · a j

14 norm := norm − 2 · ai j · ai j

15 for = 1 to n do

16 ai := ai := bi

17 a j := a j := b j

Proof We have to look at one step of the method, i.e. at the transformation

from A to B = GTi j AGi j . Since |ai j | ≥ |ak | for all k, we have

N(A) = |ak |2 ≤ n(n − 1)|ai j |2 .

k

2 2

N(B) = N(A) − 2|ai j |2 ≤ N(A) − N(A) = 1 − N(A),

n(n − 1) n(n − 1)

which means linear convergence.

Convergence in the outer norm does not necessarily mean convergence of the

eigenvalues, since the elements on the diagonal can still permute. However, ev-

ery convergent sub-sequence converges to a diagonal matrix having the eigen-

values of A as its diagonal entries. Another way of formulating this is as fol-

lows.

06

158 Calculation of Eigenvalues

matrix A and if ã(m) (m) (m)

11 ≥ ã22 ≥ · · · ≥ ãnn is an appropriate permutation of the

diagonal elements of A(m) then

|λi − ã(m)

ii | ≤ N(A(m) ) → 0, for 1 ≤ i ≤ n and m → ∞.

11 , . . . , ãnn ) then we have by Corollary 5.7 the

estimate

|λ j − ã(m)

j j | = |λ j (A) − λ j (B)| ≤ A − BF = A

(m)

− BF = N(A(m) )1/2 .

Remark 5.28 One step of the classical Jacobi method takes O(n2 ) time to

ﬁnd the maximum element and then O(n) time to update the matrix.

Since the cost for ﬁnding the maximum element is critical, there are cheaper

versions. For example, the cyclic Jacobi method visits all non-diagonal entries

regardless of their sizes in a cyclic way, i.e. the index pair (i, j) is cyclically

chosen as (1, 2), (1, 3), . . . , (1, n), (2, 3), . . . , (2, n), (3, 4), . . . and then the pro-

cess starts again with (1, 2) and so on. A transformation only takes place if

ai j 0. The cyclic version is convergent and in the case of pairwise distinct

eigenvalues even quadratically convergent if one complete cycle is considered

to be one step (see Henrici [82]).

For small values of ai j it is not eﬃcient to perform a transformation. Thus, we

can restrict ourselves to elements ai j having a square above a certain threshold,

for example N(A)/(2n2 ). Since then

1

N(B) ≤ N(A) 1 − 2

n

we still have linear convergence in the outer norm for the cyclic Jacobi method

with thresholding.

Finally, let us have a look at the eigenvectors. Since the matrices

T T T

A(m+1) = G(m) A(m)G(m) = G(m) · · · G(1) AG(1) · · · G(m) =: QTm AQm

are, at least for suﬃciently large m, almost diagonal, we see that the columns

of Qm deﬁne approximate eigenvectors of A.

and if the eigenvalues of A are pairwise distinct, then the columns of Qm con-

verge to a complete basis of orthonormal eigenvectors of A. If the eigenvalues

are not pairwise distinct then every accumulation point gives an orthonormal

basis of eigenvectors.

06

5.5 Householder Reduction to Hessenberg Form 159

In the next section, we want to use the QR factorisation of a matrix A from

Section 3.6 to compute its eigenvalues in an iterative process. As we have to use

in this process similarity transformations which do not change the eigenvalues,

our transformations have to be of the form QAQT with orthogonal Q. The ﬁrst,

immediate idea, however, faces the following problem.

Suppose we choose a Householder matrix H = H(w) such that the ﬁrst col-

umn of A is mapped to a multiple of the ﬁrst unit vector, i.e. HAe1 = α1 e1 .

Then, HAH T will in general be full again, i.e. all the zeros created by pre-

multiplication with H are destroyed again by the post-multiplication with H.

This simply comes from the fact that multiplication from the right particularly

means that the ﬁrst column is replaced by linear combinations of all columns.

Hence we have to modify this approach. To this end, let us write

a bT

A = 11

ã1 Ã

with a11 ∈ R, ã1 , b ∈ Rn−1 and Ã ∈ R(n−1)×(n−1) . Then, we can choose a House-

holder matrix H̃1 ∈ R(n−1)×(n−1) with H̃1 ã1 = α1 e1 , and deﬁne the orthogonal

and symmetric matrix

1 0T

H1 = .

0 H̃1

1 0T a11 bT 1 0T a bT 1 0T

H1 AH1 = = 11

0 H̃1 ã1 Ã 0 H̃1 α1 e1 H̃1 Ã 0 H̃1

a11 bT H̃1 a11 bT H̃1

= =

α1 e1 H̃1 ÃH̃1 α1 e1 A2

with A2 ∈ R(n−1)×(n−1) . This means we have kept almost all the zeros in the ﬁrst

column except for the one directly under the diagonal. If we proceed like this

now, we derive in the next step a transformation of the form

⎛ ⎞

⎜⎜⎜∗ ∗ · · · ∗⎟⎟⎟

⎜⎜⎜⎜∗ ∗ ∗ ⎟

· · · ∗⎟⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜0 ∗ ⎟⎟⎟

⎜

H2 H1 AH1 H2 = ⎜⎜⎜⎜0 0 ⎟⎟⎟ ,

⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ . . ⎟⎟⎟

⎜⎜⎜ .. .. ⎟⎟⎠

⎜⎝

0 0

06

160 Calculation of Eigenvalues

⎛ ⎞

⎜⎜⎜∗ · · · · · · ∗⎟⎟⎟

⎜⎜⎜ .. ⎟⎟⎟⎟

⎜⎜⎜∗ . . . . ⎟⎟⎟⎟

Hn−2 · · · H1 AH1 · · · Hn−2 = ⎜⎜⎜⎜⎜ . ⎟⎟ .

⎜⎜⎜ . . . . . .. ⎟⎟⎟⎟

.

⎜⎜⎝ ⎟⎟⎠

0 ∗ ∗

Such a matrix is almost an upper triangular matrix. We have only one sub-

diagonal band which might be diﬀerent from zero. A matrix of this form is

called a Hessenberg matrix or a matrix in Hessenberg form. We can deﬁne this

even for non-square matrices.

i > j + 1.

matrix using n − 2 Householder transformations in O(n3 ) time.

Proof We only have to analyse the computational cost. In the kth step, we

have to multiply a Householder matrix Hk ∈ R(n−k)×(n−k) by the signiﬁcant

part of the current matrix both from the left and from the right. According to

Lemma 3.31, this can be done in O((n−k)2 ) time. Hence, the total time required

is given by

n−2

O((n − k)2 ) = O(n3 ).

k=1

We can easily construct the algorithm from the above proof. As we have done

before, we want to overwrite the matrix A with its Hessenberg form H =

Hn−2 · · · H1 AH1 · · · Hn−2 during the process. In step k we have the relevant in-

formation given by Hk = diag(Ik , H̃k ), where Ik ∈ Rk×k denotes the identity on

Rk and H̃k = In−k − βk uk uTk with uk = (u(k) (k) (k) T

k+1 , uk+2 , . . . , un ) is a Householder

matrix. As in the case of the QR factorisation, there is not enough space in A

to store all the relevant information.

A possible way of using the available space is as follows. We can store the

partial vector (u(k) (k)

k+2 , . . . , un ) within the free n−k−1 positions of the kth column

of A and u(k)

k+1 and βk for 1 ≤ k ≤ n − 2 as the kth component of new vectors d

and β in Rn−2 . Details are given in Algorithm 13, where lines 2 to 9 represent

the computation of H̃k and the remaining lines deal with the actual update step.

06

5.5 Householder Reduction to Hessenberg Form 161

Input : A ∈ Rn×n .

Output: Hessenberg form of A.

1 for k = 1 to n − 2 do

2 norm := maxk+1≤i≤n |aik |

3 if norm = 0 then dk := 0, βk := 0

4 else

5 α := 0

6 for i = k + 1 to n do

7 aik := aik /norm, α := α + aik · aik

√

8 α := α, βk := 1/[α(α + |ak+1,k |)]

9 dk := −sign(ak+1,k )α · norm, ak+1,k := sign(ak+1,k )α

10 for j = k + 1 to n do

11 s := βk · ni=k+1 aik · ai j

12 for i = k + 1 to n do ai j := ai j − s · aik

13 for i = 1 to n do

14 s := βk · nj=k+1 ai j · a jk

15 for j = k + 1 to n do ai j := ai j − s · a jk

16 Exchange ak+1,k and dk

At the end of the algorithm, the matrix A and the vectors d, β have the follow-

ing form

⎛ ⎞

⎜⎜⎜ h11 h12 ··· ··· h1,n−1 h1n ⎟⎟⎟

⎜⎜⎜ ⎟ ⎞ ⎞

⎜⎜⎜ h21 h22 ··· ··· h2,n−1 h2n ⎟⎟⎟⎟ ⎛

⎜⎜⎜ u(1) ⎟⎟⎟

⎛

⎜⎜⎜ β1 ⎟⎟⎟

⎜⎜⎜⎜u(1) ⎟⎟ ⎟⎟⎟ ⎟⎟⎟

h3n ⎟⎟⎟⎟ ⎜⎜⎜ ⎜⎜⎜

2

⎜⎜ 3 h32 ··· ··· h3,n−1 ⎜⎜ u(2) ⎟⎟⎟ ⎜⎜ β2 ⎟⎟⎟

⎟⎟⎟

A = ⎜⎜⎜⎜ (1) d = ⎜⎜⎜⎜⎜ ⎟⎟⎟ , β = ⎜⎜⎜⎜⎜ ⎟⎟⎟ ,

3

.. ⎟, .. ..

⎜⎜⎜u4 u(2) . ··· h4,n−1 h4n ⎟⎟⎟⎟ ⎜⎜⎜ . ⎟⎟⎟ ⎜⎜⎜ . ⎟⎟⎟

⎜⎜⎜ ⎟ ⎟⎟⎠ ⎟⎟⎠

.. ⎟⎟⎟⎟ ⎜⎝ ⎜⎝

4

⎜⎜⎜ .. .. .. .. ..

u(n−2) βn−2

⎜⎜⎜ . . . . . . ⎟⎟⎟⎟ n−1

⎝ (1) ⎠

un u(2)

n ··· u(n−2)

n hn,n−1 hnn

We now want to have a closer look at the situation when A ∈ Rn×n is symmetric.

In this case its Hessenberg form must be a tridiagonal matrix. It is hence quite

natural to assume that the reduction of a symmetric matrix to its Hessenberg

form is signiﬁcantly cheaper than the general case.

Let us assume that A = AT ∈ Rn×n has already been transformed using the k − 1

06

162 Calculation of Eigenvalues

⎛ ⎞

⎜⎜⎜ OT ⎟⎟⎟

⎜ A ⎟

Hk−1 · · · H1 AH1 · · · Hk−1 = ⎜⎜⎜⎜⎜ k aTk ⎟⎟⎟⎟⎟ ,

⎝ ⎠

O ak Ck

where Ak is a k × k tridiagonal matrix, Ck is a (n − k) × (n − k) matrix, ak ∈ Rn−k

and O represents a zero matrix of dimension (n − k) × (k − 1). Then, we can

pick a Householder matrix H̃k ∈ R(n−k)×(n−k) , set Hk := diag(Ik , H̃k ) and derive

⎛ ⎞

⎜⎜⎜ OT ⎟⎟⎟

⎜ Ak ⎟

Hk · · · H1 AH1 · · · Hk = ⎜⎜⎜⎜⎜ aTk H̃k ⎟⎟⎟⎟⎟ .

⎝ ⎠

O H̃k ak H̃k Ck H̃k

Obviously, we will pick H̃k such that H̃k ak is again a multiple of the ﬁrst unit

vector. We can now use the symmetry to compute the product H̃k Ck H̃k eﬃ-

ciently. Assume that H̃k = I − βk uk uTk with βk = 2/uk 22 . If we now deﬁne the

vector pk := βk Ck uk , we ﬁnd

H̃k Ck H̃k = [I − βk uk uTk ]Ck [I − βk uk uTk ]

= [I − βk uk uTk ][Ck − pk uTk ]

= Ck − uk pTk − pk uTk + βk uTk pk uk uTk

= Ck − uk wTk − wk uTk

with

βk uTk pk

wk = pk − uk .

2

Using this, we can derive an algorithm as given in Algorithm 14. Here, we

store the diagonal elements and the sub-diagonal elements in two vectors δ =

(δ1 , . . . , δn )T ∈ Rn and γ = (γ1 , . . . , γn−1 )T ∈ Rn−1 . The information on the

Householder transformation H̃k = I − βk uk uTk can be stored in the kth column

of A by storing βk as the kth diagonal element and uk ∈ Rn−k in the part of the

kth column below the diagonal.

This method aims at computing all eigenvalues of a given matrix A ∈ Rn×n

simultaneously. It beneﬁts from the Hessenberg form of a matrix and can be

accelerated by shifts. It is extremely eﬃcient, particularly for symmetric ma-

trices. It is also possible to derive the eigenvectors by accumulating certain

transformation matrices. It goes back to Francis [56, 57].

06

5.6 The QR Algorithm 163

Input : A = AT ∈ Rn×n .

Output: Tridiagonal matrix.

1 for k = 1 to n − 2 do

2 δk := akk

3 norm := maxk+1≤i≤n |aik |

4 if norm = 0 then akk := 0, γk := 0

5 else

6 α := 0

7 for i = k + 1 to n do

8 aik := aik /norm, α := α + aik · aik

√

9 α := α, akk := 1/[α(α + |ak+1,k |)]

10 γk := −sign(ak+1,k ) · α · norm, ak+1,k := ak+1,k + sign(ak+1,k ) · α

11 s := 0

12 for i = k + 1 to n do

13 γi := akk ij=k+1 ai j · a jk + nj=i+1 a ji · a jk

14 s := s + γi · aik

15 s := 0.5 · akk · s

16 for i = k + 1 to n do γi := γi − s · aik

17 for i = k + 1 to n do

18 for j = k + 1 to i do

19 ai j := ai j − aik · γ j − a jk · γi

pose Am in the form Am = Qm Rm with an orthogonal matrix Qm and an upper

triangular matrix Rm . Then, form the swapped product

Am+1 := Rm Qm .

For the analysis of the convergence of the QR method, we need to recall that the

QR factorisation of a matrix is unique if the signs of the diagonal elements of

06

164 Calculation of Eigenvalues

R are prescribed, see Theorem 3.35. This means particularly that the mapping

A → Q is continuous.

sation with orthogonal Q ∈ Rn×n and upper triangular R ∈ Rn×n with positive

diagonal entries. Then, the mapping A → Q is continuous.

positive deﬁnite matrix B ∈ Rn×n , i.e. B = LLT with a lower triangular matrix

L. We will show that this factorisation is unique and that the mapping B → L

is continuous.

In the process of deriving the Cholesky factorisation, we have seen that the

LU factorisation can be written as B = LDLT with a diagonal matrix D with

positive elements. This means that the factors L and U = DLT are uniquely

determined by the uniqueness of the LU factorisation. Hence, D is uniquely

determined and if we take the positive square root, we see that L̃ = LD1/2 is

the unique L matrix with positive elements of the Cholesky factorisation of B.

To see that B → L is continuous we use induction by n. For n = 1, we have

√

B = (β) with β > 0 and L = ( β) which obviously depends continuously on β.

Hence, let us assume that the statement is true for all symmetric and positive

deﬁnite (n − 1) × (n − 1) matrices. We decompose B and L into

β bT γ 0T

B= , L=

b B̂ L̂

triangular with positive diagonal elements and b, ∈ Rn−1 , β, γ ∈ R. With this

√

notation the equation B = LLT immediately gives γ = β and = b/γ =

√

b/ β which shows that both depend continuously on B. We also have B̂ =

L̂L̂T + T so that B → L̂L̂T = B̂ − T = B̂ − bbT /β maps continuously into the

set of symmetric and positive deﬁnite (n−1)×(n−1) matrices. By induction we

know that the mapping L̂L̂T → L̂ is continuous and hence the mapping B → L

is continuous.

Finally, let A ∈ Rn×n be a non-singular matrix with unique QR factorisation

A = QR, where R has positive diagonal entries. Then the matrix B = AT A is

symmetric and positive deﬁnite and has hence a unique Cholesky factorisation

AT A = LLT = R̂T R̂ with a unique upper triangular matrix R̂ = LT with positive

diagonal elements. From this uniqueness and AT A = (QR)T (QR) = RT R we

can conclude that R = R̂ depends continuously on A. Finally, A → Q = AR−1

is also continuous, as R → R−1 obviously is.

06

5.6 The QR Algorithm 165

Obviously, the proof of the previous lemma also showed the uniqueness of the

Cholesky factorisation of a symmetric and positive deﬁnite matrix.

∗

and let L = (i j ) ∈ Rn×n be a normalised lower triangular matrix. Let Lm

denote the lower triangular matrix with entries i j di /d j for i ≥ j. Then, we

m m

have

∗ m

Dm L = Lm D , m ∈ N0 .

∗

Furthermore, Lm converges linearly to the identity matrix for m → ∞. The

asymptotic error coeﬃcient is at most max2≤i≤n |di |/|di−1 |, i.e.

"" "m

∗ " di ""

Lm − I∞ ≤ max "" " L − I∞ .

2≤i≤n di−1 "

∗

Proof The identity Dm L = Lm D immediately follows from the fact that mul-

tiplication by a diagonal matrix from the left means scaling the rows while

multiplication from the right means scaling the columns. Moreover, as L is

normalised, we have

" "" " "

i "" " i−1 "" m ""

" d m

" " d

""i j im """

∗ i j

Lm − I∞ = max "" mi − 1"" = max

1≤i≤n

j=1

" jd " 2≤i≤n

j=1

" dj "

"" " i−1 "" "

" di ""m " di ""m

≤ max " " " |i j | ≤ max " " " L − I∞ ,

2≤i≤n di−1 " 2≤i≤n di−1 "

j=1

With this result at hand, we can now prove convergence of the QR method

for computing the eigenvalues of a matrix under some additional assumptions.

These assumptions mean in particular that A has only simple eigenvalues. For

a real matrix this also means that all eigenvalues are real. The latter is not too

surprising, as in the situation of complex eigenvalues, Theorem 3.39 tells us

that we cannot expect to ﬁnd an orthogonal matrix Q ∈ Rn×n with QT AQ being

upper triangular.

Theorem 5.35 Assume that the eigenvalues of the matrix A ∈ Rn×n can be

sorted as |λ1 | > |λ2 | > · · · > |λn | > 0. Let T be the matrix of corresponding

eigenvectors of A. Assume that T −1 possesses an LU factorisation without piv-

oting. Then, the matrices Am = (a(m) i j ) created by the QR algorithm satisfy the

following conditions.

06

166 Calculation of Eigenvalues

i j → 0 for m → ∞ for

all i > j.

• The diagonal elements converge to the eigenvalues, i.e. a(m)

ii → λi for m →

∞ for all 1 ≤ i ≤ n.

• The sequences {A2m } and {A2m+1 } converge to upper triangular matrices.

i.e. to an diagonal matrix having only 1 or −1 on the diagonal.

Proof We already know that all generated matrices Am have the same eigen-

values as A. If we use the notation Rm...0 = Rm · · · R0 and Q0...m = Q0 · · · Qm ,

then we have

Am = QTm−1 Am−1 Qm−1 = QTm−1 QTm−2 Am−2 Qm−2 Qm−1 = · · · = QT0...m−1 AQ0...m−1 .

By induction we can also show that the powers of A have a similar decomposi-

tion. To be more precise, we have Am = Q0...m−1 Rm−1...0 . This is clear for m = 1

and for m + 1 it follows via

unique provided that Am is non-singular and that we assume that all upper

triangular matrices Ri have, for example, positive diagonal elements.

If we deﬁne D = diag(λ1 , . . . , λn ), we have the relation AT = T D, which leads

to Am = T Dm T −1 . By assumption, we have an LU factorisation of T −1 , i.e.

T −1 = LR with a normalised lower triangular matrix L and an upper triangular

matrix R. This gives to

Am = T Dm LR.

∗

By Lemma 5.34, there is a sequence {Lm } of lower triangular matrices, which

∗ m

converges to I and which satisfy D L = Lm

m

D , which yields the representation

∗ m

Am = T Lm D R.

ﬁnally gives

∗ m

Am = Q̃R̃Lm D R. (5.29)

∗ ∗

Since the matrices Lm converge to I, the matrices R̃Lm have to converge to R̃. If

we now compute a QR factorisation of R̃Lm as R̃Lm = Q∗∗

∗ ∗ ∗∗

m Rm , again with posi-

tive diagonal elements in R∗∗ ∗∗ ∗∗

m then we can conclude that Qm Rm converges to R̃

06

5.6 The QR Algorithm 167

so that Lemma 5.33 shows that Q∗∗m converges to the identity matrix. Equation

(5.29) can hence be rewritten as

Am = Q̃Q∗∗ ∗∗ m

m Rm D R.

1 , . . . , sn ) with entries

s(m)

i := sign(λmi rii ), where rii are the diagonal elements of R. Then, because of

Δm = I, we can rewrite the representation of Am as

2

Am = (Q̃Q∗∗ ∗∗ m

m Δm )(Δm Rm D R). (5.30)

Since the signs of the diagonal elements of the upper triangular matrix R∗∗ m

mD R

coincide with those of D R and hence with those of Δm , we see that (5.30) is a

m

matrix.

If we compare this QR factorisation to the factorisation Am = Q0...m−1 Rm−1...0

we can thus conclude that

Q0...m−1 = Q̃Q∗∗

m Δm . (5.31)

Since Q∗∗ m converges to the identity matrix, the matrices Q0...m must become

constant for suﬃciently large m and up to the sign of the columns. From Am =

QT0...m−1 AQ0...m−1 and (5.31) we can derive

Am = Q−1 ∗∗−1 −1 ∗∗

0...m−1 A Q0...m−1 = Δm Qm Q̃ A Q̃Qm Δm

= Δm Q∗∗−1 R̃ T −1

A T R̃−1 Q∗∗

0123 m Δm .

0123

m

0123

→I =D →I

clude the convergence of {A2m } and {A2m+1 } to Δ0 R̃DR̃−1 Δ0 and Δ1 R̃DR̃−1 Δ1 ,

respectively. Since both limit matrices have the form

⎛ ⎞

⎜⎜⎜λ1 ∗ · · · ∗ ⎟⎟⎟

⎜⎜⎜⎜ .. ⎟⎟⎟⎟

⎜⎜⎜

⎜⎜⎜ λ n . ⎟⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

..

⎜⎜⎜ . ∗ ⎟⎟⎟⎟⎟

⎝ ⎠

λn

Finally, using the convergence of Q∗∗

m to the identity matrix again and

Qm = Q−1 −1 ∗∗ −1 −1 ∗∗

0...m−1 Q0...m = Δm (Qm ) Q̃ Q̃Qm+1 Δm+1

06

168 Calculation of Eigenvalues

There are several things to mention here. Obviously, we have the following

computational cost. In each step, we have to compute a QR factorisation, which

will cost us O(n3 ) time. Then, we have to form the new iteration Am+1 which is

done by multiplying two full matrices, which also costs O(n3 ) time. Both are

too expensive if a large number of steps is required or the matrix dimension n

is too large.

As already indicated in the last section, it is possible to improve this perfor-

mance signiﬁcantly by reducing the matrix A ﬁrst to its Hessenberg form. This

still may cost O(n3 ) time but it is required only once.

After this pre-processing step, it is important to see that the QR algorithm re-

spects the speciﬁc form and that we can compute both the QR factorisation and

the product RQ in each step eﬃciently. We will explain how this works with

special emphasis on symmetric Hessenberg matrices, i.e. tridiagonal matrices.

it is possible to compute the QR factorisation of Am = Qm Rm in O(n) time.

The next iteration of the QR algorithm Am+1 = Rm Qm is also a symmetric

tridiagonal matrix and can be computed in linear time, as well.

has the form

⎛ ⎞

⎜⎜⎜∗ ∗ 0 · · · 0⎟⎟

⎟

⎜⎜⎜ .. ⎟⎟⎟⎟

⎜⎜⎜∗ ∗ ∗ . ⎟⎟⎟

⎜⎜⎜⎜ ⎟⎟⎟

Am = ⎜⎜⎜⎜⎜0 . . . . . . . . . 0⎟⎟⎟⎟ .

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ .. ⎟⎟⎟

⎜⎜⎜ . ∗ ∗ ∗⎟⎟⎟⎟

⎝ ⎠

0 ··· 0 ∗ ∗

We will again proceed in an iterative way. We ﬁrst concentrate on the upper-

left 2 × 2 block of A and determine a 2 × 2 Householder matrix H̃(1, 2), which

transforms the vector (a11 , a21 )T into a multiple of (1, 0). This matrix H̃(1, 2)

can be “ﬁlled-up” to give an n × n orthogonal matrix H(1, 2) in the usual way.

Important here is that the computation of H(1, 2)A costs only constant time

since it involves only changing the ﬁrst two rows and, because of the tridiago-

nal structure, only the ﬁrst three columns. Continuing like this, we can use the

index pairs (2, 3), (3, 4), . . . , (n − 1, n) to create the QR factorisation

06

5.6 The QR Algorithm 169

In each step, at most six entries are changed. Thus the computational cost is

O(n).

This can be visualised as follows:

⎛ ⎞ ⎛ ⎞ ⎛ ⎞

⎜⎜⎜∗ ∗ 0 0 0⎟⎟⎟ ⎜⎜⎜∗ ∗ ∗ 0 0⎟⎟⎟ ⎜⎜⎜∗ ∗ ∗ 0 0⎟⎟⎟

⎜⎜⎜⎜∗ ∗ ∗ 0 0⎟⎟⎟⎟ ⎜⎜ 0 0⎟⎟⎟⎟

⎟ ⎜⎜⎜

∗ ∗ 0⎟⎟⎟⎟

⎟

⎜⎜⎜ ⎟⎟⎟ (1,2) ⎜⎜⎜⎜⎜0 ∗ ∗

⎟⎟⎟ (2,3) ⎜⎜⎜⎜⎜

0 ∗

⎟⎟

⎜⎜⎜0 ∗ ∗ ∗ 0⎟⎟⎟ → ⎜⎜⎜0 ∗ ∗ ⎟

∗ 0⎟⎟ → ⎜⎜0 ⎜ 0 ∗ ∗ 0⎟⎟⎟⎟

⎜⎜⎜ ⎟ ⎜⎜⎜ ⎜⎜⎜

⎜⎜⎜0 0 ∗ ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜0 0 ∗ ∗ ∗⎟⎟⎟⎟⎟

⎟

⎜⎜⎜0 0 ∗ ∗ ∗⎟⎟⎟⎟⎟

⎟

⎜⎝ ⎟⎠ ⎜⎝ ⎠ ⎝ ⎠

0 0 0 ∗ ∗ 0 0 0 ∗ ∗ 0 0 0 ∗ ∗

⎛ ⎞ ⎛ ⎞

⎜⎜⎜∗ ∗ ∗ 0 0⎟⎟⎟ ⎜⎜⎜∗ ∗ ∗ 0 0⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟

⎜⎜0 ∗ ∗ ∗ 0⎟⎟ ∗ ∗ ∗ 0⎟⎟⎟⎟

⎟⎟⎟ (4,5) ⎜⎜⎜⎜⎜

0

(3,4) ⎜ ⎜ ⎟⎟

→ ⎜⎜⎜⎜0 0 ∗ ∗ ∗⎟⎟⎟ → ⎜⎜⎜0 0 ∗ ∗ ∗⎟⎟⎟⎟ .

⎜⎜⎜ ⎟ ⎜⎜⎜ ⎟

⎜⎜⎜0 0 0 ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜0 0 0 ∗ ∗⎟⎟⎟⎟⎟

⎝ ⎠ ⎝ ⎠

0 0 0 ∗ ∗ 0 0 0 0 ∗

1, n) we see that the ﬁrst matrix H(1, 2) acts only on the ﬁrst two columns, the

next matrix H(2, 3) on columns 2 and 3 and so on:

⎛ ⎞ ⎛ ⎞ ⎛ ⎞

⎜⎜⎜∗ ∗ ∗ 0 0⎟⎟⎟ ⎜⎜⎜∗ ∗ ∗ 0 0⎟⎟⎟ ⎜⎜∗ ∗ ∗ 0 0⎟⎟⎟

⎜

⎜⎜⎜⎜0 ∗ ∗ ∗ 0⎟⎟⎟⎟ ⎜⎜

∗ ∗ ∗ 0⎟⎟⎟⎟

⎟ ⎜⎜⎜

∗ ∗ ∗ ∗ 0⎟⎟⎟⎟

⎟

⎜⎜⎜ ⎟⎟ (1,2) ⎜⎜⎜⎜⎜∗

⎟ ⎟⎟⎟ (2,3) ⎜⎜⎜⎜⎜ ⎟⎟

⎜⎜⎜0 0 ∗ ∗ ∗⎟⎟⎟ → ⎜⎜⎜0 0 ∗ ∗ ∗⎟⎟⎟ → ⎜⎜⎜0 ∗ ∗ ∗ ∗⎟⎟⎟⎟

⎜⎜⎜ ⎟ ⎜⎜⎜ ⎜⎜⎜

⎜⎜⎜0 0 0 ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜0 0 0 ∗ ∗⎟⎟⎟⎟⎟

⎟

⎜⎜⎜0 0 0 ∗ ∗⎟⎟⎟⎟⎟

⎟

⎜⎝ ⎟⎠ ⎜⎝ ⎠ ⎝ ⎠

0 0 0 0 ∗ 0 0 0 0 ∗ 0 0 0 0 ∗

⎛ ⎞ ⎛ ⎞

⎜⎜⎜∗ ∗ ∗ ∗ 0⎟⎟⎟ ⎜⎜⎜∗ ∗ ∗ ∗ ∗⎟⎟⎟

⎜⎜⎜ ⎟ ⎜⎜⎜∗ ∗ ∗ ∗ ∗⎟⎟⎟⎟

⎜⎜∗ ∗ ∗ ∗ 0⎟⎟⎟⎟

(3,4) ⎜ ⎜ ⎟⎟⎟ (4,5) ⎜⎜⎜⎜⎜ ⎟⎟⎟

→ ⎜⎜⎜⎜0 ∗ ∗ ∗ ∗⎟⎟⎟ → ⎜⎜⎜0 ∗ ∗ ∗ ∗⎟⎟⎟⎟ .

⎜⎜⎜ ⎟ ⎜⎜⎜ ⎟⎟

⎜⎜⎜⎝0 0 ∗ ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜⎝0 0 ∗ ∗ ∗⎟⎟⎟⎟⎠

⎠

0 0 0 0 ∗ 0 0 0 ∗ ∗

The new matrix is indeed a Hessenberg matrix again. But since it is also sym-

metric, all entries above the super-diagonal have to be zero. This ensures that

the new matrix is a tridiagonal matrix and that the computation can again be

done in linear time.

Let us note two consequences, which follow from the above proof immediately.

The ﬁrst one regards tridiagonal matrices.

matrix R ∈ Rn×n has at most three non-zero entries in each row, which are

rii , ri,i+1 and ri,i+2 .

06

170 Calculation of Eigenvalues

The second consequence will be important later on. The algorithm described

above yields for a non-symmetric Hessenberg matrix the following.

can be computed in O(n2 ) time.

Proof This follows from the proof of the previous theorem. The only diﬀer-

ence is that, in each step, we have to apply the 2 × 2 matrices H(i, i + 1) to all

entries in rows i and i + 1 and columns i + 1 ≤ k ≤ n.

rithm. Instead of decomposing the matrix Am , we now decompose the matrix

Am − μm I with a certain shift parameter μm ∈ R. Thus, one step of the QR

method with shifts consists of

• set Am+1 = Rm Qm + μm I.

such that all of the matrices involved have again the same eigenvalues as A. It

also follows that

= AQ0...m−1 − μm Q0...m−1

= (A − μm I)Q0...m−1 .

then

Q0...m−1 RTm en = ρm Q0...m−1 en = (AT − μm I)Q0...m en ,

which is one step of the inverse iteration. Thus, in this case it is possible to

apply the convergence results on the power method and the inverse iteration to

derive convergence results for the smallest eigenvalue of A and an associated

eigenvector. In the best case, it is possible to establish cubic convergence again.

06

5.7 Computing the Singular Value Decomposition 171

n,n , we will, in almost all cases, have

rapid convergence leading to

T 0

Am = Tm ,

0 λ

of A1 . Inductively, once this form has been found, the QR algorithm with shift

can be concentrated only on the (n − 1) × (n − 1) leading sub-matrix T m . This

process is called deﬂation.

Taking this all into account, the ﬁnal algorithm for calculating the eigenvalues

of an n × n symmetric matrix is given in Algorithm 15.

Input : A = AT ∈ Rn×n .

Output: Approximate eigenvalues and eigenvectors.

1 Reduce A to tridiagonal form

2 for k = n downto 2 do

3 while |ak−1,k | > do

4 Compute QR = A − akk I

5 Set A := RQ + akk I

6 Record eigenvalue λk = akk

7 Replace A by its leading k − 1 by k − 1 sub-matrix

8 Record eigenvalue λ1 = a11

In this section we want to discuss numerical methods to compute the singular

value decomposition

A = UΣV T (5.32)

are orthogonal and Σ ∈ Rm×n is a diagonal matrix with diagonal entries σi ≥ 0

for 1 ≤ i ≤ n. We restrict ourselves to the case m ≥ n since it is the more

important case and since in the the case m < n we can simply consider AT

instead of A.

06

172 Calculation of Eigenvalues

As mentioned before, the factorisation (5.32) is also called the full singu-

lar value decomposition since it computes square matrices U ∈ Rm×m and

V ∈ Rn×n . However, the essential information is already given by a reduced

singular value decomposition, see Section 1.5. Sometimes a reduced singular

value decomposition for A ∈ Rm×n with m ≥ n employs matrices U ∈ Rm×n

with U T U = I, V ∈ Rn×n orthogonal and a diagonal matrix Σ ∈ Rn×n with

A = UΣV T . In any case, we have

AT A = VΣT ΣV T

and that the singular values σi are the square roots of the positive eigenvalues

of AT A.

This, of course, immediately gives rise to a numerical scheme. We could ﬁrst

compute the matrix Ã := AT A and then use one of the methods of the previous

sections to compute the eigenvalues and eigenvectors of Ã. Taking the square

roots of the eigenvalues gives us the singular values. Moreover, if we have the

eigenvalue decomposition Ã = VS V T we can solve for UΣ = AV, for example

using a QR factorisation to compute the missing orthogonal matrix U.

This is, unfortunately, though quite obvious, not advisable. First of all, the

computational cost expended to form the matrix AT A is already O(mn2 ), but

this is not the main problem. The main problem is that the condition number

will be the square of the condition number of the matrix A. It can be shown that

this leads to problems, particularly for computing the small singular values of

A, see for example Trefethen and Bau [124].

Hence, the general idea is to use the fact that the singular values are the positive

roots of the positive eigenvalues of AT A without computing AT A.

As in the case of computing the eigenvalues of a symmetric matrix using the

QR method, we will achieve this by ﬁrst transferring our matrix to a more

suitable form. This time, it will be a bidiagonal form.

nal form if bi j = 0 for all j i, i + 1, i.e. B is of the form

⎛ ⎞

⎜⎜⎜∗ ∗ ⎟⎟⎟

⎜⎜⎜⎜ ∗ ∗

⎟⎟⎟

⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ .. .. ⎟⎟⎟

⎜ . .

B = ⎜⎜⎜⎜ ⎟⎟⎟ .

⎜⎜⎜ ∗ ∗⎟⎟⎟⎟

⎜⎜⎜ ⎟

⎜⎜⎜

⎜⎝ ∗⎟⎟⎟⎟⎟

⎠

06

5.7 Computing the Singular Value Decomposition 173

compute the Hessenberg form of a matrix. However, this time we are more

ﬂexible. In the case of constructing the Hessenberg form of A, we had to use

transformations of the form QT AQ with Q orthogonal to keep the same eigen-

values. Here, we want to keep the same singular values, which allows us to use

transformations of the form U T AV with orthogonal matrices U and V since we

have:

Remark 5.40 Let A ∈ Rm×n with m ≥ n be given. If B ∈ Rm×n is deﬁned by

B = U T AV with orthogonal matrices U ∈ Rm×m and V ∈ Rn×n then A and B

have the same singular values.

Proof The singular values of B are the square roots of the positive eigenvalues

of BT B. Since

BT B = (U T AV)T (U T AV) = V T AT UU T AV = V T AT AV,

BT B and AT A have the same positive eigenvalues.

Hence, we can start with a Householder matrix U1 ∈ Rm×m which transforms

the ﬁrst column of A into a multiple of the ﬁrst unit vector. We then want

to choose a Householder matrix V1 to transform the ﬁrst row of U1 A into a

multiple of the transpose of the ﬁrst unit vector. However, as explained in the

context of deriving the Hessenberg form, this would destroy the previously

created zeros in the ﬁrst column. Hence, we have to transform the row vector

consisting of ((U1 A)12 , . . . (U1 A)1n ) into a multiple of the ﬁrst row unit vector

in Rn−1 . Continuing this way, this looks schematically like

⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞

⎜⎜⎜∗ ∗ ∗ ∗⎟⎟⎟ ⎜⎜⎜∗ ∗ ∗ ∗⎟⎟⎟ ⎜⎜⎜• ∗ 0 0⎟⎟⎟ ⎜⎜⎜• • 0 0⎟⎟⎟

⎜⎜⎜⎜∗ ∗ ∗ ∗⎟⎟⎟⎟ ⎜⎜ ⎟⎟ ⎜⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟⎟ ⎜⎜ ⎟⎟

⎜⎜⎜ ⎟⎟⎟ T ⎜⎜⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ T ⎜⎜⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟⎟⎟

⎜⎜⎜∗ ∗ ∗ ∗⎟⎟⎟ U1 ⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟ V1 ⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟ U2 ⎜⎜⎜0 0 ∗ ∗⎟⎟⎟

⎜⎜⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟

⎜⎜⎜∗ ∗ ∗ ∗⎟⎟⎟⎟⎟ → ⎜⎜⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟⎟⎟ → ⎜⎜⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟⎟⎟ → ⎜⎜⎜⎜⎜0 0 ∗ ∗⎟⎟⎟⎟⎟

⎜⎜⎜ ⎟ ⎜⎜⎜ ⎟ ⎜⎜⎜ ⎟ ⎜⎜⎜ ⎟

⎜⎜⎜∗ ∗ ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜0 ∗ ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜0 0 ∗ ∗⎟⎟⎟⎟⎟

⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠

∗ ∗ ∗ ∗ 0 ∗ ∗ ∗ 0 ∗ ∗ ∗ 0 0 ∗ ∗

⎛ ⎞ ⎛ ⎞ ⎛ ⎞

⎜⎜⎜• • 0 0⎟⎟⎟ ⎜⎜⎜• • 0 0⎟⎟⎟ ⎜⎜⎜• • 0 0⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟

⎜⎜⎜0 • ∗ 0⎟⎟⎟ ⎜⎜⎜0 • • 0⎟⎟⎟ ⎜⎜⎜0 • • 0⎟⎟⎟⎟⎟

V2 ⎜

⎜⎜0 0 ∗ ∗⎟⎟⎟ U T ⎜⎜⎜0 0 ∗ ∗⎟⎟⎟ U T ⎜⎜⎜0 0 • •⎟⎟⎟

→ ⎜⎜⎜⎜⎜ ⎟⎟⎟ →3 ⎜⎜⎜ ⎟ 4 ⎜

⎜⎜⎜0 0 0 ∗⎟⎟⎟⎟⎟ → ⎜⎜⎜⎜⎜0 0 0 ∗⎟⎟⎟⎟⎟ ,

⎟

⎜⎜⎜0 0 ∗ ∗⎟⎟⎟⎟⎟ ⎜⎜⎜ ⎟ ⎜⎜⎜ ⎟

⎜⎜⎜ ⎟⎟ ⎜⎜⎜0 0 0 ∗⎟⎟⎟⎟⎟ ⎜⎜⎜0 0 0 0⎟⎟⎟⎟⎟

⎜⎜⎝0 0 ∗ ∗⎟⎟⎟⎠ ⎜⎝ ⎟⎠ ⎜⎝ ⎟⎠

0 0 ∗ ∗ 0 0 0 ∗ 0 0 0 0

where the U T matrices are multiplied from the left and the V matrices are mul-

tiplied from the right. In the above diagram, elements that do not change any-

06

174 Calculation of Eigenvalues

more are denoted by a •; those elements that change are denoted by a ∗. Note

that we need to apply n Householder matrices from the left and n − 2 House-

holder transformations from the right. The method described here is called

Householder bidiagonalisation and goes back to Golub and Kahan [64]. A

possible realisation is given in Algorithm 16. To analyse its numerical cost,

we recall that multiplying a Householder matrix U ∈ Rm×m to a vector a ∈ Rm

costs O(m) time, see Lemma 3.31.

bidiagonal form using the Householder bidiagonalisation method in O(mn2 )

time.

matrix, which we then have to apply to n − i + 1 partial columns of A. Hence

this multiplication can be done in O((n − i + 1)(m − i + 1)) time and we have to

do it for 1 ≤ i ≤ n, so that the total cost for multiplying with all U matrices is

n

O((n − i + 1)(m − i + 1)) = O(n2 m).

i=1

which has to be applied to m − i + 1 partial row vectors of A, for 1 ≤ i ≤ n − 2.

Hence, the total cost for all these operations becomes

n−2

O((n − i)(m − i + 1)) = O(n2 m).

i=1

Input : A ∈ Rm×n .

Output: B = UAV T bidiagonal.

1 for i = 1 to n do

2 Determine Householder matrix Ũi ∈ R(m−i+1)×(m−i+1) such that

Ũi (aii , ai+1,m , . . . , ami )T = (∗, 0, . . . , 0)T .

3 Let Ui := diag(Ii−1 , Ũi ) and compute A := Ui A

4 if i ≤ n − 2 then

5 Determine Householder matrix Ṽi ∈ R(n−i)×(n−i) such that

(ai,i+1 , ai,i+2 , . . . , ain )ṼiT = (∗, 0, . . . , 0)

6 Let ViT := diag(Ii , ṼiT ) and compute A := AViT

06

5.7 Computing the Singular Value Decomposition 175

gorithm 16 have to be carried out using the special structure of the Householder

matrices. Furthermore, we can also store these matrices eﬃciently. The House-

holder matrix Ũi ∈ R(m−i+1)×(m−i+1) can be represented by Ũi = Im−i+1 − 2ui uTi ,

i.e. by a vector ui ∈ Rm−i+1 . Similarly, the matrix Ṽi ∈ R(n−i)×(n−i) can be repre-

sented by a vector vi ∈ Rn−i . Hence, when it comes to an actual implementation

of Algorithm 16 we could store most of the information in the original matrix

as follows. The vector associated with Ui can be stored in the lower triangu-

lar part of A while the Householder vector associated with Vi can be stored

in the upper triangular portion of A. For details, see the similar Algorithm 6.

Hence, we only need to store the diagonal and super-diagonal entries of the

transformed matrix.

In the case of m being much larger than n, the method will produce a large

number of unnecessary zeros. To avoid this, we could ﬁrst compute a QR fac-

torisation of A and then the Householder bidiagonalisation B = URV T of the

upper triangular matrix R. The advantage here is that the bidiagonalisation pro-

cess then only needs to be carried out on the upper n × n part of R. A thorough

analysis of the computational cost (see for example Trefethen and Bau [124])

shows that the QR factorisation can be done in about half the time T H (m, n)

of the bidiagonalisation of A. The bidiagonalisation process on the small n × n

part of R costs O(n3 ) so that the total cost of this method will be

1

T (m, n) = T H (m, n) + O(n3 ),

2

which turns out to be better if at least m > (5/3)n.

After the transformation to an upper bidiagonal matrix, the transformed matrix

has the upper bidiagonal form

⎛ ⎞

⎜⎜⎜d1 f2 0 · · · 0 ⎟⎟⎟

⎜⎜⎜⎜ .. . ⎟⎟⎟

⎜⎜⎜ d f . .. ⎟⎟⎟⎟ ⎛⎜ ⎞

⎟⎟⎟

⎜⎜⎜ 2 3 ⎟⎟⎟ ⎜⎜

⎜⎜⎜ .. .. ⎟ ⎜ ⎜ ⎟⎟⎟

⎜⎜⎜ . . 0 ⎟⎟⎟⎟⎟ ⎜⎜⎜⎜ B ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟

⎜ . . . f ⎟⎟⎟ = ⎜⎜⎜ ⎜ ⎟⎟⎟

B̃ = ⎜⎜⎜⎜ n⎟ ⎟ ⎜ ⎟⎟ (5.33)

⎜⎜⎜⎜ ⎟⎟⎟⎟ ⎜⎜⎜⎜0 · · · 0⎟⎟⎟⎟

⎜⎜⎜ d n⎟ ⎟⎟⎟ ⎜⎜⎜⎜ .. .. ⎟⎟⎟

⎜⎜⎜⎜ 0 · · · 0 ⎟⎟⎟ ⎜⎜⎜ . . ⎟⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ ⎝ ⎠

⎜⎜⎜ .. .. ⎟⎟ 0 ··· 0

⎜⎜⎜ . . ⎟⎟⎟⎟

⎝ ⎠

0 ··· 0

with B ∈ Rn×n , and we have matrices U ∈ Rm×n and V ∈ Rn×n such that A =

U BV T . The singular values of A are the square roots of the positive eigenvalues

of the matrix C := BT B, which is a tridiagonal matrix with diagonal entries

06

176 Calculation of Eigenvalues

c11 = d12 and cii = di2 + fi2 for 2 ≤ i ≤ n. The oﬀ-diagonal entries are given by

ci,i+1 = ci+1,i = di fi+1 for 1 ≤ i ≤ n − 1.

Hence, the matrix C = BT B is irreducible (see Deﬁnition 4.11) if and only if

d1 · · · dn−1 0 and f2 · · · fn 0.

If the matrix is reducible, the calculation of the singular values of B can be

reduced to the calculation of two lower-dimensional matrices. This is detailed

in the following lemma, where we have to distinguish between the two cases

characterised by whether one of the di or one of the fi vanishes.

Lemma 5.42 Let B ∈ Rn×n be a bidiagonal matrix with diagonal elements

d1 , . . . , dn and super-diagonal elements f2 , . . . , fn .

1. If there is a 1 ≤ k ≤ n − 1 with fk+1 = 0 then the matrix B can be split into

B1 O

B=

O B2

with B1 ∈ Rk×k and B2 ∈ R(n−k)×(n−k) . If B1 = U1 Σ1 V1T and B2 = U2 Σ2 V2T

are singular value decompositions of B1 and B2 , respectively, then

T

U O Σ 1 O V1 O

B= 1

O U2 O Σ2 O V2

is a singular value decomposition of B.

2. If there is a 1 ≤ k ≤ n − 1 with dk = 0 and fk+1 0 then we can choose n − k

Givens rotations Gk,k+1 , . . . , Gk,n such that B̃ = (b̃i j ) := Gk,n · · · Gk,k+1 B

is a bidiagonal matrix with b̃k,k+1 = 0. Hence, the matrix B̃ satisﬁes the

assumptions of the ﬁrst case.

Proof The ﬁrst part of the lemma simply follows by straightforward calcu-

lations. For the second part, we will use Givens rotations to move the non-

vanishing fk+1 along the kth row to the right until it disappears, as described

schematically in

⎛ ⎞ ⎛ ⎞ ⎛ ⎞

⎜⎜⎜• • ⎟⎟⎟ ⎜⎜⎜• • ⎟⎟⎟ ⎜⎜⎜• • ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟

⎜⎜⎜ 0 • ⎟⎟⎟ ⎜⎜⎜ 0 0 ∗ ⎟⎟⎟ ⎜⎜⎜ 0 0 0 ∗⎟⎟⎟⎟⎟

⎜⎜⎜⎜ ⎟ 23 ⎜

• • ⎟⎟⎟⎟ → ⎜⎜⎜⎜

G

∗ ∗

⎟⎟⎟ 24 ⎜⎜⎜

⎟⎟⎟ → ⎜⎜⎜

G

• • ⎟⎟⎟⎟

⎟

⎜⎜⎜ ⎟

⎟ ⎜

⎜ ⎟ ⎜ ⎟

⎜⎜⎜

⎜⎝ • •⎟⎟⎟⎟ ⎜⎜⎜

⎜⎝ • •⎟⎟⎟⎟ ⎜⎜⎜

⎜⎝ ∗ ∗⎟⎟⎟⎟⎟

⎠ ⎠ ⎠

• • •

⎛ ⎞

⎜⎜⎜• • ⎟⎟⎟

⎜⎜⎜ ⎟

⎜⎜⎜ 0 0 0 0⎟⎟⎟⎟⎟

G25 ⎜

→ ⎜⎜⎜⎜ • • ⎟⎟⎟⎟ .

⎜⎜⎜ ⎟⎟

⎜⎜⎜ • •⎟⎟⎟⎟⎟

⎝ ⎠

∗

06

5.7 Computing the Singular Value Decomposition 177

∗. The precise proof follows the same pattern and is left to the reader, see also

Algorithm 17.

Input : B ∈ Rn×n upper bidiagonal, i.e. of the form (5.33), index k with

dk = 0 and fk+1 0, matrix U ∈ Rm×n so that A = U BV T .

Output: B̃ upper bidiagonal with f˜k+1 = 0.

1 c := 0, s := 1

2 for j = k + 1 to n do

3 f := s f j , f j := c f j , u := ( f 2 + d2j )1/2

4 c := d j /u, s := − f /u, d j := u

5 for i = 1 to m do

c −s

6 (uik , ui j ) := (uik , ui j )

s c

which can be split as described, and a new matrix Ũ which satisfy A = Ũ B̃V T .

We can repeat this until we only have to deal with blocks B with BT B irre-

ducible.

It remains to derive a method for computing the singular values of an upper

bidiagonal matrix B with irreducible BT B. Here, the idea of Golub and Reinsch

[65] is to compute the eigenvalues of BT B with the shifted QR method without

actually computing BT B. We will now describe a method which transforms B

to a new bidiagonal matrix B̃ which should have a much smaller oﬀ-diagonal

element b̃n−1,n than the original matrix B. Hence, after a few steps it should be

possible to use deﬂation to reduce the problem to an (n − 1) × (n − 1) matrix.

We start by selecting the shift parameter μ ∈ R as the eigenvalue of the lower

block

2

dn−1 + fn−1

2

dn−1 fn

dn−1 fn dn2 + fn2

of BT B which is closest to dn2 + fn2 . Then, we determine a Givens rotation

V12 = V12 (c, s) satisfying

c s d12 − μ ∗

= ,

−s c d1 f2 0

i.e. V12 transforms the ﬁrst column of BT B − μI into a multiple of the ﬁrst unit

06

178 Calculation of Eigenvalues

vector. This means that the ﬁrst column of V12 coincides with the ﬁrst column

of the orthogonal matrix Q of a QR factorisation of BT B − μI.

T

If we apply this directly to B, i.e. if we form BV12 with dimensions appropri-

ately adapted, we only change the upper 2 × 2 block of B according to

d1 f2 c −s d1 c + f2 s −d1 s + f2 c

= ,

0 d2 s c d2 s d2 c

which means that the upper bidiagonal form is destroyed by the element d2 s in

position (2,1). We will now use successive Givens rotations from the left and

from the right to move this element out of the matrix. This idea can be depicted

as follows:

⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞

⎜⎜⎜• • ⎟⎟⎟ ⎜∗ ∗ ⎟⎟⎟ ⎜⎜⎜∗ ∗ ∗ ⎟⎟⎟ ⎜⎜⎜• ∗ ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ T ⎜⎜⎜⎜⎜∗ ⎟⎟⎟ ⎜ ⎟⎟⎟ T ⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ • • ⎟⎟⎟ V12 ⎜⎜⎜ ∗ • ⎟⎟⎟ U12 ⎜⎜⎜⎜⎜ ∗ ∗ ⎟⎟⎟ V23 ⎜⎜⎜ ∗ ∗ ⎟⎟⎟

⎜⎜⎜ • • ⎟⎟⎟ → ⎜⎜⎜ • • ⎟⎟⎟⎟ → ⎜⎜⎜⎜ • • ⎟⎟⎟ → ⎜⎜⎜ ∗ ∗ • ⎟⎟⎟

⎜⎜⎜

⎜⎝ • •⎟⎟⎟⎠⎟ ⎜⎜⎜

⎜⎝ • •⎟⎟⎠⎟ ⎜⎜⎜

⎝ • •⎟⎟⎟⎠⎟ ⎜⎜⎜

⎜⎝ • •⎟⎟⎟⎠⎟

• • • •

⎛ ⎞ ⎛ ⎞ ⎛ ⎞

⎜⎜⎜• • ⎟⎟⎟ ⎜• • ⎟⎟⎟ ⎜⎜⎜• • ⎟⎟⎟

⎜⎜ ∗ ∗ ∗ ⎟⎟⎟ T ⎜⎜⎜⎜⎜ • ∗ ⎟⎟⎟ ⎜⎜⎜ • • ⎟⎟⎟

U23 ⎜ ⎜⎜ ⎟⎟⎟ 34 ⎜⎜⎜

V ⎟⎟⎟ U34 ⎜⎜⎜ ⎟⎟

→ ⎜⎜⎜⎜ ∗ ∗ ⎟⎟⎟ → ⎜⎜⎜⎜ ∗ ∗ ⎟⎟⎟ → ⎜⎜⎜ ∗ ∗ ∗⎟⎟⎟⎟

⎜⎜⎜

⎜⎝ • •⎟⎟⎟⎠⎟ ⎜⎜⎜

⎝ ∗ ∗ •⎟⎟⎟⎟⎠ ⎜⎜⎜

⎜⎝ ∗ ∗⎟⎟⎟⎟⎠

• • •

⎛ ⎞ ⎛ ⎞

⎜⎜⎜• • ⎟⎟⎟ ⎜⎜⎜• • ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ ⎜ ⎟⎟⎟

T

V45 ⎜⎜ • • ⎟⎟⎟ U45 ⎜⎜⎜⎜⎜ • • ⎟⎟⎟

→ ⎜⎜⎜⎜ • ∗ ⎟⎟⎟ → ⎜⎜⎜ • • ⎟⎟⎟ ,

⎜⎜⎜

⎜⎝ ∗ ∗⎟⎟⎟⎠⎟ ⎜⎜⎜

⎝⎜ ∗ ∗⎟⎟⎟⎟⎠

∗ ∗ ∗

T

where Vk,k+1 is multiplied from the right and hence acts only on columns k and

k + 1, while Uk,k+1 is multiplied from the left and hence acts on rows k and

k + 1. All other rows and columns are left unchanged. As before, we indicated

non-changing values with • and those which change with ∗, unless they are

transformed to zero.

The result of this procedure is a matrix

T

· · · Vn−1,n

T

which is also upper bidiagonal and has the same singular values as B but

should, one hopes, have a smaller entry b̃n−1,n than B. The full algorithm of

this step is given in Algorithm 18.

Algorithm 18 not only gives a new upper bidiagonal matrix B̃ but also over-

writes the matrices U ∈ Rm×n and V ∈ Rn×n with

Ũ = UU12

T

· · · Un−1,n

T

, Ṽ = VV12

T

· · · Vn−1,n

T

,

06

5.7 Computing the Singular Value Decomposition 179

Input : B ∈ Rn×n bidiagonal matrix of the form (5.33) with BT B

irreducible, matrices U ∈ Rm×n and V ∈ Rn×n .

Output: Upper bidiagonal matrix B̃ ∈ Rn×n , matrices Ũ ∈ Rm×n and

Ṽ ∈ Rn×n with Ũ B̃Ṽ T = U BV T .

1 Determine μ ∈ R as the eigenvalue of

2

dn−1 + fn−1

2

dn−1 fn

,

dn−1 fn dn2 + fn2

2 Determine Givens rotation V12 = V12 (c, s) with

c s d12 − μ ∗

= .

−s c d1 f2 0

T T

Set B := BV12 and V := VV12 .

3 for k = 1 to n − 1 do

4 Determine Givens rotation Uk,k+1 = Uk,k+1 (c, s) with

c s bkk ∗

= .

−s c bk+1,k 0

T

5 Set B := Uk,k+1 B and U := UUk,k+1 .

6 if k < n − 1 then

7 Determine Givens rotation Vk+1,k+2 = Vk+1,k+2 (c, s) with

c s bk,k+1 ∗

= .

−s c bk,k+2 0

T T

8 Set B := BVk+1,k+2 and V := VVk+1,k+2 .

also have

B̃T B̃ = Vn−1,n · · · V12 BT BV12

T

· · · Vn−1,n

T

.

As we have

(Vn−1,n · · · V12 )T e1 = V12

T

· · · Vn−1,n

T

e1 = V12

T

e1

06

180 Calculation of Eigenvalues

we see that the ﬁrst column of (Vn−1,n · · · V1,2 )T coincides with the ﬁrst column

T

of V12 which coincides with the ﬁrst column of Q the orthogonal matrix in

the QR factorisation of BT B − μI. This means that we have indeed implicitly

applied one step of the shifted QR method and our considerations there allow

us now to expect a new matrix B̃ with considerably smaller super-diagonal

element fn .

It should be clear by now how to implement these algorithms and how to com-

bine them to create a working algorithm for computing the singular value de-

composition. After each step B → B̃ it should be checked whether the last

super-diagonal falls below a given threshold.

Exercises

5.1 • Let A ∈ Rn×n be a tridiagonal matrix with diagonal entries 5 j, 1 ≤ j ≤

n, and sub- and super-diagonal entries ±1. Use Gershgorin’s theorem

to give as tight bounds on the eigenvalues of A as possible. Show that

A is invertible.

• Let x0 ∈ Rn and deﬁne for k ∈ N xk by Axk = xk−1 , where A is the

matrix from the ﬁrst part. Show that there is a constant C > 0 such

that the components eTj xk of xk satisfy |e j xk | ≤ 4−k C.

5.2 Let A, B ∈ Rn×n be symmetric with eigenvalues λ1 ≥ · · · ≥ λn and

μ1 ≥ · · · ≥ μn , respectively. Let ν1 ≥ · · · ≥ νn be the eigenvalues of A + B

Show λ j + μn ≤ ν j ≤ λ j + μ1 for 1 ≤ j ≤ n.

5.3 Let A ∈ Rn×n be symmetric with eigenvalues λ1 , . . . , λn . Show, for every

λ ∈ R and x ∈ Rn \ {0},

λx − Ax2

min |λ − λ j | ≤ .

1≤ j≤n x2

5.4 Let A ∈ Rm×n with m ≥ n. Let σ1 ≥ · · · ≥ σn be the singular values of A.

Show

Ax2

σ j = min max , 1 ≤ j ≤ n,

U∈Mn+1− j x∈U x2

x0

5.5 Let A, B ∈ Rm×n with m ≥ n and singular values σ1 ≥ · · · ≥ σn and

τ1 ≥ · · · ≥ τn . Show

|σ j − τ j | ≤ A − B2 , 1 ≤ j ≤ n.

5.6 Show that every irreducible, symmetric tridiagonal matrix A ∈ Rn×n has

pairwise distinct eigenvalues.

06

PART THREE

ADVANCED METHODS

Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316544938

13:06:58, subject to the Cambridge

Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316544938

6

Methods for Large Sparse Systems

suited for large sparse systems, though they can also be applied to non-sparse

matrices. We will discuss multigrid methods but our main focus will be on so-

called Krylov subspace methods. For such methods, the main operation within

one step of the iteration is the calculation of one or at most two matrix–vector

products. If the matrix is sparse with only O(n) non-zero entries then this op-

eration can be done in O(n) time. For a general, non-sparse matrix the cost

becomes O(n2 ), which might still be acceptable if only a very small number of

steps is required and n is not too large.

We start by constructing an iterative method for symmetric positive deﬁnite

matrices, which was introduced in [83] by Hestenes and Stiefel. Let A ∈ Rn×n

be symmetric, positive deﬁnite and b ∈ Rn . We wish to solve the linear system

Ax = b.

1 T

f (x) := x Ax − xT b, x ∈ Rn .

2

We can calculate the ﬁrst-order and second-order derivatives as

∇ f (x) = Ax − b,

H f (x) = A.

183

07

184 Methods for Large Sparse Systems

matrix. Since all higher-order derivatives vanish, we can rewrite f as its Taylor

polynomial of degree two as

1

f (x) = f (y) + (x − y)T ∇ f (y) + (x − y)T H f (y)(x − y)

2

1

= f (y) + (x − y)T (Ay − b) + (x − y)T A(x − y)

2

≥ f (y) + (x − y)T (Ay − b),

where, in the last step, we have used that A is positive deﬁnite. Thus, if y = x∗

solves Ax = b, we see that f (x) ≥ f (x∗ ), i.e. x∗ minimises f . On the other hand,

every minimiser y of f has to satisfy the necessary condition 0 = ∇ f (y) =

Ay − b. This is only possible for y = x∗ .

Theorem 6.1 Let A ∈ Rn×n be symmetric and positive deﬁnite. The solution of

the system Ax = b is the unique minimum of the function f (y) := 12 yT Ay − yT b.

the form

x j+1 = x j + α j p j .

Here, x j is our current position. From this position we want to move in the

direction of p j . The step-length of our move is determined by α j . Of course, it

will be our goal to select the direction p j and the step-length α j in such a way

that

f (x j+1 ) ≤ f (x j ).

Given a new direction p j , the best possible step-length in that direction can be

determined by looking at the minimum of f along the line x j + αp j . Hence, if

we set φ(α) = f (x j + αp j ) the necessary condition for a minimum yields

pTj (b − Ax j ) pTj r j r j , p j 2

αj = = = ,

pTj Ap j pTj Ap j Ap j , p j 2

have a generic algorithm as outlined in Algorithm 19.

Note that we pick α j in such a way that

07

6.1 The Conjugate Gradient Method 185

Input : A = AT ∈ Rn×n positive deﬁnite, b ∈ Rn .

Output: Approximate solution of Ax = b.

1 Choose x0 and p0 .

2 Compute r0 := b − Ax0 and set j := 0.

3 while r j > do

r j , p j 2

4 α j :=

Ap j , p j 2

5 x j+1 := x j + α j p j

6 r j+1 := b − Ax j+1

7 Choose next direction p j+1

8 j := j + 1

Obviously, we still have to determine how to choose the search directions. One

possible way is to pick the direction of steepest descent. It is easy to see and

known from analysis that this direction is given by the negative gradient of the

target function, i.e. by

p j = −∇ f (x j ) = −(Ax j − b) = r j .

This gives the steepest descent algorithm shown in Algorithm 20.

Input : A = AT ∈ Rn×n positive deﬁnite, b ∈ Rn .

Output: Approximate solution of Ax = b.

1 Choose x0 .

2 Set p0 := b − Ax0 and j := 0

3 while p j > do

p j , p j 2

4 α j :=

Ap j , p j 2

5 x j+1 := x j + α j p j

6 p j+1 := b − Ax j+1

7 j := j + 1

The relation (6.1) now becomes p j+1 , p j 2 = 0, which means that two succes-

sive search directions are orthogonal.

To see that the method of steepest descent converges and to determine its

07

186 Methods for Large Sparse Systems

A ∈ Rn×n has n positive eigenvalues 0 < λ1 ≤ λ2 ≤ · · · ≤ λn and Rn has an

orthonormal basis w1 , . . . , wn satisfying Aw j = λ j w j for 1 ≤ j ≤ n. We need

the following lemma.

largest eigenvalues of the symmetric and positive deﬁnite matrix A ∈ Rn×n .

Then,

Ax, x2 A−1 x, x2 (λ1 + λn )2

≤ , x ∈ Rn \ {0}.

x, x2 x, x2 4λ1 λn

notation introduced above, we can expand such an x ∈ Rn as x = nj=1 x j w j

with

n

n

Ax, x2 = xT Ax = λ j x2j , A−1 x, x2 = xT A−1 x = λ−1

j xj.

2

j=1 j=1

Noticing that x2j ∈ [0, 1] with x2j = 1, we see that

n

λ := λ j x2j = Ax, x2

j=1

[λ1 , λn ]. Next, let us deﬁne the points P j := (λ j , 1/λ j ) for 1 ≤ j ≤ n and

P := (λ, j λ−1 j x j ). The points P j lie on the graph of the function f (t) = 1/t,

2

which is a convex function. Thus, all P j for 2 ≤ j ≤ n − 1 lie under the straight

line connecting P1 and Pn . The point P can be written as P = j x2j P j and

is therefore a convex combination of the points P j and thus contained in the

convex hull of P1 , . . . , Pn . Hence, P cannot be above the straight line through

P1 and Pn . Since this line is given by g(t) = (λ1 + λn − t)/(λ1 λn ), we see that

n

λ1 + λn − λ

A−1 x, x2 = λ−1

j x j ≤ g(λ) =

2

j=1

λ1 λn

and hence

λ 1 + λn − λ

Ax, x2 A−1 x, x2 ≤ λ . (6.2)

λ1 λn

The expression on the right-hand side is a quadratic polynomial in λ which

attains its global maximum in [λ1 , λn ] at λ = (λ1 + λn )/2. Inserting this into

(6.2) gives the desired result.

07

6.1 The Conjugate Gradient Method 187

formulate it, we introduce the inner product

√

with associated norm xA := x, xA , x ∈ Rn .

Theorem 6.3 Let A ∈ Rn×n be symmetric and positive deﬁnite. Then, for a

given right-hand side b ∈ Rn , the method of steepest descent converges for

every starting point x0 ∈ Rn to the solution x∗ ∈ Rn of Ax = b. Moreover, if x j

denotes the jth iteration, the error can be bounded by

j

∗ κ2 (A) − 1

x − x j A ≤ x∗ − x0 A , (6.4)

κ2 (A) + 1

where κ2 (A) = A2 A−1 2 denotes the condition number of A with respect to

the Euclidean norm.

e j+1 = x∗ − x j − α j p j = e j − α j p j ,

p j+1 = b − Ax j+1 = A(x∗ − x j+1 ) = Ae j+1 .

The orthogonality of p j and p j+1 and the fact that p j+1 = Ae j+1 shows

= p j , e j 2 − α j p j , p j 2 = Ae j , e j 2 − α j p j , p j 2

p j , p j 2

= Ae j , e j 2 1 − α j

Ae j , e j 2

p j , p j 2 p j , p j 2

= e j 2A 1 −

Ap j , p j 2 p j , A−1 p j 2

4λ1 λn

≤ e j 2A 1 −

(λ1 + λn )2

(λn − λ1 )2

= e j 2A ,

(λn + λ1 )2

07

188 Methods for Large Sparse Systems

κ2 (A) − 1

e j+1 A ≤ e j A

κ2 (A) + 1

and iterating this argument gives the stated error bound and convergence.

The predicted error bound shows that the convergence is best if the condition

number of A is close to 1 and that the convergence will be arbitrarily slow if the

condition number is large. We can even estimate the number of iterations that

we would need to reduce the initial error x∗ − x0 A to a given > 0, though

this estimate is true only when one is working in exact arithmetic.

the iterations of the steepest descent method then we have

x∗ − x j A

≤

x∗ − x0 A

after at most j ≥ 12 κ2 (A) log(1/ ) steps.

Proof Let us write κ := κ2 (A). It follows from (6.4) that the relative error

drops below if

j

κ+1 1

> .

κ−1

Taking the logarithm of both sides shows that we need

log(1/ )

j≥ . (6.5)

log κ+1

κ−1

∞

κ+1 1 −2k−1 1 1 1 2

log =2 κ = 2 + 3 + 5 + ··· ≥

κ−1 k=0

2k + 1 κ 3κ 5κ κ

The next example shows that the error bound of the previous theorem can be

sharp, showing also that even for moderately conditioned matrices the conver-

gence of the steepest descent method becomes unacceptably slow.

Example 6.5 Let us choose the following matrix, right-hand side and initial

position:

1 0 0 9

A= , b= , x0 = .

0 9 0 1

07

6.1 The Conjugate Gradient Method 189

It is not too diﬃcult to see that the method of steepest descent produces the

iterations

x j = (0.8) j (9, (−1) j )T ,

0.5

0 1 2 3 4 5 6 7 8

relevant quantities from Theorem 6.3. First of all we have κ2 (A) = 9 and hence

the convergence factor√is (κ2 (A) − 1)/(κ2 (A) + 1) = 0.8. Moreover, we have

x0 − x∗ A = x0 A = 90 and

∗ 2 j 1 0 9

x j − x A = (0.8) (9, (−1) )

2j

= (0.8)2 j 90 = 0.82 j x0 − x∗ 2A .

0 9 (−1) j

Thus, we have equality in equation (6.4).

xj

x j+1

pj

The problem of the previous example becomes more apparent if we look at the

level curves of the corresponding quadratic function f (x) = xT Ax = 12 (x12 +

9x22 ), which are ellipses, see Figure 6.2. Since the new search direction has to

be orthogonal to the previous one, we are always tangential to the level curves,

and hence the new direction does not point towards the centre of the ellipses,

which is our solution vector.

07

190 Methods for Large Sparse Systems

yj

s j+1

y j+1

sj

Figure 6.3 The steepest descent method, level curves after scaling.

y := A1/2 x, in which the function becomes g(y) = f (A−1/2 y) = 12 yT y. Follow-

ing this idea, we transform the positions by setting y j := A1/2 x j and the di-

rections by s j := A1/2 p j . If the new search directions are orthogonal as shown

in Figure 6.3, we ﬁnd the minimum in two steps. Moreover, being orthogonal

now translates back into 0 = sTj s j+1 = pTj Ap j+1 = Ap j , p j+1 2 .

Following these ideas, we make the following deﬁnition.

conjugate for a given positive deﬁnite and symmetric matrix A ∈ Rn×n if

Apk , p j 2 = 0 for all 0 ≤ j k ≤ n − 1.

Remark 6.7 • Since A is symmetric and positive deﬁnite, we have for each

p 0 that Ap, p2 = p, Ap2 > 0.

• As in (6.3), we can introduce an inner product via x, yA := Ax, y2 =

x, Ay2 . With respect to this inner product, the search directions are orthog-

onal.

• A-conjugate directions are always linearly independent.

For A-conjugate directions it turns out that the generic algorithm, Algorithm

19, terminates after n steps, so that the method is actually a direct method.

Of course, we hope to have an acceptable approximation after signiﬁcantly

fewer iterations. However, in some cases, due to numerical contamination, the

algorithm might even require more than n steps.

07

6.1 The Conjugate Gradient Method 191

be given and assume that the search directions p0 , . . . , pn−1 ∈ Rn \ {0} are A-

conjugate. The generic Algorithm 19 terminates after at most n steps with the

solution x∗ of Ax = b.

Proof Since the search directions form a basis of Rn we can represent the

vector x∗ − x0 in this basis, which means that we can ﬁnd β0 , . . . , βn−1 with

n−1

x∗ = x0 + β jp j.

j=0

Furthermore, from the generic algorithm, we can conclude that the iterations

have the representation

i−1

xi = x0 + α jp j, 0 ≤ i ≤ n. (6.6)

j=0

ri , pi 2

αi = .

Api , pi 2

To compute the βi , we ﬁrst observe that

n−1

b − Ax0 = A(x∗ − x0 ) = β j Ap j

j=0

and hence

n−1

b − Ax0 , pi 2 = β j Ap j , pi 2 = βi Api , pi 2 ,

j=0

because the directions are A-conjugate. This gives the explicit representation

b − Ax0 , pi 2 r0 , pi 2

βi = = , 0 ≤ i ≤ n − 1,

Api , pi 2 Api , pi 2

which diﬀers, at ﬁrst sight, from the αi in the numerator. Fortunately, from

(6.6) we can conclude that

i−1

ri = b − Axi = r0 − α j Ap j ,

j=0

i−1

ri , pi 2 = r0 , pi 2 − α j Ap j , pi 2 = r0 , pi 2 ,

j=0

employing again the fact that the directions are A-conjugate.Thus we have αi =

βi for all i.

07

192 Methods for Large Sparse Systems

It remains to answer the question regarding how the search directions are ac-

tually determined. Obviously, we do not want to determine them a priori but

while we iterate, hoping that our iteration will terminate signiﬁcantly before

the iteration step becomes n.

Assume we already have created directions p0 , . . . , p j . In the case of Ax j+1 = b

we can terminate the algorithm and hence do not need to compute another

direction. Otherwise, we could try to compute the next direction by setting

j

j

p j+1 = b − Ax j+1 + β jk pk = r j+1 + β jk pk ,

k=0 k=0

j

0 = Api , p j+1 2 = Api , r j+1 2 + β jk Api , pk 2

k=0

= Api , r j+1 2 + β ji Api , pi 2 , 0 ≤ i ≤ j,

as

Api , r j+1 2

β ji = − , 0 ≤ i ≤ j.

Api , pi 2

Surprisingly, β j0 , . . . , β j, j−1 vanish automatically, as we will see soon, so that

only the coeﬃcient

Ap j , r j+1 2

β j+1 := β j j = −

Ap j , p j 2

remains and the new direction is given by

This gives the ﬁrst preliminary version of the conjugate gradient (CG) method,

Algorithm 21.

It is now time to show that the so-deﬁned directions are indeed A-conjugate.

Theorem 6.9 If the CG method does not stop prematurely, the introduced

vectors satisfy the following equations:

Ap j , pi 2 = 0, 0 ≤ j ≤ i − 1, (6.7)

r j , ri 2 = 0, 0 ≤ j ≤ i − 1, (6.8)

r j , p j 2 = r j , r j 2 , 0 ≤ j ≤ i. (6.9)

Hence, it needs at most n steps to compute the solution x∗ of Ax = b.

07

6.1 The Conjugate Gradient Method 193

Input : A = AT ∈ Rn×n positive deﬁnite, b ∈ Rn .

Output: Approximate solution of Ax = b.

1 Choose x0 ∈ Rn .

2 Set p0 := r0 := b − Ax0 and j := 0.

3 while r j 0 do

4 α j := r j , p j 2 /Ap j , p j 2

5 x j+1 := x j + α j p j

6 r j+1 := b − Ax j+1

7 β j+1 := −Ap j , r j+1 2 /Ap j , p j 2

8 p j+1 := r j+1 + β j+1 p j

9 j := j + 1

show for (6.7) and (6.8); (6.9) follows immediately from p0 = r0 .

Let us now assume everything is satisﬁed for an arbitrary i ≥ 0. We will ﬁrst

show that then (6.8) follows for i + 1. By deﬁnition, we have

ri+1 = b − Axi+1 = b − Axi − αi Api = ri − αi Api . (6.10)

Thus, by the induction hypothesis, we have for j ≤ i − 1 immediately

r j , ri+1 2 = r j , ri 2 − αi Ar j , pi 2 = r j , ri 2 − αi A(p j − β j p j−1 ), pi 2

= r j , ri 2 − αi Ap j , pi 2 + αi β j Ap j−1 , pi 2 = 0.

For j = i we can use the deﬁnition of αi , equation (6.10) and

Ari , pi 2 = A(pi − βi pi−1 ), pi 2 = Api , pi 2

and the third induction hypothesis to conclude that

ri , pi 2

ri , ri+1 2 = ri , ri 2 − αi Ari , pi 2 = ri , ri 2 − Ari , pi 2

Api , pi 2

= ri , ri 2 − ri , pi 2 = 0,

which ﬁnishes the induction step for (6.8). Similarly, we proceed for (6.7).

First of all, we have

Ap j , pi+1 2 = Ap j , (ri+1 + βi+1 pi )2 = Ap j , ri+1 2 + βi+1 Ap j , pi 2 .

In the case of j = i this leads to

Api , ri+1 2

Api , pi+1 2 = Api , ri+1 2 − Api , pi 2 = 0.

Api , pi 2

07

194 Methods for Large Sparse Systems

particularly means r j , p j 2 = r j , r j 2 = 0, which is equivalent to r j = 0.

Thus if α j = 0 then the iteration would have stopped before. We hence may

assume that α j 0 such that we can use (6.10) to gain the representation

Ap j = (r j − r j+1 )/α j . Together with (6.8) for i + 1 this leads to

1

Ap j , pi+1 2 = Ap j , ri+1 2 = r j − r j+1 , ri+1 2 = 0.

αj

This proves (6.7) for i + 1. Finally, for (6.9) we use (6.1) from the generic

minimisation, which holds for any index, in the form ri+1 , pi 2 = 0. From this,

we can conclude that

ri+1 , pi+1 2 = ri+1 , ri+1 2 + βi+1 ri+1 , pi 2 = ri+1 , ri+1 2 ,

which ﬁnalises our induction step.

For the computation of the next iteration step, we need pi+1 0. If this is

not the case then the method terminates. Since 0 = ri+1 , pi+1 2 = ri+1 22 and

hence ri+1 = 0 = b − Axi+1 we have produced a solution. If the method does

not terminate early then, after n steps, we have created n conjugate directions

and Theorem 6.8 shows that xn is the true solution x∗ .

It is possible to improve the preliminary version of the CG method. First of

all, we can get rid of the matrix–vector multiplication in the deﬁnition of r j+1

by using (6.10). Moreover, from (6.10) it also follows that Ap j = α1j (r j − r j+1 )

such that

Ap j , r j+1 2 1 r j , r j+1 2 − r j+1 , r j+1 2

β j+1 = − =−

Ap j , p j 2 αj Ap j , p j 2

r j+1 , r j+1 Ap j , p j 2 r j+1 22

= = .

Ap j , p j 2 r j , p j 2 r j 22

This leads to a version of the CG method which can easily be run on a par-

allel computer, and which contains only one matrix–vector multiplication and

three inner products and three scalar–vector multiplications, as detailed in Al-

gorithm 22.

algorithm. In particular, we will investigate convergence in the energy norm

√

xA := x, xA , x ∈ Rn .

Deﬁnition 6.10 Let A ∈ Rn×n and r ∈ Rn be given. For i ∈ N we deﬁne the

ith Krylov space to A and r as

Ki (A, r) = span{r, Ar, . . . , Ai−1 r}.

07

6.1 The Conjugate Gradient Method 195

Input : A = AT ∈ Rn×n positive deﬁnite, b ∈ Rn .

Output: Approximate solution of Ax = b.

1 Choose x0 ∈ Rn .

2 Set p0 := r0 := b − Ax0 and j := 0.

3 while r j 2 > do

4 t j := Ap j

5 α j := r j 22 /t j , p j 2

6 x j+1 := x j + α j p j

7 r j+1 := r j − α j t j

8 β j+1 := r j+1 22 /r j 22

9 p j+1 := r j+1 + β j+1 p j

10 j := j + 1

In general, we only have dim Ki (A, r) ≤ i, since r could, for example, be from

the null space of A. Fortunately, in our situation things are much better.

Lemma 6.11 Let A ∈ Rn×n be symmetric and positive deﬁnite. Let pi and ri

be the vectors created during the CG iterations. Then,

Ki (A, r0 ) = span{r0 , . . . , ri−1 } = span{p0 , . . . , pi−1 }

for 1 ≤ i ≤ n. In particular, we have dim Ki (A, r0 ) = i.

Proof The proof is again by induction on i. For i = 1 this follows immediately

as we have p0 = r0 . For the induction step i → i + 1 we assume that we have

not stopped prematurely. By construction, we have pi = ri + βi pi−1 with βi 0

as we would have stopped before otherwise. This shows pi ∈ span{ri , pi−1 } and

hence by induction span{p0 , . . . , pi } ⊆ span{r0 , . . . , ri }. As the A-conjugate di-

rections p0 , . . . , pi are linearly independent, comparing the dimensions shows

that we actually have span{p0 , . . . , pi } = span{r0 , . . . , ri }.

Using ri = ri−1 − αi−1 Api−1 , where we again may assume that αi−1 0,

we ﬁnd ri ∈ span{ri−1 , Api−1 } ⊆ Ki (A, r0 ) + AKi (A, r0 ) = Ki+1 (A, r0 ) and

thus span{r0 , . . . , ri } ⊆ Ki+1 (A, r0 ) and a comparison of the dimensions shows

equality again.

Recall that the true solution x∗ of Ax = b and the iterations xi of the CG method

can be written as

n−1 i−1

x∗ = x0 + α j p j and xi = x0 + α jp j.

j=0 j=0

07

196 Methods for Large Sparse Systems

x0 + Ki (A, r0 ) by

i−1

x = x0 + γ jp j

j=0

directions are A-conjugate, we can conclude that

2

n−1

n−1

n−1

x∗ − xi 2A = α j p j = α j αk Apk , p j 2 = d j |α j |2

j=i A j,k=i j=i

2

n−1

i−1 n−1

i−1

≤ d j |α j |2 + d j |α j − γ j |2 = α j p j + (α j − γ j )p j

j=i j=0 j=i j=0 A

2

i−1

= x∗ − x0 − γ j p j = x∗ − x2A .

j=0 A

Since this holds for an arbitrary x ∈ x0 +Ki (A, r0 ) we have proven the following

theorem.

Theorem 6.12 Let A ∈ Rn×n be symmetric and positive deﬁnite. The iteration

xi from the CG method gives the best approximation to the solution x∗ of Ax =

b from the aﬃne space x0 + Ki (A, r0 ) with respect to the energy norm · A , i.e.

the energy norm, i.e. for all i we have

Since xi+1 minimises x → x∗ −xA over x0 +Ki+1 (A, r0 ), the statement follows

immediately.

Next, let us rewrite this approximation problem in the original basis of the

Krylov space. We will use the following notation. We denote the set of all

polynomials of degree less than or equal to n by πn = πn (R).

07

6.1 The Conjugate Gradient Method 197

For a polynomial P(t) = i−1

j=0 γ j t ∈ πi−1 , a matrix A ∈ R

j n×n

and x ∈ Rn we

will write

i−1

i−1

P(A) = γ j A j, P(A)x = γ j A j x.

j=0 j=0

i−1

P(A)x = γ j λ j x = P(λ)x.

j=0

Theorem 6.14 Let A ∈ Rn×n be symmetric and positive deﬁnite, having the

eigenvalues 0 < λ1 ≤ · · · ≤ λn . If x∗ denotes the solution of Ax = b and if xi

denotes the ith iteration of the CG method, then

P∈πi 1≤ j≤n

P(0)=1

i−1

x = x0 + γ j A j r0 =: x0 + Q(A)r0 ,

j=0

introducing the polynomial Q(t) = i−1 j=0 γ j t ∈ πi−1 , where we set Q = 0 for

j

∗

i = 0. This gives with r0 = A(x −x0 ) and P(A) := I − Q(A)A the representation

x∗ − x = x∗ − x0 − Q(A)r0 = x∗ − x0 − Q(A)A(x∗ − x0 )

= (I − Q(A)A)(x∗ − x0 ) = P(A)(x∗ − x0 )

and hence

x∗ − xA = P(A)(x∗ − x0 )A .

i−1

i

P(t) = 1 − γ j t j+1 = 1 − γ j−1 t j

j=0 j=1

given then we can deﬁne Q(t) = (P(t) − 1)/t ∈ πi−1 , which leads to an element

from x0 +Ki (A, r0 ), for which the calculation above holds. Thus, Theorem 6.12

yields

x∗ − xi A ≤ min P(A)(x∗ − x0 )A . (6.12)

P∈πi

P(0)=1

07

198 Methods for Large Sparse Systems

of A associated with the eigenvalues λ1 , . . . , λn . These vectors satisfy

w j , wk A = Aw j , wk 2 = λ j w j , wk 2 = λ j δ jk .

Moreover, we can represent every vector using this basis and with such a rep-

resentation x∗ − x0 = ρ j w j we can conclude that

n

P(A)(x∗ − x0 ) = ρ j P(λ j )w j

j=1

n

n

P(A)(x∗ − x0 )2A = ρ2j λ j P(λ j )2 ≤ max P(λ j )2 ρ2j λ j

1≤ j≤n

j=1 j=1

∗

= max P(λ j ) x −2

x0 2A .

1≤ j≤n

crete ∞ -approximation problem. Recall that by Theorem 6.8 the CG method

converges after at most n steps. This can also directly be concluded from (6.11),

since for i = n we can pick a polynomial P∗ ∈ πn with P∗ (0) = 1 and P∗ (λ j ) = 0

for 1 ≤ j ≤ n, such that (6.11) gives

P∈πn 1≤ j≤n 1≤ j≤n

P(0)=1

Corollary 6.15 Assume that A ∈ Rn×n is symmetric and positive deﬁnite and

has 1 ≤ d ≤ n distinct eigenvalues. Then, the CG method terminates with the

solution x∗ ∈ Rn of Ax = b after at most d steps.

Since the eigenvalues are in general unknown, we need to convert the discrete

∞ problem into a continuous L∞ approximation problem by further estimating

P∈πi

P(0)=1

where PL∞ [a,b] = max x∈[a,b] |P(x)|. Note that λ1 and λn can be replaced by

estimates λ̃1 ≤ λ1 and λ̃n ≥ λn .

The solution of this approximation problem can be computed explicitly. It in-

volves the classical Chebyshev polynomials.

07

6.1 The Conjugate Gradient Method 199

T n (t) = cos(n arccos(t)), t ∈ [−1, 1].

It is easy to see that we have T 0 (t) = 1, T 1 (t) = t and

T n+1 (t) = 2tT n (t) − T n−1 (t). (6.13)

From this, it follows immediately that T n is indeed a polynomial of degree n.

We also need some further elementary properties of the Chebyshev polynomi-

als, which we will collect in the next lemma.

Lemma 6.17 The Chebyshev polynomials have the following properties.

1. For every t ∈ [0, 1), we have

√ n

1 1+ t 1+t

√ ≤ Tn .

2 1− t 1−t

2. For every t ∈ [−1, 1] we have |T n (t)| ≤ 1.

3. On [−1, 1], the Chebyshev polynomial T n attains its extreme values ±1 at

tk = cos(πk/n), 0 ≤ k ≤ n. To be more precise we have T n (tk ) = (−1)k .

Proof The second property follows immediately from the deﬁnition. The

third property follows from the second one and by simple computation. For

the ﬁrst property it is helpful to show that for |x| ≥ 1, we have the representa-

tion

1 √ n √ n

T n (x) = x + x2 − 1 + x − x2 − 1 .

2

This is clear for n = 0 and n = 1 and for general n it follows if we can show

that the right-hand side follows the same recursion (6.13) as the Chebyshev

polynomials, which is easily done. If we now set x = 1−t 1+t

with t ∈ [0, 1) we

have x ≥ 1 and hence

√

1+t 1 √ n 1 1 + t n

Tn ≥ x+ x −1 =

2 √ .

1−t 2 2 1− t

With this at hand, we can continue with bounding the error for the conjugate

gradient method.

Theorem 6.18 Let λn > λ1 > 0. The problem

min{PL∞ [λ1 ,λn ] : P ∈ πi , P(0) = 1}

has the solution

λn + λ1 − 2t # λn + λ1

P∗ (t) = T i Ti , t ∈ [λ1 , λn ].

λn − λ 1 λn − λ 1

07

200 Methods for Large Sparse Systems

λn + λ 1 1 + γ

= ≥ 1.

λn − λ 1 1 − γ

This means that the denominator of P∗ is positive and P∗ is hence a feasible

candidate for the minimum.

Noting that

λn + λ1 − 2t

t → , t ∈ [λ1 , λn ]

λn − λ 1

maps [λ1 , λn ] linearly to [−1, 1], we can conclude from Lemma 6.17 that there

are i + 1 diﬀerent points s j ∈ [λ1 , λn ] on which P∗ attains its extreme value

1

M := λ +λ

Ti n 1

λn −λ1

Now, assume that P∗ is not the solution to the minimisation problem, i.e. that

there is a Q ∈ πi with Q(0) = 1 and |Q(t)| < M for all t ∈ [λ1 , λn ]. Then, P∗ − Q

must have alternating signs in the s j , which means that P∗ − Q must have at

least i zeros in [λ1 , λn ]. However, P∗ − Q is also zero at 0 [λ1 , λn ], which

means that P∗ − Q must have at least i + 1 diﬀerent zeros. But since P∗ − Q ∈ πi

this means P∗ = Q by the fundamental theorem of algebra. This is, however, a

contradiction to our assumption on Q.

This altogether gives the ﬁnal estimate on the convergence of the CG method.

We have, again with γ = λ1 /λn , the bound

P∈πi

P(0)=1

"" "

""T i λn +λ1 −2t """

λn −λ1

≤ max x∗ − x0 A

t∈[λ1 ,λn ] T λn +λ1

i λn −λ1

1

≤ 1+γ x∗ − x0 A

T i 1−γ

√

1− γ i ∗

≤2 √ x − x0 A .

1+ γ

Finally using the fact that 1/γ = κ2 (A) shows that we have the following result.

Theorem 6.19 Let A ∈ Rn×n be symmetric and positive deﬁnite. The sequence

07

6.1 The Conjugate Gradient Method 201

√ √ i

1− γ i κ2 (A) − 1

x∗ − xi A ≤ 2x∗ − x0 A √ ≤ 2x∗

− x

0 A √ , (6.14)

1+ γ κ2 (A) + 1

where γ = 1/κ2 (A) = λ1 /λn .

As in the case of the method of steepest descent we can determine the expected

number of iterations necessary to reduce the initial error to a given > 0. The

proof of the following result is the same as the proof of Corollary 6.4.

notes the iterations of the conjugate gradient method then we have

x∗ − x j A

≤

x∗ − x0 A

√

after at most j ≥ 1

2 κ2 (A) log(1/ ) steps.

If we compare this result with j ≥ 21 κ2 (A) log(1/ ), which is required for the

√

steepest descent method and note that κ2 (A) ≤ κ2 (A) we see that the CG

method is signiﬁcantly faster than the steepest descent method.

In speciﬁc situations, the estimate (6.14) can further be improved. As a mat-

ter of fact, (6.14) is useful only if the condition number is small and if the

eigenvalues are more or less quasi-uniformly distributed over [λ1 , λn ]. If the

eigenvalues are clustered, it is possible to choose better polynomials.

For example, let us assume that A has an isolated largest eigenvalue, i.e. we

have λ1 ≤ λ2 ≤ · · · ≤ λn−1 λn . Then, with each polynomial P̃ ∈ πi−1 with

P̃(0) = 1 we can deﬁne a polynomial P ∈ πi by

λn − t

P(t) := P̃(t), t ∈ [λ1 , λn ],

λn

which also satisﬁes P(0) = 1 but also P(λn ) = 0 and, since the ﬁrst factor

has absolute value less than or equal to one on [λ1 , λn ], we also have |P(λ j )| ≤

|P̃(λ j )|. Thus, using (6.11) and then the ideas leading to (6.14), we can derive

P∈πi 1≤ j≤n P∈πi−1 1≤ j≤n−1

P(0)=1 P(0)=1

"" "

""T i−1 λn−1 +λ1 −2t """

λn−1 −λ1 1

≤ max λ +λ x∗ − x0 A ≤ 1+γ x∗ − x0 A

t∈[λ1 ,λn−1 ] T n−1 1

T i−1 1−γ

i−1 λn−1 −λ1

√ i−1

1− γ

≤2 √ x∗ − x0 A

1+ γ

07

202 Methods for Large Sparse Systems

with γ = λ1 /λn−1 . Hence, we have eliminated the largest eigenvalue and ex-

pressed convergence using the reduced condition number κn−1 := λn−1 /λ1 at

the cost of one factor. Obviously, this result can be generalised in such a way

that the d largest eigenvalues are discarded.

Corollary 6.21 If A ∈ Rn×n is symmetric and positive deﬁnite with eigenval-

ues 0 < λ1 ≤ · · · ≤ λn−d λn−d+1 ≤ · · · ≤ λn , then we have for i ≥ d the

estimate

√

∗ κn−d − 1 i−d ∗

x − xi A ≤ 2 √ x − x0 A ,

κn−d + 1

with the reduced condition number κn−d = λn−d /λ1 .

Similarly, we can deal with an isolated smallest eigenvalue. However, the sit-

uation here is somewhat worse, in the following sense. If we proceed as above

then, for a polynomial P̃ ∈ πi−1 with P̃(0) = 1, we deﬁne the polynomial P ∈ πi

by

λ1 − t

P(t) = P̃(t), t ∈ [λ1 , λn ].

λ1

We still have P(0) = 1 and P(λ1 ) = 0. But at the other eigenvalues λ j the

absolute value of the ﬁrst factor can only be bounded by (λ j − λ1 )/λ1 ≤ (λn −

λ1 )/λ1 = κ̃1 − 1 with the (original) condition number κ̃1 = κ2 (A) = λn /λ1 .

Hence, this time we only have

√ i−1

∗ κ̃2 − 1

x − xi A ≤ 2 (κ̃1 − 1) √ x∗ − x0 A

κ̃2 + 1

with κ̃2 = λn /λ2 . This can still be an improvement over the standard estimate

since the condition number enters only as a constant factor but the convergence

forcing term can signiﬁcantly improve. Of course, we can iterate this process,

as well.

Corollary 6.22 If A ∈ Rn×n is symmetric and positive deﬁnite with eigen-

values 0 < λ1 ≤ · · · ≤ λd λd+1 ≤ · · · ≤ λn , then we have for i ≥ d the

estimate

d √κ̃d+1 − 1 i−d

x∗ − xi A ≤ 2 κ̃ j − 1 √ x∗ − x0 A ,

j=1

κ̃d+1 + 1

This also shows that the method still converges if the matrix is almost singular,

i.e. if λ1 ≈ · · · ≈ λd ≈ 0, but it might take some time for the convergence

to become eﬀective. The convergence rate depends then on [λd+1 , λn ]. Hence,

07

6.2 GMRES and MINRES 203

one should try to use appropriate matrix manipulations to reduce the condition

number of A. If the matrix A is almost singular one should try to ﬁnd a trans-

formation of the smallest eigenvalues to zero and of the bigger eigenvalues to

an interval of the form (λ, λ(1 + ε)) ⊂ (0, ∞) with ε as small as possible.

In general it does not matter if, during this process, the rank of the matrix is

reduced. Generally, this leads only to a small additional error but also to a

signiﬁcant increase in the convergence rate of the CG method.

We are now going to discuss how to solve the linear system Ax = b for arbitrary

matrices A ∈ Rn×n using ideas similar to those we applied in the CG method.

One possibility would be to look at the normal equations AT Ax = AT b, which

lead to the same solution, provided the matrix A is invertible. Since AT A is

symmetric and positive deﬁnite, we can use the CG method to solve this prob-

lem. If x∗ is the solution and x j ∈ x0 + K j (AT A, AT r0 ) is a typical iteration then

we have the identity

x∗ − x j 2AT A = (x∗ − x j )T AT A(x∗ − x j ) = (Ax∗ − Ax j )T (Ax∗ − Ax j )

= b − Ax j 22 .

This is the reason why this method is also called the CGNR method, where

CGNR stands for Conjugate Gradient on the Normal equation to minimise the

Residual.

However, the method is only of limited use since we know that we have to ex-

pect a condition number which is the square of the original condition number.

Nonetheless, we can generalise this idea in the following way.

Deﬁnition 6.23 Assume A ∈ Rn×n is invertible and b ∈ Rn . Let x0 ∈ Rn

and r0 := b − Ax0 . The jth iteration of the GMRES (Generalised Minimum

Residual) method is deﬁned to be the solution of

min{b − Ax2 : x ∈ x0 + K j (A, r0 )}. (6.15)

If A ∈ Rn×n is symmetric, then the method generating the sequence {x j } by

minimising (6.15) is called MINRES (Minimum Residual).

The diﬀerence between MINRES and GMRES is two-fold. First of all, in the

case of GMRES, an actual algorithm for creating the sequence {x j } needs to

work for all non-singular matrices A ∈ Rn×n , while in the case of MINRES

such an algorithm can exploit the fact that it will only be used for symmetric

matrices. We will soon see that this makes a signiﬁcant diﬀerence in designing

07

204 Methods for Large Sparse Systems

MINRES we can use the fact that the matrix is diagonalisable with only real

eigenvalues and a corresponding orthonormal basis consisting of eigenvectors.

We cannot do this when analysing GMRES. Finally, historically, MINRES was

introduced in [99] by Paige and Saunders long before GMRES was presented

in [105] by Saad and Schultz, but this is obviously of no consequence for us.

Despite these diﬀerences, the starting point for analysing GMRES and MIN-

RES is the same and very similar to the analysis of the CG method. Since every

j−1

x ∈ x0 + K j (A, r0 ) can be written as x = x0 + k=0 ak Ak r0 , we can conclude

that

j−1

j

b − Ax = b − Ax0 − ak Ak+1 r0 = r0 − ak−1 Ak r0 =: P(A)r0

k=0 k=1

x is equivalent to minimising over all these P.

the iterations of the GMRES (MINRES) method then

b − Ax j 2 = min P(A)r0 2 .

P∈π j

P(0)=1

Moreover, the method needs at most n steps to terminate with the solution of

Ax = b.

Proof It remains to show that the method stops with the solution x∗ . To see

this, let p(t) = det(A − tI) be the characteristic polynomial of A. Then, p ∈ πn

and, since A is invertible, p(0) = det A 0. Hence, we can choose the poly-

nomial P := p/p(0) in the last step. However, the Cayley–Hamilton theorem

gives p(A) = 0 and hence P(A) = 0 so that we must have b = Axn , unless the

method terminates before.

In the case of the CG method, we then used the fact that there is a basis of

Rn consisting of orthonormal eigenvectors of A. This is the point where the

analyses of MINRES and GMRES diverge. Since the analysis of MINRES is

closer to that one of CG we will ﬁrst look at MINRES.

Hence, for the time being, let A = AT ∈ Rn×n be symmetric. Then, A has real

eigenvalues λ j ∈ R, 1 ≤ j ≤ n, which are due to the invertibility of A all diﬀer-

ent from zero. Moreover, we can choose an orthonormal basis w1 , . . . , wn ∈ Rn

of Rn consisting of eigenvectors of A, i.e. Aw j = λ j w j . With this we can con-

tinue as in the analysis of the CG method and expand x ∈ Rn with x2 = 1 as

07

6.2 GMRES and MINRES 205

x = x j w j which then gives P(A)x = x j P(λ j )w j and hence

n

P(A)22 = sup P(A)x22 = sup x2j P2 (λ j ) ≤ max P2 (λ j ).

x2 =1 x2 =1 j=1 1≤ j≤n

Proposition 6.25 Let A ∈ Rn×n be symmetric and invertible, having the eigen-

values λ1 ≤ · · · ≤ λn . Let x∗ ∈ Rn denote the solution of Ax = b and let {xi }

denote the iterations from MINRES, then the residuals ri := b − Axi satisfy

ri 2 ≤ min max |P(λ j )| r0 2 . (6.16)

P∈πi 1≤ j≤n

P(0)=1

still real numbers, will be negative and positive, which makes the analysis

somewhat more complicated. In this context, it is usual to assume that the

eigenvalues of A are contained in [a, b] ∪ [c, d] with a < b < 0 < c < d

such that d − c and b − a are as small as possible. Ideally, a, b, c, d are given

by eigenvalues of A. However, for the purpose of the analysis it makes things

easier to assume that b − a = d − c, which can be achieved by shifting some of

the values and increasing the size of one of the intervals if necessary.

Since we now have both positive and negative eigenvalues we must expect a

slower convergence than when we only have positive (or negative) eigenvalues.

Nonetheless, using ideas similar to those in the analysis of the CG method, we

can derive the following result.

Theorem 6.26 Let A ∈ Rn×n be symmetric and invertible. Let the eigenvalues

of A be contained in [a, b] ∪ [c, d] with a < b < 0 < c < d and b − a = d − c.

Then, the residual ri = b − Axi between the true solution x∗ ∈ Rn of Ax = b

and the ith iteration xi of MINRES can be bounded by

√ √ i/2

|ad| − |bc|

ri 2 ≤ 2 √ √ r0 2 .

|ad| + |bc|

Proof As we want to use our ideas on Chebyshev polynomials, we need to

map the intervals [a, b] and [c, d] simultaneously to [−1, 1]. Since they have

the same length, we can do this by a polynomial of degree two. It is not too

diﬃcult to see that this polynomial is given by

(t − b)(t − c)

q(t) = 1 + 2 , t ∈ [a, b] ∪ [c, d].

ad − bc

Next, we have

ad + bc |ad| + |bc| 1 + γ

q(0) = = = > 1,

ad − bc |ad| − |bc| 1 − γ

07

206 Methods for Large Sparse Systems

T i/2 (q(t))

P∗ (t) := ∈ πi

T i/2 (q(0))

is well-deﬁned and satisﬁes P∗ (0) = 1. Thus, we can bound

P∈πi 1≤ j≤n P∈πi t∈[a,b]∪[c,d]

P(0)=1 P(0)=1

1

≤ max |P∗ (t)| r0 2 ≤ r0 2

t∈[a,b]∪[c,d] T i/2 (q(0))

√

1 1 − γ i/2

= 1+γ r0 2 ≤ 2 √ r0 2 ,

T i/2 1+ γ

1−γ

where we have used Lemma 6.17. The result follows using γ = |bc|/|ad|.

In the case where the intervals [a, b] and [c, d] are symmetric with respect to

the origin, i.e. in the case of a = −d and b = −c, we have γ = |bc|/|ad| =

c2 /d2 ≤ κ2 (A)2 and hence

i/2

κ2 (A) − 1

ri 2 ≤ 2 r0 2 .

κ2 (A) + 1

Finally, if MINRES is applied to a positive deﬁnite and symmetric matrix, then

obviously we can proceed exactly as in the analysis of the CG method, yielding

an estimate of the form

√ i

κ2 (A) − 1

ri 2 ≤ 2 √ r0 2 .

κ2 (A) + 1

Also results similar to those in Corollaries 6.21 and 6.22 regarding large and

small isolated eigenvalues remain valid if the · A -norm there is replaced by

the · 2 -norm.

Next, let us return to the analysis of the more general GMRES method. Hence,

we are back at the representation

P∈π j

P(0)=1

have an orthonormal basis of Rn consisting of eigenvectors of A.

Nonetheless, let us assume that the matrix A is at least diagonalisable, i.e. that

there is a matrix V ∈ Cn×n and a diagonal matrix D = diag(λ1 , . . . , λn ) ∈ Cn×n ,

such that

A = V DV −1 .

07

6.2 GMRES and MINRES 207

The diagonal elements of D are the eigenvalues of A. However, from now on,

we have to work with complex matrices.

j

If P ∈ π j is of the form P(t) = 1 + k=1 ak tk then we can conclude that

j

j

P(A) = P(V DV −1 ) = 1 + ak (V DV −1 )k = 1 + ak V Dk V −1

k=1 k=1

−1

= V P(D)V ,

and we derive the following error estimate, which is, compared with the corre-

sponding result of the CG method, worse by a factor κ2 (V).

Corollary 6.27 Assume the matrix A ∈ Rn×n is similar to a diagonal matrix

with eigenvalues λk ∈ C. The jth iteration of the GMRES method satisﬁes the

error bound

b − Ax j 2 ≤ κ2 (V) min max |P(λk )| b − Ax0 2 . (6.18)

P∈π j 1≤k≤n

P(0)=1

the MINRES method, here, the eigenvalues λk can be complex, which com-

plicates matters. Even more, the role of the matrix V is problematic. Estimate

(6.18) is useful only if κ2 (V) ≈ 1. For example, if A is a normal matrix, then

V is unitary and hence κ2 (V) = 1. In this situation, (6.18) is optimal. How-

ever, if κ2 (V) is large, then the error estimate (6.18) is not useful and it is

unclear whether in such a situation GMRES converges only poorly or whether

estimate (6.18) is simply too pessimistic. If the condition number of V is suf-

ﬁciently small, then it is possible to use complex Chebyshev polynomials to

derive bounds similar to those of the CG method. Details can, for example, be

found in Saad’s book [104].

Moreover, in some situations the convergence of GMRES is diﬃcult to predict

from the eigenvalues of A alone. To see this, let us have a look at the following

example.

Example 6.28 Let A be a matrix of the form

⎛ ⎞

⎜⎜⎜0 ∗ 0 . . . 0⎟⎟⎟

⎜⎜⎜ ⎟

⎜⎜⎜0 ∗ ∗ . . . 0⎟⎟⎟⎟⎟

⎜⎜⎜⎜ .. .. .. . . ⎟⎟⎟⎟

⎜⎜⎜ . . . . ⎟⎟⎟ ,

⎜⎜⎜ ⎟

⎜⎜⎜0 ∗ ∗ . . . ∗⎟⎟⎟⎟⎟

⎝ ⎠

∗ ∗ ∗ ... ∗

where a ∗ denotes a non-zero entry, and let x0 be such that r0 = b − Ax0 is

a multiple of e1 . Then we see that Ar0 is a multiple of en , A2 r0 is a linear

07

208 Methods for Large Sparse Systems

are orthogonal to r0 .

Since we can write x j as x j = x0 + y j , where y j minimises r0 − Ay2 over all

y ∈ K j (A, r0 ), we see that this minimising of r0 − Ay2 over all y ∈ K j (A, r0 )

will always produce y j = 0 as long as j ≤ n − 1, i.e. we will have x j = x0 and

hence no progress for 1 ≤ j ≤ n − 1.

Typical matrices of the form given in the example are companion matrices, i.e.

matrices of the form

⎛ ⎞

⎜⎜⎜ 0 1 0 . . . 0 ⎟⎟

⎟⎟

⎜⎜⎜⎜ 0 0 1 . . . 0 ⎟⎟⎟⎟

⎜⎜⎜ ⎟

A = ⎜⎜⎜⎜ ...

⎜ .. .. . . .. ⎟⎟⎟⎟

. . . . ⎟⎟⎟ .

⎜⎜⎜⎜ ⎟⎟

⎜⎜⎜ 0 0 0 . . . 1 ⎟⎟⎟⎟

⎝ ⎠

a0 a1 a2 . . . an−1

The eigenvalues are the roots of the polynomial tn − n−1 j

j=0 a j t , which shows that

matrices of the above form can occur with any desired eigenvalue distribution.

We will not further pursue this here. Instead we will look at a diﬀerent possible

approach.

To describe this, we need to broaden our perspective a little. First of all, it is

easy to see that for a real matrix A ∈ Rn×n , the GMRES iteration x j satisﬁes

(6.17) even if we allow polynomials P ∈ π j with complex coeﬃcients. This

follows because such a polynomial would have the form P(t) = Q(t) + iR(t)

with real polynomials Q, R ∈ π j satisfying Q(0) = 1 and R(0) = 0 and from

1/2

P(A)r0 2 = Q(A)r0 22 + R(A)r0 22

In the case of a complex matrix A ∈ Cn×n we would deﬁne the GMRES iter-

ation by looking at complex Krylov spaces and hence would immediately use

polynomials with complex coeﬃcients in (6.17). Hence, for the time being let

us work in the complex setting.

For a symmetric matrix we know that the maximum of the Rayleigh quotient

deﬁnes its maximum eigenvalue. The same is true for normal matrices. For a

general matrix we can generalise this concept as follows.

be

4 5

F (A) := xT Ax : x ∈ Cn , xT x = 1 .

07

6.2 GMRES and MINRES 209

ν(A) := max{|z| : z ∈ F (A)}.

Since the function x → xT Ax is continuous, the ﬁeld of values F (A) is compact

and hence the numerical radius is well-deﬁned. Moreover, by Corollary 2.28

we know that in the case of a normal matrix A the numerical radius coincides

with the 2-norm, i.e. ν(A) = A2 . For a general matrix we have λ ∈ F (A) for

every eigenvalue of A because if x ∈ Cn denotes a corresponding, normalised

eigenvector then we obviously have xT Ax = λ. We have the following proper-

ties.

Lemma 6.30 Let A ∈ Cn×n and α ∈ C. Then,

1. F (A + αI) = F (A) + α,

2. F (αA) = αF (A),

3. 21 A2 ≤ ν(A) ≤ A,

4. ν(Am ) ≤ [ν(A)]m , m = 1, 2, . . ..

Proof The ﬁrst two properties follow from

xT (A + αI)x = xT Ax + α, xT αAx = αxT Ax

which hold for all xT x = 1. For the third property we ﬁrst note that

xT Ax ≤ x2 Ax2 ≤ A2 ,

which gives the right inequality. For the left inequality we split A into A =

T T

H + N with H = (A + A )/2 and N = (A − A )/2. Since both matrices are

normal, we have

A2 ≤ H2 + N2 = ν(H) + ν(N)

1 T T

= sup |xT (A + A )x| + sup |xT (A − A )x|

2 x2 =1 x2 =1

1 T T T

≤ 2 sup |x Ax| + sup |x A x|

2 x2 =1 x2 =1

= 2ν(A).

For the last property, ν(Am ) ≤ [ν(A)]m , we have to work a little harder. First

note that because of the second property we have ν(αA) = |α|ν(A). Thus, if we

can show that ν(A) ≤ 1 implies ν(Am ) ≤ 1 for all m ∈ N, then we will have for

an arbitrary A ∈ Cn×n with α = 1/ν(A) that ν(αA) ≤ 1 and hence ν((αA)m ) ≤ 1,

i.e.

1

ν((αA)m ) = αm ν(Am ) = ν(Am ) ≤ 1.

ν(A)m

07

210 Methods for Large Sparse Systems

To see that ν(A) ≤ 1 implies ν(Am ) ≤ 1, let us introduce the roots of unity

ωk := exp(2πik/m) for k ∈ Z. Obviously, there are only m diﬀerent roots of

unity, for example for 1 ≤ k ≤ m. Nonetheless, whenever necessary, we will

use them in this periodic form. In particularly, we will use that ωm = ω0 . In

this sense, the linear polynomial k (z) = 1 − ωk z satisﬁes k (ωm−k ) = 0 for all k

so that we can factorise the polynomial 1 − zm as

m

m

1 − zm = k (z) = (1 − ωk z). (6.19)

k=1 k=1

Next, let πm (C) denote the space of complex polynomials of degree at most m

and let us deﬁne the Lagrange functions L j ∈ πm−1 (C), 1 ≤ j ≤ m via

m

k (z) 1 − ωk z

m

1

m

L j (z) := = = (1 − ωk z).

k=1

k (ωm− j ) k=1 1 − ωk+m− j m k=1

k j k j k j

In the last step, we have used that the denominator can be signiﬁcantly simpli-

ﬁed employing the identities

m

m−1

m−1

(1 − ωk+m− j ) = (1 − ωk ) = lim (1 − ωk z)

z→1

k=1 k=1 k=1

k j

1 m

1 − zm

= lim (1 − ωk z) = lim

z→1 1 − ωm z z→1 1 − z

k=1

= m.

As we well know, these Lagrangian functions form a very speciﬁc basis of the

polynomials of degree less than m as they obviously satisfy L j (ωm−k ) = δ jk for

1 ≤ j, k ≤ m. In particular, we can reconstruct the constant function 1 by its

interpolant

m

1

m m

1= 1 · L j (z) = (1 − ωk z). (6.20)

j=1

m j=1 k=1

k j

B ∈ Cn×n .

For x ∈ Cn with x2 = 1, we deﬁne the vectors

⎡ ⎤

⎢⎢⎢ m ⎥⎥⎥

⎢⎢⎢ ⎥⎥

x j := ⎢⎢⎢ (I − ωk B)⎥⎥⎥⎥⎥ x,

⎢ 1 ≤ j ≤ m,

⎢⎣k=1

⎥⎦

k j

07

6.2 GMRES and MINRES 211

1

m

Bx j xj

x j 2 1 − ω j

2

,

m j=1 x j 2 x j 2 2

⎡

m m

⎤

1 1 ⎢⎢⎢⎢ ⎥⎥

m

= (I − ω j B)x j , x j 2 = ⎢⎣ (I − ωk B)⎥⎥⎥⎦ x, x j

m j=1 m j=1 k=1 2

m

m

1 1

= (I − Bm )x, x j 2 = (I − Bm )x, xj

m j=1 m j=1 2

m m

1

= (I − Bm )x, (I − ωk B) x

m j=1 k=1 2

k j

= (I − Bm )x, x2 = 1 − Bm x, x2 .

we thus have

1 − |Am x, x2 | = 1 − eimθ Am x, x2

1

m

Ax j xj

= x j 2 1 − ω j e

2 iθ

,

m j=1 x j 2 x j 2 2

1

m

Ax j xj

= x j 22 1 − ω j eiθ ,

m j=1 x j 2 x j 2 2

"

"" Ax ""

1

m

" j xj ""

≥ x j 2 1 − " , " ≥0

2

m j=1 " x j 2 x j 2 2 "

since ν(A) ≤ 1 implies that the term in the brackets on the right-hand side of

the last line is non-negative. Thus we have |Am x, x2 | ≤ 1 for all normalised

vectors x ∈ Cn showing ν(Am ) ≤ 1.

Let us see how we can use the concept of the numerical radius to provide

possible convergence estimates for GMRES.

|z − c| ≤ s} and assume that 0 D, i.e. that s < |c|. Then the residuals of the

GMRES method satisfy

j

s

r j 2 ≤ 2 r0 2 .

|c|

07

212 Methods for Large Sparse Systems

⎛ j⎞ j

r j 2 ⎜⎜⎜ 1 ⎟⎟⎟ 1

≤ P(A)2 ≤ 2ν(P(A)) = 2ν ⎜⎝ I − A ⎟⎠ ≤ 2 ν I − A .

r0 2 c c

z := xT Ax satisﬁes |z − c| ≤ s or |1 − z/c| ≤ s/|c|. Hence,

"" "" "" "

1 "" T 1 "" " 1 T "" s

ν I − A = max "x I − A x" = max "1 − x Ax"" ≤ , "

c x2 =1 " c " x2 =1 c |c|

Now, we will have a look at a possible implementation for both GMRES and

MINRES.

We start by deriving an algorithm for GMRES and will then work out how the

algorithm can be changed for MINRES.

The implementation is based upon the idea to ﬁnd an orthonormal basis for

K j (A, r0 ) and then to solve the minimisation problem within this basis.

We can use the Gram–Schmidt orthogonalisation method (see Lemma 1.9) to

compute an orthogonal basis of K j (A, r0 ). Recall that the idea of the Gram–

Schmidt method is to compute orthogonal vectors successively, starting with

a vector v1 ∈ Rn with v1 2 = 1. Next, provided that {v1 , . . . , vk } are already

orthonormal and that v span{v1 , . . . , vk }, the Gram–Schmidt procedure cal-

culates

k

ṽk+1 := v − v, v 2 v ,

=1

vk+1 := ṽk+1 /ṽk+1 2 .

We will use this by setting r0 = b − Ax0 and v1 = r0 /r0 2 and then choose v

as v = Avk . This gives

k

k

ṽk+1 = Avk − Avk , v 2 v = Avk − hk v , (6.21)

=1 =1

vk+1 = ṽk+1 /ṽk+1 2 = ṽk+1 /hk+1,k , (6.22)

This method is called the Arnoldi method and was introduced in [2]. We will

analyse it in this simple form. However, from a numerical point of view, a more

stable version can be derived as follows. Since v1 , . . . , vk are orthogonal, we

07

6.2 GMRES and MINRES 213

k

ṽk+1 = Avk − hk v

=1

k

k

−1

= Avk − Avk , v 2 v − ci vi , v 2 v

=1 =1 i=1

k

−1

= Avk − Avk − ci vi , v v .

=1 i=1 2

more stable than the basic algorithm but also can easily be implemented. This

modiﬁed Arnoldi method is given in Algorithm 23.

Input : A ∈ Rn×n , x0 ∈ Rn .

Output: Orthonormal basis {v1 , . . . , v j } of K j (A, r0 ).

1 r0 := b − Ax0 , v1 := r0 /r0

2 for k = 1, 2, . . . , j − 1 do

3 wk := Avk

4 for = 1 to k do

5 hk := wk , v 2

6 wk := wk − hk v

7 hk+1,k := wk 2

8 if hk+1,k = 0 then break

9 else vk+1 := wk /hk+1,k

does not stop prematurely. Hence, for the time being, let us assume that always

wk = ṽk+1 0.

Lemma 6.32 If the Arnoldi method is well-deﬁned up to step j then v1 , . . . , v j

form an orthonormal basis of K j (A, r0 ).

Proof It is easy to check that {vk } forms an orthonormal set. That the elements

actually form a basis for K j (A, r0 ) can be proven by induction on j. Obviously,

for j = 1 there is nothing to show. For the induction step note that the deﬁnition

gives v j+1 ∈ span{v1 , . . . , v j , Av j } and the induction hypothesis leads to v j+1 ∈

K j+1 (A, r0 ). Equality of the spaces follows on comparing the dimensions.

07

214 Methods for Large Sparse Systems

To further analyse Arnoldi’s method and to see how it can be used for an im-

plementation of GMRES, we stick to the original formulation (6.21), (6.22).

If we want to compute a basis for K j (A, r0 ) we need to compute up to j or-

thogonal vectors. As before, let us deﬁne hk = Avk , v 2 for 1 ≤ ≤ k and

1 ≤ k ≤ j and hk+1,k = ṽk+1 2 for 1 ≤ k ≤ j. Then, using (6.22), equation

(6.21) can be rewritten as

k+1

Avk = hk v . (6.23)

=1

deﬁnes a non-square Hessenberg matrix

⎛ ⎞

⎜⎜⎜h11 h12 · · · · · · h1 j ⎟⎟

⎟⎟

⎜⎜⎜

⎜⎜⎜h21 h22 · · · · · · h2 j ⎟⎟⎟⎟

⎜⎜⎜⎜ . .. ⎟⎟⎟⎟⎟

⎜⎜⎜ h32 . . . ⎟⎟⎟

H j := (hk ) = ⎜⎜⎜ ⎟

.. ⎟⎟⎟⎟ ∈ R

( j+1)× j

.

⎜⎜⎜ .. ..

⎜⎜⎜ . . . ⎟⎟⎟

⎜⎜⎜ ⎟⎟

⎜⎜⎝ h j, j−1 h j j ⎟⎟⎟⎟

⎠

h j+1, j

If we further deﬁne the matrix V j = (v1 , . . . , v j ) ∈ Rn× j then we can use (6.23)

to derive

AV j = Av1 , . . . , Av j

⎛ 2 ⎞

⎜⎜⎜

3

j+1 ⎟⎟⎟

= ⎜⎜⎝⎜ h1 v , h2 v , . . . , h j v ⎟⎟⎟⎠

=1 =1 =1

j+1

= h1 v , h2 v , . . . , h j v ,

=1

has entries

j+1

j+1

(AV j )ik = hk (v )i = hk (V j+1 )i = (V j+1 H j )ik , 1 ≤ i ≤ n, 1 ≤ k ≤ j.

=1 =1

Moreover, if H̃ j ∈ R j× j is the matrix H j without its last row then we can rewrite

this as

j+1

j

(AV j )ik = hk (v )i = hk (v )i + h j+1,k (v j+1 )i

=1 =1

= (V j H̃ j )ik + h j+1,k (v j+1 )i = (V j H̃ j )ik + δk j ṽ j+1 2 (v j+1 )i

= (V j H̃ j )ik + δk j (ṽ j+1 )i = (V j H̃ j )ik + (ṽ j+1 eTj )ik ,

07

6.2 GMRES and MINRES 215

summarise this as

AV j = V j+1 H j = V j H̃ j + ṽ j+1 eTj . (6.24)

Lemma 6.33 Assume that A is invertible. If the Arnoldi method stops with

iteration j, then the solution x∗ of Ax = b is contained in x0 + K j (A, r0 ).

Proof That the method stops with iteration j means that ṽ j+1 = 0. Hence,

equation (6.24) implies AV j = V j H̃ j , which means that the space K j (A, r0 )

is invariant under A. Next, since A is invertible, we also must have that H̃ j is

invertible. For each x ∈ x0 +K j (A, r0 ) there is a vector y ∈ R j with x−x0 = V j y.

Moreover, we have r0 = βv1 = βV j e1 with β = r0 2 . This leads to

b − Ax2 = r0 − A(x − x0 )2 = r0 − AV j y2 = V j (βe1 − H̃ j y)2

= βe1 − H̃ j y2 ,

where we also have used the orthogonality of the columns of V j . Since H̃ j

is invertible we can set y = βH̃ −1 ∗

j e1 , which leads to x = x0 + V j y ∈ x0 +

K j (A, r0 ).

Input : A ∈ Rn×n , b ∈ Rn , x0 ∈ Rn , j ∈ {1, . . . , n}.

Output: Approximate solution of Ax = b.

1 r0 := b − Ax0 , β := r0 2 , v1 := r0 /β

2 for k = 1, 2, . . . , j do

3 Compute vk+1 and hk for 1 ≤ ≤ k + 1 using Arnoldi’s method

4 Compute y j as the solution of βe1 − H j y2

5 x j = x0 + V j y j

The proof of Lemma 6.33 gives a constructive way of computing the iterations

of the GMRES method. Equation (6.24) implies, with β = r0 2 and x = x0 +

V j y ∈ x0 + K j (A, r0 ), as before,

b − Ax = r0 − AV j y = βV j+1 e1 − V j+1 H j y = V j+1 (βe1 − H j y).

Hence, we see that x j = x0 +V j y j , where y j minimises βe1 − H j y2 . This leads

to Algorithm 24.

The GMRES method can stop prematurely. The only possible reason for this

is that the Arnoldi method stops prematurely, i.e. if hk+1,k = 0 for a k < j.

According to Lemma 6.33 this means that x∗ ∈ x0 + Kk (A, r0 ) and hence, by

deﬁnition, xk = x∗ .

07

216 Methods for Large Sparse Systems

Remark 6.34 The solution of the least-squares problem βe1 − H j y2 can be

computed using a QR factorisation, see Section 3.8. This can be done in O( j2 )

time since H j is a Hessenberg matrix, see Corollary 5.38.

though it is not related to solving linear systems but rather to computing the

eigenvalues of the matrix A. Recall from Section 5.5 that it is desirable to

reduce A to a Hessenberg matrix and then to compute the eigenvalues of this

Hessenberg matrix. The following remark is a simple consequence of (6.24).

Remark 6.35 If the Arnoldi method does not stop prematurely then, after n

steps, v1 , . . . , vn form an orthonormal basis of Rn and Vn = (v1 , . . . , vn ) trans-

forms A into the Hessenberg matrix H̃n = VnT AVn having the same eigenvalues

as A.

In the present form, the GMRES algorithm does not compute an approximate

solution in each step, i.e. for each Krylov space K j (A, r0 ). Instead, we ﬁrst

compute the orthonormal basis of K j (A, r0 ) and after that the approximate so-

lution x j . Since one usually does not know beforehand which iteration has to

be computed to achieve a desired accuracy, one could compute iterations al-

ways after a ﬁxed number of steps of the Arnoldi method. Another possibility

is to produce an iteration in each step of the Arnoldi method by computing

the QR factorisation of H j+1 , which is necessary for computing a new iteration

x j+1 , by exploiting the already-computed QR factorisation of H j . Finally, we

can use this repeated updating of the QR factorisation to compute only an ap-

proximate solution if a prescribed accuracy is achieved. We will describe this

now in more detail, covering also the solution of the least-squares problem.

The idea for this comes from the proofs of Corollary 5.38 or Theorem 5.36,

where we have successively removed the sub-diagonal entries using 2 × 2

Householder matrices. Instead of Householder matrices, we can of course also

use Givens rotations. As usual, we describe the idea using full matrices of the

form

⎛ ⎞

⎜⎜⎜Ii−1 ⎟⎟⎟ ⎛ ⎞

⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜Ii−1 ⎟⎟⎟

⎜⎜⎜ ci si ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟⎟ ∈ R( j+1)×( j+1) ,

Gi := ⎜⎜

⎜⎜⎜ ⎟⎟⎟ = ⎜⎜⎜ G̃i ⎟⎠⎟

−si ci ⎟⎟⎠ ⎝

⎝ I j−i

I j−i

where Ii−1 and I j−i are the identities matrices of size (i − 1) × (i − 1) and ( j −

i) × ( j − i), respectively.

As described in the proof of Theorem 5.36, j such orthogonal matrices suﬃce

07

6.2 GMRES and MINRES 217

⎛ ⎞ ⎞ ⎛

⎜⎜⎜h11 h12 ··· ··· · · · r1 j ⎟⎟

h1 j ⎟⎟ ⎜⎜r11 r12 ···

⎜⎜⎜⎜h21 ⎟⎟ ⎟⎟ ⎜⎜

⎜⎜⎜ h22 ··· ··· · · · r2 j ⎟⎟⎟⎟

h2 j ⎟⎟⎟⎟ ⎜⎜⎜⎜ 0 r22 ···

⎜⎜⎜ .. .. ⎟⎟⎟⎟⎟

.. ⎟⎟⎟⎟⎟ ⎜⎜⎜⎜⎜ ..

h32 ⎜⎜ . . ⎟⎟⎟

. ⎟⎟⎟ ⎜⎜⎜ .

G jG j−1 · · · G1 ⎜⎜⎜⎜ .

⎟⎟⎟ .

⎟ ⎜

.. ⎟⎟⎟⎟ = ⎜⎜⎜⎜

..

⎜⎜⎜ .. .. .. ⎟

.

⎜⎜⎜ . . . .. ⎟⎟⎟⎟

. ⎟⎟⎟ ⎜⎜⎜

⎜⎜⎜ ⎟⎟

⎟⎟ ⎜⎜

⎜⎜⎝

h j, j−1 r j j ⎟⎟⎟⎟

h j j ⎟⎟⎟⎟ ⎜⎜⎜⎜

⎠⎠ ⎝

h j+1, j

0

(6.25)

From this, we can conclude two things. First of all, deﬁning QTj := G j · · · G1 ∈

R( j+1)×( j+1) and R j = QTj H j ∈ R( j+1)× j allows us to transform βe1 − H j y2 as

usual. To this end, let us deﬁne q j := QTj e1 ∈ R j+1 and write q j = (q̃ j , q j+1 )T

with q̃ j ∈ R j . Then, we have

= βQTj e1 − R j y22

= βq j − R j y22

= βq̃ j − R̃ j y22 + |βq j+1 |2 ,

the latter expression, is hence given by

y j := βR̃−1

j q̃ j . (6.26)

G̃ j ∈ R2×2 to the corresponding entries of the right-hand side. From this, it also

immediately follows that the previously computed components do not change

from now on. To be more precise, when going over from q j ∈ R j+1 to q j+1 ∈

R j+2 we only have to apply the 2 × 2 matrix G̃ j+1 to the last two components

of (q j , 0)T = (q̃ j , q j+1 , 0)T , i.e. we have

⎛ ⎞ ⎛ ⎞ ⎛ ⎞

⎜⎜⎜ ⎟ ⎜q̃ ⎟ ⎜q̃ ⎟

⎟⎟⎟⎟ ⎜⎜⎜⎜ j ⎟⎟⎟⎟ ⎜⎜⎜⎜ j+1 ⎟⎟⎟⎟

q̃ j

⎜⎜⎜

q j+1 = ⎜⎜⎜ q j+1 ⎟⎟⎟⎟ = ⎜⎜⎜⎜ ∗ ⎟⎟⎟⎟ =: ⎜⎜⎜⎜ ⎟⎟⎟ .

⎟⎠ (6.27)

⎝G̃ j+1 ⎠ ⎝ ⎠ ⎝

0 ∗ q j+2

Next, assume that we have already computed the rotations G1 , . . . , G j and ap-

plied them to H j . Since H j is the upper left sub-matrix of H j+1 , this means that

we only have to apply G1 , . . . , G j (after adjusting their dimensions) to the last

column of H j+1 , to obtain

07

218 Methods for Large Sparse Systems

⎛ ⎞

⎜⎜⎜r11 r12 ··· · · · r1 j r1, j+1 ⎟⎟

⎜⎜⎜⎜ 0 ⎟⎟

⎜⎜⎜ r22 ··· · · · r2 j r2, j+1 ⎟⎟⎟⎟

⎟

⎜⎜⎜ .. .. .. ⎟⎟⎟⎟

⎜⎜⎜ . . . ⎟⎟⎟

⎜⎜ ⎟

G jG j−1 · · · G1 H j+1 = ⎜⎜⎜⎜ .. .. .. .. ⎟⎟⎟⎟ . (6.28)

⎜⎜⎜ . . . . ⎟⎟⎟⎟

⎜⎜⎜ ⎟

⎜⎜⎜ rjj r j, j+1 ⎟⎟⎟⎟

⎜⎜⎜ ⎟⎟

⎜⎜⎝ 0 d ⎟⎟⎟⎟

⎠

0 h

Here, the entry h in the last row and column is simply given by h = h j+2, j+1 .

Thus, besides these applications to the last column, we only need one addi-

tional rotation G j+1 to eliminate this entry. This can be done by choosing

√

c j+1 = |d|/ d2 + h2 , s j+1 = c j+1 h/d

if d 0 and c j+1 = 0, s j+1 = 1 if d = 0. Taking this all into account, we can

state the ﬁnal version of the GMRES algorithm. In principle, after initialisa-

tion, which also contains the initialisation of ξ = βe1 , the method consists in

each iteration of the following steps.

1. Compute vk+1 and hk using Arnoldi’s method.

2. Apply the rotations G1 , . . . , Gk−1 to the last column of H.

3. Compute the kth rotation to eliminate hk+1,k .

4. Apply the kth rotation to ξ and to the last column of H.

The iteration stops if the residual norm estimate |ξk+1 | becomes suﬃciently

small. Algorithm 25 contains all these steps in detail.

Even in this form, the GMRES method suﬀers from two problems. First of

all, the computational cost increases from iteration to iteration since the cost

is still O(k2 ) in the kth iteration. Moreover, with increasing k, the memory

requirement grows. In each step we have to store the vectors v1 , . . . , vk ∈ Rn .

For this alone we need O(nk) space and that is even the case if the original

matrix A was sparse.

An often-suggested remedy to this problem is the restarted GMRES method,

which stops after a ﬁxed number of orthogonalisation steps, computes an ap-

proximate solution and then restarts the algorithm using this approximate so-

lution as the new initialisation, see Algorithm 26.

As for the CG method it is easy to see that the iterations of the GMRES (and

also MINRES) method yield a decreasing sequence of residuals {rk }, this time,

however, in the 2 -norm. This remains obviously true for the restarted GMRES

method. Unfortunately, while we have (at least theoretical) convergence after

07

6.2 GMRES and MINRES 219

Input : A ∈ Rn×n , b ∈ Rn , x0 ∈ Rn , > 0.

Output: Approximate solution of Ax = b.

1 r0 := b − Ax0 , β := r0 2 , v1 := r0 /β, ξ = (β, 0, . . . , 0)T

2 for k = 1, 2, . . . , do

3 wk := Avk

4 for = 1 to k do

5 hk := wk , v 2

6 wk := wk − hk v

7 hk+1,k := wk 2

8 for i = 1, . . . , k − 1 do

hik ci si hik

9 :=

hi+1,k −si ci hi+1,k

10 γ := (h2kk + h2k+1,k )1/2 , ck := hkk /γ, sk := ck hk+1,k /hkk

ξk c k sk ξ k

11 :=

ξk+1 −sk ck 0

12 hkk := γ, hk+1,k := 0

13 if |ξk+1 | < then

14 for = k, k − 1, . . . , 1 do

15 y := h1 ξ − ki=+1 hi yi

16 xk := x0 + ki=1 yk vk

17 break

18 else

19 vk+1 := wk /hk+1,k

n steps for GMRES, this may not be true for the restarted GMRES. As a mat-

ter of fact, it can happen that the sequence {rk 2 } “levels out”, i.e. becomes

stationary with k → ∞. However, in the case of a positive deﬁnite but not

necessarily symmetric matrix A, the following result shows that the restarted

GMRES method converges. The matrix A is positive deﬁnite if xT Ax > 0 for

all x 0. Since xT Ax = xT AT x, we have that the smallest eigenvalue μ of the

symmetric matrix (A + AT )/2 is positive by Theorem 5.4:

1 xT (AT + A)x xT Ax

μ := min T

= min T > 0.

x0 2 x x x0 x x

07

220 Methods for Large Sparse Systems

Input : A ∈ Rn×n , b ∈ Rn , x0 ∈ Rn , j ∈ N, > 0.

Output: Approximate solution of Ax = b.

1 r0 := b − Ax0 , β := r0 2

2 while β > do

3 Compute the GMRES approximation x j using Algorithm 24

4 Set x0 := x j , r0 := b − Ax0 , β := r0 2

1 T

μ= x (A + AT )x0 = xT0 Ax0 ≤ Ax0 2 ≤ A2 .

2 0

Proposition 6.36 Let A ∈ Rn×n be positive deﬁnite. Let x j ∈ x0 + K j (A, r0 )

be the jth iteration of the GMRES method. Then, the residual r j = b − Ax j

satisﬁes

j/2

μ2

r j 2 ≤ 1 − 2 r0 2 ,

σ

P∈π j

P(0)=1

with P(0) = 1 such that P(A)2 = p j (A)2 ≤ p(A)2j gives an upper bound

on the minimum above. Next, we want to choose α in the deﬁnition of p to

have a bound on p(A)2 . To this end, note that, for a ﬁxed x ∈ Rn \ {0},

becomes minimal for α = xT Ax/Ax22 , see also the second part of Lemma

5.16 for a similar argument, with value

T 2 ⎡ ⎛ T ⎞2 ⎤

x Ax 2⎢

⎢⎢ ⎜⎜⎜ x Ax ⎟⎟⎟ x2 2 ⎥⎥⎥

p(A)x2 = x2 −

2 2 ⎢

= x2 ⎣⎢1 − ⎜⎝ ⎟⎠ ⎥⎥

Ax2 x22 Ax2 ⎦

μ2

≤ x22 1 − 2 ,

σ

07

6.2 GMRES and MINRES 221

where we have used that xT Ax/x22 ≥ μ and x2 /Ax2 ≥ 1/σ. This yields

μ2

p(A)22 ≤ 1 −

σ2

and hence the overall statement.

As mentioned above, this means that the restarted GMRES method converges.

method converges for every j ≥ 1.

we will discuss later on, and by replacing the Gram–Schmidt process by a more

stable Householder variant. Details can, for example, be found in Saad [104].

To see how things change when going from GMRES to MINRES, we ﬁrst note

that in the case of a symmetric matrix A, most entries computed in the Arnoldi

method are actually zero.

Lemma 6.38 Let A ∈ Rn×n be symmetric. Let {v } be the sequence of vectors

computed by the Arnoldi method and let hk = Avk , v 2 for 1 ≤ ≤ k and

hk+1,k = ṽk+1 2 . Then,

hk = 0 1 ≤ < k − 1,

hk,k+1 = hk+1,k .

Proof From the fact that span{v1 , . . . , v } = K (A, r0 ) we see that Av ∈

K+1 (A, r0 ) = span{v1 , . . . , v+1 }. Thus, using the symmetry of A yields

for all + 1 < k. For the second property note that the orthogonality of ṽk+1 to

v1 , . . . , vk gives

1

hk,k+1 = Avk+1 , vk 2 = vk+1 , Avk 2 = ṽk+1 , Avk 2

ṽk+1 2

k

1

= ṽk+1 , Avk − Avk , v 2 v = ṽk+1 2

ṽk+1 2 =1 2

= hk+1,k .

tridiagonal matrix with diagonal entries αk := hkk and sub- and super-diagonal

07

222 Methods for Large Sparse Systems

⎛ ⎞

⎜⎜⎜α1 β1 ⎟⎟⎟

⎜⎜⎜ .. .. ⎟⎟⎟

⎜⎜⎜ β . . ⎟⎟⎟

H̃ j = ⎜⎜⎜⎜⎜ 1 ⎟⎟⎟ .

⎟⎟⎟

⎜⎜⎜ .. ..

⎜⎜⎝ . . β j−1 ⎟⎟⎟⎟

⎠

β j−1 αj

Taking this new notation into account, we see that the Arnoldi algorithm, Algo-

rithm 23, reduces to the algorithm given in Algorithm 27, which is the Lanczos

orthogonalisation method for symmetric matrices, see [89, 90].

Input : A = AT ∈ Rn×n , x0 ∈ Rn , j ∈ {1, . . . , n}.

Output: Orthonormal basis {v1 , . . . , v j } of K j (A, r0 ).

1 r0 := b − Ax0 , v0 := 0, v1 := r0 /r0 2 , β0 := 0

2 for k = 1, 2, . . . , j − 1 do

3 wk := Avk − βk−1 vk−1

4 αk := wk , vk 2

5 wk := wk − αk vk

6 βk := wk 2

7 if βk = 0 then break

8 else vk+1 := wk /βk

method. We will not do this here, but later on, when we discuss biorthogo-

nalisation methods, we will derive a method similar to the CG method from a

two-sided or biorthogonal Lanczos method.

As in the non-symmetric case, we now also have an alternative method for

reducing A to a tridiagonal matrix having the same eigenvalues as A.

Remark 6.39 If the Lanczos method does not stop prematurely, the orthog-

onal matrix V = (v1 , . . . , vn ) transforms the symmetric matrix A into the tridi-

agonal matrix VnT AVn = H̃n having the same eigenvalues as A.

As in the case of GMRES, we use Givens rotations to transform the matrix H j

to an upper triangular matrix R j . However, by Corollary 5.37, we know that

row i of R has at most non-zero entries rii , ri,i+1 and ri,i+2 . Hence, applying the

07

6.2 GMRES and MINRES 223

⎛ ⎞

⎜⎜⎜r11 r12 r13 0 · · · 0 ⎟⎟

⎜⎜⎜ ⎟

⎜⎜⎜ .. .. .. .. .. ⎟⎟⎟⎟

⎜⎜⎜ . . . . . ⎟⎟⎟⎟

⎜⎜⎜ .. .. .. ⎟⎟⎟

⎜⎜⎜

⎜⎜

. . . 0 ⎟⎟⎟⎟⎟

⎟⎟⎟

G j−1 · · · G1 H j−1 = ⎜⎜⎜⎜⎜ .. ..

. . r j−3, j−1 ⎟⎟⎟⎟⎟ .

⎜⎜⎜

⎜⎜⎜ .. ⎟⎟⎟⎟

⎜⎜⎜ . r j−2, j−1 ⎟⎟⎟⎟

⎜⎜⎜ ⎟⎟

⎜⎜⎜ r j−1, j−1 ⎟⎟⎟⎟

⎜⎝ ⎠

0

The next step is to carry this over to H j . This means we have to add the jth

column and and ( j + 1)st row and then apply G j−1 · · · G1 to the jth column.

However, since the jthe column of H j has only the entries β j−1 , α j and β j in

rows j − 1, j and j + 1, respectively, and since each Gi acts only on rows i and

i + 1, we see that we only have to apply G j−2 and then G j−1 to the new jth

column. This essentially means that we have to form

r j−2, j c j−2 s j−2 0

= ,

r̃ j−1, j −s j−2 c j−2 β j−1

r j−1, j c j−1 s j−1 r̃ j−1, j

= ,

r̃ j j −s j−1 c j−1 αj

which leads to the representation

⎛ ⎞

⎜⎜⎜r11 r12 r13 0 ··· 0 0 ⎟⎟⎟

⎜⎜⎜⎜ .. .. .. .. .. .. ⎟⎟⎟

⎟⎟⎟

⎜⎜⎜ . . . . . . ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ .. .. .. .. ⎟⎟⎟

⎜⎜⎜ . . . 0 . ⎟⎟⎟

⎜⎜⎜ ⎟⎟

⎜ .. ..

G j−1 · · · G1 H j = ⎜⎜⎜⎜ . . r j−3, j−1 0 ⎟⎟⎟⎟ .

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ .. ⎟

⎜⎜⎜ . r j−2, j−1 r j−2, j ⎟⎟⎟⎟

⎜⎜⎜ ⎟⎟

⎜⎜⎜ r j−1, j−1 r j−1, j ⎟⎟⎟⎟

⎜⎜⎜ ⎟

⎜⎜⎝ 0 r̃ j j ⎟⎟⎟⎟⎟

⎠

0 βj

Then, we have to determine a ﬁnal rotation to remove the β j from the last

column and row, i.e. we determine c j and s j such that

c j s j r̃ j j r

= jj .

−s j c j β j 0

07

224 Methods for Large Sparse Systems

see also the discussion about numerical stability in this context just before

Deﬁnition 5.23. We will see that we do not even have to save the numbers

r j−2, j , r j−1, j and r j j since we do not need them in the next iteration. Hence, we

can compute all relevant quantities by

= −s j−1 c j−2 β j−1 + c j−1 α j ,

γ1 := γ02 + β2j ,

γ2 := r j−1, j = c j−1 r̃ j−1, j + s j−1 α j

= s j−1 α j + c j−1 c j−2 β j−1 ,

γ3 := r j−2, j = s j−2 β j−1 ,

c j := γ0 /γ1 , s j := β j /γ1 ,

As in the case of GMRES, we can write the iteration xk = x0 + Vk y, where

y ∈ Rk minimises βe1 − Hk y2 . However, for MINRES it is not necessary

to save the orthonormal basis, i.e. the matrix Vk . Thus we need a diﬀerent

approach to actually compute the iteration xk . According to (6.26), we can

express the iteration as

xk = x0 + βVk R̃−1

k q̃k = x0 + βPk q̃k

if we deﬁne the matrix Pk := Vk R̃−1 k , where R̃k is again the upper k × k block

of the matrix Rk ∈ R (k+1)×k

, which is the upper triangular matrix from the QR

factorisation of Hk = Qk Rk . Since the upper k × k block H̃k of Hk is now a

tridiagonal matrix, the only entries ri j of Rk in row i which might be non-zero

are rii , ri,i+1 and ri,i+2 . We also know from our previous considerations that

when going over from Rk ∈ R(k+1)×k to Rk+1 ∈ R(k+2)×(k+1) we keep Rk as the

upper left part of Rk+1 . This altogether means that we can compute the columns

of Pk ∈ Rn×k iteratively. If we set Pk = (p1 , . . . , pk ) and Vk = (v1 , . . . , vk ) then

the relation Pk R̃k = Vk yields

and then

Noting that r j j 0 as long as the Lanczos method does not break down, this

07

6.2 GMRES and MINRES 225

1

pj = (v j − r j−1, j p j−1 − r j−2, j p j−2 ) (6.29)

rjj

1

= (v j − γ2 p j−1 − γ3 p j−2 ), 1 ≤ j ≤ k,

γ1

where we have set p0 = p−1 = 0 and recall that the constants γ1 , γ2 , γ3 have to

be computed in each iteration, i.e. for each j.

A consequence of these considerations is that we can update the iterations in

the following form. We have

k

k−1

xk = x0 + βPk q̃k = x0 + β q j p j = x0 + β q j p j + βqk pk

j=1 j=1

= xk−1 + βqk pk . (6.30)

From this, we can also conclude that we do not need the whole vector q̃k

but only the scalar δ := βqk to update the iteration. Since βqk = βQTk e1 =

Gk · · · G1 (βe1 ) we compute with each application of a G j two new entries q j

and q̃ j+1 of which the ﬁrst one is not changed by the following updates while

the latter one is changed only by the next update. This can be expressed by

qj q̃ j c j s j q̃ j c j q̃ j

= G̃ j = = ,

q̃ j+1 0 −s j c j 0 −s j q̃ j

which is started with q̃1 = β, which means that we are “absorbing” β into

qk . Hence, if we have computed δ := q̃k , then we need qk = ck δ to update

xk = xk−1 +ck δpk and we need to replace δ by −sk δ to prepare the next iteration.

This altogether shows that we can implement MINRES as given in Algorithm

28.

Let us make a few more remarks. First of all, as we know, the residual of the

kth iteration satisﬁes

b − Axk 22 = βe1 − H̃k yk 22 = βq̃k − R̃k yk 22 + |βqk+1 |2 = |βqk+1 |2

or, if we again absorb β into qk as described above then we simply have b −

Axk 2 = |qk+1 |. The algorithm described above shows that |qk+1 | = |ck+1 q̃k+1 | =

|ck+1 sk q̃k | is at hand at any time during the algorithm without any additional

cost, which can be used as a stopping criterion.

There is an alternative to MINRES in the case of symmetric matrices, which we

only want to mention here. The method is called SYMMLQ . Here the iteration

xk ∈ x0 + AKk (A, r0 ) is computed as the minimiser of x∗ − x2 over x0 +

AKk (A, r0 ), where x∗ is again the solution to Ax = b. Details can be found in

Paige and Saunders [99].

07

226 Methods for Large Sparse Systems

Input : A ∈ Rn×n symmetric, b ∈ Rn , x0 ∈ Rn .

Output: Approximate solution of Ax = b.

1 β0 := 0, s−1 = s0 = 0, c−1 = c0 = 1

2 p−1 := p0 := v0 = 0

3 r0 := b − Ax0 , δ := r0 2 , v1 := r0 /δ

4 for k = 1, 2, . . . , do

5 wk := Avk − βk−1 vk−1

6 αk := wk , vk 2

7 wk := wk − αk vk

8 βk := wk 2

9 if βk = 0 then stop

10 γ0 := ck−1 αk − sk−1 ck−2 βk−1

11 γ1 := γ02 + β2k

12 γ2 := sk−1 αk + ck−1 ck−2 βk−1

13 γ3 := sk−2 βk−1

14 ck := γ0 /γ1 sk := βk /γ1

15 pk := [vk − γ2 pk−1 − γ3 pk−2 ]/γ1

16 xk := xk−1 + ck δpk

17 δ := −sk δ

18 vk+1 := wk /βk

For symmetric, sparse matrices A ∈ Rn×n , the Krylov subspace method of

choice would be MINRES if the matrix is indeﬁnite and either CG or MINRES

if the matrix is deﬁnite.

For non-symmetric matrices, GMRES often works well but has the disadvan-

tage of an increasing memory and time requirement. Moreover, though the

restarted GMRES avoids this increase in required resources, it might not con-

verge but only level out. Hence, in the case of non-symmetric matrices it makes

sense to look for alternative Krylov subspace methods.

Recall that both MINRES and GMRES were based upon an orthogonalisa-

tion of the Krylov space Kk (A, r0 ). In the case of MINRES this was given by

Lanczos’ method, which essentially boils down to the three-term recursion

v j+1 = ṽ j+1 /ṽ j+1 2

07

6.3 Biorthogonalisation Methods 227

Krylov space Kk (A, r0 ). In the case of GMRES the construction of an orthonor-

mal basis v1 , . . . , vk of Kk (A, r0 ) was based upon Arnoldi’s method and could

not be reduced to such a simple three-term recursion.

This naturally raises the question of whether it is in general possible to ﬁnd

a short recurrence relation for computing an orthonormal basis of the Krylov

space of an arbitrary matrix A ∈ Rn×n . The answer, in general, is negative as

shown in [49] by Faber and Manteuﬀel.

Hence, to have a memory-eﬃcient orthogonalisation procedure also in the case

of a non-symmetric matrix, we will now relax the idea of orthonormality.

Instead of having a set of orthonormal basis vectors v1 , . . . , vk we will con-

struct two sets v1 , . . . , vk and w1 , . . . , wk which are biorthogonal, i.e. satisfy

vi , w j 2 = 0 for all i j and which are bases for

Kk (AT , r̂0 ) = span{r̂0 , AT r̂0 , . . . , (AT )k−1 r̂0 },

respectively.

The corresponding procedure is the Lanczos biorthogonalisation method, also

called the two-sided Lanczos method, and is given in detail in Algorithm 29.

Input : A ∈ Rn×n , r0 , r̂0 ∈ Rn with r0 , r̂0 2 0.

Output: Biorthogonal bases of Kk (A, r0 ) and Kk (A, r̂0 ).

1 v1 = r0 /r0 2 , w1 = r̂0 /r̂0 , v1 2 , β0 := γ0 := 0, v0 = w0 = 0.

2 for j = 1, 2, . . . , k do

3 α j := Av j , w j 2

4 ṽ j+1 := Av j − α j v j − β j−1 v j−1

5 w̃ j+1 := AT w j − α j w j − γ j−1 w j−1

6 γ j := ṽ j+1 2

7 v j+1 := ṽ j+1 /γ j

8 β j := v j+1 , w̃ j+1 2

9 w j+1 := w̃ j+1 /β j

products per iteration, one with A and one with AT . It does indeed compute

biorthogonal vectors, as long as it does not stop prematurely. This and further

properties are given in the next proposition, which will help us to deﬁne Krylov

subspace methods based upon this biorthogonalisation technique.

07

228 Methods for Large Sparse Systems

Proposition 6.40 Assume that the two-sided Lanczos method does not termi-

nate prematurely. Then, we have v j 2 = 1 and v j , w j 2 = 1 for 1 ≤ j ≤ k

and

vi , w j 2 = 0, 1 ≤ i j ≤ k. (6.31)

Furthermore, we have

A Wk =

T

Wk T kT + βk wk+1 eTk , (6.33)

WkT AVk = Tk , (6.34)

⎛ ⎞

⎜⎜⎜α1 β1 ⎟⎟⎟

⎜⎜⎜⎜ γ α2 β2

⎟⎟⎟

⎟⎟⎟

⎜⎜⎜ 1 ⎟⎟⎟

⎜⎜ .. .. ..

T k := ⎜⎜⎜⎜⎜ . . . ⎟⎟⎟ .

⎟⎟⎟ (6.35)

⎜⎜⎜ .. .. ⎟⎟

⎜⎜⎜ . . βk−1 ⎟⎟⎟⎟⎟

⎜⎜⎝ ⎠

γk−1 αk

have v1 2 = 1 and v1 , w1 2 = v1 , r̂0 2 /v1 , r̂0 2 = 1 by deﬁnition and there is

nothing to show for (6.31). Let us now assume that we have biorthogonal vec-

tors v1 , . . . , vk and w1 , . . . , wk normalised in the stated way. Then, obviously,

we also have vk+1 2 = 1 and vk+1 , wk+1 2 = vk+1 , w̃k+1 2 /βk = 1. Hence, it

remains to show that vk+1 is orthogonal to w1 , . . . , wk and that wk+1 is orthog-

onal to v1 , . . . , vk . Since these two claims are veriﬁed in very similar ways, we

will show the reasoning just for the ﬁrst property. Here, we start with

1 & '

vk+1 , wk 2 = Avk , wk 2 − αk vk , wk 2 − βk−1 vk−1 , wk 2

γk

1

= [Avk , wk 2 − Avk , wk 2 vk , wk 2 ] = 0,

γk

using the orthogonality of vk−1 to wk , the deﬁnition of αk and the fact that

vk , wk 2 = 1.

07

6.3 Biorthogonalisation Methods 229

1 !

vk+1 , w j 2 = Avk , w j 2 − αk vk , w j 2 − βk−1 vk−1 , w j 2

γk

1 !

= vk , AT w j 2 − βk−1 vk−1 , w j 2

γk

1 !

= vk , β j w j+1 + α j w j + γ j−1 w j−1 2 − βk−1 vk−1 , w j 2

γk

1 !

= β j vk , w j+1 2 − βk−1 vk−1 , w j 2 ,

γk

using the orthogonality relation that we have by the induction hypothesis and

the deﬁnition of w j+1 . Now, if even j < k − 1 then the last two remaining terms

on the right-hand side also vanish, which shows orthogonality in this case. If,

however, j = k − 1, this becomes

1 & '

vk+1 , w j 2 = βk−1 vk , wk 2 − βk−1 vk−1 , wk−1 2 = 0

γk

since vk , wk 2 = vk−1 , wk−1 2 = 1. This establishes the biorthogonality rela-

tion (6.31). Equation (6.32) follows from

AVk = Av1 , . . . , Av j , . . . , Avk

= α1 v1 + γ1 v2 , . . . , β j−1 v j−1 + α j v j + γ j v j+1 , . . . , βk−1 vk−1 + αk vk + γk vk+1

= Vk T k + γk vk+1 eTk ,

and equation (6.33) can be proven in the same way. Finally, equation (6.34)

follows from (6.32) and the fact that WkT Vk = I and WkT vk+1 = 0.

To simplify the presentation of the algorithm, we have left out any test for

division by zero. However, it can happen that either γ j or β j become zero and

in this case the corresponding vectors v j+1 or w j+1 become undeﬁned and the

algorithm in this form will stop prematurely.

Such a premature ending can happen for various reasons. First of all, we could

have that either ṽ j+1 = 0 or w̃ j+1 = 0. These cases are usually referred to as

regular termination. The Lanczos process has found an invariant subspace. If

ṽ j+1 = 0 then the vectors v1 , . . . , v j , which have been created so far, form an

A-invariant space. If w̃ j+1 = 0 then the vectors w1 , . . . , w j form an AT -invariant

space. More problematic is the situation when both vectors ṽ j+1 and w̃ j+1 are

diﬀerent from the zero vector but are orthogonal, i.e. satisfy ṽ j+1 , w̃ j+1 2 = 0.

This situation is called a serious breakdown. It means that there are no vec-

tors v j+1 ∈ K j+1 (A, r0 ) and w j+1 ∈ K j+1 (AT , r̂0 ) with v j+1 , wi 2 = 0 and

vi , w j+1 2 = 0 for all 1 ≤ i ≤ j. However, this does not necessarily mean that

07

230 Methods for Large Sparse Systems

later on, say for k > j + 1, there are such vectors vk ∈ Kk (A, r0 ) orthogonal

to Kk−1 (AT , r̂0 ) and wk ∈ Kk (AT , r̂0 ) orthogonal to Kk−1 (A, r0 ). In this case,

a remedy to the problem is to simply skip those steps in the algorithm which

lead to undeﬁned Lanczos vectors and to continue with those steps, where

these vectors are well-deﬁned. Such a method is called a look-ahead Lanczos

method.

After having a procedure for computing the biorthogonal bases for the Krylov

spaces Kk (A, r0 ) and Kk (AT , r̂0 ) we have to address the question of how to

compute a possible approximation xk ∈ Kk (A, r0 ). Naturally, we will write

xk = x0 + Vk yk

and could, for example, determine yk such that the residual rk = b − Axk =

r0 − AVk yk is orthogonal to Kk (AT , r̂0 ), i.e.

0 = WkT rk = WkT r0 − WkT AVk yk = r0 2 e1 − T k yk ,

where we have also used that v1 = r0 /r0 2 and v1 , w1 2 = 1. Thus, if the

tridiagonal matrix T k is invertible, we have a unique solution yk ∈ Rk , which

then deﬁnes an iteration xk ∈ Rn .

Deﬁnition 6.41 Under the assumption that the Lanczos biorthogonalisation

process is deﬁned and the tridiagonal matrix T k is invertible, the biconjugate

gradient (BiCG) method computes the iteration xk ∈ x0 + Kk (A, r0 ) as

xk = x0 + Vk T k−1 (βe1 ),

where β = r0 2 .

Note, however, that we have encountered another possible situation, where the

iteration xk might not be well-deﬁned, namely if the matrix T k is not invertible.

This situation is diﬀerent from the previously described breakdowns of the bi-

Lanczos process.

To derive an algorithmic description of the BiCG method, we will now derive

characteristic properties, which can then be used for stating the algorithm.

Let us assume that the tridiagonal matrix T k ∈ Rk×k from (6.35) has an LU-

factorisation T k = Lk Uk . According to Theorem 3.16 the factors have the form

⎛ ⎞ ⎛ ⎞

⎜⎜⎜ 1 0⎟⎟⎟ ⎜⎜⎜u1 β1 0 ⎟⎟

⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ ⎜

⎜⎜⎜ .. ⎟⎟⎟

⎜⎜⎜2 1 ⎟⎟⎟ ⎜⎜⎜ u . ⎟⎟⎟

Lk = ⎜⎜⎜ ⎜ .. .. ⎟

⎟⎟⎟⎟ , Uk = ⎜⎜⎜ ⎜ 2

⎟⎟ ,

⎜⎜⎜ . . ⎟⎟⎟ ⎜

⎜⎜⎜ . . . β ⎟⎟⎟⎟

⎜⎝ ⎠ ⎜⎝ k−1 ⎟⎟⎠

0 k 1 0 uk

where the components ui and i are deﬁned recursively as u1 = α1 and i =

07

6.3 Biorthogonalisation Methods 231

the columns of Pk as Pk = (p0 , . . . , pk−1 ), we can rewrite the iteration xk in an

update form as

k−1

= x0 + Pk ak = x0 + a jp j

j=0

= xk−1 + ak−1 pk−1

with ak = (a0 , . . . , ak−1 )T := Lk−1 (βe1 ). For the residual rk = b − Axk we have

on the one hand

and on the other hand, using (6.32) and the fact that eTk T k−1 e1 ∈ R,

= r0 − r0 2 v1 − δk vk+1 = δk vk+1

with a certain constant δk ∈ R. This means that the residual rk is always parallel

to vk+1 and hence orthogonal to w1 , . . . , wk . Next, from Pk = Vk Uk−1 or Pk Uk =

Vk we ﬁnd vk = βk−1 pk−2 + uk pk−1 and hence

1 1 1

pk−1 = (vk − βk−1 pk−2 ) = rk−1 − βk−1 pk−2 , (6.36)

uk uk δk−1

i.e. we can reconstruct pk recursively from pk−1 and rk . Finally, if we deﬁne

P̂k = (p̂0 , . . . , p̂k−1 ) := Wk Lk−T , then we see that

Finally, the vectors generated by the biorthogonal Lanczos method can also be

used to solve the adjoint linear system AT x = b̂, if r̂0 is deﬁned as r̂0 := b̂−Ax̂0 .

As a matter of fact, we can deﬁne a sequence of approximations and residuals

x̂k := x̂0 + Wk zk ,

r̂k := b̂ − AT x̂k = r̂0 − AT Wk zk ,

i.e. zk = T k−T (β̂e1 ) with β̂ := r̂0 , v1 2 . Using this and the above deﬁnition

07

232 Methods for Large Sparse Systems

x̂k = x̂0 + Wk T k−T (β̂e1 ) = x̂0 + Wk Lk−T Uk−T (β̂e1 )

= x̂0 + P̂k âk = x̂k−1 + âk−1 p̂k−1 ,

where we have set âk = (â0 , . . . , âk−1 )T := Uk−T (β̂e1 ).

Using (6.33) and the deﬁnition of w1 gives again that r̂k and wk+1 are parallel:

r̂k = r̂0 − AT Wk zk = r̂0 − Wk T kT + βk wk+1 eTk zk

= r̂0 − Wk T kT + βk wk+1 eTk T k−T (β̂e1 )

= r̂0 − Wk (β̂e1 ) − βk β̂(eTk T k−T e1 )wk+1

=: δ̂k wk+1 .

From the relation P̂k LkT = Wk we can also conclude that wk = k p̂k−2 + p̂k−1 or

1

p̂k−1 = wk − k p̂k−2 = r̂k−1 − k p̂k−2 . (6.37)

δ̂k−1

So far, we have update formulas for all vectors. However, we can simplify the

update formulas for the “search directions” pk and p̂k if we rescale them. To

be more precise, if we replace pk by uk+1 δk pk , then (6.36) shows that the new

pk satisﬁes a recursion of the form pk−1 = rk−1 − β̃k−1 pk−2 . In a similar way, we

can rescale p̂k such that the coeﬃcient in front of r̂k−1 in the update formula

(6.37) can be chosen to be one. Of course, this also changes the coeﬃcients

in the other update formulas and the normalisation, but the general form and

the biorthogonality will not be modiﬁed. In conclusion we have the following

result.

Proposition 6.42 If the Lanczos biorthogonalisation process does not stop

prematurely and if the matrix T k is invertible, the process generates sequences

of vectors satisfying the relations

x j+1 = x j + α j p j , x̂ j+1 = x̂ j + α̂ j p̂ j ,

r j+1 = r j − α j Ap j , r̂ j+1 = r̂ j − α̂ j AT p̂ j ,

p j+1 = r j+1 + β j+1 p j , p̂ j+1 = r̂ j+1 + β̂ j+1 p̂ j

with certain constants α j , β j+1 , α̂ j , β̂ j+1 ∈ R. Moreover, we have the biorthogo-

nalities

r j , r̂i 2 = 0, Ap j , p̂i 2 = 0, i j. (6.38)

Finally, we have the alternative bases

K j (A, r0 ) = span{r0 , . . . , r j−1 } = span{p0 , . . . , p j−1 },

K j (AT , r̂0 ) = span{r̂0 , . . . , r̂ j−1 } = span{p̂0 , . . . , p̂ j−1 }.

07

6.3 Biorthogonalisation Methods 233

Proof The recursion formulas and the biorthogonalities (6.38) follow directly

from the derivation above. Hence, the only things that remain to be proven are

the identities regarding the Krylov spaces at the end of the proposition. This,

however, follows from the recurrence relations for r j+1 and p j+1 on the one

hand and for r̂ j+1 and p̂ j+1 on the other hand, once we have established that

p0 and r0 (and p̂0 and r̂0 ) are parallel. This can be seen as follows. Taking the

above-mentioned rescaling into account, p0 is given by

p0 = u1 δ0 Vk Uk−1 e1 .

Since Uk is an upper triangular matrix, so is Uk−1 and we can conclude that

Uk−1 e1 = (1/u1 )e1 and hence that p0 = δ0 Vk e1 = δ0 v1 = r0 .

We can also start with the recursion formulas above and use the biorthogonal-

ities (6.38) to determine the coeﬃcients in Proposition 6.42. First of all, from

p̂ j = r̂ j + β̂ j p̂ j−1 we can conclude that

Ap j , r̂ j 2 = Ap j , p̂ j − β̂ j p̂ j−1 2 = Ap j , p̂ j 2 .

Then, from r j+1 = r j − α j Ap j it follows that

0 = r j+1 , r̂ j 2 = r j , r̂ j 2 − α j Ap j , r̂ j 2 ,

i.e. the coeﬃcient α j must have the form

r j , r̂ j 2 r j , r̂ j 2

αj = = .

Ap j , r̂ j 2 Ap j , p̂ j 2

In the same way we can conclude that α̂ j must have the form

r j , r̂ j 2 r j , r̂ j 2 r j , r̂ j 2

α̂ j = = = = α j.

r j , AT p̂ j 2 p j , AT p̂ j 2 Ap j , p̂ j 2

From p j+1 = r j+1 + β j+1 p j we have

0 = p j+1 , AT p̂ j 2 = r j+1 , AT p̂ j 2 + β j+1 p j , AT p̂ j 2

which resolves to

r j+1 , AT p̂ j 2

β j+1 = − .

p j , AT p̂ j 2

Taking α j = α̂ j and AT p̂ j = −(r̂ j+1 − r̂ j )/α̂ j into account, we can simplify the

formula for β j+1 to

r j+1 , AT p̂ j 2 1 r j+1 , r̂ j+1 − r̂ j 2 r j+1 , r̂ j+1 2

β j+1 = − = = .

p j , A p̂ j 2

T α j p j , AT p̂ j 2 r j , r̂ j 2

The symmetry of the argument also shows that we once again have β̂ j+1 = β j+1 .

07

234 Methods for Large Sparse Systems

Thus, we have determined all the coeﬃcients and the ﬁnal algorithm for the

BiCG method is stated in Algorithm 30.

Input : A ∈ Rn×n , b ∈ Rn .

Output: Approximate solution of Ax = b.

1 Choose x0 ∈ Rn .

2 Set r0 := p0 = b − Ax0 and j := 0.

3 Choose p̂0 = r̂0 such that r0 , r̂0 2 0

4 while r j 2 > do

5 α j := r j , r̂ j 2 /Ap j , p̂ j 2

6 x j+1 := x j + α j p j

7 r j+1 := r j − α j Ap j

8 r̂ j+1 := r̂ j − α j AT p̂ j

9 β j+1 := r j+1 , r̂ j+1 2 /r j , r̂ j 2

10 p j+1 := r j+1 + β j+1 p j

11 p̂ j+1 := r̂ j+1 + β j+1 p̂ j

12 j := j + 1

the coeﬃcients indeed lead to sequences of vectors with the desired properties.

This can, however, be done as in the proof of Theorem 6.9.

When choosing r̂0 = r0 , the BiCG algorithm reduces for a symmetric matrix

A = AT to the standard CG algorithm (Algorithm 22), though all vectors are

computed twice, since in this case we obviously have x j = x̂ j , r j = r̂ j and

p j = p̂ j . If the matrix is in addition positive deﬁnite then we will also have

convergence.

As mentioned above, the BiCG algorithm can also be used to solve a system

of the form AT x = b̂ simultaneously. To this end, we only have to deﬁne r̂0 :=

b̂ − AT x̂0 and compute x̂ j+1 = x̂ j + α j p̂ j after step 6 of Algorithm 30.

As mentioned above, the BiCG method can fail to produce a new iteration if

the two-sided Lanczos process terminates early or if the matrix T k is not invert-

ible. The latter can be dealt with by changing the actual idea of computing the

iteration to a more GMRES-like method without the drawback of increasing

memory requirements.

Recall from (6.32) that we have the relation

Tk

AVk = Vk T k + γk vk+1 eTk = Vk+1 Hk , Hk = ∈ R(k+1)×k , (6.39)

γk eTk

07

6.3 Biorthogonalisation Methods 235

can write our iteration as xk = x0 + Vk y with a corresponding residual

rk := b−Axk = b−A(x0 +Vk y) = r0 −AVk y = βv1 −Vk+1 Hk y = Vk+1 (βe1 −Hk y),

be equivalent to minimising the 2 -norm of βe1 − Hk y if the columns of Vk+1

were orthonormal. In general this will not be the case but it is still a possible

option to just ﬁnd a minimiser for βe1 − Hk y2 , particularly since T k and hence

Hk have a tridiagonal structure.

process is deﬁned, the quasi-minimal residual (QMR) method computes the

iteration xk ∈ x0 + Kk (A, r0 ) as xk = x0 + Vk yk , where yk ∈ Rk solve

y∈Rk

with β = r0 2 .

case of the MINRES method and avoid the need to save the Lanczos vectors.

As for MINRES (see equation (6.25)), we use successive Givens rotations to

transform Hk into an upper triangular matrix. However, this time Hk ∈ R(k+1)×k

is a tridiagonal matrix and thus the transformed matrix has the band structure

⎛ ⎞ ⎛⎜r11 r12 r13 ⎞

⎜⎜⎜α1 β1 ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟

⎟⎟⎟

⎜⎜⎜⎜ γ1 α2 β2 ⎟⎟⎟⎟ ⎜⎜⎜⎜ .. ..

. . ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ r22 ⎟⎟⎟

⎜⎜⎜⎜ .. .. .. ⎟⎟ ⎜⎜⎜

⎟ . . ⎟⎟⎟

⎜ . . . ⎟⎟⎟ ⎜⎜ ⎜ .. .. r ⎟⎟

Gk Gk−1 · · · G1 ⎜⎜⎜⎜ ⎟⎟⎟ = ⎜⎜⎜ k−2,k ⎟

⎜⎜⎜ .. .. ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟⎟ .

. . βk−1 ⎟⎟ ⎜⎜ .. ⎟⎟⎟

⎜⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ . r k−1,k ⎟ ⎟⎟⎟

⎜⎜⎜ γk−1 αk ⎟⎟⎟ ⎜⎜⎜ ⎜

⎝⎜ ⎠ ⎜⎝ rkk ⎟⎟⎟⎟

γk ⎠

0

As in the case of MINRES, we can now easily compute the solution of the

minimisation problem and also update the QR factorisation. If we deﬁne again

QTk := Gk · · · G1 and Rk := QTk Hk ∈ R(k+1)×k then we can transform βe1 −Hk y2

as before. Recalling the deﬁnitions qk := QTk ek ∈ Rk+1 , qk = (q̃k , qk+1 )T with

q̃k ∈ Rk and R̃k ∈ Rk×k as the upper k × k block of Rk , equation (6.26) shows

that the minimiser of βe1 − Hk y2 is given by

yk := βR̃−1

k q̃k .

07

236 Methods for Large Sparse Systems

becomes

⎛ ⎞

⎜⎜⎜r11 r12 r13 ⎟⎟⎟

⎜⎜⎜⎜ .. .. ⎟⎟⎟

⎟⎟⎟

⎜⎜⎜ r22 . . ⎟⎟⎟

⎜⎜⎜ ⎟⎟⎟

⎜⎜⎜ .. .. ⎟⎟⎟

⎜⎜⎜ . . rk−2,k ⎟⎟⎟

Gk Gk−1 · · · G1 Hk+1 = ⎜⎜⎜⎜⎜ .. ⎟⎟ ,

⎜⎜⎜ . rk−1,k rk−1,k+1 ⎟⎟⎟⎟

⎜⎜⎜ ⎟⎟

⎜⎜⎜ rkk rk,k+1 ⎟⎟⎟⎟

⎜⎜⎜ ⎟

⎜⎜⎜ 0 d ⎟⎟⎟⎟⎟

⎝ ⎠

0 h

from which we can again determine the next iteration Gk+1 , as before. Setting

ﬁnally again Pk = (p1 , . . . , pk ) = Vk R̃−1

k we can compute the iterations via

see equation (6.30) for details. Taking this all into account, gives the ﬁnal al-

gorithm for QMR depicted in Algorithm 31, where the ﬁrst 10 lines cover the

two-sided Lanczos method.

There is an obvious connection between QMR and GMRES/MINRES. To see

this, recall that the iterations of GMRES xGk are deﬁned in such a way that the

residual rGk = b − AxGk satisﬁes

xGk as xGk = x0 + Vk yGk with yGk ∈ Rk . If we use the basis constructed by the

bi-Lanczos process, this leads to

≥ σk+1 βe1 − Hk yGk 2 ,

using (6.39) and Corollary 2.20 with the matrix Vk+1 ∈ Rn×(k+1) having full

rank k + 1 and smallest singular value σk+1 > 0. Moreover, we can bound the

residual of the QMR method by

σ1 G

≤ Vk+1 2 βe1 − Hk yGk 2 ≤ r 2 ,

σk+1 k

using also that Vk+1 2 = σ1 , the largest singular value of Vk+1 . This means

that the residual of QMR does not deviate too much from the optimal residual

of GMRES as long as Vk remains well-conditioned.

07

6.3 Biorthogonalisation Methods 237

Input : A ∈ Rn×n , b ∈ Rn , x0 ∈ Rn .

Output: Approximate solution of Ax = b.

1 r0 := b − Ax0 , β := r0 2 , v1 := r0 /β, ξ = (β, 0, . . . , 0)T

2 Choose r̂0 with r0 , r̂0 2 0 and set w1 := r̂0 /r0 , r̂0 2

3 for k = 1, 2, . . . , do

4 hkk := Avk , wk 2

5 ṽk+1 := Avk − hkk vk − hk−1,k vk−1 (undeﬁned terms are zero)

6 w̃k+1 := AT wk − hkk wk − hk,k−1 wk−1 (undeﬁned terms are zero)

7 hk+1,k := ṽk+1 2

8 vk+1 := ṽk+1 /hk+1,k

9 hk,k+1 := vk+1 , w̃k+1 2

10 wk+1 := w̃k+1 /hk,k+1 .

11 if k > 2 then

hk−2,k c sk−2 0

12 := k−2

hk−1,k −sk−2 ck−2 hk−1,k

13 if k > 1 then

hk−1,k c sk−1 hk−1,k

14 := k−1

hk,k −sk−1 ck−1 hkk

15 γ := (h2kk + h2k+1,k )1/2 , ck := hkk /γ, sk := ck hk+1,k /hkk

ξk c sk ξ k

16 := k

ξk+1 −sk ck 0

17 hkk := γ, hk+1,k := 0

18 pk := [vk − hk−1,k pk−1 − hk−2,k pk−2 ]/hkk (undeﬁned terms are zero)

19 xk := xk−1 + ξk pk

Lemma 6.44 Provided that the bi-Lanczos process does not stop prema-

turely, the residuals of the QMR and GMRES method satisfy

where κ2 (Vk+1 ) is the condition number of the matrix Vk+1 consisting of the

basis constructed by the bi-Lanczos process.

Both the BiCG method and the QMR method require a matrix–vector multipli-

cation with A and a matrix–vector multiplication with AT . This can be avoided,

though only at the price of an additional matrix–vector multiplication with the

original matrix. One possible way of achieving this is based on the following

07

238 Methods for Large Sparse Systems

we see that, as usual in the context of Krylov subspace methods, we can write

the direction p j = r j + β j p j−1 as

p j = P j (A)r0

with P ∈ π j . In the same way, we can express the residuals r j = r j−1 −

α j−1 Ap j−1 as

r j = R j (A)r0

with R j ∈ π j satisfying R j (0) = 1. Furthermore, the dual directions and residu-

als are deﬁned by the same polynomials, i.e. we have

p̂ j = P j (AT )r̂0 = (P j (A))T r̂0 , r̂ j = R j (AT )r̂0 = (R j (A))T r̂0 .

With this, the deﬁning coeﬃcients can be expressed as

r j , r̂ j 2 R j (A)r0 , R j (AT )r̂0 2 R2j (A)r0 , r̂0 2

αj = = = , (6.40)

Ap j , p̂ j 2 AP j (A)r0 , P j (AT )r̂0 2 AP2j (A)r0 , r̂0 2

r j+1 r̂ j+1 2 R j+1 (A)r0 , r̂0 2

2

β j+1 = = . (6.41)

r j , r̂ j 2 R2j (A), r0 , r̂0 2

Hence, if we can somehow compute R2j (A)r0 and AP2j (A)r0 , then we will be

able to compute these coeﬃcients directly. The deﬁning polynomials satisfy

the recurrence

P j (t) = R j (t) + β j P j−1 (t), (6.42)

R j (t) = R j−1 (t) − α j−1 tP j−1 (t). (6.43)

Squaring these equations yields

P2j (t) = R2j (t) + 2β j P j−1 (t)R j (t) + β2j P2j−1 (t),

R2j (t) = R2j−1 (t) − 2α j−1 tP j−1 (t)R j−1 (t) + α2j−1 t2 P2j−1 (t),

which would be update formulas for the squared terms P̃ j := P2j and R̃ j = R2j

if there were no mixed terms. To remove these mixed terms, we will simply

introduce a third quantity Q̃ j := P j−1 R j . To derive recursion formulas for all

three quantities, let us multiply (6.42) by R j , which gives

P j (t)R j (t) = R2j (t) + β j P j−1 (t)R j (t). (6.44)

If we further multiply (6.43) by P j−1 then we ﬁnd

P j−1 (t)R j (t) = P j−1 (t)R j−1 (t) − α j−1 tP2j−1 (t)

= R2j−1 (t) + β j−1 P j−2 (t)R j−1 (t) − α j−1 tP2j−1 (t).

07

6.3 Biorthogonalisation Methods 239

With this, we immediately ﬁnd the recurrence relations for the three quantities

P̃ j := P2j , R̃ j := R2j and Q̃ j := P j−1 R j :

Q̃ j (t) = R̃ j−1 (t) + β j−1 Q̃ j−1 (t) − α j−1 t P̃ j−1 (t),

R̃ j (t) = R̃ j−1 (t) − 2α j−1 t R̃ j−1 (t) + β j−1 Q̃ j−1 (t) + α2j−1 t2 P̃ j−1 (t),

P̃ j (t) = R̃ j (t) + 2β j Q̃ j (t) + β2j P̃ j−1 (t).

These recurrence relations can now be used to deﬁne recurrence relations for

the vectors deﬁned by

r j := R̃ j (A)r0 , p j := P̃ j (A)r0 , q j−1 := Q̃ j (A)r0 .

Note, however, that these vectors are not identical with those deﬁned in the

BiCG method. For the ﬁnal derivation of our new algorithm, let us also intro-

duce the quantity u j := r j + β j q j−1 . Then, we have the update formulas

q j−1 = r j−1 + β j−1 q j−2 − α j−1 Ap j−1

= u j−1 − α j−1 Ap j−1 ,

r j = r j−1 − α j−1 A(2r j−1 + 2β j−1 q j−2 − α j−1 Ap j−1 )

= r j−1 − α j−1 A(2u j−1 − α j−1 Ap j−1 )

= r j−1 − α j−1 A(u j−1 + q j−1 ),

p j = r j + 2β j q j−1 + β2j p j−1

= u j + β j (q j−1 + β j p j−1 ).

To ﬁnalise the algorithm, we only have to note two more things. First of all,

from (6.40) and (6.41) we see that we can compute the coeﬃcients α j and β j+1

as

r j , r̂0 2 r j+1 , r̂0 2

αj = , β j+1 =

Ap j , r̂0 2 r j , r̂0 2

and from r j+1 = r j − α j A(u j + q j ) we can conclude that the next iteration is

deﬁned by

x j+1 := x j + α j (u j + q j ).

This procedure gives a method called the Conjugate Gradient Squared (CGS)

method. The complete algorithm is given in Algorithm 32. This method does

not require us to form a matrix–vector product using the transpose AT of A.

However, it requires us to form two matrix–vector products with the matrix

A instead. Hence, the computational cost is still comparable to the cost of the

BiCG method.

The CGS method has a residual vector which can be described by R2j (A)r0

while R j (A)r0 is the residual vector of the BiCG method. This means that,

07

240 Methods for Large Sparse Systems

Input : A ∈ Rn×n , b ∈ Rn .

Output: Approximate solution of Ax = b.

1 Choose x0 ∈ Rn and r̂0 ∈ Rn \ {0}

2 Set r0 := u0 := p0 := b − Ax0

3 Set j := 0.

4 while r j 2 > do

5 α j := r j , r̂0 2 /Ap j , r̂0 2

6 q j := u j − α j Ap j

7 x j+1 := x j + α j (u j + q j )

8 r j+1 := r j − α j A(u j + q j )

9 β j+1 := r j+1 , r̂0 2 /r j , r̂0 2

10 u j+1 := r j+1 + β j+1 q j

11 p j+1 := u j+1 + β j+1 (q j + β j+1 p j )

12 j := j + 1

in the case of convergence, we can expect that the CGS method converges

approximately twice as fast as the BiCG method. However, it also means that

the eﬀect of rounding errors will be worse in CGS than in BiCG.

A possible remedy to this is to change the residual polynomial once again. The

idea here is to use the additional degrees of freedom to possibly smooth the

error. To this end, we will write the residual vector r j and the direction p j now

in the form

r j := χ j (A)R j (A)r0 ,

p j := χ j (A)P j (A)r0 ,

j

χ j (t) = (1 − ωi t) = (1 − ω j t)χ j−1 (t),

i=1

with certain weights ωi , which we will discuss later on. Moreover, we still

expect the polynomials P j and R j to satisfy the recurrence relations (6.42) and

(6.43), i.e.

R j (t) = R j−1 (t) − α j−1 tP j−1 (t). (6.46)

07

6.3 Biorthogonalisation Methods 241

For the vectors p j and r j these recurrence relations lead to the recursion

r j = (I − ω j A)χ j−1 (A)R j (A)r0

= (I − ω j A)χ j−1 (A)[R j−1 (A) − α j−1 AP j−1 (A)]r0

= (I − ω j A)[r j−1 − α j−1 Ap j−1 ],

p j = χ j (A)[R j (A) + β j P j−1 (A)]r0

= r j + β j (I − ω j A)p j−1 .

Note that recurrence relations (6.45) and (6.46) mean that the vectors P j (A)r0

and R j (A)r0 are the original directions and residuals of the BiCG method.

Hence, we can use the biorthogonality from Proposition 6.42, which means

in particular

R j (A)r0 ⊥ K j (AT , r̂0 ), AP j (A)r0 ⊥ K j (AT , r̂0 ).

From this, using also (6.45) and (6.46), we can conclude that

R j (A)r0 , R j (AT )r̂0 2 = R j (A)r0 , [R j−1 (AT ) − α j−1 AT P j−1 (AT )]r̂0 2

= −α j−1 R j (A)r0 , AT P j−1 (A)r̂0 2

= −α j−1 R j (A)r0 , AT [R j−1 (AT ) + β j−1 P j−2 (AT )]r̂0 2

= −α j−1 R j (A)r0 , AT R j−1 (AT )r̂0 2

= ···

= (−1) j α j−1 α j−2 · · · α0 R j (A)r0 , (AT ) j r̂0 2 (6.47)

and also in the same fashion that

AP j (A)r0 , P j (AT )r̂0 2 = AP j (A)r0 , [R j (AT ) + β j P j−1 (AT )]r̂0 2

= AP j (A)r0 , R j (AT )r̂0 2

= (−1) j α j−1 α j−2 · · · α0 AP j (A)r0 , (AT ) j r̂0 2 .

Hence, we can rewrite the coeﬃcient α j from (6.40) as

R j (A)r0 , R j (AT )r̂0 2 R j (A)r0 , (AT ) j r̂0 2

αj = = , (6.48)

AP j (A)r0 , P j (AT )r̂0 2 AP j (A)r0 , (AT ) j r̂0 2

which is not yet what we want, since we still do not use the newly deﬁned

vectors r j and p j . However, these vectors satisfy

r j , r̂0 2 = χ j (A)R j (A)r0 , r̂0 2 = R j (A)r0 , χ j (AT )r̂0 2

= R j (A)r0 , (I − ω j AT )χ j−1 (AT )r̂0 2

= −ω j R j (A)r0 , AT χ j−1 (AT )r̂0 2

= ···

= (−1) j ω j · · · ω1 R j (A)r0 , (AT ) j r̂0 2 (6.49)

07

242 Methods for Large Sparse Systems

and also

Ap j , r̂0 2 = Aχ j (A)P j (A)r0 , r̂0 2 = AP j (A)r0 , χ j (AT )r̂0 2

= (−1) j ω j · · · ω1 AP j (A)r0 , (AT ) j r̂0 2 .

Hence, we can express (6.48) as

r j , r̂0 2

αj = .

Ap j , r̂0 2

Moreover, (6.47) and (6.49) allow us also to express β j+1 from (6.41) now as

R j+1 (A)r0 , R j+1 (AT )r̂0 2 α j r j+1 , r̂0 2

β j+1 = = .

R j (A)r0 , R j (A)r̂0 2 ω j+1 r j , r̂0 2

Thus, the only parameters that we have not yet determined are the weights ω j .

Deﬁnition 6.45 If the coeﬃcients ω j are in each step determined in such a

way that r j minimises its 2-norm, i.e. if ω j ∈ R minimises

ω → (I − ωA)χ j−1 (A)R j (A)r0 2

then the resulti

## Molto più che documenti.

Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.

Annulla in qualsiasi momento.