Convergence Rate Estimates For The Conjugate Gradient Method

Please purchase VeryDOC PS to PDF Converter on http://www.verydoc.com to remove this watermark.
Convergence Rate Estimates

for the Conjugate Gradient Method
Igor Kaporin
Parallel Computing Lab.,

Dept. Applied Optimization Problems
A. A. Dorodnicyn Computing Center
Russian Academy of Sciences, Moscow, RUSSIA
The Rome-Moscow School of Matrix Methods
Moscow Part: September 10-15, 2012
Outline of the talk

CG as the Optimum Krylov Subspace Method
Spectral bound convergence rate estimates
New estimates via the K-condition number
The Preconditioned CG method
Preconditioning via K-optimization
CG as the Optimum Krylov Subspace Method (1)
Consider a system of linear algebraic equations

Ax = b;
x 2 Rn; b 2 Rn;
AT = A > 0;
where A is large, sparse, and not well-conditioned.
The CG approximations xk to the solution x of the
linear system are constructed from the initail residual
r0 = b Ax0 in the form
xk = x0 + r0(1k) + : : : + Ak 1r0(kk) ;
where the scalar coecients are chosen such that
f(1k); : : : ; (kk)g = arg min kx xk kA.
The CG algorithm is as follows:

r0 = b Ax0,
p0 = r0;
for i = 0; 1; : : : :
i = riT ri=pTi Api,
xi+1 = xi + pii,
ri+1 = ri Apii,
i = riT+1ri+1=riT ri,
pi+1 = ri+1 + pii.
Note that rk = b Axk = k (A)r0.
Therefore, one has
k (i)
X
k ti;
k (t) = 1
x xk A 1k (A)r0;
i=1
and the solution is well approximated even for k n,
a sucient condition for which is kk(A)k 1.
Indeed, by the optimality of k (A) one has
kx xk kA = kk (A)r0kA 1 kk(A)kkr0kA 1
= kk (A)kkx x0kA k~(A)kkx x0kA;

where ~k () is any polynomial such that
deg ~k k;
~k (0) = 1:
Since A is SPD, it holds (e.g. by the spectral decomposition of A)

k~k(A)k = i=1
max
j

~
(

)
j
;
i
k
;:::;n
where i = i(A) > 0 are the eigenvalues of A
numbered in the nondecreasing order.
Using dierent particular choices of ~k (), one can construct various CG convergence estimates of the type
kx xkkA max j~ ()j '( ; : : : ; )
n
1
kx x0kA =i k
the right-hand side of which is anyway dependent on
the spectrum of A.
Spectral bound convergence rate estimates (1)
The well-known standard result follows from

kx xkkA max j~ ()j max j~ ()j;
1n k
kx x0kA =i k
where ~k is expressed via a properly translated and
scaled kth degree Chebyshev polynomial Tk () of the
1st kind, which yields
0 s 1
!
kx xkkA 1.T n + 1 < 2 exp @ 2k 1 A :
k n
kx x0kA
n
1
Recall that
!
q
q

k
k
1
T k (z ) = 2 z + z 2 1 + z z 2 1
This estimate readily yields an iteration number bound

for the CG to converge with the relative precision ":
& s
'
2
1

n
k 2 log "
1
Note that in practice, an a priori estimation of the spectral bounds (more precisely, a control of their ratio) may
be impossible to perform.
However, a posteriori estimates of 1 and n are readily
available from i and i generated by the CG algorithm.
Thus we have the estimate

0
1
kx xkkA < 2 exp B@q 2k CA ;
kx x0kA
C(A)
where
C(A) = max((AA)) 1
min
is the spectral condition number of an SPD matrix A.
In this case, for some > 0,
A ! I
i
C(A) ! 1
New estimates via the K-condition number (1)
In the presented theory, a key role plays the matrix

functional K(A) dened as
K(A) = (n 1traceA)n= det A
0 n 1n n
X A .Y
1
@
i = K(1; : : : ; n)
= n i
i=1
i=1
The latter holds by the well-known property

traceA =
n
X
i=1
i;
det A =
n
Y
i=1
i :
In a complete analogy with the spectral condition

number, for the K-condition number it holds ( > 0)
A ! I
i
K(A) ! 1
when A is an SPD matrix. This is nothing but the
Arithmertic-Geometric Mean (AGM) inequality written
for f1; : : : ; ng.
First we demonstrate an elementary proof of a (rather
rough but instructive) estimate for the decrease of the
kx xk kA in the CG method.
Theorem 1. [Kaporin,Axelsson'00] Let A be SPD.

Then for any even k satisfying
2 log2 K(A) < k < n
it holds
kx xk kA < K(A)2=k 1k=2 :
kx x k
0 A
Note: this bound is not precise.
Proof. Let k = 2m. Using the general estimate with

~k (t) =
m
Y
(1 t=i)(1 t=n+1 i);
i=1
one readily gets

kx xk kA max j~ ( )j
max
j~ ( )j
;:::;n m k i
kx x0kA i=1;:::;n k i i=m+1
m
Y
max
i=1 m+1tn m
m
Y
i=1
j(1 t=i)(1 t=n+1 i)j

K(i; n+1 i) 1 :
Using an obvious consequence of the AGM inequality,

11=m
0 m 11=m 0 m
@ Y iA + @ Y (1 i)A 1;
i=1
i=1
with i = 1=K(i; n+1 i)
we obtain
00
m
kx xk kA B@@ Y
K
(

;

i
n+1
kx x0kA
i=1
11=m 1m
C
A
1
)
A ;
i
and it only remains to prove that

m
Y
i=1
K(i; n+1 i) K(A)
0 < i < 1;
The latter also follows the AGM inequality:

0 n 1 0 n 1n
X
Y
1
K(A) @ iA = @ n iA
i=1
i=1
0 0m
11n
m
n
m
X i + n+1 i X i + n+1 i
X AA
1
@
@
+
+
= n

i
2
2
i=1
i=1
i=m+1
0m
10 n m 1
!
2
Y i + n+1 i A @ Y A
@

i
i=1
0m
Y
@
K(i; n+1
=
Q.E.D.
i=1
i=m+1
10 n 1
Y A
A
@
i
i)
i=1
The corresponding iteration number bound is

3
2
1
4
log
K
(
A
)
+
3
log(
"
)
77 :
6

k < 66
log 4 + logK(A)(" 1) 7
For a given 0 < " 1, this condition yields
kx xkkA "kx x0kA
This can be shown setting t = K(A)2=k in the inequality
t
t 1 ( 1) 1 ;
> 1;
t > 1:
Hence we establish a CG iteration number bound whicn
grows sublinearly with respect to log 1" .
Concerning the roughness of the above estimate: an

unimprovable (best possible) bound was found, but it
relates to reduction of another error norm:
Theorem 2. [Kaporin'92,94] Let A be SPD. Then

for any even k satisfying
1k<n
it holds
krkk K(A)1=k 1k=2 :
kr k
0
This means nearly 2 times reduction of the above

iteration number bound.
An example is shown in the Figure, where we consider

a matrix A of the order n = 50 with the prescribed
eigenvalues n+1 j (M ) = n2 + 1 j 2.
With the dashed line the A 1-norm of the residual and
its upper bound via C(A) are shown,
while the solid line corresponds to the Euclidean norm
of the residual and its estimate via K(A).
It is quite clear that the solid lines behave much more
similar to each other than the two dashed lines.
Note that the considered example has isolated smaller
eigenvalues and clusterized largest eigenvalues, which
is exactly the class of eigenvalue distributions which we
prefere to deal with.
The A 1- and H -norms of the residuals and

their upper bounds vrs. the CG iteration number
The Preconditioned CG method (1)
Obviously, if some condition number of the matrix A

is very large, then the CG method may require a huge
number of iterations to converge (especially in computer arithmetics) - despite of its optimality.
To overcome this drawback, a very simple but powerful
idea of preconditioning is applied.
Namely, let us substitute x = GT y, where det G 6= 0,
and solve the preconditioned linear system
GAGT y = Gb;
by the same CG method. This time, we will have the
preconditioned matrix M = GAGT instead of A in every
formula above!
Using appropriate substitutions, and denoting H =

GT G, we readily obtain the preconditioned CG method:
r0 = b Ax0,
p0 = Hr0;
for i = 0; 1; : : : :
i = riT Hri=pTi Api,
xi+1 = xi + pii,
ri+1 = ri Apii,
i = riT+1Hri+1=riT Hri,
pi+1 = Hri+1 + pii
Here we have rk = b Axk = k (AH )r0.
Both the above iteration number bounds are (nearly)

proportional to
q
C(HA)
in the standard (spectral bounds based) CG theory, or
log K(HA)
when using the nth power of the arithmetic-togeometric mean ratio for the spectrum of HA
(i.e., the K-condition number) for the same purposes.
Hence, the central problem in the PCG theory is:
using easy-to-multiply by a vector matrices H ,

reduce the condition number of HA
- as low as possible
The K-optimization vrs. the C-optimization:
the (second) estimate via K is as sharp

as the one via C;
the estimates via K re ect the superlinear
convergence of CG, while the one via C does not;
generally, the K(HA)-optimization can be feasible,
while C(HA)-optimization may not;
K(HA)-optimization tends to clusterize the
spectrum of HA near the largest e.v.'s of HA,
while C(HA)-optimization may not.
Preconditioning via K-optimization (1)
A simplest possible example is the preconditioning

by a symmetric diagonal scaling (i.e., G = D)
M = DAD:
It can be shown that the K-optimality is attained when
D = (Diag(A)) 1=2:
That is,
D = arg min
K(DAD);
D
where the minimum is taken over the set of all
SPD diagonal matrices D.
Indeed, denoting i = (A)ii(D)2ii, one has

1 Pn n
n i=1 i
1
K(DAD) = det DAD Qn det DAD = K(DAD);
i=1 i
where D is dened above. The AGM inequality shows

that the equality in the latter estimate is attained i
i = > 0.
Finally, under a natural restriction = 1, one gets the
required equality D = D.
It must be stressed that this (so-called Jacobi) scaling

is not optimum in the sense of the spectral condition
number C(DAD) for arbitrary SPD matrices. The exception are some special cases, e.g., consistently ordered matrices.
An example: the Toeplitz SPD tridiagonal matrix
T =[-1, 2, -1] is consistently ordered and therefore is
C-optimally scaled. Therefore, the inverse of it has
the same property since for any SPD A it always holds
C(A) = C(A 1).
At the same time, T 1 has non-constant diagonal and
therefore is not K-optimally scaled.
In a similar way, it can be shown that the Block Jacobi

preconditioning for an SPD matrix is K-optimum over
the set of all block-diagonal matrices with prescribed
structure.
However, we prefer to consider this structure later as a
particular case of a more general construction.
References
[Kaporin'92] Explicitly preconditioned conjugate gradient
method for the solution of nonsymmetric linear systems. Int.
J. Computer Math., 40, 169{187, 1992.
[Kaporin'94] New convergence results and preconditioning
strategies for the conjugate gradient method. Numer. Linear
Algebra with Appls., 1, no.2, 179{210, 1994.
[Kaporin,Axelsson'00] On the sublinear and superlinear rate
of convergence of conjugate gradient methods. Numerical
Algorithms, 25, 1{22, 2000.

Convergence Rate Estimates For The Conjugate Gradient Method

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Convergence Rate Estimates For The Conjugate Gradient Method

Caricato da

Copyright:

Formati disponibili

Please purchase VeryDOC PS to PDF Converter on http://www.verydoc.com to remove this watermark.