Sei sulla pagina 1di 368

Lecture Notes in

Mathematics
Edited by A. Dold and B. Eckmann

1133

Krzysztof C. Kiwiel

Methods of Descent for


Nondifferentiable Optimization

Springer-Verla~
Berlin Heidelberg New York Tokyo
Author
Krzysztof C. Kiwiel
Systems Research Institute, Polish Academy of Sciences
ul. Newelska 6, 01-447 Warsaw, Poland

Mathematics Subject Classification: 49-02, 49 D 37, 65-02, 65 K 05, 90-02, 90 C 30

ISBN 3-540-15642-9 Springer-Verlag Berlin Heidelberg New York Tokyo


ISBN 0-38?-15642-9 Springer-Verlag New York Heidelberg Berlin Tokyo

This work is subject to copyright. All rights are reserved, whether the whole or part of the material
is concerned, specifically those of translation, reprinting, re-use of illustrations, broadcasting,
reproduction by photocopying machine or similar means, and storage in data banks. Under
54 of the German Copyright Law where copies are made for other than private use, a fee is
payable to "Verwertungsgeselrschaft Wort", Munich.
by Springer-Verlag Berlin Heidelberg 1985
Printed in Germany
Printing and binding: Beltz Offsetdruck, Hemsbach/Bergstr.
2146/3140-543210
PREFACE

This book is about numerical methods for p r o b l e m s of finding the


largest or s m a l l e s t values which can be a t t a i n e d by functions of seve-
ral real variables subject to several inequality constraints. If such
problems involve continuously differentiable functions, they can be
solved by a v a r i e t y of methods well documented in the literature. We
are concerned with more general problems in w h i c h the functions are
locally Lipschitz continuous,but not necessarily differentiable or
convex. More succintly, this book is about numerical methods for no~-
differentiable optimization.
Nondifferentiable optimization, also called nonsmooth optimizat-
ion, has many actual and potential applications in i n d u s t r y and science.
For this reason, a great deal of effort has been devoted to it during
the last decade. Most research has gone into the theory of n o n s m o o t h
optimization, while surprisingly few a l g o r i t h m s have been proposed,
these m a i n l y by C . L e m a r ~ c h a l , R.Mifflin and P.Wolfe. Frequently such
algorithms are c o n c e p t u a l , since their storage and work per iteration
grow infinitely in the course of c a l c u l a t i o n s . Also their convergence
properties are u s u a l l y weaker than those of c l a s s i c a l methods for
smooth optimization problems.
This book gives a complete state-of-the-art in g e n e r a l - p u r p o s e
methods of descent for n o n s m o o t h minimization. The methods use piece-
wise linear approximations to the p r o b l e m functions constructed from
several subgradients evaluated at certain trial points. At each iterat-
ion, a search direction is found by solving a quadratic programming
subproblem and then a line search produces both the next improved
approximation to a s o l u t i o n and a new trial point so as to detect gra-
dient discontinuities. The a l g o r i t h m s converge to points satisfying
necessary optimality conditions. Also they are w i d e l y applicable, since
they require only a weak semismoothness hypothesis on the problem funct-
ions which is likely to hold in most applications.
A unifying theme of this book is the use of s u b g r a d i e n t selection
and a g g r e g a t i o n techniques in the c o n s t r u c t i o n of m e t h o d s for n o n d i f f e -
rentiable optimization. It is shown that these techniques give rise in
a totally systematic manner to new i m p l e m e n t a b l e and g l o b a l l y converg-
ent m o d i f i c a t i o n s and extensions of all the most promising algorithms
which have been recently proposed. In effect, this book should give the
reader a feeling for the way in which the subject has developed and is
developing, even though it m a i n l y reflects the a u t h o r ' s research.
This book does not discuss methods without a monotonic descent
(or ascent) property, which have been developed in the Soviet Union.
iV

The reason is that the subject of their effective implementations is


still a mystery. Moreover, these subgradient methods are well descri-
bed in the m o n o g r a p h of Shor (1979). We refer the reader to Shor's
excellent book (its English translation was published by S p r i n g e r -
Verlag in 1985) for an e x t e n s i v e discussion of s p e c i f i c nondifferent-
iable optimization problems that arise in a p p l i c a t i o n s . Due to space
limitations, such applications will not be treated in this book.
In order to make the contents of this book accessible to as wide
a range of readers as possible, our a n a l y s i s of a l g o r i t h m s will use
only a few results from n o n s m o o t h optimization theory. These, as well
as certain other results that may help the reader in a p p l i c a t i o n s , are
briefly reviewed in the introductory chapter, which also contains a
review of r e p r e s e n t a t i v e ~ e x i s t i n g algorithms. The reader who has basic
familiarity with nonsmooth functions may skip this chapter and start
reading from Chapter 2, where methods for u n c o n s t r a i n e d convex minimi-
zation are d e s c r i b e d in detail. The basic constructions of Chapter 2
are extended to the u n c o n s t r a i n e d nonconvex case in two fundamentally
different ways in Chapters 3 and 4, g i v i n g rise to c o m p e t i t i v e methods.
Algorithms for c o n s t r a i n e d convex problems are treated in Chapter 5,
and their extensions to the n o n c o n v e x case are d e s c r i b e d in Chapter 6.
Chapter 7 presents new v e r s i o n s of the bundle method of L e m a r ~ c h a l and
its extensions to c o n s t r a i n e d and n o n c o n v e x problems. Chapter 8 con-
tains a few n u m e r i c a l results.
The book should enable research workers in various branches of
science and engineering to use methods for n o n d i f f e r e n t i a b l e optimizat-
ion more efficiently. Although no c o m p u t e r codes are given in the text,
the m e t h o d s are d e s c r i b e d unambiguously, so c o m p u t e r programs may rea-
dily be w r i t t e n .
The author would like to thank Claude Lemar~chal and Dr. A.Rusz-
czyflski for introducing him to the field of n o n s m o o t h optimization, and
Prof. K.Malanowski for s u g g e s t i n g the idea of the book. Without A.Rusz-
czyfiski~s continuing help and encouragement this book w o u l d not have
been written. Part of the results of this book were obtained when the
author worked for his doctoral dissertation under the s u p e r v i s i o n of
Prof. A.P.Wierzbicki at the Institute of Automatic Control of the Tech-
nical University of Warsaw. The help of Prof. R.Kulikowski and Prof.
J.Ho~ubiec from the Systems Research Institute of the Polish Academy of
Sciences, where this book was written, is g r a t e f u l l y acknowledged.
Finally, the a u t h o r wishes to thank Mrs. I.Forowicz and Mrs. E.Grudzifl-
ska for p a t i e n t l y typing the m a n u s c r i p t .
TABLE OF CONTENTS

Page

Chapter i. F u n d a m e n t a l s
i.i. Introduction .................................... 1
1.2. Basic Results of N o n d i f f e r e n t i a b l e Optimization
Theory .......................................... 2
1.3. A Review of E x i s t i n g Algorithms and Original
Contributions of T h i s W o r k ...................... 22
Chapter 2. A g g r e g a t e Subgradient Methods for U n c o n s t r a i n e d
Convex Minimization
2.1. Introduction .................................... 44
2.2. Derivation of the A l g o r i t h m Class ............... 44
2.3. The Basic Algorithm ............................. 57
2.4. Convergence of the B a s i c Algorithm .............. 59
2.5. The Method with Subgradient Selection ........... 71
2.6. Finite Convergence for Piecewise Linear Functions 76
2.7. Line Search Modifications ....................... 84

Chapter 3. M e t h o d s with Subgradient Locality Measures for M i n i m i -


zing Nonconvex Functions
3.1 Introduction .................................... 87
3.2 Derivation of the M e t h o d s ....................... 88
3.3 The Algorithm with Subgradient Aggregation ...... 99
3.4 Convergence ..................................... 106
3.5 The Algorithm with Subgradient Selection ........ 123
3.6 Modifications of the M e t h o d s .................... 131

Chapter 4. M e t h o d s with Subgradient Deletion Rules for U n c o n -


Strained Nonconvex Minimization
4.1 Introduction .................................... 139
4.2 Derivation of the M e t h o d s ....................... 141
4.3 The Algorithm with Subgradient Aggregation ...... 150
4.4 Convergence ..................................... 156
4.5 The Algorithm with Subgradient Selection ........ 168
4.6 Modified Resetting Strategies ................... 171
4.7 Simplified Versions That Neglect Linearization
Errors .......................................... 185

Chapter 5. F e a s i b l e Point Methods for Convex Constrained Minimi-


zation Problems
Vl

5.1 Introduction ..................................... 190


5.2 Derivation of the Algorithm Class ................ 191
5.3 The Algorithm with Subgradient Aggregation ....... 205
5.4 Convergence ...................................... 207
5.5 The Method with Subgradient Selection ............ 215
5.6 Line Search Modifications ........................ 217
5.7. Phase I - Phase II M e t h o d s ....................... 219

Chapter 6. M e t h o d s of Feasible Directions for Nonconvex

Constrained Problems
6.1. Introduction ..................................... 229
6.2. Derivation of the Methods ........................ 230
6.3. The Algorithm with Subgradient Aggregation ....... 245
6.4. Convergence ...................................... 252
6.5. The Algorithm with Subgradient Selection ......... 264
6.6. Modifications of the Methods ..................... 269
6.7. Methods with Subgradient Deletion Rules ......... 275
6.8. Methods That Neglect Linearization Errors ........ 293
6.9. Phase I - Phase II M e t h o d s ....................... 294

Chapter 7. Bundle Methods


7.1. Introduction ..................................... 299
7.2. Derivation of the Methods ........................ 300
7.3. The Algorithm with Subgradient Aggregation ....... 307
7.4. Convergence ........ . ............................. 312
7.5. The Algorithm with Subgradient Selection ......... 318
7.6. Modified Line Search Rules and Approximation Tole-
rance Updating Strategies ........................ 320
7.7. Extension to Nonconvex Unconstrained Problems .... 325
7.8. Bundle Methods for Convex Constrained Problems ... 330
7.9. Extensions to Nonconvex Constrained Problems ..... 339

Chapter 8. Numerical Examples


8.1. Introduction ..................................... 345
8.2. Numerical Results ................................ 345

References ....................................................... 354

~ndex ............................................................ 361


CHAPTER 1

Fundamentals

i. I n t r o d u c t i o n

The n o n l i n e a r programming problem, also k n o w n as the m a t h e m a t i c a l


programming problem, can be taken to have the form

P : minimize f(x), subject to F I x) ! 0 for i=l .... ,m,


1

where the objective function f and the c o n s t r a i n t functions Fi are


real-valued functions defined on the N - d i m e n s i o n a l Euclidean space R N.
The value of m ~0 is finite; when m=0 the p r o b l e m is u n c o n s t r a i n e d .
Often the o p t i m i z a t i o n problem P is smooth: the p r o b l e m functions f
and Fi are c o n t i n u o u s l y differentiable, i.e. they have continuous gra-
dients Vf and VFi, i=l,...,m. But in m a n y applications this is not
true. Nonsmooth problems are the subject of n o n s m o o t h optimization, also
called n o n d i f f e r e n t i a b l e optimization.
Owing to actual and p o t e n t i a l applications in industry and science,
recent l y much research has b e e n conducted in the area of n o n s m o o t h opti-
mization both in the East (see the e x c e l l e n t monographs by Gupal (1979),
Nurminski (1979) and Shor (1979)) and in the W e s t (see the c o m p r e h e n s i v e
bibliographies of G w i n n e r (1981) and N u r m i n s k i (1982)).
Nonsmooth problems that arise in a p p l i c a t i o n s have certain common
features. They are more complex and have poorer analytical properties
than standard mathematical programming problems, cf. (Bazaraa and Shetty,
1979; Pshenichny and Danilin, 1975). A single evaluation of the p r o b l e m
functions usually requires solutions of a u x i l i a r y optimization subprob-
lems. In particular, it is very common to e n c o u n t e r a nondifferentiable
function which is the p o i n t w i s e supremum of a c o l l e c t i o n of functions that
may themselves be d i f f e r e n t i a b l e - a max function.
Functions with discontinuous gradients, such as max functions, cannot
be m i n i m i z e d by c l a s s i c a l nonlinear programming algorithms. This observa-
tion applies both to gradient-type algorithms (the m e t h o d of s t e e p e s t descent,
conjugate direction methods, quasi-Newton methods) and to d i r e c t search
methods w h i c h do not require calculation of d e r i v a t i v e s (the m e t h o d of
Nelder and Mead, the m e t h o d of Powell, etc.), see (Lemarechal, 1978 and
1982; Wolfe, 1975).
This work is c o n c e r n e d with n u m e r i c a l methods for finding (approxi-
mate) solutions to p r o b l e m P when the p r o b l e m functions are locally Lip-
schitzian, i.e. Lipschitz continuous on each b o u n d e d subset of R N, but not
n e c e s s a r i l y differentiable.
The advent of F.H.Clarke's (1975) analysis of locally L i p s c h i t z i a n
functions p r o v i d e d a u n i f i e d approach to both n o n d i f f e r e n t i a b l e and non-
convex p r o b l e m s (Clarke, 1976). Clarke's s u b d i f f e r e n t i a l analysis, the
p e r t i n e n t part of w h i c h is b r i e f l y r e v i e w e d in the following section, suf-
fices for e s t a b l i s h i n g p r o p e r t i e s of a vast class of o p t i m i z a t i o n pro-
blems that arise in a p p l i c a t i o n s (Pshenichny, 1980; Rockafellar, 1978).

2. Basic Results of N o n d i f f e r e n t i a b l e O p t i m i z a t i o n Theory

In this section we d e s c r i b e general p r o p e r t i e s of n o n d i f f e r e n t i a b l e


o p t i m i z a t i o n p r o b l e m s that are the subject of this work. Basic familia-
rity is, however, assumed. Source m a t e r i a l may be found in (Clarke, 1975;
Clarke, 1976; Rockafellar, 1970; Rockafellar, 1978; Rockafellar, 1981).
The section is o r g a n i z e d as follows. First, we review concepts of
d i f f e r e n t i a b i l i t y and e l e m e n t a r y p r o p e r t i e s of the Clarke s u b d i f f e r e n -
tial. The proofs are omitted, because only simple results, such as Lemma
2.2, will be used in s u b s e q u e n t chapters. Other results, in p a r t i c u l a r
the calculus of subgradients, should h e l p the reader who is m a i n l y in-
terested in applications. Secondly, we study convex first order approxi-
mations to n o n d i f f e r e n t i a b l e functions. Such a p p r o x i m a t i o n s are then used
for d e r i v i n g n e c e s s a r y c o n d i t i o n s of o p t i m a l i t y for n o n d i f f e r e n t i a b l e
problems. Our approach is e l e m e n t a r y and may appear artificial. However,
it yields useful i n t e r p r e t a t i o n s of the a l g o r i t h m s d e s c r i b e d in subse-
quent chapters.
The following n o t a t i o n is used. We denote by <.,-> and I'l, res-
pectively, the usual inner p r o d u c t and n o r m in f i n i t e - d i m e n s i o n a l , real
E u c l i d e a n space. R N denotes E u c l i d e a n space of d i m e n s i o n N<~. We use x i
to denote the i-th c o m p o n e n t of the vector x. Thus
N 1/2
<x,y> = ~ xiY i and IxI=< x,x > for x , y ~ R N. S u p e r s c r i p t s are used
i=l
to denote d i f f e r e n t vectors, e.g. x I and x 2. All vectors are column vec-
tors. However, for c o n v e n i e n c e a column vector in R N+n is sometimes de-
noted by (x,y) even though x and y are column vectors in R N and Rn,
respectively. Ix,y] denotes the line s e g m e n t joining x and y in R N,
i.e. [x,y~={zERN:z=Ix+(l-l)y for some I s a t i s f y i n g 0 ~ i ~ i}.

A set ScR N
is called convex if ~,y]cS for all x and y be-
k
longing to S. A linear c o m b i n a t i o n Z ljx 3 is called a convex combina-
j=l k
tion of points xl,...,x k in R N if each lj ~0 and Z lj=l. The convex
j=l
hull of a set S C R N, d e n o t e d conv S, is the set of all convex combina-
t i o n s of p o i n t s in S. c o n v S is the s m a l l e s t c o n v e x set c o n t a i n i n g S,
and S is c o n v e x if and only if S = c o n v S. An i m p o r t a n t property of con-
vex hulls is d e s c r i b e d in

Lemma 2.1 (Caratheodory's theorem; see T h e o r e m 17.1 in ( R o c k a f e l l a r ,


1970)).

If ScR N then xEconv S if and o n l y if x is e x p r e s s i b l e as a con-


vex combination of N+I (not n e c e s s a r i l y different) points of S.

Any nonzero vector gER N and n u m b e r y define a hyperplane

H = { x E R N : < g,x > = y},

which is a t r a n s l a t i o n of the (N-l)-dimensional subspace {x~RN:<g,x>=0}


of R N. H d i v i d e s RN into two c l o s e d h a l f - s p a c e s {x~RN: < g , x > 5 y} and
{xERN: < g,x > ~ y}, r e s p e c t i v e l y . We say t h a t H is a s u p p o r t i n g h y p e r -
plane to a set ScR N at ~ S if < g , ~ > =y and < g,x > 5 y for all
x~S. Any closed convex set S can be d e s c r i b e d as an i n t e r s e c t i o n of all
the c l o s e d h a l f - s p a c e s that contain S.

We use the set n o t a t i o n

SI+s 2 = { z l + z 2 : z l ~ S I , z 2 E $2},

conv{Si: i=1,2} = cony{z: z eSiu S 2}

for any s u b s e t s S1 and S2 of R N.

A function f:R N - - R is c a l l e d c o n v e x if

f ( l x l + ( l - l ) x 2) ~ i f ( x l ) + ( l - ~ ) f ( x 2) for all kE~,l] and x l , x 2 e R.

This is e q u i v a l e n t to the ~ i g r a p h of f

epi f = { ( x , S ) E RN+I:8 f(x)}

being a convex subset of R N+I. A function f:R N -- R 1 is c a l l e d con-


cave if the f u n c t i o n (-f)(x)=-f(x) is convex. If
F . : R N ---+ R is con-
l
v e x and ii k 0 for e a c h i=l,...,k, then the f u n c t i o n s
k
i (x) = i ~ i k i f i ( x ) ,
- (2.17
~2(x) = max {fi(x): i=l ..... k}

are convex.
A function f : R N ---+ R is s t r i c t l y convex if f ( k x l + ( l - k ) x 2) <
l f ( x l ) + ( l - l ) f ( x 2) for all ~e(0,1) and x l ~ x 2. F o r i n s t a n c e , the
function " 12
~. " is s t r i c t l y convex.
A function f : RN -- R is said to be l o c a l l y Lipschitzian if for
each bounded subset B of RN there exists a Lipschitz constant
L=L(B) < ~ such t h a t

If(xl)-f(x2} I ~ L I x l - x 2 I for all x l , x 2 ~ B. (2.2)

Then in p a r t i c u l a r f is c o n t i n u o u s . Examples of l o c a l l y L i p s c h i t z i a n
functions include continuously differentiable functions, convex functions,
concave functions and any l i n e a r c o m b i n a t i o n or p o i n t w i s e m a x i m u m of a
finite collection of s u c h functions, cf. (2.1).
Following {Rockafellar, 1978), we s h a l l n o w d e s c r i b e differentiabi-
lity properties of l o c a l l y L i p s c h i t z i a n functions. Henceforth let f de-
note a function satisfying (2.27 and let x be an i n t e r i o r p o i n t of B,
i.e. x~int B.
The C l a r k e @eneralized directional derivative of f at x in a
direction d

f(x;d) = lim sup [f(y+td)-f(y)]/t (2.3)


yx,t+0

is a finite, convex function of d and f(x;d) & L i d I . The Dini u p p e r


directional derivative of f at x in a d i r e c t i o n d

fD(x;d~ = lim sup ~f(x+td}-f(x)~/t (2.4)


t+0

exists for e a c h d6R N and s a t i s f i e s

f(x+td} 5 f ( x ) + t f D ( x ; d ) + o ( t ) , (2.5)

where o(t)/t0 as t+0. The l i m i t

f'(x;d) = lim [ f ( x + t d ) - f ( x ) ] /t <2.6)


t#0

is c a l l e d the Cone-sided) directional derivative of f at x with re-


s p e c t to d, if it exists. The two-sided derivative (the G a t e a u x d e r i v a t i v e )
corresponds to the case f'(x;-d)=-f,(x;d). Clearly,

fD(x;d) <_ f(x;d),


(2.7)
f'(x;d) < fD(x;d),

whenever f' (x;d) exists.


If f'(x;d) is l i n e a r in d ( G a t e a u x d i f f e r e n t i a b l e at x}

f'(x;d) = < gf,d > for all d e R N, (2.8)


then the vector gf is called the gradient of f at x and denoted by
~f ~f
vf(x). The components of V f ( x ) = ( ~ ( x ) ~ ..... ~XN(X)) are the coordinate-

wise two-sided partial derivatives of f at x. The function f is (Frechet)


differentiable at x if

f(x+d)=f(x)+<Vf(x),d>+o(Idl) for all d ~ R N , (2.9)

where o(t)/t0 as t+0. The above relation is equivalent to

lim [f(x+td')-f(x)]/t=<Vf(x),d> for all d RN . (2.10)


d'd,t+0

If

lim [f(y+td)-f(y)]/t = <Vf(x) r d> for all d in R N I (2.11)


y+x,t+O

then f is called s t r i c t l y differentiable at x. In this case f is diffe-


rentiable at x and the gradient Vf:RNR N is continuous at x relative to
its domain

dom vf = { y ~ R N : f is differentiable at y}

It is known that a locally Lipschitzian function f:RNR is diffe-


rentiable at almost all points x e R N, and moreover that the gradient
mapping Vf is locally bounded on its domain. Suppose that (2.2) holds
for some n e i g h b o r h o o d B of a point x e R N. Then

<Vf(y),d> = f'(y;d) = lim[f(y+td)-f(y)]/t ~ Lid I


t+0

for all y B n dom Vf and d~R N, and this implies

IVf(y)[~ L for all y e B n dom Vf. (2.12)

Since dom Vf is dense in B, there exist sequences {yJ} such that f is


differentiable at yJ and yJx. The corresponding sequence of gradients
{Vf(yJ)} is bounded and has accumulation points (each being the limit
of some convergent subsequence). It follows that the set

Mf(x) = {zeRN: Vf(yJ)~z for some sequence yJ+z with f diffe-


rentiable at yJ} (2.13a)

is nonempty, bounded and closed. The set

~f(x) = c o n v Mf(x) (2.13b)


is called the subdifferential of f at x (called the generalized gradient
by Clarke (19757). Each element g f ~ f ( x ) is called a subgradient of f at
x. Thus

~f(x)=conv{limVf(yJ):yJ~x, f differentiable at yJ}. (2.14)

In particular therefore, ~f(x)=-~(-f)(x). Three immediate consequences


of the definition are listed in

Lemma 2.2. (i) ~f(x) is a nonnempty convex compact set.


(ii) The p o i n t - t o - s e t m a p p i n g ~f(-) is locally bounded (bounded on bound-
ed subsets of RN), i.e. if BcRN is bounded then the set
{gf E ~f(y):ycB} is bounded.
(iii) ~f(-) is upper semicontinuous, i.e. if a sequence {y3} converges
to x and g ~ e ~f(yJ) for each j then each accumulation point gf of
{g~} satisfies gfe~f(x).

In general, ~f(x) does not reduce to Vf(x) when the gradient ?f


is discontinuous at x.

Lemma 2.3. The following are equivalent:


(i) ~f(x) consists of a single vector;
(ii) ?fix) exists and ~f is continuous at x relative to dom Vf;
(iii) f is strictly differentiable at x.
Moreover, when these properties hold one has ~f x)={vf(x)}.

Frequently ~f(x) is a singleton for almost every x. A locally Lip-


schitzian function f:RNR is subdifferentially regular at xER N if for
every d~R N the ordinary directional derivative (2.6) exists and coinci-
des with the generalized one in (2.3):

f'(x;d) = f(x;d) for all d. <2.15)

If 42.15) holds at each x ~ R N then ~f(x) is actually single-valued at


almost every x. Below we give two important examples of subdifferential-
ly regular functions.
Lemma 2.4. If f is convex then f is subdifferentially regular and

f'(x;d) = max{<gf,d~ : g f e ~f(x)} for all x,d. (2.16)

Lemma 2.5. Suppose that

f(x) = max{fu(X): u~U} for all x E R N, (2.17)

where the index set U is a compact topological space (e.g. a finite set
in the discrete topology), each fu is locally Lipschitzian, uniformly
for u in U~ and the mappings fu(X) and ~fu(X) are upper semicontinuous
in (x,u) (e.g. each fu is a differentiable function such that fu(X) and
Vfu(X ) depend continuously on ix,u)). Let

U(x) = {u~U: fu(X) = f(x)}. 42.18)

Then f is locally Lipschitzian and

~f(x) c c o n v {~fu(X) : u~U(x)}. 42.19)

If each fu is subdifferentially regular at x, then so is f, equality


holds in (2.19), and

f'(x;d) = max{<gu,d>: gu~fu(X),U~U(x)} for all d. (2.20)

Corollary 2.6. Suppose that

f(x) = max{fi(x) : i e I} for all x in R N , (2.21)

where the index set I is finite, and let I(x)={iEI:fi(x)=f(x)}.


(i) If each fi is continuously differentiable then

f'(x;d) = max{<Vfi(x),d> : i~I(x)} for all d,


(2.22)
~f(x) = conv{Vfi(x): i~I(x)}.

(ii) If each fi is convex then

f'(x;d) = max{<gfi,d>: gf.~fi(x), i~I(x)} for all d,


l 42.23)
~f(x) = conv{gf ~ ~fi(x): i~I(x)}.
1
When f is smooth, there exists an apparatus for computing Vf in
terms of the derivatives of other functions from which f is composed. The
calculus of subgradients, which generalizes rules like V(fl+f2)(x ) =
= Vfl(x)+ Vf2(x), is based on the following results.

Lemma 2.7. Let g:Rn+R and h i : R N + R , i=l,...,n, be locally Lipschitzian.


Let h ( x ) = ( h l ( X ) ..... hn(X ) and (goh)(x)=g(h(x)) for all x~E N. Then goh
is locally Lipschitzian and
n
~(goh)(x) c c o n v { ~ ui~hi(x ) : u I ..... Un)e~g(h(x))}. (.2.24)
i=l

Moreover, equality holds in (2.24) if one of the following is satisfied:


(i) g is subdifferentially regular at h(x), each h i is subdifferentially
regular at x and ~g(h(x)) ~ R +n (R~ ={z~Rn: zi~0 for all i});
(ii) g is subdifferentially regular at h(x) and each h is continuously
1
differentiable at x;
(iii) Each h i is c o n t i n u o u s l y differentiable at x and either g (or - g)
is subdifferentially regular at h(x) or the Jacobian matrix of h at x
is surjective;
(iv) n=l, g is continuously differentiable at h(x) or g (or - g) is
subdifferentially regular at h(x) and h is continuously differentiable
at x. In cases (ii) - (iv) the symbol "cony" is superfluous in (2.24).
If (ii) holds then goh is subdifferentially regular at x.

Corollary 2.8. Suppose that fl and f2 are locally Lipschitzian on RN.


For each x~R N let (fl+f2)(x)=fl(x)+f2(x), (flf2)(x)=fl(x)f2(x) and
(fl/f2)(x)=fl(x)/f2(x) if f2(x)#0. Then

~(fl+f2)(x) c ~fl(x)+Sf2(x), 2.25a)

~(flf2)(x) C f2(x) ~fl(x)+fl(x)~f2(x), 2.255)

~(fl/f2)(x)c (f2(x))2
l [f2(x)~f1(x)-fl(x)~f2(x)]. 2.25c)

Equality holds in (2.25a) if each fi is subdifferentially regular at x,


and in (2.25b) if in addition fi(x)~0.

Clarke (1975) established the following crucial relations between


the subdifferential and the generalized directional derivatives of a io-
g

cally Lipschitzian function f defined on R N

f ( x ; d ) = m a x { < g f , d > : g f e ~f(x)} for all x,d, (2.26)

~f(x}={gfe RN: <gf,d> <f(x;d) for all d} for all x. (2.27)

We shall n o w i n t e r p r e t these relations in g e o m e t r i c terms. In w h a t fol-


lows let x be a fixed p o i n t in R N.
First, suppose that f is c o n t i n u o u s l y differentiable at x. F r o m
Lemma 2.3, (2.26) and (2.8) we h a v e

~f(x) = {Vf(x)}, (2.28a)

f(x;d) = f'(x;d) = <Vf(x),d> for all d. (2.28b)

Suppose t h a t vf(x)~0. Then vf(x) corresponds to the h y p e r p l a n e

H v f = {(z,B) e R N+I: B = f(x) + <vf(x),z-x>}

being tangent to the g r a p h of f

graph f = {(z,8) E R N + I : s = f(z)}

at the p o i n t (x,f(x)). Here 8 denotes the "vertical" coordinate of


a point (x,8) e R N+I. Moreover, the h y p e r p l a n e

H C = {z e R N : <Vf(x),z-x> : 0}

is t a n g e n t at x to the c o n t o u r of f at x

C = {z ~ RN: f(z) = f(x)}.

vf(x) is p e r p e n d i c u l a r to C at x and is the d i r e c t i o n of s t e e p e s t


ascent for f at x. D e f i n e the f o l l o w i n g linearization of f at x

Y(z) = f(x) + <vf(x),z-x> for all z in R N (2.29)

and o b s e r v e that vf(z)=vf(x) for all z (x is fixed). Therefore this


linearization has the same d i f f e r e n t i a b i l i t y properties as f at x in
the sense that

~(X) : ~f(x), (2.30a)

T(x;d) = Y'(x;d) = f(x;d) for all d, (2.305)

cf. (2.28). In p a r t i c u l a r , by (2.28a), (2.9) and (2.30b~, for any


d ~ R N we h a v e
10

f(x+td) = f(x) + t~'(x;d) + o(t), (2.31)

where o(t)/t+0 as t+0. Moreover, the graph of ~ equals Hvf, while the
contour of f at x is equal to H C. We c o n c l u d e that l i n e a r i z a t i o n s
based onaf(-) = {vf(')} provide c o n v e n i e n t d i f f e r e n t i a l a p p r o x i m a t i o n s
to f w h e n f is mooth.
Next suppose that f is convex. Then f is locally L i p s c h i t z i a n and
~f is the s u b d i f f e r e n t i a l in the sense of convex analysis:

~f(x) = {gf ~ RN: f(z) Z f(x) + <gf,z-x> for all z}. 2.32)

The above r e l a t i o n says that each s u b g r a d i e n t g f ~ ~f(x) defines a li-


n e a r i z a t i o n of f at x

fgf(Z) = f(x) + <gf,z-x> for all z in R N, (2.33)

which is a lower a p p r o x i m a t i o n to f at x

f(x)
= fgf (x), (2.34a)

f(z) ~ f (z) for all z, (2.34b)


gf
and a h y p e r p l a n e

Hgf
=
{(z,~) ~ RN+I: ~
= fgf (z)} (2.35)

supporting the e p i g r a p h of f at (x,f(x)). Observe that

Hgf = graph fgf. (2.36)

Also if gf~0 then the h y p e r p l a n e

H 1 = {z 6 RN: <gf,z-x~ = 0}

supports at x the level set {z ~ RN: f(z) < fix)}. Such h y p e r p l a n e s


are n o n u n i q u e w h e n ~f(x) is not a singleton. However, one easily checks
that a convex c o m b i n a t i o n of s u p p o r t i n g h y p e r p l a n e s to epi f at
(x,f(x)) is still a s u p p o r t i n g hyperplane. This gives the r e a s o n for
the symbol "conv" in the d e f i n i t i o n of ~f(x), see (2.13).
W h e n f is convex but not c o n t i n u o u s l y d i f f e r e n t i a b l e at x, rela-
tions of the form (2.30) and (2.31) do not, in general, hold if one
replaces ~ with ~gf. Yet such relations may be e x t e n d e d as follows.

Define the following a p p r o x i m a t i o n to f at x


11

~(z) = max{fgf(Z): g f ~ ~f(x)} for all z in R N


(2.37)
fgf(Z) = fix)+ ~gf,z-x> for each g f 6 ~f(x) and all z in R N.

Observe that the "max" above is attained, because ~f(x) is a compact set
by Lemma 2.2. By (2.34), ~ is a lower approximation to f at x
A
f(x) = f(x), (2.38a)

f(z)> ~(z) for all z . (2.385)

The epigraph of ~ can be expressed in the form

epi ~ = (x,f(x)) + Kf , (2.39)

where

Kf = {(d,~) ~ R N + I : 8~ ~gf,d> for all gf e ~f(x)} (2.40)

is a closed convex cone (it contains all nonnegative multiples of its


elements). Moreover, we deduce from (2.32) and (2.37) that the epigraph
of 2, being an intersection of all the epigraphs of ~ q containing epi f,
is a convex outer approximation to the epigraph of f:f

epi f ~ epi ~. (2.41)

Observe that the convexity of ~ follows directly from (2.37) even when
f is nonconvex, since

~(Izl+(l-~)z 2) = max{Ifgf(zl)+(l-~)~gf(Z2): gf e ~f(x)}

I maX{fgf(zl): g f e ~f(x)}+(l-~max{fgf(Z~:gfe~f(x)}

= ~ ~(zl)+<l-~)~I~ 2)

for all i e [0,~ and zl,z 2 ~ R N. If convexity fails, relations (2.34b),


(2.38b) and (2.41) are no longer valid. However, ~ is still a useful
approximation to f, as will be shown below.

Lemma 2.9. Suppose that f: RNR is locally Lipschitzian, x ~ RN and


~: RN+R is defined by (2.37). Then ~ is convex and subdifferentially
regular on R N, and

~(x) = ~f(x), (2.42a)

~(x;d) = f(x;d) for all d. (2.42b)


12

Moreover, for each d in R N one has

~(x;d) = ~'(x;d) = max{<gf,d>: gf ~ ~f(x)}, (2.43]

f(x+td) N f(x) + t~'(x,d) + o(t) for all t_>0, (2.44)

where o(t)/t0 as t0.

Proof. The convexity of ~ was shown above. For each gf E ~f(x), ~gf is

continuously differentiable with ~fgf(Z) = {gf} for all z. Therefore

the compactness of ~f(x), (2.37), (2.34a) and Lemma 2.5 imply that
is subdifferentially regular and satisfies (2.43), and
~f(x) = conv{gf: gf e ~f(x)}. The last relation and the convexity of
~f(x) yield (2.42a}. Then (2.42b) follows from (2.43) and (2.26). In
view of (2.5), (2.7), (2.42a) and (2.43), for each d ~ R N we have

f(x+td) N f(x)+tfD(x;d)+o(t) < f(x)+tf(x;d)+o(t)

< f(x)+tf'(x;d)+o(t)

for all t>0, which proves (2.44). []

A basic question in nondifferentiable optimization is h o w to find


a descent d i r e c t i o n d for f at x that satisfies

f(x+td) f(x) for all small t>0. (2.45)

This problem is tackled in the following lemma.

Lemma 2.10. (i) Suppose that f: RNR is locally Lipschitzian, x ~ RN


and d 6 R N satisfies

max{<gf,d> : gf e ~f(x)} 0. (2.46)

Then d is a descent direction for f at x.


(ii) Suppose that d is a descent d i r e c t i o n for ~ at x, i.e.

~(x+td) < ~(x) for all small t>0, (2.47)

where ~ defined by (2.37) is an a p p r o x i m a t i o n to a locally L i p s c h i t z i a n


function f: RNR at x. Then d is also a descent d i r e c t i o n for f at x.
Moreover, d satisfies (2.461 .
13

Proof.(i) From (2.46), (2.43) and (2.441, we have ~'(x;d)<0 and

f(x+td) ~ f(x)+t[~'(x;d)+o(t)/t] < f(x) for all small t>0,

because o(t)/t0 as t+0.


(ii) Using (2.47~, we choose ~>0 satisfying ~(x+d)<~(x). By Lemma
2.9, ~ is convex, hence
t ^ ^
~(x+td)=~((l- ~ x) + ~ (x+[d)) < (i - ~ ) ~ ( x ) + ~f(x+td)
t t t t
and hence

[~(x+td)-~(x)J/t <[~(x+~d)- ~(x)]/%

for all t ~ [0,~]. Therefore ~'(x;d)<[~(x+~d)-~(x)]/~<0, and (2.46)


follows from (2.43). []

The above lemma will be used below in two schemes for finding
descent directions. Relation (2.46) means that the set ~f(x) can be
separated from the origin by a hyperplane. Since ~f(x) is a convex
compact set, this is possible if and only if 0 ~ ~f(x). Therefore we
shall first state two auxiliary results.

Lemma 2.11. Suppose that f: RN+R is convex. Then a point x ~ R N mini-


mizes f, i.e. f(~)~f(y) for all y, if and only if 0 ~ ~f(x).

Proof. This follows immediately from (2.32). []

Lenima 2.12. Suppose that G c R N is a convex compact set and let

Nr G = arg min{Igl : g e G}

denote the point in G that is nearest to the origin (the projection of


the origin onto G). Then p 6 G is Nr G if and only if ,g,p>~Ipl 2 for
all g ~ G.

Proof. We note that Nr G is well-defined, because the convex function


I'I attains its unique m i n i m u m on the convex compact set G. Let g ~ G,
0~t<l; then p+t(g-p) ~ G and Ip+t(g-p)l 2= Ipl2+2tE<g,p>-Ipl2~ t21g-p 12,
which is less than I Pl 2 for small t unless <g,p>~Ipl 2 []
14

The following lemma shows how one may find descent directions for
nonsmooth functions.

Lemma 2.13. Consider a locally Lipschitzian function f: RN+R, a point


x ~ R N and an approximation ~ to f at x defined by (2.371 . Let
1 2
= Nr ~f(x) = arg mln{~Igfl : gf ~ ~f(x)} (2.48)

and let d denote a solution to the problem

minimize ~(x+d) + 1dl 2 over all d ~ R N. (2.491

Then
(i) ~ exists, is uniquely determined and satisfies

-~ = ~ ~ ~f(x), (2.50)

max{<gf,~> : gf e ~f(x)} = -Ipl 2 , (2.51)

9(x+8) = ~(x) - I ~ 1 2 , (2.52a I

~(x+td)~ ~(x) - tlpl 2 for all t ~ [0,11; (2.52b)

(ii) d~0 if and only if 0 ~ ~f(x);


(iii) 0 ~ ~f(x) if and only if x is a global minimum point for ~.

Proof (i) The objective function of (2.491 can be written as

p(d) = ~(x) + v(d) + 1dl 2 ,

v(d) = max{<gf,d,: gf E ~f(x)}.


Let gf ~ ~f(x) be fixed. Then, by the Cauchy-Schwarc inequality,
p(d) z f ( x ) + < g f , d , + 1d12 ~(x)-IgflldI+ 1 d l 2 ~ +~, as IdI+~,hence d exists.
Since p is strictly convex, ~ is unique. In view of Lemma 2.11, Corollary
2.8 and Lemma 2.5, we have 0 ~ ~p(d)=d+~v(d) and

~v(d) = {gf ~ ~f(x): <gf,~, = v(d)},


hence there exists p ~ ~f(x) satisfying p= -d and v(d)=<p,d)=-Ipl 2
Thus p ~ 8f(~) and ~gf,p>>Ipl 2 for all gf C ~f(x), therefore p=p by
Lemma 2.12. Combining the preceding relations, we establish (2.50),
(2.51) and (2.52a). Then (2.52b) follows from (2.52a) and the convexity
of ~.
(ii) This follows from (2.50) and (2.48).
(iii) By Lemma 2.9 and Lemma 2.11, 0 ~ ~f(x)=~(x) is equivalent to x
minimizing the convex function ~. []
15

We c o n c l u d e from L e m m a 2.13 and L e m m a 2.10 that if a p o i n t x ~ R N


satisfies 0 ~ ~f(x), then one m a y find a d e s c e n t direction for f at x,
cf. (2.48), (2.52) and (2.47). In p a r t i c u l a r therefore, f c a n n o t have
local m i n i m a at such points. Thus we have derived the f o l l o w i n g neces-
sary c o n d i t i o n of optimality.

Lemma 2.14. If ~ is a local m i n i m u m point for a l o c a l l y Lipschitzian


functi o n f, then 0 ~ ~f(~).

A p o i n t x E RN s a t i s f y i n g 0 E ~f(x) will be c a l l e d stationary for


f. Thus stationarity is n e c e s s a r y for optimality.
We m a y add that if f is s t r i c t l y differentiable at x and
Vf(x)~0, then the d i r e c t i o n d=-vf(x) defined in L e m m a 2.13 is the direc-
tion of s t e e p e s t descent for f at x. In general, we have
o
f (~;d/l~l) : min{f(x;d) : ldl~},
see (Wolfe, 1975).
Consider the f o l l o w i n g constrained problem

minimize fix), subject to Fi(x)~0 for i=l .... ,m, (2.53)

where the o b j e c t i v e function f and the c o n s t r a i n t functions F i are


real-valued functions defined on R N, and mkl is finite. Define the to-
tal c o n s t r a i n t function

F(x) = max{Fi(x): i=l ..... m] 2.54)

and the feasible set for (2.53)

S = {X~ RN: F(X)<0}.

A poin t ~ ~ S is a local solution to p r o b l e m (2.53) if for all x in


a neighborhood B of x one has f(~)~f(x) when x E S.
We note that if x is a local solution of (2.53), then the function

H(X;X) = max{f(x)-f(x), F(x)} for all x (2.55)

has a local (unconstrained) minimum at

H(x;x) = 0 , (2.56)

for if H(x;~) were smaller than 0 for some x 6 B then x w o u l d be fea-


sible for (2.53) and s t r i c t l y b e t t e r than x, w h i c h c a n n o t be.
In w h a t follows we assume that the p r o b l e m functions of (2.53)
are locally Lipschitzian.
We shall n o w d e r i v e necessary optimality conditions for p r o b l e m
(2.53). Define the p o i n t - t o - s e t mappings
16

"~(X) = conv{~Fi(x): Fi(x) = F(X)}, (2.57)

~f(x) if F(x)<0,
M(X) = conv{~f(x) U ~F(x)} if F(x)=0, (2.s8)
~F(x) if F(X)>0,

I ~f(x) if F(xI<0,
M(x) = conv{$f(x) o ~ ( x ) } if F(x)=0, (2.59)

~(x) if F(X)>0.

By Lemma 2.5, F(') and H(.;x) are locally Lipschitzian and the above
mappings satisfy
~F(x) C ~ ( x ) , (2.60)
~(x;x) c ~(x), 42.61)
M(X) C M(x), (2.62)
where ~H(';x) denotes the subdifferential of H(-;x) for fixed x.
We have the following necessary condition of optimality.

Lemma 2.15. If x solves (2.53) locally, then

0 e ~H(x;x), (2.63)
0 ~ M(X). (2.64)
In particular, there exist numbers ui' i=0,...,m, satisfying
m
0 ~ ~0~f(x) + ~ ~i~Fi(x),
i=l

m
~i~0, i=0,...,m, ~ ~i=l, (2.65)
i=0

~iFi(x)=0 , i=l ..... m.

Proof. Since x must minimize H(-,x) locally, from Lemma 2.14 we obtain
(2.63), which in turn implies (2.64) by (2.61) and (2.62). To see that
(2.65) follows from (2.64), (2.59) and (2.57), note that F(~)~0 and
that one may set ~i=0 if Fi(x)<0. []
17

A point ~e S is called stationary for f on S if it satisfies the


necessary optimality condition (2.65), or equivalently (2.64).
The multipliers ~i in (2.65) may be nonunique. In particular, one
may have ~o=0. If ~o=0 then relation (2.65) reduces to

F(~)=0 and 0 ~(~) ,

which describes only geometry of the feasible set at x, without provi-


ding any information about the objective function. This degenerate case
is e l i m i n a t e d if the Cottle constraint q u a l i f i c a t i o n holds at x:

either F(~) < 0 or 0 ~ ~F(x). (2.66)

If _the constraint functions are convex and x ~ S, then (2.66) is equiva-


lent to the S_~later constraint qualification

F(x) < 0 for some x ~ R N. (2.67)

This follows from the fact that in the convex case we have ~=~F, see
Corollary 2.8, and the condition 0e~F(x) is e q u i v a l e n t to F(x) <_F(y)
for all y, see Lemma 2.11.
Relation (2.65) is known as the F. John necessary condition of
optimality. It becomes the Kuhn-Tucker condition

m
0 e ~f(x) + E ~i~Fi(x), (2.68a)
i=l

~i >- 0, ~iFi(x)=0, i=l ..... m, (2.68b)

when ~o~0, since one may take ~ i = ~ i / ~ o . When problem (2.53) is convex,
i.e. f and each F are convex functions, then the Kuhn-Tucker condi-
1
tion and the Slater constraint qualification yield the following suffi-
cient condition for optimality.

Lemma 2.16. Suppose that problem (2.53) is convex and satisfies the
Slater constraint qualification (2.67). Then the following are equiva-
lent:
i) x solves p r o b l e m (2.53);
ii) ~ satisfies

min { H ( x ; ~ ) : x ~ RN}=H(~;~)=0 ; (2.69)

iii) x satisfies

O : ~H(:;:); (2.70)
18

liv} x is stationary for f on S;


iv) the Kuhn-Tucker condition (2.68) holds at x e S .

Proof. Ca) As noted above, (i) implies (ii). Suppose that (2.69) holds,
but f(~)<f(x) for some x satisfying F(x~t<_0. Then

f(~+t(x-~)) _< <l-t)f(~)+tf(x) < f(x),

F(~+t(x-~)) _< <l-t)F(~)+tF(x) < tF(x)<0,

H(~+t<~-~);x) < 0 = ~<x;x)

for sufficiently small t>0, which contradicts (2.69). Therefore (ii)


implies (i) .
(b) By convexity and Corollary 2.8, we have

~F(x) = ~F(x) for all x,


(2.71)
~H(x;x) = M(X) = M(x) for all x.

By Lemma 2.11, x minimizes H(" ;~) if and only if 0 e ~H(x;x). But


F(x)=H(x;x)>0 and 0 e ~H(x;x)= ~F(x) is impossible in view of (2.67).
We conclude that (ii) and (iii) are equivalent.
(c) The equivalence of (iii) and (iv) follows from (2.71) and (b).
(d) As noted above, owing to (2.67)
(iv'.) implies Iv). As for the reverse
m
implication, use (2.68) and let ~i=i/(l+ Z ~j) in (2.65). []
j=l

For u n c o n s t r a i n e d problems the necessary optimality conditon


0 ~ ~f(x) is equivalent to x being a m i n i m u m point for the convex first
order approximation ~ to f at x, see Len~a 2.13 (iii). We shall now
provide a similar interpretation of the stationarity condition (2.65)
for the constrained problem (2.53) in terms of properties of the follow-
ing convex approximation to problem (2.53) defined at each x E S:

P(x): minimize ~(z), subject to ~i(z)~0, i=l ..... m, (2.72)

where for any fixed x ~ S and all z ~ R N


19

~(x) = maX{Yg(Z): g ~f(x)},

~g(Z) = f(x) + <g,z-x) for each g E ~f(x),


(2.73)
~i(z) = max{Fi,g(Z): g ~ ~Fi(x)}, i=l ..... m ,

Fi,g(Z) = Fi(x ) + <g,z-x> for each g e ~Fi(x), i=l, .... m

are convex first order approximations to the problem functions at x.


Also let

~(Z) = max{~i(z): i=l ..... m} for all z,


{2.74)
~(z) = max{~(x)-~(x), ~(z)} for all z,

denote convex first order approximations to F(.) and H(.;x), respective-


ly, at x. Since F(x)~0 by assumption, and ~i(x)=Fi(x) for all i, we have
~(x)~0 and H(x)=0, hence x is feasible for P(x). Also, as shown above,
functions of the form (2.37) and (2.73) are convex. Thus P(x) is a con-
vex problem.
Differential properties of functions of the form (2.73) were stud-
ied above. In particular, in view of Lemma 2.9, we have

~(X) = ~f(x}, ~i(X) = ~Fi(x), i=l ..... m , (2.75)

hence (2.73), Corollary 2.8, (2.57) and 2.59) yield

~(X) = ~(x), (2.76)

~<x) = ~(x). (2.77)


To study the relations between the original problem and its convex
approximations we shall need the following concepts. We say that d ~ R N
is a feasible direction for S at x if

x+td~ S for all small t>0. (2.78)

Note that S is closed, because F is continuous, hence (2.78) implies


x E S. Let

Y(X) = { X ~ RN: Fi(x ) = F(X)>0} (2.79)

and observe that Y(x) is empty when F(x)<0. Relation (2.78) is equiva-
lent to the following

Fi(x+td) < 0 for all small twO and iEI(x), (2.80)

since, by the continuity of Fi, we have Fi(x+td)<0 for small t if


20

Fi(x) < 0. One easily checks that if d is a d e s c e n t direction for


H(.;x) at x S, i.e.

H(x+td;x) < H(x;x)=0 for all small t > 0,

then d is also a f e a s i b l e direction of d e s c e n t for f at x relative


to S, i.e. (2.45) and (2.80) are s a t i s f i e d .
Using the above observations and arguing essentially as in the
proof of L e m m a 2.10, one m a y obtain the following sufficient condition
for d to be a feasible descent direction for f at x relative to S.

Lemma 2.17. Under the above assumptions and c o n v e n t i o n s , if d is a


^

descent direction for H at x E S, i.e.

~(x+td) < H(x)=0 for all small t > 0,

then d is a f e a s i b l e direction of d e s c e n t for f at x relative to


s. []
The following result demonstrates how one m a y find feasible des-
cent directions.

Lemma 2.18. Consider a locally Lipschitzian problem (2.53) and its con-
vex approximation P(x) at x~S defined via (2.72) and (2.73). Let

= Nr M(x), 2.81)

and let ~ denote a solution to the problem

minimize i IdI2 over all d in R N. 2.82)

Then
i) d exists, is u n i q u e and s a t i s f i e s

-~ = p ~ ( x ) , 2.83)
~(x+d) = ~(x)-Jpl 2 , 2.84a)

H(x+td) <_ H( x) -t lp l2 for all t~[0,1]; 2.84b)

ii) d ~ 0 if and o n l y if 0 M(x) ;

iii) 0~(x) if and only if x is a m i n i m u m point for ~.

iv) If a d d i t i o n a l l y the C o t t l e constraint qualification is s a t i s f i e d at


x, i.e.
21

either F(x)<0 or 0 ~ ~F(x) , (2.85)

then x is stationary for f on S if and only x solves problem


p(x).

Proof. In view of the p r e c e d i n g results, the o b j e c t i v e function of


(2.82) can be w r i t t e n as

~(d)=max{<g,d~ : g~M(x)}+ 1dl 2 for all d,

hence (i)-(iii) m a y be p r o v e d similarly to Lemma 2.13. Therefore, we


shall only prove (iv). Since ~(x)=F(x) and ~F(x)=~(x), see (2.76),
we observe that (2.85) is also the Cottle constraint qualification for
at x, h e n c e the Slater constraint qualification holds for the con-
vex problem P(x). Therefore we deduce from Lemma 2.16 and (2.77) that
x ~S solves problem P(x) if and only if 0 e H(x)=M(x), which proves
iv). []
We c o n c l u d e from L e m m a 2.17 and Lemma 2.18 that if a p o i n t xe S
is n o n s t a t i o n a r y for f on S, then one may use c o n v e x first order ap-
proximations (2.73) and (2.74) for finding a feasible direction of des-
cent for f at x. Moreover, if the Cottle constraint qualification is
satisfied at all feasible points, then stationary points of the origi-
nal p r o b l e m are p r e c i s e l y the solutions of its c o n v e x first order appro-
ximations.
We e n d this section by r e c a l l i n g the n o t i o n of E - s u b d i f f e r e n t i a l .
If f:R N --+ R is convex, x ~ RN and e ~ 0, then the e-subdifferential
of f at x is the convex set

ef(x)={gf e R N : f ( z ) -
> f(x)+ < gf,z-x > - e for all z}. (2.86)

Each element of ~ fix) is c a l l e d an e - s u b g r a d i e n t of f at x. Clear-


ly, ~0f(x)=~f(x), see (2.32).
22

3__z. A R e v i e w of E x i s t i n g A l g o r i t h m s and O r i g i n a l C o n t r i b u t i o n s of This


Work

In this section we b r i e f l y review general p r o p e r t i e s of several


existing algorithms for n o n s m o o t h m i n i m i z a t i o n . A fuller d i s c u s s i o n of
those a l g o r i t h m s is p o s t p o n e d to s u b s e q u e n t chapters. Our i n t e n t i o n he-
re is to m o t i v a t e the need for the class of m e t h o d s w h i c h is i n t r o d u c e d
in this work.
T h r o u g h o u t this work we assume that the functions of the p r o b l e m

P:minimize f~x), subject to Fi(x) ~ 0 for i=l,...,m

are locally L i p s c h i t z i a n on R N. Also, we place strict l i m i t a t i o n s on


our ability to get i n f o r m a t i o n about the p r o b l e m functions. Let us defi-
ne the (total) c o n s t r a i n t function

F i x ) = ~ max{Fi(x): i=l,...,m} if m El,

10 if m=0,

and the feasible set of p r o b l e m P

S = {x~RN: F(x)<0}.

We assume that we have a subroutine that can e v a l u a t e fix) and a cer-


tain s u b g r a d i e n t gf(x) ~ ~f(x) of f at each x ES, and Fi(x ) and
one s u b g r a d i e n t g F i ( X ) ~ ~Fi(x ) for each xES and i=l,...,m. We do
not impose any further a s s u m p t i o n s on the c a l c u l a t e d subgradients. Such
a l i m i t a t i o n is r e a l i s t i c for p r o b l e m s of i n t e r e s t to us, where the de-
t e r m i n a t i o n of all e l e m e n t s of a s u b d i f f e r e n t i a l is e i t h e r very expen-
~ cr just impossible, see (Wolfe, 1975). On the other hand, sometimes
the objective function cannot be e v a l u a t e d at infeasible points. A l s o
at feasible points we require no k n o w l e d g e of Fi, other than F i being
nonpositive. Such a s s u m p t i o n s are common in the l i t e r a t u r e on nonsmooth
optimization. However, for c o n v e n i e n c e we sometimes assume t e m p o r a r i l y
that gf and gFi are d e f i n e d at each x e R N.
Before d i s c u s s i n g algorithms for n o n s m o o t h m i n i m i z a t i o n , let us
recall basic ideas b e h i n d c l a s s i c a l a l g o r i t h m s for solving smooth ver-
sions of p r o b l e m P. Given a starting point x l E RN, an iterative method
c o n s t r u c t s a sequence of points x2,x3,.., in RN that is intended to
converge to the r e q u i r e d solution. An a l g o r i t h m is a feasible point meth-
od if it g e n e r a t e s a sequence { x k } c S. If a d d i t i o n a l l y f ( x k + l ) < f ( x k)
for all k, then an a l g o r i t h m is a d e s c e n t method. A descent a l g o r i t h m
u s u a l l y p r o c e e d s by searching from xk~ S along a d i r e c t i o n dk for a
scalar stepsize tk > 0 that gives a r e d u c t i o n in the o b j e c t i v e f u n c t ~ n
23

value f(xk+tkd k) < f(x k) and the next feasible point xk+l=xk+tkd k E S .
Such a stepsize can be found if dk is a descent direction

f(xk+td k) < f(x k) for small t > 0

which is also feasible

Fi(xk+td k) ~ 0 for small t > 0 and i=l,...,m.

A feasible descent direction dk is usually found by solving an auxil-


iary optimization subproblem which approximates problem P in a neigh-
borhood of x k. The idea is that if dk is a feasible descent direction
for the subproblem then dk should also be a feasible descent direction
for problem P at x k. To construct a suitable direction finding subprob-
lem many algorithms use differentiation for linearizing the problem func-
tions. When the problem functions are smooth, they can be approximated
for x in some neighborhood of xk by the following linearizations

~(x;xk)=f(xk)+ < vf(xk),x-x k ,


(3.1)
Fi(x;xk)=Fi(xk)+ < VFi(xk),x-xk for i=l ..... m.

Replacing the problem functions by their linearizations, we obtain the


search direction finding subproblem

minimize ~(xk+d;x k) subject to F.(xk+d;x k) ~ 0 for i=l .... ,m,


l
(3.2)
whose various modifications are used in many well-known algorithms, see
(Bazaraa and Shetty, 1979; Pshenichny and Danilin, 1975). For instance,
in the Pironneau and Polak (1973) method of feasible directions dk is
found from the solution (dk,v k) to the problem

minimize 1dl2+v over all ( d , v ) ~ RN, R 1 satisfying


(3.3)
f(xk+d;xk)'f(x k) <v, Fi(xk+d;x k) 5 v for i=l, .... m.

In particular, in such algorithms one usually has

~(xk+dk;x k) < f(xk),


(3.4)
Fi(xk+dk;x k) < 0 for i=l,...,m

if xk is nonstationary for problem P. Then

f(xk)+ < vf(xk),d k < f(xk),


(3.51
Fi(xk)+ < VFi(xk),d k < 0 for i=l, .... m
24

and it follows from the d i f f e r e n t i a b i l i t y of f and F. that


1
f(xk+tdk)=f(xk)+t < Vf(xk),d k > +o(t) < f(x k)

Fi(xk+tdk)=(l-t)Fi(xk)+t~i(xk)+ < VFi(xk),dk >] +oCt) < 0

for small t > 0 and i=l,...,m, because each Fi(xk ) is nonpositive.


We conclude that for smooth problems the linearizations (3.1) provide a
sufficient condition (3.4) for dk to be a feasible descent direction.
When the problem functions are nonsmooth, the linearizations

f ( x ; x k ) = f ( x k ) + < gf(xk),x-x k > ,


(3.5)
~ i ( x ; x k ) = F i ( x k ) + < gFi(xk),x-x k for i=l,...,m

may not suffice for assessing the behavior of the problem functions aro-
und x k. For instance, consider an u n c o n s t r a i n e d problem and an analogue
of the steepest descent direction

d k = -gf(xk).

If gf(xk)~0, e.g. x k is n o n s t a t i o n a r y for f, then f(xk+dk;xk)=f(xk) -


Igf(xk) 12 < f(xk), so that (3.4) holds. But dk need not b~ a descent
direction for f at x k, because we no longer have f(xk+td)=f(xk)+
t < gf(xk)d +o(t) in the n o n d i f f e r e n t i a b l e case. This is shown by the
example

f(x)=Ix I for x ~ R I, xk=0, gf(xk)=l ~f(xk)=[-l,l], (3.6)

in which no descent direction exists.


In general, for nonsmooth problems the information given by f(xk),
gf(xk), Fi(xk ) and gFi(X k) for i=l,...,m may not suffice for con-

structing a feasible descent direction at x k. For this reason several


descent algorithms that require the knowledge of full subdifferentials
at xk have been proposed; for an e x c e l l e n t survey see (Dixon and Ga-
viano, 1980). However, most existing descent algorithms for nonsmooth
minimization must be regarded as theoretical or conceptual methods, since
their optimization subproblems involved in computing a descent direc-
tion are constrained nondifferentiable problems, which generally cannot
be carried out. Therefore, such methods will not be discussed here.
In view of the difficulties mentioned above, recently much rese-
arch has been devoted to methods that do not maintain feasibility and
descent at each iteration. Two examples of such methods are given below.
The subgradient algorithms developed m a i n l y in the Soviet Union
25

(Gupal, 1979; Nurminski, 1979; Shor, 1979) are nondescent methods. An


example of a typical iteration of the subgradient m e t h o d for unconstrain-
ed m i n i m i z a t i o n is given by

x k+l = xk_tkgf(x k) 13.7)

with scalar stepsizes {t k} satisfying

tk+0, Z tk=+~, tk+i/t k --+ i.


k=l

Owing to their simplicity, the subgradient algorithms have no reliable


stopping criteria for terminating the iteration when the current iterate
is sufficiently close to the required solution. For instance, when f
is smooth and {x k} converges to a m i n i m u m point x of f, then
gf(xk)--+ gf(x)= 0. However, this need not occur when f is nondifferen-
tiable. Therefore stopping criteria of the form Igf(xk)] ~ 10 -6 , which
are customary in m a t h e m a t i c a l programming, are useless in nonsmooth op-
timization. More advanced implementations of the subgradient algorithms
require interactive tuning of certain tolerances, which regulate stepsi-
zes, during the calculations (Lemarechal, 1982; Nurmi~ski, 1979). The
tuning requires much experimentation, but, when properly tuned to a given
problem, the subgradient algorithms can be very efficient, see (Lemare-
chal, 1982; Shor, 1979).
To simplify notation below, we observe that an equivalent formula-
tion of problem P is given by

P : minimize f(x} subject to Fix) S 0.

The formulation with F emphasizes the p o s s i b i l i t y of not keeping the


functions Fi completely in view at all times, which is essential in
classical algorithms for smooth versions of problem P. Also, as noted in
the preceding section, the differential properties of problem P can be
studied with the help of the mapping

~(x)=conv{g~ R N : g ~ 3Fi(x ) for some i satisfying Fi(x)=F(x)}

which satisfies ~F(x) c 3~(x); ~(x)=3F(x) when each Fi is convex.


Therefore we assume below that we have a function gF satisying
gF(x)~ ~(x) for each x~RN~S. For instance, gF(x)=gFi(x) if i=l ..... m
is the smallest number satisfying Fi(x)=F(x ). For convenience, we also
temporarily assume that gF(x) ~ ( x ) at each x ~ R N. Thus g F ( x ) ~ 3F(x)
when problem P is convex.
The Kelley (1960) cuttin9 plane method is a nondescent method for
solving convex problems, i.e. problems with convex functions f and F.
26

The method is based on the following crucial observation. For any fixed
y ~ R N, define the linearizations

~(x;y)=f(y)+ < gf(y),x-y > ,


(3.8)
~(x;y)=F(y)+ < gF(Y),x-y > .
Then
f(y) = f(y;y),
(3.9)
F(y) = F(y;y)

and, since g f ( y ) ~ 3f(y) and gF(y) ~ ~F(y), for each xER N we have

f(x) >f(x;y),
(3.10)
F(x) >~(x;y).

For points yl,...,yk in RN define the following piecewise linear


(polyhedral) functions

~k(x)=max{Y(x;yJ): j=l ..... k}.


(3.11)
~k(x)=max{~(x;yJ): j=l ..... k}.

It follows from (3.9) and (3.10) that

f(x) > ~k(x),


(3.12)
Fix) ~k(x),
and
f(yj)=~k(yj),
(3.13)
F(yJ)=Fk(y j)

for all x in RN and j=l,...,k. Thus the polyhedral functions (3.11)


are lower approximations to the problem functions. If the points yJ are
close to any given point, say x k, then such polyhedral functions can
approximate the problem functions around xk much more accurately than
only the linearizations (3.5). This property will be frequently used in
what follows.
At the k-th iteration of the cutting plane method, one sets yJ=x j
for j=l,...,k and uses (3.11) to calculate a direction dk as a solu-
tion to the problem

^k ^
minimize f (xk+d) subject to Fk(xk+d) ~ 0. (3.14)
27

By (3.8) and (3.11), this is equivalent to the following linear program-


ming problem with respect to variables (d,u)e RN~ R 1

minimize u,
I.

subject to fa + < g 3 , d > <u , j=l ..... k, (3.15)


3
F~ +< g~,d> < 0 , j=l ..... k,
3
where
f~3 = ~(xk;y j) and Fk3 = F(xk;yj)'
(3.16)
g~ = gf(yJ) and g~ = gF(y j)

for all j=l .... ,k. Setting xk+l=yk+l=xk+d k completes the k-th itera-
tion.
An interesting feature of the cutting plane method is its use of
linearizations provided by each newly generated point for improving the
polyhedral approximations to the problem functions. In other words,the
next search direction finding subproblem (3.15) is modified by appen-
ding the constraints generated by the latest linearizations. This idea
is used in many algorithms for nonsmooth optimization.
Convergence of the cutting plane algorithm can be very slow (Wol-
fe, 1975). This is mainly due to the fact that [ d k l may be so large
that the point xk+d k is far from the points y], j=l,...,k. Then
x k+dk is in the region where ~k and ~k poorly approximate f and
F. Also subproblems (3.14) and (3.15) may have no solutions. These draw-
backs can be eliminated by adding to the objective functions of (3.14)
and (3.15) a penalizing term 1/21dl 2, which will prevent large values
of Idk . Thus we obtain the following regularized modification of sub-
problem (3.14)

mlnimize ~k(xk+d;xk)+ 1dl 2 subject to ~k(xk+d;xk) ~0, (3.17)

and its quadratic programming formulation

minimize 1
~Idl2+u (3.18a)

subject to , 3=1 ..... k, (318b)

< g ,d, 0 , j=l ..... k (318c)

Search direction finding subproblems like (3.18) were introduced


by Lemarechal (1978) in an algorithm for unconstrained convex minimiza-
tion. He also showed how to construct sequences of points {x k} and
auxiliary points of the form
28

y k+l = x k+dk for k=l,2, .... (3.19)

with yl=xl being an arbitrary starting point in R N, so that his al-


g o r i t h m is a descent method in the sense that

f(x k+l) < f(x k) if x k+l ~ x k , for all k. (3.20)

This relaxed version of the usual requirement for a descent method,i.e.

f(x k+l) < f(x k) for all k, is much easier to attain in practice. The
main idea consists in taking the trial point yk+l=xk+d k as x k+l only
if this leads to an improvement in the objective function value, i.e.

f(yk+l) < f(xk). This is called a serious step. Otherwise a null step is
taken by setting xk+l=x k.
To analyze Lemarechal's (1978) line search rules in more detail,
let (dk,u k) denote the solution of (3.18a)-(3.18b) and let

v k = uk-f(xk). (3.21)

Letting v=u-f(x k) we obtain the following subproblem e q u i v a l e n t to


(3.18a)-(3.18b)

minimize 1dl2+v ,
(3.22)
subject to f~-f(xk)+ < g~ d ~ < v , j=l .... k,

whose solution (dk,v k) satisfies (3.21). By (3.12) and (3.16), we


have
f(x k) ~ ~k(xk) = max{fk : j=l .... k } . (3 23)
- 3 '

This shows that (d,v)=(0,0) is feasible for (3.22). Hence the optimal
I k 1 2 + v k ~ 1 0 1 2 + 0 = 0 . Therefore
value of 43.22) ~Id

v k < - yld
1 k2
I ~0. (3.24)

Thus vk is nonpositive. If vk=0 then xk minimizes f and the


algorithm can stop, as will be shown in Chapter 2. Therefore we may as-
sume that v k < 0. Since (dk,u k) solves (3.18a)-(3.18b), we have

k ~ j k
u = max{f + < gf,d > : j=l ..... k}, (3.25)

hence 43.8), 43.16) and (3.21) give

u k = ~k (xk+d k ) , (3.26a)

v k = fk(xk+dk)-f(xk) . (3.26b)
29

Thus vk < 0 is an estimate of f(xk+dk)-f(xk)=f(yk+l)-f(xk). If the


actual reduction in the objective function value is within m-100% of
the predicted value, i.e.

f(yk+l)-f(xk) ~ m v k, (3.27)

where m E (0,1) is a fixed line search parameter, then the trial point
k+l
y is accepted as the next iterate xk+l=y k+l . Otherwise the algorithm

stays at xk+l=x k. In both cases f(x k+l) ~ f(x k), since m> 0 and ~<0.

The following remarks on the above line search rules will be use-
ful in what follows. The condition for a serious step of the form

f(x k+l) ~ f(xk)+mv k,

instead of a simpler test f(x k+l) < f(xk), prevents the a l g o r i t h m from
taking infinitely many serious steps without significantly reducing the
objective value, which could impair convergence. On the other hand, at
a null step we have xk+l=x k and

f(yk+l) > f(xk)+~v k > f(xk)+vk=u k,

because v k < 0, m ~ (0,i) and (3.21) holds. But f(yk+l)=~(yk+l;yk+l)


from (3.9), hence the above inequality and the fact that y k + l = x k + d k=
xk+l+d k yield f ( x k + l + d k ; y k+l) ~ u k. Therefore we obtain from (3.11)

~k+l(xk+l+dk~ ~ u k. (3.28a)

In view of (3 26a) and the fact that xk+l=x k , we have

~k(xk+l+dk)=uk, (3.285)

~k+l(xk+l+dk+l)=u k+l (3.28c)

From (3.281 we conclude that after a null step the linearization ob-
k+l
tained at the trial point y leads to significant m o d i f i c a t i o n s of
both the next polyhedral approximation and the next search d i r e c t i o n
finding subproblem. Therefore e v e n t u a l l y a serious step must be taken
if the current point is not a solution.
The above algorithm of Lemarechal (1978) was extended by Mifflin
(1982) to constrained convex problems as follows. The set

S k = { x ~ R N : ~ ( x ; y j) ~ 0 for j=l ..... k} (3.29)

is an outer polyhedral approximation to the feasible set


30

S = {xeR N : Fix)50},

that is
S c Sk . (3.30)

1 k
This follows easily from (3.10). If the a u x i l i a r y points y ,...,y are
n e a r to x k , then Sk is a c l o s e approximation to S in some n e i g h -
borhood of x k. H o w e v e r , the s o l u t i o n (dk,u k) of (3.18) would usually
give x k + d k l y i n g at some "corner" of Sk which is o u t s i d e S. T h e r e f o -
re a l m o s t e v e r y trial p o i n t yk+l=xk+dk w o u l d be i n f e a s i b l e . F o r this
reason, Mifflin (1982) obtains dk f r o m the s o l u t i o n (dk,~) to the
problem
minimize ]dl2+v ,

subject to ft-f(xk)+ < g~,d > < v , j=l .... k, (3.31)


3 _ t

F~+< g~,d> 5v , j=l ..... k.

This subproblem is a b l e n d of (3.18) and (3.22). Clearly,

max{F a
+ < g ~ , d k > : j=l, .... k} vk . (3.32)

In Mifflin's algorithm one a l w a y s h a s x k e S. T h e n (3.10) and (3.23)


show that (d,v)=(0,0) is f e a s i b l e for (3.31), hence (3.24) a l s o holds
here. Aditionally, vk=0 o n l y if xk solves p r o b l e m P. T h e r e f o r e , in
general, one has

max{F(xk+dk;yj): j=l ..... k} < v k < 0 (3.33)

from (3.32), (3.16) and (3.8). Combining this with (3.29), we see t h a t
the trial p o i n t yk+l=xk+dk lies in the i n t e r i o r of S k. T h e r e f o r e we
shall h a v e yk+l S whenever Sk is s u f f i c i e n t l y close to S around
k
x .
The above d e s c r i b e d line s e a r c h r u l e s of L e m a r e c h a l (1978) need
o n l y a s i m p l e m o d i f i c a t i o n in the p r e s e n c e of the c o n s t r a i n t s . If the
k+l
trial point y is f e a s i b l e , then one m a y use the t e s t (3.27) and
proceed as above. If yk+l~ S then a n u l l step xk+l=x k is d e c l a r e d .
Thus f(x k+l) < f(x k) and xke S for all k.
Whenever a null step r e s u l t s from

F ( y k+l) > 0
then
Fk+l(xk+l+dk)=F(yk+l;yk+l)=F(y k) > v k

from (3.9)-(3.13), and


31

V
k+l >
Fk+l (xk+ l+dk+l )

from (3.33) and (3.11). The above inequalities imply that (dk+l,vk+l)~
(dk,vk). We conclude that a null step due to infeasibility provides a
significant modification of the polyhedral approximation to the const-
raint function. This explains why a feasible trial point is generated
after finitely many null steps.
The following remark on Lemarechal's (1978) search direction find-
ing subproblem (3.22)(see also (3.175 and (3.18a)-(3.18b)) will be
useful in what follows. Observe that at the k-th iteration the j-th li-
nearization
f(x;yJ)=f(yJS+ < gf(yJ),x-y j > for all x

can be written as

f(x;yJ)=f(xk)-ef(xk,yJ5 + < gf(yJ),x-x k > , (3.34)


where
~f(xk,yj)=f(xk)-T(xk;y j) 0 (3.35)

is the linearization error (nonnegative by (3.105). Therefore the k-th


polyhedral approximation ~k is also given by

fk(x):max{f(xk)-ef(xk,yJs+ < gf(yJ),x-x k > : j=l ..... k}. (3.36)

Since dk is found by minimizing ^fk(xk+d) + ~idl


i 2 over all d, we see
that if the linearization error ef(xk,yj) is large then it tends to
I

make the subgradient gf(y3) less active in the determination of the


current search direction d k. Indeed, it will be shown in Chapter 2
that
dk k ~gf(yj)
= -jZI
I= , (3.37)

where the numbers l~ j=l ...,k are the Lagrange multipliers of (3.22)
which solve the following dual of (3.22):

minimize i k__ZiXjgf(YJ)I 2 + k~ XJef(x k ,y]),


~I (3.38)
X j j=l
k
subject to Xj _> 0, j=l,... ,k, Z X.=I.
j=l ]
Moreover, by ( 3 . 1 0 ) and ( 3 . 3 4 )

f(x)>f(xk)+ < gf(yJ),x-x k > -~f(xk,y j) for all x,

hence
gf(yJ) e ~ef(x k) for c=~f(xk,y j) ~ 0, (3.39)
32

which means that the value of ef(xk,y j) indicates how far gf(yJ) is

from ~f(xk1(gf(yJ)~ ~f(x k) if af(xk,yJ)=0). Thus the algorithm of


Lemarechal (1978) uses automatic weighing of the past subgradients on
the basis of the corresponding linearization errors, which can be in-
terpreted as sub@radient locality measures.

One may distinguish three classes of descent methods for nonsmooth


minimization. The classes differ by the form of their search direction
finding subproblems and the associated line search rules. The first
class originated with the above described algorithms of Lemarechal (1978)
and Mifflin (1982). We shall now describe the remaining two classes,
confining ourselves - for simplicity - to the unconstrained case.
The second class of methods stems from the algorithms for uncon-
strained convex minimization due to Lemarechal (1975) and Wolfe (1975).
They use polyhedral approximations of the form

fLw(~kx)=max{f(xk)+ < gf(yJ),x-xk > : j ~ J ~ } (3.40)

k
for some set J f c {l,...,k}, and choose the k-th direction dk to

minimize ~Lw(Xk+d)+ 1dl 2 over all d E R N. (3.41)

Comparing (3.36) and (3.40) we see that ~k


fLw=f~k if J~={l, ..,k} and

~f(x k ,yJ)=0 for all j~ ~f . In general, however, ^k


fLW does not sa-
tisfy global relations of the form (3.12)-(3.13). Thus ^k is a use-
fLW
ful approximation to f around xk only if the set Jfk is chosen in a
way that ensures for each j g Jfk that yJ is close enough to x k , so
that the linearization error ef(xk,y j) may be neglected. In view of
(3.391, this means that in the algorithms of Lemarechal (1975) and
Wolfe (19751 each past subgradient gf(yJ), j~ J~, is treated as if it
were a subgradient of f at the current point x k. Hence, in order to
provide another interpretation of (3.40) and (3.41), assume temporarily
that gf(YJl ~ ~f(xk) for all j~ J~. Define

k k
f(x)=max{f(x )+ < gf,x-x > : gf E ~f(xk)} (3.42)

and let d denote a solution to the problem

minimize ~(xk+d)+ 1dl 2 over all d ~ R N. (3.43)

Then ~k
fLW and subproblem (3.41) may be regarded as approximate versions
of the "theoretical" constructions (3.42) and (3.43). Moreover, from
Lemma 2.13 we deduce that
33

-d = pf = Nr ~f(xk), (3.44a)

max {< gf,d > : g f ~ ~f(xk)} <_-Ipfl 2, (3.44b)

_d k : pk : Nr conv{gf4yJ): j6 jk}, 43.45a)

k k 2
max {< gf(yJ),d k > : je Jf} <--Ipfl (3.455)
k k
Thus pf=-d is found by projecting the origin onto the set conv{gf(y3) :
j EJk}, which approximates ~f(x~). Moreover, as in (2.52), we have

"k
fLw4x k +td k ) <_ f(xk)-tlpkl 2 for all t ~ [0,i], (3.46)

k 2 2
hence the value of -IPfl =-Idkl may be thought of as an approximate
derivative of f at x k in the direction d k.
We may add that search direction finding subproblems of the form
(3.41) are also used in the algorithms of Mifflin (1977b) and Polak,
Mayne and Wardi (1983). A quadratic programming formulation of 43.41)
is to find (dk,v k) to

minimize iLdI2+v .
(3.47)
k
subject to < gfIyJ~d > < v, j ~ Jf ,

with its dual


minimize 11 ~ kljgf4yJ)l 2,
1 je Jf
43.48)
k
subject to lj >_0, j~ Jf, k lj=l,
J~ Jf
2 k 2
corresponding to (3.45a)(cf.(3.38)). Moreover, v k=-Id kl =-IPfl
k
Several strategies have been proposed for selecting the sets Jf
so that fk w is a local approximation to f at x k. Mifflin (1977b)
sets
k 6k } (3.49)
Jf = {j : ~f(xk,y j) _<

for a suitably chosen sequence 6k+0. The algorithm of Lemarechal


(1975) uses yJ=xJ for j~ Jk={l ..... k} until for some k

~f4xl, xk) Z f(xl)-f(xk)+e,

where e>0 is a parameter. Then the algorithm is reset by starting from


the point xl=x k (with gf(xl)=gfIx k) and J~={l}). After sufficiently
many resets one has f4xl)~f4x k) and ~f(xl,x k) < 2~ between the resets,
so that gf4yJ)e ~2 f4xl) for all j~ J~. The algorithm of Wolfe
34

(1975) uses J~={l,...,k}


u until Idkl ~ ~, where a >0 is a parameter.
If ~dkl ~ e then the algorithm stops provided that max{]y3-xkl:-
je j~} ~E; otherwise xk is taken as the new starting point x I. Wolfe
(1975) shows that his strategy makes the value of max{lyJ-xkl:j ~ J ~ )
arbitrarily small after sufficiently many resets. Another strategy
(Mifflin, 1977b; Polak, Mayne and Wardi, 1983) is to set

k =
jf {j : lyJ_xk I ~ ~k}, (3.50)

where 8k 0 converges to zero. Such strategies will be discussed in


detail in subsequent chapters.
To sum up, algorithms based on the polyhedral approximations
(3.40) that neglect the linearization errors need suitable rules for
reducing the past subgradient information, i.e. deleting the obsolete
subgradients. Such rules should be implemented carefully, since any
premature reduction slows down convergence until sufficiently many new
subgradients are accumulated.
Lemarechal (1975) and Wolfe (1975) describe important modifica-
tions of their algorithms that require storing only finitely many sub-
gradients. The modification consists in setting (cf.(3.45a))

k-l} :
-d k = p~ = Nr conv[~pf u {gf(YJ) j J~}~ (3.51)

between each two consecutive resets (with 0


pf=gf(y 1) for k=l). The ve-
k-i
ctor pf , satisfying

k-i
pf e conv{gf( yJ ) : j=l .... ,k-l}

carries over from the previous iteration the relevant past subgradient
information. In this case J~ may be selected subject only to the re-
quirement

k e j~,

e.g. one may set J~={k}. The use of (3.51) corresponds to setting

~ W (x)=max{ f(xkl+ < p~-l,x-xk > ,f(xk) + < gf(y3),x-x k > : 3~Jf}
. k

in subproblem (3.41), and appending an additional constraint

<pk-l,d > ~ v

in subproblem (3.47). Thus for direction finding p k-i is treated as


any past subgradient. Therefore we may call it the (k-l)-st aggregate
subgradient.
35

We now pass to the line search rules used in Wolfe (1975) and
(Mifflin, 1977b). To this end recall that the Lemarechal (1978) algo-
rithm described above generates sequences related by

xk+l k k k
= x +tLd ,
(3.52)
k+l = x k +t~d
k k,
Y

with tk=l at serious steps, tk=0 at null steps, and tk=l for all k.
Moreover, at each step we have

f(x k+l) ~ f(x k.)+m~Lv


k k, (3.53)

and if a null step occurs at the k-th iteration then

f(yk+l)-f(xk+l) m v k. (3.54)

The above relations follow from the criterion (3.27) and the fact that
t~=l at a serious step, while a null step occurs with t~=0 and
k+l k k+l k ~k k+l .k k+l
x =x . At a null step we also have y =x +e =x +a , hence y -
xk+l=d k and

f ( y k + l ) - f ( x k + l ) = f ( y k + l ) + < gf(yk+l),xk+l-yk+l > - f ( x k + l ) +

+ < gf(yk+l),yk+l-xk+l > =

=-~f(xk+l,y k+l) + < gf(yk+l),dk

from (3.35), therefore (3.54) can be written as

- ~ f ( x k + l , y k + l ) + < gf(yk+l),dk > m v k. (3.55)

We have shown above that the direction finding subproblems in the Wolfe
(1975) algorithm can e s s e n t i a l l y be obtained from those in (Lemarechal,
1978) by neglecting the linearization errors. Now, if we assume that
~f(xk+l,yk+l)=0 in (3.55) then we obtain

< gf(yk+l),dk mv k (3.56)

which is e s s e n t i a l l y the criterion used in (Wolfe, 1975). To ensure


that the value of the linearization error ~f(xk+l,y k+l) is sufficien-
tly small, Wolfe (1975) imposed an additional condition

lyk+l_xk+ll ~ ~k (3.57a)

for some sequence 6k+0. In fact, he used the following m o d i f i c a t i o n of


36

(3.53) and (3.56)

k k k
f(x k+l) <_f(x )+mLtLV , (3.57b)

<gf(yk+l),dk > > m R vk, (3.57c)

where mL and mR are fixed line search parameters satisfying 0 <m L

<m R <i. Line search procedures for finding stepsizes and k


tL t kR sa-
k k
tisfying 0 ~ t L ~ t R , (3.52) and (3.57) can be found in (Mifflin, 1977b;
Wolfe, 1975).
We may add that the criterion (3.57c) ensures that the new sub-
gradient gf(yk+l) will significantly modify the next polyhedral app-

roximation tk+l and the corresponding


ILW direction finding subproblem.

This follows from the fact that if _k+l


k + l e of then (d k+l ,vk+l ), being
the solution of the (k+l)-st subproblem (3.47), satisfies

<gf(yk+l),dk+l ~ v k+l.

Combining this with (3.56c) and the fact that mR vk > v k since v k=

- tdkl2 < 0 and m R (0,i) at line searches, we obtain (dk+l,vk+l)@


(dk,vk).
In algorithms based on (3.45)(or (3.51)) and (3.57c) the value

of Idkl=Ip~I decreases at each iteration, provided that no reduction


of the past subgradient information occurs. To see this, note that
(3.45) says that p~0 defines a hyperplane separating conv{gf(yJ):
j gJ~} from the null vector
2 k
<gf(y3),p~> ~ IP~I for all j ~ Jf,

whereas (3.57c), written as

gf(yk+l) k k 2
< ,Pf > ~mRlPfl ,

means that gf(yk+l) lies in the open halfspace containing the origin.
It follows that the next separating haperplane, corresponding _k+l =
to Jf
~f u {k+l}, must be closer to the null vector, i.e. k+l < IPfl"
IPf k Thus
eventually the direction degenerates Cone can have dk=0), which pro~
vides another motivation for resetting strategies.
To sum up, the second class of algorithms discussed above (Lema-
rechal, 1975; Mifflin, 1977b; Polak, Mayne and Wardi, 1983; Wolfe,1975),
which neglect the linearization errors at search direction finding, need
rules for discarding obsolete subgradients. This is in contrast with
37

the first class (Lemarechal, 1978; Mifflin, 1982), in which the linea-
rization errors automatically weigh the past subgradients, cf.(3.38)
and (3.48).
We shall now review the third class of methods, which is inter-
mediate between the two classes discussed above.It contains so-calledboundle
methods (Lemarechal, 1976; Lemarechal, Strodiot and Bihain, 1981;
Strodiot, Nguyen and Heukemes, 1983). At the k-th iteration of the al-
gorithms based on relation (3.45 I the set

G k = conv{gf(yJ ) : j=l ..... k}

was supposed to approximate ~f(x k) . In bundle methods the convex poly-


hedron
k k
sk(e)={ E ljgf(yJ) : I > 0 j=l ..... k, E lj~f(xk,y j )<_e} (3.58)
j=l 3- ' j=l

is used for approximating ~ f(x k) , where ~>0. Indeed, using (3.39)


and taking convex combinations we obtain

k k k
f(x)=j=iI ljf(x) _>f(xk)+ ]~lljgf(yJ)= 'x-xk > - j=l
~ lj~f( xk ,YJ)

for all x, which shows that

Gk() c Bef(x k). (3.59)

If we now choose ~k > 0 and substitute Gk by Gk(E k) in (3.45), we


obtain the k-th direction d k of the bundle methods as follows~

_d k = pk = Nr Gk(E k), (3.60)

with
- k k 2
< g , P f > ~--lPfl for all ~Gk(e k) (3.61)

Thus dk can be computed by finding multipliers l kj j=l,...,k, to

1 k 2
minimize ~I3~lljgf(yJ)
I = ,

k
subject to I.> 0, j=l ..... k, ~ lj=l, (3.62)
3- j=l
k
xk,yj) k
j-llj~f(~ <_ e ,

and setting
k
d k = _pk = _ ~_llkgf(yj). (3.63)
3-
38

Search direction finding via (3.60) can be motivated as follows.


Suppose that we want to find a direction d such that f(xk+d)<f(xk)-ek
Then, since

f(xk+d) _> f(x k) + <g,d> - ~k for all g ~ ~ k f(xk)



and Gk(Ek)~ kf(xk), d must satisfy
g
<g,d> <0 for all g ~ Gk(~k),

i.e. we must find a hyperplane separating Gk( E k) from the origin. The
best such hyperplane is defined by pf, cf. (3.601 and (3.61). Note also
k
that if pf=0 then f(x)>f(xk)-e k for all x, which means that the value
of ck should be decreased.
Observe that if ~f(xk,yj)<_~ k for all j, then subproblem (3.62) is
equivalent to (3.48). For smaller values of e k the last constraint of
(3.62) tends to make the subgradients with larger linearization errors
contribute less to d k, since the corresponding multipliers must be
smaller, cf. (3.63). Thus the weighing of the past subgradients depends
on the value of k . Since it is difficult to design convergent rules for
automatic choice of the value of E k (Lemarechal, 19 80), this is the
main drawback of bundle methods in comparison with the first class of
methods based on polyhedral approximations.
Lemarechal, Strodiot and Bihain (1981) have proposed a bundle
method that requires storing only a finite number, say Mg>_l, of the
past subgradients. Suppose that at the k-th iteration we have the
(k-l)-st aggregate subgradient (pk-l,fk)E R N x R, satisfying

k-i k
(pf ,f )e conv{(gf(y3),?(xk;yj) " j=l ..... k-l}.

Then, since

f(x) ~ f(xk;y j) + <gf(yJ), x-xk> for all x,

we have
_ fk + k-I k
f(x) > P <pf , x-x > for all x,

hence
k-i f(x k) for e=~,
Pf ~ ~e

where

k = f(x k) - fk.
P P
39

Subproblem (3.62) is replaced by the following one: find values of


multipliers X~, j e J~, and tk t o
P
minimize 1j~j ljgf(YJ) + XpPfk - i '2 ,

subject to lj~0, j E jf, X ~0, lj + Ip = i, (3.64)


k
j6 Jf
k k
k ljef (xk'yj) + ip~p ~ ~ ,
J~ Jf

and (3.63) by

dk = - P ~ = - ( Z k ljk gf(yJ) + Ippfk-i ).


j~ Jf

Thus for search direction finding . k - l . k,~p) is treated


[pf as any "ordinary"
vector (gf(yJ), f(xk;yj)). The algorithm uses resets for selecting the
k
sets Jf. Between any two resets, one sets J9k+l = J ~k u { k + l } . When J9k
.
has Mg elements, the algorithm is reset by settlng Jfk l ={k+l}. Of course,
such strategy is not very efficient when M is small, since then too
g
frequent reduction of the subgradient information hinders convergence.
Line search criteria of bundle methods are essentially of the form
(3.57) with the additional requirement

af(xk+l,y k+l) < e k+l

ensuring that subproblem (3.62) (or (3.63)) is always feasible.


Up till now we have dealt mainly with the three classes of algo-
rithms for convex problems. We shall now review extensions of these
algorithms to the nonconvex case.
As shown above, in the convex case the algorithms of the first
class (Lemarechal, 1978; Mifflin, 1982) have much clearer interpreta-
tion than the remaining methods. This is mainly due to the global pro-
perties (3.12) and (3.13) of their polyhedral approximations, which
make it poss~le to weigh the past subgradients by the corresponding
l~earization errors. Of course, such global properties no longer hold
in the nonconvex case. For this reason, Mifflin (1982) proposed the
following subgradient locality measure

~f(x,y) = max{f(x} - f(x;y),ylx-yl2}, (3.65)

where y>0 is a parameter (y can be set to zero when f is convex). The


value of ~f(x,y) indicates how far gf(y) is from 3f(x). Note that if
f is convex and y=0 then ef(xk,y j) defined by (3.65) reduces to the
40

linearization error f(xk)-f(xk,yj),--4 as in (3.35). Therefore in (Mifflin,


1982} the k-th polyhedral approximation ~k is defined via (3.36) and
(3.65), and (3.55) is still used at line searches. As before, d k minimi-
zes ~k(xk+d) + ~Idl
I 2 over all d. The line search is more complicated in

the nonconvex case, since (3.54) need not imply (3.55).


As far as the second class is concerned, observe that we have in-
terpreted the direction finding subproblems and line search rules of
these algorithms only in terms of local properties of the c o r r e s p o n d i n g
~k
polyhedral approximations fLW' with no reference to convexity. This
explains why these approximations were used by Mifflin (1977b) and Po-
lak, Mayne and Wardi (1983) also for nonconvex problems, with subgra-
dient deletion rules based on (3.50) localizing the approximations.
The third class (the bundle methods) has been extended by Lemare-
chal, Strodiot and Bihain (1981) by using the subgradient locality
measures ~f(xk,y j) defined by (3.65) in subproblem (3.62). In connec-
tion with subproblem (3.64), they have also considered using the "path
length s"
k-I
sk = IY j-xjl + Z Ix i+l- x iI
3 i=j
instead of Ixk-yJI in the d e f i n i t i o n of ~ ( x k , y j ) . Then the points y3-
need not be stored, since sjk+l = sj k +I xk i_ xki
@

So far we have concentrated on describing the algorithms. We shall


now comment on the known results on their convergence and computatio-
nal efficiency.
The algorithms of the first class discussed aDove
have a potential for fast convergence (Lemarechal and Mifflin, 1982).
However, at the k-th iteration they have k linear inequalities in
their quadratic programming subprobiems. This would present serious
problems with storage and computation after a large number of itera-
tions. Under additional boundedness assumptions on the generated se-
quences of points and the c o r r e s p o n d i n g subgradients, these algorithms
have at least one stationary accumulation point.

The second class of methods require bounded storage and use simple
quadratic programming subproblems, but seem to converge slowly in practice
(Lemarechal, 1982). As for convergence, Polak, Mayne and Wardi (1983) have
modified the line search rules of the earlier versions so as to obtain
global convergence in the sense that each of the algorithm's accumulation
points is stationary.
4~

The b u n d l e m e t h o d of L e m a r e c h a l , S t r o d i o t and B i h a i n (1981), w h i c h is


r e p r e s e n t a t i v e of the third class, requires bounded storage. Numerical
experiments (Lemarechal, 1982) indicate that the m e t h o d u s u a l l y c o n v e r g e s
m u c h m o r e r a p i d l y than the a l g o r i t h m s of the s e c o n d class. However, no
global c o n v e r g e n c e of t h a t m e t h o d seems to h a v e b e e n e s t a b l i s h e d in the
nonconvex case.
Of course, much more work remains to be done b e f o r e p r a c t i c a l ef-
f i c i e n c y of e a c h class of a l g o r i t h m s is fully assessed.
In v i e w of the a d v a n t a g e s and d r a w b a c k s of e x i s t i n g m e t h o d s , our
aim has b e e n to c o n s t r u c t m e t h o d s for n o n s m o o t h m i n i m i z a t i o n w h i c h are
c h a r a c t e r i z e d by
(a) a p p l i c a b i l i t y - the a l g o r i t h m s s h o u l d use only g e n e r a l p r o p e r t i e s
of p r o b l e m P so as to be a p p l i c a b l e to a b r o a d c l a s s of p r a c t i c a l
problems;
(b) i m p l e m e n t a b i l i t y - the a l g o r i t h m s s h o u l d not r e q u i r e u n b o u n d e d
storage or an infinite n u m b e r of a r i t h m e t i c o p e r a t i o n s per itera-
tion;
(c) reliability - a guarantee should exist, at least in the f o r m of
a proof of c o n v e r g e n c e , that the a l g o r i t h m s can find (approxima-
te) solutions to a b r o a d class of problems;
(d) efficiency - ability to p r o v i d e satisfactory approximate solu ~
tions w i t h m i n i m a l c o m p u t a t i o n a l effort.
As far as e f f i c i e n c y is concerned, w~ note that f u n c t i o n e v a l u a -
tions in the p r o b l e m s of i n t e r e s t to us are v e r y t i m e - c o n s u m i n g . There-
fore, even r e l a t i v e l y c o m p l e x a l g o r i t h m s are a d m i s s i b l e , provided that
the c o m p u t a t i o n a l overhead incurred in their a u x i l i a r y operations is
s m a l l e r than the gain f r o m a d e c r e a s e in the n u m b e r of f u n c t i o n eva-
luations. For this r e a s o n the a l g o r i t h m s that are the s u b j e c t of this
work are r a t h e r c o m p l e x and will be d e s c r i b e d in d e t a i l in s u b s e q u e n t
chapters. Here we w a n t to c o m m e n t on their r e l a t i o n s w i t h the m e t h o d s
d i s c u s s e d so far.
In this w o r k we shall p r e s e n t new versions, modifications and
extensions of a l g o r i t h m s b e l o n g i n g to all the three e x i s t i n g c l a s s e s
of m e t h o d s for n o n s m o o t h optimization. We shall c o n c e n t r a t e m a i n l y
on the first class of algorithms, since it seems to be p a r t i c u l a r l y
promising.

In C h a p t e r 2 we e x t e n d the first class by d e s c r i b i n g a g g r e g a t e subg-


radient methods for u n c o n s t r a i n e d c o n v e x m i n i m i z a t i o n . In o r d e r to p r o v ~ e
u p p e r b o u n d s on the a m o u n t of the past s u b q r a d i e n t i n f o r m a t i o n w h i c h i~
42

stored and p r o c e s s e d during the calculations, we give b a s i c rules for


selecting and a g g r e g a t i n g the past subgradient information.
In C h a p t e r 3 and C h a p t e r 4 we show that the methods of C h a p t e r 2 can
be e x t e n d e d to the n o n c o n v e x case in two fundamentally different ways.
The first strategy consists in m o d i f y i n g the p o l y h e d r a l approximations
by usi n g subgradient locality measures of the form (3.65). The second,
alternative strategy is to use subgradient deletion rules for l o c a l i z i n g
the p o l y h e d r a l approximations. This approach is a d o p t e d in C h a p t e r 4,
w h e r e we also show that it leads to new a l g o r i t h m s belonging to the
second class of m e t h o d s that n e g l e c t the l i n e a r i z a t i o n errors.It w i l l be
seen that the two approaches to the n o n c o n v e x case result in s i g n i f i c a n t l y
different algorithms.
In C h a p t e r 5 and C h a p t e r 6 we e x t e n d the p r e c e d i n g methods to inequa-
lity c o n s t r a i n e d problems. We p r e s e n t feasible point methods in w h i c h the
past subgradient information about the p r o b l e m functions is s e p a r a t e l y
accumulated in two aggregate subgradients, one c o r r e s p o n d i n g to the ob-
jective function and the o t h e r to the constraints. The methods differ in
their line search rules and t r e a t m e n t of nonconvexity.
In C h a p t e r 7 we apply our t e c h n i q u e s of s u b g r a d i e n t selection and
aggregation to the third class of a l g o r i t h m s , o b t a i n i n g new v e r s i o n s of
bundle methods that require bounded storate. We also give b u n d l e methods
for p r o b l e m s with nonlinear inequality constraints, while up till now
bundle methods for only linearly constrained problems have been consi-
dered.
We shall p r e s e n t apparently novel techniques for a n a l y z i n g conver-
gence of algorithms for n o n s m o o t h optimization. In the absence of con-
vexity, we will content ourselves with finding stationary points for
problem P, i.e. points which satisfy the F . J o h n necessary optimality
condition for p r o b l e m P, see S e c t i o n 2. For each algorithm introduced
in this work w e p r o v e that it is g l o b a l l y convergent in the sense that
all its accumulation points are stationary. In the c o n v e x case, each
of our a l g o r i t h m s generates a minimizing sequence of points, which in
additi o n converges to a s o l u t i o n of p r o b l e m P whenever problem P has
any solution. Moreover, the c o n v e r g e n c e is finite in the p i e c e w i s e linear
case.
We may add that the a l g o r i t h m s discussed in this m o n o g r a p h are
first-order methods. Some research is c u r r e n t l y bein C done to obtain faster
convergence ; see A u s l e n d e r (1962), Demyanov, Lemarechal and Zowe (1985),
Hiriart-Urruty (1983), Lemarechal and M i f f l i n (1982), Lemarechal and
43

Strodiot (1985), Lemarechal and Zowe (1983), Mifflin (1983 and 1984).
This research is not d i s c u s s e d here, for our p u r p o s e is to e s t a b l i s h
some general convergence theory in the h i g h e r dimensional and c o n s t r a i n e d
case.
CHAPTER 2

A@gre~ate Sub@radient Methods for U n c o n s t r a i n e d Convex Minimization

i. I n t r o d u c t i o n

In this chapter we c o n s i d e r the p r o b l e m of m i n i m i z i n g a convex,


not n e c e s s a r i l y differentiable, function f: RNR. We introduce a class
of r e a d i l y implementable algorithms, differing in c o m p l e x i t y and effi-
ciency, and analyze their convergence under no a d d i t i o n a l assumption
on f. Each of the a l g o r i t h m s generates a minimizing sequence of points;
if f attains its m i n i m u m then this sequence converges to a m i n i m u m
point of f. P a r t i c u l a r members of this algorithm class terminate when
f happens to be p i e c e w i s e linear (Kiwiel,1983).
The a l g o r i t h m s presented are d e s c e n t m e t h o d s which combine a ge-
neralized cutting plane idea with quadratic approximation. They can be
interpreted as an e x t e n s i o n of P s h e n i c h n y ' s method of l i n e a r i z a t i o n s
(Pshenichny and Danilin,1975) to the n o n s m o o t h case. Stemming from
the p i o n e e r i n g algorithm of L e m a r e c h a l (1978), they d i f f e r from it in
the u p d a t i n g of search direction finding subproblems. More specifi-
cally, instead of using all p r e v i o u s l y computed subgradients in qua-
dratic programming subproblems, the m e t h o d s use an aggregate subgra-
dient, which is a convex combination of the past subgradients. It is
recursively updated in a way that p r e s e r v e s that part of the past
subgradient information which is e s s e n t i a l for convergence.
In S e c t i o n 2 we derive basic versions of the methods, comparing
them with the a l g o r i t h m s of P s h e n i c h n y and Lemarechal. A formal des-
cription of an a l g o r i t h m i c procedure is given in Section 3. Its glo-
bal convergence is d e m o n s t r a t e d in Section 4, where we also introdu-
ce certain concepts that will be useful for a n a l y z i n g subsequent
extensions. In Section 5 we study convergence of m e t h o d s with subgra-
d i e n t selection. Section 6 is devoted tothe piecewisel~earcase. Further
rL~xlificationsofthemethodsaredescribedinSection 7.

2. D e r i v a t i o n of the A l g o r i t h m Class

In this section we derive a class of m e t h o d s for m i n i m i z i n g a


convex function f: RNR. To deal with nondifferentiability of f, the
methods construct polyhedral approximations to f with the h e l p of
previously evaluated subgradients of f. To this end we introduce two
45

general strategies for selecting and aggregating the past subgradient


information. Such strategies enable one to impose a uniform upper bo-
und on the amount of storage and work per iteration without impairing
convergence. Our detailed description should help the reader to de-
vise his or her own strategies that are tailored to particular opti-
mization problems.
Since the algorithms to be described have structural relationship
with Pshenichny's method of linearizations (Pshenichny and Danilin,
1975), we shall now review this method. To this end, suppose momenta-
rily that

fix) = max{fj(x): j E J} for all x, (2.1)

where each fj is a convex function with continuous gradient Vfj on


R N, and J is finite. Given the k-th approximation to a solution
k
x k ~ R N, the method of linearizations finds a search direction dp from
the solution (d~, U pk ) ~ RN+I to the following problem

minimize ]dl 2 + u,
(2.2)
subject to fj(x k] +<vfj(xk),d> ~u, j~ J .

The above subproblem may be interpreted as a local first order appro-


ximation to the problem of minimizing f(xk+d) over all d ~ R N. Indeed,
k
let us introduce the following polyhedral approximation to f at x

f~(z) = max{fj(x k) +<Vfj(xk),z-xk>: j~ J} for all z. (2.3)

Then subproblem (2.2) is equivalent to the following

minimize ~(xk+d) + !2 Id12 over all d e R N (2.4)

and we have

k "k k k
Up = fp(X +dp). (2.5)

At first sight it may appear that a more natural way of finding a


search direction by minimizing f~(xk+d) over all d could be better.
^ k
However, the latter problem may have no solution; moreover, fp(X +d)
is a doubtful approximation to f(xk+d) if Idl is large. This gives
the reason for the regularizing penalty term ]d[ 2 in (2.4).
k+l k.klk
The next point x =x + t a p is found by searching for a stepsi-
ze tk>0 satisfying

f(xk+tkd ) < f(x k) + mt k kVp , (2.6)


46

where m e (0,i) is a fixed line search parameter and

Vpk = U pk - f(xk). (2.7)

More specifically, t k is the largest number from the sequence


1 1
{i, ~, ~,...} that satisfies (2.7). Such a positive number exists
if v~<0 This follows from the fact that

f,( x k ;dp)
k = max{<vfk(xk),d~> : fj(x k) = f(xk)}

S max{fj(xk)-f(xk)+<vfj(xk),dk > : j e J}

max{fj(xk)+ <vfj(xk),dk> : jg J} - f(x k)

k k
Up - f(x k ) = v~ ,

see Corollary 1.2.6, (2.3), (2.5) and (2.7). Therefore

lim [f(xk+() i dp)


k - f(xk)i I /()i = f, (x k ;dp)k ~ vpk < mVpk

k
if m ~ C0,1) and Vp<0, which shows that (2.6) must hold if t k is suffi-
k
ciently small. On the other hand, if Vph0 then the method of lineari-
k
zations stops, because x is stationary for f.
In fact, Pshenichny_ defined v~-~ in (2.6) as -Id~I
2 - , which is
slightly larger than v~ given by (2.7), and assumed that the gradients
of fj are Lipschitz continuous. However, it is easy to prove that the
above version of the method of linearizations is globally convergent
when each fi is continuously differentiable, and that the rate of
convergence is at least linear under standard second order sufficien-
cy conditions even when f is nonconvex, see (Kiwiel,1981a). Moreover,
if all the functions fi are affine, then the method finds a solution
in a finite number of iterations. Therefore it seems worthwile to
extend this method to more general nondifferentiable problems.
Although our methods will not require the special form (2.1) of
the objective function, they are in fact based on a similar, but
implicit, representation

f(x) = max{f(y) + <gy,X-y> : g y ~ ~f(y), y ~RN}, (2.8)

which is due to convexity. Since we do not assume the availibility


of the whole subdifferential ~f(y) at each y in R N, the methods will
use approximate versions of (2.8) constructed as follows.
We suppose that we have a subroutine that can evaluate a subgra-
dient g f ( x ) E ~f(x) at each x e R N. Suppose that at the k-th iteration
47

of the algorithm we have the current point x k ~ R N together with some


auxiliary points yJ- and subgradients gJ=gf(yJ) for j ~ jk, where jk is
a nonempty subset of {l,...,k}. Define the linearizations

fj(x) = f(yJ) +<gj,x-yJ> , j e jk , (2.9)

and the current polyhedral approximation to f

f~(x) = max{fj(x): j~ jk} (2.10)

Comparing 42.1) with 42.9) and (2.10), we see that an application of


one step of the method of linearizations to {k at x k leads to the
s
following search direction finding subproblem

1 2
minimize ~ Idl + u ,
42.11)
subject to fk + <gJ,d> ~u, jE jk ,
3
where

fk] = fJ (xk) for j=l,...,k. (2.12)

The above subproblem may be interpreted as a local approximation to


the problem of minimizing ^~(xk+d)
f over all d, and hence to the prob-
lem of minimizing f(xk+d) over d. Let (dk,u k) denote the solution of
(2.11), and let

V k = u k _ fkxk
t ~. (2.13)

Then, as in (2.5) and (2.7), we have

u k = fA~(xk+dk ), (2.14)

v k = f^~(xk+d k ) - f(xk). (2.15)

Moreover, 42.8) implies ^k


that f(xk)fs(xk), hence (2.15) and the con-
vexity of -{k yield
s

~ ( x k + t d k) & ( l - t ) Z~k,
s [ X )k. + t ~ ( x k + d k)

<_ f(x k) + tv k for all t E [0,i]. (2.1v)

Therefore v k may be interpreted as an approximate directional deriva-


tive of f at x k in the direction d k. We shall show later that vk<0 if
k
x is nonstationary for f. Thus we may assume that vk<0. In general,
48

we may have f'(xk;dk)>vk,'- because ^~f~may poorly approximate f. For


this reason, the line search rule of the method of linearizations
needs the following modifications.
We assume that m e (0,1) and { e (0,i] are fixed line search pa-
rameters. First we shall search for the largest number t kL in
{i, , .... } that satisfies

f(xk+t~d k) S f(x k) + mt~v k (2.18)

k -
and tLZt. This involves a finite number of function evalutions, beca-
use ~>0 (only one if ~=i). If such a number t>0 exists, we shall set
k+l k k k k+l k+l
x =x +tLd and y =x Ca serious step). Otherwise we have to
k+l k
accept a null step by setting x =x . In this case we also know
k
a number t R ~ [~,i] satisfying

f(xk+t~d k) > f(x k) + mt~v k.

k+l k k k
Therefore at a null step we shall set y =x +tRd , because this new
trial point will define a linearization fk+l by (2.9) that satisfies

fk+l(xk+d k) > f(x k) + mv k, C2.19)

see Section 7. Comparing (2.10), (2.14), (2.15) and (2.19), and using
the fact that xk+l=x k, vk<0 and m E (0,i), we deduce that after a null
step we hav~

fk+l'xk+l+dk)s
[ > uk,

~k+l.xk+l+dk+l k+l
s ~ ) =u

provided that

k+le jk+l.

k+l
Thus after a null step the linearization from the trial point y
will modify both the next polyhedral approximation and the next
search direction finding subprob!~m.
We shall now show how to choose the next subgradient index set
jk+l. As noted above, we should have k + l ~ jk+l which is satisfied if

jk+l = ~ku{k+l } (2.20)

for some set Jkc jk. The obvious choice ~k=~, suggested by the cut-
ting plane methods (Cheney and Goldstein,1959;Kelley,1960), would
49

present serious problems with storage and computation after a large


number of iterations. We may add that such a choice, i.e.

jk = {l,...,k} for all k,

is used by Lemarechal (1978~ and Mifflin (1982). It is therefore impor-


tant to be able to construct ~k in a way that permits dropping some of
the linear inequalities in subproblem (2.11). Part (iv) of the following
lemma suggests a strategy analogous to constraint dropping strategies
of the cutting plane methods due to Eaves and Zangwill (1971) and Topkis
(1970a,1970b,1982). The other parts are taken from (Lemarechal,1978;
Wierzbicki,1982).

Lemma 2.1 (i) The unique solution (dk,u k) of subproblem (2.11) always
exists.
(ii~ (dk,u k) solves (2.11} if and only if there exists Lagrange multi-
pliers I k ~ R N, j E jk, and a vector pke RN satisfying
3
Ik>0, je jk, Z I~=i (2.21a)
3- jE jk '

(2.21b)

pk = j~j
E k lk3 g3, (2.21c)

dk = _pk, (2.21d)

uk = _{ I
p1k2 k k
_ jeZjk ljfj}, (2.21e)

fk + <gJ,dk> <_uk j~ jk. (2.21f)


3
(iii) The multipliers Ik, j e jk, satisfy (2.21) if and only if they
solue the following dual subproblem
1
minimize
X
"21j~jk Xj g3']2 - k
j e2Jklj fj '
(2.22)
subject to Xj>_0, je jk, ~ k lj = i .
jgJ
k je jk , of subproblem
(iv) There exists a solution lj, (2.22) such that
the set

~k = {j~ jk: X~>0} <2.23a)


J
50

satisfies
3k
I i <_N+I (2.23b)

Such a solution can be obtained by solving the following linear program-


ming problem by the simplex method:

minimize - ljf~
1 j ~jk 3 '

subject to
J EZjk 13 = 1 ,
ljg j k (2.24)
J Z jk = p '

lj~O, j~ jk ,

where pk=-dk. Moreover, (dk,u k) solves the following reduced subproblem

minimize iIdl2 + u ,
(2.25)
subject to fk + <gJ,d> ~u, j 6 ~k
3

Proof. (i) Subproblem (2.11) is equivalent to the following problem

=i
minimize[~(d) ~Idl 2 + Us(d) ] over all d,

where Us(d ) = max{f~+<gJ,d>: j 6 jk}. The function ~ is strictly convex


and satisfies (d)+~ if Idl~, because Us(d)>f~+<gJ,d>>f~-IgJ
J - I Idl
for any j ~ jk and d ~ R N. Therefore dk, the unique minimum point of ~,
exists and satisfies uk=us(dk).
(ii) Subproblem (2.11) is convex and satisfies the Slater constraint
qualification (let d=0 and v=max{ fj:3e
k Jk}+l). Therefore we deduce
from Lemma 1.2.16 that (2.21) is the Kuhn-Tucker condition for (dk,u k)
to solve (2.11).
(iii) One may check that (2.21) is the Kuhn-Tucker condition for
ikj, j E jk, to be a solution of (2.22). Although subproblem (2.22)does
not satisfy the Slater constraint qdalification, the Kuhn-Tucker con-
dition (2.21) is both necessary and sufficient here, because the con-
straints of (2.22) are linear; see, e.g. (Bazaraa and Shetty,1979).
(iv) Since any solution i~, j e jk, of (2.22) must satisfy (2.21c) and
pk=-dk is unique, we deduce that (2.22) and (2.24) have a common, non-
empty set of solutions. The simplex method will find an optimal ba-
sic solution of (2.24) with no more than N+I strictly positive com-
ponents (Dantzig,1963). Thus we get the desired multipliers satisfy-
51

ing (2.23). Since these multipliers also solve (2.22), p~rts (ii) -
(iii)of the lemma imply (2.21). Therefore one may use (2.23a) and
part (ii) of the lemma to complete the proof. [ ]

We shall use the above lemma to design two different constraint


dropping strategies, which will yield different algorithms. Both
choices of constraints for the next direction finding subproblem are
based on the following @eneralized cuttin~ plane idea: having solved
the current search direction finding subproblem, construct an auxil-
iary reduced subproblem which yields the same solution. Then obtain
the next search direction finding subproblem by appending to the
auxiliary subproblem the constraint generated by the new subgradient.
The first application of this principle makes use of Lemma
2.1(iv). Subproblem (2.25) is the desired auxiliary problem, which is
equivalent to the original subproblem (2.11). Therefore the choice
of jk+l specified by (2.20) and (2.23) conforms with the above cut-
ting plane concept. Observe that this amounts to discarding those
subgradients gJ which have null multipliers I~. Such subgradients do
not contribute to the current direction d k, see (2.21d) and (2.21c).
The above strategy, based on (2.23), may be termed the sub~ra-
dient selection strategy. It leads to implementable algorithms that
require storage of at most N+I past subgradients. However, this may
pose serious difficulties if N is large. Therefore we describe below
another strategy that overcomes this drawback.
The second strategy may be termed the sub~radient a ~ r e ~ a t i o n
strategy, because it aggregates the constraints generated by the
past subgradients. The auxiliary subproblem is constructed by forming
a surrogate constraint based on the Lagrange multipliers of the origi-
k
nal subproblem. Let lj, j~ jk, denote any Lagrange multipliers of
(2.11), which do not necessarily satisfy (2.23), and let

~k k k
p = j ~jk ~jfj (2.26)

Combining this with (2.21c), we obtain

(pk,~) = j~ ~jk l~(gJ,f~) (2.27)

The following lemma describes the auxiliary subproblem of the


subgradient aggregation strategy.
52

Lemma 2.2. Subproblem (2.11) is equivalent to the following reduced


subproblem:

minimize 1dl 2 + u ,

subject to ~k + <pk,d > &u , (2.28)


P
fk + <gJ,at ~u, j 6 ~ k ,
3
where ~k is any subset of jk (possibly empty).

Proof. Let =i, =0, je J From (2.21) and (2.27),

+ =1
' ' P j 3 '

[~+ < p k d k ~ _ u k] ~kp : 0

Ifk + <g3,dk>
' ~k = 0,
- u k] lj j e 3k ,

k "k k -k j
P = lpP + Z k ljg ,
jeJ
dk :
_pk r

U k = - { I pki2 - ~k~k
p p- jJjk ~kfk
j j}'
~k + < p k , d k > ~ u k ' fk + <gJ,dk> <-uk, je 3 k
P 3
Subproblem (2.28) is of the form (2.11). Therefore the above relations
and Lemma 2.1(ii) imply that (dk,u k) solves (2.28). Hence Lemma 2.1(i)
yields that (2.11) and (2.28) have the same unique solution (dk,uk).~

We shall now interpret the above results in terms of lineariza-


tions and polyhedral approximations to f. Each subgradient g ] 6 ~f(y3)
provides global information about the objective function, since the
corresponding linearization fj, given by (2.9), satisfies

f(x) > fj(x) for all x . (2.29)

Since fk=f.(x k) we have


3 3 '
fj(x) = f~ + <gJ,x-xk> for all x (2.30)
3
and r2.29) becomes

fix) ~ fk + <gJ,x-xk> for all x , (2.31)


3
53

and j=l,...,k. Observe that the linearizations can be updated recur-


sively:

fk+l = fj(xk+l) : fk + <gJ,xk+l_ xk>, (2.32)


3 3
so the points {yJ} need not be stored. Summing up, we see that at ite-
ration k the subgradient information collected at the j-th iteration
consists of the (N+l)-vector (gJ,f~), which generates the correspon-
ding linearization (2.30) and the constraint in subproblem (2.11),
and yields the bound (2.31). Therefore we shall refer to (gJ,f~)- as
the j-th subgradient of f at the k-th iteration, for any j=l,...,k.
We also note that in terms of the selective polyhedral approximation
k
to f at x

~k ~ jk
s(X) = max{f + <gJ,x-xk>: je } for all x, (2.33)

the search direction finding subproblem (2.11) can be written as

minimize fk(xk+d) + 1dl 2 over all d. (2.34)

Proceeding in the same spirit, we may associate with the first


constraint of the reduced subproblem (2.28) the following aggregate
linearization

~fk(x ) = ~k + <pk,x_xk > for all x (2.35)


P
and call the associated (N+l)-vector (pk,~) the aggregate subgradient
of f at the k-th iteration. In view of Lemma 2.2, the aggregate sub-
gradient (pk,~)- embodies all the past subgradient information that is
essential for the k-th search direction finding, since an equivalent
formulation of subproblems (2.28) and 12.11) is

minimize ~k (xk+d) + 1dl 2 over all d. (2.36)

Therefore one may use the aggregate linearization (2.35) for search
direction finding at the next point xk+l, where

~k(x ) = fk+l + <pk,x_xk+l > for all x (2.37)


P
with fk+l defined similarly to (2.32):
P
fk+l = ~k(xk+l ) = ~k + <pk,xk+l - xk>. (2.38)
P P

Thus at the (k+l)-st iteration the linearization (2.35) is generated


by the updated aggregate subgradient (pk, fk+l)
P -
Our use for aggregation of multipliers that form convex combina-
54

tions, cf. (2.21a), yields the following useful property.

Lemma 2.3. The aggregate linearization (2.35) defined by (2.27) is a


convex combination of the linearizations (2.29). Moreover

f(x) ~ fk(x) for all x. (2.39)

Proof. From (2.35), (2.27) and (2.30)

fk(x) = E Ikf k + < E ikg j x-xk> =


j~ jk 3 3 j~ jk a '

= E lk[f k + <gJ,x-xk>] = Zjk lkfj(x)


j~ jk 3L 3 jE

for each x. The above relations, (2.21a) and (2.29) yield (2.39).

Following the generalized cutting plane concept introduced above,


we obtain the next search direction finding subproblem of the method
with aggregation in two steps. First, we use aggregation for deriving
the auxiliary subproblem (2.28) and the aggregate linearization (2.35).
Next, we update the linearizations according to (2.32) and (2.38) and
append the new constraint generated by the latest subgradient
gk+l=g(yk+l). Thus the next subproblem becomes: find (dk+l,u k+l) to

minimize 1dl 2 u,

subject to fk+l + <gJ,d> ~u, j~ jk+l= 3k u {k+l}, (2.40)


3
fk+l + <pk,d > ~u.
P
Of course, the above subproblem need not be e q u i v a l e n t to the(k+l)-st
subproblem (2.11), e.g. we may have ~k=~ in (2.40), hence the resulting
algorithms will differ. However, in order to stress their similarities,
we denote the corresponding variables by the same symbols.
Since the second step in the above d e r i v a t i o n of (2.40) required
no reference to subproblem (2.11), if we now show how to aggregate sub-
problem (2.40) we shall in fact define recursively the aggregate subgra-
dient method that does not need the points yJ, j=l,...,k-l, at the k-th
iteration. This is quite easy if one observes that (2.40) is similar to
(2.11). Consequently, one can aggregate subproblem (2.40) in essential-
ly the same manner as shown above for subproblem (2.11).
In this way we arrive at the following d e s c r i p t i o n of consecutive
55

aggregate subproblems. Let (dk,u k) ~ R N+I denote the solution to the


following k-th aggregate search direction finding subproblem (cf.
(2.40)):

minimize Idl 2 u

subject to fk + <gJ,d> _<u, j e J k, (2.41)


3
f k + < p k - l , d > <_u.
P

In order to be able to use (2.41) for k=l, we shall initialize the


method by choosing x16 R N and setting yl=xl and

p0 = gl = g(yl) ' flp = fl1 = f(yl) ' jl = {i}. (2.42)

Let Ik3 ' j~ jk, and Ikp denote any Lagrange multipliers of (2.41). Since
subpreblem (2.41) is of the form (2.11), Lemma 2.1 implies that these
multipliers satisfy

Ik>0j-' J E J k , lk>-0' j e Jk ~ lk3 + lkp = 1 , (2.43a)

[fk + <gJ,dk> _uk]lk = 0, j 6 jk (2.43b)


3
[fk + <pk-l,dk > -u k] ~k = 0, (2.43c)
P
k k j ik k-i
p = J Zak ljg + pp , (2.43d)

d k = _pk , (2.43e)

u k = -{ Ipkl 2 - ikf k Ikf k~ (2.43f)


j ~jk j j - p p]"

Similarly to (2.26), we define the value of the current aggregate line-


arization

}k ~kfk + ~kfk (2.44)


P = j eEj k 3 3 P P'

and obtain analogously to (2.27)

(pk,~) t jE jk lp[p , p) o (2.45)

As above, we shall use (2.38) to update the aggregate linearization


(2.35) when xk+l~x k. This completes the derivation of the method with
subgradient aggregation.
We may add that for the method with aggregation Lemma 2.2 can be
rephrased as follows: subproblem (2.28) is equivalent to subproblem
56

(2.41), which in turn is equivalent to the problem

minimize f (xk+d) + ~Idl 2 over all d, (2.46)

where ~ is the k-th aggregate polyhedral approximation to f:

~(x) = max{fk-l(x),fj(x]: j e jk}=

= max{f~ +<pk-l,x-xk>, fk + <gJ,x-xk>: j e jk}. (2.47)


3
In Section 4 we shall show that Lemma 2.3 holds also for the method
with aggregation.

Remark 2.4. Convergence of the method which uses the aggregate subprob-
lems (2.41) with Jk={k} can be slow, since only two linearizations may
provide insufficient approximation to the nondifferentiable objective
function. Using more subgradients for search direction finding enhan-
ces faster convergence, but at the cost of increased storage and work
per iteration. To strike a balance, one may use the following strategy.
Let M g->2 denote a user-supplied bound on the number of subgradients
(including the aggregate subgradient) that the algorithm may use for
each search direction finding. Then one may choose the set jk+l on the
basis of the k-th Lagrange multipliers subject to the following
requirements:

jk+l = {k+l} U ~k (2 48a)

~kc{ j jk: Ik>o}, (2.48b)


3
13ki~Mg - 2 , (2.48c)

with ~k containing the largest indices corresponding to I~>0. This ensu-


res that the most "active" subgradients will not be prematurely discar-
ded.

Remark 2.5. If the objective function is of the form

f(x) = max{~j(x): j c J}

and it is possible to calculate some subgradients gk,j e 3fj(xk), j e J,


then one may increase the efficiency of the above methods by appending
the constraints

f.(x k) + <gk'J,d><u, je J , (2.49)


3
57

to the search d i r e c t i o n finding subproblems (2.11) and (2.41), for each


k. One may also replace the set J in (2.49) with the set of indices of
e-active functions J(e)={je J: fj(xk)>f(xk)-~}, for some eZ0. Clearly,
a g g r e g a t i o n can be used for the resulting a u g m e n t e d subproblems under
an a p p r o p r i a t e change in notation. The s u b s e q u e n t c o n v e r g e n c e results
r e m a i n valid for such m o d i f i c a t i o n s .

The above remarks suggest that there is m u c h freedom w i t h i n the


framework of s u b g r a d i e n t s e l e c t i o n and a g g r e g a t i o n strategies for con-
s t r u c t i n g p a r t i c u l a r algorithms t a y l o r e d to special classes of problems.
We end this section by c o m m e n t i n g on the r e l a t i o n s of the above
d e s c r i b e d m e t h o d s with other algorithms. As shown above, the m e t h o d s
g e n e r a l i z e P s h e n i c h n y ' s method of l i n e a r i z a t i o n s for m i n i m a x problems,
which in turn extends the classical m e t h o d of steepest descent for
m i n i m i z i n g smooth functions. On the other hand, subproblems (2.11)
and (2.41) may be seen as reduced versions of the L e m a r e c h a l (1978)
subproblems with Jk={l,...,k}, cf. (1.3.22).

Remark 2.6. We recall from Section 1.3 that an a p p r o x i m a t i o n to f at


k
x
~(x) = m a x { f ( x k) + <g,x-xk>: g e ~f(xk)} (2.50a)

may yield a d e s c e n t d i r e c t i o n for f at x k as a s o l u t i o n to the prob-


lem
minimize ?(xk+dl + Lda21 over all d, (2.50b)

p r o v i d e d that x k is nonstationary, see Lemma 1.2.13. C o m p a r i n g (2.50)


with (2.33), (2.34), (2.46) and (2.47), we arrive at the f o l l o w i n g
i n t e r p r e t a t i o n of the methods d e s c r i b e d above. Instead of using the
"complete" approximations (2.50a), w h i c h w o u l d require the k n o w l e d g e
k
of all the l i n e a r i z a t i o n s a s s o c i a t e d with the c u r r e n t point x via
~f(xk), the methods use a p p r o x i m a t i o n s (2.33) and (2.47) which result
k
from l i n e a r i z a t i o n s c a l c u l a t e d at many trial points around x .

3. The Basic A l g o r i t h m

We now have all the n e c e s s a r y ingredients to state the simplest


v e r s i o n of the aggregate s u b g r a d i e n t m e t h o d for solving the p r o b l e m
in question. Its more e f f i c i e n t m o d i f i c a t i o n s are d i s c u s s e d in sub-
58

sequent sections.

Algorithm 3.1.
Step 0 (Initialization). Select the starting point x l E R N and set
yl=xl. Choose a final accuracy tolerance e_Z0 and a line search para-
meter m g (0,i). Set p0=gl=gf(yl), f~=f~=f~yl) and Jl={l}. Set the
counters k=l, i=0 and k(0)=l.
Step 1 (Direction finding). Find multipliers ikj, j6 jk, and ipk that
solve the following k-th dual subproblem

minimize ~1 l j J j k ljg 3' + Ippk-ll2 - je~jk ljfjk k


Ipfp ,
(3.1)
subject to I.>0 j ~ jk , ip~0 ; J lj + Ip = 1
3- ' j jk

Calculate the aggregate subgradient (pk,~) by (2.45). Set dk=-p k and

v k = - {Ipkl 2 + f(x k) - f }. (3.2)

Step 2 (Stopping criterion). Set

w k = ~I
1 pkl2 + f(xk) - ~kp (3.3)

If wk~m s terminate; otherwise, go to Step 3.


Step 3 (Line search). Set y k+l = xk+d . If

f(xk+d k) S f(x k) + mv k (3.4)

then set t~=l Ca serious step), set kCl+l)=k+l and increase 1 by i;


otherwise, i.e. if (3.4) is violated, set t~=0 Ca null step).
k+l ~ k k
Step 4 (Linearization updating). Set x =x +tLd . Choose a set ^
~k ~ {i, .... k} and calculate the linearization values fk+l , j ~ jk, and
fk+l by (2.32) and (2.38) Evaluate gk+l=gf(yk+l) and 3
p
fk+l = f(yk+l) + <gk+l,xk+l k+l
k+l - y >" (3.5)

Set j k + l = 3 k u {k+l}. Increase k by 1 and go to Step i.

Remark 3.2. It follows from Lemma 2.1 that in Algorithm 3.1 (dk,u k)
solves the primal subproblem (2.41), where u k is given by (2.43f),
and that I~ j e jk, and Ik are the associated Lagrange multipliers.
3' P
Thus one may equivalently solve subproblem (2.41) in Step 1 of the
above algorithm.
59

Remark 3.3. In Step 2 of Algorithm 3.1 we always have

pk ~ 3 f(x k) and Ipkl ~ (2C) I/2 for E--wk, (3.6a)

see the next section. Therefore

f(x) f(x k) + <pk,x-xk> - w k a f(x k) - [pkllx-xkl - wk a

Z f(x k) - (2wk) I/2 Ix-xkl - wk (3.6b)

for each x in R N. It follows that if f has a minimum point x, then

min{f(x): x E R N} = f(x) ~ f(x k) - (2wk) I/2 Ix-xkI - w k.

The above estimate justifies the stopping criterion in Step 2 of the


algorithm.

4. Convergence of the Basic A l g o r i t h m

In this section we shall show that each sequence {x k} generated


by Algorithm 3.1 is minimizing, i.e. f(xk)+ inf{f(x): x E RN}, and that
{x k} converges to a minimum point of f, whenever f attains its minimum.
Naturally, convergence results assume that the final accuracy tolerance
es is set to zero. For convenience, we precede the main results by se-
veral lemmas that ~nalyze the properties of Algorithm 3.1.
We start by showing that the aggregate subgradients are convex
combinations of the past subgradients.

Lemma 4.1. Suppose that Algorithm 3.1 did not stop before the n-th ite-
ration, n>l.
. .Then. for. each
. k=l ,n there exist multipliers ~k
~j,
j=l,...,k, satisfying
k
(P k ,fp)
"k = k
Z ~j(g3,fk), 4.1a)
j=l

-k 0 k ~k 4.1b)
lj_> , j=l ..... k, jZIX j = i.

Moreover, for each k satisfying l<k~n one has


k-i
(pk-l,fk) = j=IZ ~k-i
lj (gJ ,fk) , 4.2a)

k-1
~k-~_O, j:l ..... k-~, z ~k-~ : 1. 4.2b)
] j=l 3
60

Proof. The proof will proceed by induction. Let

~i
11 =
1, (4.3a)

Ik = 0 for each j~{l .... k} \ jk and k>l (4.3b)

~k = lk + lk ~k-i for j:l ,k-l, ~k k


3 3 P 3 '''" = lk' for k>l, (4.3c)

If k=l then (4.1) follows from (4.3a) and (2.45), because (pl,~p) =
= (gl,fl) = (p0,f). Therefore (2.32) and (2.38) yield (4.2) for
k=2. Suppose that (4.2) holds for some k=n>2. Then, since
k
(pk,~k) = j=IZ Ik(gj,fk) + ik~p~pk-l,fk),

k
Ik>_0, j=l ..... k, ik>_0, Z Ik + Ik = i
9= 1 3 m

by (2.43a), (2.45) and (4.3b), we obtain


k k-i
j=l P j=l 3 3

k-i
= lk(kg 'k fk) + Z (lk + I k ~k-l)(gj,f~) =
j=l 3 P 3
k
= Z ~k (gj fk)
j=l 3 ' '

~k_>0 for j=l ..... k,


3
k k-I k k-1 xk' i k-1
z + ~ =~ + z ( + )=
j=l 3 = Ik j=l 3 k j=l 3 P 3

= Z +l Z = Z + = i,
j=l 3 P j=l 3 j=l 3 P

which yields (4.1) for k=n. Next,


k k
fk+l = ~k + <pk k+l k ~kfk ~k j vk+l vk~
P P ,x -x > = Z + < Z ,,j~ ,~ -~ - =
j=l 3 3 j=l

k k
= Z lj [fjk + <g3,x k+l _ xk>] = Z kf +l
j=l j=l 3 3

from (2.38), (4.1a) and (2.32). Therefore (4.2) holds for k=n+l, and
the induction step is complete. []
61

Our convergence analysis hinges on the interpretation of the past


subgradients and the aggregate subgradients in terms of E-subgradients
of the objective function. In the following, suppose that Algorithm 3.1
did not terminate before the k-th iteration, for some k_>l. Define the
linearization errors

k = f(x k) - fk j=l ..... k, 4.4a)


3 ] '

~ kp = f(x k) - fkp ' 4.4b)

~k = f(x k) _ ~kp '


~p 4.4c)

which may be associated with the subgradients gJ, pk-I and pk as


follows.

Lemma. 4.2. At the k-th iteration of Algorithm 3.1, one has

gJ e 3~f(x k) for e = o kj , j=l,...,k, 4.5a)

p k- ~ ~ f ( x k ) k ,
for e = ~p 4.6b)

pk ~ ~ f(x k) for c = ~ , 4 .Sc)


k k ~k
~j, ~p, ~p Z 0, j=l ..... k. 4.5d)

Proof. From (2.31) amd (4,4a), for each x in R N we have

fix) ~ f(x k) + <gJ,x-x k> - ~ ( x k) - f~]

f(x k) + <gJ,x-xk> - ~k , (4.6)


3
hence (4.5a) follows from the definition of the E-subdifferential, see
(1.2.86). Setting x=x k in (4.6), we obtain ~0. By (4.1) and (4.6),
k k k
f(x) = E -k f(x)~ < E ~~ g3,x-xk> + E ikf~ =
j=l lj j=l j=l 33

= <pk,x_xk > + ~k = f(x k) + <pk,x_xk > _ ~ ,


P
which proves (4.5c). Setting x=x k, we get ~>0._ The rest follows simi-
larly from (4.2) and (4.6). []
62

k k ~k
Remark 4.3. In view of (4.5), the values of ~j, ep and ep indicate the
distance from gJ, pk-i and pk to the subdifferential of f at at x, res-
pectively. For instance, the value of ~k>0 indicates how much pk dif-
fers from being a member of
P=k
3f(x k); if ~ =0 we have p k 3f(xk).
P

The following result will justify the stopping criterion of the


algorithm.

Lemma 4.4. At the k-th iteration of A l g o r i t h m 3.1, one has

w k : ~ 1J pk l2 + ~ -k
p, (4.7a)

v k = -{~pk I2 + ep},
~k (4.7b)

v k S - wk ~ 0 (4.7c)

Proof. This follows immediately from (3.2), (3.3), (4.4c) and (4.5d).[]

k
Remark 4.5. The variable w may be termed a stationarity measure of
the current point x k, for each k, because --~Ipkl2 indicates how much
k
p differs from the null vector and o" k measures the distance from pk
to ~f(x k) (X k is stationary
Pk
if 0 ~ ~f(x )). The estimates (3.6), which
follow from (4.5c) and(4.7a), show that x k is approximately optimal
k
when the value of w is small.

In what follows we assume that the final accuracy tolerance


k
is set to zero. Since the algorithm stops if and only if w ~es,
(4.7c) and (3.6) yield

k
Lemma 4.5. If A l g o r i t h m 3.1 terminates at the k-th iteration, then x
is a m i n i m u m point of f.

From now on we suppose that the algorithm does not terminate,i.e.


wk>0 for all k. Since the line search rules imply that we always have

f(xk+l) ~ f(x k) + mt~v k (4.8)


63

k k k
with m>0 and tL>0 , the fact that v <-w <0 (see (4.7c)) yields that the
k -
sequence {f(x )} is nonincreasing.
The next result states a fundamental property of the stationarity
measures {wk}.

Lemma 4.6. Suppose that there exist an infinite set K c {1,2,...} and
a point ~ e R N satisfying x k K ~ and wk K ,0. Then ~ m i n i m i z e s f.

Proof. Passing to the limit in (3.6b) with k+~, k E K, we obtain


f(x)>f(x) for each x in R N. []

We shall need the following auxiliary result.

[.emma 4.7. S u p p o s e that the sequence {f(xk)} is bounded from below,


i.e. f(xk)_>c for some fixed c and all k. Then

{tkLlpkt 2 + tTk ~p}


-k -< [f(x I) c]/m. (4.9)
k=l

Proof. It follows from (4.8) that

f(x I) - f(x k) = f(x I) - f(x 2) +...+ f(x k-l) _ f(x k)

k-i
_~m ~ tL( -v i ) .
i=l

Dividing the above inequality by m>0, letting k approach infinity and


using (4.7b) and the assumption that f(xk)_>c, we obtain the desired
relation (4.9). []

Note that the rules of Step 3 of A l g o r i t h m 3.1 imply

xk= x k(1) for k=k(1),k(1)+l ..... k(l+l)-l, (4.10)

where we set k(l+l)=~ if the number 1 of serious steps stays bounded,


i.e. if xk=x k(1) for some fixed 1 and all kzk(1).
The case of infinitely many serious steps is analyzed in the fol-
lowing lemma.
64

Lemma 4.8. Suppose that there exist an infinite set L c {1,2,...} and
a point x ~ R N such that x k(ll , x as i~, 1 6 L. Then ~ is a m i n i m u m
point of f.

Proof. Let k = {k(l+l)-l: i ~ L}. Observe that the line search rules
imply t~=l for all k e K, while (4.10) yields

k K ~
x . (4.11a)

Since {f(xk)} is nonincreasing, (4.11a) and the continuity of f imply


f(xk)+f(~). Then Lemma 4.7, (4.7) and the fact that t~=l- for all k e K
yield

wk K ,0. (4.11b)

In view of Lemma 4.6, (4.11) yields the desired conclusion. []

In order to show that the stationarity measures {w k} tend to ze-


ro in the case of a finite number of serious steps, we have to analyze
the dual search direction finding subproblems.

Lemma 4.9. At the k-th iteration of A l g o r i t h m 3.1, k~l, w k .


Is the op-
timal value of the following problem

minimize ~1 I E k ljgj + ippk-i 12 + E I . ~ + I ak ,


j6J i~j k 3 ] P P
(4.12)
subject to lj~O, j 6 jk, Ip ~0, j~ E jk I.3 + i P = 1 '

which is e q u i v a l e n t to subproblem (3.1).

Proof. For each I satisfying the constraints of (4.12),

k k = f(x k ) - k - i f
j~jk kjej + Ipep j~jk ljfj p kp

from (4.4a) and (4.4b), which proves the equivalence of (3.1) and (4.12).
k
Since lj ' j~ jk, and ikp solve (3.1) ' the optimal value of (4.12) is

1 k j kk2 kfk kfk


j Jjk kjg + Ipp I + f(xk) - jeEj k 3 3 - P P =

= 2
_ ipkl2 + f(xk) - fp = w
k
65

from (2.45) and (3.3). []

The f o l l o w i n g result, which describes problems similar to (4.12),


will be f r e q u e n t l y used in s u b s e q u e n t chapters. It g e n e r a l i z e s Lemma
4.10 in ( M i f f l i n , 1 9 7 7 b ) .

Lemma 4.10. Suppose that N - v e c t o r s p, g and d and n u m b e r s m e (0,i),


C, v, w, ~ ~0 and e s a t i s f y
P
d =-p , 4.13a)

1, ,2 ~
w = ~lpl + ~p t 4.13b)

v = -{Ipl 2 + ;p}, 4.13c)

-e + <g,d> _>my , 4.13d)

C ~ max{Ipl,lgl, ~p,l}. 4.13e)


Let
Q(9) = ~1I < l - ~ ) p + 9gl 2 + ( l - ~ ) ~ p + 9~ for ~ ~ R, 4.14a)

= sin{Q(9): ~ ~0,i~}. 4.14b)

Then
~ ~c(W), (4.15)
where

~c(t) = t -(l-m)2t2/(8C2). (4.16)

Proof. Simple calculations yield

Q(9) = ~1 9 2 IP-gl 2 + 9 ~ p , g > - IpI2~+ ~(e-~p) + w. (4.17)

From (4.13a,c,d)

<p,g> ~m{Ipl 2 + ~p} - ~,

hence (4.17) yields

Q(9) S ~1 92 IP-g[ 2 - ~ ( l - m ) [ I P 12 + ~p] + w

for all ~>0. Since m e (0,i) and IpI2k0, we o b t a i n

Q(9) S ~1 92 Ip-gl 2 - 9 ( l - m ) w + w for all 9 ~ ~ , i ~ .


66

By (4.13e), Ip-g[2<_( Ipl+Igl)2<_4C2, hence

Q(~) < 2C2~ 2 - (l-m)w v+ w for all ~ ~ [0,i]. (4.18)

Denoting the right side of (4.18) by q(~), we check that q is ~[ni~/zed


by ~ = (l-m)w/(4C 2)<_(l-m) [C2/2+C]/(4C2)<I, yielding

q(~) = w - (l-m)2w2/(8C2). (4.19)

Since ~ E [0,i], (4.14b), (4.18) and (4.19) complete the proof. []

Applying the above lemma to Algorithm 3.1, we obtain

Lemma 4.11. Suppose that t k-i


L =0 for some k>l. Then

-~kk + ~gk'dk-l> _>mvk-i , (4.20)

w k S ck(W k-l), (4.21)

where #C is given by (4.16) and C k is any number satisfying

ek >_ max{Ipk-i I, Igkl, ~k-l,l}. (4.22)

Proof . (i) t k-i


L =0 if

f(yk) _ f(xk-l) > mvk-I

Then xk=x k-I + t.k-l~k-i


L a = x k-i and y k = xk-i + dk-l, so

-~kk + <gk,dk-1~ = _ [f(x k) _ f(yk) _ < g k x k _ y k ] + <gk,dk-l~ =

= f(yk) - f(xk).

Combining the above relations, we obtain (4.20).


(ii) Define the multipliers

~k(~) = ~, ~j(~)=0, je Jk\ {k}, ~p(~)= i-~ (4.23)

for each v ~ [0,1]. Note that k ~ jk by the rules of Step 4. Moreover,


for each
67

Z lj(9)gj + i (9)pk-i = (l_v)pk-i + vgk


je jk p
<4.24)
lj(9 k k = ~k-i k
j~ jk )~j + Ip(V)~p (l-V)~p + v~ k.

This follows k
from the fact that ap=f(xk)-f k =f(xk-l)-~ k-I if x k =x k-i
p
see (4.4) and (2.38). Since for each v e ~0,I] the multipliers (4.23)
are feasible for (4.12), we deduce from (4.24) that w k, the optimal
value of (4.12), cannot exceed the optimal value of the following
problem

minimize l[(l-v)pk-i + vgkl2 + (i-9).~k-i


P + v k,
(4.25)
subject to v ~ [0,i].

Therefore we obtain the desired conclusion from Lemma 4.10, (2.43e),


(4.7a), (4.7b), 4.20) and (4.22). []

The following result deals with the case of a finite number of


serious steps of the algorithm.

Lemma 4.12. Suppose that the number 1 of serious steps of Algorithm


3.1 stays bounded, i.e. xk=x k(1)"" for all k~k(1). Then the point ~=x k(1)""
minimizes f.

Proof. Suppose that t~=0 for all k>k and some fixed k. From Lemma 4.11,
we have

0 < w k ~ w k-I for all kzk. (4.26a)

1 k2 ~k k
In particular ~IP I + ep = w & w for all k>k, hence there exists
a constant CI<~ satisfying

~k-i , i"~ <_ C 1


max{[pk-l[,~p for all kzk. (4.26b)

Since yk=xk-l+dk-l=x+dk-l=~-p k-I for all kzk, it follows from (4.26b)


that the sequence {yk} is bounded. Therefore from the local bounded-
ness of ~f (see Lemma 1.2.2) follows the existence of a constant C2<~
satisfying ]gk[=]gf(yk)l~e 2 for all k ~ . By (4.26), this gives

max{]pk-ll,lgkl,e ~k-i
p ,i} < C for all kak,

where C=max{Cl,C2}. Thus (4.22) holds for constant ck=c and all k>~.
68

Consequently, (4.26a), (4.21), (4.16) and the fact that m E (0,I) is


fixed imply wk+0. Combining this with Lemma 4.6 and the fact that
k --
x =x for all k~k, we complete the proof. [ ]

Combining Lemma 4.8 with Lemma 4.12 and using (4.10), we obtain

Theorem 4.13. Every accumulation point of the sequence {x k} generated


by A l g o r i t h m 3.1 is a m i n i m u m point for f.

The following lemma provides a sufficient condition for {x k} to


have accumulation points.

Lemma 4.14. Suppose that a point ~ e R N satisfies f(x)<_f(x k) for all k.


Then the sequence {x k} is bounded and
k
* n I2 +
}x-xkl 2 <- Ix-x ~ { xi+l - xi[2 + 2tT~p}
i~i for k>n~l, (4.27a)
i=n

{ ixi+1_xi L2 + 2t~ ep}


~~ 0 as n~. (4.27b)
i=n

A k "k
Proof. From (4.5c), 0~f(~)-f(xk)~ <pk,x-x ~ -~p, hence

<pk,x_xk > <~k


- p
for all k. (4.28)

k+l k k k k k
Since we always have x -x =tLd =-~Lp and t ~0, (4.28) implies
^ k , x k+l - xk> _~Lep.
-<x-x <~k~k Therefore

l~_xk+ll2 : ix_xkl- 2 - 2<x-x~ k,xk+l_xk~ + {xk+l_xkl2

I~-xkl 2 + Ixkl-xkI2 k~kp,


+ 2tL~

which yields (4.27a). Next, we always have Ixk+l-xkl2=(t~)21p k 2I k k 2I '


~tLIP
because t Lk e [0,i],- hence Lemma 4.7 implies

k~k
Z {Ixk+l-xkl 2 + 2tL~p} < ~.
k=l

This proves both the boundedness of {x k} (by (4.27a)) and (4.27b). []


69

We now state the principal result. Let

= {xERN: f(x)~f(x) for all x}.

Theorem 4.15. If the solution set ~ is nonempty, then each sequence


{x k} calculated by Algorithm 3.1 converges to some point x 6 X.

^ ~ k
Proof. If x ~ X then f(x)~f(x ) for all k, hence Lemma 4.14 implies the
boundedness of {xk}. By Theorem 4.13, {x k} has an accumulation point
X E X. It remains to show that xk+x. Take any 6>0. Since f(x)~f(x k) for
all k, Lemma 4.14 implies that there exists a number n I such that

l~-xkl 2 ~ Ix-xnl 2 + 6/2 for all k>n>n I. (4.29)

Since x is an accumulation point of {xk}, there exists a number n>n 1


such that I~-xnI2<6/2. Then (4.29) yields l~-xkl2<_ 6 for all k>n.
Since 6 was arbitrary, this proves xk+x as k ~ . []

Even when f has no m i n i m u m points, we still have the following


cesult.

Theorem 4.16. Each sequence {x k} constructed by A l g o r i t h m 3.1 is mi-


nimizing:

f(xk)+ inf {fix): x e RN}.

Proof. In view of Theorem 4.15, it suffices to consider the case of


an empty X. Let {z i} be a m i n i m i z i n g sequence, i.e. f(zi)~ inf{f(x):
x e R N} and f(zi)~ f(z i+l) for all i. To obtain a contradiction, sup-
pose that for some fixed index i f(zi)<f(x k) for all k. Then Lemma
4.14 and Theorem 4.13 imply the existence of some x ~ X, which contra-
dicts the emptiness of X. Therefore f(xk)<f(z i) for every fixed i and
large k ({f(xk)} is nonincreasing), hence {x k} m i n i m i z e s f. [ ]

The next result provides further s u b s t a n t i a t i o n of the stopping


criterion.
70

Lemma 4.17. Suppose that {x k} is a sequence generated by Algorithm


3.1 satisfying f(xk)~c for a fixed number c and all k. Then

wk+0 as k~. (4.30)

Proof. If Algorithm 3.1 executes a finite number of serious steps, then


(4.30) follows from the proof of Lemma 4.12. Suppose that i+~. By the
monotonicity of {f(xk)}, we have f(xk)+~ as k~, where ~ c . Therefore

f(xk(1)) - f(xk(l-l))0 as i+~. (4.31)

Observe that -<pk,xk+l-xk>


= _<pk ,-tLP
k k > = tLlP
k k I2 for all k, hence
(4.9), (4.10) and (4.31), yield

f(xk(1))_ f(xk(l-l))_<pk(1)-l,x k i)_ xk(1)-l>+ 0 as i+~. (4.32a)

From the proof of Lemma 4.8 we deduce that

w k(1)-l+ 0 as i~. (4.32b)

Arguing as in the proof of Lemma 4.11, we deduce that


j Zj k Xj(0)g j + Xp(0)pk-i = pk-l,

EZjk lj(0)~ kj + I p ( 0 ) ~ = ~
J
see (4.23) and (4.24), and that

wk ~ ~IPl'k-If2 + ~Pk = wk-i + kp _ ~P'k-i for all k. (4.32c)

By (2.38) and (4.4),

k _ ~-k-I = f(x k) _ f(xk-l) _ <pk- 1 ,xk-xk-l~ for all k. (4.32d)


P P
From (4.10) and (4.32c,d),

w k _< w k-i for all k(1)<k~k(l+l)-i and i_>i. (4.32e)

Using (4.32), we obtain (4.30). []

Corollary 4.18. Suppose that inf{f(x): x e R N } > -~. Then Algorithm 3.1
terminates if its final accuracy tolerance s is positive.
71

5. The Method with Subgradient Selection

In this section we analyze the method that uses subgradient selec-


tion, as specified in Section 2 by (2.20) and (2.23).
To save space, we shall use the r~tation of A l g o r i t h m 3.1 with
certain modifications. Algorithm 5.1 is obtained from A l g o r i t h m 3.1 by
replacing Step 1 with the following.

k
Step i' (Direction finding). Find m u l t i p l i e r s I i, j ~ jk, that solve the
k-th dual search direction finding subproblem (2.22), and a set
J*k ={j e "jk:
I=>0}k satisfying i%11okl<N+l. Calculate the agregate subgra-
dient (p ,fp)J by (2.27). Set d k= -p k and vk= -{I pkl2 + f(xk)_ f~}.
k ~k

Thus in A l g o r i t h m 5.1 the index set 3k of the retained past sub-


gradients is chosen by direction finding, whereas in A l g o r i t h m 3.1 ~k
may be determined arbitrarily. We also note that in A l g o r i t h m 5.1 there
is no need for recursive updating of the aggregate subgradients,
since they are calculated directly from the past subgradients retained
at each iteration.

Remark 5.2. In view of the results of Section 2, in A l g o r i t h m 5.1


(dk,u k) solves the k-th primal subproblem (2.11), where u k is given by
(2.21e), for any k. lk j 6 jk, are the associated Lagrange multipliers.
3'
Therefore, one may e q u i v a l e n t l y solve the primal subproblem (2.11) and
then obtain the Lagrange m u l t i p l i e r s as described is Lemma 2.1(iv).
We may add that most quadratic programming subroutines applied to
(2.11) or (2.22) will automatically calculate Lagrange m u l t i p l i e r s
satisfying (2.23), i.e. at most N+I nonzero multipliers. This follows
from the fact that the primal problem (2.11) has N+I variables, where-
as in the dual subproblem (2.22) at most N+I vectors of the form
(gJ,l) can be linearly independent, i.e. at most N+I vectors gJ can
be affinely independent (see also (Kiwiel,1983)).

Remark 5.3. The r e q u i r e m e n t of A l g o r i t h m 5.1 that the set A


jk should
contain at most N+I indices can be modified as follows. Let Mg~N+2 de-
note the m a x i m u m number of the past subgradients that A l g o r i t h m 5.1 may
use for each search direction finding. Then one may choose the Lagrange
multipliers lk and the set jk+l subject only to the following require-
3
ments
72

jk+l~{k+l} U {j jk: A~0},


(5.1)
Iok+ll ~ Mg,
for all k. In view of Lepta 2.1, this is always possible if M g->N+2.
Also the extensions discussed in Remark 2.5 are covered by the subse-
quent analysis. In partic~llar, setting M = +~ and jk+l= {l,...,k+l}
g
for all k, we see that (5.1) is satisfied. Therefore the analysis
below applies also to the method of Lemarechal (1978), which uses
JX={l,...,k} for all k.

Global convergence of Algorithm 5.1 can be demonstrated by an


appropriate modification of the results of Section 4. To save space,
we provide here only an outline of required results.
Define the aggregate linearization by (2.35) and the linearization
errors by (4.4). Also let

~k = I~ j~ jk, ~=0, j6 {i .... k} \ J k.


3 3'
Then Lemma 4.1 follows directly from the definition of ( p k , ~ ) . More-
over, it is straightforward to check that all the results from Lemma
4.2 to Lemma 4.8 also hold for Algorithm 5.1. Lemma 4.9 is substituted
by

Lemma 5.4. At the k-th iteration of Algorithm 5.1, kZl, w k is the opti-
mal value of the following problem

1 ljg j k
minimize ~ [ Z 12 + Z lj ej,
j E jk j E jk
(5.2)
subject to lj~0, j jk, Z ~ = i.
je jk 3

In the proof of Lemma 4.11, replace (4.23) by

Ik(V)=~, lj(~) = (i-~.)i k-I


j , j c 3 k-I" (5.3)

for each ~ e [0,i], and note that (2.21a), (2.23a), (2.27), (2.32) and
(2.38) imply
73

ljk-i >0, jg 9-1 , Z k-1 = 1


- 3 '
j~ jk-i
(pk-i ,{k-i
p ) = Z ljk-i (gJ,fjk-i ),
j~ ~k-i

fk = ~k-l,
P P

k lj(v)gJ = (l-~)pk-i + vgk' (5.4a)


j~ J

= k-1 (54bi
je " P "

~j(~)a0, j e jk, Z lj(~) = i, (5.4c)


j e jk

for each ~e [0,i], if t k-i


L =0 (x k =xk-l) . Since (5.4c) means that the
multipliers (5.3) are feasible in (5.2) for each ~ e [0,i], we compare
(4.25), (5.2) and (5.4) to deduce that the optimal value of (5.2) is
not greater than the optimal value of (4.25). This observation suffi-
ces for the proofs of all the remaining results of Section 4 also for
Algorithm 5.1.

Remark 5.5. It should be clear by now that the above approach to conver-
gence analysis can be applied to methods that use more subgradients for
search direction finding, cf. Remark 2.5 and Remark 5.3. For instance,
if the sets ~ are chosen subject to the requirement (5.1), then one
may replace (5.3) with the following definition

~k(~)=~ ~j(~)=(l-v)~ k-I j~{i~ jk-i ~k-l>0 }


' j ' : l '
(5.5)
lj(~)=0, j6 {i ,... ,k-l} \ {it jk-l: lk-l>0}.l

We shall now show that methods with subgradient selection may be


interpreted as regularized versions of the cutting plane method, see
Section 1.3. Let us recall that

fj(x) = f(yJ) + <g3,x-y3> , j=l,2 ..... (5.6)

~(x) = max{fj(x): j ~ jk) , k=l,2 . . . . . (5.7a)

k
and define the reduced polyhedral approximation to f at x
74

~r(X) = max{fj(x): j e 3k} (5.7b)

for all k. Let

X ks = Argmin ~ks = {x ERN: ~ks(X) ~ s (ky ) for all y}, (5.8a)

xkr = Argmin ~, 5.8b

denote the optimal sets of f~ and ~ , respectively, for any k. Let

Vs(k d) = ~ ( x k + d ) - f(x k) for all d, (5.9a)

D k = Argmin vs'k 5.9b

for all k. Clearly,

D k = Argmin ~(xk+d). (5.9c)


d
We also recall from Section 2 that at each iteration of Algorithm 5.1
one has

d k = argmin {~(xk+d)+ 1
~[d I2}, (5.10a)
d

y k+l = xk+d k = argmin {~s(y)+


k !ly_xki2},
2I i
(5.10b)
Y
see (2.36), where "arg" denotes the unique element of "Arg", if any.
We may compare (5.10b) with the cutting plane algorithm as fol-
lows. The k-th iteration of the cutting plane method that uses a poly-
hedral approximation of the form (5.7a I would calculate the next trial
k+l
point YC as a solution to the problem

minimize ~(y) over all y, (5.11a)

k+l
i.e. YC is any point satisfying

yck+I 6 X ks, (5.11b)

see (5.8a). We shall now show that under certain conditions also

yk+l e X k.s (5.12)

Lemma 5.6.(i) At the k-th iteration of Algorithm 5.1 one has

fk + <gJ,dk> <u k j~ jk, (5.13a)


3 - '
fk3 + <g3,dk>
' =u k , j E J^k , (5.13b)
?5

~(xk+d k) : {~(xk+d k) : uk, (5~3c)


vk(d
s
k) = v k = u
k - f(xk). (5.13d)

(ii) If additionally the subgradients gJ, jE ~k, are positively line-


arly dependent, i.e.

0 e conv{gJ: j e ~k}, (5.14)

then

k+l k k
y = Prxk x = Prxk x , (5.15a)
s r

d k = Pr (5.15b)
D k0'

where Prxx denotes the projection of x on X.

Proof. (i) (5.13a,b,c) follow from (2.21b), (2.23a) and (5.7). By


5.13c) and (5.9a), we have v~(dk)=uk-f(xk).- Combining this with
2.21e), (2.26) and (3.2), we obtain (5.13d).
ii) Since

~(xk+d k) = conv{gJ: fk + <gJ,d k> : f~(xk+dk)}


]

see Corollary 1.2.6), we deduce from (5.13b) and (5.14) that

0~ ~(xk+d k) and f~(xk+d) = u k.

It follows that yk+l=xk+dk minimizes the convex function ^k fs' i.e.


y k+l e X kS t and ~k, , k for all y ~ X k Therefore (5.10b} can be formu-
~s~Y)=U S"
lated as

y k+l = argmin{f~ (y) + ~[


1 Y-X k [2 } =
Y

argmin { u k + 1 1y_xk 2}
ye X k

argmin ly-xkl = Pr x k.
yeX k Xks
S

Since

~ k k+1,) = cnv{g3: fk3 <gj yk+l -x k > ~kr(yk+ll, jg ~k},


~Zr[Y =
76

we similarly deduce that y k + l e Xkr and yk+l= Prxk x k. Then (5.15b) fol-
lows from (5.15a), (5.8) and (5.9c). [] r
To interpret the above result, consider the following condition

0 ~ conv{gJ: j e jk}, (5.16)


which is slightly weaker than (5.14). One may show that (5.16) is equi-
valent to

inf{f~ (x): x E RN}~-~,

which in turn is equivalent to X k being nonempty. Similarly, (5.14) is


s
equivalent to nonemptiness of xk. The cutting plane method chooses any
r
point in X k as the next trial point, so (5.16) must hold. In Algorithm
s
5.1 the augmentation of the subproblem objective function with the
~Idl 2 uniquely determines yk+l as the point in X k
regularizing term --
r
nearest to x k (if X k is nonempty)
r
We conclude this section by remarking that results similar to
Lemma 5.6 may be obtained also for methods that use more subgradients
for search direction finding than Algorithm 5.1, cf. Remark 2.5 and
Remark 5.3. Such results are crucial for showing that Algorithm 5.1
terminates when f happens to be piecewise linear (Kiwiel, 1983).

6. Finite Convergence for Piecewise Linear Functions

In this section we show that many versions of the method with


subgradient selection are finite, terminating methods for minimizing
piecewise linear functions.
The problem of finite convergence of nonsmooth optimization algo-
rithms in the polyhedral case.is interesting for at least two reasons.
First, piecewise linear functions frequently arise in applications
(Lasdon,1970;Shor,1979;Wolfe,1975), e.g. in decomposition methods for
large scale linear programming problems ~Nurminski,1981). Secondly,
many objective functions can often be well approximated in the vici-
nity of minimum points by piecewise linear functions of the form (2.33)
(Madsen and Schjaer-Jacbsen,1979). Then the finite termination proper-
ty of an algorithm in the polyhedral case ensures fast convergence.
The above problem was analyzed for descent methods by Wolfe
(1975) and Mifflin (1977b). They tried to modify their algorithms to
ensure finite termination for polyhedral functions. To this end they
changed line search rules, demanding exact minimization of f(xk+td k)
over all t~0, and required storing all the past subgradients. This led
77

to finite termination of the algorithms in (Mifflin,1977b;Wolfe,1975)


only for Powell's function

fix) = max{~ai,x>: i=i,...,5},


x e R 2,
T (6.1)
2n(i-i I
a i = (cos 5 ' sin 2H%i-ij)''
5 , i=l, . . . .
,5

No guarantee exists that such m o d i f i e d algorithms are convergent for


other functions.
Throughout this section we assume that the m i n i m i z e d function f
is piecewise linear, i.e.

f(x) = max{fi(x): ie I}, x RN ,


(6.2)
i(x) = <ai,x> - b i, ai@ R N, bi6 R, i e I,

where the index set I is finite. Since the subdifferential of f at x


is given by (see Corollary 1.2.6):

8f(x) = conv{ai: i e I(x)}, (6.3)

I(x) = { i ~ I: ~i(x) = f(x)}, (6.4)

we assume that A l g o r i t h m 5.1 applied to f uses subgradients of the


form

g k = ai(k) { a i : i6 I(yk)} for all k, (6.5a)

i.e. gk=ai(k)6 ~f(yk) and

~i(k)(yk) = f(yk) for all k. (6.55)

We also assume that Argmin f#0. By the results of Section 5, we know


that A l g o r i t h m 5.1 either terminates, finding some ~ ~ Argmin f in
a finite number of iterations, or it generates an infinite sequence
{x k} such that

k
x + ~ as k+~, (6.6)

f(xk)+f(x) as k~, (6.7)

wk0 as k~, (6.8)

for some ~ ~ Argminf. In the former case there is nothing to prove as


far as finite convergence is concerned. Therefore we shall initially
suppose that A l g o r i t h m 5.1 does not stop, and then show that in fact
this is impossible if f satisfies a certain condition given below.
78

Thus suppose that {x k} is infinite, so that (6.6) - (6.8) are satis-


fied.
We start by collecting some useful results. By (6.8), (4.7) and
(2.21d), we have

vko, dko and pk+o as k~. (6.9)

From (6.2), (6.5) and (2.9) we always have

fj(x) = fi(j)<x) for all x, (6.10)

hence k
aj=f(x k )-fj(x k) can be expressed as

~jk = f(x k) - fi(j)(xk), (6.11)

and we have

-~k3 + <gJ'dk> = ~i(J) (xk+dk) - f(xk)" (6.12)

Define the sequence of sets

I k = {i(j): j ~ jk} for all k, (6.13)

and note that each I k has at most N+I elements, since so has jk. Asym-
ptotic properties of {I k} are described in

Lemma 6.1. There exists a number n I satisfying

Ik c I(~) for all k~n I. (6.14)

Proof. Let i~ I be fixed and let K i ={k: i ~ Ik}. Then (6.2), (6.10),
(5.13b) and (5.13d) imply

0<f(x k) - fi(x k) = -v k + <ai,dk>

-v k + fail Id k I for all k g K i,

hence (6.9) implies that i~ I(x) if K i is infinite, because then


~i(xk)+fi(x)=f(~). If the lemma were false, one could choose, since
I is finite, an index i~ I \I(~) and a corresponding infinite set K i.
But then, by the above result, we would have i ~ I(x). This contradic-
tion completes the proof. []
7g

Let us introduce an auxiliary function ~ defined on subsets I cI:

m(~) = min{lal: a ~ conv{ai: i~ I}}. (6.15)

Note that any minimum point x of f is characterized by

~(I) = 0 for some ~ ~ I ( x ) , (6.16)

since (6.16) means that 0 e ~f(~), cf. (6.3).



We shall now show that the selected past subgradients g3, jE J^ k ,
become positively linearly dependent for large k.

Lemma 6.2. There exists a number n ~n I such that

~(I k) = 0, i.e. 0e conv{ai: i e Ik}, for all kzn .

Proof. Since m has finitely many values (I is finite), the constant 6


defined by 6= min{~(I): I~ I and 6(I)>0} is positive, and ~(I)<6
implies that ~(I)=0. By (6.5a), (6.13), (2.21a,c) and (2.23), we always
have

~(zk) < I z xk gJl = Ipkl


- 3
j jk

Therefore (6.9) yields ~(Ik)~Ipkl<6 for all sufficiently large k,


and the assertion follows. []

Since the linearizations f


used by the algorithm are in fact de-
3
fined by the linear pieces ~i(j) (see (6.10)), the polyhedral approxima-
tions can be expressed as

fk(x) = max{~i(x ): i e I k}, (6.17a)

~k+l. ,
s ix) = max{fi(x): i e I k u {i(k+l)}}, (6.17b)

for any k. This follows from (5.7), (6.5), (6.13) and the fact that we
always have ~ + l = ~ k u {k+l}. Combining Lemma 6.2 with Lemma 5.6, we get
80

Corollary 6.3. Relations (5.15) hold for all kzn , i.e. for sufficient-
ly large k the point yk+l=xk+dk is the nearest point to x k that minimi-
zes the functions ~k and ~k given by (6 17) Moreover,
s r " "
~k(yk+l)s = ~ ( y k + l ) = f(x k) + v k for all k>n , (6.18)
^k yk+l ^~ (6.19)
fr ( ) S f (x) ! f(x) for all x and k~n .

Remark 6.4. By (6.18) and (6.19),

f(x k) + v k ~ min{f(x): x ~ R N} ~ f(x k) for all k~n , ~6.20)

which is similar to the global estimates employed by cutting plane algo-


rithms.

We are now ready to prove

Theorem 6.5. If Algorithm 5.1 executes only a finite number of serious


steps, then in fact it must terminate.

k E
Proof.
--w-----
If x =x for all k large enough, then Lemma 6.1 implies that
I~c I(~) for such k, hence (6.10),(6.11) and (6.13) yield

~jk+l =u^ for all j e jk and k large enough. (6.21)

By Lemma 5.4, the multipliers Ik+l, j~ j k + l = ~ k u {k+l}, solve subproblem


3
(5.2), hence (6.21) implies

w k+l ~ min{~}
1 I E ljg j 12: lj~0, je 3 k, 2 lj = i}
j E 3k 96 3 k

for large k. In view of Lemma 6.2, the right side of the above inequality
is equal to zero for large k, hence the algorithm must stop owing to
wk=0 for some k. []

In view of the above result, we assume below that Algorithm 5.1


executes an infinite number of serious steps. We shall need the follow-
ing auxiliary results.
81

Lemma 6.6. If i+~ then for all k>n one has

f(yk+l) > ~(yk+l), (6.22)

i ( k + l ) ~ I k. (6.23)

Proof. Suppose that f(yk+l)! f(xk)+v k for some k~n . By (6.20),we have
f(yk+1) = min f. On the other hand, the line search rules yield
k+l k+l f(xk+l
x =y , hence )= min f. The next serious step must decrease
the objective value, which contradicts f(xk+l)= min f. Therefore we have
f(yk+1) ) f(x k) v k for all k~n and (6.18) yields

f(yk+l) ~i(k+l)(yk+l) > f(x k) + v k ~k~ k+l,


= = Zr~Y 2 =

= max {~i(yk+l): i~ Ik},

which proves (6.23). []

Lemma 6.7. If I~ then for all k~n one has

*k
~ Argmin f~ ~ Argmin fr' (6.24)

slY ) = Zr{Y } = ix) = min f, (6.25)


k+l
y ~ Argmin f. (6.26)

Proof. By Lemma 6.1 and Lerama 6.2, we have I k c I(x) and ~(Ik)=0.
^k
I k c I(~) and (6.17a) imply fr(X)=fi(x)=f(x ) for all i~ I k. Thus we
for all i~ I k and ~(Ik)=0. Therefore -x-~ A r g m i n
have ~ ( ~ ) = ~ i ( x ) ~k
fr'
Ak - *k N --
cf. (6.16) and (6.17a), and f (x)= mln{f (x): x e R } = f(x)= mln f.
^ r
Since ~.~f for all i and fk ~ f, we obtain f(~)=~k(~)<~k(~)~f(~)
l_+. ~- - - ~- r s r - s -
and f~k(yK
s l)>fK(yK+l),fK(~))
- r - Combining this with (6.19), we obtain
(6.24) and (6.25). If we had f(yk+l)= min f for some kan , this
would contradict (6.22~ and (6.25). []

Consider the following assumption on f and its minimum point x.

Assumption 6.8. If I c I(x) satisfies m(I)=0 then at least one of the


following conditions is satisfied:
82

(i) Argrnin max{~.(x): ie I}CArgmin f;


l
x
(ii) rank {ai: i ~ I} ~ N-I <the rank of the N x Iil matrix with columns
a i is greater than N-2) and cone{ai: i ~ I} = span{ai: ie I} ("cone" and
"span" denote the convex cone and the linear subspace generated by the
vectors a i, iE I, respectively).

Remark 6.9. The well-known Haar condition at

rank{el: ie I} = rain{N, III} for all I~ I<x) <6.27)

implies that Assumption 6.8 is satisfied. This follows from the fact
that (6.27) and ~<I)=0 yield rank {ai: i ~ I} = N.

Under Assumption 6.8 one may describe geometric properties of the


set of minimum points for ~k as follows.
r

Lemma 6.9. Suppose that Assumption 6.8<ii) is satisfied and i~. Then
for all ken one has

Argmin f~ = x + span{yk+l-~} ={x + t(yk+l-x): t ~R}, (6.28a)

k+l --
y ~ x, (6.28b)

i.e. Argmin ~k is the straight line passing through the different


r
points yk~1 and x.

Proof. Let kzn be fixed, so that I k C I(x) and ~(Ik)=0. Let bi=bi+min f
for all i ~ I. From Lemma 6.7 and Corollary 6.3 we deduce that

Argmin ~ = {x: <ai,x> <~i for all i ~ Ik}, (6.29a)

<ai,yk+l> = <ai,~> = ~i for all i~ I k, 16.29b)

and that (6.28b) is satisfied. Using Assumption 6.8(ii) and a classi-


cal theorem on finite systems of linear inequalities <see, e.g. Theorem
4~1.10 in (Pshenichny,1980)), we deduce that any point x satisfying
<ai,x><0 for all i~ I k must satisfy <ai,x>=0 for all i ~ I k. Therefore
it follows from (6.29) that

Argmin ~ = x + {y: <ai,y> = 0 for all i ~ Ik}.


83

Let L: s p a n { a i : i ~ I k} and let L denote the orthogonal complement of


the subspace L. Then Argmin fk=x+ and 0 ~ _yk+l ~ ~l from (6.28b)
r
and (6.26). Therefore to complete the proof it suffices to show that
the dimension of ~ d i m Li=I. We have i+i=R N and dim L+dim L~=N. If
rank{ai: i~ I k} = dim i = N then dim ii=0, which contradicts
0~x-yk+le L. Therefore rank{ai: ie Ik} = N-I by Assumption 6.8(ii),
and hence dim Li=I. []

We may now prove the main result.

Theorem 6.10. If Assumption 6.8 is satisfied then Algorithm 5.1 termi-


nates.

Proof. In view of Theorem 6.5 we need only consider the case of an


infinite number of serious steps. Using Lemma 6.1 and Lemma 6.2, we
choose a fixed index kzn such that

Iku I k+l c I(~) and ~(Ik)=0. (6.30)

Thus we may use Assumption 6.8 with I=I k. We have two cases:
(i) If Assumption 6.8(i) is satisfied then (6.19) yields y k + l Argmin f.
(ii) If Assumption 6.8(ii) holds then Argmin ~ is given by (6.29a).
On the other hand, (6.17), Corollary 6.3 and Lemma 6.7 imply

yk+2e Argmin ~k+is = Argmin f~ {x: <a i(k+l), x> <~l'(k+l)}.


(6.31)
By (6.22) and (6.25), <a i(k+l) , yk+l > >~i(k+l) , hence y k+l lies outsi-
de the halfspace

H = { x ~ RN: <a i(k+l), x> ~ i ( k + l ) } .

It is easy to check that if i ( k + l ) ~ I k+l, then Corollary 6.3 implies


k+2 k+l ^k+l k+l ^k+l k+2 ~k k+2
y =y , and then (6.25) yields f (y ) = f (y )= f (y )=
= ~k, k+l, . ~ ~k+l, k+l, ~. k+l, s _ ~k+l~ k+l,
r~y )= mzn z, so ~ s ~y )= z r [y ) = min ~. ~ut ~ s [y ) =
k+l ^k k+l k+l
f(y ) > fr(y ), so we have f(y )= min f, which contradicts
(6.22). Therefore we must have i ( k + l ) e I k+l c I(~). Then
<ai(k+l),~> - b i(k+l) = f(x) = min f, which means that ~ belongs to
the boundary of H. Consequently, the straight ~k given line Argmin
r'
by (6.28a), intersects the boundary of H at ~. Similarly to (6.29b),
we obtain <ai,yk+2> = ~i for all ie I k+l. Since i ( k + l ) ~ I kl, we de-
84

duce that yk+2 belongs to the b o u n d a r y of H, hence (6.31) yields


k+2 --
y =x.
In both cases c o n s i d e r e d above there exists k~n such that
k+l
y e A r g m i n f. Since this c o n t r a d i c t s (6.26), the proof is comple-
te.

Remark 6.11. P o w e l l ' s polyhedral example (6.1) satisfies the Haar con-
dition and has a unique solution x=(0,0) T. A l g o r i t h m 5.1 finds x in a
finite number of iterations and terminates from any starting point,
using at m o s t 3 subgradients for each search d i r e c t i o n finding. Note
that II(~)I=5.

By a l l o w i n g the a l g o r i t h m to use more s u b g r a d i e n t s for search di-


r e c t i o n finding, we can ensure finite c o n v e r g e n c e even when A s s u m p t i o n
6.8 fails. Suppose that we k n o w a number n such that

II(x)[ < n for all x E Argmin f

and c o n s i d e r the f o l l o w i n g m o d i f i c a t i o n of Step 4 of A l g o r i t h m 5.1 for


large k. If ~k has n k elements, let jk+l contain J k u {k+l} t o g e t h e r with
n-nk-I largest e l e m e n t s from the set {l,...,k}\J k. One may use the pre-
ceding results, e s p e c i a l l y (6.23), to show that if i+~ then for some k
the set {i(j): j~ J k } c I(~) will have n elements. This c o n t r a d i c t i o n
proves finite t e r m i n a t i o n of the m o d i f i e d algorithm. In particular, by
choosing n = [II+l we d e m o n s t r a t e finite c o n v e r g e n c e of the a l g o r i t h m
of L e m a r e c h a l (1978) (with Jk={l .... ,k} for all k).

7. Line Search M o d i f i c a t i o n s

In this section we give general line search rules that may be


implemented in e f f i c i e n t procedures for stepsize selection.
k
The a l g o r i t h m s d i s c u s s e d so far used stepsizes t L ~ {0,1} and
t~=l for g e n e r a t i n g sequences

k+l k k k k+l k k k
x = x +tLd and y =x +tRd for k=l,2,...

from the starting point xl=y I. At the k-th iteration the objective func-
tion was e v a l u a t e d only at yk+l=xk+dk, and a serious step was taken
if f ( x k + d k ) ~ f ( x k ) + mv k. The r e q u i r e m e n t t~=l- for a serious step may re-
sult in too m a n y null steps. Therefore, following Lemarechal (1978), we
85

may introduce a fixed threshold value ~ ( 0 , i ~ for a serious step, say


~=0.i, and replace Step 3 in A l g o r i t h m 3.1 and A l g o r i t h m 5.1 by the
following more general

Step 3' (Line search). Select an auxiliary stepsize t kR E [~,i] and set
k+l k k k
y =x +tRd . If

fCy k+l) ~ fCx k) + mt~v k 47.1)

k kR Ca serious
then set tL=t step), set k(l+l)=k+l and increase 1 by i;
otherwise, i.e. if 47.1) is violated, set t~=0 Ca null step).

Observe that if ~=I then Step 3' reduces to Step 3. Also one may
use t~=l as before. When ~<i, the search for a suitable value of
t~ [~,i] may use geometrical contraction, as described in Section 2, or
interpolation based upon the value of f(xk+td k) and <gfCxk+tdk),d k>
for trial values of t>0, and f(x k) and the approximate derivative v k of
f at x k, c o r r e s p o n d i n g to t=0. Many efficient procedures for executing
Step 3' can be designed, see CLemarechal, 1978 and 1981; Wierzbicki,
1978b; Wolfe, 1975 and 1978). The results of Section 6 indicate that an
efficient line search procedure should try a unit stepsize in the neigh-
borhood of a solution.
We shall now indicate the m o d i f i c a t i o n s necessary for the results
of Section 4 and Section 5 to hold also for the algorithms with Step 3'.
In the proof of Lemma 4.8 observe that t~Z~>0 for all k E K. Part(i) of
the proof of Lemma 4.11 may be substituted by the following result, which
is due to Lemarechal 41978).

Lemma 7.1. Suppose that a point y=xk+td k satisfies f(y)>f(xk)+ mtv k for
some t g (0,i]. Let g=gf(y) e 3f(y) and ~=f(xk) - ~ ( y ) + < g , x k - y > ~ . Then

-e + <g,dk> >mv k.

Proof. By assumption, we have

-e + <g,dk> = f(x k) - f(y) - t <g,dk> + <g,dk>

> tmv k + (l-t)<g,dk>.

By convexity ~= f(x k) - f(y) + t<g,dk> a0, hence


88

<g,dk> ~[f(y) - f(xk)]/t > mv k.

We conclude that -e +<g,dk> tmv k + (l-t)mvk= mv k. []

The rules of Step 3' and Lemma 6.1 imply

f(x k+l) < f(x k) + mt~v k, (7.2a)

k = 0
tL if t kL < ~, (7.2b)

-~(xk+l,y k+l) + <gf(yk+l),dk> ~ mv k if t~<~, (7.2e)


k k (7.2d)
0~tL~t R,
k ~ (7.2e)
tR<t,

where t=l and

~(x,y) = f(x) - If(y) + <gf(y), x-y>J

is the linearization error.


One may check that all the results of Section 4 and Section 5 hold
also for algorithms that use stepsizes satisfying the criteria (7.2) on
each iteration for some fixed positive values of the parameters ~ and t.
Mifflin (1982) used the criteria (7.2a,c,d,e) in his version of
the Lemarechal (1978) method, but with t={=+~. In the next chapter we
show that the convergence results of the present chapter remain valid
for the line search criteria (7.2a), (7.2c) - (7.2e) with any fixed
positive tit (including ~=t=+~). We feel, however, that such generaliza-
tions, which allow for very short serious steps, are m a i n l y of theoreti-
cal interest in the convex case. In practice, the use of (7.2b) with
a small value of ~ means that instead of trying to make an insignifi-
cant step in the current direction, we prefer to find the next, hopeful-
ly better, search direction.
CHAPTER 3

Methods with S u b ~ r a d i e n t Locality Measures for M i n i m i z i n g N o n c o n v e x

Functions

i. Introduction

In this chapter we consider the problem of m i n i m i z i n g a locally


Lipschitzian function f: RN+R, which is not n e c e s s a r i l y convex or diffe-
rentiable. This problem abounds with applications and has been treated
in many papers; see, e.g. (Goldstein,1977; Mifflin,1977b and 1982; Shor,
1979).
Several iterative algorithms for solving the problem in question
have been proposed. Given a starting point x l ~ R N, they generate a sequ-
ence of points x k, k=2,3,..., that is intended to converge to the requ-
ired solution. Much attention has been devoted to descent methods
(Dixon and Gaviano,1980; Goldstein,1977; Kiwiel,1981b; Lemarechal, Stro-
diot and Bihain,1981; Mifflin,1977b and 1982; Polak, Mayne and Wardi,
1983; Shor,1979). They obtain x k+l by searching from x k along a direc-
tion d k for a scalar stepsize t k that gives a reduction in the objecti-
ve value: f(xk+tkdk)<f(xk); then x k + l = x k + t k d k. Most existing descent
algorithms for nonconvex m i n i m i z a t i o n must be regarded as theoretical
or conceptual methods (Dixon and Gaviano,1980), since their optimization
suDproblea~ involved in computing a descent d i r e c t i o n are constrained
nondifferentiable problems, which generally cannot be carried out. Only
the algorithms in (Kiwiel,1981b; Lemarechal, Strodiot and Bihain,1981;
Mifflin,1977b and 1982; Polak, Mayne and Wardi,1983) have quadratic
programming subproblems. However, these algorithms are d e s c e n t methods
in the broader sense that

f(x k+l) < f(x k) if x k+l ~ x k.

In this chapter we present readily implementable methods of descent,


which extend the algorithms described in Chapter 2 to the n o n c o n v e x case.
The methods are based on the Mifflin (1982) e x t e n s i o n of an algorithm
due to Lemarechal (1978). Our algorithms differ from the algorithm of
Mifflin (1982) mainly in their rules for line searches and the updating
of the search direction finding subproblems that are quadratic program-
ming problems. At each iteration of the Lemarechal and M i f f l i n algorithm,
every p r e v i o u s l y computed subgradient generates one linear inequality in
88

the c u r r e n t search direction finding subproblem; here are k such ine-


qualities at the k-th iteration. This w o u l d present serious problems
with storage and c o m p u t a t i o n after a large number of iterations. As in
Chapter 2, we overcome this d i f f i c u l t y by i n t r o d u c i n g rules for select-
ing and a g g r e g a t i n g the past subgradient information. This leads to the
concept of aggregate subgradients in the n o n c o n v e x case.
A new aspect of the n o n c o n v e x case is the n e c e s s i t y to introduce
so-called subgradient l o c a l i t y measures, which depend on d i s t a n c e s
from the current point to trial points at w h i c h past subgradients were
computed. This is due to the local nature of the subdifferential in the
nonconvex case and the r e s u l t i n g lack of a p p r o p r i a t e generalizations of
the notion of e-subdifferential. The locality measures are used for weigh-
ing past subgradients at search direction finding, so that local sub-
gradients contribute to the c u r r e n t direction more significantly than
the obsolete ones.
In the absence of convexity, we will content ourselves with find-
ing stationary points for f, i.e. points x that satisfy the n e c e s s a r y
optimality condition 0 E ~f(x). We show that the m e t h o d s are both readily
implementable and g l o b a l l y convergent in the sense that all their accu-
mulation points are stationary. This seems to be a novel result
for d e s c e n t m e t h o d s that do not n e g l e c t linearization errors, cf. Sec-
tion 1.3.
For c o n v e x f, the algorithms are e x t e n s i o n s of the m e t h o d s of Chap-
ter 2, d i f f e r i n g mainly by more general line search rules. In the convex
case each algorithm generates a minimizing sequence of points, which con-
verges to a s o l u t i o n whenever f attains its infimum.
In S e c t i o n 2 we derive the methods. Section 3 contains a detailed
description of the m e t h o d with subgradient aggregation. Its c o n v e r g e n c e
is a n a l y z e d in S e c t i o n 4. Section 5 is d e v o t e d to the m e t h o d with sub-
gradient selection. In S e c t i o n 6 the results of the p r e c e d i n g sections
are used for e s t a b l i s h i n g convergence of various modifications of the
methods. In particular, we s t r e n g t h e n the e x i s t i n g results on the con-
vergence of M i f f l i n ' s (1982) method.

2. D e r i v a t i o n of the M e t h o d s

In this section we derive two m e t h o d s for m i n i m i z i n g a locally Lip-


schitzian function f:RNR. We c o n c e n t r a t e on the search d i r e c t i o n find-
ing subproblems, leaving other details to the next section.
In order to i m p l e m e n t the methods, we suppose that we have a sub-
89

routine that can e v a l u a t e f(x) and a function gf(x)E3f(x) at each


x e R N, i.e. an a r b i t r a r y s u b g r a d i e n t gf(x) of f at x on which we can-
not impose any further assumptions.
Given a starting point x 1 6 R N, the a l g o r i t h m s will generate se-
cg/ences of points x2,x3,.., in R N, search d i r e c t i o n s { d k } c R N and nonne-
gative stepsizes {t~}, related by

k+l k k k
x = x + tLd for k=l,2, . . . .

The algorithms are m e t h o d s of descent in the sense that

f(x k+l) < f(x k) if x k+l ~ x k for all k

Due to n o n d i f f e r e n t i a b i l i t y of f, only one s u b g r a d i e n t gf(x k) may


not suffice for c a l c u l a t i n g a usable d i r e c t i o n of d e s c e n t for f at xk;
see Section 1.3. This would, in general, require the k n o w l e d g e of the
full s u b d i f f e r e n t i a l ~f(xk), cf. Lemma 1.2.13. Therefore, following
Lemarechal (1975), Wolfe (1975) and M i f f l i n (1977b), we shall use bun-
d l i n g of s u b g r a d i e n t s c a l c u l a t e d at trial points

k+l k k k 1 1
y = x + tLd for k=l,2,... , y = x ,

k k k
where the a u x i l i a r y stepsizes t >0 satisfy tL~tR, for all k. The two-
-point line search will detect d i s c o n t i n u i t i e s in the g r a d i e n t of f. The
algorithms e v a l u a t e subgradients

gJ = gf(yJ) for j=l,2 . . . . .

W i t h each such s u b g r a d i e n t we associate the c o r r e s p o n d i n g l i n e a r i z a t i o n


of f

fj(x) = f(yJ) + <gJ,x-yJ> for all x. (2.1)

In order to use a s u b g r a d i e n t gf(y) and the c o r r e s p o n d i n g lineari-


zation

f(x;y) = f(y) + <gf(y), x-y> for all x (2.2)

for search d i r e c t i o n finding at any x e R N, one needs a measure, say


e(x,y)Z0, that indicates h o w m u c h the s u b g r a d i e n t g f ( y ) ~ ~f(y) d e v i a t e s
from being a m e m b e r of ~f(x); i.e. ~ (x,y) should m e a s u r e the distance
from gf(y) to ~f(x). In the convex case (see Remark 2.4.3), it suffices
to take ~(x,y) equal to the l i n e a r i z a t i o n error

f(x) - ~(x;y),
90

because we have

gf(y)E ~f(x) for e = f(x) - ~(x;y) ~ 0,

cf. Lemma 2.4.2. Thus one may have g f ( y ) ~f(x) even when y is far from
x, provided that the linearization error vanishes. This is no longer true
when f is nonconvex; in particular, we may have f(x)-~(x;y)<0. For this
reason, Mifflin (1982) introduced measures of the form

~M(x,y) = max {f(x) - ~(x;y),ylx-yl2}, (2.3)

where y is a positive parameter, which can be set te zero if f is convex.


Clearly, ~M(x,y)~0 and eM(x,y)=0 implies gf(y)e~f(x). Our m e t h o d s will
use the following subgradient locality measure

~(x,y) = max{ If(x) - ~(x;y)l,[x-yl 2} (2.4)

with the convention that y=0 if f is convex, and >0 in the nonconvex
case. Of course, (2.3) and (2.4) are equivalent in the convex case,

since then

~(x,y) = ~M(x,y) = f(x) - ~(x;y) > 0,

while our definition (2.4) puts more stress on the value of the lineari-
zation error for nonconvex f. This will allow for choosing a small value
of the distance measure parameter y.
The algorithm of Mifflin (1982) uses for search direction finding
at the k-th iteration the following polyhedral approximation to f at
k
x

f~(x)= m a x { f ( x k ) - ~ M ( x k , y j ) + < g J , x - x k > : j=l ..... k} for all x. (2.5)

The k-th search direction is obtained by solving the problem

minimize ~k
fM(xk+d) 1
+ ~[d I2 over all d e R N . (2.6)

It is easy to observe that when f is convex and y=0 then

fj(x) = f(x;y j) = ~(xk;y j) + <gJ,x-xk> =

= f(x k) - ~M(xk,y j) + <gJ,x-xk> (2.7)

and

f~(x) = max{fj(x): j=l ..... k} for all x, (2.8)

so that subproblem (2.6) reduces to a p r o b l e m of the form (2.2.34) (if


91

_k = {l,...,k } in (2.2.33)). In this case the Mifflin algorithm falls


within the framework of the methods d i s c u s s e d in Chapter 2. In the non-
convex case, the term M(xk,y j ) in (2.5) tends to make the subgradient
g{(yJ) less active in the search direction finding if y3 is far from
x . Of course, subproblem (2.6) can be formulated as a quadratic prog-
ramming problem, cf. (2.2.34) and (2.2.11). Unfortunately, this prob-
lem will have k linear inequalities at the k-th iteration, which
creates difficulties with storage and work per iteration.
For completeness, we recall below the line search criteria of Mif-
flin (19821:

f(x k+l) ~ f(x k) + m L ~.k*k


LV , (2.9a)

-~M(xk+l,y k+l) + <gf(yk+l),dk>~_ mR ~k, (2.9b)

y k+l = x k+l = x k if ~gf(x k ) , d k > ~mRv~k , (2.9c)

where x k+l = x k +tLd


k k , y k+l = x k +tRd
k k, k k
0~tL~tR, and the variable

^k ~k, k ~ k
v = XM<X to ) - f(x k) < 0 (2.10)

may be interpreted as an approximate directional derivative of f at x k


in the direction d k. Here m L and m R are line search parameters satis-
fying

0 < mL < m R < i.

One can relate the rules (2.9) to the the line search criteria (2.7.2)
discussed in Section 2.7. We shall return to this subject in the next
section and in Section 6.
The M i f f l i n algorithm requires the storage of points y3 for cal-
culating the distances Ixk-yjl involved in ~M(xk,yj). This can be avoid-
ed by using the following upper estimates of [xk-yj I

s k = IxJ-yJl + k-I
Z 1xi+l_xi 1 for j<k, sk = lxk-ykl, (2.11)
3 i=j

which can be recursively updated according to the following formula

sk +I = s k + ixk+1- xkl. (2.12)


3 3
We shall call s k the j-th distance measure at the k-th iteration. Deno-
3
ting

f~ = fj(x k) = ~ ( x k ; y j) (2.13)
3
92

and substituting Ixk-yjl


with sk in the definition
3
of ~(xk,yj), we
obtain the following subgradient locality measure

k = max{if(x k) _ fkl,(s~)2} (2.14)


ej 3 '
which indicates how far gJ is from af(xk). For this reason, we shall
call the triple

k k
(gJ,fj,sj) ~ RN ~ R * R

the j-th subgradient of f at the k-th iteration, for all jsk.


Since we want to extend the methods presented in Chapter 2, suppose
momentarily that f is convex. Then the k-th search direction finding
subproblem (2.2.11) of the method with subgradient selection can be writ-
ten as

minimize +Idl 2 + u,
(d,u)~RN-R (2.15)
subject to f(x k) - e~ + <gJ,d> ~u, j ~ jk,
3
since

f(x k) - ~k = f(x k) - [f(x k) - f ~ = fk for all j


3 3 '
because y=0 and a ~ = f ( x k ) - f ~ 0 . The above problem can be formulated simi-
larly to (2.6):

minimize ~(xk+d) 1
+ ~Idl 2 over all d 6 R N (2.16)

in terms of the k-th polyhedral approximation to f

f~(x) = max{f{x k) - ~ + <gJ,x-xk> : jE jk}, (2.17)

with the solution (dk,uk I of (2.15) satisfying

~ ( x k + d k) = u k. (2.18)

Moreover, letting

v^k = u k - f(xk), (2.19)

we see that ( d k , ~k
v) is a solution to the following problem

minimize 11d 2 + v,
^
(d,$)~R N+I
(2.20)
subject to -~ + <gJ,d><$, j c jk.
93

Also the variable

V~k : ~(xk+d k ) f(x k) (2.21)

approximates the derivative of f at x k in the direction d k.


To sum up, subproblem (2.20) reduces in the convex case to the
search direction finding subproblem of the method with subgradient se-
lection from Chapter 2. Moreover, we can use the subgradient selection
rules d e v e l o p e d in Chapter 2 for constructing sets Jkc{l,...,k} such
that IJkI~ N+2 for all k. These observations give ground for using sub-
problem (2.20) in the method with subgradient selection for nonconvex
minimization, which will be d e s c r i b e d below. However, it turns out that
the subgradient selection rules of Chapter 2 need to be modified in the
nonconvex case. Since the same m o d i f i c a t i o n s will apply also to the sub-
gradient aggregation rules, we shall first describe search direction
finding subproblems based on subgradient aggregation.
To this end, let us suppose m o m e n t a r i l y that f is convex and con-
sider the k-th iteration of the method with subgradient aggregation
(Algorithm 2.3.1) from Chapter 2. For search direction finding this
method replaces the past subgradients (gJ,f~),- j=l,...,k-l, by just one
aggregate subgradient (pk-l,f~),- which is their
j convex combination

(pk-l,f~)e conv{(gJ,f~): j=l ..... k-l} (2.22)

calculated at the (k-l)-st iteration. The "ordinary" linearizations

f-(x~ = fk + <gJ,x-xk> for all X


3" " ]

and the (k-l)-st aggregate linearization

~k-l(x ) = fk + <pk-l,x_xk > for all x


P
satisfy

f(x) ~ fj(x) for j=l ..... k,

f(x) ~ ~k-1(x)

for all x e R N when f is convex. Therefore the k-th aggregate polyhedral


approximation ~ka to f i defined by choosing a set J k c {l,...,k} k e jk,
and setting

~(x) ^ -
= max {fk l(x), fj(x): j e jk} for all x (2.231

is a lower approximation to f,
94

f(x) ~ f x) for all x.

Note that ~ can be expressed as

~(x) = max {f(xk) - ~ ( x k ) - f~] + <pk-l,x-xk~,

f(~)- Lf(xk)- f~] + <gJ,x-xk> : j6 jk}, (2.24)

where the linearization errors satisfy

f(x k)- f~ ~ 0 for j jk,


3 -
f(x~)_ fkp -~ 0. (2.25)

To extend the above construction to the nonconvex case, suppose


that at the k-th iteration we have the (k-l)-st aggregate subgradient
k k ~ R N ~ R R that satisfies the
(pk-l,fp,Sp) following generalization
of (2.22):

(pk-i k k k-i
Z ~-l(gj k k
,fp,Sp) = j=l ,fj,sj), (2.26a)

k-i
-i~0 for j=l .... ,k-l, ~ k-I
. =i. (2.26b)
j=l 3

Similarly to (2.14), define the following aggregate subgradient locali-


ty measure

epk = max{if(xkl _ f~l Y(s~) 2}. (2.27)

The value of k indicates how far pk-i is from ~f(xk). Indeed, for con-
P
vex f we have (y=0]

k = f~x
~ k, - fk
P P
by (2.25), while relation (2.26) implies

pk-iE 8 f(x k) for e = f(x k) - f~,

as in Lemma 2.4.2. On the other hand, if the value of e k is small and


P
y)0 then the value of

k-i
sk Z -k-i s~
P = j=l lj 3

is small, hence (2.26b) and the fact that s~ ~Ixk-yjl imply that the
J k
value of ~ - i must be small if y~~ is far from x , i.e. only local subgra-
I
dients gJ=gf(yJ) with small values of sk] contribute significantly to
95

p k-i . Therefore, in this case p k-i is close to 3f (xk ) by the local up-

per semicontinuity of ~f (see Lemma 1.2.2).


We may now define the k-th aggregate polyhedral approximation to f

~k(x ) = max{f(x k) _ k + <pk-l,x_xk>,

f(x k) _ k + <gJ,x_xk>: j e jk} for all x (2.28)

and use i~ for finding the k-th search direction d k that solves the
problem

minimize ~k(xk+d) + l[d12 over all d ~ R N. (2..29)

Observe that for convex f the above definition


of ~ka reduces to (2.23)
and (2.24). In effect, in the convex case we shall calculate d k precise-
ly as in the m e t h o d with subgradient aggregation in Chapter 2, see
(2.2.46) and (2.2.47).
Similarly to (2.20), one can find the solution of (2.29) by solving
the following quadratic p r o g r a m m i n g problem for (dk,v k) E R N x R:

minimize 21d[ 2 + ~,
(d,~)~R N+I
(2.30)
subject to -~jk + <gJ,d> <_v; j E J k,

k k-I ^
-~p + <p ,d~ <_v.

Then the variable

v~k = ^
fk(xk+dk) - f(xk), (2.31)
a

which approximates the derivative of f at x k in the direction d k, can


be used for line searching.
Next, we have to show h o w to update the aggregate subgradient re-
cursively, i.e. so that if (2.26) is satisfied for some k then it also
holds for k increased by i. This is easy if one observes that subprob-
lem (2.30] is of the form (2.2.41), hence the updating rules introduced
in Section 2.2 can be used here. To this end, let kkj, j ~ jk , and lpk de-
note Lagrange m u l t i p l i e r s of the k-th subproblem (2.30). By Lemma 2.2.1,
these multipliers form a convex combination

kk~0, j6 jk, kk_0 Z k k i.


- P ' jk lj + kp =
jg

Therefore, similarly to (2.2.45), we shall first calculate the current


subgradient
96

(pk ,fp,Sp
"k ~k) = Z ik(gj ,fj,sj)
k k + ik k k
p(pk-l, fp,Sp). (2.31)
je jk

In view of (2.26), this leads to the following desirable property

~k ~k
(Pk,fp,Sp) e conv{( gj ,fj,sj)
k k : j=l ..... k}, (2.33)

since a convex combination of convex combinations is again a convex com-


bination. Then, having found x k+l , we can update the linearization va-
lues by setting

fk+l = fk + <gJ,xk+l_xk> for j e jk,


3 3
f k + l = ~k + < p k , x k + l _ x k > .
P P
On the other hand, since

Sjk+l = s~3 + Ix k+l -X k I for j e jk,

we shall set

k+l -k k+l k
s = s + Ix -x I .
P P
It is easy to check (see Lemma 4.1) that the above updating formulae and
relation (2.33) yield

k ~k+l k+l) 6 conv{(g j ,fk+l ,sjk+l~): j=l,...,k}.


p ,rp ,Sp 3

Comparing the above relation with (2.26) we conclude that we have com-
k k
pleted the recursion without using the subgradients (gJ,fj,sj) for
j e {l,...,k} \Jk. Consequently, these subgradients need not be stored.
In order to be able to use the above notation for k=l, we shall
initialize the method by setting

yl=xl, p0=gl=gf(yl), f~=f~=f(yl), s~=s~=0 and Jl={l}.

In the method with subgradient aggregation the sets {jk} can be


chosen as in Chapter 2. Thus let Mg~2 denote a user-supplied, fixed up-
per bound on the number of subgradients (including the aggregate subgra-
dient) that the algorithm may use for search direction finding at any
iteration. If the sets {jk} are selected recursively so that

k+l E jk+ic j k u {k+l},

Ijk+iI& Mg-l,
97

e.g.

jk+l = { k + l } u Sk,

~kc {j6 ~ : l~>0},

13kl M g -1

as in Remark 2.2.4, then one can control the size of s u b p r o b l e m (2.30),


and hence the core r e q u i r e m e n t and work per iteration.
The subgradient selection and aggregation strategy described above
has a certain theoretical drawback that precludes obtaining global con-
vergence results. Therefore we shall m o d i f y this strategy as follows.
Convergence analysis of existing subgradient algorithms usually
requires an assumption of uniform boundedness of all calculated subgra-
dients, see (Dixon and Gaviano,1980; Goldstein,1977; Kiwiel,1981b; Mif-
flin,1977b and 1982). We shall dispense with this assumption by modi-
fying the search direction finding subproblem (2.30) at some iterations.
To this end the algorithm will calculate a variable a k satisfying

ak max{Ixk-yjI: l~j~k-I and ~k-i ~ 0},


3
where the m u l t i p l i e r s ~k-i satisfy (2.26). Simple recursive rules for
computing a k, which do not require the knowledge of , will be given
3
in the next section. Thus a k estimates the radius of the ball around
x k from which the past subgradient information was collected to form
k k
the aggregate subgradient (pk-l,fp,Sp). Therefore, we shall call a k the
k
locality radius of the aggregate subgradient. Whenever the value of a
exceeds a fixed, large threshold value a>0, the aggregate subgradient
will be discarded at the k-th iteration and the set jk will be reduced
by deleting the j-s with large distance measures sk>~/2. In this case
3
we shall say that a distance resettin~ occurs and set the reset indica-
tor rk=l;
a otherwise rk=0.
a The aggregate subgradient is discarded by de-
leting the last constraint of subproblem (2.30) and setting the corres-
ponding Lagrange m u l t i p l i e r
I kp to zero if rk=l.
a This m o d i f i c a t i o n ensu-
res that the algorithm will use only local subgradient information, i.e.
only subgradients gJ=gf(yJ) with Ixk-yjl<~, for all k. In view of the
local boundedness of ~f, this suffices for locally uniform boundedness
of such subgradients gJ.
Having described the subgradient aggregation strategy, we may now
return to the method with subgradient selection. As in Chapter 2, this
m e t h o d solves subproblem (2.20) to find the k-th search direction dk
and a set ~k satisfying
98

jk = {j e jk: ~ ~0},

l~kl~ N+I,
where l~, j6 jk, are Lagrange multipliers of (2.20), which can be compu-
ted as shown in Lemma 2.2.1 (see also Remark 2.5.2). Then the aggregate
k "k
subgradient (p ,fp,S ) is calculated according to (2.32), but with Ik=0.p
The next subgradient index set jk+l is of the form

jk+l = ~k U {k+l},

which is analogous to the one used in Chapter 2. However, if at the


k k --
k-th iteration the locality radius a is too large, i.e. a >a, then the
set jk should be reduced to achieve

a k = max{s : j eJk}_< a/2,

so that only local subgradients indexed by j e jk are used for the k-th
search direction finding.

Remark 2.1. As noted above, the use of distance measures s~ for estimat-
3 yj
ing Ixk-y31 enables us to resign from storing trial points . Still,
for theoretical reasons, one may consider the following version of the
method with subgradient selection. At the k-th iteration, let (dk,$ k)
denote the solution to the following quadratic programming problem

minimize 1
~Idl 2 + ,
(d,v) ~ R NI (2.34)
subject to -e(xk,y j) + <gJ,d> ~$, j e jk,

and let I~, j E ~ . denote the corresponding Lagrange multipliers satis-


^kJ
fying IJ I~N+I for jk = { j ~ jk: I~>0}. If we choose

jk+l = ~ k u {k+l} for all k,

then this version will need additionally to store at most N+2 points
{YJ}j ~ jk for calculating the locality measures ~(xk,y3),- for all k.
k+l
In this case the locality radius a can be computed directly by
setting

k+l ixk+l_ yj jk+l},


a = max{ I: J e

and the set jk+l should be reduced, if necessary, so that ak+Is~. The
subsequent convergence results remain valid for this version of the
method. However, we do not think that this version should be more ef-
99

ficient in practice, since s~


~ is not, usually, much larger than "Ixk-yjl,
]
and the distance terms in the definitions of the locality m e a s u r e s
~(xk,y j) and ~ are rather arbitrary, anyway.

We end this section by commenting on the relations of the above


described methods with other algorithms. As shown above, the search
direction finding subproblems(2.20) and (2.30) generalize the subprob-
lems used by the methods of Chapter 2, and so also the subproblems of
Pshenichny's method of linearizations for minimax problems and the clas-
sical method of steepest descent. At the same time, subproblems (2.20)
and (2.34) are reduced versions of the Mifflin (1982) subproblem, which
is of the form (2.34), but with jk = {l,...,k}. Also they can be re-
lated to the "conceptual" search direction finding s u b p r o b l e m describ-
ed in Lemma 1.2.13; see Remark 2.2.6.

3. The A l g o r i t h m with Sub@radient A @ g r e ~ a t i o n

We now state an algorithmic procedure for solving the p r o b l e m con-


sidered. Ways of implementing each step of the m e t h o d are d i s c u s s e d be-
low.

Algorithm 3.1.
Step 0 (Initialization). Select the starting point xl~ R N and a final
accuracy parameter Es>_0. Choose fixed positive line search parameters
m L, m R , a and ~, ][ <_ 1 and 0<mL<mR<l, and a distance measure parameter
7>0 (7=0 if f is convex). Set

1 1
y1=xl ' p0__gl=gf(yl), f~=f~=f<yl), Sp__sl=0' 51__{i}
Set al=0 and the reset indicator rl=l.
a Set the counter k=l.
Step 1 (Direction finding). Find the solution (dk,~ k) to the following
k-th quadratic programming problem

minimize 1
~Idl 2 + v,
(d,$)~ R N+I
subject t o - k + <gJ,d> <_~, je J k, (3.1)

- ~ pk + <p k-i ,d>_<~ if rk=0,

where
100

~Jk = max{If(xk) - f~[,


3 y(sk)
3 2} for j jk, (3.2)

k = max{if(xk)[_
~p fkP , y(s )2}. (3.3)

Compute Lagrange multipliers I~, j jk, and ikp of 43.1), setting I~=0
if rk=l.
a Set

"k ~k =
(pk,fp,Sp) Z l~(gj, fj,sj)
k k + Ap[p
.k, k-i ,fp,Sp),
k k (3.4)
jE jk

-k = max{If(x k) _ ~kp I' ~(~k~2


~p p- , 43.5)

v k = -{Ipkl 2 + ~ } . (3.6)

If i~=0 set

a k = max{s~: je jk}. (3.7)


3
Step 2 (Stopping criterion). Set

k 1 pk 2 ~k (3.8)
w =31 i + ~p-
If w k ~ s then terminate. Otherwise, go to Step 3.
Step 3 (Line search). By a line search procedure as given below, find
< k< k
two stepsizes t Lk and t kR such that 0_tL_t R and such that the two corre-
sponding points defined by

k+l k k k k+l k k k
x = x + tLd and y = x + tRd

satisfy t kL < 1 and

k k,
f(x k+l) <_ f(x k) + mLtLV (3.9)
k k k n
tR = tL if tLzt, 3.10a)
-~ (xk+l,yk+l ) + <gf(yk+l),dk> ~mRvk k --,
i f tL~t 3.10b)

ly k+i- xk+l I ~ ~12. (3.11)

Step 4 (Subgradient updating). Select a set 3kc jk and set

jk+l = ~k u {k+l}. (3.12)

Set gk+l = gf(yk+l) and

fk+l = fk + <gJ,xk+l_xk> for j 6 J ^k , 3.13a )


3 3
101

fk+l = f(yk+l) + <gk+l k+l k+l


k+l ,x -y >, (3.13b)
fk+l = ~k + <pk k+l k
P P ,x -x >, (3.13c)

sjk+l = sk3 + Ixk+l-xkl for j & Sk, (3.14a)

k+l ixk+l yk+l


Sk+l = - L, (3.14b)
spk+l = sp~k + ixk+l_xkl. (3.14c)

step 5 (Distance resettin9 test). Set

ak+l = max{ak + Ixk+l_xk I, Sk+l}-


k+l (3.15)

If ak+l<a then set rk+l=0


a
and go to Step 7. Otherwise, set rk+l=l
a
and
go to Step 6.
Step 6 (Distance resetting). Keep deleting from jk+l the smallest indi-
ces until the reset value of ak+l satisfies

a k+l = max i.sk+l


j : j ~ jk+l}~ a/2. (3.16)

Step 7. Increase k by 1 and go to Step i.

A few remarks on the algorithm are in order.


By Lemma 2.2.1, the k-th subproblem dual of (3.1) is to find ya-
k j eJk, and lp
lues of multipliers lj, k to

minimize 1 ~ ljg 3 + ippk-ll2 + ~ lj~ + Ip~,


1 j ~jk j ~jk

subject to lja0, j ~jk, Ip,_0, ~ lj + ip = i. (3.17)


j EJ k

Ip=0 if rk=l'a

Any solution of (3.17 1 is a Lagrange multiplier vector for (3.1) and


it yields the unique solution (dk,$ k) of (3.1) as follows

d k = _pk, (3.18)

Sk = _{Lpkl2 + Z
k k k k
lj~j + Ip~p}, (3.19)
j ~jk

where pk is given by (3.4). Moreover, any Lagrange multipliers of


(3.1) also solve (3.17). In particular, they form a convex combination:
102

I~0, j e J k, I~Z0, E Ik + kk = I, kk=0 if rk=l. (3.20)


3 P P a
j e jk

Thus one may e q u i v a l e n t l y solve the dual search d i r e c t i o n finding sub-


p r o b l e m (3.17) in Step 1 of the algorithm.
The s t o p p i n g c r i t e r i o n in Step 2 admits of the f o l l o w i n g interpre-
tation. The value of the locality m e a s u r e ~k given by (3.5) indicates
P
h o w far pk is from af(x k) (see the d i s c u s s i o n of (2.27) in Section 2).
A small value of w k indicates both that Ipkl is small and that pk is
close to af(xk), b e c a u s e the value of the locality m e a s u r e ~k is small.
P
Thus the null v e c t o r is close to af(xk), i.e. x k is a p p r o x i m a t e l y sta-
tionary. In general, w k may be thought of as a m e a s u r e of s t a t i o n a r i -
ty of x k. If f is convex and ~ is a m i n i m u m point of f, then we have

min{f(x): x E R N} ~ f(x k) - (2es) I / 2 [~_xkl - s (3.21)

upon t e r m i n a t i o n at Step 2, see Remark 2.3.3 and results in the next


section. In the n o n c o n v e x case our stopping c r i t e r i o n is a g e n e r a l i z a -
tion of the s t a n d a r d c r i t e r i o n of a small value of the g r a d i e n t of f.
The line search rules of Step 3 m o d i f y M i f f l i n ' s rules (2.9) and
(2.10). In the next section we prove that the line search is always
entered with

Ok =
~ ( x k + d k) -
f(x k) _
< v k < 0. (3.22)

Hence the c r i t e r i o n (3.9) quarantees that x k+l has a s i g n i f i c a n t l y smal-


ler objective value than x k if x k+l # x k. This p r e v e n t s the a l g o r i t h m
from taking i n f i n i t e l y m a n y serious steps (t~>0) with no s i g n i f i c a n t
i m p r o v e m e n t in the objective value, which could impair convergence.
The p a r a m e t e r E>0, e.g. E=0.01, is i n t r o d u c e d to d e c r e a s e the number
of function and s u b g r a d i e n t e v a l u a t i o n s at line searches (observe that
our r e q u i r e m e n t s (3.10) are less severe than (2.9b)). Recall that in
the line search rules for convex m i n i m i z a t i o n d e s c r i b e d in S e c t i o n 2.7
we had serious steps with t~_>~ and null steps with t~=0. Here the para-
k-- , and "short" se-
meter ~ d i s t i n g u i s h e s "lon~" Serious steps with tLzt
rious steps with 0 < t~ <~, for which (3.10b) is satisfied. It will
be seen that as far as c o n v e r g e n c e analysis is concerned, short se-
k --
rious steps are e s s e n t i a l l y e q u i v a l e n t to null steps. If tL~t , i.e. if
a s i g n i f i c a n t d e c r e a s e of the objective value occurs, then there is no
need for d e t e c t i n g d i s c o n t i n u i t i e s in the g r a d i e n t of f, so the algo-
r i t h m sets gk+l= gf(xk+l) . On the other hand, if tL<t k --, which indicates
that the a l g o r i t h m is b l o c k e d at x k due to the n o n d i f f e r e n t i a b i l i t y of
f, then the c r i t e r i o n (3.10b) ensures that the new s u b g r a d i e n t
103

g k+l = gf( yk+l ), with yk+l and x k lying on the opposite sides of a dis-
continuity of the gradient of f, will force a significant modification
of the next search direction finding subproblem. The criterion (3.11),
which is related to the distance resetting test, prevents the algorithm
from collecting irrelevant subgradient information.
Clearly, the line search rules (3.9) - (3.11) are so general that
one can devise many procedures for implementing Step 3, see (Lemarechal,
1981; Mifflin, 1977b and 1982; Wierzbicki, 1982; Wolfe, 1978). For com-
pleteness, we give below a procedure for finding stepsizes tL=t kL and
k
tR=tR, which is based on the ideas of Mifflin (1977b and 1982). In this
procedure ~ is a fixed parameter satisfying ~ e{0,0.5), x=x k, d=d k and
k
V=V .

Line Search Procedure 3.2.


(i) Set tL=0 and t=tu=l. Set i=l.
lii) If f(x+td)<f(x)+ mLtv set tL=t; otherwise set tu=t.
(iii) If tL>_~ set tR=t L and return.
(iv) -~(x+tLd , x+td) + <gf(x+td), d> ~mRv and tL<~ and (t-tL) IdI<a/2 ,
set tR=t and return.
(v) Choose t e It L + ~(tu-tL),tu-~(tu-tL) ~ by some interpolation pro-
cedure.
(vi) Increase i by 1 and go to (ii).

Convergence of the above procedure can be established as in (Mif-


flin, 1977b, 1982) if f satisfies the following "semismoothness" hypoth-
esis (see Bihain, 1984) :

for any x ~ R N, d E R N and sequences {~i} ~ RN and {ti}c R+ satisfying

~i ~f(x+tid) and ti+0, one has (3.23)

lim sup <g~,d> ~ lim inf [f(x+tXd) - f(x)3/t i .


i~ i~

Lemma 3.3. If f has the property (3.23) then Line Search Procedure 3.2
terminates with t~=t L and t~=t R satisfying (3.9) - (3.11).

Proof. Assume, for contradiction purposes, that the search does not
terminate. We recall that the line search is entered with 0<mL<mR<l,
104

a>0, ~>0 and v<0. t~iL and t Ui denote


Let t l, the values taken on by t,
t L and tu, respectively, after the i-th execution of step (ii) of the
procedure, so that t l ~ { ~ , t ~ } , for all i. Since ~ ( 0 , 0 . 5 ) ,
~i ~i+l< i+l< i _ .i+l ~i+Y i ~i
fLit L t U _t U ana t u -t L &(l-~)(tu-tL) for all i, there exists
t>0 satisfying ~+t and t~+t. Also t!~, since we must have ti<~ for all
i. Let

TL = {t0: f(x+td) < f(x) + mLtv}.

Since {~ } c TL, +t and f is continuous, we have ~ ~ TL, i.e.

f(x+td) - f(x) N mLtv. (3.24a)

Since t~+t,
t~ LT and tu.
i ~ LT if t~=t i there exists an infinite set
I c {1,2 .... } such that t~=tl>t and

f(x+tid) - f(x) >mLtiv for all ie I. (3.24b)

By (3.24), we have

[f(x+tid) - f(x+~d)]/(ti-) > mLv for all i E I,

hence

liminf [f(x+td+(ti-t)d) - f(x+td)]/(ti-t) ~ mLv. (3.25)


i~,ie I

Next, for sufficiently large i we have t~<~ and (t -tL)
~i
Idl<a/2 , beca-
~i ~ and ti+{. Therefore
use tLt

-~(x+t~d,x+tld) + <gl,d> <mRv for all large i,

--i
where g =gf(x+tid) for all i. But

~(x+t~d,x+tid) = max{[f(x+tld ) - f(x+tid) - ( ~CiL - t. )i .< g --i ,d> I ,

~(ti-t~)21dl 2} + O,

because ~i ~
tL+t, ~
tl+t,
'
f is continuous and the subgradient mapping gf is
locally bounded (see Lemma 1.2.2). Therefore

limsup <gl,d> SmRv. (3.26)


i+~
Since 0<mL<mR<l and v<0, we have mRv<mLv; hence (3.25) and (3.26) con-
tradict the property (3.23) applied to the subsequences indexed by
i e I. Therefore the search teminates. It is easy to show that (3.9) -
(3.11) hold at termination.
105

Remark 3.4. Following Mifflin (1977b), we say that f is weakly upper


semismooth if in addition to (3.23) it satisfies the condition

liminf <gi,d> ~ limsup ~(x+tid) - f(x)]/t i.


i~ i~
In particular, every convex function is weakly upper semismooth, since

f(x) ~ f(x+tid) + <gf(x+tid), x-(x+tid)>

f(x+tid) - ti<~i,d>

and hence

<~i,d> z[f(x+tid) - f(x)]/t i

if f is convex and g i = g f ( x + t i d ) ~ f ( x + t i d ) . Many important classes of


weakly upper semismooth functions are well described in (Lemarechal,
1981; Mifflin, 1977a and 1977b). In general, it may be difficult to
verify whether (3.23) is satisfied in any specific situation. We be-
lieve, however, that (3.23) is likely to hold for m o s t locally Lips-
chitzian functions that arise in applications. Our computational expe-
rience indicates that the above line search procedure always terminates,
provided that function and subgradient evaluations are not distorted
too much by gross numerical (or program/ning) errors.

Remark 3.5. One may choose trial stepsizes t in step(v) of Line Search
Procedure 3.2 as follows. If on entering step iv) of the procedure we
have tL=0 , which means that t=tu>0 and

f(X+tud) > fix) + mLtUV, (3.27)

then one may set

t = m a x { ~ t u , % }, (3.28a)

where

% = 0.5vitu)2/[tuV- f(x) - f(x+tud)] (3.28b)

minimizes the quadratic function h: R+R that interpolates the function


tf(x+td) in the sense that h ( 0 ) = f ( x ~ , h ( t u ) = f ( x + t u d ) and h'(0)=v. It
is easy to check that if (3.27) holds and m L g (0,0.5) then

t _<max{s, 0.5/(l-mL)}tu,

which ensures the necessary contraction. On the other hand, if tL>0


106

then one may choose the next trial stepsize either by a r i t h m e t i c b i s e c -


tion

t = (t L + tu)/2,

or g e o m e t r i c b i s e c t i o n

t = (t L + tu)i/2

It is i m p o r t a n t to observe that the a l g o r i t h m never deletes the


latest subgradient, i.e. we have

k ~Jk for all k. (3.29)

Too see this, note that Jl={l} and that in Step 4 the index k+l is the
k+l = lyk+l k+l
l a r g e s t in jk+l and Sk+ 1 -x l~a/2 owing to (3.14b) and (3.11).
T h e r e f o r e k+l cannot be d e l e t e d from jk+l in Step 6.

4. C o n v e r g e n c e

In this section we shall e s t a b l i s h global c o n v e r g e n c e of A l g o r i t h m


3.1. We suppose that each e x e c u t i o n of Line Search P r o c e d u r e 3.3 is fi-
nite, e.g. that f has the additional semismoothness property (3.23).
Naturally, c o n v e r g e n c e results assume that the final accuracy toleran-
ce ms is set to zero. In the absence of convexity, we will content
o u r s e l v e s with finding s t a t i o n a r y points for f. Our p r i n c i p a l r e s u l t
states that A l g o r i t h m 3.1 e i t h e r terminates at a s t a t i o n a r y point or
generates an infinite sequence {x k} whose a c c u m u l a t i o n points are sta-
tionary for f. When f is convex, {x k} is a m i n i m i z i n g sequence, which
c o n v e r g e s to a m i n i m u m point of f w h e n e v e r f attains its infimum.
For convenience, we precede the m a i n result by several lemmas.
B e s i d e s b e i n g of i n t e r e s t in their own right, they can be used for
e s t a b l i s h i n g global c o n v e r g e n c e of the algorithms d e s c r i b e d in Chap-
ter 2 that e m p l o y the general line search rules (2.6.2).
We shall need the following n o t a t i o n c o n n e c t e d with the resets
k
of the algorithm. Observe that the relation ra=l indicates that the algo-
r i t h m is reset at the k-th iteration. Also ~ = 0 implies that the ag-
gregate s u b g r a d i e n t is d i s c a r d e d at the k-th iteration. Let

kr(k ) = max{j: j~k and r3=l}


a (4.1a)

jk kr(k)
r = J U {j: kr(k)<j~k}, (4.1b)
107

kp(k) = max{j: j~k and ~ = 0 } , (4.2a)


k (k)
$~ = J P u {j: kp(k)<jSk}, (4.25]

for all k.
The following lemma shows that the aggregate subgradient is a
convex combination of the subgradients retained at the latest reset
and the subgradients calculated after the latest reset.

Lemma 4.1. Suppose k~l is such that Algorithm 3.1 did not stop before
~k ^k
the k-th iteration. Then there exist numbers l~,j j ~ Jp, satisfying

~k ~k
(pk,fp,Sp) = Z ~k(gj ,fk,sk) (4 3a)
jE~ k 3 J 3 '
P
~k Ik=l. (4.3b)
~0, j ~ Jp,
je ~k 3
P
Moreover,

k k ^~
a = max{sj: j ~ J }, (4.4)

Ixk-yj[ ~ ak ~ for all j ~ J (4.5)

Proof. (i) It follows from (4.2) and the rules of the algorithm that
we always have j k + i c jku{k+l} and J k c J~k . Therefore, in view of (3.4)
P
and (3.20), we can define additional multipliers

l~=0 for j ~ ~k \ jk
P
so that

-k ~k
(pk,fp,Sp) = ~ l~(g j k k .k, k-i k k (4.6a)
,fj,sj) + Ap<p ,fp,Sp),
jEJ k
P

j~ 3 k J
P
for any k. Suppose that Xk =0 for some k_>l. Then Jp=J
^k k by (4.2], and
(4.3) follows from <4.6)
if one sets ~k=lk for all j E ~ k = J k. Also
3 3 1 P 1
(4.4) is implied by (3.7) if Ik=0.
p Observe that I p =0 since r a =i. Hence
to prove that relations (4.3) - (4.4) are valid for any k, it suf-
fices to show that if they hold for some fixed k and Ik+l>0, then they
P
are true also for k increased by i. Therefore, suppose that (4.3] and
108

(4.4) are satisfied for some k=n and ~ + < 0. By (4.2), we have ~k+l
p =
_~k
-Jp u{k+l}. Let

~k+l k+l Ik+l ~k for j e ^k ~k+l = lk+l


lj = lJ + P 3 JP' k+l k+l" (4.7)

From (4.3b), (4.6b) and (4.7),

~k+l~0 for all j 6 3k+l


3 P '
z ~k+l - ~k+ik+l + Z ^ ~k+lj = lk+ik+l+ Z Ik+l+j lk+l:k)=p
3'
j : ~k: : j ~ jk" j 6 jk (
P P P
= xk + 1 + z x}+: + ~k+1 z ~k =
k+l %k 3 P ^
j ~ jk 3
j~
P P

kk+l + kk+l = 1
i

j 3k+l 3 P
P

which yields (4.3b) for k increased by i. From (4.6), (4.7), (4.3) and
the fact that 3k+l=~k
p p u {k+l} we obtain
k+l k+l gj + ~ + I pk =
p = Z lj
j e 3 k+l
P
= ik+l k+l k+l k+l ~ )g3' = Z -k+l g j ,
lj
k+l g + ~ (~j + Ip
j E ~k j ~ ~k+l
P P
and

~k+l Z k+l fk+l + ik+l fk+l


P = ~k+l kj J P P =
j~

z kk+l fk+l + _k+l [~k + <pk,xk+l_xk>] =


tk+l J J ~p
jE dp

k+l~ fk+l + ik+l -k [fk + < g J , x k + l _ x k > ] =


k] ] P lj
ek+l
j e dp j 6 jk
P

k+l fk+l . k+l ik ~ fk+l -k+ifk+l


= Ik+l k+l + j Z3 ~ ~j + P ~ ) 3 = jgZ ~k+ip ~j J

from (3.13), and


109

~k+l = ~ ik+l sk+l + k+l s k + l


Sp ~k+l j 3 Ip p
jE
P
lk+l sk+l + k + l ~k k+l k
j ~ ~k+l j 3 P (Sp + Ix -x I) =
P

Z lk+l sk+l + lk+l ~k k k+l k


J 3k+l J 3 P j eZ2k 3 (sj + Ix -x I) =
P P

= ik+l k+l Ik+l+ Ik+l ~ ) s. = Z s.


k+l Sk+l + Z ( J P 3 ~k+l 3 3
j~3~ je

k+l
from (3.14). This yields
(4.3a) for k=n+l. Next, since l_ >0 by assump-
k+l P k+l
tion, the rules of the algorithm imply that r a =0, and so a is co-
mputed by (3.15), i.e.

k+l k+l_xk ] k+l k+l


a = max{a k + Ix , Sk+ I} if ~p >0.

Combining this with (4.4) and the fact that s3k+l=s3+k xk+l-xkl for all
j E ~ and that ~k+l
P = ~kP u {k+l}, we obtain

max{s +i: j ~k+l}


P _ = max{max{s~+l: j e J }, Sk+l~ =

= max{max{sk3 +Ixk+l-xkl: j e J~}, Sk+l


}k+l =

= max{a k +Ixk+l-xkl, k+l~ = a k+l ,


Sk+l~

which shows that (4.4) holds for k=n+l.


(ii) Since Ixk-yJl&s~ for all j~k, and ak~_a by the rules of Step 5 and
Step 6, (4.5) follows from (4.4).

In the convex case we have the following additional result.

Lemma 4.2. Suppose that k~l is such that Algorithm 3.1 did not stop be-
fore the k-th iteration, and that f is convex. Then

Pk ~ ~ f(x k) for ~ = f(x k) - ~kp = ~ _> 0. (4.8)

Proof. As in the proof of Lemma 2.4.2, use (4.3) and the fact that ~k
P
If<xk)-f~I~ if f is convex, since ~=0 in the convex case.
P
110

Our next result states that in fact the aggregate subgradient can
be expressed as a convex combination of N+3 (not necessarily different)
past subgradients calculated at points whose distances from the current
point do not exceed the threshold value a.

Lemma 4.3. Suppose k>_l is such that Algorithm 3.1 did not stop before
the k-th iteration, and let M=N+3 Then there exist numbers ~k and v e c -
" i
tors (yk'i,fk'i,sk'i) E R N ~ R ~ R, i=l,...,M, satisfying
M
(pk ,~k "k
fp,Sp) = Z
i=l ~ki (gf(yk,i) ,fk ,i,
' s k ,i), 4.9a)

M
i >_0, i=l ..... M, Z ~k = I, 4.9b)
i= 1 l

(gf(yk,i),fk ,l,

s k ,1)
'
6 {(gf( yj )'fJ'
k sk):
3 j ~ j ~k } i=l ..... M, 4.9c)

[yk'i-xk 1 _< s k'i, i=l ..... M, 4.9d)

max{sk'i: i=l,...,M} ~ a k < a. (4.9e)

Proof. (4.9) follows from Lemma 4.1, Caratheodory's theorem (Lemma


1.2.1), and the fact that gJ=gf(yJ) for l~j&k . []

In order to deduce stationarity results from the representation


(4.9), we shall need the following lemma.

Lemma 4.4. Let x e R N be given and suppose that the following hypothesis
is fulfilled:
there exist N-vectors p, yi, ~i for i=l,...,M, M=N+3, and numbers
~p, ~l, ~l, satisfying
M
(p,~p,Sp) = Z ~i (~i,~i,~i), 4.zOa)
i=l
M
~i>0, i=l,...,M, Z ~.=I 4.10b)
i= 1 1 '
~i ~f(~i), i=i ..... M 4.10c)

~i = f(~i) + <~i,~_~i>, i=l ..... M, 4.10d)

ii_~i _< ~i, i=l ..... M, 4.10e)


111

f(~) = i[p, (4.10f)

ySp = 0. (4.10g)

(Recall that y=0 only if f is convex; otherwise y>0.) Then p ~ f ( x ) .

Proof. (i) First, suppose that y>0. Let l={i:. ~i # 0}. By (4.10g),
~p=0, hence (4.10a,b) and (4.10e) imply ~l=~ for all i ~ I, so (4.10c)
yields ~i e ~f(x) for all i g I. Thus we have p= E ~i ~i, ~i>0 for
iE I
i e I, E ~ =i and ~ i ~ ~f(~), i~ I, so p ~ f ( ~ ) by the convexity of
i~ I 1

~f(~).
(ii) Next, suppose that y=0. Then f is convex and (4.10c) and (4.10d) give

f(z) ~ f(~i) + <~l,z_~l > = f(~) + <~i,z_~ > _[f(~) _ ~

for all z e R N and i=l,...,M. Multiplying the above inequality by ~.


and summing, we obtain for each z

f(z) k f(x) + <p,z-x> -Ef(~) - fp] = f(~) + <p,z-x>

from (4.10b), (4.10a) and (4.10f). Thus p ~ f ( x ) by the definition of


subdifferential in the convex case.

First we consider the case when the method terminates.

Lemma 4.5. If Algorithm 3.1 terminates at the k-th iteration, kzl, then
the point ~=x k is stationary for f.

Proof. If the algorithm terminates at Step 2 due to wksef=0, then, since


k 1 k 2 ~k ~k k ~k k ~k y(sk)2}=0,
w =~ IP I +a and a >0, we have p =0 and ~ = max{if(x )-f-i,
z k P ~ P- ~~ W w
hence f(x ) = fp and ySp=0. Combining this with (4.9a) - (4.9d), we see
that the assumption of Lemma 4.4 is fulfilled by x=x k ~=pk ~p=~k
~p=~k, etc. Therefore 0 = p k e 8f(x).

From now on we suppose that the algorithm calculates an infinite


sequence {xk}, i.e. wk>0 for all k.
The following lemma states useful asymptotic properties of the ag-
gregate subgradients.
112

Lemma 4.6. Suppose that there exist a point x e R N and an infinite set
K c {i,2, . ..} satisfying x k ----+x.
K -- Then there exists an infinite set K c K
such that the hypothesis (4.10a) - (4.10e) is fulfilled at ~ and

( k ~ k -,Sp)
k +(p, fp, Sp) . (4.11)
p ,fp

If additionally ~k K 0, then p e ~f(x).

Proof. (i) From (4.9d,e), the fact that ~<~ and the assumption that
x k ---x,
K -- we deduce the existence of points --i
y , i=l,...,M, and an infinite
set K i c K, satisfying

yk,i K1 --i
y for i=l,...,m. (4.12a)

By (4.12a), (4.9c), and the local boundedness and uppersemicontinuity


of ~f (see Lemma 1.2.2}, there exist N-vectors ~i and numbers ~i,
i=l,...,M, and an infinite set K 2 c K I, satisfying

gf(yk ,z)
' K 2 ,gl
--' ~ f ( ~ i ) for i=l ..... M, (4.12b)
) = --'
fk,i K2 ~i f(~i} + <gl,~_~1> for i=l ..... M, (3.12c)

since fk,i = f(yk,i)+ <gf(yk,i) ,xk_yk,i> for i=l,...,M and all k. In


view of (4.9b,e) there exist numbers ~ and ~l, i=l,...,M, and an infi-
l

nite set K c K 2 such that

Ak
li K~ ~'I for i=l,...,M, (4.12d)

sk,i - - .K + s --i for i=l, .... M. (4.12e)

Letting k E K approach infinity in ~4.9a), (4.9b) and (4.9d), and using


(4.12), we obtain (4.10a) - (4.10e) and (4.11).
(ii) Suppose that ~k= max{ If(xk)-f~l 7 ( ~ ) 2 } K )0. Then ~k K ~ =f(~)
~k ~ P P P
and 7sp ~y~p=0, so p E ~f(x) by Lemma 4.4.

Our next result describes a crucial property of the stationarity


k k
measure w of the current point w .

Lemma 4.7. Suppose that for some point ~ E R N we have

liminf max {wk, I~-xkl } = 0, (4.13)

or equivalently
113

there exists an infinite set K c {1,2,...} such that x k K .~


and w k K ~0. (4.14)

Then 0 ~ ~f(x).

Proof. The e q u i v a l e n c e of (4.13) and (4.14) follows from the fact that
w k_ is n o n n e g a t i v e for all k, since we always have w k= ~iIpkI2+ eP~k and
~>0. Thus (4.14) implies pk K ~0 and ~k K D0, so Lemma 4.6 yields
P
the d e s i r e d conclusion.

The above lemma enables us to reduce further c o n v e r g e n c e analysis


to checking if w k a p p r o a c h e s zero around any a c c u m u l a t i o n point of {xk}.
To this end, as in C h a p t e r 2, we shall now relate the s t a t i o n a r i t y mea-
sures with the optimal values of the dual search d i r e c t i o n finding sub-
problems.
^k
Let w denote the optimal value of the k-th dual search d i r e c t i o n
finding s u b p r o b l e m (3.17), for all k. By (3.4) and the fact that the
L a g r a n g e m u l t i p l i e r s of (3.1) solve (3.17), we always have

^k --llpki2 ^k
w = 2 j + ep, (4,14a)

where

^k lk k + ~k k
~p = E 3 J p p. (4.15b)
j EJ k

A useful r e l a t i o n b e t w e e n w k and ~k is e s t a b l i s h e d in the f o l l o w i n g lem-


ma.

L e m m a 4.8. (i) At the k-th i t e r a t i o n of A l g o r i t h m 3.1, one has

^k
0 -< k p -< ~p, (4.16)

0 & w k S Qk, (4.17)

v k s -w k ~ 0, (4.18)

^k vk
v ! . (4.19)

(ii) If f is convex then ~p-ep,


~k_^k w k =w~k and ^k
v =v k , for all k.

Proof.(i) By (3.4) and (3.20),


114

jeJ k
(4.20a)
j eJ k

and, since the function tyt 2 is convex (yZ0),

p- j~j 3 3 ApSp) <

k k 2 k ( k 2
< ~ lj~(sj) + Ipy Sp) , (4.20b)

for all k. Since the Lagrange multipliers It~~ and ik are nonnegative, we
3 P
obtain from (4.20)

~k = max{If(x k) - ~kl,y(~k)2 } <_


P
_< ~ ik3 max {if(xk)_fkl ,(sk)2} + lkp max{If(xk)-fpl,y(Sp
) k k 2}=
j ~jk

lk k + ik k ^k
j ~ jk 3 3 P P ~P'

which yields (4.16). Next, (4.17) - (4.19) follow immediately from


(4.16), (3.6), (3.8), (3.19) and (4.15).
(ii) If f uis convex then we have f(xk)-f~z0 for all j ~ jk, f(xk)_f~z0
and f ( x k ) - f ~ 0 , hence equality holds in (2.20a). Therefore, since 7=0
in the convex
W case, we have ~kp= ~p,
^k ant the preceding argument yields
wk=~ k and vk=$ k.

We conclude from the above lemma that in the convex case the va-
riables involved in line searches and the search direction finding sub-
problems satisfy relations analogous to those developed for the algorith-
ms in Chapter 2.
Returning to relation (3.22), we see that (3.22) follows from
(4.18), (4.19) and the fact that w k is always positive at line searches.
Note that for nonconvex f our estimate v k of the derivative of f at x k
^k
in the direction d k can be less optimistic than the primal estimate v ,
since ~k~vk for all k. Thus v k is always negative, hence the criterion
k
(3.9) with mL>0 and t~z0 ensures that the sequence {f(xk)} is nonin-
creasing and f(xk+l)<~(x k) if xk+l~x k.
Consider the following condition for some fixed point x e R N :

there exists an infinite set K c {i,2,...} satisfying xk-K--*K~.(4.21)


115

Our aim is to show that w k K +0 for some set K c K . In Chapter 2 this


was done by c o n s i d e r i n g first the case of an infinite number of serious
steps, where the line search c r i t e r i o n (3.9) and the fact that
t~z~>0 for each serious step y i e l d e d the d e s i r e d conclusion. Since in
k
A l g o r i t h m 3.1 one can have a r b i t r a r i l y small serious stepsizes tL>0, the
same a r g u m e n t cannot be applied here. However, we can still analyze
"long" serious steps as follows.

Lemma 4.9. (i) If (4.21) holds then

f(xk)%f(x) as k~, (4.22)

k k
tLv 0 as k+~. (4.23)

(ii) If (4.21) is fulfilled and there exist a number {>0 and an infi-
k
nite set K c K such that tL_>t for all k e K , then (4.14) holds.

Proof. (i) (4.21) and the c o n t i n u i t y of f y i e l d f(x k) K ~f(~), so (4.22)


follows from the m o n o t o n i c i t y of {f(xk)}. Since we always have vk<0,
k
m L (0,i) and tLZ0, we obtain from (3.9)

k k
0_<-tLv _< [f(xk) - f ( x k + l ) ] / m L for all k,

w h i c h yields_(4.23) in virtue of (4.22).


(ii) If x k K k *
~x and tLat>0 for all k E K , then (4.23) yields v k 0,
and w k K D0 by (4.18). Thus x k K ,x and w k K ~0, w h i c h implies
(4.14) .["]

C o r o l l a r y 4.10. Suppose (4.21) is satisfied, but (4.13) does not hold,


i.e.

liminf max {wk,l~-xk I} ~ ~ > 0 (4.24)


k~
for some --
~. Then t kL K 0.

Proof. Since we always have t~>0, the d e s i r e d c o n c l u s i o n follows from


L e m m a 4.9 and the e q u i v a l e n c e of (4.13) and (4.14). []

In v i e w of the above results, it remains to c o n s i d e r the case of


a r b i t r a r i l y short serious and null steps, i.e. t k
L K ~0 . Recall that in
116

Observe that for a null step tL=0 the above lemma reduces to Lem-
k ~k-1 k . k-1
ma 2.4.11, since then ~p= ~p In this case w is a fraction of w
For short serious steps the rate of decrease of w k depends on the value
k ~k-i
of I~p-ap I and the following properties of the function @C" Note that
in fact C depends on the value of m R e(0,1), which is fixed in our
analysis.

Lemma 4.12. For any ew>0 and C>0 there exist nun~ers e =ea(ew,C)> 0 and
N=N(w,C)~I such that for any sequence of numbers {t l} satisfying

0<ti+l<~c(ti ) + ~ for i>l, 0<ti<4c 2, (4.29)

i
one has t <e w for all i>_N.

Proof. For any ~ >0 define the number t(e ) by t(e) = c(t(a ))+~
and observe that @c(t)+e <t for any t>t(e ). Then it is easy to show
that limsup ti~t(e ) for any sequence {t l} satisfying (4.29), because
i~
the function @C(.)+e e is continuous. Define the sequence ~i=4c2,
~i+l= ~ c ( t i ) + e for i>l. Clearly, limsup ~iit(z ) and for any sequence

{ti} satisfying (4.28) we have t l ~ 1 if ~ikt(e ), for all i. Then


the desired conclusion follows from the fact that t(e )+0 as e %0.

We conclude from Lemma 4.11 and Lemma 4.12 that w k will become
arbitrarily small, i.e. wk< w for any fixed ew>0 , provided that for
sufficiently many N=N(Cw,C ) consecutive iterations a local bound of the
k -k-i l<_c and
form (4.27) is valid, we have sufficiently small lep-ep
k-i -
t L <t, and no reset occurs. These properties will be established by
the following four lemmas.
A locally uniform bound of the form (4.27) will result from the
following lemma, which gives the reason for the line search rule (3.11).

Lemma 4.13.(i) For each kal

~k 1
max{Ipkl , ~p}!max{~Igkl 2+ a kk, (I gk I2+ -ek
9 k%1/2~
j J" (4.30)

!ii) Suppose x e R N, B = { y 6 RN: Ix-yl~2a}, where ~>0 is the line search


parameter involved in (3.11), and let

Cg = sup{Igf(y)l: y e B}, (4.31a)


117

Chapter 2 this case was equivalent to having t~=0 for all sufficiently
large k, and was analyzed by showing that the optimal value of the
dual search direction finding subproblem decreases after a null step
owing to line search requirements of the form (3.10b). Proceeding along
similar lines, we shall now show that the stationarity measure ~ decreas-
es whenever the algorithm cannot obtain a significant improvement in
the objective value, i.e. after a null step or a short serious step.

~
Lemma 4.11. Suppose that t -i<~ and rk=0
a for some k>l. Then

wk ~k<c(wk-l)+
- k ~k-i {,
l~p-~p (4.2s)

where the function C is defined (for the fixed value of the line search
parameter m R ~ (0,i)) by

c(t) = t - (I-mR)2t2/(8C2), (4.26)

and C is any number satisfying

max{Ipk-i I , Igkl, ~ - i , i } ~ C. (4.27)

Proof. (i) Observe that k>l[ ~ k-l<~


i andk-ithe line search rule (3.10b)
yield _~(xk,yk) + <gf,(yk) d > ~mRv , so

k ,dk-l> k-i (4.28)


-ek + <gk ~mRv

(ii) Define the multipliers

Ik(~) = 9, lj(v) = 0 for j e jk ~ {k}, Ip(V) = i-9

for each ~ E[0,1]. Since rk=0


a by assumption and k ~ jk by (3.29) the
above multipliers are feasible for the k-th dual subproblem (3.17).
Therefore, reasoning as in the proof of Lemma 2.4.11, we obtain

W~ k ~mzni~l(l-~)pk-l+
. .1, ~gkl2+(l-~)ak + k
~e~k: ~ E [0,1]} <
P
<min{1(l-~)pk-l+ ~gkl2+(l-~)~k-i + ~ k: v e [0,i] } +
- p

k ~k-i
+ l~p - ~p I"

Then one may use (4.28) and the various definitions of the algorithm
to obtain the desired conclusion by invoking Lemma 2.4.10 as in the
proof of Lemma 2.4.11.
118

C = sup{e(x,y): x e B, y g B}, (4.31b)

C = max{ C 2g + C , (C~+2C)1/2,1}. (4.31c)

Then C is finite and

max{Ipk[,~,IgkI,l} S C if Ixk-xl ~ a. (4.32)

Proof. (i) Let k>l be fixed and define the multipliers

Ik=l, ~j=0 for j e jk \{k}, Ip=0

Since k e jk, the above multipliers are feasible for the k-th dual sub-
problem (3.17). Therefore the optimal value w~k of (3.17) satisfies
~k<- ~Ig
1 k I2+ ~k'
k hence ~Ipkl2+
1 ~k = wk< 1 Igkl 2+ okk and (4.30) follows
ep - ~ '
(ii) We deduce from the local boundedness of af (Lemma 1.2.1) that the
mappings gf(.) and e(',-) are bounded on the bounded sets B and B x B,
respectively. Therefore, the constants defined by (4.31) are finite.
If Ix-xklsa then Ix-yk[~l~-xkI+Ixk-ykl&2~, because lyk-xkl~a by (3.11).
k
Thus we have xk~ B, yk6 B, gk=gf(yk) and ~k=~(x k ,yk), hence (4.32) fol-
lows from (4.30) and (4.31).

k ~k-i I involved
Our next result will provide bounds on the term l~p-ap
in (4.25).

Lemma 4.14. Suppose (4.21) holds. Then

'If(xk+l)_ fk+l - If(x k) - ?k 1 + 0 as k~. (4.33)


P P

Proof. For any k

l l.,
~ x k + l .)-~p
. k + l , l-lf(x k )-f~ll < 1f(x k+l )_fk+l_f( xk)+f~
- p
<
-

~k S
&If(xk+l)-f(xkll+Ifpk + l -fpl

~If(xk+l)-f(xk)[+l<pk,xk+l-xk> I (4.34a)

k+l k k k k k k
from (3.13c). Next, since t~>0L_ and x =x +tLd =x -tLp (see (3.18)
for all k, we obtain from (3.6)
k k k~tLlpkl 2 = <xk-xk+l,pk>
-tLV ~0. (4.34b)
119

Combining (4.34) with Lemma 4.9(i) we establish (4.33).

The following lemma will imply that t~-l<~


~ for sufficiently many
i

consecutive iterations.

Lemma 4.15. Suppose (4.21) and (4.24) hold. Then for any fixed integer
m~0 there exists a number ~m such that for any integer n [0,m~

k+n --
x x as k~, k e K , (4.35a)

wk+na~/2 for all k>km, k ~ K, (4.35b)

^
tk+n
L +u as k+~, k ~ K. (4.35c)

Moreover, for any numbers k, N and E >0 there exists a number kh~,
e K, such that

wk~ ?/2 for ksk~k+N, (4.36a)

max{ipk-ll,lgk 1 , ~k-l,l}~C for k<k~k+N, (4.36b)


P
[e~- ~k-i
Ip _< ee for ~<k<k+N,_
_ (4.36c)

t~<~ for k~k~i+N, (4.36d)

where C is the constant defined in Lemma 4.13.

Proof. (i) We shall first establish (4.35). For m=n=0, (4.35a) follows
from our assumption (4.21). Suppose that (4.35a) holds for some fixed
m=n>0. From (4.35a) and (4.24) we deduce the existence of a number n
such that

wk+n?/2 for all k~kn, k e K. (4.37a)

(4.35C) follows from (4.35a), (4.24) and Corollary 4.10. Using (4.35a)
and Lemma 4.13, we deduce that Ipk+n[sc for all k E K. Then

Ixk+n-xk+n[ = tkL+ n [dk+n I = tkL+ n Ipk+n [ ~ Ct k+n


L (4.37b)

for all k e K, since we always have d k= _pk by (3.18). (4.37b) and


(4.35c) yield [xk+n+l-xk+n[0 as k~, k E K, hence (4.35a) implies
k+n+l --
x +x as k~, k E K. Thus (4.35a) holds for n increased by I. There-
fore one can repeat the above reasoning for all n ~ [0,m~. Setting
120

km: max{kn: ne[0,m]} , see (4.37a), we complete the proof of (4.35).


(ii) If e >0 then (4.21) and Lemma 4.14 imply the existence of a num-
ber kl>k satisfying

tif(xk+1,)- fk+l
p l-lf(xk)-fkl I _< e /2 for all k_>kI. (4.38a)

Next, by (3.14c) and (3.5), we always have

y(s ) ~ y(Spk+l)2 = ~k + ixk+l -x k I) 2 <


y(Sp

~k 2 + 27s ~ Ixk+l -x kl+ y ixk+l - xkl2


y(Sp)

y(~ )2 + 2 ( y ~ ) i / 2 ixk+l_xkl + ixk+l_xk I2 ,

hence

~k 2 ~y[,sk+l)2<y(~k
y(Sp) P _ Sp) 2 + Ixk+l -xkl (2(yC) I/2 + Y Ixk+l_xkl ) (4.385)

if Ixk-xI<_a by (4.32). It follows from (3.2), (3.5) and (4.38) that

i~pk+l_-k~p dea/2 + Ixk+l-xkl(2(yC) I/2 +71xk+l-xkl) (4.39)

for any kzk I such that Ix-xklga. using (4.32) and (4.39), we deduce
from the first part of the lemma the existence of khk I satisfying
(4.36). []

We can now prove the principal result of this section.

Lemma 4.16. Suppose that (4.21) holds. Then 4.13) is satisfied.

Proof. For purposes of a proof by contradiction, assume that (4.13) does


not hold, i.e. (4.24) is satisfied for some ~>0.
(i) Let ~w=~/2>0 and choose e =e (~w,C) and N=N(~w,C)<+~, N>I, as spe-
cified in Lemma 4.12, where C is the constant defined in Lemma 4.13.
(ii) Let N=ION(ew,C ). Using Lemma 4.15 and the fact that ~>0 by assump-
tion, we can choose k satisfying (4.36) and

Z. Ixk+l-xkl ~ ~/4. (4.400


k=k

(iii) Suppose that there exists a number k satisfying


121

k~k ~k+N-2N, (4.41 a )

rk=0 for all k e L~,k+N~. (4.41b)

Then (4.36b) - (4.36d), (4.41b), Lemma 4.11, Lemma 4.12 and our choice
of e and N imply wk<ew=~/2 for some k e [~,~+N~, which contradicts
(4.36a) and (4.41a). Consequently, we have shown that for any number
satisfying (4.41a) we have

rk=l
a for some k e [k,k+N] .

(iv) Letting k = ~ we obtain from part (iii) of the proof that rkl=l for
some k I E [k,k+N], and the rules of Step 5 and Step 6 yield
kI _
a ~a/2. (4.42)

iv) Since we always have lyk+l-xk+llsa/2, at Step 5

ak + l = max{a k + ixk+l -x k i, lYk+l-x k+l ]} <-

_< max{a k + Ixk+l-xkI, a/2}. (4.43)

The above estimate, (4.40) and (4.42) yield a k <~a<a


3--- for all k ( k l , ~ + N ],
k --
hence for such k no resetting due to a >a occurs, i.e.

rk=0 for k=kl+l,... ,kl+l+N. (4.44)

(vi) Since ~=kl+l satisfies (4.41a), from part (iii) of the proof we
deduce a c o n t r a d i c t i o n with (4.44). Therefore, (4.13) must hold.

Combining Lemma 4.16 with Lemma 4.7, we obtain

Theorem 4.17. Each accumulation point of the sequence {x k} generated


by A l g o r i t h m 3.1 is stationary for f.

In the convex case, the above result can be strengthened as fol-


lows.

Theorem. 4.18. If f is convex then A l g o r i t h m 3.1 constructs a minimizing


sequence {xk}: f(xk)% inf{f(x): x~RN}. Moreover, if f attains its mini-
122

m u m value, then {x k} converges to a m i n i m u m point of f.

Proof. One can check that L e m m a 2.4.7 and Lemma 2.4.14 hold also for
Algorithm 3.1 in the convex case. Then T h e o r e m 4.17 and the proofs of
Theorem 2.4.15 and T h e o r e m 2.4.16 yield the d e s i r e d result.

The following result substantiates the s t o p p i n g criterion of the


method.

Corollary 4.19. If the level set S={x ~ RN:f(x) ~ f(xl)} is b o u n d e d and


the final accuracy tolerance es is positive, then A l g o r i t h m 3.1 ter-
minates in a finite number of iterations.

Proof. If the assertion were false, then the infinite sequence {xk}cs
would have an a c c u m u l a t i o n point, say ~. Then L e m m a 4.16 w o u l d y i e l d
(4.14), and the a l g o r i t h m should stop owing to w k S es for large k.

Remark 4.20. It is w o r t h observing that our c o n v e r g e n c e analysis does


not use e x p l i c i t l y the semismoothness hypothesis (3.23). In fact, it re-
mains valid when a procedure different from Line Search Procedure 3.2 is
used for finding stepsizes satisfying the line search requirements (3.9)-
(3.11) at each iteration of the method. For instance, one may use eMpan-
sion in Line Search Procedure 3.2 by r e p e a t i n g its step (ii) w i t h a
doubled t if t=tL until t becomes tu, i.e. by i n c r e a s i n g the initial
stepsize if this yields significantly smaller objective values; see
(Mifflin, 1982). However, one should e n s u r e b o u n d e d n e s s of {t~}, since
it is n e c e s s a r y for the proof of T h e o r e m 4.18 in the convex case. For
this r e a s o n we had t~ S 1 for all k in A l g o r i t h m 3.1, although this b o u n d
may be larger . Also one may delete the c r i t e r i o n (3.10a).
123

5. The A l ~ o r i t h m with S u b ~ r a d i e n t Selection

In this section we state in detail and analyze the method for non-
convex m i n i m i z a t i o n that uses subgradient selection in the way describ-
ed in Section 2.

Algorithm 5.1.

Step 0 (Initialization). Select the starting point x I ~ RN and a final


accuracy parameter e s Z 0. Choose fixed positive line search parameters
mL,mR,a and ~ satisfying 0 < m L < m R < i, and a distance measure para-
meter 7 >0 (7=0 if f is convex). Set al=0 and

y i=x 1 , g 1=gf(yl) , f =f(yl) , si=0 ,

Set the counter k=l.

Step 1 (Direction finding). Find the solution (dk,v k) to the follow-


ing k-th quadratic programming problem
L 2
minimize d I +v,
(d,~) RN+I (5.i)
jk,
subject to - ~jk + < gJ'd > ~ v, j E

where
~k = max{If(xk) -f
]
I, y(s ) } for j~ (5.2)

Find Lagrange m u l t i p l i e r s i kj, j e jk, of (5.1) and a set ~k satis-


fying

Sk = {j E Jk:xk ~ 0} (5.3a)
]

13kl N+i. (5.3b)

Step 2 (Stopping criterion). Set

^k = ik k (5.4)
~p j ~ j k j~j'

w-k = ~lldkL2+ p. (5.5)

If ~E es then terminate. Otherwise, go to Step 3.

Step 3 (Line search). By a line search procedure as discussed below,


find two stepsize t such that 0 i t kL E t kR and such that the two
124

corresponding points defined by

k+l k .k~k k+l k .klk


x =x t~La and y =x CUR a
k
satisfy t L <_ 1 and
k, k^k
f(x k+l) <_f(x )+mLtLV , 5.6a)

k k k>_~,
t R = t L i f t L 5.6b)

_~(xk+l,yk+l)+ < gf(yk+l),d k > >_mRv^k if


k
t L < ~, 5.6e)

lyk+l-xk+ll <~/2 . 5.6d)

Step 4 (Subgradient updating). Set

jk+l ^k
= J u {k+l}. (5.7)

k+l ( k+l,
Set g =gf y ~ and

fk+l = f(yk+l)+ < gk+l,xk+l k+l (5.8a)


k+l -Y > '

fk+l = f k + < gJ,xk+l_x k > for j6 J^k , (5.8b)


3 3

Sk+ik+l = lyk+l_xk+ii (5.8c)

sjk+l = s3b+Ixk+z-xkl for j e 3 k. (5.8d)

Step 5 (Distance resettin~ test). Set

ak+l=max{sk+l
j : j 6 jk+l }. (5.9)

If ak+l ~ --
a then set r k+l =0 and go to Step 7. Otherwise, set rk+~l
a a
and go to Step 6.

Step 6 (Distance resetting). Keep deleting from jk+l the smallest


indices until the reset value of a k+l satisfies

a k+l = max{s ~+i: j~ jk+l} ~ / 2 . (5.10)


125

Step 7. Increase k by 1 and go to Step i.

We shall now comment on relations between the above method and


Algorithm 3.1.
By Lemma 2.2.1, the k-th subproblem dual of (5.1) is to find values
of the multipliers lk, j ~jk, to

minimize 1 1 ~ jk~jg 3 i2+ ~ Jk x k


jE j~ J~J'
(5.11)
subject to lj ~_0, j ~ j k , jkXj=l.
j~

Any solution of (5.11) is a Lagrange multiplier vector for (5.1) and it


yields the unique solution (dk,v k) of (5.1) as follows

dk =
_ pk , (5.12)

^k , 2 k k
v = -{Ip k + ~ . I~.} (5.13)
jEj K J 3 '

where
p
k
= z jk I g3.
~' (5.14)
j~
Moreover, any Lagrange m u l t i p l i e r s of (5.1) also solve (5.11). In par-
ticular, we have

X~ 0 for j e J k, ~ jk ikj = i. (5.15)


3 je

Thus we see that, as far as the search direction finding is concerned,


the above relations (5.11)-(5.15) can be obtained from the correspon-
ding relations developed for A l g o r i t h m 3.1 in Section 3 by setting

Ik = 0 for all k. (5.16)


P
This corresponds to deleting the last constraint of the k-th primal
search direction finding subproblem (3.1), and thus reducing subproblem
(3.1) to subproblem (5.1).
We refer the reader to Remark 2.5.2 for a discussion of the possi-
ble ways of finding the k-th Lagrange m u l t i p l i e r s satisfying the
requirement (5.3).
The stopping criterion in Step 2 can be interpreted similarly to
the termination rule of A l g o r i t h m
between 3.1. A slight difference the
^k
two stopping criteria arises from the fact that the values of ~p and
^k "k k
w can be larger than the values of the variables ~p and w
defined by
126

(f ,Sp)~k =jeZ jk j[ j' 3 ' <5.17)

~P~k = m a x { i f ( x k ) _ { k l , y ( ~ k pp)}, <5.18)

wk = 1
TiP k 12+~ (5.19)

To see this, note that, by (5.5) and (5.12), we a l w a y s h a v e

~k 1 k 2 ~k
w = Ti P +~p, (5.20)
hence one can use the p r o o f of L e m m a 4.8 and the c o n v e n t i o n (5.16) to
show that
~k ^k
~p & ~p, <5.21)

w k & ~k,
(5.22)

a n d t h a t the above v a r i a b l e s are n o n n e g a t i v e . Thus b o t h wk and


^k
w
c a n be r e g a r d e d as s t a t i o n a r i t y measures of xk; see S e c t i o n 3. We al-
so h a v e the o p t i m a l i t y estimate (3.21) upon termination if f is con-
vex.
The line search rules (5.6) d i f f e r f r o m the r u l e s ( 3 . 9 ) - ( 3 . 1 1 )
^k
inasmuch as the v a l u e of v can be l o w e r t h a t the v a l u e of the v a r i a -
ble
k pk 2 ~ k
V = -{I I +ep}' (5.23)

which corresponds to the c o n s t r u c t i o n u s e d in A l g o r i t h m 3.1. By (5.13),


(5.4),(5.21) and (5.23), we h a v e

^k 2 ~k.
V = -{ Ipkl +~p), (5.24)

^k k
v ~ v . (5.25)

^k ^k
Also v ~-w from (5.19) and (5.24), so the line search is a l w a y s
^k
entered with negative v . As o b s e r v e d in S e c t i o n 2,

^ k
Sk : f~(x + d k ) - f ( x k)

is an a p p r o x i m a t i o n to the d e r i v a t i v e of f at x k in the d i r e c t i o n d k.
Thus, except for the d i f f e r e n c e in the v a l u e s of vk and v^k , the
line search criteria (5.6) m a y be i n t e r p r e t e d esentially as in S e c t i o n 3.
We m a y add that for i m p l e m e n t i n g Step 3 of A l g o r i t h m 5.1 one can use
Line Search Procedure 3.2 w i t h vk replaced by v~k .
k+l
In a l g o r i t h m 5.1 the l o c a l i t y radius a is c a l c u l a t e d direct-
ly v i a (5.9) and (5.10), instead of u s i n g the r e c u r s i v e formulae (3.7)
127

and (3.15). We also observe that the s u b g r a d i e n t d e l e t i o n rules ensure


that
k e jk for all k, (5.26) ~

as in Section 3, i.e. the latest s u b g r a d i e n t is always used for the cur-


rent search d i r e c t i o n finding.

Remark 5.2. The r e q u i r e m e n t of A l g o r i t h m 5.1 that at any iteration at


m o s t N+I past s u b g r a d i e n t s should be r e t a i n e d for the next search direc-
tion finding can be m o d i f i e d as follows. Let M g - > N+2 denote the maxi-
m u m number of s u b g r a d i e n t s that the a l g o r i t h m may store. Then one may
choose L a g r a n g e m u l t i p l i e r s Ik and a set ~k subject only to the fo-
3
llowing r e q u i r e m e n t

Jk={j~jk:kk ~ 0} and l~kl < M -i


3 - g

and set j k + l = j k u {k+l}, for all k. Such a choice is always possible if


M g -i a N + I , cf. Lemma 2.2.1 and Remark 2.5.2. It will be seen that such
m o d i f i c a t i o n s do not impair the s u b s e q u e n t c o n v e r g e n c e results.

We now pass to c o n v e r g e n c e analysis. Global c o n v e r g e n c e of A l g o -


rithm 5.1 can be e s t a b l i s h e d by m o d i f y i n g the results of Section 4. To
this end one may p r o c e e d as in Section 2.5, where the c o n v e r g e n c e of
the m e t h o d with s u b g r a d i e n t selection was deduced from the results on
the convergence of the c o r r e s p o n d i n g m e t h o d w i t h s u b g r a d i e n t aggrega-
tion. Therefore we shall give only an outline of r e q u i r e d results.

In the proof of Lemma 4.1 for A l g o r i t h m 5.1, observe that no indu-


ction is needed, since

~k = jk for all k
P

by (4.2) and (5.16), hence we may let ~k = kk for all j E J k to ob-


3 3
tain (4.3) from (5.14), (5.17) and (5.15). Also r e l a t i o n s (4.4) and
(4.5) follow i m m e d i a t e l y from (5.9) and (5.10).

The proofs of L e m m a 4.2 through Lemma 4.8 require no m o d i f i c a t i o n s


if one uses the simplifying convention (5.16). Also it can be checked that the as-
-k
sertionsk of Lemma 4.7 and L e m m a 4.8 remain valid upon r e p l a c i n g ep
^k ^k
and w by e and w ; see (5.21)-(5.22).
P
In the f o r m u l a t i o n of Lemma 4.9, substitute r e l a t i o n (4.23) by
the following

tk^k
Lv ~ 0 as k~,
128

while in the proof one can refer to (5.6a) instead of (3.9), replace
^k
v k by v and use (5.25). Of course, Corollary 4.10 remains valid,
even if one replaces w k by w^k in (4.13), (4.14) and (4.24).

Lemma 4.11 is replaced by

Lemma 5.3. Suppose that k-i k


tL <~ and ra=0 for some k i. Then
^k ~k-i k
wk<- w <-@C (w )+A s, (5.27)

where @C is defined by (4.26) for any C satisfying

max{ Ipk-ll, Igkl ' ~k-l,l}


p
< C, (5.28)

and

j ~j J j ~j I (5.29)

k-i
Proof. (i) If k 1 and tL <~ then the line search rule (5.6c)
yields -~(xk,yk)+<gf(yk),dk-i _>mR;k-1 , so

-~+ < g k , d k - 1 >_mRC" k - I . (5.30)

(ii) Define the multipliers

ik(~ ) : ~, lj(~)=(l_~)ik-i
j for j ~ ~k-i (5.31)

for each ~ E [0,I]. By (5.14), (5.15), (5.4) and (5.3a), we have

.k-i j
pk-l= ~ J k-l~j g , (5.32a)
je

t kj- 1 _> 0 for j ~ ~k-1 , S ~k-i Xk-1


j :1, (5.32b)
j~
~k-1= ~ ,k-i k-1
p J ~ ~k_l^j ~j (5.32c)

Since rk=0
a by assumption, we have J k = 3 k - i u {k} ' so (5 32b) implies
that the multipliers defined by (5.31) satisfy the constraints of the
k-th dual subproblem (5. ii) for each ~ ~ [0,i] . Noting that w^k is
the optimal value of (5.11)(see (5.14), (5.4) and (5.20)), we deduce
from (5.31) and (5.32a) that
~k 1 2 ,k-i k, k
w <_~l(l-v)pk-l+vgk I +(i-'~1
j~Z ~k_l^j ~ j-rg~k

for all v E [0,1], so (5.32c) and (5.29) yield


129

~k <_1<l-)pk-l+gki2 . . ^k-i k .k
(5.33)

for all ~ e [0,i]. Using Lemma 2.4.10 and relations (5.12), (5.20),
(5.24), (5.30) and (5.28), we obtain

min{1(l-v)pk-l+vgkI2+(l-V)ap
.^k-i +v~ k : v [0,I]} ~ ~c(wk-l). (5.34)

Combining (5.33), (5.34) and (5.22) we obtain (5.27), as required.

We conclude from the above lemma that after a null step (or a
short serious step) of Algorithm 5.1 one can expect a significant de-
crease of the stationarity measure w^k , while in Algorithm 3.1 the
same observation applies to the stationarity measure w k, cf. (5.27) and
(4.25) The rate of decrease of w^k is established by Lemma 4.12 if
Ak~ ).
AS far as Lemma 4 13 is concerned, substitute ~k by ^k
~p mn
P
relations (4.30) and (4.32), and use the fact that wk is the optimal
value of the dual subproblem (5.11).

Recall that Lemma 4.14 was instrumental only in the proof of Lemma
4.15. Therefore it suffices now to consider the following substitute of
Lemma 4.15.

Lemma 5.4. Suppose that (4.21) and (4.24) hold. Then the assertions of
Lemma 4.15 are valid for Algorithm 5.1 if one replaces (4.36b)-(4.36c)
by
max{ipk-ll,lgkl ^k-i ,i} ! C
,~p for k < k < k+N , (5.35a)

Ak ~ e for k ~k &k+&. (5.35b)

Proof. It is easily verified that we only need to prove the assertion


concerning (5.35b). To this end, observe that

k ik-lak..- Z ik-i k-i I =I Z Ik-l(ak ak-l) I <-


As=l j g~3 k-I 3 3 jE 3k-i j aj j~ ~k-i j " j- j

< max{[ k k-I jk-i ^ ik-i


- ~j-~j I: J ~ } j~z jk-i j '
i.e.
A k+le _ < m a x { l e ~ + l - ~ l : j e jk} for all k, (5.36)

i"rom (5.15) and (5.3a). Next, for all k and je jk we have


130

II~,
r~x k+l,)-~j
.k+l I-I f(xk) -f~l 1< If( xk+l )-f~+l-f(xk )+f~l <
-- ] --

<If(xk+l)-f(xk)l + I< gJ'xk+l-xk >I <-

!If(xk+l)-f(xk)l+Igf(yJ)llxk+l-xk I (5.37a)

from (5.8b), while (5.8d) yield

Y(S~ )2 ~( sk+i)23
" = y(s~ +lxk+l-xkl )2 <

<_~(s})22ys~l
xklJ -xkl+~Ixkl-xkl 2 <
< y(sk)2+Ixk+l-xkl(2y~+ylxk+l-xkl) (5.37b)
- 3

since sk < a k <E. From (5.2) and (5.37)


7-

~3k+l_ ~3kl ~ If(xk+l)-f(xk)l+l x k+l -x kl(igf(yj)i+2y~+yixk+l_xkl)


(5.38)
for all jE jk and k>l. Since Ixk-yjl < sk < ak i a for all k and
je jk, from Lemma 4.13 we obtain that

max{Igf(YJ)l: j 6 jk} ~ C if l~-xkl ~a, for all k,

and that If(xk+l)-f(xk)l ~ 0 as k+~. Therefore one may complete the


proof by using (5.36), (5.38) and the first assertion of Lemma 4.15.
In the proof of Lemma 4.16, substitute (4.43) with the following
relation
ak+l=max{s k+lj :j 6 jk+l}=max{s~+l:j~ ~ku {k+l}}=

=max[max{sk~xk+l - x k I:J 6 Jk},ly


~ k+l -x k+l I}

max{ak+Ixk+l-xkl, ~/2},

which follows from the fact that max{s~:j E ~k} !max{s~:j 6 Jk}=ak.

In this way we arrive at the following result.

Theorem 5.5. Every accumulation point of a sequence {x k} generated by


Algorithm 5.1 is stationary for f.
Using the results of Section 2.5 one can check that in the convex
case the convergence properties of Algorithm 5.1 may be expressed in the
form of Theorem 4.18. Moreover, Corollary 4.19 is valid for Algorithm
5.1. This follows from the fact that the proof of Lemma 4.16 shows that
one may substitute in the lemma relation (4.14) by a modification of
131

^k k
(4.14) in w h i c h w replaces w .
To sum up, we have e x t e n d e d all the c o n v e r g e n c e results of Section
4 to A l g o r i t h m 5.1.

6. M o d i f i c a t i o n s of the Methods

In this section we d e s c r i b e several m o d i f i c a t i o n s of the methods


d i s c u s s e d so far and analyze their c o n v e r g e n c e w i t h i n the f r a m e w o r k
e s t a b l i s h e d in the p r e c e d i n g sections.
First, we w a n t to d e m o n s t r a t e global c o n v e r g e n c e of v e r s i o n s of
the methods for convex m i n i m i z a t i o n p r e s e n t e d in C h a p t e r 2 that use the
general line search c r i t e r i a (2.7.2), w h i c h allow for a r b i t r a r i l y small
serious stepsizes. To this end, suppose that f is convex and consider
the following m e t h o d which is obtained from A l g o r i t h m 2.3.1 by modify-
ing its line search rules.

A 1 9 o r i t h m 6.1.

step 0 through Step 2. These are the c o r r e s p o n d i n g steps of A l g o r i t h m


2.3.1. Step 3 (Line search). Find two stepsizes tk
L and t kR such that
0 ~ t ~ & t~ and such that the two c o r r e s p o n d i n g points defined by

k+l k k k k+l k k k
x =x +tLd and y =x +tRd

satisfy
f(xk+l) ~ f(x k) + m L t L
kV k , (6.1a)

- ~ ( x k + l , y k + l ) + < gf(yk+l) ,dk > ~ m R vk if tk


L<~, (6.1b)

lyk+l-xk+ll ~ ~, (6.1c)

tkR ~ ~. (6 ld)

Step 4. The same as in A l g o r i t h m 2.3.1.

We suppose that in (6.1) ~ , m L , m R , ~ and t are fixed p o s i t i v e pa-


rameters satisfying 0 < m L < m R< 1 and ~l~t, and

e(x,y)=f(x)-f(y)- < gf(y),x-y > .

It is easy to verify that Line Search P r o c e d u r e 3.2 can be e m p l o y e d for


finding stepsizes t~ and t kR satisfying (6.1)

One may observe that, except for Step 3, A l g o r i t h m 6.1 is obtained


132

from A l g o r i t h m 3.1 by deleting in the latter method Step 5 and Step 6,


and setting =0 rk=0and
for all k. In other words, the construc-
a k
tions involving distance measures s and the resetting strategy are
3
not n e c e s s a r y in the convex case.
To show that A l g o r i t h m 6.1 is globally convergent in the sense of
Theorem 4.18 and Corollary 4.19, one may proceed as follows. In view
of the results of Section 2.4, it suffices to establish Lemma 4.16 for
Algorithm 6.1. To this end, observe that Lemma 4.9 through Lemma 4.15
remain true for A l g o r i t h m 6.1, since we have y=0, r~=0 and lyk+l-
xk+l I ~ for all k. Therefore we may use parts (i)-(iii) of the proof
of Lemma 4.16, deducing in part (iii) a contradiction, since we now
have rk=0
a for all k. Therefore Lemma 4 16 and the subsequent results
hold for A l g o r i t h m 6.1.
Let us now consider the method with subgradient selection for con-
vex m i n i m i z a t i o n which is obtained by using the line search criteria
(6.1) in A l g o r i t h m 2.5.1. This method can be also derived from Algori-
thm 5.1 by replacing (5.6) with (6.1) and deleting Step 5 and Step 6.
To establish global convergence of the method, one may reason as above,
but this time using the results of Section 2.5 and Section 5.

To sum up, we have shown that one may employ the general line search
criteria (6.~) in the methods for convex m i n i m i z a t i o n from Chapter
2 without impairing the global convergence results.

Next,
consider the version of the subgradient selection method for
k
nonconvex m i n i m i z a t i o n that uses the measures e (xk,y j ) instead of ei'
see Remark 2.1. The method results from replacing in A l g o r i t h m 5.1 the
variables k j by ~(xk,yj), and calculating a k+l by

k+l xk+l_yj jk+l}


a =max{ I l:J (6.2)

instead of (5.9) and (5.10). For verifying that the global convergence
results of Section 5 cover this version of the method, m o d i f y (5.17)
as follows

(f~k) =,s
F~kF jE~ jk ikj ( fk
3'Ixk-yjl) (6.3)

and use the fact that we always have

Ixk+l-y j ~ Ixk-yjI+Ixk+l-xkl. (6.4)

In effect, this version can be analyzed by assuming that the variables


k
s~ are substituted by Ixk-yjl everywhere in A l g o r i t h m 5.1 (includ-
J k
ing relation (5.2) defining ~j).
133

For the sake of c o m p l e t e n e s s of the theory, let us now consider a


method that uses all the past s u b g r a d i e n t s for search d i r e c t i o n finding
at each iteration. Of course, such a s t r a t e g y of total s u b g r a d i e n t ac-
c u m u l a t i o n is only of t h e o r e t i c a l interest, since it requires infinite
storage and an increasing amount of work per iteration. Our m e t h o d w i ~
s u b @ r a d i e n t a c c u m u l a t i o n is o b t a i n e d from A l g o r i t h m 5.1 by r e p l a c i n g
everywhere ~~ by e(xk,yj), d e l e t i n g Step 5 and Step 6, and setting

jk = {l,...,k} for all k.

Thus we always have

jk+l = jk u {k+l},

hence there is need for selecting L a g r a n g e m u l t i p l i e r s of (5.1) to meet


the r e q u i r e m e n t (5.3). As far as our t e c h n i q u e s of c o n v e r g e n c e analysis
are concerned, the p r i n c i p a l difference b e t w e e n this m e t h o d and the cor-
responding a l g o r i t h m with s u b g r a d i e n t s e l e c t i o n d e s c r i b e d above c o n s ~
sts in the fact that the locality radii d e f i n e d by

k xk_yj
a = max{ I J: j=l ..... k}

need not be bounded, since the m e t h o d has no s u b g r a d i e n t d e l e t i o n rules.


In this context we recall that the proofs in Section 4 and Section 5 re-
ly h e a v i l y on the local b o u n d e d n e s s arguments, which d e p e n d on the bound-
k
edness of a . However, if one makes an additional a s s u m p t i o n that
there exists a c o n s t a n t a such that ak S a for all k, then one can
e s t a b l i s h global c o n v e r g e n c e of the method w i t h s u b g r a d i e n t accumula-
tion by using (6.3) and (6.4) as above. For this reason, c o n s i d e r the
1
following a s s u m p t i o n on f and the starting point x :

the set S={x E RN:f(x) ~ f(xl)} is bounded. (6.5)

Then there exists a >a such that

sup {Ix-~L: x~s, ~ s } ~ / 2 ,


so that we always have

{xk-yj I ~ xk-x j]+]xj-yj i S a/2+a/2 N a,

since {x k } C S by the m o n o t o n i c i t y of {f(xk)}, w h i l e IxJ-yJl E a / 2


owing to the line search r e q u i r e m e n t (5.6d). It follows that ak ~
for all k. In effect, we have shown that the a b o v e - d e s c r i b e d m e t h o d
with s u b g r a d i e n t a c c u m u l a t i o n is c o n v e r g e n t in the sense of Theorem
134

5.5, Theorem 4.18 and Corollary 4.19 under the additional assumption
(6.5).
We shall now present a simple m o d i f i c a t i o n of the methods of this
chapter that we have found useful in calculations. This modification,
which amounts to calculating and using more subgradients for search di-
rection finding, can be m o t i v a t e d as follows. The methods described so
far evaluate subgradients gJ=gf(yJ) at trial points yJ which can,
in general, be different from the points x j. This is due to the use of
the two-point line search for detecting discontinuities in the gradient
of f. However, the lack of subgradient information associated with the
points xj may u n n e c e s s a r i l y slow down convergence. For instance, con-
sider the k-th iteration of A l g o r i t h m 3.1. Recall that for line sear-
ches the variable vk is regarded as an approximation to the directio-
nal derivative of f at xk in the direction d k. For this interpreta-
tion to be valid, we should try to achieve the relation

max{ < g,d k > : g E ~f~xk)} ~ v k, (6.6)

since the left side of (6.6) is equal to that directional derivative


(see Section 1.2). For instance, if f is smooth at x k, so that the
gradient Vf(x k) exists and gf<xk)=~f(xk), then we should have

< gf(xk),d k ~ ~ v k, (6.7)

which would yield f'(xk;dk)= < ?f(xk),d k > 5 v k. But if yk@xk and
gf(x k) is not e v a l u a t e d then we cannot even verify (6.7), let alone
ensuring that it holds. A simple way out of this difficulty is to cal-
culate gf(x k) and use it for the k-th search direction finding by
appending the following additional constraint

< gf( x k ) , d > ~ v^ (6.8)

to the k-th primal subproblem (3.1). Then we shall have

< gf(x k ),d k > ~ v~k , (6.9)

^k k
which will yield (6.7), since v i v . Noting that (6.8) can be formu-
lated as

- ~ ( x k , x k ) + < gf(x k ) , d > ~, (6.10)

we conclude that gf(x k) is treated here as any other subgradient gJ=


gf(YJ)-
Once the subgradient gf(x 3) is evaluated, it can be employed for
search direction finding at xk for k > j. Thus, for ease of subsequent
135

notation, let

y-J=xJ for 9=1,2 ..... k and all k, (6.11)


and
gJ=gf(yJ) for 9=1,2 ..... (6.12)

i.e. gJ=gf(x ljl ) if j < 0, and gJ=gf(yJ) if j > 0, as before. Then


at the k-th iteration the past points y3, j=l,2,...,k, are characte-
rized by the linearizations

fj(x)=f(y9)+ < g3,x-y3 > =fk+ < gJ,x-x k > for all x
3
and the following upper estimates of Ixk-yJl

=Ixl91y91i=
s 91 k-i
Ixi+l-xil for lJl < k, skk=Ixk-ykl, (6.13)

from which we can calculate the subgradient locality measures

~ =max{.f(xk,-f [, ~ ( s j ) } for 1 ~ 191 ~ k .

To sum u p , the points x9 are treated exactly as the points y9, 9 ~ 1,


used before; see (2.11) and (2.14). Therefore we may now c h o o s e sets

Jkc {9:j=1 ..... k} for all k.

Noting t h a t ~kk=0,_ s i n c e --t~fk


=f(xk)and skk=0,_ we s e e t h a t the k-th sub-
problem 43.1) will have 46.8) among its constraints if
-k ~ jk.

Together with the previously motivated condition k e jk, this leads to


the requirement

~ k ~ jk for all k, (6.14)

which can be met by selecting in Step 4 of Algorithm 3.1 a (possibly


empty) set ~ k ~ jk and setting

jk+l=~ku {k+l,-(k+l)} (6.15)

with g-(k+l)=gf(xk+l) and

fk+l
-(k+l) =f(xk+l )' (6.16a)
k+l
S_(k+l)=0, (6.16b)

instead of using 43.12).


136

Of course, Algorithm 5.1 may also be m o d i f i e d in this way. To this


end, replace the requirement (5.7) by (6.15), and use (6.16) in addi-
tion to (5.8). Moreover, similar m o d i f i c a t i o n s may be introduced in all
the versions of the methods considered in this section.

It is elementary to check that the preceding results on the global


convergence of the methods are not influenced by the additional use of
subgradients gf(x3). This exercise is left to the reader, who may, for
~k
instance, define additional multipliers I. involved in (4.3) for j
3
belonging to the set

$~ = jkp(k)u {j: kp(k) < lJ[ S k},

etc.
To sum up, the above described m o d i f i c a t i o n has no impact on the
theory of our methods, but may increase their computational efficiency
by decreasing the number of null steps (or short serious steps), thus
leading to faster convergence. Of course, this advantage should be we-
ighed againts the work involved in evaluating additional subgradients.
We may add that in many applications it is relatively easy to calcula-
te gf(x k+l) once f has been evaluated at xk+l

It is worthwile to observe that, although the notation of Mifflin's


(1982) line search rules (2.9) may be misleading, in fact his algorithm
calculates, in addition to gJ=gf(yJ), all the subgradients gf(xJ),
and uses many of them for search direction finding at the k-th iteration.
First, let us note that gk+l=gf(xk+l) at some iterations. This is the
case when
^k (6.17)
< gf(xk),d k > ~ m R v ,

since then the line search rule (2.9c) yields yk+l=xk+l=xk and gk+l=
gf(yk+l)=gf(xk+l). The test (6.17) is related to the desirable relation
^k
(6.9), since if (6.17) holds then the fact that m R ~ C0,1) and v < 0
at line searches yield

~k ^k
< gf(xk),d k > ~ mRv > v ,

so that (6.9) cannot be satisfied. Of course, if yk+l=xk+l=xk then at


the next iteration one has

<gf (xk+l ),dk > = - e ( x k + l , y k + l ) + < gk+l,dk+l > <_vk+l,

since (dk+l,v k~l) solves the (k+l)-st subproblem of the form (2.34)
with k + 1 6 jk+l={l,...,k+l }. Thus relation (6.9) is satisfied for k
increased by i. We may add that the line search rule (2.9c) plays a
137

crucial role in Mifflin's (1982) c o n v e r g e n c e analysis, who p r o v e d that


the m e t h o d has at least one stationary a c c u m u l a t i o n point if the sequen-
ces {x k} and {yJ} are bounded (or if {x k} is bounded and the
line rule 43.11) is used t o g e t h e r with (2.9)). One may remark that
under such a s s u m p t i o n s every a c c u m u l a t i o n point of the M i f f l i n algorit~n
is stationary. This claim can be easily verified; see the preced-
ing d i s c u s s i o n of the m e t h o d with s u b g r a d i e n t a c c u m u l a t i o n and the fol-
lowing remark.

Remark 6.2. It is worth noting that the d e f i n i t i o n s of s u b g r a d i e n t lo-


cality m e a s u r e s used in the methods d e s c r i b e d so far are arbitrary to
a large extent. For instance, in A l g o r i t h m 3.1 one may use, instead of
(3.2), (3.3) and 43.5), the following d e f i n i t i o n s

k k k k 2
~j = max{f( x )-fj,T ( sj) } , 6.18a)

k k) k k 2
~p = m a x {x f .(- f p(, Ts ) ~ }, 6.18b)

~k k ~k ~k 2
ep = max{ f( x ) - f p , T ( Sp) } , 6.18c)

which c o r r e s p o n d to the following m o d i f i e d d e f i n i t i o n of

~(x,y)= m a x { f ( x ) - ~ ( x ; y ) , TIx-y[ 2} for all x,y, 6.18d)

w h i c h is e q u i v a l e n t to Mifflin's definition (2.3). As before, in (6.18)


is a p o s i t i v e parameter, w h i c h can be set to zero if f is convex. The
above m o d i f i e d d e f i n i t i o n s of s u b g r a d i e n t locality m e a s u r e s may be used
in all methods d e s c r i b e d so far. It is s t r a i g h t f o r w a r d to check that
such m o d i f i c a t i o n s do not impair the p r e c e d i n g convergence results. One
may also use in all of the a b o v e - d e s c r i b e d m e t h o d s (excepting Algoritm
6.1) the following s i m p l i f i e d version of (6.18)

k k 2
j = T(sj) ,

k = y(s~) 2,
ep (6.19)

2
~p l

e(x,y)=ylx-yl 2 for all x and y,

where this time Y is fixed and positive even if f convex. One can easi-
ly v e r i f y that global c o n v e r g e n c e results of the form of T h e o r e m 4.17
remain valid for d e f i n i t i o n s (6.19). However, these results can no long-
er be s t r e n g t h e n e d in the convex case to the form of T h e o r e m 4.18,
because s u b g r a d i e n t locality m e a s u r e s (6.19) neglect l i n e a r i z a t i o n er-
138

rors. For this reason, definitions (6.19) are inferior to the two de-
finitions discussed above.
CHAPTER 4

Methods with Sub~radient Deletion Rules for U n c o n s t r a i n e d Nonconvex


Minimization

i. I n t r o d u c t i o n

This chapter is a c o n t i n u a t i o n of C h a p t e r 3, in w h i c h we b e g a n dis-


cussing methods for m i n i m i z i n g a locally Lipschitzian function f:RN+R,
which is not n e c e s s a r i l y convex or d i f f e r e n t i a b l e .
We shall present a class of readily implementable methods of d e -
scent, e x t e n d i n g the algorithms of C h a p t e r 2 to the n o n c o n v e x case in a
way d i f f e r e n t from the one followed in C h a p t e r 3. In effect, the met-
hods generalize Pshenichny's method of l i n e a r i z a t i o n s for m i n i m a x pro-
blems (see S e c t i o n 2.2). Simplified versions of the methods may be re-
garded as e x t e n s i o n s of the W o l f e (1975) conjugate subgradient method
to the n o n c o n v e x case, and as m o d i f i c a t i o n s of algorithms due to Mif-
flin (1977b) and Polak, Mayne and W a r d i (1983).
The methods for convex m i n i m i z a t i o n described in Chapter 2 use li-
nearizations of f o b t a i n e d by e v a l u a t i n g f and its s u b g r a d i e n t at
certain trial points. At each iteration, the p o i n t w i s e maximum of seve-
ral past linearizations defines the current polyhedral approximation
to f that is used for search direction finding. A fundamental proper-
ty of the c o n v e x case is that the c u r r e n t polyhedral approximation de-
pends on the past trial points only implicity via the c o r r e s p o n d i n g
linearizations, so that any two trial points are e q u i v a l e n t if they
yield the same l i n e a r i z a t i o n , i.e. the same supporting hyperplane to
the e p i g r a p h of f.
Since in the n o n c o n v e x case each past linearization is not necessa-
rily a lower approximation to f, in Chapter 3 we m o d i f i e d the m e t h o d s
of Chapter 2 by using Mifflin's (1982) ideas for e n s u r i n g that the search
direction generated at any point is b a s e d m a i n l y on local past sub-
gradient information, i.e. information collected at trial points close
to the c u r r e n t point. In effect, we c o n s i d e r e d polyhedral approxima-
tions d e f i n e d by past subgradients and their l o c a l i t y measures, which
depend on s o - c a l l e d distance measures that m a j o r i z e distances from the
corresponding trial points to the c u r r e n t point. Thus these d i s t a n c e
measures directly influence each search direction finding subproblem.
In this c h a p t e r we adopt a different approach to e x t e n d i n g the
methods of C h a p t e r 2 to the n o n c o n v e x case. Our a p p r o a c h is m o t i v a t e d
by the o b s e r v a t i o n that the p o l y h e d r a l approximations to f used in
Chapter 2 remain useful even if f is nonconvex, provided that they are
140

based on only local past subgradient information. Thus we retain defi-


nitions of p o l y h e d r a l approximations in terms of past subgradients and
the c o r r e s p o n d i n g linearization errors. Although in this case, in con-
trast with the m e t h o d s of Chapter 3, d i s ta n c e measures are not direct-
ly in v o l v e d in each search direction finding subproblem, they are u ~ d
for d e c i d i n g which of the past subgradients should be r e t a i n e d for the
next search d i r e c t i o n finding. Thus d i s t a n c e measures are e m p l o y e d on-
ly in s u b @ r a d i e n t deletion rules for l o c a l i z i n g the past subgradient
information by resets. This locality resetting is a s s o c i a t e d here with
estimating the degree of s t a t i o n a r i t y of the c u r r e n t point, while in
Chapter 3 resets were used only for e n s u r i n g locally uniform bounded-
ness of the subgradients stored by the methods.
We shall show that each of our m e t h o d s with subgradient deletion
rules is g l o b a l l y convergent in the sense that all its a c c u m u l a t i o n
points ar~ stationary. In the convex each m e t h o d generates a minimizing
sequence of points, which converges to a s o l u t i o n if f attains its in-
fimum. These convergence results, as well as the t e c h n i q u e s for deri-
ving them, are similar to those of Chapter 3.
To sum up, the m e t h o d s with subgradient deletion rules may be re-
garded as a l t e r n a t i v e s of the m e t h o d s with subgradient l o c a l i t y mea-
sures. Theoretical results on their global convergence are the same, but
they are b a s e d on c o n s t r u c t i o n s with fundamentally different motiva-
tions. More importantly, our p r e l i m i n a r y numerical experience indica-
tes that those two classes of m e t h o d s perform differently in practice.
Therefore, one has to address the q u e s t i o n of r e l a t i v e advantages and
drawbacks of both classes of m e t h o d s in specific applications. We leave
this important question open for future theoretical and e x p e r i m e n t a l
investigations.
It turns out that the f r a m e w o r k for c o n v e r g e n c e analysis presented
here can e a s i l y accomodate methods that n e g l e c t linearization errors
(see S e c t i o n 1.3 for a d i s c u s s i o n of this class of methods). For this
reason, we shall e s t a b l i s h global convergence results for such methods,
which subsume those of M i f f l i n (1977b) and Polak, Mayne and Wardi
(1983).
In S e c t i o n 2 we derive the methods. Section 3 contains a detailed
description of the m e t h o d with subgradient aggregation. Its eonvergence
is analyzed in S e c t i o n 4. In S e c t i o n 5 we e s t a b l i s h convergence of
the m e t h o d with subgradient selection. In S e c t i o n 6 we dicuss modified
subgradient deletion rules and their p r a c t i c a l implications. Section 7
is d e v o t e d to methods that n e g l e c t linearization errors.
141

2. D e r i v a t i o n of the Methods

In this section we derive two m e t h o d s for m i n i m i z i n g a locally


L i p s c h i t z i a n function f d e f i n e d on R N. D e t a i l e d d e s c r i p t i o n s of the
methods will be given in s u b s e q u e n t sections.
In order to implement the methods, we suppose that we have a finite
process that can calculate f(x) and a s u b g r a d i e n t gf(x) ~f(x)
at each x e R N.
The algorithms will g e n e r a t e sequences of points {x k } c R N, search
directions {d k} c R N and n o n n e g a t i v e stepsizes { t kL } c R+, related by

X k+!=xk+t~dk for k=l,2,...,

where xI is a given starting point in R N. The algorithms are descent


methods in the sense that

f(x k+l) < f(x k) if x k+l ~ x k, for all k,

and the sequence {x k} is intended to converge to a m i n i m u m point of f.

To deal with n o n d i f f e r e n t i a b i l i t y of f, the methods will use a two-


point line search, similar to Line Search P r o c e d u r e 3.3.2, for detec-
ting d i s c o n t i n u i t i e s in the g r a d i e n t of f. Thus each a l g o r i t h m will cal-
k k
culate auxiliary stepsizes {t }, t R ~ t L for all k, and trial points

k+l k k k 1 1
y =x +tRd for k=l,2,..., y =x ,

and e v a l u a t e s u b g r a d i e n t s

gJ = gf(yJ) for 9=1,2 . . . . .

With each such s u b g r a d i e n t we associate the c o r r e s p o n d i n g linearizatien


of f
fj(x) = f ( y J ) + < gJ,x-yJ > for all x. (2.1)

The two algorithms to be d e s c r i b e d differ in the way in w h i c h they


make use of the past s u b g r a d i e n t information, in order to employ only
a finite number of the past subgradients for search d i r e c t i o n finding
at each iteration, the m e t h o d s will use m o d i f i e d versions of the sub-
g r a d i e n t s e l e c t i o n strategy and the s u b g r a d i e n t a g g r e g a t i o n strategy
i n t r o d u c e d in C h a p t e r 2, and e x t e n d e d in Chapter 3. For c o n v e n i e n c e of
the reader, r e l e v a n t features of these strategies are b r i e f l y r e c a l l e d
below.
We start with the m e t h o d with s u b g r a d i e n t selection. Suppose that
at the k-th iteration we have a nonempty set j k = {l,...,k} and the
c o r r e s p o n d i n g past subgradients (gj,fk) j ~ jk, w h i c h d e t e r m i n e the
3'
142

linearizations fj, j e jk, via the relation

fj(x) = f~+ < gJ,x-x k > for all x, (2.2)

where f~=f (xkl for all j < k. In the convex case we found it con-
3 J
venient to use the following polyhedral approximation to f

f~(x) = max{fj(x): j ~ jk} for all x, (2.3)

since
f(x) > f~(x) for all x, (2.4a)

f(yJ} = f~(yJ) for all je jk (2.4b)

if f is convex. These properties followed from the fact that

fix) ~ fj(x) for all x and j (2.5)

in the convex case. In view of (2.4), we had reasons to suppose that


fs(xk+d) is close to f(xk+d) if Idl is small, so we found the
k-th search direction dk to

minimize f~(xk+d)+ [d[ 2 over all d e RN . (2.6)

We also noted in Section 3.2 that in terms of linearization errors

k k
~j = f(xk)-fj - (2.7)

the k-th selective polyhedral approximation to f could be written as

~k
fs(X ) = max{f(xk)_ k + < g J , x _ x k > : je jk} for all x, (2.8)

so that dk could be found by solving the following quadratic program-


ming problem for ,~k
~Q ,v^k ) ~ R N-R:
2 ^
minimize lld I +v,
^

(d,v) ~ R N+I (2.9)

subject to ~+<gj ~ jk.


-e ,d > ! v, j E

The Lagrange multipliers k jk


lj, j E , of (2.9) solve the following dual
of (2.9)
2
minimize 1
~I ~ jkljg 3 I + Z jkljak ,
I j~ j~
(2.10)
subject to 1
>_ O, j e jk l ~ j=l
3 je J kl '
and yield dk via the relation
143

NdM = Z kl~g
~ j, (2.11)
j~ J J

and hence only linearizations f. with small values of linearization


k 3
errors ~. > 0 are significantly active in the d e t e r m i n a t i o n of d k,
i.e . Ik must
3 - be small if a.k is large in comparison with ~ik for
i~j, it3 jk. Thus in the convex case the nonnegative linearization er-
rors (2.7) could be used for weighing the past subgradients at each
search d i r e c t i o n finding. Moreover, each linearization error ~.k mea-
3
sures the distance of gJ from ~f(x k) in the sense that

gJ ~ ~ef(xk) for ~=~. (2.12)

In the nonconvex case considered in Chapter 3, we had to modify


the above constructions for the following reasons. If f is nonconvex
then each l i n e a r i z a t i o n fj need not globally approximate f from
below, so that a polyhedral approximation of the form (2.3) does not,
in general, satisfy (2.4). Moreover, it is important to observe that
if one used linearization errors defined via (2.7) in the k-th search
direction finding s u b p r o b l e m (2.9) then such linearization errors
would no longer consistently weigh the past subgradients as in the con-
vex case. This follows from the fact that although dk would be ex-
pressed in the form of (2.11) in terms of a solution to (2.10), we
could have negative o k.. For this reason, in Chapter 3 we used polyhe-
3
dral approximations of the form (2.8) and the c o r r e s p o n d i n g k-th search
direction finding subproblem (2.9) with

k 2
~k3 = m a x { I f ( x k ) - f ~ I , y(sj) }, (2.13)

where Y was a positive parameter (which could be set to zero in the


convex case) and the distance measures

sk = IxJ-yJl+ k-I
Z 1 xi+l xil
3 i=j

were used for estimating Ixk-yJl without storing {yJ} (although one
could use
2
~jk = max{if(xk)_f~l ylxk-y 3] } (2.14)

if {yJ} were stored).

Observe that if yJ~x k for all j e jk then ~k defined by (2.8)


s
and (2.13) is locally a lower a p p r o x i m a t i o n to f at xk in the sense
that

f(xk+d) ~ f~(xk+d) for sufficiently small Idl, (2.15)


144

since then each k3. is positive (s~ -> Ixk-yj I > 0), and so

is less than f(xk+d) for small Idl because of the Lipschitz continu-
ity of f. We also have (2.4a) for ~k defined by (2.8) and (2.13) if
s
f is convex and y=0. Thus (2.15) may be regarded as a local version
of (2.4a) in the nonconvex case.
Relation (2.13) defines subgradient locality measures, since in
the convex case (y=0) (2.13) reduces to (2.7) and we have (2.12), while
in the nonconvex case we may use the following d e f i n i t i o n of the
Goldstein (1977) e-subdifferential

~f(x;e)=con{~f(y):ly-xl i E} (2.16)

to deduce that
k 1/2
g J e ~f(xk;el for ~=(~j/) , (2.17)

since
lyJ-xk I _< sk<3- ( ~ / y ) i / 2

Moreover, the above inequality implies that if dk is obtained by sol-


ving (2.9) with ~ defined via (2.13), then in (2.11) we have small
ljk corresponding Jto large [xk-yj I , i.e. d k is based mainly on the
local past subgradients. We summarize our observations in
Remark 2.1. The methods considered in Chapter 3 will automatical-
ly ensure that each past subgradient gJ becomes p r o g r e s s i v e l y less
active in successive search direction finding subproblems, i.e. the va-
lues of k
lj decrease when the algorithm moves away from the point yJ,
so that eventually we have I~=0 and the subgradient gJ is dropped
(j ~ jk+l). This m e c h a n i s m depends on the value of the distance m e a s u r e
parameter y. If the value of y is too large then the a l g o r i t h m will keep
deleting even the subgradients collected in relatively small neigh-
borhoods of iterates x k. In this case the a l g o r i t h m proceeds in "leaps",
i.e. after each long serious step almost all the past subgradients are
dropped, so the next long serious step can occur only after the algo-
rithm hasaccumulated sufficiently many subgradients by executing a se-
ries of null or very short serious steps. This should, of course, be
avoided. At the same time, too small a value of y may not ensure proper
weighing of the past subgradients at the search direction finding; for
instance, (2.15) may hold only for very small ]d I. Then the algorithm
may select for retaining even nonlocal past subgradients in preference
145

to the local ones, which is inefficient.

We conclude from the above remark that in practice it may be d ~


ficult to choose a suitable value of the distance measure parameter
in the methods with subgradient locality measures. For this reason,
we shall assume in this chapter that y=0 in (2.13), i.e. we shall
use the following linearization errors

~k3 = If(xk)-f~ I (2.18)

and polyhedral approximations of the form (2.8) even in the nonconvex


case. In effect, the d i r e c t i o n dk found by solving the k-th primal
subproblem (2.9) (or, equivalently, the k-th dual subproblem (2.10))
will satisfy (2.11) as in the c o r r e s p o n d i n g method with subgradient
selection of Chapter 3. However, since linearization errors will no
longer measure locality of the past subgradients, and since f~ given
by (2.8) and (2.18) is a useful local a p p r o x i m a t i o n of f at xk only
if yJ is close to x k for all j ~ jk, we shall modify the subgradie-
nt selection strategy of Chapter 3 to ensure that the sets jk are
chosen so that we have sufficiently small values of lyJ-xk 1 for all
j ~ jk. To this end we shall use subgradient deletion rules for decid-
ing which of the past subgradients should be retained for the next
search direction finding. Namely, we shall first find Lagrange multi-
pliers of (2.9) satisfying

3 k = {j 6 jk: Ik ~ 0}
3

_k+l tk
and set d =J u {k+l} as in Chapter 3. Next, we shall use suitable
rules for d e c i d i n g w h e t h e r jk+l should be reduced by deleting indi-
ces j corresponding to large values of distance measures s k.. Since
3
specific deletion rules are applicable also to the method with subgra-
dient aggregation described below, we postpone their discussion till
the end of this section.
Let us now pass to the method with subgradient aggregation. As in
Chapter 3, for search direction finding at the k-th iteration the agg-
regate subgradient (pk-l,f~),- satisfying

(pk-l,f~) E conv{(gj,f~): j ~ k -pl . ~' 5 kp - l c {i ..... k-l}, (2.19)

may replace the past subgradients (gj,fkj), j=l ..... k-l. Since (2.18)
corresponds to the following linearization error
146

~(x,y) = If(x)-T(x;y)l, (2.20)

we may define the linearization error

k
~p = If(xk)-f~i (2.2~)
associated with the (k-l)-st aggregate linearization

~k-l(x ) = fk+ < pk-l,x_xk > for all x.


P
(Observe that (2.20) and (2.21), can be obtained by setting y=0 in
the corresponding definitions of Section3.2.). Then, as in Chapter 3,
we may define the k-th aggregate polyhedral approximation to f

~k(x) = m a x { f ( x k ) - ~ k + < g J , x - x k > : j 6 jk;

f(xk)-~k+ < pk-l,x-x k > } for all x (2.22)

and x~nd the k-th search direction d k to

minimize ~k(x k+d)+ 21d12 over all d e RN" (2.23)

This can be done by solving the following quadratic programming prob-


lem for (dk,vk)- E R Nx R:

minimize 1
~[d I2 +v,
"
2.24a)
(d,v) e R N+I

subject to k < gj ,d > _< v,


-ej+ ^ j e jk, 2.24b)

k
-~p+ <
pk-i ,d > < v.^ 2.24c)

Moreover, the Lagrange multipliers k


I=, je jk, and Ik of (2.24), sa-
3 P
tisfying

lk > 0 j 6 jk t k > O, Z , tk.+lk=l, 2.25)


3- ' ' P- jEj K 3 P

determine the current aggregate subgradient

(pk,}k) = k gj, fk)+~k(pk-I, fk I


Z jklj( 2.26)
j~

such that
d k = _pk,

(pk,~k) e conv{(gj,fk):
3 j~ 3 k-I
p u jk}. 2.27)

Therefore, by setting
147

fk+l = ~k+ < p k , x k + l _ x k >


P P

one can p r o c e e d to the next iteration

One may observe that the m e t h o d w i t h s g b g r a d i e n t a g g r e g a t i o n de-


scribed so far differs from the c o r r e s p o n d i n g m e t h o d of C h a p t e r 3 only
in the choice of y=0, i.e. here we use l i n e a r i z a t i o n errors instead of
the c o r r e s p o n d i n g s u b g r a d i e n t locality measures. For this reason, in
the n o n c o n v e x case the function ~k defined by (2.22) is a useful lo-
k a
cal a p p r o x i m a t i o n to f around x only if it is based on sufficiently
local s u b g r a d i e n t information, i.e. if the sets
5k-i jk and (see
P
(2.19)) are such that the values of the following variables

^k = max{Ixk-y3[:
aj j 6 jk},

^k = max{[xk-yj[:
ap j ~ 3~-i}, (2.28)

^k
a = max{aj,ap}
^k ^k

are s u f f i c i e n t l y small. This can be ensured by a suitable choice of jk


^k
and 3k-i In this context we observe that the value of aj can be
P
made as small as desired at the k-th i t e r a t i o n by c h o o s i n g jk on the
basis of the values sk > Ixk-yj[, p r o v i d e d that we have sk=0 for some
3 - 3
j. On the other hand, since the set 3k-i is d e f i n e d by (see (3.4.2))
P

3k_ip = jkp(k-l)o {J: kp(k-1) < j S k},

kp(k-l) = max{j:j < k-i and k~ = 0},

^k 3k-i
the only way of r e d u c i n g ap, ~ this is necessary, is to reset P
to an empty set, so that a vanishes. This is e q u i v a l e n t to dis-
P
carding the (k-l)-st a g g r e g a t e s u b g r a d i e n t in the d e f i n i t i o n (2.22) of
^k
fa' and to d r o p p i n g the last c o n s t r a i n t of (2.24) at the search direc-
tion finding.
We shall now d e s c r i b e a simple s t r a t e g y for l o c a l i z i n g the past
s u b g r a d i e n t i n f o r m a t i o n used for search d i r e c t i o n finding. It is based
on an idea due to Wolfe (1975) in the convex case. The concept of the
Goldstein subdifferential (2.16) is useful in the n o n c o n v e x case, be-
cause it is defined d i r e c t l y in terms of n e i g h b o o r h o o d s of a given
point. Moreover, ~f(x;0)=~f(x) and, owing to the d e f i n i t i o n (1.2.14)
of ~f, ~f(x;) is a close a p p r o x i m a t i o n to ~f(x) if the value of eis
small. Our aim is to obtain at some i t e r a t i o n pk~ ~f(xk;c) for some
small values of ~ and Ipkl ; then x k will be a p p r o x i m a t e l y s t a t i o n a r y
148

(stationary points x satisfy 0 e ~f(x) c ~f(~;e) for all ea0). From


Chapter 3 we know how to construct locality radii a k of the aggregate
subgradients, by using only distance measures s~, such that (see Lem-
ma 4.1]

pk = E kljg 3, (2.29a)
j~3
P
~k ^k
lj ~0, j 6Jp, j6Z ~k ~ =i , (2.29b)
P
max{lyJ-xkl: %
j e J } ~ a k, (2.29c)

~k : jkp(k)O {j: k (k) < j ~k}, (2.29d)


P P
kp(k) = max{j : j ~k and I~=0}. (2.29e)

Relations (2.29a)-(2.29c) and definition (2.16) lead to the following


fundamental property

Pk6 ~f(xk;ak). (2.30)

In view of (2.30) we want to obtain small values of both lpki and


k
a at some iteration. This will occur if both

IPkl ~maak (2.31)

and the value of a k is


small, where ma> 0 is a fixed scaling para-
meter. However, if the variables a k were generated as in the methods
of Chapter 3 with no resettings, i.e. by the formula

a k+l = max{ak+l xk+l-x k I,IY k+l -x k+l I},

then we would always have a k+l ~ a k. Therefore (2.31) would hold only
at later iterations for large values of a k.

Thus a m e c h a n i s m is needed for decresing the value of a k if


(2.31) occurs for some k. In this case we shall discard the aggregate
subgradient by resetting the algorithm. The resetting will involve a
repeated calculation of d k and pk from the k-th subproblem (2.24a)-
(2.24b), i.e. the aggregate constraint (2.24c) is ignored at resets.
By setting the c o r r e s p o n d i n g Lagrange m u l t i p l i e r Ik to zero, we
P
again obtain (2.38) from (2.29), but this time for the reduced value of
k
a given by

a k = max{sj:
k j e jk}. (2.32)

Next, if (2.31) holds once again, the resetting procedure should be re-
149

peated with a further reduction of the search d i r e c t i o n finding subprob-


lem (2.24a)-(2.24b), which consists in deleting some elements j of
jk with large values of s~, so that the value of a k given by (2.32)
decreases. Finally, we should have [pk I > ma ak after a reset. Since
[pkl=[dk{, this may be interpreted as ensuring that the length of the
current search direction is comparable to the value of the current lo-
cality radius.
Except for possible reductions due to resets, the sets jk can be
chosen as follows. Let M g >- 2 denote a user-supplied, fixed upper
bound on the number of the past subgradients (including the aggregate sub-
gradient) that the algorithm may use for each search direction finding.
By choosing the sets jk recursively so that

k+l e jk+ic j k u {k+l},

jk+l c {j: k-Mg+3 < j k+l}

one can control storage and work per iteration.


Having described the subgradient aggregation strategy, we may now
return to the method with subgradient selection. In this method the ag-
gregate subgradient (pk,~) is defined by (2.26), but with ~k=0.p If
the resetting test (2.31) is fulfilled with a k given by (2.32), then
the set jk is reduced and the search direction finding s u b p r o b l e m
(2.10) is solved again. The resetting is repeated until either Ipki>ma aK
or the algorithm stops with Ipk[ ! m a a k and a small value of a k. The
next index set is of the form j k + l = 3 k u {k+l}, which is analogous to the
A
one used in Chapter 3, i.e. we have IJk[ S N+I, so that the method can
store Mg=N+2 past subgradients.

Remark 2.2. In the above-described method with subgradient selection one


may use, instead of (2.31), the following test for resetting the method

pk ^k
{ l&ma a ,

where
a^k = m a x { l yJ_x k I: j 6 ~k} .

The subsequent convergence results remain valid for this modification.


^k
However, the calculation of a would require storing N+2 points y3,
j E jk. For this reason, we perfer to use the easily computable variable
ak .
150

3. The Algorithm with Sub~radient Aggre@ation

We now state an algorithmic procedure for solving the problem in


question. Ways of implementing each step of the method are discussed
below.

Algorithm 3.1.

Step 0 IInitialization). Select the starting point xle RN and a final


accuracy parameter es k 0. Choose fixed positive line search parameters
mL,mR,~,~ and ~ with ~ ~ i, mL < m R< i and ~ < i. Set Mg equal to
the fixed maximum number of subgradients that the algorithm may use for
search direction finding; Mg ~ 2. Choose a predicted shift in x at the
first iteration sI > 0 and set 01=~. Select a positive reset toleran-
ce ma and set the reset indicator rl=l.
a Set

jl = {i}, yl=xl, p0=gl=gf(yl) , f ~ =f~=f(yl), 1 1 =0.


sl=a

Set the counters k=l, i=0 and k(0)=l.

Step 1 (Direction finding). Find the solution (dk,~ k ) to the following


k-th quadratic programmnig problem

2 n
minimize 21dl +v,
( d , v ) ~ R N+I
subject to _ k+ < gJ,d > <_ v, j e ~ , (3.1)
A
_ k+ < pk-l,d > & v if r~=0,
where
ejk = If(xk)-f I for j 6 jk, (3.2a)

~pk = if(xk)_fkl. (3.2b)

Find Lagrange multipliers I kj, j6 ~ , and Ip of (3.1), setting Ik=0


if rk=l.
a Set

, p) (3.3)
j e J klj( ' ~

~p-- If(xk)-~ i, (3.4)


k k 2 -k
v = -{ IP +~p~" (3.5)
If lk=0 set
151

a k = max{ sk] : j e jk}. (3.6)

Step 2 (Stopping criterion). If max{Ipkl,ma ak} 5 Cs then terminate.


Otherwise, go to Step 3.

Step 3 (Resetting test). If Ipkl ~ m a ak then go to Step 4; otherwise,


go to Step 5.

Step 4 IResetting). (i) If rk=0


a then set rk=l,
a replace jk by

J k n {j : j h k - M g +2} and go to Step I.

(ii) If IJkl > 1 then delete the smallest number from jk and go to
Step i.
(iii) Set yk=xk, gk=gf(yk ), fk=f(y
k k ), s =0, Jk={k} and to Step i.

step 5 (Line search). By a line search procedure as discussed below,


k k
find two stepsizes t Lk and t kR such that 0 _< t L _< t R and such that
the two corresponding points defined by

k+l k+tk+d k k+l k+tk+d k


x =x and y =x
satisfy ~ <_1 and

f(x k+l) <_ f(x k )+mLtLV


k k, (3.7)

k k k >~, (3.8a)
t R = tL if tL _

-a(xk+l,yk+l)+ < gf (yk+l ),d k > >_mRvk if t kL <~, (3.8b)

lyk+l-xk+l I <a, (3.9)

lyk+l-xk+l{ < eksk _


if k
tL=0 , (3.10a)

lyk+l-xk+l I _<-elxk+l-xk I if t kL >0, (3.10b)


where
e(x,y) = If(x)-f(y)- < g f ( y ) , x - y > I" (3.11)

Step 6. If t~=0 (null step), set sk+l=s k and 8k+i=~8 k. Otherwise,

i.e. if t Lk > 0 (serious step), set s k+l = ixk+l -x k I, 0k+l=~, k(l+l)=k+l


and increase 1 by i.

Step 7 (Subgradient updating). Select a set jk satisfying


152

jk c jk, (3.12a)

I~kl ~ Mg-2, (3.12b)

and set
jk+l = 5 k u {k+l}. (3.12c)

Set gk+l=gf(yk+l) and

fk+ik+l = f(yk+l)+ < gk+l,xk+l_yk+l > , (3.13a)

fk+l : fk +<gJ,xk+1_xk> for j J^k , (3.13b)


] 3
fk+l = ~k+ < pk xk+l_xk > (3.13c)
t I
P P

k+l k+l k+l (3.14a)


Sk+l= ly -x I,

k+l = sj+
k x k+l -x k I (3.145)
s]. for j 6 ~k

Calculate

a k+l =
max{ak+ ixk+l xkl _ ,
k+l,
Sk+l~ (3.15)

and set rk+l=0.


a

Step 8. Increase k by 1 and go to Step i

We shall now comment on each step of the algorithm.


By Lemma 2.2.1, the k-th subproblem dual of (3.1) is to find values
of the multipliers ~ k , j 6 jk, and lk that
J P
1
minimize ~[ E jk x jgj+xppk_l I2 + E jklj aj+Ipap,
k k
j j~

subject to ~j >_ 0, j e J k, Ip>0,_ EjkXj+~p=l, (3.16)


j~
p =0 if rk=l.
a

Any solution of (3.16) is a Lagrange multiplier vector for (3.1) and it


yields the unique solution (dk,v k) of (3.1) as follows

dk = _pk, ( 3.17 )

^k 2 k k k k.
v =-{Ipkl + E jklj~j+~p~p~ , (3.18)
jE
153

k ,
where p is given by (3.3). Moreover, any Lagrange multipliers of
(3.1) also solve (3.16). In particular, we always have

lk3 >- 0, je jk, ipk >_ 0, j 6Z jklk+Ik=l'3


P l~=0 if rk=la" (3.19)

Thus one may equivalently solve the k-th dual search direction finding
subproblem (3.16) in Step 1 of the algorithm.
The algorithm stops at Step 2 when

k
p 3f(xk;Es/ma ) and IPkl <-as'

k .
i.e. when x is approximately stationary for f. This follows from the
fact that p k e ~f(xk;a k) at Step 2, as will be shown below.

The sole purpose of resettings at Step 3 and Step 4 is to ensure


that at line searching at Step 5 we have Idkl > m a a k . Note that there
can be only finitely many returns from Step 4 to Step I, since Step 4
reduces the number of subgradients that can be used for the next
search direction finding. Step 4 (iii) is entered after pk-i has been
discarded and jk has been reduced to Jk={k}, while Igkl < maS ~. The-
refore, in this case we have to completely reset the a l g o r i t h m by re-
defining yk to yk=xk. Then ak=s~=0- at Step i, and max{ Ipkl,
maak}=Ipk I at Step 2. If the algorithm does not stop at Step 2, i.e.
IPkl > es ~ 0, then Ipkl > meek=0 in Step 3, hence the algorithm goes
to Step 5.
Observe that each reset involves a repetition of search direction
finding. To avoid the need for solving too many quadratic programmning
problems at resetting, one may delete more than one subgradient in
Step 4(i) and Step 4(iii). We shall return to this subject in Section 6.

The line search rules of Step 5 are m o d i f i c a t i o n s of the rules


(3.3.9)-(3.3.11) of the method with subgradient locality measures. We
shall prove in the next section that the line search is always entered
with
v^k = f~ (xk+dk)-f(x k) < v ~ < 0. (3.20)

k k
Thus v < 0 is an estimate of the directional derivative of f at x
in the d i r e c t i o n d k, and the criteria (3.7)-(3-.9) may be interpreted
similarly to the line search criteria of Section 3.3. It will be seen
that the additional requirement (3.10) ensures that the subgrad-
Sent --~+1=gf(yk+1) is calculated sufficiently close to the next point
x k+l . We show below that the variable ek decreases and s k is corustant
whenever the algorithm executes a series of null steps at deblocking,
154

i.e. when the algorithm is blocked at x k due to the nondifferentiabili-


ty of f. Then (3.10a) ensures that the new subgradient information is
increasingly local, which yields better search directions. We may add
that the methods with subgradient locality measures of Chapter 3 had
no need for such criteria, since the distance lyk+l-xk+ll was account-
ed for by using a requirement of the form (3.8b), but with e(x k+l,
yk+l] depending on lyk+l-xk+l!.
The following m o d i f i c a t i o n of Line Search Procedure 3.3.2 may be
used for finding stepsizes tL=t ~ and tR=t kR satisfying the require-
ment of Step 5.

Line Search Procedure 3.2.

(i) Set tL=0 and t=tu=min{l,a/Idkl}.

(ii) If f(xk+td k) ~ f(xk)+mLtv k set tL=t; otherwise set tu=t.

iii) If tL h ~ set tR=t L and return.

iv) If - e ( x k + t L d k , x k + t d k ) + < gf(xk+tdk),d k > a m R v and either tL=0

and tldkl S eks k or t-t L S ~ t L , then set tR=t and return.

v) Set t=tL+~(tu-tL) and go to (ii).

Convergence of the above procedure can be established similarly


to Lemma 3.3.3 under additional semismoothness assumptions.

Lemma 3.3. If f has the semismoothness property (3.3.23) then Line


Search Procedure 3.2 terminates with tL=t
Lk and tR=t
Rk satisfying

(3.7)-(3.10). (Here we use the properties 0 < m L < m R < i, ~ > 0, ~ ~ 0,


0 < ~ < i, 0ks k > 0 and v k < 0).

Proof. Use the proof Lemma 3.3.3, setting ~=i-~ and observing that
we must have either tL=0 for all i and tiJdk I <_ eks k for large i,
i
or t -t L <_~ >0 for large i~

We refer the reader to Remark 3.3.4 concerning the semismoothness


assumptions.
We may add that in e f f i c i e n t versions of Line Search Procedure 3.2
one may use safeguarded interpolation instead of b i s e c t i o n for choosing
trial stepsizes; see Remark 3.3.5.
In Step 6 of the algorithm we update the variables ek and sk
155

that localize, via the line search requirement (3.10a), the new sub-
gradient information at null steps. Upon c o m p l e t i o n of Step 6, the cur-
rent value of 1 is equal to the number of serious steps taken so
far. In general, we have k(1) < k(l+l) and

X
k =
xk(1) if k I] ! k < k ( i + I ) , (3.21a)

k 0 if k i) <_k< k(l+l)-l, (3.21b)


tL =

k
tL > 0 if k = k(l+l)-l, (3.21c)

s k = sk(1) if k i) S k < k ( l + l ) , (3.21d)

sk=Ixk(l+l)-xk(1) I if k=k(l+l), (3.21e)

where we set k(l+l)=+~ if the number 1 of serious steps stays


bounded during the run of the algorithm, i.e. if t~=0 for some fixed
1 and all k ~k(1). By (3.10), (3.21) and the rules of Step 6, we have

lyk+l-xk+ll ~ (~)k+l-k(1)sk(1) if k(1) < k ~ k(l+l)-l, (3.22a)

lyk+l-xk+iI ~ ~s k(l+l) if k+l = k(l+l). (3.22b)

Thus the value of "lyk+l-xk+ll is always a fraction of the length of


the latest serious step.
The rules of Step 7 ensure that the algorithm uses at most Mg sub-
gradients for each search direction finding, so that at most Mg (N+2)
-vectors of the form (gJ ,f~,j
k sk3) need to be stored. In fact,^ it will
be seen that, as far as convergence is concerned, the set jk may con-
tain any indices c o r r e s p o n d i n g to the subgradients used since the la-
test reset, i.e. the requirement (3.12b) is not important. At the same
time, the rule of Step 4(i), ensuring that only a finite number of la-
test subgradients are retained after each reset, is crucial for global
convergence. We also note that the latest subgradient is always used
for search direction finding, i.e. we have

k e jk, gk=gf(yk) and lyk-xkl ~ a for all k. (3-.23)

This follows from (3.9), (3.12c) and the rules of Step 4.


156

4. C o n v e r g e n c e

In this section we shall establish global c o n v e r g e n c e of Algorithm


3.1. We suppose that each e x e c u t i o n of Line Search P r o c e d u r e 3.2 is fi-
nite, e.g. that f is semismooth in the sense of (3.3.23). For conver-
gence results we assume that the final a c c u r a c y tolerance es is set
to zero. As usually done, in the absence of convexity we will content
ourselves with finding s t a t i o n a r y points for f. Our m a i n result states
that A l g o r i t h m 3.1 either terminates at a s t a t i o n a r y point, or genera-
tes an infinite sequence {x k} whose a c c u m u l a t i o n points are stationa-
ry for f. If f is convex then {x k} is a m i n i m i z i n g sequence, w h i c h
converges to a m i n i m u m point of f w h e n e v e r f attains its infimum.

In the following we implicitly assume, unless otherwise stated,


that d k, pk, v k, etc. denote the variables at Step 5 of A l g o r i t h m 3.1
at the k-th iteration, for any k.
As in Section 3.4, we shall use the following n o t a t i o n for describ-
ing resets of the algorithm. Recall that the c o n d i t i o n r k=l
a indicates
that a reset occurs at the k-th iteration, while Ik=0 means that the
P
(k-l)-st aggregate s u b g r a d i e n t is dropped. Let

kr(k ) = max{j : j _<k and r3=l}a ' (4.1a)

jkr}k L
3k = {j : kr(k ) < j ~ k } , (4.1b)

kp(k) = max{j : j s k and 13=0}, (4.2a)


P

~k ~ jkp(k) U {j : kp(k) < j <_k}, (4.25)

3 k = {j : kr(k)-Mg~< j < k } , (4.3)

for all k.

The following lemma shows that the aggregate s u b g r a d i e n t is a con-


vex c o m b i n a t i o n of the subgradients that were used for d i r e c t i o n fin-
ding since the latest reset.

L e m m a 4.1. Suppose k ~ 1 is such that A l g o r i t h m 3.1 did not stop before


the k-th iteration. Then

(p k ,fp)
~k e conv{( gJ ,fk)] : j E ~ } (4.4)

= max{s : j E J }, (4.5)
157

IY 3-xk
I <- ak for all j e Jp,
^k (4.6)

14. I

Proof. One can establish (4.4)-(4.6) as in the proof of Lemma 3.4.1.


k
To prove (4.7), observe that rk=la implies kp=0, for any k. Therefore

(4.1) and (4.2) imply that kp(k) _>kr(k ) and ~k~p ~kr for all k.

Hence it suffices to prove that J*k c ~k for any k. If rk=l for some
r a
k, then kr(k)=k by (4.1a) and j k c {j : k-Mg+2 ~ j} by the rules of

Step 4 hence J =~c 3k by (4.1) and (4.3) Since jk+l=jk u {k+l}


' r " r r

and jk+l=~u {k+l} if kr(k+l ) < k+l, this proves that


*k
Jr ~
~k for
all k. ~
It is easy to observe that relations (4.4)-(4.6) imply

P k E af(xk;a k) (4.8)

by the definition of the Goldstein subdifferential (see (2.16)). On


the other hand, in the convex case we have the following additional re-
sult.

Lemma 4.2. Suppose that f is convex and k Zi is such that Algorithm


3.1 did not stop before the k-th iteration. Then

k k "k "k
P E aef(xk) for E=f(x )-fp=Uf_> 0.

Proof. As in the proof of Lemma 2.4.2, use (4.4) and the fact that
~k
p=l f ( x k ) _ ~ I- S
First, we consider the case when the algorithm terminates.

Lemma 4.3. If Algorithm 3.1 stops at the k-th iteration, then x k is


stationary for f.

Proof. If max{Ipkl,ma ak} ~ ~s=0 then (4.8) and the fact that ma > 0

yield 0e af(xk).

From now on we suppose that the method constructs an infinite se-


quence {xk}.
158

We shall now collect a few useful results. In Step 5 of the algo-


rithm we always have

Ipkj > m e ek ~0. (4.9)

2 ~k
Since dk=-p k by (3.17), vk=-{ Ipkl +~p} by (3.5), and ~ ~0 by
(3.4), we obtain from (4.9) that the line search is always entered
with
d k ~ 0, (4.10)

k
v < 0. (4.11)

This establishes (3.20). Moreover, the criterion (3.7) with m L > 0 and
k
tL ~0 ensures that the sequence {f(xk)}- is nonincreasing and

f(x k+l) < f(x k) if xk+l~x k.

Our next result states that the aggregate subgradient can be ex-
pressed as a convex combination of N+2 (not necessarily different) pa-
st subgradients.

Lemma 4.4. At the k-th iteration of Algorithm 3.1 there exist numbers
~ and vectors (yk,i,fk,i) R N R, i=l ..... M, M=N+2, satisfying

k ~k M ^k " ' i),


(P ,fp) =i~lli(gf(Y~'l),f k'

~ki ~ 0, i=l,...,M, Z
M ~k:l, (4.12)
i=l I

(yk,i,fk,i) E {(yj,fk)
:3 j ~}' i=l, .... M,

max{[yk'i-xkl : i=l ..... M} < a k.

Proof. The assertion follows from Lemma 4.1, Caratheodory's theorem


(Lemma 1.2.1), and the fact that gJ=gf(yJ) for all j.

Comparing Lemma 4.4 with Lemma 3.4.3 we see that the only differ-
ence stems from the fact that we are now considering tuples (p k ,fp)
"k
instead of triples (p k ,Zp,Sp
~k ~k).

To deduce stationarity results from the representation (4.12), we


159

shall need the following lemma, w h i c h is similar to Lemma 3.4.4.

Lemma 4.5. Let ~ e RN be given and suppose that the following hypoth-
esis is fulfilled:

-- --i --i
there exist N-vectors p,y ,g for i=I,...,M=N+2, and numbers

~p,~i,~i, i=l,...,M, satisfying

M
f -- --i ~i (4.13a)
(P, p)=iZlli(g.= ' ),

M
~i ~ 0, i=l,...,M, Z ~.=i, <4.13b)
i=l 1

~ i = f(~ i)+ < --i -- -i > , i=l ..... M,


g ,x-y (4.13d)

_ , i=l,...,M, (4.13e)

max{~l : ~i @ 0} = 0. (4.13f)

Then p e ~f(x) and Y =f(x).


P
M
Proof. Since Sp=iZl~iSl=0, we may use part (i) of the proof of Lemma

3.4.4.

The following lemma states useful asymptotic properties of the


a g g r e g a t e subgradients.

Lemma 4.6. Suppose that there exist a p o i n t x R N, a number a > 0 and

an infinite set K c {1,2,...} satisfying xk K ~ and ak ~ a for


--i
all k g K. Then there exist an infinite set KcK and numbers s
k
liminf a , i=l,...,N+2, such that the h y p o t h e s i s 4.13a)~4.13b) is ful-
k g K
filled at x and

(pk,~k) K (~,~p).

~k K ,
If a d d i t i o n a l l y ak K 0 then p~ ~f(x) and ~ 0.
P
160

Proof. Using Lemma 4.4, and Lemma 4.5 let sk'i=lyk'i-xk I for i=l,..
.,M and k e K, and argue as in the proof of Lemma 3.4.6.

The following result is crucial for e s t a b l i s h i n g convergence of the


method. Define the stationarity measure

k 1 2 ~k
w = 7fpkl +~p (4.14)

at the k-th iteration (at Step 5) of A l g o r i t h m 3.1, for all k.

Lemma 4.7. (i) Suppose that for some point x~ RN we have

liminf max{w k, Ix-x kl}=0, (4.15)


k+~

or e q u i v a l e n t l y

there exists an infinite set K c {1,2,...} such that

k K
x + x and w k _K ~ 0. (4.16)

Then 0 ~ ~f(x).

(ii) Relations (4.15) and (4.16) are equivalent to the following

liminf max{Ipkl,l~-xkl}=0. (4.17)


k+~

Proof. (i) The equivalence of (4.15) and 4.16) follows from the non-

negativity of wk and l~-xkl. If (4.16) holds, then wk= ~IPlki2+


~k
~p K + 0, so Ipk I K 0 by the nonnegativity of ~. Since we always

have Ipkl > m a a k ~ 0 at Step 5 (see (4.9)) and ma > 0 is fixed, we

have Ipkl K ~ 0, a k K , 0 and xk K ~ ~. Consequently, 0 e ~f(x) by

Lemma 4.6. Also max{ ipkl,l~_xk[} K -~ 0, hence we have shown that (4.16)
implies (4.17).

(ii) It remains to show that (4.17) implies (4.16). Suppose that (4.17)

holds. Then ipkl K + 0 and xk K ,~ for some infinite set K c {1,2,

...}. Since 0 ~ a k ~ Ipkl/ma , we obtain ak K ~ 0. Then Lemma 4.6 yields


-k K ~ 0, hence
~p wk= ~Ipkl
1 2+~ K ~ 0. Thus (4.16) holds, as required.
161

The above lemma, which is s i m i l a r to L e m m a 3.4.7, will enable us


to r e d u c e further convergence analysis to v e r i f y i n g if the stationari-
ty m e a s u r e s wk vanish in the neighbourhood of an a r b i t r a r y accumula-
tion point ~ of {xk}. Therefore, as in the p r e c e d i n g two chapters,
it is u s e f u l to r e l a t e the stationarity measures with the o p t i m a l va-
lues of the d u a l search direction finding subproblems.
Let wk denote the optimal value of the k - t h dual search direc-
tion finding subproblem (3.16), for all k. Since the L a g r a n g e multipli-
ers of (3.1) solve (3.16) and y i e l d pk via (3.3), we always have

^k 1 . . 2 ^k
w = ~Ipkl +~p, (4.18a)

where
^k k k k k
~p = ~ l_.~.+Ip~p.j
J (4.18b)
j 6 jk

The following lemma, which can be p r o v e d similarly to L e m m a 3.4.8,


shows that wk majorizes w k.

Lemma 4.8. (i) At the k - t h iteration of A l g o r i t h m 3.1, one has

"k ^k
0 ~ ep < ~p, (4.19a)

k ~k
0 ~w ~ w , (4.19b)

k k
v ~-w <_0, (4.19c)

^k k
v ~ v . (4.19d)

(ii) If f is c o n v e x then "k-^k


ep-ep, w k = w~k and v k =v^k , for all k.

Consider the following condition:

there exist a point x R N and an i n f i n i t e set Kc {1,2,...}

such that x k K + ~. (4.20)

In v i e w of (3.21a), (4.20) implies that either of the following two


disjoint case must arise:

there exists an i n f i n i t e set LC{l,2,...} such that

x k(1) ~ x as i+~, l e L; (4.21a)

or
162

there exists a fixed number :L such that xk=xk(1)=x

for all k >_k( i). (4.21b)

Conversly, (4.21) implies (4.20), so in fact (4.20) and (4.211 are equi-
valent.
k K
Our aim is to show that (4.20) implies that w 4 0 for some
KcK. To this end we shall first analyze the case of "long" serious
steps in the following lemma, which can be e s t a b l i s h e d s i m i l a r l y to
Lemma 3.4.9 if one uses (3.71, (4.19c) and (3.21a).

Lemma 4.9. (i) If (4.20) holds then

f(x k) + f(x) as k~,

kk
tLv ---+0 as k~.
^
(ii) If (4.21a1 is fulfilled and there exist a number t > 0 and an
infinite set Lc L such that t kL > ~ for all k=k(l+l)-i and i ~ L,
then (4.16) holds.

Corollary 4.10. Suppose (4.20) is satisfied, but (4.15) does not hold,
i.e.

liminf m a x { w k, Ix-xkl} >_~ 0 (4.22)


k~

for some c. Then

k
m a x { t L : k(1) < _ k < k ( l + l ) } , 0 as i~, l e L. (4.23)

Proof. The a s s e r t i o n follows from Lemma 4.9 (ii), (3.21b1, and the equi-
valence of (4.15) and (4.16) on the one hand, and of (4.20) and (4.211
on the other.

In v i e w of the above result, we need only c o n s i d e r the case of


a r b i t r a r i l y small stepsizes. Therefore, we shall now show that whenever the
k
value of tL is small, i.e. no s i g n i f i c a n t i m p r o v e m e n t in the objecti-
ve value occurs at the k-th iteration (cf. (3.7)), then w k+l is a f r a c -
tion of w k .

Lemma 4.11. Suppose that t k-i


L <~ and rk=0a for some k > i. Then

W k <
_W ~k _< #c(wk-l) + k
~p-~p ~k-I I (4.24)
163

where the function #C is defined by

~c(t)=t-(l-mR)2t 2 /(8C2),

and C is any number satisfying

max{Ipk-i I , Igkl ,
~k-l,l}
p
~ C. (4.25)

Proof. Use the proof of Lemma 3.4.11.

To obtain locally uniform bounds of the form (4.24) we shall need


the following result, which is a consequence of (3.23). Its proof, ~hich
is similar to the proof of Lemma 3.4.13, is left to the reader.

Lemma 4.12. The assertions of Lemma 3.4.13 remain valid for A l g o r i t h m


3.1. In particular

max{[pki,lgkl
~k,l}
p
<- C if Ixk-xl ~ a , (4.26)

where a is the line search parameter involved in (3.9), and C is the


constant defined in Lemma 3.4.13.

k ~k-I
Our next result demonstrates that the term lap-ep I involved
in (4.24) vanishes in the limit.

Lemma 4.13. Suppose that (4.20) holds. Then

k+l-~kl---+
ep ~p I 0 as k~.

Proof. Since k+l=if(xk+l.)_fk+l,p and ~k=,f(xk)-fk, for all k, one

may use Lemma 4.9(i) to obtain the desired conclusion as in the proof
of Lemma 3.4.14.

The following lemma will imply that the decrease of stationarity


measures established in Lemma 4.11 will take place for sufficiently
many consecutive iterations provided that no reset occurs.

Lemma 4.14. Suppose that (4.21a) and (4.22) hold. Then:

(i) For any fixed integer m ~0 there exists a number im such that
164

max{Ix-xk[ : k(1) < k < k ( l + m ) } - - 0 as i~, l e L, (4.27a)

min{w k : k(1) ~ k < k(l+m+l)} ~ / 2 for all 1 > ~m' i ~ L, (4.27b)

max{t~ : k(1) ~ k < k(l+m+l)}---+ 0 as lm, l e L, (4.27c)

max{s k : k(l+l) ~ k < k(l+m+l) ---+ 0 as i+~, I~'L, (4.28a)

m a x { [ y k + l - x k + l I : k(l+l) ~ k < k(l+m+l) ---+ 0 as i~, i~ L.(4.28b)

(ii) For any numbers k, N and > 0 there exists a number k >
such that k e K={k(l+l)-i : le L} and

k
w >_~/2 for k=k ..... k+N,

max{ipk-l[,igkl, ~k-i
p , i} ~ C for k=k t- ..,k+N,
(4.29)
k "k-l[ _
~p-ep < ~ for k=k, . . . ,k+N,

k-I
tL <~ for k=k ..... k+N,

where C is the constant defined in Lemma 3.4.13.

Proof. (4.27) can be proved by using Corollary 4.10 and Lemma 4.12
as in the proof of Lemma 3.4.12. Then (4.28) follows from (4.27),
(3.21), (3.22) and the fact that ~ 6(0,1). Part (ii) of the lemma
follows from part (i), (4.26) and Lemma 4.13.

Since the above lemma dealt with the case of an infinite number
of serious steps, we now have to analyze the remaining case.

Lamina 4.15. Suppose that (4.21b) holds. Then t~=0 for each k ~ k(1)
k
and y - - x as k~. If additionally (4.22) holds then the second
assertion of Lemma 4.14 is true.

Proof. Suppose that (4.21b) holds. Then, since we always have x k+l=
k k-k
x +tLd and dk~0 (see (4.10)), we deduce from the line search rule
(4.10) that tk=0 and lyk+l-xk+l[=[yk+l-xl <_ 0ks k for all k >_k(1).

But tk=0 always yields 0k+l=~e k and sk+l=s k, hence 0k+0 (0< -G<I)
165

and yk+ ~ as k~. Having established that Ixk-~l+0, t~0 and

lyk+l-xk+ll0 as k+~, we see that the proof of the second assertion


of Le m m a 4.14 remains valid in this case.

Up till now we h ~ concentrated on the r e l a t i o n s between the varia-


bles at Step 5 of the a l g o r i t h m and their impact on convergence. We ncw
turn our a t t e n t i o n to the p r o p e r t i e s of v a r i a b l e s generated at each
search direction finding. Recall that Step 1 of the m e t h o d may be re-
peated if a r e s e t occurs at any iteration.

Lemma 4.16. For any k >i, let bk denote the m i n i m u m value taken on
ak
by m a x { I p k l , m a ak} for the successive values of pk and calcula-
ted at each e x e c u t i o n of Step 1 of A l g o r i t h m 3.1 at the k-th iteration.
(The v a r i a b l e bk is w e l l - d e f i n e d , because there can be only finitely
many returns to Step 1 at any iteration.) Suppose that

liminf max{bk,l~-xkl}=O (4.30)


k~

for some ~ 6 R N. Then 0 ~ ~f(~l.

Proof. For any k > i, let ak and pk have the values calculated at

the first e x e c u t i o n of Step 1 such that bk=max{Ipkl,maak}. One easi-


ly check that L e m m a 4.1, Lemma 4.4 and L e m m a 4.6 r e m a i n valid for the
variables generated by this p a r t i c u l a r execution of Step i, for all k.
Therefore one may use the p r o o f of Lemma 4.7(i) to o b t a i n the d e s i r e d
conclusion.

Remark 4.17. One may check that, except for Lemma 4.7, all the p r e c e -
ing results of this section hold also for v a r i a b l e s generated by each
execution of S t e p 1 at any iteration. Lemma 4.7 assumes that IpkI>maa~-
hence it deals only w i t h the v a r i a b l e s calculated by the last execu-
tion of Step 1 at any iteration.

We are now ready to prove the p r i n c i p a l result of this section.

Lemma 4.18. Suppose that (4.20) holds. Then at least one of r e l a t i o n s


(4.15) and (4.30) is satisfied.

Proof. Suppose that (4.20) holds. For p u r p o s e s of a p r o o f by contradic-


tion, assume that neither (4.15) nor (4.30) is fulfilled. Thus suppo-
se that there exist positive constants ~ and ~p satisfying (4.22)
166

and
liminf max{b k, l~-xkl } >_~p. (4.31)
k+~

(i) Let ~w--~/2 > 0 and choose e =~ (ew,C) and N=N(ew,C) < +~, N>_I,

as specified in Lemma 3.4.12, where C is the constant defined in Lemma


3.4.13.
(ii) Let N=I0(N+Mg). Combining (4.22), (4.31), Lemma 4.14 and Lemma
4.15 with the fact that ~p/(2ma) > 0, we deduce the existence of k sa-
tisfying (4.29) and

max{ sjJ : J=k, ... ,k+N }+ ~ ~ IxJ+l-xJ[ <~p/(2ma) (4.32a)


j=k

b k _~~p/2 for k=~ ..... ~=~, (4 3261

since S~=lyJ-x j[ for all j, m a > 0, [p > 0 and < +~.

(iii) Suppose that there exists a number k satisfying

~ { ~+~-2~, (4.33)
k
ra : 0 for all ~ Lk,k+Nj. (4.34)

Then (4.29), (4.34), Lemma 4.11, Lemma 3.4.12 and our choice of E
[~^--
and N imply that w k ~ ew=~/2 for some k e ,k+N] , which contradicts
(4.33) and (4.29). Consequently, we have shown that for any number
k satisfying(4.33) we have rk=l
a for some k e [k,k+N].

(iv) Let kl=k+2Mg " Then part (iii) of the proof implies that r k2-1 =1
a
k-i
for some k 2 ~ [kl,kl+N]. Since k2-Mg > k and re2 =i, we obtain from

(4.1), (4.3) and Lemma 4.1 (see Remark 4.17) that

~k [~,~+~] for all k ~ [k2,k2+N], (4.35a)


k
a S max{s :j6 } for all k E [k2,k2+N]. (4.35b)

We want to stress that (4.35) holds for the values of ak and ~k


computed by every execution of Step 1 at the k-th iteration, k=k2,... ,
k2+N. Also in view of (4.32b) and the definition of b k (see Lemma 4.16)
we have

max{Ipkl,ma ak} ~ [ p / 2 for k=k 2 ..... k2+N, (4.36)


167

i.e. the inequality in (4.36) holds on each entrance to Step 3 at the


k-th iteration, for k=k2,...,k2+N. From (4.35), (4.32a) and the fact

that we always have sk=sj.+k] 1 Ix i+l-xi I, we deduce that at Step 3


3 3 i=j

ak < max{s~ : j E Jk}+kZ llxj+l-xjl < [p/(2ma) (4.37)


j=k
ak
for k=k 2 ..... k2+N. Using this in (4.36), we get that Ipkl > m a on

each entrance to Step 3 at the k-th iteration, for k=k2,...,k2+N, so

r ke = 0 for k=k 2 ,... ,k2+N. (4.38)

(v) Since k=k 2 satisfies (4.33), from part (iv) of the proof we
deduce a contradiction with (4.38). Therefore, either (4.15) or (4.30) must
hold,
Combining Lemma 4.18 with Lemma 4.7 and Lemma 4.16, we obtain

Theorem 4.19. Every accumulation point of the sequence {x k} generated


by A l g o r i t h m 3.1 is stationary for f.

In the convex case, the above result can be strengthened as follo-


ws.

Theorem 4.20. If f is convex then A l g o r i t h m 3.1 constructs a minimi-


zing sequence {xk}, i.e. f(xk]+ inf{f(x) : x ~ RN}. Moreover, if f at-
tains its infimum then {x k} converges to a m i n i m u m point of f.

Proof. Similar to the proof of Theorem 3.4.18.

The following result validates the stopping test of the method.

Corollary 4.21. If the level set S = { x R N : f(x) ~ f(xl)} is bounded


and the final accuracy tolerance es is positive, then A l g o r i t h m 3.1
terminates in a finite number of iterations.

Proof. If the assertion were false, then the infinite sequence {xk}s
would have an accumulation point, say ~. Then Lemma 4.18, Lemma 4.7 and Le-
mma 4.16 would yield liminf max{Ipkl,maak}=0, so the algorithm should
k+~
stop owing to m a x { I p k I ' m a ak} < ~s for large k.
168

5. The A19orithm with Sub@radient Selection


In this section we state and analyze the method with subgradient
selection.

Al~orithm 5.1.
Step 0 (Initialization). Select x I ~ RN and es Z 0. Choose positive
parameters ma, m L, mR, a, t, 0, Mg and sI with ~I, mL<mR<l, O ~ I and
Mg ~ N+2. Set 01=[, r~=0, Jl={l}, yl=xl, gl=gf(yl), f~=f(yl) and s~=0.

Set k=l, i=0 and k(0)=l.

Step 1 (Direction finding). Find the solution (dk,$ k) to the follow-


ing k-th quadratic programmnig problem

1 2^
minimize ~Id[ +v,
(d,$) e R N+I (5.1)
subject to -e~+< gJ,d > ! $, j 6 jk,
where
~jk = If(xk)-f ~ I for j 6 jk (5.2)

Find Lagrange multipliers k j ~ jk , of (5.1) and a set


lj, ~k satisfy-
ing
3k = {j ~ jk : IkJ ~ O} , (5.3a)

13kl ~Mg-l. (s.3b)

Set
ak = max{sjk : j6 jk}.
^ (5.4)

Step 2 (Stopping criterion). If max{IdkI'ma ak} S ~s then terminate.


Otherwise, go to Step 3.

Step 3 (Resettin@ test). If Idkl ~ ma ak then go to Step 4; otherwise ,


go to Step 5.

Step 4 (Resettin~).(i) Replace jk by jk {j : j ~ k-Mg+l} and set

rka =i.
(ii) If IJkl ~ 1 then delete the smallest number from jk and go to
Step i.
(iii) Set yk=xk r gk=gf(yk) t f~=f(yk) i s~=0 t Jk={k} and go to Step 1
169

Step 5 (Line search). Find two nonnegative stepsizes t kL & t kR and the

two corresponding points xk+l=xk+t dk and y k+l =x k +tRd


k k such that

f(x k+l) S f(x k)+mL~Lv'k^k, (5.5a)

k k k
t R = t L <i if tL>~ , (5.5b)

- ~ ( x k + l , y k + l ) + < gf(yk+l ),d k > ~ m R v^k if k~,


tL (5.5c)

lyk+l-xk+ll < ~ , (5.5e)

lyk+l-xk+l I <_ eks k if t kL =


0, (5.5f)

[yk+l-xk+iI <_~Ixk+l-xk I if t kL 0. (5.5g)

Step 6. If t =0 set sk+l=s k and ek+l=~@ k. If t kL > 0 then set

sk+l=Ixk+l-xk I , 8k+l=~ , k(l+l)=k+l and increase 1 by i.

Step 7 (Subgradient updating). Set j k + l = 3 k u {k+l}, gk+l=gf(yk+l),

fk+l f(yk+l)+ k+l k+l k


k+l = <g ,x -x > ,

fk+l = fk+ < gJ,xk+l_xk > for j e J^k ,


3 3
k+l lyk+l k+l
Sk+ 1 = -x I,

sjk+l = sk+ixk+lj -x k I for j 6 J^k .

k+l
Step 8. set r a =0, increase k by 1 and go to Step 1.

We shall now compare the above method with A l g o r i t h m 3.1. It is


easy to observe that A l g o r i t h m 5.1 is related with A l g o r i t h m 3.1 in
the same way as A l g o r i t h m 3.5.1 with A l g o r i t h m 3.3.1. Therefore, we
may shorten our d i s c u s s i o n of A l g o r i t h m 5.1 by using suitable modifi-
cations of the results and remarks of Section 3.5.
We refer the reader to Remark 2.5.2 for a d i s c u s s i o n of possible
ways of finding the k-th Lagrange m u l t i p l i e r s satisfying the require-
ment (5.3). To this end we observe that such m u l t i p l i e r s exist, since
Mg satisfies N+I ~ M g - i by assumption (see also Remark 3.5.2). In
Step 1 of the-method one may also solve the dual of the k-th.search
direction finding s u b p r o b l e m (3.1), which is of the form (3.5.11),and
170

^k
then recover (dk,v) via (3.5.12)-(3.5.14).
As in Section 3.5, we may derive useful relations b e t w e n variab-
les g e n e r a t e d by A l g o r i t h m 5.1 by setting

lk = 0 for all k
P

in the c o r r e s p o n d i n g results of S e c t i o n 3 and Section 4. Thus, d e f i n i n g


at the k-th i t e r a t i o n of A l g o r i t h m 5.1 the v a r i a b l e s

(pk,~k) =j ~Z jkljk(gj, fk)3 '

p--Kf(xk) I,
^k k k
~p = Z -i.~., (5.6)
j6J K 3 3

k 2 -k
w k = 1p +ep,

"k 1 k 2 ~k
w = ~IP +~p,

V k = -{ Ipk[ }

for all k, we obtain

d k = _pk, (5.7a)

~k ~k
ep ~ ap, (5.7b)

k ~k
w S w , (5.7c]

~k k
v ~ v , (5.7d~

cf. (3.5.12)-(3.5.251 .

In v i e w of (5.7a), the stopping c r i t e r i a and the r e s e t t i n g tests


of A l g o r i t h m 3.1 and A l g o r i t h m 5.1 are equivalent.
Observe that the line search criteria (5.5) can be derived by sub-
^k k
stituting v for v in the c r i t e r i a (3.7)-(3.10). Therefore, by
k ~k
replacing v with v In Line Search P r o c e d u r e 3.2 we o b t a i n a proce-
dure for e x e c u t i n g Step 5 of A l g o r i t h m 5.1. Since the line search is
always entered w i t h vk < 0 (cf. (3.20) and (5.7)), Lemma 3.3 remains
valid for this m o d i f i c a t i o n .
We also note that the s u b g r a d i e n t d e l e t i o n rules of A l g o r i t h m 5.1
171

ensure that at most Mg latest s u b g r a d i e n t s are r e t a i n e d after each


reset, and that the latest s u b g r a d i e n t gk=gf(yk) with lyk-xkl ~ is
always used for search d i r e c t i o n finding, i.e. (3.23) holds.
We shall now analyze c o n v e r g e n c e of A l g o r i t h m 5.1.

T h e o r e m 5.2. T h e o r e m 4.19, T h e o r e m 4.20 and C o r o l l a r y 4.21 are true for


A l g o r i t h m 5.1.

Proof. One can prove the theorem by m o d i f y i n g the results of Section 4


similarly as we m o d i f i e d in S e c t i o n 3.5 the results of Section 3.4 to
e s t a b l i s h c o n v e r g e n c e of A l g o r i t h m 3.5.1. For instance, Lemma 4.11
should be~eplaced by L emma 3.5 3, and the e x p r e s s i o n k "k-i I
~p-~p in

(4.29) by Ak d e f i n e d in (3.5.29) w i t h the c o r r e s p o n d i n g part of the


proof of L e m m a 4.14 b e i n g changed to the form of the proof of Lemma 3.
5.4. To save space, we leave details to the reader.

6. M o d i f i e d R e s e t t i n ~ Strategies

The r e s e t t i n g of A l g o r i t h m 3.1 and A l g o r i t h m 5.1 is crucial for


o b t a i n i n g strong results on their convergence. In this section we shall
c o n s i d e r earlier r e s e t t i n g strategies due to Wolfe (1975) and M i f f l i n
(1977b). It turns out that these strategies can be easily analyzed.
However, at p r e s e n t it seems impossible to e s t a b l i s h for the r e s u l t i n g
algorithms, which include those in (Wolfe, 1975; Mifflin, 1977b) as
special cases, global c o n v e r g e n c e results similartoTheorem 4.19,even under
a d d i t i o n a l assumptions. We also propose a new r e s e t t i n g strategy based
on a g g r e g a t i n g d i s t a n c e m e a s u r e s
sk as in A l g o r i t h m 3.3.1. This stra-
]
tegy, on the one hand, may be more e f f i c i e n t in practice, and on the
other hand, retains all the p r e c e d i n g global c o n v e r g e n c e results.

To m o t i v a t e the subsequent t h e o r e t i c a l developments, we start


w i t h the f o l l o w i n g p r a c t i c a l observation . A l g o r i t h m 3.1 has a certain
d r a w b a c k that slows down its c o n v e r g e n c e in practice. This d r a w b a c k
stems from the d e f i n i t i o n of the locaiity radius ak (estimating the
radius of the ball around xk from w h i c h the past s u b g r a d i e n t infor-
m a t i o n was accumulated}, which resets the a l g o r i t h m w h e n e v e r IpkI
maak. Namely, the values of ak are n o n d e c r e a s i n g b e t w e e n every two
c o n s e c u t i v e resets, while in the n e i g h b o r h o o d of a solution the values
~f Ipkl d e c r e a s e rapidly, thus forcing frequent resets due to
Ipk[ ~ m a ak. Too f r e q u e n t r e d u c t i o n of the past s u b g r a d i e n t information
by d i s c a r d i n g the aggregate s u b g r a d i e n t hinders convergence, especia-
lly when, due to storage limitations, only a small number Mg(Mg< < N)
172

of past subgradients are used for search direction finding. This draw-,
back is eliminated to a certain extent in the following m o d i f i c a t i o n
of A l g o r i t h m 3.1.
Before stating the modified method, let us briefly recall the ba-
sic tasks of subgradient deletion rules. In this chapter we concentra-
ted on rules for localizing the accumulated subgradient information.
Such rules ensure that polyhedral functions of the form (2.8) and (2.22)
are close approximations to f in a neighborhood of x k. On the other
hand, in Chapter 3 we used resetting test of the form ak S ~ to ensu-
re locally uniform boundedness of the subgradients that were aggregated
at any iteration. Observe that in the methods of this chapter there was
no need for distance resets through ak ~ a , since we had a k ~ Ipkl/ma
and IpkI was locally bounded (cf. Lemma 4.12). However, if we substi-
tute the resetting test Ipkl ~ m a ak by some other test, then we shall
no longer have estimates of the form ak 5 IPkl/ma Therefore, we shall
additionally use a distance resetting test of the form ak S ~ . For a
sufficiently large value of ~, say a=103, a reduction of the past
subgradient information due to ak being larger than ~ will occur
infrequently.
To derive a new resetting strategy, suppose that in Step 1 of AI-
-k
gorithm 3.1 we calculate the aggregate distance measure s by setting
P

(p k ,fp
~k , Sp}=j~
~k k
Z jklj(g j ,fj,sj)
k k +ik(pk-i
P ,fk
p , S pk) (6.1)

and then calculating

Spk+l = Sp+IX
"k k+l
-xk I (6.2)

in Step 7, for all k. Then according to Lemma 3.4.1 we shall have

( k ~k "k ~k j k k
p ,fp ,s ~ )= j ~~ lj(g ,fj,sj) (6.3a}

~k ^k ~k=l , (6.3b)
lj >_ 0 j 6 Jp, jE 3k lj

max{ }yJ_x k I : J J^k } <-max{s k : j 6 ~p }=a k , <6.3c)

where
~k = jkp(k) U {j : kp(k) < j ~ k } ,

kp(k) = max{j : j <_k and Ip=0}.


173

~k
Thus the aggregate d i s t a n c e m e a s u r e Sp, b e i n g a convex c o m b i n a t i o n of
k k
the d i s t a n c e m e a s u r e s sj, is no greater than the locality radius a ,
k -k
w h i c h is the m a x i m u m of the c o r r e s p o n d i n g sj-s. The value of Sp, in
c o n t r a s t with that of a k, can decrease even if no reset occurs. Also
the value of ~k can be small even if the value of s~ is large for
3
some j e ~, P
provided that the value of ~k3 is small. This means that
if ~k has a small value then only local past subgradients gJ (with
r e l a t ip v e l y small sk] ~ lyJ-xkl] contribute s i g n i f i c a n t l y to the current
direction dk=-p k (have large Ik in (6.3a)). For this reason we
]
shall reset the m e t h o d only if

[pkl ~ m a ~kp" (6.4}

This will decrease the frequency of r e s e t t i n g s in c o m p a r i s o n with the


one that would occur if the test Ipk] ! m a ak were used.
To save space, we shall now describe the m o d i f i e d r e s e t t i n g stra-
tegy in detail by using the n o t a t i o n of A l g o r i t h m 3.1.
A l ~ o r i t h m 6.1 is o b t a i n e d from A l g o r i t h m 3.1 as follows. In Step
1 we substitute r e l a t i o n (3.3) by (6.1). In Step 2 we replace the stop-
ping criterion max{[pkl,maak} ~ es by the c r i t e r i o n

max{ipkl,ma~p}k ~ Es" (6.5)

The r e s e t t i n g test Ipkl ~ m a a k of Step 3 is r e p l a c e d by the test (6.4).


In Step 7 we update the aggregate distance m e a s u r e via (6.2). Step 8
is r e p l a c e d by the following two steps.

Step 8' (Distance resetting). If ak+l < a then go to Step 9. Otherwise,


keep deleting from jk+l the smallest indices until the reset val-
ue of a k+l satisfies

a k+l =max{ sk+l


j : j jk+l} <_ a/2.

Set rk+l=l.
a

Step 9' Increase k by 1 and go to Step i.

C o n v e r g e n c e of the above m e t h o d is e s t a b l i s h e d in the following


theorem.

Theorem 6.2. A l g o r i t h m 6.1 is g l o b a l l y c o n v e r g e n t in the sense of The-


orem 4.19, T h e o r e m 4.20 and C o r o l l a r y 4.21.
174

Proof. Since a formal proof of the theorem would involve lengthy repe-
titions of the results of Section 4 and Section 3.4, we give only an
outline, hoping that the reader can fill in the necessary details.
First, we observ~ that we always have Ipk[ > maS ~ and ak S a at Step 5
of the method. Using this, one may replace Lemma 4.1 through Lemma 4.6
by Lemma 3.4.1 through Lemma 3.4~, with the condition
,,~k K ~ 0" in
P
Lemma 3.4.6 being replaced by the condition ,,~k K * 0". In the proof
P
of Lemma 4.7, use the fact that ~kp <- ipkl/ma at Step 5. In the formu-
lation and the proof of Lemma 4.16, replace a k by ~k. Finally, while
proving Lemma 4 18 observe that ~kp <- a k from (6.3], replace
p (4.36)
with

max{Ipkl, maS ~} k ~ p / 2 for k=k 2 ..... k2+N ,

assume with no loss of generality that ~p/(2ma) < a/2, and deduce that
in part (iv) of the proof we have ak ~ a and [pk I > mas ~ for k=k2,.
..,k2+N, so that (4.38) holds, as required.

For the sake of completeness of our discussion of algorithms for


nonsmooth minimization, we shall now discuss a simple resetting stra-
tegy due to Wolfe (1975).
Consider the following m o d i f i c a t i o n of A l g o r i t h m 3.1, in which re-
sets will be regulated by a sequence of positive numbers {6k}. In Sgep
0 we set 61 equal to any positive number. The resetting test Ipkl
ma ak of Step 3 is replaced by the test

ipkl ~ ~k. (6.6)

Part (i) of Step 4 is replaced by the following

Step 4'(i). If m a a k s ~k then replace 6k by 6k/2. If rk=0 then


a
set rk=l
a ' replace jk by {j6 jk : J ~ k - M g +2} and go to Step i.

In Step 5 we replace the line search requirements (3.9)-(3.10) by the


following

[yk+l-xk+l[ ! ~k/(2ma). 16.7)

Of course, Line Search Procedure 3.2 is easily modified to handle the


above requirement, since 6k/(2ma ) > 0. The remaining steps of Algo-
rithm 3~I remain unchanged~ except that in Step 8 we set 6k+l=~ k.
The aim of the above m o d i f i c a t i o n of A l g o r i t h m 3.1 is to calculate
in a finite number of iterations a point xk such that
175

max{|pkI[ ,maak } ~ 6k for a small value of 6k , so that the method can


stop at Step 2 owing to ~k being smaller than ~s" Note that ~k is
~k, i.e.
halved at Step 4'(i) only after we have Ipkl s 6k and maak

when max{[pkl, maak} ~ ~k, which means that the algorithm has found a
significantly better approximation to a stationary point. When 6k
decreases, the line search requirement (6.7) ensures that the algorithm
collects p r o g r e s s i v e l y more local subgradient information.
We may add that in the original version of the Wolfe (1975) stra-
tegy, one would use only part (iii) of Step 4 of A l g o r i t h m 3.1, i.e.
each reset would involve restarting the method from the current point
x k, and d i s c a r d i n g all the past subgradients. As observed by Mifflin
(1977b), such a strategy is inefficient, because it leads to many null
steps accumulating subgradient information to compensate for total
resets,
The following result describes convergence properties of this modi-
fication of A l g o r i t h m 3.1.

Theorem 6.3. Suppose that the above-described modification of Algorithm


3.1 (with the Wolfe resetting strategy) calculates an infinite sequence
of points {x k} with the stopping parameter es set to zero. Let K
denote the (possibly empty) set of indices of iterations k at which 6k
is decreased at Step 4~i), i.e.

K = {k : 6k+l & 6k/2}.

Then either of the following two case arises:

(i) f(xkl+-~ as k+~;

(ii) K is an infinite set, 6k+0 as k+~ and every a c c u m u l a t i o n point


of the subsequence {xk}k~ K is stationary for f.

Moreover, if the sequence {x k} is bounded (e.g. the level set Sf=


{x6 RN:f(x) ~ f(xl)} is bounded) then case (ii) above occurs and {x k}
has at least one stationnary accumulation point.

Proof. If f(xk)+-~ as k+~ then there is nothing to prove. So sup-

pose that f(x k) > c for some fixed c and all k.

(i) We shall first prove that K is infinite. For purposes of a proof


by contradition, suppose that K is finite, i.e. 6k=~ > 0 for some
fixed k& and all k > k~. Then at Step 5 we have ipk] > ~k=~ and
176

k k 2
tLIP [ =Ipkltkldkl=IPkl Ixk+l-xkl >-~Ixk+l-xkl,

i.e.
k
tLlP I
2 ~ ~ f kl - xkl for all k > k 6. (6.8)

From the assumption that f(x k) ~ c for all k we obtain, as in Lemma


k k 2
2.4.7, that Z tLIP I < +~, hence (6.8) yields
k=l

Ixk+l-xkl < +~. (6.9)


k=l

Therefore, by the triangle inequality,

k+n-i
Ixk+n_x k _< ~ Ixi+l-x i I-- 0 as k,n --*~
i=k

and {xk}, being a Cauchy sequence, has a limit

xk ~ x as k+~. 6.10)

Since 6k=~ for all k >k6 , we have

k
maa >~ at Step 4' for all k >_k 6, 6.1l)

b k = max{Ipkl,maak} > ~ at Step 4' for all k ~k6, 6.12)

and, because IP kl -> 5k and w k =~I


1 pk I2 + ~ k > 2 IP kl 2 at Step 5,

wk > ~ > 0 at Step 5 for all k _>k~, 6.13)

where e=($)2/2. By (6.7), we always have lyk+l-xk+ll < a for a=


~i/(2ma), hence Lemma 4.12 remains valid. Therefore, we deduce from
(6.10) and (6.13) that the second assertion of Lemma 4.14 is true, so,
in view of (6.10), (6.12) and (6.13), we may use the proof of Lemma 4.
18. To this end, observe that, since for large j we have s~=lyJ-xJ
l -
.k-i
~/(~a) from (6.7), and sk=s~+ Z Ixi+l-xil, (6.9) yields
3 3 i=j

k
s~ < ~/m a for all large j &k. (6.14)
J
177

Then one may set ~p=2~ and proceed as in the proof of Lemma 4.18,
deleting (4.30a)and using (6.14) to replace (4.371 by

k k
a = max{sj : j~ 3 k} < ~ / m a for k=k 2 .... ,k2+N , (6.15)

to get (4.38) from (6.11) and (6.151 . Thus we obtain a contradiction,


showing that K must be infinite.

(ii I If K is infinite, then for infinitely many k~ K we have 6 k+l=


6k/2, and 0 S ~k+~kfor all k, so 6k+0 as k-~. Suppose that x k ~ +
for some point ~ and an infinite set K c K. Then, since

b k = max{ipkl, maak} K dk for all ke K

and 6k+0, we obtain from Lemma 4.16 that 0 ~f(x).

(iii) If {x k} is bounded then {f(xk)} is bounded(f is continuous),


so case (ii I cannot occur and the bounded subsequence {xk}k~ must
K
have at least one accumulation point, which is stationary by the pre-
ceding results.
We conclude from the above theorem that if f is bounded from below
and the stopping parameter es is positive, then the above-described
modification of Algorithm 3.1 with the Wolfe resetting strategy stops
after a finite number of iterations, finding an approximately statio-
nary point x k with p k E 3f(xk;~s/ma) and Ipkl ~ e s-

Let us now see what happens if the Wolfe resetting strategy is


used in Algorithm 6.1. To this end, consider the following modification
of Algorithm 6.1. In Step 0 we choose a positive 61 satisfying 61
ma~, so that we shall have

~k/(2ma I 5 a / 2 for all k.

In Step 3 we replace the resetting test Ipkl S m a S ~ by the test Ipkl~


6 k. In Step 4(i I we insert the following additional instruction

~k 6k 6k
if maS p ~ then replace by 6k/2.

We also employ the line search requirement (6.7) in place of (3.9)-


(3.10) in Step 5 of the method. In Step 9 we set 6k+l=6k. Thus 6k+l~
6k/2 only if

max{ Ipkl ,maS~} ~ 6k,

and 6k+I=6 k otherwise.


178

Using the preceding results, one may check that convergence pro-
perties of the above modification of Algorithm 6.1 can be expressed in
the form of Theorem 6.3.
We shall now describe a resetting strategy based on the ideas of
Mifflin (1977b). Thus consider the following modification of Algori-
thm 3.1. In Step 0 we choose a positive 61 . Step 3 and Step 4 are rep-
laced by the following

Step 3"(Resettin~ test I . If [pk I ~ 6 k then go to Step 4"; otherwise,


replace ~k by min{6k,]pkl} and go to Step 5.

Step 4" (Resetting). Replace 6k by -~6k, and then j k by {je jk :

sjk ~ 6k/me} . If k~J k then set y k =x k , g k =gf(yk), f k =f(yk), s k =0 and

add k t o j k . Set r ka= l a n d go t o S t e p 1.

In Step 5 we replace the line search requirements (3.9)-(3.10) by the


following

lyk+l-xk+Zl~ 6 k / m a . (6.161

Step 8 is substiuted by the following two steps.

6k+l 6k k+l 6k+l


Step 8" (Distance resetting) Set = . If maa ~ then go

to Step 9. Otherwise, set r k+la=i and replace jk+l by the set

{j e jk+l : sk+l
J _< ~6k+i/ma} so that the reset value ~ a k+l satisfies

a k+l = max{sjk+l : j ~ jk+l } <_ [dk+I/ma. (6.17)

Step 9". Increase k by 1 and go to Step i.

To compare the above modification of Algorithm 3.1 with the origi-


nal method, we observe that, since @ E(0,1) by assumption, in the
above algorithm we have

maa k & ~k for all k. (6.18)

Therefore, if no reset occurs at the k-th iteration at Step 4", i.e.


if Ipkl > ~ k , then we have the relation IpkI >~maak, which is simi-
lar to the corresponding relation Ipkl > m a a k of Algorithm 3.1. In
general, the resetting strategy ensures that
179

Ipkl m 'a a k at Step 5 for all k ' (6 " 19)

where m ~ = ~ m a 0, and

[pk I ~ ~k at Step 5 for all k. (6.20)

At the same time, one can have a reset at Step 8" due to ma ak+l ~ 6k
even if the value of IPkl Z 6k is large. In this case a reset occurs
even though ]pk I > m a ak, in c o n t r a s t with the rules of A l g o r i t h m 3.1.
We note that on each e n t r a n c e to Step 4" one has

m a x { I p k l , m a a k } ~ 6k (6.21)

for s u c c e s s i v e l y smaller 6k-s, so in this case 6k m e a s u r e s the sta-


tionarity of x k. Moreover, the line search r e q u i r e m e n t (6.16) and the
rules of Step 8 ensure that the latest s u b g r a d i e n t gk+l is never de-
leted at Step 8. This is similar to the c o r r e s p o n d i n g p r o p e r t y of Algo-
r i t h m 3.3.1.
We shall now e s t a b l i s h convergence of the above m o d i f i c a t i o n of
A l g o r i t h m 3.1.

T h e o r e m 6.4. Suppose that the a b o v e - d e s c r i b e d m o d i f i c a t i o n of A l g o r i t h m


3.1 (with the M i f f l i n r e s e t t i n g strategy) generates an infinite sequence
of points {x k} with the stopping p a r a m e t e r ~s set to zero. Let K
denote the p o s s i b l y emply set of iterations k at which Ipk[ < 6k at
Step 3", i.e. K={k : ~k+l < ~k}. Then e i t h e r of the following two cases
arises:

(i) f(xk)+-~ as k~;

(ii) K is an infinite set, 6k+0 as k+~ and every a c c u m u l a t i o n point


of the subsequence {xk}k~K is s t a t i o n a r y for f.
Moreover, if the sequence { x k} is b o u n d e d (e.g. the level set Sf=
{x : f(x) S f(xl)} is bounded), then case (ii) above occurs and {x k}
has at least one s t a t i o n a r y a c c u m u l a t i o n point.

Proof. Suppose {flxk}} is b o u n d e d from below. There exists a number

> 0 such that 6k+~, b e c a u s e we have 0 ~ 6k+l ~ 6k for all k.

(i) We shall first prove that ~=0. To obtain a contradiction, suppose


that ~ > 0. Then (6.20) and the b o u n d e d n e s s of {f(xk)} yield, as in
the proof of T h e o r e m 6.3, that
180

Ix k + l - x k I < + ~ , 6.22)
k=l

k
x J x as k+~. 6.23)

Note that Step 4" c a n n o t be e n t e r e d infinitely often, because, slnce


e(0,1), the c h a n g e of 6k at S t e p 4" w o u l d imply 6k+0, a contradic-
tion. Hence there e x i s t s a number k6 such t h a t S t e p 4" is n e v e r en-
t e r e d at i t e r a t i o n s k _>k 6. Thus we h a v e Ipkl > 6k >_~ and w k= l j p k l 2 +
-k 1 k 2
e p - > ~ I P I , i.e.

k -- i(~2
w > e = > 0 for all k >_ k 6. (6.24)

By (6.16), jyk+l-x~+l
l u ~ ~klm a < ~ 6 1 1 m a , so L e m m a 4.12 r e m a i n s valid.

Therefore, f r o m (6.23) and (6.24) we d e d u c e that the s e c o n d assertion


of L e m m a 4.14 is true, and t h a t we m a y now use the p r o o f of L e m m a 4.18
(deleting (4.32a), since we no l o n g e r h a v e (4.28b)). To this end, we
note that, since ~ ~(0,i) and 6 k + ~ > 0, we m a y c h o o s e n e(~,l) and
k~ > k 6 such t h a t ~ 6 k < q~ for all k > k . T h e n the line s e a r c h require-
q
ment (6.16) y i e l d s

k+l !yk+l k+l ~k/m a (6.25a)


Sk+ 1 = -x I~ < q~/m a for all k ~k~.

In v i e w of (6.22) and the fact that ( l - q ) ~ / m a > 0, we m a y c h o o s e ks> k 6


such t h a t

Ixk+l-xkl < ( l - q ) ~ I m a. (6.25b)


k=k s

Now, u s i n g p a r t s (i)-(iii) of the p r o o f of L e m m a 4.18, we find k2 > ks


such that rk2=l
a and

m a x { s k2 : j6 J k2} i ~ k 2 / m a < q~/m a (6.25c)

f r o m (6.17). Since s k = s n + k~l [xi+l - x i I for all k > n z j, f r o m (6.25)


3 3 i=n
we o b t a i n

m a x { s ~ : j c J k2 or j > k 2 } < ~ / m a for all k > k 2. (6.26)

But Lemma 4.1 shows t h a t on e a c h e n t r a n c e to S t e p 8" w e h a v e

a k+l =max{ sk+l


j : j e J~ or j=k+l},
181

hence (6.26) and the fact that ~c jk2 u {j : j h k2} yield

m a a k+l < ~ < 6k+l for all k > k2, (6.27)

showing that no reset due to maak+l > ~k+l can occur at Step 8" for
any k > k 2. Thus (4.38) holds and we obtain a contradiction. Therefore,
~=0.
(ii) Note that for each k eK we have ipkl < k and ma ak <_ 6 k frc~
(6.18), so

b k = m a x { i p k l , m a a k } ~ dk for all k e K.

Using this r e l a t i o n one can complete the proof a n a l o g o u s l y to parts


(ii)-(iii) of the proof of T h e o r e m 6.3.
We conclude from the p r e c e d i n g results that the r e s e t t i n g strate-
gies of Wolfe (1975) and M i f f l i n (1977b) have certain theoretical draw-
hacks. Namely, they lead to algorithms that are c o n v e r g e n t in the
sense that they have at least one s t a t i o n a r y a c c u m u l a t i o n point, pro-
vided that they generate b o u n d e d sequences {xk}. Such convergence re-
sults are w e a k e r than the ones e s t a b l i s h e d for A l g o r i t h m 3.1 and Algo-
rithm 6.1, which say that every a c c u m u l a t i o n point of these a l g o r i t h m s is
stationary, under no additional a s s u m p t i o n s on f.
The d i f f e r e n c e b e t w e e n these two kinds of c o n v e r g e n c e results may
be e x p l a i n e d as follows. The r e s e t t i n g strategies of W o l f e (1975) and
M i f f l i n (1977b) are b a s e d on the use of certain m o n o t o n i c a l l y nonin-
c r e a s i n g sequences { k}, such that 6k d e c r e a s e s at each reset, so
that each reset in influenced by all the p r e c e d i n g resets. On the other
hand, r e s e t t i n g tests of the form IPkl ! m a a k or Ipkl < m a S ~- are more
local in nature in the sense that they do not depend e x p l i c i t l y on any
of the p r e c e d i n g resets.

This difference d i r e c t l y influences the required techniques of con-


v e r g e n c e analysis. Namely, the local c h a r a c t e r m e n t i o n e d above is well-
suited to p r o v i n g that an arbitrary a c c u m u l a t i o n point of {xk} is statio-
nary. O n the other hand, c o n t r a d i c t i n g the s t a t e m e n t that every a c c u m u l a
tion point is n o n s t a t i o n a r y requires only the global arguments m e n t i o n e d
earlier.

It is w o r t h w h i l e to point out that, from the theoretical p o i n t of


view, in the convex case the a b o v e - d e s c r i b e d m o d i f i c a t i o n s of A l g o r i t h m
182

3.1 and A l g o r i t h m 6.1 with the r e s e t t i n g strategies of Wolfe and Mif-


flin have the same global c o n v e r g e n c e p r o p e r t i e s as the orginal methods.
Namely, one may augment T h e o r e m 6.3 and T h e o r e m 6.4 w i t h the s t a t e m e n t
that if f is convex then {x k} is a m i n i m i z i n g sequence, which conver-
ges to a m i n i m u m point of f if f attains its infimum. This s t a t e m e n t
can be proved s i m i l a r l y to T h e o r e m 4.20. In this context we observe,
once again, that r e s e t t i n g strategies have no impact on c o n v e r g e n c e in
the convex case, since in fact the convex case needs no resettings; see
Chapter 2.
We shall now use the p r e c e d i n g results of this section to design
a m o d i f i e d resetting s t r a t e g y for A l g o r i t h m 3.1. This time our motiva-
tion is practical and stems from the following observation. Each rese-
tting of A l g o r i t h m 3.1 involves solving a q u a d r a t i c p r o g r a m m i n g subprob-
lem. Hence we may have to solve m a n y search d i r e c t i o n finding subprob-
lems at any iteration, e s p e c i a l l y if few past s u b g r a d i e n t s are d e l e t e d
at each reset. For simplicity, at Step 4 of A l g o r i t h m 3.1 we discard
only one past subgradient, although the p r e c e d i n g c o n v e r g e n c e results
are not impaired if one deletes more subgradients. At the same time, a
p r e m a t u r e r e d u c t i o n of the past s u b g r a d i e n t i n f o r m a t i o n can be w a s t e -
ful, since then the a l g o r i t h m has to a c c u m u l a t e new s u b g r a d i e n t s at
null steps, w h e n no m o v e m e n t towards a s o l u t i o n occurs. Thus we need a
rule for d e t e c t i n g the case w h e n it is more e f f i c i e n t to d i s c a r d seve-
ral past subgradients, rather than w a i t until they are dropped after we
have solved several quadratic p r o g r a m m i n g subproblems. Of course, such
a rule should not impair the p r e c e d i n g global c o n v e r g e n c e results. Guid-
ed by these observations, we shall now p r e s e n t a s u b g r a d i e n t d e l e t i o n
rule b a s e d on the ideas of M i f f l i n (1977b).
C o n s i d e r the following m o d i f i c a t i o n of A l g o r i t h m 3.1. In Step 0 we
choose a positive 61 and a parameter ~ 6(0,i). Step 3 and Step 4 are
r e p l a c e d by the following

Step 3 " I R e s e t t i n 9 test). Set

~k = max{ Ipkl,maak}- (6.28)

If ~k < ~ 6 k set 6k+l=~k; otherwise set 6k+i=6 k. If IP k I -<ma ak


then go to Step 4"'; otherwise, go to Step 5.

Step 4'" (Resetting). Replace jk by the set {j jk : sk3 ~ ~ 6 k + i / m a } "

If k ~ Jk then set yk=xk, gk=gf(yk) , fk=f(yk), sk=0 and add k to ~ .

Set rk=l
a and go to Step i.
183

The line search r e q u i r e m e n t s (3.9)-(3.10) of Step 5 are replaced by the


following

lyk+l-xk+ll ~ e ~ k + i / m a . (6.29)

Finally, Step 8 is s u b s t i t u t e d by

k+l k
Step 8" (Distance resetting). If 6 =~ then go to Step 9. Otherwise,
replace jk+l by the set {j e jk+l : sk+l ~k+i/ma}. If a k+l > ~ k + i / m a
3
k+l
then set r a =i and

k+l ~+i jk+l} _-- k+l,


a = max{s : j < %8 /m a. (6.30)

Step 9". Increase k by 1 and go to Step i.

For c o m p a r i n g the above m o d i f i c a t i o n of A l g o r i t h m 3.1 w i t h the ori-


ginal m e t h o d we shall use the fact that, as will be proved below, ~k+0
whenever {x k} has at least one a c c u m u l a t i o n point. Thus the two m e t h -
ods differ in several aspects. First, the above c r i t e r i a for retain-
ing past subgradients gJ at a reset are f o r m u l a t e d d i r e c t l y in terms
of s u b g r a d i e n t distance m e a s u r e s
sk] >
- IyJ-xkI, while in A l g o r i t h m 3.1
a suitable r e d u c t i o n of the locality radius a k at a reset was en-
sured by r e t a i n i n g a limited number of the latest subgradients, so that
gJ was retained only if both j ~k-Mg+2 and the value of s~ was
smaller than Ipkl/ma after a reset. Secondly, the line search rule
(6.29) makes yk+l s u f f i c i e n t l y close to x k+l w h e n the a l g o r i t h m ap-
proaches a solution, as indicated by a small value of 6k, while in
A l g o r i t h m 3.1 the value of lyk+l-xk+iI is c o n t r o l l e d by a combination
of the line search criteria (3.10) and the rules of Step 6. Thirdly,
the s u b g r a d i e n t d e l e t i o n rules of Step 8" reduce the number of resets
at Step 4'", thus saving work requred by q u a d r a t i c p r o g r a m m i n g subprob-
lems. Namely, at Step 8" we d e t e c t the s i t u a t i o n w h e n the a l g o r i t h m
approaches a solutions, i.e. 6k+l < ~ k , and then discard a few (seeming-
ly) irrelevant subgradients, trying to forestall a reset at Step 4"
at the next iteration.
We shall now show that the p r e c e d i n g global c o n v e r g e n c e results
cover this modification.

T h e o r e m 6.5. Suppose that the a b o v e - d e s c r i b e d m o d i f i c a t i o n of A l g o r i t h m


3.1 (with Step 3", Step 4", etc.) generates an infinite sequence {xk}.
Then every a c c u m u l a t i o n point of {x k} is stationary. Moreover, Theo-
184

rem 4.20 and C o r o l l a r y 4.21 are true.

Proof.(i) Suppose that there exist a point x e RN and an infinite set


K c {1,2,...} such that xk K ~. W e claim that 6k+0. The reader may
verify this claim by using the proof of T h e o r e m 6.4.

(ii) We now c l a i m that Lemma 4.18 and its proof remain valid if one
replaces in A l g o r i t h m 3.1 Step 4 and the line search r e q u i r e m e n t s (3.9)
-(3.10) by Step 4'" and (6.29), respectively, p r o v i d e d that 6k+0. To
justify this claim, use (6.29) and the a s s u m p t i o n that 6k+0 for show-
ing that Lemma 4.12, Lemma 4.14 with (4.28a) deleted, and Lemma 4.15
are true.
(iii) The theorem will be proved if we show how to modify the proof of
Lemma 4.18. Thus suppose that (4.20), (4.22) and (4.31) hold. Let K6=
{k : 6k+l < ~ k } . F r o m part (i) above and the a s s u m p t i o n (4.20) we know
that K~ is infinite and 6k+0. Therefore, in view of part (ii) above,
in the proof of Lemma 4.18 we need only c o n s i d e r additional resets oc-
curing at Step 8" for k E K 6. To this end, suppose that k is so large
that ~6 k < ~ p / 2 , where ~p > 0 is the c o n s t a n t involved in (4.36).

Then (4.36) and the rules of Stew 3" yield 6k+i=6 k for k=k2,...,k2+N ,
so that k#K~ for k=k2,...,k2+N. Thus no resets occur at Step 8" for
k=k2,...,k2+N, and part (iv) of the proof of Lemma 4.18 remains valid.
Thus L e m m a 4.18, and hence also T h e o r e m 4.19, T h e o r e m 4.20 and Corolla-
ry 4.21 are true.

Remark 6.6. We conclude from the above proof that the global convergen-
ce results e s t a b l i s h e d in Section 4 for A l g o r i t h m 3.1 are not impaired
if one replaces Step 4 by Step 4"' above, and the line search require-
ments (3.9)-(3.10) by (6.29), p r o v i d e d that the rules for c h o o s i n g {6 k}
are such that {~k} is b o u n d e d and 6k_+ 0 w h e n e v e r {x k} has at least
one a c c u m u l a t i o n point. This o b s e r v a t i o n may be used in d e s i g n i n g rules
d i f f e r e n t from the ones of A l g o r i t h m 3.1 and its a b o v e - d e s c r i b e d modi-
fication.

Let us now consider another m o d i f i c a t i o n of A l g o r i t h m 3.1, which


is similar to A l g o r i t h m 6.1. Thus suppose that we use Step 3"' with

6 = max{ Ipkl,maS } (6.31)

instead of (6.28), and replace the r e s e t t i n g test Ipkl ~ m a ak by


Ipkl ~ m a S ~ , w h e r e the aggregate distance m e a s u r e s
~k are g e n e r a t e d
P
via (6.1)-(6.2). We also use Step 4'" and substitute the line search
185

r e q u i r e m e n t s (3.9)-(3.10) by (6.29), and Step 8 by Step 8'" with the


condition ,,~k+l=6k" r e p l a c e d by ,,~k+l=6k and a k+l 5 a " .
The r e s u l t i n g m e t h o d is globally c o n v e r g e n t in the sense of Theo-
rem 6.5. To see this, use a c o m b i n a t i o n of the proofs of T h e o r e m 6.2
and T h e o r e m 6.5. At the same time, the use of Step 8'" and of the r e s e t -
ting test IpkI ~ m a S ~ d e c r e a s e s the f r e q u e n c y of resets occuring at
Step 4"
We may add that each of the m o d i f i e d s u b g r a d i e n t d e l e t i o n rules
and line search r e q u i r e m e n t s of this section may be i n c o r p o r a t e d in
A l g o r i t h m 5.1. The c o r r e s p o n d i n g results on c o n v e r g e n c e are the same.

Remark 6.7. As o b s e r v e d in Section 3.6, in many cases it is e f f i c i e n t


to calculate s u b g r a d i e n t s also at points {xk}, and use such additio-
nal subgradients for each search d i r e c t i o n finding. This idea can be
readily i n c o r p o r a t e d in the m e t h o d s d i s c u s s e d so far in this chapter.
Namely, using n o t a t i o n of (3.6.11)-(3.6.16), one may evaluate additio-
nal subgradients

g-J = gf(Y-Jl = gf( xj ) for j=l,2 .....

and choose sets jk+l subject to the p r e c e d i n g r e q u i r e m e n t s and the


following

- ( k + l ) ~ jk+l for all k.

Then there is no need for Step 4(iii) in A l g o r i t h m 3.1, A l g o r i t h m 6.1


and their various m o d i f i c a t i o n s d i s c u s s e d so far. A l s o the p r e c e d i n g
global c o n v e r g e n c e results are not influenced, although in p r a c t i c e
the use of additional s u b g r a d i e n t s may enhance faster convergence.

7. S i m p l i f i e d V e r s i o n s That N e g l e c t L i n e a r i z a t i o n Errors

In this section we shall consider s i m p l i f i e d v e r s i o n s of the pre-


v i o u s l y d i s c u s s e d methods that are o b t a i n e d by n e g l e c t i n g linearization
errors at each search d i r e c t i o n finding, i.e. by setting k_k_ 0
~j-ep- in
the search d i r e c t i o n finding subproblems (3.1) and (5.1). The resul-
ting dual search d i r e c t i o n finding subproblems (3.16) and (3.5.11) ha-
ve special structure, which enables one to use e f f i c i e n t q u a d r a t i c pro-
gramming subroutines. Particular variants of such m e t h o d s include the
algorithms of W o l f e (1975) and M i f f l i n (1977b).
We may add that m e t h o d s that neglect l i n e a r i z a t i o n errors, propos-
ed by W o l f e (1975) and e x t e n d e d to the n o n c o n v e x case by M i f f l i n
186

(1977b), seem to be less efficient in practice than other algorithms


(Lemarechal, 1982). However, they are relatively simple to implement
and still attract considerable theoretical attention (Polak, Mayne and
Wardi, 1983).
Let us, therefore, consider the following m o d i f i c a t i o n of Algori-
thm 3.1. In the primal search direction finding s u b p r o b l e m (3.1) we set

~=0 for j e jk, and ~=0 (7.1)


~3

i.e. we use (7.1) instead of (3.2). Then the corresponding dual search
direction finding subproblem (3.16) is of the form

jgj+lppk_ 1 2
minimize 1 Z jk I I ,
l,lp jg

subject to lj_> 0, je J k, I p > 0, E jkljIp=l, (7.2)


je

I =0 if rk=l.
p a

Solving (7.2) is equivalent to finding the vector of m i n i m u m length in


the convex hull of {pk-l, gj : j jk} if rk=0
a (or {gJ : j e jk} if
r~=l). This can be done by using the very efficient and stable Wolfe
(1976) algorithm, designed specially for quadratic programming proble2as
of the form (7.2). Another advantage of this simplified version is that
linearization values fk and fk are not needed, hence one can save
3 P
the effort previously required by updating linearization values by (3.
13). We shall also neglect linearization errors at line searches by
setting

v
k
=-[
pk j2 , (7.3)

a(x,y) = 0 for all x and y, (7.4)

k
instead of using (3.5) and (3.11). Then v < 0, so that Line Search
Procedure 3.2 can be used as before, with Lemma 3.3 remaining valid.
As far as convergence of the above algorithm is concerned, one may
reason as if the values of ~KD were zero for all k, since in this
case (7.3)is equivalent to the previously employed relation (3.5). Then
it is easy to check that the m o d i f i c a t i o n defined by (7.1)-(7.4) does
not impair Theorem 4.19 and Corollary 4.21; in fact, the relevant
proofs of Section 4 are simplified. We conclude that in the nonconvex
case the above-described version of A l g o r i t h m 3.1 has the same global
convergence properties as the original method.
187

At the san]e time, it is not clear w h e t h e r T h e o r e m 4.20 holds for


the a b o v e - d e s c r i b e d method, since the additional results on c o n v e r g e n c e
in the covex case d e p e n d e d s t r o n g l y on the p r o p e r t y that we had

z t {l l }<+" (7.5)
k=l

whenever {f(xk)} was b o u n d e d from below; see the proofs of T h e o r e m 2.


4.15 and T h e o r e m 2.4.16. We recall fromm the proof of Lemma 2.4.7 that
(7.5) resulted from summing the inequalities due to the line search
c r i t e r i o n (3.7)

tk,
L[-V k ) <_ [f(xkl-f( xk+ 1 )]/m L (7.6)

2
with _ v k = i p k l 2 + ~ k , for all k. But now we have -v k=Ipkl in the sim-
plified v e r s i o n of the method, so that (7.6) yields

z t~Ipkl2 < ~ <7.7)


k=l

if {f(xk)} is b o u n d e d from below, and, unfortunately, (7.5) need not


hold.
In v i e w of the d i f f i c u l t i e s m e n t i o n e d above, we shall now show that by
using a suitable r e s e t t i n g strategy one may n e g l e c t l i n e a r i z a t i o n errors
in Algorithm 3.1 w i t h o u t s a c r i f i c i n g strong global c o n v e r g e n c e pro-
perties in the convex case.
Our new r e s e t t i n g s t r a t e g y can be m o t i v a t e d as follows. We want to
ensure (7.5) if {f(xk)} is b o u n d e d from below. In view of (7.7), this
will be the case if we always have

Ipkl 2 > m a k (7.8)

at line searches, w h e r e m is a fixed, p o s i t i v e parameter. To this


end, we shall reset the m e t h o d w h e n e v e r (7.8) is violated.
Thus c o n s i d e r the v e r s i o n of A l g o r i t h m 3.1 that uses (7.1)-(7.4)
and the f o l l o w i n g m o d i f i c a t i o n of Step 3.

ipkl2 ~k
Step 3' (Resetting test). If IPkl ! m a a k or m ep then go to
Step 4; otherwise, go to Step 5.

R e a s o n i n g as in. Section 3, one may check that only finitely many re-
sets can occur at any i t e r a t i o n of the m o d i f i e d method.
We shall now show that the m o d i f i e d m e t h o d retains all the conver-
188

gence properties of A l g o r i t h m 3.1. Since Lemma 4.3 remains true, we may


suppose that the method does not terminate.

Theorem 7.1. The above-described modification of A l g o r i t h m 3.1 ( with


(7.1)-(7.4) and Step 3') is globally convergent in the sense of Theo-
rem 4.19, Theorem 4.20 and Corollary 4.21.

Proof. In view of the preceding results, we only need to establish Theo-


rem 4.19, since then, by (7.7)-(7.8), Theorem 4.20 and Corollary 4.21
will follow as before. To this end, we show how to modify the proof of
Lemma 4.18. Suppose that in addition to (4.35) we can assert that

~k --2
~p < ap/(4ma) for all k e [k2,k2+N ]. (7.9)

Then (4.36), (4.37) and (7.9) yield iiLpkl > m a a k and jpkl2 > m ~k
~p for

all k e [k2,...,k2+N], so (4.38) holds and we may complete the proof of


Lemma 4.18. Thus it suffices to show that in part (ii) of the proof of
Lemma 4.18 we can choose k such that we have (4.29), (4.32), and then
(4.35) and (7.9) in part (iv] of the proof. Thi~ is easy if we observe
that, by Lemma 4.4,

~k = if(xk)_~pk I =
~p

= I ME ~k[f(xk)-f(yk'i)- < gf(yk,i),xk_yk,i >] I <-


i=l

5 max{If(xk)-f(yk'i) - < g f ( y k ' i ) , x k - y k ' i >I : i=l ..... M}

< maxlf(xk)-f(yk'i)I + maxlgf(yk'i)I Ixk-yk'il <


i i -

5 2 L m a x {Ixk-yk'il : i=l ..... M} ! 2La k,

so, by (4.5] and (4.7),

ep <_ 2L max{s : j E Jk}, (7.10)

where L is the Lipschitz constant of f on

If(x)-f(y)l ~ Llx-y i for all x and y in B,

and B is any bounded set containing xk and yk,i, i=l,...,M. Let


189

B={y : Ix-yl ~ 2}. From (4.22), (4.31), Lemma 4.14, Lemma 4.15 and (4.20)
we deduce the existence of k satisfying (4.29), (4.32) and

max{s k : k_< j <_k < k+N} < min{l,~2p/(8m~)}, (7.1!a)

x-xkl < 1 for k=k ..... k+N. (7.11b)

Then in part (iv) we have, for k=k2,...,k2+N and i=l,...,M, that

I~-yk'il ~ l~-xkL+Ixk-yk'i] < 1+1=2

from (7.11), since {yk,i : i=l ..... M } ~ {yJ : j e ~k} and Ixk-yjl 5 s~.

Thus x ke B and yk,i~ ~ for k=k2,...,k2+N and i=l,...,M. Then


(4.35a), (7.10) and (3.11a) yield (7.9), as required.
CHAPTER 5

Feasible Point Methods for C o n v e x Constrained Minimization Problems

i. I n t r o d u c t i o n

In this chapter we consider the following convex constrained mini-


mization problem

minimize f(x), subject to F(x) & 0,

where the functions f : R N --+ R and F : R N --+ R are convex, but n o t ne-
cessarily differentiable. We assume that the S l a t e r constraint qualifica-
tion is f u l f i l l e d , i.e. there exists x~ RN satisfying F(x) < 0. T h i s
i~plies t h a t the f e a s i b l e set

S = {x e R N : F (x) ~ 0}

has a nonempty interior.

We present a class of r e a d i l y implementable algorithms, which dif-


fer in c o m p l e x i t y and efficiency. The algorithms require only the calcu-
lation of f or F a n d one subgradient of f or F at d e s i g n a t e d points.
Each algorithm yields a minimizing sequence of p o i n t s . Moreover, this
sequence converges to a s o l u t i o n of p r o b l e m (i.i) whenever the s o l u t i o n
s e t of (i.i)

= Argmin f = { ~ ~ S : f(x) ~ f(x) for a l l x ~ S}


S

is n o n e m p t y . Storage requirements and w o r k per iteration of t h e algo-


rithms can be c o n t r o l l e d b y t h e user.
The algorithms are feasible point descent methods, i.e. they gene-
rate a sequence of p o i n t s {x k} satisfying

xk ~ S and f(xk+lj~ < f(xkl~ if x k+l~xk ' for k = l , 2 , ....

where xle S is the starting point. They may be viewed as e x t e n s i o n s


of the P i r o n n e a u and Polak (1972,1973) method of c e n t e r s and method of
feasible directions t o the n o n d i f f e r e n t i a b l e case.
The methods generate search directions by using the subgradient se-
lection and aggregation strategies introduced in C h a p t e r 2. T h e impor-
tant extension to the c o n s t r a i n e d case is t h a t w e u s e separate selection
and aggregation of s u b g r a d i e n t s of e a c h of the p r o b l e m functions. This
191

enables us to c o n s t r u c t suitable p o l y h e d r a l a p p r o x i m a t i o n s of f and F.


One of the algorithms p r e s e n t e d (the m e t h o d of feasible directions from
Section 6) may be o b t a i n e d by e m p l o y i n g s u b g r a d i e n t a g g r e g a t i o n in the
a l g o r i t h m of M i f f l i n (1982). We may add that another type of a g g r e g a t i o n
was used by Lemarechal, S t r o d i o t and B i h a i n (1981) for linearly constrain-
ed problems, but with no global c o n v e r g e n c e results.
The algorithms require a feasible s t a r t i n g point xls ~ but require
no knowledge of f at infeasible points. Also at feasible points we
require no k n o w l e d g e of F (other than F b e i n g nonpositive);this is impor-
tant in c e r t a i n applications, see Section 1.3. Each of the algorithms
can find a feasible starting point by m i n i m i z i n g F, s t a r t i n g from any
point.
We shall also present phase I - phase II methods w h i c h can be used
w h e n the initial a p p r o x i m a t i o n to a s o l u t i o n is infeasible. These methods
stem from an a l g o r i t h m of Polak, Trahan and Mayne (1979) for smooth prob-
lems. They differ s i g n i f i c a n t l y from the phase I - phase II a l g o r i t h m
of Polak, Mayne and W a r d i (1983) for nonsmooth problems, both in their
c o n s t r u c t i o ~ and in stronger global c o n v e r g e n c e properties.
In S e c t i o n 2 we derive basic versions of the methods and give com-
parisons with other algorithms. An algorithmic p r o c e d u r e is presented
in Section 3, and its global c o n v e r g e n c e is d e m o n s t r a t e d in Section 4.
Section 5 is d e v o t e d to methods with s u b g r a d i e n t selection. Useful mo-
d i f i c a t i o n s of the methods are given in Section 6. Phase I - phase II
methods are d e s c r i b e d in Section 7.

2. D e r i v a t i o n of the A l g o r i t h m Class

We start by r e c a l l i n g the n e c e s s a r y and s u f f i c i e n t o p t i m a l i t y con-


ditions for p r o b l e m (i.I); see Section 1.2. For any x and y in RN,let

H(y;x) = m a x { f ( y ) - f ( x ) , F ( y ) } , (2.1)

and, for any fixed x, let ~H(y;x) denote the s u b d i f f e r e n t i a l of the


convex f u n c t i o n H (.;x) at y, i.e.

F ~f(Y) if f(y)-f(x) > F(y) ,

~a(y;x) = I c o n v ~ f ( y ) U ~F(y)} if f(y)-f(x)=F(y),

f
[ ~F(y) if f ( y ) - f ( x ) < F(y) .
192

The Slater constraint qualification implies that the feasible set S is


nonempty and that the solution set X can be characterized as follows; see
Lemma 1.2.16.

Lemma 2.1. The condition x GX is equivalent to

min{H(y;x) : y a R N} = H(x;x)=0, (2.2)

which in turn is e q u i v a l e n t to

0 ~ ~H(x;x).

Remark 2.2.There is no loss of generality in requiring that F be scalar-


valued. If the original formulation of the problem involves a finite
number of convex constraints Fj(x) ~ 0, j ~ JF' then one may set

F(x) = max{Fj(x) : j EJF}.

By Corollary 1.2.6, ~F(x)=conv{g E 3Fi(x) : Fi(x)=F(x)}.

Observe that if H(x;x)=0, i.e. x is feasible, and H(y;x) < H(x;x),


then f(y)-f(x) < 0 and F(y) < 0, hence y is both feasible and has a
strictly lower objective value than x. For this reason the function H is
sometimes called an improvement function for p r o b l e m (i.i).

To calculate a point x satsfying (2.2), one may use the following


method of centers due to Huard (1967):

Step 0. Set k=l and find a point x l e S.

Step i. Find a d i r e c t i o n dk solving the p r o b l e m

minimize H ( x k + d ; x k) over all d e RN (2.3)

and set yk+l=xk+dk.

Step 2. If

H(yk+l;x k) < H(xk;x k)

then set xk+l=yk+l; otherwise set xk+l=x k.

Step 3. Increase k by 1 and go to Step i.


193

By (Polak, 1970; Section 4.2), if a sequence {xk}c S generated


by the method of centers has an accumulation point ~ then 2.2) holds,
i.e. ~ solves problem (i.i).

Note that (2.3) is a nonsmooth unconstrained minimization problem


that generally cannot be carried out. Therefore, in implementable ver-
sions of the method it is replaced by a simpler subproblem based on a
suitable local approximation to the function H(-;xk).

Since the algorithms to be described are structurally similar to


the Pironneau and Polak (1972,1973) method of centers for smooth prob-
lems, we shall now review an extension of that method to constrained mini-
max problems (Kiwiel, 1981a). To this end, suppose momentarily that

f(x)=max{fj(x) : j 6 Jf} and F(x)=max{Fj(x) : j ~ JF }, (2.4)

where fj, j Jf, and Fj, j ~ JF' are convex functions with continuous
gradients Vfj and VFj, respectively, and the sets Jf and JFarefinite. To
calculate a point ~ satisfying (2.2), the method of centers (Kiwiel,
1981a) proceeds as follows. Given the k-th approximation to a solution
x k S, a search direction dk is found from the solution (dk,vk)~R Nx
R to the following problem

minimize 1dl2+v,

subject to fj(xk)-f(xk)+ < vfj(xk),d > < v, j ~ Jf, (2.5)

Fj(xk)+ < VFj(xk),d > < v, J JF"

The above problem may be interpreted as a local first order approxima-


tion to the problem of minimizing H(xk+d;x k) with respect to d ~ R N.
To see this, define the following polyhedral approximations to f, F and
H(.;x k )

f^~ ix) = max{fj(x k )+ < vfj(x k ),x-x k > : j ~ Jf},

F^~ (x) = max{Fj(x k) + < vFj(x k I,x-x k > : j e JF} , (2.6)

H^~ (x) = max{fA~ (x) -f(x k )' ~k(x)}


c '

respectively. Then subproblem (2.5) is equivalent to the following

minimize H~(xk+d)+ 1dl 2 over all d, (2.7)

and we have
194

v k = H^~(xk+d k ). 42.8)

We note that subproblems (2.5) and 12.7) are extensions of the


search direction finding subproblems of the method of linearizations de-
scribed in Section 2.2, see 42.2.2) and 42.2.4). Also if IJfI=IJFl=l
then s u b p r o b l e m (2.5) reduces to subproblems of the method of centers
and method of feasible directions due to Pironneau and Polak (1972,1973).
The regularizing term ~~idl 2 in (2.7) serves to keep xk+d k in the
region where ~k(.) is a close a p p r o x i m a t i o n to H(.;xk).

Reasoning as in Section (2.2~,one may prove that vk majorizes the


directional derivative of H(.;x k) at xk in the direction d k. There-
fore the method of centers can choose a stepsize tk > 0 as the largest
1 1
number in {i, 2" 4 ,...} satisfying

H ( x k + t k d k ; x k) ~ H ( x k ; x k ) + m tkv k, (2.9)

where m e(0,1) is a fixed line search parameter. This is possible when


v k < 0, which is the case if x k ~ X. The method, of course, stops if xk
eX. Otherwise (2.9) yields the next point x k+l satisfying

f(x k+l) < f(x k ) and x k + l e S.

This follows from the nonnegativity of mtkv k and the fact that H(xk;xk)=0
owing to x k e S.
It is known (Kiwiel, 1981a) that the above method of centers is
globally convergent (to stationary points of (i.i) if the problem func-
tion are nonconvex), and that the rate of convergence is at least line-
ar under standard second order sufficiency conditions of optimality.
This justifies our efforts to extend the m e t h o d to more general nondif-
ferentiable problems.
Although the methods given below will not require the special form
(2.4) of the problem functions, they are based on similar representa-
tions

f(x) max{ f(y)+ < gf,y,X-y > : g f , y e ~f(y), y e R N},

F(x) = max{F(y)+< gF,y'X-Y> : gF,y ~ ~F(y), y eRN},

which are due to convexity. Since such implicit representations cannot


be computed, the methods will use their approximate versions construct-
ed as follows.
We suppose that we have subroutines that can evaluate subgradient
195

functions gf(x]~f(x) at each x~ S, and gF(x)~F(x) at each x e RN


\S. For simplicity, we shall temporarily assume that gf(x)e~f(x) and
gF(x)6~F(x) for each x ~ R N.

Suppose that at the k-th iteration we have the current point xk~s
and some auxiliary points yJ , j ~ J~ u JF'
k and subgradients g~=gf(yJ),
k and
j Jf,
g~=gF(yJ), k where
j e JF' Jfk and JFk are some subsets of
{i .... ,k}. Define the linearizations


fj(x) = f(yJ)+ < g~,x-y 3 > , j e Jfk
(2.10)

Fj(x) = F(yJ)+ < g~,x-y 3" > , j e JF'
k

and the c_~urrent pol[hedral approximations ^fk and ^F k to f and F, re-


s s
spectively,

f (x) = max{ (x) : j ~ J },


(2.11)
F~(x) = m a x { F j ( x ) : j ~ J ~ } .

Noting the similarities in (2.4) and (2.10)-(2.11), we see that apply-


ing one iteration of the method of centers to ~k and ~k at x k
s s
would lead to the following search direction finding subproblem

minimize 1
~Idl2+v,

subject to fk-f(xk)+ < g ,d > < v, j e Jf, (2.12)


3
F k + < g ,d~ < v, j e JF,
3

where f ~=fj(x k ), j Jf,


k and Fk=F.(x k) j k Defining the following
polyhedral approximation to H(!;x ~) ' JF"

H ix) = max{f x)-f( ks


),F ix)}, (2.13)

we deduce that (2.12) is a quadratic programming formulation of the sub-


problem

i 2
minimize + ~Idl over all d, (2.14)

with the solution (dk,v k) of (2.12) satisfying

H~(xk+d k) = v k, (2.15)
196

cf. (2.6)-(2.8). Thus subproblem (2.12) is a local approximation to the


problem of m i n i m i z i n g H(xk+d;x k) over all d.

In the next section we prove that vk < 0 and that vk=0 only if
x k e X. Therefore we may now suppose that vk is negative. The line
search rule of
the above-described m e t h o d of centers must be modified
k
here, because v need no longer be an upper estimate of the directional
derivative of H(.,x k) at x k in the d i r e c t i o n d k. This is due to the
fact that ^k
Hs(. ) may poorly approximate H(.;xk). However, we still
have

H~(x k+tdk) 5 H ( x k ; x k ) +tv~ for all t m[0,1S, (2.16)

owing to (2.15), the convexity of ~k and the fact that H k ( x k ) ~ H(xk;


xk), which is zero by assumption (xke S). Therefore the variable

k ^k k k k k
v = Hs(X +d )-H(x ;x ) (2.17)

may be thought of as an approximate directional derivative of H(-,x k)


at xk in the direction d k. Consequently, if we use the rules of Sec-
tion 2.2 for searching from xk along dk for a stepsize that gives a
reduction in H(.;xk), we obtain the following line search rules.
Let ma(0,1) and t ~(0,i] be fixed line search parameters. We

shall start by searching for the largest number tk


L ~~ in {i , ~i , 7'
i

...} that satisfies

H(xk+tkdk;xk) < H ( x k ;x k )+mtLv


kk . (2.18)

This requires a finite number of the problem function evaluations. For


instance, if ~=i then (2.18) reduces to the following test

H ( x k + d k ) - H ( x k ; x k) & m [ H ~ ( x k + d k ) - H ( x k ; x k ) ] . (2.19)

k
If a stepsize tL ~ satisfying
(2.18) is found, then the method can
k+l k k k k+l k+l
execute a serious step by setting
x =x +tLd and y =x . Other-
wise a null step is taken by setting x k+l = x k . In this case an auxilia-
ry stepsize t kR ~ ~ satisfying

H ( x k + t ~ d k ; x k ) > H(xk;x k ) + m t Rvk k

k+l k.k~k
is known from the search for t kL. Then the trial point y =x ~tRO

and the subgradients g~+l=gf(yk+l) and g~+l=gF(yk+l ) will define the


197

corresponding linearizations fk+l and Fk+ 1 by (2.4) that will sig-

nificantly modify the next polyhedral approximations ~k+l and ~k+l,


~k+l _k+l
provided that k+l e uf u dF , see Section 6. Thus after a null step
the method will improve its model of the problem functions, increasing
the chance of generating a descent direction for Hi.; xk+l).
We shall now discuss how to select the next subgradient index sets
jk+l _k+l
f and JF . In order for the algorithm to use the latest subgradient
information, we should have k+l e J~+IN j~+l, which is satisfied if

j fk + =l Jf^k u {k+l} and


_k+l
dF
^k
= JF u {k+l} (2.20)

^k k
for some sets ^k c Jfk
Jf and J F C JF" From Chapter 1 and Chapter 2 we know
Ak
that at least three approaches to the selection of 3~ and JF are pos-
sible. First, one may use subgradient a c c u m u l a t i o n by choosing Jf-Jf^k-
k
and tk _k
dF=d F , which r e s u l t s in

f = {I .... ,k} and JFk = {i, "'" ,k} for all k.

This strategy, which is employed in the algorithm of Mifflin (1982) (see


Section 1.3), encounters serious difficulties in storage and computation
after a large number of iterations. For this reason, we shall now present
extensions of the other two approaches from Chapter 2: the subgradient
selection strategy and the subgradient aggregation strategy. Both strate-
gies are based on analyzing Lagrange m u l t i p l i e r s of the search direction
finding subproblems. Therefore we shall need the following generaliza-
tion of Lemma 2.2.1.

Lemma 2.3.(i) The unique solution (dk,v k) of subproblem (2.12) always


exists.
(ii) (dk,v k) solves (2.12) if and only if there exist Lagrange multi-
pliers I~, j e J , and
k
~j,
k
j e JF' and a vector pk~ RN satisfying

k k ~ lj+j
k Z _k~jk = i, 2.21a)
lj ~0, j ~ , ~ ~ O, J ~ J F ' j e J~ ~ 0F

g ,dk > vk] k_0,


j k
j Jf. 2.21b)

k j k ko k k
IF + < gF,d > -v ]~j=0, j ~ JF' 2.21c)

pk = j eZ J kl g3 + Z k ~ kjgF'
j 2.21d)
f J ~JF
198

d k = _pk, 2.21e)

_{ ipkl 2 ~k k k
vk = j [f(x l 2.21f)

k
f~-f<xkl < g~,d k ~ < v k, j e Jf, 2.21g)

F k + < gJ,dk~ k
< v k, j e JF" 2.21hi

(iii) Multipliers
lk3' j E Jf,
k uj,
k j JF'
k satisfy (2.21) if and only
they solve the following dual of (2.12)

minimize i[ Z ~kljg3+ Z k~jgj[ 2 + Z klj[f(xk)-f k ]- Z k


l,~ J~ Jf 3 e JF je Jf j~ J k~jFj'
(2.22)
k k
subject to I. > 0, j e Jf, zj a 0,j 6 JF, ~klj+. 7~ k~j=l.
3- j~jf 3e J

(iv) There exists a solution k ~j,


I k, je Jf, k je JF'
k of subproblem (2.22)
such that the sets

^k k k
Jf = {J~ : 3 and JF = {jE JF : ~j > 0} (2.23a)

satisfy

(2.23b)

Such a solution can be obtained by solving the following linear program-


ming problem by the simplex method:

minimizel,~ JgZ JfklJ [f(xk)-fk]-3 Z.e J ku jFk'

subject to kI + Z k~ = 1
j e Jf j j E JF j '
(2.24)
Z ~kljg~+. ~ k~jg ~ = pk,
j E ~f 3 e JF

k k
lj a 0, j e Jf, ~j a0, J e J F '

where p k = _ d k" Moreover, (dk,v k) solves the following reduced subprob-


lem
199

minimize lldl2.v,
(d, v)eR N+I

subject to ^k
f -f(xk)+ < g~,d > ~ v, j Jf, (2.25)

F~+ < g ,d > 5 v, J ~ JF"

Proof. It suffices to observe that subproblem (2.12) is structurally si-


milar to subproblem (2.2.11), and that the above lemma is an obvious re-
formulation of Lemma 2.2.1.

Lemma 2.3(iv) and the generalized cutting plane idea from Section
2.2 lead us to the following subgradient selection_strategy. Subproblem
(2.25) is a reduced, equivalent version of subproblem (2.12). Therefore
the choice of j~+l- and j~+l- specified by (2.20) and (2.23) conforms
with the generalized cutting plane concept, because it consists in ap-
pending to a reduced subproblem linear constraints generated by the la-
test subgradients. Thus only those past subgradients that contribute to
the current search direction are retained, see (2.21a), (2.21d), (2.21e)
and (2.23). Subgradient selection results in implementable algorithms
that require storage of at most N+I past subgradients.
In the s ubgradient aggregation strategy we shall construct an aux-
iliary reduced subproblem by forming surrogate contraints with the help
of Lagrange multipliers of (2.12). As expounded in Chapter 2, subgradi-
ent aggregation consists in forming convex combinations of the past sub-
gradients of a given function on the basis of the corresponding Lagran-
ge multipliers. Here a slight complication arises from the fact
that the Lagrange multipliers associated with each of the problem func-
tions (l k with f, and k with F) do not form separate convex com-
binations, see (2.21a). Yet the subgradients of f should be aggregated
separately from those of F, since otherwise the mixing of subgradients
would spoil crucial properties of subgradient aggregation. For separa-
te subgradient aggregation we shall use scaled versions of Lagrange
multipliers of (2.12). A suitable scaling procedure, which yields sepa-
rate convex combinations, is given below.
Let (Ik,~ k) denote any vectors of Lagrange multipliers of (2.12),
which do not necessarily satisfy (2.23) and let the numbers v~,
' 3'
k k ~k k
j Jf, VF, ~j, j e JF' satisfy

k lk k k'k k
~ f = 3 ~l jk j, lj=~)flj, j ~Jf, (2.26a)
f
200

k k k k~k k
~F = j ~E jfk~j, ~j=VFW j , j 2.26b)
JF'

2.26c)
3- ' jeff

~k k E ~k= 2.26d)
wj _> 0, j e J F, j ~ jkWj i.

Such numbers exist and can be easily computed as follows. By (2.21a),


(2.26a) and (2.26b), we have

k k k k
Vf_> 0, ~F >-0' ~f + ~F = i. (2.27)

If ~ ~ 0 then the scaled multipliers

k
satisfy (2.26a) and (2.26d) in view of (2.21a). If ~f=0 then l~=0a
for all j 6 J~ by (2.21a), hence (2.26a) is trivially fulfilled by any
~k
numbers ~3 satisfying (2.26c). Similarly one may choose ~j.
The above scaled Lagrange multipliers (~k,~k) will be used for
subgradient aggregation as follows.

Lemma 2.4. Define the aggregate subgradients

k ~k ~k ' k k ~k ~k gJ,F k)
(pf,fp) : E lj(g3,f ) and (PF'Fp) = E k ~j( F 3 (2.28)
je JF
Then
k k k k k (2.29)
p = ~fpf + ~FPF ,

k 2 k k ~k k~k
v = - {lpkl +~fEf(x )-fp]-~FFp}. (2.30)

Moreover, subproblem (2.121 is equivalent to the following reduced prob-


lem

minimize 1dl2+v,
(d,v)~ RN + l

subject to xk>+< p ,d> < v, (2.31)

+< PF,d> _< v.


201

Proof. (2.29) and (2.30) follow easily from (2.21) and (2.26). The equ-
ivalence of (2.31) and (2.12) can be shown as in the proof of Lemma 2.
2.2.

The constraints of the reduced subproblem (2.31) are generated by


the following aggregate linearizations

fk(x) = fk+<pk,x-xk > and ~k(x) = ~ k + < PFk 'x-x k > . (2.32)

They are convex combinations of the linearizations fj and Fj, r e s p e c -


tively, because the aggregating scaled m u l t i p l i e r s form convex combina-
tions, cf. (2.26c,d) and Lemma 2.2.3.

The rules for updating the linearizations can be taken from Chapter
2:

J 3 -x > and j 3 > , (2.33)

because for each x eR N and j=l,...,k we have

fjix) = f k + < g30.x-xk > and Fj(x) = F k + < g3F,x-xk> , (2.34)


3 3

see (2.2.30) and (2.2.32). Denoting


fk+l=fk(xk+l) and Fk+l=Fk(x k+l)
P P
we obtain from (2.32) similar rules for updating the aggregate lineari-
zations:
k k+l -x k >
fk+l = ~k+ < pf,x and Fk+l = ~k+ < PFk 'xk+l -x k > . (2.35)
P P P P

Also by convexity the linearizations satisfy

f(x) > fj(x) and Fix ) a F j ( x ) for all x and j, (2.36)

hence the aggregate linearizations, being their convex combinations, al-


so satisfy

fix) ~ fk(x) and F(x) ~ Fk(x) for all x. (2.37)

We also note that at the (k+l)-st iteration the aggregate linearizations

~k(x ) = fk+l+ < p~,x_xk+l > and Fk(x) = Fk+l+ < PF,X-X
k k+l > (2.38)
P P

are generated by the updated aggregate subgradients k


(pf, fk+l) and
P
(p~,Fk+l).
p -
In terms of aggregate linearizations, Lemma 2.4 states that an e q ~ -
202

valent formulation of subproblem (2.14) is

minimize Hk(xk+d)+ ~Idl 2 over all d, (2.39)

where
Hk(x) = max{fk(x)-f(xk),Fk(x)} for all x.
(2.40)

Thus the use of separate aggregation enables one to construct aggregate


versions of polyhedral approximations to the improvement function
H(o;xk).
Following the generalized cutting plane idea one may obtain the
next search direction finding subproblem by updating the reduced subprob-
lem (2.31) according to (2.35), and appending constraints generated by
the latest subgradients g~+l- and g~+l. For efficiency one may also re-
rain a limited number of linear constraints generated by the past sub-
gradients.
In this way we arrive at the following d e s c r i p t i o n of consecutive
subproblems of the method with subgradient aggregation. Let (dk,v k) de-
note the solution to the following k-th search direction finding subprob-
lem

minimize 11dl2+v,
(d,v) ~ R N+I
j k
subject to fk-f(xk)+ < g ,d > _< v, j e Jf,

f -f(xk)+ < p -1,d > v, (2.41)

F *< g d> < v


3

F k + < pk-l,d > < v.


p

For k=l we shall initialize the m e t h o d by choosing the starting point


x16 S and setting yl=xl and

0 1 , fl I = f(x I) I = {i}
Pf = gf = gf (xl) p = fl ' Jf '
(2.42)
0 1 , F1 1 = F(x I) 1
PF = gF = gF (xl) p = F1 ' JF = {i}"

Subproblem (2.41) is of the form (2.12), hence we may rephrase the


preceding results as follows. Let i k3' j e J k , xkp' uj'
k J a JF'
k and k

denote any Lagrange multipliers of (2.41). Then Lemma 2.4 yields

k k
Ik>-0'3 je jf,k ik0 , p ]Jk >0- ' J ~ JF' ~p>0,_ (2.43a)
203

Z. lk+lk+ Z k ~ jk+ p pk = i, (2.43b)


j6Jf 3 P jajF

k k j k k-i k j k k-i
p = ~ _kljgf+lpPf + ~ kPjgF+upPF , (2.43c)
j e cf J E JF

d k = - pk, (2.43d)

hence we may c a l c u l a t e scaled m u l t i p l i e r s satisfying

k . Ik+l k, lk = k~k k lk k~k


(2.44a)
vf = Z jK ] P 3 9flj, j e Jr, P = ~flp,
je f

k k k k k~k k k k'k
vF = ~ JFk ~ j + ~ p ' ~ j = ~F~j ' j e JF ' ~p = ~FPp, (2.44b)
J 6

~k>o,
] - j ~ Jf'
k ~pk-> o ' j ~~ jk ~kj + ~kp = i, (2.44c)
f

~k k ~k 0, Z kp~kj + pp
~k = l,
(2.44d)
pj a 0, J ~ JF' Ppa j e JF

k k k k
~f > 0, ~F >_0, ~f + 9F = I, (2.44e)

and use t h e m for c o m p u t i n g the a g g r e g a t e subgradients

(2.45)
'fP) j e Jf J P~Pf '

-k j Fk k-lF l
, p) = Z k P j ( g F , j) + Pp(PF
J e JF

Moreover, relations (2.29) and (2.30) also h o l d for the m e t h o d with sub-
gradient aggregation based on ( 2 . 4 3 ) - ( 2 . 4 5 ) .

One may observe that for the m e t h o d with subgradient aggregation


the last a s s e r t i o n of L e m m a 2.4 can be r e f o r m u l a t e d as follows: subprob-
lem (2.31) is e q u i v a l e n t to s u b p r o b l e m (2.41). Both subproblems are
equivalent to the f o l l o w i n g

~k 1 2
minimize Ha(xk+d)+ ~Idl over all d, (2.46)

where the a g g r e g a t e polyhedral approximation to the i m p r o v e m e n t function

H x) = m a x { f x)-f( ), F x)} (2.47)


204

is defined by the following aggregate polyhedral approximations to f


and F
~ka(x 7 = max{fk-l(x),fk(x7 : j ejk},
(2.48)
Fk(x) = max{Fk-l(x),Fk(x7 : j EJkF}.

Remark 2.5. Suppose the problem functions are of the form

f(x 7 = max{fj(x) : j e Jf} and F(x) = max{Fj(x) : j~ JF }

and one can compute subgradients g ~ ' J e ~fj(xk), j ~ Jf, and g~'Je~Fj(xk),
J ~ JF" Then one may append the constraints

?j(xk)-f( xk 7+ < g 'k,d > ~ v, j e Jf,


(2.49)
Fj(xkl+<g~'k,d> ~v, J ~JF'

to the search direction subproblems (2.12) and (2.41), for all k. This
enhances faster convergence, but at the cost of more work per iteration.
One may also replace the sets Jf and JF in (2.49) with the sets

Jf(e) ={j ~ Jf : fj(x k) Z f(xk)-e}, JF(e)={j JF : Fj (xk) z F ( x k ) - E }

for some E~0. Such augmentations should be used especially if there are
many constraints in the original formulation of problem (1.17, cf. Re-
mark2.2. It is straighforward to extend the subgradient selection and
aggregation rules to the augmented subproblems. The subsequent results
hold also for such modifications.

We shall now remark on relations of the above methods with other


algorithms. The methods generalize the method of centers for inequality
constrained minimax problems (Kiwiel, 1981a), which in turn extends the
Pironneau and Polak (1972) method of centers for smooth nonlinear pro-
gramming problems. On the other hand, subproblems (2.12) are reduced
v~rsions of the Mifflin (1982) subproblems, cf. (1.3.317.

Remark 2.6. It is worthwile to compare subproblems (2.14) and (2.467


with the ~conceptual" search direction finding subproblem (1.2.82). The
latter problem always yields a descent direction, provided that the point
at which it is defined is nonstationary. However, this problem requires
full subdifferentials of the problem functions for defining the corres-
205

ponding approximations, whereas subproblems (2.14) and (2.4) are ba-


sed on the polyhedral approximations (2.11), (2.13), (2.47) and (2.48)
(see also Remark 2.2.6).

3. The Algorithm with Subgradient Aggregation

The results of the preceding section are formalized in the follow-


ing procedure for solving problem (I.i).

Al@orithm 3.1
Step 0 (Initialization). Select a starting point xla S, a final accuracy
tolerance E s Z 0 and a line search parameter m e (0,i). Set yl=xl and
initialize the algorithm according to (2.42). Set the counters k=l,
i=0 and k( 0 )=i.
k k k
Step 1 (Direction finding). Compute multipliers 13k, j ~ Jf, Ip, ~j,
k and
J ~JF' k that solve the k-th dual search direction finding sub-
problem
i, j k-i ~j~ k - i 2
minimize ~l Z ~kl~gf+IpPfj + E k~j~F~pF F I +
I,~ j6 of j e JF

+ I~ k)~ j [f(xk)-f .]+;tp[f(xk)-f ]- E k~jFj-~pFp,


j~ Jf J~JF (3.1)
k
subject to 13 _> 0 ' je Jr'
k IP-> 0, ~j >- 0, je JF, ~p> 0,

E ~ +I + ~ k~j+~p=l.
j~jf 3 P jejF

Calculate multipliers k, ~k3, j 6 jf,k Ip,'k ~F'k J~ JF'k and ~k satisfying


k ~k
(2.44). Compute (pf,fp) and ( PkF , Fk) by (2.45), and use (2.29) and
(2.30) for calculating pk and v k. Set dk=_pk.

Step 2 (Stopping criterion). Set

k 1 k 2 k k ~kT k k
w : ~IP I +vf[f(x )-fpJ-VFF p. (3.2)

If w k s ~s terminate; otherwise, go to Step 3.

Step 3 (Line search). Set yk+l=xk+dk. If

H(xk+dk;x k) ~H(xk;xk)+mv k (3.3)


206

then set t~=l (a serious step), set k(l+l)=k+l and increase 1 by I;

otherwise, i.e. if (3.3) is violated, set t~=0 (a null step).

k+l k k-k
Step 4 (Linearization updating). Set x =x +tLd . Choose some sets

jf^kcjfk and J~ c ~F' and calculate the linearization values fk+l,3


^k fk+l
j e Jf,^k Fk+13 , j 6 JF' p and Fk+l
P by (2.33) and (2.35). Set gfk+l=
k+l ( k+l,
gf(yk+l), gF =gF y ) and

fk+l = f(yk+l)+ < g~+l,xk+l k+l _k+l = F(yk+l) k+l k+l


k+l -y > and Fk+ 1 + < gF ,x -
_yk+l>.

.k+l tk k+l tk
Set of =Jf u {k+l} and JF = d F U {k+l}. Increase k by I and go to
Step I.

Remark 3.2. It follows from Lemma 2.3 that in Algorithm 3.1 (dk,v k)-

solves subproblem (2.41) ' and that k j' J E Jf'


k ~kp' ~ , j e JF'
k ~p
k are the
associated Lagrange multipliers. Thus one may equivalently solve subprob-
lem (2.41) in Step 1 of the above algorithm.

Remark 3.3. As noted in the previous section, the line search quarantees
that x k+l e S if x ke S. Since xI ~ S by assumption, it follows
that Algorithm 3.1 is a feasible point method. In particular, H(xk,xk) =
0 for all k.

Remark 3.4. For convenience, the above version of the algorithm requires
the subgradient mappings gf and gF to be defined everywhere. In
fact, gf need not be defined at x ~ S. In this case, in Step 4 set
k+l . k+l. .k+l _.xk+l yk+l~
gf =gf[x ) and Zk+l=Z( ) if S. If gF is not defined

at feasible points, then the following modifications are necessary.


k_ in sub-
Set J~=@ in Step 0 and impose the additional constraint ~p-0
problem (3.1) (or delete the last constraint of (2.41)) for all k such
that y J e S for j=l ,k-l. Also let jk+l=~k in Step 4 if yk+iE S.
' . . . . F -F
This amounts to not using the constraint subgradients until the first
infeasible trial point is found. One can check that all the subsequent
proofs need only minor changes to cover this modification of Step 1
and Step 3. We will not treat this case explicitly, since this would
further complicate the notation.
207

4. C o n v e r g e n c e

In this section we s h o w that the sequence {x k} g e n e r a t e d by Al-


g o r i t h m 3.1 is a m i n i m i z i n g sequence, i.e. {x k} c S and f(x k) + inf
{f(x) : x e S} as k+~, and that {x k} converges to a solution of
p r o b l e m (i.i), p r o v i d e d that the solution set X is nonempty. Naturally,
c o n v e r g e n c e results assume that the final accuracy t o l e r a n c e es is
set to zero.
Since A l g o r i t h m 3.1 is an e x t e n s i o n of A l g o r i t h m 2.3.1 to con-
strained problems, the a~alysis given b e l o w dwells on the results of
Section 2.4. T h e r e f o r e we shall c o n c e n t r a t e on m o d i f i c a t i o n s only, omit-
ting c e r t a i n details that can be found in C h a p t e r 2.
We start by stating that the aggregate s u b g r a d i e n t s are convex
corabinations of the past subgradients.

L e m m a 4.1. Suppose that A l g o r i t h m 3.1 did not stop b e f o r e the k - t h ite-


ration. Then

(pf,fp)k
~k E conv{(gj,fk,)3 : j = l , . . . . k } ,
(~4 ~z)
jk
(PF' ) E conv{(gF,Fj) : j:l ..... k}.

If a d d i t i o n a l l y k > 1, then

(pk-i ,fk
p ) E c o n v { ( g 3 , f k) : j=l ..... k-l},
(4.2)
conv (g ,F ) : j:1 .....

Proof. Since A l g o r i t h m 3.1 uses a g g r e g a t i o n rules analogous to those of


A l g o r i t h m 2.3.1, the proof is similar to the proof of Lemma 2.4.1.

It follows from the above convex r e p r e s e n t a t i o n of the aggregate


subgradients that they can be i n t e r p r e t e d as ~ - s u b g r a d i e n t s of the fun-
ctions a s s o c i a t e d with p r o b l e m (i.i), viz. the o b j e c t i v e f u n c t i o n f, the
constraint violation function F+ d e f i n e d by

F+(x) = F(x)+ = m a x { F ( x ) , 0 } for all x,

and the i m p r o v e m e n t f u n c t i o n H. Note that the f u n c t i o n F+ is c o n v e x and


we always have F(xk)+=0, b e c a u s e {x k} c S. In the following, suppose
that A l g o r i t h m 3.1 did not t e r m i n a t e b e f o r e the k - t h iteration. Define
the l i n e a r i z a t i o n errors
208

k,j:f(xk)_fk k : - Fk j:l .... k (4.3a)


af 3 ' ~F,j 3 ' ' '
k k ~ k - Fk (4.3b)
af,p=f(x )-f , eF,j= P ,

~k,p=f(xk)_~kP ~k - ~k (4.3c)
ef , OF,p= p

~k k k ~k k~k
~p = ~f[f(x )-fpJ-~Fp (4.3d)

Observe that F(xk)+=0 implies

k F(xk) +_F k k F(xk)+_F k ~k F(xk) +_~k


~F,j.= ' aF,p = ' aF,p = '

which justifies calling those variables the linearization errors of the


constraint violation function. The linearization errors characterize the
subgradients calculated by the algorithm as follows.

Lemma 4.2. At the k-th iteration of Algorithm 3.1, one has

k
g]e DEf(xk ) for e=ef,j j=l,...,k, (4.4a)

gJe D F(xk)+ for e= k,j j=l,...,k, (4.4b)

k-i f(x k ) for k (4.4c)


Pf ~ De e=~f,p

PFk-i ~ De F(xk)+ for e=a~,p (4.4d)

k eD f(x k)
pf for ~k
c=af,p (4.4e)

PFk D E F ( x k )+ for e=a~,p (4.4f)

k H(xk;x k) for e=~k > 0. (4.4g)


p e 3 P _

Proof. Using (2.34), (2.36), Lemma 4~ and the fact that we always have
F(x)+ >F(x) and F(xk)+=0, one obtains (4.4a)-(4.4f) as in the proof
of Lemraa 2.4.2. In particular, similarly to (2.4.5c) we get

f(x) > f ( x k ) + < p fk, x - x k > -[f(x k) _~kp~,

k k
F(x) -> < PF,X-X > +~kp

for each x e R N. The above inequalities, (2.44e), (2.29) and (4.3d)


209

yield

k k k k k j+

+ k~k Z < pk,x-xk > - ~ k


~FFp p

R N" k k k k
for each x~ Since vf ~ 0 and VF ~ 0 satisfy vf+VF=l, we have

H(x;xk)=max{f(x)-f(xk),F(x)} a v~[f(x)-f(xk)l+v~F(x),

hence
H(x;x k) ~ H(xk;xk)+ < pk,x_xk > - ~ k for all x,
P

because H(xk;xk)=0. Setting x=x k, we complete the proof of (4.4g).

Remark 4.3. In view of (4.4), the linearization errors (4.3) may also be
called subgradient locality measures, because they indicate the distance
from subgradients to the corresponding subdifferentials at the current
point x k. For instance, the value of~k h 0 indicates how much pk dif-
P
fers from being a member of 3H(xk;xk); if ~k=0 then p k ~ ~H(xk;xk).
P
The following result is useful for justifying the stopping criter-
ion of the algorithm.

Lemma 4.4. At the k-th iteration of Algorithm 3.1, one has

k 1 2 ~k
w = ~lpkl +~p, (4.5)

v k = -{ IP k I2+;~ }, (4.6)

vk ~ -w k ~0. (4.7)

Proof. This follows easily from (3.2), (2.30), (4.3d) and the nonnega-
tivity of ~kp"

From relations (4.4g) and (4.5) we deduce easily that

p k e 3 H(x k ;x k) and Ipkl ~ (2c) I/2 for E = wk . (4.8)

Thus wk may be called a stationarit[ measure of the current point x k,

because 1pkl indicates how much pk differs from the null vector
210

and ~k measures the distance from pk to ~H(xk;xk), and stationary


P
points ~ ~ satisfy

~H(~;~) and p=0

by L4mma 2.1. The estimate (4.8) shows that xk is approximately opti-


mal when the value of wk is small.
In what follows we assume that the final accuracy tolerance Es is
set to zero. Since the algorithm stops if and only if0 s w k s es=0, (4.8)
and Lemma 2.1 yield

Lemma 4.5. If A l g o r i t h m 3.1 terminates at the k-th iteration, then


x k eX.

From now on we suppose that the algorithm does not terminate, i.e.
wk> 0 for all k. Since the line search rules imply that we always have

f(xk+l)-f(x k) mt~v
k~ (4.9)

with m> 0 and t kL ~ 0, the fact that v k ~ -w k < 0 see (4.7)) yields
that the sequence {f(xk)} is nonincreasing.

We shall need the following properties of the improvement functions


H.

Lemma 4.6. The mapping ~.H(.;.) is locally bounded, i.e. if (y,x,e)


remains bounded in RN, RN~R then ~eH(y;x) remains bounded in R N.
Moreover, ~ H(.;.). is upper semicontinuous, i.e. if the sequences
{~i}, {Ti}, {ei}, {gi}, gi e ~EiH(~i;ri ) for all i, tend to ~,T,e and

g, respectively, then g ~ ~eH(~;~).

Proof. Consider a bounded set BcRNxRN~R and let (y,x,E) ~ B , g H e


~ H(y,x) and T=y+gH/IgHl. Then H(~;x) ~ H ( y ; x ) + < gH,T-y > - e yields

< gH,~-y > = I gHl ~ H(T;x)-H(y,y)+e.

But T is bounded, and so are H(r;x) and H(y,x) (H is continuous as


a convex function), thus proving the local boundedness of ~.H(.;.).
Next, g i e ~giH(~i;T i) implies

H(r;Ti) ~ H ( ~ i ; T i ) + < gi,T_~i > _ i for all T.

Passing to the limit, we obtain the desired conclusion.


211

The following property of the stationarity measures is crucial for


convergence.

Lemma 4.7. Suppose that there e x i s t an infinite set K c {1,2,...} and

a poin t ~ cR N satisfying xk K , ~ and wk K ~ 0. Then ~eX.

Proof. By (4.5), (4.8) and L e m m a 4.6, we have pk K


+ 0 e 8H(~;x). Thus
~X by L e m m a 2.1.

The f o l l o w i n g result can be o b t a i n e d similarly to Lemma 2.4.7.

Lemma 4.8. Suppose that the sequence {f(xk)} is b o u n d e d from below. Then

= k k 2 k~k +~ (4.10)
{tLlP I +tLa p} <
k=l

As in S e c t i o n 2.4 (see (2.4.10)),we have

k k(1)
x = x for k=k(1), k(1)+l ..... k(l+l)-l, (4.11)

where k(l+l)=+~ if the number 1 of serious steps stays bounded, i.e.


if xk=x k(1) for some fixed 1 and all k a k(1).

First we deal w i t h the case of infinitely many serious steps.

The following result can be p r o v e d similarly to Lemma 2.4.8, since it


is an immediate consequence of (4.5), L e mm a 4.8 and the fact that

k(1)-i = 1
tL for all 1

Lemma 4.9. Suppose that there exist an infinite set L c{i,2,...} and

a poin t ~ 6R N such that x k(1)__+ ~ as I~, l e L. Then wk(l+4 -i_+ 0


as I~, l e L.

In the case of a finite number of serious steps, we have to show


that the s t a t i o n a r i t y measures {w k} tend to zero. To this end we shall
analyze the dual search d i r e c t i o n finding subproblems.

Lemma 4.10. At the k-th iter<tion of A l g o r i t h m 3.1, w k is the optimal


value of the following problem

j k-I j k-i 2+
minimize 11 Z ~kljgf+IpPf + Z kPjgF+~pPF I
l,~ J ~ ~f 3 e JF
212

k k k k
+ Z -k1-~f,-+IP~f,P
+ 3 3 Z
j ~ df j e JFkPj~F'j+~p~F'P'
(4.12)

subject to 13 >- 0, j ~ k
Jf' i p -> 0, p j > 0, j ~- k
JF' Pp a 0,

kl.+1 + Z kPj+pp=l,
j e jf 3 P J ~ JF

which is eqQivalent to subproblem (3.1).

Proof. As in the proof of Lemma 2.4.9, the assertion follows from (4.3),
(4.5), (2.45), (2.43c) and the fact that the k-th Lagrange multipliers
solve (3.1).

Let us now define the variables

k k k k if f(yk)-f(xk-l) < F(y k) (4.13a)


g = gF and e = eF,k

g k = gfk and ~k k if f(yk)-f(xk-i > F(y k) (4.13b)


= ef ,k - '

for all k > i. They will be used in the following extension of Lemma 2.
4.11.

k-i
Lemma 4.11. Suppose that t L =0 for some k >I. Then

-~ k + < gk,dk-i > a m Y k-i , (4.14)

w k S #c(wk-l), (4.15)

where #C is defined by (2.4.16) and C is any number satisfying

-k-i ,i } .
C _> max{ Ipk-ll ,Igkl , ~p

k-i
Proof.(i) If t L =0 then the line search rules yield yk=xk-l+dk-I
and xk=x k-l, i.e. yk=xk+dk-l, and

max{f(yk)-f(xk),F(yk)} > mv k-l. <4.16)

First, suppose that F(y k) ~ m v k-l. Then (4.13a), (4.3a), the rules of

Step 4 and the fact that yk=xk+dk-i yield

k+ gk,dk-i k k k k F(yk) k
-e < > = Fk+ < gF,y -x > = >mv (4.17a)
213

Next, suppose that F(y k) < m v k-l. Then (4.16) implies f(yk)-f(xk) >
mv k-l. Hence (4.13b), (4.3a) and the rules of Step 4 yield

_k+ < gk.dk-1 > =-[f(xk)_f(yk)_ < >] + < >

= f(yk)-f(xk) >mv k-I (4.17b)

This completes the proof of (4.14).

(ii) If (4.17a) holds, let ~ a [0,I] and define the multipliers

k Ip(9) = (l-v)vfk-1
lj(v) = 0, j Jf,
(4.18a)
~k(V) = 9, ~ j ( 9 ) = O, j E J ~ \ { k } , ~p(9) = (1-9)~ -1.

If (4.17b) is satisfied, let

]
Ik(V) =v , lj(v) = 0, j ~ J ~{k}, I p ( V ) = (l-v)vf
k-1
,
(4.18b)
k ~p(9) = (i_~) k-i
~j(9) = 0, j6 JF,

Observe that the multipliers (4.18) are feasible for subproblem <4.12)
for each ~ ~ [0,i], because k J n JFk and (2.44e) is satisfied. More-
over, for each


klj(v)g~+Ip(9)Pf k-i + Z k~j ( 9)gJ+~p(VF )PFk-i =(l-v)pk-l+vgk"
j e jf j e JF
(4.19a)

This follows from (4.18), (2.29) and (4.13). Next, x k =x k-i implies
~kp= fk-i
p and ~kp= Fk-i
p by (2.35), hence (4.18) (4.13) and (4.3) yield

k ) fk,p+j k k
Z klj(9)ef,j+ip(~ ~ ~j(v)e F j+~p(V)~F,p =
j~ Jf e jk ,
(4.19b)
~k-i k
=(l-v) ap + ~ ,

for each ~ - [0,i]. (4.19) and Lemma 4.10 imply that wk is not larger
than the optimal value of the problem

minimize ]<l-~)pk-l+ggk[2+(l-~)~k-l+v~k,
P
<4.20)
subject to ~ e [0,i].
214

Since we also have (4.14), one may complete the proof by using Lemma 2.
4.10 for bounding the optimal value of (4.20), as in the proof of Lem-
ma 2.4.11. E

We may now complete the analysis of the case of a finite number of


serious steps of the algorithm. The following result can be e s t a b l i s h e d
similarly to Lemma 2.4.12, if one uses Lemma 4.11 and the local b o u n d e d -
ness of 3H(o;.) (~ee Lemma 4.6) together with the definition (4.13).

Lemma 4.12. Suppose that the number 1 of serious steps of A l g o r i t h m 3.1


stays bounded, i.e. xk=x k(1) for some fixed I and all k > k(1). Then
wk ~ 0 as k~.

Combining Lemma 4.7 with Lemma 4.9 and Lemma 4.12, and using (4.11),
we obtain

Theorem 4.13. Every accumulation point of the sequence {x k} generated


by A l g o r i t h m 3.1 is a solution to problem ~i.i).

A sufficient condition for the sequence {x k} to have accumula-


tion points is given below.

Lemma 4.14. Suppose that a point x eS satisfies f(x )<f(x k) for all
k. Then the sequence {x k} is bounded and

I~-xkl 2 < l~-xnl 2 + k~ { IXi + l - X i I 2*2t~[~} for k>nzl,


- i= n

{Ixi+1-xil 2+2tL~
i~i
p} --+ 0 as n~.
i=l

Proof. Observe that ^ ~ f(x k )


fix) and Fix^ ) ~ 0 imply ^ k ) &0
H(x;x =

H(xk;x k) (xke S), hence (4.4g) yields

<pk,x-xk > ~ ~ for all k.

Since the above inequality is of the form (2.4.28), one may complete
the proof by using Lemma 4.8 similarly to the proof of Lemma 4.14.

We may now state the principal result.

Theorem 4.15. If problem (i.i) admits of a solution, then A l g o r i t h m 3.1


215

calculates a sequence {x k} converging to a solution of problem (i.i).

^ ^ ^ xk
Proof. Let x eX. Then x S and f(x) S f( ) for all k, hence Lemma
4.14 shows that {x k} is bounded. By Theorem 4.15, {x k} has an accu-
m u l a t i o n point ~ X. For showing that x k --+ x, use Lemma 4.14 and
the proof of Theorem 2 . 4 . 1 5

The following results can be proved similarly to Theorem 2.4.16,


Lemma 2.4.17 and Corollary 2.4.18.

Theorem 4.16. Each sequence {x k} constructed by A l g o r i t h m 3.1 is a


minimizing sequence: {xk}c S and f ( x k ) + i n f { f ( x ) : x e S}.

Lemma 4.17. Suppose that the sequence {f(xk)} is bounded from below.
Then w k + 0.

Corollary 4.18. Suppose that inf {f(x) : x S} >-~. Then A l g o r i t h m 3.1


terminates if its final accuracy tolerance Cs is positive.

5. The Method with S u b @ r a d i e n t Selection

In this section we analyze the method with subgradient selection


from Section 2.
Al~orithm 5.1 is obtained from A l g o r i t h m 3.1 by replacing Step 1
with

, k k k . k
Step 1 (Direction finding). Find m u l t i p l i e r s lj, j e ~f, and ^~j,3 ~ JF'
that solve the k-th dual subproblem (2.22), and sets J~ and J~ satis-
fying (2.23). Calculate scaled Lagrange m u l t i p l i e r s satisfying (2.26).
k ~k
Compute (p~,f) and (PF,Fp) by (2.28), and use (2.29) and (2.30) for

calculating pk and ~. Set d k = _ pk.

Clearly, Algorithm 5.1 is an extension of A l g o r i t h m 2.5.1. There-


fore we refer the reader to Remark 2.5.2 on the computation of Lagrange
multipliers in Step i' , and to Remark 2.5.3 on methods that use more
than N+3 past subgradients for search direction finding. In particu-
lar, the analysis given below applies also to the m e t h o d of M i f f l i n
(1982) (see Section 1.3), which uses all the past subgradients for
search direction finding.
216

By the results of Section 2, in Algorithm 5.1 (dk,v k) solves the


k-th primal subprehlem (2.12), for any k. Therefore one may equivalent-
ly use (2.12) for direction finding.

Convergence of Algorithm 5.1 can be proved by modifying the re -


sults of Section 4, One may proceed as in Section 2.5, where the proper-
ties of the method with subgradient selection were derived from the re-
sults on the convergence of the method with subgradient aggregation
from Section 2.4. Therefore we shall outline significant modifications
only.
We substitute Lemma 4.10 with the following result.

Lemma 5.2. At the k-th iteration of Algorithm 5.1, w k is the optimal


value of the following problem, which is equivalent to subproblem (2.
22):
1 j 2+ k k
minimize ~I ~ _kljg~ + ~ kPjgF] ~ klJ~f'J+j c~ J~ ~j~F,j'
l,p J Jf 3 E JF J Jf
(5.i)
subject to k S _klj+. Z k~j=l.
lj > 0, j eJ~, ~j > 0, j JF'je ~f 3 JF

In the proof of Lemma 4.11, the definition (4.18a) should be sub-


stituted by

lj(9) = (i-9)13 -I
k , j , Ik(~) = 0,
(5.2a)
tk- 1
]Jj(~)) = (l-~)p k-I , j E dE , pk(V) = v,

and (4.18b) by

lj(~) = ( l - v ; ,ik-I
j , j ~ Jtk-i
f , ik(~)) = ~),
(5.25)
^k-i
~j(9) = (I-9)~3k.-i , j e JF , ~k(~)) = 0.

By (2.26) and (2.28), we have

ik-i tk-I k-I ^k-I , ~ %k-llJk-i +. ~ tk-l~Jk-l=l ,


j _> 0, j ~ J f , ~j ~0, j e JF je jf 3E u F

k-I k-I ~k-l)= Z ~k_llj ( ' , (5.3)


~f (Pf , p " j~jf

k-1 ' k
~Fk-l(pFk-l'~k-l)=p" j eZ ~Ftk-l~J (gF3'F)"

Using (5.2) and (5.3), one may replace (4.19) by


217

klj(~)g~+ Z k~j(v)g~ = (l-9)pk-l+vgk,


je Jf je JF

,"k-1 k (5.4)
jE Jf ,3 jE JF

lj(~) > 0, k k ~ klj(v)+ ~ (v) = 1


j e Jf, ~j(9) > 0, je J F , j e Jf J~ j ~ j ,

k-i
for all ~ E [0,i], if t L =0. In view of Lemma 5.2, (5.4) suffices for
completing the proof of Lemma 4.11 for A l g o r i t h m 5.1. The remaining pro-
ofs need not be modified.

We conclude that all the convergence results of Section 4 hold al-


so for A l g o r i t h m 5.1.

6. Line Search M o d i f i c a t i o n s

In this section we discuss general line search rules that may be


used in efficient procedures for stepsize selection. We also derive a
new class of methods of feasible directions from the methods discussed
so far.
The practical singnificance of rules that allow much freedom in
stepsize selection was discussed in Section 2.6 in the unconstrained
case. Most of that discussion applies to the constrained case, too.
As noted in Section 2.6, the requirement t~=l for a serious step
may result in too many null steps. For this reason, a lower threshold
~ (0,i] for a serious stepsize may be preferable. This leads to r e -
placing Step 3 in A l g o r i t h m 3.1 and A l g o r i t h m 5.1 by the following more
general

k
Step 3' <Line search). Select an auxiliary stepsize t R 6 L~,I] and set
k+l k k k
y =x +tRd . If

H(yk+l;x k) <_H(xk;x k.)+m~Rv


k k (6.1)

k k t~=0 (a null step).


then set tL=t R (a serious step); otherwise set

The search for a suitable value of tk e [~,I] may use geometrical


contraction, as described in Section 2. Of course, many other p r o c e -
dures can be constructed; see Remark 3.3.5.
One may check that the above line search m o d i f i c a t i o n does not im-
218

pair the p r e c e d i n g convergence results. Only two proofs need chang-


es. In p r o v i n g Lemma 4.9, take account of the remark in Section 2.6
on the proof of Lemma 2.4.8. To prove (4.14), use Lemma 2.6.1 and the
following g e n e r a l i z a t i o n of a result of M i f f l i n (1982).

Lemma 6.1. Suppose that a point y=xk+td k satisfies F(y) > m t v k for
some t e (0,i], where F(x k) ~ 0. Let g=gF(y) e ~F(y) and ~ = - [F(y)+
<g,xk-y >] . Then -e+ < g,d k > > m y k.

Proof. By assumption,

- ~ < g , d k > = F ( y ) - t < g , d k > + < g , d k > > tmvk+(l-t) < g,d k > .

By the c o n v e x i t y of F, 0 z F(x k) h F(y)-t < g,d k > , hence

t < g,d k > >F(y) > mtv k,

and, since t >0, we have < g,dk> > mv k. It follows that

-~+ < g,d k > > t m v k + [ l - t ) m v k = mv k,

since t ~ (0,i] .

We shall now show how a m o d i f i c a t i o n of line search rules turns


the p r e v i o u s l y d i s c u s s e d algorithms into new m e t h o d s of feasible direc-
tions that extend the P i r o n n e a u and Polak (1973) m e t h o d to the nondif-
ferentiable case. Again, let ~ e (0,i] be fixed and replace Step 3 in Al-
g o r i t h m 3.1 and A l g o r i t h m 5.1 by the following

Step 3 " ILine search). Select an auxiliary stepsize t kR e [~,lJ and set
yk+l=xk+t~dk. If

f(xk+t~ dk)-f(xk) s m t ~ v k and F ( x k + t ~ d k) < 0, (6.2)

k k
then set tL=t R (a serious step); otherwise, i.e. if at least one of
inequalities (6.2) is violated, set t~=0 Ca null step).

Step 3 " quarantees that each xk is feasible. One may i m p l e m e n t


Step 3" , for instance, as in (Mifflin, 1982) (see Section 6.3).
All the p r e c e d i n g c o n v e r g e n c e results remain valid for A l g o r i t h m
3.1 and A l g o r i t h m 5.1 w i t h Step 3 " . This follows e s s e n t i a l l y from
219

the fact that w i t h respect to the o b j e c t i v e value the c r i t e r i a (6.1)


and (6.2~ are e q u i v a l e n t , w h e r e a s if F(y k+l) > 0 then n e c e s s a r i l y
F(y k+l) > m v k, b e c a u s e m > 0 and vk < 0 at line searches, i.e. (6.1)
is s t r o n g e r than (6.2).
Our c o m p u t a t i o n a l experience suggests that the m e t h o d s of feasible
directions (with Step 3" ) converge faster than the m e t h o d s of centers
(with Step 3' ). The rule F(x k+l) ~ m t ~ v k < 0 for a serious step of the
methods of centers hinders progress of {x k} towards the b o u n d a r y of
the feasible set.
We may add that the results on c o n v e r g e n c e in S e c t i o n 4 and Section
5 hold also for line search rules that a ll o w for a r b i t r a r i l y short seri-
ous stepsizes (as if ~=0 in Step 3 " ) . Such rules were introduced by
Mifflin (1982). The r e l e v a n t analysis is p r e s e n t e d in the next chapter.
However, we b e l i v e that the rules of Step 3" are general enough to
allow for c o n s t r u c t i n g efficient line s e a r c h procedures in the convex
case.

7. Phase I - phase II m e t h o d s

The algorithms described in the p r e c e d i n g sections require a fea-


sible starting point. Of course, by m i n i m i z i n g F each of the algorithms
can find a feasible point in a finite number of iterations, since they
generate minimfzing sequence, while inf{F(x) : x ~ R N} < 0 by the Slater
constraint qualification. However, in c e r t a i n cases one knows a point,
say ~ ~ R N, w h i c h is close to a solution, but infeasible. Then it is
reasonable to search for a feasible point by m o v i n g from x towards
the c o n s t r a i n t boundary in a way that ensures as small an increase in
the o b j e c t i v e as possible. This is the aim of phase I - phase II m e t h -
ods(Polak, Trahan and Mayne, 1979; Polak, Mayne and Wardi, 1983). At
each iteration of phase I such m e t h o d s try to d e c r e a s e the c o n s t r a i n t
violation while not c o m p l e t e l y ignoring the o b j e c t i v e function. Once a
feasible point is found, at phase II the m e t h o d s proceed as f e a s i b l e
direction algorithms.
In this s e c t i o n we show that it is easy to turn the p r e v i o u s l y dis-
cussed algorithms into phase I - phase II methods. In fact, this re-
quires only minor line search m o d i f i c a t i o n s and a slightly m o r e involved
convergence analysis. The r e s u l t i n g algorithms m a y be c o n s i d e r e d as
more advanced versions of the m e t h o d of Polak, Mayne and W a r d i (1983).

Throughout this section we suppose that the o b j e c t i v e subgradient


mapping gf is d e f i n e d on the w h o l e of R N, i.e. gf(x)e ~f(x) for all
220

x. Also for simplicity we assume that F and gF can be evaluated every-


where; see Remark 3.4 for a discussion of how to relax this assump
tion.
We shall first describe the modified algorithm with subgradient
aggregation. Consider the following modification of the k-th primal
subproblem (2.41): find (dk,v k) e RN~R to

minimize lldl2+v,

k
subject to < g ,d > .<_v, j Jf,

f~-f(xk)-F(xk)++ <p~-l,d > &v, (7.1)

j k
Fk-F(xk)++ < g ,d > <_ v, j e JF'

Fk-F(x k)++ < pk-l,d > ~ v,

and its dual

2
1 j k-i j k-I
minimize ~I g kljgf+IpPf + z _k~jgF+ppp F I +
i,~ J ~ JF je JF

+ g +
jE Jf

(7.2)
je JF
k k
subject to lj _> 0, j ~ Jf, Ip>_0, ~j > 0, J ~ JF' p p Z 0 ,

J ~ j~11 3 +I P + j eZ j~pj+pp-i

k lk k k k
with a solution denoted by I~, j~ Jf, p, ~j, jC JF' pp. If we denote
by (dk,u k) the solution of subproblem (2.41) then

v k = uk-F(xk)+, (7.3)

SO the results of Section 2 imply

v k = ^~
H (xked k )_H(xk;x k) ~0, (7.4)

cf. (2.17). Thus v k may be interpreted as an approximate derivative


of H(.;x k ) at x k in the direction d k. This is important for line
221

searches.
It is easy to observe that if F(x k) i 0 then vk=u k and subprob-
lems (7.1) and (7.2) reduce to subproblems (2.41) and (3.1), respec-
tively. In fact, even for F(x k) > 0 subproblems (7.1) and (2.41) are
essentially equivalent in view of (7.3), and can be regarded as qua-
dratic programming formulations of the following

~k 1 2
minimize Ha(xk+d)+ ~Idl over all d,

which in turn is a local approximation to the problem of minimizing


H(.;xk). If H(xk;xk)=F(xk)+ > 0 then it is reasonable to search for a
direction d k that forms obtuse angles with the constraint subgradients,
since then d k points from xk towards the feasible region. To see
that this is the case, note that dk= _ pk, where

k k ~ k j k k-i
P = j Ef ZJ kl g + lppfk
k-l+j EZ ~F ujgF + ~pPF '

ik_>
] 0, j e j k , lkp a 0 , ~jk ~ 0, j 6JF,
k k 0,
~p_>

k k
k Ik + Ik + Z jk~j + Up = i,
je j 3 P je F

cf. (2.43), and use the linearization errors

k = f(xk)_fk j=l,...,k,
~f,J 3 '

k = f(xk)_f~,
~f,p
(7.5)
k = F(x k) _F k
aF,j + j ' j=l,...,k,

k = F(x k) _F k
~F,p + p'

for rewriting subproblem (7.2) in the following form

minimize 11 j k-I j k-l.2


2 Z kljgf+IpPf + Z kUjgF+~pPF I +
I,~ jEJ jEJ F
k k k
+ Z klj[ef,j+F(x )+]+lp[~f,j+F(xk)+] +
jE Jf
k k (7.6)
+ j~Z jk~3eF,J

+ ~p~F,p'

subject to lj _> 0, j ~ J , lp>_0, 99_>0, j E J F, ~ p a 0 ,


222

g klj + I + g kPj + Mp = i.
J ~ Jf P J JF

We conclude that if the linearization errors have comparable values


then a positive term F(xk)+ in (7.6) tends ~o make the constraint sub-
gradients influence dk more actively than do the objective subgradie-
nts. On the other hand, if the constraint violation is not too large
then the multipliers lk j e k and ik are positive and so the objec-
3' Jf' P
tive subgradients contribute significantly to d k, deflecting it from
directions of ascent of the objective function at x k.
We may now state the first version of our phase I - phase II meth-
od in detail. To save space we shall use the notation of Algorithm 3.1
Algorithm 7.1 is obtained from Algorithm 3.1 by replacing Step 1
and Step 2 with the following steps.

k k k k
Step i" IDirection finding). Find a solution I~, j g Jf, Ip, pj, JmJF'
k k
and pp to the k-th dual subproblem (7.6). Calculate multipliers ~f,

~' k
J ~ Jf' ~kp' 9F'
k ~j'
~k k and
j JF' p~ satisfying (2.44). Compute (p~

~k)p and (p~,F~) by (2.45) and use (2.29) for calculating pk. Set
d k = _ pk and

k[F(xk ) k ~ [~(xk)+_F~]. (7.7a)


+ g jkPj +-Fj]+Mp
je F

step 2' (stoppin~ criterion), set

wk=~lp +~fk2
I kEf(xk)_~+F(xkl+]+,~EF<xk)+_~] " (7.7b)
k
If w _< E s terminate; otherwise, go to Step 3.

In Step I" of Algorithm 7.1 one may equivalently solve for (d k , ~ )


the k-th primal search direction finding subproblem (7.1), which has
Lagrange multipliers k je Jr,
lj, k Ip,
k ~j,
k j e JF'
k and p~. This observa-

tion, together with the representation (7.7) of v k, follow from Lem-


ma 2.3.
To establish relations between Algorithm 3.1 and Algorithm 7.1,
suppose that for some k >1 one has F(x k) ~ 0 . Then the k-th iterations
223

of both algorithms are identisal, so F(xk1)K0 by the properties of


Algorithm 3.1. It follows by induction that also at subsequent itera-
tions Algorithm 7.1 generates feasible points and reduces to Algorithm
3.1. We say that Algorithm 7.1 enters phase II at the k-th iteration
if F(x k) ~ 0 and F(x k-l) > 0. In this case iterations l,...,k-i form
phase I. Thus we see that at phase II Algorithm 7.1 becomes equivalent
to Algorithm 3.1. Hence the convergence results of Section 4 are valid
for phase II of Algorithm 7.1. Moreover, to analyze convergence of Al-
gorithm 7.1 we need to consider only the case when phase I is infini-
tely long, i.e. F(x k) > 0 for all k.
We shall now establish global convergence of Algorithm 7.1. To
this end we have to extend certain results of Section 4. First, we ob-
serve that Lemma 4.1 is valid also for Algorithm 7.1, since the aggrega-
tion rules do not depend on the feasibility of {x~ . In the following
extension of Lemma 4.2 we use the linearization errors defined by (7.5)
and the relations

~k = f ( x k ) _ ~ ,
~f,p

~k x k . _}k (7.8)
~F,p=F( )+ p'

~k = ~fk [ f ( x k ) _ ~ + F ( x k )+] + VF[F(xk)+-Fp]


~p k ~k

Lemma 7.2. At the k-th iteration of Algorithm 7.1, one has (4.4) and

g] E~eH(xk;x k) for E=af,kj+F(xk)+, j=l,...,k, (7.9a)

gF3 6 3eH(xk;x k) for E=ek,j, j=l ..... k, (7.9b)

Pfk-i 6 ~eH(xk; xk) for _ k + F( xk)+ ,


e-af,p (7.9c)

k-I k
PF e ~eH(xk;x k) for e=~F,p , (7.9d)

k ~k
pf 3 eH(xk;xk) for e=~ f,p+F{ xk) +, (7.9e)

k ~k
p F e ~ H ( x k ; x k) for ~=~F,p' (7.9f)

p k e ~H(xk;x k) for ~=~k~0. (7.9g)

Proof. By ~2.36), for any x and j &k we have


224

j
f(x) _> < gf,x-x k > + fk] ,

hence

H(x;x k) = max{f(x)-f(x k),F(x)} ~ f(x)-f(x k) >

~ < g~,x-x k > -[f(x~)-f ~ ~

x:x j+F(xk) +] ( 7.10a)

This yields (7.9a) by the definition of e-subdifferential. Similarly,


from (2.36) we obtain

j k +Fk
F(x) ~ < gF,x-x > 3
and
j
H(X;X k) >F(x) ~ < gF,x-x k > +F k >

>-H(xk;xk)+ < g3F,x_xk k ( 7.10b)


> -aF,j'

which implies (7.9b). In view of Lemma 4.1, one may take convex combi-
nations of ~7.10) to obtain (7.9c)-(7.9f). In particular, we have

k
H(x;x k) ~ H(xk;xk)+ < pf,x-x k > _Ef(xk)_~+F(xk)+] (7.11a)

k
H(X;X k) > H(xk;xk)+ < PF,X-X k > - [F(xk)._9~ (7.11b)

for any x. Multiplying (7 .lla) by k k


~f~0 and (7.11b) by ~ F ~ 0, adding
the results and using the fact that k k
9f+~F=l by (2.44e), we obtain

k k k k k
H(x;x k) ~ H(xk;xk)+ < vfPf+~FPF,X-X >+

_> H( xk;x k )+ < pk,x-xk > -a~ k


P

from (2.29). Setting x=x k, we get ~k_> 0. This completes the proof of
(7.9g). (4.4a)-(4.4f) can be established as in the proof of Lemma 4.2.

From (7.7) and (7.8) we deduce that Lemma 4.4 holds for Algorithm
7.1. Then relation (4.8) follows from (7.9g) and (4.5), so Lemma 4.5
remains valid for Algorithm 7.1.
225

As observed above, we may assume that F(x k) > 0 for all k. Since
the line search rules imply that we always have

F(x k+l) <_ H(xk+l;x k) _< H(xk;xk)+mtkvk = F(x k )++mtLv


k k,

we obtain

F(xk+l) _< F(x k) +mtLv


k k for all k. (7.12)

The above relation, which substitutes (4.9), shows that {F(xk)} is


nonincreasing. Since F(x k) > 0 for all k by assumption, one can obtain
(4.10) from (7.12) as in the proof of Lemma 2.4.7.
Lemma 4.7, which is based on (4.4g), remains valid for Algorithm
7.1. Lemma 4.9 can be proved by using (7.12) and the arguments in the
proof of Lemma 2.4.8, together with Lemma 4.7.
The following extension of Lemma 4.10 can be easily derived.

Lemma 7.3. At the k-th iteration of Algorithm 7.1, w k is the optimal va-
lue of subproblem (7.6).

To prove Lemma 4.11 for Algorithm 7.1, start by substituting (4.13)


by
k k k k f(yk)_f(xk-l) < F(yk) ,
g = gF and a = aF, k if 7.13a)

k k ak k,k+F(xk ) if f(yk)-f(xk-l)_> F(y k) , 7.13b)


g = gf and = af +

and (4.16) by

max{f(yk)-f(xk),F(yk)} > F ( x k ) + + m v k-I <7.14)

Instead of (4.17a), we obtain

_k+ < gk,dk-1 - _ IF( xk) +_F( yk)_ < g ,xk_yk 7]+ < g k ,d k > =

= F(yk)-F(xk)+> mv k-l, 7.15a)

while (4.17b) is replaced by

_ k + < gk,dk > = - [ f ( x k ) - f ( y k)- < g k , x k - y k > +F(x k ) + ] + < gf,dkk > =

= f(yk)-f(xk)-F(xk)+ > mv k-I (7.15b)

Next, substitute (4.19b) by the following relation


226

j ~ J~f xk)+]+lP (v) +]+

k k ~-l+vak
+ Z k~j(~)~F,j+~p(~)aF, j = (l-~)e (7.16)
J~ JF

and use it together with i4.19a) to deduce from Lemma 7.3 that w k is
m a j o r i z e d by the optimal value of (4.20), as before.
Since Lemma 4.12 is valid for A l g o r i t h m 7.1, we obtain the follow-
ing result.

Theorem 7.4. Suppose A l g o r i t h m 7.1 generates an infinite sequence {xk}.


Then:

(i) If F(x k) > 0 for all k, i.e. the algorithm stays at phase I, then
every accumulation point of {x k} is a solution to problem (i.i).

(ii) If F(x k) ~ 0 for some k ~I, then F(x k) ~ 0 for all k zk and

f(xk)+ inf {fix) : Fix) ~ 0}, i.e. {x k} is a m i n i m i z i n g sequence


for problem (i.i). If additionally problem (i.i) admits of a solu-
tion, then {x k} converges to a solution of problem (l.l).

We shall now discuss m o d i f i c a t i o n s of line search rules for Algo-


rithm 7.1. First, we note that A l g o r i t h m 7.1 can use, instead of Step
3, Step 3' described in Section 6, cf. (6.1). This will allow for imp-
lementing more efficient line search procedures w i t h o u t impairing the
convergence results, since one can easily derive suitable extensions of
i7.15) by using Lemma 2.6.1 and Lemma 6.1. Secondly, one may use the fol-
lowing modification of Step 3' , in which ~ e(0,!~ is a fixed parame-
ter of the algorithm.

Step 3'" <Line search). Select an auxiliary stepsize t kR [~,i] and


k+l k k k
set y =x +tRd . If either of the following two conditions is satis-
fied

F(x k) > 0 k k k
and Fly k+l) <_Fix )+mtRv (7.17a)
or

F(x k) ~ 0 and Hiyk+l;x k) < H ( x k ; x k ) + m t k v k (7.175)

k k
then set tL=t R Ca serious step); otherwise set t =0 a null step).

If Step 3'" is used then at phase I the algorithm will ignore the
objective function values at line searches until a feasible point is
227

found. A similar strategy can be employed in the context of feasible


directions methods based on Step 3" described in Section 6, cf. (6.2).
To this end consider the following extension of Step 3 " .

Step 3''" (Line search). Select an auxiliary stepsize t kR [~,i] and


k+l k k k
set y =x +tRd . If either of the following two conditions is satis-
fied

F(x k) > 0 and F(y k+l) < F(xk)+mt~v k (7.18a)

or
f(yk+l) ~ f(x k,J+mtRv
k k and F( yk+l ) ~ 0 (7.18b)

k k
then set tL=tR; otherwise set t~=0.

If the current point xk is infeasible then in both Step 3'" and


Step 3'"' we should search for a point that descreases the constraint
violation. On the other hand, both of these steps m a i n t a i n feasibility
at phase II. In particular, at phase II the algorithm with Step 3'"'
reduces to the feasible directions m e t h o d discussed in Section 6.
We may add that Theorem 7.4 remains valid if Step 3'" or Step 3'"'
is used. To see this, note that phase II is covered by the results of
Section 6, while at phase I each null step yields

F(yk) F(x k) .k k-i


> +m~ R v ,

so that we have

_~,k+ < g~,dk-I > > m v k-i (7.19)

k k k
from Lemma 2.6.1 and the fact that ~F,k=F(x ~ - F k = F ( x k ) - F ~ if F(xk)>0.
Thus one can use (7.19) instead of (7.15) in the proof of Lemma 4.11.
We now pass to the phase I - phase II method with subgradient se-
lection, which extends A l g o r i t h m 5.1 to the case of infeasible starting
points.
Algorithm 7.5 is obtained from A l g o r i t h m 7.1 by replacing Step i "
with

k k k
Step i'" (Direction finding). Find m u l t i p l i e r s ~j, j ~ Jf, and ~j,
k
j e JF' that solve the following k-th dual subproblem
228

k~jg~ 12 +
~kljg~+ 3 6Z JF
minimize 1j 6 Jf

+ ~ klj[a
~,j+F(x k
)+]+
k
Z k~jaF,j, (7.20)
je Jf je JF

subject to k ~j > 0, j JF'j


lj Z 0, j 6 Jr, k eZ Jfk I 3.+j e~ J~ ~j=l

^k ^k
and the corresponding sets Jf and JF
that satisfy (2.23). Calculate
k -k k ~k
scaled multipliers satisfying (2.26), compute (pf,fp) and (PF,Fp) by
(2.28), and use (2.29) for calculating pk. Set d k = _ pk and

vk =- {Ipkl 2+~f[~f,p+F(x
k ~k k k k
)+]+~F,p}. 7.21)

Of course, in (7.20) and (7.21) we use the linearization errors de-


fined by (7.5) and (7.8). Also it is readily seen that (7.20) is the du-
al of the following k-th (primal) search direction finding subproblem:

minimize 1dl2+v,
(d,v)~ R N+I

subject to k -f(xk)-F(xk)++ <g~ ,d> -<v, je jkf,


f9 (7.22)

F ~_F(xk)++ < g ~ ,d > ~v, J ~ JF'


k

with the solution -'(dk,v


k) and the Lagrange multipliers k j k
li, J f,
and ~ ' J~ ~F" Therefore at phase II Algorithm 7.5 reduces to Algorithm
5.1.
We may add that one can use the modified line search rules discus-
sed in this section also in Algorithm 7.5. Global convergence of the
resulting methods can be expressed in the form of Theorem 7.4. To this
end one may combinate the preceding results of this section with the
techniques of Section 5.
CHAPTER 6

Methods of F e a s i b l e Directiorus for N o n c o n v e x C o n s t r a i n e d Problems

i. I n t r o d u c t i o n

In this chapter we c o n s i d e r the f o l l o w i n g c o n s t r a i n e d minimization


problem

minimize f(x), subject to F(x) ~ o, (1.1)

w h e r e the functions f : RN--+ R and F : RN--+R are locally L i p s c h i t z i a n


but not n e c e s s a r i l y convex or differentiable. We assume that the feasi-
ble set

S = { x E R N : F(x) & 0 }

is nonempty.

We p r e s e n t several readily i m p l e m e n t a b l e a l g o r i t h m s for solving


p r o b l e m (I.I), which differ in complexity, storage and speed of conver-
gence. The m e t h o d s require only the e v a l u a t i o n of f or F and one sub-
g r a d i e n t of f or F at d e s i g n a t e d points. Storage r e q u i r e m e n t s and work
per i t e r a t i o n of the algorithms can be c o n t r o l l e d by the user.
The algorithms are o b t a i n e d by i n c o r p o r a t i n g in the feasible point
m e t h o d s of Chapter 5 the techniques for d e a l i n g with n o n c o n v e x i t y that
were d e v e l o p e d in Chapter 3 and Chapter 4. Thus the algorithms g e n e r a t e
search d i r e c t i o n s by using separate p o l y h e d r a l a p p r o x i m a t i o n s to f and
F. To c o n s t r u c t such a p p r o x i m a t i o n s we use the rules for s e l e c t i n g and
a g g r e g a t i n g s e p a r a t e l y s u b g r a d i e n t s of f and F that w e r e introduced in
C h a p t e r 5. The p o l y h e d r a l a p p r o x i m a t i o n s take n o n c o n v e x i t y into account
by using either the s u b g r a d i e n t locality m e a s u r e s of Chapter 3, or the
s u b g r a d i e n t d e l e t i o n rules of Chapter 4. In the latter case we employ
r e s e t t i n g strategies for l o c a l i z i n g the past s u b g r a d i e n t i n f o r m a t i o n on
the basis of e s t i m a t i n g the degree of s t a t i o n a r i t y of the current appro-
x i m a t i o n to a solution.
The algorithms are feasible p o i n t m e t h o d s of descent, i.e. they
g e n e r a t e sequences of points {x k} satisfying

xk e S and f(x
k+l)~" < f(x
k)'" if x k + l ~ x k ' for all k,

where xI E S is the s t a r t i n g point. Under m i l d assumptions on F, such


as n o n e m p t i n e s s of the interior of S, each of the algorithms can find
230

a feasible starting p o i n t by m i n i m i z i n g F.
We shall also p r e s e n t phase I - phase II m e t h o d s that can be em-
ployed when the user has a good, but infeasible, initial approximation
to a solution. Starting from this point, phase I of such m e t h o d s tries
to find a feasible point without unduly increasing the o b j e c t i v e value.
At phase II the m e t h o d s reduce to feasible point algorithms.
The algorithms of this chapter may be v i e w e d as e x t e n s i o n s of the
Pironneau and Polak (1972; 1973) m e t h o d of centers and m e t h o d of f e a s ~
ble d i r e c t i o n s to the n o n d i f f e r e n t i a b l e case. One of the algorithms can
be d e r i v e d by a p p l y i n g our subgradient selection and a g g r e g a t i o n rules
to the M i f f l i n (1982) method . Also our e x t e n s i o n s of the Polak, Mayne
and T r a h a n (1979) phase I - phase II a l g o r i t h m differ from those of Po-
lak, M a y n e and W a r d i (1983).
We shall prove that each of our feasible point m e t h o d s is g l o b a l l y
convergent in the sense that it g e n e r a t e s an infinite sequence of points
{x k} such that every accumulation point of {x k} is s t a t i o n a r y for f on
S. If p r o b l e m (I.i) is c o n v e x and satisfies the Slater constraint qua~-
ficati o n (i.e. F(x) < 0 for some x in RN), then xk is a minimi-
zing s e q u e n c e for f on S, w h i c h converges to a s o l u t i o n of p r o b l e m (i.i)
whenever f attains its i n f i m u m on S. Similar convergence results hold
for our phase I - phase II methods.
In Section 2 we derive the methods. The a l g o r i t h m with subgradient
aggregation is d e s c r i b e d in detail in S e c t i o n 3, and its c o n v e r g e n c e is
established in S e c t i o n 4. S e c t i o n 5 is d e v o t e d to the a l g o r i t h m with s u ~
gradient selection. In S e c t i o n 6 we study various modifications of the
methods with subgradient locality measures. Several versions of methods
with subgradient deletion rules are a n a l y z e d in S e c t i o n 7. In S e c t i o n 8
we discuss methods that n e g l e c t linearization errors. Phase I - phase II
methods are d e s c r i b e d in S e c t i o n 9.

2. D e r i v a t i o n of the M e t h o d s

We start by r e c a l l i n g the n e c e s s a r y conditions of o p t i m a l i t y for


problem (i.i), see S e c t i o n 1.2.
For any fixed x ~ R N, define the i m p r o v e m e n t function

H(y;x) = max{f(y)-f(x), F(y)} for all y ~ R N. (2.)

If ~ GS is a local solution of (i.i) then H(.;x) attains a local


minimum at ~, so 0 ~ ~H(x;~), where ~H(~;x) denotes the s u b d i f f e r e n -
231

tial of H(.;x) at x . Since ~H(x;x) c M ( x ) for

{ ~f(x) if F(X) < 0,

M(x) = conv{3f(x) u 3 r ( x ) } if F(x) = 0, (2.2)

3F(x) if F(x) > 0,

A
the necessary condition of optimality is 0 E Mix ) . For this reason, a
point ~E S such tha~ 0 ~ M(x) is called stationary for f on S.

Remark 2.1. There is no loss of generality in requiring that F be sca-


lar-valued. If the original formulation of the problem involves a finite
number of constraints Fj(x) ~ 0, j e J, with locally L i p s c h i t z i a n
functions F~, then one can let
J

F(x) = max{Fj(x) : j e J} for all x. (2.3a)

Defining

3~(x) = conv{~Fj(x) : j ~ J and Fj(x)=F(x)} for all x, (2.3b)

we have (see il.2.60))

~F(x) c ~ ( X ) for all x. (2.4)


Let
r ~f(x) if F(x) < 0,

M(X) = ~conv[8 f(x) o ~ ( x ) } if F(x) = 0 , (2.5)

L~(x) if F(x) > 0,

for all x. By (2.2) and (2.5), Mi') c M i . ) , so, although we may have
M(~)~M(~), if x solves(l.l) locally then 0 M(~). Therefore, we s h a h
also say that a point x~ S is stationary for f on S if 0 E M(x).

In view of the above results, testing if a point xE S is statio-


nary for f on S is in a sense equivalent to testing if there exists a
direction, of descent for Hi.;x) at x. At the same time, if we find a
point y such that H(y;x) < Hix;x)=0 then flY) < fix) and F(y) < 0, so
y is better than x. Therefore, in theory, one could solve p r o b l e m (i.i)
by the Huard (1968) method of centers described in Section 5.2, which
in the present case has stationary accumulation points, if any.
One can, in theory, find a descent direction for ~Hi';x) at x by f ~ d -
ing the subgradient of m i n i m u m norm in 3H/x;~) (see Lemma 1.2.18).
232

This w o u l d require the k n o w l e d g e of full s u b d i f f e r e n t i a l s Sf(x) and


~F(x). However, we assume only that we have a finite process for calcu-
lating f(x) and a certain s u b g r a d i e n t gf(x) e ~f(x) at each x e S,
and F(x) and an arbitrary s u b g r a d i e n t gF(x) ~F(x) at each x 4 S.
This a s s u m p t i o n is realistic in many applications (Mifflin, 1982). The-
refore, we shall c o m p e n s a t e for the lack of ~H(x;x) by using gf(y)
and gF(y) e v a l u a t e d at several points y close to x. For s i m p l i c i t y of
exposition, we shall t e m p o r a r i l y assume that gf and gF are d e f i n e d
on the whole of R N.

Remark 2.2. In the case c o n s i d e r e d in Remark 2.1 it suffices to assume


that gF(x) e ~F(x) at each x ~ S. Then for each infeasible x one has
to find an index j J satisfying Fj(x)=F(x) and an arbitrary sub-
gradient gFj(X) 6 ~Fj(x), cf. (2.3b). This requirement, f o r m u l a t e d di-

rectly in terms of s u b d i f f e r e n t i a l s of the c o n s t r a i n t functions Fj, is


frequently more p r a c t i c a l than the one in terms of ~F, since ~F(x)
may not be available (because 3F(x) is, in general, d i f f e r e n t from
~(x)).
We shall now derive the first g e n e r a l i z a t i o n of the feasible direc-
tion method of Chapter 5. Our e x t e n s i o n of that m e t h o d to the n o n c o n v e x
case w i l l use p o l y h e d r a l a p p r o x i m a t i o n s b a s e d on s u b g r a d i e n t locality
measures introduced in Chapter 3.
The a l g o r i t h m will generate sequences of points {xk}cs, search
directions {d k} c R N and stepsizes { t ~ } R+ related by

k+l k k k
x = x +t~d for k=l,2,...,

where x l e S is a given s t a r t i n g point. At the k-th i t e r a t i o n d k is in-


tended to be a d i r e c t i o n of descent for H ( . ; x k) at x k, and H(x~;xk) =
0 because x k s S. Therefore, we shall use a t w o - p o i n t line search for
finding two stepsizes tL
k and t~, 0 ~ t k k the next point
_ t R,
L < x k+l 6 S
satisfying

f(x k+l) < f(x k) if x k+l # x k (t~ > 0),

and the trial point

yk+l = xk+t~d k

such that the subgradients gf(yk+l) and gF(y k+l) modify significan-
ly the next p o l y h e d r a l a p p r o x i m a t i o n s to f and F that will be used for
finding the next search direction.
233

Thus the algorithm calculates subgradients

g~ = gf(yJ) and g~ = gF(y j) for j=l,2 .....

where yl=xl. Each point yJ defines the linearizations

fj(x) = f(yJ)+ < g~,x-y j > for all x,


(2.6)
Fj(x) = F(yJ)+ < g~,x-y j for all x,

of f and F, respectively. At the k-th iteration the subgradient infor-


mation collected at the j-th iteration (j S k) is characterized by the
linearization values

fk = fj(xk)
3

Fk = F (x k)
S J '

and the distance measure

s.k = ly 3-x jl+ k-I


Z 41xi+l-xil-
3 i=j

The linearization values determine the current expression of the linea-


rizations

fj(x) = f k + < g 3 , x - x k > for all x,


3
(2.7)
Fj(x) = Fk+<gF3,x-xk > for all x,
3

while the distance measure estimates lyJ-xkl:

lyJ-xk I < sk.


- 3

These easily updated quantities enable us not to store the points yJ.
At the k-th iteration we want to find a descent direction for
H(o;xk). Therefore, we need some measures, say ~ ,j ~0 and ~F,j Z0,
that indicate how much the subgradients g~=gf(yJ) and g~=gF(y j) dif-
fer from being elements of 3H(xk;xk). To this end, we shall use the
following subgradient locality measures

k 2
k = max{If(xk)-f k I yf(sj) }, (2.8a)
~f,j
234

~F,j = max{IF I, YF (s , (2.8b)

where yf and F are positive parameters. We shall also set yf=0 if


f is convex, and YF=0 if F is convex. This construction can be moti-
vated as follows. In the convex case we have

g~ ~ E H ( x k ; xk) for e=efk,j=f(xk)-fk3 Z 0,


(2.9)
g~ ~H(xk; xk) for k
e=eF,j= - Fk3 ~ 0,

see Lemma 5.7.2. Next, suppose that F is nonconvex and the value of

~F,j is small. Then F I~ 0 and s k ~0 (F > 0), so ly3-xk I -~ 0 and,


by (2.6]-(2.7), 3

F(y j) = Fk- < g~,y3-xk > z F k ~ 0,


3 3

so the subgradient g~ 6 ~F(y j) is close to


M(x k) (see (2.2)), which
approximates ~H(xk;xk). Similarly, if the value of ef,jk is small
then the subgradient g~ ~ ~f(yJ) is close to ~f(xk), and to Mix k)
(see i2.2) and note that F(x k) ~ 0).

Suppose that at the k-th iteration we have the subgradients (g~,


k k
fj,sj) for k
j 6Jf, and j k 'sk)
(gF'Fj 3 for j JF'
k where % and ~F are scme
nonempty subsets of {l,...,k}. Let

Hk(x) = max{fix) - f(xk),F(x)} for all x. (2.10)

In the convex case, the methods of Chapter 5 would use the following
search direction finding subproblem

minimize Hk(xk+d)+ lldl 2 over all d R N, i2.~)

where

x = max x - x , x ,

f ix) = max{ (x) : j J }, (2.12)

Fkix) = max{Fjix) : j jk}

are polyhedral approximations to H k, f and F, repectively. If f and F


are convex then yf=YF=0, (2.8) becomes
235

k f(xk)_f~
~f,j =

k =_ Fk
eF,j j'

and from (2.7)

fj(x)_f(xkl=f~+ < gjf,x-x k >-f(xk) = - ~f,j+


k j
< gf,x-x k > ,

k j k k j k
Fj(x)=F + < gF,x-x > = - ~F,j+ < g ,x-x > ,

so
^
Hk(x) = max[max{ - e k
f,j+< g~,~_x k > : j e j~},

m a x { - (~k,j+< g3F,x-xk> : j 6 jk}] (2.13)

and a quadratic programming formulation of subproblem (2.11) is to find


^
(dk,v k ) to

minimize
^
1dL2 +v,
'~

(d,v) e R N+I
k
subject to k j+< g3,d
--~fl
> < v, j6 Jf, <2.14)

j k
k j + < g ,d~ < v, j6 J F.
-~F,
Also
^k
v = ~ ( x k + d k )=Hs(xk+dk)-Hk(x
^k k)

may be regarded as an approximate directional derivative of H k at x k


in the direction d k. Therefore, a natural extension of the methods of
Chapter 5 to the nonconvex case consists in finding d k by solving (2.
14), with ~,j and ~ , j defined by (2.8).
We shall now present another reason for using the search direction
finding subproblem (2.14) in the nonconvex case. To this end we recall
from Chapter 5 that (dk,~ k) can be found by solving the following du-
al of (2.14)

minimize ~I
1 Z /jg ~+j Z k~jg~l ~ Z k I k k
Z k~jeF,J '
I,~ J g ~f JF je Jf J~f'J+j ~ JF
42.15)
subject to > 0
k3 _ ,
j ~ k k Z I.+
Jf, ~j ~ 0, j e JF' j ~ j~ 3 je JF
Z k~j=l,
236

since if kkj, j e Jf,


k and k
~j' J jkF denote any solution of (2.15) then

k j k j
- dk = ~ ~ k l j g f + ~ k~jgF , (2.16a)
J Jf 3 ~ JF

^k k k k k
v = - {IdkI2+ 2 kk.~ .+ ~ k~j~F (2.16b)
j jf 3 z,3 j g JF 'J}'
and
kk > 0, j jk k ~0,j ~ k ~ ~.kk+ ~ k~jk = i.
3 - f~j JF'j j~ 3 j e JF

Thus the past subgradients g~ and g~ may contribute significantly

to dk (have relatively large values of ljk ~ 0 and pjk ~ 0) only if

the values of ~,j and ~,j are relatively small, i.e. g~ and g~
are approximate subgradients of Hk at x k.

Up till now we have not specified how to choose the sets Jfk and
k
JF involved in (2.13) and (2.15). Since subproblem (2.15) is of the
form studied in Chapter 5 ~see Lemma 5.2.3), we may use the subgradient
selection rules developed in that chapter for choosing J and JFk re-
cursively so that at most N+3 past subgradients are used for each di-
rection finding. Thus at the k-th iteration one can find Lagrange multi-
pliers Ik3 and pjk of (2.15) ~k c Jfk
and sets Jf ~k c JFk such that
and JF

~k = k k
jf^k = {j 6 Jf: k Ik3 ~ 0} and JF {J JF : ~j ~ 0},

Then, since the subgradients k k


(g~,fj,sj) for ^k and
j ~ Jf j k ,sj)
( gF,Fj k
^k
for J JF embody, in the sense of Lemma 5.2.3(iv), all the past sub-
gradient information that determined ( d k , ~ ) , one may discard the sub-
gradients indexed by ^k u JF
j ~ Jf ^k that were inactive at the k-th search
direction finding (had null Lagrange multipliers). At the same time,
the algorithm should use the latest subgradients. This leads to the
choice

f+l= J^kf u {k+l} and


_k+l
JF
^k
= J F U {k+l}.

As in Chapter 3, we shall also use suitable rules for reducing j~+l


and j~+l at some iterations. Such resetting strategies are employed
x
only to ensure locally uniform boundedness of the subgradients stored
by the algorithm; see Section 3.2.
237

The above-described method with subgradient selection requires


storing N+3 past subgradients. Also much work may be required by the so-
lution of subproblem (2.14) (or (2.15)) if N is large. Therefore, we
shall now use the subgradient aggregation strategy of Chapter 5 to deri-
ve a method in which storage and work per iteration can be controlled
by the user.
At the k-th iteration of the method with subgradient aggregation we
have some past subgradients (g~,f~ sk),
3 , j eJ~, and (g~,F~skj,4j), j e JF'k
and two aggregate subgradients

(Pfk-i , fkp ,Sf,p)E


k conv{(g 3 ,fj,sj)
k k : j=l ..... k-l},
(2.17)
. k-i _k k j k k
tP F ,~p ,SF, p ) e conv{(g ,Fj,sj) : j=l . . . . .,k-l}

which were computed at the (k-l)-st iteration. The aggregate subgradi-


ents are characterized, similarly to (2.8), by the following aggregate sub-
gradient locality measures

k = max{ If(xk)-Fk I k 2
af,p , yf(sf) },
(2.18)
eF,pk = max{ IFkl , yF (sk) 2}.

k k k-l. k-l,
The value of ~f,p(aF,p) indicates how far pf [PF ) is from 3Hk(xk)."

We recall from Chapter 5 that in the convex case such aggregate


subgradients define the (k-l)-st aggregate linearizations

~k-l(x) = f k + < pk-l,x_xk > for all x,

Fk-l(x) = Fk+ < p k - l , x - x k > for all x,


P

which are convex combinations of the !inearizations fj and Fj, j=l,.


..,k-l, respectively. For this reason, in the convex case we defined
the following aggregate polyhedral approximations

H (x) = maX{fa(X ) ,F (x)},

f (x) = max{ fk-l( x) , : j e J },

Fk(x) = max{Fk-l(x),Fj(x) : j e jk}

to H k, f and F, respectively, and used the following search direction


finding subproblem
238

minimize H^ ka ( x k + d ) + 1 d l 2 over all d R N" (2.19)

Reasoning as in the t r a n s i t i o n f r o m (2.11) to (2.14), one c a n show


t h a t in the c o n v e x case H^ka c a n be e x p r e s s e d in terms of s u b g r a d i e n t
locality measures as

{~ka(X) : max [max{-ek,j+ <g3,x-xk > : je jk},

k k-i k
- ~ f , p + < pf ,X-X > ,

k
-~3?, p+
<p ixxk ,] (2.20)

while subproblem (2.19) ,nay be s o l v e d by f i n d i n g (dk,v k ) to

1 2 ^
minimize ~Idl +v,
(d,v) e R N+I
k j * k
subject to - ~ f , j + < g ,d > i v , j Jf,

k k-i ^ (2.21)
- ~f ,p + <P~ ,d > < v,

k j ^ k
- eF,j + < g ,d > <_v, j E J F,

k k-i ^
- ~ F ,p + < P F ,d> <v.

Therefore we s h a l l a l s o use the s e a r c h d i r e c t i o n finding subproblem


(2.21) in the n o n c o n v e x case. In fact, we s h a l l use the f o l l o w i n g m o d i -
fied v e r s i o n of (2.21):

2 A
minimize 1dl +v,
(d,v) ~ R N+I

subject to _ k ,j + < g ,d > < v, j Jf,

k + < p k - l , d > < v,


^ if rk=0, (2.22)
- ~f,p - a

k j ^ k
- a F , j + < g ,d > _< v, j e J F,

k k-i ^ rk=0,
- a F ,p + < PF ,d> <
- v if a

where the v a l u e of ~a a {0,i} i n d i c a t e s whether the (k-l)-st aggregate


239

subgradients are dropped at the k-th iteration, when a so-called dis-


tance reset (r~=l) occurs (see Section 3.2). As in the method with
subgradient selection, our resetting strategy will ensure locally uni-
form boundedness of accumulated subgradients.
For updating the aggregate subgradients we may use the rules of
Chapter 5, which are applicable to subproblems of the form (2.22) (s~e
Lemma 5.2.4). To this end, let I kj, j e J~, I kp, ~ , je J~, and ~ de-
note any Lagrange multipliers of (2.22) ' where we set I k-
p - P pk-0
- if rk=l.
a
Similarly to (2.43)-(2.45), we have

ik k ik k k k
3 Z0, j e Jf, p Z 0 , ~j >_0, je JF, ~p~_0,

Ik I k + Z k~jk + ~pk = i,
j jk j + p J JF

hence we may calculate scaled multipliers 1 and p satisfying

vfk : Z ~flk + ik ik : k~k k i k : k~k (2.23a)


3e 3 p, j vflj, j J , P ~flp,

k k k k k~k ~F k k~k (2.23b)


9F = ~ k~j + ~p'~j = VF~j' jE , ~ p = ~F~p,
3~ JF

~ > 0 je k ~k > 0 ' Z ~k ~k (2.23c)


3 - ' Jf' p - j~ jk j + p = i,
f
~k k ~k ~k -k
~j ~ 0, j e JF' ~p ~ 0, ~ k~j + ~p = i, (2.23d)
jE JF

and use them for computing the current aggregate subgradients (cf. (3.3.
4))

k ~k ~k = Z ~kl~(gf, ~k. k-I k k


~kj J fk'sk)3 + Ip(pf ,fp,Sf),
(pf, fp,Sf) J~ Jf
(2.24)
k ~k ~k = "k, j F kj'sjk~~ ~k. k-i _ k k.
Z jk~jtgF,
(PF'Fp'SF) j e -F + ~ p t P F ,~p,Sm).

We recall from Section 5.2 that

k k k k
9f Z 0 , 9F>_0 ' 9f + 9F = i, (2.25)

and that
240

~k k k k ~k k k
~j = ~j/VF' J JF' ~p = ~ p / V F

k k
if k ~ 0 an~ ~. If 9f=0 (~ 0) then one n~y p i c k any n u m b e r s satisfy-
ing (2.23c) ((2.23d)). We a l s o h a v e

dk _ pk

k k k k k
p = v f p f + ~FPF, (2.26)

and the k - t h aggregate subgradients (2.24) embody, in the s e n s e of Lem-


m a 5.2.4, all t h a t p a r t of the p a s t subgradient information that was
active at the k - t h s e a r c h d i r e c t i o n finding. Therefore, in the m e t h o d
_k+l
with subgradient aggregation one h a s m u c h freedom in the c h o i c e of df

and _k+l ' s u b j e c t


JF ~k+l u d_k+l
o n l y to the r e q u i r e m e n t t h a t k + l ~ Jf F . For
instance, one m a y set k k
Jf=JF={k} for all k, a l t h o u g h this w i l l lead to
slow convergence, which is e n h a n c e d if m o r e subgradients are u s e d for
search direction finding.
k ~k ~k k ~k ~k,
Having computed (p~,fp,Sf) and (pF,rp,SF2 and the next p o i n t

x k+l , one can o b t a i n (~fk+l k+l~ and (~Fk+l k+l'by the u p d a t i n g rules
P ,sf ~ P ,s F J
of S e c t i o n 3.2. In p a r t i c u l a r , we m a y d e f i n e the k - t h a g g r e g a t e lineari-
zations

~ k ( x ) = ~k + < k k for all x,


P pf,x-x >

9k(x) = [.k + < P Fk ' x - x k > for all x


P

and c a l c u l a t e fk+l=fk(xk+l) and Fk+l=Fk(xk+l). This e n d s the k - t h


P P
iteration.

Remark 2.3. For c o n v e n i e n c e , in the two m e t h o d s described a b o v e we have


assumed t h a t one c a l c u l a t e s

k+l yk+l .k+l = f(yk+l k+l k+l k+l


gf = gf( ), Zk+ 1 ) + < gf ,x -y > , (2.27a)

k+l yk+l _k+l = F(yk+l k+l k+l k+l


gF = gF ( )' ~ k + l ) + < gF ,x -y > , (2.27b)

and c h o o s e s sets of the f o r m

~k ^k k
j kf + l= Jf u {k+l}, J f c Jf, (2.28a)
241

^k ^k k
j Fk + l= JF u{k+l}, JF c JF' (2.28b)

for all k ~ i, and that the methods are initialized by setting yl=xl
and

Jf1 = {i}, g~ = gf(yl), fl1 : f(yl), (2.29a)

1 1 1 = F(yl) (2.29b)
JF = {i}, gF = gF (yl)' F1

If f and gf, or F and gF' cannot be evaluated at each y e R N, then


the following modifications are necessary. We replace (2.28a) by the
following requirement

~k k+l
u {k+l} if y 6 S,
_k+l I Jf (2.30a)
Jf = ^k
Jf if y k + l % S,

^k k k+l
where Jf c Jf, for all k. Then there is no need for (2.27a) if y is
infeasible (Another possibility is to use (2.28a) with (2.27a) if

y k+l e S, and with gfk+l =gf( xk+l ) and .k+l ~ k+l,)


~k+l=itx if yk+14 S). If F

and gF cannot be evaluated at feasible points then we set J~=@ and


replace (2.28b) by

^k
_k+l I JF if yk+l I S,
(2.30b)
JF = ^k
JF u {k+l} if yk+l S,

~k k k+l
where J F C JF' for all k. Then (2.27b) need not be used if y is fe-
asible. In this case the last constraint (2.22) should be dropped for
all k such that yJ~ S for j=l,...,k-l, i.e. we do not use the con-
straint subgradients until the first infeasible trial point is found. It
will be seen that all the subsequent proofs need only minor changes to
cover the rules (2.30), while the rules (2.28) require simpler notation.
Specific techniques for dealing with the rules (2.30) will be described
in Section 7.

We shall now consider versions of the above-described methods that


are obtained if one uses subgradient locality measures different from
(2.8). To this end, for any x and y in RN define the linearizations

~(x;y) = f(y) + < gf(y), x-y > ,


(2.31)
242

F(x;y) = F(y) +< gF(Y),x-y >

and the following subgradient locality measures

of(x,y) = max{If(x)-~(x;y)I , yflx-yl2},


(2.32)
OF(x,y ) = max{IF(x;Y)l , YFlX-yl2},

which indicate how far gf(y) and gF(y) are from ~H(x;x), respecti-
vely. Since

fk = ~(xk;y j) and Fk = F(xk;yj),


3 3

of(xk,y j) = max{If(xkl-f~I, yflxk-yjl2},

OF(xk,yj ) = max{IF~l , ~F[xk-yjl2},

we see that (2.8) differs from (2.32) by using the distance measures s.k
3
instead of Ixk-yjl. This enables us not to store the points yJ.
In fact, one may use of(xk,y j) and OF(xk,y j) instead of ~
k J
and OF,j in the search direction finding subproblems (2.14) and (2.22).
Then the method with subgradient selection has subproblems of the form

minimize 1dl2 +V,


^
(d,~) ~ RN+I

subject to --of( ) +<g ,d> <_v, j <2.33)

- OF(xk,yj) + < g ~ ,d > ~ ~, j ~ J Fk

while the k-th iteration of the method with subgradient aggregation uses
the subproblem
2
minimize 1dl +v,
<d,v) e R~+I

subject to -of( ) +<g ,d> <_v, j~

~ + < pfk - i ,d ~ <- v^


o k~,~ if rk=0,
a (2.34)

- F(xk,y j ) + < g ~ ,d > < v,


^ j ~ J~,
243

k k-I ^
- ~F,p + < PF ,d > K v if rk=0a"

In this case the aggregate subgradient updating rules (2.24) should be


replaced by the following

(p~,'k ~k = klk(g k k l k k
fp,Sf) j jf 3 ',fj,]xk-yjl) + p(pf ,fp,Sf),
(2.35)
k ~k ~k, ~k j k ~k k-i k k
PF,Fp,SF ) = Z jk~j(gF'Fj 'Ixk-yjl) +~p(PF 'Fp'SF)"
j e -F

Other versions of the methods are obtained if we replace (2.32) by


the following definition of Mifflin (1982)

ef(x,y) = max{f(x)-f(x;y), yflx-yl2},


(2.36)
~F(x,Y) = max{-F(x;y), YFlx-yl 2}

and (2.8) and (2.18) by

k k 2} ,
ef,j = max{f(xk)-f~, yf(sj)

k k 2}
~F,j = max{-F~, YF(Sj) ,
(2.37)
k
~f,p = max{f(xk)-f~, yf(s~)2},

k k 2}
aF,p = max{-F~, YF(SF) .

We note that in the convex case (yf=YF=0) the values of the subgradi-
ent locality measures (2.32), (2.8) and (2.18) coincide with those gi-
ven by (2.36)-(2.37), respectively.
To compare our method with subgradient selection with the Mifflin
(1982) algorithm, we shall need the following notation. Define the sub-
gradient mapping

I gf(x) if x e S,
g(x) = (2.38)
gF(x) if x % S,

and the corresponding subgradient locality measure


{a (x,y) if y eS
~(x,y) = f (2.39)
~F(x,y) if y ~S.
244

k
Suppose that thel rules for choosing Jf and JFk satisfy (2.30) for
all k, with J~={l} and J~=~. Then we have

k
JF = @

and
g~ = g(yJ) and ~f(xk,yj)=~(xk,y j ) if yJes,

g~ = g(yJ) and eF(xk,yk)=~(xk,y k) if yJ 4 S ,

for all k. Hence, letting

jk k k
= Jf U JF for all k,

we conclude that the search direction finding subproblem (2.33) can be


formulated as follows

minimize Aid
2 I 2 ^
+v,
(d,~)e R N+I
(2.40)
subject to -~(x k ,yJ ) + < g( yJ) ,d> ^ j ~ jk.
~ v,

Subproblem (2.40) is the search direction finding subproblem of


the Mifflin (1982) algorithm if Jk={l,...,k} and e(xk,yj) is defin-
ed via (2.39) and (2.36). Thus this algorithm uses total subgradient
^k tk _k
accumulation, coresponding to the choice Jf=J~ and OF=O F in (2.30),

for all k. Moreover, this choice of Jfk and JFk combined with the
subgradient locality measures (2,37) and the search direction finding
subproblems (2.14) leads to a version of the M i f f l i n (1982) algorithm
that does not need storing the points yJ.
To sum up, we shall now comment on relations of the a b o v e - d e s c r i b -
ed methods with other algorithms. If we neglect the variables corre-
sponding to the constraint function F then the methods reduce to the
algorithms for u n c o n s t r a i n e d minimization from Chapter 3. In the convex
case we automatically obtain the search direction finding subproblems
studied in Chapter 5. Thus the methods generalize the method of centers
for inequality constrained m i n i m a x problems (Kiwiel, 1981a), which in
turn extends the Pironneau and Polak m e t h o d of centers and method of
feasible directions for smooth problems.
245

3. The Algorithm with Subgradient Aggregation

We now state an algorithm procedure for solving problem (i.i). Its


line searches are discussed below.

Algorithm 3.1.

Step 0 (Initialization). Select the starting point xI ~ S and a final


accuracy tolerance E s a 0. Choose fixed positive line search parameters
m L, m R , ~ and t, t ~ 1 and 0 < m L < m R < i~ and distance measure para-

meters yf > 0 and YF > 0 (yf=0 if f is convex; YF=0 if F is con-


vex) Set yl=xl , Sl=S
1 1f = s F1 = 0 and

1 1 0 gf(yl 1 ~ yl)
Jf = {I}, gf = pf = )' fl = f ( '

1 1 0 gF(yl 1 F~(yl)
JF = {i}, gF = PF = )' FI = '

and the reset indicator rl=l.


a Set the counter k=1 .

k k k >~, j~ k
Step 1 (Direction finding). Find multipliers lj, j~ Jf, lp, JF'

and ~ that solve the following k-th dual search direction finding sub-
problem

minimize 1 Z k~jg~ + lppfk-I + Z k~Jg ~ + ~pPFk-i I2 +


l,~ j Jf j e JF
k + k
k k + Z ~jaF,j >p~F,p'
+j ~d f_kljef,j
g + lp~f,p j 6 J~

k ~p> 0
k lp >_ 0 , ~ j -> 0, je JF' (3.1)
subject to Xj ~ 0, j e Jf, - '

Z kXj + I p + z k>j + >p = i,


je Jf J~JF

Xp = ~p=0 if rk=la'

where
2
k = max{if(xk)_fkl, yf(sk)2} ~ = m~{[Fkl YF(S k) } (3.2a)
ef,j 3 3- ' ,J 3 ' 3 '
k k 2 k k 2
k
ef,p = max{If(xk) -f I, yf(sf) }, aF ,p = max{IFkl, yf(SF) }.(3.25)
P

Compute
246

k lk lk k k k
~f = Z k + and
9F = Z kUj + ~p. (3.3)
3eJ J P 3 e JF
Set

~k k k k ~k ~ k k
] = lj/vf for j 6 Jf and P = ip/gf if 9f ~ 0,

= 0, 3.4)

"k = ~j/~F
k k "k k k k
~j for j e JFk and ~p = ~p/~F if ~F ~ 0,

"k = i, ~j
~k "k = 0, jg J k -{k}, ~kP = 0 if vk = 0.

Calculate ak=max{s : j E Jf u J } if Ik=~k=0. Set


P P
k =k-k ~k j k k ~k, k-~ fk k
pf,Zp,Sf) =j ~S jk j ( g f , f j , s j ) + p<pf , p,Sf),
3.5)
,Fp,SF) = Z k~j(gF,Fj, 3
3~ JF

k k k k k
p = vfpf + 9FPF , 3.6)

d k = _ pk,
3.7)

2
ef,P'k = max{if(xk)_~kl, yf(sk) } 3.8a)

"k : max{IFkl, YF(sk)2},


eF,p 3.8b)

k pki2+9 k ~ k +~ k
v =-{I fef,p F,p . 3.8c)

Step 2 (Stopping criterion). Set

w k = 1,
~ipkL,2 k,k k~k
~vfef,p+VFeF, p" 3.10)

If w k N es terminate; otherwise, go to Step 3.

step 3 (Line search). By a line search procedure as given below, find

two stepsizes t kL and t kR such that 0 N t kL N t kR and such that the two

corresponding points defined by

x k+l = xk+tkd k and yk+l = x k +tR~


k-k
247

satisfy t kL _< 1 and

f(x k+l) _< f(x k) + m L t LkV k , (3.11a)

F(x k+l) _< 0, (3.11b)

k k ka~ '
tR = tL if tL (3.11c)

- ~ ( x k + l , y k+l) + < g ( y k + l ) , d k > _>mR vk k


if tL < (3.11d)

lyk+l-xk+l 1 _<-~/2, (3,11e)

where

g(y) = gf(y) and e(x,y)=max{If(x)-f(x;y)l,yflx-yl 2} if F(y) s 0,

(3.12)

g(y) = gF(y) and e(x,y)=max{IF(x;y)l,YFlx-yl 2} if F~y) > 0.

Step 4 ISub~radient updating). Select sets ^k c J fk


Jf and 5~ c JF'
k and Set

jk+l ^k _k+l ^k
f = Jf U {k+l} and oF = J F U {k+l}. (3.13)

Set k+l (yk+l) g~+l=gF(yk+l ) and


gf =gf

fk+l = f(yk+l) k+l k+l k+l


k+l + < gf ,x -y > ,

fk+l : fk + < j k+l k *k


3 3 gf,x -x > for j e Jf,

fk+l = ~k + < k k+l-x k


P P pf,x > '
Fk+l = F(yk+l k+l k+l k+l
k+l ) + < gF ,x -y > ,

F k+l = Fk + < gJ,xk+l-x k > for ~k


J JF'
3 3
(3.14)
F k+l = Fz k + < PF,X
k k+l -x k >
P P
k+l lyk+l k+l
Sk+ 1 = -x I,

sjk+l = s~ + Ixk+l-xk I ^k
for j e Jf u ~ k ,

k+l ~k ixk+l_xk
sf = sf+ I,
248

k+l ~k+l k+l k


sF = sF + Ix -x I"

Step 5 IDistance resettin~ test). Set

k+l
ak+l = max{ak+l xk+l-xkl'sk+l}

If a k+l < ~ then set rk+l=0 and go to Step 7. Otherwise, set rk+l=l
a
a
and go to Step 6.

Step 6 (Distance resetting). Keep deleting from _k+l


of and _k+l
JF indi-
ces with the smallest values until the reset value of a k+l satisfies

k+l ~+i ~k+l _k+l


a = max{s : j e ~f u oF } ~a/2.

Step 7. Increase k by 1 and go to Step i.

A few remarks on the algorithm are in order.

By Lemma 5.2.3, subproblem (3.1) is the dual of the k-th primal


k k k k
searchu direction finding subproblem (2.22), and lj, j e Jf, Ip, ~j,
k
j E J~ and pp are the corresponding Lagrange multipliers. Relations
(3.3)-(3.4) satisfy (2.23), hence (see Section 5.2) one can calculate
A

(dk,vk), the solution of (2.22), via (3.5)-(3.7) and

^k 2+ ~k k ~k k ~k k ~k k
v = - { Ipkl ~ _klj~f,j+lpef,p+ Z kPjaF,j+pp~F,p}. (3.15)
j~ df Jg JF

One may, of course, solve the k-th primal search direction finding sub-
problem (2.22) in Step 1 of the method.

The stopping criterion of Step 2 admits of the following interpre-


~k
tation. The values of the aggregate subgradient locality measures ef,p
and k
~F,p ^given by (3.8) indicate how far p and P~'k respectively,
are from M(xk). At the same time, the value of the following subgradi-
entlocality measure

~k k~k k~k (3.16)


~p = 9f~f,p + ~F~F,p

indicates how much the aggregate subgradient pk=~fpf+VFPFk


k k k differs from
k k k k
being an element of M(xk), since v f Z 0, ~F Z 0 and vf+VF=l. In par-
ticular, in the convex case we have M(xk)=~H(xk,x k) and
249

k
P 6 SEH(xk;xk) for =~p,

see Lemma 5.4.2. By (3.10) and (3.16),

w k = ~[
1 pk 12+~pk (3.17)

Therefore, a small value of w k indicates that both Ipkl is small and


that pk is close to M(xk), i.e. the null vector is close to M(xk),so
that x k is approximately stationary (stationary points ~ satisfy
0 ~ Mix)). Thus wk may be called the stationarity measure of x k. On
the other hand, since pk is a convex combination of approximate subgra-
dients p~ and p~ of f and F at x k, respectively, we may regard
pk as an approximate subgradient of some Lagrangian function of prob-
lem (i.i)

L(x,~) = 9ff(x)+gFF(X),

i.e. pk is close to 9~3f(xk)+~F(x k) if the value of ~ is small.


Thus our stopping criterion generalizes the usual criterion of a small
value of the gradient of the Lagrangian, which is frequently employed in
algorithms for smooth problems.
Our line search rules (3.11) extend the rules (3.3.8)-(3.3.11) to
the constrained case. As in Algorithm 5.3.1, vk approximates the direc-
tional derivative of Hk at xk in the direction d k, and the line
search is entered with

V k = fi~(xk+dk)-Hk(x k) < 0.

The criteria (3.11a)-(3.11b) ensure monotonicity in the objective value


and feasibility, i.e. f(x k+l) ~ f(x k) and x k+l ~ S for all k. The rule
(3.11c) means that we do not pose any demands on new subgradients ~+~

gf(yk+l) and g~+l=gF(yk+l ) if the algorithm makes sufficient progress,


i.e. f(x k+l) is significantly smaller than f(xk). On the other hand,
the criterion (3.11d), yielding either

k+l k+l .k k k k+l


-~f,k+l + < gf ,a > > m R v if tL <~ and y 6 S

or
k+l k+l d k k k k+l
-~F,k+l + < gF ' > ~mRv if tL <~ and y i S,

ensures that at least one of the two new subgradients will significantly
modify the next polyhedral approximation to H k+l after a null step or
250

a short serious step. This prevents the algorithm from jamming at non-
stationary points. The criterion (3.11e) is connected with the distan-
ce resetting strategy discussed below.
The line search rules (3.11) are general enough to allow for con-
structing many efficient procedures for executing Step 3 (Mifflin, 1982
and 1983). For completeness, we give below a simple extension of Line
Search Procedure 3.3.1 for finding stepsizes tL=t kL and tR=t kR. In ths
procedure ~ is a fixed parameter satisfying ~6 (0,0.5), x=x k, d=d k
and v = v k < 0.

Line Search Procedure 3.2.

(i) Set tL=0 and t=tu=l.

(ii) If f(x+td) < f(x)+mLtv and F(x+td) S 0 set tL=t; otherwise set
tu=t.

(iii) If tLh~ set tR:t L and return.

(iv) If -a(X+tLd,X+td)+ <g(x+td),d > >mRV and tL<~ and

(t-tL) Id I < a / 2 set tR=t and return.

(v) Choose t e [tL+~(tu-tL), tu-~(tu-tL) ] by some interpolation proce-


dure and go to (ii).

To study convergence of the above procedure, consider the following


"semismoothness" hypothesis:

for any x ~ RN,d e R N and sequences {~i}c R N and {t i} c R+

--i tl+0, one has (3.~)


satisfying g e 8F(x+tid) ,F(x+tid) > 0,F(x)=0 and

lim sup <gi,d-- > _> lim inf --[F(x+tid)-F(x)]/t i.


i--+- i--+~

Lemma 3.3. If f and F are semismooth in the sense of (3.3.23) and (3.18)

then Line Search Procedure 3.2 terminates with t~=t L and k R


tR=t sa-
tisfying (3.11).

Proof. We shall use the following modification of the proof of Lemma 3.


3.3. Let

TL={t > 0:f(x+td) 5 f(x)+mLtv and F(x+td) < 0}


251

and observe that we now have {t L } c TL, t and t eTL, because both
f and F are continuous, so we have F(x+td) <_0 in addition to (3.
3.24a). Using (2.31), (3.12), the continuity of f and F, and the local
boundedness of gf and gF' we obtain

~(x+t~d,x+tid) --+ 0,

hence the rules of the procedure yield, as before, that

limsup < gi,d > S m R v , (3.19)


i--+~

where gi=g(x+tid) for all i. Since t~+t, t e LT and t Ui 6 LT if
i i i i ~
tu=t , there exists an infinite set I c{i,2,...} such that tu=t

and either f(x+tid) > f(x)+mLtiv or F(x+tid) > 0 for all ie I. If


contained an infinite set I satisfying (3.3.24b), we would obtain a
contradiction with (3.3.23) as before. Therefore we may suppose that

F(x+tid) > 0 for all i e I, (3.20a)

~i = gF(x+tid) ~ ~F(x+~d+(ti_~)d) for all i e I. (3.20b)

Then, since ti+t, F(x+td) < 0 and F is continuous, (3.20a) yields

F(x+~d) = 0, (3.20c)

l i m i n f ~ [F(x+td+(ti-t)d)-F(x+td)]/(ti-t)>0 ~_mLv , (3.20d)


i+~, i e I

because m L v < 0. Since mRv(mLv, (3.19)-(3.20) contradict (3.18). The-


refore the search terminates.

Remark 3.4. As in Remark 3.3.4, we observe that (3.18) holds if F is


weakly upper semismooth in the sense of Mifflin (1982), e.g. if F is
convex.

The subgradient deletion rules of Step 5 and Step 6 are taken from
Algorithm 3.3.1. Therefore, similarly to (3.3.29), we have

k e Jf ~ for all k, (3.21a)

while the use of (2.30) instead of (3.13) yields

k k
k e Jf if yk S, and k e JF if y k ~ S. (3.21b)
252

4. C o n v e r @ e n c e

In this section we shall e s t a b l i s h global convergence of Algorithm


3.1. We suppose that each e x e c u t i o n of Line Search P r o c e d u r e 3.2 is fi-
nite, e.g. f and F have the additional s e m i s m o o t h n e s s properties
(3.3.23) and (3.18). Moreover, c o n v e r g e n c e results assume that the final
accuracy tolerance Es is set to zero. In the absence of convexity, we
w i l l c o n t e n t ourselves with finding s t a t i o n a r y points for f on S. Our
principal result states that the a l g o r i t h m either terminates at a sta-
tionary point or generates an infinite sequence {x k} whose accumula-
tion points are station&ry. If p r o b l e m (i.i) is convex and satisfies the
Slater c o n s t r a i n t qualification, then {x k} is a m i n i m i z i n g sequence
for f on S, which converges to a solution of (i.i) w h e n e v e r f attains
its infimum on S.
Since A l g o r i t h m 3.1 is a c o m b i n a t i o n of A l g o r i t h m 3.3.1 and Algo-
rithm 5.3.1, our c o n v e r g e n c e analysis will consist in m o d i f y i n g the pro-
ofs of Section 3.4 w i t h the help of the results of Section 5.4.
We start by o b s e r v i n g that the rules of A l g o r i t h m 3.1 for aggrega-
ting s u b g r a d i e n t s of each of the p r o b l e m functions are similar to those
of A l g o r i t h m 3.3.1. Therefore, the reader can e s t a b l i s h the following
result on convex r e p r e s e n t a t i o n s of the aggregate subgradients analogo-
usly to Lemma 3.4.1 through Lemma 3.4.3, and Lemma 5.4.1.

Lemma 4.1. Suppose k _> 1 is such that A l g o r i t h m 3.1 did not stop before
^k ^k
k-th iteration, and let M=N+3. Then there exist numbers I i and ~i'
and vectors (y~,i,fk 'i ,sf
k,i) R N~ R x R and (gk'i,Fk'i,s
k ' i ) - - e R N' R ~R,
r r
i=l,...,M, satisfying

k ~k ~k M ,~k, , k,i. i k,i)


(pf,fp,Sf) = Z Ai~gftY f j,fk, ,sf , (4.1a)
i=l

~k M ~k
i >_0, i=l,...,M, Z li=l , (4.1b)
3=1

(gf(y~,~,fk,i,sfk,i) e {(gf(yJ),fj,k sk)3 : j=l ..... k}, i=l, .... M,(4.1c)

k,i k k,i (4.1d)


IYf -x ~ sf , i=l,...,M,

max{s k'i : i=l,...,M} _< a k < a , (4.1e)

and
k ~k ~k M ^kl k,i. k,i.
(PF,Fp,SF) = Z ~i~gf~yF ),Fk'l,SF ),
i=l
253

^k M ^k
~i ~ 0, i=l,...,M, Z ~i = i,
i=l
k k k k
(gF(YF'i),Fk'i,sF 'i) E{(gF(yJ),Fj,sj): j=l ..... k},i=l ..... M, (4.2)

lyk'i-xkl <_ s k'i, i=l ..... M,

max{s~ 'i : i=l,...,M} ~ a k < ~ .

If additionally f is convex then

pfek ~cf(xk ) for g=f(xk)-f kp= ~f,P'-k (4.3a)

while if F is convex then

PFk ~ 3 F ( x k ) for g=F(xk)-F ~ g ~F,p"


"k (4.3b)

The following lemma, which generalizes Lemma 3.4.4, is useful in


deriving asymptotic results from the representations (4.1)-(4.2). We
recall that yf=0 (YF=0) only if f (F) is convex; otherwise yf >0
(YF>0).

-- --i --i
Lemma 4.2. (i) Suppose that a point x E R N, N-vectors pf ,yf and gf,
and numbers fp,Sf,~ l and Sf, i=I,...,M=N+3,
--i satisfy

M
--i --i --i
(pf,fp,Sf) = Z ~i(gf,f ,Sf),
i=l
M
~. ~0, i=l,...,M, Z ~. = I,
l i= 1 i

B i
gf e ~f(y~), i=~, ... ,M,

= ~ <--i -- --i
~i f(~ )+ gf,x-yf >, i=l ..... M, (4.4)

--i -- --i i=l, ,M,


lyf-x] sf . . . .

f(~) = ~p,

yfSf = 0.

Then pf e ~f(~) .
254

-- --i --i
(ii) Suppose that a point x ~ R N, N-vectors pF,y F and gF' and num-
--i
bers Fp,SF,~l and s F, i=l,...,M, satisfy

M
-- .--i ~ i --i
(PF,Fp,SF) = ~ ~i~gF ,- ,s F), 4.5a)
i=l
M
p i > O, i=l,...,M, ~ l~i= 1 , 4.5b)
i=1

gF ~F( ), i=l,...,M, 4.5c)

~ --i -- --i
~i = F( ) + < gF,x-YF > , i=l,...,M, 4.5d)

--i -- --i
lYF-XL <_ s F , i=i ..... M, 4.5e)

F(~)+ = ~p, 4.5f)

YFSF = 0. 4.5g)

Then PF ~ F ( x ) and F(~) > 0.

Proof. We shall only prove part (ii) of the lemma, since part (i) fol-
lows from Lemma 3.4.4.

(a) First, suppose that y F > 0. Then SF=0 by (4.5g), so (4.5a,b,e)


yields

YF = ~ if ~i # 0 (4.6)

and we have PF e ~F(~) from (4.5a,b,c), (4.6) and the convexity of


3F(~). By (4.5a,b,d,f) and (4.6)

M
o = F(x)+-~p = zji[F<~)+-~1]==

M
-- --i --i -- --i >]
= Z Pi[F(x)+-F(YF)- < gF,x-YF J
i=l

=_ E 7i[F(x)+-F(~) ] = F(~)+-F(x).
~i #o

Thus F(x) = F(~)+> 0.

(b) Next, suppose that YF=0. Then F is convex and (4.5c,d) yield
255

_ -- ---- = ~i
for ..... M

Multiplying the above inequality by ~i and summing, we obtain F(~)


Fp from (4.5a,b). Therefore, by (4.5f), F(x)~ Fp=F(x)+~ 0.

We shall now consider the case when the algorithm terminates.

Lemma 4.3. If Algorithm 3.1 terminates at the k-th iteration then the
point ~=x k is stationary for f on S.

Proof. Suppose the algorithm terminates at Step 2 due to w k s es=0 , and


let x=x k. We have

k 1 k 2+'k
w = ~IP ~p,

k kk kk
p = vfpf + vFp F,
2
~P~k = 9 k m a x { if(~)_~kl ,yf(sk)}+gkmax{ l~k[ ,YF(S k) }

k k>_0, 9 k + v k = l (4 7a)
9f >_ 0, ~F

from (3.17), (3.6), (3.16), (3.8) and (2.25). Therefore wk=0 and

k k k k
vfpf + ~FPF = 0, (4.7b)

k k ~k =
k[f(~)_~ ] = 0, 9fyfsf 0, (4.8a)

vk[F(x)+-F
k U F p / 0, k ~k = 0, (4.85)
= VFYFSF

where F(~)+=0, because F(x)=F(x k) < 0. Suppose that V~0. Then (4.8a),
Lemma 4.1 and Lemma 4.2 imply p~ e ~f(~), i.e.

pfk e~f(x) if vk # 0. (4.9a)

k k
Next, if ~F~0 then (4.85), Lemma 4.1 amd Lemma 4.2 yield PF e~F(x)
and F(~)~0, so, because F(~) ~ 0, we have

k k
PF eSF(x) and F(x)=0 if v F ~ 0. (4.9b)
^

Since F(x) ~ 0, (4.7) and (4.9) imply 0 e M(x) (see (2.2)) and x , S.
256

From now on we suppose that the algorithm generates an infinite


sequence {xk}, i.e. wk > 0 for all k.
The following lemma, which generalizes Lemma 3.4.6, states useful
asymptotic properties of the aggregate subgradients.

Lemma 4.4. Suppose that there exist a point x ER N and an infinite set

K c{i,2,...} satisfying xk K , ~. Then there exists an infinite set


cK and two N-vectors pf and PF such that

k K k K
pf ~ pf and PF ' Pp.

~k K ~k K + 0 then
If additionally ~f,p 0 then p f E ~f(x), while if ~F,p

PF e ~F(~) and F(~) Z 0.

Proof. Use L e n a 4.1, L e n a 4.2 and proceed as in the proof of L e n a


3.4.6.~

Our next result, which extends Lemma 3.4.7, establishes a crucial


property of the stationarity measures {~}.

Lemma 4.5. Suppose that for some x E RN we have

liminf max{wk, l~-xkl} = 0, (4.10)


k+~

or e q u i v a l e n t l y

there exists an infinite set Kc{l,2,...} such that


(4.~1)
xk K , x and wk K + 0.

Then 0 e M(~) and ~ ~ S.

Proof. Suppose that (4.11) holds. Since w k =i12 pk +~f~f,p+~FeF,


p k - kk~k with
k-k k~k
9fSf,p>0 and VF~F,p>0 for all k by (2.25) and (3.8), we have

Ipk I K J 0 and

k~k K k~k K (4.12)


0 and , 0.
~f~f,p VFeF,p

Since xk K + ~ by assumption and Ipkl K O, we may use (2.25),(3.6)


257

and Lemma 4.4 to deduce the existence of an infinite set K oK, numbers
~f and ~F' and N-vectors pf and PF such that

k K k
vf + ~f ' ~F ~F'

k K k K_~_, --
Pf ' Pf ' PF PF'

~f > 0, ~F > 0, ~f + ~F = i, (4.13a)

4.13b
~fpf + ~Fp F = 0.

-k
Suppose that ~f~0. Then (4.12) yields 0, so p f e 8f(x) by
~f,p
Lemma 4.4. Thus

pf g 8f(x) if ~f ~ 0. (4.13c)

Similarly, (4.12) and Lemma 4.4 imply

PF e 8F(x) and F(x) > 0 if ~F ~ 0. (4.13d)

Since F(x k) S 0 and xk K , ~, the continuity of F yields F(x) S 0.


Combining this with (4.13) we obtain 0 e M(x) and x~S. The equiva-
lence of (4.10) and (4.11) follows from the nonnegativity of wk-s.

Proceeding as in Section 3.4, we shall now relate the stationarity


measures wk with the dual search direction finding subproblems.
Let ~k denote the optimal value of the dual search direction find-
A

ing subproblem (3.1), for all k. A useful relation between wk and w k


is established in the following lemma, which generalizes Lemma 3,4.8.

Lemma 4.6. At the k-th iteration of Algorithm 3.1, one has

~k 1 .. 2 ^k
w = ~ipkl + (4.14a)
am ,
^k k^k + k^k (4.14b)
ap = vf~f,p 9FeF,p,

^k ~k k ~k k (4.14c)
~f'P = ~ ~kXj~f,j
j e _f +kp~f,p,

~k ~k k ~k k
aF,p =
j eZ J ~j~F,p +~p~F,p' (4.14d)

~k ^k ~k ^k 4.15
af,p ~ af,p and ~F,p $ ~F,p
258

k ^k (4.16)
w <_w ,

v k = - { l p k l 2 + ap}
~k S -w k ~ O, (4.17)

^k
V <_Vk (4.18)

~k ^k k k k ^k
Moreover, if f and F are convex then ap-ep, w =w and v =v .

Proof.(i) By (3.6), (3.5) and (2.23a,b),

k k j ik k-i k j k k-i
p = E _kljgf + ppf + ~ _k~jgF + PpPF '
je df jed F
while (4.14b,c,d) and (2.23a,b) yield

^k = E k k k k ~ _ k P kj ka F , j + ~ pk~ Fk, p . (4.19)


~P TklJ~f' J+iP~f'P+j e O F
j e _f

The above two equalities imply (4.14a).

(ii) One can e s t a b l i s h (4.15) by using (4.14c,d), (3.8), (3.5) and


k k
(2.23c,d) as in the p r o o f of L e m m a 4.8. Since ~f and vF are nonnega-
tive, we o b t a i n

~k k~k k~k kAk k^k Ak


C~p = , o f a f , p + ~ F a F , p N " ~ f a f , p + VFaF, p = ~p

from (3.16), (4.15) and (4.14b). Hence

wk =
1
71pk12+~ 1 k 2+^ke
~Ip I p = W
^k
,

v = - { ipkl }< 1 2+a k

-k = - { I p k l 2+o
v ~ } < - ipki2+ = v
from (3.17), (4.14a), (3.9), (3.16), (3.15) and (4.19). This proves
(4.16)-(4.18).

(iii) If and F are c o n v e x then f(xk)-f~ Z0, kf(xk)-f~


j E Jf, > 0, f(xk)~

f~, -Fk3 Z 0 ' j JF'


k -F pk > 0 ' and _~kp 0 (see L e m m a 5.4.2), h e n c e we
k ^k ~k ^k
obtain ~,p=ef,p and aF,p=aF,p as in the proof of L e m m a 3.4.8. T h e n
part (ii) above yields ~k-~k
ap-~p, w k =wAk and v k =vAk .

The reader can easily establish analogues of L e m m a 3.4.9 and C o r o l -


lary 3.4.10 for A l g o r i t h m 3.1.
259

We shall now prove the following extension of Lemma 3.4.11. Let

gk = gf(yk) and k = ef(xk,y k) if yk eS,


(4.20)
g k = gF(y k) and ~ k = ~F(xk,y k) if yk %S .

Lemma 4.7. Suppose that t k-i


L <~ and rk=0
a for some k > i. Then

w ~ w"k S ~C (wk-1 )+I kp- ~k-11


p , <4.21)

where the function ~C is defined by

#c(t) = t-(l-mR)2t2/(8C2),

C is any number satisfying

max{Ipk-i I ' Igkl, ~k-l,l}


p <- C , (4.22)

and
k k-i k + k-i k (4.23)
~p = ~f,p eF,p"

Proof.(i) Observe that k > i, t k-i


L <~ and the line search rule (3.11d)
yield

-~(xk,y k) + < g(yk),dk-i > a mR vk-I

so we have

k gk,dk-I k-i
-e + < > ~mRv (4.24)

from (4.20), (2.38) and (2.36).

(ii) Let ~ e ~,I] and define the multipliers

lk(V)=v, lj(v)=0 for j ~ Jfk\{k} , ip(~)=(l-~)9 k-I


(4,25a)
k k-i
~j(v)=0 for J ~JF' ~p(~)=(I-~)~F

if yk e S, and

Ij(V)=0 for j ~ jk, ip(9)=(l_v) k-l,


260

(4.25b)
~k(9):9, ~j(9):0 for j e J~\{k}, ~p(9):(i-9)9~ -I

if yk$ S. Since pk-l=vfk-lpfk-l+gFk-lpFk-i by (3.6), we obtain

k-1
~i .klj('~)gJ+Ip(',~)pf + ~] k]~j(~)g3F+]~p(~ )PFk-1 =(i-9) pk-l+ggk
j e Jf J ~ JF (4.26a)

from (4.25), (4.20) and the fact that k


gf=gf(y k) k k
and gF=gF(y ). Simi-
larly, (4.25), (4.20), (4.23) and the fact that k
ef,k=~f(xk,y k) and
k k
eF,k=~F( x ,yk) yield

k k + k k k k
Z klj(~)~f,j+Ip(V)~f,p J k~j(V)~F,j+~p(~)~F,p=(l-~)~p+V~
j Jf e JF
(4.265)
k-i k-i k-i k-i
By (2.25), vf a 0, v F >0 and ~f +VF =i, hence (4.25) yields

k Ip(V) >0, k ~p(9) >0,


~j(9) z 0, J ~ JF'
lj(v) _>0, j eJf, _

klj(v)+Ip(9)+ Z k~j(v)+ip(9) =
je Jf J~JF

=~ + (l-~)~k-l+(l-V)~Fk-i =I

for all ~ e [0,i]. Combining this with our assumption that rk=0
a and
with (3.21b), we deduce that the multipliers (4.25) are feasible for
subproblem (3.1) for all v a [0,I]. Therefore ~ (the optimal value
of subproblem (3.1)) satisfies

w^k min{1(l-9)pk-l+~gk I2+(l-~)e~+gekU :9 ~[0,i]}

min{1(l-9)pk-l+ggk[ 2 + ( 1 - ~ ) ~ - I ~ k :~ k ~k~Z
[o,i]" }+lep-~p' I"
(4.27)
Using Lemma 2.4.10 and relations (3.7), (3.17), (4.17), (4.24) and
(4.22), we obtain

min{ y1 [(l-~)p-l+vgk[2+(l-9)ep
-k-l+~k : 9 e [o,i] } #c(wk-l),

hence (4.27) and (4.16) imply (4.21).


261

To obtain locally uniform bounds of the form (4.22), we shall need


the following generalization of Lemma 3.4.13.

Lemma 4.8. (i) For each k a1


1/2
max{Ipkl,~} ~max{ ~1 Igkl2+0k,(Igkl2+2~ k) }. (4.28)

(ii) Suppose that ~ ~ R N, B={y ~ R N : l~-yl ~ 2~}, where ~ is the line


search parameter involved in (3.11e), and let

Cg = sup{Ig(y) l : y & B } , (4.29a)

C~ = sup{~(x,y) : x e B, y ~B}, (4.29b)


1/2
C = max{ ~1 C g2 + C ,(C~ + 2C ) ,i}. (4.29c)

Then C is finite and

max{ipkl,lgk[. , ~kp , i} <- C if Ixk-~i ~ . (4.3o)

Proof.(i) Let k >_ 1 be fixed and define the multipliers

I k = i, Xj = 0 for j~J ,{k}, Xp 0,

k
~j = 0, for ieJF, ~p = 0

if yk6 S, and

k
lj = 0 for j~Jf, ip = 0,

~k = i, ~j = 0 for j e J ,{k}, 0

k~ k
if yk4 S. Since k ~ Jf JF by (3.21a), the above multipliers are fea-
^k
sible for the k-th dual subproblem (3.1). Therefore w , the optimal va-

lue of (3.1), satisfies


^k 1
w ~!
gk 12+ k from (4.20), and we have

11 p k 2 +ep
-k = w k <_w
^k <-~Igkl2+
1 k

by (3.17) and (4.16). The above inequality and the fact that ~kp -> 0 yield
(4.28).
262

(ii) Use the local b o u n d e d n e s s of gf and gF (Lemma 12.2) and the


c o n t i n u i t y of f and F to deduce from (2.31) and (3.12) that the map-
ping g(o) is b o u n d e d on the b o u n d e d set B, while ~(.,.) is b o u n d e d
on B-B. T h e r e f o r e the constants d e f i n e d by (4.29) are finite.(4.30)
follows from (428)-(4.29) and the fact that lyk-xkl ~ a by (3.11e),

while gk=g(yk) and ~k=~(xk,yk) by (4.20) and (3.12).

To verify that Lemma 3.415 holds for A l g o r i t h m 3.1, we shall need


the f o l l o w i n g result

Lemma 4.9. Suppose that there exist a point ~ ~ RN and an infinite set

K c {1,2, ..} such that xk K ~ and Ixk+l-xkl 0 as k~, k ~ K.


k k
Then the sequences {Pf}keK and {PF}keK are bounded, and

k+l ~k
l ~ f , p - af,p[ ---+ 0 as k~, k e K , (4.31a)

k+l ~k I -- 0 as k~, k eK, (4.31b)


leF,p - ~F,p

l e k + l - ~kl ~ 0 as k~, k eK. (4.32)

Proof Suppose xk K + ~ and let B={y R N : Ix-yl ~ 2a}. Since ~> 0 and
x k K ,x,
-- we deduce from (4.1d,e) the e x i s t e n c e of a number k such that
k,i
Yf B for i=l,...,M and all k ~k, k ~K. Then (4.1a,b) and the
boundedness of gf on B imply that {P~}k ~ K is bounded. In a similar

way we deduce the b o u n d e d n e s s of {PF}kkeK from (4.2). Next,

II~( xk1~-fklp L lf(xk)-~ IL ~ 1~(xk+l)-fk1-f(xk)~Ip


.k+l -~pl
If(xk+1)-f(xk)l+l~p Zk. =

: If(xk+l)-f(xk)I+l < P~ 'xk+l-xk >I ~

If(xkl)-f(xk)E+Ip~Lixk+l-xk L ~ , 0, (4.33a)

K -- , , K k
since xk ~ x, ~xk+l-xk I ~ 0, f is continuous and {Pf}kaK is
bounded. A similar a r g u m e n t yields

Fk+l ~k k k+l
p -Fp =l ~p~,x -x k ~l_<IpFJlx
k k+l
-x k [ K
, o. (4.33b)
263

Analogously to (3.4.37b), we obtain for large k ~ K

IYf[Sf
kl ) 2-Yf(Sf)l
~k ~ I
xk+l_xk I(2(yfC) ~/2 +yf L kl_xk I),

, k+l, , k ixk+l k 1/2 xk+l_xkl


I~F~SF ~-~F~SF) I~ -x l(2(YFC~ +~FI )

Combining this with our assumption that ixk+l_xkl K 0 and with (4.33),
we obtain (4.31) from (3.2b) and (3.8). By (4.23), (3.15) and (2.45),

k+l_~k k. k+l ~k . k, k+l ~k


l~p ep = [~f~af,p-~f,p~+VF<~F,p-~F,p) I

k+l ~k I k+l ~k K
max{ l~f,p-af,~, [~F,p -~ F,p I} + 0

from (4.31).

Using the above lemma, one can easily m o d i f y the proof of Lemma 3.
4.15 for A l g o r i t h m 3.1. Then, since the proof of Lemma 3.4.16 requires
no modifications, we obtain

Lemma 4.10. If xk K ~ ~ then (4.10) holds.

Lemma 4.5 and Lemma 4.10 yield

Theorem 4.11. Every a c c u m u l a t i o n point of a sequence {x k} generated by


Algorithm 3.1 is stationary for f on S.

In the convex case, the above result can be strengthened as follows.

Theorem 4.12. If f and F are convex.and F(x) < 0 for some ~R N ,


then A l g o r i t h m 3.1 generates a minimizing sequence {xk}:

{x k} c S and f(xk)+inf{f(x) : x 6 S}.

If additionally problem (i.I) admits of a solution, then {x k} converges


to a solution of p r o b l e m (i.i).

Proof. One can use the proofs of Lemma 5.4.14, T h e o r e m 5.4.15 and Theo-
rem 5.4.16 to obtain the desired conclusion.

The folowing result can be established similarly to Lemma 3.4.19.


264

Corollary 4.13. If the level set {x ~ S : f(x) ~ f(xl)} is bounded and


the final accuracy tolerance cs is positive, then Algorithm 3.1 termi-
nates in a finite number of iterations.

Remark 4.14. The results of this section hold for the case of many con-
straints considered in Remark 2.1 and Remark 2.2. This follows from the
fact that the mapping ~ has essentially the same properties as the
subdifferential ~F, i.e. is locally bounded and upper semicontinuous.

5. The Al@orithm with Subgradient Selection

In this section we state in detail and analyze the method with sub-
gradient selection introduced in Section 2.

Al@orithm 5.1.

Step 0 (Initialization). Select the starting point xlE S and a final


accuracy tolerance ~s a 0. Choose fixed positive line search parameters
m L, m R , ~ and ~ , ~ < 1 and m L < m R < i, and distance measure parame-
ters yf > 0 and YF > 0 (yf=0 if f is convex; YF=0 if F is convex).
Set yl=xl, s~=0 and

1 1 1 = f(yl),
Jf = {i}, gf =gf(yl), fl

JF1 = {i}, g F1= gF(yl )' El1 =F(yl)"

Set the counter k=l.

Step 1 (Direction finding]. Find multipliers k j ~ Jf'


lj, k and v~
~' J ~ JF'
k
that solve the following k-th dual search direction finding subproblem

1 ' 2+ ~ k k
minimize ~I E k l]g~+ j E kvjg~ I Z _k j~f, + ~
] j kVjeF,J "
~,p j~ Jf e JF J ~ of ~ JF
(5.1)

subject to lj a0, j ~ ~f , ~j a 0, j e J Fk, j~Z jfk lj+ J ~Z J Fk ~j=l,

where

k xk)_fk , sk)2}, ~F,j=max{


k sk) 2 (5.2)
~f,j=max{If( I Yf( IFkI,yF ( ] } '

~k ^k
and sets Jf and JF satisfying
265

~k k ~k~o } and
~k k k
JF = {je JF : ~j ~0}, (5.3a)
f={jeJf: 3

U < N+I. (5.35)

Compute

pk= E k k j k j 5.4)
J ~ J f ljgf + JeJFE
k ~jgF

d k = _pk, 5.5)

~k k k k k 5.6)
= E _klj~f,j+ ~ k~j~F,j,
~P j ~ df J ~ JF

v^k =-{Ipkl 2 +~p}.


^k
5.71

step 2 (Stopping criterion), set

^k pk 2 ~k
w=] L 5.8)

If ~k
w ~ ~s then terminate; otherwise, go to Step 3.

step 3 (Line search). By a line search procedure as discussed below, find


two stepsizes t~ and t~ such that 0 <_ t k <_ t k and such that the two
corresponding points defined by

k+l k k k k+l k k k
x =x + tLd and y =x + tRd

satisfy t~ ! 1 and

f(x k+l) ~ f(x k )+mLtLV


k~k , (5.9a)

F(x k+l ) < 0, (5.9b)

k k k
t R = tL if t L a ~, (5.9c)

-~(xk+l,yk+l)+ < g(yk+l),dk > % m R v^k if t Lk <~, (5.9d)

lyk+l-xk+ll ~ a / 2 (5.9e)

where the mappings g(.) and ~(.,.) are defined by (3.12).

step 4 (Subgradient updating). Set


266

^k _k+l ~k
j kf + l= Jf u{k+l} and oF = JFU{k+l} . (5.10)

Set g~+l=gf(yk+l), gFk+l =gF [y


. k+l ) and

fk+l = f(yk+l k+l,xk+l yk+l


k+l )+ < gf > '

fk+l=fk+ < gf3,xk+l_xk > for je ,


3 3

Fk+l F(yk+l)+ k+l k+l k+l


k+l = < y ,x -y > ,


F k+l = F k + < g3F,xk+l-xk> for J a ^JF"
k
3 3
_ ^k ~k
Sk+ik+l= sk3 + [ xk+l xkl for j e Jf o JF"

Step 5 (Distance resetting test). Set

ak+l max.sk+l jk+l U J Fk+l } (5.11)


= { j :J f

If ak+l~ ~ then set r k+l


a =u and go to Step 7; otherwise, set rk+l=la
and go to Step 6.

Step 6 (Distance resetting). Keep deleting from J k+]


f u 0~k+1
F indices
with the smallest values until the reset value of a k+1 satisfies

k+l k+l _k+l k+l


a = m a n sj : j e df u JF } ~ ~/2.

Step 7. Increase k by 1 and go to Step i.

We shall now conment on relations between the above method and Al-
gorithm 3.1.

By Lemma 2.2.1, s u b p r o b l e m (5.1) is the dual of the k-th (primal)


search direction finding subproblem (2.14), which has the unique solu-
tion ( d k , ~^ -) and (possibly nonunique) Lagrange multipliers I~, j J f,
k
A
Uj, j e . We r e f e r the reader to Remark 2.5.2 for a discussion of pos-
sible ways of finding the k-th Lagrange multipliers satisfying the re-
quirement (5.3).
The stopping criterion of Step 2 can be interpreted similarly to the
termination rule of A l g o r i t h m 3.1. To see this, let
267

k k k k
~f = Z and ~F = E k)/_i,
jej k 3
J ~JF ~

~k ~k
define the scaled multipliers I~J and ~j satisfying

lk = k "k k
] ~fkj for j e Jf,

~k >0, je k Z k~k:l,
3 - Jf' j e Jf

k = ~F~j
~j k~k for j~ 4 ,

-k k ~k
E k~j=l,
~j _>0, j eJF, J~ JF

and let

(pk ~k "k ~k j k k),


,fp,Sf) = ~ _klj(gf,fj,sj
jeof

"k j k k
(pk,~kFp,SF)'k = j 6Z JFkZJ(gF'Fj'sj)'

k ~k 2
~k,p=max{ If(xk)-f
ef I yf(sf) },

~k,p=max { l~kl, ~k 2
F(SF ) },

~k k'k k~k
~p = 9f~f,p + ~F~F,p,

k 1 k 2 ~k
w =~Lp I + ~p,

v k : -{Ipkl 2+[~}.

Then one may simply set

ik k
P =~p =0

in the relevant relations of the preceding sections to see that Lemma 4.6
holds for Algorithm 5.1. In particular, we have wk~ w^k . Thus both w k
and ~k can be regarded as stationarity measures of xk; see Section 3.

The line search rules (5.9) differ from the rules (3.11) only in that
we now use Sk instead of vk for estimating the directional derivative
of f at x k in the direction d k. Note that we always have Sk < 0 at Step
^

3, since vk S vk < 0 by Lemma 4.6. Hence to implement Step 3 one can use
268

Line Search Procedure 3.2 with vk replaced by v^k .


We also note that in Algorithm 5.1 the locality radius a k+l is
calculated directly via (5.11), instead of using the recursive formulae
of Algorithm 3.1.
We refer the reader to Remark 3.5.2 on the possible use of more
than N+3 subgradients for each search direction finding.
Let us now pass to convergence analysis. Global convergence of Al-
gorithm 5.1 can be established by modifying the results of Section 4 in
the spirit of Section 3.5 and Section 5.5. One easily checks that only
Lemma 4.7, Lemma 4.8 and Lemma 4.9 need to be modified.
Lemma 4.7 should be replaced by the following results.

Lemma 5.2. Suppose that t k-i


L <~ and rk=0 for some k > i. Then

w^k <_,c(wk-1 )+l kp- ;k-11


p l

where +C is defined by

~c(t) = t-(l-mR)2t2/(8C2),

C is any number satisfying

max{Ipk-ll, Igkl, ~k-l,l} &C,


P
and
k lk-i k k-I k
~P = j~Ztk-lof ] ~f,J +j~E JF~k-I Pj ~F,j"

Proof. In the proof of Lemma 4.7, replace (4.25a) by

,xk-1 ~k-1
lk(~)=v, lj(v)=(l-~) j , j a Jf ,

k tk- 1
~k(~)=0, ~j(~)=(l-v)~j -I' J ~ OF '

and (2.25b) by

lk-i tk-i
ik(V)=0, lj(v)=(l-9, j , j E jf ,

Zk (~)=~' ~j (~)=(l-~)~k-l' J & Jtk-I


F '

and obtain, similarly to (5.5.4), that

E klj(~)gJ+ Z ~ j ( 9 ) g J = (l-9)pk-l+~g k
J ~Jf jeJ k
269

k k = (1-V)~,k+'gc~ k ,
Z klj(~]~f,j + Z k~j(9)eF,j
j e Jf J~ JF

k k Z klj(v)+ ~ ~j(~)=l
kj(~) ~ 0, j ~ Jf, ~j(~) ~ 0, je JF" j e jf j j~

for each 9 ~ [0,i], since J~="~f - i u {k} and k ^k-i u {k}


JF=JF if rk=0.a
^k
Then use the fact that w is the optimal value of (5.1) to complete the
proof, as before. (See also the proof of Lemma 3.5.3.)

In Lemma 4.8 replace (4.38) and (4.30) by

^k ~ m a x {~[
max{Ipkl,ep} 1 gk 12+~k' ( Igk [2+2~k)I/2},

max{lpkl,[skl,~p,1}
^k Sc if Ixk-xl< a ,
^k 1 k 2 ~k
without influencing the proof, since w =~ p +ep.

Lemma 4.9 is replaced by

Lemma 5.3. Suppose that there exist a point ~ RN and an infinite set
K c {1,2, ..} such that xk K -~ x and Ixk+l-xkl---~ 0 as k+~, k ~K.
Then
I ep
k+l "kl
-~pl---+ 0 as k+~, k e K.

Proof. As in the proof of Lemma 3.5.4, observe that

[ ~ +I _ ~ I s m a x Lrmaxr~t,~f,j-~f,jI:
k+l k j e j~} ,

maxiDF, j - aF,jl : J

and use estimates of the form [5.37) and

Fj -Fj[ = [ < ~F ,= ~x I~ I - I

together with the boundedness of {gF(YJ)}j~ jk for k K.

We conclude that Algorithm 5.1 is globally convergent in the sense


of Theorem 4.11, Theorem 4.12 and Corollary 4.13.

6. Modifications of the Methods

In this section we describe several modifications of the methods


270

d i s c u s s e d so far and analyze their c o n v e r g e n c e w i t h i n the framework es-


t a b l i s h e d in the p r e c e d i n g two sections.
We start by d e m o n s t r a t i n g global c o n v e r g e n c e of versions of the
methods for convex problems p r e s e n t e d in Chapter 5 that use general
line search criteria a l l o w i n g for a r b i t r a r i l y small serious stepsizes.
To this end, suppose that p r o b l e m (I.i) is convex and consider the
following m o d i f i e d line search rules for A l g o r i t h m 5.3.1 and A l g o r i t h m
5.5.1.

Step 3 (Line search). Find two stepsizes t kL and t kR such that 0 ~t k


L
tk
R and such that the two c o r r e s p o n d i n g points d e f i n e d by

k+l k k ~k k+l k k k
x =x +tLa and y =x +tRd

satisfy
k k k
f(x k+l) <_ f(x )+mLtLV , (6.1a)

F(x k+l ) ~ 0, (6.15)

_~(xk+l,yk+l)+ < g(yk+l),d k ~ >mRvk if (6.1c)


tL

Iyk+l-xk+ll <-a, (6.1d)


k ~,
tR (6.1e)

where ~, m L, m R , ~ and are fixed, p o s i t i v e line search p a r a m e t e r s


satisfying mL < m R < 1 and i < 1 < t, and

f(x)-f(y)- < g f ( y ) , x - y ~ if y ~ S,
~(x,y) =
I -F(y)- < g F ( Y ) , x - y > if y ~ S.

R e c a l l i n g that in A l g o r i t h m 5.3.1 and A l g o r i t h m 5.5.2 line searches


are p e r f o r m e d with v k < 0, we conclude that Line Search P r o c e d u r e 3.2
can be used for finding stepsizes tk
L and tkR satisfying (6.1)

We observe that, e x c e p t for Step 3, A l g o r i t h m 5.3.1 is o b t a i n e d


from A l g o r i t h m 3.1 by d e l e t i n g in the latter m e t h o d Step 5 and Step 6,
and setting yf=YF=0 and r~=0 for all k. In other words, the construc-
tions i n v o l v i n g distance m e a s u r e s s~ and r e s e t t i n g strategies are not
3
n e c e s s a r y in the convex case.
We shall now e s t a b l i s h global c o n v e r g e n c e of the a b o v e - d e s c r i b e d
modifications of the methods of Chapter 5.
271

Theorem 6.1. Suppose that problem (I.i) is convex and satisfies the Sla-
ter constraint qualification, i.e. f and F are convex and F(x) < 0 for
some x e R N. Then A l g o r i t h m 5.3.1 with the m o d i f i e d line search rules
(6.1) generates a sequence {xk}~ S satisfying f(xk)+ inf{f(x): x e S}.
Moreover, if f attains its infimum on S then {x k} converges to a so-
lution of problem (I.i).

Proof. In view of the results of Section 5.4, it suffices to establish


Lemma 4.10 for the m o d i f i e d algorithm. This can be done by exploiting
the fact that the algorithm is a simplified version of A l g o r i t h m 3.1.
Therefore, one readily checks that Lemma 4.6, analogues of Lemma 3.4.9
and Corollary 3.4.10, and Lemma 4.7 remain valid, and that Lemma 4.8
holds in virtue of the line search ewquirement (6.1d). Thus we only need
to replace Lemma 4.9 by the following result: if x k K + ~ and
ixk+l_xkl K ~ 0 then lap ap K + 0. To this end, observe
k+l_~k that

~k k~k k~k krf,xk- ~k~ k~k


ap = vfaf,p+gFeF, p = ~f[ i ~-fpJ-~FFp,

k+l k k+l k k+l kr~, k+l- k+l~ k~k+l


ap = vfaf ,p + VF~F, p = ~f[z[x ~-fp J-~F:p

by (5.4.3), so

a
k+1 ~ k
-ep
k[f(
~f
k+l)_f(xk)]_v~(=k+l ~k,
Xp -~p)-VF~
k,Fp k+l -Fp)=
~k-

f[k~f(xkl)-f(xk I k k kl k k k kl k
= ~ ]-~f < pf,x -x -gF < PF 'x -x > =

= ~ if( xk+l)_f( xk)] _ <pk ,xk+l-xk >

Hence
I ~pk+l-~p~k~_ If(xk+l)-f(xk)l+Ipkll xk+l-xk L,
since v~ e [0,1]. Then, since xk K , ~, ixk+l_xk I K + 0 and {Pk}ke K

is bounded in view of Lemma 4.8, we have f(xk+l)-f(x k) K ~ 0 and

ipkl ixk+l_xkl K ~ 0, so Iapk+l -~pk K 0, as required. This result che-

bles us to establish Lemma 3.4.15 for the m o d i f i e d method, and then we may
prove Lemma 4.10 by using parts (i)-(iii) of the proof of Lemma 3.4.16.

We conclude from the above proof and the results of Section 5 that
Theorem 6.1 is valid for A l g o r i t h m 5.5.1 with the m o d i f i e d line search
criteria 6.1.

To sum up, we have shown that one may use the general line search
272

criteria (6.1) in the methods for convex problems from Chapter 5


without ~ i r i n g the global convergence results of Section 5.4 and Sec-
tion 5.5.
Let us now consider versions of the methods that use the subgrad-
k
ient locality measures ef(xk,y 3) and eF(xk,y j) instead of ~f,j and
k
~F,j" First, suppose that in A l g o r i t h m 3.1 we solve at Step 1 the k-th
search direction finding subproblem (2.34) (or its dual, which is of the
form (3 i) with k
ef,j and k
aF,j replaced by ~f(xk,y 3) and ~F(xk,yj),
respectively}, replace (3.5) by (2.35), and calculate ak by

k k k
a =max{Ixk-y31 : j e J f u JF }

k k
if Ip=~p=0, and by

k+l xk+l yj _k+l _k+l


a =max{ I - i : J~ df UJ F }

at Step 6. In effect, this version is obtained by substituting s~ with


3
~'Ixk-yjl everywhere in A l g o r i t h m 3.1. Making use of this observation and
of the fact that

Ixk+l-yjl ~ Ixk-yjl+Ixk+l-xkl,

one can verify that the convergence analysis of Section 4 covers this
version of A l g o r i t h m 3.1.
Reasoning as above, we conclude that if we replace sk by IxK-yJ
I--
3
everywhere in A l g o r i t h m 5.1 then the resulting method is globally con-
vergent in the sense of Theorem 4.11, T h e o r e m 4.12 and Corollary 4.13.
Moreover, this method has (primal) search direction finding subproblems
of the form (2.33), which, as it was shown in Section 2, reduce to the
k
Mifflin (1982) subproblem (2.40) if the rules for choosing J~JF and
satisfy the requirements (2.30). Therefore, this method may be regarded
as an implementable and globally convergent version of the Mifflin (1982)
algorithm. Further comparisons with the algorithm of Mifflin (1982) are
given below.
For the sake of completeness of the theory, let us now consider a
method that uses all the past subgradients for search direction finding
at each iteration. The method with subgradient accumulation is obtained
from A l g o r i t h m 5.1 by deleting Step 5 and Step 6 and setting

jfk = J~={l . . . . k}

(or using (2.30) with ^k _k


Jf=of and ^k k
JF=JF for all k). Thus the method has
no resets and there is no need for selecting Lagrange multipliers of (2.
14) to meet the requirement (5.3).
273

As far as convergence is concerned, we recall from Section 3.6


that methods with total subgradient accumulation, which have no subgra-
dient deletion rules, require additional boundedness assumptions. In
this context, consider the following assumption on sequences {xk} and
{yk} generated by the above-described method with subgradient accumula-
tion

{x k} and {yk} are bounded. (6.2)

The above assumption is satisfied if

the set S= {x ~ S : f(x) ~ f(xl)} is bounded, (6.3)

since then {xk}c S in view of the monotonocity of {f<xk)} and the fe-
asibility of {xk}, so {xk} is bounded, while lyk-xkl S ~ / 2 for all k
owing to the line search requirement (5.9e).
The results of Section 4 and Section 5 imply that the above-descri-
bed method with subgradient accumulation is convergent in the sense of
Theorem 4.11, Theorem 4.12 and Corollary 4.13 under the additional assu-
mption (6.2). The same result holds if we replace sjk by Ixk-yjl eve-
rywhere in this version of Algorithm 5.1.
As observed in Section 3.6, it may be efficient to calculate sub-
gradients not only at {~} but also at {xk}, and then use such addi-
tional subgradients for each search direction finding. This idea can be
easily incorporated in all the methods discussed so far in this chapter.
For instance, in Algorithm 3.1 we may let

y-J = x j for j=l,2 .....

g~ = gf<yJ) and g ~ = gF(y j) for j=l,2 .....

fj(x)=f(Y3)+ < g~,x-y


3 ~ > for j=l,2 ..... <6.4)

Fj(x)=F(yJ)+ ~ g~,x-y 3 > for j=l,2 .... ,

[y3-xlJII+ k-i

z ixi+ 1-xil if lj I <k,
sj
k = I I I ~lJl
,yJ-x k , if j=k,

and substitute (3.13) with the following

jk+l ^k
f = Jf u{k+l,-(k+l)}, (6.5a)
jk+l ^k
F = JF u{k+l,-(k+l)}. <6.5b)

One may also use the following modification of (2.30)


274

^k
_k+l Jf u{k+l,-(k+l)} if yk+l g S,
(6.6a)
Jf = ^k
Jf if yk+l ~ S,

^k
_k+l = i JF^k if yk+l ~ S, (6.6b)
OF JF u { k + l , - ( k + l ) } if yk+l ~ S .

Of course, if xk+l=y k+l, i.e. yk+l=y-(k+l~,~ then, for instance, (6.5a1


should be replaced by Jfk+l-~k-~fu{k+l}.

It should be clear that the above-described use of additional sub-


gradients does not influence the preceding convergence results, although
it may lead to faster convergence in practice.
We want to add that one may use the Mifflin (19821 definition (2.
361 of subgradient locality measures in all the methods described so
far in this chapter. This will, for instance, involve replacing in Algo-
rithm 3.1 relations (3.2) by (2.371, (3.8) by

~k k ~k -k
~f,p=max{f(x 1-fp,~f(sf)2},

~F,p=max(-Fp,~F[SF)2},
~k - -k ~k

and (3.12) by

g(y)=gf(y) and ~(x,y)=max{f(x)-f(x;y),yflx-yl 2} if y E S,

g(y)=gF(y) and ~(x,y)=max{-F(x;y),YFlx-yl 2} if y ~S.

Such modifications do not impair the preceding convergence results.

We shall now show how to strengthen the existing results on con-


vergence of the Mifflin (19821 algorithm. At the k-th iteration this al-
gorithm finds (dk,~ k) by solving subproblem (2.40), where Jk={l,...,k}
and ~(xk,yj) is defined by (2.361 and (2.39). In effect, we see that
the algorithm uses the same direction finding subproblems as does the
above-described method with subgradient accumulation, if in the latter
method J~ and JFk are chosen by (2.30) with ik _k and JF=O
of=df tk _k
F for
all k. The Mifflin (19821 line search requirements are the following

f(x k+l) < f(xk)+mLtLk^k


v ,

F(x k+l ) s0,

_~(xk+l,yk+l)+ < g(yk+l),d k > >-mRv^k , (6.7)


k k ^k
tL=tR=0 if < g(xk),dk> >_mRv ,
275

[yk+l-xk+l I & a.

It is easy to observe that any stepsizes k


tL and t kR satisfying (6.7)
automatically satisfy (5.9) if ~=+~.
Moreover, Line Search P r o c e d u r e
k
3.2 can be used for finding s t e p s i z e s tL and t~ s a t i s f y i n g (6.7) if
^k k k
< gf(x k ) , d k > ~ m R v ; otherwise one can set tL=tR=0.

To sum up, we h a v e shown that the M i f f l i n (1982) algorithm can be


regarded as a v e r s i o n of the m e t h o d with subgradient accumulation des-
cribed above. Therefore, by the p r e c e d i n g results, the M i f f l i n algorithm
is c o n v e r g e n t in the sense of T h e o r e m 4.11 and T h e o r e m 4.12 under the
additional assumption (6.2) (or the stronger assumption (6.3)). Our re-
sult subsumes a result of M i f f l i n (1982), who p r o v e d that his algorithm
has at least one s t a t i o n a r y accumulation point under the a d d i t i o n a l as-
sumption (6.2).

7. M e t h o d s with Subgradient Deletion Rules

The a l g o r i t h m s described in the p r e c e d i n g s e c t i o n were o b t a i n e d by


incorporating in the methods for convex pr o b l e m s of Chapter 5 the techni-
ques for d e a l i n g with nonconvexity through the use of s u b g r a d i e n t loca-
lity m e a s u r e s introduced in Chapter 3. In the u n c o n s t r a i n e d case consi-
dered in C h a p t e r 4, we showed that one can also take n o n c o n v e x i t y into
account by using suitable subgradient delection rules for l o c a l i z i n g the
past subgradient information that d e t e r m i n e s the c u r r e n t polyhedral ap-
proximation to the o b j e c t i v e function. Therefore, in this section we s h a ~
consid e r the use Of s u b g r a d i e n t deletion strategies for o b t a i n i n g exten-
sions of m e t h o d s of C h a p t e r 5 to the n o n c o n v e x case, which differ from
the algorithms described so far.
We start by r e m a r k i n g that in p r a c ti c e the p e r f o r m a n c e of m e t h o d s
with subgradient locality measures may be s e n s i t i v e with respect to va-
lues of the d i s t a n c e measure parameters yf and YF; see Remark 4.2.1.
For this reason, in C h a p t e r 4 we studied m e t h o d s that result from
setting Yf=0 even in the n o n c o n v e x case. To ensure convergence of such meth-
ods, we had to employ subgradient deletion rules b a s e d on e s t i m a t i n g
the d e g r e e of s t a t i o n a r i t y of the c u r r e n t iterate. Proceeding in the
same sprit, we may set yf=YF=0 in A l g o r i t h 3.1 and use a simple reset-
ting strategy of S e c t i o n 4.2 to obtain the f o l l o w i n g method, w h i c h may
be r e g a r d e d as a c o m b i n a t i o n of A l g o r i t h m 4.3.1 and A l g o r i t h m 5.3.1.

Algorithm 7.1.

Step 0 (Initialization). Select the s t a r t i n g point x le S and a final


276

accuracy tolerance s ~ 0. Choose fixed positive line search parameters


mL,mR,a,t and ~ with ~<I, m L < m R < 1 and ~ < i. Set Mg~ and Mg~
equal to the fixed maximum number of subgradients of f and F, respecti-
vely, that the algorithm may use for each search direction finding;
Mg,f ~ 2 and Mg,F Z 2. Choose a predicted shift in x at the first itera-
tion s I > 0 and set ~i=~. Select a positive reset tolerance m a and
set reset indicators 1 1 1 . Set yl=xl, s~=0 and
ra=rf=rF=z.

J fi = {i}, g :p :gf(yl), fl1 = fZp = fkyi2,~,

1 0 RN F 1 = 0.
JF = @ ' PF = 0 ~ , P

Set the counters k=l, i=0 and k(0)=l.

Step l~IDirection finding). Find multipliers k j ~ Jf,


lj, k Ip,
k ~j,
k j E JFk
and ~ that solve the folowing k-th dual search direction finding sub-
problem
1
minimize 31 ~ kljg~' + k-l+ ~ k~Jg ~ + ~pPFk-i I2 +
I,~ J~ Jf IpPf J~ JF

k + k k k
+j~Z jklj~ff ,3 I p ~-x,p +j e~ j~ ~JaF'J + ~p~F,p ,

k k
subject to lj ~ 0, j e Jf, ~p ~ 0, ~j a 0, j ~ JF' ~p ~ 0, (7.1)

Z klj + l p + Z k~j + ~ p = i,
j e Jf j e JF

where
ef,j =
If(xk)-f ' ~F,j
= IF I' (7.2)

~f,pk=If(xk)-f l, ~F,p=IF K (7.3)

Compute k k ~k k ~k ~ , j~ k ~k
~f and ~F by (3.3), and lj, j e Jf, p, JF' and ~p
by (3.4). Calculate

k-i
Pf , , (7.4a)

(7.4b)

k k k k k
p = ~ f p f + v F P F, (7.5)
277

dk =
_pk t
(7.6)

~ f , p = If(x k) , (7.7a)

F,p= I I, (7.7b)

k 12 k~k k~k (7.8)


v =-{Ip k + ~fef,p + ~FaF,p},

k k
If rf=rF=l set

a k = m a x { s k.
3 : j g J fk u jk}. (7.9)

Step 2 (Stoppin 9 criterion). If max{Ipkl, maak} & E s then terminate.


Otherwise, go to Step 3.
Step 3 (Resetting test). If Ipkl ~ m a a k then go to Step 4; otherwise,
go to Step 5.
k k k . k
Step 4 (Resetting). (i) If rk=0
a then set ra=rf=rF=i , replace Jf
k by {j k and {J ~ JFk : j ~k-Mg, f+2} , respecti-
and JF Jf : j ~k-Mg,f+2}
vely, and go to Step 1.
k k k
(ii) If IJ U J Fk I > 1 then delete the smallest number from Jf or JF
and go to Step 1.
(iii) Set yk = x k , gf=gf(y
k K ), fkk=f(yk), Jfk {k}, s k =0 and go to Step i.

Step 5 (Line search). By a line search procedure as discussed below,find


two stepsizes t Lk and t kR such that 0 < t Lk < t kR and such that the two
corresponding points defined by

x k+l = xk+tLkd k and yk+l = x k+tkd k

satisfy tk
L < 1
and
k k k 7.10a)
f(x k+l) < f(x }+mLtLV ,

F(x k+l ) ~_ 0, 7.10b )

k k kz~ ' 7.10c)


t R = tL if tL

-~(xk+l,y k+l) + < g ( y k + l ) , d k > > m R vk if t kL < ~ , 7.10d)

lyk+l-xk+l[ ~ ~, 7.10e)

lyk+l-xk+iI ~ 8ks k if k 7.10f)


t L = 0,

lyk+1-xk+ll Ixk+1_xk I k (7,10g)


if t L > 0,
278

where
g(y)=gf(y) and e(x,y)=~f(x,y) if y e S, (7.11a)

g(y)=gF(y) and e(x,y)=~F(x,y) if y ~ S, (7.11b)

~f(x,y)=If(x)-f(y)- < gf(y),x-y >I , (7.12a)

eF(x,Y)=IF(Y)- < gF(Y),x-Y >I (7.12b)

Step 6. If t =0 set sk+l=s k and ek+l =~ek . Otherwise, i.e. if t kL > 0,


set sk+l=Ixk+l-xkl, ek+l--~, k(l+l)=k+l and increase 1 by I.
^k ^k
Step 7 (Sub~radient updating). Select sets Jf and JF satisfying

Jf and ,f-2, (7.13a)

~F c J~ and IJ~ I ~Mg,f-2, (7.13b)

and set
k+l tk _k+l tk if yk+l S, (7.14a)
Jf =df U {k+l} and JF =OF
jk+l ~k k+l ^k+l yk+l~ (7.14b)
f =Jf and JF =JF u {k+l} if S.

Set g~+l=gf(yk+l) yk+16 S, g~+l=g F ( y k+l,2 if yk+l~ S. Compute


if
fk+l,j j ~ jfk+l, fk+Ip FJ-k+l' j~ dF-k+l'~p+i, Skl ^k UJF,
d k + lan s~+13' j e Jf ^k by (3.14.

Calculate
ak+l=max{ak+ixk+l_xk I,Sk+l}.
k+l (7.151

Set rk+l=0
a and

k+l i 1 if rf=l
k and ~f=0,
k
rf = ~ k (7.16a)
0 if r =0 or vf~0,

k k
k+l r i if rF=l and VF=0,
(7.16b)
rF = I k k
0 if rF=0 or VF#0.

Step 8. Increase k by 1 and go to Step i.

A few remarks on the algorithm are in order.


By Lemma 5.2.3, subproblem (7.1) is the dual of the following k-th
primal) search direction finding subproblem
279

1 2+$,
minimize
(d,v) R N+I
k k
subject to -ef,j+<g ,d> <v, j eJf,
k k-i ^ k
-ef,p+ < p f ,d > <_v if rf=0, (7.17)
k j " k
-eF,j+ < g ,d > <_v, j e JF'
^ k
-F,P
+k <pk-l,d> <v_ if rF=0.

Subproblems (7.1) and (7.17) are equivalent to subproblems (3.1) and


k k k
(2.22), respectively, provided that ra=rf=r F. However, in contrast
with Algorithm 3.1, in A l g o r i t h m 7.1 we use two reset indicators r~ and
k
r F, which need not have values equal to that of r ka" The condition r =I

(r~=l) implies that the (k-l)-st aggregate subgradient of f (of F) is


ignored at k-th search direction finding. First, this is the case if a
k k k
reset occurs at the k-th iteration, since then ra=rf=r F by the rules
of Step 4 and Step 7. Secondly, we have k
rf=l (r~=l) if after the let-
est reset, say at the kr(k)-th iteration, no subgradient of f (of F)
contributed to the search directions

d j = -( vfpf
J J + ~FPF
J J) for J=kr( k ) ..... k-i

in the sense that ~3=0 (~ =0) for J=kr(k ) ..... k-i (cf.(7.16)). This

can occur, for instance, if JJ=~ (JFJ=@) for J=kr(k) ..... k-i (cf. 7.
13)-(7.14)). Then there is nothing to be aggregated at such iterations
with JJ=0 and rJ=l, (JJ=@ and rJ=l), so we must ignore the (k-l)-
st aggregate subgradient of f (of F) by setting rk=l (rk=l).

For example, suppose that the constraint function is inactive, i.e.


S=R N. Then, since jl=@ and each yk is feasible, (7.13) and (7.14)
yield jk=@ for all k, and so, because

k k k k=0 k
~F = ~ +~ and ~ if rF=l
J eJ F 3 P P

1 k k
and rF=l, we have 9F=0 and rF=l for all k by the rules of Step 4
k
(i) and Step 7. (Note that we have implicitly 2 k~j=0 ifassumed that
Js JF
jk=@. The same convention of ignoring summations over empty sets is em-
k k k
ployed in A l g o r i t h m 7.1). Moreover, we have ~f=l-VF=l, so that rf=0 if
k
re=0 ,for all k. Then it is easy to deduce that Algorithm 7.1 reduces to Algorithm 4.3.1,
except for the rule of updating ak via (7.9) if rk=l, while A l g o r i t h m
280

4.3.1 uses (7.9) if Ik=0.


P
Algorithm 7.1 is closely related to Algorithm 5.3.1 for convex
constrained problems. First, both algorithms use the same subgradient
aggregation rules, cf. (7.4) and (5.2.45). Secondly, their search direc-
k k
tion finding subproblems coincide in the convex case if rf=rF=0. To see
this, recall from Lemma 5.4.1 and Lemma 5.4.2 that in the convex case
the subgradient aggregation rules ensure that f(xk)-fK-3 a0, f(xk)-f-wK ~0,
-Fk > 0 and -F k > 0 for all k, hence (7.2)-(7.3) becomes
3 - P -

k
~f,j = f(xk)-f~ and ~f,Pk = f(x k )_f~,

k =-Fk and k = _F k
~F,j 3 ~F,p p'

k k
corresponding to relations (5.4a,b). Therefore, if rf=rF=0 , in this
case subproblem (7.1) coincides with subproblem (5.4.12), which is equiva-
lent to the k-th search direction finding subproblem (5.3.1) of Algori-
thm 5.3.1 by Lemma 5.4.10.
To discuss the resetting strategy of the algorithm, we shall need
the following result on convex representations of aggregate subgradients,
which is an analogue of Lemma 4.4.1 and Lemma 4.1. Let

k k
= Jf U J F for all k, (7.18)

and observe that by (7.13)-(7.14) we have

k k
J f n JF = ~ for all k. (7.19)

Lemma 7.2. Suppose k a1 is such that Algorithm 7.1 did not stop before
the k-th iteration, and let

kf(k)=max{j : j ~k and r~=l}, (7.20a)


^k kf(k)
Jf,r = Jf U {j : kf(k) < j ~ k and y J e S}, (7.205)

kF(k)=max{ j : j ~ k and r~=l}, (7.20c)


^k _kF(k)
JF,r = JF U {j : kF(k) < j ~ k and yJ ~ S}, (7.20d)

kp(k)=max{j : j ~k and r~=r~=l}, (7.21a)


jkp(k)
~ = u {j : kp(k) < j ~k}, (7.215)

kr(k)=max{ j : j -< k and rJ=l}


a ' (7.21c)
^k jkr(k)
Jr = u {j : kr(k ) < j !k}, (7.21d)
281

= {j : kr(k)-Mg ~ j ~k}, 7.22)

where Mg=Mg,f+Mg,F. Let M=N+2. Then

a = max{s : jE J , 7.23)

3k = c 7.24)
p r
~k
Jf,r ~ @ if 9 k ~ 0, (7.25a)

*k k
JF,r ~ @ if v F ~ 0. (7.25b)

~k ~k (y~,i,
If Jf,p~@ then there exist numbers Ii and (N+2)-vectors
fk'i,sk'i), i=l,...,M, satisfying

k ~k M
(pf,fp) =
Z ik(g(yfk,i),fk,i),
i=l
M ^
ik ~ 0, i=l ..... M, ~ Ik=l,
1- i=l
k,i) ,fk,i k,i) (gj, fj,sj)
k k ^ (7,26)
(g(yf ,Sf ~ { : j ~ jk ,p} . . i=l,
... M,

k,i x k k,i
Yf - I -< sf ,

max{s k'i : i=l .... ,M} <_ a k.

If J ,p~@ then there exist numbers ^k k,i


~ and (N+2)-vectors (YF
Fk'i,sk'i), i=l,...,M, satisfying

k ~k : M ^ ,i) i),
(PF,Fp) iZ:l~k(g(Y k ,F k'

Mk M ^k
~i_> 0, i=l,...,M, Z ~i=l,
i=l

'SF ) ~ {(g ' 3 ,p}, i=l ..... M, (7.27)

k,i xk k
YF - I ~ SF'l, i=l, . . ,M,

max{s k'i : i=l,...,M} ~ a k.

Moreover, we have (4.3a) if f is convex, and (4.3b) if F is convex and


y J ~ S for some j ~k.

Proof. Since either ~ or v~ must be positive, because they form a


282

convex combination, the algorithm's rules imply that rf=rF=l J J only if


rk=l; otherwise either r J=0 or r~=O. Hence kp(k)=kr(k) and jk:~k.p
r

Moreover, kf(k) > kp(k)


^k
and kF(k ) akp(k), so j f , p C j ,p= P Suppose
^~ 3k.
~k j~=@ i k
that Jf,p=0. Then and rf=l for i=kf(k),...,k, hence Jf = ~,
rk=l, ik=0 and

vfk = j ~Zj k ikj + ikp = 0.

This proves (7.25a). An analogous argument yields (7.25b). (7.23) can


be established as in the proof of Lemma 3.4.1.(7.24) follows from the
rules of Step 4 (i) '
(7.13)-(7.14) and the above-derived relation ~k=~k
p r
as in the proof of Lemma 4.4.1. The representations (7.26) and ~7.27)
follow from the subgradient aggregation rules (see the proofs of Lemma
3.4 1 and Lemma 3.4 3) (7.23) and the fact that J , p u ; k c jk The
" " ' F,p p"
proof of the assertion concerning the convex case is similar to the pro-
of of Lemma 5.4.2, since the aggregate subgradients are always convex
combinations of the past subgradients, ^k ^k
even if Jf,p=JF,p=~.

The stopping criterion of Step 2 admits of the following interpre-


tation. For any x ~ RN and E ~ 0, define the following outer approxima-
tion to M(x)

M(x;e)=conv{M(y): ly-xl ~ E}. (7.28)


In view of (2.2), M(x;~) may be regarded as a generalization of the
Goldstein e-subdifferential

af(x;E) = conv{af(y): ly-xl ~ e}

to the constrained case. Observe that at each iteration only one subgra-
dient of the form

^ k k
gk=g(yk)= I gf(yk) E M ( y ) if y ~ S,
(7.29)
gF(yk) ~ ( y k ) if y k ~ S,

is added to the set of subgradients that are aggregated at subsequent


iterations. We deduce from Lemma 7.2 and (7.29) that

pfk E conv{M(yJ): [yJ-xkl ~ a k} if ^k


Jf,p ~ ~, (7.30a)

PFk e conv{M(y j) : [yJ-xkl ~ a k} if ^k


JF,p ~ ~" ( 7. 305 )
Therefore, since the algorithm's rules and (7.25) yield
283

k k k k k
p = vfpf + 9Fp F,
k k k k
vf >_0, ~F a 0, ~f + ~F = I, (7.31a)

kf = o if (7.31b)
f,P =
k
~F = 0 if F,p (7.31c)

we obtain from (7.30) and (7.28) the following analoque of (4.2.30)

pke M(xk;ak). (7.32)

By (7.32), the algorithm stops at Step 2 when

pk ~(xk;s/ma), ipkl ~ s and xkE S, (7.33)

i.e. when xk is approximately stationary for f on S.

The resetting strategy of the algorithm, which is related to


(7.32), is a direct extension of the strategy of Algorithm 4.3.1. We
may add that it is possible to use other, more efficient strategies sl-
milar to those of Section 4.6. We shall return to this subject later on.
The line search rules of Step 5 are a direct juxtaposition of the
rules of Algorithm 4.3.1 and Algorithm 3.1, cf. (4.3.7)-(4.3.10),(3.11)-
(3.12) and (7.10)-(7.12). Therefore we may refer the reader to Section 3
and Section 4.3 for the motivation of such rules, and of the rules of
Step 6.
The following extension of Line Search Procedure 4.3.2 can be used
k
for finding stepsizes tL=t L and tR=t ~ satisfying the requirements
of Step 5.

Line Search Procedure 7.3.

(i) Set tL=0 and t=tu=min{l,a/Idkl}.


(ii) If f(xk+td k) ~ f(xk)+mLtv k and F(xk+td k) ~ 0 set tL=t; other-
wise set tu=t.

(iii) If tL ~ set t R= tL and return.


(iv) If -a(xk+tLdk,xk+td k) + < g(xk+tdk),d k > Z m R v k and either tL=0
and tldkl S 8ks k or t-t L ~ t L , then set tR=t and return.

iv) Set t=tL+~(tu-tL) and go to (ii).

The following result can be established similarly to Lemma 3.3.


284

Lemma 7.4. If f and F are semismooth in the sense of (3.3.23) and (3.
18) then Line Search Procedure 7.3 terminates with tL=t
Lk and tR=t
Rk
satisfying (7.10).

The requirement (7.13) may be substituted by the following

~k c Jfk and ^k k
Jf JF CJF, (7 34a)

Mg2 (734b)
where Mg t 2 is a fixed, user-supplied upper bound on the number of
stored subgradients. In view of (7.18) and (7.19), the simplest way of
satisfying (7.34) is to delete some smallest numbers from JF
k=j u k
.^ ^k ^k ^k
so as to obtain IJkl ~ M g - 2 with Jf U J F = J . In fact, as far as con-
vergence is concerned, the r e q u i r e m e n t (7.34a) can be substituted by
the following more general rule

^k ^k ^k ^k
Jf c J f , p and JF c JF,p'
i.e. any subgradient used since the latest reset can be stored, cf. (7.
17a,b,c).
Observe that (7.10e), (7.14) and the rules of Step 4 yield the
following analogue of (4.3.23) and (3.21b)

k ~ J~, gk=gf(yk) and lyk-xkl ~ a if y k e S, (7.35a)

k e J~, gk=gF(yk ) and lyk-xk I ~ a if y k ~ S. (7.35b)

Thus the latest subgradient is always used for the current search direc-
tion finding.
We shall now establish convergence of the algorithm. To save space
we shall use suitable modifications of the results of Section 4.4 and
Section 4.
We suppose that the final accuracy tolerance es is set to zero
and that each execution of Line Search Procedure 7.3 is finite (see
Lemma 7.4, Remark 3.3.4 and Remark 3.4).
First, we observe that Lemma 7.2 can serve as a substitute for Lem-
ma 4.1, Lemma 4.4.1, Lemma 4.4.2 and Lemma 4.4.4. Secondly, since Lem-
ma 4.3 holds in view of (7.33), the assumption that es=0 and the de-
finition of stationary points, we may assume that the method generates
as infinite sequence of points. Then (4.4.9)-(4.4.11) are easily veri-
fied, and we conclude that {f(xk)} is nonincreasing. Thirdly, we note
that part (i) of Lemma 4.2 can be replaced by Lemma 4.4.5, and part (ii)
by the following result.
285

Lemma 7.5. Suppose that a point x E R N, N-vectors pF,y F and gF' and
~p, ~i --i
number~ and s F, i=l,...,M, satisfy

M
-- --i
(PF,Fp) = ~ ~i(gF,Fi),
i=l
_
M ~ ^

!ai ~ 0 , i=l,...,M, ~ !ai=l ,


i=l

g F E ~F( ), i=l,...,M,

~i=F(
~)+ -i - -i
< gF,x-YF > , i=l,...,M,
--i

-i
max {s F : ~i ~ 0~ = 0,

F(~) Z0, i=i ..... M.


^ _ _

Then pF e %F(x) and Fp=F(x) a0, so that p F E M(x).

M
Proof. Set ~F=iZl~i~= and use part (i) of the proof of Lemma 4.2.[']

Lemma 4.4.6 and Lemma 4.4 are replaced by the following

Lemma 7.6. Suppose that there exist a point x eR N and an infinite set
K c{1,2,...} such that xk K ~ and ak K ~ 0. Then there exist an
infinite set K c K, N-vectors p,pf and PF' and numbers ~f and ~F
such that

k
P + p,

= ~fpf + 9FPF ,

~f~0, 7F~0, ~f +~F=I, (7.36)

Pf ~ ~f(~)' PF ~ ~F(~)'

F(~)~O if ~F~O.

k~k k-k
Moreover, p g M(~) and vfaf,p+VF~F,p 0.

^k ~k
Proof. By (7.31), J f , p O JF,p ~ ~ for all k, so at least one of the
following two sets
286

^k ^k
Kf = { k e K : Jf,p~@} and KF={k~ K : JF,p~@}

k
is infinite. Suppose that KF is finite. Then we have VF=0 and (7.26)
for all large k~Kf, hence we may use (7.31a), (7.6) and (7.7a) to de-
duce, as in the proof of Lemma 3.4.6, (7.36) w i t h ~f=l, ~F=0 and
k-k
~fef,p ~ 0. A similar a r g u m e n t based on (7.27} and Lemma 7.5 yields
k-k
(7.36) with ~f=0, ~F=I and 9FeF,p ~ 0 if Kf is finite. In view
of the p r e c e d i n g two results, and the fact that K=KfuKF, it remains
to c o n s i d e r the case of an infinite set K = K f ~ K F. Then (7.26) and (7.
27) hold for all k a K, so the d e s i r e d c o n c l u s i o n can be deduced from
(7.31), (7.6)-(7.7), Lemma 4.4.5 and Lemma 7.5. (7.36) implies p6M(x)
in view of (2.2).

Define the s t a t i o n a r i t y m e a s u r e

wk 1 k 2 -k
(7.37a)

where
~k k~k k~k
~p = vfef,p + VFeF, p, (7.37b)

at the kth i t e r a t i o n Cat Step 5) of the algorithm, for all k. We have the
following analogue of Lemma 4.4.7.

Lemma 7.7. (i) Suppose that for some point x ~R N we have

liminf max{wk,[~-xk[}=0, (7.38)


k+~

or e q u i v a l e n t l y

there exists an infinite set K c{I,2,...} such that

xk K + x and wk K , 0. (7.39)

Then 0 e M(~) and F(x) ~ 0.

(ii) Relations (7.38) and (7.39} are e q u i v a l e n t to the following

liminf m a x { I p k l , l ~ - x k l } = 0.
k+~

Proof. Use the proof of Lemma 4.4.7, r e p l a c i n g the r e f e r e n c e to Lemma


4.4.5 by the one to Lemma 7.6, and observe that any a c c u m u l a t i o n point
of {xk}c S m u s t be feasible, because S={x ~ RN:F(x) ~ 0} is closed.

~k
Let w denote the optimal value of the k-th dual search direc-
287

tion finding s u b p r o b l e m (7.1), for all k. Then it is easy to verify that


Lemma 4.6 holds for A l g o r i t h m 7.1. This result replaces Lemma 4.4.8. Al-
so it is straightforward to check that Lemma 4.4.9 and Corollary 4.4.10
are true for A l g o r i t h m 7.1. Next, we may use (7.35) to establish Lemma
4.7 and Lemma 4.8, thus replacing Lemma 4.4.11 and Lemma 4.4.12.
One can prove Lemma 4.9 for A l g o r i t h m 7.1 as follows. If xk K
then {[pkl}kE K is bounded ~n view of Lemma 4.8, so {ak}keK is bound-
ed, because we have a k ~ Ipkl/ma at Step 5, for all k. Then one can use
the representations (7.26) and (7.27) as in the proof of Lemma 7.6 to
obtain the desired conclusion from relations of the form (4.33). This re-
sult substitutes Lemma 4.4.13.
It is easy to check that the proofs of Lemma 4.4.14 through Lemma
4.4.18 require no modifications. Thus we have obtained the following re-
sult.

Theorem 7.8. A l g o r i t h m 7.1 is globally convergent in the sense of Theo-


rem 4.11, Theorem 4.12 and Corollary 4.13.

Let us pass to the method with subgradient selection. To save spa-


ce, we give a shortened description.

Algorithm 7.9.

Step 8 (Initialization I. Do Step 0 of A l g o r i t h m 7.1. Set Mg a N + 2 equal


to the fixed m a x i m u m number of subgradients of f and F that the algori-
thm may use for each search direction finding. Set jl=j OJF.

Step 1 (Direction finding). Do Step 1 of A l g o r i t h m 5.1, setting yf=yF=0


k ^k ^k
in (5.2). Set ak=max{s : j a Jf O J F } .

Step 2 (Stopping criterion). Do Step 2 of A l g o r i t h m 7.1.

Step 3 <Resetting test). Do Step 3 of A l g o r i t h m 7.1.

Step 4 (Resetting). (i) Replace jk by {j~ jk : j >_k-Mg+l}, and then

Jfk and JFk by {j e jk : yj e S} and { j e jk : yj ~ S}, respectively. Set

r k = 1.
a
(ii) If IJk[ > 1 then delete the smallest number from Jfk or k set
JF'

J k =Jfk u JFk and go to Step 1

(iii) Set yk=xk, gf=gf(yk)k ,fk=f(yk),sk=0, jk=jk={k} ' jk=@ and go to


Step i.
k ^k
Step 5 ILine search). Do Step 5 of A l g o r i t h m 7.1, replacing v by v
in (7.10).
288

Step 6. Do Step 6 of A l g o r i t h m 7.1.

Step 7 (Subgradient updating). Do Step 7 of A l g o r i t h m 7.1, ignoring


k+l _k+l _k+l
(7.13) and (7.15)-(7.16). Set J =Jf u JF "

Step 8. I n c r e a s e k by 1 and go to Step I.

The above m e t h o d is a c o m b i n a t i o n of A l g o r i t h m 4.5.1 and A l g o r i -


thm 5.5.1. Note that the method's subgradient deletion rules are less
complicated than those of A l g o r i t h m 7.1, since the m e t h o d does not u p -
date the aggregate subgradients.
k ~k
We may add that one can replace v by v in Line Search Proce-
dure 7.3 for e x e c u t i n g Step 5 of the method. Lemma 7.4 remains valid,
^k
since we have v < 0 at Step 5

We have the f o l l o w i n g convergence result.

Theorem 7.10. Algorithm 7.9 is g l o b a l l y convergent in the sense of The-


orem 4.11, T h e o r e m 4.12 and C o r o l l a r y 4.13.

Proof. Replacing (7.20) and (7.21a,b) by

^k ~k ^k ^k for all k,
Jf,r = Jf and JF,r = JF

3k = ~k for all k,
p r

we obtain an a n a l o g u e of Lemma 7.2. Then it is easy, albeit tedious, to


obtain the d e s i r e d conclusion by m o d i f y i n g the p r e c e d i n g results of this
section in the spirit of S e c t i o n 5. This task is left to the reader.

Let us discuss modified resetting strategies for m e t h o d s with sub-


gradient deletion rules. In S e c t i o n 4.6 one can find d e t a i l e d motivation
behind the use of such strategies in the u n c o n s t r a i n e d case. Most of tho-
se remarks apply also to the c o n s t r a i n e d case. Thus we w a n t to d e c r e a s e
the frequency of resettings, since too f r e q u e n t discarding of the aggre-
gate subgradients leads to a loss of the a c c u m u l a t e d past subgradient in-
formation, which can r e s u l t in slow c o n v e r g e n c e .
The r e s e t t i n g strategy of A l g o r i t h m 7.1 and A l g o r i t h m 7.9 is simi-
lar to that of A l g o r i t h m 4.3.1. A reset occurs at the k-th ieration if
Ipkl ~ m a a k , i.e. w h e n the length of the c u r r e n t search direction IdkI=IpKI--
becomes much shorter than the value of the l o c a l i t y radius a k, w h i c h es-
timates the radius of the ball around xk from w h i c h the past subgrad/-
ent i n f o r m a t i o n was aggregated to form dk (see (7.32)). To reduce the num-
~k
ber of resets, in A l g o r i t h m 4.6.1 we used aggregate distance measures Sp
289

and resetting tests of the form IpklS maS ~, instead of Ipk ] ~ m a ak.
An extension of this strategy to the constrained case is given in the
following method.

Algorithm 7.11.
1 1
Step 0 (Initialization I. Do Step 0 of Algorithm 7.1. Set sf=sF=0.

Step 1 IDirection finding). Do Step 1 of Algorithm 7.1, setting

(p~ ~k ~k ~k j k s k k k-I k k
,fp,Sf) = Z -l.(gf,fj, 3 ) + ~ (pf ,fp,Sf),
jgJf ] (7.40)
(p~ ~k ~k ~k j k k .k, k-i k s k)
,Fp,SF) = Z kDj(gF,Fj,sj) + ~ p [ P F ,Fp,
JgJF

instead of using (7.4). Set

~k k k k-k
Sp = vfsf + ~FSF . (7.41)

Step 2 (Stopping criterion). If max{ Ipkl,maS~}


~ ~ es then terminate.
Otherwise, go to Step 3.

Step 3 (Resetting test}. If Ipkl ~ maS ~ then go to Step 4; otherwise,


go to Step 5.

Step 4 (Resetting). Do Step 4 of Algorithm 7.1.

Step 5 (Line search). Do Step 5 of Algorithm 7.1.

Step 6. Do Step 6 of Algorithm 7.1.

Step 7 (Subgradient updating). Do Step 7 of Algorithm 7.1. Set

k+l -k xk+l xkl


sf = sf + [ - ,
k+l ~k xk+l xk[
sF = sF + I -

Step 8 (Distance resetting test}. If ak+l<_ a then go to Step 10. Other-


k k+l
wise, set rk+l=r~+l=rFa
x =i and go to Step 9.

_k+l
Step 9 IDistance resetting I. Keep deleting from of and o_k+l
F indi-
k+l
ces with the smallest values until the reset value of a satisfies

k+l ~+I _k+l k+l.


a =max{s : j Jf u JF J ~a/2"
k+l _k+l _k+l
Set J =Jf u oF .

Step i0. Increase k by 1 and go to Step i.


290

O b s e r v e that the above m e t h o d uses the s u b g a r d i e n t a g g r e g a t i o n


rules of Algorithm 3.1, cf. (3.5) and (7.40). Also the d i s t a n c e resetting
s t r a t e g y of Step 8 and Step 9, which ensures locally u n i f o r m boundedness
of a c c u m u l a t e d subgardients, is similar to the c o r r e s p o n d i n g strategy of
A l g o r i t h m 3.1. At the same time, the m e t h o d uses the a g g r e g a t e d i s t a n c e
~k k
measure s instead of the locality radius a in the stopping crite-
P
rion and the r e s e t t i n g test of A l g o r i t h m 7.1. In effect, A l g o r i t h m 7.11
is related to A l g o r i t h m 7.1 and A l g o r i t h m 3.1 in the same way as was
A l g o r i t h m 4.6.1 to A l g o r i t h m 4.3.1 and A l g o r i t h m 3.3.1. This observa-
tion facilitates e s t a b l i s h i n g c o n v e r g e n c e of the method, since one may rea-
son as in the proof of T h e o r e m 4.6.2, w h e r e c o n v e r g e n c e of A l g o r i t h m
4.6.1 was d e d u c e d from the results c o n c e r n i n g A l g o r i t h m 4 . 3 . 1 and Algo-
r i t h m 3.3.1. We follow a similar path below.
To obtain a b e t t e r insight into p r o p e r t i e s of the a g g r e g a t e dis-
tance m e a s u r e s s~, we present the following analogue of Lemma 4.1 and
Lemma 7.2. Its simple proof is left to the reader.

L e m m a 7.12. Lemma 7.2 holds for A l g o r i t h m 7.11 if we replace (7.23) by

max{s : je j } =ak_<a (7.42)

and augment (7.26) and (7.27) by

k tk ~k M ^k k,i fk,i k,i


pf,Zp,Sf) = Z ll(g(yf )' ,sf ) (7.43)
i=l
and
k ~k ~k M ~k k,i , k,i
pF,~p,SF) = ~ ~i(g(y F ),F k'i s F ), (7.44)
i=l

respectively.

~k
We c o n c l u d e from (7.41), (7.31) and Lemma 7.12 that Sp is al-
ways a convex c o m b i n a t i o n of the a g g r e g a t e d i s t a n c e m e a s u r e s ~k and
k , w h i c h in turn indicate how far pf k an d PFk are from M(x ). Thus

k ~k k k -k k k ~k
(p ,Sp) = v f ( p f , s f ) + ~F(PF,SF),

k k k k
~f_> 0, VF>_ 0, 9f + g F = i,

h e n c e the v a l u e of
~k
- indicates how far pk is from ^M(xk). This jus-
P
tifies the stopping c r i t e r i o n of Step 2. In fact, by using Lemma 7.12,
Lemma 4.2 and the proof of Lemma 4.3 one can show that if es=0 then
the a l g o r i t h m stops only at s t a t i o n a r y points.
S u p p o s i n g the m e t h o d does not terminate, we have the following re-
sult.
291

Theorem 7.13. A l g o r i t h m 7.11 is globally convergent in the sense of


Theorem 4.11, Theorem 4.12 and Corollary 4.13.

Proof. We give only an outline of the proof, which is similar to the


proof of Theorem 6.2. First, we observe that we always have ipkl ~ maS ~
at Step 5. Secondly, we deduce from (7.41), (7.31) and Lemma 7.12 that

-k -k ak ^k
Sp = sf _< if JF,p = ~,
~k ~k a k ^k
p =s F -
< if Jf,p = ~;
(7.45)
~k ~k -k ak ^k ^k
p max{sf,s F} < if J f , p ~ JF,p
~k ^k
J f , p U JF,p ~ @'

and that s~ K t 0 if and only if v~s~ K ~ 0 and 9~s~ ~ 0.


Using the latter property and the representations of the aggregate sub-
gradients of Lemma 7.12 as in the proof of Lemma 7.6, we obtain from Le-
mma 4.2 that xk K ~ ~ and ~k K ~ 0 implies the existence of an in-
P
finite ~c K such that pk K ~ p ~M(x). This result replaces Lemma 7.6,
and enables us to establish both Lemma 4.7 by using the fact that ~k <
P
< Ipkl/ma for all k , a n d a n a n a l o g u e of Lemma 4.4.16, in which the rela-
tions b k = m a x { I p k l , m a ak} and 0 e ~f(~) are replaced by bk=max{Ipkl,
ma } and 0 eM(~), From the analysis of A l g o r i t h m 7.1 we may derive
analogues of Lemma 4.4.8 through Lemma 4.4.15. Then, since s~ & a k by
(7.45), one may establish Lemma 4.4.18 as in the proof of T h e o r e m 6.2.

We may add that one can modify A l g o r i t h m 7.9 in the spirit of Al-
g o r i t h m 7.11 w i t h o u t impairing the preceding global convergence results.
Namely, in A l g o r i t h m 7.9 we may use the stopping criterion

max{Ipkl,maS ~} ~ e s

and the resetting test

- maS p ,

with &~ generated by (7.40) with ~kp=~p=~k0, and replace Step 8 by Step
8, Step 9 and Step 10 of A l g o r i t h m 7.11. To establish Theorem 7.10 for
the resulting method with subgradient selection, one may use the proof
of Theorem 7.13.
The preceding algorithms of this section can be modified by using
the r e s e t t i n g strategies of W o l f e and M i f f l i n described in Section 4.6.
292

We shall not dwel~ on this subject, since, as explained in Section 4.6,


the resulting algorithms are convergent only in the sense that they
have at least one stationary accumulation point under additional bounded-
ness assumptions. More precisely, it is straightforward to establish
analogues of Theorem 4.6.3 and Theorem 4..6.4 in the constrained casei
formulating them with

Sf = {xe S : f(x) ~f(xl)}.

We shall now present a convergent modification of Algorithm 7.11


with subgradient deletion rules that do not require a repetition of
search direction finding whenever a subgradient is dropped. As discussed
in Section 4.6, such rules decrease the work involved in additional qua-
dratic programming calculations.

Algorithm 7.14.

Step 0 (Initialization). Do Step 0 of Algorithm 7.1. Choose a positive


1 1
61 . Set sf=sF=0.
Step 1 (Direction findin@). Do Step 1 of Algorithm 7.1. Set

~ ~k k ~k k ~k -k k + ~k k
sk Tk~jSj + ~pSf
j e ~f and sF = z k~jSj ~pSF'
j J
F
~k k ~k k ~k
Sp = vfsf + ~FSF

Step 2 (Stopping criterion). If max{ jPkj'maS~ } ~ ~s then terminate.


Otherwise, go to Step 3.
Step 3 (Resetting test). Set ~k=max{ jpkj,maS~}. If ~k < ~6k set

8k+l=~k; otherwise set 6k+i=6 k. If jpkj ~ m a ~ then go to Step 4;


otherwise, go to Step 5.

Step 4 (Resetting). Replace Jfk by {je Jfk : sjk < ~6k+i/ma} and JFk by

{j e JFk : sk<3-~6k+i/ma}" If JJ~ U~FJ < 1 set yk=xk, gf=gf(xk k),f~=f(x k) ,


k k k .
s~=0, J~={k} and J~=@. Set ra=rf=rF=l and go to Step i.

Step 5 (Line search). Do Step 5 of Algorithm 7.1, replacing (7.10e-g) by

jyk+l-xk+lj ~ 6 k + i / m a . (7.46)

Step 6 (Subgradient updatin@). Do Step 7 of Algorithm 7.1. Set

k+l = s~ k + Jxk+l -xkJ


sf and s k+l
F = S-k
F+ Jxk+l -xkJ
293

step 7 (Distance resetting test I. If a k+l ~ and ~k+l=~k go to Step


9; otherwise go to Step 8.

Step 8 (Distance resetting). Replace _k+l


Jf by k+l
{je Jf : sjk+l ~ 6 k+ ~ma}

and JFk by {j ~ j~+l : s k+lj <_~sk+i/ma }. If a k+l > ~ k + i / m a then set

r ak+l=r~+l=rFk+l =i. and ak+l_max.sk+l


- i j ~k+l u JF
: j ~ ~f _k+l.~"

step 9. Increase k by 1 and go to Step i.

We note that Line Search Procedure 3.2 (with a = e6k+i/ma ) can


be used for executing Step 5 of the above method.
Algorithm 7.14 is globally convergent in the sense of Theorem 4.1~
Theorem 4.12 and Corollary 4.13. To verify this claim, the reader may
use the preceding results of this section and the arguments in the proof
of Theorem 4.6.5.
We may add that the algorithms described so far may use additional
subgradients of f calculated at points {xk}; see Remark 4.6.7. This may
enhance faster convergence.

8. Methods That Neglect L i n e a r i z a t i o n Errors

In this section we shall consider simplified versions of the meth-


ods discussed in Section 7 that are obtained ny neglecting lineariza-
tion errors at each search d i r e c t i o n finding. The resulting dual search
direction finding subproblems have special structure, which enables one
to use the efficent and numerically stable Wolfe (1976) quadratic program-
ming algorithm. In effect, such methods may require less work per ite-
ration, but may converge more slowly than do the previously discussed
algorithms (Lemarechal, 1982). We may add that similar methods were pro-
posed in (Mifflin, 1977b; Polak, Mayne and Wardi, 1983).
Let us, therefore, consider the following m o d i f i c a t i o n of Algori-
thm 7.1. We set

k k k k k k =0
~f,j = 0 ~ j ~ Jf, ep=0, ~F,j=0, J~ JF' eF,p

in the k-th search direction finding s u b p r o b l e m (7. 1 ). Then the k-th


line search is performed with

k pk 2
v =-I I ,

~(x,y) = 0 for all x and y


294

instead of u s i n g the previous definitions (7.8) and (7.12). Then


vk < 0 in S t e p 5, so t h a t L i n e Search Procedure 7.3 can be used as b e -
fore, with Lemma 7.4 remaining true.
It is e a s y to verify that the above modification does not impair
Theorem 4.11 and Corollary 4.13. However, Theorem 4.12 m a y not hold
for the r e a s o n s discussed in S e c t i o n 4.7.
To obtain stronger convergence properties in the c o n v e x case, one
may use in the above-described version of A l g o r i t h m 7.1 the a d d i t i o n a l
resetting test

i.e. the algorithm should g o to S t e p 4 if e i t h e r Ipkl <_ma ak or Ipkl <_


<_m ~ k, w h e r e

~k k-k k~k
ep = 9 f ~ f , p + 9 F e F , p

and me > 0 is a s c a l i n g parameter. We refer the r e a d e r to S e c t i o n 4.7


for the m o t i v a t i o n of s u c h additional resetting tests. Moreover, one
may use the p r o o f of T h e o r e m 4.1.7 and t h e results of S e c t i o n 7 to s h o w
that the resulting method is g l o b a l l y convergent in the sense of Theorems
4.11 and 4.12 and Corollary 4.13. All these results are easily exten-
ded to the c o r r e s p o n d i n g version with subgradient selection.

9. P h a s e I - Phase II M e t h o d s

The algorithms described i n the p r e c e d i n g sections require a fea-


sible starting point. In t h i s section we shall discuss phase I - phase
II m e t h o d s which may use infeasible starting points. These methods are
extensions of the a l g o r i t h m s of S e c t i o n 5.7 to t h e n o n c o n v e x case, hence
we refer the reader to S e c t i o n 5.7 for their motivation and deriva-
tion.
Throughout this section we suppose that one c a n c a l c u l a t e f(x)
and gf(x} e ~f(x) at each x e R N. A l s o for simplicity we assume that F
and gF can be evaluated everywhere; see R e m a r k 2.3 a n d S e c t i o n 7 for
a discussion of h o w to r e l a x this assumption.
We recall that the phase I - phase II a l g o r i t h m s of S e c t i o n 5.7
were obtained by modifying the definitions of l i n e a r i z a t i o n errors and
the line seach rules of t h e feasible point methods of S e c t i o n 5.2-5.6.
Introducing similar modifications in A l g o r i t h m 3.1, w e obtain the fol-
lowing phase I - phase II m e t h o d .
295

Al~orithm 9.1.
Step 0 (Initialization). Select a starting point xl~ R N and initiali-
ze the method according to the rules of Step 0 of Algorithm 3.1.
k k k k k
Step 1 Direction finding}. Find multipliers lj, j~ Jf, Ip, ~j, j ~ JF'
and Ik that
P
1 ~ k-i ~ k-I 2
mlnimize ~ I Z _kljg + IpPf + ~ kUj g + ZpPF I +
I,~ j ~ Jf j e JF

+ z k~j[~f,j+F(~k)+] +~p[~ ,P F(xk)]+


k k
_k~jaF, j+Npa~p
jaaf J~ O F
k k
subject to lj_>O, j6 Jf, Ip>_O, ~j ~_0, J~ JF' Zp>_O, (9.~)

jklj + I p + E k~ j + ~p= i,
ja f J~JF
k
~p = ~p = 0 if r a = 1,

where

af,kj=max{jf(xk)_f~l,yf(sj)
k 2 }, eF,j
k =max{IF(x k )+-FJk I' YF( 3
sk)2} '
(9.2)
k 2 k k k s~)2}
~,p=max{If(xk)-f~l,yf(sf) }, ~F,p=max{IF(x )+-Fpl, F (

Compute k, k, ~k ~k
(~k ~k).. ak, (pk,fp,Sf), k ~k ~k ' P k
(PF,Fp,SF) and dk as

in Step 1 of Algorithm 3..1. Set

~k,p=max{if(f x k )-fpl,yf(sf),
~k ~k 2 }, ~F,p
~k =max{IF(x k )+-FpI'YF(SF)
-k -k 2 }' (9.3a)

vk =-{[pk I2 +~p}.
-k (9.4)

S__tep 2 (Stopping criterion). Set

wk =!J2pki2 + ~p"k (9.5)


k
If w S ~s terminate; otherwise, continue.
k k
Step 3 (Line search). If F(x k)- 50, find two stepsizes t L and t R
satisfying the requirements of Step 3 of Algorithm 3.1. Othewise, i.e.
k k
if F ( x k ) > O, f i n d t w o s t e p s i z e s tk
L and tk
R such that 0 _<t L <_ t R
k+l k k k k+l k..k.k
and such that x =x +tLd and y =x +iRa satisfy
296

k k
F(x k+l) <_F(x k) + m L t L V , (9.6a)

k k if k (9.6b)
t R = tL t L Z ~,

-~F(xk+l,y k+l) + < gF(y k+l),d k > > mR vk k (9.6c)


if t L < ~,

[yk+l-xk+l I ~ a/2, (9.6d)

where
~F(x,y) = m a x { I F ( x ) + - F ( x ; y ) ] , YF[x-y[2}. (9.7)

step 4 (Sub~radient updating). Do Step 4 of A l g o r i t h m 3.1.

S_tep 5 (Distance resettin~ test). Do Step 5 of A l g o r i t h m 3.1.

step 6 (Distance resetting). Do Step 6 of A l g o r i t h m 3.1.

Step 7. Increase k by 1 and go to Step i.

A few comments on the algorithm are in order.

The values of ~f(x,y)+F(x)+ and ~F(x,y) indicate how much the


subgradients gf(y) e ~f(y) and gF(y ) ~ F(y) differ from being elements of
M(x) (see (2.2)), respectively. In effect, the value of the stationarity
measure wk indicates how much xk differs from being a stationary
point for f on S. Note that the value of wk may be small even if xk is
infeasible, since 9~ may be close to zero. In this case x k is c3ose
k k k kk+kk k
to being a stationary point for F, since ~F=l-vf ~ I, P =~fPf ~FPF PF'
and wkz ~IPF[
1 k 2 +~F,p'
~k i.e. 0 is close to ~F(x k) and x k ~ S. To exclude

such cases, phase I - phase II methods typically require the Cottle


constraint qualification, i.e. 0 ~ ~F(x) for all x~S.

It is easy to see that at phase II, i.e. when x k e S, the method


reduces to A l g o r i t h m 3.1 and maintains feasibility of successive itera-
tes. Thus only the case of an infinitely long phase I (xk~ S for all
k) is of interest here.
When F(x k) > 0, one may apply Line Search Procedure 3.3.1 to F
in order to find stepsizes t kL and t kR satisfying (9.6). By Lemma 3
3.3, this procedure will terminate in a finite number of iterations if
F has the semismoothness property (3.3.23). In fact, this procedure may
be stopped w h e n e v e r it finds any feasible point, since then phase II
will begin at the next iteration of the method.
We shall now establish convergence of the method, assuming that
Es=0.
297

Lemma 9.2. If Algorithm 9.1 terminates at the k-th iteration then the
point x=x k satisfies 0 ~ M(x). If additionally F(x~ 0 or

0 ~ 3F(x) for all x g S, (9.8)

then xS and x is stationary for f on S.

Proof. Use (9.3) in the proof of Lemma 4.3 for replacing (4.9) by

k
p f ~ ~f(~) and F(x)+=0 if ~ ~ 0,
k
PF 3F(~) and F(~) ~ 0 if ~ ~ 0
A ^

to deduce that 0gM(~). By (2.2), (9.8) implies ~e S if 0 e M(~).

In view of the above result, we shall assume from now on that the
method calculates an infinite sequence {xk}. Of course, phase II of the
method is covered by the results of Section 4. Therefore we need only
consider the case when the method stays at phase I.

Theorem 9.3. Suppose that Algorithm 9.1 generates an infinite sequence


{x k} such that F(x k) > 0 for all k. Then every accumulation point
of {x k} satisfies 0 M(~). Moreover, xe S and x is stationary for
f on S if (9.8) holds.

Proof. To save space, we shall only indicate how to modify the results
of Section 4 for Algorithm 9.1.

(i) Proceeding as in the proof of Lemma 9.2, use (9.3) in the proof of
Lemma 4.5 to obtain the desired conclusion if (4.11) holds.

(ii) In view of (9.3)-(9.5), one may express ^k


~p in the formulation and
the proof of Lemma 4.6 as follows

^k = ~fLef,p
~p kr^k + F(xk)+ ~ + ~FaF,P
k k (9.9)

(iii) By assumption, F(x k) > 0 for all k, hence, by the algorithm's


rules, we have (4.24) with gk=gF(yk ) and ek=~F(xk,yk ) if tL <
(cf.(9.6c)). Therefore one may use (9.3) and (9.9) to establish Lemma
4.7 for Algorithm 9.1 with

~k
p= ,p+F( xk)j + vk-l~F, p . (9.1o)

(iv) It is easy to establish Lemma 4.8 for Algorithm 9.1 by defining


~(x,y)=~f(x,y)+F(x)+ if y S, ~(x,y)=~F(x,y ) if y ~ S, and setting
ek=e(xk,yk) for all k.
298

(v) In the proof of Lemma 4.9 for A l g o r i t h m 9.1, replace (4.33b) by a


r e l a t i o n similar to (4.33a) (with f s u b s t i t u t e d by F), and use (9.3b)
and (9.10) together with the a s s u m p t i o n that F ( x k ) + = F ( x k) > 0 for all
k to show that Is k+1-
P l ,0

(vi) C o m b i n i n g the above results as in S e c t i o n 4, we see that Lemma 4.


i0 holds for A l g o r i t h m 9.1, so we have (4.10) and (4.11), and the desi-
red c o n c l u s i o n follows from part (i) above.

R e a s o n i n g as in S e c t i o n 4, one may deduce from the above proof the


following result.

C o r o l l a r y 9.4. Suppose that F(x I) > 0, the set { x e R N : F(x) i F ( x l ) }


is b o u n d e d and ~s > 0. Then A l g o r i t h m 9.1 will either t e r m i n a t e at pha-
se I or switch to p h a s e II at some iteration.

We c o n c l u d e from the above results that if the method terminates


at a s i g n i f i c a n t l y infeasible point x k, then F is likely to have a sta-
tionary point x with 0 ~ ~F(~) and F(x) > 0. This will h a p p e n if F
has a p o s i t i v e minimum, i.e. no feasible point exists.
We end this s e c t i o n by r e m a r k i n g that if we n e g l e c t e d lineariza-
tion errors,i.e, set e k
f,j=~f k , p = ~ Fk , j = ~ Fk, p = e pk=0 and ~f=~F=0, then the
method would become similar to a conceptual a l g o r i t h m p r o p o s e d in (Polak,
Mayne and Wardi, 1983).
CHAPTER 7

Bundle Methods

i. I n t r o d u c t i o n

The m e t h o d s for n o n s m o o t h minimization discussed in the preced-


ing chapters belong to the class of a l g o r i t h m s proposed by L e m a r e c h a l
(1978a) and e x t e n d e d by M i f f l i n 41982). In C h a p t e r s 4 and 6 w e showed
that by n e g l e c t i n g linearization errors one obtains simplified versi-
ons of these m e t h o d s which are in the class of algorithms introduced
by L e m a r e c h a l (1975) and W o l f e (19751, and e x t e n d e d by M i f f l i n (1977b)
and Polak, Mayne and W a r d i (1983). This c h a p t e r is d e v o t e d to bundle
methods which form the third remaining class of algorithms. These
methods were proposed by Lemarechal (1976,1978b) in the u n c o n s t r a i n e d
convex case, and e x t e n d e d by Lemarechal, Strodiot and B i h a i n 41981)to
nonconvex problems with linear constraints.
A computational advantage of b u n d l e methods over the a l g o r i t h m s
discussed so far is that their search d i r e c t i o n finding subproblems
may be solved by an efficient quadratic programming algorithm of Mi-
fflin (1978) w h i c h exploits the s t r u c t u r e of these subproblems, while
up till now no s p e c i a l - p u r p o s e quadratic programming algorithm has
been developed for s u b p r o b l e m s of C h a p t e r 2. Moreover, preliminary com-
putational experimence (Lemarechal, 1982; Strodiot, Nguyen and H e u k e -
mes, 1983) indicates that b u n d l e methods are promising. However, so
far no global convergence of such m e t h o d s seems to have been establis-
hed, and the analysis of (Lemarechal, Strodiot and Bihain, 1981; Stro-
diot, Nguyen and Heukemes, 1983) only shows that b u n d l e m e t h o d s can
find an a p p r o x i m a t e solution in a finite n u m b e r of iterations. Also
extensions of b u n d l e m e t h o d s to n o n l i n e a r y constrained problems have
not b e e n considered in the literature.
In this chapter we shall present new v e r s i o n s of bundle methods
for c o n v e x and n o n c o n v e x problems, both unconstrained and c o n s t r a i n e d
ones. Owing to the use of s u b g r a d i e n t selection and a g g r e g a t i o n tech-
niques, the m e t h o d s have flexible storage requirements and w o r k per
iteration which can be c o n t r o l l e d by the user. Our rules for regulating
the a p p r o x i m a t i o n tolerances of the methods, which are d ~ f f e r e n t from
those in (Lemarechal, Strodiot and Bihain, 1981), enable us to estab-
lish global c o n v e r g e n c e of the m e t h o d s under no a d d i t i o n a l assumptions
on the p r o b l e m functions. We also give line search procedures that are
finite p r o c e d u r e s for functions having the s e m i s m o o t h n e s s properties
300

(3.3.23) and (6.3.18), w h i c h are w e a k e r than ~ o s e r e q u i r e d in (Lema-


rechal, S t r o d i o t and Bihain, 1981); see (Lemarechal, 1981). In effect,
we e s t a b l i s h theoretical results on these versions of bundle methods
that are comparable to the ones o b t a i n e d for other algorithms in the
p r e c e d i n g chapters.
We start, in S e c t i o n 2, by d e r i v i n g b u n d l e methods for convex un-
constrained minimization. A m e t h o d with s u b g r a d i e n t a g g r e g a t i o n is de-
scribed in detail in S e c t i o n 3, and its c o n v e r g e n c e is e s t a b l i s h e d in
S e c t i o n 4. Section 5 d i s c u s s e s a method w i t h s u b g r a d i e n t selection and
its convergence. Useful m o d i f i c a t i o n s of the methods are d e s c r i b e d in
S e c t i o n 6. Then we extend the methods to the n o n c o n v e x u n c o n s t r a i n e d
case in S e c t i o n 7, to convex constrained problems in Section 8, and to
the n o n c o n v e x c o n s t r a i n e d case in Section 9.

2. D e r i v a t i o n of the Methods

In this section we derive a bundle m e t h o d for the u n c o n s t r a i n e d


p r o b l e m of m i n i m i z i n g a convex function f : R N --+ R that is not neces-
sarily differentiable. We suppose that we h a v e a finite process for
finding a s u b g r a d i e n t gf(x) ~ 8f(x) of f at each given x e R N.

The a l g o r i t h m to be d e s c r i b e d will generate sequences of points


{x k} c R N, search d i r e c t i o n s {d k} c R N and stepsizes { t ~ } c R+, r e l a t -
k+l k._k.k 1
ed by x =x ~L a for k=l,2,... , where x is a given s t a r t i n g
point. The s e q u e n c e {x k} is intended to c o n v e r g e to the r e q u i r e d so-
k+l k.k-k
lution. The m e t h o d will also c a l c u l a t e trial points y =x +iRa for
k=l,2,... , and s u b g r a d i e n t s gk=gf(yk) for all k, w h e r e yl=xl and
k k k k k
the a u x i l i a r y stepsizes t R_> t L satisfy tR=t L if t L > 0, for all k.

Given a point y e R N, let

~(x;y) = f(y) + < gf(y),x-y > for all x

denote the c o r r e s p o n d i n g l i n e a r i z a t i o n of f, and let

e(x,y) = f(x)-~(x;y)

denote the l i n e a r i z a t i o n error at any x ~ R N. At the k-th iteration,


we shall h a v e a n o n e m p t y set J k c {l,...,k} and the l i n e a r i z a t i o n s
fj(.)=f(.;yj), j ~ jk, given by the (N+l)-vectors (gJ,f~) in the form

fj(x) = f~3 + < gJ,x-x k > for all x,


301

k k k j
where fk=~(xk;y3) for j E jk. Let ~j=aj(x ,y ) for all j e jk. By con-
3
vexity, gJE ~ kf(xk), i.e.
3
k
f(x) >_f(xk) + < g3,x-xk >-~ for all x,
3
and hence for any e Z 0 the convex polyhedron

Gk(e) = {g~ RN : g = Z jkljg3, Z jklj ~k-< e,


j~ jE
12.1)
13 Z 0, je jk , Z jklj =i}
j~
is an inner approximation to the e-subdifferential of f at x k

Gk(e)~ ~e f (x k) for all e z0,

that is, if Gk(e) is nonempty, then

f(x) > f(xk)+max{ < g,x-xk > : g ~ G k ( e ) } - e for all x.

Suppose that for some ek > 0 Gk(e k) is nonempty and we want to find
a direction d~ RN such that f (xk+d) < f(xk)-e k. Letting x=xk+d and
e=e k, we see that d must satisfy

< g,d > < 0 for all g in Gk(e),

i.e. we must find a hyperplane separating Gk(e) from the origin. One
way of finding such a hyperplane is to compute the element pk=Nr Gk(e)
of Gk(e) that is nearest to the origin, since (see Lemma 1.2.12)

< g , p k > ~ Ipkl 2 for all g ~ Gk(e),

and hence if dk=-p k is nonzero then

< g,dk> ~ - I p k l 2 < 0 for all g~Gk(e). (2.2)

(We may add that, since < g,pk/Ipk I > is the length of the projection
of g on the direction of pk and Ipkl is the distance of the hyper-
plane
H = {ze R N : < z,p k > = Ipkl 2 }

from the origin, among the hyperplanes separating Gk(e) and the null
vector H is the furthest one from the origin.) Of course, there is
no separation if pk=0, but then 0=pk~ G k ( e ) ~ ~ef(X k) and so f(x)
f(xk)-e for all x. In this case xk minimizes f up to the accuracy
302

of e k , so the method may stop if the value of e k is small enough

Otherwise one may decrease the value of e, compute new pk=NrGk(e) etc.
This process will either drive e to zero, indicating that x k is op-
timal, or find a direction dk=-p k satisfying (2.2).
We shall now give another motivation for the above construction.
In Section 1.2 (see Lemma 1.2.13) we considered search direction find-
ing subproblems of the form

minimize fk(xk+d)+ 1dl 2 over all de RN

with the approximation ~k to f around xk given by

fk(x) = m a x { f ( x k) + < g, x-x k > : g e ~f(xk)} for all x.

Since the use of ~k would require the knowledge of the full subdiffer-
ential ~f(xk), in Chapter 2 we replaced ~k by the polyhedral appro-
ximation

~(x) =max{fj<x): j E jk} =

= m a x { f ( x k) + < g J , x _ x k > _~k : j~ jk}.


3
By neglecting the linearization errors ~ in the definition of ~k
S'
we obtain the simplified approximation

f _ix) = max{f(x k) + < gJ,x-x k > : j e jk} for all x


,w

used in the methods of Lemarechal (1975) and Wolfe (1975); see Section
4.7. Let us now consider the following approximation to f at xk
~k
fB,s(X) = m a x { f ( x k) + < g,x-x k > : g ~ Gk(ek)} for all x. (2.3)

*k ^k k
Observe that fB,s reduces to fLW whenever e is sufficiently large,
i.e. ek ~max{~ : j ~Jk}. On the other hand, if ek is small enough
then we may hope that Gk(ek), being a subset of ~ekf(xk), is a good
approximation of ~f(xk). In this case ^k
fB,s is close to the "concep-
tual" approximation ~k. It is natural, therefore, to consider the fo-
llowing search direction finding subproblem

minimize f~,s(xk+d)+~Idl 2 over all d ~ R N. (2.4)

Lemma 2.1. (i) Subproblem (2.4) has a unique solution d k. (Recall that

Gk(e k) is nonempty by assumption.)


k
(ii) Let lj, j e jk, denote a solution to the problem
303

minimize 1j ~ j k ljg j 12
subject to lj a0, j ~ j k , Z jklj = i, (2.5)
j~
jklj~ k <-e k,
j~
and let
pk = klkgj. (2.6)
j~ J J

Then pk=Nr Gk(e k) and dk=-p k.

(iii) There exists a solution Ikj, j jk, of subproblem (2.5) such


that the set
~k = {j~ jk : ik ~ O} (2.7a)
3
satisfies

I~kl ~ N + 1. (2.7b)

Such a solution can be obtained by solving the following linear program-


ming problem by the simplex method

minimize ~ jklj ~k3 ;


je

subject to z jklj = I,
j~
(2.8)
E jkljg j = pk,
je

i. z0, j~jk.
3

Moreover, d k solves the following reduced subproblem

minimize ^k
fB,r(xk+d) +1d] 2 (2.9)

where
~k
fB,r(X) =max{f(xk)+<g,x-xk > : g a ~k }

6k = ( g a R N : g = j~ j^kljg j , J EE ~klj e jk< e _ k,

^k
lj _>0, j a J , J ~Z ~klj = i}.
304

k
(iv) There exists a Lagrange multiplier s >_0 for the last constra-
int of (2.5] such that (2.5) is equivalent to the problem

minimize 1 T..jkljg j [ 2+sk Z jkljajk ,


1 j~ je (2.1o)
subject to lj >- O, jg jk, Z jklj = i.
j~

P roof.(i ) Reasoning as in the proof of Lemma 2.2.1, we deduce that the


strongly convex function k (d)=fB,s(xk+d)
~k +1dl 2 has a unique minimi-
zer d k .

(ii) We have pk=NrGk(ek) by the definition of Nr Gk(ek). Let G=


Gk(ek]. Making use of Caratheodory's theorem (Lemma 1.2.1), we deduce
that there exist N+I not necessarily different elements ~i of G
and numbers i i, i=l,...,N+l, such that

k N+I^ ~i
p = Z fig , (2.11a)
i=l
N+I^
~i>_ 0, i=l,...,N+l, I i = i. (2.11b)
i=l
^

Of course, Ii solve the problem

N+I
minimize ~1 [ r. .
~i I 2
X i=l xlg '
N+I
s~ject to I i ~ 0, i=l,...,N+l, Z I i = l,
i=l
hence Lemma 2.2.1 yields

[< ~i,~ > _ v i i i : 0 for i=l,...,N+l, (2.11c)

while the basic property of NrG implies (see L~nma 1.2.12)


A

< ~,d> <_v for all g~G, (2.11d)

where ~=_pk and v = - I p kI2. From (2.11) and Lemma 1.2.5 we obtain

pke ~,s(xk+~),
^k k ^ ^
hence 0 m afB,s(X +d)+d=ack(d) by Corollary 1.2.6. Thus d minimizes
k A .k
k, so -p =a=a from part (i) above.

(iii) The simplex method will find an optimal basic solution of (2.8)
with no more than N+I positive components (Dantzig , 1963) such that
305

1 k 2
it solves (2.5), since 71P I is the optimal value of (2.5). Hence

-pk=-NrGk solves (2.9), by part (ii) above.

(ivl The existence of sk follows from duality theory; see (Lemare-


chal, 1978b). 0

Remark 2.2. The Mifflin (1978) quadratic programming algorithm will au-
tomatically find multipliers lk satisfying (2.7), and the multiplier
3
sk .

We conclude from the above lemma that subproblem (2.5) has many
properties similar to those of the subproblems studied in Chapter 2,
which were of the form (2.10) but with sk=l. In particular, (2.9) is
its reduced version. Therefore, according to the generalized cutting
plane idea of Section 2.2, we may construct the (k+l)-st approximation
to f by choosing jk+l such that

jk+l ^k
= J u {k+l}

where ~k satisfies (2.7). This will define the method with subgrad-
lent selection, which uses at most N+3 past subgradients for search
direction finding at any iteration.

We may now consider the method with subgradient aggregation. Sup-


pose that at the beginning of the k-th iteration we have the (k-l)-st
aggregate subgradient

(pk-l,f~) g c o n v { ( g j , f ~ ) : j=l ..... k-l}.

Define the corresponding aggregate linearization

~k-l(x ) = f k + < pk-l,x_xk > for all x


P
k
and the linearization error at x

apk = f ( x k ) _ f k

Since for each x

f(x) >_ f ( x k ) + < gJ,x-x k > - ~ k =


3
= f(xk)+ < gJ,x-x k > -Ef(xk)-f k] ,

we have
f(x) Z f(x k) + < pk'l,x-xk > _ k for all x.
P

Therefore for any e ~0 the convex polyhedron


306

Gk(e) = {g~ R N : g = Z jkljg3+Ipp k-l, Z jklj~k+Ip~ k _<e,


a je je
(2.12)
lj a0, j e jk,lp h0, Z jklj+Ip=l}
j~

satisfies G~(e) c ~ef(xk). These relations are natural extensions of


those of the method with subgradient selection described above. Hence
if Gk(ek)~@
a then we may find the direction d k by computing pk=
Nr G~(e k) and setting d k=-pk. This can be done by calculating multi-
pliers Ikj, j ~ jk , and k
Ip that

minimize 1 Z jkljgJ+Ipp k-I 12,

subject to lj 50, j~ Jk,lp h0, Z jklj+Ip=l, (2.13)


j~
k k k
jkljaj+Ipep <_e ,
j~
and setting
pk = J ~Z jkl~g j + Ippkk-i

The corresponding primal search direction finding subproblem is to

minimize f~,a(xk+d)+ ~Idl 2 over all d ~ R N,

^k
where fB,a is the k-th aggregate approximation to f

f^k,a (x) =max{f( xk)+ < g , x - x k > : g ~ G k (e k ) } for all x. (2.14)

Also one may calculate lk and ik by the Mifflin (1978) algorithm,


p
although in this case I~3~ need not necessarily satisfy (2.7). Moreover,
3
we may use the subgradient aggregation rules of Chapter 2 to define the
next aggregate linearization ~k by computing

(pk,~k) = Z klk(g j , fk'+Ik"


j) piP
k-i , fk)
p (2.15)
j~J
and setting
~k<x ) = ~k+ ~ p k , x _ k for all x.
P

The next aggregate approximation ~k+l will use


B,a
jk+l ^k u
= J {k+l},
307

where ~k is any subset of jk such that Gk+l(e


a k+l) is nonempty.

Observing that G (e) is nonempty for any e k 0 if ~j=0 for some


je jk (set lj=l, ip=0 and li=0 for i)~j in (2.12)), we conclude

that it suffices to ensure that ~k+l=0 for some j in jk+l.


]
Thus we have motivated the search direction finding subproblems
of the methods. These subproblems depend on the parameter e k, which
controls the accuracy with which the sets Gk(ek)c ~ekf(x k) and G~(e k)
C ~ekf(x k) approximate ~f(xk). Rules for choosing the approximation
tolerances e k, as well as the associated line search criteria, will be
discussed in Sections 3 and 6.

3. The Al~orithm with Subgradient Aggregation

We may now describe in detail the method with subgradient aggrega-


tion for minimizing a convex function f : R N __+ R. Several comments on
its rules will be given below.

Al@orithm 3.1.

Step 0 (Initialization). Select the starting point x l ~ R N, a stopping


parameter es ~ 0 and an approximation tolerance e a> 0. Choose positi-
ve line search parameters mL, mR, me, ~ and t satisfying m L < m R < i,
me < 1 and ~ < 1 < t. Set jl={l} ' yl=xl, p0=gl=gf(yl), f~=f~=f(yl) and

el=ea . Set the counters k=l, i=0 and k(0)=l.

Step i (Direction finding). Find multipliers k jE jk, and


lj, ~kp that

solve the k-th dual subproblem (2.13). Calculate the aggregate subgra-

dient (p k ,fp)
~k by 2.15). Set dk=-p k and

v k : -Ipkl 2.

Step 2 (Stopping criterion). Set

-k = f(xk)-f k , (3.1)
P P

If max{ipkl2 '~p}
~k ~ es' terminate; otherwise, continue.

Step 3 (Approximation tolerance decreasing). If Ipkl2 > ~p


~k then go to

Step 4. Otherwise, i.e. if Ipkl 2 ~ eP'~k replace ek by me ek and go to


Step i.
308

Step 4 ILine search I. By a line search procedure as g i v e n below, find

two s t e p s i z e s t kL and t kR such that 0 <_ t kL <_ t kR & t and k kL


tR=t if
k k+l
t L > 0, and such that the two c o r r e s p o n d i n g points defined by x =xk+
k+l k k k
t~d k and y =x +tRd satisfy

f(x k+l) ~ f(x k )+mLt kL v k , (3.2a)

k k+l k k (3.2b)
tL~ ~ or ~(xk,x ) > mee if t L > 0,

xk,y k+l) ek if k = 0 (3.2c)


~( S me tL ,

< gf(yk+l), dk > Z m R vk if t Lk = 0 . (3.2d)

Step 5 (Approximation tolerance updatingl. If t Lk = 0 (null step) set


k k+l
ek+l=ek; otherwise, i.e. if tL > 0 (serious ste~), set e =e a and
k(l+l)=k+l, and i n c r e a s e 1 by i.

Step 6 (Linearization updating) .Choose a subset 5k of jk containing

k(1) if k(1) < k+l, and set jk+l=jku {k+l}. Set gk+l=gf(yk+l) and
compute
fk+l = f(yk+l gk+l,xk+l_yk+l
k+l )+ < > '
fk+l = fk + < g 3 , x k + l _ x k > for j ~ J^k , (3.3)
3 3
fk+l = ~ k + < p k , x k + l _ x k >.
P P

Step 7. I n c r e a s e k by 1 and go to Step i.

A few remarks on the a l g o r i t h m are in order.


The above subgradient aggregation rules are the same as those in
Algorithm 2.3. i, since we always have

ik > 0 , j ~ jk, ik lk
3 - p~0, Z jk i kj + = i. (3.4)
j~ P

Hence Lemmas 2.4.1 and 2.4.2 are v a l i d for A l g o r i t h m 3.1. In p a r t i c u l a r


we h a v e
k
Pk ~f(x k ) for =~p, (3.5)

f(x) _> f(xk)-Ipkl Ix-xkI-& k for all x, (3.6)


w

cf. Remark 2.3.3. Therefore, if the a l g o r i t h m terminates at the k-th


iteration, then
309

fix) >_ f(xk)-Es/2(" ~Ix-xkl+as/2) for all x, (3.7)

k
This e s t i m a t e justifies our s t o p p i n g criterion and shows that x is
optimal if s=0.
k
Our r u l e s for u p d a t i n g the a p p r o x i m a t i o n tolerances e s t e m from
the f o l l o w i n g considerations. In v i e w of (3.6), we a i m at o b t a i n i n g small
values of b o t h pk[ and ~K at s o m e iteration. This w i l l occur if
both Ipkl2<_~k and the v a l u e of e~k is small. Thus a mechanism is ne-
eded for decreasing~ the v a l u e of ~ Pkp if -k
ip k i 2 <_ ep. Since

= :

= Z j k x k ~ k + xkak <- e k (3.8)


je

because xjk and )tpk s o l v e ( 2 . 1 3 ) , we have t h e f u n d a m e n t a l r e l a t i o n

~k k
a p -< e . (3.9)

pk ~ -k
Therefore, whenever I I2 ~p occurs the a l g o r i t h m decreases ek(me<l)
~k
and c a l c u l a t e s new pk and ~k. Thus the u p p e r b o u n d on e is decreas-
P P
ed, while the n e w Ipkl c a n n o t be s m a l l e r than the old one. Moreover,
this r e d u c t i o n of ek increases the a c c u r a c y of our a p p r o x i m a t i o n ~k
B,a
of f a r o u n d x k, w h i c h is b a s e d on the set G~(ek)c-d 3 k f ( x k ) ' w e m a y add
e
that, for s i m p l i c i t y , the a l g o r i t h m uses the a p p r o x i m a t i o n t o l e r a n c e
ek=ea after each s e r i o u s step. Other, more efficient rules for u p d a -
ting ek w i l l be d i s c u s s e d in S e c t i o n 6.
Our line s e a r c h criteria ensure two b a s i c prerequisites for con-
vergence: a sufficient decrease of the o b j e c t i v e value at a s e r i o u s step,
and a s i g n i f i c a n t modification of the n e x t a p p r o x i m a t i o n to f a f t e r a
null step. We have, by (2.14) and (2.2),

~kf~,a(xk+dk)_f(x k) : m a x { < g,d k > : g ~ G ~ ( e k ) } : -[pkl 2

k
= v (3.10)

and -vk=Ipkl 2 > e k > 0 at l i n e s e a r c h e s . Thus vk < 0 m a y be r e g a r d e d as


an a p p r o x i m a t e directional d e r i v a t i v e of f at xk in the d i r e c t i o n dk~0.
Whenever a serious s t e p is taken, the c r i t e r i a (3.2a,b) make tk
L suffi-
ciently large so t h a t x k+l has a s i g n i f i c a n t l y smaller objective va-
lue than d o e s x k, s i n c e ~ > 0, m e e k > 0 and - m L v k > 0. On the o t h e r
hand, after a null step we have
310

gk+l g Gk+l(ek)a (3.11a)


< gk+l,dk > ~-mRIpkI 2 > - I p k l 2 (3.11b)

This follows from (3.2c,d) and the fact that k+l


~k+l=e (xk+l ,y k+l~e (x k,

yk+l),k+le jk+l, meek < e k, vk=-Ipkl 2 and m R ~ (0,1). Comparing (3.10)


with (3.11) we see that d k+l must differ from d k after a null step,
since then e k+l = e k . At the same time, (3.2c) implies

g k + i g ~Ef(x k+l) for e=a(xk,y k+l) & m e e k

This shows that when ek decreases during a series of null steps then
the algorithm collects only local subgradient information, i.e. gk+l is
close to ~f(xk+l).
The following line search procedure may be used for executing
Step 4.

Line Search Procedure 3.2.

(a) Set tL=0 and t=tu=l. Choose m satisfying m L < m < mR, e.g. m=
(9mL+mR)/10.
(b) If f(xk+td k) ~ f(xk)+ mtv k set tL=t; otherwise set tu=t.

(c) If f(xk+td k) ~ f(xk)+mLtvk and either t ~ or e(xk,xk+td k) >


k k k
mee set tL=tR=t and return.
(d) If ~(x k,xk+td k) _< m e k and t <~ and < gf(xk+tdk),d k > a m R vk
e
set k
tR=t , t~=0 and return.

(e) Choose t g [tn+0.1(tu-tL) , tu-0.1(tu-tn) ] by some interpDlation

procedure and go to (ii).

We shall now establish convergence of the above procedure.

Lemma 3.3. Line Search Procedure 3.2 terminates in a finite number of


iterations, finding stepsizes t~ and t kR satisfying (3.2)

Proof. Assume, for contradiction purposes, that the search does not ter-
minate. Denote by t i, ~i
tL and t Ui the values of t, t L and tU after

the i-th execution of step Ib) of the procedure, so that ti~ { ~ , t ~ } ,


i ~iL)
for all i. Since t~iL ~ t?i+l
L ~ tiU+ l ~ t Ui and t Ui+l -t~i+l
L S 0 .9(tu-t for

all i, there exists t z0 such that t~iL + t and t ui + t. Let x=x k ,


311

d=d k , v=v k and

TL = {t >_0 : f(x+td) <_f(x)+mtv}.

Since { } c TL, t L + and f is continuous, we have t ~LT, i.e.

A ^

f(x+td) <_ f(x)+mtv. (3.12a)

i i i'
Since t U + t, LT and t ~ LT if tu=t , the set I={i:tl=t2}
u is

infinite and ti > t and

f(x+tid) > f ( x ) + m t i v for all i e I. (3.12b)

We shall c o n s i d e r the following two cases.

(i) S u p p o s e that ~ > G. Since, by (3.12a),

f(x+td) ~ f(x)+mLtv-~

with e = - ( m - m L ) t^ v > 0 (m > m L and v < 0), and t i I ~ [, we have

f(x+tid) & f ( x ) + m L t i v for large ie I

from the c o n t i n u i t y of f. T h e r e f o r e at step (c) we m u s t have ~(x,x+


tid) meek and ti < ~ for all large i 6 I, and hence at step (d)

<gi,d> <mRv for large i e I, 3.12c)

where gi=gf(x+tid) for all i.

(ii) Suppose that t=0. Then ti I , 0 and ~(x,x+tid) I 0, since

e(x,x+tid) ~ f ( x ) - f ( x + t i d ) + t i l g i I Idl,

f is c o n t i n u o u s and {~i} is bounded (from the local b o u n d e d n e s s of


~f; see Lemma 1.2.2). H e n c e ~(x,x+tid) < m e k at steD (d) for large
k e
i e I, b e c a u s e
mee > 0, so we have (3.12c) at step (d).

Making use of (3.12) and the fact that 0 <m<mR, one may obtain the de-
sired c o n c l u s i o n as in the proof of Lemma 3.3.3, since f, being a convex
function, has the s e m i s m o o t h e n e s s p r o p e r t y (3.3.23) (see Remark 3 . 3 . 4 ) . ~

We refer the reader to Remark 3.3.5 for a d i s c u s s i o n of interpola-


tion formulae which may be used at step (e) of Line Search P r o c e d u r e 3.2.
P r a c t i c a l rules for d e c i d i n g w h i c h past subgradients should be
d r o p p e d at Step 6 should w e i g h speed of c o n v e r g e n c e against storage and
312

work per iteration; see Remark 2.2.4. We may add that one m a y use addi-
tional subgradients for search direction finding w h e n the o b j e c t i v e func-
tion is a max function, see Remark 2.2.5. We also note that the subgradi-
ent gf(x k) is always used for search direction finding at the k-th
iteration, i.e. we have

k
k(1) ~ jk, g k ( 1 ) = g f ( x k ) and ~k(1)=0 if k(1) ~ k < k(l+l) . (3.13)

This property, w h i c h will be e s t a b l i s h e d in the next section, ensures


that the k-th dual subproblem 2.13) is feasible, since ~=0 for some
j ~ jk. 3

4. Convergence.

In this section we shall show that each sequence {x k} generated


by A l g o r i t h m 3.1 m i n i m i z e s f, l.e. f(x k) + i n f { f ( x ) : x ~ RN}, and that { ~ }
converges to a m i n i m u m point of f w h e n e v e r f attains its infimum. Natu-
rally, convergence results assume that the s t o p p i n g parameter s is
set to zero. To save space, our analysis will rely on the results of
Chapters 2 and 3.
We start ny r e c a l l i n g the following result of the p r e c e d i n g sec-
tion.

k
Lemma 4.1. If A l g o r i t h m 3.1 t e r m i n a t e s at the k-th iteration, then x
is a m i n i m u m p o i n t of f.

F r o m now on we assume that the a l g o r i t h m does not stop.


Note that Step 1 may be e x e c u t e d m o r e than once at each iteration,
k
since each d e c r e a s e of the a p p r o x i m a t i o n tolerance e involves a repe-
tition of the search direction finding. This is not r e f l e c t e d in our no-
tation, since it should be clear from the c o n t e x t that the v a r i o u s rela-
tions, such as (3.4)-(3.9), h o l d upon each c o m p l e t i o n of Step 1 at any
iteration.
We shall now analyze the case w h e n the a l g o r i t h m produces a finite
sequen c e {xk}.

Lemma 4.2. Suppose that at the k-th iteration Algorithm 3.1 cycles infi-
nitely b e t w e e n Steps 1 and 3. Then 0 ~ ~f(~) for ~ = x k.

Proof. If at the k-th iteration there are i n f i n i t e l y m a n y returns from


Step 3 to Step 1, then ''Ipkl and ~k
~p tend to zero, since we h a v e
ipkl2 ~ ~k ~ e k at Step 3 and ek is r e p l a c e d by me ek with m e ~ (0,i).
p
313

Using this in (3.6), we obtain fix) a f(x k) for all x, so 0 e ~f(xk).

In v i e w of the above lemma, we may assume in w h a t follows that the


a l g o r i t h m generates an infinite sequence {xk}, e x e c u t i n g Step 1 finite-
ly many times at each iteration.
The following result is crucial for convergence.

Lemma 4.3. Suppose that for some x eR N we have

liminf max{ Ipkl ,~k,lx_xkl} = 0, (4.1)


k+~

or e q u i v a l e n t l y

there exists an infinite set K c {1,2,...} such that x k K ~,


(4.2)
IPkl K + 0 and ~ K 0.

Then 0 ~ ~f(~).

Proof. (4.1) and (4.2) are equivalent, since ~k ~ 0 for all k. If (4.2)
P
holds, we may let kg K tend to infinity in (3.6) and use the conti-
nuity of f if is locally L i p s c h i t z i a n as a convex function on R N) to
obtain f(x) ~ f(x) for all x, i.e. 0 e ~f(~).

C o n s i d e r the following c o n d i t i o n for some fixed ~ RN

there exists an infinite set K~{I,2,...} such that xk K ~ ~.


(4.3)
In v i e w of the above lemma, our aim is to show that m a x { I p k l , e k } K--~-+0.

We start by c o l l e c t i n g a few useful results. By the rules of Step


5, we have

x k = xk(i) for k=k(1),k(1)+l,...,k(l+l)-l,

w h e r e we set k(l+l)=~ if the number 1 of serious steps stays bounded,


i.e. if xk=x k(1) for some fixed 1 and all k ~ k(1). At Step 4 we al-
ways have
Ipkl 2 > e k > 0, (4.4)

hence dk=-pk@0 and vk=-Ipkl 2 < 0. T h e r e f o r e the line search require-


ment (3.2a) with mL > 0 ensures that the s e q u e n c e {f(xk)} is nonin-
c r e a s i n g and

f(x k+l) < f(x k) if x k+l ~ X k.

These line search p r o p e r t i e s y i e l d the following a u x i l i a r y result.


314

Lemma 4.4. (i) Suppose that the sequence {f(xk)} is bounded from be-
low. Then
k k 2 k-k
Z {tLlP 1 +tL~ p} < ~. 4.5)
k=l

(ii) If (4.3) holds then (4.5) is satisfied and

f(xk)+f(x ) as k+~, 4.6)

tklpkl2--+0 as k+~. 4.7)

Proof. (i) By the line search criterion (3.2a),

f(xl)-f(xk+l)=f(xl)-f(x2)+...+f(xk)-f(xk+l)

k i i k i pi 2
-m L Z tLv =m L Z tLI I
i=l i=l

Since mL > 0 and 0 ~ ~ k ~ Ipkl 2 at line searches and {f(xk)} is non-


increasing, the above inequality
~ yields (4.5) if {f(xk)} is bounded
from below.

(iii) If (4.3) holds then (4.6) follows from the continuity of f and
the m o n o t o n i c i t y of {f(xk)}. Hence f(x k) ~ f(x) for all k, so (4.5)
holds and we have (4.7), as desired.

We shall now show that the properties of the dual search direction
finding subproblems ensure locally uniform reductions of [pkl after
null steps.

Lemma 4.5. Let x e R N, a > 0 and B = { y e R N : ly-xl ~ a } be given. Then


there exists C independent of k such that

max{ Ipkl,lgk+iI ,i} { C if x k e B. (4.8)

Moreover, if xk~ B, t k-i


L =0 k k-i for some k > 1 then
and e =e

1pkl2 c(Lpk-121 (4.9)

where the function ~C is defined (for the fixed value of the line
search parameter m R e (0,1)) by

#c(t) = t-(l-mR)2t2/(8C2). (4.10)

Proof. (i) Observe that, by (2.15), Ipk12/2 is the optimal value of


the k-th dual subproblem (2.13), for any k.
315

(ii) Suppose that k(1) &k < k(l+l), so that xk=x k(1). Observe that

tk-i
R =t k-i
L , and hence yk =x k =x k(1) and gk=gf( x k(1)) if k=k(1). Com-
bining this with the fact that k(1)e jk by the rules of Steps 5 and 6,
and that ~k(ik)=~(xk,xk(1))=~(xk(1),xk(1))=0, we obtain

k(1)e jk, gk(1)=gf(xk ) and k


~k(1)=0.

Hence the multipliers

Ik(1)=l, lj=0 for j e jk {k(1)}, Ip=0

are feasible for the k-th subproblem (2.13), so its optimal value
llpkl2 1 gk(1) 2 k 1 2

(iii) Observing that [dk]=[p k] and lyk+l-x[ 5 [yk+l-xkl+[xk-xl

tkR [ d k l a <_~ I d k l + ~ and gk+l =gf( yk+l ) if xke B, we deduce from (4.11)

and the local boundedness of gf the existence of a constant C which is


larger than Ipk I, lyk+l-xl and [gk+ll whenever x k ~ B. This yields
(4.8).
(iv) Suppose that x k ~ B, tk-1
L =0 and e k =e k-I . Then x k-I =x k e B, so

max{ lpk-ll, Igk[ ,1} ~ c (4.12)

from (4.8). Since ek=e k-l, me< 1 and xk=x k-l, (3.9) and (3.2c) yield
k ,k-i ~ e k
ep=ep and k
~k=~(xk,y k) ~ meek ~ e k , while k e jk by the rules of
Step 6, so the multipliers

Ik(~)=v, lj(v)=0 for j ejk ~ {k}, Ip(v)=l-v

are feasible in (2.13) for each v e [0,i]. Therefore the optimal value
of (2.13) satisfies

1pki2_< min{ l(l_9)pk-l+vgk [2 : ~ ~ [0,i] }. (4.13)

Since tk-i
L =0, (3.2d) yields

< gk,dk-I > ~mRvk-I (4.14)

Using (4.12), (4.14) and the fact that m R ~ (0,I), dk-l=-p k-I and
v k-i =-I p k - 1 2] , we deduce from Lemma 2.4.10 that the right side of ine-
quality (4.13) is no larger then %C(Ipk-l12/2), so (4.9) holds, as re-
quired.
316

We are now r e a d y to a n a l y z e the c a s e of a f i n i t e n u m b e r of s e r i o -


us steps of the m e t h o d .

Lemma 4.6. Suppose that xk=xk(1)=x for s o m e f i x e d 1 and all k Z k(1).


Then (4.1) h o l d s and 0 e ~f(x).

Proof. Suppose that xk=x for all k a k(1) and let K={k: Ipkl 2 ~k
~p}"
We shall consider two cases.

(i) Suppose that K is i n f i n i t e . Then, since 0 S e k+l ! e k for k>_k(1),

and e k+l ~ m e e k if k 6 K with the f i x e d m e e (0,11, w e h a v e ek K 0.

Since Ipk[ 2 & ~ < ek for all k e K, we o b t a i n (4.2), and h e n c e (4.1)


and 0 ~ ~f(x) from Lemma 4.3.

(ii) S u p p o s e that K is finite, i.e. ipk[2 > ~k for all l a r g e k. S i n c e


P
k-- k-i k
x =x and t L =0 for k >k(1), by the r u l e s of S t e p 5 w e h a v e e =

e >0 for all k > k ( 1 ) . B u t then L e m m a 4.5 y i e l d s Ipk[ + 0 , so, since


~k < [pk 12 for l a r g e k, b o t h
~P Ipkl and ~ tend to zero, and L e m m a 4.3
yields 0 e ~f(~).

It r e m a i n s to c o n s i d e r the c a s e of an i n f i n i t e number of s e r i o u s
steps. To this end, let K l = { k : k(1) < k < k ( l + l ) } and let bI denote
the m i n i m u m value t a k e n on by m a x { IpkI ' ~k}
p in S t e p 2 at i t e r a t i o n s
k e KI, for all i. Note that bI is w e l l - d e f i n e d if i ~, s i n c e then
there can by only f i n i t e l y m a n y executions of S t e p 1 at any i t e r a t i o n .

Lemma 4.7. Suppose that there e x i s t a p o i n t xe RN and an i n f i n i t e set


LC {1,2,... } such t h a t xk(1)+ x as i + ~, l e L. T h e n liminfbl=0 and
0 E ~f(x). i~ L

Proof. Suppose that x k ( 1 ) L + ~. We shall consider two cases.

(i) Suppose that bI L + 0 for some i n f i n i t e set LEL. Since xk =

x k(1) for all k~ K 1 and xk(1)--+ x as i~, le L, f r o m the d e f i n i -

t i o n of bI we deduce (4.2). Hence 0 ~f(x) by L e m m a 4.3.

(ii) S u p p o s e that {bl}leL is b o u n d e d away f r o m zero. Then there exists

e ~0 such t h a t on e a c h e n t r a n c e to S t e p 3 we have

m a x { [ p k l 2"k,ep} >_ e for all k~ K 1 and l a r g e i ~ L. (4.15)

By the algorithm's rules, for any 1 and k such that k e K 1 and k+le K 1
317

we have ek(1)=e
a
~ 0, e k+l m ek
e
if [pk[ ~ ~ S ek at Step 3, and
k+l k k
e =e otherwise. Therefore, if e a p p r o a c h e d zero for some k ~K 1
and large 1 ~ L, then so w o u l d [pk[ 2 and ~k
ep, w h i c h w o u l d c o n t r a d i c t
(4.15). Thus e k > Ee > 0 for some Ee and all k~ K1 and large 1 6 L.
In particular,

k
e _>e>0 for all large k ~ K, (4.16)

where K={k(l+l)-i : i L}. Also, since ipkl2 > ~p at Step 4, (4.15)


yields

[pkl2z e for all large k e K. (4.17)

Since ~ is an a c c u m u l a t i o n point of {xk}, Lemma 4.4 yields tL


kipkj2+
+ 0 as k~. C o m b i n i n g this with (4.17) and the fact that

k k 2 k k pk k+l k
tLIP I =ItL d I I I=I x -x I Ipk[ for all k,

we obtain tk
L K 0 and Ixk+l-xkl-~K+ 0. But tk
L > 0 for all k e K, so
we deduce from (3.2b), the fact that ~> 0 is fixed, and (4.16) that

~(xk,x k+l) > mee e > 0 for all large k e K. (4.18)

Since x k ----*
K ~ and Ixk+l-x k I__K+K 0, we have

e(xk,xk+l) = f(xk~f(xk+l) - < gf(xk+l) ,x k - x k+l > - - K 0 (4.19)

from the c o n t i n u i t y of f and the local b o u n d e d n e s s of gf. This contra-


dicts (4.18). Therefore, {bl}leL cannot be b o u n d e d away from zero, and
case (i) above yields the desired conclusion.

C o m b i n i n g Lemmas 4.6 and 4.7 we obtain

T h e o r e m 4.8. Every a c c u m u l a t i o n point x of an infinite sequence {x k}


g e n e r a t e d by A l g o r i t h m 3.1 satisfies 0 e ~f(~).

Our next result states that the global c o n v e r g e n c e p r o p e r t i e s of


the method are the same as those of the algorithms c o n s i d e r e d in Chap-
ter 2.

T h e o r e m 4.9. Every infinite s e q u e n c e {x k} c a l c u l a t e d by A l g o r i t h m 3.1


minimizes f, i.e. f(x k) + inf{f(x) : x e R N} as k+~. M o r e o v e r , { x k} con-
verges to a m i n i m u m point of f w h e n e v e r f attains its infimum.

Proof. In virtue of T h e o r e m 4.9 and the fact that we have (3.5),


k;
tL
318

for all k, and (4.5) if {f(xk)} is b o u n d e d from below, the proofs of


Lemma 2.4.14, T h e o r e m 2.4.15 and T h e o r e m 24.16 are valid for A l g o r i t h m
3.1.

The next results provide further s u b s t a n t i a t i o n of our stopping


criterion.

C o r o l l a r y 4.10. If f has a m i n i m u m point and the stopping p a r a m e t e r Es


is positive, then A l g o r i t h m 3.1 t e r m i n a t e s in a finite number of itera-
tions.

Proof. If the a s s e r t i o n were false then Lemma 2.4.14, which holds for
A l g o r i t h m 3.1 owing to (3.5a) and Lemma 4.4(i), w o u l d imply that {x k}
is b o u n d e d and has some a c c u m u l a t i o n point x if {x k} is infinite,
while the proof of Lemma 4.2 shows that the m e t h o d must stop if {x k}
is finite and e s > 0. Then ~ 4.6 and 4.7 w o u l d y i e l d that max{Ipkl,
~} ~ es for some k, and hence the m e t h o d w o u l d stop, a contradiction.

C o r o l l a r y 4.11. If the level set S f = { x e R N : f(x) ~ f(xl)} is b o u n d e d


and the s t o p p i n g p a r a m e t e r ~s is positive, then A l g o r i t h m 3.1 termina-
tes in a finite number of iterations.

Proof Since {x k} c Sf is b o u n d e d and E s > 0, we may use either the pro-


of of L e m m a 4.2 or Lemmas 4.6 and 4.7 to show that max{IpkI,~} ~ es
for some k.

5. The A l g o r i t h m with S u b ~ r a d i e n t Selection.

In this section we shall state and analyze in detail the m e t h o d


with subgradient s e l e c t i o n i n t r o d u c e d in Section 2.

A l g o r i t h m 5.1.

Step 0 (Initialization). Do Step 0 of A l g o r i t h m 3.1.

Step 1 <Direction findin~l Find m u l t i p l i e r s k j, j jk , w h i c h solve the

k-th dual s u b p r o b l e m (2.5) and are such that the c o r r e s p o n d i n g set ~k=

{j jk lk ~ 0} has at most N+I elements. Set


: 3

(pk,~k) = ~ jklk(gj,fk), (5.1)


jE

dk = - p k and vk = - I p k I 2 "
319

Step 2 (Stoppin@ criterion). Do Step 2 of A l g o r i t h m 3.1.

Step 3 (Approximation tolerance decreasin@). Do Step 3 of A l g o r i t h m 3.1.

Step 4 (Line search). Do Step 4 of A l g o r i t h m 3.1.

St~ 5 (.Approximation tolerance updating). Do Step 5 of A l g o r i t h m 3.1.

Step 6 (Linearization updatin@). Set

jk+l = J~k u {k+l, k(1)}. (52)

Set gk+l=gf(yk+l) and calculate fk+l


] ' je jk+l # by (33) .

Step 7. Increase k by 1 and go to Step i.

A few comments on the method are in order.


The subgradient selection and aggregation rules of the m e t h o d are
the same as those of A l g o r i t h m 2.5.1, since we always have

k 3k k
lj>_0, je ' jE~ ~kX3.=l (5.3)

from Lemma 2.1 and the construction of 3k. Hence, by the results of
Section 2.5, Lemmas 2.4.1 and 2.4.2 are true for A l g o r i t h m 5.1 and we
k
have (3.5). In view of (5.1) and (5.3), we may set ~p=0 in (3.7) to
obtain
~k e k
ep-<

as in A l g o r i t h m 3.1.

The remarks in Section 3 on the line search criteria of A l g o r i t h m


3.1 apply to A l g o r i t h m 5 1 if one replaces ~k , a and G ka in (3.8)-(3,9)
by ~k and G k, respectively; see (2.3) and (2.13). Of course, Line
B,s
Search Procedure 3.2 may be used for executing Step 4 of A l g o r i t h m 5.1.

The r e q u i r e m e n t (5.2) results in relation (3.11), which ensures


that the k-th dual s u b p r o b l e m (2.5) is feasible, i.e. the a l g o r i t h m is
well-defined. We refer the reader to Remarks 2.2.5 and 2.5.3 on the pos-
sible use of additional subgradients for search direction finding
As far as convergence is concerned, it is easy to verify that all
the results of Section 4 hold for A l g o r i t h m 5.1. In fact, only part (iv)
of the proof of Lemma 4.5 needs a minor modification. To this end, de-
fine the multipliers

Xk(V)--~'lJ (V)=(l-v)Ik-13
for j e jk-l,xj(~)=0 for je J k \ .^k-i
( J u {k})
320

and use (2.5.4), which follows from (5.1)-(5.3), to deduce that (4.13)
holds, as before.
To sum up, A l g o r i t h m 5.1 is a g l o b a l l y convergent method in the
sense of T h e o r e m 4.9 and C o r o l l a r i e s 4.10 and 4.11.

6__. M o d i f i e d Line Search Rules and A p p r o x i m a t i o n Tolerance Updatin~ Stra-


tegies.

In this section we shall describe modified rules for u p d a t i n g the


approximation tolerances {e k} of A l g o r i t h m 3.1. These rules are more
efficient than the original ones. We shall also discuss modified line
search requirements that are similar to those p r o p o s e d by Lemarechal,
Strodiot and B i h a i n (1981).
The m o t i v a t i o n for our modified approximation tolerance updat-
ing strategies is p r a c t i c a l and stems from the following observation.
Algorithm 3.1 resets ek to the fixed value of ea > 0 after each ser-
ious step, and then each d e c r e a s i n g of ek involves solving an addi-
tional "idle" quadratic programming subproblem. This strategy has two
drawbacks. First, too small a value of ek may result in m a n y function
and s u b g r a d i e n t evaluations at initial iterations, when the line search
procedure may need m a n y contractions for finding yk+l sufficiently
close to xk so that ~(xk,y k+l) ~ m e e k at a null step. Secondly, when
the a l g o r i t h m is close to a s o l u t i o n then, in general, dk is a d e s c e n t
direction only if ek is small enough, so that G~(e k) is close to
~f(xk). Hence later iterations may require m a n y solutions of q u a d r a t i c
programming subproblems only to reduce the values of e k.
The first d r a w b a c k may be e l i m i n a t e d by a l l o w i n g the m e t h o d to
choose after a serious step any v a l u e of ek not smaller than e a. For
instance, following Lemarechal, Strodiot and B i h a i n (1981), one may set

k+l k k
e = maX{ea,-tLv } (6.1a)
or
k+l m a x { e a , f ( x k ) - f ( x k+l)} (6.1b)
e =

k
at Step 5 of A l g o r i t h m 3.1 if t L > 0. This will enable the m e t h o d to
use ek larger then ea at initial iterations. At the same time, one
may easily verify that this m o d i f i c a t i o n does not impair the c o n v e r g e n -
ce results of S e c t i o n 4.
The above m o d i f i c a t i o n does not e l i m i n a t e the need for the fixed
threshold ea > 0, h e n c e it has the second d r a w b a c k mentioned above. For
this reason, consider the f o l l o w i n g modification of A l g o r i t h m 3.1.
321

Algorithm 6.1.

Step 0 (Initialization). Do Step 0 of A l g o r i t h m 3.1. Set ~l=e a and

el=me 61 .
Step i (Direction finding). Do Step 1 of A l g o r i t h m 3.1.

Step 2 (Stopping criterion I . Do Step 2 of A l g o r i t h m 3.1.

Step 3 (Approximation tolerance decreasing I. Set

~k =
max{ipkl 2 , ~~k
p}. (6.2)

If ipkl2 > ~k then go to Step 4. Otherwise, i.e. if tpkl2 S ~kp' s e t


P
6k=6 k ek=me 6k and go to Step 1

step 4 (Line search I. Do Step 4 of A l g o r i t h m 3.1.

Step 5 (Approximation tolerance updating I. Set 6k+l=~ k and e k+l=


m 6k+l .
e
Step 6 (Linearization updating). Do Step 6 of A l g o r i t h m 3.1.

step 7. Increase k by 1 and go to Step i.

Remark 6.2. If is easy to see that A l g o r i t h m 6.1 calculates monotonical-


ly nonincreasing sequences of positive numbers {~k} and {e k} satis-
ying ek=me~ k for all k, if it does not stop. To this end, observe that

since~k ~ <- ek=me~k from (3.9), 6k is set equal to ~k


6 only if Ipkl2
!~p and

~k = m a x { I p k l 2 ,ap}
~k = ~p
~k ~ e k =me~ k

In this case 6k is reduced by the factor m e ~ (0,i); otherwise, 6k is


unchanged. It follows that either ~k and e k eventually stay constant
and there are no returns from Step 3 to Step i, or they converge to zero
and the method returns from Step 3 to Step 1 infinitely often with
m a x { I p k l 2 , ~ } ~ 6k.

We conclude from the above remark that A l g o r i t h m 6.1 a u t o m a t i c a l -


ly reduces ek when it approaches an optimal point, which is indicated
by a small value of max{Ipkl2, ap}.
~k
Let us now analyze convergence of the method. Making use of Remark
6.2, one may easily establish Lemma 4.2 for A l g o r i t h m 6.1. Therefore, we
322

may suppose in w h a t follows that the m e t h o d generates an infinite s e -


quence {xk}. Next, we observe that the proofs of Lemmas 4.3-4.6 need not
be modified. Lemma 4.7 is r e p l a c e d by the following result.

Lemma 6.3. Suppose that A l g o r i t h m 6.1 g e n e r a t e s an infinite s e q u e n c e

{x k} such that for some x ~R N one has f(x) <_ f(x k) for all k. Then
there exists x ~R N such that 0 ~f(x) and xk + x as k~. Moreover,
liminf ;k=0.
k+~

Proof. Suppose that f(x k) a f(x) for all k. We may use Lemma 4.4(i),
(3.9) and the fact that tLk< t for all k to deduce that Lemma 2.4.14

holds for A l g o r i t h m 6.1. Hence {x k} is bounded and has an a c c u m u l a t i o n


point x satisfying f(x) < f(x k) for all k by Lemma 4.4(ii). Using the
proof of T h e o r e m 2.4.15, we deduce that in fact xk~ as k+~. In view
of Remark 6.2, we shall c o n s i d e r two cases.

(i) Suppose that ek stays constant for all large k. Then the d e s i r e d
c o n c l u s i o n follows from Lemma 4.6 and the proof of Lemma 4.7, w h i c h is
valid for constant e k.

(ii) Suppose that ek tends to zero. Then 6k+0 and for infinitely ma-
ny k-s we have m a x { ] p k l 2 ,ap}
~k <_ 6 k, w h i l e x k + x--
, so (4.2) holds and Lem-
ma 4.3 yields the desired conclusion.

We c o n c l u d e from Lemmas 6.3 and 4.4(ii) that T h e o r e m 4.8 holds


for A l g o r i t h m 6.1. Then it is s t r a i g t f o r w a r d to verify that T h e o r e m 4.9
and C o r o l l a r i e s 4.10 and 4.11 are true for A l g o r i t h m 6.1, since their
proofs did not depend on the choice of {ek}.
To sum up, we have proved that A l g o r i t h m 6.1 is g l o b a l l y convergent
in the sense of T h e o r e m 4.9 and Corollaries 4.10 and 4.11.
We shall now c o n s i d e r m o d i f i e d line search rules that are similar
to those in (Lemarechal, S t r o d i o t and Bihain, 1981). To this end, r e c a ~
from Lemma 2.1(iv) that the k-th dual s u b p r o b l e m (2.13) is e q u i v a l e n t to
the following

1
minimize ~1 jk ljgj+ippk-i 12+ Z jkljS k ~j
k + IpS k~k
ap,
1 jm je
(6.3)
subject to lj >_0, j e J k t lp >0,
-- Ejklj+l p ---- i t

jE

where s k >_0 is the L a g r a n g e m u l t i p l i e r for the last c o n s t r a i n t of (2.13),


satisfying

s k( ,k k k ) = O.
z k Ik k + ^p~p-e
jaj 3 3
323

By (3.8),

-k k k k k
ep= Z jklj~j + Ip~p,
j

k~k k k
hence we have s ~ p = S e . Combining the preceding relations with (2.15)
and invoking Lemma 2.2.1, we see that (6.3) is the dual to the problem

minimize lldl2+v,
<d,v) ~ R ~ 1
subject to -s k ~jk + < gJ,d> <_ v, j e J k, (6.4)

k k pk-i
-s ~ + < ,d > < v,
P
which has a unique solution (dk,v k) with dk=-p k (pk is given by (2.
15) ) and

v~k =-{Ipkl 2 +s k~k.


~p~ =-{ ]pk[2+skek}. (6.5)

Note that for sk=l (6.3)-(6.5) reduce to the corresponding relations


of A l g o r i t h m 2.3.1 (see also (3.2.30)-(3.2.31)), while if sk=0 then
vk reduces to the variable vk=-]pkl2 used by A l g o r i t h m 3.1; see (3.10).
Therefore ~k may be regarded as an approximate directional derivative
of f at xk in the direction d k.

Let us, therefore, consider the use of ~k in place of vk in


Algorithm 3.1. This amounts to replacing vk by ~k in the line sea-
rch criteria (3.2). Since sk > 0, e k > 0 and vk< 0 at line searches,
we have v k < _ v k < 0, i.e. ~ k < 0, so the m o d i f i e d criteria may be inter-
preted similarly to the original ones, and Line Search Procedure 3.2
will meet these requirements.
We shall now establish convergence of the resulting method. One
may easily verify that among the results of Section 4 only part (iv) of
the proof of Lemma 4.5 is invalidated if we have v k < vk for some k.
But Lemma 4.5 was instrumental only in part (ii) of the proof of Lemma
4.6. Therefore, we need only prove the following result.

Lemma 6.4. Consider the version of A l g o r i t h m 3.1 that uses ~k given


by (6.5) instead of v k, for all k, and assume that m e + m R ~ 1. Suppose

that xk=~, t~=0 and ek=e > 0 for some fixed x, e and all large k.
Then Ipkl+0 as k+~.

Proof.(i) With no loss of generality, suppose that xk=~, t~=0 and


k for all k. Then (3.2c,d) with ~k replaced by vk yield
e =e>0
324

k+l
ak+ i ! mee and

< g k+l ,pk > ~ m R l p k l 2 + m R e s k (6.6a)

for all k. Moreover, since (dk+l,v k+l) solves the (k+l)-st subproblem
(6 .4), while k + l a jk+l , we have -s k+l ak+l
k+l + < g k + l dk+l > ! ~k+l , so

k+l k+l Z I P k+l 2+ k+l k+l k+l


< g ,P > s e-s ak+l ~
(6.6b)
Ipk+lI2+(l-me)e s k+l ,

k+l
since ~k+l & m e e ' for all k. Subtracting (6.6a) from (6.6b) and rearrag-
ing terms yields

k+l
s ~ skmR/(l-me)-ck/(l-me)e,

where
c k =l pk+l i2_m R [pk L2-<g k+l ,pk+l -p k ~, IG 7)

hence the fact that m R ~ 1-m e implies

k+l k
S ~ s - ck/(l-me)e for all k. (6.8)

(ii) From the proof of Lemma 4.5 we deduce that {gk} is bounded and
that pk+l=Nr[pk+l,pkj, for all k. Therefore {Ipkl} is monotonically
nonincreasing, and (6.6b) and the positivity of (l-me)e imply the ex-
istence of a constant s such that sk&s for all k.

(iii) Since {pk} is bounded,there exists peR N and an infinite set

K c {1,2, ..} such that pk --


K p. Assume, for contradiction purposes,
that p~0, since Ipkl+0 otherwise. We shall show that pk+l --+
K p. To

this end, suppose that pk+l --+ p for some p and an infinite set
oK. Since pk+l=Nr[pk+l,pkj, Lemma 1.2.12 implies < pk+l,pk> Z Ipk+iI2
Passing to the limit with k K and using the monotonicity of {Ipkl}
yields < P,P > ~ IPl IPl ~ 0, so p = p ~ 0 from elementary properties of
the inner product. This shows that pk and pk+l have a common limit
~ 0 as k+~, k ~ K. By (6.7) and the boundedness of {gk}, we have

c k _~K+ ~ with positive ~=(l-mR) Ipl, since m R < i.

(iv) Let k c satisfy k c ~ /2 >s(l-me)e. From part (iii) above we de-


duce the existence of kp such that ck h~/2 for k=kp,kp+l,...,kp+k c.
325

But skp ~ , so (6.8) yields skp +kc+l & s - k c~/2(l-me)e < 0, contra-

dicting the n o n n e g a t i v i t y of {sk}. Hence p=0 and the proof is comple-


te.

We conclude that the use of {~k} in A l g o r i t h m 3.1 does not im-


pair the global c o n v e r g e n c e results of Section 4, p r o v i d e d that me+
+m R ~ i.
We may add that all the a b o v e - d e s c r i b e d m o d i f i c a t i o n s may be used
in A l g o r i t h m 5.1, w i t h o u t impairing the results of Section 5.

7. E x t e n s i o n to N o n c o n v e x U n c o n s t r a i n e d Problems.

In this section we shall p r e s e n t b o u n d l e methods for the p r o b l e m


of m i n i m i z i n g a locally L i p s c h i t z i a n f u n c t i o n f : RN R that is not
n e c e s s a r i l y convex or differentiable. As before, we suppose that for
any x~R N one can find in finite time a s u b g r a d i e n t gf(x) ~ ~f(x) of
f at x.
Our e x t e n s i o n of the b u n d l e m e t h o d from S e c t i o n 3 to the noncon-
vex case will use the techniques for d e a l i n g with n o n c o n v e x i t y deve-
loped in Chapter 3. The r e l e v a n t features of this approach are the follo-
wing.
First, we observe that for convex f the k-th dual search direc-
tion finding s u b p r o b l e m (2.5) uses automatic w e i g h i n g of the past sub-
gradients gJ by their l i n e a r i z a t i o n errors k
~j=f(x k )-f ~ 0 in the

sense that g3- contributes to dk with a r e l a t i v e l y small w e i g h t

lk > 0 if the value of k is large. This important p r o p e r t y is ensu-


3 3
red by the last c o n s t r a i n t of (2.5) and the n o n n e g a t i v i t y of e~.
3
Secondly, we recall that in the convex case l i n e a r i z a t i o n errors
may serve as s u b g r a d i e n t locality m e a s u r e s in the sense that

g J e ~Ef(xk) for E=~ Z 0,

so that g 3 = g f ( y 3 ) e ~f(yJ) is close to ~f(x k) if a~ is small, even

if yJ is far from x k. This p r o p e r t y need not, of course, hold if f is


nonconvex. Therefore, in Chapter 3 we used s u b g r a d i e n t locality mea-
sures of the form

~(x,y) = m a x { I f ( x ) - f ( x ; y ) l , y l x - y l 2} (7.1)

for i n d i c a t i n g how much the s u b g r a d i e n t gf(y) differs from b e i n g a


326

member of ~f(x), where f(x)-f(x;y) is the error with whic~ the lineariza-
tion of f calculated at y

~(x;y) = f(y) + < gf(y),x-y > (7.2)

approximates f at x, and y is a positive parameter, which may be set


to zero when f is convex. The save storage, we approximated ~(xk,yj) by

~k=max{3 If(xk)-f~ l'Y(s~ )2}' (7.3)

where fk=~( x k ;yJ ) and


3
k-i
s~ = lyJ-xJl + Z Ixi+l-xil
3 i=j

is an overestimate of Ixk-yjl.

Thirdly, in Chapter 3 we used a simple resetting strategy involv-


ing, at the k-th iteration, the locality radius ak that estimated
the radius of the ball around xk from which past subgradients gJ had
been collected to form the k-th aggregate subgradient. This strategy en-
sured locally uniform boundedness of such subgradients, which was impor-
tant for global convergence under no additional boundedness assumptions.

In view of the above remarks, it is natural to extend Algorithm


3.1 to the nonconvex case by using subgradient locality measures of the
form (7.2)-(7.3) and the resetting strategy of Chapter 3. The resulting
method is given below.

Algorithm 7.1.

Step 0 (Initialization). Select the starting point x l ~ R N, a stopping


parameter Es >- 0 and an approximation tolerance e a > 0. Choose positi-
ve line search parameters mL,mR,m e, ~ and t satisfying m L < m R < i,

m < 1 and ~_< 1 <_ t. Select a distance measure parameter y > 0 (y=0 if
e

f is convex) and a resetting tolerance a > 0. Set jl={l}, yl=xl, p0=gl=


1 1 1 1 1 1
gf(yl), fp=fl=f(y ), Sp=Sl= 0 and e =e a. Set al=0 and the reset indi-

cator rl=l.
a
Set the counters k=l, i=0 and k(0)=l.

Step 1 (Direction finding). Find multipliers ik, j ~ jk , and lpk that


solve the k-th dual subproblem

minimize 11 ~ jkljg 3 + ippk-i [2,


1 je
327

subject to lj z 0, je jk, Ip z0, z jklj+ip=l, Ip=0 if rk=l, 7.4)


je
jklje k + i k < e k
je 3 P P

where ~k are given by (7.3) and


3

kp = max{ If(xk)-f k], (sk)2}. 7.5)

Set
(p k ,fp,Sp)
"k-k = j ~Z jk Ik (g3,fj,sj)
' k k + Xp~p
.k~ k-i ,fp,Sp),
k k
7.6)

dk=-p k and vk=-Ipkl 2. If Ik=0p set ak=max{s~ : j6 jk}.

Step 2 (Stopping criterion). Set

-k = max{if(xk)_~ki,
ap y(~k)2}. (7.7)

If max{Ipkl2,~} i es, terminate; otherwise, continue.

Step 3 (Approximation tolerance decreasing). If Ipkl 2 > ~ then go to


Step 4; otherwise, i.e . if Ipkl 2 ~ ap,
~k replace ek by mee k and go to
Step i.

_Step 4 (Line search). By a line search procedure (e.g. Line Search Pro-
k k
cedure 3.2), find two stepsizes t kL and t kR such that 0 i tL s t R
and k kL
tR=t if tkL > 0, and such that the two corresponding points defin-
k+l k .k~k k+l k k k
ed by x =x eUL a and y =x +t~d satisfy

f(x k+l) <_ f(x k.)+mLtLV


k k, (7.8a)

t kL ~ or ~(xk,x k+l ) > m e e k if t~>0, (7.8b)

e(xk,y k+l) _<me ek if t kL 0, (7.8c)


< gf(yk+l),dk > h m R v k if t kL = 0, (7.8d)

where ~(x,y) is defined by (7.1).

Step 5 (Approximation tolerance updatin@l. If t~=0 (null step), set


k+l k k k+l
e =e ; otherwise, i.e. if tL > 0 (serious step), choose e ~e a

(e.g. by (6.1)), set k(l+l)=k+l, and increase 1 by I.


328

S_ttep 6 (Linearization updating). Choose ~k c jk such that the set ~ + ~

~ku {k+l} contains k(1), i.e. k(1) e 3k if k(1) < k+l. Set gk+l =

gf(yk+l) and compute

fk+l f(yk+l k+l k+l k+l


k+l = )+ < g ,x -y > ,

fk+l = fk + < gJ,xk+l_xk > for j e ~k (7.9)


3 3
fk+l = ~k + < pk,xk+l_xk > ,
P P
k+l [yk+l k+l
Sk+ 1 = -x I,

k+l = s~ + Ixk+l-xkl for j e jk (7.10)


sj 3
k+l ~k + ixk+l k
Sp = Sp -x I-

Step 7 (Distance resettin@ test). Set k+l~.


ak+l=max{ak+Ixk+l-xk],Sk+l~ If

a k+l ! ~ or t~=0 then set rk+l=0


a and go to Step 9; otherwise, set

rk+l=l
a and go to Step 8.

Step 8 (Distance resetting). Delete from jk+l all indices j with


sk+13 >a/2, and set ak+l=max{s~+l : j jk+l}.

Step 9. Increase k by 1 and go to Step i.

A few comments on the method are in order.


The subgradient selection and aggregation rules of the method are
borrowed from Algorithm 3.3.1, hence the properties of the aggregate
subgradients (pk ,fkp,~pl
~k\ may be deduced from Lemmas 3.4.1-3.4.3 More-
over, since the stationarity measure

k i 2+~k
w = ~lpki p
satisfies

W k<_ 2max{ IPkl 2 ,~k}


P and max{Ipkl 2 ,~p}
-k <_ 2w k, (7.11)

the stopping criterion of the method may be interpreted similarly to


that of Algorithm 3.3.1; see Section 3.3. Also we have (3.5) and (3.7)
if f happens to be convex.
k
The method updates the approximation tolerances e according to
329

the m o d i f i e d rules of S e c t i o n 6. Note that we still have


~k ~ e k for
P
all k, since Lemma 3.4.8 yields a suitable e x t e n s i o n of r e l a t i o n (3.8).
We do not use the s e e m i n g l y m o r e e f f i c i e n t strategy of A l g o r i t h m 6.1,
b e c a u s e it may impair global c o n v e r g e n c e in the n o n c o n v e x case. We may
add that in practice one should set ea equal to a small p o s i t i v e num-
ber, e.g. ea=es=lO-6.
Line Search P r o c e d u r e 3.2 may be used for e x e c u t i n g Step 4 of the
method. It is easy to v e r i f y that this p r o c e d u r e will t e r m i n a t e in a fi-
nite number of iterations if f has the s e m i s m o o t h n e s s p r o p e r t y (3.3.231;
see the proof of L e m m a 3.3.
The s u b g r a d i e n t d e l e t i o n rules of Step 7 and 8 differ slightly
from those of A l g o r i t h m 3.3.1 in that no d i s t a n c e reset can occur af-
ter a null step. At the same time, we still have (3.13), since s~(1)=

Jxk y k ( 1 ) j = J x k _ xk(1)l=0, while


_
in Step 8 k+l
Sk+l= y k+l -x k+l]=0, k
since tL=
jk+l.
t ~ > 0, so k+l cannot be deleted from

Let us now analyze c o n v e r g e n c e of the method. To this end, we sha~


need the f o l l o w i n g analogue of Lemma 4.3.

Lemma 7.2. Suppose that A l g o r i t h m 7.1 g e n e r a t e s an infinite s e q u e n c e

{x k} such that xk K ~, ipkj K 0 and ~k K 0 for some ~ and


an infinite set K c{i,2,...}. Then 0 e ~f(~.

Proof. I n s p e c t i n g the proofs of Lemmas 3.4.1, 3.4.2, 3.4.6 and 3.4.7,


we see that 0 e ~f(x) if {ak}keK is bounded. T h e r e f o r e we need only

snow that {ak}keK is bounded. Let k(1) ~ k < k ( l + l ) , so that x Jk=(x1 ) - -


k
if k(1) _< j ~ k . F r o m the proof of Lemma 3.4.1 we deduce that the so-s
3
which form ak=max{ s k : j ~ J^k } may be d i v i d e d into two disjoint groups.

The first group comprises sk ~ a with j_< k(1), since the rules of
3
Steps 7 and 8 ensure a k(1) _<a after a serious step, while sk stay
3
k
c o n s t a n t b e t w e e n serious steps. Since the second group contains s. =
3
[yJ-xJJ=tJldJ] < tlpJ I with xJ=x k and k(1) < j < k, w h i l e xk K-
k
we deduce from the proof of Lemma 4.5 that such s.-s are u n i f o r m l y
bounded for k ~K, b e c a u s e so are the c o r r e s p o n d i n g 3 p3-s. Hence {ak}kE K
is bounded, as desired.

C o m b i n i n g the above proof with the proof of L e m m a 4.2, we deduce


that Lemma 4.2 holds for A l g o r i t h m 7.1. S i m i l a r l y we obtain that if the
330

method stops at the k - t h iteration and Cs=0, then 0 ~ ~f(xk). Next,


one easily checks t h a t the p r o o f s of L e m m a s 4.4-4.7 need not be m o d i f i -
ed, so T h e o r e m 4.8 and C o r o l l a r y 4.11 are true for A l g o r i t h m 7.1. More-
over, if f is c o n v e x then f ( x k ) _ ~ kp -
< ~kp for all k (see L e m m a 3.4.2)

hence in the convex case Theorem 4.9 and C o r o l l a r y 4.10 hold for A l g o -
rithm 7.1. We conclude that A l g o r i t h m 7.1 is a g l o b a l l y convergent
method.
We m a y add t h a t one m a y m o d i f y the line search criteria of the
method by replacing vk in (7.8) by ~k defined by (6.5). This modifica-
tion will not impair the p r e c e d i n g global convergence results, since the
proof of L e m m a 6.4 remains valid.

8. B u n d l e Methods for C o n v e x Constrained Problems.

In this section we shall present bundle methods for s o l v i n g the


following convex minimization problem

minimize fix), subject to F(x) ~ 0, (8.1)

where the functions f : RNR and F : RN+ R are convex, but not nece-
ssarily differentiable. We assume that the S l a t e r constraint qualifica-
tion is f u l f i l l e d , i.e. F(x) < 0 for some x ~ R N, so t h a t the feasible
set

S = {x m RN : F(x) & 0}

has a nonempty interior. Moreover, we s u p p o s e that we h a v e a finite pro-


cess for c a l c u l a t i n g fix) and a subgradient gf(x)e ~f(x) of f at
each x gS, and F(x) and a subgradient gF(x)g ~F(x) of F at e a c h
xe RN~s . For simplicity of e x p o s i t i o n , we shall initially assume that
one can compute f(x),gfix), Fix) and gFix) at any x.

Our extension of the b u n d l e methods from Sections 3 and 5 to the


constrained case w i l l use the approach of C h a p t e r 5. To this end, we re-
call from Lemma 5.2.1 that in terms of the improvement function

H(y;x) = m a x { f ( y ) - f ( x ) , F ( y ) } i8.2)

and its s u b d i f f e r e n t i a l ~H(y;x) at y (for a fixed x)

~f(y) if f(y)-f(x) > F(y),

~H(y;x) = conv{~fiy) ~F(y)} if f(y)-f(x)=F(y), (8.3)


~F(y) if fiy)-f(x) < F(y),
331

a necessary and sufficient condition for a point xe RN to minimize


f on S is given by each of the following equivalent relations

min{H(y;x) : y R N} = H ( x ; x ) =0, (8.4)

0 e 3H(x;x).

Thus testing if a point x S is optimal may be done by trying to


find a direction of descent for the convex function H(.,x) at x. If
such a direction exists, then moving from x along this direction wi~
yield a better point; otherwise, x is optimal.
The above remarks suggest the following extension of the bundle
methods. Suppose that at the k-th iteration we have a point xk~ S.
We would like to find a direction of descent for the convex function

Hk(x) = m a x { f ( x ) - f ( x k ) , F ( x ) }

at x k. Treating Hk as our temporary objective function to be mini-


mized, we may proceed as in Section 2 by finding a direction d k=

-NrGk(e), where Gk(e)~ ~eHk(x k) is a certain approximation to

~Hk(x k) and e z 0 is the approximation tolerance. For constructing

Gk(e) we may use past subgradient g~=gf(y3) and g~=gF(y 3) calcu-


lated at trial points yJ, j=l,...,k, and the corresponding lineari-
zations
fj(x) = f(yJ)+ < g~,x-y 3 > ,

Fj(x) = F ( y J ) + < g~,x-y j

of f and F, respectively. We recall from Lemma 5.7.2 that g$ and g~

may be regarded as e-subgradients of Hk at x k, since in terms of


the linearization errors

k =f(xk)_fk and k k
~f,J 3 ~F,j=-FJ ' (8.6)

where f~=f (x k) and Fk=F.(xk), we have (xke S)


3 J ] 3

g ~ e 3EHk(x k) for e=e~,j,

k
g ~ e 3eHk(x k) for =~F,j"

It follows, as in Section 2, that for any e h0 the convex polyhedron


332

k k
Gk(e)={gE R N : g= Z jklJ g + Z k~J g ' . Z kl.~.
f 3 r,].+ ] E k~j~F,j ~ e
j e _f J JF 3~ J " JF

k
lj z 0,j e Jf,pj ~ 0,je J Fk ' j eZ Jfklj+ jeZ O_kUj=l},
F (8.7)

where jfku JFk c {l,...,k}, satisfies Gk(e)c ~eHk(xk)" Hence finding d k


by computing pk=NrGk(ek) and setting dk=-p k corresponds to the
construction used by the bundle method with subgradient selection from
Section 5.
To derive a bundle method with subgradient aggregation, suppose
that at the k-th iteration we have two (k-l)-st aggregate subgradients

(pk-l,fk)e conv{(gj,fk) : j=l ..... k-l},

_J F k,

of f and F, respectively. Once again Lemma 5.7.2 yields that

k-i Hk(x k ) for k


pf 6 ~ E=~f,p,
k-i Hk(x k ) for k
PF ~e e=~F,p'

with the linearization errors

k = f(xk)-f k and ~ = -F k (8.8)


~f,p p ,P p"

Hence for e ~0 the polyhedron

G (e) = {g~ R N : g= Z jkljgj+Ippk-l+j Z _k~jgF+~pPF


j k-i ,
j~ e JF
k + k k
<e,
j ~Z JfklJe~'J+IPef'P jEZ J ~ J e F ' J + ~ P ~ F ' P

k k
lj ~ 0,j e Jf,lp ~ 0,~j h0,j E JF,~p ~ 0,

klj + ip + ~ k~j + pp = i} (8.9)


j e JF J JF

satisfies G ~ ( e ) c 9e Hk. Therefore finding d k by calculating pk =

NrG~(e k) and setting dk=-p k corresponds to the construction used

by the bundle method with subgradient aggregation from Section 3.


333

Moreover, vk=-Ipkl
2"" may be regarded as an approximate directional de-
rivative of Hk at xk in the direction dk; cf. (3.10).

Having derived the search direction finding subproblems, we may


state the bundle method with subgradient aggregation for solving pro-
blem (8.1).

Algorithm 8.1.

Step 0 (initialization). Select a starting point x l S, a stopping


parameter es ~ 0 and an approximation tolerance ea > 0. Choose posi-
tive line search parameters mL, mR, me, ~ and t satisfying m L<
1 1 i}, 1 1 0 1 0
m R < I, m e < 1 and ~ < 1 < t. Set Jf=JF={ y =x , pf=gf=gf(yl), PF =
1 1
gF=gF(yl i), flp=fll=f'~yl)' Fp=FI=F(y l ), and el=ea . Set the counters

k=l, i=0 and k(0)=l.

k \k k
Step 1 (Direction finding). Find multipliers ~, je Jf, p, ~j, j~
k and
JF' i kp that solve the k-th dual subproblem

j k-i 9 k-I 2
minimize ~I ~ ~k~jgf+lppf + Z k~jgF+~pPF I ,
l,p J g oF 3 ~ JF

k k 0
subject to ~3 >- 0, j ~ Jf' lp >- 0 ' ~j -> 0 ' j E JF ' ~P ~ '
(8.~o)
+ ~ k~j + ~ p = l ,
JgZ Jfklj + lp Jg JF

jkljefk ' k k .+~ p k F,p <- e k "


~ k~j~F,3
j e -f J+kP~f'P+J JF

Compute the scaled multipliers

k k k k k k
vf = ~ klj + Ip and VF = ~ k~j + ~p'
JJf J~J F

~k lk, k for j~ k ~k k, k k
j= j/~f Jf and lp=kp/Vf if 9f ~ 0,

k k ~k k (8.11)
~k=l, I =0 for j g Jf ~ {k} and ip=0 if 9f=0,

-k k, k k ~k_ kt k k
~j=~j/~F for J ~ JF and ~p-~p/~r if ~F ~ 0,

-k -k_0 k ~k k
~k=l, ~j- for J JF " {k} and ~p-0- if ~F=0.
334

Calculate the aggregate subgradients

k ~k ~ '
,fj)+Xp(pf , ,
j e Jf
8.12)
k ~k, ~k j k ~k k - 1 F k )
(PF' p) = Z k~j(gF,Fj)+~p(PF , P ,
9e JF
k k k k k 8.13)
P =~fPf + 9FPF ,

and set dk=-p k and vk=-Ipkl 2.

Step 2 (stoppin@ criterion), set

"k : f (xk~f k and ~k : _F k 8.14)


~f,p p aF,p p'

~k k~k k-k 8.15


ap = ~fef,p + 9FaF,p.

If max{ipki2 ,ep}
-k ~ es, terminate; otherwise, continue.

step 3 (Approximation tolerance decreasin@). If ]pkl2 > ~k then go to


P
Step 4. Otherwise, i.e. if [pki2 < ~k, replace e k by m e k and go
- p e
to Step i.

Step 4 (Line search). By a line search procedure as given below, find

three not necessarily different stepsizes t kL, t kR and t Bk such that

0 S t Lk ~ tRk ~ and t Lk & tBk &10t~, and tR=t


Lk k if tLk > 0, and such that
k k k k+l k k-k
the three corresponding points defined by xk+l=x +tLd , y =x +tRd
and yk+l=xk+t~dk satisfy

k k k (8.16a)
f(x k+l) <_f(x )+mLtLV ,

F(x k+l ) 5 0, (8.16b)

t kL ~ or max{~(xk,xk+l),~(xk,~k+l)} >me ek k
if t L > 0, (8.16c)

~(xk,y k+l) ~ m e e k if t Lk = 0 , (8.16d)

< g(yk+l),dk > > m R vk if t kL = 0, (8.16e)

where
g(y)=gf(y) and a(x,y)=f(x)-f(y)- < gf(y),x-y > if y E S,
(8.17)
g(y)=gF(y) and a(x,y)=-F(y)- < gF(y ),x-y > if y~S.
335

Step 5 (Approximation tolerance updating). If t~=0 (null step), set

ek+l=ek; otherwise, i.e. if t Lk > 0 (serious step), choose e k+l ~ e a


(e.g. by (6.1)7, set k(l+l)=k+l, and increase 1 by i.

~k k ^k c JFk such that


Step 6 (Linearization updating). Choose jf c jf and JF
^k k+l ^k k+l tk
k(1) e Jf if k(1) < k+l, and set Jf =J u {k+l} and JF = J F U {k+l}.
fk+l, j ~ _k+l .k+l
Set gfk+l =gf~y
, k+l~)' gFk+l=gF(yk+l ) and calculate 3 Jf ' ZP '
Fk+l j _k+l and F k+l by (6.3.14).
j ' JF ' p

Step 7. Increase k by 1 and go to Step i.

A few remarks on the algorithm are in order.


The subgradient aggregation rules of the method are taken from Al-
gorithm 5.3.1. Hence the properties of the aggregate subgradients are
expressed by Lemmas 5.4.1, 5.4.2 and 5.7.2. In particular, we have

p k ~ 3eH(xk,x k) for e=~ k. (8.18)

Therefore, if the algorithm terminates at the k-th iteration then

pk~ ~esH (xk;x k) and Ipkl 2 <_ es. This estimate justifies our stopping
criterion and shows that xk is stationary for f on S if Es=0, since
stationary points ~ satisfy 0 e ~H(x;~) and are optimal for problem
(8.1).
The method updates the approximation tolerances ek according to
modified rules of Section 6. Note that we always have ~ k & e k, since
P
the proof of Lemma 6.4.6 yields a suitable extension of relation (3.8).
It is worth adding that the method may also use the efficient strategy
of Algorithm 6.1 for regulating e k, as will be shown below.
The line search criteria (8.167 extend (3.2) to the constrained

case in a way that ensures monotonicity (f(x k+l) ~ f(xk)), feasibility


({x k } c S), sufficiently large t~ at serious steps, and a significant
enrichment of Gk(e k) after a null step (one may derive (3.111 from
(8.16d,e) and (8.17) by defining

g k = gf(yk ) and e k = ef,k


k if yk ~ S,
(8.19)
k k k
g = g F ( y k) and ~ = ~F,k if yk~ S

and noting that

gk = g ( y k ) and k = ~(xk-l,yk) if t k-i


L = 0, (8.20)
336

for all k > i). The nontrivial aspect of this extension consists in al-
lowing for a serious step when a(xk,y k+l) > me ek , which indicates, by
the properties of the function u and the fact that meek is positive,

that lyk+l-xkl, and hence also Ixk+l-xkl, are sufficiently large, so


that significant progress occurs. This follows from the fact that by
construction

Ixk+1-xk I l k+1-xkl 101xk1-xk I. (82 )

The reason for our introduction of the additional stepsize t Bk will


become clear in the analysis of convergence of the following procedure

for finding stepsizes tL=tL,


k tR=t R
k and tB=t to m e e t the require-

ments of Step 4 with x=x k , d=d k e=e k and v=v k

Line Search Procedure 8.2.

(a) Set tL=0 and t=tu=l.

(b) If F(x+td) ~ 0 and f(x+td) ~ f(x)+mLtv set tL=t ; otherwise set


tu=t.

(c) If tL > 0 and either tL Z ~ or max{~(x,x+tLd), ~(x,x+tud)}>mee

set tR=t L and tB=t U, and return.

(d) If ~(x,x+td) < mee and < g(x+td),d > a m R v set tR=t and tL=
tB=0, and return.

(e) Choose t ~ [tL+0.1(tu-tL), tu-0.1(tu-tL)] by some interpolation


procedure and go to (b).

Lemma 8.3. Line Search Procedure 8.2 terminates in a finite number of

iterations with stepsizes tL=tL,k t~=t R and t~=t B satisfying the re-
quirements of Step 4 of A l g o r i t h m 8.1.

Proof. We shall use a combination of the proofs of Lemmas 3.3 and 6.3.3.
Assume, for c o n t r a d i c t i o n purposes, that the search does not terminate.
Let

TL={t _>0 : f(x+td) ~ f(x)+m_tv and F(x+td) < 0}.

Denote by t i, tL~i and tui the values of t, t L and tU after the


i i
i-th e x e c u t i o n of step {b) of the procedure, and let I={i : t =tu},
I f = { i : t i = t ui and F ( x + t i d ) <_0} and I F = { i : t i = t ui and F(x+tid) >0},

so that I = I f u I F. We deduce, as in the proofs of Lemmas 3.3 and 6.3.3,


337

^ ~i i ~
the existence of t a0 such that ~+t' tu#t and t~LT, and that the
set I is infinite. We shall consider the following two cases.
-i ~i *
(i) Suppose that ~ > 0. Then, since tL%t, we have
t L > 0 for large i,
"i i
and, since t i e {tL,t U} for all i, the rules of step (c) imply that
step (d) is entered with ~(x,x+tid) < m e e for large i. Therefore in
step (d)

<g(x+tXd),d> <mRv for all large i. (8.22)

(ii) Suppose that ^


t=0. Then tL =0 and t i =t Ui for all i, so ti + 0
and, by (8.17) and the local boundedness of gf and gF' we have
~(x,x+tid)+0. Hence e(x,x+tid) < mee at step (d) for large i, because
m e e > 0, so we again obtain (8.22).
Making use of the above results, one may proceed as in the proof of
Lemma 6.3.3 to derive a c o n t r a d i c t i o n between (8.22) and the fact that
f and F, being convex functions, have the semismoothness properties
(3.3.23) and C6.3.18) (see Remarks 3.3.4 and 6.3.4). Therefore the se-
arch terminates. It is easy to show that (8.16) holds at termination.
Also the first positive tL must satisfy tL=ti > 0.1t ~ by the choice

of t at step (e), and then we have tUi+l <_ t Ui _< I0 <_ 10tiL+I This shows
that t B_< 10t L at termination.

_k+l
The rules for choosing of at Step 6 yield the following ana-
logue of (3.13)

k ( 1 ) e jf,gfk
k(1)=gf (xk) and k
~f,k(l~=0j if k(1) <_k < k ( i + i ) , ( 8 . 2 3 )

which ensures that the constraints of the k-th s u b p r o b l e m (8.10) are


consistent (Gk(ek)~). We may add that Remarks
5.3.4 and 6.2.3 and
jk+l and
Section 6.7 indicate how to modify the choice of j~+l- and -F
impose the additional constraint ~p=0 in subproblem (8.10) at certain
iterations in order to treat the case when one cannot compute f(x) and
gf(x) at x~S, and Fix) and gF(x) at x ~S. Also the subsequent
proofs may be easily modified to cover such modifications. We shall
leave this task to the reader.
We shall now establish convergence of the method by m o d i f y i n g the
analysis of Section 4 with the help of the results of Section 5.4. To
this end, define the stationarity measure

k 1 k 2+~k (8.24a)
w : ~Ip I p
and observe that we always have
338

w k <_ 2max{Ip k 2 +epk


~k. and m a x { i p k [ 2 ,~p}
~k ~ 2w k. (8.24b)

We showed above that if the m e t h o d terminates at the k-th iteration,


and as=0, then x k solves p r o b l e m (8.1). Therefore, we shall assume
from now on that the m e t h o d does not stop. We shall now show that Lem-
mas 4.2-4.7 hold for A l g o r i t h m 8.1 if one replaces in their formulations
and proofs the r e l a t i o n 0 e ~f(x) by 0 ~ ~H(x;x).
First, we observe that Lemmas 4.2 and 4.3 m a y be e s t a b l i s h e d for
A l g o r i t h m 8.1 by c o m b i n i n g their original proofs w i t h the proof of Lem-
ma 5.4.7 and using (8.24). In view of the line search rules (8.1a,b),
Lemma 4.4 is v a l i d for A l g o r i t h m 8.1 w i t h the additional a s s e r t i o n that
F(x) ~ 0, which follows from the c o n t i n u i t y of F and the f e a s i b i l i t y of
{xk}. In the proof of Lemma 4.5, use (8.23) and the proof of Lemma 6.4.
8 in parts (ii)-(iii), and (8.19)-(8.20) for d e r i v i n g (4.13) as in the
proof of Lemma 5.4.11. The proof of Lemma 4.6 need not be modified. In
the proof of Lemma 4.7, replace a(xk,x k+l) in (4.18)-(4.19) by
max{~(xk,xk+l), ~(xk,~k+l)}, o b s e r v i n g that this r e p l a c e m e n t is valid
in virtue of (8.16c), (8.21) and the e l e m e n t a r y p r o p e r t y of

e(x,y)+0 if x,y+xe S,

which is a c o n s e q u e n c e of (8.17) and the local b o u n d e d n e s s of gf and


gF" In effect, we obtain the following analogue of T h e o r e m 4.8.

T h e o r e m 8.4. Every a c c u m u l a t i o n point of a sequence {x k} g e n e r a t e d by


A l g o r i t h m 8.1 is stationary for f on S.

C o m b i n i n g the above t h e o r e m with the a b o v e - d e s c r i b e d e x t e n s i o n of


L e m m a 4.4 and the fact that we h a v e (8.18) and t~& t for all k, one
may use the c o r r e s p o n d i n g proofs of Sections 4 and 5.4 to e s t a b l i s h the
f o l l o w i n g results.

T h e o r e m 8.5. Every sequence {x k} c a l c u l a t e d by A l g o r i t h m 8.1 minimi-


zes f on S: {x k} c S and f ( x k ) + i n f { f ( x ) : x e S}. Moreover, {x k} conver-
ges to a s o l u t i o n of p r o b l e m (8.1) w h e n e v e r p r o b l e m (8.1) has any solu-
tions.

Co_rollary 8.6. If p r o b l e m (8.1) has a s o l u t i o n and the s t o p p i n g parame-


ter es is positive, then A l g o r i t h m 8.1 terminates in a finite number
of iterations.

Corollary 8.7. If the level set {x e S : f(x) & f(xl)} is b o u n d e d and the
339

stopping parameter es is positive, then A l g o r i t h m 8.1 terminates in


a finite number of iterations.

Let us now consider the m e t h o d with s u b g r a d i e n t selection. This


m e t h o d is o b t a i n e d from A l g o r i t h m 8.1 by using the additional constrai-
nts ip=0 and ~p=0 in s u b p r o b l e m <8.10} and d e m a n d i n g that the sets

^k k k k k k
Jf = {j e Jf : lj ~ 0} and JF = {j ~ JF : ~j ~ 0} (8.25a)

should satisfy
^k ^k
IJf u JFI ~ N+I. (8.25b)

k k
The r e q u i r e d m u l t i p l i e r s lj and ~j may be found by the M i f f l i n
(1978) algorithm; see Remark 2.2.
One may easily verify that the method w i t h s u b g r a d i e n t s e l e c t i o n
is g l o b a l l y c o n v e r g e n t in the sense of T h e o r e m 8.5 and Corollaries 8.6-
8.7. To this end, it suffices to m o d i f y the p r e c e d i n g c o n v e r g e n c e ana-
lysis of the m e t h o d w i t h s u b g r a d i e n t a g g r e g a t i o n by using 15.5.2)-(5.
5.4) in the proof of an analogue of Lemma 4.5.
A l g o r i t h m 8.1 may be m o d i f i e d by i n c o r p o r a t i n g the a p p r o x i m a t i o n
t o l e r a n c e u p d a t i n g s t r a t e g y of A l g o r i t h m 6.1. This will not impair the
p r e c e d i n g c o n v e r g e n c e results, since an analogue of Lemma 6.3 (in w h i c h
^

one uses the additional a s s u m p t i o n that x ~ S and then asserts that


0 ~ ~H(x;~) and ~ S) may be e s t a b l i s h e d by c o m b i n i n g the proof of
Lemma 6.3 with the proofs of Lemma 5.4.14 and T h e o r e m 5.4.15.
The line search rules of A l g o r i t h m 8.1 m a y be m o d i f i e d by r e p l a c -
ing v k with ~k given by (6.5), where sk is the L a g r a n g e m u l t i p l i e r
for the last c o n s t r a i n t of (8.10) (see Lemma 2.1(iv)), and imposing the
additional r e q u i r e m e n t me+m R ~ 1 on the choice of the line search pa-
rameters. Since s u b p r o b l e m (8.10) is of the form (2.5) and we have (8.
19)-(8.20) w i t h kE J if yke S and k
k e JF if y k ~ S, the proof of
L e m m a 6.4 remains valid for this m o d i f i c a t i o n of A l g o r i t h m 8.1. There-
fore, this m o d i f i c a t i o n of A l g o r i t h m 8.1 retains all the c o n v e r g e n c e
p r o p e r t i e s of the original method.
We may add that the two a b o v e - d e s c r i b e d m o d i f i c a t i o n s of A l g o -
rithm 8.1 may be i n c o r p o r a t e d in the m e t h o d w i t h s u b g r a d i e n t s e l e c t i o n
w i t h no e s s e n t i a l changes in its c o n v e r g e n c e analysis.

9. E x t e n s i o n s to N o n c o n v e x C o n s t r a i n e d Problems.

In this section we shall p r e s e n t e x t e n s i o n s of the bundle methods


340

of the p r e c e d i n g section for solving the f o l l o w i n g problem

minimize f(x), subject to F(x) ~0, (9.1)

where the functions f : RN + R and F : RN +R are locally Lipschitzian


b u t not n e c e s s a r i l y convex or d i f f e r e n t i a b l e .
Our e x t e n s i o n of the b u n d l e m e t h o d s for c o n v e x constrained mini-
mization to the n o n c o n v e x case will use the techniques for dealing with
nonconvexity developed in Sections 6.2-6.5. Thus we only need to modi-
fy those c o n s t r u c t i o n s of S e c t i o n 8 which depended on the c o n v e x i t y of
the p r o b l e m functions.
First, we recall that in the n o n c o n v e x case it is c o n v e n i e n t to
use the f o l l o w i n g outer approximation to the s u b d i f f e r e n t i a l ~H(x;x)
of the i m p r o v e m e n t function H(-;x) at x

~f(x) if F(x) < 0,

M(x) = conv{~f(x) u ~F x)} if F(x) = 0, (9.2)

~F(x) if F(x) > 0.

Then a p o i n t x aR N is s t a t i o n a r y for f on S if it satisfies the ne-


^

cessary optimality condition 0 ~ M(~) and ~ e S. Also for any xe S


and y E RN the s u b g r a d i e n t locality measures

~f(x,y)=max{If(x)-~(x;y) xflx-yl2}, I,

~F(x,y)=max{l~(x;y)I,YFlX-yl2},
defined in terms of the l i n e a r i z a t i o n s

~(x;y)=f(y)+ < gf(y),x-y > and ~(x;y)=F(y)+ < gF(Y),x-y > ,

indica t e how m u c h the s u b g r a d i e n t s gf(y) 6 ~f(y) and gF(y) ~ ~F(y)


differ from b e i n g elements of M(x), respectively, where yf (yF) is a
positive parameter w h i c h may be set to zero if f (F) is convex. More
compactly, w e may d e f i n e the s u b g r a d i e n t mapping g(-) and its locali-
ty m e a s u r e ~(x,.) at any x e S by

g(y)=gf(y) and ~(x,y)=~f(x,y) if y ~ S,


(9.3)
g(y)=gF(y) and e(x,y)=~F(x,y ) if y ~S.

Using the above c o n c e p t s and i n c o r p o r a t i n g the s u b g r a d i e n t aggre-


gation techniques of A l g o r i t h m 6.3.1 into A l g o r i t h m 8.1, we o b t a i n the
341

following method.

Al~orithm 9.1.

Step 0 (Initialization I. Select a starting point xlg S, a stopping pa-


rameter es ~ 0 and an approximation tolerance ea ~ 0. Choose positive
line search parameters mL, mR, me, ~ and t satisfying m L < m R < i,
me < 1 and ~ N 1 N t. Select distance measure parameters yf > 0 and
YF > 0 (yf=0 if f is convex; YF=0 if F is convex), and a positive
resetting tolerance ~. Set yl=xl
1 1 0 1 0 0 , i,
, jf=jF={1}, pf=gf=gf(yl), PF=gF=gF [y ),
1 1 1 ^ and e 1 =e a. Set al=0 and r~=l
i- [yl)' sf=sF=Sl=U
flp=fll=r~y''i), F0-FI-F'p-
Set the counters k=l, i=0 and k(0)=l.

Step 1 (Direction findin@). Find multipliers Ik j e k ik ~ je k


3' Jf' p' ' JF'
and !ak that

j . k-i j k-i 2
minimize 1 1 Z kljgf+IpPf +j ~ kDjgF+~pPF I ,
I,~ J g Jf ~ JF
k
subject to 1 3. >0,
_ j ~ Jf,
k ip_> 0 , pj >_ 0, J ~ J F ' ~ p >_ 0 ,

+ ~ kDj+pp=l, (9.4)
J e J fklj+iP J e JF

Ip = ~p = 0 if r ka = l ,

k k < ek
k k + Z _kD3~F,j+~peF,p ,
Z klJ~f'J+IP~f'P j 6 o F
J~Jf

where the subgradient locality measures are defined by

~k,j=max{If(xk)-fk I, yf(sk) 2} and k j=max{IFkl, F (sk,2


OF, jJ }'
(9.5)
k 2}
~k,p----max{If(xk)-fk I, yf(sf) and k
eF,p=max{ iFki,YF(sk)2 }.

k ikj' j e Jf,
Calculate the scaled multipliers k , 9F' k ik -k J~ JF'
p' Pj' k
and pp
~k by (8.11). Compute the aggregate subgradients

~k sf)=j
(pf,fp,k "k ~k j k k k k '
eZ.TklJ(gf'fj'sj)+Ik(pk-l'fp'Sf)vf
P z

~k j k k ~k k-i k k
(PF,Fp
, k ~k ~k)=j EZ JFk~J(gF'Fj'sj)+~P(PF 'Fp'SF)' (9.6)
342

k k k k k
p = vfpf + ~Fp F

and the corresponding locality measures

~k,p=max{If(xk)-fk[,f(sk) 2} and eF,p=max{ IF l,YF(sk)2},


(9.7)
"k k~k k-k
ep = Vfaf ,p + VF~F,p

Set dk=-p k and vk=-Ipkl 2. If k k


Ip=~p=0 set a k =max{ sk3 : j e j k u j k } .

Step 2 (Stoppin~ criterion). If max{ipk[2 '


~k}
p
<- e s , terminate; otherwise
continue.

Step 3 (Approximation tolerance decreasing). If -k go to Step


Ipkl 2 > ~p,

4; otherwise, replace ek by meek and go to Step i.

Step 4 (Line search). By a line search procedure (e.g. Line Search Pro-
cedure 8.2), find three not necessarily different stepsizes k t kR and
tL,
t Bk satisfying the requirements of Step 4 of Algorithm 8.1 with g and
defined by (9.3).

Step 5 (Approximation tolerance updating). If t~=0, set ek+l=ek;


k+l
otherwise, choose e he a (e.g. by (6.1)), set k(l+l)=k+l, and incre-
ase 1 by I.

Step 6 (Linearization updating). Choose ~k c J fk


Jf and ~ F c JFk such that

k(1) E 3 ~ if k(1) < k+l, and set Jfk+l =Jf


^k u {k+l} and JFk + 1 %=ok F u {k+l}-
k+l . k+l. k+l , k+l, fk+l, j ~ _k+l .k+l
Set gf =gf[y ) and gF =gF ~y ). Calculate j Jf , Zp ,
Fk +I ~k~l Fk+l
3 ' J ~ OF ' P ' s3,k j E ~f~k+iu oF-k+l, s~ +I and s~ +I by (6.3.14)

Step 7 (Distance resetting test). Set k+l~.


ak+l=max{ak+]xk+l-xkl,Sk+l~ If
k+l
a <- a or t =0 then set rk+l=0
a
and go to Step 9; otherwise, set
rk+l=l
a and go to Step 8.

Step 8 (Distance r e s e t t i n ~
Delete from J~+]
~ and dF all indices
k k _k+i _ k + l }.
j with s 7 +1 > a / 2 , and set a k + l = m a x { s 7 +1 : j e u dF
] ] df

Step 9. Increase k by 1 and go to Step i.

A few comments on the algorithm are in order.


343

Since the m e t h o d uses the s u b g r a d i e n t a g g r e g a t i o n rules of Algo-


rithm 6.3.1, the p r o p e r t i e s of the a g g r e g a t e s u b g r a d i e n t s (9.6) may be
deduced from Lemma 6.4.1. Moreover, the s t a t i o n a r i t y m e a s u r e

k 1 k,2 ~k
w = ~ p ~ +~p (9.8a)

satisfies

w k ~ 2max{Ipk[ 2 ,ap}
~k and m a x { i p k l 2 ,~p}
~k ~ 2w k, (9.8b)

hence the s t o p p i n g c r i t e r i o n of the m e t h o d may be i n t e r p r e t e d s i m i l a r l y


to that of A l g o r i t h m 6.3.1; see Section 6.3.
k
The m e t h o d updates the a p p r o x i m a t i o n tolerances e a c c o r d i n g to
the m o d i f i e d rules of Section 6. It is easy to see that we always have
~k ~ e k as in A l g o r i t h m 3.1, since the proof of Lemma 6.4.6 provides a
P
suitable e x t e n s i o n of r e l a t i o n (3.8). We may add that, as in S e c t i o n 7,
we do not use the a p p r o x i m a t i o n t o l e r a n c e u p d a t i n g s t r a t e g y of A l g o -
r~thm 6.1, b e c a u s e it may impair global c o n v e r g e n c e in the n o n c o n v e x
case.
Line Search P r o c e d u r e 8.2 may be used for e x e c u t i n g Step 4 of the
method. Using the proof of Lemma 8.3, one readily v e r i f i e s that this
p r o c e d u r e will t e r m i n a t e in a finite number of iterations if f and F
have the semismoothness properties (3.3.231 and (6.3.18), respectively.

The s u b g r a d i e n t d e l e t i o n rules of Step 7 and 8 are b o r r o w e d from


A l g o r i t h m 7.1, i.e. they differ slightly from those of A l g o r i t h m 6.3.1
in that no d i s t a n c e reset can occur after a null step. Thus the latest

subgradients can never be deleted, k+l ~ o_k+l


i.e. k+l e Jf F at Step 9,
k+1 , k+l k+l k
since Sk+1=LY -x [=0 if t L > 0. Moreover, as in A l g o r i t h m 7.1, we
have r e l a t i o n (8.23), with ensures that the c o n s t r a i n t s of the k-th
subproblem (9.4) are c o n s i s t e n t (the set G~(e k) d e f i n e d by (8.9) is
nonempty). We may add that Remarks 5.3.4 and 6.2.3 and S e c t i o n 6.7 in-
dicate how to m o d i f y the choice of j~+l and j~+l and impose the ad-
ditional c o n s t r a i n t ~p=0 in s u b p r o b l e m (9.4) at c e r t a i n iterations in
order to treat the case w h e n one cannot, or does not want to, compute
f(x) and gf(x) at x~S, and F(x) and gF(x) at x aS.

We shall now e s t a b l i s h c o n v e r g e n c e of the method, a s s u m i n g that


S=0.

T h e o r e m 9.2. Suppose that A l g o r i t h m 9.1 generates a sequence {xk}.

Then:
344

(i) If {x k} is finite, then its last element xk is stationary for


f on S.

(ii) If {x k} is infinite, then every a c c u m u l a t i o n p o i n t ~ of {x k}


is stationary for f on S.

(iii) If f and F are convex and F(x) < 0 for some x, then {x k} mini-
mizes f on S, i.e. {x k} c S and f ( x k ) + i n f { f ( x ) : x e S}. Moreover, {x k}
converges to a m i n i m u m point of f on S w h e n e v e r f attains its infimum
on S.

Proof. The proof may be c o n s t r u c t e d by m o d i f y i n g the analysis of Sec-


tion 4 w i t h the aid of the results of Sections 7,8 and 6.4. Since a
formal proof w o u l d involve lengthy repetitions of the p r e c e d i n g results,
we only give an outline of the required analysis.

(i) Suppose that x=x k is the last point g e n e r a t e d by the method. If


the a l g o r i t h m stops at the k-th iteration, then
m a x { ] p k [ 2 , ~ k} ~ es=0,
P
so wk=0 by (9.8), and the proof of Lemma 6.4.3 yields the stationari-
ty of x k. The case w h e n the m e t h o d cycles i n f i n i t e l y b e t w e e n Steps 1
and 3 may be easily analyzed by using the proof of Lemma 4.2 and Lemma
6.4.5, w h i c h is valid for A l g o r i t h m 9.1, as will be shown below.

(ii) Suppose that {x k} is infinite and has an a c c u m u l a t i o n point x.


R e a s o n i n g as in the proof of Lemma 7.2, one may d e d u c e from (8.23) and
the proofs of Lemmas 6.4.1 and 6.4.5 that Lemma 6.4.5 is true for Al-
g o r i t h m 9.1. This yields, by (9.81, an a n a l o g u e of Lemma 4.3. The pro-
ofs of analogues of Lemmas 4.4-4.7 are similar to the ones d i s c u s s e d
in Section 8 in c o n n e c t i o n w i t h A l g o r i t h m 8.1. C o m b i n i n g all these re-
sults, one shows that x is s t a t i o n a r y for f on S.

(iii) Suppose that p r o b l e m (9.1) is convex and satisfies the Slater


c o n s t r a i n t qualification. Then the p r o p e r t i e s of A l g o r i t h m 9.1 are es-
s e n t i a l l y those of A l g o r i t h m 8.1, e.g. we have (8.18) and tk
L ~ for
all k, so the desired c o n c l u s i o n may be d e d u c e d from part (ii) of the
theorem and the results of Section 8.

From the above proof we deduce easily that C o r o l l a r y 8.7 is true


for A l g o r i t h m 9.1.
We may add that A l g o r i t h m 9.1 reduces to a m e t h o d with subgradi--
entselection if one uses the additional c o n s t r a i n t Ip=~p=0 in subprob-
lem (9.4) and calculates m u l t i p l i e r s satisfying (8.25). A l s o its line
search rules may be m o d i f i e d by r e p l a c i n g vk with ~k as in S e c t i o n
9. F o l l o w i n g the analysis of Section 8, one may show that such modifi-
cations do not impair the p r e c e d i n g c o n v e r g e n c e results.
CHAPTER 8

N u m e r i c a l Examples

I. Introduction

In this chapter we give numerical results for several o p t i m i z a -


tion problems. Our intention is to give the reader some feeling for
the speed of c o n v e r g e n c e he/she can expect when solving p r o b l e m s with
several v a r i a b l e s by the methods d i s c u s s e d in the p r e c e d i n g chapters.
A l s o we think that it is too early to compare the e f f i c i e n c y of the
various e x i s t i n g algorithms. For these reasons, we only describe
results o b t a i n e d w i t h a very u n s o p h i s t i c a t e d i m p l e m e n t a t i o n of the
a l g o r i t h m s w i t h subgradient d e l e t i o n rules from C h a p t e r s 4 and 6.
The a l g o r i t h m s were p r o g r a m m e d in FORTRAN on a PDP 11/70 com-
p u t e r w i t h the relative accuracy of 10 -17 in double p r e c i s i o n (seven-
t e e n - d i g i t s precision). Since the number N of v a r i a b l e s was
relatively small, the algorithms used subgradient selection and at
most M g =N+4 subgradients for each search d i r e c t i o n finding (see
Sections 4.5 and 6.7). The f o l l o w i n g standard values of line search
and locality p a r a m e t e r s were used: mL=0.1, mR=0.5 , a=103 r t=0.01 i

8 =0.5, si=I, ma=10 -3. The stopping c r i t e r i o n

Id kl > ma ak and w k 5 s

was e m p l o y e d w i t h various values of the final accuracy tolerance Es-


In the next section k denotes the final (or current) iteration
number, Lf is the total number of the objective function/subgradient
evaluations, whereas LF is the number of the c o n s t r a i n t function/
/subgradient evaluations. We also give the solution x to the p r o b l e m
in q u e s t i o n w h e n e v e r it is known.

2. N u m e r i c a l Results

2.1. Shor's Problem

f(x) = max{fi(x) : i=I ..... 10}, xe R 5 ,


5
fi(x)= bi j I (xj-aij )2 . .i=I
. . . ,10
.

(hi) = (1,5,10,2,4,3,1.7,2.5,6,3.5) ,
346

0 2 1 1 3 0 1 1 0 1 ]
0 1 2 4 2 2 I 0 0 1
(aij)T = 0 I I I I I I 1 2 2 ,
0 I I 2 0 0 I 2 I 0
0 3 2 2 I I I I 0 0

= (1.12434,0.97945,1.47770,0.92023,1.12429),

f(x) = 22.60016,

xI = (0,0,0,0,1), f ( x I) = 80.

The results for various values of are given in T a b l e 2.1. We


s
also have

x 49 = (1.12433,0.97943,1.47749,0.92027,1.12425).

Table 2.1

Es k f (x k ) Lf

10 -4 34 22.60021 64

10 -5 41 22.60017 76

10 -6 49 22.60016 90

2.2. Lemarechal-s Problem MAXQUAD

f(x) = max{<Alx,x> - <bl,x>: i=1,...,5}, xeR 10,

lj = Aji = exp(i/j)cos(i-j)sin(1), l/j,

A%. : i sin(i)/10 + [ A~ ,
ll j#i 13
347

b I = exp(i/l)sin(i-1),
1

= (-0.1263,-0.0346,-0.0067,0.2668,0.0673,
0.2786, 0.0744, 0.1387,0.0839,0.0385),

f(x)= -0.8414.

This is the first problem of L e m a r e c h a l (1982). The starting point


x!=1, i = I , . . . , I 0 , has f(x)=5337. For =10 -4 we o b t a i n e d
1 s
k=51 , f ( x k ) = - 0 . 8 4 1 3 6 , Lf=102,

51
x = (-0.1263,-0.0342,-0.0062,0.0269,0.0671,
-0.2783, 0.0744, 0.1385,0.0836,0.0383).

2.3. Ill - c o n d i t i o n e d Linear Pro@rammin@

The linear programming problem

minimize <c,x> over all x E RN

satisfying Ax < b, x > 0,

where

N
Aij = I/(i+j), bi = ~- I/(i+j), i,j=1 ,...,N, N>2,
j=1
N
c i = -I/(I+I) - ~ I/(i+j)
j=1

x = (1,1 ..... 1) ,

is ill-conditioned for N>5, since A is e s s e n t i a l l y a section of the


Hilbert matrix. The c o n s t r a i n t function is

F(x) = max{max[(Ax)i-bi: i=I ..... N], max[-xi: i=I ..... N]}

and F (x) =0.


This problem can be solved by m i n i m i z i n g the e x a c t penalty
function

f(x) = <c,x> + ~F(x)+

over all x in R N, where ~ =2N is the p e n a l t y coefficient. Note


348

that f is polyhedral. We use the feasible starting point x I=0


(with f(xl)=0). Table 2.2 c o n t a i n s results for ~s=10 -7 and N=5,10,
15, whereas Table 2.3 d e s c r i b e s the case N=15 for v a r i o u s Es.

Table 2.2

N f(~) k f(x k ) Lf

5 -6.26865 14 -6.26865 31

10 -13.1351 23 -13.1351 47

15 -20.0420 32 -20.0420 67

Table 2.3

k f(x k ) Lf
~s

10 -4 16 -20.0411 26

10 -5 21 -20.0420 41

10 -6 25 -20.0420 51

10 -7 32 -20.0420 67

The results for the p r o b l e m

minimize f(x) = <c,x>, subject to F(x)<0

obtained by the feasible point method (see S e c t i o n 6.7) are g i v e n in


Table 2.4.

Table 2.4
N as k f (xk) Lf LF

5 10 -4 13 -6.26610 30 43
5 10 -5 20 -6.26861 52 74
10 10 -4 46 -13.1344 100 154
10 10 -5 50 -13.1348 110 168
15 10 -3 20 -19.9912 56 83
15 ]0 -4 43 -20.0396 117 174
15 10 -5 51 -20.0406 131 193
349

2.4. CRESCENT

2 2
f(x) = max{x~ (x2-I)2 + x 2 - 1, -x I - (x2-1) + x 2 + I)} ,

= (0,0) , f(~) = 0.

This objective function has n a r r o w c r e s c e n t - s h a p e d level sets w h i c h


force any d e s c e n t algorithm to m a k e v e r y s h o r t steps. The r e s u l t s for
xi=(-1.5,2) (f(x1):4.25) are g i v e n in T a b l e 2.5.

Table 2.5

k f(x k ) Lf
~s
10 -6 16 8-10 -6 29

10 -9 21 7-10 -6 36

10 -12 40 9-10 -12 62

2.5. sHELL DUAL

The p r o b l e m is to
5
minimize 2 [ djyj3 + < c y , y > - <b,z> o v e r all
j=1
(y,z) e R 5X R I0

satisfying (Ax)j - 2(Cy)j - 3djy~ - ej < 0 for j=1 ..... 5,

-yj<_0 for j=I,...,5, -z i <_ 0 for i=1 .... ,I0,

where ( y , z ) = x ~ R 15. The p r o b l e m d a t a are g i v e n below.

The optimal point is

= (0.3, 0.3335, 0.4, 0.4283, 0.224,


0, 0, 5.1741, 0, 3.0611,
11.8396, 0, 0, 0.1039, 0)

with f0(~)=32.3488, whereas the starting point

x iI = 10-4 for i~7 x1=60

has f0(xl )=2400 and F ( x I)<0. Here f0 denotes the p r o b l e m objective,


while F is the t o t a l c o n s t r a i n t function (F=max(Fi: i=I, .... 20), w h e r e
Fi are the c o n s t r a i n t s ) .
350

matrix A b
-16 2. 0 1 0. -40
0. -2 0 4 2. --2.
-350. 2 0 0. -0.25
0 -2 0 -4 --I --4.
0 -9. -2 -I -2.8 --4.
2 0. -4 0 0. --I .

-1 -1. -I -I --I. -40.


-I -2. -3 -2 --1 -60.
I 2. 3 4 5. 5.
I I. I I I. I.

symmetric matrix C d e

30 -20 -10 32 -10 4 -15


-20 39 -6 -31 32 8 -27
-10 -6 10 -6 -10 10 -36
32 -31 -6 39 -20 6 -18
-10 32 -10 -20 30 2 -12

This problem can be s o l v e d by m i n i m i z i n g its e x a c t penalty


function

f(x) = f0(x) + 500 F ( x ) +

over all x in R 15. The problem is q u i t e d i f f i c u l t t o solve by


general-purpose nonsmooth optimization methods (see L e m a r e c h a l
(1982)). The algorithm stopped by r e a c h i n g the iteration limit
I TMAX=300 with s=I0 -4 Table 2.6 illustrates its p r o g r e s s

Table 2.6

f (x k) Lf
!00 32.85 259
150 32.54 384
200 32.38 497
250 32.36 626
300 32.35 766
351

2.6. Electronic Filter Desi@n

f(x) = max(le(x,hi)l : i=I ..... 41},

e(x,h) = H(x,9~h) - S(h),

x = (al,bl,Cl,dl,a2,b2,c2,d2,A) R 9,

i/2
H (x, g) =A ~ I I+a2i+bi+2bi
2 (2cos2g_1)+2ai(1+bi)cos g )
2+ .2
i=I ~ I+c i ai+2di(2cos2g-1)+2ci(1+di)cos g

g =7[h,

S(h) = I 1-2hl,

h.1 = (i-I)0.01 for i=I.,,,.6, h.=0.07+ (i-7)


1
for i=7,...,20,

h21=0.5, h22=0.54, h23=0.57, h24=0.62,

h.=0.63+(i-25)0.03
1
for i=25 ..... 35,

h.=0.95+(i-36)0.01
1
for i=36 ..... 41,

=(0,0.980039,0,-0.165771,0,-0.735078,0,-0.767228,0.3679),

xl=(0, I, 0, -1.5, 0, -6.28, 0, -0.72, 0.37),

f(x)= 6.1853-10 -3 , f(x I) = 0.6914.

This problem originated in the optimal design of electronic filters


(Charalambous,1979).
Table 2.7 gives results for various s"

Table 2.7

k f (xk) Lf
~s

10 -3 23 17"10 -3 43
10 -4 54 14"10 -3 105
10 -5 201 6.45-10 -3 400
352

2.7. Feedback Controller Design

The following problem arises in the design of robust feedback


regulators for linear multivariable control systems (see Gustafson
and Desoer (1983) and Kiwiel (1984c) for details). The plant
transfer matrix is
l
s2+8s+10 3s2+7s+4 I "
P(s) :
2
(s+2) (s+3)
2s+2 3s2+9s+8 ]
The compensator transfer matrix C(s,x) depending on the design
parameters x R2 is found in terms of P(s) and a matrix
Q (s,x) as

C(s,x) = Q(s,x) (I-P(s)Q(s,x)) -I,

where

3s2+9s+8 -3s2-7s-4
d1(s,x) d2(s,x)
Q(s,x) = I
2s+4
-2s-2 s2-8s+10
d1(s,x) d2(s,x)

di(s,x) = (sxi)2 + ~sxi+1 , i=1,2.

Here I/x I and 1/x 2 can be interpreted as bandwidths. The size of


the maximum singular value 6(w,x) of the matrix

G(w,x) = Q(~/~ w,x)

gives an upper bound on the noise power per hertz in any channel at
the plant input. The design requirement

8(w,x) <_ 2.5 for wE~= { I, 1.2, .... 2.6, 2.8}

can be formulated in terms of the function

F(X):= max{<z,G(w,x)G(w,x)Xz>: llzll =I, weR} - 6.25

as

F (x) < 0.
353

(In v i e w of L e m m a 1.2.5., the v a l u e and a subgradient of F at x


can easily be f o u n d by c o m p u t i n g the e i g e n v e c t o r of G(w,x)G(w,x) x
corresponding to its m a x i m u m eigenvalue.)
The design problem is to c h o o s e the c o m p e n s a t o r parameters
x eR 2 such that the b a n d w i d t h s 1/x I and I/x 2 are as large as
possible, subject to F(x)~0. Thus we w a n t to h a v e small values of
both xI and x2, so our d e s i g n problem is m u l t i o b j e c t i v e (has two
criteria). Of course, multicriteria optimization problems should be
solved in an interactive mode. We give results for four typical
auxiliary subproblems that the d e s i g n e r may wish to solve in o r d e r
to e x p l o r e the d e s i g n possibilities. The subproblems are o b t a i n e d
by c h o o s i n g different scalarizations of the two o b j e c t i v e s and
handling the c o n s t r a i n t F(x)<0 via exact penalties. The subproblem
objectives are

Problem 1: f(x) = 0.8x I + 0.2x 2 100 F ( x ) + ,

Problem 2: f(x) = max (Xl,X 2) + 10 F ( x ) + ,

Problem 3: f(x) = max (x1,x2-0.5) + 10 F ( x ) + ,


Problem 4: f(x) = max ( X l - 0 . 5 , x 2) + 10 F ( x ) +

The starting point xI = (1,1) has F(x I) = 0 . 4 2 0 9 9 7 . Table 2.8


gives results for ~s=10 -5.

Table 2.8

Problem k Lf f (x k) Xlk xk F (x k) +
I 19 37 0.704194 0.5030 1.5088 0
2 15 45 1.032101 1.0321 1.0321 0
3 21 69 0.740497 0.7405 1.2405 5"10 -7
4 31 78 0.923553 1.4236 0.9236 0
References

Auslender A. (1978). Minimisation de fonctions localement lipschitziennes : Al~plica-


tions ~ la progranmation mi-convexe, mi-diff@rentiable. Nonlinear Progranming
3, 0.L. Managasarian, P.R. Meyer and S.M. Robinson, eds., Academic Press, New
York, pp. 429-460.
Auslender A. (1982). On tb~ differential properties of the support function of the
e-subdifferential of a convex function. Math. Progranlning, 24, 257-268.

Auslender A. (1985). Numerical methods for nondifferentiable convex minimization.


Math. Programming Study (to appear).
Bazaraa M.S. and C.M. Shetty (1979). Nonlinear Programaing. Tneory and Algorithms.
Wiley, New York.
Bertsekas D.P. and S.K. Mitter (1973). A descent numerical method for optimization
problems with nondifferentiable cost functionals. SIAM J. Control, ii, 637-652.
Bihain A. (1984). Optimization of upper semidifferentiable functions. J. Optimiz.
Theory Appl. 44, 545-568.
Charalambous C. (1979). Acceleration of the least p-th algorithm for minimax optim-
ization with engineering applications. Math. Programming, 17, 270-297.
Cheney E.W. and A.A. C~idstein (1959). Newton's method for convex programming and
Chebyshev approximation. Num. Math., i, 253-268.
Clarke F.H. (1975). Generalized gradients and applications. Trans. Amer. Math. Soc.,
205, 247-262.
Clarke F.H. (1976). A new approach to Lagrange multipliers. Math. Oper. Res., i,
165-174.
Clarke F.H. (1983). Nonsmooth Analysis and Optimization. Wiley - Interscience,
New York.
Dantzig G.B. (1963). Linear Progranming and Extensions. Princeton University Press,
Princeton, New Yersey.
Demyanov V.F. and V.N. Malozemov (1974). Introduction to Minimax. Wiley, New York.

Demyanov V.F. and L.V. Vasilev (1981 ). Nondifferentiable Optimization. Optimization


Software Inc./Springer: New York (to appear, 1985). Russian edition: Nauka,
Moscow (1981).

Demyanov V.F., C. ~ c h a l and J. Zowe (1985). Trying to approximate a set-valued


mapping. Nondifferenti~ble Optimization : Theory and Applications, V.F. Dernyanov
ed., Lech/re Notes in Control and Information Sciences, Springer, Berlin (to
appear).
Dixon I~C.W. and ~ Gaviano (1980). Reflections on nondifferentiable optimization,
part 2, oonvergence. J, Optim. Theory Appl., 32, 259-275.
Eaves B.C. and W.I. Zangwill (1971). Generalized cutting plane algorithms. SIAM J.
Control, 9, 529-542.
Eletc~er R. (1981). Practical Methods of Optimization, Vol.2, Constrained Optimiza-
tion. Wiley, New-York.
355

F u k u s h i m a M. (1984). A descent a l g o r i t h m for n o n s m o o t h c o n v e x p r o g r a m -


ming. Math. Programming, 30, 163-175.
G a u d i o s o M. and M.F. M o n a c o (1982). A b u n d l e type a p p r o a c h to the
u n c o n s t r a i n e d m i n i m i z a t i o n of convex n o n s m o o t h functions. Math.
Programming, 23, 216-226.
Goldstein A.A. (1977). O p t i m i z a t i o n of L i p s c h i t z c o n t i n u o u s functions.
Math. Programming, 13, 14-22.
Gupal A.M. (1979). S t o c h a s t i c Methods for Solving N o n s m o o t h Extermal
Problems. Kiev, N a u k o v a Dumka (in Russian).
G u s t a f s o n C.L. and C.A. Desoer (1983). C o n t r o l l e r design for linear
multivariable feedback systems with stable plants, using opti-
m i z a t i o n with inequality constraints. Int. J. Control, 37,
881-907.
G w i n n e r J. (1981). B i b l i o g r a p h y on n o n d i f f e r e n t i a b l e o p t i m i z a t i o n and
n o n - s m o o t h analysis. J. Comput. Appl. Math., 7, 277-285.
H i r i a r t - U r r u t y J.B. (1983). The approximate f i r s t - o r d e r and second-
order d i r e c t i o n a l d e r i v a t i v e s for a convex function. Proceedings
of the C o n f e r e n c e on M a t h e m a t i c a l Theories of Optimization, Lecture
Notes in M a t h e m a t i c s 979, Springer, Berlin.
Huard P. (1967). R e s o l u t i o n of m a t h e m a t i c a l p r o g r a m m i n g with non-
linear c o n s t r a i n t s by the m e t h o d of centers. N o n l i n e a r Programming,
J. Abadie, ed., A c a d e m i c Press, N e w York.
Kelley J.E. (1960). The c u t t i n g plane m e t h o d for solving c o n v e x
programs: J. SIAM, 8, 703-712.
K i w i e l K.C. (1981a). A globally c o n v e r g e n t q u a d r a t i c a p p r o x i m a t i o n
a l g o r i t h m for inequality c o n s t r a i n e d m i n i m a x problems. CP-81-9,
I n t e r n a t i o n a l Institute for A p p l i e d Systems Analysis, Laxenburg,
Austria. (revised version: A phase I - phase II m e t h o d for inequa-
lity c o n s t r a i n e d m i n i m a x problems. C o n t r o l Cyb., 12 (1983)
55-75).
K i w i e l K.C. (1981b). A v a r i a b l e metric m e t h o d of c e n t e r s for n o n s m o o t h
minimization. CP-81-27, International Institute for A p p l i e d
Systems Analysis, Laxenburg, Austria.
Kiwiel K.C. (1983). An a g g r e g a t e s u b g r a d i e n t m e t h o d for n o n s m o o t h
convex minimization. Math. Programming, 27, 320-341.
Kiwiel K.C. (1984a). A l i n e a r i z a t i o n a l g o r i t h m for c o n s t r a i n e d non-
smooth m i n i m i z a t i o n . System M o d e l l i n g and Optimization,
P. T h o f t - C h r i s t e n s e n , ed., Lecture N o t e s in C o n t r o l and Informa-
tion Sciences 59, Springer, Berlin, pp. 311-320.
Kiwiel K.C. (1984b). A q u a d r a t i c a p p r o x i m a t i o n m e t h o d for m i n i m i z i n g
a class of q u a s i d i f f e r e n t i a b l e functions. Numer. Mathematik, 45,
411-430.
356

Kiwiel K.C. (1984c). An a l g o r i t h m for o p t i m i z a t i o n p r o b l e m s w i t h singu-


lar values of control systems. Proc. IFAC 9th W o r l d Congress,
J. Gertler and L. Keviczky, eds., P e r g a m o n Press, O x f o r d (to
appear).
Kiwiel K.C. (1985a). An exact penalty function a l g o r i t h m for non-
smooth c o n s t r a i n e d convex m i n i m i z a t i o n problems. IMA J. Num.
Anal. (to appear).

Kiwiel K.C. (1985b). A method for m i n i m i z i n g the sum of a c o n v e x


f u n c t i o n and a c o n t i n u o u s l y d i f f e r e n t i a b l e function. J. Optim.
Theory Appl. (to appear).
Kiwiel K.C. (1985c). A l i n e a r i z a t i o n a l g o r i t h m for n o n s m o o t h m i n i -
mization. Math. Oper. Res. (to appear).
Kiwiel K.C. (1985d). An a l g o r i t h m for linearly c o n s t r a i n e d convex
n o n d i f f e r e n t i a b l e m i n i m i z a t i o n problems. J. Math. Anal. Appl.
~to appear).
Kiwiel K.C. (1985e). A descent method for n o n s m o o t h c o n v e x multi-
o b j e c t i v e minimization. Large Scale Systems (to appear).
Kiwiel K.C. (1985f). A d e c o m p o s i t i o n m e t h o d of descent for mini-
m i z i n g a sum of convex n o n s m o o t h functions. J. Optim. Theory
Appl. (to appear).
Kiwiel K.C. (1985g). An a l g o r i t h m for n o n s m o o t h c o n v e x m i n i m i z a -
tion w i t h errors. Math. Comput. (to appear).
Kiwiel K.C. (1985h). A method of l i n e a r i z a t i o n s for m i n i m i z i n g
certain q u a s i d i f f e r e n t i a b l e functions. Math. Programming
Study (to appear).
Kiwiel K.C. (1985i). Descent methods for n o n s m o o t h c o n v e x con-
s t r a i n e d minimization. Nondifferentiable Optimization: Theory
and A p p l i c a t i o n s , V.F.Demyanov, e d . , L e c t u r e Notes in C o n t r o l
and I n f o r m a t i o n Sciences (to appear).
L a s d o n L.S. (1970). O p t i m i z a t i o n Theory for Large Systems. Mac~tillan,
London.
L e m a r e c h a l C. (1975). An e x t e n s i o n of D a v i d o n m e t h o d s to n o n d i f f e -
r e n t i a b l e problems. Nondifferentiable Optimization, M.L.
B a l i ~ s k i and P.Wolfe, eds., M a t h e m a t i c a l P r o g r a m m i n g Study 3,
North-Holland, Amsterdam, pp, 95-109.
L e m a r e c h a l C. (1976). C o m b i n i n g Kelley's and c o n j u g a t e g r a d i e n t
methods. Abstract, IX I n t e r n a t i o n a l S y m p o s i u m on M a t h e m a t i c a l
Programming, Budapest.
L e m a r e c h a l C. (1978). N o n s m o o t h o p t i m i z a t i o n and descent methods.
RR-78-4, International Institute for A p p l i e d Systems Analysis,
Laxenburg, Austria.
357

Lemarechal C. (1978b). Bundle methods in nonsmooth optimization.


Nonsmooth Optimization, C.Lemarechal and R.Mifflin, eds.,
Pergamon Press, Oxford, pp. 79-102.
Lemarechal C. (1980). Extensions diverses des methods de gradient
et applications. These d'Etat, Universite de Paris IX.
Lemarechal C. (1981). A view of line searches. Optimization and
Optimal Control, W.Oettli and J.Stoer, eds., Lecture Notes in
Control and Information Sciences 30, Springer, Berlin, pp.
59-78.
Lemarechal C. (1982). Numerical experiments in nonsmooth optimi-
zation. Progress in Nonsmooth Optimization, E.Nurminski, ed.,
CP-82-S8, International Institute for Applied Systems Analysis,
Laxenburg, Austria, pp. 61-84.
Lemarechal C. and R.Mifflin, eds. (1978). Nonsmooth Optimization.
Pergamon Press, Oxford.
Lemarechal C. and R.Mifflin (1982). Global and superlinear
convergence of an algorithm for one-dimensional minimization
of convex functions. Math. Programming, 24, 241-256.
Lemarechal C., J.-J.Strodiot and A.Bihain (1981). On a bundle
algorithm for nonsmooth optimization. Nonlinear Programming 4,
O.L.Mangasarian, R.R.Mayer and S.M.Robinson, eds., Academic
Press, New York, pp. 245-281.
Lemarechal C. and J.-J.Strodiot (1985). Bundle methods, cutting
plane algorithms and ~-Newton directions. Nondifferentiable
Optimization: Theory and Applications, V.F.Demyanov, ed.
Lecture Notes in Control and Information Sciences (to appear).
Lemarechal C. and J.Zowe (1983). Some remarks on the construction
of higher order algorithms for convex optimization. Appl. Math.
Optim., 10, 51-68.
Madsen K. and H.Schjaer-Jackobsen (1978). Linearly constrained mi-
nimax optimization. Math. Programming, 14, 208-223.
Mifflin R. (1977a). Semismooth and semiconvex functions in constrain-
ed optimization. SIAM J. Control Optim., 15, 959-972.
Mifflin R. (1977b). An algorithm for constrained optimization with
semismooth functions. Math. Oper. Res., 2, 191-207.
Mifflin R. (1978). A feasible descent algorithm for linearly con-
strained least squares problems. Nonsmooth Optimization, C.
Lemarechal and R.Mifflin, eds., Pergamon Press, Oxford, pp.
103-126.
358

M i f f l i n R. (1982). A m o d i f i c a t i o n and an e x t e n s i o n of L e m a r e c h a l ' s


a l g o r i t h m for n o n s m o o t h minimization. N o n d i f f e r e n t i a l and
Variational Techniques in O p t i m i z a t i o n , D . C . S o r e n s e n and R.J.-B.
Wets, eds., M a t h e m a t i c a l P r o g r a m m i n g Study 17, pp. 77-90.
M i f f l i n R. (1983). A s u p e r l i n e a r l y c o n v e r g e n t a l g o r i t h m for one-di-
m e n s i o n a l c o n s t r a i n e d m i n i m i z a t i o n w i t h c o n v e x functions. Math.
Oper. Res., 8, 185-195.
M i f f l i n R. (1984). Better than linear c o n v e r g e n c e and s a f e g u a r d i n g in
n o n s m o o t h minimization. System M o d e l l i n g and O p t i m i z a t i o n , P.
Thoft-Christensen, ed., Lecture Notes in Control and I n f o r m a t i o n
Sciences 59, Springer, Berlin, pp. 321-330.
M i f f l i n R. (1985). A nested o p t i m i z a t i o n application. Nondifferentiab-
le Optimization: Theory and A p p l i c a t i o n s , V.F.Demyanov, ed. Lecture
Notes in Control and Information Sciences, Springer, Berlin
(to appear).
N u r m i n s k i E.A. (1979). N u m e r i c a l Methods for Solving D e t e r m i n i s t i c
and Stochastic M i n i m a x Problems. Kiev, N a u k o v a Dumka (in Rus-
sian).
N u r m i n s k i E.A. (1981). On a d e c o m p o s i t i o n of s t r u c t u r e d problems.
WP-81-32, International Institute for A p p l i e d Systems Analysis,
Laxenburg, Austria.
N u r m i n s k i E.A. (1982). B i b l i o g r a p h y on n o n d i f f e r e n t i a b l e optimization.
P r o g r e s s in N o n d i f f e r e n t i a b l e O p t i m i z a t i o n , E.Nurminski, ed.,
CP-82-$8, International Institute for A p p l i e d Systems Analysis,
Laxenburg, Austria.
P i r o n n e a u O. and E . P o l a k (1972). On the rate of c o n v e r g e n c e of c e r t a i n
methods of centers. Math. Programming, 2, 230-257.
P i r o n n e a u O. and E . P o l a k (1973). Rate of c o n v e r g e n c e of a class of
methods of feasible directions. SIAM J. Num. Anal., 10, 16!-174.
Polak E. (1970). C o m p u t a t i o n a l Methods in Optimization. A Unified
Approach. A c a d e m i c Press, N e w York.
Polak E. and D . Q . M a y n e (1981). A robust secant m e t h o d for o p t i m i z a -
tion p r o b l e m s with inequality constraints. J. Optim. Theory Appl.,
33, 463-477.
Polak E., D . Q . M a y n e and Y.Wardi (1983). On the e x t e n s i o n of c o n s t r a i n -
ed o p t i m i z a t i o n algorithms from d i f f e r e n t i a b l e to n o n d i f f e r e n t i a -
ble problems. SIAM J. Control Optim., 21, 179-203.
Polak E., R . T r a h a n and D . Q . M a y n e (1979). C o m b i n e d phase I - phase II
m e t h o d s of feasible directions. Math. Programming, 17, 61-73.
Powell M.J.D. (1978). A l g o r i t h m s for n o n l i n e a r c o n s t r a i n t s that use
Lagrangian functions. Math. Programming, 14, 224-248.
359

Pshenichny B.N. (1980). Convex Analysis and Extremal Problems. Nauka,


Moscow (in Russian).
Pshenichny B.N. and Yu.M.Danilin (1975). Numerical Mehhods for Extre-
mal Problems. Nauka, Moscow (English translation, Mir, Moscow,
1978).
Rockafellar R.T. (1970). Convex Analysis. Princeton University Press,
Princeton, New Yersey.
Rockafellar R.T. (1978). The theory of subgradie~ts and its applica-
tions to problems of optimization. Lecture Notes, University of
Montreal.
Rockafellar R.T. (1981). The theory of subgradients and its applica-
tions to problems of optimization. Convex and nonconvex functions.
Research Notes in Mathematics I, K.H.Hoffman and R.Wille, eds.,
Heldermann," Berlin.
Rzewski S.V. (1981). e-Subgradient method for solving the convex
programming problem. Zurn. Vyc. Mat. Mat. Fiz., 25, 1126-1132
(in Russian).
Shor N.Z. (1979). Methods for minimizing nondifferentiable functions
and their applications. Kiev, Naukova Dumka (in Russian). (English
translation: Minimization methods for nondifferentiable functions,
Springer-Verlag, Berlin, 1985).
Strodiot J.-J., V.H.Nguyen and N.Heukemes (1983). -Optimal solutions
in nondifferentiable convex programming and some related
questions. Math. Programming, 25, 307-328.
Topkis D.M. (1970a). Cutting - plane methods without nested constraint
sets. Oper. Res., 18, 404-413.
Topkis D.M. (1970b). A note on the cutting - plane method without nest-
ed constraint sets. Oper. Res., 18, 1216-1220.
Topkis D.M. (1982). A cutting - plane algorithm with linear and geome-
tric rates of convergence. J. Optim. Theory Appl., 36, 1-22.
Wierzbicki A.P. (1978a). A quadratic approximation method based on
augmented Lagrangian functions for nonconvex nonlinear programming
problems. WP-78-61, International Institute for Applied Systems
Analysis, Laxenburg, Austria.
Wierzbicki A.P. (1978b). Lagrangian functions and nondifferentiable
optimization. WP-78-63, International Institute for Applied
Systems Analysis, Laxenburg, Austria.
Wierzbicki A.P. (1982). Lagrangian functions and nondifferentiable op-
timization. Progress in Nondifferentiable Optimization, E.Nur-
minski, ed., CP-82-S8, International Institute for Applied Systems
Analysis, Laxenburg, Austria, pp. 173-213.
360

Wolfe P. (1975). A method of conjugate subgradients for minimizing


nondifferentiable convex~functions. Nondifferentiable Optimiza-
tion, M.L.Balinski and P.Wolfe, eds., Mathematical Progr ,amlming
Study 3, North-Holland, Amsterdam, pp. 145-173.

Wolfe P. (1976). Finding the nearest point in a polytope. Math. Prog-

ramming, 11, 128-149.

Wolfe P. (1978). Sufficient minimization of piecewise-linear uni-

variate functions. Nondifferentiable Optimization, C.Lemarechal

and R.Mifflin, eds,, Pergamon Press, Oxford, pp. 127-130.

Zowe J. (1985). Nondifferentiable optimization - a motivation and

a short introduction into the subgradient- and the bundle concept.

ASI Proceedings on Computational Mathematical Programming (to

appear).
INDEX

Accumulation 133 descent i 2


accuracy tolerance 58 direction 12
aggregate 34 - method 22
- approximation 56 d i f f e r e n t i a b l e function 1
distance measure 172 direct search methods 1
linearization 53 directional derivative 4
- subgradient 34, 53 generalized 4
aggregation 5i - upper 4
applicability 41 distance measure 92
approximation tolerance 307 - reset 92

bisection 106 efficiency 41


bundle method 37~ 299 epigraph 3
-subdifferential 21
Caratheodory's theorem 3 -subgradient 21
concave function 3
cone ii F. 3ohn condition i7
constraint i feasible i9
- dropping 49 - direction 19
function 1,22 - point method 22
qualification 17 finite termination 76
violation function 206
convergence 42 generalized directional derivative 4
convex 2 generalized gradient 6
- combination 2 globa[ convergence 42
hull 2 gradient 5
function type algorithm i
- outer approximation 11,29 graph 9
- problem 17
- set 2 half-space 3
continuous diffentiability i hyperplane 3
contour 9
cutting plane 25 implementability 41
- idea 51 improvement function 191
- method 25 innex product
interpolation 154
deletion rules 140
362

Kuhn-Tucker condition 17 rank 82


reset 33
Lagrange multipliers 49
level set l 0 selection 51
line search 29,47 selective approximation 53
line segment 2 semismoothness 103, 250
linearization 9,10 serious step 28
- error 31, 61 smooth problem 1
- updating 58 span 82
local boundedness 6 stationary point 15, 17
local solution 15 stationarity measure 62, i02
locally Lipschitzian steepest descent i5
locality 90 stepsize 22
- measures 3?-, 90 stopping criterion 25
radius 97 s t r i c t l y convex 3
lower approximation 10,11 strict d i f f e r e n t i a b i l i t y 5
subdifferentia] 6
mathematical programming 1 - regularity 6
max function 1 subgradient 6
method 15 algorithm 24
of centers 192 supporting hyperplane 3
of flexible directions 228
of linearizations 45 t r i a l point 28
of steepest descent 15
minimax 5"7 upper derivative 4
upper semicontinuity 6
nondifferentiable optimization I
nonlinear programming i weak semismoothness I05
nonsmooth optimization I
norm 2
null step 28

objective function i
optimality condition 15, 16, 17

phase I - phase H method 219, 294


polyhedral 26
positive dependence 75
Powell's function 77
projection 13

Potrebbero piacerti anche