Sei sulla pagina 1di 25

Part I

Unconstrained optimization
• Line search method

• Truest region method

Recall: Taylor’s theorem

1. f (y) = f (x) + f 0 (ψ)(y − x), ψ ∈ (x, y)


R1
2. f (y) = f (x) + 0 f 0 (x + t(y − x))dt

1 Line search method

1.1 Determine a direction (decent)

Definition:
A direction p at x1 is called a decent direction if pT ∇f (x1 ) < 0.

1.1.1 Steepest decent (negative gradient)

Rmk:
f (x + t~u) − f (x)
D~u f (x) = limt→0 = ∇f (x) · ~u (1)
t
= |∇f (x)|cosθ, when ~u is a unit vector.

Therefore, if

1. θ = 0, i.e, ~u is in the direction of the gradient of f , then D~u f (x) = |∇f (x)| > 0 if ∇f (x) 6= 0.

2. θ = π, i.e, ~u is in the opposite direction of the gradient of f , then D~u f (x) = −|∇f (x)| < 0
if ∇f (x) 6= 0.

Rmk:

1. We only need first derivative (good)!


2. Sometimes the convergence is very very slow (bad)!

3. Any descent direction will work.

1
1.1.2 Newton’s method

Starting at x1 approximate f at x1 by Taylor polynomial at degree 2:

1
f (x) = f (x1 ) + ∇f (x1 )T (x − x1 ) + (x − x1 )T ∇2 f (x1 )(x − x1 ) := m(x) (2)
2
A sufficient condition to minimize m(x) is ∇m(x) = 0, i.e.,

∇f (x1 ) + ∇2 f (x − x1 ) = 0 (3)

Side remark:

• f (x) = bT x =⇒ ∇f (x) = b
• f (x) = 12 xT Qx =⇒ ∇f (x) = 12 (Q + QT )x
• NOTE: Only if Q is symmetric, this simplifies to ∇f = Qx.

If ∇2 f (x1 ) is invertible, then x − x1 = −(∇2 f (x1 ))−1 · ∇f (x1 ).

This direction is called Newton’s direction.


Remarks:

1. Convergence is quadratic (fast)!


2. Need to compute Hessian.
3. Why is Newton’s direction a decent direction?
HOMEWORK: Show that if ∇2 f (x1 ) is positive definite, then Newton’s direction is a decent.
4. If ∇2 f (x1 ) is NOT positive definite (then Newton’s direction is not decent), then we need to
”fix” the Hessian.

Summary:
Let

• pk : direction at the k th stage

• αk : step size for the k th iteration.

Then

xk+1 = xk + αk pk , k = 0, 1, 2, . . . (4)

Note that
pk = −βk ∇f (xk ) (5)

and if

2
• βk = Id, then we are using steepest decent method.
• βk = x − x1 , then we are using Newton’s method.

1.2 How far do we go along a chosen direction?

The next question is how long to go along a piven pk .


Let
Φ(α) = f (xk + αpk ), α > 0. (6)

Idea:
Minimize Φ(α), i,e., find the step which minimizes Φ for α > 0.
Good, but expensive!
How do we guarantee that we arrive at an absolute and not at a local minimum?
What we need is:

1. Φ(α) < Φ(0)

2. α >> 0 (away from 0).

1.2.1 Criteria for acceptable stepsize

1. Wolfe’s conditions
(a) Φ(α) ≤ Φ(0) + c1 Φ0 (0) · α, (Armijo condition)
(b) Φ0 (α) ≥ c2 Φ0 (0), where 0 < c1 < c2 < 1. (Curvature condition)
2. Goldstein’s conditions
(a) Φ(0) + 1 − 0)Φ0 (0)α ≤ Φ(α) ≤ Φ(0) + cΦ0 (0)α, c ∈ (0, 1/2)

Backtracking:
Pick α so that
(a) α is acceptable: done.
(b) Go back to smaller steps.

1. For backtracking, do not use the curvature condition


2. For Newton direction, the natural initial step is α = 1.

1.3 Convergence

Theorem (Zoutendijk):
If f is continuously differentiable and ∇f is Lipschitz continuous (i.e, ∃L > 0, s.t. |∇f (y) −
∇f (x)| ≤ L|x − y| ∀x, y), and the steps satisfy Wolfe’s conditions and the directions are

3
decent directions and given our iterative process:

xk+1 = xk + αk pk (7)
then

X
cos2 θk ||∇f (xk )||2 < ∞. (8)
k=1

Remarks:

1. Zoutendijk’s theorem implies global convergence


(a) Sum is finite implies the terms go to zero.
(b) If cos2 θk > δ > 0∀k then ||∇f (xk )||2 → 0
(c) For steepest decent, cos2 θk = 1 > 0 =⇒ . steepest decent is always convergent.
(d) For Newton’s method, if Hessian satisfies the uniformly bounded condition number (A
invertible, κ(A) = ||A||||A−1 ||), i.e, ||∇2 f (xk )||||∇2 f (xk )−1 || > M ∀ xk .
||Ax||
2. ||A|| = sup||x||≤1 ||x||

3. g is Lipschitz continuous in a domain D if ∃L > 0 such that |g(y)−g(x)| ≤ L|x−y|, ∀x, y ∈ D


E.g. (HOMEWORK), g ∈ C 1 then g is Lipschitz.
∇f (xk )·pk
4. cosθk = ||∇f (xk )||||pk ||

Recall that in Wolfe’s condition Φ(α) = f (xk + αpk ), hence Φ0 (α) = ∇f (xk + αpk ) · pk .
We pick αk such that
Φ0 (α) ≥ c2 ∇f (xk ) · pk , hence
∇f (xk + αpk ) ≥ c2 ∇f (xk + αpk )
Proof (Zoutendijk):

(∇f (xk+1 ) − ∇f (xk )) · pk (9)


= ∇f (xk+1 ) · pk − ∇f (xk ) · pk (10)
= ∇f (xk + αk pk ) · pk − ∇f (xk ) · pk (11)
≥ c2 ∇f (xk ) · pk − ∇f (xk ) · pk , W olf e(ii) (12)

At the same time

|(∇f (xk+1 ) − ∇f (xk )) · pk | (13)


≤ ||∇f (xk+1 ) − ∇f (xk )||||pk ||, Cauchy (14)
= L||xk+1 − xk ||||pk ||, Lipschitz (15)
= Lαk ||pk ||, xk+1 − xk = αk pk (16)

Combining both inequalities, we obtain

4
(c2 − 1)∇f (xk ) · pk ≤ Lαk ||pk ||2

(c2 −1)∇f (xk )·pk


=⇒ αk ≥ L||pk ||2 ,

from the curvature condition; this implies that the steps have to be far away enough from 0.
Now, note that

f (xk+1 ) − f (xk )
= f (xk + αk pk ) − f (xk )
≥ −c1 αk ∇f (xk ) · pk ), Wolfe’s condition (i)
= c1 αk (−∇f (xk ) · pk )
(c2 − 1) c1 (1 − c2 ) (∇f (xk ) · pk )2
≥ c1 2
(∇f (xk ) · pk )(−∇f (xk ) · pk ) =
L||pk || L ||pk ||2
2
c1 (1 − c2 ) (∇f (xk ) · pk )
=
L ||pk ||2
Therefore we have,
c1 (1 − c2 )
f (xk ) − f (xk+1 ) ≥ ||∇f (xk )||2 cos2 θk , ∀k
L
c1 (1 − c2 )
f (xk−1 ) − f (xk ) ≥ ||∇f (xk−1 )||2 cos2 θk−1
L
..
.
c1 (1 − c2 )
f (x0 ) − f (x1 ) ≥ ||∇f (x0 )||2 cos2 θ0
L
Adding this telescopic sum, we obtain
k
X c1 (1 − c2 )
f (xk ) − f (xk+1 ) ≥ ||∇f (xn )||2 cos2 θn (17)
n=0
L

1.4 More on Convergence Rate

Theorem: f : Rn −→ R twice continuously differentiable iterates generated by steepest descent


with exact line search convergence to x∗ , at which ∇2 f (x) > 0. Then for all k large enough we
have
f (xk+1 ) − f (xk ) = r2 (f (xk ) − f (x∗ ), (18)
where r ∈ ( λλnn −λ
+λ1 , 1) and λ1 ≤ λ2 ≤ . . . λn are the eigenvalues of the Hessian.
1

Sketch of proof: Using a telescopic argument


f (xk+1 ) − f (xk ) ≤ r2 (r2 (f (xk ) − f (x∗ ))

= r4 (f (xk ) − f (x∗ )

...

≤ r2k (f (xk ) − f (x∗ ))

5
Remark: Assume that the condition number of ∇2 f (x∗ ) is
λn
κ(∇2 f (x∗ )) = k∇2 f (x∗ )kk∇2 f (x∗ )−1 k = = 800, say
λ1
Then
λn − λ1 799
= .
λn + λ1 801
Then
799 2k
f (xk+1 ) − f (xk ) ≤ (f (xk ) − f (x∗ )
801
, but
799 2k
f (xk+1 ) − f (xk ) ≤
= 0.08,
801
which is very slow! A disadvantage of linear convergence!

2 Trust region method


In contrast to the line search method, the trust region method determines the direction of the step
length simultaneously. Let xk be the current position within a region (the trust region), usually a
disk B(xk , ∆k ) := {x : kx − xk k ≤ ∆k }.
Goal:
min f (x), x ∈ R2
x

The objective function f has a good approximation function m(p), called the model function.
Usually we pick
1
m(p) = f (xk ) + ∇ f (xk )T pk + pT ∇2 f (xk )p,
2
the second order Taylor expansion of f at xk .
That is the ideal case, but sometimes the Hessian is hard to find, therefore, in practice we use
instead

1
m(p) = f (xk ) + gkT pk + pT Bk p.
2

We choose xk+1 to be the one that minimizes the model function m(p) within the trust region if
the function decreases enough at the new point.
Remark: If the value of the objective function at the new point is not small enough, then we
reject the model and shrink ∆k and redo the constraint optimization.

2.1 Pseudo-Algorithm
ˆ pick ∆0 ∈ (0, ∆,
Given a maximal radius ∆ ˆ η ∈ [0, 41 ).

1. Define
f (xk ) − f (xk+1 )
ρk = , xk+1 = xk + pk
m(0) − m(ρk )

6
Figure 1: default

7
2. If ρk < 0 then reject it, because f (xk+1 ) > f (xk ) and shrink ∆k .
3. If ρk ≈ 1, accept it and expand ∆k .

4. If ρk > 0, ρk close to 0, then reject it and redo with smaller ∆k .

Pseudo-code:

ˆ ∆0 ∈ (0, ∆,
• Input: ∆, ˆ η ∈ [0, 41 )
• Compute ρk , from solving m(p)p , |p| < ∆k .
f (xk )−f (xk+1 )
• Calculate ρk = m(0)−m(ρk )

• if ρk < 14 , then ∆ = 14 ∆k
ˆ
else, ρk > 43 and |p| = ∆k , then ∆k+1 = min{2∆l , Delta}
else ∆k+1 = ∆k .
• if ρk > η then xk+1 = xk + pk
else xk+1 = xk .

Remark: The optimal solution p∗ of

2.2 How to solve (or approximate) the model problem quickly?

The model problem is


1
min m(p) = fk + gkT p + pT Bk p, s.t. |p| ≤ ∆k , (19)
p 2

where fk = f (xk ), gk = ∇ f (xk ), Bk = ∇2 f (xk ).


Remark: The optimal solution p∗ of (19) is characterized by

1. p∗ is feasible (i.e. |p∗ | ≤ ∆k )

2. λ ≥ 0
3. λ(∆k − |p∗ |) = 0
4. (Bk + λI)p∗ = −g

5. Bk + λI ≥ 0.

2.3 Three approaches to approximating the solution


1. Cauchy Point: steepest descent
2. Dogleg method
3. Two dimensional minimization

8
2.4 Cauchy Point

1
m(p) = fk + gkT p + pT Bk p,
2
The Cauchy point is the minimum point of m(p) in the steepest descent direction such that |p| ≤ ∆.
Define
1
Φ(α) = f − α|g|2 + pT Bk pα2
2

Minimize Φ(α) s.t. 0 ≤ α ≤ |g| .

The Cauchy point pc is defined by


pc = −α∗ g, where α∗ is the minimum point of Φ.

2.5 Dogleg method


• If ∆ small, then m(p) ≈ f + g T p: Cauchy point is a good approx.

• If ∆ is not small then the quadratic term plays a role: m(p) = fk + gkT p + 21 pT Bk p

The unconstrained minimum of m(p) is given by ∇ m(p) = 0.


Let ∇ m(p) = g + Bp = 0 =⇒ p = −B −1 g (if B > 0).
See graph in book, page 74.
We follow a kinked path, called the dog leg path, which is parametrized by

2.6 Two dimensional problem

The model problem is solving

1
minp f + g T p + pT Bp,
2

where |p| < ∆, p ∈ span{pu , pB }.

Remark: The 2-dimensional minimization improves the min found in the dogleg method, which
improves the Cauchy point.
2-dim min ¿ better ¿ Dogleg > better > Cauchy.

9
Part II

Constrained Optimization
3 Mathematical programming
If f : Rn −→ R, the general mathematical programming optimization problem is often denoted by
(MP), and it is the following problem:

min f (x) (20)


subject to m equality constraints  
g1 (x) = 0

 g2 (x) = 0 

 .. 
 . 
gm (x) = 0
and r inequality constraints:  
h1 (x) ≤ 0

 h2 (x) ≤ 0 

 .. 
 . 
hr (x) ≤ 0

If f, g1 , g2 , . . . , gm , h1 , . . . , hr are linear (affine) functions then the problem is called Linear Programming
and denoted (LP).
The (LP) problem can then be written in matrix form:

min cT x (21)
such that Aeq x + beq = 0 and Aineq x + bineq ≤ 0.

3.1 Standard Linear programming

The standard Linear programming can then be written as:

min cT x
s.t.
Ax = b
x≥0

3.2 Simplex Method

Consider the following problem, which is solved using the simplex method. The tables below are
called tableaux.

10
(1) Consider the following linear programming problem:

max 4x1 + 3x2

subject to
3 x1 + x2 ≤ 93x1 + 2x2 ≤ 10x1 + x2 ≤ 4x1 , x2 ≥ 0.
Transform this problem into the standard form. How many basic solutions does
the standard form problem have? What are the basic feasible solutions and
what are the extreme points of the feasible region? Solve the problem by sim-
plex method. NOTE: Basic solutions are the points which satisfy the equality
constraints but are not necessarily positive (if it’s positive, then it’s called basic
feasible).

By adding slack variables, this problem can be written in standard LP form:

max 4x1 + 3x2 (22)

subject to
3x1 + x2 +x3 = 9
3x1 + 2x2 +x4 = 10
x1 + x2 +x5 = 4
x1 , x2 , x3 , x4 , x5 ≥ 0.

The tableau for this linear programming problem is given by:



Basic var. x1 x2 x3 x4 x5
Z -4 -3 0 0 0 0
⇐ x3 3∗ 1 1 0 0 9
x4 3 2 0 1 0 10
x5 1 1 0 0 1 4
We choose x1 to be our entering variable since it is the most negative of the variables in the
objective row. We perform the ratio test and obtain 3, 10/3, 4, respectively, so we select x3 as our
leaving variable and 3∗ as our pivot. Therefore, we obtain:

Basic var. x1 x2 x3 x4 x5
Z 0 -5/3 4/3 0 0 12
x1 1 1/3 1/3 0 0 3
⇐ x4 0 1∗ -1 1 0 1
x5 0 2/3 -1/3 0 1 1
The entering variable is the only remaining negative variable in the objective row, x2 , and per-
forming the ratio test we obtain 9,1,3/2, respectively. Therefore, the leaving variable is x4 and the
pivot is 1∗ , which leads to tableau # 3:

Basic var. x1 x2 x3 x4 x5
Z 0 0 -1/3 5/3 0 41/3
x1 1 0 2/3 -1/3 0 8/3
x2 0 1 -1 1 0 1
⇐ x5 0 0 1/3∗ -2/3 1 1/3

11
Since now x3 is negative in the objective row, we try to eliminate it by choosing x3 as the entering
variable. We perform the ratio test on the non-negative entries in that column and we obtain 4,
and 1 respectively, so that x5 is our leaving variable and 1/3∗ our pivot element, which leads to
our last tableau:

Basic var. x1 x2 x3 x4 x5
Z 0 0 0 1 1 14
x1 1 0 0 1 -2 2
x2 0 1 0 -1 3 2
x3 0 0 1 -2 3 1
Since both objective row variables are positive, we found an optimal solution, namely 14, for the
values of x1 = 2, x2 = 2, x3 = 1.

12
X

x y

Z f
(y)
>=f
(x)
+f’
(
x)(
y-x
)

(a) Bild a. (b) Bild b.

Figure 2: Bild a och b.

4 Convex Optimization
NOTE: See Convex Optimization book chapters 4 and 5.

A convex optimization problem is one of the form

min f0 (x)
(23)
subject to fi (x)0, i = 1, ..., m
aTi x = bi , i = 1, ..., p,

where f0 , ..., fm are convex functions.


A fundamental property of convex optimization problems is that any locally optimal point is also
(globally) optimal. Proof by contradiction and the use of convexity in the segment xz,
¯ then

f0 (z)(1?θ)f0 (x) + θf0 (y) < f0 (x).

More notes can be found on Boyd, and Vanderberghe, p. 136 - 151.

4.1 Chebyshev problem

Assume X is discrete and finite


X ∈ x1 , x2 , . . . , xn
P [X = xi ] = pi

Given that f0 (xi ) = a0i , f1 (xi ) = a1i , . . . , fm (xi ) = am


i ,

and

n
X
αk ≤ E[fk (x)] = pi aki ≤ βk
i=1

13
then the Chebyshev’s problem is to find
min E[f0 (x)]
s.t.
αk ≤ E[fk (x)] ≤ βk , k = 1, ..., m

4.1.1 Example 1

I could not add as a note into the book.


Let
f0 (x) = x
f1 (x) = 1(,∞) (x)
then the problem becomes
X
min E[X] = pi xi
st
X
α1 ≤ E[1(,∞) (x)] ≤ β1 = P [X > ] = pi
{xi >}

4.1.2 Example 2: Forward rate

X price of asset.

What is the max/min of a call on X struck at K, given E[X]?

n
X
max or min E[(X − K)+] = max or min (xi − K)+ pi
p
i=1

s.t.
E[X] = µ

n
X X
xi pi = µ, p ≥ 0, pi = 1
i=1

More interesting / challenging if the constraint is


E[(x − K1 )+ ] = C
iff
n
X
xi pi = c
i=1

What could be added to make the problem more realistic?


Adding variance the problem becomes quadratic optimization.

14
4.2 Quadratic problem QP

Quadratic Objective function:


1
min xT Qx + pT x + r
2
s.t.

Ax = b, Gx ≤ h, affine constraints

4.2.1 Quadratic constraint quadratic problem QCQP

1
min xT Qx + pT x + r
2
s.t.

Ax = b
and
1 T
x Qi x + pTi x + ri ≤ 0, i = 1, 2, . . . , m
2

4.2.2 Markowitz portfolio optimization

Let
X1 , X2 , . . . , Xn
be the return of the risky assets with mean returns

µ1 , µ2 , . . . , µn

and covariance matrix


Σi,j = E[Xi Xj ]
Let
P := (p1 , p2 , . . . , pn )
the portfolio of these assets, i.e., the amount invested in each asset 1, 2, . . . , n.
Then
µt p
is the expected return of the portfolio and

pt Σp

is the variance of the portfolio P .


Consider the problem

min pT Σp minimize risk

15
such that

µT p ≥ rmin , expected return no less than rmin


n
X
pi ≤ 1, budget constraint
i=1

P[pt X ≤ ] ≤ δ

16
5 Duality Theory
Problem:
min f0 (x) (24)
such that
fi (x) ≤ 0, i = 1, . . . , m
hi (x) = 0, i = 1, . . . , p

Let x ∈ D, which is the intersection of the domains of all the functions.


Let
m
X p
X
L(x, λ, ν) := f0 (x) + λi fi (x) + νi hi (x) = f0 (x)+ < λ, f > + < ν, h >
i=1 i=1

where
L : D × Rm × Rp −→ R
is the Lagrangian.
Define
n
X p
X
g(λ, ν) := inf L(x, λ, ν) = inf {f0 (x) + λi fi (x) + νi hi (x)}
x x
i=1 i=1

This function g is the dual function of (24).


Remark:

1. g is concave, since it is the point-wise infimum of linear functions.


2. Let p∗ be the optimal value of (24), then

p∗ ≥ g(λ, ν) ∀λ ≥ 0.

Proof: If λ ≥ 0, x is feasible.
m
X p
X
L(x, λ, ν) := f0 (x) + λi fi (x) + νi hi (x) ≤ f0 (x),
i=1 i=1

because λi ≥ 0, fi (x) ≤ 0, hi (x) = 0, since x is feasible. Hence,

inf L(y, λ, ν) ≤ L(x, λ, ν) ≤ f0 (x) ∀x feasible and λ ≥ 0,


y∈D

but
L(y, λ, ν) = g(λ, ν),
so that
g(λ, ν) ≤ p∗ .

17
5.0.3 Example: dual function for LP

min cT x

x≥
s.t. Ax = b, 0. Rewrite instead as −x ≤ 0.
Then
L(x, λ, ν) := cT x + λT (−x) + ν T (Ax − b)
and
−ν T b, At ν = λ − c

g(λ, ν) = inf {(ct − λt + ν t A)x} − ν t b =
x −∞ otherwise
t t t
Since (c − λ + ν A)x with no constraints, its infimum is attained at −∞. To avoid that, we need
to have
ct − λt + ν t A = 0
or equivalently:
At ν = λ − c

5.0.4 Example: trust-region problem

min xT Ax + bT x

s.t. xT x ≤ ∆2 , where ∆ is the radius of the trust region.


Find the dual function of the problem.
Solution:
L(x, λ) := xT Ax + bT x + λ(xT x − ∆2 )
and

g(λ) = infn L(x, λ) = inf xT Ax + bT x + λ(xT x − ∆2 ) = inf xT (A + λI)x + bT x − λ∆2 .


x∈R x x

Then, since A is symmetric, we obtain

∇(xT (A + λI)x + bT x − λ∆2 ) = 2(A + λI)x + b = 0,

and hence
b
x = −(A + λI)−1
2
if A + λI is invertible.
Therefore,
− 41 bT (A + λI)−1 b − λ∆2

ifA + λI is positive definite
g(λ) =
−∞ otherwise

The value λ∆2 is obtained by pluging x = −(A + λI)−1 b into the equation for L:
bt b b 1
(A + λI)−1 (A + λI)(A + λI)−1 − bT (A + λI)−1 = − bT (A + λI)−1 b
2 2 2 4

Remark:

18
• p∗ ≥ − 41 bT (A + λI)−1 b − λ∆2 is positive definite
• The dual function provides a nontrivial lower bound on p∗ if λ ≥ 0 and g(λ, ν) 6= −∞
• (λ, ν) s.t. λ ≥ 0 and g(λ, ν) 6= −∞ is called dual feasible.

Actual question:

1. What is the best lower bound?


2. Is the best lower bound equal to p∗ ?

5.1 Dual Problem

Answer:

1. Dual problem
d∗ := max g(λ, ν) s.t. λ ≥ 0
which is always a convex problem.
2. In general, p∗ = d∗ .

5.1.1 Strong Duality

If p∗ 6= d∗ , then we say we have strong duality.

5.1.2 Weak Duality

Since
p∗ ≥ g(λ, ν), ∀λ ≥ 0
we have
p∗ ≥ max{g(λ, ν)|λ ≥ 0} := d∗

NOTE: Weak duality is always true, no matter if the primal is convex or not.

5.1.3 Examples

1. LP:
min cT x
s.t.
Ax = b
x≥0
Then
L(x, λ, ν) := cT x + λT (−x) + ν T (Ax − b) = (ct − λt + ν t A)x − ν t b
and
−ν T b, ct + λt + ν t A = 0

g(λ, ν) = inf {(ct − λt + ν t A)x} − ν t b =
x −∞ otherwise

19
The dual problem is
max g(λ, ν)
s.t.
λ≥0

Rewriting it in an equivalent form:

max −ν t b
λ,ν
s.t.
At ν + c + λ = 0
λ≥0

or alternatively, since λ is playing the role of a slack variable:

max −ν t b
ν
s.t.
At ν + c = 0

Remark:Dual of the dual problem of the LP problem is the LP problem again!


2. Two way partition problem

min xT W x, W symmetric, not nec. positive definite


s.t. x = (x1 , . . . , xn )
xi = 1, i = 0, . . . , n

What is the dual problem?


Solution: X
L(λ, ν) = xt W x + νi (x2i − 1)
and
Pn n
= inf {xt W x + νi x2i −
P
g(λ, ν) i=1 νi }
x i=1
n
= inf {xt (W + N )x −
P
νi }
x i=1

which equals  Pn
− i=1 νi if W + N = 0
−∞ otherwise

where  
ν1 0
N =
 .. 
. 
0 νn

Dual Problem:
n
P
max − νi
i=1 (25)
s.t. W +N ≥0

Remarks:p∗ ≥ d∗ , even if p∗ , d∗ = ±∞

20
1. p∗ = −∞, i.e. primal is unbounded, then
d∗ = −∞, i.e., the dual is infeasible
2. d∗ = ∞, i.e., the dual is unbounded, then
p∗ = ∞, i.e., the primal is infeasible

5.2 Strong duality

p ∗ = d∗

Remark:In general, we dont́ have strong duality.


For convex problem, we “usually”have strong duality

Theorem 5.1 (Slater’s condition)


For a convex problem, the strong duality holds if there is an x ∈ rel int(D), such that x is feasible,
i.e.,
fi (x) < 0, ∀i = 1, . . . , m
hi (x) = 0, ∀i = 1, . . . , p
Notice the strict inequality in the first condition.
Remark:(Weaker Slater’s condition)
If f1 , . . . , fm are affine then the Slater’s condition can be weakened to nonpositivity: f1 (x) ≤
0, . . . , fm (x) ≤ 0.

5.2.1 Examples

1. LP:
min cT x
s.t.
Ax = b
x≥0
Slater’s condition just means feasibility in this case. The dual in this case is

max −ν t b
λ,ν
s.t.
At ν + c + λ = 0
λ≥0

(a) If the primal is feasible (then weak Slater condition holds) , then we have strong duality,
i.e. p∗ = d∗ .
(b) If the dual is feasible (then weak Slater condition holds), then we have strong duality,
i.e. d∗ = p∗ .
(c) primal and dual are both infeasible, then it can happen that ∞ = p∗ 6= q ∗ = −∞.

21
2. QCQP: quadratic constraint, quadr. programming:
1 T
min 2 x P0 x + q0T x + r0
s.t.
1 T
2 x Pi x + qiT x + ri ≤ 0
i = 1, . . . , m
Ax = b

Then Slater’s condition is: ∃x | 21 xT Pi x + qiT x + ri < 0.


Convexity:
P0 > 0, pi ≥ 0, i = 1, . . . , m
Slaters condition + convexity give us strong duality.
3. Nonconvex problem with strong duality Strong duality holds for any optimization problem
with quadratic objective function and one quadratic inequality constraint provided that
Slater’s condition holds.
Remark:Trust-region problem
min xT Ax + bT x
s.t.
xT x ≤ ∆2
where A is symmetric but not necessarily positive definite!

Dual problem:
max − 41 bT (A + λI)−1 b − λ∆2
s.t.
λ ≥ 0, A + λI ≥ 0

Dua
lfunc
tion Pr
ima
lfunc
tion

p*
g(
lam,
nu) d* f
_o(
x)

Figure 3: Solving both problems simultaneously gives faster convergence.

5.3 Optimality conditions

Assume that the objective function and the constraints are differentiable.

22
Complementary slackness:
Assume we have strong duality, i.e. p∗ = d∗ , and they are attained at x∗ for primal problem and
(λ∗ , ν ∗ ) for the dual problem, i.e.

f0 (x∗ ) = g(λ∗ , ν ∗ ) strong duality


= inf L(x, λ∗ , ν ∗ ) definition of g
x
≤ (=) L(x∗ , λ∗ , ν ∗ )
m p
= f0 (x∗ ) + λ∗i fi (x∗ ) + νi∗ hi (x∗ )
P P
def of Lagrangian
i=1 i=1

Since we actually achieve strict equality in the one but last line, therefore, since

hi (x∗ ) = 0, fi (x∗ ) ≤ 0, λ∗ ≥ 0

it follows that both sums have to be exactly 0.


Therefore,

1. x∗ minimizes L(x, λ∗ , ν ∗ ) over x.


2. Complementary slackness: λ∗i fi (x∗ ) = 0, ∀i = 1, . . . , m

5.4 Necessary optimality conditions (Karuch-Kuhn-Tucker)


• Dual feasible:
λi ≥ 0, ∀i

• Primal feasible:

fi (x) ≤ 0, ∀i = 1, . . . , m, hi (x) ≤ 0, ∀i = 1, . . . , p.

• Complementary slackness:
λ∗i fi (x∗ ) = 0, ∀i = 1, . . . , m
This last condition can be relaxed by setting instead:
m
X p
X
∇x L(x, λ, ν) = ∇f0 (x) + λfi i (x) + νi hi (x)
i=1 i=1

23
Contents

I Unconstrained optimization 1

1 Line search method 1


1.1 Determine a direction (decent) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Steepest decent (negative gradient) . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 How far do we go along a chosen direction? . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Criteria for acceptable stepsize . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 More on Convergence Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Trust region method 6


2.1 Pseudo-Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 How to solve (or approximate) the model problem quickly? . . . . . . . . . . . . . 8
2.3 Three approaches to approximating the solution . . . . . . . . . . . . . . . . . . . 8
2.4 Cauchy Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Dogleg method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Two dimensional problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

II Constrained Optimization 10

3 Mathematical programming 10
3.1 Standard Linear programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Simplex Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Convex Optimization 13
4.1 Chebyshev problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.2 Example 2: Forward rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Quadratic problem QP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.1 Quadratic constraint quadratic problem QCQP . . . . . . . . . . . . . . . . 15
4.2.2 Markowitz portfolio optimization . . . . . . . . . . . . . . . . . . . . . . . . 15

24
5 Duality Theory 17
5.0.3 Example: dual function for LP . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.0.4 Example: trust-region problem . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.1 Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.1 Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.2 Weak Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 Strong duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.3 Optimality conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4 Necessary optimality conditions (Karuch-Kuhn-Tucker) . . . . . . . . . . . . . . . 23

25

Potrebbero piacerti anche