Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Unconstrained optimization
• Line search method
Definition:
A direction p at x1 is called a decent direction if pT ∇f (x1 ) < 0.
Rmk:
f (x + t~u) − f (x)
D~u f (x) = limt→0 = ∇f (x) · ~u (1)
t
= |∇f (x)|cosθ, when ~u is a unit vector.
Therefore, if
1. θ = 0, i.e, ~u is in the direction of the gradient of f , then D~u f (x) = |∇f (x)| > 0 if ∇f (x) 6= 0.
2. θ = π, i.e, ~u is in the opposite direction of the gradient of f , then D~u f (x) = −|∇f (x)| < 0
if ∇f (x) 6= 0.
Rmk:
1
1.1.2 Newton’s method
1
f (x) = f (x1 ) + ∇f (x1 )T (x − x1 ) + (x − x1 )T ∇2 f (x1 )(x − x1 ) := m(x) (2)
2
A sufficient condition to minimize m(x) is ∇m(x) = 0, i.e.,
∇f (x1 ) + ∇2 f (x − x1 ) = 0 (3)
Side remark:
• f (x) = bT x =⇒ ∇f (x) = b
• f (x) = 12 xT Qx =⇒ ∇f (x) = 12 (Q + QT )x
• NOTE: Only if Q is symmetric, this simplifies to ∇f = Qx.
Summary:
Let
Then
xk+1 = xk + αk pk , k = 0, 1, 2, . . . (4)
Note that
pk = −βk ∇f (xk ) (5)
and if
2
• βk = Id, then we are using steepest decent method.
• βk = x − x1 , then we are using Newton’s method.
Idea:
Minimize Φ(α), i,e., find the step which minimizes Φ for α > 0.
Good, but expensive!
How do we guarantee that we arrive at an absolute and not at a local minimum?
What we need is:
1. Wolfe’s conditions
(a) Φ(α) ≤ Φ(0) + c1 Φ0 (0) · α, (Armijo condition)
(b) Φ0 (α) ≥ c2 Φ0 (0), where 0 < c1 < c2 < 1. (Curvature condition)
2. Goldstein’s conditions
(a) Φ(0) + 1 − 0)Φ0 (0)α ≤ Φ(α) ≤ Φ(0) + cΦ0 (0)α, c ∈ (0, 1/2)
Backtracking:
Pick α so that
(a) α is acceptable: done.
(b) Go back to smaller steps.
1.3 Convergence
Theorem (Zoutendijk):
If f is continuously differentiable and ∇f is Lipschitz continuous (i.e, ∃L > 0, s.t. |∇f (y) −
∇f (x)| ≤ L|x − y| ∀x, y), and the steps satisfy Wolfe’s conditions and the directions are
3
decent directions and given our iterative process:
xk+1 = xk + αk pk (7)
then
∞
X
cos2 θk ||∇f (xk )||2 < ∞. (8)
k=1
Remarks:
Recall that in Wolfe’s condition Φ(α) = f (xk + αpk ), hence Φ0 (α) = ∇f (xk + αpk ) · pk .
We pick αk such that
Φ0 (α) ≥ c2 ∇f (xk ) · pk , hence
∇f (xk + αpk ) ≥ c2 ∇f (xk + αpk )
Proof (Zoutendijk):
4
(c2 − 1)∇f (xk ) · pk ≤ Lαk ||pk ||2
from the curvature condition; this implies that the steps have to be far away enough from 0.
Now, note that
f (xk+1 ) − f (xk )
= f (xk + αk pk ) − f (xk )
≥ −c1 αk ∇f (xk ) · pk ), Wolfe’s condition (i)
= c1 αk (−∇f (xk ) · pk )
(c2 − 1) c1 (1 − c2 ) (∇f (xk ) · pk )2
≥ c1 2
(∇f (xk ) · pk )(−∇f (xk ) · pk ) =
L||pk || L ||pk ||2
2
c1 (1 − c2 ) (∇f (xk ) · pk )
=
L ||pk ||2
Therefore we have,
c1 (1 − c2 )
f (xk ) − f (xk+1 ) ≥ ||∇f (xk )||2 cos2 θk , ∀k
L
c1 (1 − c2 )
f (xk−1 ) − f (xk ) ≥ ||∇f (xk−1 )||2 cos2 θk−1
L
..
.
c1 (1 − c2 )
f (x0 ) − f (x1 ) ≥ ||∇f (x0 )||2 cos2 θ0
L
Adding this telescopic sum, we obtain
k
X c1 (1 − c2 )
f (xk ) − f (xk+1 ) ≥ ||∇f (xn )||2 cos2 θn (17)
n=0
L
= r4 (f (xk ) − f (x∗ )
...
5
Remark: Assume that the condition number of ∇2 f (x∗ ) is
λn
κ(∇2 f (x∗ )) = k∇2 f (x∗ )kk∇2 f (x∗ )−1 k = = 800, say
λ1
Then
λn − λ1 799
= .
λn + λ1 801
Then
799 2k
f (xk+1 ) − f (xk ) ≤ (f (xk ) − f (x∗ )
801
, but
799 2k
f (xk+1 ) − f (xk ) ≤
= 0.08,
801
which is very slow! A disadvantage of linear convergence!
The objective function f has a good approximation function m(p), called the model function.
Usually we pick
1
m(p) = f (xk ) + ∇ f (xk )T pk + pT ∇2 f (xk )p,
2
the second order Taylor expansion of f at xk .
That is the ideal case, but sometimes the Hessian is hard to find, therefore, in practice we use
instead
1
m(p) = f (xk ) + gkT pk + pT Bk p.
2
We choose xk+1 to be the one that minimizes the model function m(p) within the trust region if
the function decreases enough at the new point.
Remark: If the value of the objective function at the new point is not small enough, then we
reject the model and shrink ∆k and redo the constraint optimization.
2.1 Pseudo-Algorithm
ˆ pick ∆0 ∈ (0, ∆,
Given a maximal radius ∆ ˆ η ∈ [0, 41 ).
1. Define
f (xk ) − f (xk+1 )
ρk = , xk+1 = xk + pk
m(0) − m(ρk )
6
Figure 1: default
7
2. If ρk < 0 then reject it, because f (xk+1 ) > f (xk ) and shrink ∆k .
3. If ρk ≈ 1, accept it and expand ∆k .
Pseudo-code:
ˆ ∆0 ∈ (0, ∆,
• Input: ∆, ˆ η ∈ [0, 41 )
• Compute ρk , from solving m(p)p , |p| < ∆k .
f (xk )−f (xk+1 )
• Calculate ρk = m(0)−m(ρk )
• if ρk < 14 , then ∆ = 14 ∆k
ˆ
else, ρk > 43 and |p| = ∆k , then ∆k+1 = min{2∆l , Delta}
else ∆k+1 = ∆k .
• if ρk > η then xk+1 = xk + pk
else xk+1 = xk .
2. λ ≥ 0
3. λ(∆k − |p∗ |) = 0
4. (Bk + λI)p∗ = −g
5. Bk + λI ≥ 0.
8
2.4 Cauchy Point
1
m(p) = fk + gkT p + pT Bk p,
2
The Cauchy point is the minimum point of m(p) in the steepest descent direction such that |p| ≤ ∆.
Define
1
Φ(α) = f − α|g|2 + pT Bk pα2
2
∆
Minimize Φ(α) s.t. 0 ≤ α ≤ |g| .
• If ∆ is not small then the quadratic term plays a role: m(p) = fk + gkT p + 21 pT Bk p
1
minp f + g T p + pT Bp,
2
Remark: The 2-dimensional minimization improves the min found in the dogleg method, which
improves the Cauchy point.
2-dim min ¿ better ¿ Dogleg > better > Cauchy.
9
Part II
Constrained Optimization
3 Mathematical programming
If f : Rn −→ R, the general mathematical programming optimization problem is often denoted by
(MP), and it is the following problem:
If f, g1 , g2 , . . . , gm , h1 , . . . , hr are linear (affine) functions then the problem is called Linear Programming
and denoted (LP).
The (LP) problem can then be written in matrix form:
min cT x (21)
such that Aeq x + beq = 0 and Aineq x + bineq ≤ 0.
min cT x
s.t.
Ax = b
x≥0
Consider the following problem, which is solved using the simplex method. The tables below are
called tableaux.
10
(1) Consider the following linear programming problem:
subject to
3 x1 + x2 ≤ 93x1 + 2x2 ≤ 10x1 + x2 ≤ 4x1 , x2 ≥ 0.
Transform this problem into the standard form. How many basic solutions does
the standard form problem have? What are the basic feasible solutions and
what are the extreme points of the feasible region? Solve the problem by sim-
plex method. NOTE: Basic solutions are the points which satisfy the equality
constraints but are not necessarily positive (if it’s positive, then it’s called basic
feasible).
subject to
3x1 + x2 +x3 = 9
3x1 + 2x2 +x4 = 10
x1 + x2 +x5 = 4
x1 , x2 , x3 , x4 , x5 ≥ 0.
11
Since now x3 is negative in the objective row, we try to eliminate it by choosing x3 as the entering
variable. We perform the ratio test on the non-negative entries in that column and we obtain 4,
and 1 respectively, so that x5 is our leaving variable and 1/3∗ our pivot element, which leads to
our last tableau:
Basic var. x1 x2 x3 x4 x5
Z 0 0 0 1 1 14
x1 1 0 0 1 -2 2
x2 0 1 0 -1 3 2
x3 0 0 1 -2 3 1
Since both objective row variables are positive, we found an optimal solution, namely 14, for the
values of x1 = 2, x2 = 2, x3 = 1.
12
X
x y
Z f
(y)
>=f
(x)
+f’
(
x)(
y-x
)
4 Convex Optimization
NOTE: See Convex Optimization book chapters 4 and 5.
min f0 (x)
(23)
subject to fi (x)0, i = 1, ..., m
aTi x = bi , i = 1, ..., p,
and
n
X
αk ≤ E[fk (x)] = pi aki ≤ βk
i=1
13
then the Chebyshev’s problem is to find
min E[f0 (x)]
s.t.
αk ≤ E[fk (x)] ≤ βk , k = 1, ..., m
4.1.1 Example 1
X price of asset.
n
X
max or min E[(X − K)+] = max or min (xi − K)+ pi
p
i=1
s.t.
E[X] = µ
n
X X
xi pi = µ, p ≥ 0, pi = 1
i=1
14
4.2 Quadratic problem QP
Ax = b, Gx ≤ h, affine constraints
1
min xT Qx + pT x + r
2
s.t.
Ax = b
and
1 T
x Qi x + pTi x + ri ≤ 0, i = 1, 2, . . . , m
2
Let
X1 , X2 , . . . , Xn
be the return of the risky assets with mean returns
µ1 , µ2 , . . . , µn
pt Σp
15
such that
P[pt X ≤ ] ≤ δ
16
5 Duality Theory
Problem:
min f0 (x) (24)
such that
fi (x) ≤ 0, i = 1, . . . , m
hi (x) = 0, i = 1, . . . , p
where
L : D × Rm × Rp −→ R
is the Lagrangian.
Define
n
X p
X
g(λ, ν) := inf L(x, λ, ν) = inf {f0 (x) + λi fi (x) + νi hi (x)}
x x
i=1 i=1
p∗ ≥ g(λ, ν) ∀λ ≥ 0.
Proof: If λ ≥ 0, x is feasible.
m
X p
X
L(x, λ, ν) := f0 (x) + λi fi (x) + νi hi (x) ≤ f0 (x),
i=1 i=1
but
L(y, λ, ν) = g(λ, ν),
so that
g(λ, ν) ≤ p∗ .
17
5.0.3 Example: dual function for LP
min cT x
x≥
s.t. Ax = b, 0. Rewrite instead as −x ≤ 0.
Then
L(x, λ, ν) := cT x + λT (−x) + ν T (Ax − b)
and
−ν T b, At ν = λ − c
g(λ, ν) = inf {(ct − λt + ν t A)x} − ν t b =
x −∞ otherwise
t t t
Since (c − λ + ν A)x with no constraints, its infimum is attained at −∞. To avoid that, we need
to have
ct − λt + ν t A = 0
or equivalently:
At ν = λ − c
min xT Ax + bT x
and hence
b
x = −(A + λI)−1
2
if A + λI is invertible.
Therefore,
− 41 bT (A + λI)−1 b − λ∆2
ifA + λI is positive definite
g(λ) =
−∞ otherwise
The value λ∆2 is obtained by pluging x = −(A + λI)−1 b into the equation for L:
bt b b 1
(A + λI)−1 (A + λI)(A + λI)−1 − bT (A + λI)−1 = − bT (A + λI)−1 b
2 2 2 4
Remark:
18
• p∗ ≥ − 41 bT (A + λI)−1 b − λ∆2 is positive definite
• The dual function provides a nontrivial lower bound on p∗ if λ ≥ 0 and g(λ, ν) 6= −∞
• (λ, ν) s.t. λ ≥ 0 and g(λ, ν) 6= −∞ is called dual feasible.
Actual question:
Answer:
1. Dual problem
d∗ := max g(λ, ν) s.t. λ ≥ 0
which is always a convex problem.
2. In general, p∗ = d∗ .
Since
p∗ ≥ g(λ, ν), ∀λ ≥ 0
we have
p∗ ≥ max{g(λ, ν)|λ ≥ 0} := d∗
NOTE: Weak duality is always true, no matter if the primal is convex or not.
5.1.3 Examples
1. LP:
min cT x
s.t.
Ax = b
x≥0
Then
L(x, λ, ν) := cT x + λT (−x) + ν T (Ax − b) = (ct − λt + ν t A)x − ν t b
and
−ν T b, ct + λt + ν t A = 0
g(λ, ν) = inf {(ct − λt + ν t A)x} − ν t b =
x −∞ otherwise
19
The dual problem is
max g(λ, ν)
s.t.
λ≥0
max −ν t b
λ,ν
s.t.
At ν + c + λ = 0
λ≥0
max −ν t b
ν
s.t.
At ν + c = 0
which equals Pn
− i=1 νi if W + N = 0
−∞ otherwise
where
ν1 0
N =
..
.
0 νn
Dual Problem:
n
P
max − νi
i=1 (25)
s.t. W +N ≥0
Remarks:p∗ ≥ d∗ , even if p∗ , d∗ = ±∞
20
1. p∗ = −∞, i.e. primal is unbounded, then
d∗ = −∞, i.e., the dual is infeasible
2. d∗ = ∞, i.e., the dual is unbounded, then
p∗ = ∞, i.e., the primal is infeasible
p ∗ = d∗
5.2.1 Examples
1. LP:
min cT x
s.t.
Ax = b
x≥0
Slater’s condition just means feasibility in this case. The dual in this case is
max −ν t b
λ,ν
s.t.
At ν + c + λ = 0
λ≥0
(a) If the primal is feasible (then weak Slater condition holds) , then we have strong duality,
i.e. p∗ = d∗ .
(b) If the dual is feasible (then weak Slater condition holds), then we have strong duality,
i.e. d∗ = p∗ .
(c) primal and dual are both infeasible, then it can happen that ∞ = p∗ 6= q ∗ = −∞.
21
2. QCQP: quadratic constraint, quadr. programming:
1 T
min 2 x P0 x + q0T x + r0
s.t.
1 T
2 x Pi x + qiT x + ri ≤ 0
i = 1, . . . , m
Ax = b
Dual problem:
max − 41 bT (A + λI)−1 b − λ∆2
s.t.
λ ≥ 0, A + λI ≥ 0
Dua
lfunc
tion Pr
ima
lfunc
tion
p*
g(
lam,
nu) d* f
_o(
x)
Assume that the objective function and the constraints are differentiable.
22
Complementary slackness:
Assume we have strong duality, i.e. p∗ = d∗ , and they are attained at x∗ for primal problem and
(λ∗ , ν ∗ ) for the dual problem, i.e.
Since we actually achieve strict equality in the one but last line, therefore, since
hi (x∗ ) = 0, fi (x∗ ) ≤ 0, λ∗ ≥ 0
• Primal feasible:
fi (x) ≤ 0, ∀i = 1, . . . , m, hi (x) ≤ 0, ∀i = 1, . . . , p.
• Complementary slackness:
λ∗i fi (x∗ ) = 0, ∀i = 1, . . . , m
This last condition can be relaxed by setting instead:
m
X p
X
∇x L(x, λ, ν) = ∇f0 (x) + λfi i (x) + νi hi (x)
i=1 i=1
23
Contents
I Unconstrained optimization 1
II Constrained Optimization 10
3 Mathematical programming 10
3.1 Standard Linear programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Simplex Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Convex Optimization 13
4.1 Chebyshev problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.2 Example 2: Forward rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Quadratic problem QP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.1 Quadratic constraint quadratic problem QCQP . . . . . . . . . . . . . . . . 15
4.2.2 Markowitz portfolio optimization . . . . . . . . . . . . . . . . . . . . . . . . 15
24
5 Duality Theory 17
5.0.3 Example: dual function for LP . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.0.4 Example: trust-region problem . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.1 Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.1 Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.2 Weak Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 Strong duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.3 Optimality conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4 Necessary optimality conditions (Karuch-Kuhn-Tucker) . . . . . . . . . . . . . . . 23
25