Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Zico Kolter
(notes by Ryan Tibshirani, Javier Peña, Zico Kolter)
Convex Optimization 10-725
Last time: primal-dual interior-point methods
min f (x)
x
subject to h(x) ≤ 0
Ax = b
2
Primal dual interior point method: repeat updates
3
Outline
Today:
• Quasi-Newton motivation
• SR1, BFGS, DFP, Broyden class
• Convergence analysis
• Limited memory BFGS
• Stochastic quasi-Newton
4
Gradient descent and Newton revisited
Back to unconstrained, smooth convex optimization
min f (x)
x
x+ = x − t∇f (x)
5
Quasi-Newton methods
Two main steps in Newton iteration:
• Compute Hessian ∇2 f (x)
• Solve the system ∇2 f (x)∆x = −∇f (x)
Each of these two steps could be expensive
x+ = x + t∆x
6
Some history
7
Quasi-Newton template
Basic idea: as B (k−1) already contains info about the Hessian, use
suitable matrix update to form B (k)
8
Secant equation
B+s = y
9
Symmetric rank-one update
B + = B + auuT
(auT s)u = y − Bs
(y − Bs)(y − Bs)T
B+ = B +
(y − Bs)T s
10
How can we solve B + ∆x+ = −∇f (x+ ), in order to take next
step? In addition to propagating B to B + , let’s propagate
inverses, i.e., C = B −1 to C + = (B + )−1
Sherman-Morrison formula:
A−1 uv T A−1
(A + uv T )−1 = A−1 −
1 + v T A−1 u
Thus for the SR1 update the inverse is also easily updated:
(s − Cy)(s − Cy)T
C+ = C +
(s − Cy)T y
11
Broyden-Fletcher-Goldfarb-Shanno update
B + = B + auuT + bvv T
BssT B yy T
B+ = B − + T
sT Bs y s
12
Woodbury formula (generalization of Sherman-Morrison):
The BFGS update is thus still quite cheap, O(n2 ) per update
13
Positive definiteness of BFGS update
14
Davidon-Fletcher-Powell update
C + = C + auuT + bvv T .
Cyy T C ssT
C+ = C − +
y T Cy yT s
15
Broyden class
BssT B yy T
B+ = B − + T + φ(sT Bs)vv T
sT Bs y s
Note:
• BFGS corresponds to φ = 0
• DFS corresponds to φ = 1
• SR1 corresponds to φ = y T s/(y T s − sT Bs)
16
Convergence analysis
Assume that f convex, twice differentiable, having dom(f ) = Rn ,
and additionally
• ∇f is Lipschitz with parameter L
• f is strongly convex with parameter m
• ∇2 f is Lipschitz with parameter M
(same conditions as in the analysis of Newton’s method)
kx(k) − x? k2 ≤ ck kx(k−1) − x? k2
17
Example: Newton versus BFGS
Example from Vandenberghe’s lecture notes: Newton versus BFGS
Example
on LP barrier problem, for n = 100, m = 500
m
mX
minimizeT cT x X log(bi aT
i x)
min c x − i=1log(bi − aTi x)
x
n = 100, m = 500 i=1
Newton BFGS
103 103
100 100
f?
f?
3 3
10 10
f (xk )
f (xk )
6 6
10 10
9 9
10 10
12 12
10 10
0 2 4 6 8 10 12 0 50 100 150
k k
C + g = p + (α − β)s, where
sT g yT p
α= , q = g − αy, p = Cq, β =
yT s yT s
19
We see that C + g can be computed via two loops of length k (if
C + is the approximation to the inverse Hessian after k iterations):
1. Let q = −∇f (x(k) )
2. For i = k − 1, . . . , 0:
(a) Compute αi = (s(i) )T q/((y (i) )T s(i) )
(b) Update q = q − αy (i)
3. Let p = C (0) q
4. For i = 0, . . . , k − 1:
(a) Compute β = (y (i) )T p/((y (i) )T s(i) )
(b) Update p = p + (αi − β)s(i)
5. Return p
20
Limited memory BFGS
Limited memory BFGS (LBFGS) simply limits each of these loops
to be length m:
1. Let q = −∇f (x(k) )
2. For i = k − 1, . . . , k − m:
(a) Compute αi = (s(i) )T q/((y (i) )T s(i) )
(b) Update q = q − αy (i)
3. Let p = C̄ (k−m) q
4. For i = k − m, . . . , k − 1:
(a) Compute β = (y (i) )T p/((y (i) )T s(i) )
(b) Update p = p + (αi − β)s(i)
5. Return p
In Step 3, C̄ (k−m) is our guess at C (k−m) (which is not stored). A
popular choice is C̄ (k−m) = I, more sophisticated choices exist
21
Stochastic quasi-Newton methods
Consider now the problem
22
The most straightforward adaptation of quasi-Newton methods is
to use BFGS (or LBFGS) with
The key is to use the same noise variable ξk in the two stochastic
gradients. This is due to Schraudolph et al. (2007)
23
rdinate descent (CD) method mentioned above. We observe that the SQN metho
h bH = 300 and 600 outperforms SGD, and obtains the same or better objectiv
ue thanExample
the coordinate descent
from Byrd et al.method.
(2015):
0
10 SGD: b = 50, β = 7
SQN: b = 50, β = 2, bH = 300
SQN: b = 50, β = 2, bH = 600
CD approx min
−1
10
fx
−2
10
ure 1: Illustration of SQN and SGD on the synthetic dataset. The dotted blac
24
References and further reading
25