Sei sulla pagina 1di 18

Gradient and Optimization

Digression
DERIVATIVES AND
GRADIENT
Derivatives
Some derivation rules:

w 2w
2 2
f (w) 2 f (w) f ' (w) c 0

w aw
a a 1
f (w) a f (w) f ' (w)
a a 1
(cw) c

f ( w) g ( w) f ( w) g ( w)
e e
w w
e e f (w)
f ( w) f ( w)

ln(w) 1
ln( f ( w)) 1
f ' ( w)
w f ( w)

If we are supplied a value for w, say 5, then the above become


numbers.
We say, we obtain the derivative at point 5
Partial Derivatives
Suppose now we have a function of multiple variables, e.g.
f ( w1 , w2 , w3 ) w1w2 w3
2

It can also written as f(w), where w is [w1,w2,w3]


This function has three partial derivatives:
f 'w1 obtained by considering only w1 variable and w2, w3 constant
f 'w2 obtained by considering only w2 variable and w1, w3 constant
f 'w3 obtained by considering only w3 variable and w1, w2 constant

f w1 ( w1 , w2 , w3 ) 2w1w2 w3 w2 w3 2 w1w22 w32


Derivation rules are
f w2 ( w1 , w2 , w3 ) 2w1w2 w3 w1w3 2 w12 w2 w32 the same as those
for a single variable.

f w3 ( w1 , w2 , w3 ) 2w1w2 w3 w1w2 2 w12 w22 w3


Other Notation
f
f w1 also denoted by
w1

f
f w2 also denoted by
w2

f
f w3 also denoted by
w3
Gradient
The gradient is the vector of partial derivatives.


f (w ) 2 w1w22 w32 , 2 w12 w2 w32 , 2 w12 w22 w3
Now suppose we want to compute the gradient at a point, say
w=(1,2,3), i.e. w1=1, w2=2, w3=3


f (w ) f ([1,2,3]) 2 1 2 2 32 , 2 12 2 32 , 2 12 2 2 3 81, 36, 24
OPTIMIZATION
Minimization Problem

min w f (w )
Iterative Method
Start at some w0; take a step along steepest slope
Fixed step size:

w w + v

v is a unit vector in the direction of the steepest slope.


Whats the steepest slope?
Steepest Slope
Suppose we have a function f(w) of m variables.

Given a unit vector v (of m dimensions), the v-directional


derivative of f at a point w is:
f ( w tv ) f ( w )
Dv , f (w ) limt 0
t
and gives the increase rate of f in direction v.

Its been shown that Dv , f (w )


v1 f w1 (w ) vm f wm (w )
Gradient of f at w.
v f w1 (w ), , f wm (w )
v f (w )
Digression: Steepest Slope
When is Dv,f (w) the biggest? v f (w )
|| v || || f (w ) || cos
|| f (w ) || cos

When cos 1 i.e. 0

f (w )
v Direction of fastest increase of f.
|| f (w ) ||

f (w ) Direction of fastest decrease of f.


v Same slope (steepest),
|| f (w ) || but opposite direction.
Step Size
w w + v

Figures taken from Y. A. Mostafa (California Institute of Technology)


Variable Step
Instead of using constant in
E (w )
v
|| E (w ) ||

make proportional to the magnitude of the gradient, i.e.

|| E (w ) ||
And we get

v E (w )
Gradient Descent Algorithm
Initialize w=0
For t=0,1,2,do
Compute the gradient f (w )

Update the weights w w f (w )

Iterate with the next step until w doesnt change too much
(or for a fixed number of iterations)

Return final w.
HOW DO WE DETERMINE ?
f(w) during iterations

f
Value of f for the w we
get after 100 iterations.

A picture like this tells us


that the gradient descent
100 200 300 400 number of iterations
is working fine.
f(w) during iterations

A picture like this tells us


that the gradient descent
is NOT working fine.

100 200 300 400 number of iterations We should use smaller

If is small enough, a convex f(w) should decrease on every iteration.


However, if is too small, it well take a long time to converge.
Practically
Try
=0.001 =0.01 =0.1 =1

Plot or see f(w) for each one. If it decreasing with a reasonable


speed, choose that value for .

Potrebbero piacerti anche