Sei sulla pagina 1di 27

Optimization and

Search
Chapter 11
Going Downhill
The basic idea is we want to minimize a function f(x),
where x is vector (x
1
, x
2
, x
3
,, x
n
) that has the
elements for each feature value, starting from some
initial guess x(0).
We try to find a sequence of new points x(i) that move
downhill towards a solution.
In order to calculate the gradient in all dimensions we
have to take derivatives of the function in each of the
different dimensions of x.

Going Downhill (contd..)
How to know when we have found a solution or when
to stop?
When f = 0 i.e. when you have reached bottom of hill
In practice we usually stop at f < , where is some
small number, maybe 10
-5.


Level sets of gradient

Now, from current point x
i
we need to
decide two things :
I. At what direction to move and
II. How far to move
Going Downhill (contd..)
How far to move ?
We have two methods to find this
I. Line Search: If we know what direction to look in, then
we move along it until we reach the minimum in this
direction.
Mathematically,
Where P
k
is the direction we have chosen to move in and

k
is the distance to travel in that direction, chosen by the
line search.

Going Downhill (contd..)
II. Trust Region: It is more complex, since it
consists of making a local model of the
function as a quadratic form and finding
the minimum of that model.

Going Downhill (contd..)
At what direction to move?
Make greedy choices and always go downhill as
fast as possible at each point (steepest descent)



Thus, we iterate
x
k+1
= x
k
+
k
p
k

until f(x
k
)=0, which practically means until f(x
k
) <


) ,..., , ( ) (
2 1 kn k k
k k
x
f
x
f
x
f
x f p
c c c
= V =
Least-Squares Optimization
Newton Direction
Levenberg-Marquardt Algorithm
Newton Direction
We use Taylor Expansion to expand function f(x)

x
0
is initial guess and |x
0
means the function is
evaluated at that point.
Now, J(x) and H(x) are the Jacobian and Hess
matrix.
Jacobian is matrix of first derivatives and Hess
matrix is matrix of second derivatives of function
f(x).
Newton Direction (contd..)
|
|
.
|

\
|
c c
=
c
= V =
n
n
x
x f
x
x f
x
x f
x f x J
) (
,...,
) ( ) (
) ( ) (
1
1
Newton Direction (contd..)

For vector f(x), H(x) is is three-dimensional and J(x) is
a two-dimensional matrix.
If we choose to minimize the Taylor expanded equation
in terms of Hess and Jacobian matrix, then we find the
Newton direction at the kth iteration to be :

In x
k+1
= x
k
+ o
k
p
k
and the step size is always o
k
=1.


Levenberg-Marquardt
Algorithm
For least squares problem, the object function that we are
optimizing is:

Function gradients can be computed as:


Thus, knowing Jacobian gives first part of hessian without any
additional computation cost, and it is this that special algorithms
can exploit to solve least-squares problem efficiently.


Levenberg-Marquardt
Algorithm (contd..)
When ||r(x)|| is linear, then



We can use this along with some linear algebra to find x using the
Singular Value Decomposition. The SVD is the decomposition of a
matrix A of size mn into
A = USV
T
,
Where U and V are orthogonal matrices and S is diagonal matrix
Levenberg-Marquardt
Algorithm (contd..)
Now we calculate SVD of J,

where U
1
is the first n columns of U and U
2
is the
remaining columns. The optimal solution is found when the
first term on the right is equal to zero, which means that:

where u
i
is the ith column on U, and similarly for V.


Levenberg-Marquardt
Algorithm (contd..)
When ||r(x)|| is non-linear, then

Here the principal approximation that algorithm
makes is to ignore the residual terms, making
each iterations a linear least-squares problem so
that


Levenberg-Marquardt
Algorithm (contd..)
Then the problem to be solved is:

Where k is radius of trust region, which is the region where
it is assumed that this approximation holds well. Here k is
used to control a parameter 0 that is added to the
diagonal elements of the jacobian matrix and is known as
the damping factor. The minimum p satisfies:

This equation can also be solved similarly like we solved in
linear least-squares case. There are various Solvers
available to solve this equation.
Levenberg-Marquardt
Algorithm
Given start point X
0
While J
T
r(x) > tolerance and maximum number of iterations
not exceeded:
repeat
Solve (J
T
J + I)dx = -J
T
r for dx using linear least squares
Set x
new
= x + dx
Compute the ratio of the actual and prediction reductions:
If 0<<0.25: accept step : x=x
new
Else if >0.25 accept step : x=x
new
increase trust region size
(reduce )
Else reject step and reduce trust region (increase )
Until x is updated or maximum number of iterations is exceeded

Conjugate Gradients
The conjugate gradient algorithm
selects the successive direction
vectors as a conjugate version of
the successive gradients obtained
as the method progresses.
The basis for a nonlinear conjugate
gradient method is to effectively
apply the linear conjugate gradient
method, where the residual is
replaced by the gradient
Conjugate Gradients
(contd..)
Given start point x0, and stopping parameter , set P
0
= - f(x)
Set P
new
= P
0

While P
new
>
2
P
0
:
Compute
k
and x
new
=x +
k
P using the Newton-Raphson iteration:
While
2
dp>
2:


Evaluate f(x
new
)
Compute
k+1
using any of following equations


Update P
new
= f(x
new
) +
k+1
P
Check for restarts


Traveling Salesman Problem
The traveling salesman problem is one of the classical
problems in computer science.
A traveling salesman wants to visit a number of cities and
then return to his starting point. Of course he wants to save
time and energy, so he wants to determine the shortest cycle
for his trip.
We can represent the cities and the distances between them
by a weighted, complete, undirected graph.
The problem then is to find the shortest cycle (of minimum
total weight that visits each vertex exactly one).
Exhaustive Search
Try out every solution and pick the best one.
This method guaranteed to find the global
optimum, because it checks every single solution.
It is impractical for any reasonable size problem
For TSP it would involve testing out every single
possible way of ordering the cities, and calculating
the distance for each ordering.
Its computational complexity is O(N!)
Greedy Search
Just makes one pass through the system, making the
best local choice at each stage.
For TSP, it chooses the first city arbitrarily, and then
repeatedly pick the city that is closest to where are you
now that hasnt been visited yet until you run out of
cities.
This is computationally cheap compared to other
searches.
It has O(N) complexity but it does not guarantees the
optimal solution.
Hill Climbing
Basic idea is to perform local search around
the current solution, choosing any option that
improves the result.
Choice of how to do local search is called the
move-set
Move-set defines how the current solution
can be changed to generate new solutions
Hill Climbing (contd..)
Algorithm for TSP
I. Choose an initial tour randomly
II. Then keep swapping pairs of cities if the total
length of tour decreases, i.e.,
if new dist. traveled < before dist. traveled.
III. Stop after a predefined number of swaps or when
no swap improved the solution for some time.
Exploitation and Exploration
Exploration of the search space is like exhaustive
search (always trying out new solutions)
Exploitation of the current best solution is like hill
climbing (trying local variants of the current best
solution)
Ideally we would like to have a combination of
those two.

Exploitation and Exploration
Example
If you want to play one-armed bandit machine
At first, you have no information at all, so you choose
randomly.
However, as you explore, you pick up information about
which machines are good.
You could carry on using them (exploiting your knowledge)
or you could try out other machines in the hope of finding
one that pays out even more (exploiting further).
The optimal approach is to trade off the two, always
making sure that you have enough money to explore
further by exploiting the best machines you know of, but
exploring when you can.
Simulated Annealing
This method deals with huge amount of data
It is stochastic method in order to get approximate
solutions to the problems that, while still expensive, do
not require the massive computational times that the
full solution would.
It work in analogy to real physical system: System is
heated, so that there is plenty of energy around, and
each part of the system is effectively random. Then, an
annealing schedule is applied that cools the material
down, allowing it to relax into a low energy
configuration.

Simulated Annealing For TSP
Like in hill climbing, keep swapping pairs of cities
if new dist. traveled < before dist. traveled,
or
if (before dist. Traveled - new dist. Traveled) <
T*log(rand)
Set T=c*T, where 0<c<1 (usually 0.8<c<1)
Thus, we accept a bad solution if for some random
number p


) log(
) log(
exp
p T E E
p
T
E E
p
T
E E
after before
after before
after before
<
<

<