Sei sulla pagina 1di 33

IEOR 160: Optimization Problem Modeling

Nate Armstrong

Lecture 8/28/17

1 Examples of Modeling Optimization Problems


1.1 Generic Format for an Optimization Problem
Formulation: The formulation will contain the story explaining the problem,
and a brief description of the variables involved. It will also contain the quantity
or quantities to be maximized.
Variables:
• The variables describes the variables used to solve the problem. These are
specific to each technique used to model the problem.

Decisions: The decisions are the quantities which we control, and optimize
over. These usually take the form of x’s.
Constraints:
• The constraints determine what is and is not possible in this problem.
They usually take the form of multiple equalities or inequalities which
must be true for any solution to the problem.
Objective: The objective function is what we try to minimize or maximize
over the decision variables.

1.2 Knapsack Problem


Formulation: Given a set of items, each with a mass and a value, determine
a number (zero or one) of each item to include such that weight is less than a
limit and value is maximized.
Variables:

• xi : The decision variable for taking item i, 1 if included, 0 otherwise.


• vi : The value of item i.
• wi : The weight of item i.
• W : The weight limit of the bag.

1
Decisions: X = [x1 , x2 , . . . , xn ]
Constraints:
n
P
• xi wi ≤ W
i=1

n
P
Objective: min − x i vi
X i=1
Remark. There are often infinitely many ways to model a problem, the question
is deciding how to model it in a fashion that is easy to solve.
Remark. We will often use a vector, X = [x1 , . . . , xn ], to represent variables
more quickly. If you see a capital, single variable, it is not just one variable.

1.3 Assignment Problem


Formulation: A taxi firm has n taxis available and n customers waiting to be
picked up as soon as possible. Each taxi has a distance to each customer, and
may only pick up one customer. We want to pick up all customers as quickly
as possible.
Variables:

• xij : Decision variable for taxi i to pick up customer j


• dij : Distance (cost) for taxi i to pick up customer j
 
x11 x12 . . . x1n
 x21 x22 . . . x2n 
Decisions: X =  .
 
.. .. .. 
 .. . . . 
xn1 xn2 ... xnn
Constraints:
• Taxi Constraints:
Pj
xij = 1∀i
i=1

• Customer Constraints:
Pi
xij = 1∀j
j=1

n P
P n
Objective: min xij dij
X i=1 j=1

Remark. Notice that this problem requires optimizing over n2 variables to solve
for n customers. This can depend on the formulation, but it is a hard problem.

2
1.4 Transportation
Formulation: A company has m warehouse and n retail outlets. A single
product is to be shipped from the warehouses to the outlets. Each warehouse
has a given level of supply and each outlet has a given level of demand. There
are transportation costs. How can I match supply to demand most efficiently?
Variables:
• ai : supply at warehouse i
• bi : demand at outlet j

• cij : transportation cost (i, j)


• xij : number of copies going from warehouse i to supplier j
 
x11 x12 . . . x1n
 x21 x22 . . . x2n 
Decisions: X =  .
 
.. .. .. 
 .. . . . 
xm1 xm2 ... xmn
Constraints:

• Supply constraints:
Pn
xij ≤ ai ∀i
j=1

• Demand Constraints:
m
P
xij = bj ∀j
i=1

• Variable Constraints:
Xij ∈ {0, 1, 2, . . . }
m P
P n
Objective: min xij cij
X i=1 j=1

Remark. Most of these problems are very difficult. They can be modeled fairly
easily, but the choice of algorithm is very important.

1.5 Network Flow


Formulation: Given a directed graph, where each edge (i, j) has a capacity of
cij , and a start node s and end node t, we want to maximiize the amount of
flow from s to t such that flow over an edge is less than the capacity, and flow
is conserved.
Variables:
• cij : The capacity of the edge going from vertex i to vertex j to hold flow.
• xij : The flow over the edge from vertex i to vertex j.

3
Decisions: X = the matrix of flow amounts over all edges.
Constraints:
• Variable Constraints
0 ≤ xij ≤ cij

• Conservation
P of Flow
P
xji = Xik
(i,j):edge (i,k):edge
P P
Objective: max xsi − xjs
X (s,i) (j,s)

Remark. This problem is actually really easy, surprisingly enough. It forms the
basis for a variety of interesting problems. We will learn how to distinguish
between easy and hard problems.

4
IEOR 160: Optimizations
Nate Armstrong

Lecture 8/30/17

1 Different Types of Optimization


1.1 Unconstrained Univariate Optimization
Definition. Unconstrained: There are no constraints.

min x2 − cos(x) + log(x)


z∈R

This can include problems like the generic min f (x).


Remark. We need a well-behaved objective function.

Definition. Univariate: There is only one decision variable.

1.2 Constrained Univariate Optimization


max x3 + 5x2 s.t. − 2 ≤ x ≤ 5

1.3 Unconstrained Multivariate Optimization


Now, we may have multiple decision variables

min x21 + log x2 − x43

1.4 Constrained Multivariate Optimization


max x1 + 5x2 s.t.x21 + x22 = 5

1.5 Types of Functions


We want to have well-behaved functions for our objective function.

• Discontinuous
Discontinuous functions are bad for optimization, so we will not see them
in this course.

1
• Continuous but not differentiable
This could be a function that has a ‘cusp’ or sharp edge. They are easier
than discontinuous functions, but not easy. In most of this course, we will
not deal with these functions.
As an example of one of those functions, consider |x|. It is not differen-
tiable at x = 0.
• Continuous and differentiable
These are well-behaved functions.

1.6 Types of Minima and Maxima of Functions


• Local:
Say x∗ is local minimum if ∃,  > 0, such that for every x that satisfies
|x − x∗ < | we have f (x) ≥ f (x∗ ). It is a local maximum if we have,
instead, that f (x) ≤ f (x∗ ).

• Global:
x∗ is global minimum of f (x) if for any x we have f (x∗ ) ≤ f (x).
x∗ is global maximum of f (x) if for any x we have f (x∗ ) ≥ f (x).

2 Methods for Solving Optimization Problems


2.1 Continuous and Differentiable Functions
If x is local min or local max, then f 0 (x) = 0. However the reverse is not true.

f 0 (x) = 0 only means that a point is interesting, but it is possible that it is not
a local min or max.
Consider x3 at x = 0. The derivative is 0, yet the function increases in the
positive direction and decreases in the negative.

Definition. Stationary Point: If x∗ satisfies f (x∗ ) = 0, then x∗ is a stationary


point.
A stationary point is either a local min, a local max, or a saddle point.

2.1.1 Steps for Solving Unconstrained Univariate


1. Take derivative of f (x) and set it equal to zero.

2. Find the roots of f 0 (x) = 0.


3. Evaluate f (x) at each of those points. Find the best of those, and that is
the global solution.

min ex + (x − 1)2 = f (x)

2
First, find f 0 (x).
f 0 (x) = ex + 2(x − 1)
Then we set this equal to zero.

min cos(x) = f (x)


0
First, find f (x).
f 0 (x) = − sin(x)
Then we set this equal to zero. We find that the roots are x = kπ.
Now, we need to evaluate the points.
By graphing, we find that points of the form x = (2k + 1)π are local minimums.
Those are all equal to −1, and are global minima.

2.2 Steps for Solving Constrained Univariate


Now we are trying to solve general problems of a slightly different form.

min f (x)s.t.a ≤ x ≤ b (1)

1. Take the derivative of f (x) and set it equal to 0.


2. Find the roots of f 0 (x) = 0.
3. Consider the roots that are between a and b, together with a and b.
4. Inspect these points and pick the best one.

Remark. This problem takes a similar form to the unconstrained univariate.


The reason that we need to inspect the end points is because they can also be
the global solution. For example, consider minimizing f (x) = −x3 between −2
and 2. f 0 (x) = 0 at x = 0. This evaluates to 0. But at x = 2, f (x) = −8, which
is smaller than 0.

3
IEOR 160: Optimizations
Nate Armstrong

Lecture 9/6/17

1 Types of Optimizations
The goal is to
min / max f (X)
such that X satisfies some arbitrary number of constraints, where f (X) is the
objective function.
• Constrained
– Univariate
– Multivariate

• Unconstrained
– Univariate
– Multivariate

1.1 Stationary Points


A stationary point is a point that is locally flat. This is equivalent to saying
that the first derivative is 0 at that point. There are three different types of
stationary points.

• Local Min: If the point is perturbed slightly in either direction, the value
of f (x) increases.

• Local Max: If the point is perturbed slightly in either direction, the


value of f (x) decreases.
• Saddle: If the point is perturbed slightly in either direction, the value of
f (x) decreases in one direction, and increases in the other.

1
1.2 Looking for Local Minima and Maxima
Finding local minima and maxima is important for optimization. The univariate
case is considered here, but the insights can be generalized to the multivariate
case. Assuming we have a function f (x) such that a ≤ x ≤ b, there are a few
cases.

0
x ∈ (a, b) → f (x) = 0 → stationary
0
x = a → f (x) > 0 → local min
0
x = a → f (x) < 0 → local max
0
x = b → f (x) > 0 → local max
0
x = b → f (x) < 0 → local min
0
x = a, b → f (x) = 0 → check second derivative

We need to classify the different stationary points by their second derivative


if the first derivative is 0, and it is not an endpoint. This is because just knowing
that the first derivative is 0 does not tell us which way the function is curved
0
at that point. Given that we know that f (x) = 0.
00
f (x) > 0 → local min
00
f (x) < 0 → local max
00
f (x) = 0 → Need to check the third derivative

Notice that this pattern of needing to check the next derivative if the current
one is 0 continues. If the first derivative is 0, we need to look to the second for
more information. If the second derivative is 0, we need to look to the third for
more information, and so on. It turns out that this pattern is generalizable.

1.2.1 When the first k derivatives are zero


0 00
If we find that f (x) = 0, f (x) = 0, we need to continue taking derivatives. We
take the derivative repeatedly at point x until we find a derivative which is not
zero. This is the kth derivative.
• k is even: Then it is a saddle point
• k is even:
– If f (k) > 0 then the point is a local min
– If f (k) < 0, then the point is a local max
Remark. A continuous, smooth function can have an infinite number of deriva-
tives equal to zero if and only if the function is uniformly zero. This is because
the function has a Taylor series expansion, in which it is written as a sum (with
coefficients) of its derivatives.

2
It costs a monopolist $5/unit to produce a product. If he produces x units
of a product, then each can be sold for 10 − x(0 ≤ x ≤ 10). To maximize profit,
how much should the monopolist produce?

max x(10 − x) − 5x
x∈R

We use the techniques discussed before to find the local minima and maximums.
We find that
0 00
x∗ = 25 max f (x) = 0, f (x) < 0
0
x∗ = 0 min f (x) > 0
0
x∗ = 10 min f (x) < 0

2 Numerical Algorithms (Iterative Algorithms)


The underlying ideas of numerical algorithms is to guess over and over, getting
closer to the right answer each time.

2.1 Golden Section Algorithm


Definition. Unimodal: f (x) is unimodal on [a, b] if for some point x∗, f (x)
is strictly increasing on [a, x∗] and strictly decreasing on [x∗, b].
The structure of this algorithm is an optimization algorithm communicating
with a black box solver of the function. This works for unimodal functions. The
algorithm samples two points x1 , x2 and compares their values in the function.

• f (x1 ) > f (x2 ): Then, we know x∗ < b. We have reduced the interval in
which x∗ may lie.
• f (x1 ) < f (x2 ): Then, we know x∗ > a. We have reduced the interval in
which x∗ may lie. xitem f (x1 ) = f (x2 ): Then, we know x1 > x∗ > x2 .
We have reduced the interval in which x∗ may lie.

Because we are continually reducing the interval in which x∗ may lie, our
’guesses’ will get more and more accurate. If we execute this algorithm and
arbitrary number of times, we will get arbitrarily close to x∗. With better
guessing patterns, we can solve this faster.

3
IEOR 160: Optimizations
Nate Armstrong

Lecture 9/6/17

1 Optimization Algorithms
1.1 Golden Section Algorithm
Definition. Unimodal: f (x) is unimodal on [a, b] if for some point x∗, f (x)
is strictly increasing on [a, x∗] and strictly decreasing on [x∗, b].
The structure of this algorithm is an optimization algorithm communicating
with a black box solver of the function. This works for unimodal functions. The
algorithm samples two points x1 , x2 and compares their values in the function.

• f (x1 ) > f (x2 ): Then, we know x∗ < b. We have reduced the interval in
which x∗ may lie.
• f (x1 ) < f (x2 ): Then, we know x∗ > a. We have reduced the interval in
which x∗ may lie. xitem f (x1 ) = f (x2 ): Then, we know x1 > x∗ > x2 .
We have reduced the interval in which x∗ may lie.

Because we are continually reducing the interval in which x∗ may lie, our
’guesses’ will get more and more accurate. If we execute this algorithm and
arbitrary number of times, we will get arbitrarily close to x∗. With better
guessing patterns, we can solve this faster.

1.1.1 Improving the Golden Section Algorithm


1. Choose points in a symmetric way

2. When you slice the interval based on the ratio γ, then one new point
should overlap one previous point.
This allows us to reuse calculations of f (x).
3. The optimal value for γ actually turns out to be the golden number. We
get γ 2 + γ − 1 = 0 → γ = 0.618 . . . .

Suppose we want to know our optimal value to within a certain value . How
many iterations k do we need?

1
We end up finding that the size of our range after k iterations is γ k (b − a).

log( b−a )
We set this to be ≤ . We then find k = log γ . This is an algorithm with
complexity log 1 .
Remark. It doesn’t actually matter what log base we use here, because we are
only looking at the order of the function. That means that there is a coefficient
in front of log 1 , which depends on the value of the parameters a, b.
Remark. Suppose that I tell you to minimize ex − x over 1 ≤ x ≤ 5. Most likely,
the solution to this will be irrational. Then, that means that calculating all the
digits of the answer will take infinite time. Therefore, suppose I want you to
find a solution that is accurate within 10−4 . This is an example of when you
would use this algorithm, with  = 10−4 .

1.2 Using the Golden Section Algorithm


Suppose our problem statement is

max −x2 − 1 s.t. − 1 ≤ x ≤ 0.75


x∈R

Find an interval with uncertainty less than 41 .

1.3 The Multivariate Case of the Golden Section Algo-


rithm
min/maxx∈Rn f (x1 , x2 , . . . , xn )
We set x = [x1 x2 . . . xn ]. How can we translate the same logic we used in the
univariate case to the multivariate case?

1.3.1 Overview of Multivariate Calculus

R Rn
∂f ∂f ∂f
derivative f 0 (x) Gradient∇f (x) [ ... ]
∂x1 ∂x2 ∂xn
 ∂f ∂f

1 ∂x ∂x
1 ∂x1 ∂x2 . . . ∂x∂f
1 ∂xn
 ∂f ∂f
. . . ∂x∂f

∂x ∂x ∂x2 ∂x2 2 ∂xn 
second derivative f 00 (x) Hessian∇2 f (x) 
 2 1

 .. .. .. .. 
 . . . . 
∂f ∂f ∂f
∂xn ∂x1 ∂xn ∂x2 . . . ∂xn ∂xn

1.3.2 Overview of Linear Algebra


Suppose we have an n × n matrix A. It has n eigenvalues, denoted λ. A value
λ is an eigenvalue iff del(A − XI) = 0.

2
The eigenvalues can be real or complex scalars. If a matrix is symmetric, all
of its eigenvalues will be real. A matrix is symmetric if A[i][j] = A[j][i]. There
are four important definitions for optimization:
A  0 : A is positive definite iff its eigenvalues are all positive.

A ≺ 0 : A is negative definite iff its eigenvalues are all negative.

A  0 : A is negative definite iff its eigenvalues are all non-negative.

A  0 : A is negative definite iff its eigenvalues are all non-positive.

3
IEOR 160: Multidimensional Case
Nate Armstrong

Lecture 9/13/17

1 Multivariate Case
Suppose we have  
x1
x ∈ R2 −→ x =
x2
and we want to find
min/maxx f (X)
We want to find x∗1 and x∗2 such that the coordinates are local minima.
x∗
If x∗ is a min for f (X), then X = [x12 ] means that f (x∗1 , x2 ) is a function
of x2 and x∗2 is the minimum of the new function.
Definition. Stationary Point: The gradient at that point is zero.
But remember, a stationary point can be a min, max, or saddle. The 1-
dimensional equivalent of the gradient is just the derivative, so this is consistent.

Example

min (x1 − 1)2 + (x2 − 2)2


x∈R2

We can then find the gradient


∂f ∂f
∇f (x) = [ ]
∂x1 ∂x2
= [2(x1 − 1) 2(x2 − 2)] = 0 → x∗1 = 1, x∗2 = 2

A stationary point generalizes to the gradient of a function, ∇f (x).


The second derivative equivalent in the multidimensional case is the Hessian,
which is a symmetric matrix of second partial derivatives.
Theorem 1.1. If ∇f (x∗ ) = 0, if
• ∇2 f (x∗ )  0, then it is a local minima
• ∇2 f (x∗ ) ≺ 0, then it is a local maxima

1
• If all eigenvalues are nonzero, and one is less than zero, and another is
greater than 0, then it is a saddle point.
• Otherwise, if any eigenvalues are 0, it cannot be said.
 
a b
Suppose I have a function in R2 , and I compute ∇2 f (x∗) = . I must
c d
then find my eigenvalues, and check their signs. I then find my eigenvalues, and
check against the possibilities.
Remark. Note: Calculating the eigenvalues is not easy. Remember, we don’t
need the value of the eigenvalues, we just need the signs. This problem is much
easier than the other problem.
Suppose I want to find if H  0.
 
a11 a12 a13 ...
a21 a22 a23 . . .
H = a31 a32
 
a33 . . .
.. .. ..
 
..
. . . .

We can find the principal minors. The principal minors of an n × n matrix are
the square matrix consisting of up to row k and column k. A matrix is positive
definite iff the determinant of all principal minors are positive definite.

Theorem 1.2.
 
a11 a12
det(a1 ), det( ), det[3 × 3], . . . , det[n × n]
a21 a22

• All of their determinants are positive iff the Hessian is positive definite.

• If one of them is 0 in the n − 1 determinants, then the Hessian is a saddle


point (no eigs=0, one ¿ 0, one ¡ 0)
In order to apply the above to the case for looking for a maximum, we look
at −H in the same manner as we would a positive matrix (as this is guaranteed
to have all positive eigenvalues iff H is negative definite).

Example
Suppose we have a matrix
   
1 2 1 2
H= −→ det[1] = 1, det =1
2 5 2 5

Because the determinants are > 0, we know the matrix is a local min.

2
Example
Suppose we have a different matrix
 
−1 2
H= −→ det[−1] = −1
2 −5

This first determinant is negative, so we can only possibly have a saddle


point or a local max. Look at −H.
   
1 −2 1 −2
−H = −→ det[1] = 1, det =1
−2 5 −2 5

Therefore, −H is positive definite, so we know that H is negative definite,


and it is a local max.

Remark. A positive eigenvalue means that there is a direction in which you can
increase the function, and a negative eigenvalue means that there is a direction
in which you can decrease the function. It is for this reason that having both
makes a point a saddle point.

Example

min f (x1 , x2 ) = x21 x2 + x32 x1 − x1 x2


x∈R2

We then want to find the gradient.


∂f ∂f
∇f (x) = [ ]
∂x1 ∂x2
This example will be continued in the next lecture.

3
1 Approximation and Matrices
1.1 Taylor Series Approximation
The Taylor series approximation estimates a function by a series of monomials
involving the derivative of f (x). It is arbitrarily accurate, as it is an infinite
series.
1
f (x∗ + δx) = f (x∗ ) + f 0 (x∗ )δx + f 2 (x∗ )δx2 + . . .
2
If our function is f (x) = ex , our approximation is
1
eδx = 1 + δx + δx2 + . . .
2
We had factored out an ex from every term in this infinite series.

1.1.1 Multidimensional Taylor Series Approximation


 
x1 +x2 ∗ 0
Suppose f (x) = e , and we have x = . We want to use the Taylor
0
series approximation, but how do we do so for this multidimensional case?

We get a vector for our first derivative, and a matrix (the Hessian) for the
second derivative.
1
f (x∗ + δx) = f (x∗ ) + ∇f (x∗ )δx + δxT ∇2 f (x∗ )δx + . . .
2
The form xT Ax turns a matrix into a scalar by multiplication with x.

1.1.2 Matrix Definitions


We have new definitions for definite and semidefinite matrices.
We have A  0 iff A has all positive eigenvalues iff xT Ax > 0 ∀x.
Similarly, A  0 iff A has all non-negative eigenvalues iff xT Ax ≥ 0 ∀x.

Example  
1 2
Consider the matrix . This matrix is positive definite.
2 5

Example
We will prove that A cannot be positive definite if an eigenvalue is non-
positive. Suppose there is a non-positive eigenvalue λ. This eigenvalue
has a corresponding eigenvector, for which xT Ax = λx. This eigenvector
disagrees with our other definition for positive definite matrices.

1
2 Minimization
R:Local min ← f 0 (x∗ ) = 0, f 00 (x∗ ) ≥ 0.
Rn : Local min ← ∇f (x∗ ) − 0, ∇2 f (x∗ )  0.
x∗ : Local min ← there exists some region around x∗ such that there is no better
point in said region.

2.1 Using Taylor Series


Local min → ∇f (x∗ ) = 0, which is called the first-order optimality condition.
1
f (x∗ + δx) = f (x∗ ) + 0 + δxT ∇2 f (x∗ )δx + . . .
2
We can ignore the gradient, because we have already said that it must be 0. We
also can choose arbitrarily small terms for δx such that all higher order terms in
the expansion are much, much smaller than the Hessian term. Therefore, if we
can have A  0 we can say that x∗ is a local min. This is a sufficient condition.
However, if A  0, x∗ may or may not be a local minimum. This is a necessary
condition.
Remark. We do not deal with the 3rd+ derivatives in the Taylor Series in this
class. They are too hard to deal with, and so it will not be covered. However,
we still need to bound the gap if A  0 ∧ A  0.

Definition. Second-Order Optimality Condition: ∇2 f (x∗ )  0

3 Numerical Algorithms
Suppose we have some arbitrarily complicated function. This could be

min ex1 +x2 + sin(x1 ) + cos(x2 )

Numerical algorithms can solve extremely complicated problems in a very short


amount of time. You cannot find the solution with infinite precision, but you
can get very close to the optimal solution. Let’s see how we can use Numerical
Algorithms.
minn f (x)
x∈R

Start with a guess x(0) . Then, I do some checks, and revise my solution.

x(0) → x(1) → x(2) → · · · → x∗ ± 

We hope that the ideal case, described above, will happen. We will eventually
get to the best solution.

Suppose I want to know if x1 is a better solution than x2 . I can determine

2
this by evaluating f (x1 ) ? f (x2 ). The question mark indicates the relation be-
tween f (x1 ) and f (x2 ). The basic algorithm to choose x(k+1) from x(k) is

x(k+1) = x(k) + t(k) δx(k)

where t is the step size, and δx is the direction.

3
Numerical Algorithms and Constrained
Optimization
Nate Armstrong

Lecture 9/25/17

1 Numerical Algorithms
If I am given a number instead of conditions, we need algorithms. The high
level idea of most numerical algorithms is to generate a sequence of points
x(0) → x(1) → · · · → x(i−1) → x(k) → · · · → x∗
The algorithm determines the rule for going between the points.

Suppose I have some f (x1 , x2 ). This function is optimized by choosing some


arbitrary point x(0) . We follow an update rule which can be formulated as
x(k) = x(k−1) + t(k−1) ∆x(k−1)
where t is the step size which is greater than or equal to 0. ∆x is the direction
in R2 , or in Rn if that is the space you are optimizing within.
Remark. Your algorithm will need to be clever in its choice of t. You have to
design t such that you do not jump over the solution. A good algorithm will
scale t over time to minimize the number of iterations.
Remark. A good algorithm will make the iterates improve at every iteration.

f (x(k) ) = f (x(k−1) + t(k−1) ∆x(k−1) )


We then expand out the new f (x), to arrive at
f (x(k) ) = f (x(k−1) ) + t(k−1) ∆f (x(k−1) )∆x(k−1) + higher order terms
Theorem 1.1. If ∇f (x(k−1) )∆x(k−1) < 0, there exists a small t such that new
f < old f .
Theorem 1.2. If ∇f (x(k−1) )∆x(k−1) > 0, then for all small t(k−1) , new f >
old f .
Remark. We do not want the second case to be true, unless we are already at
our optimal (local) solution.

1
Definition: Descent
A direction ∆x is a descent at point x if ∇f (x)∆x < 0.

Example
Minimize f (X) = ex1 +x 2
 − (x1 − x2 ) .
2

0
We choose to start at .
0
 x +x 
e 1 2 − 2(x1 − x2 )
∇f (x) = x1 +x2
e + 2(x1 − x2 )

1.1 Designing an algorithm


The principal two questions are ”How to design direction?”, and ”How to design
step size?”.

1. Descent direction:

(a) Gradient: Go along negative gradient


(b) Newton’s Method:
(c) Steepest Descent:

1.1.1 Gradient
 
Suppose we have some gradient 1 −1 2 , and I want to make this negative
when multiplied by ∆x. What can I do? The obvious thing to do is to choose
our ∆x to be the opposite of ∇f (x). We thus get the answer δx = −∇f (x)T .

We get the formal definition for the algorithm as

x(k) = x(k−1) − t(k−1) f (x(k−1) )T

The hard part of the algorithm is choosing the step size.

Definition: Exact Line Search


Find the best step size t, by solving

min f (x(k−1) + t(k−1) ∇f (x(k−1) )T )


t(k−1) ∈R

Remark. Nobody actually uses the Exact Line Search, in favor of backtracking.

2
IEOR 160: Numerical Optimization Algorithms
Nate Armstrong

Lecture 9/27/17

1 Algorithms
The two main challenges in algorithms are deciding the direction, and the step
size. Generally, you choose the direction first, and then the step size. There are
three main algorithms we will look at:

• Gradient: ∆x = ∇f (x)T
• Newton’s Method : Use a Taylor approximation of the function, solve the
quadratic to update.
∆x = −∇2 f (x(k−1) )−1 ∇f (x(k−1) )T .
• Steepest Descent: This is not technically part of the course, but it is
basically using different norms to define the downward direction.

These are only algorithms for finding the direction in which to search. We can
use either exact line search or backtracking to find the best step size after finding
direction.

1.1 Step Size


We already talked about the straight line search. Backtracking is the more
commonly used approach.
Remark. The line search will give you the best value possible, but most op-
timization solvers out there do not want the sub-problems to be optimization
problems, even though univariate optimization is easy.

1.1.1 Backtracking
We have a value α > 0, which is large, and 0 < β < 1, which is the scaling
tracker. We want t to be large to get to the solution quicker, but it is bad in
that it is possible to jump over the solution.
We want t to be small so that it is safer, but it is very slow.

Backtracking first checks the value of a step size of α, and then multiplies that
by β repeatedly until the solution is sufficient.

1
1.2 Halting Conditions
We know that if ∇f (x(k) ) = 0, the point is a stationary point. However, this
has to be very exact. We can instead look to see if ||∇f (x(k) )|| ≤ . If this
condition is true, we halt our algorithm. The  is a static parameter.

1.3 Complexity of Algorithms


Note: This material will not be on exams.

The complexity of an algorithm is the answer to the question(s):

• How long to solve this problem w.r.t. n

• How long to solve given .

In the case of this algorithm (backtracking and gradient descent), the complex-
ity is log( 1 ). This may still have a constant factor α, but in complexity analysis
we only look at the parts of the algorithm that scale with the input/parameters.

The basic form of the complexity proof will look like

||x∗ − x(k) || ≤ ||x∗ − x(k−1) ||γ

We want to bound the difference, and end up showing that there is exponential
improvement. Thus, we take log of both sides, and get the log term.

1.4 Newton’s Method


Example

min x21 − x1 x2 + x22


This is a surface, and finding the minimum should be pretty trivial. We
do not need to do gradient method, as that will be slow.

∇f (x∗ ) = [2x1 − x2 − x1 + 2x2 ]


The above is a linear equation, and is easy to solve. Newton’s Method takes
advantage of this.

If a function is quadratic in n variables, we can rewrite it as

f (x) = xT P x + q T x + r

where P is an n × n matrix, q ∈ Rn , and r ∈ R. The P matrix generates all


quadratic terms, the q all the linear terms, and the r is the constant term. P

2
can always be symmetric, because you can split every kxi xj into k/2 in both
(i, j), (j, i). We then get that

∇f (x) = 2xT P + q T

We can set this gradient equal to 0, and find that


1 −1 −1
xT = − q T P −1 = P q
2 2
Newton’s method is derived from the Taylor series.
1
f (x(k−1) + ∆x) = f (x(k−1) ) + ∇f (xk−1 )∆x + ∆xT ∇2 f (x(k−1) )∆x + . . .
2
We consider only the first three terms, so that this above equation is quadratic.
Then, we want to find the best direction:
1
min f (x(k) ) + ∇f (xk−1 )∆x + ∆xT ∇2 f (x(k−1) )∆x
∆x∈R 2
which means that we find that

∆x = −∇2 f (x(k−1) )−1 ∇f (x(k−1) )T

3
Numerical Algorithms and Constrained
Optimization
Nate Armstrong

Lecture 10/2/17

1 Numerical Algorithm Review


Numerical algorithms begin with a guess, updating and making it more accurate
with each iteration. We start with x(0) , then update it by some rule. The com-
plexity (the number of iterations) is roughly log( 1 ) for the gradient algorithm,
instead of log(log( 1 )) for Newton’s Method.

1.1 Initialization
How do we pick an initial point x(0) ? If the function is unimodal, there is no
issue with picking an initial point. If you choose any point on the curve, you
can converge to the optimal solution.

However, consider the case where a saddle point is taken as the initial point.
In this case, the gradient algorithm would never move. Additionally, the algo-
rithm may get ”stuck” on a saddle point because the 0 gradient will satisfy the
stopping condition.
Remark. For all this algorithms, the best you can converge to is a local min or
a saddle point.
For almost all problems, it is too difficult to do efficiently. We need an
exponential number of operations. Many problems cannot be solved due to this
exponential time constraint.

1.2 Convex Optimization


There is a class of problems for which there is no issue. In a convex problem,
there are no saddle points, and the local min is equal to the global min. We will
discuss this more in detail later in the class.

1
2 Constrained Optimization
We started with univariate optimization, which is of the form of

min f (x) → f 0 (x) = 0, f 00 (x) > 0


x∈R

there is also multivariate optimization, which is the same but in Rn

min f (X) → ∇f (X) = 0, ∇2 f (X) ≺ 0


x∈Rn

However, for constrained optimization, it takes a slightly different form

min f (x) s.t. h1 (x) = 0, h2 (x) = 0, . . . , hn (x) = 0


x∈Rn

We rewrite this equation as

min f (x) s.t. h(x) = 0


x∈Rn

Many times, the actual constraints are less than or equal to 0, but you bound
the region by using equalities.

Definition: Feasible set


The feasible set is the subset of the domain of f(x) for which the constraints
are valid.

Example
Suppose we have
x21 + x22 − 1
 
h(x) =
x3 − 1
Then, our feasible set is a circle of radius 1 along the xy plane, all at z = 1.
You may notice that calculation of the feasible set is hard.

2.1 Local Behavior of Contraints


Let’s look at the local behavior of the functions. The tangent plane roughly
approximates the behavior at a point for very small deviations from that point.
Calculating tangent plane is hard, unless a regularity (well-behaved) condition
is satisfied.

Definition: Regular Points


A point x∗ is regular if ∇h1 (x∗ ), ∇h2 (x∗ ), . . . , ∇hm (x∗ ) are linearly inde-
pendent.

The definition of linearly independence is that

2
Definition: Linear Independence
A series of vectors y1 , y2 , . . . , ym ∈ RM is independent iff there is no nonzero
n
P
vector (α1 , α2 , . . . , αm ) such that ai yi .
i=1

Example
suppose we have the vectors below, and want to prove they are independent.
   
1 3
,
2 4

We assume there is some sequence of αi s such that they are dependent, for
the sake of contradiction.
   
1 3
α1 + α2
2 4

We then get a system of linear equations, which we can solve to find that
α1 = α2 = 0.

Example
Let’s go back to our earlier example. We can check if
m
X
αi ∇hi (x∗ ) = 0
i=1

for h(x) = x21 + x22 . We only have 1 α, so we just solve

α 2x∗1 2x∗2 = 0
 

finding that the point is only irregular if x∗ = 0, as that is the only x∗ for
which the above equation is true.

I will give you an example, but we will solve it next lecture,

Example
The feasible set is only on the x2 axis, where x1 = 0. We can formulate
h(x) in two different ways. The first is h(x) = x1 . The second is h(x) = x21 .
Which is better?

3
IEOR 160: Constrained Optimization
Nate Armstrong

Lecture 10/4/17

1 Constrained Optimization
It is not easy to characterize local behavior unless X ∗ is a regular point.
Example
There are two definitions of the feasible set for this problem. Either

S = {x1 |x1 = 0}

or
S = {x1 |x21 = 0}
If we take the gradient of each of these constraint definitions, they are
different. The first is 1 at x1 = 0, and the second is 0 at x1 = 0. The first
is linear independent, but the second has no regular points in the feasible
region.

Remark. The notion of regularity depends on the set S and its algebraic rep-
resentation. Different representations of the same set can cause algorithms to
perform differently on them.

1.1 Tangent Planes


The tangent (hyper)plane resides in the same space as X. If x is a 10-dimensional
vector, the tangent plane also resides in 10D space. What is a tangent plane?
• Stay at point x∗ on S.
• Move slowly so you don’t fall off of S.
• Stretch it into a flat surface.
Remark. The mathematical definition of a tangent plane is much harder than
this, but this is the basic idea. A tangent plane is a set of directions in which
you can move your x.
Theorem 1.1. If x∗ is regular, then tangent plane T at x∗ is
T = {∆x|∇h1 (x∗ )∆x = 0, . . . , ∇hm (x∗ )∆x = 0}

1
Example
Let’s consider the case of a circle. Let’s call points
 
1
A=
0
 
0
B=
1
" #
√1
C= 2
√1
2

We can define our circle constraints as h(x) = x21 + x22 − 1. We want this to
be 0. Looking for the tangent planes at these points, it is what you expect.

TA = {[∆x1 , ∆x2 ]|∆x1 = 0}


TB = {[∆x1 , ∆x2 ]|∆x2 = 0}
TC = {[∆x1 , ∆x2 ]|∆x1 + x2 = 0}

1.2 Deriving the Tangent Plane


If I have a point such that

h(x∗ ) = 0, h(x∗ + ∆x) ≈ 0

I can use the Taylor series approximation at x∗ , which is

hi (x∗ + ∆x) = hi (x∗ ) + ∇hi (x∗ )∆x + higher order terms

If we have sufficiently small ∆x, this will be equal to 0, as Deltax is orthogonal


to all ∇hi by definition.

1.3 Constrained Optimization


If we have

x ∈ Rn min f (x) s.t. h(x) = 0, =⇒ ∇f (x)+λ∈R ∈R


1 ∇h1 (x)+· · ·+λm ∇hm (x) = 0

This is called the Lagrangian method.

Example
min x2 =⇒ 2x = 0, x = 0.
min x21 + x22 =⇒ [2x1 2x2 ] = 0 =⇒ x1 = x2 = 0

2
But suppose we have some problem where we have

min x21 + x22 s.t. x1 + x2 = 1

We solve this by using the Lagrange method.


  −λ
[2x1 2x2 ] + λ 1 1 = 0 → x1 = x2 = , x1 + x2 = 1
2
There, we have 3 equations and 3 variables. We can solve for λ, and find
that λ = 1, x1 = x2 = 21 .

1.4 Deriving the Lagrangian


min f (x) s.t. h(x) = 0
x∈Rn

We know that, in general, a point x∗ is a local minimum if a perturbation in


a small neighborhood doesn’t improve the objective value. However, when you
have a constraint, some points of the function are invalid.

Assume x∗ is a local min for the given problem above. That means that
f (x∗ + ∆x) ≥ f (x∗ ) if ∆x is small. However, this ∆x is not arbitrary. The final
point must be in the feasible set. However, this is very hard to characterize, so
instead we say that ∆x must belong to the tangent plane. This works because
the final point will reside in the feasible set as the size of ∆x goes to 0.

3
1 Minimizing multidimensional subject to con-
straints
Minimizing a function f (x) subject to constraints h(x) = 0. h(x) is a vector of
constraints. We have first and second order optimality conditions.
m
X
First-Order: ∇f (x∗ ) + λi ∇hi (x∗ ) = 0
i=1
m
X
Second-Order: ∆xT (∇2 f (x∗ ) + λi ∇2 hi (x∗ ))∆x  0
i=1

The point may instead have the second order condition be true for  instead of
. However, seeing that it is  does not necessarily imply that it is a minimum.

We know that ∆x belongs to the tangent plane at the point x∗ .

1.1 Characterizing the Tangent Plane


regular
x ∈ Rn −→ ∆x ∈ Rn −−−−−→ T = {∆x|∇h1 (x∗ )∆x = 0, . . . , ∇hm (x∗ )∆x = 0}
The tangent plane is of dimension n − m, where n is the space in which x lies,
and m is the number of constraints. Thus, there are n − m basis vectors that
lie within the tangent plane and are linearly independent. Letting T.P. be the
tangent plane,
n−m
X
∆x ∈ T.P. = αi yi , αi ∈ R
i=1

Using this in an example,

Example
Consider a case where we have n = 3, m = 1. We have h(x) = x1 + x2 +
x3 − 1.
The gradient of our constraints is ∇h(x∗ ) = 1 1 1 6= 0. This implies
 

that all points are regular.

Next, we characterize the tangent plane. The tangent planes is

T.P. at x∗ = {∆x ∈ R3 | 1 1 1 ∆x = 0}
 

   
1 1
We can use  0  and −1 as our basis vectors. This implies that we are
−1 0

1
looking for    
 1 1 
α1  0  + α2 −1
−1 0
 

Pm
We can let M = ∇2 f (x∗ )+ (x∗ ).We want ∆xT M ∆x  0 for every
λi ∇2 hi
i=1
α1  
 α2  y1
∆x in the tangent plane. We can write   . . . . We can
 . . .  = α, and E =

yn−m
αm
write the equivalence
∆xT M ∆x = αT (E T M E)α
where the left side ∆x’s are in the tangent plane, and the αs on the right are
arbitrary.
If that expression is , that is a necessary condition.
If the expression is , that is a sufficient condition.

Example
(
max2 x1 x2
x∈R can be rewritten as minimizing −x1 x2 . We get ∇h(x∗ ) =
s.t. x1 + x2 = 2
 
1 1 6= 0. First-order conditions:

∇f (x∗ ) + λ1 ∇h1 (x∗ ) = 0


 ∗
−x2 −x∗1 + λ1 1 1 = 0
  

x∗1 + x∗2 = 2

This all implies that x∗1 = x∗2 = λ1 = 1.

We get M = ∇2 f (x∗ ) + 0.
 
0 −1
M=
−1 0

1
We also get E = .
−1
  
T
  0 −1 1
E ME = 1 −1 =2
−1 0 −1

The two is what it is equal to, which is greater than 0.

We will now try a much harder example.

2
Example

(
max x1 x2 + x2 x3 + x3 x1
x∈R3
s.t. x1 + x2 + x3 = 3
We
 can instead
 minimize the negative of the objective. We get ∇h(x∗ ) =
1 1 1 6= 0. This implies that all points are regular. Our first order
conditions lead us to finding
 
1
x∗ = 1
1

We need to check second order conditions. We get


 
0 1 1
M = ∇2 f (x∗ ) = 1 0 1
1 1 0

We get  
  1 0
E = y1 y2 = −1 1
0 −1
We get that  
T 2 1
E ME =
1 2
We do the determinant test on that, find that the eigenvalues are greater
than 0, and find that our point is a local minimum.

Remark. The problems we just solved were fairly easy, the constraints were
always linear, which makes it easier to solve.

2 Numerical Algorithms with Constraints


Problem: I have a function f (x) that I want to minimize, and I have constraints.
How can I solve this with a numerical algorithm?

Possible Solution: Penalty. Remove hard conditions and replace them with
soft conditions. Try adding a large constant multiplied by the sum of squared
constraints. We write this as
min f (x) + C(h1 (x)2 + · · · + hm (x)2 )
We can prove that
1
||x∗ − x̂|| ∝
C

Potrebbero piacerti anche