Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Nate Armstrong
Lecture 8/28/17
Decisions: The decisions are the quantities which we control, and optimize
over. These usually take the form of x’s.
Constraints:
• The constraints determine what is and is not possible in this problem.
They usually take the form of multiple equalities or inequalities which
must be true for any solution to the problem.
Objective: The objective function is what we try to minimize or maximize
over the decision variables.
1
Decisions: X = [x1 , x2 , . . . , xn ]
Constraints:
n
P
• xi wi ≤ W
i=1
n
P
Objective: min − x i vi
X i=1
Remark. There are often infinitely many ways to model a problem, the question
is deciding how to model it in a fashion that is easy to solve.
Remark. We will often use a vector, X = [x1 , . . . , xn ], to represent variables
more quickly. If you see a capital, single variable, it is not just one variable.
• Customer Constraints:
Pi
xij = 1∀j
j=1
n P
P n
Objective: min xij dij
X i=1 j=1
Remark. Notice that this problem requires optimizing over n2 variables to solve
for n customers. This can depend on the formulation, but it is a hard problem.
2
1.4 Transportation
Formulation: A company has m warehouse and n retail outlets. A single
product is to be shipped from the warehouses to the outlets. Each warehouse
has a given level of supply and each outlet has a given level of demand. There
are transportation costs. How can I match supply to demand most efficiently?
Variables:
• ai : supply at warehouse i
• bi : demand at outlet j
• Supply constraints:
Pn
xij ≤ ai ∀i
j=1
• Demand Constraints:
m
P
xij = bj ∀j
i=1
• Variable Constraints:
Xij ∈ {0, 1, 2, . . . }
m P
P n
Objective: min xij cij
X i=1 j=1
Remark. Most of these problems are very difficult. They can be modeled fairly
easily, but the choice of algorithm is very important.
3
Decisions: X = the matrix of flow amounts over all edges.
Constraints:
• Variable Constraints
0 ≤ xij ≤ cij
• Conservation
P of Flow
P
xji = Xik
(i,j):edge (i,k):edge
P P
Objective: max xsi − xjs
X (s,i) (j,s)
Remark. This problem is actually really easy, surprisingly enough. It forms the
basis for a variety of interesting problems. We will learn how to distinguish
between easy and hard problems.
4
IEOR 160: Optimizations
Nate Armstrong
Lecture 8/30/17
• Discontinuous
Discontinuous functions are bad for optimization, so we will not see them
in this course.
1
• Continuous but not differentiable
This could be a function that has a ‘cusp’ or sharp edge. They are easier
than discontinuous functions, but not easy. In most of this course, we will
not deal with these functions.
As an example of one of those functions, consider |x|. It is not differen-
tiable at x = 0.
• Continuous and differentiable
These are well-behaved functions.
• Global:
x∗ is global minimum of f (x) if for any x we have f (x∗ ) ≤ f (x).
x∗ is global maximum of f (x) if for any x we have f (x∗ ) ≥ f (x).
f 0 (x) = 0 only means that a point is interesting, but it is possible that it is not
a local min or max.
Consider x3 at x = 0. The derivative is 0, yet the function increases in the
positive direction and decreases in the negative.
2
First, find f 0 (x).
f 0 (x) = ex + 2(x − 1)
Then we set this equal to zero.
3
IEOR 160: Optimizations
Nate Armstrong
Lecture 9/6/17
1 Types of Optimizations
The goal is to
min / max f (X)
such that X satisfies some arbitrary number of constraints, where f (X) is the
objective function.
• Constrained
– Univariate
– Multivariate
• Unconstrained
– Univariate
– Multivariate
• Local Min: If the point is perturbed slightly in either direction, the value
of f (x) increases.
1
1.2 Looking for Local Minima and Maxima
Finding local minima and maxima is important for optimization. The univariate
case is considered here, but the insights can be generalized to the multivariate
case. Assuming we have a function f (x) such that a ≤ x ≤ b, there are a few
cases.
0
x ∈ (a, b) → f (x) = 0 → stationary
0
x = a → f (x) > 0 → local min
0
x = a → f (x) < 0 → local max
0
x = b → f (x) > 0 → local max
0
x = b → f (x) < 0 → local min
0
x = a, b → f (x) = 0 → check second derivative
Notice that this pattern of needing to check the next derivative if the current
one is 0 continues. If the first derivative is 0, we need to look to the second for
more information. If the second derivative is 0, we need to look to the third for
more information, and so on. It turns out that this pattern is generalizable.
2
It costs a monopolist $5/unit to produce a product. If he produces x units
of a product, then each can be sold for 10 − x(0 ≤ x ≤ 10). To maximize profit,
how much should the monopolist produce?
max x(10 − x) − 5x
x∈R
We use the techniques discussed before to find the local minima and maximums.
We find that
0 00
x∗ = 25 max f (x) = 0, f (x) < 0
0
x∗ = 0 min f (x) > 0
0
x∗ = 10 min f (x) < 0
• f (x1 ) > f (x2 ): Then, we know x∗ < b. We have reduced the interval in
which x∗ may lie.
• f (x1 ) < f (x2 ): Then, we know x∗ > a. We have reduced the interval in
which x∗ may lie. xitem f (x1 ) = f (x2 ): Then, we know x1 > x∗ > x2 .
We have reduced the interval in which x∗ may lie.
Because we are continually reducing the interval in which x∗ may lie, our
’guesses’ will get more and more accurate. If we execute this algorithm and
arbitrary number of times, we will get arbitrarily close to x∗. With better
guessing patterns, we can solve this faster.
3
IEOR 160: Optimizations
Nate Armstrong
Lecture 9/6/17
1 Optimization Algorithms
1.1 Golden Section Algorithm
Definition. Unimodal: f (x) is unimodal on [a, b] if for some point x∗, f (x)
is strictly increasing on [a, x∗] and strictly decreasing on [x∗, b].
The structure of this algorithm is an optimization algorithm communicating
with a black box solver of the function. This works for unimodal functions. The
algorithm samples two points x1 , x2 and compares their values in the function.
• f (x1 ) > f (x2 ): Then, we know x∗ < b. We have reduced the interval in
which x∗ may lie.
• f (x1 ) < f (x2 ): Then, we know x∗ > a. We have reduced the interval in
which x∗ may lie. xitem f (x1 ) = f (x2 ): Then, we know x1 > x∗ > x2 .
We have reduced the interval in which x∗ may lie.
Because we are continually reducing the interval in which x∗ may lie, our
’guesses’ will get more and more accurate. If we execute this algorithm and
arbitrary number of times, we will get arbitrarily close to x∗. With better
guessing patterns, we can solve this faster.
2. When you slice the interval based on the ratio γ, then one new point
should overlap one previous point.
This allows us to reuse calculations of f (x).
3. The optimal value for γ actually turns out to be the golden number. We
get γ 2 + γ − 1 = 0 → γ = 0.618 . . . .
Suppose we want to know our optimal value to within a certain value . How
many iterations k do we need?
1
We end up finding that the size of our range after k iterations is γ k (b − a).
log( b−a )
We set this to be ≤ . We then find k = log γ . This is an algorithm with
complexity log 1 .
Remark. It doesn’t actually matter what log base we use here, because we are
only looking at the order of the function. That means that there is a coefficient
in front of log 1 , which depends on the value of the parameters a, b.
Remark. Suppose that I tell you to minimize ex − x over 1 ≤ x ≤ 5. Most likely,
the solution to this will be irrational. Then, that means that calculating all the
digits of the answer will take infinite time. Therefore, suppose I want you to
find a solution that is accurate within 10−4 . This is an example of when you
would use this algorithm, with = 10−4 .
R Rn
∂f ∂f ∂f
derivative f 0 (x) Gradient∇f (x) [ ... ]
∂x1 ∂x2 ∂xn
∂f ∂f
1 ∂x ∂x
1 ∂x1 ∂x2 . . . ∂x∂f
1 ∂xn
∂f ∂f
. . . ∂x∂f
∂x ∂x ∂x2 ∂x2 2 ∂xn
second derivative f 00 (x) Hessian∇2 f (x)
2 1
.. .. .. ..
. . . .
∂f ∂f ∂f
∂xn ∂x1 ∂xn ∂x2 . . . ∂xn ∂xn
2
The eigenvalues can be real or complex scalars. If a matrix is symmetric, all
of its eigenvalues will be real. A matrix is symmetric if A[i][j] = A[j][i]. There
are four important definitions for optimization:
A 0 : A is positive definite iff its eigenvalues are all positive.
3
IEOR 160: Multidimensional Case
Nate Armstrong
Lecture 9/13/17
1 Multivariate Case
Suppose we have
x1
x ∈ R2 −→ x =
x2
and we want to find
min/maxx f (X)
We want to find x∗1 and x∗2 such that the coordinates are local minima.
x∗
If x∗ is a min for f (X), then X = [x12 ] means that f (x∗1 , x2 ) is a function
of x2 and x∗2 is the minimum of the new function.
Definition. Stationary Point: The gradient at that point is zero.
But remember, a stationary point can be a min, max, or saddle. The 1-
dimensional equivalent of the gradient is just the derivative, so this is consistent.
Example
1
• If all eigenvalues are nonzero, and one is less than zero, and another is
greater than 0, then it is a saddle point.
• Otherwise, if any eigenvalues are 0, it cannot be said.
a b
Suppose I have a function in R2 , and I compute ∇2 f (x∗) = . I must
c d
then find my eigenvalues, and check their signs. I then find my eigenvalues, and
check against the possibilities.
Remark. Note: Calculating the eigenvalues is not easy. Remember, we don’t
need the value of the eigenvalues, we just need the signs. This problem is much
easier than the other problem.
Suppose I want to find if H 0.
a11 a12 a13 ...
a21 a22 a23 . . .
H = a31 a32
a33 . . .
.. .. ..
..
. . . .
We can find the principal minors. The principal minors of an n × n matrix are
the square matrix consisting of up to row k and column k. A matrix is positive
definite iff the determinant of all principal minors are positive definite.
Theorem 1.2.
a11 a12
det(a1 ), det( ), det[3 × 3], . . . , det[n × n]
a21 a22
• All of their determinants are positive iff the Hessian is positive definite.
Example
Suppose we have a matrix
1 2 1 2
H= −→ det[1] = 1, det =1
2 5 2 5
Because the determinants are > 0, we know the matrix is a local min.
2
Example
Suppose we have a different matrix
−1 2
H= −→ det[−1] = −1
2 −5
Remark. A positive eigenvalue means that there is a direction in which you can
increase the function, and a negative eigenvalue means that there is a direction
in which you can decrease the function. It is for this reason that having both
makes a point a saddle point.
Example
3
1 Approximation and Matrices
1.1 Taylor Series Approximation
The Taylor series approximation estimates a function by a series of monomials
involving the derivative of f (x). It is arbitrarily accurate, as it is an infinite
series.
1
f (x∗ + δx) = f (x∗ ) + f 0 (x∗ )δx + f 2 (x∗ )δx2 + . . .
2
If our function is f (x) = ex , our approximation is
1
eδx = 1 + δx + δx2 + . . .
2
We had factored out an ex from every term in this infinite series.
We get a vector for our first derivative, and a matrix (the Hessian) for the
second derivative.
1
f (x∗ + δx) = f (x∗ ) + ∇f (x∗ )δx + δxT ∇2 f (x∗ )δx + . . .
2
The form xT Ax turns a matrix into a scalar by multiplication with x.
Example
1 2
Consider the matrix . This matrix is positive definite.
2 5
Example
We will prove that A cannot be positive definite if an eigenvalue is non-
positive. Suppose there is a non-positive eigenvalue λ. This eigenvalue
has a corresponding eigenvector, for which xT Ax = λx. This eigenvector
disagrees with our other definition for positive definite matrices.
1
2 Minimization
R:Local min ← f 0 (x∗ ) = 0, f 00 (x∗ ) ≥ 0.
Rn : Local min ← ∇f (x∗ ) − 0, ∇2 f (x∗ ) 0.
x∗ : Local min ← there exists some region around x∗ such that there is no better
point in said region.
3 Numerical Algorithms
Suppose we have some arbitrarily complicated function. This could be
Start with a guess x(0) . Then, I do some checks, and revise my solution.
We hope that the ideal case, described above, will happen. We will eventually
get to the best solution.
2
this by evaluating f (x1 ) ? f (x2 ). The question mark indicates the relation be-
tween f (x1 ) and f (x2 ). The basic algorithm to choose x(k+1) from x(k) is
3
Numerical Algorithms and Constrained
Optimization
Nate Armstrong
Lecture 9/25/17
1 Numerical Algorithms
If I am given a number instead of conditions, we need algorithms. The high
level idea of most numerical algorithms is to generate a sequence of points
x(0) → x(1) → · · · → x(i−1) → x(k) → · · · → x∗
The algorithm determines the rule for going between the points.
1
Definition: Descent
A direction ∆x is a descent at point x if ∇f (x)∆x < 0.
Example
Minimize f (X) = ex1 +x 2
− (x1 − x2 ) .
2
0
We choose to start at .
0
x +x
e 1 2 − 2(x1 − x2 )
∇f (x) = x1 +x2
e + 2(x1 − x2 )
1. Descent direction:
1.1.1 Gradient
Suppose we have some gradient 1 −1 2 , and I want to make this negative
when multiplied by ∆x. What can I do? The obvious thing to do is to choose
our ∆x to be the opposite of ∇f (x). We thus get the answer δx = −∇f (x)T .
Remark. Nobody actually uses the Exact Line Search, in favor of backtracking.
2
IEOR 160: Numerical Optimization Algorithms
Nate Armstrong
Lecture 9/27/17
1 Algorithms
The two main challenges in algorithms are deciding the direction, and the step
size. Generally, you choose the direction first, and then the step size. There are
three main algorithms we will look at:
• Gradient: ∆x = ∇f (x)T
• Newton’s Method : Use a Taylor approximation of the function, solve the
quadratic to update.
∆x = −∇2 f (x(k−1) )−1 ∇f (x(k−1) )T .
• Steepest Descent: This is not technically part of the course, but it is
basically using different norms to define the downward direction.
These are only algorithms for finding the direction in which to search. We can
use either exact line search or backtracking to find the best step size after finding
direction.
1.1.1 Backtracking
We have a value α > 0, which is large, and 0 < β < 1, which is the scaling
tracker. We want t to be large to get to the solution quicker, but it is bad in
that it is possible to jump over the solution.
We want t to be small so that it is safer, but it is very slow.
Backtracking first checks the value of a step size of α, and then multiplies that
by β repeatedly until the solution is sufficient.
1
1.2 Halting Conditions
We know that if ∇f (x(k) ) = 0, the point is a stationary point. However, this
has to be very exact. We can instead look to see if ||∇f (x(k) )|| ≤ . If this
condition is true, we halt our algorithm. The is a static parameter.
In the case of this algorithm (backtracking and gradient descent), the complex-
ity is log( 1 ). This may still have a constant factor α, but in complexity analysis
we only look at the parts of the algorithm that scale with the input/parameters.
We want to bound the difference, and end up showing that there is exponential
improvement. Thus, we take log of both sides, and get the log term.
f (x) = xT P x + q T x + r
2
can always be symmetric, because you can split every kxi xj into k/2 in both
(i, j), (j, i). We then get that
∇f (x) = 2xT P + q T
3
Numerical Algorithms and Constrained
Optimization
Nate Armstrong
Lecture 10/2/17
1.1 Initialization
How do we pick an initial point x(0) ? If the function is unimodal, there is no
issue with picking an initial point. If you choose any point on the curve, you
can converge to the optimal solution.
However, consider the case where a saddle point is taken as the initial point.
In this case, the gradient algorithm would never move. Additionally, the algo-
rithm may get ”stuck” on a saddle point because the 0 gradient will satisfy the
stopping condition.
Remark. For all this algorithms, the best you can converge to is a local min or
a saddle point.
For almost all problems, it is too difficult to do efficiently. We need an
exponential number of operations. Many problems cannot be solved due to this
exponential time constraint.
1
2 Constrained Optimization
We started with univariate optimization, which is of the form of
Many times, the actual constraints are less than or equal to 0, but you bound
the region by using equalities.
Example
Suppose we have
x21 + x22 − 1
h(x) =
x3 − 1
Then, our feasible set is a circle of radius 1 along the xy plane, all at z = 1.
You may notice that calculation of the feasible set is hard.
2
Definition: Linear Independence
A series of vectors y1 , y2 , . . . , ym ∈ RM is independent iff there is no nonzero
n
P
vector (α1 , α2 , . . . , αm ) such that ai yi .
i=1
Example
suppose we have the vectors below, and want to prove they are independent.
1 3
,
2 4
We assume there is some sequence of αi s such that they are dependent, for
the sake of contradiction.
1 3
α1 + α2
2 4
We then get a system of linear equations, which we can solve to find that
α1 = α2 = 0.
Example
Let’s go back to our earlier example. We can check if
m
X
αi ∇hi (x∗ ) = 0
i=1
α 2x∗1 2x∗2 = 0
finding that the point is only irregular if x∗ = 0, as that is the only x∗ for
which the above equation is true.
Example
The feasible set is only on the x2 axis, where x1 = 0. We can formulate
h(x) in two different ways. The first is h(x) = x1 . The second is h(x) = x21 .
Which is better?
3
IEOR 160: Constrained Optimization
Nate Armstrong
Lecture 10/4/17
1 Constrained Optimization
It is not easy to characterize local behavior unless X ∗ is a regular point.
Example
There are two definitions of the feasible set for this problem. Either
S = {x1 |x1 = 0}
or
S = {x1 |x21 = 0}
If we take the gradient of each of these constraint definitions, they are
different. The first is 1 at x1 = 0, and the second is 0 at x1 = 0. The first
is linear independent, but the second has no regular points in the feasible
region.
Remark. The notion of regularity depends on the set S and its algebraic rep-
resentation. Different representations of the same set can cause algorithms to
perform differently on them.
1
Example
Let’s consider the case of a circle. Let’s call points
1
A=
0
0
B=
1
" #
√1
C= 2
√1
2
We can define our circle constraints as h(x) = x21 + x22 − 1. We want this to
be 0. Looking for the tangent planes at these points, it is what you expect.
Example
min x2 =⇒ 2x = 0, x = 0.
min x21 + x22 =⇒ [2x1 2x2 ] = 0 =⇒ x1 = x2 = 0
2
But suppose we have some problem where we have
Assume x∗ is a local min for the given problem above. That means that
f (x∗ + ∆x) ≥ f (x∗ ) if ∆x is small. However, this ∆x is not arbitrary. The final
point must be in the feasible set. However, this is very hard to characterize, so
instead we say that ∆x must belong to the tangent plane. This works because
the final point will reside in the feasible set as the size of ∆x goes to 0.
3
1 Minimizing multidimensional subject to con-
straints
Minimizing a function f (x) subject to constraints h(x) = 0. h(x) is a vector of
constraints. We have first and second order optimality conditions.
m
X
First-Order: ∇f (x∗ ) + λi ∇hi (x∗ ) = 0
i=1
m
X
Second-Order: ∆xT (∇2 f (x∗ ) + λi ∇2 hi (x∗ ))∆x 0
i=1
The point may instead have the second order condition be true for instead of
. However, seeing that it is does not necessarily imply that it is a minimum.
Example
Consider a case where we have n = 3, m = 1. We have h(x) = x1 + x2 +
x3 − 1.
The gradient of our constraints is ∇h(x∗ ) = 1 1 1 6= 0. This implies
T.P. at x∗ = {∆x ∈ R3 | 1 1 1 ∆x = 0}
1 1
We can use 0 and −1 as our basis vectors. This implies that we are
−1 0
1
looking for
1 1
α1 0 + α2 −1
−1 0
Pm
We can let M = ∇2 f (x∗ )+ (x∗ ).We want ∆xT M ∆x 0 for every
λi ∇2 hi
i=1
α1
α2 y1
∆x in the tangent plane. We can write . . . . We can
. . . = α, and E =
yn−m
αm
write the equivalence
∆xT M ∆x = αT (E T M E)α
where the left side ∆x’s are in the tangent plane, and the αs on the right are
arbitrary.
If that expression is , that is a necessary condition.
If the expression is , that is a sufficient condition.
Example
(
max2 x1 x2
x∈R can be rewritten as minimizing −x1 x2 . We get ∇h(x∗ ) =
s.t. x1 + x2 = 2
1 1 6= 0. First-order conditions:
x∗1 + x∗2 = 2
We get M = ∇2 f (x∗ ) + 0.
0 −1
M=
−1 0
1
We also get E = .
−1
T
0 −1 1
E ME = 1 −1 =2
−1 0 −1
2
Example
(
max x1 x2 + x2 x3 + x3 x1
x∈R3
s.t. x1 + x2 + x3 = 3
We
can instead
minimize the negative of the objective. We get ∇h(x∗ ) =
1 1 1 6= 0. This implies that all points are regular. Our first order
conditions lead us to finding
1
x∗ = 1
1
We get
1 0
E = y1 y2 = −1 1
0 −1
We get that
T 2 1
E ME =
1 2
We do the determinant test on that, find that the eigenvalues are greater
than 0, and find that our point is a local minimum.
Remark. The problems we just solved were fairly easy, the constraints were
always linear, which makes it easier to solve.
Possible Solution: Penalty. Remove hard conditions and replace them with
soft conditions. Try adding a large constant multiplied by the sum of squared
constraints. We write this as
min f (x) + C(h1 (x)2 + · · · + hm (x)2 )
We can prove that
1
||x∗ − x̂|| ∝
C