Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
multivariate functions. The interpretation of the first derivative remains the same, but
there are now two second order derivatives to consider.
First, there is the direct second-order derivative. In this case, the multivariate function
is differentiated once, with respect to an independent variable, holding all other
variables constant. Then the result is differentiated a second time, again with respect
to the same independent variable. In a function such as the following:
These second derivatives can be interpreted as the rates of change of the two slopes of
the function z.
Now the story gets a little more complicated. The cross-partials, fxy and fyx are
defined in the following way. First, take the partial derivative of z with respect to x.
Then take the derivative again, but this time, take it with respect to y, and hold the x
constant. Spatially, think of the cross partial as a measure of how the slope (change in
z with respect to x) changes, when the y variable changes. The following are
examples of notation for cross-partials:
We'll discuss economic meaning further in the next section, but for now, we'll just
show an example, and note that in a function where the cross-partials are continuous,
they will be identical. For the following function:
Take the first and second partial derivatives.
Now, starting with the first partials, find the cross partial derivatives:
Note that the cross partials are indeed identical, a fact that will be very useful to us in
future optimization sections.
Now that we have the brief discussion on limits out of the way we can proceed into taking
derivatives of functions of more than one variable. Before we actually start taking derivatives of
functions of more than one variable lets recall an important interpretation of derivatives of
functions of one variable.
We will need to develop ways, and notations, for dealing with all of these cases. In this section
we are going to concentrate exclusively on only changing one of the variables at a time, while
the remaining variable(s) are held fixed. We will deal with allowing multiple variables to change
in a later section.
Because we are going to only allow one of the variables to change taking the derivative will now
become a fairly simple process. Lets start off this discussion with a fairly simple function.
Well start by looking at the case of holding y fixed and allowing x to vary. Since we are
interested in the rate of change of the function at and are holding y fixed this
means that we are going to always have (if we didnt have this then
eventually y would have to change in order to get to the point). Doing this will give us a
function involving only xs and we can define a new function as follows,
Now, this is a function of a single variable and at this point all that we are asking is to determine
to do that. Here is the rate of change of the function at if we hold y fixed and
allow x to vary.
Now, lets do it the other way. We will now hold x fixed and allow y to vary. We can do this in a
similar way. Since we are holding x fixed it must be fixed at and so we can
define a new function of y and then differentiate this as weve always done with functions of one
variable.
Note that these two partial derivatives are sometimes called the first order partial derivatives.
Just as with functions of one variable we can have derivatives of all orders. We will be looking
at higher order derivatives in a later section.
Note that the notation for partial derivatives is different than that for derivatives of functions of a
single variable. With functions of a single variable we could denote the derivative with a single
prime. However, with partial derivatives we will always need to remember the variable that we
are differentiating with respect to and so we will subscript the variable that we differentiated with
respect to. We will shortly be seeing some alternate notation for partial derivatives as well.
Note as well that we usually dont use the notation for partial derivatives. The
more standard notation is to just continue to use . So, the partial derivatives
from above will more commonly be written as,
Now, as this quick example has shown taking derivatives of functions of more than one variable
is done in pretty much the same manner as taking derivatives of a single variable. To
compute all we need to do is treat all the ys as constants (or
numbers) and then differentiate the xs as weve always done. Likewise, to compute
we will treat all the xs as constants and then differentiate the ys as we are used
to doing.
Before we work any examples lets get the formal definition of the partial derivative out of the
way as well as some alternate notation.
Since we can think of the two partial derivatives above as derivatives of single variable functions
it shouldnt be too surprising that the definition of each is very similar to the definition of the
derivative for single variable functions. Here are the formal definitions of the two partial
derivatives we looked at above.
Now lets take a quick look at some of the possible alternate notations for partial derivatives.
Given the function the following are all equivalent
notations,
For the fractional notation for the partial derivative notice the difference between the partial
derivative and the ordinary derivative from single variable calculus.
Okay, now lets work some examples. When working these examples always keep in mind that
we need to pay very close attention to which variable we are differentiating with respect to. This
is important because we are going to treat all other variables as constants and then proceed with
the derivative as if it was a function of a single variable. If you can remember this youll find
that doing partial derivatives are not much more difficult that doing derivatives of functions of a
single variable as we did in Calculus I.
The rule for partial derivatives is that we differentiate with respect to one variable while keeping all
the other variables constant. As another example, find the partial derivatives of u with respect
to x and with respect to y for
Thus,
and
Taking the Partial Derivative of a Partial
Derivative
So far we have defined and given examples for first-order partial derivatives. Second-order partial
derivatives are simply the partial derivative of a first-order partial derivative. We can have four
second-order partial derivatives:
and
Likewise,
and
and
Then,
and
To unlock this lesson you must be a Study.com Member. Create your account
Unconstrained optimization problems consider the problem of minimizing an objective function that depends on
real variables with no restrictions on their values. Mathematically, let xRnxRn be a real vector
with n1n1 components and let f:RnRf:RnR be a smooth function. Then, the unconstrained optimization
problem is
minxf(x).minxf(x).
Unconstrained optimization problems arise directly in some applications but they also arise indirectly from
reformulations of constrained optimization problems. Often it is practical to replace the constraints of an optimization
problem with penalized terms in the objective function and to solve the problem as an unconstrained problem.
Algorithms
An important aspect of continuous optimization (constrained and unconstrained) is whether the functions are smooth,
by which we mean that the second derivatives exist and are continuous. There has been extensive study and
development of algorithms for the unconstrained optimization of smooth functions. At a high level, algorithms for
unconstrained minimization follow this general structure:
Beginning at x0x0, generate a sequence of iterates {xk}k=0{xk}k=0 with non-increasing function (ff)
value until a solution point with sufficient accuracy is found or until no further progress can be made.
To generate the next iterate xk+1xk+1, the algorithm uses information about the function at xkxk and possibly earlier
iterates.
Newton's Method
Newton's Method gives rise to a wide and important class of algorithms that require computation of the gradient
vector
f(x)=(1f(x),,nf(x))Tf(x)=(1f(x),,nf(x))T
2f(x)=[ijf(x)].2f(x)=[ijf(x)].
Although the computation or approximation of the Hessian can be a time-consuming operation, there are many
problems for which this computation is justified.
qk(s)=f(xk)+f(xk)Ts+12sT2f(xk)s.qk(s)=f(xk)+f(xk)Ts+12sT2f(xk)s.
In the basic Newton method, the next iterate is obtained from the minimizer of qk(s)qk(s). When the Hessian
matrix, 2f(xk)2f(xk), is positive definite, the quadratic model has a unique minimizer that can be obtained by
2f(xk)s=f(xk).2f(xk)s=f(xk).
xk+1=xk+sk.xk+1=xk+sk.
xk+1xxkx2xk+1xxkx2
for some positive constant . In most circumstances, however, the basic Newton
method has to be modified to achieve convergence. There are two fundamental
strategies for moving from xkxk to xk+1xk+1: line search and trust region. Most
algorithms follow one of these two strategies.
The line-search method modifies the search direction to obtain another downhill, or descent, direction for ff.
It then tries different step lengths along this direction until it finds a step that not only decreases ff but also
achieves at least a small fraction of this direction's potential.
The trust-region methods use the original quadratic model function, but they constrain the new iterate to stay
in a local neighborhood of the current iterate. To find the step, it is necessary to minimize the quadratic function
subject to staying in this neighborhood, which is generally ellipsoidal in shape.
Wikipedia Link to Trust Region
Line-search and trust-region techniques are suitable if the number of variables nn is not too large, because the cost
per iteration is of order n3n3. Codes for problems with a large number of variables tend to use truncated Newton
methods, which usually settle for an approximate minimizer of the quadratic model.
If computing the exact Hessian matrix is not practical, the same algorithms can be used with a reasonable
approximation of the Hessian matrix. Two types of methods use approximations to the Hessian in place of the exact
Hessian.
One approach is to use difference approximations to the exact Hessian. Difference approximations exploit
the fact that each column of the Hessian can be approximated by taking the difference between two instances
of the gradient vector evaluated at two nearby points. For sparse Hessians, it is often possible to approximate
many columns of the Hessian with a single gradient evaluation by choosing the evaluation points judiciously.
Quasi-Newton Methods build up an approximation to the Hessian by keeping track of the gradient
differences along each step taken by the algorithm. Various conditions are imposed on the approximate
Hessian. For example, its behavior along the step just taken is forced to mimic the behavior of the exact
Hessian, and it is usually kept positive definite.
There are two other approaches for unconstrained problems that are not so closely related to Newton's method.
Nonlinear conjugate gradient methods are motivated by the success of the linear conjugate gradient method
in minimizing quadratic functions with positive definite Hessians. They use search directions that combine the
negative gradient direction with another direction, chosen so that the search will take place along a direction not
previously explored by the algorithm. At least, this property holds for the quadratic case, for which the minimizer
is found exactly within just n iterations. For nonlinear problems, performance is problematic, but these methods
do have the advantage that they require only gradient evaluations and do not use much storage.
The nonlinear Simplex method (not to be confused with the simplex method for linear programming) requires
neither gradient nor Hessian evaluations. Instead, it performs a pattern search based only on function values.
Because it makes little use of information about f, it typically requires a great many iterations to find a solution
that is even in the ballpark. It can be useful when f is nonsmooth or when derivatives are impossible to find, but
it is unfortunately often used when one of the algorithms above would be more appropriate.
Related Problems
The Nonlinear Least-Squares Problem is a special case of unconstrained optimization. It arises in many
practical problems, especially in data-fitting applications. The objective function ff has the form
f(x)=0.5j=1mr2j(x),f(x)=0.5j=1mrj2(x),
where each rjrj is a smooth function from RnRn to RR. The special form of ff and its derivatives has been
exploited to develop efficient algorithms for minimizing ff.
The problem of solving a system of Nonlinear Equations is related to unconstrained optimization in that a
number of algorithms for nonlinear equations proceed by minimizing a sum of squares. It often arises in
problems involving physical systems. In nonlinear equations, there is no objective function to optimize but
instead the goal is to find values of the variables that satisfy a set of nn equality constraints.
General form[edit]
where and are constraints that are required to be satisfied; these are called hard constraints.
In some problems, often called constraint optimization problems, the objective function is
actually the sum of cost functions, each of which penalizes the extent (if any) to which a soft
constraint (a constraint which is preferred but not required to be satisfied) is violated.
Solution methods[edit]
Many unconstrained optimization algorithms can be adapted to the constrained case, often via
the use of a penalty method. However, search steps taken by the unconstrained method may be
unacceptable for the constrained problem, leading to a lack of convergence. This is referred to
as the Maratos effect.[1]
Equality constraints[edit]
If the constrained problem has only equality constraints, the method of Lagrange multipliers can
be used to convert it into an unconstrained problem whose number of variables is the original
number of variables plus the original number of equality constraints. Alternatively, if the
constraints are all equality constraints and are all linear, they can be solved for some of the
variables in terms of the others, and the former can be substituted out of the objective function,
leaving an unconstrained problem in a smaller number of variables.
Inequality constraints[edit]
With inequality constraints, the problem can be characterized in terms of the Geometric
Optimality conditions, Fritz John conditions and KarushKuhnTucker conditions, in which
simple problems may be solvable.
Linear programming[edit]
If the objective function and all of the hard constraints are linear, then the problem is a linear
programming problem. This can be solved by the simplex method, which usually works
in polynomial time in the problem size but is not guaranteed to, or by interior point
methods which are guaranteed to work in polynomial time.
Quadratic programming[edit]
If all the hard constraints are linear but the objective function is quadratic, the problem is
a quadratic programming problem. It can still be solved in polynomial time by the ellipsoid
method if the objective function is convex; otherwise the problem is NP hard.
Constraint optimization can be solved by branch and bound algorithms. These are backtracking
algorithms storing the cost of the best solution found during execution and using it to avoid part
of the search. More precisely, whenever the algorithm encounters a partial solution that cannot
be extended to form a solution of better cost than the stored best cost, the algorithm backtracks,
instead of trying to extend this solution.
Assuming that cost is to be minimized, the efficiency of these algorithms depends on how the
cost that can be obtained from extending a partial solution is evaluated. Indeed, if the algorithm
can backtrack from a partial solution, part of the search is skipped. The lower the estimated cost,
the better the algorithm, as a lower estimated cost is more likely to be lower than the best cost of
solution found so far.
On the other hand, this estimated cost cannot be lower than the effective cost that can be
obtained by extending the solution, as otherwise the algorithm could backtrack while a solution
better than the best found so far exists. As a result, the algorithm requires an upper bound on
the cost that can be obtained from extending a partial solution, and this upper bound should be
as small as possible.
A variation of this approach called Hansen's method uses interval methods.[2] It inherently
implements rectangular constraints.
One way for evaluating this upper bound for a partial solution is to consider each soft constraint
separately. For each soft constraint, the maximal possible value for any assignment to the
unassigned variables is assumed. The sum of these values is an upper bound because the soft
constraints cannot assume a higher value. It is exact because the maximal values of soft
constraints may derive from different evaluations: a soft constraint may be maximal for while
another constraint is maximal for .
In particular, the cost estimate of a solution having as unassigned variables is added to the cost
that derives from the evaluated variables. Virtually, this corresponds on ignoring the evaluated
variables and solving the problem on the unassigned ones, except that the latter problem has
already been solved. More precisely, the cost of soft constraints containing both assigned and
unassigned variables is estimated as above (or using an arbitrary other method); the cost of soft
constraints containing only unassigned variables is instead estimated using the optimal solution
of the corresponding problem, which is already known at this point.
There is similarity between the Russian Doll Search method and Dynamic Programming. Like
Dynamic Programming,Russian Doll Search solves sub-problems in order to solve the whole
problem. But, whereas Dynamic Programming directly combines the results obtained on sub-
problems to get the result of the whole problem, Russian Doll Search only uses them as bounds
during its search.
Bucket elimination[edit]
The bucket elimination algorithm can be adapted for constraint optimization. A given variable can
be indeed removed from the problem by replacing all soft constraints containing it with a new
soft constraint. The cost of this new constraint is computed assuming a maximal value for every
value of the removed variable. Formally, if is the variable to be removed, are the soft
constraints containing it, and are their variables except , the new soft constraint is defined by:
Bucket elimination works with an (arbitrary) ordering of the variables. Every variable is
associated a bucket of constraints; the bucket of a variable contains all constraints having
the variable has the highest in the order. Bucket elimination proceed from the last variable to
the first. For each variable, all constraints of the bucket are replaced as above to remove the
variable. The resulting constraint is then placed in the appropriate bucket.