Convex Functions and Their Applications PDF

M3R Project 2013
Convex Functions and Their Applications
Name: Muhammad Syafiq Johar

CID: 00608118
Supervisor: Professor Ari Laptev
This is my own unaided work unless stated otherwise.
....................................
Abstract
Convexity is a really basic yet useful part of analysis. First, we are going to define
convex functions and outline some of their interesting properties. We are also going
to introduce the concept of subderivatives, which generalises the idea of differen-
tiation. Then, we are going to go through an application of convex functions in
classical dynamics, namely the Legendre transform and the more general Legenre-
Fenchel transform. Finally, we are going to look at an elegant way of proving some
classical inequalities using the idea of convexity and ultimately, extend Hölder’s
inequality to higher dimensions.
Contents
I Convex Functions and Their Properties 1
1 Introduction 1
2 Convex Functions 1
2.1 Basic Properties of Convex Functions . . . . . . . . . . . . . . . . . . . . 2
3 Subderivatives 6
II Legendre Transform 9
4 Legendre Transform 9
5 Legendre-Fenchel Transform 12
6 Applications of Legendre Transform 17

6.1 Young’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.2 Lagrangian and Hamiltonian Mechanics . . . . . . . . . . . . . . . . . . . 19
III Inequalities 25
7 Convexity and Jensen’s Inequality 25
8 Concavity and Inequalitites 29

8.1 Cauchy’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.2 Hölder’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.3 Minkowski’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.4 Hölder’s Inequality of Higher Dimensions . . . . . . . . . . . . . . . . . . 35
IV Summary 39
V Bibliography 40
Acknowledgements 42
i
Part I
Convex Functions and Their Properties
1 Introduction
Convex functions are one of the most basic types of functions. We all have seen numerous
examples of convex functions, best described as looking like a bowl. As an example,
everyone starts off algebra by learning about examples of functions and a particularly
simple example of f (x) = x2 is one of them. The shape of the graph of this function
looks like a bowl, hence it is said to be convex.
However, there are other convex functions that do not look like a bowl, for examples,
the exponential function g(x) = ex and even the linear function h(x) = x. One common
trait between these functions is that their slope is non-decreasing. But how do we
properly define convex functions? How about convex functions of higher dimensions,
how do they look like?
We are also concerned about the properties of convex functions. They’re such a
simple class of functions at first glance. However, if we look further at their properties,
we can see that they’re much richer than they seem. This enables us to exploit these
properties by using them as a useful and sometimes crucial tool in different branches of
mathematics, such as dynamics and inequalities, as we will see in the later parts of this
project.
But first, before going to the applications part, let’s define what convex functions
are and explore their basic properties.
2 Convex Functions
In this section, we consider functions f : I → R, where I ⊂ R is an interval and it can
either be an open interval, a closed interval or half-open interval.
Definition 2.1 (Convex and Concave Functions). [1, p.11] A function f : I → R is
called convex if for t ∈ [0, 1] and for all a, b ∈ I, we have the following inequality
f ((1 − t)a + tb) ≤ (1 − t)f (a) + tf (b) (1)
A function is called strictly convex if the inequality for (1) holds strictly whenever a and
b are distinct and t ∈ (0, 1). A function is called concave if we have opposite inequality
for (1).
A geometrical definition of a convex function is if we draw a secant chord on the
graph of a convex function f (x), the secant chord does not go below the graph of f (x)
between the endpoints of the chord. Indeed, the function F (t) = (1 − t)f (a) + tf (b) for
t ∈ [0, 1] is a parametric function for the straight line joining (a, f (a)) and (b, f (b)). So,
by 2.1, we can see that the function F (t) is always greater than or equal to f ((1−t)a+tb),
thus the straight line F is always above or on the graph of f in x ∈ [a, b].
1
y
f (x)
x
a b
Figure 1: Secant line of a convex function
Based on the geometrical interpretation above, we can also view convex functions in
terms of convex sets. We define first what an epigraph is. [1, p.115]
Definition 2.2 (Epigraph). An epigraph of a function f : I → R, denoted epi(f ), is the

set of points lying above or on the graph, i.e.
epi(f ) = {(x, y) : x ∈ I, y ≥ f (x)} (2)
So, from the geometrical interpretation, it is clear that the secant line for a convex
function f always lies above the graph of f . Hence, the secant line always stays in
the epigraph. Furthermore, if we pick any two points in the epigraph and draw a line
segment between them, the line will stay in the epigraph. Thus, the epigraph of f , epi(f )
is a convex set if and only if f is a convex function.
Recall that in the introduction, we have listed three examples of convex functions,
namely f (x) = x2 , g(x) = ex and h(x) = x. Clearly, by drawing the graphs and any
secant line on the graphs, we can see that these lines always lie above or on the graph.
Also, the epigraphs are convex sets.
Now that we have properly defined convex functions on an interval in the real line,
we shall look at some of their basic properties.
2.1 Basic Properties of Convex Functions

We first look at an important property of convex functions that is almost immediate
from the definition. We will be utilising this property in the next few propositions.
Proposition 2.1. [2, p.2] If f : I → R is a convex function, then for all a, b, c ∈ I such
that a < b < c, we have:
f (b) − f (a) f (c) − f (a) f (c) − f (b)

≤ ≤ (3)
b−a c−a c−b
2
y
f (x)
x
a b c
Figure 2: Sequential secants of a convex function
Proof. We will prove the first inequality as the second one is done in a similar manner.
Since f is convex, f ((1−t)a+tc) ≤ (1−t)f (a)+tf (c) for t ∈ [0, 1]. Since a < b < c, there
exists t0 ∈ (0, 1) such that b = (1 − t0 )a + t0 c. Substituting t0 in the convexity inequality,
b−a
we will have f (b) ≤ (1 − t0 )f (a) + t0 f (c). Rearranging and substituting t0 = c−a , we
will get the first inequality of (3).
Now, an important consequence of Proposition 2.1 above is that any convex function
f : I → R is continuous on the interior of I, denoted int(I). Furthermore, we can also
prove that the left and right derivatives of the function f at any point in int(I) exists.
Proposition 2.2. [1, p.25] Any convex function f : I → R is continuous on int(I).
Proof. We want to show continuity of the function f at x ∈ int(I). Construct two
inequalities from (3) in the following way: let a = x − δ, c = x and b = y ∈ (x −
δ, x). Substituting in the second inequality of (3), we will have f (x)−fδ (x−δ) ≤ f (x)−f
x−y
(y)
.
Similarly, substituting a = x, c = x + δ and b = y ∈ (x, x + δ) in the first inequality
of (3), we will have f (x)−f
x−y
(y)
≤ f (x+δ)−f
δ
(x)
. Thus, for any y ∈ (x − δ, x + δ) we will

f (x)−f (y) + f (x)−f (y)
have x−y ≤ K for some K ∈ R as x−y is bounded above and below in this
ε
domain. So, fix ε > 0 and choose δ = K . Then, |x − y| < δ ⇒ |f (x) − f (y)| ≤ ε, proving
continuity.
Remark 2.1. In fact, we can show that convex functions are absolutely continuous and
uniformly continuous for any compact interval within int(I). This will be shown later
when we define the Lipschitz condition.
Proposition 2.3. [1, p.25] If f : I → R is a convex function, then the left derivative
f−0 and right derivative f+0 of f (x) exists on int(I) and f−0 (x) ≤ f+0 (x) for all x ∈ int(I).
Moreover, these left and right derivatives are increasing on int(I).
Proof. Using (3), we can determine the left and right derivative of the function f at
point x = b for any b ∈ int(I). Looking at the inequality, f (b)−f
b−a
(a)
≤ f (c)−f
c−b
(b)
, consider
3
f (c)−f (b)
the limit as a approaches b. This limit is bounded by c−b , so the limit exists and
hence f−0 (b) ≤ f (c)−f
c−b
(b)
. Similarly, we can show that the right derivative of f at b exists
f (b)−f (a)
and f+0 (b) ≥ b−a . This yields that f−0 (b) ≤ f+0 (b) for any b ∈ int(I).
Furthermore, if a < b < c ∈ int(I), we have f+0 (a) ≤ f (b)−f b−a
(a)
≤ f (c)−f
c−b
(b)
≤ f−0 (c).
Hence, for any x < y ∈ int(I), we have f−0 (x) ≤ f+0 (x) ≤ f−0 (y) ≤ f+0 (y). Thus, f−0 and
f+0 are increasing on int(I).
Note that convexity does not imply differentiability as at any point in int(I), we
have only proved that the left and right derivatives exist in Proposition 2.3, but they
are not necessarily equal.
Example 2.1. Consider the convex function f (x) = |x| on R. At all points in R, the left
and right derivatives of f exist. However, note that the left and right derivatives are not
equal at x = 0 but they are equal everywhere else, thus f is differentiable everywhere
except at the origin.
f (x)
Figure 3: Graph of f (x) = |x|
It can be proved that a convex function f is not differentiable only at a countably

many number of points in the interval I. In order to show this, we need one more
important property of a convex function and a theorem by Rademacher. But first, we
define Lipschitz condition on a function and formulate Rademacher’s theorem.
Definition 2.3 (Lipschitz Condition). [3, p.1] A function f : X → Y where X, Y ⊂

R is called Lipschitz if there exists a constant K ∈ R+ such that for all x, y ∈ X,
|f (x) − f (y)| ≤ K|x − y|.
Remark 2.2. Note that from this definition, it can be deduced that all Lipschitz func-
tions are absolutely continuous and uniformly continuous.
Theorem 2.1 (Rademacher’s Theorem). [3, p.18] If a function f : Rm → Rn is Lipschitz,

then f is differentiable almost everywhere i.e. it is not differentiable only at countably
many points.
4
The proof of this theorem is out of the scope of this project. However, keen readers
may refer to Muñoz’s paper on Rademacher’s Theorem for the proof of this theorem
[4]. Now, using Proposition 2.3, we can prove that any convex function f : I → R is
Lipschitz on any compact subinterval contained in int(I).
Proposition 2.4. [1, p.27] If f : I → R is a convex function, then f is Lipschitz on any

compact subinterval contained in int(I).
Proof. Let [a, b] be an arbitrary compact subinterval contained in int(I). Then, choose
arbitrary x, y ∈ [a, b] such that x < y. The left and right derivatives of f at these
four points exist as all of them are contained in int(I). By the monotonicity of the left
and right derivatives of a convex function, we can construct the following inequality:
f+0 (a) ≤ f+0 (x) ≤ f (x)−f
x−y
(y)
≤ f−0 (y) ≤ f−0 (b). Hence, for any x, y ∈ [a, b] we can bound

f (x)−f (y)
x−y by some constant K = max{|f+0 (a)|, |f−0 (b)|}. Thus, |f (x) − f (y)| ≤ K|x − y|
for any x, y ∈ [a, b].
Proposition 2.5. A convex function f : I → R is differentiable almost everywhere.
Proof. This proposition is an immediate consequence of the theorem by Rademacher.

Note that if f : I → R is convex, then, by Proposition 2.4, f is Lipschitz on any compact
subinterval contained in int(I). Thus, using Rademacher’s theorem and the fact that we
can construct the interval int(I) using countable union of compact sets, we can conclude
that f is differentiable almost everywhere on I.
So, we have shown that any convex function f is differentiable almost everywhere.
How about the points where f is not differentiable? In Proposition 2.3 we have shown
that for any convex function f : I → R, for all points x ∈ int(I), we have f−0 (x) ≤ f+0 (x).
At the points where f is not differentiable, the left and right derivatives exist but they
are not equal to each other. Looking back at the Example 2.1 of the function f (x) = |x|,
we can somehow define derivatives at x = 0 by an interval. We will look into this further
later. But for now, we consider the case where f is differentiable.
Theorem 2.2. [2, p.11] If a convex function f : I → R is differentiable everywhere in

int(I), then the derivative of f is an increasing function on int(I). Furthermore, if f is
a twice differentiable function, then f is convex if and only if f 00 (x) is non-negative.
Proof. If f is differentiable everywhere, i.e. for all points x ∈ int(I), we have f−0 (x) =
f+0 (x). Note that by Proposition 2.3, we have shown that the left and right derivatives
are increasing on int(I). So, for a differentiable convex function f , the derivative f 0 (x)
is an increasing function.
Moreover, if f is a twice differentiable convex function, it is easy to see that f 00 (x) ≥ 0
on int(I) as f 0 (x) is an increasing function.
Conversely, suppose that f is twice differentiable such that f 00 (x) ≥ 0 for all x ∈
int(I). Then, using Taylor’s theorem [2, p.70], for any y ∈ int(I), f can be expressed
as f (x) = f (y) + f 0 (y)(x − y) + 12 f 00 (ξ)(x − y)2 for some ξ between x and y. Hence,
5
we have that f (x) − f (y) ≥ f 0 (y)(x − y). Now, consider that x = a < y = b < c
such that b = (1 − t)a + tc for t ∈ [0, 1]. Substituting in the inequality, we will have
f (a)−f (b) ≥ tf 0 (b)(a−c). Similarly, if we let a < y = b < x = c such that b = (1−t)a+tc
for t ∈ [0, 1], we would have f (c) − f (b) ≥ (1 − t)f 0 (b)(c − a). Then, by multiplying
the first inequality and the second inequality with (1 − t) and t respectively, and adding
them together, we would get f (b) ≤ (1 − t)f (a) + tf (c) where b = (1 − t)a + tc. Hence,
f is convex over int(I).
Now we will consider the case where the function f is not differentiable everywhere
by looking at the concept of subderivatives.
3 Subderivatives
Subderivatives is a generalisation of derivatives to a general function. Before we define
what a subderivative is, we look at the crucial definition of support lines that will allow
us to construct subderivatives.
Definition 3.1 (Support Line). [2, p.12] A support line for a convex function f : I → R
at a point b ∈ I is a line p(x) that goes through f (b) and is always less than or equal to
f (x) for all x ∈ I i.e. p(x) = f (b) + m(x − b) such that p(x) ≤ f (x) for all x ∈ I.
One thing to note is that, in the definition above, at the boundary of the interval
I, the support line of a convex function f may not exist. As an example, consider the
√
convex function f (x) = 3 x for x ≤ 0. At x = 0, the support line does not exist as the
function has a vertical tangent at this point. However, in the interior of I, support lines
for the function f must exist, as we will see later.
Also, note that for a differentiable function f in the interval I, a support line at the
point b ∈ I is just the tangent line that goes through the point b with gradient f 0 (b).
Since a derivative is unique at any point on a differentiable function, the support line at
the point b is unique. However, this is not the case for a general function. We shall see
later that support lines at a point is not necessarily unique on a general function.
Theorem 3.1. [2, p.12] If f : I → R is convex then there is at least one support line
for f at each b ∈ int(I).
Proof. If f is convex, then for an arbitrary point b ∈ int(I), first consider for all x ∈ I
such that x > b. From Proposition 2.3, we have the inequality f+0 (b) ≤ f (x)−f x−b
(b)
. So, if
0
we choose any m ≤ f+ (b), we will get f (b) + m(x − b) ≤ f (x) for all x > b. Similarly,
for all x ∈ I such that x < b, we have the inequality f (b)−f b−x
(x)
≤ f−0 (b). If we choose
0
any m ≥ f− (b), we will get f (b) + m(x − b) ≤ f (x). Therefore, if we choose any
m ∈ [f−0 (b), f+0 (b)], we will get p(x) = f (b) + m(x − b) ≤ f (x) for all x ∈ I. Note that
by Proposition 2.3, f−0 (b) ≤ f+0 (b) for all b ∈ int(I). So, the interval [f−0 (b), f+0 (b)] is
non-empty. Therefore, there is at least one value of m that we can choose, thus there is
at least one support line through f (b) for any b ∈ int(I).
6
Remark 3.1. Note that Theorem 3.1 coincides with the fact that if f is differentiable at
b ∈ int(I), then the support line is unique. Indeed, if f is differentiable at b, then f−0 (b) =
f+0 (b) = f 0 (b) so, we have only one choice for m which is f 0 (x) and f (b) + f 0 (b)(x − b) is
the tangent line for f at b.
However, if a function is not differentiable at b, there is an interval that we can choose
m from, i.e.[f−0 (b), f+0 (b)]. So, the value of m is not unique at b.
We return to our example of f (x) = |x| in Example 2.1. Here, f is not differentiable
at the origin as the left and right derivatives are not equal. The left derivative at x = 0 is
−1 and the right derivative is 1. So, any line of the form p(x) = mx such that m ∈ [−1, 1]
is a support line of f at origin. Geometrically, we can see that these lines will always lie
below the graph of f for any x ∈ R.
f (x)
p0 (x)
Figure 4: Graph of f (x) = |x| with a support line p0 (x) at x = 0
Now, we can define subderivatives.
Definition 3.2 (Subderivatives). [2, p.32] A subderivative δf (x) of a convex function

f : I → R at a point b ∈ int(I) is the set of all possible slope of support lines at b, i.e.
δf (b) = {m ∈ [f−0 (b), f+0 (b)]}.
Hence, from this, we have the following corollary:
Corollary 3.1. The subderivative of a convex function f : I → R at any point b ∈ int(I)

is a non-empty set.
This is immediate from Theorem 3.1. Also, note that the subderivative of the function
f may not exist at the boundary of I based on an earlier observation.
Example 3.1. Consider the convex function f (x) = |x − 2| defined on the interval (0, 4].
If we were to draw the graph of δf (x), we would have the following graph:
7
y y
f (x)
δf (x)
x x
(a) (b)
Figure 5: Graph of f (x) = |x − 2| and its subderivative δf (x)
The graph of δf (x) is multivalued at x = 0 because at the origin, the graph of f (x)
is not differentiable, and thus the subderivative is given by the interval [−1, 1] as the left
and right derivative of f (x) at the origin is −1 and 1 respectively. We can also clearly
see from the graph of f (x) that its support lines at the origin can have gradients ranging
from −1 to 1 as if we draw any lines with these gradients through the origin, the lines
will stay below the graph of f (x) for all x ∈ (0, 4].
Another interesting point to note is that at the boundary of the domain i.e. x = 4,
the subderivative is given by the interval [1, ∞). This is because at this point, we can
draw lines of any gradient greater than or equal to 1 and the lines will stay below or on
the graph of f (x) for all x ∈ (0, 4]. In other words, the subderivative of f at x = 4 is any
number greater than or equal to 1. At other values of x, the graph of δf (x) is univalued
as f (x) is differentiable everywhere else.
There is also another corollary that we can make from the definition of subderivatives
which will be useful later on.
Corollary 3.2. If f : I → R is a convex function, then for all point b ∈ int(I), for any
m ∈ δf (b), for all x ∈ I, we have:
f (x) − f (b) ≥ m(x − b) (4)
Proof. We have that for any m ∈ δf (b), the line pm (x) = f (b) + m(x − b) is a support
line for the function f . Also, note that the support line always lie below the graph of
f (x), so we have f (x) ≥ pm (x) = f (b) + m(x − b) for all x ∈ I, hence the result.
Now, we shall look at the first application of convex function.
8
Part II
Legendre Transform
4 Legendre Transform
We begin by defining what Legendre transform is and laying out the conditions that
enable us to carry out the transformation. Suppose that we have a twice differentiable
convex function f over an open interval I ⊂ R such that f 00 (x) > 0 for all x ∈ I. The
graph of this function Gf is a space of {(x, f (x)) : x ∈ I}. The idea is to find the dual of
the space which is some space {(m, g(m)) : m ∈ J} for some function g on the domain
J ⊂ R. So, how do we find this dual for any given twice differentiable convex function
f?
[6, p.1] We have supposed that f (x) is a twice differentiable convex function such
that f 00 (x) > 0 in I. Note that since f 00 (x) is strictly positive, we have that f 0 (x) is a
strictly increasing function. So, for any two x1 , x2 ∈ I such that x1 < x2 we would have
f 0 (x1 ) < f 0 (x2 ). By defining m(x) := f 0 (x), we can see that the function m : x 7→ f 0 (x)
is an injective function.
By restricting the codomain to M where M = {m : m = f 0 (x), x ∈ I}, we can invert
the function m(x) to get x(m) as the function m : I → M is bijective.
Note that by Remark 3.1, there is a unique support line going through each differ-
entiable point on a curve. Hence, for each of these points, there corresponds a unique
y-intercept of this support line, which we would call c(x). Therefore, the support line at
each x is given by the equation f (x) = m(x) · x + c(x). By inverting the dependency of
x and m and defining g(m) as the negative of the y-intercept, we would get the equation
g(m) = x(m) · m − f (x(m)).
Now, since our assumed f is differentiable everywhere on I, for each point x ∈ I,
there exists a unique pair (m, g(m)) corresponding to it. So, the set {(m, g(m)) : m ∈ J}
for some J ⊂ R is a dual for the set {(x, f (x)) : x ∈ I}. We write the Legendre transform
of the function f (x) as f ∗ (m) := g(m).
In summary, we can find the Legendre transform of a twice differentiable convex

function by following the steps below:
1. Find the gradient of the function, m(x) := f 0 (x).
2. Find the tangent line at each x, given by f (x) = m(x) · x + c(x) where c(x) is the
y-intercept.
3. Invert the dependency of x and m and define the Legendre transform of f as

f ∗ (m) := −c(x(m)).
4. Substitute in the equation of the tangent line to get f ∗ (m) = x(m) · m − f (x(m)).
9
We will now have a look at a simple yet classical example for this transformation.
Example 4.1. Consider the function f (x) = ex for x ∈ R. Note that f (x) is a convex
function and f 00 (x) = ex > 0 for all x ∈ R. So we can find the Legendre transform of
this function. How do we do it?
First, we find the gradient of the function, m(x). This is given by m(x) = f 0 (x) = ex .
We can invert this function to get x(m) = ln m where m > 0. Now we find the equation
of the tangent line at each x. The tangent lines are given by the equation y = mx + c.
Rearranging this, we will get c = y − mx. Substituting m = ex and y = ex , we will
have: c = ex (1 − x). Finally, substituting x(m) = ln m and c = −g(m), we will have:
g(m) = m ln m − m. Thus, the Legendre transform of the exponential function is given
by: f ∗ (m) = m ln m − m for m > 0.
Now, we shall have a look at a different example.
Example 4.2. Consider the twice differentiable convex function f (x) = x2 − x + 1 for
x ∈ (0, 2) = I. We can carry out the Legendre transform as f (x) is twice differentiable
and f 00 (x) = 2 > 0 for all x ∈ I.
y
f (x)
p
Figure 6: Tangent lines p, q and r of the curve f (x) = x2 − x + 1 for x ∈ (0, 2)
Firstly, we find the gradient of the function m(x) at each point by differentiating the
function. So, m(x) = f 0 (x) = 2x − 1 for x ∈ (0, 2). We can invert this into x(m) = m+1 2
where m ∈ (−1, 3).
Substituting m = 2x − 1 and y = f (x) = x2 − x + 1 into the equation of tangent
c = y − mx, we will have: c = (x2 − x + 1) − x(2x − 1) = −x2 + 1.
Finally, we complete our transform by substituting x = m+1 2 and c = −g(m). There-
m2 +2m−3
fore, the final equation is: g(m) = 4 .
∗ m2 +2m−3
Thus, f (m) = 4 where m ∈ (−1, 3).
10
Note that in the example above, if we take the second derivative of f ∗ (m) with
d2 ∗ 1 ∗
respect to m, we would get that dm 2 (f (m)) = 2 > 0. Hence, this function f (m) is
convex. So, we have a question here: is it necessarily true that a Legendre transform of
any twice differentiable function is convex?
The answer is yes and the proof is given in the theorem below.
Theorem 4.1. The Legendre transform f ∗ (m) of the function f (x) is a convex function.
Proof. Recall that the Legendre transform of a twice differentiable function f is given by
f ∗ (m) = x(m) · m − f (x(m)). Now we differentiate this with respect to m: dm d ∗
f (m) =
dx 0 dx 0
dm · m + x(m) − f (x) · dm . But recall that f (x) = m by definition. Therefore, we
d ∗
will have dm f (m) = x(m). Note that in order for us to take the Legendre transform
of the function f , we must have the condition that f 00 (x) = m0 (x) > 0. Hence, the
function m(x) is strictly increasing and thus its inverse x(m) is strictly increasing as
d2 ∗ 0
well. Therefore, we have dm 2 f (m) = x (m) > 0. Then, by Theorem 2.2, we conclude
∗
that f (m) is a convex function.
2
Note that from Theorem 4.1, we have that f ∗ (m) is a convex function and dm
d ∗
2 f (m) >
0 for m ∈ J. Note that these are the conditions that one would require to make a Leg-
endre transform. So, what would happen if we take a Legendre transform of f ∗ (m)?
f ∗ (m)
s
u
m
m2 +2m−3
Figure 7: Tangent lines s, t and u of the curve f ∗ (m) = 4 for m ∈ (−1, 3)
We look back at our Example 4.2. The Legendre transform of the function f (x) =
2
x2 −x+1 for x ∈ [0, 2] is given by f ∗ (m) = m +2m−3
4 where m ∈ (−1, 3). Now, we take the
Legendre transform of f ∗ (m). First, we differentiate f ∗ (m) to get the gradient n(m) of
11
d ∗
this function: n(m) := dm f (m) = 12 (m+1) and n ∈ (0, 2). Thus, the tangent line of the
2
function f ∗ (m) is given by:f ∗ (m) = n(m) · m + c ⇒ m +2m−3
4 = 12 (m + 1) · m + c. Noting
that we can invert the dependency of n and m, we have m = 2n − 1. Substituting this
into the tangent line and putting f ∗∗ (n) = −c, we would finally have: f ∗∗ (n) = n2 −n+1
for n ∈ (0, 2).
Notice that the second Legendre transform f ∗∗ (n) = n2 − n + 1 is equivalent to the
original equation f (x) = x2 − x + 1. This is an interesting observation.
Here, we have yet another question that can be asked: is it necessarily true that if
we take the Legendre transform of a function twice, we would get the original function
back?
The answer is yes.
Theorem 4.2. [7, p.2] The Legendre transform is self-inverse i.e. f ∗∗ (x) = f (x).
Proof. Suppose that f ∗ (m) is the Legendre transform for f (x). Then, from Theorem 4.1,
d ∗
we have dm f (m) = x(m). Since f ∗ (m) is a convex function and its second derivative
d ∗
is positive, the function x : m 7→ dm f (m) is injective. Now, using the method outlined
earlier in this section, we construct the Legendre transform for f ∗ (m). We find the
equation for tangent lines for f ∗ (m) for any m ∈ J, given by f ∗ (m) = x(m) · m + c. We
define f ∗∗ (x) = −c, therefore we will arrive at the equation f ∗∗ (x) = x(m) · m − f ∗ (m).
Recall that we have defined f ∗ (m) as f ∗ (m) := x(m) · m − f (x(m)). Substituting this in
the previous equation, we will get f ∗∗ (x) = x(m) · m − x(m) · m + f (x(m)) = f (x).
5 Legendre-Fenchel Transform
Note that for us to be able to take the Legendre transform of a function f , we must
have the conditions that f must be a twice differentiable convex function. This is a weak
statement. Therefore, we need a stronger result which is called the Legendre-Fenchel
transform.
[2, p.30] In this result, we can take the transformation of any function f and it is
given by the formula:
f ? (m) = sup(x · m − f (x)) (5)

x∈R
The Legendre-Fenchel transform is only defined in the domain of m where the supre-
mum exists. Here, we do not require any special conditions on the function f .
We will show later that Legendre transform is a special case for this transform. But
first, we look at the properties of this transform. Firstly, like Legendre transform, f ? (m)
is a convex function.
Theorem 5.1. The Legendre-Fenchel transform f ? (m) of a function f is a convex
function.
Note that we don’t need any condition for the function f to find its Legendre-Fenchel
transform, so, in general, it may not be twice differentiable. Furthermore, it may not
12
even be differentiable. So, we can’t prove this theorem the way we proved Theorem 4.1.
However, we can go back to the very first definition of convex function and try to prove
Theorem 5.1 by first principle.
Proof. The formula for the Legendre-Fenchel transform is given by:
f ? (m) = sup(x · m − f (x))

x∈R
For any m1 < m2 ∈ I, any m ∈ [m1 , m2 ] is given by m = (1 − t)m1 + tm2 for some
t ∈ [0, 1]. So, for this m, we have:
f ? (m) = f ? ((1 − t)m1 + tm2 ) = sup(x · ((1 − t)m1 + tm2 ) − f (x))

x∈R
= sup((1 − t)xm1 + txm2 − f (x))
x∈R
= sup((1 − t)(xm1 − f (x)) + t(xm2 − f (x)))
x∈R
≤ sup((1 − t)(xm1 − f (x))) + sup(t(xm2 − f (x)))
x∈R x∈R
= (1 − t) sup(xm1 − f (x)) + t sup(xm2 − f (x))
x∈R x∈R
= (1 − t)f ? (m1 ) + tf ? (m2 )
Hence, we have f ? ((1 − t)m1 + tm2 ) ≤ (1 − t)f ? (m1 ) + tf ? (m2 ) for t ∈ [0, 1]. Thus,
f ? (m) is a convex function.
Apart from the property above, we have an extra property for the Legendre-Fenchel
transform. First, we define what it means for a convex function to be closed.
Definition 5.1 (Closed convex function). A convex function f : I → R is said to be

closed if the set Lc = {x ∈ I : f (x) ≤ c} is a closed subset of R for all c ∈ R.
So, with this definition, we formulate the following theorem.
Theorem 5.2. [2, p.30] The Legendre-Fenchel transform f ? (m) of a function f is a

closed function.
Proof. Fix c ∈ R. Suppose we have an arbitrary sequence {mn } of Lc = {m ∈ J :

f ? (m) ≤ c} such that it converges to a point m̄ as n → ∞. Hence, by the Legendre-
Fenchel formula, we have: x · mn − f (x) ≤ f ? (mn ) ≤ c. Taking the limit of this as
n → ∞, we have x · m̄ − f (x) ≤ f ? (m̄) ≤ c. Therefore, m̄ ∈ Lc as well. We conclude
that the subset Lc is closed as it contains the limit point of an arbitrary sequence.
From the previous section, we note that if we carry out a Legendre transformation
on a convex function twice, we would get the original function. Does this hold true for
the Legendre-Fenchel transform?
13
Note that we do not require convexity nor twice differentiability to carry out the
Legendre-Fenchel transform. This means that we can carry out the transformation on
any general function at all. However, if we carry out the Legendre-Fenchel transformation
on a general function twice, we do not necessarily get the original function back.
As an example, consider a non-convex function f (x) and we take the Legendre-
Fenchel transform of it, given by f ? (m). From Theorem 5.1, we have f ? (m) as a convex
function. Carrying out the transformation again, we will have f ?? (x) which is again a
convex function. But note that our original function f (x) is not a convex function, so it
must not be equal to f ?? (x).
But what if we take the Legendre-Fenchel transform of a convex and closed function
(not necessarily differentiable) twice? Would we recover the original function? The
surprising answer is yes.
Theorem 5.3. If f (x) is a convex and closed function, then the Legendre-Fenchel trans-
form of f (x) is self-inverse i.e.
f ?? (x) = sup (x · m − f ? (m)) = f (x)
m∈R
Proof. Note that we can show f ?? (x) ≤ f (x) easily:

?
f (m) = sup(x · m − f (x))
x∈R
?
⇒ f (m) ≥ x · m − f (x)
⇒ f (x) ≥ x · m − f ? (m)
⇒ f (x) ≥ sup (x · m − f ? (m))
m∈R
??
⇒ f (x) ≤ f (x)
To show the reverse inequality, we need to use the convexity of f (x). Note that from
Corollary 3.1, we have that the subderivative of the function f at any point a ∈ int(I)
is non-empty i.e. the set δf (a) is non-empty. Since this set is non-empty, we can choose
any pa ∈ δf (a). By Corollary 3.2, we have
pa (x − a) ≤ f (x) − f (a)
⇒ x · pa − f (x) ≤ a · pa − f (a) ∀x ∈ I (†)
Also, by the definition of Legendre-Fenchel transformation, we have:
f ? (m) = sup(x · m − f (x))
x∈R
By choosing m = pa and using the inequality in (†), we will get: f ? (pa ) ≤ a · pa − f (a).
Rearranging this, we would get:
f (a) ≤ a · pa − f ? (pa )
≤ sup(x · a − f ? (x)) = f ?? (a)
x∈R
⇒ f (x) ≤ f ?? (x)
Hence, by the two inequalities, we have f (x) = f ?? (x).
14
Remark 5.1. Note that if f is a twice differentiable convex function such that f 00 (x) > 0,
the Legendre-Fenchel transform of f is just the same as the Legendre transform of f .
The supremum of m · x − f (x) can be obtained by taking the partial derivative of this
quantity with respect to x and equating it with 0, to get m = f 0 (x) (the second partial
derivative is negative, hence a maximum). Thus, this quantity is maximised at x such
that f 0 (x) = m. We can invert the dependency of m and x as f 0 (x) is a strictly increasing
function and this leads us back to the steps of carrying out the Legendre transform.
Example 5.1. For λ > 0 and x ≥ 0, consider the function f (x) := (λ − x)† defined as:

λ − x if 0 ≤ x < λ
f (x) := (λ − x)† =
0 if x ≥ λ
This is clearly a convex function as the epigraph of this function is a convex set. Also,
this function is a closed function. The function is differentiable at all x > 0 except at
x = λ (the function is left differentiable at x = 0). Since it is not twice differentiable
everywhere, we cannot use Legendre transform on this function. However, we can use
the Legendre-Fenchel transform.
The function is differentiable everywhere except at x = λ (it is right differentiable at
x = 0). The derivatives at 0 < x < λ is −1 and at x > λ is 0. The subderivatives at x = 0
is the interval (−∞, −1] and at x = λ is the interval [−1, 0]. To find the Legendre-Fenchel
transform of this function, we divide the problem into cases for different values of m and
determine the supremum of m · x − f (x). We define this quantity as k(x) := m · x − f (x).
y y
−1
λ m
f (x) f ? (m)
x −λ
λ
(a) (b)
Figure 8: Graph of f (x) = (λ − x)† and its Legendre-Fenchel transform f ? (m)
• m ≤ −1
For these values of m, k(x) is given by a piecewise function.

(m + 1)x − λ if 0 ≤ x < λ
k(x) =
mx if x ≥ λ
Thus, by simple calculation, we find that the supremum of k(x) is −λ.
15
• −1 < m < 0
For these values of m, k(x) is also given as the piecewise function as in the case
above. Note that k(λ) = mλ. However, for x > λ, we have k(x) = mx < mλ.
Also, for x < λ, it can be easily shown that k(x) = (m + 1)x − λ < mλ. Hence,
the supremum of k(x) for −1 < m < 0 is mλ.
• m=0
For m = 0, k(x) = −f (x), hence the supremum is 0.
• m>0
For these values of m, the quantity k(x) is unbounded as k(x) → ∞ as x → ∞.
So, the Legendre-Fenchel transform is not defined for these values of m.
So, putting everything together, we get the Legendre-Fenchel transform of the function
(λ − x)† which is defined only for m ≤ 0.

? −λ if m ≤ −1
f (m) =
mλ if − 1 < m ≤ 0
Now, by Theorem 5.3, since f (x) above is a convex function, if we take the Legendre-
Fenchel transform once more, we would recover the original function. Here we try and
do the transformation again.
Note the function f ? (m) is differentiable everywhere except at m = 1 (it is left-
differentiable at m = 0). The derivatives of this function are 0 at m < −1 and λ at
−1 < m < 0. The subderivative at m = 0 is the interval [λ, ∞) and at m = −1 is the
interval [0, λ]. Again, we split the real line into sections and work on them separately.
Define h(m) := x · m − f ? (m).
• x<0
For these values of x, h(m) is unbounded as h(m) → ∞ as m → −∞. So, the
Legendre-Fenchel transform is undefined for these values of x.
• x=0
For x = 0, h(x) = −f ? (m), hence the supremum is λ.
• 0<x<λ
For these values of x, h(m) is given by a piecewise function.

mx + λ if m ≤ −1
h(m) =
m(x − λ) if − 1 < m ≤ 0
Note that when m = −1, we have h(m) = λ − x. Since λ > 0 and 0 < x < λ,
then m < −1 ⇒ h(m) = mx + λ < −x + λ. Also, for −1 < m ≤ 0, it can be
easily shown that h(m) = m(x − λ) < λ − x. Hence, the supremum of h(m) for
0 < x < λ is λ − x.
16
• x≥λ
For these values of x, h(m) is also given by the piecewise function as in the case
above. For m ≤ −1, by a simple observation, we have h(m) = mx + λ ≤ 0. Also,
for −1 < m ≤ 0, we have x − λ ≥ 0 ⇒ h(m) = m(x − λ) ≤ 0. In fact, h(m) attains
the value 0 when m = 0 regardless of the value of x. Thus, the supremum of h(m)
here is 0.
Putting everything together, the Legendre-Fenchel transform of the function f ? (m),

which is defined for only x ≥ 0 is given by:

?? λ − x if 0 ≤ x < λ
f (x) = = f (x)
0 if x ≥ λ
6 Applications of Legendre Transform

We have looked at Legendre transform and the more general version, Legendre-Fenchel
transform. These transformations enable us to revert a convex function from one system
of coordinates to another and give us the freedom of working in either coordinate systems,
whichever is easier to work with, without losing any of the original information. But
how can these transformations be useful tools in mathematics?
One interesting corollary of the Legendre transform is Young’s inequality.
6.1 Young’s Inequality

Corollary 6.1 (Young’s Inequality). [7, p.2] If f (x) is a convex function and f ? (m) is
its Legendre-Fenchel transform, then we have the following inequality:
x · m ≤ f (x) + f ? (m) (6)
Proof. This is immediate from the definition of Legendre-Fenchel transform.
f ? (m) = sup(x · m − f (x)) ≥ x · m − f (x)

x∈R
Rearranging this, we would get the desired result.
Young’s inequality is an important inequality used to prove the really useful Hölder’s
inequality. Hölder’s inequality for the discrete case is formulated as follows.
Theorem 6.1 (Hölder’s Inequality). [2, p.190] Given two sets of positive real numbers
{ai }ni=1 and {bi }ni=1 , for real numbers 1 < p, q < ∞ such that p1 + 1q = 1, we have:
n n
!1 n
!1
p q
api bqi
X X X
ai bi ≤ (7)
i=1 i=1 i=1
api
Equality holds if and only if bqi
= c for some constant c ∈ R for all i = 1, 2, . . . , n.
17
p
Consider f (x) = xp where 1 < p < ∞ for x ≥ 0. Note that f is a convex function
as the epigraph of f is a convex set and it is twice differentiable with f 00 (x) > 0 for all
q
x > 0. Hence, we can find the Legendre transform for this function which is f ∗ (m) = xq
with p1 + 1q = 1.
Thus, putting this into Young’s inequality, we would get, for p1 + 1q = 1:
xp mq
x·m≤ + (8)
p q
Claim that equality occurs if and only if xp = mq . Indeed, if xp = mq , by using
p q
the fact that p1 + 1q = 1 and m = xp−1 , the LSH of the inequality will be xp + mq =

xp p1 + 1q = x · xp−1 = x · m which is the RHS. Conversely, if equality occurs, it is easy
to show that xp = mq by using
n othe same n twoofacts as used above.
n n
ai
Now, consider the sets a and bbi where a and b are arbitrary positive
i=1 i=1
ai bi
constants. Then, putting x = a and m = b into the Young’s inequality we found
above, we would get:
ai bi 1 api 1 bqi
· ≤ · p+ · q (9)
a b p a q b
Then, summing up over i = 1, 2, . . . , n, we would get:
n n n
1 X 1 X p 1 X q
ai · bi ≤ ai + bi (10)
a·b p · ap q · bq
i=1 i=1 i=1
By choosing suitable values for a and b, we aim to transform the above inequality to our
desired result of Hölder’s inequality. Thus, we choose:
n
!1 n
!1
p q
api bqi
X X
a= and b =
i=1 i=1
Finally, the LHS becomes p1 + 1q which is equal to 1 by the condition of Young’s inequality.
And hence, by multiplying both sides with a and b, we arrive at our desired result.
ap
For equality, if for all i = 1, 2, . . . , n we have bqi = c, consider the RHS of (7).
i
n
!1 n
!1 n
!1+1
p q p q
1
api bqi bqi
X X X
=c p
i=1 i=1 i=1

n
X q− pq
= ai bi
i=1
n
X
= ai bi
i=1
18
Which isntheo LHS. Hence,
n onwe have equality. Conversely, note that since the elements of
n
ai
the sets a and bbi are all positive, both sides of the inequality (9) are positive
i=1 i=1
for all i = 1, 2, . . . , n. So, in order to get equality for (10), we must have equality for
(9) for all i = 1, 2, .. . ,
n. Thus,
from the equality condition of Young’s inequality, for
p q
all i we must have aai = bbi for arbitrary positive constants a and b. Choosing
a = b = 1, we get the reverse implication.
Remark 6.1. We can also prove Hölder’s inequality for the continuous case using the
same method but instead of summation, we use integration.
6.2 Lagrangian and Hamiltonian Mechanics

Now we look at another use of Legendre transform, which is an application in dynamics,
namely Lagrangian and Hamiltonian mechanics.
The Lagrange’s equation of motion can be represented in many different ways. Here,
we only look at a simple case of classical dynamics as an example of the uses of Legendre
transform. For simplicity, we first consider the case for 1-dimensional mechanics.
[8, p.10] In classical dynamics, the Lagrangian L(x, ẋ), is given by the equation:
L(x, ẋ) = T (x, ẋ) − V (x) (11)
Where T (x, ẋ) is the total kinetic energy and V (x) is the total potential energy. The
variables x and ẋ represent displacement and velocity respectively and they may depend
on a variable t which is time. Note that the potential energy does not depend on the
velocity.
Based on the principle of stationary action, one can express the information using
Euler-Lagrange’s equation of motion (or Lagrange’s equation) which is the following.
Theorem 6.2 (Lagrange’s Equation). [8, p.11] Given the Lagrangian of a physical
system, L(x, ẋ) = T (x, ẋ) − V (x), we have the following equation:
d ∂L ∂L
= (12)
dt ∂ ẋ ∂x
Before we prove this equation, we have to define a physical quantity called action,
denoted as S, which will be used in the proof, along with the principle of stationary
action.
Definition 6.1 (Action). An action of a physical system is a functional which takes the
function path of motion, x(t), to a real number. The action is defined as an integral of
the Lagrangian: Z t1
S[x(t)] = L(x, ẋ)dt (13)
t0
Note that if we fix the initial point x(t0 ) and the final point x(t1 ) of the motion,
the action depends on the path taken by the system as there might be more than one
possible path for the system. Now, the principle of stationary action is given as follows:
19
Theorem 6.3 (Principle of Stationary Action). [9, p.124] The path taken by a physical
system is the path x(t) such that it is an extremum of the action i.e. δS[x(t)] = 0.
Now, we can prove Theorem 6.2. We begin by applying the principle of stationary
action to determine the path taken by the system. We first fix the initial x(t0 ) and
the final points x(t1 ) of the path and consider continuously shifting the path slightly by
x(t) → x(t) + δx(t). Note that as we have fixed the initial and final points, we have
δx(t0 ) = δx(t1 ) = 0.
To find the path taken by the system, we find δS and equate it with 0 [8, p.11].
Z t1 Z t1
δS = δ L(x, ẋ) dt = δL(x, ẋ) dt
t0 t0
Z t1

∂L ∂L
= δx + δ ẋ dt
t0 ∂x ∂ ẋ
Z t1 Z t1
∂L ∂L
= δx dt + δ ẋ dt
t0 ∂x t0 ∂ ẋ
Z t1 t1 Z t1
∂L ∂L d ∂L
= δx dt + δx − δx dt
t0 ∂x ∂ ẋ t0 t0 dt ∂ ẋ
But since at t0 and t1 the value of δx is 0, we would get:

Z t1
∂L d ∂L
δS = − δx dt
t0 ∂x dt ∂ ẋ
Finally, by principle of stationary action, we require the integral to be 0. Further-
more, since δx is arbitrary, we have that ∂L d ∂L
∂x − dt ∂ ẋ = 0, which is what we wanted to
prove.
There is another way of writing (12) [7, p.3]. If we write p = ∂L
∂ ẋ , we would have:
∂L
p=
∂ ẋ
∂L
ṗ =
∂x
The quantity p is called the generalised momentum of the system. Recall that the
Lagrangian of a system is a function that depends on x and ẋ. We can do away with
the dependency on ẋ and represent the system in new coordinates x and p. We shall see
how this can be useful later.
How do we change the variable ẋ to p = ∂L ∂ ẋ in our Lagrangian? Recall our useful

tool Legendre transformation. It transforms the function f (x) to a new function f ∗ (m)
where m is the derivative f 0 (x). Since our p is the partial derivative of L(ẋ, x) with
respect to ẋ, theoretically, we are able to transform the Lagrangian L(ẋ, x) to a new
function L∗ (p, x) via Legendre transform.
20
Now, by treating x as a constant in the Lagrangian, we aim to carry out the Legendre
transform of L with ẋ as the only variable i.e. our Lagrangian now only depends on
ẋ, denoted as Lx (ẋ). Then, the Legendre transform of the Lagrangian Lx (ẋ) is given
by L∗x (p) = ẋ · p − Lx (ẋ). Putting p = ∂L 0
∂ ẋ = T (ẋ), we can invert this quantity to get
ẋ in terms of p. Substituting this into the transformation, we would get the Legendre
transform of the Lagrangian, which depends on x and p. We define H(x, p) := L∗ (x, ẋ).
This quantity H(x, p) is called the Hamiltonian.
[8, p.83] Consider the variation of the Hamiltonian. Since H(p, x) = ẋ · p − L(x, ẋ),
we have the following:

∂H ∂H ∂L ∂L
δp + δx = δ ẋ · p + ẋ · δp − δx + δ ẋ
∂p ∂x ∂x ∂ ẋ
= δ ẋ · p + ẋ · δp − ṗ · δx − p · δ ẋ
= ẋ · δp − ṗ · δx
Equating like terms, we get the following equations:

∂H
ẋ =
∂p
∂H
ṗ = −
∂x
These are called the Hamilton equations.
Recall that the Legendre transform is self-inverse from Theorem 4.2. Hence, by
theory, we can get from the Hamiltonian back to the Lagrangian by carrying out the
Legendre transform once more. Indeed, this is very easy to show.
We consider a really simple and accessible example to demonstrate how we use Leg-
endre transform to move from the Lagrangian to the Hamiltonian.
Example 6.1. Consider a system of mass attached to a horizontal spring. For simplicity,
assume that the surface is frictionless. Denote x as the displacement of the body of mass
m from the equilibrium point of the spring. Let k be the spring constant.
Figure 9: Mass attached to a horizontal spring
21
We know that the kinetic energy T and potential energy V are given by T = 12 mẋ2
and V = 21 kx2 respectively. Hence, the Lagrangian is given by:
1 1
L(x, ẋ) = mẋ2 − kx2
2 2
By considering the quantity x as a constant, we can easily show that the function
Lx (ẋ) is a convex function and it is twice differentiable with positive second derivative.
Therefore, we can carry out the Legendre transform on this function.
We calculate the generalised momentum by differentiating Lx (ẋ) to get p = mẋ. By
inverting this and substituting it in H(p, x) := L∗x (ẋ) = ẋ · p − Lx (ẋ), we would get the
Hamiltonian of the system:
1 p2 1 2
H(p, x) = + kx
2m 2
Of course, this is a very simple example done just for demonstration of how Legendre
transform is used in dynamics.
Remark 6.2. Note that the quantity x may not necessarily represent linear displace-
ment. We can replace the quantity x and ẋ with other types of quantity, say angular
displacement θ and angular velocity θ̇. In theory, the quantites x and ẋ are called gen-
eralised position and generalised velocity respectively. Also note that in the example
above, the generalised momentum p corresponds to the momentum of the system. How-
ever, this is not the case in coordinates other than Cartesian coordinates (for example,
radial coordinates) [8, p.21].
We shall look at another example below to illustrate the remark above.
Example 6.2. Consider a pendulum swinging under the influence of gravity. The rod
has length l and is attached to a body with mass m. For simplicity, we assume no air
resistance and the pivot is frictionless.
h = l(1 − cos θ)
Figure 10: Pendulum swinging under influence of gravity
22
The kinetic energy T and potential energy V are given by T = 21 mẋ2 = 21 ml2 θ̇2 and
V = mgh = mgl(1 − cos θ) respectively. Hence, the Lagrangian is given by:
1
L(θ, θ̇) = ml2 θ̇2 − mgl(1 − cos θ)
2
By treating θ as a constant, we can carry out the Legendre transform of the La-
grangian. We find the generalised momentum p = ml2 θ̇. Note that this is not equal to
momentum of the system. By substituting this in H(p, θ) := L∗θ (θ̇) = θ̇ · p − Lθ (θ̇) we
would get the Hamiltonian:
1 p2
H(p, θ) = + mgl(1 − cos θ)
2 ml2
So why is this concept useful in classical dynamics? Recall that the Lagrange and
Hamilton equations are given by:
Lagrange Equations Hamilton Equations

∂L ∂H
p = ∂ ẋ ẋ = ∂p
∂L
ṗ = ∂x ṗ = − ∂H
∂x
Table 1: Lagrange and Hamilton equations
Note that Lagrange equations is a second order differential equation in one variable
[8, p.80]. On the other hand, the Hamilton equations is a coupled system of first order
differential equations in two variables [8, p.84]. In theory, one is able to switch from
solving one second order differential equation to solving a coupled system of two first
order differential equations by using the Legendre transform to move from Lagrangian
mechanics to Hamiltonian mechanics.
Recall Example 6.1. The Lagrangian is given by:
1 1
L(x, ẋ) = mẋ2 − kx2
2 2
So, by Lagrange’s equations, we have:
d ∂L ∂L
=
dt ∂ ẋ ∂x
d
⇒ mẋ = −kx
dt
d2 x
⇒ m 2 = −kx
dt
Which is a second order differential equation in variable x. Now we look at the Hamil-
tonian:
1 p2 1 2
H(p, x) = + kx
2m 2
23
Hence, by the Hamilton equations, we would have the system of differential equations:
dx p

= 

dt m
dp
= −kx 

dt
This is a system of two first order differential equations in variables x and p. The method
of shifting from Lagrange equations to Hamilton equations gives us more freedom in solv-
ing problems in classical dynamics.
What we worked through before was a simple case of 1-dimensional dynamics. In

fact, we can extend this up to k-dimensional dynamics.
Consider a k-dimensional generalised coordinate system (x1 , x2 , . . . xk ). Then, for
i = 1, 2, . . . k, the Lagrangian is given by:
L(xi , ẋi ) = T (xi , ẋi ) − V (xi )
Hence, by working through the whole process as before, we would get the Lagrange
equations for i = 1, 2, . . . , k:
∂L
pi =
∂ ẋi
∂L
p˙i =
∂xi
This is a set of k second order differential equations in k variables x1 , x2 , . . . xk .
However, when we take the Legendre transform of the Lagrangians to find the Hamil-
tonians, we would get the Hamilton equations for i = 1, 2, . . . , k.
∂H
ẋi =
∂pi
∂H
p˙i = −
∂xi
Thus, we would have a coupled system of 2k first order differential equations in 2k vari-
ables x1 , x2 , . . . , xk , p1 , p2 , . . . , pk .
So, in general, by carrying out the Legendre transform, we can switch from solving
a set of k second order differential equations in k variables to solving a coupled system
of 2k first order differential equations in 2k variables, and vice versa. Thus, we can
choose whichever way that might be easier to solve without losing any information, as
the Legendre transform is self-inverse.
24
Part III
Inequalities
Now, we turn from the really technical subject of Legendre transform and look at a really
neat and elegant result in inequalities using convexity. In Steele’s book The Cauchy-
Schwarz Master Class, there is a whole chapter dedicated to the idea of convexity in the
theory of inequalities [10, p.87]. In fact, convexity is regarded as the third pillar of the
theory of inequalities, after positivity and monotonicity.
We have seen Young’s inequality in the previous section and how it is used in proving
another important inequality, Hölder’s inequality. Let us begin here by proving Jensen’s
inequality, an inequality that revolves around convex functions.
7 Convexity and Jensen’s Inequality

Theorem 7.1 (Jensen’s Inequality). [10, p.87] Suppose that f : I → R is a convex
function
Pn where I ⊂ R and the set {ai }ni=1 is a set of positive real numbers such that
i=1 ai = 1. Then, for all xi ∈ I for i = 1, 2, . . . , n the following inequality holds:
n n
!
X X
f ai xi ≤ ai f (xi ) (14)
i=1 i=1
If f is strictly convex, equality holds if and only if x1 = x2 = . . . = xn .
Proof. We prove this inequality by induction on n. For n = 1, the inequality is trivial

as it is just an equality. n = 2 is slightly less trivial, but this case clearly holds as f is a
convex function. Now, assume that the inequality holds true for n = k − 1, we want to
prove that it is true for n = k.
For n = k, assume that we have the set of positive numbers {ai }ki=1 such that
a1 + a2 + . . . + ak = 1. Now, we can scale the set {ai }k−1 i=1 by 1 − ak such that the sum
n ok−1
ai
of the elements is still 1. So, the set 1−a k
has the same property. We shall use
i=1
this later as the inductive hypothesis. Now, using convexity of the function f and the
inductive hypothesis, we can show the following:
k k−1
! !
X X ai
f ai xi ≤ ak f (xk ) + (1 − ak )f xi
1 − ak
i=1 i=1
k−1
X ai
≤ ak f (xk ) + (1 − ak ) f (xi )
1 − ak
i=1
k
X
= ai f (xi )
i=1
25
Which is what we wished to prove. For equality, if x1 = x2 = . . . = xn , it is obvious
that we would get equality. Conversely, if f is strictly convex, we prove that x1 = x2 =
. . . = xn by induction. For the case n = 1, it is obviously true. Assuming that it holds
for n = k − 1, we prove for the case n = k. Note that in the proof above, we have
inequality
P in athe first line. Since f is strictly convex, equality occurs here if and only if
xk = k−1 i=1 1−ak i . But by inductive hypothesis, we have x1 = x2 = . . . = xk−1 , so we
i
x
have xk = xi for any i = 1, 2, . . . , k − 1, hence proving our statement.
Remark 7.1. We can only prove the case for equality for strictly convex function be-
cause for general convex functions, we might have equality for (1) at other points other
than the end points. However, one thing that we can note is that if f is a linear function,
then equality for (14) holds everywhere.
There exists the continuous form of Jensen’s inequality but here we only look at the
discrete form. This inequality has applications in probability theory and statistics. Here
we look at an application of this inequality.
Recall the very well-known AM-GM inequality:
Theorem 7.2 (AM-GM Inequality). For a positive integer n and the set of non-negative
numbers {xi }ni=1 , we have the following inequality:
n
!1 n
n
Y 1X
xi ≤ xi (15)
n
i=1 i=1
Equality holds if and only if x1 = x2 = . . . = xn .
AM-GM inequality is really well known so we are not going to prove it here. The
√ √
case for n = 2 is really easy to see by considering the positivity of ( x1 − x2 )2 and
simple algebraic manipulation. For general value of n, readers may refer to [10, p.20] for
a proof. However, there is a more general AM-GM inequality that we are interested in.
Theorem 7.3 (Generalised AM-GM Inequality). [2, p.190] For a positive integer n,
n n
Pnof non-negative numbers {xi }i=1 and the set {ai }i=1 of positive real numbers
given the set
such that i=1 ai = 1, we have the following inequality:
n
Y n
X
xai i ≤ ai xi (16)
i=1 i=1
Equality holds if and only if x1 = x2 = . . . = xn .
Proof. Consider the function f (t) = et . The function f (t) is strictly convex for all t ∈ R.
Now, let yi = ln xi . Hence, xai i = exp(ai ln xi ) = exp(ai yi ). Now we consider the LHS of
(16) and proceed using Jensen’s inequality by the convexity of the exponent function.
26
n n n
!
Y Y X
xai i = exp(ai yi ) = exp a i yi
i=1 i=1 i=1
n
!
X
=f ai yi
i=1
n
X
≤ ai f (yi )
i=1
Xn
= ai xi
i=1
Which is what we wanted to get to. Now, since f is strictly convex, by Jensen’s inequality,
we have equality for (16) if and only if y1 = y2 = . . . = yn . Since we defined yi = ln xi and
the logarithm function is injective, we have equality if and only if x1 = x2 = . . . = xn .
Now, if we put ai = n1 for all i = 1, 2, . . . n, we would get the original AM-GM

inequality. An interesting thing to note is that if n = 2 and we put a1 = p1 and a2 = 1q
such that p1 + 1q = 1 as well as using x1 = xp and x2 = mq for non-negative x and m, we
would get:
xp mq
x·m≤ +
p q
This is Young’s inequality which we have looked at in the previous part. By our
choices of x1 and x2 , we get equality if and only if xp = mq .
Another use for Jensen’s inequality is to prove Berezin-Lieb inequality of convex

functions on compact self-adjoint operators. Recall that a self-adjoint operator A is an
operator such that A = A∗ where A∗ is its adjoint and a compact operator is an operator
that maps bounded sets onto a relatively compact set.
Theorem 7.4 (Berezin-Lieb Inequality). [11, p.15] If ϕ is a convex function, A is a
compact self-adjoint operator in a Hilbert space (i.e. A : H → H) and P is a projection
in H, then we have
Tr(ϕ(P AP )) ≤ Tr(P ϕ(A)P ))
To prove this, we need some knowledge from spectral theory [12, p.62-63]. Say we
have a compact self-adjoint operator A with eigenvalues λk and corresponding orthonor-
mal eigenvectors uk respectively (i.e. Auk = λk uk for all k), we can write the operator
A in such a way: X
A= λk h · , uk iuk
k
Furthermore, if we have a function f acting on A, we can write f (A) as [13, p.268]:
X
f (A) = f (λk )h · , uk iuk
k
27
Also, we define the trace of a compact self-adjoint operator to be the sum of all its
eigenvalues [14, p.27]. So, for the operator A above, we have its trace given by:
X X
Tr(A) = λk = hAuk , uk i
k k
With the basic notions defined, we can now prove Theorem 7.4.
Proof. [11, p.20] We define the compact self-adjoint operator A and its eigenvalues and
orthonormal eigenvectors as before. Let P be a projection operator (i.e. P is a self-
adjoint operator such that P 2 = P ). Then, it can be easily shown that P AP is a
compact self-adjoint operator as well. Suppose that its eigenvalues and orthonormal
eigenvectors are µj and vj respectively (i.e. P AP vj = µj vj for all j). Hence we have:
X
P AP = µj h · , vj ivj
j
Note that since P is a projection, we have P (P AP vj ) = P (µj vj ) ⇒ P AP vj = µj P vj ⇒

vj = P vj for all j. This will be useful later.
Suppose that ϕ is a convex function, then we have:
X
ϕ(P AP ) = ϕ(µj )h · , vj ivj
j
Now, the trace of the operator ϕ(P AP ) is given by:

X
Tr(ϕ(P AP )) = ϕ(µj )
j
However, since we have chosen the eigenvectors of the operator P AP to be orthonormal,

we have: µj = hµj vj , vj i = hP AP vj , vj i. And hence, we can write the trace as:
X
Tr(ϕ(P AP )) = ϕ(hP AP vj , vj i)
j
X
= ϕ(hAP vj , P vj i)
j
* +!
X X
= ϕ λk hP vj , uk iuk , P vj
j k
!
X X
= ϕ λk hP vj , uk ihuk , P vj i
j k
!
X X
= ϕ λk |hP vj , uk i|2
j k
28
P
Note that as {uk }k forms a basis for the Hilbert space H, we can write P vj = k hP vj , uk iuk .
we have k |hP vj , uk i| = kP vj k = kvj k2 = 1.
2 2
P
Hence, for all j, using Parseval’s identity,
Since ϕ is a convex function and k |hP vj , uk i|2 = 1, by Jensen’s inequality, we have:
P
!
X X
Tr(ϕ(P AP )) = ϕ λk |hP vj , uk i|2
j k
XX
≤ ϕ(λk )|hP vj , uk i|2
j k
XX
= ϕ(λk )hP vj , uk ihuk , P vj i
j k
* +
X X
= ϕ(λk )hP vj , uk iuk , P vj
j k
X
= hϕ(A)P vj , P vj i
j
X
= hP ϕ(A)P vj , vj i = Tr(P ϕ(A)P )
j
Jensen’s inequality is a neat and useful result but there is a more elegant result
using the idea of convexity (or more precisely, concavity) in proving some well-known
inequalities.
8 Concavity and Inequalitites

Recall the famous Cauchy’s inequality for the discrete case.
Theorem 8.1 (Cauchy’s Inequality). [15, p.16] For the sets of positive real numbers
{ai }ni=1 and {bi }ni=1 , we have the following inequality:
n
!2 n n
X X X
ai bi ≤ a2i · b2i (17)
i=1 i=1 i=1
ai
Equality holds if and only if bi = c for some constant c ∈ R for all i = 1, 2, . . . , n.
The usual way of proving Cauchy’s inequality is by using the idea of positivity.
Consider the non-negative quadratic equation [10, p.11]:
n
X
g(z) := (ai z + bi )2
i=1
Consider the discriminant. Since the polynomial is non-negative, it must have at most
one real root and hence the discriminant is non-positive. Thus, by some manipulations,
we would get Cauchy’s inequality.
29
To show equality, if abii = c for all i = 1, 2, . . . , n, the only root for g(z) would be
z = c and thus, the discriminant is 0. Hence, we have equality in Cauchy’s inequality.
Conversely, suppose that we have equality in Cauchy’s inequality. That implies the
discriminant of g(z) is equal to 0,P meaning we have exactly one root for g(z). At this
root, say z = c, we have g(c) = ni=1 (ai c + bi )2 = 0. But since all the terms in the
expression are non-negative, we have them all equal to 0. This implies that abii = c for
all i = 1, 2, . . . , n, and this completes the proof.
In fact, this is not the only method of proving Cauchy’s inequality. We can also
use the powerful tool of positivity in a different way by considering the non-negative
quantity [16, p.204]: X
0≤ (ai bj − aj bi )2
1≤i<j≤n
By expanding and rearranging the equation, we would get Cauchy’s inequality. For a dif-
ferent proof that involves positivity, refer to the book Inequalities by Hardy, Littlewood
and Pólya [15].
However, we can also use the idea of convexity (or, more precisely, concavity) to
prove Cauchy’s inequality.
We begin by recalling what concave function means. Recall from Definition 2.1 in the
beginning of the first part, we mentioned that f : I → R is called concave if for t ∈ [0, 1]
and for all a, b ∈ I we have the inequality f ((1 − t)a + tb) ≥ (1 − t)f (a) + tf (b). The
function is strictly concave if we have strict inequality. In other words, the function f (x)
is concave (or strictly concave) if and only if the negative of this function, i.e. −f (x), is
convex (or strictly convex). So, we can say that concavity is the dual for convexity.
y
f (x)
0
>
m
convex
x
concave
m
<
0
−f (x)
Figure 11: Convex and concave functions. Notice the slope also switches sign.
Hence, the properties of a concave function are almost similar to those of convex
30
functions and can be proven from convex functions by simply taking the negative sign
for a convex function. As an example, for any convex function, we have that the right
and left derivatives of the function are increasing on the domain it is defined on. On the
other hand, for any concave function, we have the opposite: the left and right derivatives
are decreasing on its domain.
Now, we introduce an expression that will help us prove Cauchy’s inequality. [16,
p.203] Let g : R+ → R be a strictly concave function and let f : R2+ → R be the function:

x
f (x, y) = y · g (18)
y
This is a seemingly random expression that we pulled out of thin air. However, this
function has a special property which can be very useful in the theory of inequalities.
Proposition 8.1. [16, p.204] For the sets of positive real numbers {xi }ni=1 and {yi }ni=1 ,
we have the inequality
n n n
!
X X X
f (xi , yi ) ≤ f xi , yi (19)
i=1 i=1 i=1
Equality holds if and only if xyii = c for some constant c ∈ R for all i = 1, 2, . . . , n.
Proof. We prove this proposition using induction. For n = 1, the inequality is trivial as
it is just an equality. Now, assume the inequality holds true for n = k − 1. We prove
the case for n = k. From our assumption, we have the following inequality:
k−1 k−1 k−1
!
X X X
f (xi , yi ) ≤ f xi , yi
i=1 i=1 i=1
Now, for n = k, we have:
k
X k−1
X
f (xi , yi ) = f (xk , yk ) + f (xi , yi )
i=1 i=1
k−1 k−1
!
xk X X
≤ yk · g +f xi , yi
yk
i=1 i=1
k−1
Pk−1 !
xk X xi
= yk · g + yi · g Pi=1
k−1
yk i=1yi
i=1
k Pk−1 Pk−1 !!
X yk xk yi xi
= yi Pk ·g + Pi=1
k
·g Pi=1
k−1
i=1 yi
yk i=1 yi i=1 yi
i=1
k Pk !
X xi
≤ yi · g Pi=1 k
i=1 i=1 yi
k k
!
X X
=f xi , yi
i=1 i=1
31
Hence, the inequality holds true for n = k as well. Now we need to prove for equality.
If xyii = c for some constant c ∈ R for all i = 1, 2, . . . , n, the LHS of the inequality is:
n
X n
X n
X
f (xi , yi ) = f (c · yi , yi ) = yi · g(c)
i=1 i=1 i=1
On the other hand, the RHS of the inequality is:

n n n n n
! !
X X X X X
f xi , yi = f c yi , yi = yi · g(c)
i=1 i=1 i=1 i=1 i=1
Thus, we have equality.

We prove the converse by induction. For the case where n = 1, it is equal and
trivially the ratio of xy11 is a constant. Now, we assume that we have xyii = c for some
c ∈ R for all i = 1, 2, . . . k − 1. Now we prove for the case n = k. Note that in the
proof, from our inductive hypothesis, the first inequality is an equality by the inductive
step. We look at the second inequality. The inequality occurs because we used the strict
P k−1
xk xi
concavity of the function g. It is equal if and only if yk is equal to Pi=1
k−1 by definition
i=1 yi
xi xk
of strict concavity. Since we assumed that yi = c for all i = 1, 2, . . . k − 1, we have yk
also equals to c, completing the induction.
Remark 8.1. Consider that g(x) is a concave function (not strictly concave). The
inequality also holds true because when proving the inequality, we did not use the strict-
ness property of the concavity. However, we would not be able to determine when it
becomes an equality as we use the property of strict concavity when proving equality.
8.1 Cauchy’s Inequality

So we have shown that the function f from (18) has an interesting property. But how
do we use it to prove Cauchy’s inequality? Recall that we must have g : R+ → R as a
√
strictly concave function. [16, 204] If we choose the function g(x) = x for x > 0, which
is a strictly concave function, let us see what we would get from the whole theory we
went through earlier. q √ √
Firstly, we would get the functionf (x, y) = y · xy = x y. Hence, putting this in
(19), we would get: v v
n u n u n
X √ √ uX uX
x i yi ≤ t xi t yi
i=1 i=1 i=1
By putting xi = a2i and yi = b2i then squaring both sides, we would get (17), Cauchy’s
inequality. Furthermore, from Proposition 8.1 we would get equality if and only if for
all i = 1, 2, . . . , n the ratio xyii is a constant independent of i. Hence, we have proved
Cauchy’s inequality using the idea of convexity (or, more precisely, concavity).
32
This whole process seems like a really long method of proving the simple Cauchy’s
inequality. The proof using positivity is much shorter than the whole process of defining
the function f (x, y) and proving its special property. Furthermore, we need to come up
√
with the function g(x) = x to get to Cauchy’s inequality.
However, this method is much more elegant as not only we can prove Cauchy’s
inequality, can we prove other special inequalities as well.
8.2 Hölder’s Inequality

Recall in the previous part where we proved Hölder’s inequality using the idea of Leg-
endre transform along with Young’s inequality. We can also prove Hölder’s inequality
using the idea of concavity introduced here.
1
[16, p.205] We begin by choosing g(x) = x p where p > 0 for x > 0. By choosing p > 0
1
p
we have g(x) as a strictly concave function. Hence, we would get f (x, y) = y · xy =
1 1
1 1
x p · y q where p + q = 1. Putting this in (19), we would get:
n n
!1 n
!1
X 1 1 X p X q
p q
x i · yi ≤ xi yi
i=1 i=1 i=1
Then, if we substitute xi = api and yi = bqi , we would get (7).

Finally, we check the condition for equality of Hölder’s inequality. Equality holds
in (19) if and only if for all i = 1, 2, . . . , n, the ratio xyii is a constant independent of i.
ap
Hence, by our choices of xi and yi , we have bqi = c for all i = 1, 2, . . . , n. Thus we have
i
proved Hölder’s inequality for the discrete case using the idea of concavity.
8.3 Minkowski’s Inequality

[15, p.31] Not only can we prove Cauchy’s and Hölder’s inequalities, we can also prove
another non-trivial inequality called Minkowski’s inequality.
Theorem 8.2 (Minkowski’s Inequality). For the sets of positive real numbers {ai }ni=1
and {bi }ni=1 and a number p > 1, we have the following inequality:
n
!1 n
!1 n
!1
p p p
api bpi
X X X
(ai + bi )p ≤ + (20)
i=1 i=1 i=1
ai
The usual way of proving Minkowski’s inequality is by considering the p-norm of the
lp space and using Hölder’s inequality to get the final result. However, using the idea
1
of concavity, we can prove the inequality by considering the function g(x) = (x p + 1)p
33
[16, p.205]. This function is a strictly
concave function. Hence, using this function for
1 p
1 1
x p p + y p )p .
f (x, y), we would get f (x, y) = y · y + 1 = (x
Using this f (x, y) in (19), we would get:
n

n
!1 n
! 1 p
X 1 1 X p X p
(xip + yip )p ≤  xi + yi 
i=1 i=1 i=1
Then, by choosing xi = api , yi = bpi and taking the p-th root on both sides, we would
get (20). Equality occurs if and only if for all i = 1, 2, . . . , n, the ratio xyii is a constant
independent of i. By our choices of xi and yi , we have abii = c for all i = 1, 2, . . . , n.
Thus, this completes the proof.
Now, one question arises. Is that all? We have proved Cauchy’s, Hölder’s and
Minkowski’s inequalities using the idea of concavity. Are there any other inequalities
that can be proved using the same idea? There is one lesser known inequality called
Milne’s inequality.
Theorem 8.3 (Milne’s Inequality). For the sets of positive real numbers {ai }ni=1 and
{bi }ni=1 , we have the following inequality:
n
! n ! n
! n !
X X ai bi X X
(ai + bi ) ≤ ai bi (21)
ai + bi
i=1 i=1 i=1 i=1
ai
[16, p.205] We can prove this inequality by using the strictly convex function g(x) =
x xy
1+x in (18) to get f (x, y) = x+y and use (19) to get the required result of (21).
Using the method of concavity, we can only prove some non-trivial inequalities using
this elegant method. However, this does not limit the power of this method. We can
extend this idea to higher dimensions.
Naturally, we have to define the notion of convexity (and concavity) for higher di-
mensions. We amend our definition of convex functions in Definition 2.1. We consider
the functions f : A → R, where A ⊂ Rd is a convex set.
Definition 8.1 (Convex and Concave Functions for Higher Dimensions). [2, p.89] For
d ≥ 1, a function f : A → R is called convex if for t ∈ [0, 1] and for all a, b ∈ A, we have
the following inequality
f ((1 − t)a + tb) ≤ (1 − t)f (a) + tf (b) (22)
A function is called strictly convex if the inequality for (1) holds strictly whenever a and
b are distinct and t ∈ (0, 1). A function is called concave if we have opposite inequality
for (22).
34
Remark 8.2. We require the domain A ⊂ Rd of the function f to be a convex set so
that the function f will always be defined at (1 − t)a + tb for any a, b ∈ A and t ∈ [0, 1].
Recall that we did not need any special properties of the convex (and concave)
functions in this part to prove all the inequalities, apart from the definiton. So now we
go straight to the idea of proving inequalities using concavity.
Corollary 8.1. [16, p.206] Let d ≥ 2 be an integer and the function g : Rd−1+ → R be a
strictly concave function. Define the function f : Rd+ → R as:

x1 x1 xd−1
f (x1 , x2 , . . . , xd ) = xd · g , ,..., (23)
xd xd xd
Then, for the sets of positive real numbers {xi,j }ni=1 for j = 1, 2, . . . , d, we have the
inequality:
n n n n
!
X X X X
f (xi,1 , xi,2 , . . . , xi,d ) ≤ f xi,1 , xi,2 , . . . , xi,d (24)
i=1 i=1 i=1 i=1
Equality holds if and only if the vectors (x1,j , x2,j , . . . , xn,j ) for j = 1, 2, . . . , d are multi-
ples of each other. In other words, consider the n×d matrix of positive entries X = (xi,j ).
Equality in (24) holds if and only if the rank of the matrix X is 1.
Proof. If we put d = 2, this is basically the result for Proposition 8.1. Hence, for different
values of d, the proof is similar to the proof for the case d = 2. For the case of equality,
the proof is carried out in a similar manner, by induction.
This way, we can generalise a lot of well-known inequalities that we have to higher
dimensions. In the paper by Woeginger [16, p.207], the author mentioned without proof
about the Hölder’s inequality of three dimensions. Here, we look at Hölder’s inequality
of higher dimensions.
8.4 Hölder’s Inequality of Higher Dimensions

Theorem 8.4 (Hölder’s Inequality of Higher Dimensions). Assume that we have d sets
of positive real numbers {ai,j }ni=1 for j = 1, 2, . . . , d. For the set of real numbers {pj }dj=1
such that pj > 1 for all j = 1, 2, . . . d and dj=1 p1j = 1, we have:
P
n

d

d n
!1
pj
p
X Y Y X
 ai,j  ≤ ai,jj (25)
i=1 j=1 j=1 i=1
j j p p j p
Equality holds if and only if the vectors (a1,j , a2,j , . . . , an,j ) for j = 1, 2, . . . , d are multi-
ples of each other.
35
Proof. We shall prove this by induction on d. For d = 1, this is trivial as it is juat an
equality. For the case d = 2, we have our original Hölder’s inequality in (7).
Now, we assume that the inequality (25) holds for d = k − 1. We aim to prove for the
case d = k i.e. for the set of real numbers {pj }kj=1 such that pj > 1 for all j = 1, 2, . . . k
and kj=1 p1j = 1, we have:
P
n

k

k n
!1
pj
p
X Y Y X
 ai,j  ≤ ai,jj
i=1 j=1 j=1 i=1
Consider function g : Rk−1

+ → R given by:
k−1
Y 1
p
g(x1 , x2 , . . . xk−1 ) = xi i
i=1
We need to show that this function g is a strictly concave function. One way to do it is
to consider the Hessian of the function and show that it is negative definite. However,
working out the Hessians of large matrices can be difficult. Luckily, there is an alterna-
tive method to show strict concavity using the induction which is what we are going to
do here.
Since we have kj=1 p1j = 1, by moving the k-th term of the sum to the RHS, we
P
pk −1 pk −1
would get: k−1 1
P
j=1 pj = pk . We define qj := pk pj . Thus, we would get the equation:
k−1
X 1
=1 (26)
qj
j=1
Claim that we have qj > 1 for all j = 1, 2, . . . , k − 1. Indeed, if there exists a j such
that qj < 1, then, since all the quantity qj are non-negative, the summation k−1 1
P
j=1 qj
would be greater than 1, which is a contradiction.
k−1 Pk−1 1
Since we have the set of k − 1 numbers {qj }j=1 such that j=1 qj = 1 and qj > 1
for all j = 1, 2, . . . , k − 1, we can apply the inductive hypothesis for this set of numbers.
We define functions h : Rk−1 + → R+ and m : R+ → R by:
k−1 1
q
Y
h(x1 , x2 , . . . xk−1 ) = xj j (27)
j=1
pk −1
m(x) = x pk
(28)
This way, we can compose the function g such that g = m ◦ h. We do this because
by checking the concavity of functions m and h, we can conclude the concavity of the
function g.
36
First, we aim to show concavity of the function h from definition using our inductive
hypothesis. Consider arbitrary a, b ∈ Rk−1
+ . For t ∈ [0, 1], we have:
k−1
Y 1
h((1 − t)a + tb) = ((1 − t)aj + tbj ) qj
j=1
k−1
Y 1
k−1
Y 1
qj
≥ ((1 − t)aj ) + (tbj ) qj
j=1 j=1
k−1 1 k−1 1
q q
Y Y
= (1 − t) aj j + t bj j
j=1 j=1
= (1 − t)h(a) + th(b)
Hence, the function h is a concave function.

Also, it is easy to see that m is a strictly concave and non-decreasing function. So, for
t ∈ (0, 1) we have m((1−t)x+ty) > (1−t)m(x)+tm(y) as well as x ≤ y ⇒ m(x) ≤ m(y).
Therefore, for t ∈ (0, 1), we have the following:
g((1 − t)a + tb) = m(h((1 − t)a + tb))

≥ m((1 − t)h(a) + th(b))
> (1 − t)m(h(a)) + tm(h(b))
= (1 − t)g(a) + tg(b)
Thus, the function g is strictly concave. So, from (23), we define the function f : Rk+ → R
such that:
Yk 1
x1 x2 xk−1 p
f (x1 , x2 , . . . , xk ) = xk · g , ,..., = xj j
xk xk xk
j=1
Then, from (24), by summing up over n we have:
n

k 1

k n
!1
pj
pj
X Y Y X
 xi,j  ≤ xi,j
i=1 j=1 j=1 i=1
p
Putting xi,j = ai,jj , we would get the desired result.
Also, we would get equality for (24) if and only if the vectors (x1,j , x2,j , . . . , xn,j ) for
j = 1, 2, . . . , k are multiples of each other. By our choice of xi,j , we would get equality
pj pj pj
for this inequality if and only if (a1,j , a2,j , . . . , an,j ) for j = 1, 2, . . . , k are multiples of
each other.
Also, from the proof above, we get an interesting result regarding a class of functions
on a d-dimensional space.
37
Corollary 8.2. Let f : Rd+ → R be a function defined as:
d
Y 1
p
f (x1 , x2 , . . . , xd ) = xi i
i=1
Pd 1
The numbers pi > 1 are such that 0 < i=1 pi < 1. Then, the function f is a strictly
concave function.
The proof of this is done in Theorem 8.4 above.
Remark 8.3. If the numbers pi > 1 are such that di=1 p1i = 1, the function f would
P
still be a concave function but we cannot guarantee the strictness of the concavity. For
1 1
example, consider the function f (x1 , x2 ) = x12 x22 . Along the line x1 = x2 , the function
f becomes f (x1 , x2 ) = x1 = x2 . So, any secant line drawn for the function along this
line will always be equal to the function f . Hence, the function f is not strictly concave.
38
Part IV
Summary
Over the course of this project, we have looked at the definition of convex functions
and their simple yet remarkable properties. We can define concave functions as the
opposite of convex functions and their properties can be proven in similar manners.
These properties of convex and concave functions can be useful in different areas of
mathematics.
One thing to note is that, recall that we proved a convex function over an interval I is
differentiable almost everywhere (i.e. not differentiable at countably many points) using
the idea of Lipschitz and Rademacher. During the course of my work on the project,
I came across Froda’s theorem which states that any monotone function on a closed
interval has at most countably many discontinuities. This theorem might be useful as
we have shown that the left and right derivatives of a convex function are monotone
functions. However, we only have shown that they exists in the interior of I which is
an open interval hence Froda’s theorem might not work. More work can be done in this
direction if we can somehow bound the domain of the left and right derivatives to a
closed interval by investigating what would happen to the left and right derivatives at
the boundary of the interval.
In Part II of the project, we have gone through applications of convex functions
in applied mathematics, namely the Legendre and Legendre-Fenchel transforms. These
concepts found their way into classical dynamics where one can transform a second order
differential equation into a system of first order differential equations without losing any
information as the transform is involutive. Also, this transform can be used to prove
Young’s inequality and consequently, Hölder’s inequality.
Finally, in Part III of the project, we applied the concept of convexity in pure math-
ematics to prove Jensen’s inequality as well as to generalise the AM-GM inequality.
Then, using the idea of concavity, we have looked at some elegant proofs for Cauchy,
Minkowski, Hölder and Milne inequalities. In the proof, we defined the function f which
depends on a strictly convex function g with a special property that enables us to sub-
stitute certain functions of g to obtain classical inequalities. However, more work could
be done here as we can try and find another function f such that similar properties
hold in order to prove other classical inequalities. For example, I tried proving Carlson
inequality using the method outlined in the project but it is not possible. Probably if
we defined the function f to be different but yet the same property hold, we can prove
Carlson inequality using the idea of concavity. Finally, by extending the ideas of concav-
ity and convexity to higher dimensions, we have proved the Hölder’s inequality of higher
dimensions.
Indeed, there are more applications for convex functions in different areas of mathe-
matics such as statistics, finance and optimisation. This project just provides a general
and brief idea of how powerful the tool of convexity is in mathematics.
39
Part V
Bibliography
References
[1] Constantin Niculescu and Lars-Erik Persson, Convex Functions and Their Applica-
tions: A Contemporary Approach (London: Springer, 2005). ISBN: 0-387-24300-3
[2] A. Wayne Roberts and Dale E. Varberg, Convex Functions (London: Academic
Press, 1973). ISBN: 0-12-589740-5
[3] Juha Heinonen, Lectures on Lipschitz Analysis (2005). Accessed: January 12, 2013.
http://www.math.jyu.fi/research/reports/rep100.pdf
[4] Martin Muñoz, Rademacher’s Theorem (2010). Accessed: January 12, 2013.
http://wiki.math.toronto.edu/TorontoMathWiki/images/b/be/MAT1000_
Martin_Munoz.pdf
[5] Edward F. Redish, R. K. P. Zia and Susan R. McKay, Making Sense of Legendre
Transform (2009). Accessed: January 12, 2013. arXiv: 0806.1147
[6] Sam Kennerly, A Graphical Derivation of the Legendre Transform (2011). Ac-
cessed: January 12, 2013. http://www.physics.drexel.edu/~skennerly/maths/
Legendre.pdf
[7] David Glickenstein, The Legendre Transform (2000). Accessed: January 12, 2013.
http://math.arizona.edu/~glickenstein/tex/legendre.pdf
[8] David Tong, Classical Dynamics: University of Cambridge Part II Tripos

(2004). Accessed: January 12, 2013. http://www.damtp.cam.ac.uk/user/tong/
dynamics/clas.pdf
[9] Darryl D. Holm, Geometric Mechanics Part 1: Dynamics and Symmetry, 2nd Edi-
tion (London: Imperial College Press, 2011). ISBN: 1-84816-774
[10] J. Michael Steele, The Cauchy-Schwarz Master Class (New York: Cambridge Uni-
versity Press, 2004). ISBN: 0-52154-677-X
[11] N. Sookia and P. Nunkoo Gonpot, Berezin-Lieb Inequality: An Extension to Normal

Operators (2011) Accessed: March 30, 2013. http://www.ajol.info/index.php/
umrj/article/view/70729/59325
[12] J. R. Retherford, Hilbert Space: Compact Operators and the Trace Theorem (Cam-
bridge: Cambridge University Press, 1993). ISBN: 0-521-42933-1
[13] David H. Griffel, Applied Functional Analysis (Toronto: General Publishing Com-
pany, 1981). ISBN: 0-486-42258-5
40
[14] E. Kowalski, Spectral Theory in Hilbert Spaces (ETH Zürich: FS 09)
(2009) Accessed: March 30, 2013. http://www.math.ethz.ch/~kowalski/
spectral-theory.pdf
[15] G. H. Hardy, J. E. Littlewood and G. Pólya, Inequalities, 2nd Edition (London:

Cambridge University Press, 1952). ISBN: 0-521-05260-8
[16] Gerhard J. Woeginger, “When Cauchy and Hölder Met Minkowski: A Tour through
Well-Known Inequalities”, Mathematics Magazine, Vol. 82, No. 3 (2009): 202-207.
Accessed: 12 January, 2013. http://www.jstor.org/stable/27765902
41
Acknowledgements
First and foremost, I would like to thank my supervisor, Professor Ari Laptev, for helping
me getting started with the project, for providing direction to my project and giving me
hints to solve some of my problems. Some of the problems and examples are suggested
by him as they relate to his areas of interest and past papers (e.g. Example 5.1 was a
problem he worked on in one of his previous papers). I really enjoyed working under his
supervision and hope that if I can continue with my graduate studies, I would be able
to work with him again.
Secondly, I would like to thank my personal tutor, Professor Darryl Holm, for helping
me with the Lagrangian and Hamiltonian dynamics section in the Legendre transform.
One of his books, referenced in the Bibliography, has given me an insight on the mo-
tivation of Legendre transform. He has also given me an example to explain the idea
and encouraged me to find other simple examples, which you have came across in the
project.
I would like to express my thanks to my friends especially Claire Rebello who first
read the proof of the higher dimensional Hölder’s Inequality and Saber King who checked
that I haven’t made any mistakes in the proof. Also, thank you to Cissy Chan for
proofreading the project and helped me fix the spelling and grammatical errors. And
thank you to Kee Pau Boon for listening to my presentation and pointing out some good
tips for me.
Finally, I would like to thank you for reading this project. I hope that it has been
interesting and enlightening.
Muhammad Syafiq Johar
Forwarding address:
No. 32, Jalan BP1,
Bandar Bukit Puchong,
47100, Puchong,
Malaysia.
42

Convex Functions and Their Applications PDF

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Convex Functions and Their Applications PDF

Caricato da

Copyright:

Formati disponibili

M3R Project 2013

Convex Functions and Their Applications

Name: Muhammad Syafiq Johar

This is my own unaided work unless stated otherwise.

I Convex Functions and Their Properties 1

6 Applications of Legendre Transform 17

7 Convexity and Jensen’s Inequality 25

8 Concavity and Inequalitites 29

f ((1 − t)a + tb) ≤ (1 − t)f (a) + tf (b) (1)

Figure 1: Secant line of a convex function

Definition 2.2 (Epigraph). An epigraph of a function f : I → R, denoted epi(f ), is the

epi(f ) = {(x, y) : x ∈ I, y ≥ f (x)} (2)

2.1 Basic Properties of Convex Functions

f (b) − f (a) f (c) − f (a) f (c) − f (b)

Figure 2: Sequential secants of a convex function

Figure 3: Graph of f (x) = |x|

It can be proved that a convex function f is not differentiable only at a countably

Definition 2.3 (Lipschitz Condition). [3, p.1] A function f : X → Y where X, Y ⊂

Theorem 2.1 (Rademacher’s Theorem). [3, p.18] If a function f : Rm → Rn is Lipschitz,

Proposition 2.4. [1, p.27] If f : I → R is a convex function, then f is Lipschitz on any

Proposition 2.5. A convex function f : I → R is differentiable almost everywhere.

Proof. This proposition is an immediate consequence of the theorem by Rademacher.

Theorem 2.2. [2, p.11] If a convex function f : I → R is differentiable everywhere in

Figure 4: Graph of f (x) = |x| with a support line p0 (x) at x = 0

Now, we can define subderivatives.

Definition 3.2 (Subderivatives). [2, p.32] A subderivative δf (x) of a convex function

Hence, from this, we have the following corollary:

Corollary 3.1. The subderivative of a convex function f : I → R at any point b ∈ int(I)

Figure 5: Graph of f (x) = |x − 2| and its subderivative δf (x)

f (x) − f (b) ≥ m(x − b) (4)

Now, we shall look at the first application of convex function.

In summary, we can find the Legendre transform of a twice differentiable convex

1. Find the gradient of the function, m(x) := f 0 (x).

3. Invert the dependency of x and m and define the Legendre transform of f as

Figure 6: Tangent lines p, q and r of the curve f (x) = x2 − x + 1 for x ∈ (0, 2)

f ? (m) = sup(x · m − f (x)) (5)

Proof. The formula for the Legendre-Fenchel transform is given by:

f ? (m) = sup(x · m − f (x))

f ? (m) = f ? ((1 − t)m1 + tm2 ) = sup(x · ((1 − t)m1 + tm2 ) − f (x))

Definition 5.1 (Closed convex function). A convex function f : I → R is said to be

So, with this definition, we formulate the following theorem.

Theorem 5.2. [2, p.30] The Legendre-Fenchel transform f ? (m) of a function f is a

Proof. Fix c ∈ R. Suppose we have an arbitrary sequence {mn } of Lc = {m ∈ J :

Proof. Note that we can show f ?? (x) ≤ f (x) easily:

Figure 8: Graph of f (x) = (λ − x)† and its Legendre-Fenchel transform f ? (m)

Thus, by simple calculation, we find that the supremum of k(x) is −λ.

Putting everything together, the Legendre-Fenchel transform of the function f ? (m),

6 Applications of Legendre Transform

6.1 Young’s Inequality

x · m ≤ f (x) + f ? (m) (6)

Proof. This is immediate from the definition of Legendre-Fenchel transform.

f ? (m) = sup(x · m − f (x)) ≥ x · m − f (x)

Rearranging this, we would get the desired result.

i=1 i=1 i=1

6.2 Lagrangian and Hamiltonian Mechanics

L(x, ẋ) = T (x, ẋ) − V (x) (11)

But since at t0 and t1 the value of δx is 0, we would get:

How do we change the variable ẋ to p = ∂L ∂ ẋ in our Lagrangian? Recall our useful

Equating like terms, we get the following equations:

Figure 9: Mass attached to a horizontal spring

We shall look at another example below to illustrate the remark above.

Figure 10: Pendulum swinging under influence of gravity