Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1, JANUARY 2007 1
Abstract—Rényi divergence is related to Rényi entropy much Although the closely related Rényi entropy is well studied
like Kullback-Leibler divergence is related to Shannon’s entropy, [14], [15], the properties of Rényi divergence are scattered
and comes up in many settings. It was introduced by Rényi as a throughout the literature and have often only been established
measure of information that satisfies almost the same axioms as
Kullback-Leibler divergence, and depends on a parameter that for finite alphabets. This paper is intended as a reference
is called its order. In particular, the Rényi divergence of order 1 document, which treats the most important properties of Rényi
equals the Kullback-Leibler divergence. divergence in detail, including Kullback-Leibler divergence as
We review and extend the most important properties of Rényi a special case. Preliminary versions of the results presented
divergence and Kullback-Leibler divergence, including convexity,
arXiv:1206.2459v2 [cs.IT] 24 Apr 2014
Pn 1/2 1/2 2
squared Hellinger distance Hel2 (P, Q) = i=1 pi − qi
[24]:
Hel2 (P, Q)
D1/2 (P kQ) = −2 ln 1 − . (5)
2
Similarly, for α = 2 it satisfies
D2 (P kQ) = ln 1 + χ2 (P, Q) ,
(6)
2
where χ2 (P, Q) = i=1 (pi −q
n i)
denotes the χ2 -divergence
P
qi
[24]. It will be shown that Rényi divergence is nondecreasing
in its order. Therefore, by ln t ≤ t − 1, (5) and (6) imply that
Fig. 1. Rényi divergence as a function of its order for fixed distributions Hel2 (P, Q) ≤ D1/2 (P kQ) ≤ D1 (P kQ)
≤ D2 (P kQ) ≤ χ2 (P, Q). (7)
whenever this integral is defined. If P has support in an Finally, Gilardoni [25] shows that Rényi divergence
Pn is related
interval I of length n then to the total variation distance1 V (P, Q) = i=1 |pi − qi | by
a generalization of Pinsker’s inequality:
hα (P ) = ln n − Dα (P kUI ), (3) α 2
V (P, Q) ≤ Dα (P kQ) for α ∈ (0, 1]. (8)
2
where UI denotes the uniform distribution on I, and Dα is
the generalization of Rényi divergence to densities, which will (See Theorem 31 below.) For α = 1 this is the normal version
be defined formally in Section II. Thus the properties of both of Pinsker’s inequality, which bounds total variation distance
the Rényi entropy and the differential Rényi entropy can be in terms of the square root of the Kullback-Leibler divergence.
deduced from the properties of Rényi divergence as long as
P has compact support. C. Outline
There is another way of relating Rényi entropy and Rényi The rest of the paper is organized as follows. First, in
divergence, in which entropy is considered as self-information. Section II, we extend the definition of Rényi divergence
Let X denote a discrete random variable with distribution P , from formula (1) to continuous spaces. One can either define
and let Pdiag be the distribution of (X, X). Then Rényi divergence via an integral or via discretizations. We
Hα (P ) = D2−α (Pdiag kP × P ). (4) demonstrate that these definitions are equivalent. Then we
show that Rényi divergence extends to the extended orders 0,
For α tending to 1, the right-hand side tends to the mutual 1 and ∞ in the same way as for finite spaces. Along the way,
information between X and itself, and again a well-known we also study its behaviour as a function of α. By contrast,
formula is recovered. in Section III we study various convexity and continuity
properties of Rényi divergence as a function of P and Q, while
α is kept fixed. We also generalize the Pythagorean inequality
B. Special Orders to any order α ∈ (0, ∞). Section IV contains several minimax
Although one can define the Rényi divergence of any order, results, and treats the connection to Chernoff information
certain values have wider application than others. Of particular in hypothesis testing, to which many applications of Rényi
interest are the values 0, 1/2, 1, 2, and ∞. divergence are related. We also discuss the equivalence of
The values 0, 1, and ∞ are extended orders in the sense channel capacity and the minimax redundancy for all orders α.
that Rényi divergence of these orders cannot be calculated by Then, in Section V, we show how Rényi divergence extends
plugging into (1). Instead, their definitions are determined by to negative orders. These are related to the orders α > 1 by a
continuity in α (see Figure 1). This leads to defining Rényi negative scaling factor and a reversal of the arguments P and
divergence of order 1 as the Kullback-Leibler divergence. Q. Finally, Section VI contains a number of counterexamples,
For order 0 it becomes − ln Q({i | pi > 0}), which is showing that properties that hold for certain other divergences
closely related to absolute continuity and contiguity of the are violated by Rényi divergence.
distributions P and Q (see Section III-F). For order ∞, Rényi For fixed α, Rényi divergence is related to various forms of
divergence is defined as ln maxi pqii . In the literature on the power divergences, which are in the well-studied class of f -
minimum description length principle in statistics, this is called divergences [27]. Consequently, several of the results we are
the worst-case regret of coding with Q rather than with P presenting for fixed α in Section III are equivalent to known
[3]. The Rényi divergence of order ∞ is also related to the results about power divergences. To make this presentation
separation distance, used by Aldous and Diaconis [23] to self-contained we avoid the use of such connections and only
bound the rate of convergence to the stationary distribution use general results from measure theory.
for certain Markov chains. 1 N.B. It is also common to define the total variation distance as 1 V (P, Q).
Only for α = 1/2 is Rényi divergence symmetric in its 2
See the discussion by Pollard [26, p. 60]. Our definition is consistent with the
arguments. Although not itself a metric, it is a function of the literature on Pinsker’s inequality.
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 3
Summary
Definition for the simple orders α ∈ (0, 1) ∪ (1, ∞): Additivity and other consistent sequences of distributions (Thms 27, 28):
For arbitrary distributions P1 , P2 , . . . and Q1 , Q2 , . . ., let P N =
Z
•
Dα (P kQ) = α−11
ln pα q 1−α dµ. P1 × · · · × PN and QN = Q1 × · · · × QN . Then
N
(
For the extended orders (Thms 4–6): X for α ∈ [0, ∞] if N < ∞,
N N
Dα (Pn kQn ) = Dα (P kQ )
D0 (P kQ) = − ln Q(p > 0) for α ∈ (0, ∞] if N = ∞.
n=1
D1 (P kQ) = D(P kQ) = Kullback-Leibler divergence
• Let P 1 , P 2 , . . . and Q1 , Q2 , . . . be consistent sequences of distribu-
p
D∞ (P kQ) = ln ess sup = worst-case regret. tions on n = 1, 2, . . . outcomes. Then
P q
Dα (P n kQn ) → Dα (P ∞ kQ∞ ) for α ∈ (0, ∞].
Equivalent definition via discretization (Thm 10):
Limits of σ-algebras (Thms 21, 22):
Dα (P kQ) = sup Dα (P|P kQ|P ). S∞
P∈finite partitions • For σ-algebras F1 ⊆ F2 ⊆ · · · ⊆ F and F∞ = σ n=1 Fn ,
Relations to (differential) Rényi entropy (2), (3), (4) : For α ∈ [0, ∞], lim Dα (P|Fn kQ|Fn ) = Dα (P|F∞ kQ|F∞ ) for α ∈ (0, ∞].
n→∞
Hα (P ) = ln |X | − Dα (P kU ) = D2−α (Pdiag kP × P ) for finite X , T∞
• For σ-algebras F ⊇ F1 ⊇ F2 ⊇ · · · and F∞ = n=1 Fn ,
hα (P ) = ln n − Dα (P kUI ) if X is an interval I of length n.
lim Dα (P|Fn kQ|Fn ) = Dα (P|F∞ kQ|F∞ ) for α ∈ [0, 1)
n→∞
Relations to other divergences (5)–(7), Remark 1 and Pinsker’s inequality
(Thm 31): and also for α ∈ [1, ∞) if Dα (P|Fm kQ|Fm ) < ∞ for some m.
Hel2 ≤ D1/2 ≤ D ≤ D2 ≤ χ2 Absolute continuity and mutual singularity (Thms 23, 24, 25, 26):
α 2 • P Q if and only if D0 (P kQ) = 0.
V ≤ Dα for α ∈ (0, 1].
2 • P ⊥ Q if and only if Dα (P kQ) = ∞ for some/all α ∈ [0, 1).
Relation to Fisher information (Section III-H): For a parametric statistical • These properties generalize to contiguity and entire separation.
model {Pθ | θ ∈ Θ ⊆ R} with “sufficiently regular” parametrisation, Hypothesis testing and Chernoff information (Thms 30, 32): If α is a simple
1 α order, then
lim Dα (Pθ kPθ0 ) = J(θ) for α ∈ (0, ∞).
θ 0 →θ (θ − θ 0 )2 2 (1 − α)Dα (P kQ) = inf {αD(RkP ) + (1 − α)D(RkQ)} .
R
Varying the order (Thms 3, 7, Corollary 2):
Suppose D(P kQ) < ∞. Then the Chernoff information satisfies
• Dα is nondecreasing in α, often strictly so.
• Dα is continuous in α on [0, 1] ∪{α ∈ (1, ∞] | Dα < ∞}.
sup inf {αD(RkP ) + (1 − α)D(RkQ)}
• (1 − α)Dα is concave in α on [0, ∞]. α∈(0,∞) R
Positivity (Thm 8) and skew symmetry (Proposition 2): = inf sup {αD(RkP ) + (1 − α)D(RkQ)} ,
• Dα ≥ 0 for α ∈ [0, ∞], often strictly so. R α∈(0,∞)
α
• Dα (P kQ) = D (QkP ) for 0 < α < 1.
1−α 1−α and, under regularity conditions, both sides equal D(Pα∗ kP ) = D(Pα∗ kQ).
Convexity (Thms 11–13): Dα (P kQ) is
• jointly convex in (P, Q) for α ∈ [0, 1], Channel capacity and minimax redundancy (Thms 34, 36, 37, 38, Lemma 9,
• convex in Q for α ∈ [0, ∞], Conjecture 1): Suppose X is finite. Then, for α ∈ [0, ∞],
• jointly quasi-convex in (P, Q) for α ∈ [0, ∞]. • The channel capacity Cα equals the minimax redundancy Rα ;
Pythagorean inequality (Thm 14): For α ∈ (0, ∞), let P be an α-convex set • There exists Qopt such that supθ D(Pθ kQopt ) = Rα ;
of distributions and let Q be an arbitrary distribution. If the α-information • If there exists a capacity achieving input distribution πopt , then
projection P ∗ = arg minP ∈P Dα (P kQ) exists, then D(Pθ kQopt ) = Rα almost surely for θ drawn from πopt ;
• If α = ∞ and the maximum likelihood is achieved by θ̂(x), then
Dα (P kQ) ≥ Dα (P kP ∗ ) + Dα (P ∗ kQ) for all P ∈ P.
πopt (θ) = Qopt ({x | θ̂(x) = θ}) is a capacity achieving input
Data processing (Thm 9, Example 2): If we fix the transition probabilities distribution;
A(Y |X) in a Markov chain X → Y , then Suppose X is countable and R∞ < ∞. Then, for α = ∞, Qopt is the
Dα (PY kQY ) ≤ Dα PX kQX for α ∈ [0, ∞]. Shtarkov distribution defined in (66) and
The topology of setwise convergence (Thms 15, 18): sup D∞ (Pθ kQ) = R∞ + D∞ (Qopt kQ) for all Q.
θ
• Dα (P kQ) is lower semi-continuous in the pair (P, Q) for α ∈ (0, ∞].
• If X is finite, then Dα (P kQ) is continuous in Q for α ∈ [0, ∞]. We conjecture that this generalizes to a one-sided inequality for any α > 0.
The total variation topology (Thm 17, Corollary 1): Negative orders (Lemma 10, Thms 39, 40):
• Dα (P kQ) is uniformly continuous in (P, Q) for α ∈ (0, 1).
• Results for positive α carry over, but often with reversed properties.
• D0 (P kQ) is upper semi-continuous in (P, Q).
• Dα is nondecreasing in α on [−∞, ∞].
The weak topology (Thms 19, 20): Suppose X is a Polish space. Then • Dα is continuous in α on [0, 1] ∪{α | −∞ < Dα < ∞}.
• Dα (P kQ) is lower semi-continuous in the pair (P, Q) for α ∈ (0, ∞];
Counterexamples (Section VI):
• The sublevel set {P | Dα (P kQ) ≤ c} is convex and compact for
c ∈ [0, ∞) and α ∈ [1, ∞]. • Dα (P kQ) is not convex in P for α > 1.
Orders α ∈ (0, 1) are all equivalent (Thm 16): • For α ∈ (0, 1), Dα (P kQ) is not continuous in (P, Q) in the topology
of setwise convergence.
α 1−β
D
β 1−α β
≤ Dα ≤ Dβ for 0 < α ≤ β < 1. • Dα is not (the square of) a metric.
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 4
Example 2. The name “data processing inequality” stems Proof of Theorem 2: By the data processing inequality
from the following application of Theorem 1. Let X and Y
be two random variables that form a Markov chain sup Dα (P|P kQ|P ) ≤ Dα (P kQ).
P
and the Rényi divergence of order ∞ is defined as as above. So suppose γ > 1. Then convexity of pαn q 1−αn in
αn implies that for αn ≤ γ
D∞ (P kQ) = lim Dα (P kQ).
α↑∞ αn 0 1 αn γ 1−γ
pαn q 1−αn ≤ (1 − )p q + p q ≤ q + pγ q 1−γ .
Our definition of D0 follows Csiszár [4]. It differs from γ γ
Since q dµ = 1, it remains to show that pγ q 1−γ dµ < ∞,
R R
Rényi’s original definition [1], which uses (9) with α = 0
plugged in and is therefore always zero. As illustrated by which is implied by γ > 1 and Dγ (P kQ) < ∞.
Section III-F, the present definition is more interesting. The closed-form expression for α = 0 follows immediately:
The limits in Definition 3 always exist, because Rényi
Theorem 4 (α = 0).
divergence is nondecreasing in its order:
D0 (P kQ) = − ln Q(p > 0).
Theorem 3 (Increasing in the Order). For α ∈ [0, ∞] the
Rényi divergence Dα (P kQ) is nondecreasing in α. On A = Proof of Theorem 4: By Lemma 1 and the fact that
{α ∈ [0, ∞] | 0 ≤ α ≤ 1 or Dα (P kQ) < ∞} it is constant limα↓0 pα q 1−α = 1{p>0} q.
if and only if P is the conditional distribution Q(· | A) for For α = 1, the limit in Definition 3 equals the Kullback-
some event A ∈ F. Leibler divergence of P from Q, which is defined as
Z
Proof: Let α < β be simple orders. Then for x ≥ 0 the p
(α−1) D(P kQ) = p ln dµ,
function x 7→ x (β−1) is strictly convex if α < 1 and strictly q
concave if α > 1. Therefore by Jensen’s inequality with the conventions that 0 ln(0/q) = 0 and p ln(p/0) = ∞ if
Z Z (1−β) α−1 p > 0. Consequently, D(P kQ) = ∞ if P 6 Q.
1 α 1−α 1 q β−1
ln p q dµ = ln dP Theorem 5 (α = 1).
α−1 α−1 p
Z 1−β
1 q D1 (P kQ) = D(P kQ). (17)
≤ ln dP.
β−1 p Moreover, if D(P kQ) = ∞ or there exists a β > 1 such that
On A, (q/p)1−β dP is finite. As a consequence, Jensen’s Dβ (P kQ) < ∞, then also
R
inequality holds with equality if and only if (q/p)1−β is lim Dα (P kQ) = D(P kQ). (18)
constant P -a.s., which is equivalent to q/p being constant P - α↓1
a.s., which in turn means that P = Q(· | A) for some event For example, by letting α ↑ 1 in (10) or by direct
A. computation, it can be derived [20] that the Kullback-Leibler
From the simple orders, the result extends to the extended divergence between two normal distributions with positive
orders by the following observations: variance is
D0 (P kQ) = inf Dα (P kQ),
0<α<1 D1 N (µ0 , σ02 )kN (µ1 , σ12 )
D1 (P kQ) = sup Dα (P kQ) ≤ inf Dα (P kQ), 1 (µ1 − µ0 )2 σ2 σ2
0<α<1 α>1
= 2 + ln 12 + 02 − 1 .
D∞ (P kQ) = sup Dα (P kQ). 2 σ1 σ0 σ1
α>1
It is possible that Dα (P kQ) = ∞ for all α > 1, but
D(P kQ) < ∞, such that (18) does not hold. This situation
Let us verify that the limits in Definition 3 can be expressed occurs, for example, if P is doubly exponential on X = R with
in closed form, just like for finite alphabets. We require the −2|x|
density p(x) = e√ and Q is standard normal with density
following lemma: −x2 /2
q(x) = e / 2π. (Liese and Vajda [27] have previously
Lemma 1. Let A = {α a simple order | 0 < α < 1 or used these distributions in a similar example.) In this case
Dα (P kQ) < ∞}. Then, for any sequence α1 , α2 , . . . ∈ A there is no way to make Rényi divergence continuous in α at
such that αn → β ∈ A ∪{0, 1}, α = 1, and we opt to define D1 as the limit from below, such
Z Z that it always equals the Kullback-Leibler divergence.
lim pαn q 1−αn dµ = lim pαn q 1−αn dµ. (16) The proof of Theorem 5 requires an intermediate lemma:
n→∞ n→∞
Lemma 2. For any x > 1/2
Our proof extends a proof by Shiryaev [28, pp. 366–367].
Proof: We will verify the conditions for the dominated 1−x
(x − 1) 1 + ≤ ln x ≤ x − 1.
convergence theorem [28], from which (16) follows. First 2
suppose 0 ≤ β < 1. Then 0 < αn < 1 for all sufficiently Proof: By Taylor’s theorem with Cauchy’s remainder
large n. In this case pαn q 1−αn , which is never negative, does term we have for any positive x that
not exceed αn p + (1 − αn )q ≤ p +Rq, and the dominated
(x − ξ)(x − 1)
convergence theorem applies because (p + q) dµ = 2 < ∞. ln x = x − 1 −
Secondly, suppose β ≥ 1. Then there exists a γ ≥ β such 2ξ 2
that γ ∈ A ∪ {1} and αn ≤ γ for all sufficiently large n. If ξ−x
= (x − 1) 1 +
γ = 1, then αn < 1 and we are done by the same argument 2ξ 2
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 7
for some ξ between x and 1. As ξ−x 2ξ 2 is increasing in ξ for with the conventions that 0/0 = 0 and x/0 = ∞ if x > 0.
x > 1/2, the lemma follows.
If the sample space X is countable, then with the nota-
Proof of Theorem 5: Suppose P 6 Q. Then D(P kQ) =
tional conventions of this theorem the essential supremum
∞ = Dβ (P kQ) for all β > 1, so (18) holds. Let xα =
R α 1−α reduces to an ordinary supremum, and we have D∞ (P kQ) =
p q dµ. Then limα↑1 xα = P (q > 0) by Lemma 1, and P (x)
ln supx Q(x) .
hence (17) follows by
Z Proof: If X contains a finite number of elements n, then
1
lim ln pα q 1−α dµ 1 Xn
α↑1 α − 1 D∞ (P kQ) = lim ln pα 1−α
i qi
1 α↑∞ α − 1
i=1
= lim ln P (q > 0) = ∞ = D(P kQ).
α↑1 α − 1 pi P (A)
= ln max = ln max .
Alternatively, suppose P Q. Then limα↑1 xα = 1 and i qi A⊆X Q(A)
therefore Lemma 2 implies that This extends to arbitrary measurable spaces (X , F) by Theo-
1 rem 2:
lim Dα (P kQ) = lim ln xα
α↑1 α↑1 α − 1
D∞ (P kQ) = sup sup Dα (P|P kQ|P )
xα − 1 p − pα q 1−α
Z
α<∞ P
= lim = lim dµ, (19)
α↑1 α − 1 α↑1 p,q>0 1−α = sup sup Dα (P|P kQ|P )
P α<∞
where the restriction of the domain of integration is allowed P (A) P (A)
because q = 0 implies p = 0 (µ-a.s.) by P Q. Convexity = sup ln max = ln sup ,
P A∈P Q(A) A∈F Q(A)
of pα q 1−α in α implies that its derivative, pα q 1−α ln pq , is
nondecreasing and therefore for p, q > 0 where P ranges over all finite partitions in F.
Z 1 Now if P 6 Q, then there exists an event B ∈ F such that
p − pα q 1−α 1 p
= pz q 1−z ln dz P (B) > 0 but Q(B) = 0, and
1−α 1−α α q p
α 1−α 0 1−0 P = ∞ = P (q = 0) ≥ P (B) > 0
is nondecreasing
R in α, and p−p1−α
q
≥ p−p1−0
q
= p − q. q
As p,q>0 (p − q) dµ > −∞, it follows by the monotone P (A)
convergence theorem that implies that ess sup p/q = ∞ = supA Q(A) . Alternatively,
suppose that P Q. Then
p − pα q 1−α p − pα q 1−α
Z Z
lim dµ = lim dµ Z Z
p p
α↑1 p,q>0 1−α α↑1 1−α P (A) = p dµ ≤ ess sup ·q dµ = ess sup ·Q (A)
Zp,q>0 A ∩{q>0} A ∩{q>0} q q
p
= p ln dµ = D(P kQ),
p,q>0 q for all A ∈ F and it follows that
which together with (19) proves (17). If D(P kQ) = ∞, then P (A) p
sup ≤ ess sup . (21)
Dβ (P kQ) ≥ D(P kQ) = ∞ for all β > 1 and (18) holds. A∈F Q(A) q
It remains to prove (18) if there exists a β > 1 such that
Dβ (P kQ) < ∞. In this case, arguments similar to the ones Let a < ess sup p/q be arbitrary. Then there exists a set A ∈ F
above imply that with P (A) > 0 such that p/q ≥ a on A and therefore
Z Z
pα q 1−α − p
Z
lim Dα (P kQ) = lim dµ (20) P (A) = p dµ ≥ a · q dµ = a · Q (A) .
α↓1 α↓1 p,q>0 α−1 A A
P (A)
and
α 1−α
p q −p α 1−α
is nondecreasing in α. Therefore p qα−1 −p ≤ Thus supA∈F Q(A) ≥ a for any a < ess sup p/q, which implies
α−1
β 1−β
p q −p β 1−β β 1−β that
≤ p β−1
q
and, as p,q>0 p β−1
q
R
β−1 dµ < ∞ is implied P (A) p
by Dβ (P kQ) < ∞, it follows by the monotone convergence sup ≥ ess sup .
A∈F Q(A) q
theorem that
In combination with (21) this completes the proof.
pα q 1−α − p pα q 1−α − p
Z Z
lim dµ = lim dµ Taken together, the previous results imply that Rényi diver-
α↓1 p,q>0 α−1 α↓1 α−1
Zp,q>0 gence is a continuous function of its order α (under suitable
p conditions):
= p ln dµ = D(P kQ),
p,q>0 q
Theorem 7 (Continuity in the Order). The Rényi divergence
which together with (20) completes the proof. Dα (P kQ) is continuous in α on A = {α ∈ [0, ∞] | 0 ≤ α ≤
For any random variable X, the essential supremum of X 1 or Dα (P kQ) < ∞}.
with respect to P is ess supP X = sup{c | P (X > c) > 0}.
Proof: Continuity at any simple order β follows by
Theorem 6 (α = ∞). Lemma 1. It extends to the extended orders 0 and ∞ by the
P (A) p definition of Rényi divergence at these orders. And it extends
D∞ (P kQ) = ln sup = ln ess sup ,
A∈F Q(A) P q to α = 1 by Theorem 5.
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 8
Equality holds if and only if for any 0 < λ < 1. For finite α, equality holds if and only if
α = 0: D0 (P0 kQ0 ) = D0 (P1 kQ1 ), α = 0: D0 (P kQ0 ) = D0 (P kQ1 );
p0 = 0 ⇒ p1 = 0 (Q0 -a.s.) and 0 < α < ∞: q0 = q1 (P -a.s.)
p1 = 0 ⇒ p0 = 0 (Q1 -a.s.); Proof: For α ∈ [0, 1] this follows from the previous
0 < α < 1: Dα (P0 kQ0 ) = Dα (P1 kQ1 ) and theorem. (For P0 = P1 the equality conditions reduce to the
p0 q1 = p1 q0 (µ-a.s.); ones given here.) For α ∈ (1, ∞), let Qλ = (1 − λ)Q0 + λQ1
and define f (x, Qλ ) = (p(x)/qλ (x))α−1 . It is sufficient to
α = 1: p0 q1 = p1 q0 (µ-a.s.)
show that
Proof: Suppose first that α = 0, and let Pλ = (1−λ)P0 +
λP1 and Qλ = (1 − λ)Q0 + λQ1 . Then ln EX∼P [f (X, Qλ )]
≤ (1 − λ) ln EX∼P [f (X, Q0 )] + λ ln EX∼P [f (X, Q1 )].
(1 − λ) ln Q0 (p0 > 0) + λ ln Q1 (p1 > 0)
≤ ln ((1 − λ) Q0 (p0 > 0) + λQ1 (p1 > 0)) Noting that, for every x ∈ X , f (x, Q) is log-convex in Q, this
is a consequence of the general fact that an expectation over
≤ ln Qλ p0 > 0 or p1 > 0 = ln Qλ (pλ > 0). log-convex functions is itself log-convex, which can be shown
Equality holds if and only if, for the first inequality, Q0 (p0 > using Hölder’s inequality:
0) = Q1 (p1 > 0) and, for the second inequality, p1 > 0 ⇒ EP [f (X, Qλ )] ≤ EP [f (X, Q0 )1−λ f (X, Q1 )λ ]
p0 > 0 (Q0 -a.s.) and p0 > 0 ⇒ p1 > 0 (Q1 -a.s.) These
conditions are equivalent to the equality conditions of the ≤ EP [f (X, Q0 )]1−λ EP [f (X, Q1 )]λ .
theorem. Taking logarithms completes the proof of (26). Equality holds
Alternatively, suppose α > 0. We will show that point-wise in the first inequality if and only if q0 = q1 (P -a.s.), which
(1 − λ)pα 1−α
+ λpα 1−α
≤ pα 1−α is also sufficient for equality in the second inequality. Finally,
0 q0 1 q1 λ qλ (0 < α < 1);
p0 p1 pλ (26) extends to α = ∞ by letting α tend to ∞.
(1 − λ)p0 ln + λp1 ln ≥ pλ ln (α = 1), And secondly, Rényi divergence is jointly quasi-convex in
q0 q1 qλ
(24) both arguments for all α:
where pλ = (1 − λ)p0 + λp1 and qλ = (1 − λ)q0 + λq1 . For
Theorem 13. For any order α ∈ [0, ∞] Rényi divergence
α = 1, (23) then follows directly; for 0 < α < 1, (23) follows
is jointly quasi-convex in its arguments. That is, for any two
from (24) by Jensen’s inequality:
pairs of probability distributions (P0 , Q0 ) and (P1 , Q1 ), and
Z
α 1−α
Z
1−α
any λ ∈ (0, 1)
(1 − λ) ln p0 q0 dµ + λ ln pα 1 q1 dµ
Dα (1 − λ)P0 + λP1 k(1 − λ)Q0 + λQ1
Z Z (27)
≤ ln (1 − λ) pα q
0 0
1−α
dµ + λ pα 1−α
q
1 1 dµ . (25) ≤ max{Dα (P0 kQ0 ), Dα (P1 kQ1 )}.
Proof: For α ∈ [0, 1], quasi-convexity is implied by con-
If one of p0 , p1 , q0 and q1 is zero, then (24) can be verified vexity. For α ∈ (1, ∞), strict monotonicity of x 7→ α−1 1
ln x
directly. So assume that they are all positive. Then for 0 < implies that quasi-convexity is equivalent to quasi-convexity
α < 1 let f (x) = −xα and for α = 1 let f (x) = x ln x, such of the Hellinger integral pα q 1−α dµ. Since quasi-convexity
R
that (24) can be written as is implied by ordinary convexity, it is sufficient to establish
that the Hellinger integral is jointly convex in P and Q. Let
(1 − λ)q0 p0 λq1 p1 pλ
f + f ≥f . pλ = (1 − λ)p0 + λp1 and qλ = (1 − λ)q0 + λq1 . Then joint
qλ q0 qλ q1 qλ
convexity of the Hellinger integral is implied by the pointwise
(24) is established by recognising this as an application of inequality
Jensen’s inequality to the strictly convex function f . Regard-
1−α 1−α 1−α
less of whether any of p0 , p1 , q0 and q1 is zero, equality holds (1 − λ)pα
0 q0 + λpα
1 q1 ≥ pα
λ qλ ,
in (24) if and
R only if p0 q1 =
R p1 q1−α
0 . Equality holds in (25) if
1−α which holds by essentially the same argument as for (24) in
and only if pα 0 q0 dµ = pα 1 q1 dµ, which is equivalent
the proof of Theorem 11, with the convex function f (x) = xα .
to Dα (P0 kQ0 ) = Dα (P1 kQ1 ).
Finally, the case α = ∞ follows by letting α tend to ∞:
Joint convexity in P and Q breaks down for α > 1 (see
Section VI-A), but some partial convexity properties can still D∞ (1 − λ)P0 + λP1 k(1 − λ)Q0 + λQ1
be salvaged. First, convexity in the second argument does hold = sup Dα (1 − λ)P0 + λP1 k(1 − λ)Q0 + λQ1
for all α [4]: α<∞
≤ sup max{Dα (P0 kQ0 ), Dα (P1 kQ1 )}
Theorem 12. For any order α ∈ [0, ∞] Rényi divergence is α<∞
convex in its second argument. That is, for any probability = max{ sup Dα (P0 kQ0 ), sup Dα (P1 kQ1 )}
distributions P , Q0 and Q1 α<∞ α<∞
= max{D∞ (P0 kQ0 ), D∞ (P1 kQ1 )}.
Dα (P k(1−λ)Q0 +λQ1 ) ≤ (1−λ)Dα (P kQ0 )+λDα (P kQ1 )
(26)
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 10
C. A Generalized Pythagorean Inequality The left-hand side is minimized at λ = 1/m, where it equals
An important result in statistical applications of information m−(1−α)/α , which completes the proof for α ∈ (0, 1). The
theory is the Pythagorean inequality for Kullback-Leibler proof for α ∈ (1, ∞) goes the same way, except that all
divergence [30]–[32]. It states that, if P is a convex set of inequalities are reversed because f is concave.
distributions, Q is any distribution not in P, and Dmin = And, like for α = 1, the set of (α, λ)-mixtures is closed
inf P ∈P D(P kQ), then there exists a distribution P ∗ such that under taking further mixtures of its elements:
Lemma 4. Let α ∈ (0, ∞), let P1 , . . . , Pm be arbitrary
D(P kQ) ≥ D(P kP ∗ ) + Dmin for all P ∈ P. (28)
probability distributions and let Pλ1 and Pλ2 be their (α, λ1 )-
The main use of the Pythagorean inequality lies in its impli- and (α, λ2 )-mixtures for some distributions λ1 , λ2 . Then, for
cation that if P1 , P2 , . . . is a sequence of distributions in P any distribution γ = (γ1 , γ2 ), the (α, γ)-mixture of Pλ1 and
such that D(Pn kQ) → Dmin , then Pn converges to P ∗ in the Pλ2 is an (α, ν)-mixture of P1 , . . . , Pm for the distribution ν
strong sense that D(Pn kP ∗ ) → 0. such that
γ1 γ2
For α 6= 1 Rényi divergence does not satisfy the ordinary ν = α λ1 + α λ2 , (31)
Z1 C Z2 C
Pythagorean inequality, but there does exist a generalization if
we replace convexity of P by the following alternative notion where C = Zγ1α + Zγ2α , and Z1 and Z2 are the normalizing
1 2
of convexity: constants of Pλ1 and Pλ2 as defined in (29).
Definition 4. For α ∈ (0, ∞), we will call a set of distributions Proof: Let Mγ be the (α, γ)-mixture of Pλ1 and Pλ2 , and
P α-convex if, for any probability distribution λ = (λ1 , λ2 ) take λi = (λi,1, , . . . , λi,m ). Then
and any two distributions P1 , P2 ∈ P, we also have Pλ ∈ P,
m γ ∝ γ1 p α α 1/α
λ1 + γ2 pλ2 )
where Pλ is the (α, λ)-mixture of P1 and P2 , which will be γ X 1/α
defined below. 1 α γ2 X α
= λ p
1,θ θ + λ p
2,θ θ
Z1α Z2α
For α = 1, the (α, λ)-mixture is simply the ordinary mixture θ θ
γ1 λ1,θ γ2 λ2,θ
λ1 P1 + λ2 P2 , so that 1-convexity is equivalent to ordinary X
Z1α + Z2α
1/α
convexity. We generalize this to other α as follows: ∝ pα
θ ,
C
θ
Definition 5. Let α ∈ (0, ∞) and let P1 , . . . , Pm be any
from which the result follows.
probability distributions. Then for any probability distribu-
We are now ready to generalize the Pythagorean inequality
tion λ = (λ1 , . . . , λm ) we define the (α, λ)-mixture Pλ of
to any α ∈ (0, ∞):
P1 , . . . , Pm as the distribution with density
Theorem 14 (Pythagorean Inequality). Let α ∈ (0, ∞).
α 1/α m
Pm
θ=1 λθ pθ
Z X 1/α
pλ = , where Z = λθ pα dµ Suppose that P is an α-convex set of distributions. Let Q be
θ
Z an arbitrary distribution and suppose that the α-information
θ=1
(29) projection
is a normalizing constant. P ∗ = arg min Dα (P kQ) (32)
P ∈P
The normalizing constant Z is always well defined:
exists. Then we have the Pythagorean inequality
Lemma 3. The normalizing constant Z in (29) is bounded by
( Dα (P kQ) ≥ Dα (P kP ∗ ) + Dα (P ∗ kQ) for all P ∈ P.
[m−(1−α)/α , 1] for α ∈ (0, 1], (33)
Z∈ (30)
[1, m(α−1)/α ] for α ∈ [1, ∞). This result is new, although the work of Sundaresan on a
generalization of Rényi divergence might be related [33], [34].
Proof: For α = 1, we have Z = 1, as required. So it
1/α Our proof follows the same approach as the proof for α = 1
remains to consider theR simpleP orders.α
Let f (y) = y for
by Cover and Thomas [30].
y ≥ 0, so that Z = f θ λθ pθ dµ. Suppose first that Proof: For α = 1, this is just the standard Pythagorean
α ∈ (0, 1). Then f is convex, which implies that f (a + b) −
inequality for Kullback-Leibler divergence. See, for example,
P ≥ f (b) −P
f (a) f (0) = f (b) for any a, b, so that, by induction,
α the proof by Topsøe [32]. It remains to prove the theorem
f ( θ aθ ) ≥ θ f (aθ ) for any aθ . Taking aθ = λθ pθ and when α is a simple order.
using Jensen’s inequality, we find:
Let P ∈ P be arbitrary, and let Pλ be the α, (1 − λ, λ) -
mixture of P ∗ and P . Since P is α-convex ∗
X X X
f λθ pαθ ≤ f λθ p α
θ ≤ λθ f (pα
θ) d
and P is the
θ θ θ
minimizer over P, we have dλ Dα (Pλ kQ) λ=0 ≥ 0.
X 1/α
X 1/α X This derivative evaluates to:
λθ pθ ≤ λθ pα
θ ≤ λθ pθ .
dµ − (p∗ )α q 1−α dµ
R α 1−α R
θ θ θ d 1 p q
Dα (Pλ kQ) = R
Since every pθ integrates to 1, it follows that dλ α−1 Zλα pα q 1−α dµ
R ∗ α 1−α R λ α 1−α
X 1/α α (1 − λ) (p ) q dµ + λ p q dµ d
λθ ≤ Zλ ≤ 1. − Zλ .
Zλα+1 pα
R
α−1 λ q 1−α dµ dλ
θ
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 11
1/α
Let Xλ = (1 − λ)(p∗ )α + λpα
R
, so that Zλ = Xλ dµ. convergence in the sense that convergence in total variation
If α ∈ (0, 1), then Xλ is convex in λ so that Xλ −X λ
0
is distance implies convergence on any A ∈ F. The two
nondecreasing in λ, and if α ∈ (0, ∞), then Xλ is concave in topologies coincide if the sample space X is countable.
λ so Rthat Xλ −X
λ
0
is nonincreasing. By Lemma 3, we also see In general, Rényi divergence is lower semi-continuous for
Xλ −X0 Zλ −1 positive orders:
that λ dµ = λ is bounded by 0 for λ > 0, from
above if α ∈ (0, 1) and from below if α ∈ (1, ∞). It therefore
Theorem 15. For any order α ∈ (0, ∞], Dα (P kQ) is a lower
follows from the monotone convergence theorem that
semi-continuous function of the pair (P, Q) in the topology of
Zλ − Z0 Xλ − X0
Z
d setwise convergence.
Zλ λ=0 = lim = lim dµ
dλ λ↓0 λ λ↓0 λ
Proof: Suppose X = {x1 , . . . , xk } is finite. Then for any
Xλ − X0
Z Z
d
= lim dµ = Xλ λ=0 dµ, simple order α
λ↓0 λ dλ
k
where 1 X
Dα (P kQ) = ln pα q 1−α ,
d 1 1/α−1 α − 1 i=1 i i
Xλ = (1 − λ)(p∗ )α + λpα (pα − (p∗ )α ).
dλ α 1−α
where pi = P (xi ) and qi = Q(xi ). If 0 < α < 1, then pα i qi
If Dα (P ∗ kQ) = ∞, then the theorem is trivially true, so we
is continuous in (P, Q). For 1 < α < ∞, it is only discontinu-
may assume without lossR of generality that Dα (P ∗ kQ) < ∞, 1−α 1−α
ous at pi = qi = 0, but there pα i qi = 0 = min(P,Q) pαi qi ,
which implies that 0 < (p∗ )α q 1−α dµ < ∞. α 1−α
so then pi qi is still lower semi-continuous. These prop-
Putting everything together, we therefore find Pk α 1−α
erties carry over to i=1 pi qi and thus Dα (P kQ) is
d continuous for 0 < α < 1 and lower semi-continuous for
0≤ Dα (Pλ kQ)λ=0
dλ R α > 1. A supremum over (lower semi-)continuous functions
pα q 1−α dµ − (p∗ )α q 1−α dµ
R
1 is itself lower semi-continuous. Therefore, for simple orders α,
= R
α−1 (p∗ )α q 1−α dµ Theorem 2 implies that Dα (P kQ) is lower semi-continuous
for arbitrary X . This property extends to the extended orders
Z
1
− (p∗ )1−α (pα − (p∗ )α )dµ 1 and ∞ by Dβ (P kQ) = supα<β Dα (P kQ) for β ∈ {1, ∞}.
α−1 R
1 pα q 1−α dµ
Z
∗ 1−α α
= R − (p ) p dµ . Moreover, if α ∈ (0, 1) and the total variation topology is
α−1 (p∗ )α q 1−α dµ
assumed, then Theorem 17 below shows that Rényi divergence
Hence, if α > 1 we have is uniformly continuous.
Z Z Z
First we prove that the topologies induced by Rényi diver-
pα q 1−α dµ ≥ (p∗ )α q 1−α dµ (p∗ )1−α pα dµ,
gences of orders α ∈ (0, 1) are all equivalent:
and if α < 1 we have the converse of this inequality. In both Theorem 16. For any 0 < α ≤ β < 1
cases, the Pythagorean inequality (33) follows upon taking
α1−β
logarithms and dividing by α − 1 (which flips the inequality Dβ (P kQ) ≤ Dα (P kQ) ≤ Dβ (P kQ).
sign for α < 1). β 1−α
This follows from the following symmetry-like property,
D. Continuity which may be verified directly.
In this section we study continuity properties of the Rényi Proposition 2 (Skew Symmetry). For any 0 < α < 1
divergence Dα (P kQ) of different orders in the pair of proba- α
bility distributions (P, Q). It turns out that continuity depends Dα (P kQ) = D1−α (QkP ).
1−α
on the order α and the topology on the set of all probability
distributions. Note that, in particular, Rényi divergence is symmetric for
The set of probability distributions on (X , F) may be α = 1/2, but that skew symmetry does not hold for α = 0 and
equipped with the topology of setwise convergence, which α = 1.
is the coarsest topology such that, for any event A ∈ F, Proof of Theorem 16: We have already established the
the function P 7→ P (A) that maps a distribution to its second inequality in Theorem 3, so it remains to prove the
probability on A, is continuous. In this topology, convergence first one. Skew symmetry implies that
of a sequence of probability distributions P1 , P2 , . . . to a 1−α
Dα (P kQ) = D1−α (QkP )
probability distribution P means that Pn (A) → P (A) for any α
A ∈ F. 1−β
≥ D1−β (QkP ) = Dβ (P kQ),
Alternatively, one might consider the topology defined by β
the total variation distance from which the result follows.
Z
V (P, Q) = |p − q| dµ = 2 sup |P (A) − Q(A)|, (34) Remark 2. By (5), these results show that, for α ∈ (0, 1),
A∈F Dα (Pn kQ) → 0 is equivalent to convergence of Pn to Q in
in which Pn → P means that V (Pn , P ) → 0. The total Hellinger distance, which is equivalent to convergence of Pn
variation topology is stronger than the topology of setwise to Q in total variation [28, p. 364].
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 12
Next we shall prove a stronger result on the relation between Proof: Directly from the closed-form expressions for
Rényi divergence and total variation. Rényi divergence.
Finally, we will also consider the weak topology, which is
Theorem 17. For α ∈ (0, 1), the Rényi divergence Dα (P kQ)
weaker than the two topologies discussed above. In the weak
is a uniformly continuous function of (P, Q) in the total
topology, convergence of P1 , P2 , . . . to P means that
variation topology. Z Z
Lemma 5. Let 0 < α < 1. Then for all x, y ≥ 0 and ε > 0 f (x) dPn (x) → f (x) dP (x) (35)
|xα − y α | ≤ εα + εα−1 |x − y|. for any bounded, continuous function f : X → R. Unlike for
Proof: If x, y ≤ ε or x = y the inequality |xα − y α | ≤ εα the previous two topologies, the reference to continuity of f
is obvious. So assume that x > y and x ≥ ε. Then means that the weak topology depends on the topology of
the sample space X . We will therefore assume that X is a
|xα − y α | |xα − 0α | Polish space (that is, it should be a complete separable metric
≤ = xα−1 ≤ εα−1 .
|x − y| |x − 0| space), and we let F be the Borel σ-algebra. Then Prokhorov
[35] shows that there exists a metric that makes the set of
Proof of Theorem 17: First note that Rényi diver- finite measures on X a Polish space as well, and which is
gence is a
function of the power divergence dα (P, Q) = such that convergence in the metric is equivalent to (35). The
R α
1 − dQdP
dQ : weak topology then, is the topology induced by this metric.
Theorem 19. Suppose that X is a Polish space. Then for
1
Dα (P kQ) = ln (1 − dα (P, Q)) . any order α ∈ (0, ∞], Dα (P kQ) is a lower semi-continuous
α−1 function of the pair (P, Q) in the weak topology.
1
Since x 7→ α−1 ln(1−x) is continuous, it is sufficient to prove
The proof is essentially the same as the proof for α = 1 by
that dα (P, Q) is a uniformly continuous function of (P, Q).
Posner [36].
For any ε > 0 and distributions P1 , P2 and Q, Lemma 5
Proof: Let P1 , P2 , . . . and Q1 , Q2 , . . . be sequences of
implies that
distributions that weakly converge to P and Q, respectively.
dP1 α
Z α
dP2 We need to show that
|dα (P1 , Q) − dα (P2 , Q)| ≤ − dQ
dQ dQ
Z
dP1
lim inf Dα (Pn kQn ) ≥ Dα (P kQ). (36)
dP2 n→∞
≤ εα + εα−1 − dQ
dQ dQ For any set A ∈ F, let ∂A denote its boundary, which is
Z
dP1 dP2 its closure minus its interior, and let F0 ⊆ F consist of the
= εα + εα−1 − dQ
dQ dQ sets A ∈ F such that P (∂A) = Q(∂A) = 0. Then F0 is
= εα + εα−1 V (P1 , P2 ). an algebra by Lemma 1.1 of Prokhorov [35], applied to the
measure P + Q, and the Portmanteau theorem implies that
As dα (P, Q) = d1−α (Q, P ), it also follows that Pn (A) → P (A) and Qn (A) → Q(A) for any A ∈ F0 [37].
|dα (P, Q1 ) − dα (P, Q2 )| ≤ ε1−α + ε−α V (Q1 , Q2 ) for any Posner [36, proof of Theorem 1] shows that F0 gener-
Q1 , Q2 and P . Therefore ates F (that is, σ(F0 ) = F). By the translator’s proof of
|dα (P1 , Q1 ) − dα (P2 , Q2 )| Theorem 2.4.1 in Pinsker’s book [38], this implies that, for
any finite partition {A1 , . . . , Ak } ⊆ F and any γ > 0,
≤ |dα (P1 , Q1 ) − dα (P2 , Q1 )| there exists a finite partition {A01 , . . . , A0k } ⊆ F0 such that
+ |dα (P2 , Q1 ) − dα (P2 , Q2 )| P (Ai 4A0i ) ≤ γ and Q(Ai 4A0i ) ≤ γ for all i, where
≤ εα + εα−1 V (P1 , P2 ) + ε1−α + ε−α V (Q1 , Q2 ), Ai 4A0i = (Ai \ A0i ) ∪(A0i \ Ai ) denotes the symmetric set
difference. By the data processing inequality and lower semi-
from which the theorem follows. continuity in the topology of setwise convergence, this implies
A partial extension to α = 0 follows: that (15) still holds when the supremum is restricted to finite
Corollary 1. The Rényi divergence D0 (P kQ) is an upper partitions P in F0 instead of F.
semi-continuous function of (P, Q) in the total variation Thus, for any ε > 0, we can find a finite partition P ⊆ F0
topology. such that
Dα (P|P kQ|P ) ≥ Dα (P kQ) − ε.
Proof: This follows from Theorem 17 because D0 (P kQ)
is the infimum of the continuous functions (P, Q) 7→ The data processing inequality and the fact that Pn (A) →
Dα (P kQ) for α ∈ (0, 1). P (A) and Qn (A) → Q(A) for all A ∈ P, together with
If we consider continuity in Q only, then for any finite lower semi-continuity in the topology of setwise convergence,
sample space we obtain: then imply that
Theorem 18. Suppose X is finite, and let α ∈ [0, ∞]. Then
Dα (Pn kQn ) ≥ Dα (Pn )|P k(Qn )|P
for any P the Rényi divergence Dα (P kQ) is continuous in Q
in the topology of setwise convergence. ≥ Dα (P|P kQ|P ) − ε ≥ Dα (P kQ) − 2ε
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 13
for all sufficiently large n. Consequently, Theorem 22. For the special case α = 1, information-theoretic
proofs of Theorems 21 and 22 are given by Barron [39] and
lim inf Dα (Pn kQn ) ≥ Dα (P kQ) − 2ε
n→∞ Harremoës and Holst [40]. Theorem 21 may also be derived
for any ε > 0, and (36) follows by letting ε tend to 0. from general properties of f -divergences [27].
Theorem 20 (Compact Sublevel Sets). Suppose X is a Polish Theorem 21 (Increasing). Let F1 ⊆ F2 ⊆ · · · ⊆SF be an
∞
space, let Q be arbitrary, and let c ∈ [0, ∞) be a constant. increasing family of σ-algebras, and let F∞ = σ ( n=1 Fn )
Then the sublevel set be the smallest σ-algebra containing them. Then for any order
α ∈ (0, ∞]
S = {P | Dα (P kQ) ≤ c} (37)
lim Dα (P|Fn kQ|Fn ) = Dα (P|F∞ kQ|F∞ ). (39)
is convex and compact in the topology of weak convergence n→∞
for any order α ∈ [1, ∞]. For α = 0, (39) does not hold. A counterexample is given
after Example 3 below.
Proof: Convexity follows from quasi-convexity of Rényi
divergence in its first argument. Lemma 6. Let F1 ⊆ F2 ⊆ · · · ⊆ F be an increasing family
Suppose that P1 , P2 , . . . ∈ S converges to a finite measure of σ-algebras, and suppose that µ is a probability distribution.
P . Then (35), applied to the constant function f (x) = 1, Then the family of random variables {pn }n≥1 with members
implies that P (X ) = 1, so that P is also a probability pn = E [ p| Fn ] is uniformly integrable (with respect to µ).
distribution. Hence by lower semi-continuity (Theorem 19) S
The proof of this lemma is a special case of part of the
is closed. It is therefore sufficient to show that S is relatively
proof of Lévy’s upward convergence theorem in Shiryaev’s
compact.
textbook [28, p. 510]. We repeat it here for completeness.
For any event A ∈ F, let Ac = X \A denote its complement.
Proof: For any constants b, c > 0
Prokhorov [35, Theorem 1.12] shows that S is relatively Z Z
compact if, for any ε > 0, there exists a compact set A ⊆ X
pn dµ = p dµ
such that P (Ac ) < ε for all P ∈ S. pn >b
Since X is a Polish space, for any δ > 0 there exists Zpn >b Z
a compact set Bδ ⊆ X such that Q(Bδ ) ≥ 1 − δ [37, ≤ p dµ + p dµ
pn >b,p≤c
Lemma 1.3.2]. For any distribution P , let P|Bδ denote the Z p>c
restriction of P to the binary partition {Bδ , Bδc }. Then, by ≤ c · µ (pn > b) + p dµ
monotonicity in α and the data processing inequality, we have, p>c
(∗)
Z Z
for any P ∈ S, c c
≤ E[pn ] + p dµ = + p dµ,
b p>c b p>c
c ≥ Dα (P kQ) ≥ D1 (P kQ) ≥ D1 (P|Bδ kQ|Bδ )
P (Bδ ) P (Bδc ) in which the inequality marked by (∗) is Markov’s. Conse-
= P (Bδ ) ln + P (Bδc ) ln quently
Q(Bδ ) Q(Bδc )
1 Z Z
≥ P (Bδ ) ln P (Bδ ) + P (Bδc ) ln P (Bδc ) + P (Bδc ) ln lim sup |pn | dµ = lim lim sup |pn | dµ
Q(Bδc ) b→∞ n pn >b c→∞ b→∞ n pn >b
−2 1 c
Z
≥ + P (Bδc ) ln , ≤ lim lim + lim p dµ = 0,
e Q(Bδc ) c→∞ b→∞ b c→∞ p>c
where the last inequality follows from x ln x ≥ −1/e. Conse- which proves the lemma.
quently, Proof of Theorem 21: As by the data processing inequal-
c + 2/e
P (Bδc ) ≤ , ity Dα (P|Fn kQ|Fn ) ≤ Dα (P kQ) for all n, we only need
ln 1/Q(Bδc ) to show that limn→∞ Dα (P|Fn kQ|Fn ) ≥ Dα (P|F∞ kQ|F∞ ).
and since Q(Bδc ) → 0 as δ tends to 0 we can satisfy the To this end, assume without loss of generality that F = F∞
condition of Prokhorov’s theorem by taking A equal to Bδ and that µ is a probability distribution (i.e. µ = (P + Q)/2).
for any sufficiently small δ depending on ε. Let pn = E [ p| Fn ] and qn = E [ q| Fn ], and define the
distributions P̃n and Q̃n on (X , F) by
E. Limits of σ-Algebras
Z Z
P̃n (A) = pn dµ, Q̃n (A) = qn dµ (A ∈ F),
As shown by Theorem 2, there exists a sequence of finite A A
partitions P1 , P2 , . . . such that such that, by the Radon-Nikodým theorem and Proposition 1,
dP̃n dP|Fn dQ̃n dQ|Fn
Dα (P|Pn kQ|Pn ) ↑ Dα (P kQ). (38) dµ = pn = dµ|Fn and dµ = qn = dµ|Fn (µ-a.s.) It follows
that
Theorem 21 below elaborates on this result. It implies that (38) Dα (P̃n kQ̃n ) = Dα (P|Fn kQ|Fn )
holds for any increasing sequence of partitions P1 ⊆ P2 ⊆
· · · that generate
S∞ σ-algebras converging to F, in the sense for 0 < α < ∞ and therefore by continuity also for α = ∞.
that F = σ ( n=1 Pn ). An analogous result holds for infinite We will proceed to show that (P̃n , Q̃n ) → (P, Q) in the
sequences of increasingly coarse partitions, which is shown by topology of setwise convergence. By lower semi-continuity
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 14
of Rényi divergence this implies that limn→∞ Dα (P̃n kQ̃n ) ≥ where EQ [Xn ] ≤ EQ [X1 ] in the last inequality follows from
Dα (P kQ), from which the theorem follows. By Lévy’s up- the data processing inequality. Consequently,
ward convergence theorem [28, p. 510], limn→∞ pn = p Z Z
(µ-a.s.) Hence uniform integrability of the family {pn } (by lim sup |Xn | dQ = lim lim sup |Xn | dQ
b→∞ n c→∞ b→∞ n
Lemma 6) implies that for any A ∈ F |Xn |>b
Z |Xn |>b
Z Z c
≤ lim lim EQ [X1 ] + lim X1 dQ = 0,
lim P̃n (A) = lim pn dµ = p dµ = P (A) c→∞ b→∞ b c→∞ X >c
1
n→∞ n→∞ A A
and the lemma follows.
[28, Thm. 5, p. 189]. Similarly limn→∞ Q̃n (A) = Q(A), so Proof of Theorem 22: First suppose that α > 0 and, for
we find that (P̃n , Q̃n ) → (P, Q), which completes the proof. dP|Fn
n = 1, 2, . . . , ∞, let pn = dµ|F
dQ|Fn
, qn = dµ|F and Xn =
n n
pn
f qn with f (x) = x if α 6= 1 and f (x) = x ln x + e−1
α
Theorem 22 (Decreasing). Let F ⊇ F1 ⊇ F2T⊇ · · · be a de-
∞ if α = 1, as in Lemma 7. If α ≥ 1, then assume without
creasing family of σ-algebras, and let F∞ = n=1 Fn be the
loss of generality that F = F1 and m = 1, such that
largest σ-algebra contained in all of them. Let α ∈ [0, ∞). If
Dα (P|Fm kQ|Fm ) < ∞ implies P Q. Now, for any α > 0,
α ∈ [0, 1) or there exists an m such that Dα (P|Fm kQ|Fm ) <
it is sufficient to show that
∞, then
EQ [Xn ] → EQ [X∞ ]. (40)
lim Dα (P|Fn kQ|Fn ) = Dα (P|F∞ kQ|F∞ ).
n→∞
By Proposition 1, pn = Eµ [ p| Fn ] and qn = Eµ [ q| Fn ].
The theorem cannot be extended to the case α = ∞. Therefore by a version of Lévy’s theorem for decreasing
Lemma 7. Let F ⊇ F1 ⊇ F2 ⊇ · · · be a decreasing family sequences of σ-algebras [41, Theorem 6.23],
dP|Fn dQ|Fn
of σ-algebras. Let α ∈ (0, ∞), pn = dµ|F , qn = dµ|F pn = Eµ [ p| Fn ] → Eµ [ p| F∞ ] = p∞ ,
n n (µ-a.s.)
and Xn = f ( pqnn ), where f (x) = xα if α 6= 1 and f (x) = qn = Eµ [ q| Fn ] → Eµ [ q| F∞ ] = q∞ ,
x ln x + e−1 if α = 1. If α ∈ (0, 1), or EQ [X1 ] < ∞ and
and hence Xn → X∞ (µ-a.s. and therefore Q-a.s.) If 0 < α <
P Q, then the family {Xn }n≥1 is uniformly integrable
1, then
(with respect to Q).
EQ [Xn ] = Eµ pα 1−α
n qn ≤ Eµ [αpn + (1 − α)qn ] = 1 < ∞.
Proof: Suppose first that α ∈ (0, 1). Then for any b > 0
Z Z (1−α)/α And if α ≥ 1, then by the data processing inequality
Xn Dα (P|Fn kQ|Fn ) < ∞ for all n, which implies that also in this
Xn dQ ≤ Xn dQ
Xn >b Xn >b b case EQ [Xn ] < ∞. Hence uniform integrability (by Lemma 7)
Z
≤ b−(1−α)/α Xn1/α dQ ≤ b−(1−α)/α , of the family of nonnegative random variables {Xn } implies
(40) [28, Thm. 5, p. 189], and the theorem follows for α > 0.
R
and, as Xn ≥ 0, limb→∞ supn |Xn |>b |Xn | dQ = 0, which The remaining case, α = 0, is proved by
was to be shown. lim D0 (P|Fn kQ|Fn )
dP n n→∞
Alternatively, suppose that α ∈ [1, ∞). Then pqnn = dQ|F
|Fn
(Q-a.s.) and hence by Proposition 1 and Jensen’s inequality = inf inf Dα (P|Fn kQ|Fn ) = inf inf Dα (P|Fn kQ|Fn )
n α>0 α>0 n
for conditional expectations = inf Dα (P|F∞ kQ|F∞ ) = D0 (P|F∞ kQ|F∞ ).
α>0
dP dP
Xn = f E F n ≤ E f Fn = E [ X1 | Fn ]
dQ dQ
(Q-a.s.) As minx x ln x = −e−1 , it follows that Xn ≥ 0 and F. Absolute Continuity and Mutual Singularity
for any b, c > 0 Shiryaev [28, pp. 366, 370] relates Hellinger integrals
Z Z
to absolute continuity and mutual singularity of probability
|Xn | dQ = Xn dQ distributions. His results may more elegantly be expressed
|Xn |>b Xn >b
Z Z in terms of Rényi divergence. They then follow from the
≤ E [ X1 | Fn ] dQ = X1 dQ observations that D0 (P kQ) = 0 if and only if Q is absolutely
ZXn >b Z Xn >b continuous with respect to P and that D0 (P kQ) = ∞ if and
= X1 dQ + X1 dQ only if P and Q are mutually singular, together with right-
Xn >b,X1 ≤c
Z Xn >b,X1 >c continuity of Dα (P kQ) in α at α = 0. As illustrated in the
≤ c · Q(Xn > b) + X1 dQ next section, these properties give a convenient mathematical
X1 >c tool to establish absolute continuity or mutual singularity of
Z
c infinite product distributions.
≤ EQ [Xn ] + X1 dQ
b Theorem 23 ( [28, Theorem 2, p. 366]). The following con-
Z X1 >c
c ditions are equivalent:
≤ EQ [X1 ] + X1 dQ,
b X1 >c (i) Q P ,
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 15
(ii) Q(p > 0) = 1, (iv) lim sup Dα (Pn kQn ) = ∞ for all α ∈ (0, ∞].
n→∞
(iii) D0 (P kQ) = 0,
(iv) limα↓0 Dα (P kQ) = 0. If Pn and Qn are the restrictions of P and Q to an
increasing sequence of sub-σ-algebras that generates F, then
Proof: Clearly (ii) is equivalent to Q(p = 0) = the equivalences in (41) continue to hold, because we can
0, which is equivalent to (i). The other cases follow by relate Theorems 23 and 25 and Theorems 24 and 26 via
limα↓0 Dα (P kQ) = D0 (P kQ) = − ln Q(p > 0). Theorem 21.
Theorem 24 ( [28, Theorem 3, p. 366]). The following con-
ditions are equivalent: G. Distributions on Sequences
(i) P ⊥ Q, Suppose (X ∞ , F ∞ ) is the direct product of an infinite
(ii) Q(p > 0) = 0, sequence of measurable spaces (X1 , F1 ), (X2 , F2 ), . . . That
(iii) Dα (P kQ) = ∞ for some α ∈ [0, 1), is, X ∞ = X1 × X2 × · · · and F ∞ is the smallest σ-algebra
(iv) Dα (P kQ) = ∞ for all α ∈ [0, ∞]. containing all the cylinder sets
Proof: Equivalence of (i), (ii) and D0 (P kQ) = ∞ follows Sn (A) = {x∞ ∈ X ∞ | x1 , . . . , xn ∈ A}, A ∈ F n,
from definitions. Equivalence of D0 (P kQ) = ∞ and (iv)
follows from the fact that Rényi divergence is continuous on for n = 1, 2, . . ., where F n = F1 ⊗ · · · ⊗ Fn . Then a
[0, 1] and nondecreasing in α. Finally, (iii) for some α ∈ (0, 1) sequence of probability distributions P 1 , P 2 , . . ., where P n
is equivalent to is a distribution on X n = X1 × · · · × Xn , is called consistent
if
Z
pα q 1−α dµ = 0, P n+1 (A × Xn+1 ) = P n (A), A ∈ F n.
which holds if and only if pq = 0 (µ-a.s.). It follows that in For any such consistent sequence there exists a distribution
this case (iii) is equivalent to (i). P ∞ on (X ∞ , F ∞ ) such that its marginal distribution on X n
Contiguity and entire separation are asymptotic versions of is P n , in the sense that
absolute continuity and mutual singularity [42]. As might be
P ∞ (Sn (A)) = P n (A), A ∈ F n.
expected, analogues of Theorems 23 and 24 also hold for these
asymptotic concepts. If P 1 , P 2 , . . . and Q1 , Q2 , . . . are two consistent sequences of
Let (Xn , Fn )n=1,2,... be a sequence of measurable spaces, probability distributions, then it is natural to ask whether the
and let (Pn )n=1,2,... and (Qn )n=1,2,... be sequences of distri- Rényi divergence Dα (P n kQn ) converges to Dα (P ∞ kQ∞ ).
butions on these spaces. Then the sequence (Pn ) is contiguous The following theorem shows that it does for α > 0.
with respect to the sequence (Qn ), denoted (Pn ) C (Qn ),
Theorem 27 (Consistent Distributions). Let P 1 , P 2 , . . . and
if for all sequences of events (An ∈ Fn )n=1,2,... such that
Q1 , Q2 , . . . be consistent sequences of probability distribu-
Qn (An ) → 0 as n → ∞, we also have Pn (An ) → 0. If
tions on (X 1 , F 1 ), (X 2 , F 2 ), . . ., where, for n = 1, . . . , ∞,
both (Pn ) C (Qn ) and (Qn ) C (Pn ), then the sequences
(X n , F n ) is the direct product of the first n measurable spaces
are called mutually contiguous and we write (Pn ) CB (Qn ).
in the infinite sequence (X1 , F1 ), (X2 , F2 ), . . . Then for any
The sequences (Pn ) and (Qn ) are entirely separated, de-
α ∈ (0, ∞]
noted (Pn ) M (Qn ), if there exist a sequence of events
(An ∈ Fn )n=1,2,... and a subsequence (nk )k=1,2,... such that Dα (P n kQn ) → Dα (P ∞ kQ∞ )
Pnk (Ank ) → 0 and Qnk (Xnk \ Ank ) → 0 as k → ∞.
Contiguity and entire separation are related to absolute as n → ∞.
continuity and mutual singularity in the following way [28, Proof: Let G n = {Sn (A) | A ∈ F n }. Then
p. 369]: if Xn = X , Pn = P and Qn = Q for all n, then ∞ ∞ ∞
Dα (P n |Qn ) = Dα (P|G n kQ|G n ) → Dα (P kQ∞ )
(Pn ) C (Qn ) ⇔ P Q,
by Theorem 21.
(Pn ) CB (Qn ) ⇔ P ∼ Q, (41)
As a special case, we find that finite additivity of Rényi
(Pn ) M (Qn ) ⇔ P ⊥ Q. divergence, which is easy to verify, extends to countable
Theorems 1 and 2 by Shiryaev [28, p. 370] imply the following additivity:
two asymptotic analogues of Theorems 23 and 24: Theorem 28 (Additivity). For n = 1, 2, . . ., let (Pn , Qn )
Theorem 25. The following conditions are equivalent: be pairs of probability distributions on measurable spaces
(Xn , Fn ). Then for any α ∈ [0, ∞] and any N ∈ {1, 2, . . .}
(i) (Qn ) C (Pn ),
(ii) lim lim sup Dα (Pn kQn ) = 0. N
X
α↓0 n→∞ Dα (Pn kQn ) = Dα (P1 ×· · ·×PN kQ1 ×· · ·×QN ), (42)
Theorem 26. The following conditions are equivalent: n=1
Countable additivity as in (43) does not hold for α = 0. A Qn are distributions on arbitrary measurable spaces such that
counterexample is given following Example 3 below. Pn ∼ Qn . Then
Proof: For simple orders α, (42) follows from indepen- ∞
X
dence of Pn and Qn between different n, which implies that Q∼P ⇔ Dα (Pn kQn ) < ∞, (47)
!1−α N n=1
N Z 1−α Z QN ∞
Y dQn d n=1 Qn Y X
dPn = QN d Pn . Q⊥P ⇔ Dα (Pn kQn ) = ∞. (48)
n=1
dPn d n=1 Pn n=1 n=1
P∞
As N is finite, this extends to the extended orders by continuity Proof: If n=1 Dα (Pn kQn ) = ∞, then Dα (P kQ) = ∞
in α. Finally, (43) follows from Theorem 27 by observing that and Q ⊥ P follows by Theorem
P∞ 24.
the sequences P N = P1 ×· · ·×PN and QN = Q1 ×· · ·×QN , On the other hand, if n=1 Dα (Pn kQn ) < ∞, then for
for N = 1, 2, . . ., are consistent. every ε > 0 there exists an N such that
Theorems 23 and 24 can be used to establish absolute con- ∞
X
tinuity or mutual singularity of infinite product distributions, Dα (Pn kQn ) ≤ ε,
as illustrated by the following proof by Shiryaev [28] of the n=N +1
Gaussian dichotomy [43]–[45]. and consequently by additivity and monotonicity in α:
Example 3 (Gaussian Dichotomy). Let P = P1 × P2 × · · ·
and Q = Q1 × Q2 × · · · , where Pn and Qn are Gaussian D0 (P kQ) = lim Dα (P kQ)
α↓0
distributions with densities ≤ lim Dα (P1 × · · · × PN kQ1 × · · · × QN ) + ε = ε.
α↓0
1 2 1 2
pn (x) = √1 e− 2 (x−µn ) , qn (x) = √1 e− 2 (x−νn ) .
2π 2π As this holds for any ε > 0, D0 (P kQ) must equal 0, and, by
Then Theorem 23, Q P . As Q P implies Q 6⊥ P , Theorem 24
α implies that Dα (QkP ) < ∞, and by repeating the argument
Dα (Pn kQn ) = (µn − νn )2 , with the roles of P and Q reversed we find that also P Q,
2
which completes the proof.
and by additivity for α > 0
Theorem 29 (with α = 1/2) is equivalent to a classical
αX
∞ result by Kakutani [46], which was stated in terms of Hellinger
Dα (P kQ) = (µn − νn )2 . (44) integrals rather than Rényi divergence, and according to Gibbs
2 n=1
and Su [24] might be responsible for popularising Hellinger
Consequently, by Theorems 23 and 24 and symmetry in P integrals. As shown by Rényi [47], Kakutani’s result is related
and Q: to the amount of information that a sequence of observations
contains about the parameter of a statistical model.
∞
X
QP ⇔ P Q ⇔ (µn − νn )2 < ∞, (45)
n=1
H. Taylor Approximation for Parametric Models
X∞ Suppose {Pθ | θ ∈ Θ ⊆ R} is a parametric statistical
Q⊥P ⇔ (µn − νn )2 = ∞. (46) model. Then it is well known that, for sufficiently regular
n=1 parametrisations, a second order Taylor approximation of
The observation that P and Q are either equivalent (both P D(Pθ kPθ0 ) in θ0 at θ in the interior of Θ yields
Q and Q P ) or mutually singular is called the Gaussian 1 1
dichotomy. lim D(Pθ kPθ0 ) = J(θ), (49)
(θ − θ0 )2
θ 0 →θ 2
By letting α tend to 0, ExampleP
3 shows that countable addi-
d
∞
where J(θ) = E ( dθ ln pθ )2 denotes the Fisher information
tivity does not hold for α = 0: if n=1 (µn − νn )2 = ∞, then at θ (see e.g. [30, Problem 12.7] or [48]). Haussler and Opper
PN
(44) implies that D0 (P kQ) = ∞, while n=1 D0 (Pn kQn ) = [6] argue that this property generalizes to
0 for all N . In light of the proof of Theorem 28 this also 1 α
provides a counterexample to (39) for α = 0. lim Dα (Pθ kPθ0 ) = J(θ) (50)
θ 0 →θ (θ − θ0 )2 2
The Gaussian dichotomy raises the question of whether the
same dichotomy holds for other product distributions. Let P ∼ for any α ∈ (0, ∞), but we are not aware of a reference that
Q denote that P and Q are equivalent (both P Q and spells out the exact technical conditions on the parametrisation
Q P ). Suppose that P = P1 × P2 × · · · and Q = Q1 × that are needed.
Q2 × · · · , where Pn and Qn are arbitrary distributions on
arbitrary measurable spaces. Then if Pn 6∼ Qn for some n, IV. M INIMAX RESULTS
P and Q are not equivalent either. The question is therefore A. Hypothesis Testing and Chernoff Information
answered by the following theorem:
Rényi divergence appears in bounds on the error proba-
Theorem 29 (Kakutani’s Dichotomy). Let α ∈ (0, 1) and let bilities when testing a probabilistic hypothesis Q against an
P = P1 × P2 × · · · and Q = Q1 × Q2 × · · · , where Pn and alternative P [4], [49], [50]. This can be explained by the
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 17
fact that (1 − α)Dα (P kQ) equals the cumulant generating we find that
function for the random variable ln(p/q) under the distribution
Q (provided α ∈ (0, 1) or P Q) [4]. The following inf αD(RkP ) + (1 − α)D(RkQ)
R
theorem relates this cumulant generating function to two
Kullback-Leibler divergences that involve the distribution Pα ≤ lim sup − α ln P (p ≤ cq)
c→∞
with density
+ inf αD(RkPc ) + (1 − α)D(RkQ)
q 1−α pα R
pα = R 1−α α , (51) ≤ lim sup (1 − α)Dα (Pc kQ) ≤ (1 − α)Dα (P kQ),
q p dµ
c→∞
may verify that R D α (Pc kQ) < ∞ and D(SkPc ) < ∞ for The minimum is achieved by r = αp + (1 − α)q, from which
s = pα c q 1−α
/ p α 1−α
c q dµ, so that we have already proved
α 2
that (52) holds if P is replaced by Pc . Hence, observing that Dα (P kQ) ≥ 2α(p − q)2 = V (P, Q).
for all R 2
( The general case of distributions P and Q on any sample space
∞ if R 6 Pc , X reduces to the binary case by the data processing inequality:
D(RkPc ) =
D(RkP ) + ln P (p ≤ pc) otherwise, for any event A, let P|A and Q|A denote the restrictions of P
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 18
and Q to the binary partition P = {A, X \ A}. Then connection between Chernoff information and D(Pα∗ kP ) is
2 2 discussed by Cover and Thomas [30, Section 12.9], with a
α Dα (P kQ) ≥ sup α2 Dα (P|A kQ|A ) ≥ sup V (P|A , Q|A )
A A different proof.
2 Proof of Theorem 32: Let f (α, R) = αD(RkP ) + (1 −
= sup 4 P (A) − Q(A) = V 2 (P, Q),
A α)D(RkQ). For α ∈ (0, 1), Dα (P kQ) ≤ D(P kQ) < ∞
as required. implies that Pα is well defined. Suppose there exists α∗ ∈
As one might expect from continuity of Dα (P kQ), the (0, 1) such that D(Pα∗ kP ) = D(Pα∗ kQ). Then Theorem 30
terms on the right-hand side of (52) are continuous in α, at implies that (α∗ , Pα∗ ) is a saddle-point for f (α, R), so that
least on (0, 1): (55) holds [53, Lemma 36.2], and Theorem 30 also implies
that all quantities in (56) are equal to f (α∗ , Pα∗ ).
Lemma 8. If D(P kQ) < ∞ or D(QkP ) < ∞, then both Let A be either (0, 1) or (0, ∞). As the sup inf is never
D(Pα kQ) and D(Pα kP ) are finite and continuous in α on bigger than the inf sup [53, Lemma 36.1], we have that
(0, 1).
sup inf f (α, R) ≤ sup inf f (α, R) ≤ inf sup f (α, R),
Proof: The lemma is symmetric in P and Q, so sup- α∈A R α∈(0,∞) R R α∈(0,∞)
pose without loss of generality that D(P kQ) < ∞. Then
Dα (P kQ) ≤ D(P kQ) < ∞ implies that Pα is well defined so it remains to prove the converse inequality.
and finiteness of both D(Pα kQ) and D(Pα kP ) follows from By Lemma 8 we know that both D(Pα kP ) and D(Pα kQ)
Theorem 30. Now observe that are finite and continuous in α on (0, 1). By the intermediate
α α value theorem, there are therefore three possibilities: (1) there
1 p p exists α∗ ∈ (0, 1) such that D(Pα∗ kP ) = D(Pα∗ kQ),
D(Pα kQ) = R α 1−α EQ ln
p q dµ q q for which we have already proved (55); (2) D(Pα kP ) <
+ (1 − α)Dα (P kQ). D(Pα kQ) for all α ∈ (0, 1); and (3) D(Pα kP ) > D(Pα kQ)
for all α ∈ (0, 1).
Then by continuity of Dα (P kQ) and hence of pα q 1−α dµ in
R
We proceed with case (2), observing that
α, it is sufficient to verify continuity of EQ [(p/q)α ln(p/q)α ].
To this end, observe that inf sup f (α, R) = inf sup f (α, R)
( R α∈(0,∞) R : D(RkQ)<∞ α∈(0,∞)
1/e if p < q, n
|(p/q)α ln(p/q)α | ≤ = inf D(RkQ)
(p/q) ln(p/q) if p ≥ q. R : D(RkQ)<∞
o
As D(P kQ) < ∞ implies EQ [1{p≥q} (p/q) ln(p/q)] < ∞, + sup α D(RkP ) − D(RkQ)
α∈(0,∞)
we may apply the dominated convergence theorem to obtain
α α " ∗ ∗ # = inf D(RkQ)
α α R : D(RkP )≤D(RkQ)<∞
p p p p
lim∗ EQ ln = EQ ln ≤ inf D(Pα kQ).
α→α q q q q 0<α<1
for any α∗ ∈ (0, 1), which proves continuity of D(Pα kQ). Now by Theorem 30
Continuity of D(Pα kP ) now follows from Theorem 30 and
continuity of (1 − α)Dα (P kQ). inf D(Pα kQ) ≤ lim inf D(Pα kQ)
0<α<1 α↓0
n α o
Theorem 32. Suppose that D(P kQ) < ∞. Then the following = lim inf Dα (P kQ) − D(Pα kP )
minimax identity holds: α↓0 1−α
≤ lim Dα (P kQ) = lim(1 − α)Dα (P kQ)
α↓0 α↓0
sup inf {αD(RkP ) + (1 − α)D(RkQ)}
α∈(0,∞) R = lim inf f (α, R) ≤ sup inf f (α, R),
α↓0 R α∈A R
= inf sup {αD(RkP ) + (1 − α)D(RkQ)} , (55)
R α∈(0,∞)
as required. It remains to consider case (3), which turns out
with the convention that αD(RkP ) + (1 − α)D(RkQ) = ∞ to be impossible by the following argument: two applications
if it would otherwise be undefined. Moreover, (55) still holds of Theorem 30 give
if α is restricted to (0, 1) on its left-hand side; and if there n o
D1/2 (P kQ) = inf D(Pα kP ) + D(Pα kQ)
exists an α∗ ∈ (0, 1) such that D(Pα∗ kP ) = D(Pα∗ kQ), then 0<α<1
(α∗ , Pα∗ ) is a saddle-point for (55) and both sides of (55) are ≤ 2 inf D(Pα kP ) ≤ 2 lim sup D(Pα kP )
0<α<1 α↑1
equal to
n1 − α 1−α o
(1 − α∗ )Dα∗ (P kQ) = sup (1 − α)Dα (P kQ) = 2 lim sup Dα (P kQ) − D(Pα kP )
α∈(0,1) α↑1 α α
(56)
= D(Pα∗ kP ) = D(Pα∗ kQ). 1−α
≤ 2 lim sup Dα (P kQ) = 0.
α↑1 α
The minimax value defined in (55) is the Chernoff informa-
tion, which gives an asymptotically tight bound on both the It follows that P = Q, which contradicts the assumption that
type 1 and the type 2 errors in tests of P vs. Q. The same D(Pα kP ) > D(Pα kQ) for any α ∈ (0, 1).
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 19
B. Channel Capacity and Minimax Redundancy Theorem 35 (Sion’s Minimax Theorem). Let A be a convex
Consider a non-empty family {Pθ | θ ∈ Θ} of probability subset of a linear topological space and B a compact convex
distributions on a sample space X . We may think of θ as subset of a linear topological space. Let f : A × B → R be
a parameter in a statistical model or as an input letter of such that
an information channel. In the main results of this section (i) f (·, b) is upper semi-continuous and quasi-concave on A
we will only consider discrete sample spaces X , which are for each b ∈ B;
either finite with n elements or countably infinite. Whenever (ii) f (a, ·) is lower semi-continuous and quasi-convex on B
distributions on Θ are involved, we also implicitly assume that for each a ∈ A.
Θ is a topological space that is equipped with the Borel σ- Then
algebra, that {θ} is a closed set for every θ, and that the map sup min f (a, b) = min sup f (a, b).
θ 7→ Pθ is measurable. a∈A b∈B b∈B a∈A
For α = 1, Haussler [55] has extended this result to infinite A distribution πopt on the parameter space Θ is a capacity
sample spaces X . It seems plausible that his approach might achieving input distribution if
extend to other orders α as well. Z
Equation 59 is equivalent to the minimax identity inf Dα (Pθ kQ) dπopt (θ) = Cα . (63)
Q
sup inf ψα (π, Q) = inf sup ψα (π, Q), (60)
π Q Q π A distribution Qopt on X may be called a redundancy achiev-
where Z ing distribution if
ψα (π, Q) = Dα (Pθ kQ) dπ(θ). (61) sup Dα (Pθ kQopt ) = Rα . (64)
θ
We will prove this identity using Sion’s minimax theorem [56], If the sample space is finite, then a redundancy achieving
[57], which we state with its arguments exchanged to make distribution always exists:
them line up with the arguments of ψα :
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 20
Lemma 9. Suppose X is finite and let α ∈ [0, ∞]. Then In particular, Qopt = S is unique and
the function Q 7→ supθ Dα (Pθ kQ) is continuous and convex, X
and has at least one minimum. Consequently, a redundancy R∞ = ln sup Pθ (x) < ∞. (68)
x θ
achieving distribution Qopt exists.
Proof: Since R∞ < ∞, for any finite C > R∞ there
Pn of elements in X by n, let
Proof: Denote the number must exist a distribution QC such that supx ln supQθCP(x)
θ (x)
≤ C.
∆n = {(p1 , . . . , pn ) | i=1 pi = 1, pi ≥ 0} denote Hence
the probability simplex on n outcomes, and let f (Q) = X X
supθ Dα (Pθ kQ). Since f is the supremum over continuous, sup Pθ (x) ≤ QC (x)eC = eC < ∞,
x θ x
convex functions, it is lower semi-continuous and convex
itself. As the domain of f is ∆n , which is compact, this so that S is well defined.
implies that it attains its minimum. Moreover, convexity on a Now for any arbitrary distribution Q, we have
simplex implies upper semi-continuity [53, Theorem 10.2], so supθ Pθ (x) sup P (x) S(x)
θ θ
that f is both lower and upper semi-continuous, which means sup ln = sup ln + ln
x Q(x) x S(x) Q(x)
that it is continuous. X S(x)
= ln sup Pθ (x) + sup ln
Theorem 36. Suppose X is finite and let α ∈ [0, ∞]. If θ x Q(x)
x
there exists a (possiblyR non-unique) capacity achieving input
supθ Pθ (x) S(x)
distribution πopt , then Dα (Pθ kQ) dπopt (θ) is minimized by = sup ln + sup ln .
x S(x) x Q(x)
Q = Qopt and Dα (Pθ kQopt ) = Rα almost surely under πopt .
S(x)
If Rα is regarded as the radius of {Pθ | θ ∈ Θ}, then this Since supx ln Q(x) = D∞ (SkQ) ≥ 0, with strict inequality
theorem shows how Qopt may be interpreted as its center. unless Q = S, this establishes (67) and Qopt = S. Finally,
Proof: Since πopt is capacity achieving, (68) follows by evaluating supx ln supS(x)
θ Pθ (x)
.
Z We conjecture that the previous result generalizes to any
Cα = inf Dα (Pθ kQ) dπopt (θ) positive order α as a one-sided inequality:
Q
Z Conjecture 1. Let α ∈ (0, ∞] and suppose that Rα < ∞.
≤ Dα (Pθ kQopt ) dπopt (θ) Then we conjecture that there exists a unique redundancy
Z achieving distribution
≤ Rα dπopt (θ) = Rα = Cα .
Qopt = arg min sup Dα (Pθ kQ), (69)
Q θ
The result follows because both inequalities must be equalities.
and that for all Q
Three orders α for the channel capacity Cα and minimax sup Dα (Pθ kQ) ≥ Rα + Dα (Qopt kQ). (70)
redundancy Rα are of particular interest. The classical ones θ
are α = 1, because it corresponds to the original definition of This conjecture is reminiscent of Sibson’s identity [4], [59].
channel capacity by Shannon, and α = 0 because C0 gives an It would imply that any distribution Q that is close to achieving
upper bound on the zero error capacity, which also dates back the minimax redundancy in the sense that
to Shannon.
Now let us look at the case α = ∞, assuming for simplicity sup Dα (Pθ kQ) ≤ Rα + δ, (71)
θ
that X is countable. We find that
must be close to Qopt in the sense that
Pθ (x)
sup D∞ (Pθ kQ) = sup ln sup
θ θ x Q(x) Dα (Qopt kQ) ≤ δ. (72)
supθ Pθ (x) As shown in Example 4 below, Conjecture 1 does not hold for
= sup ln (65)
x Q(x) α = 0. For α > 0, it can be expressed as a minimax identity
is the worst-case regret of Q relative to {Pθ | θ ∈ Θ} [3]. for the function
As is well known [3], [58], the distribution that minimizes φα (R, Q) = sup Dα (Pθ kQ) − Dα (RkQ), (73)
the worst-case regret is uniquely given by the normalized θ∈Θ
maximum likelihood or Shtarkov distribution where we adopt the convention that φα (R, Q) = ∞ if both
sup Pθ (x) supθ∈Θ Dα (Pθ kQ) and Dα (RkQ) are infinite. However, we
S(x) = P θ , (66)
x supθ Pθ (x) cannot use Sion’s minimax theorem (Theorem 35) to prove
provided that the normalizing sum is finite, so that S is well the conjecture, because in general φα is not quasi-convex in
defined. its second argument2 .
A distribution π on the parameter space Θ is called a
Theorem 37. Suppose that X is countable and that the barycentric input distribution if
minimax redundancy R∞ is finite. Then S is well defined and Z
the worst-case regret of any distribution Q satisfies Qopt = Pθ dπ(θ). (74)
sup D∞ (Pθ kQ) = R∞ + D∞ (SkQ). (67) 2 We
θ mistakenly claimed this in an earlier draft of this paper.
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 21
Example 4. Take α ∈ (0, ∞] and consider the distributions ΘX ⊂ Θ of θ on which πopt (θ) > 0. Hence, for any Q,
Z X
P1 = 1/2, 0, 1/2
, P2 = 0, 1/2, 1/2
(75) D∞ (Pθ kQ)dπopt (θ) = D∞ (Pθ kQ)πopt (θ)
θ∈ΘX
X X
on a three-element set. Then by symmetry and convexity of = D∞ (Pθ kQ) S(x) 1{θ̂(x)=θ}
θ∈ΘX x
Rényi divergence in its second argument, there must exist a X
redundancy achieving distribution of the form = S(x)D∞ (Pθ̂(x) kQ)
x
P (A)
p
Let ε > 0 be arbitrary, and let 0 < q < 1 be small enough
= ln inf = ln ess inf , (83) that
A∈F Q(A) Q q (1 − pi )α (1 − q)1−α
with the conventions that 0/0 = 0 and x/0 = ∞ for x > 0. max ≤ ε.
i∈{0,1} pα i q
1−α
Proof: The identity (82) follows directly from definitions. Then convexity of Dα in its first argument would imply that
α
It implies D−∞ (P kQ) = −D∞ (QkP ), because 1−α tends to
1
ln pα 1−α
+ (1 − p0 )α (1 − q)1−α
−1 as α → −∞. The remaining identities follow from the 0q
closed-form expressions for D∞ (QkP ) in Theorem 6. 2
1
+ ln pα 1−α
+ (1 − p1 )α (1 − q)1−α
Skew symmetry gives a kind of symmetry between the 1q
orders 1/2 + α and 1/2 − α. In applications in physics this 2
symmetry is related to the use of so-called escort probabilities ≥ ln pα 1/2 q
1−α
+ (1 − p1/2 )α (1 − q)1−α ,
[60].
which implies
Whereas the nonnegative orders generally satisfy the same
or similar properties for different values of α, the fact that 1 1
ln pα 1−α
(1 + ε) + ln pα 1−α
0q 1q (1 + ε)
α 2 2
1−α < 0 for α < 0, implies that properties for negative
≥ ln pα 1−α
orders are often inverted. For example, Rényi divergence for 1/2 q
[43] J. Feldman, “Equivalence and perpendicularity of Gaussian processes,” Peter Harremoës Peter Harremoës (M’00) received
Pacific Journal of Mathematics, vol. 8, no. 4, pp. 699–708, 1958. the BSc degree in mathematics in 1984, the Exam.
[44] J. Hájek, “On a property of normal distributions of any stochastic Art. degree in archaeology in 1985, and the MSc de-
process,” Czechoslovak Mathematical Journal, vol. 8, no. 4, pp. 610– gree in mathematics in 1988, all from the University
618, 1958. In Russian with English summary. of Copenhagen, Denmark. In 1993 he received his
[45] B. J. Thelen, “Fisher information and dichotomies in equiva- PhD degree in the natural sciences from Roskilde
lence/contiguity,” The Annals of Probability, vol. 17, no. 4, pp. 1664– University, Denmark.
1690, 1989. From 1993 to 1998, he worked as a mountaineer.
[46] S. Kakutani, “On equivalence of infinite product measures,” The Annals From 1998 to 2000, he held various teaching posi-
of Mathematics, vol. 49, no. 1, pp. 214–224, 1948. tions in mathematics. From 2001 to 2006, he was
[47] A. Rényi, “On some basic problems of statistics from the point of view Postdoctoral Fellow with the University of Copen-
of information theory,” in Proceedings of the Fifth Berkeley Symposium hagen, with an extended visit to the Zentrum für Interdisziplinäre Forschung,
on Mathematical Statistics and Probability, vol. 1: Statistics, pp. 531– Bielefeld, Germany in 2003. From 2006 to 2009, he was affiliated with the
543, 1967. Centrum Wiskunde & Informatica, Amsterdam, The Netherlands, under the
[48] S. Kullback, Information theory and statistics. Wiley, 1959. European Pascal Network of Excellence. Since then he has been affiliated
[49] T. Nemetz, “On the α-divergence rate for Markov-dependent hypothe- with Niels Brock, Copenhagen Business College, in Denmark.
ses,” Problems of Control and Information Theory, vol. 3, no. 2, pp. 147– From 2007 to 2011 Peter Harremoës has been Editor-in-Chief of the journal
155, 1974. Entropy. He is currently an editor for that journal.
[50] Z. Rached, F. Alajaji, and L. L. Campbell, “Rényi’s divergence and
entropy rates for finite alphabet Markov sources,” IEEE Transactions
on Information Theory, vol. 47, no. 4, pp. 1553–1561, 2001.
[51] I. Csiszár, “Information projections revisited,” IEEE Transactions on
Information Theory, vol. 49, no. 6, pp. 1474–1490, 2003.
[52] A. A. Fedotov, P. Harremoës, and F. Topsøe, “Refinements of Pinsker’s
inequality,” IEEE Transactions on Information Theory, vol. 49, no. 6,
pp. 1491–1498, 2003.
[53] R. T. Rockafellar, Convex Analysis. Princeton University Press, 1970.
[54] B. Ryabko, “Comments on “a source matching approach to finding
minimax codes” by Davisson, L. D. and Leon-Garcia, A.,” IEEE
Transactions on Information Theory, vol. 27, no. 6, pp. 780–781, 1981.
Including also the ensuing Editor’s Note.
[55] D. Haussler, “A general minimax result for relative entropy,” IEEE
Transactions on Information Theory, vol. 43, no. 4, pp. 1276–1280,
1997.
[56] M. Sion, “On general minimax theorems,” Pacific Journal of Mathemat-
ics, vol. 8, no. 1, pp. 171–176, 1958.
[57] H. Komiya, “Elementary proof for Sion’s minimax theorem,” Kodai
Mathematical Journal, vol. 11, no. 1, pp. 5–7, 1988.
[58] Y. M. Shtar’kov, “Universal sequential coding of single messages,”
Problems of Information Transmission, vol. 23, no. 3, pp. 175–186,
1987.
[59] R. Sibson, “Information radius,” Z. Warscheinlichkeitstheorie verw. Geb.,
vol. 14, pp. 149–160, 1969.
[60] J. Naudts, “Estimators, escort probabilities, and φ-exponential families
in statistical physics,” Journal of Inequalities in Pure and Applied
Mathematics, vol. 5, no. 4, 102, 2004.