R Enyi Divergence and Kullback-Leibler Divergence: Tim Van Erven Peter Harremo Es, Member, IEEE

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO.
1, JANUARY 2007 1
Rényi Divergence and Kullback-Leibler Divergence

Tim van Erven Peter Harremoës, Member, IEEE
Abstract—Rényi divergence is related to Rényi entropy much Although the closely related Rényi entropy is well studied
like Kullback-Leibler divergence is related to Shannon’s entropy, [14], [15], the properties of Rényi divergence are scattered
and comes up in many settings. It was introduced by Rényi as a throughout the literature and have often only been established
measure of information that satisfies almost the same axioms as
Kullback-Leibler divergence, and depends on a parameter that for finite alphabets. This paper is intended as a reference
is called its order. In particular, the Rényi divergence of order 1 document, which treats the most important properties of Rényi
equals the Kullback-Leibler divergence. divergence in detail, including Kullback-Leibler divergence as
We review and extend the most important properties of Rényi a special case. Preliminary versions of the results presented
divergence and Kullback-Leibler divergence, including convexity,
arXiv:1206.2459v2 [cs.IT] 24 Apr 2014
here can be found in [16] and [7]. During the preparation

continuity, limits of σ-algebras and the relation of the special
order 0 to the Gaussian dichotomy and contiguity. We also show of this paper, Shayevitz has independently published closely
how to generalize the Pythagorean inequality to orders different related work [17], [18].
from 1, and we extend the known equivalence between channel
capacity and minimax redundancy to continuous channel inputs A. Rényi’s Information Measures
(for all orders) and present several other minimax results.
For finite alphabets, the Rényi divergence of positive order
Index Terms—α-divergence, Bhattacharyya distance, infor-
mation divergence, Kullback-Leibler divergence, Pythagorean
α 6= 1 of a probability distribution P = (p1 , . . . , pn ) from
inequality, Rényi divergence another distribution Q = (q1 , . . . , qn ) is
n
1 X
Dα (P kQ) = ln pα q 1−α , (1)
I. I NTRODUCTION α − 1 i=1 i i
HANNON entropy and Kullback-Leibler divergence (also
S known as information divergence or relative entropy) are
perhaps the two most fundamental quantities in information
where, for α > 1, we read pα 1−α
i qi as pα
(α−1)
i /qi and adopt the
conventions that 0/0 = 0 and x/0 = ∞ for x > 0. As described
theory and its applications. Because of their success, there in Section II, this definition generalizes to continuous spaces
have been many attempts to generalize these concepts, and in by replacing the probabilities by densities and the sum by an
the literature one will find numerous entropy and divergence integral. If P and Q are members of the same exponential
measures. Most of these quantities have never found any appli- family, then their Rényi divergence can be computed using a
cations, and almost none of them have found an interpretation formula by Huzurbazar [19] and Liese and Vajda [20, p. 43],
in terms of coding. The most important exceptions are the [11]. Gil provides a long list of examples [21], [22].
Rényi entropy and Rényi divergence [1]. Harremoës [2] and Example 1. Let Q be a probability distribution and A a set
Grünwald [3, p. 649] provide an operational characterization with positive probability. Let P be the conditional distribution
of Rényi divergence as the number of bits by which a mixture of Q given A. Then
of two codes can be compressed; and Csiszár [4] gives an
operational characterization of Rényi divergence as the cut- Dα (P kQ) = − ln Q(A).
off rate in block coding and hypothesis testing. 1
We observe that in this important special case the factor α−1
Rényi divergence appears as a crucial tool in proofs of in the definition of Rényi divergence has the effect that the
convergence of minimum description length and Bayesian value of Dα (P kQ) does not depend on α.
estimators, both in parametric and nonparametric models [5],
[6], [7, Chapter 5], and one may recognize it implicitly in The Rényi entropy
n
many computations throughout information theory. It is also 1 X
closely related to Hellinger distance, which is commonly used Hα (P ) = ln pα
1 − α i=1 i
in the analysis of nonparametric density estimation [8]–[10].
Rényi himself used his divergence to prove the convergence can be expressed in terms of the Rényi divergence of P from
of state probabilities in a stationary Markov chain to the the uniform distribution U = (1/n, . . . , 1/n):
stationary distribution [1], and still other applications of Rényi
Hα (P ) = Hα (U ) − Dα (P kU ) = ln n − Dα (P kU ). (2)
divergence can be found, for instance, in hypothesis testing
[11], in multiple source adaptation [12] and in ranking of As α tends to 1, the Rényi entropy tends to the Shannon
images [13]. entropy and the Rényi divergence tends to the Kullback-
Leibler divergence, so we recover a well-known relation. The
Tim van Erven (tim@timvanerven.nl) is with the Département de differential Rényi entropy of a distribution P with density p is
Mathématiques, Université Paris-Sud, France. Peter Harremoës (har-
remoes@ieee.org) is with the Copenhagen Business College, Denmark. Some given by Z
of the results in this paper have previously been presented at the ISIT 2010 1 α
conference.
hα (P ) = ln p(x) dx
1−α
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 2
Pn 1/2 1/2 2
squared Hellinger distance Hel2 (P, Q) = i=1 pi − qi
[24]:
Hel2 (P, Q)

D1/2 (P kQ) = −2 ln 1 − . (5)
2
Similarly, for α = 2 it satisfies
D2 (P kQ) = ln 1 + χ2 (P, Q) ,

(6)
2
where χ2 (P, Q) = i=1 (pi −q
n i)
denotes the χ2 -divergence
P
qi
[24]. It will be shown that Rényi divergence is nondecreasing
in its order. Therefore, by ln t ≤ t − 1, (5) and (6) imply that
Fig. 1. Rényi divergence as a function of its order for fixed distributions Hel2 (P, Q) ≤ D1/2 (P kQ) ≤ D1 (P kQ)
≤ D2 (P kQ) ≤ χ2 (P, Q). (7)
whenever this integral is defined. If P has support in an Finally, Gilardoni [25] shows that Rényi divergence
Pn is related
interval I of length n then to the total variation distance1 V (P, Q) = i=1 |pi − qi | by
a generalization of Pinsker’s inequality:
hα (P ) = ln n − Dα (P kUI ), (3) α 2
V (P, Q) ≤ Dα (P kQ) for α ∈ (0, 1]. (8)
2
where UI denotes the uniform distribution on I, and Dα is
the generalization of Rényi divergence to densities, which will (See Theorem 31 below.) For α = 1 this is the normal version
be defined formally in Section II. Thus the properties of both of Pinsker’s inequality, which bounds total variation distance
the Rényi entropy and the differential Rényi entropy can be in terms of the square root of the Kullback-Leibler divergence.
deduced from the properties of Rényi divergence as long as
P has compact support. C. Outline
There is another way of relating Rényi entropy and Rényi The rest of the paper is organized as follows. First, in
divergence, in which entropy is considered as self-information. Section II, we extend the definition of Rényi divergence
Let X denote a discrete random variable with distribution P , from formula (1) to continuous spaces. One can either define
and let Pdiag be the distribution of (X, X). Then Rényi divergence via an integral or via discretizations. We
Hα (P ) = D2−α (Pdiag kP × P ). (4) demonstrate that these definitions are equivalent. Then we
show that Rényi divergence extends to the extended orders 0,
For α tending to 1, the right-hand side tends to the mutual 1 and ∞ in the same way as for finite spaces. Along the way,
information between X and itself, and again a well-known we also study its behaviour as a function of α. By contrast,
formula is recovered. in Section III we study various convexity and continuity
properties of Rényi divergence as a function of P and Q, while
α is kept fixed. We also generalize the Pythagorean inequality
B. Special Orders to any order α ∈ (0, ∞). Section IV contains several minimax
Although one can define the Rényi divergence of any order, results, and treats the connection to Chernoff information
certain values have wider application than others. Of particular in hypothesis testing, to which many applications of Rényi
interest are the values 0, 1/2, 1, 2, and ∞. divergence are related. We also discuss the equivalence of
The values 0, 1, and ∞ are extended orders in the sense channel capacity and the minimax redundancy for all orders α.
that Rényi divergence of these orders cannot be calculated by Then, in Section V, we show how Rényi divergence extends
plugging into (1). Instead, their definitions are determined by to negative orders. These are related to the orders α > 1 by a
continuity in α (see Figure 1). This leads to defining Rényi negative scaling factor and a reversal of the arguments P and
divergence of order 1 as the Kullback-Leibler divergence. Q. Finally, Section VI contains a number of counterexamples,
For order 0 it becomes − ln Q({i | pi > 0}), which is showing that properties that hold for certain other divergences
closely related to absolute continuity and contiguity of the are violated by Rényi divergence.
distributions P and Q (see Section III-F). For order ∞, Rényi For fixed α, Rényi divergence is related to various forms of
divergence is defined as ln maxi pqii . In the literature on the power divergences, which are in the well-studied class of f -
minimum description length principle in statistics, this is called divergences [27]. Consequently, several of the results we are
the worst-case regret of coding with Q rather than with P presenting for fixed α in Section III are equivalent to known
[3]. The Rényi divergence of order ∞ is also related to the results about power divergences. To make this presentation
separation distance, used by Aldous and Diaconis [23] to self-contained we avoid the use of such connections and only
bound the rate of convergence to the stationary distribution use general results from measure theory.
for certain Markov chains. 1 N.B. It is also common to define the total variation distance as 1 V (P, Q).
Only for α = 1/2 is Rényi divergence symmetric in its 2
See the discussion by Pollard [26, p. 60]. Our definition is consistent with the
arguments. Although not itself a metric, it is a function of the literature on Pinsker’s inequality.
Summary
Definition for the simple orders α ∈ (0, 1) ∪ (1, ∞): Additivity and other consistent sequences of distributions (Thms 27, 28):
For arbitrary distributions P1 , P2 , . . . and Q1 , Q2 , . . ., let P N =
Z
•
Dα (P kQ) = α−11
ln pα q 1−α dµ. P1 × · · · × PN and QN = Q1 × · · · × QN . Then
N
(
For the extended orders (Thms 4–6): X for α ∈ [0, ∞] if N < ∞,
N N
Dα (Pn kQn ) = Dα (P kQ )
D0 (P kQ) = − ln Q(p > 0) for α ∈ (0, ∞] if N = ∞.
n=1
D1 (P kQ) = D(P kQ) = Kullback-Leibler divergence
• Let P 1 , P 2 , . . . and Q1 , Q2 , . . . be consistent sequences of distribu-
p
D∞ (P kQ) = ln ess sup = worst-case regret. tions on n = 1, 2, . . . outcomes. Then
P q
Dα (P n kQn ) → Dα (P ∞ kQ∞ ) for α ∈ (0, ∞].
Equivalent definition via discretization (Thm 10):
Limits of σ-algebras (Thms 21, 22):
Dα (P kQ) = sup Dα (P|P kQ|P ). S∞
P∈finite partitions • For σ-algebras F1 ⊆ F2 ⊆ · · · ⊆ F and F∞ = σ n=1 Fn ,

Relations to (differential) Rényi entropy (2), (3), (4) : For α ∈ [0, ∞], lim Dα (P|Fn kQ|Fn ) = Dα (P|F∞ kQ|F∞ ) for α ∈ (0, ∞].
n→∞
Hα (P ) = ln |X | − Dα (P kU ) = D2−α (Pdiag kP × P ) for finite X , T∞
• For σ-algebras F ⊇ F1 ⊇ F2 ⊇ · · · and F∞ = n=1 Fn ,
hα (P ) = ln n − Dα (P kUI ) if X is an interval I of length n.
lim Dα (P|Fn kQ|Fn ) = Dα (P|F∞ kQ|F∞ ) for α ∈ [0, 1)
n→∞

Relations to other divergences (5)–(7), Remark 1 and Pinsker’s inequality
(Thm 31): and also for α ∈ [1, ∞) if Dα (P|Fm kQ|Fm ) < ∞ for some m.
Hel2 ≤ D1/2 ≤ D ≤ D2 ≤ χ2 Absolute continuity and mutual singularity (Thms 23, 24, 25, 26):
α 2 • P Q if and only if D0 (P kQ) = 0.
V ≤ Dα for α ∈ (0, 1].
2 • P ⊥ Q if and only if Dα (P kQ) = ∞ for some/all α ∈ [0, 1).
Relation to Fisher information (Section III-H): For a parametric statistical • These properties generalize to contiguity and entire separation.
model {Pθ | θ ∈ Θ ⊆ R} with “sufficiently regular” parametrisation, Hypothesis testing and Chernoff information (Thms 30, 32): If α is a simple
1 α order, then
lim Dα (Pθ kPθ0 ) = J(θ) for α ∈ (0, ∞).
θ 0 →θ (θ − θ 0 )2 2 (1 − α)Dα (P kQ) = inf {αD(RkP ) + (1 − α)D(RkQ)} .
R
Varying the order (Thms 3, 7, Corollary 2):
Suppose D(P kQ) < ∞. Then the Chernoff information satisfies
• Dα is nondecreasing in α, often strictly so.
• Dα is continuous in α on [0, 1] ∪{α ∈ (1, ∞] | Dα < ∞}.
sup inf {αD(RkP ) + (1 − α)D(RkQ)}
• (1 − α)Dα is concave in α on [0, ∞]. α∈(0,∞) R
Positivity (Thm 8) and skew symmetry (Proposition 2): = inf sup {αD(RkP ) + (1 − α)D(RkQ)} ,
• Dα ≥ 0 for α ∈ [0, ∞], often strictly so. R α∈(0,∞)
α
• Dα (P kQ) = D (QkP ) for 0 < α < 1.
1−α 1−α and, under regularity conditions, both sides equal D(Pα∗ kP ) = D(Pα∗ kQ).
Convexity (Thms 11–13): Dα (P kQ) is
• jointly convex in (P, Q) for α ∈ [0, 1], Channel capacity and minimax redundancy (Thms 34, 36, 37, 38, Lemma 9,
• convex in Q for α ∈ [0, ∞], Conjecture 1): Suppose X is finite. Then, for α ∈ [0, ∞],
• jointly quasi-convex in (P, Q) for α ∈ [0, ∞]. • The channel capacity Cα equals the minimax redundancy Rα ;
Pythagorean inequality (Thm 14): For α ∈ (0, ∞), let P be an α-convex set • There exists Qopt such that supθ D(Pθ kQopt ) = Rα ;
of distributions and let Q be an arbitrary distribution. If the α-information • If there exists a capacity achieving input distribution πopt , then
projection P ∗ = arg minP ∈P Dα (P kQ) exists, then D(Pθ kQopt ) = Rα almost surely for θ drawn from πopt ;
• If α = ∞ and the maximum likelihood is achieved by θ̂(x), then
Dα (P kQ) ≥ Dα (P kP ∗ ) + Dα (P ∗ kQ) for all P ∈ P.
πopt (θ) = Qopt ({x | θ̂(x) = θ}) is a capacity achieving input
Data processing (Thm 9, Example 2): If we fix the transition probabilities distribution;
A(Y |X) in a Markov chain X → Y , then Suppose X is countable and R∞ < ∞. Then, for α = ∞, Qopt is the

Dα (PY kQY ) ≤ Dα PX kQX for α ∈ [0, ∞]. Shtarkov distribution defined in (66) and
The topology of setwise convergence (Thms 15, 18): sup D∞ (Pθ kQ) = R∞ + D∞ (Qopt kQ) for all Q.
θ
• Dα (P kQ) is lower semi-continuous in the pair (P, Q) for α ∈ (0, ∞].
• If X is finite, then Dα (P kQ) is continuous in Q for α ∈ [0, ∞]. We conjecture that this generalizes to a one-sided inequality for any α > 0.
The total variation topology (Thm 17, Corollary 1): Negative orders (Lemma 10, Thms 39, 40):
• Dα (P kQ) is uniformly continuous in (P, Q) for α ∈ (0, 1).
• Results for positive α carry over, but often with reversed properties.
• D0 (P kQ) is upper semi-continuous in (P, Q).
• Dα is nondecreasing in α on [−∞, ∞].
The weak topology (Thms 19, 20): Suppose X is a Polish space. Then • Dα is continuous in α on [0, 1] ∪{α | −∞ < Dα < ∞}.
• Dα (P kQ) is lower semi-continuous in the pair (P, Q) for α ∈ (0, ∞];
Counterexamples (Section VI):
• The sublevel set {P | Dα (P kQ) ≤ c} is convex and compact for
c ∈ [0, ∞) and α ∈ [1, ∞]. • Dα (P kQ) is not convex in P for α > 1.
Orders α ∈ (0, 1) are all equivalent (Thm 16): • For α ∈ (0, 1), Dα (P kQ) is not continuous in (P, Q) in the topology
of setwise convergence.
α 1−β
D
β 1−α β
≤ Dα ≤ Dβ for 0 < α ≤ β < 1. • Dα is not (the square of) a metric.
II. D EFINITION OF R ÉNYI DIVERGENCE positive variance σ12 ) is

Let us fix the notation to be used throughout the paper. Dα N (µ0 , σ02 )kN (µ1 , σ12 )
We consider (probability) measures on a measurable space
(X , F). If P is a measure on (X , F), then we write P|G α(µ1 − µ0 )2 1 σα
= 2
+ ln 1−α α , (10)
for its restriction to the sub-σ-algebra G ⊆ F, which may 2σα 1 − α σ0 σ1
be interpreted as the marginal of P on the subset of events provided that σα2 = (1 − α)σ02 + ασ12 > 0 [20, p. 45].
G. A measure P is called absolutely continuous with respect
to another measure Q if P (A) = 0 whenever Q(A) = 0 Remark 1. The interpretationRof pα q 1−α in Definition 2 is such
for all events A ∈ F. We will write P Q if P is abso- that the Hellinger integral pα q 1−α dµ is an f -divergence
lutely continuous with respect to Q and P 6 Q otherwise. [27], which ensures that the relations from the introduction
Alternatively, P and Q may be mutually singular, denoted to squared Hellinger distance (5) and χ2 -distance (6) hold in
P ⊥ Q, which means that there exists an event A ∈ F such general, not just for finite sample spaces.
that P (A) = 0 and Q(X \ A) = 0. We will assume that all For simple orders, we may always change to integration
(probability) measures are absolutely continuous with respect with respect to P :
to a common σ-finite measure µ, which is arbitrary in the Z Z 1−α
α 1−α q
sense that none of our definitions or results depend on the p q dµ = dP,
choice of µ. As we only consider (mixtures of) a countable p
number of distributions, such a measure µ exists in all cases, which shows that our definition does not depend on the choice
so this is no restriction. For measures denoted by capital of dominating measure µ. In most cases it is also equivalent
letters (e.g. P or Q), we will use the corresponding lower- to integrate with respect to Q:
case letters (e.g. p, q) to refer to their densities with respect Z Z α
p
α 1−α
to µ. This includes the setting with a finite alphabet from the p q dµ = dQ (0 < α < 1 or P Q).
q
introduction by taking µ to be the counting measure, so that p
and q are probability mass functions. Using R αthat1−α
densities are However, if α > 1 and P 6 Q, then Dα (P kQ) = ∞,
random variables, we write, for example, p q dµ instead whereas the integral with respect to Q may be finite. This
of its lengthy equivalent p(x)α q(x)1−α dµ(x). For any event
R
is a subtle consequence of our conventions. For example, if
A ∈ F, 1A denotes its indicator function, which is 1 on A P = (1/2, 1/2), Q = (1, 0) and µ is the counting measure, then
and 0 otherwise. Finally, we use the natural logarithm in our for α > 1
definitions, such that information is measured in nats (1 bit Z
(1/2)α (1/2)α
equals ln 2 nats). pα q 1−α dµ = α−1 + α−1 = ∞, (11)
1 0
We will often need to distinguish between the orders for
but
which Rényi divergence can be defined by a generalization of Z α α
(1/2)α
Z
formula (1) to an integral over densities, and the other orders. p p
dQ = dQ = α−1 = 2−α . (12)
This motivates the following definitions. q q>0 q 1
Definition 1. We call a (finite) real number α a simple order if
α > 0 and α 6= 1. The values 0, 1, and ∞ are called extended B. Definition via Discretization for Simple Orders
orders. We shall repeatedly use the following result, which is a
direct consequence of the Radon-Nikodým theorem [28]:
Proposition 1. Suppose λ µ is a probability distribution,
A. Definition by Formula for Simple Orders or any countably additive measure such that λ(X ) ≤ 1. Then
Let P and Q be two arbitrary distributions on (X , F). The for any sub-σ-algebra G ⊆ F
formula in (1), which defines Rényi divergence for simple

dλ|G dλ
orders on finite sample spaces, generalizes to arbitrary spaces =E G (µ-a.s.)
dµ|G dµ
as follows:
It has been argued that grouping observations together (by
Definition 2 (Simple Orders). For any simple order α, the considering a coarser σ-algebra), should not increase our
Rényi divergence of order α of P from Q is defined as ability to distinguish between P and Q under any measure
Z of divergence [29]. This is expressed by the data processing
1
Dα (P kQ) = ln pα q 1−α dµ, (9) inequality, which Rényi divergence satisfies:
α−1
Theorem 1 (Data Processing Inequality). For any simple
α
p
where, for α > 1, we read pα q 1−α as qα−1 and adopt the order α and any sub-σ-algebra G ⊆ F
conventions that 0/0 = 0 and x/0 = ∞ for x > 0.
Dα (P|G kQ|G ) ≤ Dα (P kQ).
For example, for any simple order α, the Rényi divergence
Theorem 9 below shows that the data processing inequality
of a normal distribution (with mean µ0 and positive variance
also holds for the extended orders.
σ02 ) from another normal distribution (with mean µ1 and
Example 2. The name “data processing inequality” stems Proof of Theorem 2: By the data processing inequality
from the following application of Theorem 1. Let X and Y
be two random variables that form a Markov chain sup Dα (P|P kQ|P ) ≤ Dα (P kQ).
P
X → Y, To show the converse inequality, consider for any ε > 0 a

discretization of the densities p and q into a countable number
where the conditional distribution of Y given X is A(Y |X). of bins
Then if Y = f (X) is a deterministic function of X, we
ε
may view Y as the result of “processing” X according to Bm,n = {x ∈ X | emε ≤ p(x) < e(m+1)ε ,
the function f . In general, we may also process X using a enε ≤ q(x) < e(n+1)ε },
nondeterministic function, such that A(Y |X) is not a point-
mass. where n, m ∈ {−∞, . . . , −1, 0, 1, . . .}. Let Qε = {Bm,nε
}
ε ε
Suppose PX and QX are distributions for X. Let PX ◦A and and F = σ(Q ) ⊆ F be the corresponding partition and σ-
QX ◦A denote the corresponding joint distributions, and let PY algebra, and let pε = dP|Qε /dµ and qε = dQ|Qε /dµ be the
and QY be the induced marginal distributions for Y . Then the densities of P and Q restricted to F ε . Then by Proposition 1
reader may verify that Dα (PX ◦ AkQX ◦ A) = Dα (PX kQX ), qε E[q | F ε ] q
and consequently the data processing inequality implies that = ≤ e2ε (P -a.s.)
pε E[p | F ε ] p
processing X to obtain Y reduces Rényi divergence:
It follows that
Dα (PY kQY ) ≤ Dα (PX ◦AkQX ◦A) = Dα (PX kQX ). (13) Z 1−α Z 1−α
1 qε 1 q
ln dP ≥ ln dP − 2ε,
Proof of Theorem 1: Let P̃ denote the absolutely contin- α−1 pε α−1 p
uous component of P with respect to Q. Then by Proposition 1 and hence the supremum over all countable partitions is large
and Jensen’s inequality for conditional expectations enough:
!α
dP̃|G
Z
1 sup Dα (P|Q kQ|Q ) ≥ sup Dα (P|Qε kQ|Qε ) ≥ Dα (P kQ).
ln dQ countable Q ε>0
α−1 dQ|G σ(Q)⊆F
Z " #!α
1 dP̃ It remains to show that the supremum over finite partitions is
= ln E G dQ at least as large. To this end, suppose Q = {B1 , B2 , S
. . .} is any
α−1 dQ
Z " !α # (14) countable partition and let Pn = {B1 , . . . , Bn−1 , i≥n Bi }.
1 dP̃
Then by
≤ ln E G dQ

α−1 dQ [ α [ 1−α
Z !α P Bi Q Bi ≥0 (α > 1),
1 dP̃ i≥n i≥n
= ln dQ.
α−1 dQ [ α [ 1−α
lim P Bi Q Bi =0 (0 < α < 1),
n→∞
α 1−α
If 0 < α < 1, then p q = 0 if q = 0, so the restriction of i≥n i≥n
P to P̃ does not change the Rényi divergence, and hence the we find that
theorem is proved. Alternatively, suppose α > 1. If P Q,
1 X
then P̃ = P and the theorem again follows from (14). If lim Dα (P|Pn kQ|Pn ) = lim ln P (B)α Q(B)1−α
n→∞ n→∞ α − 1
P 6 Q, then Dα (P kQ) = ∞ and the theorem holds as well. B∈Pn
n−1
1 X
The next theorem shows that if X is a continuous space, ≥ lim ln P (Bi )α Q(Bi )1−α
n→∞ α − 1 i=1
then the Rényi divergence on X can be arbitrarily well
approximated by the Rényi divergence on finite partitions of = Dα (P|Q kQ|Q ),
X . For any finite or countable partition P = {A1 , A2 , . . .} of
where the inequality holds with equality if 0 < α < 1.
X , let P|P ≡ P|σ(P) and Q|P ≡ Q|σ(P) denote the restrictions
of P and Q to the σ-algebra generated by P.
C. Extended Orders: Varying the Order
Theorem 2. For any simple order α
As for finite alphabets, continuity considerations lead to the
Dα (P kQ) = sup Dα (P|P kQ|P ), (15) following extensions of Rényi divergence to orders for which
P
it cannot be defined using the formula in (9).
where the supremum is over all finite partitions P ⊆ F.
Definition 3 (Extended Orders). The Rényi divergences of
It follows that it would be equivalent to first define Rényi orders 0 and 1 are defined as
divergence for finite sample spaces and then extend the defi-
nition to arbitrary sample spaces using (15). D0 (P kQ) = lim Dα (P kQ),
α↓0
The identity (15) also holds for the extended orders 1 and D1 (P kQ) = lim Dα (P kQ),
∞. (See Theorem 10 below.) α↑1
and the Rényi divergence of order ∞ is defined as as above. So suppose γ > 1. Then convexity of pαn q 1−αn in
αn implies that for αn ≤ γ
D∞ (P kQ) = lim Dα (P kQ).
α↑∞ αn 0 1 αn γ 1−γ
pαn q 1−αn ≤ (1 − )p q + p q ≤ q + pγ q 1−γ .
Our definition of D0 follows Csiszár [4]. It differs from γ γ
Since q dµ = 1, it remains to show that pγ q 1−γ dµ < ∞,
R R
Rényi’s original definition [1], which uses (9) with α = 0
plugged in and is therefore always zero. As illustrated by which is implied by γ > 1 and Dγ (P kQ) < ∞.
Section III-F, the present definition is more interesting. The closed-form expression for α = 0 follows immediately:
The limits in Definition 3 always exist, because Rényi
Theorem 4 (α = 0).
divergence is nondecreasing in its order:
D0 (P kQ) = − ln Q(p > 0).
Theorem 3 (Increasing in the Order). For α ∈ [0, ∞] the
Rényi divergence Dα (P kQ) is nondecreasing in α. On A = Proof of Theorem 4: By Lemma 1 and the fact that
{α ∈ [0, ∞] | 0 ≤ α ≤ 1 or Dα (P kQ) < ∞} it is constant limα↓0 pα q 1−α = 1{p>0} q.
if and only if P is the conditional distribution Q(· | A) for For α = 1, the limit in Definition 3 equals the Kullback-
some event A ∈ F. Leibler divergence of P from Q, which is defined as
Z
Proof: Let α < β be simple orders. Then for x ≥ 0 the p
(α−1) D(P kQ) = p ln dµ,
function x 7→ x (β−1) is strictly convex if α < 1 and strictly q
concave if α > 1. Therefore by Jensen’s inequality with the conventions that 0 ln(0/q) = 0 and p ln(p/0) = ∞ if
Z Z (1−β) α−1 p > 0. Consequently, D(P kQ) = ∞ if P 6 Q.
1 α 1−α 1 q β−1
ln p q dµ = ln dP Theorem 5 (α = 1).
α−1 α−1 p
Z 1−β
1 q D1 (P kQ) = D(P kQ). (17)
≤ ln dP.
β−1 p Moreover, if D(P kQ) = ∞ or there exists a β > 1 such that
On A, (q/p)1−β dP is finite. As a consequence, Jensen’s Dβ (P kQ) < ∞, then also
R
inequality holds with equality if and only if (q/p)1−β is lim Dα (P kQ) = D(P kQ). (18)
constant P -a.s., which is equivalent to q/p being constant P - α↓1
a.s., which in turn means that P = Q(· | A) for some event For example, by letting α ↑ 1 in (10) or by direct
A. computation, it can be derived [20] that the Kullback-Leibler
From the simple orders, the result extends to the extended divergence between two normal distributions with positive
orders by the following observations: variance is
D0 (P kQ) = inf Dα (P kQ),

0<α<1 D1 N (µ0 , σ02 )kN (µ1 , σ12 )
D1 (P kQ) = sup Dα (P kQ) ≤ inf Dα (P kQ), 1 (µ1 − µ0 )2 σ2 σ2
0<α<1 α>1
= 2 + ln 12 + 02 − 1 .
D∞ (P kQ) = sup Dα (P kQ). 2 σ1 σ0 σ1
α>1
It is possible that Dα (P kQ) = ∞ for all α > 1, but
D(P kQ) < ∞, such that (18) does not hold. This situation
Let us verify that the limits in Definition 3 can be expressed occurs, for example, if P is doubly exponential on X = R with
in closed form, just like for finite alphabets. We require the −2|x|
density p(x) = e√ and Q is standard normal with density
following lemma: −x2 /2
q(x) = e / 2π. (Liese and Vajda [27] have previously
Lemma 1. Let A = {α a simple order | 0 < α < 1 or used these distributions in a similar example.) In this case
Dα (P kQ) < ∞}. Then, for any sequence α1 , α2 , . . . ∈ A there is no way to make Rényi divergence continuous in α at
such that αn → β ∈ A ∪{0, 1}, α = 1, and we opt to define D1 as the limit from below, such
Z Z that it always equals the Kullback-Leibler divergence.
lim pαn q 1−αn dµ = lim pαn q 1−αn dµ. (16) The proof of Theorem 5 requires an intermediate lemma:
n→∞ n→∞
Lemma 2. For any x > 1/2
Our proof extends a proof by Shiryaev [28, pp. 366–367].
Proof: We will verify the conditions for the dominated 1−x
(x − 1) 1 + ≤ ln x ≤ x − 1.
convergence theorem [28], from which (16) follows. First 2
suppose 0 ≤ β < 1. Then 0 < αn < 1 for all sufficiently Proof: By Taylor’s theorem with Cauchy’s remainder
large n. In this case pαn q 1−αn , which is never negative, does term we have for any positive x that
not exceed αn p + (1 − αn )q ≤ p +Rq, and the dominated
(x − ξ)(x − 1)
convergence theorem applies because (p + q) dµ = 2 < ∞. ln x = x − 1 −
Secondly, suppose β ≥ 1. Then there exists a γ ≥ β such 2ξ 2

that γ ∈ A ∪ {1} and αn ≤ γ for all sufficiently large n. If ξ−x
= (x − 1) 1 +
γ = 1, then αn < 1 and we are done by the same argument 2ξ 2
for some ξ between x and 1. As ξ−x 2ξ 2 is increasing in ξ for with the conventions that 0/0 = 0 and x/0 = ∞ if x > 0.
x > 1/2, the lemma follows.
If the sample space X is countable, then with the nota-
Proof of Theorem 5: Suppose P 6 Q. Then D(P kQ) =
tional conventions of this theorem the essential supremum
∞ = Dβ (P kQ) for all β > 1, so (18) holds. Let xα =
R α 1−α reduces to an ordinary supremum, and we have D∞ (P kQ) =
p q dµ. Then limα↑1 xα = P (q > 0) by Lemma 1, and P (x)
ln supx Q(x) .
hence (17) follows by
Z Proof: If X contains a finite number of elements n, then
1
lim ln pα q 1−α dµ 1 Xn
α↑1 α − 1 D∞ (P kQ) = lim ln pα 1−α
i qi
1 α↑∞ α − 1
i=1
= lim ln P (q > 0) = ∞ = D(P kQ).
α↑1 α − 1 pi P (A)
= ln max = ln max .
Alternatively, suppose P Q. Then limα↑1 xα = 1 and i qi A⊆X Q(A)
therefore Lemma 2 implies that This extends to arbitrary measurable spaces (X , F) by Theo-
1 rem 2:
lim Dα (P kQ) = lim ln xα
α↑1 α↑1 α − 1
D∞ (P kQ) = sup sup Dα (P|P kQ|P )
xα − 1 p − pα q 1−α
Z
α<∞ P
= lim = lim dµ, (19)
α↑1 α − 1 α↑1 p,q>0 1−α = sup sup Dα (P|P kQ|P )
P α<∞
where the restriction of the domain of integration is allowed P (A) P (A)
because q = 0 implies p = 0 (µ-a.s.) by P Q. Convexity = sup ln max = ln sup ,
P A∈P Q(A) A∈F Q(A)
of pα q 1−α in α implies that its derivative, pα q 1−α ln pq , is
nondecreasing and therefore for p, q > 0 where P ranges over all finite partitions in F.
Z 1 Now if P 6 Q, then there exists an event B ∈ F such that
p − pα q 1−α 1 p
= pz q 1−z ln dz P (B) > 0 but Q(B) = 0, and
1−α 1−α α q p
α 1−α 0 1−0 P = ∞ = P (q = 0) ≥ P (B) > 0
is nondecreasing
R in α, and p−p1−α
q
≥ p−p1−0
q
= p − q. q
As p,q>0 (p − q) dµ > −∞, it follows by the monotone P (A)
convergence theorem that implies that ess sup p/q = ∞ = supA Q(A) . Alternatively,
suppose that P Q. Then
p − pα q 1−α p − pα q 1−α
Z Z
lim dµ = lim dµ Z Z
p p
α↑1 p,q>0 1−α α↑1 1−α P (A) = p dµ ≤ ess sup ·q dµ = ess sup ·Q (A)
Zp,q>0 A ∩{q>0} A ∩{q>0} q q
p
= p ln dµ = D(P kQ),
p,q>0 q for all A ∈ F and it follows that
which together with (19) proves (17). If D(P kQ) = ∞, then P (A) p
sup ≤ ess sup . (21)
Dβ (P kQ) ≥ D(P kQ) = ∞ for all β > 1 and (18) holds. A∈F Q(A) q
It remains to prove (18) if there exists a β > 1 such that
Dβ (P kQ) < ∞. In this case, arguments similar to the ones Let a < ess sup p/q be arbitrary. Then there exists a set A ∈ F
above imply that with P (A) > 0 such that p/q ≥ a on A and therefore
Z Z
pα q 1−α − p
Z
lim Dα (P kQ) = lim dµ (20) P (A) = p dµ ≥ a · q dµ = a · Q (A) .
α↓1 α↓1 p,q>0 α−1 A A
P (A)
and
α 1−α
p q −p α 1−α
is nondecreasing in α. Therefore p qα−1 −p ≤ Thus supA∈F Q(A) ≥ a for any a < ess sup p/q, which implies
α−1
β 1−β
p q −p β 1−β β 1−β that
≤ p β−1
q
and, as p,q>0 p β−1
q
R
β−1 dµ < ∞ is implied P (A) p
by Dβ (P kQ) < ∞, it follows by the monotone convergence sup ≥ ess sup .
A∈F Q(A) q
theorem that
In combination with (21) this completes the proof.
pα q 1−α − p pα q 1−α − p
Z Z
lim dµ = lim dµ Taken together, the previous results imply that Rényi diver-
α↓1 p,q>0 α−1 α↓1 α−1
Zp,q>0 gence is a continuous function of its order α (under suitable
p conditions):
= p ln dµ = D(P kQ),
p,q>0 q
Theorem 7 (Continuity in the Order). The Rényi divergence
which together with (20) completes the proof. Dα (P kQ) is continuous in α on A = {α ∈ [0, ∞] | 0 ≤ α ≤
For any random variable X, the essential supremum of X 1 or Dα (P kQ) < ∞}.
with respect to P is ess supP X = sup{c | P (X > c) > 0}.
Proof: Continuity at any simple order β follows by
Theorem 6 (α = ∞). Lemma 1. It extends to the extended orders 0 and ∞ by the

P (A) p definition of Rényi divergence at these orders. And it extends
D∞ (P kQ) = ln sup = ln ess sup ,
A∈F Q(A) P q to α = 1 by Theorem 5.
III. F IXED N ONNEGATIVE O RDERS

In this section we fix the order α and study properties
of Rényi divergence as P and Q are varied. First we prove
nonnegativity and extend the data processing inequality and
the relation to a supremum over finite partitions to the extended
orders. Then we study convexity, we prove a generalization of
the Pythagorean inequality to general orders, and finally we
consider various types of continuity.
A. Positivity, Data Processing and Finite Partitions

Theorem 8 (Positivity). For any order α ∈ [0, ∞]
Dα (P kQ) ≥ 0. Fig. 2. Rényi divergence as a function of P = (p, 1 − p) for Q = (1/3, 2/3)
For α > 0, Dα (P kQ) = 0 if and only if P = Q. For α = 0,

Dα (P kQ) = 0 if and only if Q P .
Proof: Suppose first that α is a simple order. Then by
Jensen’s inequality
Z Z 1−α
1 α 1−α 1 q
ln p q dµ = ln dP
α−1 α−1 p
1−α
Z
q
≥ ln dP ≥ 0.
α−1 p
Equality holds if and only if q/p is constant P -a.s. (first
inequality) and Q P (second inequality), which together
is equivalent to P = Q.
The result extends to α ∈ {1, ∞} by Dα (P kQ) =
supβ<α Dβ (P kQ). For α = 0 it can be verified directly that Fig. 3. Level curves of D1/2 (P kQ) for fixed Q as P ranges over the simplex
of distributions on a three-element set
− ln Q(p > 0) ≥ 0, with equality if and only if Q P .
Theorem 9 (Data Processing Inequality). For any order α ∈
[0, ∞] and any sub-σ-algebra G ⊆ F For α = 0, the data processing inequality implies that
Dα (P|G kQ|G ) ≤ Dα (P kQ). (22) Dα (P kQ) ≥ sup Dα (P|P kQ|P ),
P
Example 2 also applies to the extended orders without
modification. and equality is achieved for the partition P = {p > 0, p = 0}.
Proof: By Theorem 1, (22) holds for the simple orders.
Let β be any extended order and let αn → β be an arbitrary
sequence of simple orders that converges to β, from above if
β = 0 and from below if β ∈ {1, ∞}. Then B. Convexity
Dβ (P|G kQ|G ) = lim Dαn (P|G kQ|G )
n→∞ Consider Figures 2 and 3. They show Dα (P kQ) as a func-
≤ lim Dαn (P kQ) = Dβ (P kQ). tion of P for sample spaces containing two or three elements.
n→∞
These figures suggest that Rényi divergence is convex in its
first argument for small α, but not for large α. This is in
Theorem 10. For any α ∈ [0, ∞] agreement with the well-known fact that it is jointly convex
in the pair (P, Q) for α = 1. It turns out that joint convexity
Dα (P kQ) = sup Dα (P|P kQ|P ), extends to α < 1, but not to α > 1, as noted by Csiszár
P
[4]. Our proof generalizes the proof for α = 1 by Cover and
where the supremum is over all finite partitions P ⊆ F. Thomas [30].
Proof: For simple orders α, the result holds by Theo- Theorem 11. For any order α ∈ [0, 1] Rényi divergence is
rem 2. This extends to α ∈ {1, ∞} by monotonicity and left- jointly convex in its arguments. That is, for any two pairs
continuity in α: of probability distributions (P0 , Q0 ) and (P1 , Q1 ), and any
Dα (P kQ) = sup Dβ (P kQ) = sup sup Dβ (P|P kQ|P ) 0<λ<1
β<α β<α P
= sup sup Dβ (P|P kQ|P ) = sup Dα (P|P kQ|P ). Dα (1 − λ)P0 + λP1 k(1 − λ)Q0 + λQ1
P β<α P
(23)
≤ (1 − λ)Dα (P0 kQ0 ) + λDα (P1 kQ1 ).
Equality holds if and only if for any 0 < λ < 1. For finite α, equality holds if and only if
α = 0: D0 (P0 kQ0 ) = D0 (P1 kQ1 ), α = 0: D0 (P kQ0 ) = D0 (P kQ1 );
p0 = 0 ⇒ p1 = 0 (Q0 -a.s.) and 0 < α < ∞: q0 = q1 (P -a.s.)
p1 = 0 ⇒ p0 = 0 (Q1 -a.s.); Proof: For α ∈ [0, 1] this follows from the previous
0 < α < 1: Dα (P0 kQ0 ) = Dα (P1 kQ1 ) and theorem. (For P0 = P1 the equality conditions reduce to the
p0 q1 = p1 q0 (µ-a.s.); ones given here.) For α ∈ (1, ∞), let Qλ = (1 − λ)Q0 + λQ1
and define f (x, Qλ ) = (p(x)/qλ (x))α−1 . It is sufficient to
α = 1: p0 q1 = p1 q0 (µ-a.s.)
show that
Proof: Suppose first that α = 0, and let Pλ = (1−λ)P0 +
λP1 and Qλ = (1 − λ)Q0 + λQ1 . Then ln EX∼P [f (X, Qλ )]
≤ (1 − λ) ln EX∼P [f (X, Q0 )] + λ ln EX∼P [f (X, Q1 )].
(1 − λ) ln Q0 (p0 > 0) + λ ln Q1 (p1 > 0)
≤ ln ((1 − λ) Q0 (p0 > 0) + λQ1 (p1 > 0)) Noting that, for every x ∈ X , f (x, Q) is log-convex in Q, this
is a consequence of the general fact that an expectation over
≤ ln Qλ p0 > 0 or p1 > 0 = ln Qλ (pλ > 0). log-convex functions is itself log-convex, which can be shown
Equality holds if and only if, for the first inequality, Q0 (p0 > using Hölder’s inequality:
0) = Q1 (p1 > 0) and, for the second inequality, p1 > 0 ⇒ EP [f (X, Qλ )] ≤ EP [f (X, Q0 )1−λ f (X, Q1 )λ ]
p0 > 0 (Q0 -a.s.) and p0 > 0 ⇒ p1 > 0 (Q1 -a.s.) These
conditions are equivalent to the equality conditions of the ≤ EP [f (X, Q0 )]1−λ EP [f (X, Q1 )]λ .
theorem. Taking logarithms completes the proof of (26). Equality holds
Alternatively, suppose α > 0. We will show that point-wise in the first inequality if and only if q0 = q1 (P -a.s.), which
(1 − λ)pα 1−α
+ λpα 1−α
≤ pα 1−α is also sufficient for equality in the second inequality. Finally,
0 q0 1 q1 λ qλ (0 < α < 1);
p0 p1 pλ (26) extends to α = ∞ by letting α tend to ∞.
(1 − λ)p0 ln + λp1 ln ≥ pλ ln (α = 1), And secondly, Rényi divergence is jointly quasi-convex in
q0 q1 qλ
(24) both arguments for all α:
where pλ = (1 − λ)p0 + λp1 and qλ = (1 − λ)q0 + λq1 . For
Theorem 13. For any order α ∈ [0, ∞] Rényi divergence
α = 1, (23) then follows directly; for 0 < α < 1, (23) follows
is jointly quasi-convex in its arguments. That is, for any two
from (24) by Jensen’s inequality:
pairs of probability distributions (P0 , Q0 ) and (P1 , Q1 ), and
Z
α 1−α
Z
1−α
any λ ∈ (0, 1)
(1 − λ) ln p0 q0 dµ + λ ln pα 1 q1 dµ
Dα (1 − λ)P0 + λP1 k(1 − λ)Q0 + λQ1
Z Z (27)
≤ ln (1 − λ) pα q
0 0
1−α
dµ + λ pα 1−α
q
1 1 dµ . (25) ≤ max{Dα (P0 kQ0 ), Dα (P1 kQ1 )}.
Proof: For α ∈ [0, 1], quasi-convexity is implied by con-
If one of p0 , p1 , q0 and q1 is zero, then (24) can be verified vexity. For α ∈ (1, ∞), strict monotonicity of x 7→ α−1 1
ln x
directly. So assume that they are all positive. Then for 0 < implies that quasi-convexity is equivalent to quasi-convexity
α < 1 let f (x) = −xα and for α = 1 let f (x) = x ln x, such of the Hellinger integral pα q 1−α dµ. Since quasi-convexity
R
that (24) can be written as is implied by ordinary convexity, it is sufficient to establish
that the Hellinger integral is jointly convex in P and Q. Let

(1 − λ)q0 p0 λq1 p1 pλ
f + f ≥f . pλ = (1 − λ)p0 + λp1 and qλ = (1 − λ)q0 + λq1 . Then joint
qλ q0 qλ q1 qλ
convexity of the Hellinger integral is implied by the pointwise
(24) is established by recognising this as an application of inequality
Jensen’s inequality to the strictly convex function f . Regard-
1−α 1−α 1−α
less of whether any of p0 , p1 , q0 and q1 is zero, equality holds (1 − λ)pα
0 q0 + λpα
1 q1 ≥ pα
λ qλ ,
in (24) if and
R only if p0 q1 =
R p1 q1−α
0 . Equality holds in (25) if
1−α which holds by essentially the same argument as for (24) in
and only if pα 0 q0 dµ = pα 1 q1 dµ, which is equivalent
the proof of Theorem 11, with the convex function f (x) = xα .
to Dα (P0 kQ0 ) = Dα (P1 kQ1 ).
Finally, the case α = ∞ follows by letting α tend to ∞:
Joint convexity in P and Q breaks down for α > 1 (see
Section VI-A), but some partial convexity properties can still D∞ (1 − λ)P0 + λP1 k(1 − λ)Q0 + λQ1
be salvaged. First, convexity in the second argument does hold = sup Dα (1 − λ)P0 + λP1 k(1 − λ)Q0 + λQ1

for all α [4]: α<∞
≤ sup max{Dα (P0 kQ0 ), Dα (P1 kQ1 )}
Theorem 12. For any order α ∈ [0, ∞] Rényi divergence is α<∞
convex in its second argument. That is, for any probability = max{ sup Dα (P0 kQ0 ), sup Dα (P1 kQ1 )}
distributions P , Q0 and Q1 α<∞ α<∞
= max{D∞ (P0 kQ0 ), D∞ (P1 kQ1 )}.
Dα (P k(1−λ)Q0 +λQ1 ) ≤ (1−λ)Dα (P kQ0 )+λDα (P kQ1 )
(26)
C. A Generalized Pythagorean Inequality The left-hand side is minimized at λ = 1/m, where it equals
An important result in statistical applications of information m−(1−α)/α , which completes the proof for α ∈ (0, 1). The
theory is the Pythagorean inequality for Kullback-Leibler proof for α ∈ (1, ∞) goes the same way, except that all
divergence [30]–[32]. It states that, if P is a convex set of inequalities are reversed because f is concave.
distributions, Q is any distribution not in P, and Dmin = And, like for α = 1, the set of (α, λ)-mixtures is closed
inf P ∈P D(P kQ), then there exists a distribution P ∗ such that under taking further mixtures of its elements:
Lemma 4. Let α ∈ (0, ∞), let P1 , . . . , Pm be arbitrary
D(P kQ) ≥ D(P kP ∗ ) + Dmin for all P ∈ P. (28)
probability distributions and let Pλ1 and Pλ2 be their (α, λ1 )-
The main use of the Pythagorean inequality lies in its impli- and (α, λ2 )-mixtures for some distributions λ1 , λ2 . Then, for
cation that if P1 , P2 , . . . is a sequence of distributions in P any distribution γ = (γ1 , γ2 ), the (α, γ)-mixture of Pλ1 and
such that D(Pn kQ) → Dmin , then Pn converges to P ∗ in the Pλ2 is an (α, ν)-mixture of P1 , . . . , Pm for the distribution ν
strong sense that D(Pn kP ∗ ) → 0. such that
γ1 γ2
For α 6= 1 Rényi divergence does not satisfy the ordinary ν = α λ1 + α λ2 , (31)
Z1 C Z2 C
Pythagorean inequality, but there does exist a generalization if
we replace convexity of P by the following alternative notion where C = Zγ1α + Zγ2α , and Z1 and Z2 are the normalizing
1 2
of convexity: constants of Pλ1 and Pλ2 as defined in (29).
Definition 4. For α ∈ (0, ∞), we will call a set of distributions Proof: Let Mγ be the (α, γ)-mixture of Pλ1 and Pλ2 , and
P α-convex if, for any probability distribution λ = (λ1 , λ2 ) take λi = (λi,1, , . . . , λi,m ). Then
and any two distributions P1 , P2 ∈ P, we also have Pλ ∈ P,
m γ ∝ γ1 p α α 1/α
λ1 + γ2 pλ2 )
where Pλ is the (α, λ)-mixture of P1 and P2 , which will be γ X 1/α
defined below. 1 α γ2 X α
= λ p
1,θ θ + λ p
2,θ θ
Z1α Z2α
For α = 1, the (α, λ)-mixture is simply the ordinary mixture θ θ
γ1 λ1,θ γ2 λ2,θ
λ1 P1 + λ2 P2 , so that 1-convexity is equivalent to ordinary X
Z1α + Z2α
1/α
convexity. We generalize this to other α as follows: ∝ pα
θ ,
C
θ
Definition 5. Let α ∈ (0, ∞) and let P1 , . . . , Pm be any
from which the result follows.
probability distributions. Then for any probability distribu-
We are now ready to generalize the Pythagorean inequality
tion λ = (λ1 , . . . , λm ) we define the (α, λ)-mixture Pλ of
to any α ∈ (0, ∞):
P1 , . . . , Pm as the distribution with density
Theorem 14 (Pythagorean Inequality). Let α ∈ (0, ∞).
α 1/α m
Pm
θ=1 λθ pθ
Z X 1/α
pλ = , where Z = λθ pα dµ Suppose that P is an α-convex set of distributions. Let Q be
θ
Z an arbitrary distribution and suppose that the α-information
θ=1
(29) projection
is a normalizing constant. P ∗ = arg min Dα (P kQ) (32)
P ∈P
The normalizing constant Z is always well defined:
exists. Then we have the Pythagorean inequality
Lemma 3. The normalizing constant Z in (29) is bounded by
( Dα (P kQ) ≥ Dα (P kP ∗ ) + Dα (P ∗ kQ) for all P ∈ P.
[m−(1−α)/α , 1] for α ∈ (0, 1], (33)
Z∈ (30)
[1, m(α−1)/α ] for α ∈ [1, ∞). This result is new, although the work of Sundaresan on a
generalization of Rényi divergence might be related [33], [34].
Proof: For α = 1, we have Z = 1, as required. So it
1/α Our proof follows the same approach as the proof for α = 1
remains to consider theR simpleP orders.α
Let f (y) = y for
by Cover and Thomas [30].
y ≥ 0, so that Z = f θ λθ pθ dµ. Suppose first that Proof: For α = 1, this is just the standard Pythagorean
α ∈ (0, 1). Then f is convex, which implies that f (a + b) −
inequality for Kullback-Leibler divergence. See, for example,
P ≥ f (b) −P
f (a) f (0) = f (b) for any a, b, so that, by induction,
α the proof by Topsøe [32]. It remains to prove the theorem
f ( θ aθ ) ≥ θ f (aθ ) for any aθ . Taking aθ = λθ pθ and when α is a simple order.
using Jensen’s inequality, we find:
Let P ∈ P be arbitrary, and let Pλ be the α, (1 − λ, λ) -
mixture of P ∗ and P . Since P is α-convex ∗
X X X
f λθ pαθ ≤ f λθ p α
θ ≤ λθ f (pα
θ) d
and P is the
θ θ θ
minimizer over P, we have dλ Dα (Pλ kQ) λ=0 ≥ 0.

X 1/α
X 1/α X This derivative evaluates to:
λθ pθ ≤ λθ pα
θ ≤ λθ pθ .
dµ − (p∗ )α q 1−α dµ
R α 1−α R
θ θ θ d 1 p q
Dα (Pλ kQ) = R
Since every pθ integrates to 1, it follows that dλ α−1 Zλα pα q 1−α dµ
R ∗ α 1−α R λ α 1−α
X 1/α α (1 − λ) (p ) q dµ + λ p q dµ d
λθ ≤ Zλ ≤ 1. − Zλ .
Zλα+1 pα
R
α−1 λ q 1−α dµ dλ
θ
1/α
Let Xλ = (1 − λ)(p∗ )α + λpα
R
, so that Zλ = Xλ dµ. convergence in the sense that convergence in total variation
If α ∈ (0, 1), then Xλ is convex in λ so that Xλ −X λ
0
is distance implies convergence on any A ∈ F. The two
nondecreasing in λ, and if α ∈ (0, ∞), then Xλ is concave in topologies coincide if the sample space X is countable.
λ so Rthat Xλ −X
λ
0
is nonincreasing. By Lemma 3, we also see In general, Rényi divergence is lower semi-continuous for
Xλ −X0 Zλ −1 positive orders:
that λ dµ = λ is bounded by 0 for λ > 0, from
above if α ∈ (0, 1) and from below if α ∈ (1, ∞). It therefore
Theorem 15. For any order α ∈ (0, ∞], Dα (P kQ) is a lower
follows from the monotone convergence theorem that
semi-continuous function of the pair (P, Q) in the topology of
Zλ − Z0 Xλ − X0
Z
d setwise convergence.
Zλ λ=0 = lim = lim dµ
dλ λ↓0 λ λ↓0 λ
Proof: Suppose X = {x1 , . . . , xk } is finite. Then for any
Xλ − X0
Z Z
d
= lim dµ = Xλ λ=0 dµ, simple order α
λ↓0 λ dλ
k
where 1 X
Dα (P kQ) = ln pα q 1−α ,
d 1 1/α−1 α − 1 i=1 i i
Xλ = (1 − λ)(p∗ )α + λpα (pα − (p∗ )α ).
dλ α 1−α
where pi = P (xi ) and qi = Q(xi ). If 0 < α < 1, then pα i qi
If Dα (P ∗ kQ) = ∞, then the theorem is trivially true, so we
is continuous in (P, Q). For 1 < α < ∞, it is only discontinu-
may assume without lossR of generality that Dα (P ∗ kQ) < ∞, 1−α 1−α
ous at pi = qi = 0, but there pα i qi = 0 = min(P,Q) pαi qi ,
which implies that 0 < (p∗ )α q 1−α dµ < ∞. α 1−α
so then pi qi is still lower semi-continuous. These prop-
Putting everything together, we therefore find Pk α 1−α
erties carry over to i=1 pi qi and thus Dα (P kQ) is
d continuous for 0 < α < 1 and lower semi-continuous for
0≤ Dα (Pλ kQ)λ=0
dλ R α > 1. A supremum over (lower semi-)continuous functions
pα q 1−α dµ − (p∗ )α q 1−α dµ
R
1 is itself lower semi-continuous. Therefore, for simple orders α,
= R
α−1 (p∗ )α q 1−α dµ Theorem 2 implies that Dα (P kQ) is lower semi-continuous
for arbitrary X . This property extends to the extended orders
Z
1
− (p∗ )1−α (pα − (p∗ )α )dµ 1 and ∞ by Dβ (P kQ) = supα<β Dα (P kQ) for β ∈ {1, ∞}.
α−1 R
1 pα q 1−α dµ
Z
∗ 1−α α
= R − (p ) p dµ . Moreover, if α ∈ (0, 1) and the total variation topology is
α−1 (p∗ )α q 1−α dµ
assumed, then Theorem 17 below shows that Rényi divergence
Hence, if α > 1 we have is uniformly continuous.
Z Z Z
First we prove that the topologies induced by Rényi diver-
pα q 1−α dµ ≥ (p∗ )α q 1−α dµ (p∗ )1−α pα dµ,
gences of orders α ∈ (0, 1) are all equivalent:
and if α < 1 we have the converse of this inequality. In both Theorem 16. For any 0 < α ≤ β < 1
cases, the Pythagorean inequality (33) follows upon taking
α1−β
logarithms and dividing by α − 1 (which flips the inequality Dβ (P kQ) ≤ Dα (P kQ) ≤ Dβ (P kQ).
sign for α < 1). β 1−α
This follows from the following symmetry-like property,
D. Continuity which may be verified directly.
In this section we study continuity properties of the Rényi Proposition 2 (Skew Symmetry). For any 0 < α < 1
divergence Dα (P kQ) of different orders in the pair of proba- α
bility distributions (P, Q). It turns out that continuity depends Dα (P kQ) = D1−α (QkP ).
1−α
on the order α and the topology on the set of all probability
distributions. Note that, in particular, Rényi divergence is symmetric for
The set of probability distributions on (X , F) may be α = 1/2, but that skew symmetry does not hold for α = 0 and
equipped with the topology of setwise convergence, which α = 1.
is the coarsest topology such that, for any event A ∈ F, Proof of Theorem 16: We have already established the
the function P 7→ P (A) that maps a distribution to its second inequality in Theorem 3, so it remains to prove the
probability on A, is continuous. In this topology, convergence first one. Skew symmetry implies that
of a sequence of probability distributions P1 , P2 , . . . to a 1−α
Dα (P kQ) = D1−α (QkP )
probability distribution P means that Pn (A) → P (A) for any α
A ∈ F. 1−β
≥ D1−β (QkP ) = Dβ (P kQ),
Alternatively, one might consider the topology defined by β
the total variation distance from which the result follows.
Z
V (P, Q) = |p − q| dµ = 2 sup |P (A) − Q(A)|, (34) Remark 2. By (5), these results show that, for α ∈ (0, 1),
A∈F Dα (Pn kQ) → 0 is equivalent to convergence of Pn to Q in
in which Pn → P means that V (Pn , P ) → 0. The total Hellinger distance, which is equivalent to convergence of Pn
variation topology is stronger than the topology of setwise to Q in total variation [28, p. 364].
Next we shall prove a stronger result on the relation between Proof: Directly from the closed-form expressions for
Rényi divergence and total variation. Rényi divergence.
Finally, we will also consider the weak topology, which is
Theorem 17. For α ∈ (0, 1), the Rényi divergence Dα (P kQ)
weaker than the two topologies discussed above. In the weak
is a uniformly continuous function of (P, Q) in the total
topology, convergence of P1 , P2 , . . . to P means that
variation topology. Z Z
Lemma 5. Let 0 < α < 1. Then for all x, y ≥ 0 and ε > 0 f (x) dPn (x) → f (x) dP (x) (35)
|xα − y α | ≤ εα + εα−1 |x − y|. for any bounded, continuous function f : X → R. Unlike for
Proof: If x, y ≤ ε or x = y the inequality |xα − y α | ≤ εα the previous two topologies, the reference to continuity of f
is obvious. So assume that x > y and x ≥ ε. Then means that the weak topology depends on the topology of
the sample space X . We will therefore assume that X is a
|xα − y α | |xα − 0α | Polish space (that is, it should be a complete separable metric
≤ = xα−1 ≤ εα−1 .
|x − y| |x − 0| space), and we let F be the Borel σ-algebra. Then Prokhorov
[35] shows that there exists a metric that makes the set of
Proof of Theorem 17: First note that Rényi diver- finite measures on X a Polish space as well, and which is
gence is a
function of the power divergence dα (P, Q) = such that convergence in the metric is equivalent to (35). The
R α
1 − dQdP
dQ : weak topology then, is the topology induced by this metric.
Theorem 19. Suppose that X is a Polish space. Then for
1
Dα (P kQ) = ln (1 − dα (P, Q)) . any order α ∈ (0, ∞], Dα (P kQ) is a lower semi-continuous
α−1 function of the pair (P, Q) in the weak topology.
1
Since x 7→ α−1 ln(1−x) is continuous, it is sufficient to prove
The proof is essentially the same as the proof for α = 1 by
that dα (P, Q) is a uniformly continuous function of (P, Q).
Posner [36].
For any ε > 0 and distributions P1 , P2 and Q, Lemma 5
Proof: Let P1 , P2 , . . . and Q1 , Q2 , . . . be sequences of
implies that
distributions that weakly converge to P and Q, respectively.
dP1 α
Z α
dP2 We need to show that
|dα (P1 , Q) − dα (P2 , Q)| ≤ − dQ
dQ dQ
Z
dP1
lim inf Dα (Pn kQn ) ≥ Dα (P kQ). (36)
dP2 n→∞
≤ εα + εα−1 − dQ
dQ dQ For any set A ∈ F, let ∂A denote its boundary, which is
Z
dP1 dP2 its closure minus its interior, and let F0 ⊆ F consist of the
= εα + εα−1 − dQ
dQ dQ sets A ∈ F such that P (∂A) = Q(∂A) = 0. Then F0 is
= εα + εα−1 V (P1 , P2 ). an algebra by Lemma 1.1 of Prokhorov [35], applied to the
measure P + Q, and the Portmanteau theorem implies that
As dα (P, Q) = d1−α (Q, P ), it also follows that Pn (A) → P (A) and Qn (A) → Q(A) for any A ∈ F0 [37].
|dα (P, Q1 ) − dα (P, Q2 )| ≤ ε1−α + ε−α V (Q1 , Q2 ) for any Posner [36, proof of Theorem 1] shows that F0 gener-
Q1 , Q2 and P . Therefore ates F (that is, σ(F0 ) = F). By the translator’s proof of
|dα (P1 , Q1 ) − dα (P2 , Q2 )| Theorem 2.4.1 in Pinsker’s book [38], this implies that, for
any finite partition {A1 , . . . , Ak } ⊆ F and any γ > 0,
≤ |dα (P1 , Q1 ) − dα (P2 , Q1 )| there exists a finite partition {A01 , . . . , A0k } ⊆ F0 such that
+ |dα (P2 , Q1 ) − dα (P2 , Q2 )| P (Ai 4A0i ) ≤ γ and Q(Ai 4A0i ) ≤ γ for all i, where
≤ εα + εα−1 V (P1 , P2 ) + ε1−α + ε−α V (Q1 , Q2 ), Ai 4A0i = (Ai \ A0i ) ∪(A0i \ Ai ) denotes the symmetric set
difference. By the data processing inequality and lower semi-
from which the theorem follows. continuity in the topology of setwise convergence, this implies
A partial extension to α = 0 follows: that (15) still holds when the supremum is restricted to finite
Corollary 1. The Rényi divergence D0 (P kQ) is an upper partitions P in F0 instead of F.
semi-continuous function of (P, Q) in the total variation Thus, for any ε > 0, we can find a finite partition P ⊆ F0
topology. such that
Dα (P|P kQ|P ) ≥ Dα (P kQ) − ε.
Proof: This follows from Theorem 17 because D0 (P kQ)
is the infimum of the continuous functions (P, Q) 7→ The data processing inequality and the fact that Pn (A) →
Dα (P kQ) for α ∈ (0, 1). P (A) and Qn (A) → Q(A) for all A ∈ P, together with
If we consider continuity in Q only, then for any finite lower semi-continuity in the topology of setwise convergence,
sample space we obtain: then imply that
Theorem 18. Suppose X is finite, and let α ∈ [0, ∞]. Then
Dα (Pn kQn ) ≥ Dα (Pn )|P k(Qn )|P
for any P the Rényi divergence Dα (P kQ) is continuous in Q
in the topology of setwise convergence. ≥ Dα (P|P kQ|P ) − ε ≥ Dα (P kQ) − 2ε
for all sufficiently large n. Consequently, Theorem 22. For the special case α = 1, information-theoretic
proofs of Theorems 21 and 22 are given by Barron [39] and
lim inf Dα (Pn kQn ) ≥ Dα (P kQ) − 2ε
n→∞ Harremoës and Holst [40]. Theorem 21 may also be derived
for any ε > 0, and (36) follows by letting ε tend to 0. from general properties of f -divergences [27].
Theorem 20 (Compact Sublevel Sets). Suppose X is a Polish Theorem 21 (Increasing). Let F1 ⊆ F2 ⊆ · · · ⊆SF be an
∞
space, let Q be arbitrary, and let c ∈ [0, ∞) be a constant. increasing family of σ-algebras, and let F∞ = σ ( n=1 Fn )
Then the sublevel set be the smallest σ-algebra containing them. Then for any order
α ∈ (0, ∞]
S = {P | Dα (P kQ) ≤ c} (37)
lim Dα (P|Fn kQ|Fn ) = Dα (P|F∞ kQ|F∞ ). (39)
is convex and compact in the topology of weak convergence n→∞
for any order α ∈ [1, ∞]. For α = 0, (39) does not hold. A counterexample is given
after Example 3 below.
Proof: Convexity follows from quasi-convexity of Rényi
divergence in its first argument. Lemma 6. Let F1 ⊆ F2 ⊆ · · · ⊆ F be an increasing family
Suppose that P1 , P2 , . . . ∈ S converges to a finite measure of σ-algebras, and suppose that µ is a probability distribution.
P . Then (35), applied to the constant function f (x) = 1, Then the family of random variables {pn }n≥1 with members
implies that P (X ) = 1, so that P is also a probability pn = E [ p| Fn ] is uniformly integrable (with respect to µ).
distribution. Hence by lower semi-continuity (Theorem 19) S
The proof of this lemma is a special case of part of the
is closed. It is therefore sufficient to show that S is relatively
proof of Lévy’s upward convergence theorem in Shiryaev’s
compact.
textbook [28, p. 510]. We repeat it here for completeness.
For any event A ∈ F, let Ac = X \A denote its complement.
Proof: For any constants b, c > 0
Prokhorov [35, Theorem 1.12] shows that S is relatively Z Z
compact if, for any ε > 0, there exists a compact set A ⊆ X
pn dµ = p dµ
such that P (Ac ) < ε for all P ∈ S. pn >b
Since X is a Polish space, for any δ > 0 there exists Zpn >b Z
a compact set Bδ ⊆ X such that Q(Bδ ) ≥ 1 − δ [37, ≤ p dµ + p dµ
pn >b,p≤c
Lemma 1.3.2]. For any distribution P , let P|Bδ denote the Z p>c
restriction of P to the binary partition {Bδ , Bδc }. Then, by ≤ c · µ (pn > b) + p dµ
monotonicity in α and the data processing inequality, we have, p>c
(∗)
Z Z
for any P ∈ S, c c
≤ E[pn ] + p dµ = + p dµ,
b p>c b p>c
c ≥ Dα (P kQ) ≥ D1 (P kQ) ≥ D1 (P|Bδ kQ|Bδ )
P (Bδ ) P (Bδc ) in which the inequality marked by (∗) is Markov’s. Conse-
= P (Bδ ) ln + P (Bδc ) ln quently
Q(Bδ ) Q(Bδc )
1 Z Z
≥ P (Bδ ) ln P (Bδ ) + P (Bδc ) ln P (Bδc ) + P (Bδc ) ln lim sup |pn | dµ = lim lim sup |pn | dµ
Q(Bδc ) b→∞ n pn >b c→∞ b→∞ n pn >b
−2 1 c
Z
≥ + P (Bδc ) ln , ≤ lim lim + lim p dµ = 0,
e Q(Bδc ) c→∞ b→∞ b c→∞ p>c
where the last inequality follows from x ln x ≥ −1/e. Conse- which proves the lemma.
quently, Proof of Theorem 21: As by the data processing inequal-
c + 2/e
P (Bδc ) ≤ , ity Dα (P|Fn kQ|Fn ) ≤ Dα (P kQ) for all n, we only need
ln 1/Q(Bδc ) to show that limn→∞ Dα (P|Fn kQ|Fn ) ≥ Dα (P|F∞ kQ|F∞ ).
and since Q(Bδc ) → 0 as δ tends to 0 we can satisfy the To this end, assume without loss of generality that F = F∞
condition of Prokhorov’s theorem by taking A equal to Bδ and that µ is a probability distribution (i.e. µ = (P + Q)/2).
for any sufficiently small δ depending on ε. Let pn = E [ p| Fn ] and qn = E [ q| Fn ], and define the
distributions P̃n and Q̃n on (X , F) by
E. Limits of σ-Algebras
Z Z
P̃n (A) = pn dµ, Q̃n (A) = qn dµ (A ∈ F),
As shown by Theorem 2, there exists a sequence of finite A A
partitions P1 , P2 , . . . such that such that, by the Radon-Nikodým theorem and Proposition 1,
dP̃n dP|Fn dQ̃n dQ|Fn
Dα (P|Pn kQ|Pn ) ↑ Dα (P kQ). (38) dµ = pn = dµ|Fn and dµ = qn = dµ|Fn (µ-a.s.) It follows
that
Theorem 21 below elaborates on this result. It implies that (38) Dα (P̃n kQ̃n ) = Dα (P|Fn kQ|Fn )
holds for any increasing sequence of partitions P1 ⊆ P2 ⊆
· · · that generate
S∞ σ-algebras converging to F, in the sense for 0 < α < ∞ and therefore by continuity also for α = ∞.
that F = σ ( n=1 Pn ). An analogous result holds for infinite We will proceed to show that (P̃n , Q̃n ) → (P, Q) in the
sequences of increasingly coarse partitions, which is shown by topology of setwise convergence. By lower semi-continuity
of Rényi divergence this implies that limn→∞ Dα (P̃n kQ̃n ) ≥ where EQ [Xn ] ≤ EQ [X1 ] in the last inequality follows from
Dα (P kQ), from which the theorem follows. By Lévy’s up- the data processing inequality. Consequently,
ward convergence theorem [28, p. 510], limn→∞ pn = p Z Z
(µ-a.s.) Hence uniform integrability of the family {pn } (by lim sup |Xn | dQ = lim lim sup |Xn | dQ
b→∞ n c→∞ b→∞ n
Lemma 6) implies that for any A ∈ F |Xn |>b
Z |Xn |>b
Z Z c
≤ lim lim EQ [X1 ] + lim X1 dQ = 0,
lim P̃n (A) = lim pn dµ = p dµ = P (A) c→∞ b→∞ b c→∞ X >c
1
n→∞ n→∞ A A
and the lemma follows.
[28, Thm. 5, p. 189]. Similarly limn→∞ Q̃n (A) = Q(A), so Proof of Theorem 22: First suppose that α > 0 and, for
we find that (P̃n , Q̃n ) → (P, Q), which completes the proof. dP|Fn
n = 1, 2, . . . , ∞, let pn = dµ|F
dQ|Fn
, qn = dµ|F and Xn =
n n
pn
f qn with f (x) = x if α 6= 1 and f (x) = x ln x + e−1
α
Theorem 22 (Decreasing). Let F ⊇ F1 ⊇ F2T⊇ · · · be a de-
∞ if α = 1, as in Lemma 7. If α ≥ 1, then assume without
creasing family of σ-algebras, and let F∞ = n=1 Fn be the
loss of generality that F = F1 and m = 1, such that
largest σ-algebra contained in all of them. Let α ∈ [0, ∞). If
Dα (P|Fm kQ|Fm ) < ∞ implies P Q. Now, for any α > 0,
α ∈ [0, 1) or there exists an m such that Dα (P|Fm kQ|Fm ) <
it is sufficient to show that
∞, then
EQ [Xn ] → EQ [X∞ ]. (40)
lim Dα (P|Fn kQ|Fn ) = Dα (P|F∞ kQ|F∞ ).
n→∞
By Proposition 1, pn = Eµ [ p| Fn ] and qn = Eµ [ q| Fn ].
The theorem cannot be extended to the case α = ∞. Therefore by a version of Lévy’s theorem for decreasing
Lemma 7. Let F ⊇ F1 ⊇ F2 ⊇ · · · be a decreasing family sequences of σ-algebras [41, Theorem 6.23],
dP|Fn dQ|Fn
of σ-algebras. Let α ∈ (0, ∞), pn = dµ|F , qn = dµ|F pn = Eµ [ p| Fn ] → Eµ [ p| F∞ ] = p∞ ,
n n (µ-a.s.)
and Xn = f ( pqnn ), where f (x) = xα if α 6= 1 and f (x) = qn = Eµ [ q| Fn ] → Eµ [ q| F∞ ] = q∞ ,
x ln x + e−1 if α = 1. If α ∈ (0, 1), or EQ [X1 ] < ∞ and
and hence Xn → X∞ (µ-a.s. and therefore Q-a.s.) If 0 < α <
P Q, then the family {Xn }n≥1 is uniformly integrable
1, then
(with respect to Q).
EQ [Xn ] = Eµ pα 1−α

n qn ≤ Eµ [αpn + (1 − α)qn ] = 1 < ∞.
Proof: Suppose first that α ∈ (0, 1). Then for any b > 0
Z Z (1−α)/α And if α ≥ 1, then by the data processing inequality
Xn Dα (P|Fn kQ|Fn ) < ∞ for all n, which implies that also in this
Xn dQ ≤ Xn dQ
Xn >b Xn >b b case EQ [Xn ] < ∞. Hence uniform integrability (by Lemma 7)
Z
≤ b−(1−α)/α Xn1/α dQ ≤ b−(1−α)/α , of the family of nonnegative random variables {Xn } implies
(40) [28, Thm. 5, p. 189], and the theorem follows for α > 0.
R
and, as Xn ≥ 0, limb→∞ supn |Xn |>b |Xn | dQ = 0, which The remaining case, α = 0, is proved by
was to be shown. lim D0 (P|Fn kQ|Fn )
dP n n→∞
Alternatively, suppose that α ∈ [1, ∞). Then pqnn = dQ|F
|Fn
(Q-a.s.) and hence by Proposition 1 and Jensen’s inequality = inf inf Dα (P|Fn kQ|Fn ) = inf inf Dα (P|Fn kQ|Fn )
n α>0 α>0 n
for conditional expectations = inf Dα (P|F∞ kQ|F∞ ) = D0 (P|F∞ kQ|F∞ ).
α>0
dP dP
Xn = f E F n ≤ E f Fn = E [ X1 | Fn ]
dQ dQ
(Q-a.s.) As minx x ln x = −e−1 , it follows that Xn ≥ 0 and F. Absolute Continuity and Mutual Singularity
for any b, c > 0 Shiryaev [28, pp. 366, 370] relates Hellinger integrals
Z Z
to absolute continuity and mutual singularity of probability
|Xn | dQ = Xn dQ distributions. His results may more elegantly be expressed
|Xn |>b Xn >b
Z Z in terms of Rényi divergence. They then follow from the
≤ E [ X1 | Fn ] dQ = X1 dQ observations that D0 (P kQ) = 0 if and only if Q is absolutely
ZXn >b Z Xn >b continuous with respect to P and that D0 (P kQ) = ∞ if and
= X1 dQ + X1 dQ only if P and Q are mutually singular, together with right-
Xn >b,X1 ≤c
Z Xn >b,X1 >c continuity of Dα (P kQ) in α at α = 0. As illustrated in the
≤ c · Q(Xn > b) + X1 dQ next section, these properties give a convenient mathematical
X1 >c tool to establish absolute continuity or mutual singularity of
Z
c infinite product distributions.
≤ EQ [Xn ] + X1 dQ
b Theorem 23 ( [28, Theorem 2, p. 366]). The following con-
Z X1 >c
c ditions are equivalent:
≤ EQ [X1 ] + X1 dQ,
b X1 >c (i) Q P ,
(ii) Q(p > 0) = 1, (iv) lim sup Dα (Pn kQn ) = ∞ for all α ∈ (0, ∞].
n→∞
(iii) D0 (P kQ) = 0,
(iv) limα↓0 Dα (P kQ) = 0. If Pn and Qn are the restrictions of P and Q to an
increasing sequence of sub-σ-algebras that generates F, then
Proof: Clearly (ii) is equivalent to Q(p = 0) = the equivalences in (41) continue to hold, because we can
0, which is equivalent to (i). The other cases follow by relate Theorems 23 and 25 and Theorems 24 and 26 via
limα↓0 Dα (P kQ) = D0 (P kQ) = − ln Q(p > 0). Theorem 21.
Theorem 24 ( [28, Theorem 3, p. 366]). The following con-
ditions are equivalent: G. Distributions on Sequences
(i) P ⊥ Q, Suppose (X ∞ , F ∞ ) is the direct product of an infinite
(ii) Q(p > 0) = 0, sequence of measurable spaces (X1 , F1 ), (X2 , F2 ), . . . That
(iii) Dα (P kQ) = ∞ for some α ∈ [0, 1), is, X ∞ = X1 × X2 × · · · and F ∞ is the smallest σ-algebra
(iv) Dα (P kQ) = ∞ for all α ∈ [0, ∞]. containing all the cylinder sets
Proof: Equivalence of (i), (ii) and D0 (P kQ) = ∞ follows Sn (A) = {x∞ ∈ X ∞ | x1 , . . . , xn ∈ A}, A ∈ F n,
from definitions. Equivalence of D0 (P kQ) = ∞ and (iv)
follows from the fact that Rényi divergence is continuous on for n = 1, 2, . . ., where F n = F1 ⊗ · · · ⊗ Fn . Then a
[0, 1] and nondecreasing in α. Finally, (iii) for some α ∈ (0, 1) sequence of probability distributions P 1 , P 2 , . . ., where P n
is equivalent to is a distribution on X n = X1 × · · · × Xn , is called consistent
if
Z
pα q 1−α dµ = 0, P n+1 (A × Xn+1 ) = P n (A), A ∈ F n.
which holds if and only if pq = 0 (µ-a.s.). It follows that in For any such consistent sequence there exists a distribution
this case (iii) is equivalent to (i). P ∞ on (X ∞ , F ∞ ) such that its marginal distribution on X n
Contiguity and entire separation are asymptotic versions of is P n , in the sense that
absolute continuity and mutual singularity [42]. As might be
P ∞ (Sn (A)) = P n (A), A ∈ F n.
expected, analogues of Theorems 23 and 24 also hold for these
asymptotic concepts. If P 1 , P 2 , . . . and Q1 , Q2 , . . . are two consistent sequences of
Let (Xn , Fn )n=1,2,... be a sequence of measurable spaces, probability distributions, then it is natural to ask whether the
and let (Pn )n=1,2,... and (Qn )n=1,2,... be sequences of distri- Rényi divergence Dα (P n kQn ) converges to Dα (P ∞ kQ∞ ).
butions on these spaces. Then the sequence (Pn ) is contiguous The following theorem shows that it does for α > 0.
with respect to the sequence (Qn ), denoted (Pn ) C (Qn ),
Theorem 27 (Consistent Distributions). Let P 1 , P 2 , . . . and
if for all sequences of events (An ∈ Fn )n=1,2,... such that
Q1 , Q2 , . . . be consistent sequences of probability distribu-
Qn (An ) → 0 as n → ∞, we also have Pn (An ) → 0. If
tions on (X 1 , F 1 ), (X 2 , F 2 ), . . ., where, for n = 1, . . . , ∞,
both (Pn ) C (Qn ) and (Qn ) C (Pn ), then the sequences
(X n , F n ) is the direct product of the first n measurable spaces
are called mutually contiguous and we write (Pn ) CB (Qn ).
in the infinite sequence (X1 , F1 ), (X2 , F2 ), . . . Then for any
The sequences (Pn ) and (Qn ) are entirely separated, de-
α ∈ (0, ∞]
noted (Pn ) M (Qn ), if there exist a sequence of events
(An ∈ Fn )n=1,2,... and a subsequence (nk )k=1,2,... such that Dα (P n kQn ) → Dα (P ∞ kQ∞ )
Pnk (Ank ) → 0 and Qnk (Xnk \ Ank ) → 0 as k → ∞.
Contiguity and entire separation are related to absolute as n → ∞.
continuity and mutual singularity in the following way [28, Proof: Let G n = {Sn (A) | A ∈ F n }. Then
p. 369]: if Xn = X , Pn = P and Qn = Q for all n, then ∞ ∞ ∞
Dα (P n |Qn ) = Dα (P|G n kQ|G n ) → Dα (P kQ∞ )
(Pn ) C (Qn ) ⇔ P Q,
by Theorem 21.
(Pn ) CB (Qn ) ⇔ P ∼ Q, (41)
As a special case, we find that finite additivity of Rényi
(Pn ) M (Qn ) ⇔ P ⊥ Q. divergence, which is easy to verify, extends to countable
Theorems 1 and 2 by Shiryaev [28, p. 370] imply the following additivity:
two asymptotic analogues of Theorems 23 and 24: Theorem 28 (Additivity). For n = 1, 2, . . ., let (Pn , Qn )
Theorem 25. The following conditions are equivalent: be pairs of probability distributions on measurable spaces
(Xn , Fn ). Then for any α ∈ [0, ∞] and any N ∈ {1, 2, . . .}
(i) (Qn ) C (Pn ),
(ii) lim lim sup Dα (Pn kQn ) = 0. N
X
α↓0 n→∞ Dα (Pn kQn ) = Dα (P1 ×· · ·×PN kQ1 ×· · ·×QN ), (42)
Theorem 26. The following conditions are equivalent: n=1
(i) (Pn ) M (Qn ), and, except for α = 0, also

(ii) lim lim sup Dα (Pn kQn ) = ∞, ∞
α↓0 n→∞ X
(iii) lim sup Dα (Pn kQn ) = ∞ for some α ∈ (0, 1). Dα (Pn kQn ) = Dα (P1 ×P2 ×· · · kQ1 ×Q2 ×· · · ). (43)
n→∞ n=1
Countable additivity as in (43) does not hold for α = 0. A Qn are distributions on arbitrary measurable spaces such that
counterexample is given following Example 3 below. Pn ∼ Qn . Then
Proof: For simple orders α, (42) follows from indepen- ∞
X
dence of Pn and Qn between different n, which implies that Q∼P ⇔ Dα (Pn kQn ) < ∞, (47)
!1−α N n=1
N Z 1−α Z QN ∞
Y dQn d n=1 Qn Y X
dPn = QN d Pn . Q⊥P ⇔ Dα (Pn kQn ) = ∞. (48)
n=1
dPn d n=1 Pn n=1 n=1
P∞
As N is finite, this extends to the extended orders by continuity Proof: If n=1 Dα (Pn kQn ) = ∞, then Dα (P kQ) = ∞
in α. Finally, (43) follows from Theorem 27 by observing that and Q ⊥ P follows by Theorem
P∞ 24.
the sequences P N = P1 ×· · ·×PN and QN = Q1 ×· · ·×QN , On the other hand, if n=1 Dα (Pn kQn ) < ∞, then for
for N = 1, 2, . . ., are consistent. every ε > 0 there exists an N such that
Theorems 23 and 24 can be used to establish absolute con- ∞
X
tinuity or mutual singularity of infinite product distributions, Dα (Pn kQn ) ≤ ε,
as illustrated by the following proof by Shiryaev [28] of the n=N +1
Gaussian dichotomy [43]–[45]. and consequently by additivity and monotonicity in α:
Example 3 (Gaussian Dichotomy). Let P = P1 × P2 × · · ·
and Q = Q1 × Q2 × · · · , where Pn and Qn are Gaussian D0 (P kQ) = lim Dα (P kQ)
α↓0
distributions with densities ≤ lim Dα (P1 × · · · × PN kQ1 × · · · × QN ) + ε = ε.
α↓0
1 2 1 2
pn (x) = √1 e− 2 (x−µn ) , qn (x) = √1 e− 2 (x−νn ) .
2π 2π As this holds for any ε > 0, D0 (P kQ) must equal 0, and, by
Then Theorem 23, Q P . As Q P implies Q 6⊥ P , Theorem 24
α implies that Dα (QkP ) < ∞, and by repeating the argument
Dα (Pn kQn ) = (µn − νn )2 , with the roles of P and Q reversed we find that also P Q,
2
which completes the proof.
and by additivity for α > 0
Theorem 29 (with α = 1/2) is equivalent to a classical
αX
∞ result by Kakutani [46], which was stated in terms of Hellinger
Dα (P kQ) = (µn − νn )2 . (44) integrals rather than Rényi divergence, and according to Gibbs
2 n=1
and Su [24] might be responsible for popularising Hellinger
Consequently, by Theorems 23 and 24 and symmetry in P integrals. As shown by Rényi [47], Kakutani’s result is related
and Q: to the amount of information that a sequence of observations
contains about the parameter of a statistical model.
∞
X
QP ⇔ P Q ⇔ (µn − νn )2 < ∞, (45)
n=1
H. Taylor Approximation for Parametric Models
X∞ Suppose {Pθ | θ ∈ Θ ⊆ R} is a parametric statistical
Q⊥P ⇔ (µn − νn )2 = ∞. (46) model. Then it is well known that, for sufficiently regular
n=1 parametrisations, a second order Taylor approximation of
The observation that P and Q are either equivalent (both P D(Pθ kPθ0 ) in θ0 at θ in the interior of Θ yields
Q and Q P ) or mutually singular is called the Gaussian 1 1
dichotomy. lim D(Pθ kPθ0 ) = J(θ), (49)
(θ − θ0 )2
θ 0 →θ 2
By letting α tend to 0, ExampleP
3 shows that countable addi-
d
∞
where J(θ) = E ( dθ ln pθ )2 denotes the Fisher information
tivity does not hold for α = 0: if n=1 (µn − νn )2 = ∞, then at θ (see e.g. [30, Problem 12.7] or [48]). Haussler and Opper
PN
(44) implies that D0 (P kQ) = ∞, while n=1 D0 (Pn kQn ) = [6] argue that this property generalizes to
0 for all N . In light of the proof of Theorem 28 this also 1 α
provides a counterexample to (39) for α = 0. lim Dα (Pθ kPθ0 ) = J(θ) (50)
θ 0 →θ (θ − θ0 )2 2
The Gaussian dichotomy raises the question of whether the
same dichotomy holds for other product distributions. Let P ∼ for any α ∈ (0, ∞), but we are not aware of a reference that
Q denote that P and Q are equivalent (both P Q and spells out the exact technical conditions on the parametrisation
Q P ). Suppose that P = P1 × P2 × · · · and Q = Q1 × that are needed.
Q2 × · · · , where Pn and Qn are arbitrary distributions on
arbitrary measurable spaces. Then if Pn 6∼ Qn for some n, IV. M INIMAX RESULTS
P and Q are not equivalent either. The question is therefore A. Hypothesis Testing and Chernoff Information
answered by the following theorem:
Rényi divergence appears in bounds on the error proba-
Theorem 29 (Kakutani’s Dichotomy). Let α ∈ (0, 1) and let bilities when testing a probabilistic hypothesis Q against an
P = P1 × P2 × · · · and Q = Q1 × Q2 × · · · , where Pn and alternative P [4], [49], [50]. This can be explained by the
fact that (1 − α)Dα (P kQ) equals the cumulant generating we find that
function for the random variable ln(p/q) under the distribution
Q (provided α ∈ (0, 1) or P Q) [4]. The following inf αD(RkP ) + (1 − α)D(RkQ)
R
theorem relates this cumulant generating function to two
Kullback-Leibler divergences that involve the distribution Pα ≤ lim sup − α ln P (p ≤ cq)
c→∞
with density
+ inf αD(RkPc ) + (1 − α)D(RkQ)
q 1−α pα R
pα = R 1−α α , (51) ≤ lim sup (1 − α)Dα (Pc kQ) ≤ (1 − α)Dα (P kQ),
q p dµ
c→∞
which is well defined if and only if 0 < pα q 1−α dµ < ∞.

R
where the last inequality follows by lower semi-continuity of
Dα (Theorem 15). In case 2, (52) follows immediately. In
Theorem 30. For any simple order α
case 1, (52) follows by combining this inequality with its
converse (54).
(1 − α)Dα (P kQ) = inf {αD(RkP ) + (1 − α)D(RkQ)} ,
R Theorem 30 shows that (1 − α)Dα (P kQ) is the infimum
(52) over a set of functions that are linear in α, which implies the
with the convention that αD(RkP )+(1−α)D(RkQ) = ∞ if it following corollary:
would otherwise be undefined. Moreover, if the distribution Pα
with density (51) is well defined and α ∈ (0, 1) or D(Pα kP ) < Corollary 2. The function (1 − α)Dα (P kQ) is concave in α
∞, then the infimum is uniquely achieved by R = Pα . on [0, ∞], with the conventions that it is 0 at α = 1 even if
D(P kQ) = ∞ and that it is 0 at α = ∞ if P = Q.
This result gives an interpretation of Rényi divergence as a
trade-off between two Kullback-Leibler divergences. Proof: Suppose first that D(P kQ) < ∞. Then (52) also
Remark 3. Theorem 30 was formulated and proved for distri- holds at α = 1. Hence (1 − α)Dα (P kQ) is a point-wise
butions on finite sets by Shayevitz [17], but appeared in the infimum over linear functions on (0, ∞), and thus concave.
above formulation already in [7]. Prior to either of these, the This extends to α ∈ {0, ∞} by continuity.
identity (53) below, which forms the heart of the proof, has Alternatively, suppose that D(P kQ) = ∞. Then (1 −
been used by Csiszár [51]. α)Dα (P kQ) is still concave on [0, 1), where it is also
nonnegative. And by monotonicity of Rényi divergence, we
Proof of Theorem 30: First suppose that Pα is well have that Dα (P kQ) = ∞ for all α ≥ 1. Consequently,
defined or, equivalently, that Dα (P kQ) < ∞. Then for (1 − α)Dα (P kQ) is nonnegative and concave for α ∈ [0, 1),
α ∈ (0, 1) or D(RkP ) < ∞, we have at α = 1 it is 0 (by convention) and for α ∈ (1, ∞] it is −∞.
Z It then follows that (1 − α)Dα (P kQ) is concave on all of
αD(RkP )+(1−α)D(RkQ) = D(RkPα )−ln pα q 1−α dµ. [0, ∞], as required.
(53) In addition, Theorem 30 can be used to prove Gilardoni’s
Hence, if 0 < α < 1 or D(Pα kP ) < ∞, the infimum over extension of Pinsker’s inequality from the case α = 1 to any
R is uniquely achieved by R = Pα , for which it equals (1 − α ∈ (0, 1] [25], which was mentioned in the introduction.
α)Dα (P kQ) as required. If, on the other hand, α > 1 and Theorem 31 (Pinsker’s Inequality). Let V (P, Q) be the total
D(Pα kP ) = ∞, then we still have variation distance, as defined in (34). Then, for any α ∈ (0, 1],
α 2

inf αD(RkP ) + (1 − α)D(RkQ) ≥ (1 − α)Dα (P kQ). V (P, Q) ≤ Dα (P kQ).
R
(54) 2
Secondly, suppose α ∈ (0, 1) and Dα (P kQ) = ∞. Then Proof: We omit the proof for α = 1, which is the
P ⊥ Q, and consequently either D(RkP ) = ∞ or D(RkQ) = standard version of Pinsker’s inequality (see [52] for a survey
∞ for all R, which means that (52) holds. of its history). For α ∈ (0, 1), consider first the case of two
Next, consider the case that α > 1 and P 6 Q. Then distributions P = (p, 1 − p) and Q = (q, 1 − q) on a binary
Dα (P kQ) = ∞ and the infimum over R is achieved by R = alphabet. Then V 2 (P, Q) = 4(p − q)2 and by Theorem 30 and
P , for which it equals −∞, and again (52) holds. the result for α = 1, we find
Finally, we prove (52) for the remaining cases: α > 1, P
Q and either: (1) Dα (P kQ) < ∞, but D(Pα kP ) = ∞; or (2) (1 − α)Dα (P kQ) = inf {αD(RkP ) + (1 − α)D(RkQ)}
R
Dα (P kQ) = ∞. To this end, let Pc = P (· | p ≤ cq) for all ≥ inf 2α(r − p)2 + 2(1 − α)(r − q)2 .

c that are sufficiently large that P (p ≤ cq) > 0. The reader r
may verify that R D α (Pc kQ) < ∞ and D(SkPc ) < ∞ for The minimum is achieved by r = αp + (1 − α)q, from which
s = pα c q 1−α
/ p α 1−α
c q dµ, so that we have already proved
α 2
that (52) holds if P is replaced by Pc . Hence, observing that Dα (P kQ) ≥ 2α(p − q)2 = V (P, Q).
for all R 2
( The general case of distributions P and Q on any sample space
∞ if R 6 Pc , X reduces to the binary case by the data processing inequality:
D(RkPc ) =
D(RkP ) + ln P (p ≤ pc) otherwise, for any event A, let P|A and Q|A denote the restrictions of P
and Q to the binary partition P = {A, X \ A}. Then connection between Chernoff information and D(Pα∗ kP ) is
2 2 discussed by Cover and Thomas [30, Section 12.9], with a
α Dα (P kQ) ≥ sup α2 Dα (P|A kQ|A ) ≥ sup V (P|A , Q|A )
A A different proof.
2 Proof of Theorem 32: Let f (α, R) = αD(RkP ) + (1 −
= sup 4 P (A) − Q(A) = V 2 (P, Q),
A α)D(RkQ). For α ∈ (0, 1), Dα (P kQ) ≤ D(P kQ) < ∞
as required. implies that Pα is well defined. Suppose there exists α∗ ∈
As one might expect from continuity of Dα (P kQ), the (0, 1) such that D(Pα∗ kP ) = D(Pα∗ kQ). Then Theorem 30
terms on the right-hand side of (52) are continuous in α, at implies that (α∗ , Pα∗ ) is a saddle-point for f (α, R), so that
least on (0, 1): (55) holds [53, Lemma 36.2], and Theorem 30 also implies
that all quantities in (56) are equal to f (α∗ , Pα∗ ).
Lemma 8. If D(P kQ) < ∞ or D(QkP ) < ∞, then both Let A be either (0, 1) or (0, ∞). As the sup inf is never
D(Pα kQ) and D(Pα kP ) are finite and continuous in α on bigger than the inf sup [53, Lemma 36.1], we have that
(0, 1).
sup inf f (α, R) ≤ sup inf f (α, R) ≤ inf sup f (α, R),
Proof: The lemma is symmetric in P and Q, so sup- α∈A R α∈(0,∞) R R α∈(0,∞)
pose without loss of generality that D(P kQ) < ∞. Then
Dα (P kQ) ≤ D(P kQ) < ∞ implies that Pα is well defined so it remains to prove the converse inequality.
and finiteness of both D(Pα kQ) and D(Pα kP ) follows from By Lemma 8 we know that both D(Pα kP ) and D(Pα kQ)
Theorem 30. Now observe that are finite and continuous in α on (0, 1). By the intermediate
α α value theorem, there are therefore three possibilities: (1) there
1 p p exists α∗ ∈ (0, 1) such that D(Pα∗ kP ) = D(Pα∗ kQ),
D(Pα kQ) = R α 1−α EQ ln
p q dµ q q for which we have already proved (55); (2) D(Pα kP ) <
+ (1 − α)Dα (P kQ). D(Pα kQ) for all α ∈ (0, 1); and (3) D(Pα kP ) > D(Pα kQ)
for all α ∈ (0, 1).
Then by continuity of Dα (P kQ) and hence of pα q 1−α dµ in
R
We proceed with case (2), observing that
α, it is sufficient to verify continuity of EQ [(p/q)α ln(p/q)α ].
To this end, observe that inf sup f (α, R) = inf sup f (α, R)
( R α∈(0,∞) R : D(RkQ)<∞ α∈(0,∞)
1/e if p < q, n
|(p/q)α ln(p/q)α | ≤ = inf D(RkQ)
(p/q) ln(p/q) if p ≥ q. R : D(RkQ)<∞
o
As D(P kQ) < ∞ implies EQ [1{p≥q} (p/q) ln(p/q)] < ∞, + sup α D(RkP ) − D(RkQ)
α∈(0,∞)
we may apply the dominated convergence theorem to obtain
α α " ∗ ∗ # = inf D(RkQ)
α α R : D(RkP )≤D(RkQ)<∞
p p p p
lim∗ EQ ln = EQ ln ≤ inf D(Pα kQ).
α→α q q q q 0<α<1
for any α∗ ∈ (0, 1), which proves continuity of D(Pα kQ). Now by Theorem 30
Continuity of D(Pα kP ) now follows from Theorem 30 and
continuity of (1 − α)Dα (P kQ). inf D(Pα kQ) ≤ lim inf D(Pα kQ)
0<α<1 α↓0
n α o
Theorem 32. Suppose that D(P kQ) < ∞. Then the following = lim inf Dα (P kQ) − D(Pα kP )
minimax identity holds: α↓0 1−α
≤ lim Dα (P kQ) = lim(1 − α)Dα (P kQ)
α↓0 α↓0
sup inf {αD(RkP ) + (1 − α)D(RkQ)}
α∈(0,∞) R = lim inf f (α, R) ≤ sup inf f (α, R),
α↓0 R α∈A R
= inf sup {αD(RkP ) + (1 − α)D(RkQ)} , (55)
R α∈(0,∞)
as required. It remains to consider case (3), which turns out
with the convention that αD(RkP ) + (1 − α)D(RkQ) = ∞ to be impossible by the following argument: two applications
if it would otherwise be undefined. Moreover, (55) still holds of Theorem 30 give
if α is restricted to (0, 1) on its left-hand side; and if there n o
D1/2 (P kQ) = inf D(Pα kP ) + D(Pα kQ)
exists an α∗ ∈ (0, 1) such that D(Pα∗ kP ) = D(Pα∗ kQ), then 0<α<1
(α∗ , Pα∗ ) is a saddle-point for (55) and both sides of (55) are ≤ 2 inf D(Pα kP ) ≤ 2 lim sup D(Pα kP )
0<α<1 α↑1
equal to
n1 − α 1−α o
(1 − α∗ )Dα∗ (P kQ) = sup (1 − α)Dα (P kQ) = 2 lim sup Dα (P kQ) − D(Pα kP )
α∈(0,1) α↑1 α α
(56)
= D(Pα∗ kP ) = D(Pα∗ kQ). 1−α
≤ 2 lim sup Dα (P kQ) = 0.
α↑1 α
The minimax value defined in (55) is the Chernoff informa-
tion, which gives an asymptotically tight bound on both the It follows that P = Q, which contradicts the assumption that
type 1 and the type 2 errors in tests of P vs. Q. The same D(Pα kP ) > D(Pα kQ) for any α ∈ (0, 1).
B. Channel Capacity and Minimax Redundancy Theorem 35 (Sion’s Minimax Theorem). Let A be a convex
Consider a non-empty family {Pθ | θ ∈ Θ} of probability subset of a linear topological space and B a compact convex
distributions on a sample space X . We may think of θ as subset of a linear topological space. Let f : A × B → R be
a parameter in a statistical model or as an input letter of such that
an information channel. In the main results of this section (i) f (·, b) is upper semi-continuous and quasi-concave on A
we will only consider discrete sample spaces X , which are for each b ∈ B;
either finite with n elements or countably infinite. Whenever (ii) f (a, ·) is lower semi-continuous and quasi-convex on B
distributions on Θ are involved, we also implicitly assume that for each a ∈ A.
Θ is a topological space that is equipped with the Borel σ- Then
algebra, that {θ} is a closed set for every θ, and that the map sup min f (a, b) = min sup f (a, b).
θ 7→ Pθ is measurable. a∈A b∈B b∈B a∈A
We will study Proof of Theorem 34: Sion’s minimax theorem cannot be

applied directly, because ψα may be infinite. For λ ∈ (0, 1),
Z
Cα = sup inf Dα (Pθ kQ) dπ(θ), (57) we therefore introduce the auxiliary function
π Q
ψαλ (π, Q) = ψα π, (1 − λ)U + λQ ,

which has been proposed as the appropriate generalization of
the channel capacity from α = 1 to general α [4], [18].
If X is finite, then the channel capacity is also finite: where U is the uniform distribution on X . Finiteness of ψαλ
follows from
Theorem 33. If X has n elements, then Cα ≤ ln n for any
Dα Pθ k(1 − λ)U + λQ ≤ Dα Pθ kU − ln(1 − λ)
α ∈ [0, ∞]. (62)
≤ D∞ Pθ kU − ln(1 − λ) ≤ ln n − ln(1 − λ),
Proof: Let U denote the uniform distribution on X . Then
Z Z where n denotes the number of elements in X .
sup inf Dα (Pθ kQ) dπ(θ) ≤ sup Dα (Pθ kU ) dπ(θ) To verify the other conditions of Theorem 35, we observe
Q
π π that ψαλ (·, Q) is linear, and hence continuous and concave.
= sup Dα (Pθ kU ) ≤ sup D∞ (Pθ kU ) Convexity of ψαλ (π, ·) follows from convexity of ψα (π, ·),
θ θ
which holds because ψα (π, ·) is a linear combination of con-
Pθ (x)
= sup ln max ≤ ln n. vex functions. Continuity of ψαλ (π, ·) follows by the dominated
θ x 1/n convergence theorem (which applies by (62)) and continuity
of Dα (Pθ k·). Thus we may apply Sion’s minimax theorem.
For α = 1, it is a classical result by Gallager and Ryabko By
[54] that the channel capacity equals the minimax redundancy:
Dα Pθ k(1 − λ)U + λQ ≤ Dα Pθ kQ − ln λ,
Rα = inf sup Dα (Pθ kQ). (58)
Q θ∈Θ we also have ψαλ (π, Q) ≤ ψα (π, Q) − ln λ, and hence we may
For finite Θ, Csiszár [4] has shown that this result in fact reason as follows:
extends to any α ∈ (0, ∞), noting that the minimax re-
dundancy Rα (and therefore the channel capacity Cα ) may sup inf ψα (π, Q) − ln λ ≥ sup inf ψαλ (π, Q)
π Q π Q
be geometrically interpreted as the “radius” of the family λ
= inf sup ψα (π, Q) ≥ inf sup ψα (π, Q).
of distributions {Pθ | θ ∈ Θ} with respect to the Rényi Q π Q π
divergence of order α. It turns out that Csiszár’s result extends By letting λ tend to 1 we find
to general Θ and all orders α:
sup inf ψα (π, Q) ≥ inf sup ψα (π, Q).
Theorem 34. Suppose X is finite. Then for any α ∈ [0, ∞] π Q Q π
the channel capacity equals the minimax redundancy: As the sup inf never exceeds the inf sup [53, Lemma 36.1],
Cα = Rα . (59) the converse inequality also holds, and the proof is complete.
For α = 1, Haussler [55] has extended this result to infinite A distribution πopt on the parameter space Θ is a capacity
sample spaces X . It seems plausible that his approach might achieving input distribution if
extend to other orders α as well. Z
Equation 59 is equivalent to the minimax identity inf Dα (Pθ kQ) dπopt (θ) = Cα . (63)
Q
sup inf ψα (π, Q) = inf sup ψα (π, Q), (60)
π Q Q π A distribution Qopt on X may be called a redundancy achiev-
where Z ing distribution if
ψα (π, Q) = Dα (Pθ kQ) dπ(θ). (61) sup Dα (Pθ kQopt ) = Rα . (64)
θ
We will prove this identity using Sion’s minimax theorem [56], If the sample space is finite, then a redundancy achieving
[57], which we state with its arguments exchanged to make distribution always exists:
them line up with the arguments of ψα :
Lemma 9. Suppose X is finite and let α ∈ [0, ∞]. Then In particular, Qopt = S is unique and
the function Q 7→ supθ Dα (Pθ kQ) is continuous and convex, X
and has at least one minimum. Consequently, a redundancy R∞ = ln sup Pθ (x) < ∞. (68)
x θ
achieving distribution Qopt exists.
Proof: Since R∞ < ∞, for any finite C > R∞ there
Pn of elements in X by n, let
Proof: Denote the number must exist a distribution QC such that supx ln supQθCP(x)
θ (x)
≤ C.
∆n = {(p1 , . . . , pn ) | i=1 pi = 1, pi ≥ 0} denote Hence
the probability simplex on n outcomes, and let f (Q) = X X
supθ Dα (Pθ kQ). Since f is the supremum over continuous, sup Pθ (x) ≤ QC (x)eC = eC < ∞,
x θ x
convex functions, it is lower semi-continuous and convex
itself. As the domain of f is ∆n , which is compact, this so that S is well defined.
implies that it attains its minimum. Moreover, convexity on a Now for any arbitrary distribution Q, we have
simplex implies upper semi-continuity [53, Theorem 10.2], so supθ Pθ (x) sup P (x) S(x)
θ θ
that f is both lower and upper semi-continuous, which means sup ln = sup ln + ln
x Q(x) x S(x) Q(x)
that it is continuous. X S(x)
= ln sup Pθ (x) + sup ln
Theorem 36. Suppose X is finite and let α ∈ [0, ∞]. If θ x Q(x)
x
there exists a (possiblyR non-unique) capacity achieving input
supθ Pθ (x) S(x)
distribution πopt , then Dα (Pθ kQ) dπopt (θ) is minimized by = sup ln + sup ln .
x S(x) x Q(x)
Q = Qopt and Dα (Pθ kQopt ) = Rα almost surely under πopt .
S(x)
If Rα is regarded as the radius of {Pθ | θ ∈ Θ}, then this Since supx ln Q(x) = D∞ (SkQ) ≥ 0, with strict inequality
theorem shows how Qopt may be interpreted as its center. unless Q = S, this establishes (67) and Qopt = S. Finally,
Proof: Since πopt is capacity achieving, (68) follows by evaluating supx ln supS(x)
θ Pθ (x)
.
Z We conjecture that the previous result generalizes to any
Cα = inf Dα (Pθ kQ) dπopt (θ) positive order α as a one-sided inequality:
Q
Z Conjecture 1. Let α ∈ (0, ∞] and suppose that Rα < ∞.
≤ Dα (Pθ kQopt ) dπopt (θ) Then we conjecture that there exists a unique redundancy
Z achieving distribution
≤ Rα dπopt (θ) = Rα = Cα .
Qopt = arg min sup Dα (Pθ kQ), (69)
Q θ
The result follows because both inequalities must be equalities.
and that for all Q
Three orders α for the channel capacity Cα and minimax sup Dα (Pθ kQ) ≥ Rα + Dα (Qopt kQ). (70)
redundancy Rα are of particular interest. The classical ones θ
are α = 1, because it corresponds to the original definition of This conjecture is reminiscent of Sibson’s identity [4], [59].
channel capacity by Shannon, and α = 0 because C0 gives an It would imply that any distribution Q that is close to achieving
upper bound on the zero error capacity, which also dates back the minimax redundancy in the sense that
to Shannon.
Now let us look at the case α = ∞, assuming for simplicity sup Dα (Pθ kQ) ≤ Rα + δ, (71)
θ
that X is countable. We find that
must be close to Qopt in the sense that
Pθ (x)
sup D∞ (Pθ kQ) = sup ln sup
θ θ x Q(x) Dα (Qopt kQ) ≤ δ. (72)
supθ Pθ (x) As shown in Example 4 below, Conjecture 1 does not hold for
= sup ln (65)
x Q(x) α = 0. For α > 0, it can be expressed as a minimax identity
is the worst-case regret of Q relative to {Pθ | θ ∈ Θ} [3]. for the function
As is well known [3], [58], the distribution that minimizes φα (R, Q) = sup Dα (Pθ kQ) − Dα (RkQ), (73)
the worst-case regret is uniquely given by the normalized θ∈Θ
maximum likelihood or Shtarkov distribution where we adopt the convention that φα (R, Q) = ∞ if both
sup Pθ (x) supθ∈Θ Dα (Pθ kQ) and Dα (RkQ) are infinite. However, we
S(x) = P θ , (66)
x supθ Pθ (x) cannot use Sion’s minimax theorem (Theorem 35) to prove
provided that the normalizing sum is finite, so that S is well the conjecture, because in general φα is not quasi-convex in
defined. its second argument2 .
A distribution π on the parameter space Θ is called a
Theorem 37. Suppose that X is countable and that the barycentric input distribution if
minimax redundancy R∞ is finite. Then S is well defined and Z
the worst-case regret of any distribution Q satisfies Qopt = Pθ dπ(θ). (74)
sup D∞ (Pθ kQ) = R∞ + D∞ (SkQ). (67) 2 We
θ mistakenly claimed this in an earlier draft of this paper.
Example 4. Take α ∈ (0, ∞] and consider the distributions ΘX ⊂ Θ of θ on which πopt (θ) > 0. Hence, for any Q,
Z X
P1 = 1/2, 0, 1/2

, P2 = 0, 1/2, 1/2

(75) D∞ (Pθ kQ)dπopt (θ) = D∞ (Pθ kQ)πopt (θ)
θ∈ΘX
X X
on a three-element set. Then by symmetry and convexity of = D∞ (Pθ kQ) S(x) 1{θ̂(x)=θ}
θ∈ΘX x
Rényi divergence in its second argument, there must exist a X
redundancy achieving distribution of the form = S(x)D∞ (Pθ̂(x) kQ)
x
Qopt(α) = (q, q, 1 − 2q). (76)

X Pθ̂(x) (y)
= S(x) max ln
x
y Q(y)
If α is a simple order, then for θ ∈ {1, 2} the divergence is X Pθ̂(x) (x)
≥ S(x) ln
x
Q(x)
Dα (Pθ kQopt(α) ) X Pθ̂(x) (x)
1 α α = D(SkQ) + S(x) ln
= ln 1/2 q 1−α + 1/2 (1 − 2q)1−α x
S(x)
α−1 X
α ln 2 1 = D(SkQ) + S(x)R∞
ln q 1−α + (1 − 2q)1−α . (77)

= +
1−α α−1 x
= D(SkQ) + R∞ .
To find q, we therefore we have to extremize
By taking the infimum over Q on both sides we get
Z
f (q) = q 1−α + (1 − 2q)1−α , (78) inf D∞ (Pθ kQ)dπopt (θ) ≥ R∞ .
Q
which leads to Since the reverse inequality is trivial and R∞ = C∞ , we find

1 that πopt is a capacity achieving input distribution, as required.
q= . (79)
2 + 21/α
Example 5. Let θ ∈ [0, 1] denote the success probability of
The reader may verify that (79) also holds for α = 1, a binomial distribution Pθ = Bin(2, θ) on X = {0, 1, 2}.
giving Qopt(1) = ( 41 , 14 , 12 ), and for α = ∞, leading to Then for α = ∞ the redundancy achieving distribution is
Qopt(∞) = ( 13 , 13 , 31 ). Note that only for α = 1 is Qopt(α) a S = ( 52 , 15 , 25 ) and the minimax redundancy is R∞ = ln 25 .
convex combination of P1 and P2 , with unique barycentric In this case there are many barycentric input distributions.
input distribution π = (1/2, 1/2). For example, the distribution π = 15 M0 + 35 U + 51 M1 is
Finally, consider α = 0. In this case (79) still holds, giving a barycentric input distribution, where Mθ is a point-mass
Qopt(0) = (0, 0, 1). Now let Q = (1/2, 1/2, 0). Then, for θ ∈ on θ and U is the uniform distribution on [0, 1]. Another
3 2 3
{1, 2}, we see that the first two terms in (70) are well behaved: example is the distribution π = ( 10 , 5 , 10 ) on the maximum
1
likelihood parameters Ψ = {0, 2 , 1} for the elements of X .
By Theorem 38, there also exists a capacity achieving input
lim sup Dα (Pθ kQ) = sup D0 (Pθ kQ) = ln 2,
α↓0 θ θ distribution πopt , and it is supported on Ψ, with probabilities
lim sup Dα (Pθ kQopt(α) ) = 0 = sup D0 (Pθ kQopt(0) ).
πopt (0), πopt ( 12 ), πopt (1) = S(0), S(1), S(2) = ( 52 , 51 , 25 ).

α↓0 θ θ
The last term, however, evaluates to D0 (Qopt(0) kQ) = ∞, so V. N EGATIVE O RDERS

we obtain a counterexample to (70). The difference in be-
Until now we have only discussed Rényi divergence of non-
haviour between α = 0 and α > 0 may be understood by ob-
negative orders. However, using formula (9) for α ∈ (−∞, 0)
serving that limα↓0 Dα (Qopt(α) kQ) = ln 2 6= D0 (Qopt(0) kQ). 1−α
(reading qp−α for pα q 1−α ), it may also be defined for these
Theorem 38. Suppose that X is finite and that there exists negative orders. This definition extends to α = −∞ by
a maximum likelihood function θ̂ : X → Θ (that is, Pθ (x) ≤
D−∞ (P kQ) = lim Dα (P kQ). (81)
Pθ̂(x) (x) for all x ∈ X ). Then, for α = ∞, the distribution α↓−∞
According to Rényi [1], only positive orders can be regarded

πopt (θ) = S({x | θ̂(x) = θ}) (80) as measures of information, and negative orders indeed seem to
be hardly used in applications. Nevertheless, for completeness
is a capacity achieving input distribution, where S is as defined we will also study Rényi divergence of negative orders. As
in (66). will be seen below, our results for positive orders carry over
to the negative orders, but most properties are reversed. People
Proof: As X is finite, there can be at most a finite set may have avoided negative orders because of these reversed
properties. Avoiding negative orders is always possible, be- VI. C OUNTEREXAMPLES

cause they are related to orders α > 1 by an extension of Some useful properties that are satisfied by other diver-
skew symmetry: gences, are not satisfied by Rényi divergence. Here we give
Lemma 10 (Skew Symmetry). For any α ∈ (−∞, ∞), α 6∈ counterexamples for a few important ones.
{0, 1}
α A. Convexity in P does not hold for α > 1
Dα (P kQ) = D1−α (QkP ). (82)
1−α
Rényi divergence for α ∈ (1, ∞) is not convex in its
Furthermore first argument. Consider the following counterexample: let
D−∞ (P kQ) = −D∞ (QkP ) 0 < p0 < p1 < 1 be any two numbers, and let p1/2 = p0 +p
2 .
1
P (A)

p
Let ε > 0 be arbitrary, and let 0 < q < 1 be small enough
= ln inf = ln ess inf , (83) that
A∈F Q(A) Q q (1 − pi )α (1 − q)1−α
with the conventions that 0/0 = 0 and x/0 = ∞ for x > 0. max ≤ ε.
i∈{0,1} pα i q
1−α
Proof: The identity (82) follows directly from definitions. Then convexity of Dα in its first argument would imply that
α
It implies D−∞ (P kQ) = −D∞ (QkP ), because 1−α tends to
1
ln pα 1−α
+ (1 − p0 )α (1 − q)1−α

−1 as α → −∞. The remaining identities follow from the 0q
closed-form expressions for D∞ (QkP ) in Theorem 6. 2
1
+ ln pα 1−α
+ (1 − p1 )α (1 − q)1−α

Skew symmetry gives a kind of symmetry between the 1q
orders 1/2 + α and 1/2 − α. In applications in physics this 2
symmetry is related to the use of so-called escort probabilities ≥ ln pα 1/2 q
1−α
+ (1 − p1/2 )α (1 − q)1−α ,
[60].
which implies
Whereas the nonnegative orders generally satisfy the same
or similar properties for different values of α, the fact that 1 1
ln pα 1−α
(1 + ε) + ln pα 1−α

0q 1q (1 + ε)
α 2 2
1−α < 0 for α < 0, implies that properties for negative
≥ ln pα 1−α

orders are often inverted. For example, Rényi divergence for 1/2 q
negative orders is nonpositive, concave in its first argument and 1 1

ln pα ln pα α

upper semi-continuous in the topology of setwise convergence. 0 (1 + ε) + 1 (1 + ε) ≥ ln p1/2 .
2 2
In addition, the data processing inequality holds with its As this expression holds for all ε > 0, we get
inequality reversed and for α ∈ (−∞, 0) Theorem 2 applies 1 1
with an infimum instead of a supremum. ln pα0 + ln pα α
1 ≥ ln p1/2
2 2
Not all properties are inverted, however. Most notably, it 1 1 p0 + p1
does remain true that Rényi divergence is nondecreasing and ln p0 + ln p1 ≥ ln ,
2 2 2
continuous in α (see also Figure 1):
which is a contradiction, because the natural logarithm is
Theorem 39. For α ∈ [−∞, ∞], the Rényi divergence strictly concave.
Dα (P kQ) is nondecreasing in α.
Proof: For α < 0, Dα (P kQ) ≤ 0 and for α ≥ 0, B. Rényi divergence is not continuous
Dα (P kQ) ≥ 0, so the divergence for negative orders never In general the Rényi divergence of order α ∈ (0, 1) is
exceeds the divergence for nonnegative orders. The remainder not continuous in the topology of setwise convergence. To
of the proof follows from Theorem 3 and skew symmetry. construct a counterexample, let Pn denote the probability
distribution on [0, 2π] with density 1+sin(nx) and let Qn denote
Theorem 40. The Rényi divergence Dα (P kQ) is continuous 2π
in α on A = {α ∈ [−∞, ∞] | 0 ≤ α ≤ 1 or |Dα (P kQ)| < the probability distribution on [0, 2π] with density 1−sin(nx)
2π
∞}. for n = 1, 2, . . . Then Dα (Pn kQn ) > 0 does not depend on n,
and both Pn and Qn converge to the uniform distribution U on
Proof: Rényi divergence is nondecreasing in α, nonneg- [0, 2π] in the topology of setwise convergence. Consequently,
ative for α ≥ 0 and nonpositive for α < 0. Therefore the limn→∞ Dα (Pn kQn ) 6= 0 = Dα (U kU ), so in general Dα is
required continuity follows directly from Theorem 7 and skew not continuous in the topology of setwise convergence.
symmetry, except for the case
lim Dα (P kQ) = D0 (P kQ), C. Not a metric
α↑0
Except for the order α = 1/2, Rényi divergence is not
which is required to hold if there exists a value β < 0
symmetric and cannot be a metric. For α = 1/2, Rényi
such that Dβ (P kQ) > −∞. In this case D1−β (QkP ) =
1−β divergence is symmetric and by (5) it locally behaves like
β Dβ (P kQ) < ∞, which implies: (a) that Q P , so the square of a metric. Therefore one may wonder whether it
D0 (P kQ) = 0; and (b) that D(QkP ) < ∞ and by Theorem 5 actually is the square of a metric itself. Consider the following
α three distributions on two points:
lim Dα (P kQ) = lim D1−α (QkP ) = 0 · D(QkP ) = 0.
α↑0 α↑0 1 − α
P = (0, 1) , Q = (1/2, 1/2) , R = (1, 0) .
Then [13] A. O. Hero, B. Ma, O. Michel, and J. D. Gorman, “Alpha-divergence for

classification, indexing and retrieval (revised),” Tech. Rep. CSPL-334,
D1/2 (P kQ) = ln 2, D1/2 (QkR) = ln 2, D1/2 (P kR) = ∞. Communications and Signal Processing Laboratory, The University of
Michigan, 2003.
As the square roots of these divergences violate the triangle [14] J. Aczél and Z. Daróczy, On Measures of Information and Their
Characterizations. Academic Press, 1975.
inequality, D1/2 cannot be the square of a metric. [15] M. Ben-Bassat and J. Raviv, “Renyi’s entropy and the probability
of error,” IEEE Transactions on Information Theory, vol. 24, no. 3,
VII. S UMMARY pp. 324–330, 1978.
[16] T. van Erven and P. Harremoës, “Rényi divergence and majorization,”
We have reviewed and derived the most important properties in Proceedings of the IEEE International Symposium on Information
of Rényi divergence and Kullback-Leibler divergence. These Theory (ISIT), 2010.
[17] O. Shayevitz, “A note on a characterization of Rényi measures and
include convexity and continuity properties, a generalization its relation to composite hypothesis testing.” arXiv:1012.4401v1, Dec.
of the Pythagorean inequality to general orders, limits of σ- 2010.
algebras, additivity for product distributions on infinite se- [18] O. Shayevitz, “On Rényi measures and hypothesis testing,” in IEEE
International Symposium on Information Theory Proceedings, pp. 800–
quences, and the relation of the special order 0 to absolute 804, 2011.
continuity and mutual singularity of such distributions. [19] V. S. Huzurbazar, “Exact forms of some invariants for distributions
We have also derived several key minimax identities. In admitting sufficient statistics,” Biometrika, vol. 42, no. 3/4, pp. pp. 533–
537, 1955.
particular, Theorems 30 and 32 illuminate the relation between [20] F. Liese and I. Vajda, Convex Statistical Distances. Leipzig: Teubner,
Rényi divergence, Kullback-Leibler divergence and Chernoff 1987.
information in hypothesis testing. And Theorem 34 extends [21] M. Gil, “On Rényi divergence measures for continuous alphabet
sources,” Master’s thesis, Queen’s University, 2011.
the known equivalence of channel capacity and minimax [22] M. Gil, F. Alajaji, and T. Linder, “Rényi divergence measures for com-
redundancy to continuous channel inputs (for all orders). monly used univariate continuous distributions,” Information Sciences,
vol. 249, pp. 124–131, 2013.
[23] D. Aldous and P. Diaconis, “Strong uniform times and finite random
ACKNOWLEDGMENTS walks,” Advances in Applied Mathematics, vol. 8, pp. 69–97, 1987.
[24] A. L. Gibbs and F. E. Su, “On choosing and bounding probability
The authors would like to thank Peter Grünwald, Wouter metrics,” International Statistical Review, vol. 70, pp. 419–435, 2002.
Koolen and two anonymous referees for useful comments. [25] G. L. Gilardoni, “On Pinsker’s and Vajda’s type inequalities for Csiszár’s
Part of the research was done while both authors were with f -divergences,” IEEE Transactions on Information Theory, vol. 56,
no. 11, pp. 5377–5386, 2010.
the Centrum Wiskunde & Informatica in Amsterdam, the [26] D. Pollard, A User’s Guide to Measure Theoretic Probability. Cambridge
Netherlands, and while Tim van Erven was with the VU University Press, 2002.
University, also in Amsterdam. This work was supported in [27] F. Liese and I. Vajda, “On divergences and informations in statistics and
information theory,” IEEE Transactions on Information Theory, vol. 52,
part by NWO Rubicon grant 680-50-1112. no. 10, pp. 4394–4412, 2006.
[28] A. N. Shiryaev, Probability. Springer-Verlag, 1996.
R EFERENCES [29] S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence
of one distribution from another,” Journal of the Royal Statistical Society,
[1] A. Rényi, “On measures of entropy and information,” in Proceedings series B, vol. 28, no. 1, pp. 131–142, 1966.
of the Fourth Berkeley Symposium on Mathematical Statistics and [30] T. M. Cover and J. A. Thomas, Elements of Information Theory. Wiley,
Probability, vol. 1, pp. 547–561, 1961. 1991.
[2] P. Harremoës, “Interpretations of Rényi entropies and divergences,” [31] I. Csiszár, “I-divergence geometry of probability distributions and min-
Physica A: Statistical Mechanics and its Applications, vol. 365, no. 1, imization problems,” The Annals of Probability, vol. 3, no. 1, pp. 146–
pp. 57–62, 2006. 158, 1975.
[3] P. D. Grünwald, The Minimum Description Length Principle. The MIT [32] F. Topsøe, Entropy, Search, Complexity, vol. 16 of Bolyai Society
Press, 2007. Mathematical Studies, ch. 8, Information Theory at the Service of
[4] I. Csiszár, “Generalized cutoff rates and Rényi’s information measures,” Science, pp. 179–207. Springer, 2007.
IEEE Transactions on Information Theory, vol. 41, no. 1, pp. 26–34, [33] R. Sundaresan, “A measure of discrimination and its geometric proper-
1995. ties,” in Proceedings of the IEEE International Symposium on Informa-
[5] T. Zhang, “From -entropy to KL-entropy: Analysis of minimum infor- tion Theory (ISIT), 2002.
mation complexity density estimation,” The Annals of Statistics, vol. 34, [34] R. Sundaresan, “Guessing under source uncertainty with side infor-
no. 5, pp. 2180–2210, 2006. mation,” in Proceedings of the IEEE International Symposium on
[6] D. Haussler and M. Opper, “Mutual information, metric entropy and Information Theory (ISIT), 2006.
cumulative relative entropy risk,” The Annals of Statistics, vol. 25, no. 6, [35] Y. V. Prokhorov, “Convergence of random processes and limit theorems
pp. 2451–2492, 1997. in probability theory,” Theory of Probability and Its Applications, vol. I,
[7] T. van Erven, When Data Compression and Statistics Disagree: Two no. 2, pp. 157–214, 1956.
Frequentist Challenges for the Minimum Description Length Principle. [36] E. C. Posner, “Random coding strategies for minimum entropy,” IEEE
PhD thesis, Leiden University, 2010. Transactions on Information Theory, vol. 21, no. 4, pp. 388–391, 1975.
[8] L. Le Cam, “Convergence of estimates under dimensionality restric- [37] A. W. van der Vaart and J. A. Wellner, Weak Convergence and Empirical
tions,” The Annals of Statistics, vol. 1, no. 1, pp. 38–53, 1973. Processes: With Applications to Statistics. Springer, 1996. (Corrected
[9] L. Birgé, “On estimating a density using Hellinger distance and some second printing, 2000).
other strange facts,” Probability Theory and Related Fields, vol. 71, [38] M. S. Pinsker, Information and Information Stability of Random Vari-
pp. 271–291, 1986. ables and Processes. Holden-Day, 1964. Translated by A. Feinstein.
[10] S. van de Geer, “Hellinger-consistency of certain nonparametric max- [39] A. R. Barron, “Limits of information, Markov chains and projections,”
imum likelihood estimators,” The Annals of Statistics, vol. 21, no. 1, in Proceedings of the IEEE International Symposium on Information
pp. 14–44, 1993. Theory (ISIT), p. 25, 2000.
[11] D. Morales, L. Pardo, and I. Vajda, “Rényi statistics in directed families [40] P. Harremoës and K. K. Holst, “Convergence of Markov chains in
of exponential experiments,” Statistics, vol. 34, pp. 151–174, 2000. information divergence,” Journal of Theoretical Probability, vol. 22,
[12] Y. Mansour, M. Mohri, and A. Rostamizadeh, “Multiple source adap- pp. 186–202, 2009.
tation and the Rényi divergence,” in Proceedings of the Twenty-Fifth [41] O. Kallenberg, Foundations of Modern Probability. Springer, 1997.
Conference on Uncertainty in Artificial Intelligence (UAI), pp. 367–374, [42] A. W. van der Vaart, Asymptotic Statistics. Cambridge University Press,
2009. 1998.
[43] J. Feldman, “Equivalence and perpendicularity of Gaussian processes,” Peter Harremoës Peter Harremoës (M’00) received
Pacific Journal of Mathematics, vol. 8, no. 4, pp. 699–708, 1958. the BSc degree in mathematics in 1984, the Exam.
[44] J. Hájek, “On a property of normal distributions of any stochastic Art. degree in archaeology in 1985, and the MSc de-
process,” Czechoslovak Mathematical Journal, vol. 8, no. 4, pp. 610– gree in mathematics in 1988, all from the University
618, 1958. In Russian with English summary. of Copenhagen, Denmark. In 1993 he received his
[45] B. J. Thelen, “Fisher information and dichotomies in equiva- PhD degree in the natural sciences from Roskilde
lence/contiguity,” The Annals of Probability, vol. 17, no. 4, pp. 1664– University, Denmark.
1690, 1989. From 1993 to 1998, he worked as a mountaineer.
[46] S. Kakutani, “On equivalence of infinite product measures,” The Annals From 1998 to 2000, he held various teaching posi-
of Mathematics, vol. 49, no. 1, pp. 214–224, 1948. tions in mathematics. From 2001 to 2006, he was
[47] A. Rényi, “On some basic problems of statistics from the point of view Postdoctoral Fellow with the University of Copen-
of information theory,” in Proceedings of the Fifth Berkeley Symposium hagen, with an extended visit to the Zentrum für Interdisziplinäre Forschung,
on Mathematical Statistics and Probability, vol. 1: Statistics, pp. 531– Bielefeld, Germany in 2003. From 2006 to 2009, he was affiliated with the
543, 1967. Centrum Wiskunde & Informatica, Amsterdam, The Netherlands, under the
[48] S. Kullback, Information theory and statistics. Wiley, 1959. European Pascal Network of Excellence. Since then he has been affiliated
[49] T. Nemetz, “On the α-divergence rate for Markov-dependent hypothe- with Niels Brock, Copenhagen Business College, in Denmark.
ses,” Problems of Control and Information Theory, vol. 3, no. 2, pp. 147– From 2007 to 2011 Peter Harremoës has been Editor-in-Chief of the journal
155, 1974. Entropy. He is currently an editor for that journal.
[50] Z. Rached, F. Alajaji, and L. L. Campbell, “Rényi’s divergence and
entropy rates for finite alphabet Markov sources,” IEEE Transactions
on Information Theory, vol. 47, no. 4, pp. 1553–1561, 2001.
[51] I. Csiszár, “Information projections revisited,” IEEE Transactions on
Information Theory, vol. 49, no. 6, pp. 1474–1490, 2003.
[52] A. A. Fedotov, P. Harremoës, and F. Topsøe, “Refinements of Pinsker’s
inequality,” IEEE Transactions on Information Theory, vol. 49, no. 6,
pp. 1491–1498, 2003.
[53] R. T. Rockafellar, Convex Analysis. Princeton University Press, 1970.
[54] B. Ryabko, “Comments on “a source matching approach to finding
minimax codes” by Davisson, L. D. and Leon-Garcia, A.,” IEEE
Transactions on Information Theory, vol. 27, no. 6, pp. 780–781, 1981.
Including also the ensuing Editor’s Note.
[55] D. Haussler, “A general minimax result for relative entropy,” IEEE
Transactions on Information Theory, vol. 43, no. 4, pp. 1276–1280,
1997.
[56] M. Sion, “On general minimax theorems,” Pacific Journal of Mathemat-
ics, vol. 8, no. 1, pp. 171–176, 1958.
[57] H. Komiya, “Elementary proof for Sion’s minimax theorem,” Kodai
Mathematical Journal, vol. 11, no. 1, pp. 5–7, 1988.
[58] Y. M. Shtar’kov, “Universal sequential coding of single messages,”
Problems of Information Transmission, vol. 23, no. 3, pp. 175–186,
1987.
[59] R. Sibson, “Information radius,” Z. Warscheinlichkeitstheorie verw. Geb.,
vol. 14, pp. 149–160, 1969.
[60] J. Naudts, “Estimators, escort probabilities, and φ-exponential families
in statistical physics,” Journal of Inequalities in Pure and Applied
Mathematics, vol. 5, no. 4, 102, 2004.
Tim van Erven is originally from the Netherlands.

He performed his PhD research at the Centrum
Wiskunde & Informatica (CWI) in Amsterdam, and
received his PhD degree from Leiden University
in 2010. After postdoc positions at the CWI and
the Vrije Universiteit in Amsterdam, he obtained a
Rubicon grant from the Netherlands Organisation for
Scientific Research (NWO) to do a two-year postdoc
at the Université Paris-Sud in France.
His interests are in topics related to information
theory and statistics, including minimum descrip-
tion length (MDL) learning, sequential prediction with individual sequences
(online learning), and statistical learning theory. The present paper was
partially motivated by the importance of Rényi divergence in stating sufficient
conditions for convergence of the MDL estimator, which he studied in his PhD
thesis [7, Chapter 5].

R Enyi Divergence and Kullback-Leibler Divergence: Tim Van Erven Peter Harremo Es, Member, IEEE

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

R Enyi Divergence and Kullback-Leibler Divergence: Tim Van Erven Peter Harremo Es, Member, IEEE

Caricato da

Copyright:

Formati disponibili

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO.

Rényi Divergence and Kullback-Leibler Divergence

here can be found in [16] and [7]. During the preparation

II. D EFINITION OF R ÉNYI DIVERGENCE positive variance σ12 ) is

X → Y, To show the converse inequality, consider for any ε > 0 a

III. F IXED N ONNEGATIVE O RDERS

A. Positivity, Data Processing and Finite Partitions

For α > 0, Dα (P kQ) = 0 if and only if P = Q. For α = 0,

(i) (Pn ) M (Qn ), and, except for α = 0, also

which is well defined if and only if 0 < pα q 1−α dµ < ∞.

We will study Proof of Theorem 34: Sion’s minimax theorem cannot be

ψαλ (π, Q) = ψα π, (1 − λ)U + λQ ,

Qopt(α) = (q, q, 1 − 2q). (76)

which leads to Since the reverse inequality is trivial and R∞ = C∞ , we find

The last term, however, evaluates to D0 (Qopt(0) kQ) = ∞, so V. N EGATIVE O RDERS

According to Rényi [1], only positive orders can be regarded

properties. Avoiding negative orders is always possible, be- VI. C OUNTEREXAMPLES

negative orders is nonpositive, concave in its first argument and 1 1

Then [13] A. O. Hero, B. Ma, O. Michel, and J. D. Gorman, “Alpha-divergence for

Tim van Erven is originally from the Netherlands.

Potrebbero piacerti anche