EnergyDistance10 1002-Wics 1375

Advanced Review
Energy distance
Maria L. Rizzo1* and Gábor J. Székely2,3
Energy distance is a metric that measures the distance between the distributions
of random vectors. Energy distance is zero if and only if the distributions
are identical, thus it characterizes equality of distributions and provides a theo-
retical foundation for statistical inference and analysis. Energy statistics are
functions of distances between observations in metric spaces. As a statistic,
energy distance can be applied to measure the difference between a sample and
a hypothesized distribution or the difference between two or more samples in
arbitrary, not necessarily equal dimensions. The name energy is inspired by the
close analogy with Newton’s gravitational potential energy. Applications include
testing independence by distance covariance, goodness-of-fit, nonparametric
tests for equality of distributions and extension of analysis of variance, generali-
zations of clustering algorithms, change point analysis, feature selection,
and more. © 2015 Wiley Periodicals, Inc.
How to cite this article:
WIREs Comput Stat 2016, 8:27–38. doi: 10.1002/wics.1375
Keywords: Multivariate, goodness-of-fit, distance correlation, DISCO,

independence
INTRODUCTION (length) of its argument, E denotes expected

value, and a primed random variable X0 denotes an
E nergy distance is a distance between probability
distributions. The name ‘energy’ is motivated by
analogy to the potential energy between objects in a
independent and identically distributed (iid) copy of
X; that is, X and X0 are iid. Similarly, Y and Y0 are
iid. The squared energy distance can be defined in
gravitational space. The potential energy is zero if terms of expected distances between the random
and only if the location (the gravitational center) of vectors
the two objects coincide, and increases as their dis-
tance in space increases. One can apply the notion of
potential energy to data as follows. Let X and Y be
D2 ðF,GÞ : = 2E k X −Y k −E k X −X0 k
independent random vectors in ℝd, with cumulative − E k Y − Y 0 k ≥ 0;
distribution function (CDF) F and G, respectively. In
what follows, k k denotes the Euclidean norm
and the energy distance between distributions F and
G is defined as the square root of D2(F,G).
It can be shown that energy distance D(F,G)
*Correspondence to: mrizzo@bgsu.edu satisfies all axioms of a metric, and in particular D
1
Department of Mathematics and Statistics, Bowling Green State (F,G) = 0 if and only if F = G. Therefore, energy dis-
University, Bowling Green, OH, USA tance provides a characterization of equality of distri-
2
National Science Foundation, Arlington, VA, USA butions, and a theoretical basis for the development
3
Rényi Institute of Mathematics, Hungarian Academy of Sciences, of statistical inference and multivariate analysis based
Hungary on Euclidean distances. In this review we discuss sev-
Conflict of interest: The authors have declared no conflicts of inter- eral of the important applications and illustrate their
est for this article. implementation.
Volume 8, January/February 2016 © 2015 Wiley Periodicals, Inc. 27

Advanced Review wires.wiley.com/compstats
BACKGROUND AND APPLICATION TESTING FOR EQUAL

OF ENERGY DISTANCE DISTRIBUTIONS
The notion of ‘energy statistics’1–3 was introduced by Consider the null hypothesis that two random vari-
Székely in 1984–1985 in a series of lectures given in ables, X and Y, have the same cumulative distribu-
Budapest, Hungary, and at MIT, Yale, and Columbia. tion functions: F = G. For samples x1, …, xn and
As mentioned above, Székely’s main idea and the y1, …, ym from X and Y, respectively, the E-statistic
name derive from the concept of Newton’s potential for testing this null hypothesis is
energy. Statistical observations can be considered
as objects in a metric space that are governed by a
E n, m ðX, Y Þ : = 2A− B −C;
statistical potential energy that is zero if and only if
an underlying statistical null hypothesis is true.
Energy statistics (E-statistics) are a class of functions where A, B, and C are simply averages of pairwise
of distances between statistical observations. distances:
Several examples of one-sample, two-sample, and
multi-sample energy statistics will be illustrated
below. 1 X n X m
1X n X n
Cramér’s distance4 is closely related, but only A= k xi −yj k , B = 2 k xi −xj k ,

nm i = 1 j = 1 n i=1 j=1
in the univariate (real valued) case. For two real-
valued random variables with CDFs F and G, the 1 X m X m
C= 2 k yi −yj k :
squared energy distance is exactly twice the distance m i=1 j=1
proposed by Harald Cramér:
ð∞ One can prove2,9 that E(X, Y) := D2(F, G) is zero if

D ðF,GÞ = 2
2
ðFðxÞ − GðxÞÞ2 dx: and only if X and Y have the same distribution
−∞ (F = G). It is also true that the statistic E n,m is always
non-negative. When the null hypothesis of equal dis-
However, the equivalence of energy distance tributions is true, the test statistic
with Cramer’s distance cannot extend to higher
dimensions, because while energy distance is rotation nm
invariant, Cramér’s distance does not have this T= E n, m ðX,Y Þ:
n+m
property.
A proof of the basic energy inequality,
D(F,G) ≥ 0 with equality if and only if F = G follows converges in distribution to a quadratic form of
from Ref 5 and also from Mattner’s result.6 An alter- independent standard normal random variables.
nate proof related to a result of Morgenstern7 Under an alternative hypothesis the statistic
appears in Refs 8,9. T tends to infinity stochastically as sample sizes
Application to testing for equality of two distri- tends to infinity, so the energy test for equal distribu-
butions appeared in Refs 8,10–12 as well as a multi- tions that rejects the null for large values of T is
sample test for equality of distributions ‘distance consistent.12
components’ (DISCO).13 Goodness-of-fit tests have Because the null distribution of T depends on
been developed for multivariate normality,8,9 stable the distributions of X and Y, the test is implemented
distribution,14 Pareto distribution,15 and multivariate as a permutation test in the energy package,24 which
Dirichlet distribution.16 Hierarchical clustering and a is available for R25 on the Comprehensive R Archive
generalization of k-means clustering based on energy Network (CRAN) under general public license. The
distance are developed in Refs 17,18. test is implemented in the function eqdist.etest.
Generalizations and interesting special The data argument can be the data matrix or dis-
cases of the energy distance have appeared in the tance matrix of the pooled sample, with the
recent literature; see Refs 19–22. A similar default being the data matrix. The second argument
idea related to energy distance and E-statistics were is a vector of sample sizes. Here we test whether two
considered as N-distances and N-statistics in Ref 23; species of iris data differ in the distribution of their
see also Ref 5. Measures of the energy distance type four dimensional measurements. In this example,
have also been studied in the machine learning n = m =50 and we use 999 permutation replicates
literature.21 for the test decision.
28 © 2015 Wiley Periodicals, Inc. Volume 8, January/February 2016

WIREs Computational Statistics Energy distance
To compute the energy statistic only: First, let us introduce a generalization of

the energy distance. The characterization of equality
of distributions by energy distance also holds if
we replace Euclidean distance by k X − Y k α,
where 0 < α < 2. The characterization does not
hold if α = 2 because 2E k X −Yk2 −E k X −X0 k2
−E k Y −Y 0 k2 = 0 whenever EX = EY. We denote the
The E-statistic is not standardized, so it is corresponding two-sample energy statistic by E ðnα,Þm .
reported only for reference. To interpret the value, Let A = fa1 ,…,an1 g, B = fb1 ,…,bn2 g be two
one should normalize the statistic. One way of doing samples, and define
this is to divide by an estimate of E k X − Y k. Note 1 X n1 X n2
that if gα ðA,BÞ : = k ai −bm kα ; ð1Þ
n1 n2 i = 1 m = 1
D2 ðFX ,FY Þ 2E k X − Y k −E k X −X0 k −E k Y − Y 0 k

H: = = ; for 0 < α ≤ 2. If A1, …, AK are samples of sizes
2E k X −Y k 2E k X − Y k XK
n1, n2, …, nK, respectively, and N = n , the
j=1 j
then 0 ≤ H ≤ 1 with H = 0 if and only if X and Y are within-sample dispersion statistic is
identically distributed.
For background on permutation tests, see Ref X
K
nj
26 or Ref 27. Wα = Wα ðA1 ,…,AK Þ = gα Aj , Aj ; ð2Þ
For more details, applications, and power com- j=1
2
parisons see Refs 11,12 and the documentation
included with the energy package. The same func- and the total dispersion of the observed response is
tions are generalized to handle multi-sample pro-
N
blems, discussed below. Tα = Tα ðA1 ,…,AK Þ = gα ðA, AÞ; ð3Þ
2
where A is the pooled sample. The between-sample

MULTI-SAMPLE ENERGY energy statistic is
STATISTICS
X nj + nk nj nk

ðαÞ
Distance Components: A Nonparametric Sn, α = E Aj ,Ak
1≤j<k≤K
2N nj + nk nj , nk
Extension of ANOVA X nnj nk o
Analogous to the ANOVA decomposition of vari- = 2gα Aj ,Ak − gα Aj , Aj − gα ðAk , Ak Þ :
1≤j<k≤K
2N
ance, we partition the total dispersion of the pooled
samples into between and within components, called ð4Þ
distance components (DISCO). For two samples, the
between-sample component is the two-sample energy Note that Sn,α weights each pairwise E-statistic based
statistic above. For several samples, the between on the proportion of data in the two samples. Analo-
component is a weighted combination of pairwise gous to the decomposition of between-sample vari-
two-sample energy statistics. ance and within-sample variance of ANOVA, we
To test the K-sample hypothesis H0 : F1 = … = have the decomposition of distances
FK, K ≥ 2, one can apply the between-sample test sta-
tistic, or alternately a ratio statistic similar to the Tα = Sα + Wα ;
familiar F statistic of ANOVA, both implemented as
a randomization test. where both Sα and Wα are nonnegative.

For every 0 < α < 2, the statistic Eq. (4) deter- E-clustering
mines a consistent test of the multi-sample hypothesis Energy distance has been applied in hierarchical clus-
of equal distributions.13 In the special case where all Fj ter analysis.17 It generalizes the well-known Ward’s
are univariate distributions and α = 2, Sn,2 is the minimum variance method in a similar way that
ANOVA between sample sum of squared error and DISCO generalizes ANOVA. The DISCO decomposi-
the decomposition T2 = S2 + W2 is the ANOVA tion has recently been applied to generalize k-means
decomposition. The ANOVA test statistic measures clustering.18
differences in means, not distributions. However, if we In an agglomerative hierarchical clustering
apply α = 1 (Euclidean distance) or any 0 < α < 2 as algorithm, starting with single observations, at each
the exponent on Euclidean distance, the corresponding step we merge clusters that have minimum cluster
energy test is consistent against all alternatives with distance. In the energy distance algorithm, the cluster
finite α moments. If any of the underlying distributions distance is the two sample energy statistic.
may have non-finite first moment, a suitable choice of There is a general class of hierarchical cluster-
α extends the energy test to this situation. ing algorithms uniquely determined by their respec-
Returning to the iris data example, we can eas- tive recursive formula for updating all cluster
ily apply the test for equality of the three species’ dis- distances following each merge of two clusters. One
tributions using a choice of methods, and the can show17 that the energy clustering algorithm
relevant options here are method="discoB" or is also a member of this class and its recursive for-
method="discoF". The first uses the between- mula shows that it it is formally similar to Ward’s
sample statistic, and the second uses an ‘F ’ ratio like method.
ANOVA. Both are implemented as permutation tests.
Sn, α =ðK − 1Þ
Fn, α = ;
Wn, α =ðN − KÞ
Suppose that the disjoint clusters Ci, Cj are to
and the decomposition details are displayed in a table be merged at the current step. If Ck is a disjoint clus-
similar to an ANOVA table. Although it has the ter, then the new cluster distances can be computed
same form as an F statistic, it does not have an by the following recursive formula:
F distribution and a permutation test is applied.
ni + nk
d Ci [ Cj ,Ck = dðCi ,Ck Þ
One can obtain a table of pairwise energy sta- ni + nj + nk
tistics for any number of samples in one step using nj + nk
+ d Cj ,Ck ð5Þ
the edist function in the energy package. It can be ni + nj + nk
nk
used, for example, to display E-statistics for the − d Ci , Cj ;
result of a cluster analysis. ni + nj + nk


where d Ci , Cj = E ni , nj Ci ,Cj , and ni, nj, nk are the Distance Covariance
sizes of clusters Ci, Cj, Ck, respectively. Let dij := d The simplest formula for the distance covariance sta-
(Ci, Cj). Then tistic is the square root of
1 X n
dðijÞk : = d Ci [ Cj , Ck V 2n ðX, Y Þ = Ab ij B
b ij ;
= αi dðCi ,Ck Þ + αj d Cj ,Ck + βd Ci , Cj n2 i, j = 1
= αi dik + αj djk + βdij + γjdik − djk j;
where Â and B b are the double-centered distance
where matrices of the X sample and the Y sample, respec-
tively, and the subscript ij denotes the entry in the
ni + nk − nk i-th row and j-th column. The double-centered dis-
αi = ; β= ; γ = 0:
ni + nj + nk ni + nj + nk tance matrices are computed as in classical multidi-
mensional scaling. Given a random sample
If we substitute squared Euclidean distances for (x, y) = {(xi, yi) : i = 1, …, n} from the joint distribu-
Euclidean distances in this recursive formula, keeping tion of random vectors X in ℝp and Y in ℝq, com-
the same parameters (αi, αj, β, γ), then we obtain the pute the Euclidean distance matrix (aij) = (kxi − xjk)
updating formula for Ward’s minimum variance for the X sample and (bij) = (kyi − yjk) for the
method. However, we know that Ward’s method Y sample. The ij-th entry of Â is
(with exponent α = 2 on distances) is a geometrical
method that separates clusters by their centers, not b ij = aij −ai: −a:j + a .. ,
by their distributions. E-clustering generalizes Ward A i,j = 1, …,n;
because for every 0 < α < 2, the energy clustering
algorithm separates clusters that differ in where
distribution.
Overall in simulations and real data 1X n
1X n
1 X n
examples17 the characterization property of E is a ai: = aij , a:j , = aij , a .. = 2 aij :

n j=1 n i=1 n i, j = 1
clear advantage for certain clustering problems, with-
out sacrificing the good properties of Ward’s mini-
mum variance method for separating spherical b is
Similarly, the ij-th entry of B
clusters. i: − b
:j + b
.. , i, j = 1, …, n:
b ij = bij − b
B
The hclust hierarchical clustering function
provided in R25 implements exactly the above recur-
sive formula Eq. (5), and therefore one can use The sample distance variance is
hclust to apply either the E-clustering solution or
Ward’s method by specifying method ‘ward.D’ 1 X n
b2 :
V 2n ðXÞ = V 2n ðX,XÞ = A
(energy) or ‘ward.D2’ (Ward). n i, j = 1 ij
2
The distance covariance statistic is always non-nega-

tive, and V 2n ðXÞ = 0 only if all of the sample observa-
TESTING INDEPENDENCE tions are identical (see Ref 28).
An important application for two samples applies to
testing independence of random vectors. In this case,
we test whether the joint distribution of X and Y is Distance Correlation
equal to the product of their marginal distributions. Distance correlation is the standardized distance
Interestingly, the statistics can be expressed in a covariance. We have defined the squared distance
product–moment expression involving the double- covariance statistic
centered distance matrices of the X and Y samples.
The statistics based on distances are analogous to,
1 X n
b ij B
b ij ;
but more general than, product–moment covariance V 2n ðX,Y Þ = A
and correlation. This suggests the names distance n2 i, j = 1
covariance (dCov) and distance correlation (dCor),
defined below. and the squared distance correlation is defined by

8 8
> V 2n ðX, Y Þ
< qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi V 2 ðX, Y Þ
ffi , V 2n ðXÞ V 2n ðY Þ > 0 ; >
< qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi , V 2 ðX Þ V 2 ðY Þ > 0 ;
Rn ðX, Y Þ =
2
V n ðXÞV n ðY Þ
2 2
R ðX, Y Þ =
2
>
: > V ðXÞV ðY Þ
2 2
0, V 2n ðXÞ V 2n ðY Þ = 0: :
0, V 2 ðXÞ V 2 ðY Þ = 0:
Distance correlation satisfies The empirical coefficients V n ðX,Y Þ and Rn (X, Y)
converge almost surely to the population coefficients
1. 0 ≤ Rn (X, Y) ≤ 1. V ðX, Y Þ and R(X, Y), as n ! ∞. The distribution of
2. If Rn (X, Y) = 1 then there exists a vector a, a Qn = nV 2n ðX, Y Þ converges to a quadratic form
X∞
non-zero real number b and an orthogonal λ Z2 , where Zi are iid standard normal and λi
i=1 i i
matrix R such that Y = a + bXR, for the data are non-negative coefficients that depend on the dis-
matrices X and Y. tributions of the underlying random variables. Values
of Qn near zero are consistent with the null hypothe-
Remark 1. One could also define dCov as V 2n and sis of independence, while large Qn support the alter-
dCor as R2n rather than by their respective native. Thus, a consistent test of multivariate
square roots. There are reasons to prefer each defini- independence is based on the distance covariance
tion, but historically28,29 the above definitions were Qn = nV 2n , and it can be implemented as a nonpara-
used. When we deal with unbiased statistics, we no metric permutation test. The dCov test applies to
longer have the non-negativity property, so we can- random vectors in arbitrary, not necessarily equal
not take the square root and need to work with the dimensions, for any sample size n ≥ 4. For high
square. dimensional X and Y, there is also a distance correla-
tion t-test of independence introduced in Ref 30 is
applicable when dimension exceeds sample size.
For the permutation test in this case, the null
Population Coefficients distribution is sampled by permuting the indices of
Suppose that X 2 ℝp, Y 2 ℝq, E k X k < ∞, and one of the two variables each time a sample is drawn.
E k Y k < ∞. The squared population distance covari- The replicates then provide a reference distribution
ance coefficient can be written in terms of expected for estimating the tail probability to the right of the
distances: observed test statistic Qn.
V 2 ðX, Y Þ = E k X − X0 kk Y − Y 0 k + E k X − X0 k E Remark 2. Note that the permutation test is only

k Y − Y 0 k − 2E k X − X0 kk Y − Y 00 k ; applicable if the observations are exchangeable under
the null, which would not be true e.g., for time
where (X,Y), (X0 , Y0 ), and (X00 , Y00 ) are iid.29 Here series data.
V 2 ðX, Y Þ is an energy distance between the joint dis-
The statistics and tests are implemented in the
tribution of (X,Y) and the product of the marginal
energy package for R.24,25 Functions dcor, dcov,
distributions of X and Y.
dcov.test and dcor.ttest compute the statistics
We have a characterization of independence:
and tests of independence. The following example
V ðX, Y Þ ≥ 0 with equality to 0 if and only if X and
2
uses the crabs data in the MASS package. After
Y are independent. Population distance correlation
converting the binary factors to integers, we test if
R(X, Y) is the square root of the standardized
the two-dimensional categorical variable (species, sex)
coefficient:
is independent of the vector of body measurements.

The data arguments to dcov.test can be data As the first step in computing the unbiased sta-
matrices or distance objects returned by the R dist tistic, we replace the double centering operation with
function. Here the arguments are data matrices. U-centering, to obtain U-centered distance matrices
Ã and B.e Then
1 Xe e
eB
A e := Ai, j Bi, j ,
nðn−3Þ i6¼j
is an unbiased estimator of squared population dis-

tance covariance V 2 ðX, Y Þ. The inner product nota-
tion is due to the fact that this statistic is an inner
product in the Hilbert space of U-centered distance
matrices.31
The test is significant and we reject independ- A bias corrected R2n is defined by normalizing
ence. One may also want to compute the statistics the inner product statistic with the bias corrected
using dcor or dcov: dVar statistics. The bias-corrected dCor statistic is
implemented in the R energy package by the bcdcor
function. Returning to the example, here is a compar-
ison of biased and bias-corrected R2n :
To recover all of the statistics from dCor, a util-

ity function is provided:
The above unbiased inner product dCov statis-

tic is easy to compute, but since it can take negative
values, we cannot define the bias corrected statistic
An Unbiased Distance Covariance to be the square root of it. Thus we avoid the
Statistic ‘squared’ notation and use the inner product opera-
The population distance covariance is zero under tor or V *n and R*n . One could have defined ‘dCov’
independence, but the dCov statistic is non-negative, from the start to be the square of the energy distance.
hence its expected value is positive except in degener- Historically, the rationale for choosing the square
ate cases. Clearly the dCov statistic in its original for- root definition is that in this case distance covariance
mulation is biased for the population coefficient. The is the energy distance between the joint distribution
bias is in fact increasing with dimension. of the variables and the product of their marginals. A
An unbiased estimator of V 2 ðX,Y Þ was given in disadvantage is that the distance variance, rather
Ref. 30 and an equivalent unbiased statistic was than the distance standard deviation, is measured in
given in Ref 31. Although the latter one looks sim- the same units as the distances.
pler, the original may be faster to compute. The fol- It is clear that both the sample dCov and the
lowing is from Ref 31. sample dCor can be computed in O(n2) steps.
Let A = (aij) be a symmetric, real valued n × n Recently Huo and Székely32 proved that for real val-
matrix with zero diagonal (not necessarily Euclidean ued samples the unbiased estimator of the squared
distances). Instead of the classical method of double population distance covariance can be computed by
centering A, we introduce the U-centered matrix Ã. an O(n log n) algorithm. The supplementary files to
The (i,j)-th entry of Ã is Ref. 32 include an implementation in Matlab.
8
> X n 1 X n X n
<a − 1 ai, j − ai, j +
1
ai, j , i 6¼ j;
e
Ai, j =
i, j n −2 n−2 ðn −1Þðn −2Þ Distance Correlation for Dissimilarity
>
: i = 1 j = 1 i , j = 1
0, i = j: Matrices
It is important to notice that Ã does not change if we
Here ‘U-centered’ refers to the result that the corre- add the same constant to all off-diagonal entries and
sponding squared distance covariance statistic is an U-center the result. In Ref 31, it is shown that the inner
unbiased estimator of the population coefficient. product version of dCov can be applied to any U-

centered dissimilarity matrix (zero-diagonal symmetic stochastically, and therefore E n determines a consist-
matrix). The algorithm is outlined in Ref 31, and it is a ent goodness-of-fit test.
strong competitor of the Mantel test for association For most applications the exponent α = 1
between dissimilarity matrices based on comparisons (Euclidean distance) can be applied, but smaller
in the paper. This makes dCov tests and energy statis- exponents have been applied for testing distributions
tics ready to apply to problems in e.g., community with heavy tails including Pareto, Cauchy, and stable
ecology, where one must often work with data in the distributions.14,15 The important special case of test-
form of non-Euclidean dissimilarity matrices. ing multivariate normality8,9 is fully implemented in
the energy24 package for R.
Partial Distance Correlation A detailed introduction to the energy goodness-
Based on the inner product dCov statistic (the unbi- of-fit tests with several simple examples can be found
in Ref 3. Here we will focus on the important appli-
ased estimator of V 2 ) theory is developed to define
cations of testing for multivariate normality, starting
partial distance correlation analogous to (linear) par-
with the special case of univariate normality.
tial correlation. There is a simple computing formula
The energy statistic for testing whether a sam-
for the pdCor statistic and there is a test for the
ple X1, …, Xn is from a multivariate normal distribu-
hypothesis of zero pdCor based on the inner product.
tion N(μ, Σ) is developed by Székely and Rizzo.9 Let
Energy statistics are defined for random vectors, so
x1, …, xn denote an observed random sample.
pdCor(X, Y; Z) is a scalar coefficient defined for ran-
dom vectors X, Y, and Z in arbitrary dimension. The Univariate Normality
statistics and tests are described in detail in Ref. 31 For a test of univariate normality, we apply the sta-
and currently implemented in an R package pdcor,33 tistic Eq. (6). Suppose that the null hypothesis is that
which will become part of the energy package. the sample has a Normal distribution with mean μ
and variance σ 2. Then it can be derived that
GOODNESS-OF-FIT E k xi − X k = 2ðxi −μÞFðxi Þ + 2σ 2 f ðxi Þ− ðxi −μÞ;

2σ
Goodness-of-fit is a one-sample problem, but there E k X − X0 k = pffiffiffi ;
π
are two distributions to consider: one is the
hypothesized distribution and the other is the under-
where F and f denote the cdf and density of the
lying distribution from which the observed sample
hypothesized Normal(μ, σ 2) distribution. The last
has been drawn. Energy distance applies to compare
sum in the statistic E n can be linearized in terms of
these two distributions with a variation of the two
the ordered sample, which allows computation in
sample energy distance.
O(n log n) time.
The energy distance for this problem must be
Generally, the parameters μ and σ are unknown.
the same as E(X, Y) where one of the variables now
In that case we first standardize the sample using the
represents the unknown sampled distribution. Sup-
sample mean and the sample standard deviation, then
pose that a random sample x1, …, xn is observed
test the fit to the standard normal distribution. The
and the problem is to test whether the sampled distri-
estimated parameters change the critical values of the
bution F is equal to the hypothesized distribution FX.
distribution but not the general shape, and the rejec-
The energy goodness-of-fit statistic is
0 1 tion region is in the upper tail. See Ref 8 for a detailed
2 X
n
1 Xn X n proof that the test with estimated parameters is statis-
E n = n@ E k xi − Xkα − E k X −X0 kα − 2 k xi −xj kα A;
ni=1 n i = 1j = 1 tically consistent (also for the multivariate case)
against all alternatives with finite moments.
ð6Þ
where X and X0 are iid with distribution FX, and

0 < α < 2. The statistic is defined in arbitrary dimen- Multivariate Normality
sion and is not restricted by sample size. The only The statistic E n for testing multivariate normality is
required condition is that kXk has finite α moment considerably more difficult to derive. We first stand-
under the null hypothesis. Under the null hypothesis ardize the sample using the sample mean vector and
EE n = E k X −X0 kα , and the asymptotic distribution of the sample covariance matrix to estimate parameters.
E n is a quadratic form of centered Gaussian random Then we test the sample for fit to standard multivari-
variables. The rejection region is in the upper tail. ate normal. If Z and Z0 are iid standard normal in
Under an alternative hypothesis, E n tends to infinity dimension d, we have


pffiffiffi Γ d 2+ 1 n are generated, j = 1, …, M, each standardized
0
E k Z − Z kd = 2E k Zkd = 2 d ; using the j-th sample mean vector and sample covari-
Γ 2 ð jÞ
ance, and E n, d is computed for each of these samples
where Γ() is the complete gamma function. If to obtain a reference distribution. An estimated
y1, …, yn are the standardized sample elements, the p-value is obtained by finding the proportion of repli-
computing formula for the test of multivariate nor- ð jÞ
cates E n, d that exceed the observed E n,d statistic. An
mality in ℝd is
0 1 example illustrating the test of normality for the
X
n Γ d+1
1 X n four-dimensional iris setosa data (included in the R
B2 2 C
nE n, d = n@ E k yj − Zkd −2 − 2 k yj −yk kd A distribution) follows. In this example, the hypothesis
nj=1 Γ 2d n j, k = 1 of normality is rejected at significance level 0.05.
where Under the null hypothesis, n E n,d X converges

∞

in distribution to a quadratic form Qd = λ Z2 ,
i=1 i i
pffiffiffi d + 1
2Γ as n ! ∞, where Zi are iid standard normal random
2 variables and λi are non-negative constants that
E k a − Zkd =
d depend on the parameters of the null distribution.

Γ
2

Figure 1 displays replicates for testing the iris data
rffiffiffi ∞ d+1 3 with the observed E n,d identified by the large black
Γ Γ k +
2 X ð − 1 Þk k akd 2k + 2 2 2 dot. The density curve overlaid on the plot is an
+
:
π k = 0 k!2k ð2k + 1Þð2k + 2Þ d approximation (n = 50) of the density of the asymp-
Γ k+ +1 totic distribution Qd.
2
The expression for E k a − Z k d follows from the

fact that if Z is a d-variate standard normal random
4
vector, then k a − Z k2d has a noncentral chisquare dis-
tribution χ 2[ν; λ] with noncentrality parameter
λ = jaj2d =2, and degrees of freedom ν = d + 2ψ, where
3
ψ is a Poisson random variable with mean λ. Typi-
cally the sum in E k a − Z k d converges to within a
Density
small tolerance after 40–60 terms, except when a is a

true outlier of the standard multivariate normal dis- 2
tribution (when k a k is very large). However,
E k a −Z k converges to k a k as k a k ! ∞, so we
can evaluate E k a − Z k ≊ k a k in that case. See the 1
source code in ‘energy.c’ of the energy package24 for
an implementation.
The test is implemented for the multivariate 0
normal distribution as follows, in the energy pack-
0.8 1.0 1.2 1.4
age24 with the mvnorm.etest function. If the
observed sample size is n, it is standardized using the E
sample mean vector and sample covariance, and the F I G U R E 1 | Replicates generated under the null hypothesis in a
observed test statistic E n,d computed. A large number test of multivariate normality for the iris setosa data. The test statistic
M of standard multivariate normal samples of size E n,d of the observed iris sample is located by the black dot.

The energy goodness-of-fit test could alternately One can further generalize energy distance to
be implemented by evaluating the constants λi, but probability distributions on metric spaces. Consider a
that is a difficult problem except in special cases. Some metric space (M, d) which has Borel sigma algebra
analytical results for univariate normality are given in B(M), so (M, B(M)) is a measurable space. Let P ðMÞ
Ref 3 and a numerical approach investigated in Ref denote the set of probability measures on (M, B(M)).
b n, d
14. See also Ref 8 for tabulated critical values of nE Then for any pair of probability measures μ and ν in
for several (n, d) obtained by large scale simulation. P ðMÞ, we can define the energy distance in terms of
The energy test of multivariate normality is the associated random variables X and Y and the
practical to apply via parametric bootstrap as illus- metric d as the square root of
trated above for arbitrary dimension, and sample size
n < d is not a problem. Monte Carlo power compari- D2 ðμ, νÞ = 2E½dðX, Y Þ − E½dðX,X0 Þ− E½dðY, Y 0 Þ;
sons9 suggest that the energy test is a powerful com-
petitor to other tests of multivariate normality. provided that D2(μ, ν) ≥ 0. However, in general
Indeed, there are very few other tests in the literature D2(μ, ν) can be negative. In order that D is a metric,
for multivariate normality like energy that are con- it is necessary and sufficient that (M,d) is strongly
sistent, powerful, omnibus tests with practical imple- negative definite (see Lyons20). When (M,d) is
mentation; the BHEP tests,34,35 which also apply a strongly negative definite, then the energy distance
characterization of equality between distributions, D(μ, ν) equals zero if and only if the distributions are
share these properties, and have recently been imple- equal. A commonly applied metric that is negative
mented in an R package MVN. See Ref 9 for definite but not strongly negative definite is the taxi-
comparisons. cab metric in ℝ2. Lyons20 showed that all separable
Hilbert spaces (and in particular Euclidean spaces)
have strong negative type.
GENERALIZATIONS
One might wonder what makes the functions | |α,
0 < α < 2 special in the definition above. One can
CONCLUSION
show that the key property is that | |α, 0 < α < 2 is
strongly negative definite (see Lyons20). In this case, Energy distance is a powerful tool for multivariate
the generalized distance function remains a metric for analysis. It applies to random vectors in arbitrary
measuring the distance between probability distribu- dimensions, and the methodology requires only the
tions F and G, it is nonnegative and equals zero if mild assumption of finite first moments or at least
and only if F = G. What makes the distance function finite α > 0 moments for some positive α. Computing
special in the class of strongly negative definite func- formulas are simple and the tests have been imple-
tions is that the distance function is scale equivariant. mented by nonparametric methods using resampling
If we change the scale by replacing x by cx and y by or Monte Carlo methods. We have illustrated the use
cy, then the squared distance D2(F, G) is multiplied of the functions in the energy package24 for several
by c, and therefore the ratio of these functions does of the methods. The package is open source and dis-
not depend on the constant c. tributed under general public license. To scale up to
This property of invariance also holds for big data problems, for problems such as cluster anal-
0 < α < 2, because if we change the measurement ysis, one could apply a ‘divide and recombine’
units in D2(α)(F, G), replace X by cX and Y by cY, (D&R) analysis.
then D2(α)(F, G) is multiplied by cα and the ratio of Readers may also refer to several interesting
two statistics of this type is again invariant with applications in a variety of disciplines, under Further
respect to c. Hence the statistical decisions based on Reading. Our review of the background and applica-
these ratios do not depend on the choice of measure- tions of energy distance is not an exhaustive bibliog-
ment units. This invariance property is essential. raphy, but intended as a starting point.
FURTHER READING
Dueck J, Edelmann D, Gneiting T, Richards D. The affinely invariant distance correlation. Bernoulli 2014, 20:2305–2330.
Feuerverger A. A consistent test for bivariate dependence. Int Stat Rev 1993, 61:419–433.

Székely GJ, Rizzo ML. On the uniqueness of distance covariance. Stat Probab Lett 2012, 82:2278–2282. doi:10.1016/j.
spl.2012.08.007.
Wahba G. Positive definite functions, reproducing kernel Hilbert spaces and all that. The Fisher Lecture at JSM 2014,
2014. Available at: http://www.stat.wisc.edu/wahba/talks1/fisher.14/wahba.fisher.7.11.pdf.
Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc 2007, 102:359–378.
doi:10.1198/016214506000001437.
Gretton A. A simpler condition for consistency of a kernel independence test, 2015. Available at: http://arxiv.org/abs/
1501.06103v1.
Gretton A, Györfi L. Consistent nonparametric tests of independence. J Mach Learn Res 2010, 11:1391–1423.
Kong J, Wang S, Wahba G. Using distance covariance for improved variable selection with application to learning genetic
risk models. Stat Med 2015, 34/10:1097–1258. doi:10.1002/sim.6441.
Kong J, Klein BEK, Klein R, Lee KE, Wahba G. Using distance correlation and SS-ANOVA to assess associations of familial
relationships, lifestyle factors, diseases, and mortality. Proc Natl Acad Sci USA 2012, 109:20352–20357. doi:10.1073/
pnas.1217269109.
Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. J Am Stat Assoc 2012, 107:1129–1139.
doi:10.1080/01621459.2012.695654.
Martinez-Gomez E, Richards MT, Richards DSP. Distance correlation methods for discovering associations in large astro-
physical databases. Astrophys J 2014, 781:39.
Kim AY, Marzban C, Percival DB, Stuetzle W. Using labeled data to evaluate change detectors in a multivariate streaming
environment. Signal Process 2009, 89/12:2529–2536. doi:10.1016/j.sigpro.2009.04.011.
Menshenin DD, Zubkov AM. Properties of the Szekely-Mori symmetry criterion statistics in the case of binary vectors.
Math Notes 2012, 91:62–72.
Székely GJ, Móri TF. A characteristic measure of asymmetry and its application for testing diagonal symmetry. Commun
Stat Theory Methods 2001, 30:1633–1639.
Székely GJ, Bakirov NK. Extremal probabilities for Gaussian quadratic forms. Probab Theory Relat Fields 2003,
126:184–202.
Varin T, Bureau R, Mueller C, Willett P. Clustering files of chemical structures using the Szekely-Rizzo generalization of
Ward’s method. J Mol Graphics Modell 2009, 28/2:187–195. doi:10.1016/j.jmgm.2009.06.006.
Zhou Z. Measuring nonlinear dependence in time-series, a distance correlation approach. J Time Ser Anal 2012,
33:438–457.
REFERENCES
1. Székely GJ. Potential and Kinetic Energy in Statistics, some statistics in connection with some probability
Lecture Notes, Budapest Institute of Technology metrics, In: Stability Problems for Stochastic
(Technical University), 1989. Models. Moscow, VNIISI, 1989, 47–55. (in Russian),
2. Székely GJ. E-statistics: The Energy of Statistical Sam- English Translation: A characterization of
ples, Technical Report, Bowling Green State Univer- distributions by mean values of statistics and certain
sity, Department of Mathematics and Statistics probabilistic metrics, Journal of Soviet Mathematics
No. 02–16 and Technical Reports by the same title (1992), 2012.
from 2000–2003, e.g. No.03-05 and NSA grant # 6. Mattner L. Strict negative definiteness of integrals via
MDA 904-02-1-s0091 (2000–2002), 2002. complete monotonicity of derivatives. Trans Am Math
3. Székely GJ, Rizzo ML. Energy statistics: statistics based Soc 1997, 349:3321–3342.
on distances. J Stat Plann Infer 2013, 143:1249–1272. 7. Morgenstern D. Proof of a conjecture by Walter Deub-
doi:10.1016/j.jspi.2013.03.018. ner concerning the distance between points of two
4. Cramér H. On the composition of elementary errors. types in Rd. Discrete Math 2001, 226:347–349.
Skand Aktuar 1928, 11:141–180. 8. Rizzo ML. A new rotation invariant goodness-of-fit
5. Zinger AA, Kakosyan AV, Klebanov LB. Characteriza- test, PhD dissertation, Bowling Green State Univer-
tion of distributions by means of mean values of sity, 2002.

9. Székely GJ, Rizzo ML. A new test for multivariate nor- 23. Klebanov LB. N-distances and Their
mality. J Multivar Anal 2005, 93:58–80. Applications. Charles University, Prague: Karolinum
10. Baringhaus L, Franz C. On a new multivariate two- Press; 2005.
sample test. J Multivar Anal 2004, 88:190–206. 24. Rizzo ML, Székely GJ. Energy: E-statistics (energy sta-
11. Rizzo ML. A test of homogeneity for two multivariate tistics). R package version 1.6.2, 2014. Available at:
populations. In: 2002 Proceedings of the American http://CRAN.R-project.org/package=energy.
Statistical Association, Physical and Engineering 25. R Core Team. R: A language and environment for sta-
Sciences Section. Alexandria, VA: American Statistical tistical computing. R Foundation for Statistical Com-
Association, 2003. puting, Vienna, Austria, 2015. Available at: http://
12. Szekely GJ, Rizzo ML. Testing for equal distributions www.R-project.org/.
in high dimension. InterStat 2004, Nov.
26. Efron B, Tibshirani RJ. An Introduction to the
13. Rizzo ML, Székely GJ. DISCO analysis: a nonparamet- Bootstrap. Boca Raton, FL: Chapman & Hall/
ric extension of analysis of variance. Ann Appl Stat CRC; 1993.
2010, 4:1034–1055.
27. Davison AC, Hinkley DV. Bootstrap Methods and
14. Yang G. The energy goodness-of-fit test for univariate
their Application. Oxford: Cambridge University
stable distributions. PhD Thesis, Bowling Green State
Press; 1997.
University, 2012.
15. Rizzo ML. New goodness-of-fit tests for Pareto distri- 28. Székely GJ, Rizzo ML, Bakirov NK. Measuring
butions. ASTIN Bull 2009, 39:691–715. and testing independence by correlation of
distances. Ann Stat 2007, 35:2769–2794. doi:10.1214/
16. Li Y. Goodness-of-fit tests for Dirichlet distributions
009053607000000505.
with applications. PhD Thesis, Bowling Green State
University, 2015. 29. Székely GJ, Rizzo ML. Brownian distance covariance.
17. Székely GJ, Rizzo ML. Hierarchical clustering Ann Appl Stat 2009, 3:1236–1265. doi:10.1214/09-
via joint between-within distances: extending ward’s AOAS312.
minimum variance method. J Classif 2005, 30. Székely GJ, Rizzo ML. The distance correlation t-
22:151–183. test of independence in high dimension. J
18. Li S. k-groups: a generalization of k-means by energy Multivar Anal 2013, 117:193–213. doi:10.1016/j.
distance. PhD Thesis, Bowling Green State Univer- jmva.2013.02.012.
sity, 2015. 31. Székely GJ, Rizzo ML. Partial distance correlation
19. Baringhaus L, Franz C. Rigid motion invariant two- with methods for dissimilarities. Ann Stat 2014,
sample tests. Stat Sin 2010, 20:1333–1361. 42:2382–2412.
20. Lyons R. Distance covariance in metric spaces. Ann 32. Huo X, Székely G. Fast computing for distance
Probab 2013, 41:3284–3305. covariance. Technometrics 2015. doi:10.1080/
21. Sejdinovic D, Gretton A, Sriperumbudur B, 00401706.2015.1054435.
Fukumizu K. Hypothesis testing using pairwise dis- 33. Rizzo ML, Székely GJ. pdcor: Partial distance correla-
tances and associated kernels. In: Proceedings of the tion. R package version 1.0.0, 2014.
29th International Conference on Machine Learning,
Edinburgh, Scotland, UK, 2012. Available at: http:// 34. Henze N, Zirkler B. A class of invariant consistent
arxiv.org/abs/1205.0411v2. tests for multivariate normality. Commun Stat Theory
Methods 1990, 19:3595–3618.
22. Sejdinovic D, Sriperumbudur B, Gretton A,
Fukumizu K. Equivalence of distance-based and 35. Henze N, Wagner T. A New Approach to the BHEP
RKHS-based statistics in hypothesis testing. Ann Stat tests for multivariate normality. J Multivar Anal 1997,
2013, 41:2263–2291. 62:1–23.

EnergyDistance10 1002-Wics 1375

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

EnergyDistance10 1002-Wics 1375

Caricato da

Copyright:

Formati disponibili

Advanced Review

Keywords: Multivariate, goodness-of-ﬁt, distance correlation, DISCO,

INTRODUCTION (length) of its argument, E denotes expected

Volume 8, January/February 2016 © 2015 Wiley Periodicals, Inc. 27

BACKGROUND AND APPLICATION TESTING FOR EQUAL

Cramér’s distance4 is closely related, but only A= k xi −yj k , B = 2 k xi −xj k ,

ð∞ One can prove2,9 that E(X, Y) := D2(F, G) is zero if

28 © 2015 Wiley Periodicals, Inc. Volume 8, January/February 2016

To compute the energy statistic only: First, let us introduce a generalization of

D2 ðFX ,FY Þ 2E k X − Y k −E k X −X0 k −E k Y − Y 0 k

where A is the pooled sample. The between-sample

Volume 8, January/February 2016 © 2015 Wiley Periodicals, Inc. 29

30 © 2015 Wiley Periodicals, Inc. Volume 8, January/February 2016

examples17 the characterization property of E is a ai: = aij , a:j , = aij , a .. = 2 aij :

The distance covariance statistic is always non-nega-

Volume 8, January/February 2016 © 2015 Wiley Periodicals, Inc. 31

V 2 ðX, Y Þ = E k X − X0 kk Y − Y 0 k + E k X − X0 k E Remark 2. Note that the permutation test is only

32 © 2015 Wiley Periodicals, Inc. Volume 8, January/February 2016

is an unbiased estimator of squared population dis-

To recover all of the statistics from dCor, a util-

The above unbiased inner product dCov statis-

Volume 8, January/February 2016 © 2015 Wiley Periodicals, Inc. 33

GOODNESS-OF-FIT E k xi − X k = 2ðxi −μÞFðxi Þ + 2σ 2 f ðxi Þ− ðxi −μÞ;

where X and X0 are iid with distribution FX, and

34 © 2015 Wiley Periodicals, Inc. Volume 8, January/February 2016

where Under the null hypothesis, n E n,d X converges

d depend on the parameters of the null distribution.

The expression for E k a − Z k d follows from the

small tolerance after 40–60 terms, except when a is a

Volume 8, January/February 2016 © 2015 Wiley Periodicals, Inc. 35

36 © 2015 Wiley Periodicals, Inc. Volume 8, January/February 2016

Volume 8, January/February 2016 © 2015 Wiley Periodicals, Inc. 37

38 © 2015 Wiley Periodicals, Inc. Volume 8, January/February 2016

Potrebbero piacerti anche