Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Franco Peracchi
University of Rome Tor Vergata and EIEF
Spring 2014
Contents
1 Nonparametric Density Estimators 2
1.1 Empirical densities . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 The kernel method . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Statistical properties of the kernel method . . . . . . . . . . . . 22
1.4 Other methods for univariate density estimation . . . . . . . . . 40
1.5 Multivariate density estimators . . . . . . . . . . . . . . . . . . 44
1.6 Stata commands . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1
1 Nonparametric Density Estimators
Let the data Z1 , . . . , Zn be a sample from the distribution of a random vec-
tor Z Z Rm . We are interested in the general problem of estimating
the (probability) distribution of Z nonparametrically, that is, without
restricting it to belong to some known parametric family.
2
Use of nonparametric density estimates
Nonparametric density estimates may be used for:
3
Example 1 Given an estimate f of f , a natural estimate of the mode =
argmaxzZ f (z) of f is the mode = argmaxzZ f(z) of f. 2
Example 2 Let z
F (z) = f (u) du
f(z)
(z) = .
1 F (z)
The nonparametric estimate (z) may then be compared with the benchmark
parametric estimate based on the exponential distribution with constant
hazard (z) = > 0. 2
4
Example 4 Let Z = (X, Y ) be a random vector with two elements (m = 2)
and consider the conditional mean function (CMF) or mean regression
function of Y given X, dened by
f (x, y)
(x) = E(Y | X = x) = y dy,
fX (x)
5
1.1 Empirical densities
We begin with the simpler problem of estimating univariate densities (m =
1), and discuss the problem of estimating multivariate densities (m > 1)
later in Section 1.5.
6
Discrete Z
If Z is discrete with a distribution that assigns positive probability mass to
the values z1 , z2 , . . ., then the probability fj = Pr{Z = zj } may be estimated
by the relative frequency
n
fj = n1 1{Zi = zj }, j = 1, 2, . . . ,
i=1
This is just the sample average of n iid binary rvs X1j , . . . , Xnj , where
Xij = 1{Zi = zj }, i = 1, . . . , n,
fj (1 fj )
Var fj = , j = 1, 2, . . . ,
n
fj = n1 fj (1 fj ).
which is typically estimated by Var
p
It is easy to verify that j is consistent for fj , that is, fj
f fj . It is also
fj ) = fj (1fj ).
where AV(fj ) = fj (1fj ) can be estimated consistently by AV(
This justies the use of a symmetric asymptotic condence interval (CI) for
fj of the form
fj (1 fj )
CI12 (fj ) = fj z() ,
n
where z() is the upper th percentile of the N (0, 1) distribution (e.g. z(.025) =
1.96). Notice that this CI is problematic because it may contain values less
than zero or greater than one.
7
Continuous Z
The previous approach breaks down when Z is continuous because the rela-
tive frequency of a specic value z in any sample is either zero or very small.
Pr{a < Z b}
f (z) .
ba
Thus, a natural estimate of f (z) is the fraction of observations falling in the
(small) interval (a, b] divided by the length b a of such interval,
1
n
f(z) = 1{a < Zi b}, a < z b, (2)
n(b a) i=1
Notice that, if no sample value is repeated and the interval (a, b] is small
enough, then it contains at most one observation, so f(z) is equal either to
zero or to [n(b a)]1 .
8
Sampling properties
The numerator of (2) is the sum of n iid rvs Xi = 1{a < Zi b}, each having
a Bernoulli distribution with mean
and variance
Var Xi = [F (b) F (a)][1 F (b) + F (a)].
Thus,
F (b) F (a)
E f(z) = ,
ba
so f(z) is generally biased for f (z), although its bias is negligible if b a is
suciently small. Notice that f(z) is unbiased for f (z) if the df F is linear
on an interval containing (a, b] or, equivalently, the density f is constant on
such interval.
Further
[F (b) F (a)][1 F (b) + F (a)]
Var f(z) =
n(b a)2
2
1 F (b) F (a) 1 F (b) F (a)
= .
n(b a) ba n ba
which shows that letting b a 0 reduces the bias of f(z), but also
increases its sampling variance. Thus, letting b a 0 as n
is not enough for f(z) to be consistent for f (z). This trade-o between
bias (systematic error) and sampling variance (random error) is typical of
nonparametric estimation problems with n xed.
p
For f(z) f (z) as n , we need that both b a 0 and n(b a) ,
that is, the rate at which b a 0 must be slower than the rate at which
n .
9
Histograms
Histograms are classical examples of empirical densities.
The choice of an interval (a0 , b0 ] that contains the range of the data.
cj = cj1 + h = a0 + jh, j = 1, . . . , J 1,
with c0 = a0 and cJ = b0 .
The ith sample point Zi falls in the jth bin if cj1 < Zi cj . For any point z
in the jth bin, a histogram estimate of f (z) is
1
n
f (z) = 1{cj1 < Zi cj },
nhj i=1
the fraction of observations falling in the jth bin divided by the bin width
hj = cj cj1. This is also the estimate of f (z ) for any other z falling in the
same bin. It is easy to verify
that the function f is a proper density, that
is, f(z) 0 for all z and f(z) dz = 1.
In the case of equally spaced bins, where hj = h for all bins, the histogram
estimate becomes
1
n
f(z) = 1{cj1 < Zi cj }.
nh i=1
10
Figure 1: Histogram estimates of density for dierent number of bins. The data
are 200 observations from the Gaussian mixture .6 N (0, 1) + .4 N (4, 4).
.2
.2
.15
.15
Density
f(z)
.1
.1
.05
.05
0
5 0 5 10 5 0 5 10
z z
25 bins 50 bins
.4
.3
.3
.2
Density
Density
.2
.1
.1
0
5 0 5 10 5 0 5 10
z z
11
Sampling properties of the histogram estimator
Let Z1 , . . . , Zn be a sample from a continuous distribution on the interval
(0, 1], with df F and density f . Put a0 = 0, b0 = 1, and partition the interval
(0, 1] into J bins of constant width h = 1/J. If Xj denotes the number
of observations falling in the jth bin, then the histogram estimator at any
point z in the jth bin is f(z) = Xj /(nh).
Now let the number of bins J = Jn increase with the sample size n or,
equivalently, let the bin width hn = 1/Jn decrease with n. Let {(cjn1 , cjn ]}
denote the sequence of bins that contain the point z and let jn = F (cjn )
F (cjn 1 ). Then
j F (z) F (z hn )
E f(z) = n = f (z),
hn hn
provided that hn 0 with n . Further
2
1 jn 1 jn
Var f(z) = .
nhn hn n hn
p
Thus, for f(z) f (z), we need not only that hn 0 as n , but also
that nhn or, equivalently, that Jn /n 0 as n . That is, hn must
go to zero, or equivalently J must increase with n, but not too fast.
12
Drawbacks of histograms
Histograms are useful tools for exploratory data analysis, but have several
undesirable features, such as:
Results depend on the partition of (a0 , b0 ] into bins, that is, on their
number and position.
The histogram is a step function with jumps at the end of each bin.
Thus, it is impossible to incorporate prior information on the degree of
smoothness of a density.
The method may also create diculties when estimates of the deriva-
tives of the density are needed as input to other statistical procedures.
The partition of (a0 , b0 ] depends only on the number J of bins or, equiv-
alently, on the bin width h = (b0 a0 )/J.
The fact that the bin width h is kept xed over the range of the data
may lead to loss of detail where the data are most concentrated. If h
is reduced to deal with this problem, then spurious noise may appear
in regions where data are sparse.
13
1.2 The kernel method
We now present a related method that tries to overcome some of the drawbacks
of histograms.
1
n
f(z) = 1{z h < Zi z + h}.
2nh i=1
This is sometimes call a naive density estimate or, for reasons that will be
clear soon, a uniform kernel density estimate.
Notice that f(z) is just the fraction of sample points falling in the in-
terval (z h, z + h] divided by the length 2h of this interval. Equivalently,
f(z) is the sample average of a binary indicator which is equal to (2h)1 if Zi
is within h distance from z and is equal to zero otherwise.
2. Add up the heights of the boxes that contain the point z; the result is
equal to f(z).
14
Figure 2: Uniform kernel density estimates for dierent bandwidth. The data
are 200 observations from the Gaussian mixture .6 N (0, 1) + .4 N (4, 4).
bwidth=.5 bwidth=1.0
.3
.2
.1
0
5 0 5 10
bwidth=1.5
.3
.2
.1
0
5 0 5 10
z
True Estimated
15
Smooth kernel density estimates
Because the event zh < Zi z+h is equivalent to the event h < Zi z h,
which in turn is equivalent to the event 1 (z Zi )/h < 1, the density
estimate f(z) may also be written
1
n
z Zi
f(z) = w ,
nh i=1 h
where
1/2, if 1 u < 1,
w(u) = (3)
0, otherwise,
is a symmetric bounded non-negative function that integrates to one and cor-
responds to the density of a uniform distribution on the interval [1, 1]. This
estimate may be viewed as the average of n uniform densities with common
variance equal to h2 , each centered about one of the sample observations.
16
Remarks
Notice that
1
n
z Zi
f(z) dz = K dz.
nh i=1 h
The fact that h is independent of the point where the density is esti-
mated is a nuisance, for it tends to produce spurious eects in regions
where data are sparse. If a large enough h is chosen to eliminate this
phenomenon, we may end up losing important detail in regions where
the data are more concentrated.
17
Figure 3: Structure of a smooth kernel density estimate.
.4
.3
.2
.1
0
18
Example 5 If K = , where denotes the density of the N (0, 1) distribution,
then the kernel estimate of f is continuous and has continuous derivates
of every order.
Unlike the uniform kernel (3), which takes constant positive value in the in-
terval [Zi h, Zi + h) and vanishes outside this interval, the Gaussian kernel
is always positive, assumes its maximum when z = Zi and tends to zero as
|z Zi | .
Hence, while the uniform kernel estimate of f (z) is based only on the obser-
vations that are within h distance from the evaluation point z and assigns
them a constant weight, the Gaussian kernel estimate is based on all the
observations but assigns them a weight that declines exponentially as their
distance from the evaluation point increases. 2
19
Figure 4: Gaussian kernel density estimates for dierent bandwidth. The data
are 200 observations from the Gaussian mixture .6 N (0, 1) + .4 N (4, 4).
bwidth=.25 bwidth=.5
.3
.2
.1
0
5 0 5 10
bwidth=1.5
.3
.2
.1
0
5 0 5 10
z
True Estimated
20
Extensions
Let Z1 , . . . , Zn be a sample from the distribution of Z. If f is a kernel estimate
of the density f of Z, then one may easily estimate other aspects of the
distribution of Z.
where u
K(u) = K(v) dv
(z) = = n hzZi .
1 F (z) N i=1 K h
2
21
1.3 Statistical properties of the kernel method
In evaluating the statistical accuracy of a kernel density estimator f, it is
important to distinguish between its local and global properties:
the rst refer to the accuracy of f(z) as an estimator of f (z), the value
of f at a given point z;
22
Local properties
The kernel density estimator of f at a point z may be written as the sample
average
n
1
fh (z) = n Ki (z),
i=1
where
1 z Zi
Ki (z) = K
h h
is a nonlinear transformation of Zi . A natural measure of local accuracy of
fh (z) is therefore its mean squared error (MSE)
MSE[fh (z)] = E[fh (z) f (z)]2 = Var fh (z) + [Bias fh (z)]2 .
If the data Z1 , . . . , Zn are a sample from the distribution of a rv Z, then we
have
1 zZ
Bias fh (z) = E K f (z),
h h
(4)
1 zZ
Var fh (z) = Var K .
nh2 h
Thus, the estimator fh (z) is biased for f (z) in general. For h xed, its bias
does not depend on the sample size n, whereas its variance tends to zero as n
increases.
23
Bias
After a change of variable from x to u = (z x)/h, we have
1 zx
Bias fh (z) = K f (x) dx f (z)
h h
= K(u) [f (z hu) f (z)] du,
Thus, for h suciently small, the bias of fh (z) is O(h2 ) and depends on:
In particular:
Bias fh (z) 0 when f (z) = 0, that is, when the density is linear in a
neighborhood of z.
Bias fh (z) 0 as h 0.
If the bandwidth h decreases with the sample size n, then the bias of
fh (z) vanishes as n .
24
Variance
First notice that
1 z Z
Var fh (z) = Var K
nh2 h
2 2
1 zZ 1 zZ
= EK 2 EK .
nh2 h nh h
Using (4) and the fact that Bias fh (z) = O(h2 ), we have
zZ
EK = h[f (z) + Bias fh (z)] = h[f (z) + O(h2 )].
h
Separating the term of order O((nh)1 ) from that of order O(n1 ) we have
that, for h suciently small,
1
Var fh (z) f (z) K(u)2 du + O(n1 ).
nh
25
Remarks
If the sample size is xed, increasing the bandwidth reduces the variance
of fh (z) but, from (11), it also increases its bias.
If the sample size increases and smaller bandwidths are chosen for larger
n, then both the bias and the variance of fh (z) may be reduced.
p
For fh (z) f (z) we need both h 0 and nh as n .
26
Global properties
Common measures of distance between two functions f and g are:
the L1 distance
d1 (f, g) = |f (z) g(z)| dz,
the L2 distance
1/2
2
d2 (f, g) = [f (z) g(z)] dz ,
27
Optimal choice of bandwidth
For a xed sample size n and a xed kernel function K, an optimal bandwidth
may be chosen by minimizing the MISE(fh ) with respect to h. As an approx-
imation to this problem, consider minimizing the right-hand side of (7) with
respect to h
1 1
min K(u)2 du + m22 h4 R(f ).
h>0 nh 4
Remarks:
h convergences to zero as n but at the rather slow rate of n1/5 .
Smaller values of h are appropriate for densities that are more wiggly,
that is, such that R(f ) is high, or kernels that are more spread out, that
is, such that m2 is high.
Example 8 If f is a N (, 2) density and K is a standard Gaussian kernel,
then
5 3
R(f ) = (z)2 dz = 5
8
and
1
K(u)2 du = .
2
The optimal bandwidth in this case is
1/5
1 8 5
h =
2 3
1/5
4
= n1/5 1.059 n1/5 .
3
2
28
Optimal choice of kernel function
Substituting the optimal bandwidth h into (7) gives
5
MISE(fh ) C(K) R(f )1/5 n4/5 ,
4
where 4/5
2/5 2
C(K) = m2 K(u) du .
Given the optimal bandwidth, the MISE depends on the choice of kernel func-
tion only through the term C(K). Thus, an optimal kernel may be obtained
by minimizing C(K) with respect to K.
29
The Epanechnikov kernel
If we conne attention to kernels that correspond to densities of distributions
with mean zero and unit variance, then an optimal kernel K may be
obtained by minimizing
4/5
2
C(K) = K(u) du
Solving this problem (see Pagan & Ullah 1999 for details), the optimal kernel
is
3
K (u) = (1 u2 ) 1{|u| 1},
4
called the Epanechnikov kernel.
The eciency loss from using suboptimal kernels, however, is modest. For
example, using the uniform kernel w only implies an eciency loss of about
7% with respect to K . Thus, it is perfectly legitimate, and indeed desirable,
to base the choice of kernels on other considerations, for example the degree of
dierentiability or the computational eort involved (Silverman 1986, p. 43).
It turns out that the crucial choice is not what kernel to use, but rather
how much to smooth. This choice partly depends on the purpose of the
analysis.
30
Figure 5: Uniform, Gaussian and Epanechnikov kernels.
.8
.6
.4
.2
0
2 1 0 1 2
u
31
Automatic bandwidth selection
It is often convenient to be able to rely on some procedure for choosing the
bandwidth automatically rather than subjectively. This is especially impor-
tant when a density estimate is used as an input to other statistical proce-
dures.
A third approach starts from the observation that the MISE of a kernel
density estimator fh , based on a given kernel K, may be decomposed as
MISE(fh ) = E [fh (z)2 + f (z)2 2fh (z)f (z)] dz
= M(h) + f (z)2 dz,
where
2
M(h) = E fh (z) dz 2 E fh (z)f (z) dz.
32
Cross-validation
To construct an unbiased estimator of M(h), notice rst that fh (z)2 dz is
unbiased for E fh (z)2 dz.
Consider next the kernel estimator of f obtained by excluding the ith obser-
z Zj
vation
1
f(i) (z) = K , i = 1, . . . , n. (9)
(n 1)h j=i h
33
Asymptotic justication for cross-validation
Let
I (Z1 , . . . , Zn ) = min [fh (z) f (z)]2 dz
h
denote the integrated squared error of the density estimate obtained choos-
ing h optimally for the given sample, and let
ICV (Z1 , . . . , Zn ) = [fh (z) f (z)]2 dz
denote the integrated squared error obtained using the bandwidth h that
minimizes the cross-validation criterion Mh .
Stone (1984) showed that, if f is bounded and some other mild regularity
conditions hold, then
ICV (Z1 , . . . , Zn )
Pr lim = 1 = 1.
n I (Z1 , . . . , Zn )
34
Asymptotic properties of kernel density estimators
Let fn denote a kernel density estimate corresponding to a random sample
Z1 , . . . , Zn from a distribution with density f and a data-dependent bandwidth
hn . We now provide sucient conditions for the sequence {fn (z)} to be
consistent for f (z) and asymptotically normal, where z is any point in
the support of f .
We again rely on the fact that, under our iid assumption, fn (z) may be repre-
sented as the sample average of n iid rvs,
n
fn (z) = n1 Kin (z),
i=1
where
1 z Zi
Kin (z) = K .
hn hn
Results are easily generalized to the case when, instead of a single point of
evaluation z, we are interested in a xed set of points z1 , . . . , zJ .
35
Consistency
Convergence in probability follows immediately from the fact that if the kernel
K is the
density of a distribution with zero mean and nite positive variance
2
m2 = u K(u) du then, from our previous results,
1
E Kin (z) = f (z) + m2 h2n f (z) + O(h3n ) (11)
2
and
1
Var Kin (z) = f (z) K(u) du + hn f (z) uK(u) du + O(h4n ).
2 2
(12)
hn
(i) hn 0;
(ii) nhn ;
(iii) K(u) du = 1, uK(u) du = 0, and m2 = u2 K(u) du is nite and
positive.
p
Then fn (z) f (z) for every z in the support of f .
36
Asymptotic normality
The next theorem could in principle be used to construct approximate sym-
metric condence intervals for f (z) based on the normal distribution.
Theorem 2 In addition to the assumptions of Theorem 1, suppose that nh5n
, where 0 < . Then
1 2
nhn [fn (z) f (z)] N m2 f (z), f (z) K(u) du
2
for every z in the support of f . Further, nhn [fn (z)f (z)] and nhn [fn (z )
f (z )] are asymptotically independent for z = z .
37
Remarks:
38
Figure 6: Rates of convergence.
1
.8
.6
.4
.2
0
39
1.4 Other methods for univariate density estimation
The nearest neighbor method
One of the problems with the kernel method is the fact that the bandwidth
is independent of the point at which the density is evaluated. This tends to
produce too much smoothing in some regions of the data and too little in
others.
For any point of evaluation z, let d1 (z) d2 (z) . . . dn (z) be the distances
(arranged in increasing order) between z and each of the n data points. The
kth nearest neighbor estimate of f (z) is dened as
k
f(z) = , k < n. (13)
2n dk (z)
The motivation for this estimate is the fact that, if h is suciently small,
then we would expect a fraction of the observations approximately equal to
2hf (z) to fall in a small interval [z h, z + h] around z. Since the interval
[z dk (z), z + dk (z)] contains by denition exactly k observations, we have
k
= 2dk (z)f (z).
n
Solving for f (z) gives the density estimate (13).
Unlike the uniform kernel estimator, which is based on the number of obser-
vations falling in a box of xed width centered at the point z, the nearest
neighbor estimator is inversely proportional to the width of the box needed to
exactly contain the k observations nearest to z. The smaller is the density of
the data around z, the larger is this width.
40
Problems with the nearest neighbor method
f does not integrate to one and so it is not a proper density.
41
Nonparametric ML
The log-likelihood of a sample Z1 , . . . , Zn from a distribution with strictly
positive density f0 is dened as
n
L(f0 ) = c + ln f0 (Zi ),
i=1
f = argmax L(f ),
f F
f(z) = f (z, ),
where = argmax L() and L() = c + ni=1 ln f (Zi , ).
where
, if u = 0,
(u) =
0, otherwise,
is the Dirac delta function. Thus, f is a function equal to innity
at each of the sample points and equal to zero otherwise (why is this
unsurprising?).
42
Maximum penalized likelihood
Being innitely irregular, f provides an unsatisfactory solution to the prob-
lem of estimating f0 . If one does not want to make parametric assumptions,
an alternative is to introduce a penalty for lack of smoothness and then
maximize the log likelihood function subject to this constraint.
where > 0 represents the trade-o between smoothness and delity to the
data. A maximum penalized likelihood density estimator f maximizes
L over the class of densities with continuous 1st derivative and square
integrable 2nd derivative. The smaller is , the rougher in terms of R(f)
is the maximum penalized likelihood estimator.
maximizing
delity to the data, represented here by the term
i ln f (Z i ),
avoiding densities that are too wiggly, represented here by the
term R(f ).
43
1.5 Multivariate density estimators
Let Z1 , . . . , Zn be a sample from the distribution of a random m-vector Z with
density function f (z) = f (z1 , . . . , zm ). How can we estimate f nonparamet-
rically?
m
Km (u1 , . . . , um) = K(uj ),
j=1
This type of estimators are a special case of (14) obtained by using the same
bandwidth h for each component of Z. They may be appropriate when
the data have being previously rescaled using some preliminary estimate of
scale.
Example 10 In the case of a product kernel, (15 becomes
1
n m
zj Zij
f (z) = K ,
nhm i=1 j=1 h
44
Asymptotic properties
Theorem 3 Let {Zi } be a sequence of iid continuous random m-vectors with
a twice continuously dierentiable density f , and suppose that the sequence
{hn } of bandwidths and the multivariate kernel Km : Rm R satisfy:
(i) hn 0;
(ii) nhm
n ;
(iii) Km (u) du = 1, uKm (u) du = 0 and uu Km (u) du = M2 , a nite
m m matrix.
p
Then fn (z) f (z) for every z in the support of f .
Theorem 4 In addition to the assumptions of Theorem 3, suppose that h2n (nhm
n)
1/2
, where 0 < . Then
m 1/2 1 2
(nhn ) [fn (z) f (z)] N b(z), f (z) Km (u) du
2
for every z in the support of f , where b(z) = tr[f (z)M2 ]. Further, (nhm
n)
1/2
[fn (z)
f (z)] and (nhn ) [fn (z ) f (z )] are asymptotically independent for z = z .
m 1/2
Remarks:
The speed of convergence of fn (z) to its asymptotically normal distri-
bution is inversely related to the dimension m of Z (a manifestation
of the curse-of-dimensionality problem). When m = 1, we have the
results in Theorem 2.
When > 0, the additional assumption in Theorem 4 implies that hn =
O(n1/(m+4) ). In this case, putting hn = cn1/(m+4) , with c > 0, gives
= c(m+4)/2 and (nhm n)
1/2
= cm/2 n2/(m+4) , and therefore
2/(m+4) 1 2 f (z) 2
n [fn (z) f (z)] N c b(z), m Km (u) du .
2 c
45
The curse-of-dimensionality problem
Although conceptually straightforward, extending univariate nonparametric
methods to multivariate settings is problematic for at least two reasons:
While contour plots are enough for two dimensions, how to represent
the results of a nonparametric analysis involving three or more vari-
ables is a neglected but practically important problem.
46
Example 11 Consider a multivariate histogram constructed for a sample
from the distribution of a random m-vector Z with a uniform distribution
on the m-dimensional unit hypercube, that is, whose components are iid as
U(0, 1).
The table below shows some calculations for m = 1, . . . , 5 and h = .10 and
h = .05.
Number m of variables in Z
1 2 3 4 5
h = .10
number of cells 10 100 1,000 10,000 100,000
n 300 3,000 30,000 300,000 3,000,000
h = .05
number of cells 20 400 8,000 160,000 3,200,000
n 600 12,000 240,000 4,800,000 96,000,000
47
Projection pursuit
One method for nonparametric estimation of multivariate densities is projec-
tion pursuit (PP), introduced by Friedman, Stuetzle and Schroeder (1984).
This method assumes that the density of a random m-vector Z may well be
approximated by a density of the form
J
f (z) = f0 (z) fj (j z), (16)
j=1
m f0 is a known m-variate density, j is a vector with unit norm, j z =
where
h=1 jh zh is a linear combination or projection of the variables in Z, and
f1 , . . . , fp are smooth univariate functions, called ridge functions.
The choice of f0 is left to the investigator and should reect prior information
available about the problem.
The estimation algorithm determines the number J of terms and the vectors
j in (16), and constructs nonparametric estimates of the ridge functions
f1 , . . . , fJ by minimizing a suitably chosen index of goodness-of-t, or projec-
tion index. The curse-of-dimensionality problem is bypassed by using linear
projections and nonparametric estimates of univariate ridge functions.
48
Semi-nonparametric methods
Gallant and Nychka (1987) proposed to approximate the m-variate density
f (z1 , . . . , zm ) by an Hermite polynomial expansion.
In the bivariate case (m = 2), their approximation is
1
f (z1 , z2 ) = K (z1 , z2 )2 (z1 ) (z2 ),
K
where
K
K (z1 , z2 ) = hk z1h z2k
h,k=0
is a polynomial of order K in z1 and z2 , and
K = (z1 , z2 )2 (z1 ) (z2 ) dz1 dz2
49
1.6 Stata commands
We now briey review the commands for histogram and kernel density estima-
tion available in Stata, version 12.
These include the histogram and the kdensity commands. Both commands
estimate univariate densities. The package akdensity in van Kerm (2012)
allows estimating both the density and the distribution function by the kernel
method.
Recently, some articles in the Stata Journal have been devoted to the use
of nonparametric or semi-nonparametric methods for density estimation in a
variety of statistical problems. Examples include De Luca (2008) and De Luca
and Perotti (2011).
50
The histogram command
The basic syntax is:
histogram varname [if] [in] [weight] [, [continuous opts | discrete opts]
options]
where:
varname is the name of a continuous variable, unless the discrete
option is specied,
continuous opts includes:
bin(#): sets the number of bins to #,
width(#): sets the width of bins to #,
start(#): sets the lower limit of rst bin to # (the default is the
observed minimum value of varname).
bin() and width() are alternatives. If neither is specied, results are
the same as if bin(k) had been specied, with
k = min{ n, 10 ln n/ ln 10},
where n is the (weighted) number of observations.
discrete opts includes:
discrete: species that the data are discrete and you want each
unique value of varname to have its own bin (bar of histogram),
options includes:
density: draws as density (the default),
fraction: draws as fractions,
frequency: draws as frequencies,
percent: draws as percentages,
addlabels: adds heights label to bars,
normal: adds a normal density to the graph,
kdensity: adds a kernel density estimate to the graph.
51
The kdensity command
The basic syntax is:
kdensity varname [if] [in] [weight] [, options]
where options includes:
at(var x): estimates the density using the values specied by var x.
52
2 Nonparametric Regression Estimators
We now assume that the data (X1 , Y1), . . . , (Xn , Yn ) are a sample from the joint
distribution of (X, Y ), for which the conditional mean function (CMF)
(x) = E(Y | X = x)
53
Linear nonparametric regression estimators
We restrict attention to the class of nonparametric estimators of (x) that are
linear, that is, of the form
n
(x) = Sj (x) Yj , (17)
j=1
where the weight Sj (x) assigned to Yj depends only on the Xi s and the eval-
uation point x, not on the Yi s.
54
Regression smoothers
It is useful to represent a linear nonparametric regression estimator as a linear
regression smoother.
Let Y be the n-vector of observations on the outcome variable and let X be
the matrix of n observations on the k covariates. A regression smoother is
a way of using Y and X to produce a vector = (1 , . . . , n ) of tted values
that is less variable than Y itself.
A regression smoother is linear if it can be represented as
= SY,
where S = [sij ] is an n n smoother matrix that depends on X but not on
Y. Thus, the class of linear regression smoothers coincides with the class of
linear predictors of Y.
The ith element of a linear regression smoother is a weighted average
n
i = sij Yj
j=1
of the elements of Y and sij is the weight assigned to the jth element of Y in
the construction of i . Linear nonparametric regression estimates are
linear smoothers where i = (Xi ) and sij = Sj (Xi ).
Example 14 A parametric example of linear smoother is the vector of OLS
tted values = X = SY, where = (X X)1 X Y and the smoother
matrix S = X(X X)1 X is symmetric and idempotent. 2
where Y is the sample mean of Y . For example, the OLS smoother matrix
preserves the constant if the design matrix X contains a column of ones.
We now present other examples of linear regression smoothers. For ease of
presentation, we assume that X is a scalar rv.
55
2.1 Polynomial regressions
A polynomial regression represents a parsimonious and relatively ex-
ible way of approximating an unspecied CMF.
is a (k + 1)-degree polynomial.
However, if the CMF is very irregular, even on small regions of the approxi-
mation range, then a polynomial approximation tends to be poor everywhere
(Figure 7).
56
Figure 7: Polynomial regression estimates. The sample consists of 200 obser-
vations from a Gaussian model with (x) = 1 x + exp[50(x .5)2 )].
Data Linear
1.5
1.5
1
1
mu(x)
.5
.5
0
0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
x x
Quadratic Cubic
1.5
1.5
1
1
.5
.5
0
0
.5
.5
0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
x x
57
2.2 Regression splines
One way of avoiding global dependence on local properties of the function (x)
is to consider piecewise polynomial functions.
58
Linear splines
Select J distinct points or knots, c1 < < cJ , on the support of X. These
points dene a partition of R into J + 1 intervals. To simplify the notation,
also dene two boundary knots c0 < c1 and cJ+1 > cJ .
A linear spline is a continuous piecewise linear function of the form
m(x) = j + j x, cj1 < x cj , j = 1, . . . , J + 1.
For m to be continuous at the rst knot c1 , the model parameters must satisfy
1 + 1 c1 = 2 + 2 c1 ,
which implies that 2 = 1 + (1 2 )c1 . On the interval (c0 , c2 ], we therefore
have
1 + 1 x, if c0 < x c1 ,
m(x) =
1 + (1 2 )c1 + 2 x, if c1 < x c2 .
A more compact representation of m on (c0 , c2 ] is
m(x) = + x + 1 (x c1 )+ , c0 < x c2 ,
where = 1 , = 1 , 1 = 2 1 and (x c1 )+ = max(0, x c1 ).
Repeating this argument for all other knots, a linear spline may be represented
as
J
m(x) = + x + j (x cj )+ ,
j=1
where j = j+1 j .
Remarks
The number of free parameters in the model is only J + 2, less than
the number 2(J + 1) of parameters of an unrestricted piecewise linear
function. The dierence 2(J + 1) (J + 2) is equal to the number of
constraints that must be imposed to ensure continuity.
Although continuous, a linear spline is not dierentiable for its deriva-
tive is a step function with jumps at c1 , . . . , cJ .
This problem may be avoided by considering smooth higher-order
(quadratic, cubic, etc.) piecewise polynomial functions.
59
Cubic splines
A cubic spline is a twice continuously dierentiable piecewise cubic
function. Given J distinct knots c1 < < cJ in the support of X, it possesses
the parametric representation
J
2 3
m(x) = + 1 x + 2 x + 3 x + j (x cj )3+ , (18)
j=1
Remarks:
A natural cubic spline is a cubic spline that is forced to be linear out-
side the boundary knots c0 and cJ+1 . The number of free parameters
in this case is just J + 2.
The representation (18) lends itself directly to estimation by OLS.
In general, given the sequence of knots and the degree of the polynomial,
a regression spline may be estimated by an OLS regression of Y on an
appropriate set of vectors that represent a basis for the selected family
of piecewise polynomial functions evaluated at the sample values of X.
Regression splines estimated by OLS give linear regression smoothers
dened by symmetric idempotent smoother matrices. The number
and position of the knots control the exibility and smoothness of the
approximation.
60
Figure 8: Splines (blue) vs. polynomial (black) regression estimates. The
sample consists of 200 observations from a Gaussian model with (x) = 1
x + exp[50(x .5)2 )].
Data Linear
1.5
1.5
1
1
mu(x)
.5
.5
0
0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
x x
Quadratic Cubic
1.5
1.5
1
1
.5
.5
0
0
.5
.5
0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
x x
61
2.3 The kernel method
Recall that the CMF may be written
f (x, y) c(x)
(x) = y dy = ,
fX (x) fX (x)
where fX (x) = f (x, y) dy is the marginal density of X and
c(x) = y f (x, y) dy.
Thus, given nonparametric estimates c(x) and fX (x) of c(x) and fX (x), a
nonparametric estimate of (x) is
c(x)
(x) = . (19)
fX (x)
62
The Nadaraya-Watson estimator
Consider the bivariate kernel density estimate
n
1 x Xi y Y i
f (x, y) = K , ,
nhX hY i=1 hX hY
63
Example 15 A family of kernels with bounded support which contains
many kernels used in practice is
c = 221 (2 + 2) ( + 1)2
64
Figure 9: Gaussian kernel regression estimates. The sample consists of 200
observations from a Gaussian model with (x) = 1 x + exp[50(x .5)2 )].
h=.02 h=.04
1.5
1
.5
0
h=.06 h=.08
1.5
1
.5
0
0 .5 1 0 .5 1
x
True Estimated
65
2.4 The nearest neighbor method
This method suggests considering the Xi that are closest to the evaluation
point x and estimating (x) by averaging the corresponding values of Y .
where
1/k, if j Ok (x),
wj (x) =
0, otherwise.
Thus, (x) is just a running mean or moving average of the Yi .
Notice that both the NN estimator and the NW estimator solve the following
weighted LS problem
n
min wi (x)(Yi c)2 .
cR
i=1
66
Running lines
Using a running mean may give severe biases near the boundaries of the
data, where neighborhoods tend to be highly asymmetric.
Computation of (21) for the n sample points may be based on the recursive
formulae for OLS and only requires a number of operations of order O(n).
Notice that (x) and (x) in (21) solve the locally weighted LS problem
n
min wi (x)(Yi Xi )2 ,
(,)R2
i=1
where the weight wi (x) is equal to one if the ith point belongs to Ok (x) and
to zero otherwise.
Running lines tend to produce ragged curves but the method can be im-
proved replacing OLS by locally WLS, with weights that decline as the dis-
tance of Xi from x increases.
67
Lowess
An example of locally weighted LS estimator is the lowess (LOcally WEighted
Scatterplot Smoother) estimator introduced by Cleveland (1979) and com-
puted as follows:
Algorithm 1
(2) Compute the distance (x) = maxiOk (x) d(Xi , x) of the point in Ok (x)
that is farthest from x.
(3) Assign weights wi (x) to each point in Ok (x) according to the tricube
function
d(Xi , x)
wi (x) = W ,
(x)
where
(1 u3 )3 , if 0 u < 1,
W (u) =
0, otherwise.
68
Figure 10: Lowess estimates. The sample consists of 200 observations from a
Gaussian model with (x) = 1 x + exp[50(x .5)2 )].
0 .5 1 0 .5 1
x
True Estimated
69
Local polynomial tting
Another straightforward generalization is local polynomial tting (Fan &
Gijbels 1996), which replaces the running line by a running polynomial of
order k 1,
(x) = (x) + 1 (x)x + + k (x)xk ,
where the coecients , 1 (x), . . . , k (x) solve the problem
n
min Ki (x)(Yi 1 Xi k Xik )2 ,
(,1 ,...,k )R2
i=1
70
Figure 11: Locally linear tting with Gaussian kernel weights. The sample
consists of 200 observations from a Gaussian model with (x) = 1 x +
exp[50(x .5)2 )].
h=.02 h=.04
1.5
1
.5
0
h=.06 h=.08
1.5
1
.5
0
0 .5 1 0 .5 1
x
True Estimated
71
2.5 Cubic smoothing splines
The regression analogue of the maximum penalized likelihood problem is
the probleme of nding a smooth function that best interpolates the obser-
vations on Y without uctuating too wildly.
Let [a, b] be a closed interval that contains the observed values of X. Then
this problem may be formalized as follows
n b
2
min
2
Q(m) = [Yi m(Xi )] + m (u)2 du, (22)
mC [a,b] a
i=1
where C 2 [a, b] is the class of functions dened on [a, b] that have continuous
1st derivative and integrable 2nd derivative, and 0 is a xed pa-
rameter. The residual sum of squares in the functional Q measures delity to
the data, whereas the second term penalizes for excessive uctuations
of the estimated CMF.
72
Smoothing splines and penalized LS
Knowledge of the form of the solution to (22) makes it possible to dene its
smoother matrix. If we consider the representation of as a cubic spline,
then the required number of basis functions is J + 4 = n 2 + 4 = n + 2. The
solution to problem (22) may therefore be written
n+2
(x) = j Bj (x),
j=1
Thus, the original problem (22) is reduced to the much simpler problem of
minimizing a quadratic form in the (n + 2)-dimensional vector .
73
The penalized LS estimator
The solution to the penalized LS problem (23) exists and must satisfy the
normal equation
0 = 2B (Y B ) + 2.
If B B + is a nonsingular matrix, then
= (B B + )1 B Y.
S = B(B B + )1 B ,
See Murphy and Welch (1992) for an independent derivation of the same result.
74
2.6 Statistical properties of linear smoothers
The smoother matrix of a linear smoother typically depends on the sample val-
ues of X and a parameter (or set of parameters) which regulates the amount of
smoothing of the data. To emphasize this dependence, we denote the smoother
matrix by S = [sij ], where is the smoothing parameter. We adopt the
convention that larger values of correspond to more smoothing.
Let the data be a sample from the distribution of (X, Y ), suppose that the
CMF (x) = E(Y | X = x) and the CVF 2 (x) = Var(Y | X = x) of Y
are both well dened, and let and denote, respectively, the n-vector with
generic element i = (Xi ) and the diagonal nn matrix with generic element
i2 = 2 (Xi ).
Bias( | X) = E( | X) = (S In ).
Var( | X) = S S .
If the data are conditionally homoskedastic, that is, 2 (x) = 2 for all x,
then = 2 In may be estimated by 2 In , where 2 =
n1 i (Yi i )2 . The
sampling variance of i may then be estimated by 2 ( nj=1 s2ij ).
75
Choosing the amount of smoothing
One way of choosing the smoothing parameter optimally is to maximize a
global measure of accuracy, conditionally on X.
n
AMSE() = n1 E(i i )2
i=1
= n [tr(S S ) + (S In ) (S In )].
1
The following relationship links the AMSE to the average conditional mean
squared error of prediction or average predictive risk (APR)
n
n
1 2 1
n E(Yi i ) = n i2 + AMSE().
i=1 i=1
Since the APR depends on unknown parameters, one way of choosing the
smoothing parameter optimally is to minimize an unbiased estimate of the
APR.
76
Cross-validation
An approximately unbiased estimator of the APR is the cross-validation
criterion
1
n
CV() = [Yi (i) (Xi )]2 ,
n i=1
where (i) (Xi ) is the value of the smoother at the point Xi , computed by
excluding the ith observation, and Yi (i) (Xi ) is the ith predicted resid-
ual. Minimizing CV() represents an automatic method for choosing the
smoothing parameter .
Example 16 In the OLS case, (i) (Xi ) = (i) Xi , where
Xi Ui
(i) = (X X Xi Xi )1 (X Y Xi Yi ) = n (X X)1
1 hii
is the OLS coecient computed by excluding the ith observation, n is the
OLS coecient computed for the full sample, Ui = Yi n Xi is the OLS
residual and hii is the ith diagonal element of the matrix H = X(X X)1 X .
Since Yi (i) Xi = Ui /(1 hii ), the cross-validation criterion is
2
1 1
n n
2 Ui
CV() = [Yi (i) Xi ] = .
n i=1 n i=1 1 hii
2
Cross-validation requires computing the sequence (1) (X1 ), . . . , (n) (Xn ). This
nif the smoother matrix preserves the constant, that is, S =
is very simple
. Because i=1 sij = 1 in this case, (i) (Xi ) may be computed by setting the
weight of the ith observation equal to zero and then dividing all the other
weights by 1 sii in order for them to add up to one. This gives
n
sij j=1 sij Yj sii Yi i sii Yi
(i) (Xi ) = Yj = = .
j=i
1 sii 1 sii 1 sii
Hence, the ith predicted residual is Yi (i) (Xi ) = Ui /(1 sii ), where Ui =
Yi i , and the cross-validation criterion becomes
2
1
n
Ui
CV() = .
n i=1 1 sii
77
Equivalent kernels and equivalent degrees of freedom
Using the representation (17), it is easy to compare linear smoothers obtained
from dierent methods.
The set of weights S1 (x), . . . , Sn (x) assigned to the sample values of Y in the
construction of (x) is called the equivalent kernel evaluated at the point x.
The ith row of the smoother matrix S gives the equivalent kernel evaluated
at the point Xi . A comparison of equivalent kernels is therefore a simple and
eective way of comparing dierent linear smoothers.
Example 17 Given two dierent linear smoothers, one may compare the dif-
ferent weights,
(1) (1) (2) (2)
{si1 , . . . , sin }, {si1 , . . . , sin },
that they give to the sample values of Y in constructing the estimate of the
value (Xi ) of (x) at a given point Xi . Alternatively, given a smoother and
two dierent points Xi and Xj , one may compare the dierent weights,
that the smoother gives to the sample values of Y in constructing the estimates
of (Xi ) and (Xj ). 2
k = tr S .
The eective number of parameters is equal to the sum of the eigenvalues of the
matrix S . If S is symmetric and idempotent (as in the case of regression
splines), then
k = rank S .
The eective number of parameters depends mainly on the value of the smooth-
ing parameter , while the conguration of the predictors tends to have a much
smaller eect. As increases, we usually observe a decrease in k and a
corresponding increase in df .
78
Asymptotic properties
We now present some asymptotic properties of regression smoothers. For sim-
plicity, we conne our attention to estimators based on the kernel method.
Theorem 5 Suppose that the sequence {hn } of bandwidths and the kerne
K: Rk R satisfy:
(i) hn 0;
(ii) nhkn ;
(iii) K(u) du = 1, uK(u) du = 0, and uu K(u) du is a nite k k
matrix.
p
Then n (x) (x) for every x in the support of X.
where 0 < . If the CMF (x) and the CVF 2 (x) are smooth and X
has a twice continuously dierentiable density fX (x) then
k 1/2 b(x) 2 (x) 2
(nhn ) [n (x) (x)] N , K(u) du
fX (x) fX (x)
for every x in the support of X, where
(k+2) x Xi
b(x) = lim hn E [(Xi ) (x)] K .
n hn
Further, (nhkn )1/2 [n (x) (x)] and (nhkn )1/2 [n (x ) (x )] are asymptotically
independent for x = x .
79
Remarks
n (x) is asymptotically biased for (x), unless hn is chosen such that
= 0. In this case, however, the convergence of n (x) to its limiting
distribution is slower than in the case when > 0.
80
Tests of parametric models
Nonparametric regression may be used to test the validity of a parametric
model. For this purpose one may employ both informal methods, especially
graphical ones, and more formal tests.
Let U = (In H)Y, with H = X(X X)1 X , be the OLS residual vector and
let U = (In S )Y be the residual vector associated with some other linear
smoother = S Y. Because H0 is nested within the alternative, an F -type
test rejects H0 for large values of the statistic
U U U U Y MY Y M Y
F = = ,
U U Y M Y
81
Connections with the Durbin-Watson test
Azzalini and Bowman (1993) suggest replacing H0 by the hypothesis that
E U = 0, conditional on X. The alternative is now that E U is a smooth
function of the covariates. This leads to a test that rejects H0 for large values
of the statistic
U U U M U
F =
.
U M U
It can be shown that F depends only on the data and the value of the smooth-
ing parameter . Rejecting H0 for large values of F is equivalent to rejecting
for small values of the statistic
1 U M U
= = ,
F + 1 U U
which has the same form as the DurbinWatson test statistic.
82
2.7 Average derivatives
Suppose that the
CMF is suciently smooth and write it as (x) = c(x)/fX (x),
where c(x) = y f (x, y) dy. Then its slope is
(x) = = E (X).
where sX (x) = fX (x)/fX (x) is the score of the marginal density of X.
A third representation follows from the fact that, by the Law of Iterated Ex-
pectations,
where the last expectation is with respect to the joint distribution of (X, Y ).
83
Estimation of the average derivative
If the kernel K is dierentiable, an analog estimators of the average derivative
based on (24) is
n
1 = n1 Di (Xi ),
i=1
where Di = 1{fX (Xi ) > cn } for some trimming constant cn that goes to
zero as n . The lower bound on fX is introduced to avoid erratic behavior
where X is sparse, so the value of fX (x) is very small.
n
2 = n1 Di s(Xi ) (Xi ),
i=1
n
1
3 = n Di sX (Xi ) Yi .
i=1
84
2.8 Methods for high-dimensional data
We now review a few nonparametric approaches that try to overcome some of
the problems encountered in trying to extend univariate nonparametric meth-
ods to multivariate situations.
Remarks:
85
The PP algorithm
Algorithm 2
(1) Given estimates j of the projection directions and estimates mj of the
rst h 1 ridge functions in (27), compute the approximation errors
h1
ri = Yi mj (j Xi ), i = 1, . . . , n,
j=1
where Yi = Yi Y .
(2) Given a vector b Rk such that
b
= 1, construct a linear smoother
m(b Xi ) based on the errors ri and compute the RSS
n
2
S(b) = ri m(b Xi ) .
i=1
86
Additive regression
An important property of the linear regression model
k
m(x) = + j xj
j=1
Remarks:
87
Characterizing additive regression models
An additive regression model may be characterized as the solution to an ap-
proximation problem. In turn, this leads to a general iterative proce-
dure for estimating this class of models.
This way of proceeding is very similar in spirit to obtaining OLS as the sample
analog of the normal equations that characterize the best linear approximation
to an arbitrary CMF.
Before stating the main result, it is worth recalling a few denitions and results.
We refer to Luenberger (1969) for details.
88
Banach and Hilbert spaces
A normed vector space X is a vector space on which is dened a function
x
: X R+ , called the norm of x, that satises the following properties:
(i) x | y = y | x;
(ii) x + y | z = x | z + y | z;
89
Example 18 An important example of Hilbert space is the space H of all ran-
dom variables with nite variance dened on a probability space (, A, P ).
In this case, given two random variables Y, X H, we dene the inner prod-
uct
X | Y = Cov(X, Y ).
Viewed as elements of H, two uncorrelated random variables are therefore
orthogonal. The inner product thus dened induces the norm
90
The Classical Projection Theorem
Theorem 7 Let M be a closed subspace of a Hilbert space H. For any vector
x H, there exists a unique vector m M such that
x m
x m
for all m M. Further, a necessary and sucient condition for m M to
be such a vector is that x m M.
91
The main result
Theorem 9 Let Z = (X1 , . . . , Xk , Y ) be a random (k + 1)-vector with -
nite second moments, let H be the Hilbert space of zero-mean functions of
Z, with the inner product dened as | = E (Z)(Z), and let (x) =
(x1 , . . . , xn ) = E(Y | X1 = x1 , . . . , Xk = xk ). For any j = 1, . . . , k, let
Mj be the subspace of zero-mean square integrable functions of Xj and let
Ma = M1 + + Mk be the subspace of zero-mean square integrable additive
functions of X = (X1 , . . . , Xk ). Then a solution to the following problem
k
m (X) = mj (Xj ),
j=1
92
Proof
We only sketch the proof. The details can be found in Stone (1985). If we
endow the Hilbert space H with the norm
= [E (Z)2 ]1/2 , then problem
(29) is equivalent to the minimum norm problem
min
Y m
2 .
mMa
Y m Mj , j = 1, . . . , k.
Because m Ma , it must be of the form m (X) = kj=1 mj (Xj ). Further,
because the conditional expectation E( | Xj ) is an orthogonal projection onto
Mj , we must have
k
E Y mj (Xj ) | Xj = 0, j = 1, . . . , k,
j=1
that is,
mj (Xj ) = E Y mh (Xh ) | Xj , j = 1, . . . , k.
l=j
93
Estimating additive regression models
Let Sj denote a smoother matrix for univariate smoothing on the jth covariate.
Because the sample analogue of the conditional mean j (x) = E(Y | Xj = x)
is the univariate smoother Sj Y, the analogy principle suggests the following
equation system as the sample counterpart of (30)
mj = Sj Y mh , j = 1, . . . , k,
h=j
where mj is the n-vector with generic element mj (Xij ) and Y is the n-vector
with generic element Yi Y .
k
= n + mj .
j=1
94
The backtting algorithm
One way of solving system (31) is the GaussSeidel method. This method
solves iteratively for each vector mj from the relationship
mj = Sj Y ml ,
h=j
using the latest values of {mh , h = j} at each step. The process is repeated
for j = 1, . . . , k, 1, . . . , k, . . ., until convergence. The result is the backtting
algorithm:
Algorithm 3
(1) Compute the univariate smoother matrices S1 , . . . , Sk .
(0)
(2) Initialize the algorithm by setting mj = mj , j = 1, . . . , k.
(3) Cycle for j = 1, . . . , k, mj = Sj (Y h=j mh ), where the smoother
matrix Sj for univariate
smoothing on the jth covariate is applied to the
residual vector Y h=j mh obtained from the previous step.
(4) Iterate step (3) until the changes in the individual functions become neg-
ligible.
The smoother matrices Sj can be computed using any univariate linear smoother
and do not change through the backtting algorithm.
To ensure that Y h=j mh has mean zero at every stage, Sj may be replaced
by the centered smoother matrix Sj (In n1 n
n ).
A possible starting point for the algorithm is the tted value of Y from an
(0)
OLS regression of Y on X. In this case mj = Xj j , where Xj and j denote
respectively the jth column of X and the jth element of the OLS coecient
vector.
95
Partially linear models
A partially linear model is of the form
E(Y | X, W ) = m1 (X) + m2 (W ),
For simplicity (and without loss of generality), let X be a scalar rvs and write
the model as
Y = X + m2 (W ) + U, (32)
where E(U | X, W ) = 0.
Y E(Y | W ) = [X E(X | W )] + U.
96
The backtting algorithm for partially linear models
Now consider applying the backtting algorithm. Because in this case m1 is
linear, two dierent smoothers may be employed:
An OLS smoother, with smoother matrix S1 = X(X X)1 X , which
produces estimates of the form m1 = X, where is the desired estimate
of .
A nonparametric smoother with smoother matrix S2 . Applied to
the vector W = (W1 , . . . , Wn ), this smoother produces estimates of the
vector m2 = (m2 (W1 ), . . . , m2 (Wn )) of the form m2 = S2 Y.
The method proposed by Robinson (1988) coincides with the backtting algo-
rithm if S2 = P = S is a projection matrix (i.e., symmetric and idempotent).
97
2.9 Stata commands
We now briey review the commands for linear nonparametric regression esti-
mation available in Stata, version 12.
These include the lowess command for lowess, the lpoly command for kernel-
weighted local polynomial smoothing, and the mkspline command for construct-
ing linear or natural cubic splines. All three commands consider the case of a
single regressor.
The package bspline in Newson (2012) allows one to estimate univariate and
multivariate splines.
98
The lowess command
This is a computationally intensive command. For example, running lowess
on 1,000 observations require performing 1,000 regressions each involving a
number of observations equal to n times the span. Since Stata does not take
advantage of the recursive formulae for OLS or WLS, this may take a long
time on a slow computer.
nograph: suppresses graph. This option is often used with the generate()
option.
99
The lpoly command
The basic syntax is:
lpoly yvar xvar [if] [in] [weight] [, options]
where options includes:
nograph: suppresses graph. This option is often used with the generate()
option.
100
The mkspline command
In the rst syntax, it creates a set of k variables containing a linear spline of
oldvar with knots at the specied k 1 knots:
mkspline newvar 1 #1 newvar 2 #2 [...]] newvar k = oldvar [if] [in]
[, marginal displayknots]
where
displayknots displays the values of the knots that were used in creating
the linear or restricted cubic spline.
101
3 Distribution function and quantile function
estimators
The distribution function and the quantile function are equivalent ways
of characterizing the probability distribution of a univariate rv Z.
z
Any function f such that F (z) = f (t) dt is called a density for F . Any
such density must agree with F except possibly on a set with measure zero.
If f is continuous at z0 , then f (z0 ) = F (z0 ).
Sometimes the df is dened in terms of its density. For example, the density
of the N (0, 1) distribution is
2
1 z
(z) = exp .
2 2
The associated df is dened by the integral
z
(z) = (u) du,
102
Properties of the distribution function
F is non decreasing, right-continuous, and satises F () = 0 and
F () = 1.
For any z,
Pr{Z = z} = F (z+) F (z).
where F (z+) = limh0 F (z + h) and F (z) = limh0 F (z h).
If F is continuous at z, then
and so Pr{Z = z} = 0.
103
Quantiles
For any p (0, 1), a pth quantile of a rv Z is a number Qp R such that
or equivalently
or equivalently
Pr{Z > Qp } 1 p Pr{Z Qp }.
The set of pth quantiles of Z is a closed interval of the real line. A pth
quantile is unique if the df of Z is strictly increasing at Qp , in which case
F (Qp ) = p.
104
Quantiles as solutions to a minimization problem
Recall that, if a rv Z has nite mean, then its mean solves the problem
mincR E(Z c)2 . A pth quantile of Z may also be dened implicitly as a
solution to a particular minimization problem.
Proof. Consider for simplicity the case when Z is a continuous rv with density
f and nite mean . Since L(c) is convex, it is enough to show that Qp is a
root of the equation L (c) = 0. First notice that
c
L(c) = p (z c)f (z) dz (z c)f (z) dz
c
= p( c) zf (z) dz + cF (c).
Hence,
L (c) = p cf (c) + F (c) + cf (c) = p + F (c).
Since Qp satises F (Qp ) = p, it is a root of L (Qp ) = 0. 2
105
Figure 12: The asymmetric absolute loss function p (v).
1.5
1
.5
0
2 1 0 1 2
v
106
The quantile function
The quantile function (qf) of a rv Z is a function Q: (0, 1) R dened by
If Qp is unique, then Q(p) = Qp . While the df F maps R into [0, 1], the qf Q
maps (0, 1) into R.
F (z) p z Q(p).
This justies calling Q the inverse of F and using the notation Q(p) =
F 1 (p). Also,
F (z) = sup {p [0, 1]: Q(p) z}. (36)
107
Derivation of the qf from the df
If the df F is continuous and can be expressed in closed form, then the qf
is easily obtained by solving the equation F (z) = p.
F (z) = 1 exp(z),
so
ln(1 p)
Q(p) = .
2
108
Table 1: Distribution function F (z) and quantile function Q(p) of selected
distributions.
1 1
Cauchy 2
+
arctan z tan[(p 12 )]
Chi-square(1) 2( z) 1 [1 ((p + 1)/2)]2
Exponential 1 ez ln(1 p)
1 z 1
Laplace 2
e , z<0 ln 2p, p < 2
1 12 ez , z 0 ln 2(1 p), p 1
2
ez p
Logistic ln
1 + ez 1p
Uniform z p
109
Figure 13: Quantile functions of selected distributions.
8
6
4
.5
4
2
2
0
0
Gumbel Gaussian Laplace
2
5
6 4 2 0
2
Q(p)
0
0
2
5
Logistic Lognormal Weibull(2,1)
15
3
5
10
2
0
1
5
0 .5 1 0 .5 1 0 .5 1
p
110
Other properties of the quantile function
If the rst two moments of Z exist, then
1 1 1 2
2
EZ = Q(p) dp, Var Z = Q(p) dp Q(p) dp .
0 0 0
111
Moment-based summaries of a distribution
Center, spread and symmetry of a distribution are intuitive but somewhat
vague concepts. Kurtosis is not even intuitive.
they may not exist (e.g., for all four measures to exist, a t distribution
needs at least 5 degrees of freedom);
112
Quantile-based summaries of a distribution
An alternative is to use quantile-based summaries of a distribution. Thus,
we may alternatively dene:
The IQR (IDR) is the interval of values which contains the central 50%
(80%) of the probability mass of Z.
or the ratios
Q(.90) Q(.50)
Q(.50) Q(.10)
or
[Q(.90) Q(.50)] [Q(.50) Q(.10)]
,
Q(.90) Q(.10)
also known as Kelleys measure. One may also consider analogous
measures with Q(.90) and Q(.10) replaced by Q(.75) and Q(.25).
113
Functionals of the distribution function or the quantile function
Sometimes, the parameter of interest is a particular functional of either the df
or the qf. We illustrate this with two examples.
The tail conditional mean (), also called the -level expected short-
fall, represents the expected value of a loss (negative return) that exceeds
the VaR.
114
Example 22 Let Z be a non-negative rv with nite nonzero mean . The
Lorenz curve of Z is formally dened as
1
L() = Q(p) dp, 0 < < 1.
0
The Lorenz curve is commonly used in economics to describe the distribution
of income and is associated with the Gini inequality index.
From (37), the following relationship links the Lorenz curve and the tail con-
ditional mean of Z
L() = ().
The generalized Lorenz curve (Shorrocks 1983) is the Lorenz curve scaled
up by the mean, and is equal to
GL() = Q(p) dp = (), 0 < < 1.
0
115
3.2 The empirical distribution function
Let Z1 , . . . , Zn be a sample from the distribution of a rv Z, and consider the
problem of estimating the df F of Z when F is not restricted to a known
parametric family.
that is, F (z) is just the mean of the Bernoulli rv 1{Z z}. This suggests
estimating F (z) by its sample counterpart
n
F (z) = n1 1{Zi z}, (38)
i=1
be the sample order statistics. Then the edf may also be written
0, if z < Z[1] ,
F (z) = i/n, if Z[i] z < Z[i+1] , i = 1, . . . , n 1,
1 if z Z[n] .
This shows that the edf contains all the information carried by the sample,
except the order in which the observations have been drawn or are arranged.
116
Figure 14: Empirical distribution function F for a sample of 100 observations
from a N (0, 1) distribution.
1
.8
.6
.4
.2
0
4 2 0 2 4
z
Population df Empirical df
117
The Riemann-Stieltjes integral
In what follows we shall sometimes use the notation of the Riemann-Stieltjes
integral (see e.g. Apostol 1974, Chpt. 7).
Because the edf F is discrete, with probability mass function that assigns
probability mass 1/n to each distinct sample point, we have
n
1
EF g(Z) = g(z) dF (z) = n g(Zi ).
i=1
118
The empirical process
Let Z1 , . . . , Zn be a sample from the distribution of a rv Z with df F , let
= T (F ) = g(z) dF (z)
Thus, the estimation error ultimately depends on the dierence between the
edf F and the population df F .
where
pn (z) = n [F (z) F (z)].
Notice that pn (z) is a function dened on R for any given sample, but is a rv
for any given z. The collection
pn = {pn (z), z R}
of all such rvs is therefore a stochastic process with index set R. This
process is called the empirical process.
119
Figure 15: Realization of the empirical process pn for a sample of n = 100
observations from a N (0, 1) distribution.
.6
.4
.2
p_n(z)
0 .2
.4
4 2 0 2 4
z
120
Finite sample properties
The key is again the fact that F (z) is the average of n iid rvs 1{Zi z} that
have a common Bernoulli distribution with parameter F (z).
Remarks:
Being the average of n iid rvs with nite variance, F (z) also satises the
standard CLT and is therefore asymptotically normal.
121
Asymptotic properties
We now consider the properties of a sequence {Fn (z)} of edfs indexed by the
sample size n. As in the case of density estimation, we distinguish between
local properties (valid at a nite set of z values) and global properties.
As n we have:
as
Fn F,
jk = min(Fj , Fk ) Fj Fk , j, k = 1, . . . , J.
These two properties may be viewed as special cases of the following two
results.
The empirical process {pn (Q(u)), u (0, 1)} converges weakly to the
Brownian bridge or tied-down Brownian motion, that is, the unique
Gaussian process U with continuous sample paths on [0, 1] such that
E U(t) = 0 and Cov[U(s), U(t)] = min(s, t) st, with 0 < s, t < 1.
122
The Glivenko-Cantelli theorem
As a measure of distance between the edf Fn and the population df F consider
the Kolmogorov-Smirnov statistic
Pr{ lim Dn = 0} = 1
n
or, equivalently, the event that Dn does not converge to zero as n occurs
with zero probability under repeated sampling.
123
Figure 16: Empirical distribution function F for a sample of 1, 000 observations
from a N (0, 1) distribution.
1
.8
.6
.4
.2
0
4 2 0 2 4
z
Population df Empirical df
124
Multivariate and conditional edf
The denition of edf and its sampling properties are easily generalized to the
multivariate case.
125
3.3 The empirical quantile function
We now consider the problem of estimating the qf of a rv Z given a sample
Z1 , . . . , Zn from its distribution.
Sample quantiles
A sample analogue of (33) is called a sample quantile. Thus, for any p
(0, 1), a pth sample quantile is a number Qp R such that
n
n
n1 1{Zi < Qp } p and n1 1{Zi > Qp } 1 p,
i=1 i=1
that is, about np of the observations are smaller than Qp and about n(1 p)
are greater. Equivalently, we have the condition
n
n
1{Zi < Qp } np 1{Zi Qp }, (40)
i=1 i=1
that is, np cannot be smaller than the number of observations strictly less than
Qp and cannot be greater than the number of observations less than or equal
to Qp .
126
Asymmetric LAD
It can be shown that a pth sample quantile is also a solution to the sample
analogue of (34), namely
n
1
min EF p (Z c) = n p (Zi c), (41)
cR
i=1
where p (v) = [p 1{v < 0}] v is the asymmetric absolute loss function.
Problem (41) is called an asymmetric least absolute deviations (ALAD)
problem. Notice that solving problem (41) avoids the need of sorting.
1
n
L(c h) L(c)
L (c) = lim =p 1{Zi < c}
h0 h n i=1
1
n
L(c + h) L(c)
L (c+) = lim = 1{Zi c} p,
h0 h n i=1
1 1
n n
1{Zi < Qp } p 1{Zi Qp },
n i=1 n i=1
127
Example 25 A sample median is a solution to problem (41) for p = 1/2
and minimizes the least absolute deviations (LAD) objective function
1
n
L(c) = |Zi c|.
2n i=1
n
min |Zi c|.
cR
i=1
1 1
n
L(c h) L(c)
L (c) = lim = 1{Zi < c}
h0 h 2 n i=1
and
1
n
L(c + h) L(c) 1
L (c+) = lim = 1{Zi c} ,
h0 h n i=1 2
so the following condition must hold
1 1
n n
1
1{Zi < } 1{Zi },
n i=1 2 n i=1
that is, at most half of the observations are less than and at least half are
greater or equal to . 2
128
Figure 17: LAD objective function for dierent sample sizes.
1.2
1
.8
.6
.4
1 .5 0 .5 1
c
n=4 n=5
129
Figure 18: Subgradients L (c) and L (c+) of the LAD objective function for
n = 4.
.5
0
.5
1 .5 0 .5 1
c
L(c) L(c+)
130
Figure 19: Subgradients L (c) and L (c+) of the LAD objective function for
n = 5.
.5
0
.5
1 .5 0 .5 1
c
L(c) L(c+)
131
The empirical quantile function
The empirical quantile function (eqf) is a real-valued function dened on
(0, 1) by
Q(p) = inf {z R: F (z) p}.
Thus, the eqf is just the sample analogue of problem (35) and satises all
properties of a qf. In particular, it is the left-continuous inverse of the
right-continuous edf F .
is a function on (0, 1) for any given sample, and is a rv for any given p. The
collection
qn = {qn (p), 0 < p < 1}
of all such rvs is therefore a stochastic process with index set (0, 1). This
process is called the sample quantile process.
132
Figure 20: Empirical quantile function Q for a sample of 100 observations from
a N (0, 1) distribution.
4
2
0
2
4
0 .25 .5 .75 1
p
Population qf Empirical qf
133
Figure 21: Empirical quantile function Q for a sample of 1000 observations
from a N (0, 1) distribution.
4
2
0
2
4
0 .25 .5 .75 1
p
Population qf Empirical qf
134
Figure 22: Realization of the sample quantile process qn for a sample of n = 100
observations from a N (0, 1) distribution.
4
2
0
q_n(z)
2
4
6
0 .25 .5 .75 1
p
135
Finite sample properties
Because sample quantiles essentially coincide with sample order statistics, we
present a result on the df and the density of the sample order statistic Z[i] ,
corresponding to the sample quantile Q(i/n).
Interest often centers not just on a single quantile, but on a nite number of
them. This is the case when we seek a detailed description of the shape of a
probability distribution, or we are interested in constructing some L-estimate,
that is, linear combinations of sample quantiles, such as the sample IQR, the
sample IDR, or a trimmed mean.
n!
g(u, v) = f (u)f (v) F (u)i1
(i 1)! (j 1 i)!(n j)!
[F (v) F (u)]j1i [1 F (v)]nj .
136
Asymptotic properties
Since the exact sampling properties of a nite set of sample quantiles are
somewhat complicated, we now consider their asymptotic properties.
Let Qn be a solution to (41) (to simplify notation we henceforth drop the ref-
erence to p) and again assume that Z1 , . . . , Zn is a sample from a continuous
distribution with df F and nite strictly positive density f . Under this
assumption, Q = F 1 (p) is the unique solution to (34) for any 0 < p < 1.
Further, the ALAD objective function
n
1
Ln (c) = n p (Yi c)
i=1
L(c) = E p (Y c).
as
Thus, as n , Qn Q for any 0 < p < 1.
137
Asymptotic normality of a sample quantile
Because the ALAD objective function Ln (c) = n1 ni=1 p (Zi c) is not
dierentiable, we cannot derive asymptotic normality of a sample quantile
Qn ((the index p has been dropped for simplicity) by taking a 2nd-order Taylor
expansion of Ln .
138
Asymptotic joint normality of sample quantiles
Given J distinct values
Extending the previous argument one can show that, if F possesses a contin-
uous density f which is positive and nite at Q1 , . . . , QJ , then
qn NJ (0, )
min(pr , ps ) pr ps
rs = , r, s = 1, . . . , J. (42)
f (Qr ) f (Qs )
Csorgo (1983) and Shorack and Wellner (1986) discuss an analogue of the
DKW inequality and weak convergence of the sample quantile process to the
Brownian bridge with nite dimensional distributions equal to the asymp-
totic distribution of qn .
139
Bahadur representation
Bahadur (1966) and Kiefer (1967) established the close link between the
sample quantile
process qn (p) = n [Qn (p) Q(p)] and the empirical process
pn (z) = n [Fn (z) F (z)].
1 p 1{Zi Q(p)}
n
Qn (p) Q(p) = + Rn ,
n i=1 f (Q(p))
pn (Q(p))
qn (p) = + op (1).
f (Q(p))
140
Estimating the asymptotic variance of sample quantiles
From (42), a pth sample quantile has asymptotic variance
p(1 p)
AV(Qn (p)) = .
f (Q(p))2
where = Q(.50).
Qn (p)) = p(1 p) .
AV(
f(Qn (p))2
141
Asymptotic relative eciency of the median to the mean
We may use result (42) to compare the asymptotic properties of the sample
mean Zn and the sample median n for a random sample from a rv Z with a
unimodal distribution that is symmetric about with variance 0 < 2 <
and density f that is strictly positive at .
Under these assumptions, as n ,
n (Zn ) N (0, 2),
1
n (n ) N 0, .
4f ()2
Because the two estimators are asymptotically normal with the same asymp-
totic mean, their comparison may be based on the ratio of their asymptotic
variances
AV(Zn )
ARE(n , Zn ) = = 4 2 f ()2 ,
AV(n )
called the asymptotic relative eciency (ARE) of the sample median to
the sample mean.
Because Var Zn n1 AV(Zn ) and Var n n1 AV(), the ARE is equal
to the ratio of the sample sizes that the sample mean and the sample median
respectively need to attain the same level of precision (inverse of the sampling
variance). Thus, if 100 observations are needed for the sample median to attain
a given level of precision, then the sample mean would need about 100 ARE
observations.
The ARE of the sample median increases with the peakedness f () of the
density f at . It is easy to verify that
2/ .64, if Z Gaussian,
ARE(n , Zn ) = 2 /12 .82, if Z logistic,
2, if Z double exponential.
The table below shows the ARE of the sample median for t distributions with
m 3 dof.
m 3 4 5 8
ARE 1.62 1.12 .96 .80 .64
142
3.4 Conditional distribution and quantile functions
Consider a random vector (X, Y ), where X is Rk -valued and Y is real val-
ued. The conditional distribution of Y given X = x may equivalently be
characterized through either:
F (y | x) = Pr{Y y | X = x}.
For any xed x, F (y | x) and Q(p | x) satisfy all properties of a df and a qf,
respectively. In particular,
To stress the relationship between the conditional qf and the conditional df,
the notations Q(p | x) = F 1 (p | x) and F (y | x) = Q1 (y | x) will also be used.
143
Conditional and marginal dfs and qfs
The Law of Iterated Expectations implies that
E Y = E (X) = (x) dH(x),
where the 1st term on the rhs reects dierences in the CMF of the two
subpopulations, while the 2nd term reects dierences in their composition.
When the two CMFs are linear, that is, j (x) = j x, j = 1, 2, we have the
so-called Blinder-Oaxaca decomposition
1 2 = (1 2 ) X,1 + 2 (X,1 X,2 ),
where X,j denotes the mean of X in group j = 1, 2.
where the 1st term on the rhs reects dierences in the cdf of the two sub-
populations, while the 2nd term reects dierences in their composition.
Simple decomposition of this kind are not available for quantiles. Machado
and Mata (2005) and Melly (2005) propose methods essentially based on in-
version of (43). Recently, Firpo, Fortin and Lemieux (2011) have proposed
an approximate method based on the recentered inuence function. See
also Fortin, Lemieux and Firpo (2011).
144
Estimation when X is discrete
The methods reviewed so far can easily be extended to estimation of the con-
ditional df and qf when the covariate vector X has a discrete distribution.
Pr{Y y, X = x}
F (y | x) = Pr{Y y | X = x} = .
Pr{X = x}
If O(x) = {i: Xi = x} is the set of sample points such that Xi = x and n(x)
is their number, then the sample counterpart of Pr{X = x} is the fraction
n(x)/n of sample points such that Xi = x. Hence, if n(x) > 0, a reasonable
estimate of F (y | x) is the fraction of sample points in O(x) such that Yi y
n
F (y | x) = n(x)1 1{Yi y, Xi = x}
i=1
(44)
1
= n(x) 1{Yi y}.
iO(x)
For these estimates to make sense, the number of data points corresponding to
each of the possible values of X ought to be suciently large. This approach
breaks down, therefore, when either X is continuous or X is discrete but its
support contains too many points relative to the available sample size.
145
3.5 Estimating the conditional quantile function
Let (X1 , Y1 ), . . . , (Xn , Yn ) be a sample from the joint distribution of (X, Y ).
The conditional qf Q(p | x), viewed as a function of x for a given p, may be
obtained by solving the problem
For a given p (0, 1), this suggests estimating Q(p | x) by choosing a function
of x, out of a suitable family C C, as to solve the sample analogue of (45)
n
1
min n p (Yi c(Xi )), 0 < p < 1.
cC
i=1
146
Linear quantile regression
In the original approach of Koenker and Bassett (1978), C is the class of
linear functions of x, that is, c(x) = b x (unless stated otherwise, the model
always includes a constant).
In this case, the sample counterpart of problem (45) is the ALAD problem
n
min n1 p (Yi b Xi ), 0 < p < 1. (46)
bRk
i=1
Q(p | x) = (p) x.
By suitably redening the elements of the covariate vector Xi , one may easily
generalize problem (46) to cases in which C is a class of functions that de-
pend linearly on a nite dimensional parameter vector, such as the class of
polynomial functions of X of a given degree.
147
Computational aspects of linear quantile regression
The lack of smoothness of the ALAD objective function implies that gradi-
ent methods cannot be employed to solve (46).
subject to Y = X + U+ U
U+ 0, U 0,
where is the n-vector of ones, and U+ and U are n-vectors with generic
elements equal to Ui+ = max(0, Yi Xi ) and Ui = min(0, Yi Xi )
respectively.
Simpler but cruder algorithms based on iterative WLS may also be em-
ployed.
max Y
[p1,p]n
subject to X = 0
max Y
[0,1]n
subject to X = (1 p)X n .
148
Asymptotics for linear quantile regression estimators
Let n = (n (p1 ), . . . , n (pJ )) be a kJ-vector of linear quantile regression
estimators, and consider the following assumptions:
Pr{Y y | X = x} = F (y x),
where F has a continuous and strictly positive density f and a continuous and
strictly increasing qf Q such that Q(p) = 0 for some 0 < p < 1.
Yi = Xi + Ui , i = 1, . . . , n,
where the rst element of (p) is equal to 1 + Q(p), with 1 the rst element
of . If p is such that Q(p) = 0, then (p) = .
149
Consistency and asymptotic normality
Under Assumptions 24,
as
n ,
where = ((p1), . . . , (pJ )). Further,
n ( n ) NkJ (0, P 1 ), (47)
min(pr , ps ) pr ps
jk = , r, s = 1, . . . , J.
f (Q(pr )) f (Q(ps ))
It follows from this result that, if X only contains a constant term, then P = 1,
n = Qn , and the asymptotic variance of n is equal to .
Suppose, in addition, that F has nite variance 2 . Then, under the above two
assumptions, result (47) implies that the ARE of a linear quantile regression
estimator n (p) relative to the OLS estimator n is equal to
2 f (Q(p))2
ARE(n (p), n ) = .
p(1 p)
150
Extensions
Koenker and Portnoy (1987) provide a uniform Bahadur representation
for linear quantile regression estimators.
They show that, under Assumptions 24, with probability tending to one as
n ,
1 p 1{Ui Q(p)}
n
n [n (p) (p)] = P 1 Xi + op (1), (48)
n i=1 f (Q(p))
Among other things, the asymptotically linear representation (48) helps ex-
plain both the asymptotic normality and the good robustness proper-
ties of linear quantile regression estimators with respect to outliers in the
Y -space (although not to outliers in the X-space).
Both properties follow from the fact that the inuence function of (p) is
equal to
p 1{Ui Q(p)}
P 1 Xi ,
f (Q(p))
which is a bounded function of Ui (although not of Xi ).
151
Drawbacks of linear quantile regression estimators
Although increasingly used in empirical work to describe the conditional dis-
tribution of an outcome of interest, linear quantile regression estimators have
several drawbacks.
152
Behavior under heteroskedasticity
To illustrate the problem, let X be a scalar rv and suppose rst that Y =
(X) + U, where the regression error U is independent (not just mean in-
dependent) of Y with continuous strictly increasing df F . Then F (y | x) =
F (y (x)). By denition, the pth conditional quantile of Y satises
F (Q(p | x) | x) = p.
Hence
Q(p | x) = Q(p) + (x),
where Q(p) = F 1 (p) is the pth quantile of F . Thus, for any p = p ,
Q(p | x) Q(p | x) = Q(p ) Q(p)
for all x, that is, the distance between any pair of conditional quantiles is
independent of x. If (x) = + x, then Q(p | x) = [ + Q(p)] + x, that is,
the conditional quantiles of Y are a family of parallel lines with common
slope .
Now suppose that (x) = + x but Y is conditionally heteroskedastic,
that is, Y = + X + (X) U, where the function (x) is strictly positive.
The homoskedastic model is a special case where (x) = 1 for all x. Now
y x
F (y | x) = F ,
(x)
so
Q(p | x) = + x + (x)Q(p).
In this case:
although (x) is linear in x, conditional quantiles need not be;
the distance between any pair of conditional quantiles depends on x,
Q(p | x) Q(p | x) = (x)[Q(p ) Q(p)].
153
Figure 23: Quantiles of Y | X = x N ((x), 2 (x)) when (x) = 1 + x and
either 2 (x) = 1 (homoskedasticity) or 2 (x) = 1 + (2x + .5)2 (heteroskedas-
ticity).
Homoskedasticity Heteroskedasticity
10
10
5
5
Q(p|x)
Q(p|x)
0
0
5
5
2 1 0 1 2 2 1 0 1 2
x x
154
Implications
Linear quantile regressions may be a poor approximation to popula-
tion quantiles when data are conditionally heteroskedastic and the
square root of the conditional variance function is far from being linear
in x.
where (p) = 0 +02 Q(p), (p) = 1 +20 1 Q(p) and (p) = 12 Q(p) all depend
on p. 2
155
Interpreting linear quantile regressions
Consider the ordinary (mean) regression model Y = X + U, where U and
X are uncorrelated. In this model, X may equivalently be interpreted
as the best linear predictor of Y given X, or as the best linear approxi-
mation to the conditional mean (X) = E(Y | X), that is,
where the rst expectation is with respect to the joint distribution of X and
Y , while the second is with respect to the marginal distribution of X.
with 1
w(x, b) = (1 u) f (Q(p | x) + u(x, b) | x) du,
0
For an alternative proof of this result, see Theorem 2.7 in Koenker (2005).
156
Inference under heteroskedasticity
From the practical point of view, heteroskedasticity implies that estimated lin-
ear quantile regressions may cross each other, thus violating a fundamental
property of quantiles and complicating the interpretation of the results of
a statistical analysis.
where
Bj = E X f (X j | X) XX T
and f (y | x) denotes the conditional density of Y given X = x.
157
Figure 24: Scatterplot of the data and estimated quantiles for a random
sample of 200 observations from the model Y | X N ((X), 2(X)), with
X N (0, 1), (X) = 1 + X and either 2 (X) = 1 (homoskedasticity) or
2 (X) = 1 + (2X + .5)2 (heteroskedasticity).
Homoskedasticity Heteroskedasticity
12
12
8
8
Y
Y
0
0
4
4
4 2 0 2 4 4 2 0 2 4
x x
Homoskedasticity Heteroskedasticity
9
9
Estimated quantiles
Estimated quantiles
6
6
3
3
0
0
3
4 2 0 2 4 4 2 0 2 4
x x
158
Nonparametric quantile regression estimators
To overcome the problems arising with the linearity assumption, several non-
parametric quantile regression estimators have been proposed. These include:
locally linear tting (Yu & Jones 1998), that is, Q(p | x) = (p; x) x,
with
n
1
(p; x) = argmin n p (Yi b Xi ) Wi (x), 0 < p < 1,
bRk i=1
where the Wi (x) are nonnegative weights that add up to one and, typi-
cally, give more weight to Y -values for which Xi is closer to x.
Drawbacks
Curse-of-dimensionality problem: It is not clear how to general-
ize these estimators to cases when there are more than two or three
covariates.
159
3.6 Estimating the conditional distribution function
If the interest is not merely in a few quantiles but in the whole conditional dis-
tribution of Y given a random k-vector X, why not estimating the conditional
df F (y | x) directly?
Wi (x) 0, i = 1, . . . , n,
n
i=1 Wi (x) = 1.
Estimators of this kind tend to do rather poorly when data are sparse,
which is typically the case k is greater than 2 or 3.
160
Avoiding the curse of dimensionality
We now describe a simple semi-nonparametric method that appears to
perform well even when k is large relative to the available data. The basic
idea is to partition the range of Y into J + 1 intervals dened by a set of
knots
< y1 < . . . < yJ < ,
and then estimate the J functions F1 (x), . . . , FJ (x), where Fj (x) = F (yj | x) =
Pr{Y yj | X = x).
Fj (x)
j (x) = ln .
1 Fj (x)
exp j (x)
Fj (x) = .
1 + exp j (x)
161
Estimation
By the analogy principle, j may be estimated by maximizing the sample log
likelihood
n
L() = [1{Yi yj } (Xi ) ln(1 + exp (Xi )) ]
i=1
162
Example 28 The simplest case is when (y | x) = (y) x.
where IJ is the JJ unit matrix. Finally, let F n (x), n and n (x) = (IJ x ) n
be the estimators of F (x), and (x) respectively.
where
I = E[V (X) XX ]
and V (x) is the J J matrix with generic element
Hence, as n ,
n [n (x) (x)] NJ (0, A(x)),
where
A(x) = (IJ x ) I 1 (IJ x).
Therefore, as n ,
n [F n (x) F (x)] NJ (0, (x)),
where
(x) = V (x) A(x) V (x) .
2
163
Figure 25: Estimated conditional dfs of log monthly earnings (Peracchi 2002).
.5 .5 .5 .5
0 0 0 0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
France Germany Greece Ireland
1 1 1 1
.8
estimated probability
.5 .5 .6 .5
.4
0 0 .2 0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Italy Luxembourg Netherlands Portugal
1 .6 1 1
.8
.4
.5 .5 .6
.2
.4
0 0 0 .2
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Spain UK
1 1
.5 .5
0 0
0 10 20 30 40 0 10 20 30 40
experience
164
Imposing monotonicity
While our approach automatically imposes the nonnegativity condition (49),
it does not guarantee that the monotonicity condition (50) also holds.
Since j (x) is strictly increasing in Fj (x), monotonicity is equivalent to the
condition that
< 1 (x) < . . . < J (x) <
for all x in the support of X.
If j (x) = j + j (x), then monotonicity holds if
j > j1 , j (x) j1(x).
One case where these two conditions are satised is the ordered logit model.
This model is restrictive, however, for it implies that changes in the covariate
vector X aect the conditional distribution of Y only through a location
shift.
An alternative is to model F1 (x) and the conditional probabilities or discrete
hazards
j (x) = Pr{Y yj | Y > yj1 , X = x}
Sj1(x) Sj (x)
= , j = 2, . . . , J,
Sj1(x)
where Sj (x) = 1 Fj (x) is the survivor function evaluated at yj .
Using the recursion
Sj (x) = [1 j (x)] Sj1 (x), j = 2, . . . , J,
we get
k
Sk (x) = S1 (x) [1 j (x)], k = 2, . . . , J,
j=2
that is,
k
Fk (x) = 1 [1 F1 (x)] [1 j (x)], k = 2, . . . , J.
j=2
If F1 (x) and the j (x) are modeled to guarantee that 0 < F1 (x) < 1 and 0 <
j (x) < 1, then both monotonicity and the constraint (49) are automatically
satised.
165
Figure 26: Estimated conditional dfs of log monthly earnings imposing mono-
tonicity (Peracchi 2002).
.5 .5 .5 .5
0 0 0 0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
France Germany Greece Ireland
1 1 1 1
.8
estimated probability
.5 .5 .6 .5
.4
0 0 .2 0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Italy Luxembourg Netherlands Portugal
1 .8 1 1
.6 .8
.5 .4 .5 .6
.2 .4
0 0 0 .2
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Spain UK
1 1
.5 .5
0 0
0 10 20 30 40 0 10 20 30 40
experience
166
Extensions: autoregressive models
Consider the discrete-time univariate AR(1) process
Yt = Yt1 + Ut ,
where || < 1, > 0 and the {Ut } are iid with zero mean and marginal df G.
The conditional df of Yt given Yt1 = x is
y x
F (y | x) = Pr{Yt y | Yt1 = x} = G . (51)
The assumptions implicit in (51) are strong. As an alternative, one may retain
the assumption that F (y | x) is time-invariant and apply the results of the
previous section by letting Xt = Yt1 .
F (y | x) = Pr{Yt y | Xt = x},
167
3.7 Relationships between the two approaches
Koenker, Leorato and Peracchi (2013) ask the general question: How does the
distributional regression (DR) approach outlined in the previous section
relate to quantile regression (QR) approach?
168
From log odds to quantiles
If F (y | x) is continuous, the fact that F (Q(p | x) | x) = p implies
p
(Q(p | x) | x) = ln , p (0, 1).
1p
Next notice that conditional quantiles are linear in x i the partial derivative
Qx (p | x) does not depend on x. If the conditional log odds are linear in x,
that is, (y | x) = (y) + (y)x, then a quantile regression is linear in x i
,
(y) ,
Qx (p | x) = ,
(y) + (y)x ,y=Q(p | x)
does not depend on x. Sucient conditions are: (i) (y) is linear in y, and
(ii) (y) does not depend on y (as with the ordered logit model).
169
From quantiles to log odds
If Q(p | x) is a known continuously dierentiable function of (p, x) such that
Q(p | x) = y, where y is a xed number, then
Fx (y | x) = Qx (p | x) f (y | x)|p=F (y | x)
or, equivalently,
x (y | x) = Qx (p | x) y (y | x)|p=F (y | x) .
x (y | x) = (p) y (y | x)|p=F (y | x)
does not depend on x. Sucient conditions are: (i) (p) does not depend on
p, and (ii) y (y | x) does not depend on x.
170
3.8 Stata commands
We now briey review the commands available in Stata, version 12.
These include the qreg command for linear conditional quantile estimation,
the associated iqreg, sqreg and bsqreg commands, and the post-estimation
tools in qreg postestimation.
At the moment, Stata only oers the cumul for estimating univariate distribu-
tion functions and has no command for estimating cdfs.
171
The qreg command
The basic syntax is:
qreg depvar indepvars [if] [in] [weight] [, qreg options]
where qreg options includes:
level(#): sets the condence level. The default is level(95) (95 per-
cent).
Notice that the standard errors produced by qreg are based on the ho-
moskedasticity assumption and should not be trusted.
172
Other commands
The iqreg, sqreg and bsqreg commands all assume linearity of conditional
quantiles but estimate the variance matrix of the estimators (VCE) via the
bootstrap. Their syntax similar to that of qreg. For example,
iqreg depvar indepvars [if] [in] [weight] [, iqreg options]
Remarks:
173
Post-estimation tools
The following postestimation commands are available for qreg, iqreg, sqreg
and bsqreg:
lincom: point estimates, standard errors, testing, and inference for linear
combinations of the coecients,
nlcom: point estimates, standard errors, testing, and inference for non-
linear combinations of coecients,
174
References
Angrist J., Chernozhukov V. and Fernandez-Val I. (2006) Quantile Regression un-
der Misspecication, with an Application to the U.S. Wage Structure. Econo-
metrica, 74: 539563.
Bassett G. and Koenker R. (1978) The Asymptotic Distribution of the Least Ab-
solute Error Estimator. Journal of the American Statistical Association, 73:
618622.
Cleveland W.S. (1979) Robust Locally Weighted Regression and Smoothing Scat-
terplots. Journal of the American Statistical Association, 74: 829836.
Cleveland W.S. and Devlin S.J. (1988) Locally Weighted Regression: An Approach
to Regression Analysis by Local Fitting. Journal of the American Statistical
Association, 93: 596610.
175
De Angelis D., Hall P. and Young G.A. (1993) Analytical and Bootstrap Approxi-
mations to Estimator Distributions in L1 Regression. Journal of the American
Statistical Association, 88: 13101316.
de Boor C. (1978) A Practical Guide to Splines, Springer, New York.
De Luca G. (2008) SNP and SML Estimation of Univariate and Bivariate Binary-
Choice Models. Stata Journal, 8: 190220.
De Luca G., and Perotti V. (2011) Estimation of Ordered Response Models with
Sample Selection. Stata Journal, 11: 213239.
Engle R.F., Granger C.W.J., Rice J.A. and Weiss A. (1986) Semiparametric Esti-
mates of the Relationship Between Weather and Electricity Sales. Journal of
the American Statistical Association, 81: 310320.
Eubank R.L. (1988) Spline Smoothing and Nonparametric Regression, Dekker, New
York.
Eubank R.L. and Spiegelman C.H. (1990) Testing the Goodnes of Fit of a Linear
Model Via Nonparametric Regression Techniques. Journal of the American
Statistical Association, 85: 387392.
Fan J. (1992) Design-adaptive Nonparametric Regression. Journal of the American
Statistical Association, 87: 9981004.
Fan J. and Gijbels I. (1996) Local Polynomial Modelling and Its Applications, Chap-
man and Hall, London.
Fan J., Heckman N.E. and Wand M.P. (1995) Local Polynomial Regression for
Generalized Linear Models and Quasi-likelihood Functions. Journal of the
American Statistical Association, 90: 141150.
Firpo S., Fortin N. and Lemieux T. (2009) Unconditional Quantile Regressions.
Econometrica, 77: 953973.
Fortin N., Lemieux T. and Firpo S. (2011) Decomposition Methods in Economics.
In Ashenfelter O. and Card D. (eds.) Handbook of Labor Economics, Vol. 4a,
pp. 2102, Elsevier, Amsterdam.
Friedman J.H. and Stuetzle W. (1981) Projection Pursuit Regression. Journal of
the American Statistical Association, 76: 817823.
176
Gallant A.R. and Nychka D.W. (1987) Semi-Nonparametric Maximum Likelihood
Estimation. Econometrica, 55: 363390.
Good I.J. and R.A. Gaskins (1971) Nonparametric Roughness Penalties for Prob-
ability Densities. Biometrika, 58: 255-277.
Green P.J. and Silverman B.W. (1994) Nonparametric Regression and Generalized
Linear Models, Chapman and Hall, London.
Hall P., Wol R.C.L. and Yao Q. (1999) Methods for estimating a Conditional
Distribution Function. Journal of the American Statistical Association, 94:
154163.
Hastie T.J. and Loader C.L. (1993) Local Regression: Automatic Kernel Carpentry
(with discussion). Statistical Science, 8: 120143.
Hastie T.J. and Tibshirani R.J. (1990) Generalized Additive Models, Chapman and
Hall, London.
177
Koenker R. and Bassett G. (1982) Robust Tests for Heteroskedasticity Based on
Regression Quantiles. Econometrica, 50: 4361.
Koenker R., Leorato S., and Peracchi F. (2013) Distributional vs. Quantile Regres-
sion. Mimeo.
Koenker R. and Machado J.A.F. (1999) Goodness of Fit and Related Inference
Processes for Quantile Regression. Journal of the American Statistical Asso-
ciation, 94: 12961310.
Marron J.S. and Nolan D. (1988) Canonical Kernels for Density Estimation. Statis-
tics and Probability Letters, 7: 195199.
Newson R.B. (2012) Sensible Parameters for Univariate and Multivariate Splines.
Stata Journal, 12: 479504.
178
Portnoy S. and Koenker R. (1997) The Gaussian Hare and the Laplacian Tortoise:
Computability for Squared-Error versus Absolute-Error Estimators (with dis-
cussion). Statistical Science, 12: 279300.
Ruppert D., Wand M.P. and Carroll R.J. (2003) Semiparametric Regression, Cam-
bridge University Press, New York.
Sheater S.C. and Jones M.C. (1991) A Reliable Data-Based Bandwidth Selection
Method for Kernel Density Estimation. Journal of the Royal Statistical Soci-
ety, Series B, 53: 683690.
Schucany W.R. and Sommers J.P. (1977) Improvement of Kernel Type Density
Estimators. Journal of the American Statistical Association, 72: 420423.
Silverman B.W. (1986) Density Estimation for Statistics and Data Analysis, Chap-
man and Hall, New York.
Stone C.J. (1982) Optimal Global Rates of Convergence for Nonparametric Re-
gression. Annals of Statistics, 10: 10401053.
Stone C.J. (1984) An Asymptotically Optimal Window Selection Rule for Kernel
Density Estimates. Annals of Statistics, 12: 12851297.
Stone, C.J. (1985) Additive regression and other nonparametric models. Annals of
Statistics, 13: 689705.
179
Vapnik V.N. (1995) The Nature of Statistical Learning Theory, Springer, New York.
Wahba G. (1990) Spline Models for Observational Data, SIAM, Philadelphia, PA.
Yu K. and Jones M.C. (1998) Local Linear Quantile Regression. Journal of the
American Statistical Association, 93: 228237.
180