Sei sulla pagina 1di 181

Applied Nonparametric Methods

Franco Peracchi
University of Rome Tor Vergata and EIEF
Spring 2014
Contents
1 Nonparametric Density Estimators 2
1.1 Empirical densities . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 The kernel method . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Statistical properties of the kernel method . . . . . . . . . . . . 22
1.4 Other methods for univariate density estimation . . . . . . . . . 40
1.5 Multivariate density estimators . . . . . . . . . . . . . . . . . . 44
1.6 Stata commands . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2 Nonparametric Regression Estimators 53


2.1 Polynomial regressions . . . . . . . . . . . . . . . . . . . . . . . 56
2.2 Regression splines . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.3 The kernel method . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.4 The nearest neighbor method . . . . . . . . . . . . . . . . . . . 66
2.5 Cubic smoothing splines . . . . . . . . . . . . . . . . . . . . . . 72
2.6 Statistical properties of linear smoothers . . . . . . . . . . . . . 75
2.7 Average derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.8 Methods for high-dimensional data . . . . . . . . . . . . . . . . 85
2.9 Stata commands . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3 Distribution function and quantile function estimators 102


3.1 Distribution functions and quantile functions . . . . . . . . . . . 102
3.2 The empirical distribution function . . . . . . . . . . . . . . . . 116
3.3 The empirical quantile function . . . . . . . . . . . . . . . . . . 126
3.4 Conditional distribution and quantile functions . . . . . . . . . 143
3.5 Estimating the conditional quantile function . . . . . . . . . . . 146
3.6 Estimating the conditional distribution function . . . . . . . . . 160
3.7 Relationships between the two approaches . . . . . . . . . . . . 168
3.8 Stata commands . . . . . . . . . . . . . . . . . . . . . . . . . . 171

1
1 Nonparametric Density Estimators
Let the data Z1 , . . . , Zn be a sample from the distribution of a random vec-
tor Z Z Rm . We are interested in the general problem of estimating
the (probability) distribution of Z nonparametrically, that is, without
restricting it to belong to some known parametric family.

We begin with the problem of how to estimate nonparametrically the density


function of Z.

Although there are several other characterizations of a distribution (e.g., the


distribution function, the quantile function, the characteristic function, the
hazard function), there may be advantages in focusing on the density:

The graph of a density may be easier to interpret if one is interested


in aspects such as symmetry or multimodality.

Estimates of certain population parameters, such as the mode, are more


easily obtained from an estimate of the density.

2
Use of nonparametric density estimates
Nonparametric density estimates may be used for:

Exploratory data analysis: What is the shape of the distribution


of Z?

Estimating qualitative features of a distribution such as unimodality,


skewness, etc.

Specication and testing of parametric models (e.g., do data provide


support for a Gaussian model?)

Estimation of statistical functionals: Many statistical problems in-


volve estimation of a population parameter that can be represented as
a statistical functional = T (f ) of the population density f . In these
cases, the analogy principle suggests estimating by = T (f), where
f is a reasonable estimate of f , possibly a nonparametric one.

3
Example 1 Given an estimate f of f , a natural estimate of the mode =
argmaxzZ f (z) of f is the mode = argmaxzZ f(z) of f. 2

Example 2 Let  z
F (z) = f (u) du

be the value of the distribution function (df) of a random variable (rv) Z


at z. Given an estimate f of f , a natural estimate of F (z) is
 z
F (z) = f(u) du.

Example 3 A characterization of the distribution of a continuous non-


negative rv Z (e.g. the length of an unemployment spell) with df F and
density function f is the hazard function

Pr{z < Z z + | Z > z}


(z) = lim
0+
F (z + ) F (z)
= lim
0+ [1 F (z)]
 
1 F (z + ) F (z)
= lim
1 F (z) 0+
f (z)
= , z 0.
1 F (z)
z
If f is a nonparametric estimate of f and F = f(u) du is the derived
nonparametric estimate of F , then a nonparametric estimate of (z) is

f(z)
(z) = .
1 F (z)

The nonparametric estimate (z) may then be compared with the benchmark
parametric estimate based on the exponential distribution with constant
hazard (z) = > 0. 2

4
Example 4 Let Z = (X, Y ) be a random vector with two elements (m = 2)
and consider the conditional mean function (CMF) or mean regression
function of Y given X, dened by

f (x, y)
(x) = E(Y | X = x) = y dy,
fX (x)

where f (x, y) is the joint density of X and Y and



fX (x) = f (x, y) dy

is the marginal density of X. Given a nonparametric estimate f(x, y) of f (x, y),


a nonparametric estimate of (x) is

f(x, y)
(x) = y dy, (1)
fX (x)

where fX (x) = f(x, y) dy is a nonparametric estimate of fX . 2

5
1.1 Empirical densities
We begin with the simpler problem of estimating univariate densities (m =
1), and discuss the problem of estimating multivariate densities (m > 1)
later in Section 1.5.

Thus, let Z1 , . . . , Zn be a sample from the distribution of a univariate rv Z


with df F and density function f . The case when the data are a nite segment
of a strictly stationary process is discussed in Pagan & Ullah (1999).

The choice of an appropriate method for estimating f depends on the nature,


discrete or continuous, of the rv Z.

6
Discrete Z
If Z is discrete with a distribution that assigns positive probability mass to
the values z1 , z2 , . . ., then the probability fj = Pr{Z = zj } may be estimated
by the relative frequency

n
fj = n1 1{Zi = zj }, j = 1, 2, . . . ,
i=1

called the empirical probability that Z = zj .

This is just the sample average of n iid binary rvs X1j , . . . , Xnj , where

Xij = 1{Zi = zj }, i = 1, . . . , n,

has a Bernoulli distribution with mean fj and variance fj (1 fj ). Thus,


fj is unbiased for fj , that is, E fj = fj , and its sampling variance is

fj (1 fj )
Var fj = , j = 1, 2, . . . ,
n
 fj = n1 fj (1 fj ).
which is typically estimated by Var
p
It is easy to verify that j is consistent for fj , that is, fj
f fj . It is also

n -consistent and asymptotically normal, that is,



n (fj fj ) N (0, AV(fj )),

 fj ) = fj (1fj ).
where AV(fj ) = fj (1fj ) can be estimated consistently by AV(
This justies the use of a symmetric asymptotic condence interval (CI) for
fj of the form
fj (1 fj )
CI12 (fj ) = fj z() ,
n
where z() is the upper th percentile of the N (0, 1) distribution (e.g. z(.025) =
1.96). Notice that this CI is problematic because it may contain values less
than zero or greater than one.

7
Continuous Z
The previous approach breaks down when Z is continuous because the rela-
tive frequency of a specic value z in any sample is either zero or very small.

Notice however that, if z is any value in a suciently small interval (a, b]


then, by the denition of probability density,

Pr{a < Z b}
f (z) .
ba
Thus, a natural estimate of f (z) is the fraction of observations falling in the
(small) interval (a, b] divided by the length b a of such interval,

1 
n
f(z) = 1{a < Zi b}, a < z b, (2)
n(b a) i=1

called the empirical density of Zi at z.

Notice that, if no sample value is repeated and the interval (a, b] is small
enough, then it contains at most one observation, so f(z) is equal either to
zero or to [n(b a)]1 .

8
Sampling properties
The numerator of (2) is the sum of n iid rvs Xi = 1{a < Zi b}, each having
a Bernoulli distribution with mean

E Xi = Pr{a < Z b} = F (b) F (a)

and variance
Var Xi = [F (b) F (a)][1 F (b) + F (a)].
Thus,
F (b) F (a)
E f(z) = ,
ba
so f(z) is generally biased for f (z), although its bias is negligible if b a is
suciently small. Notice that f(z) is unbiased for f (z) if the df F is linear
on an interval containing (a, b] or, equivalently, the density f is constant on
such interval.

Further
[F (b) F (a)][1 F (b) + F (a)]
Var f(z) =
n(b a)2
 2
1 F (b) F (a) 1 F (b) F (a)
= .
n(b a) ba n ba

If b a is suciently small, then


1 1
Var f(z) f (z) f (z)2 ,
n(b a) n

which shows that letting b a 0 reduces the bias of f(z), but also
increases its sampling variance. Thus, letting b a 0 as n
is not enough for f(z) to be consistent for f (z). This trade-o between
bias (systematic error) and sampling variance (random error) is typical of
nonparametric estimation problems with n xed.
p
For f(z) f (z) as n , we need that both b a 0 and n(b a) ,
that is, the rate at which b a 0 must be slower than the rate at which
n .

9
Histograms
Histograms are classical examples of empirical densities.

Construction of a histogram requires:

The choice of an interval (a0 , b0 ] that contains the range of the data.

A partition of (a0 , b0 ] into J bins (cj1, cj ]. The simplest case is when


the bins are of constant width h = (b0 a0 )/J, so

cj = cj1 + h = a0 + jh, j = 1, . . . , J 1,

with c0 = a0 and cJ = b0 .

The ith sample point Zi falls in the jth bin if cj1 < Zi cj . For any point z
in the jth bin, a histogram estimate of f (z) is

1 
n

f (z) = 1{cj1 < Zi cj },
nhj i=1

the fraction of observations falling in the jth bin divided by the bin width
hj = cj cj1. This is also the estimate of f (z  ) for any other z  falling in the
same bin. It is easy to verify
 that the function f is a proper density, that
is, f(z) 0 for all z and f(z) dz = 1.

In the case of equally spaced bins, where hj = h for all bins, the histogram
estimate becomes

1 
n
f(z) = 1{cj1 < Zi cj }.
nh i=1

10
Figure 1: Histogram estimates of density for dierent number of bins. The data
are 200 observations from the Gaussian mixture .6 N (0, 1) + .4 N (4, 4).

Population density 5 bins


.25

.2
.2

.15
.15

Density
f(z)

.1
.1

.05
.05
0

5 0 5 10 5 0 5 10
z z
25 bins 50 bins
.4
.3

.3
.2
Density

Density
.2
.1

.1
0

5 0 5 10 5 0 5 10
z z

11
Sampling properties of the histogram estimator
Let Z1 , . . . , Zn be a sample from a continuous distribution on the interval
(0, 1], with df F and density f . Put a0 = 0, b0 = 1, and partition the interval
(0, 1] into J bins of constant width h = 1/J. If Xj denotes the number
of observations falling in the jth bin, then the histogram estimator at any
point z in the jth bin is f(z) = Xj /(nh).

The random J-vector (X1 , . . . , XJ ) has a multinomial distribution with index


n and parameter = (1 , . . . , J ), where
j = F (cj ) F (cj1 ) = F (cj ) F (cj h), j = 1, . . . , J.
Hence
n!
J
x
Pr{X1 = x1 , . . . , XJ = xJ } = j,
x1 ! xJ ! j=1 j
where xj = 0, 1, . . . , n. By the properties of the multinomial distribution, Xj
has mean nj and variance nj (1 j ). Hence
j j (1 j )
E f(z) = , Var f(z) = .
h nh2
Thus, the histogram estimator is biased for f (z) in general, and its bias is
just the error made in approximating f (z) by j /h.

Now let the number of bins J = Jn increase with the sample size n or,
equivalently, let the bin width hn = 1/Jn decrease with n. Let {(cjn1 , cjn ]}
denote the sequence of bins that contain the point z and let jn = F (cjn )
F (cjn 1 ). Then
j F (z) F (z hn )
E f(z) = n = f (z),
hn hn
provided that hn 0 with n . Further
2
1 jn 1 jn
Var f(z) = .
nhn hn n hn
p
Thus, for f(z) f (z), we need not only that hn 0 as n , but also
that nhn or, equivalently, that Jn /n 0 as n . That is, hn must
go to zero, or equivalently J must increase with n, but not too fast.

12
Drawbacks of histograms
Histograms are useful tools for exploratory data analysis, but have several
undesirable features, such as:

Results depend on the partition of (a0 , b0 ] into bins, that is, on their
number and position.

They also depend on the choice of the range (a0 , b0 ].

The histogram is a step function with jumps at the end of each bin.
Thus, it is impossible to incorporate prior information on the degree of
smoothness of a density.

The method may also create diculties when estimates of the deriva-
tives of the density are needed as input to other statistical procedures.

In the case of equally spaced bins:

The partition of (a0 , b0 ] depends only on the number J of bins or, equiv-
alently, on the bin width h = (b0 a0 )/J.

Given the data, increasing J (reducing h) eventually produces a his-


togram that is only informative about the location of the distinct sample
points.

By contrast, reducing J (increasing h) eventually produces to a com-


pletely uninformative rectangle. It is intuitively clear, however, that J
may safely be increased if the sample size n also increases.

The fact that the bin width h is kept xed over the range of the data
may lead to loss of detail where the data are most concentrated. If h
is reduced to deal with this problem, then spurious noise may appear
in regions where data are sparse.

13
1.2 The kernel method
We now present a related method that tries to overcome some of the drawbacks
of histograms.

Consider the empirical density (2). Putting a = z h and b = z + h, where h


is a small positive number, gives

1 
n
f(z) = 1{z h < Zi z + h}.
2nh i=1

This is sometimes call a naive density estimate or, for reasons that will be
clear soon, a uniform kernel density estimate.

Notice that f(z) is just the fraction of sample points falling in the in-
terval (z h, z + h] divided by the length 2h of this interval. Equivalently,
f(z) is the sample average of a binary indicator which is equal to (2h)1 if Zi
is within h distance from z and is equal to zero otherwise.

Because z h < Zi z + h if and only if Zi h z < Zi + h, the estimate


f(z) may also be constructed by the following two-step procedure:

1. Place a box of width equal to 2h and height equal to (2nh)1 around


each sample observation.

2. Add up the heights of the boxes that contain the point z; the result is
equal to f(z).

If we contructed a histogram of constant bin width 2h having z at the center


of one of its bins, then f(z) would coincide with the histogram estimate. An
advantage over the histogram method is that there is no need to choose an
interval (a0 , b0 ) and to partition this interval into bins. However, f(z) still has
the following drawbacks:

it is a step function with jumps at the points z = Zi h, so it is not


smooth;

it depends on the choice of constant h.

14
Figure 2: Uniform kernel density estimates for dierent bandwidth. The data
are 200 observations from the Gaussian mixture .6 N (0, 1) + .4 N (4, 4).

bwidth=.5 bwidth=1.0
.3
.2
.1
0

5 0 5 10

bwidth=1.5
.3
.2
.1
0

5 0 5 10
z
True Estimated

15
Smooth kernel density estimates
Because the event zh < Zi z+h is equivalent to the event h < Zi z h,
which in turn is equivalent to the event 1 (z Zi )/h < 1, the density
estimate f(z) may also be written

1 
n
z Zi
f(z) = w ,
nh i=1 h

where
1/2, if 1 u < 1,
w(u) = (3)
0, otherwise,
is a symmetric bounded non-negative function that integrates to one and cor-
responds to the density of a uniform distribution on the interval [1, 1]. This
estimate may be viewed as the average of n uniform densities with common
variance equal to h2 , each centered about one of the sample observations.

It is now clear that f is not smooth because it is a (weighted) sum of step


functions. Since the sum of smooth functions is itself smooth, replacing w by
a smooth function K gives an estimate of f which inherits all the continuity
and dierentiability properties of K.

This leads to the class of estimates of f (z) of the form



1 
n
z Zi
f(z) = K ,
nh i=1 h

where K is a bounded function called the kernel function and h is a


positive constant called the bandwidth. Notice that f(z) is simply an
average of transformed observations, the type of transformation depending on
the kernel K and the bandwidth h. If K is continuous, then so is f(z), and
if K is dierentiable up to order r, then so is f(z).

An estimate in this class is called a Rosenblatt-Parzen density estimate


(Rosenblatt 1955, Parzen 1962) or, simply, a kernel density estimate.

16
Remarks
Notice that
 
1 
n
z Zi
f(z) dz = K dz.
nh i=1 h

After a change of variables from (z Zi )/h to u, we have


 
z Zi
K dz = h K(u) du.
h

Thus, if K integrates to one, then so does f. If K is a proper density


(i.e., is nonnegative and integrates to one), then so is f.

The bandwidth h controls the degree of smoothness or regularity


of a density estimate. Small values of h tend to produce estimates
that are irregular, while large values of h correspond to very smooth
estimates.

If K corresponds to the density of a zero-mean rv U, then h1 K((v


Zi )/h) is the density of the rv Vi = Zi + hU, generated from U through a
location-scale transformation. Thus, f(z) may be viewed as the average
height at the point z of n densities with the same spread and the same
shape, each centered around one of the sample observations (Figure 3).
The smaller is h, the more concentrated is each density, and therefore the
smaller is the number of observations that contribute in an appreciable
way to form f(z) and the more irregular is the resulting estimate of f (z).

The fact that h is independent of the point where the density is esti-
mated is a nuisance, for it tends to produce spurious eects in regions
where data are sparse. If a large enough h is chosen to eliminate this
phenomenon, we may end up losing important detail in regions where
the data are more concentrated.

17
Figure 3: Structure of a smooth kernel density estimate.
.4
.3
.2
.1
0

2.5 0 2.5 5 7.5


z

18
Example 5 If K = , where denotes the density of the N (0, 1) distribution,
then the kernel estimate of f is continuous and has continuous derivates
of every order.

Such an estimate may be viewed as the average of n Gaussian densities with


common variance equal to the squared bandwidth h2 , each centered about one
of the observations.

Unlike the uniform kernel (3), which takes constant positive value in the in-
terval [Zi h, Zi + h) and vanishes outside this interval, the Gaussian kernel
is always positive, assumes its maximum when z = Zi and tends to zero as
|z Zi | .

Hence, while the uniform kernel estimate of f (z) is based only on the obser-
vations that are within h distance from the evaluation point z and assigns
them a constant weight, the Gaussian kernel estimate is based on all the
observations but assigns them a weight that declines exponentially as their
distance from the evaluation point increases. 2

19
Figure 4: Gaussian kernel density estimates for dierent bandwidth. The data
are 200 observations from the Gaussian mixture .6 N (0, 1) + .4 N (4, 4).

bwidth=.25 bwidth=.5
.3
.2
.1
0

5 0 5 10

bwidth=1.5
.3
.2
.1
0

5 0 5 10
z
True Estimated

20
Extensions
Let Z1 , . . . , Zn be a sample from the distribution of Z. If f is a kernel estimate
of the density f of Z, then one may easily estimate other aspects of the
distribution of Z.

Example 6 A nonparametric estimate of its df is


 z 
n
1 z Zi
F (z) = f(u) du = n K ,
i=1
h

where  u
K(u) = K(v) dv

is the integrated kernel. If K corresponds to a proper density, then K is the


associated df, and so F is itself a proper df. 2

Example 7 If Z is a continuous non-negative rv Z, then a nonparametric


estimate of its hazard function is
n  zZ 
f(z) i=1 K i

(z) = = n hzZi  .
1 F (z) N i=1 K h
2

21
1.3 Statistical properties of the kernel method
In evaluating the statistical accuracy of a kernel density estimator f, it is
important to distinguish between its local and global properties:

the rst refer to the accuracy of f(z) as an estimator of f (z), the value
of f at a given point z;

the second refer to the degree of statistical closeness between the


two functions f and f .

To stress its dependence on the bandwidth h, a kernel density estimator will


henceforth be denoted by fh .

22
Local properties
The kernel density estimator of f at a point z may be written as the sample
average
n
1
fh (z) = n Ki (z),
i=1
where
1 z Zi
Ki (z) = K
h h
is a nonlinear transformation of Zi . A natural measure of local accuracy of
fh (z) is therefore its mean squared error (MSE)
MSE[fh (z)] = E[fh (z) f (z)]2 = Var fh (z) + [Bias fh (z)]2 .
If the data Z1 , . . . , Zn are a sample from the distribution of a rv Z, then we
have
1 zZ
Bias fh (z) = E K f (z),
h h
(4)
1 zZ
Var fh (z) = Var K .
nh2 h
Thus, the estimator fh (z) is biased for f (z) in general. For h xed, its bias
does not depend on the sample size n, whereas its variance tends to zero as n
increases.

By imposing the following additional assumptions on the density f and


the kernel K we can study in more detail the sampling properties of fh (z).
Assumption 1
(i) The density f is twice continuously dierentiable.
(ii) The kernel K satises
  
K(u) du = 1, uK(u) du = 0, 0< u2 K(u) du < .

If K is a nonnegative function, then Assumption 1 requires K to be the den-


sity of some probability distribution with zero mean and nite positive
variance, such as the uniform and the Gaussian.

23
Bias
After a change of variable from x to u = (z x)/h, we have

1 zx
Bias fh (z) = K f (x) dx f (z)
h h

= K(u) [f (z hu) f (z)] du,

Because f is twice dierentiable (Assumption 1), a 2nd order Taylor expansion


of f (z hu) about h = 0 gives
1
f (z hu) f (z) = huf  (z) + h2 u2 f  (z) + O(h3 ).
2
If h is suciently small, then
 
 1

Bias fh (z) hf (z) uK(u) du + u K(u) du h2 f  (z)
2
2 (5)
1
= m2 h2 f  (z),
2
 2
where m2 = u K(u) du and we used the fact that K has mean zero.

Thus, for h suciently small, the bias of fh (z) is O(h2 ) and depends on:

the degree of curvature f  (z) of the density at the point z,

the amount of smoothing of the data through the bandwidth h and


the spread m2 of the kernel.

In particular:

Bias fh (z) 0 when f  (z) = 0, that is, when the density is linear in a
neighborhood of z.

Bias fh (z) 0 as h 0.

If the bandwidth h decreases with the sample size n, then the bias of
fh (z) vanishes as n .

24
Variance
First notice that

1 z Z
Var fh (z) = Var K
nh2 h
2  2
1 zZ 1 zZ
= EK 2 EK .
nh2 h nh h

Using (4) and the fact that Bias fh (z) = O(h2 ), we have

zZ
EK = h[f (z) + Bias fh (z)] = h[f (z) + O(h2 )].
h

Hence, after a change of variable from x to u = (z x)/h,


 2
1 z x 1
Var fh (z) = K f (x) dx [f (z) + O(h2 )]2
nh2 h n

1 1
= K(u)2 f (z hu) du [f (z) + O(h2 )]2 .
nh n

Taking a 1st order Taylor expansion of f (z hu) around h = 0 gives



1
Var fh (z) K(u)2 [f (z) huf  (z)] du + O(n1 )
nh
 
f (z) 2 f  (z)
= K(u) du uK(u)2 du + O(n1 ).
nh n

Separating the term of order O((nh)1 ) from that of order O(n1 ) we have
that, for h suciently small,

1

Var fh (z) f (z) K(u)2 du + O(n1 ).
nh

25
Remarks
If the sample size is xed, increasing the bandwidth reduces the variance
of fh (z) but, from (11), it also increases its bias.

If the sample size increases and smaller bandwidths are chosen for larger
n, then both the bias and the variance of fh (z) may be reduced.
p
For fh (z) f (z) we need both h 0 and nh as n .

To conclude, in large samples and for h suciently small,



f (z) 1
MSE fh (z) K(u)2 du + m22 h4 f  (z)2 . (6)
nh 4

26
Global properties
Common measures of distance between two functions f and g are:
the L1 distance

d1 (f, g) = |f (z) g(z)| dz,

the L2 distance
 1/2
2
d2 (f, g) = [f (z) g(z)] dz ,

the L or uniform distance


d (f, g) = sup |f (z) g(z)|.
<z<

In the case of the L2 distance, a global measure of performance of fh as an


estimator of f is the mean integrated squared error (MISE)
MISE(fh ) = E d2 (fh , f )2

=E [fh (z) f (z)]2 dz

where the expectation is with respect to joint distribution of Z1 , . . . , Zn .
Under appropriate regularity conditions

MISE(fh ) = E[fh (z) f (z)]2 dz

= MSE[f(z)] dz,

that is, the MISE is equal to the integrated MSE. Hence, in large samples
and for h suciently small, one may approximate the MISE by integrating (6)
under the further assumption that the second derivative of f satises

R(f ) = f  (z)2 dz < .

Our global measure of performance is therefore



1 1

MISE(fh ) K(u)2 du + m22 h4 R(f ). (7)
nh 4

27
Optimal choice of bandwidth
For a xed sample size n and a xed kernel function K, an optimal bandwidth
may be chosen by minimizing the MISE(fh ) with respect to h. As an approx-
imation to this problem, consider minimizing the right-hand side of (7) with
respect to h 
1 1
min K(u)2 du + m22 h4 R(f ).
h>0 nh 4

The rst-order condition for this problem is



1
0 = 2 K(u)2 du + m22 h3 R(f ).
nh
Assuming that 0 < R(f ) < and solving for h gives the optimal bandwidth
 1/5
K(u)2 du
h = n1/5 = O(n1/5 ). (8)
m22 R(f )

Remarks:
h convergences to zero as n but at the rather slow rate of n1/5 .
Smaller values of h are appropriate for densities that are more wiggly,
that is, such that R(f ) is high, or kernels that are more spread out, that
is, such that m2 is high.
Example 8 If f is a N (, 2) density and K is a standard Gaussian kernel,
then 
5 3
R(f ) =  (z)2 dz = 5
8
and 
1
K(u)2 du = .
2
The optimal bandwidth in this case is
 1/5
1 8 5
h =
2 3
1/5
4
= n1/5 1.059 n1/5 .
3
2

28
Optimal choice of kernel function
Substituting the optimal bandwidth h into (7) gives
5
MISE(fh ) C(K) R(f )1/5 n4/5 ,
4
where  4/5
2/5 2
C(K) = m2 K(u) du .

Given the optimal bandwidth, the MISE depends on the choice of kernel func-
tion only through the term C(K). Thus, an optimal kernel may be obtained
by minimizing C(K) with respect to K.

29
The Epanechnikov kernel
If we conne attention to kernels that correspond to densities of distributions
with mean zero and unit variance, then an optimal kernel K may be
obtained by minimizing
 4/5
2
C(K) = K(u) du

under the side conditions



K(u) 0 for all u, K(u) du = 1,
 
u K(u) du = 0, u2 K(u) du = 1.

Solving this problem (see Pagan & Ullah 1999 for details), the optimal kernel
is
3
K (u) = (1 u2 ) 1{|u| 1},
4
called the Epanechnikov kernel.

The eciency loss from using suboptimal kernels, however, is modest. For
example, using the uniform kernel w only implies an eciency loss of about
7% with respect to K . Thus, it is perfectly legitimate, and indeed desirable,
to base the choice of kernels on other considerations, for example the degree of
dierentiability or the computational eort involved (Silverman 1986, p. 43).

It turns out that the crucial choice is not what kernel to use, but rather
how much to smooth. This choice partly depends on the purpose of the
analysis.

30
Figure 5: Uniform, Gaussian and Epanechnikov kernels.
.8
.6
.4
.2
0

2 1 0 1 2
u

Uniform Gaussian Epanechnikov

31
Automatic bandwidth selection
It is often convenient to be able to rely on some procedure for choosing the
bandwidth automatically rather than subjectively. This is especially impor-
tant when a density estimate is used as an input to other statistical proce-
dures.

The simplest approach consists of choosing h = 1.059/n1/5 , where


is some estimate of the standard deviation of the data. This approach
works reasonably well for Gaussian kernels and data that are not too far
from Gaussian.

A second approach is based on formula (8) and chooses


 1/5
K(u)2 du
h= n1/5 ,
m2 R(f)
2

where f is a preliminary kernel estimate of f based on some initial


bandwidth choice.

A third approach starts from the observation that the MISE of a kernel
density estimator fh , based on a given kernel K, may be decomposed as


MISE(fh ) = E [fh (z)2 + f (z)2 2fh (z)f (z)] dz


= M(h) + f (z)2 dz,

where  
2
M(h) = E fh (z) dz 2 E fh (z)f (z) dz.

Hence, the MISE of fh depends on the choice of the bandwidth only


through the term M(h). Thus, minimizing the MISE with respect to h
is equivalent to minimizing the function M with respect to h. Because
such a function is unknown, this approach suggests minimizing with
respect to h an unbiased estimator of the function M.

32
Cross-validation

To construct an unbiased estimator of M(h), notice rst that fh (z)2 dz is

unbiased for E fh (z)2 dz.

Consider next the kernel estimator of f obtained by excluding the ith obser-
 z Zj
vation
1
f(i) (z) = K , i = 1, . . . , n. (9)
(n 1)h j=i h

If the data are a sample from the distribution of Z, then


 n 
1
En f(i) (Zi ) = E n f(i) (Zi )
n i=1

= E (i) f(i) (z)f (z) dz

= E n fh (z)f (z) dz

= E n fh (z)f (z) dz

where E n denotes expectations with respect to the joint distribution of Z1 , . . . , Zn ,


E (i) denotes expectations with respect to the joint distribution of Z1 , . . . , Zi1 , Zi+1, . . . , Zn ,
and we used the fact that E n fh (z) = h1 E K((z Z)/h) does not depend on
n and so it is equal to E (i) f(i) (z).

Thus, an unbiased estimator of M(h) is the cross-validation criterion



2 
n
2
M(h) = fh (z) dz f(i) (Zi ). (10)
n i=1

The cross-validation procedure minimizes M (h) with respect to h.

33
Asymptotic justication for cross-validation
Let 
I (Z1 , . . . , Zn ) = min [fh (z) f (z)]2 dz
h

denote the integrated squared error of the density estimate obtained choos-
ing h optimally for the given sample, and let

ICV (Z1 , . . . , Zn ) = [fh (z) f (z)]2 dz

denote the integrated squared error obtained using the bandwidth h that
minimizes the cross-validation criterion Mh .

Stone (1984) showed that, if f is bounded and some other mild regularity
conditions hold, then

ICV (Z1 , . . . , Zn )
Pr lim = 1 = 1.
n I (Z1 , . . . , Zn )

Thus, cross-validation achieves the best possible choice of smoothing parame-


ter, in the sense of minimizing the integrated squared error for a given sample.

The main drawback of cross-validation is that the resulting kernel density


estimates tend to be highly variable and to undersmooth the data.

34
Asymptotic properties of kernel density estimators
Let fn denote a kernel density estimate corresponding to a random sample
Z1 , . . . , Zn from a distribution with density f and a data-dependent bandwidth
hn . We now provide sucient conditions for the sequence {fn (z)} to be
consistent for f (z) and asymptotically normal, where z is any point in
the support of f .

We again rely on the fact that, under our iid assumption, fn (z) may be repre-
sented as the sample average of n iid rvs,

n
fn (z) = n1 Kin (z),
i=1

where
1 z Zi
Kin (z) = K .
hn hn

Results are easily generalized to the case when, instead of a single point of
evaluation z, we are interested in a xed set of points z1 , . . . , zJ .

35
Consistency
Convergence in probability follows immediately from the fact that if the kernel
K is the
 density of a distribution with zero mean and nite positive variance
2
m2 = u K(u) du then, from our previous results,

1
E Kin (z) = f (z) + m2 h2n f  (z) + O(h3n ) (11)
2
and
   
1
Var Kin (z) = f (z) K(u) du + hn f (z) uK(u) du + O(h4n ).
2  2
(12)
hn

The only technical problem is the behavior of hn as a function of n.

Theorem 1 Let {Zi } be a sequence of iid continuous rv with twice continu-


ously dierentiable density f , and suppose that the sequence {hn } of bandwidths
and the kernel function K satisfy:

(i) hn 0;

(ii) nhn ;
  
(iii) K(u) du = 1, uK(u) du = 0, and m2 = u2 K(u) du is nite and
positive.
p
Then fn (z) f (z) for every z in the support of f .

36
Asymptotic normality
The next theorem could in principle be used to construct approximate sym-
metric condence intervals for f (z) based on the normal distribution.

Theorem 2 In addition to the assumptions of Theorem 1, suppose that nh5n
, where 0 < . Then
 
1  2
nhn [fn (z) f (z)] N m2 f (z), f (z) K(u) du
2

for every z in the support of f . Further, nhn [fn (z)f (z)] and nhn [fn (z  )
f (z  )] are asymptotically independent for z = z  .

37
Remarks:

Although consistent, fn (z) is generally asymptotically biased for f (z).

There are three cases when no asymptotic bias arises:

hn is chosen such that = 0;


m2 = 0, in which case the kernel K may assume negative values
(higher-order kernel), so the estimate fn may fail to be a proper
density;
f  (z) = 0, that is, the density f is linear at the point z.

When > 0, the additional assumption in Theorem 2 implies that hn =


O(n1/5 ), as for the
optimal 5/2
bandwidth. In this case, letting hn = cn1/5 ,
with c > 0, gives nh5n = c = and nhn = c1/2 n2/5 . Therefore

2/5 1 2  f (z) 2
n [fn (z) f (z)] N c m2 f (z), K(u) du .
2 c

Thus, fn (z) converges to its asymptotic distribution at the rate of n2/5 ,


which is slower than the rate n1/2 achieved by standard parametric
estimators (Figure 6).

When = 0 (no asymptotic bias), the bandwidth tends to zero at a


faster rate than O(n1/5 ) but, in this case, the convergence of fn (z) to
its asymptotic normal distribution is slower than n2/5 .

An estimate of the asymptotic variance of fn (z) is fn (z) K(u)2 du,
so a condence interval for f (z) based on the asymptotically normal
distribution of fn is
  1/2
fn (z) K(u)2 du
CI1 (f (z)) = fn (z) z() .
nhn

This symmetric condence interval may contain negative values, so its


use is problematic and using bootstrap condence interval may be
better.

38
Figure 6: Rates of convergence.
1
.8
.6
.4
.2
0

0 50 100 150 200 250


n

n^(1/2) n^(2/5) n^(1/5) n^(1/6)

39
1.4 Other methods for univariate density estimation
The nearest neighbor method
One of the problems with the kernel method is the fact that the bandwidth
is independent of the point at which the density is evaluated. This tends to
produce too much smoothing in some regions of the data and too little in
others.

For any point of evaluation z, let d1 (z) d2 (z) . . . dn (z) be the distances
(arranged in increasing order) between z and each of the n data points. The
kth nearest neighbor estimate of f (z) is dened as

k
f(z) = , k < n. (13)
2n dk (z)

The motivation for this estimate is the fact that, if h is suciently small,
then we would expect a fraction of the observations approximately equal to
2hf (z) to fall in a small interval [z h, z + h] around z. Since the interval
[z dk (z), z + dk (z)] contains by denition exactly k observations, we have

k
= 2dk (z)f (z).
n
Solving for f (z) gives the density estimate (13).

Unlike the uniform kernel estimator, which is based on the number of obser-
vations falling in a box of xed width centered at the point z, the nearest
neighbor estimator is inversely proportional to the width of the box needed to
exactly contain the k observations nearest to z. The smaller is the density of
the data around z, the larger is this width.

The number k regulates the degree of smoothness of the estimator, with


larger values of k corresponding to smoother estimates. The fraction =
k/n of sample points in each neighborhood is called the span.

40
Problems with the nearest neighbor method
f does not integrate to one and so it is not a proper density.

Although continuous, f is not smooth because its derivatives are dis-


continuous at all points of the form (Z[i] + Z[i+k] )/2.

One way of overcoming the second problem is to notice that



n
z Zi
k= w ,
i=1
dk (z)

where w is the uniform kernel, so w((z Zi )/dk (z)) is equal to 1/2 if Zi


is within distance dk (z) from z and is equal to zero otherwise. So f(z) may
alternatively be represented as

1 
n
z Zi
f(z) = w .
ndk (z) i=1 dk (z)

Thus, the nearest neighbor estimator may be regarded as a uniform kernel


density estimator with a varying bandwidth dk (z).

The lack of smoothness of the nearest neighbor estimator may therefore be


overcome by considering a generalized nearest neighbor estimate of the
form
1 
n
z Zi
f (z) = K .
ndk (z) i=1 dk (z)
If K is a smooth kernel and the bandwidth dk (z) is continuously dieren-
tiable, then f is a smooth estimator of f .

41
Nonparametric ML
The log-likelihood of a sample Z1 , . . . , Zn from a distribution with strictly
positive density f0 is dened as

n
L(f0 ) = c + ln f0 (Zi ),
i=1

where c is an arbitrary constant. One may then think of estimating f0 by

f = argmax L(f ),
f F

where F is some family of densities.

When F = {f (z; ), } is a known parametric family of densities,


one obtains the classical ML estimator

f(z) = f (z, ),

where = argmax L() and L() = c + ni=1 ln f (Zi , ).

When F includes all strictly positive densities on the real line,


one can show that the nonparametric ML estimator is

n
f(z) = n1 (z Zi ),
i=1

where
, if u = 0,
(u) =
0, otherwise,
is the Dirac delta function. Thus, f is a function equal to innity
at each of the sample points and equal to zero otherwise (why is this
unsurprising?).

42
Maximum penalized likelihood
Being innitely irregular, f provides an unsatisfactory solution to the prob-
lem of estimating f0 . If one does not want to make parametric assumptions,
an alternative is to introduce a penalty for lack of smoothness and then
maximize the log likelihood function subject to this constraint.

To quantify the degree of smoothness of a density f consider again the


functional 
R(f ) = f  (z)2 dz.

If f is wiggly, then R(f ) is large. If the support of the distribution is an


interval and f is linear or piecewise linear, then R(f ) = 0.

One may then consider maximizing the penalized sample log-likelihood



n
L (f ) = ln f (Zi ) R(f ),
i=1

where > 0 represents the trade-o between smoothness and delity to the
data. A maximum penalized likelihood density estimator f maximizes
L over the class of densities with continuous 1st derivative and square
integrable 2nd derivative. The smaller is , the rougher in terms of R(f)
is the maximum penalized likelihood estimator.

Advantages of this method:


It makes explicit two conicting goals in density estimation:

maximizing
 delity to the data, represented here by the term
i ln f (Z i ),
avoiding densities that are too wiggly, represented here by the
term R(f ).

The method places density estimation within the context of a unied


approach to curve estimation.
The method can be given a nice Bayesian interpretation.
Its main disadvantage is that the resulting estimate f is dened only im-
plicitly, as the solution to a maximization problem.

43
1.5 Multivariate density estimators
Let Z1 , . . . , Zn be a sample from the distribution of a random m-vector Z with
density function f (z) = f (z1 , . . . , zm ). How can we estimate f nonparamet-
rically?

Multivariate kernel density estimators


The natural starting point is the following multivariate generalization of the
univariate kernel density estimator
n
1 z1 Z i1 zm Z im
f(z) = f(z1 , . . . , zm ) = Km ,..., , (14)
nh1 hm i=1 h1 hm

where Km : Rm R is a multivariate kernel.


Example 9 An example of multivariate kernel is the product kernel

m
Km (u1 , . . . , um) = K(uj ),
j=1

where K is a univariate kernel. 2

We now consider consistency and asymptotic normality of a special class of


multivariate kernel density estimators of the form

1 
n
z1 Zi1 zm Zim
f (z) = Km ,..., . (15)
nhm i=1 h h

This type of estimators are a special case of (14) obtained by using the same
bandwidth h for each component of Z. They may be appropriate when
the data have being previously rescaled using some preliminary estimate of
scale.
Example 10 In the case of a product kernel, (15 becomes

1 

n m
zj Zij
f (z) = K ,
nhm i=1 j=1 h

where K is a univariate kernel. 2

44
Asymptotic properties
Theorem 3 Let {Zi } be a sequence of iid continuous random m-vectors with
a twice continuously dierentiable density f , and suppose that the sequence
{hn } of bandwidths and the multivariate kernel Km : Rm R satisfy:
(i) hn 0;
(ii) nhm
n ;
  
(iii) Km (u) du = 1, uKm (u) du = 0 and uu Km (u) du = M2 , a nite
m m matrix.
p
Then fn (z) f (z) for every z in the support of f .
Theorem 4 In addition to the assumptions of Theorem 3, suppose that h2n (nhm
n)
1/2

, where 0 < . Then

m 1/2 1 2
(nhn ) [fn (z) f (z)] N b(z), f (z) Km (u) du
2

for every z in the support of f , where b(z) = tr[f  (z)M2 ]. Further, (nhm
n)
1/2
[fn (z)
f (z)] and (nhn ) [fn (z ) f (z )] are asymptotically independent for z = z  .
m 1/2  

Remarks:
The speed of convergence of fn (z) to its asymptotically normal distri-
bution is inversely related to the dimension m of Z (a manifestation
of the curse-of-dimensionality problem). When m = 1, we have the
results in Theorem 2.
When > 0, the additional assumption in Theorem 4 implies that hn =
O(n1/(m+4) ). In this case, putting hn = cn1/(m+4) , with c > 0, gives
= c(m+4)/2 and (nhm n)
1/2
= cm/2 n2/(m+4) , and therefore

2/(m+4) 1 2 f (z) 2
n [fn (z) f (z)] N c b(z), m Km (u) du .
2 c

When = 0, the bandwidth tends to zero at a faster rate than O(n1/(m+4) )


but the convergence of fn (z) to its asymptotically normal distribution is
slower than the optimal rate.

45
The curse-of-dimensionality problem
Although conceptually straightforward, extending univariate nonparametric
methods to multivariate settings is problematic for at least two reasons:

While contour plots are enough for two dimensions, how to represent
the results of a nonparametric analysis involving three or more vari-
ables is a neglected but practically important problem.

When one-dimensional nonparametric methods are generalized to higher


dimensions, their statistical properties deteriorate very rapidly because
of the so-called curse-of-dimensionality problem, namely the fact
that the volume of data required to maintain a tolerable degree of statis-
tical precision grows much faster than the number of variables under
examination.

For these reasons, simple generalizations of one-dimensional methods to the


case of more than two or three variables tend to produce results that are
dicult to represent and are too irregular, unless the size of the available
data is very large.

46
Example 11 Consider a multivariate histogram constructed for a sample
from the distribution of a random m-vector Z with a uniform distribution
on the m-dimensional unit hypercube, that is, whose components are iid as
U(0, 1).

If we partition the unit hypercube into hypercubical cells of side equal to h,


then each cell contains on average only hm percent of the data. Assume that
at least 30 observations per cell are needed for a tolerably accurate histogram
estimate. Then an adequate sample should have a number of observations at
least equal to n = 30hm .

The table below shows some calculations for m = 1, . . . , 5 and h = .10 and
h = .05.

Number m of variables in Z
1 2 3 4 5
h = .10
number of cells 10 100 1,000 10,000 100,000
n 300 3,000 30,000 300,000 3,000,000
h = .05
number of cells 20 400 8,000 160,000 3,200,000
n 600 12,000 240,000 4,800,000 96,000,000

Leaving aside the nontrivial problem of how it could be represented, a 5-


dimensional histogram is likely to be estimated too imprecisely to be of
practical use, unless the sample size is truly gigantic. 2

47
Projection pursuit
One method for nonparametric estimation of multivariate densities is projec-
tion pursuit (PP), introduced by Friedman, Stuetzle and Schroeder (1984).

This method assumes that the density of a random m-vector Z may well be
approximated by a density of the form

J
f (z) = f0 (z) fj (j z), (16)
j=1


m f0 is a known m-variate density, j is a vector with unit norm, j z =
where
h=1 jh zh is a linear combination or projection of the variables in Z, and
f1 , . . . , fp are smooth univariate functions, called ridge functions.

The choice of f0 is left to the investigator and should reect prior information
available about the problem.

The estimation algorithm determines the number J of terms and the vectors
j in (16), and constructs nonparametric estimates of the ridge functions
f1 , . . . , fJ by minimizing a suitably chosen index of goodness-of-t, or projec-
tion index. The curse-of-dimensionality problem is bypassed by using linear
projections and nonparametric estimates of univariate ridge functions.

As a by-product of this method, the graphical information provided by


the shape of the estimated ridge functions may be useful for exploring and
interpreting the multivariate distribution of the data.

The PP method may be regarded as a generalization of the principal com-


ponents method, where multidimensional data are projected linearly onto
subspaces of much smaller dimension, chosen to minimize the variance of the
projections.

48
Semi-nonparametric methods
Gallant and Nychka (1987) proposed to approximate the m-variate density
f (z1 , . . . , zm ) by an Hermite polynomial expansion.
In the bivariate case (m = 2), their approximation is
1
f (z1 , z2 ) = K (z1 , z2 )2 (z1 ) (z2 ),
K
where
K
K (z1 , z2 ) = hk z1h z2k
h,k=0
is a polynomial of order K in z1 and z2 , and
 
K = (z1 , z2 )2 (z1 ) (z2 ) dz1 dz2

is a normalization factor to ensure that f integrates to one. The order K


is typically chosen by minimizing some information criterion such as AIC or
BIC. Because, after the choice of K, the model for f is fully parametric,
this is an example of semi-nonparametric (SNP) approach.
The class of densities that can be approximated in this way includes densi-
ties with arbitrary skewness and kurtosis, but excludes violently oscillatory
densities or densities with too fat or too thin tails.
Since the polynomial expansion in (1.5) is invariant to multiplication of =
(00 , 01 , . . . , KK ) by a scalar, some normalization is needed. A convenient
normalization is 00 = 1. Under this normalization, expanding the square of
the polynomial in (1.5) and rearranging terms gives
 2K 
1 
h k
f (z1 , z2 ) = hk z1 z2 (z1 ) (z2 ),
K h,k=0
where
bh 
 bk
hk = rs hr,ks ,
r=ah s=ak
with
ah = max(0, h K), bh = min(h, K),
ak = max(0, k K), bk = min(k, K).

49
1.6 Stata commands
We now briey review the commands for histogram and kernel density estima-
tion available in Stata, version 12.

These include the histogram and the kdensity commands. Both commands
estimate univariate densities. The package akdensity in van Kerm (2012)
allows estimating both the density and the distribution function by the kernel
method.

Recently, some articles in the Stata Journal have been devoted to the use
of nonparametric or semi-nonparametric methods for density estimation in a
variety of statistical problems. Examples include De Luca (2008) and De Luca
and Perotti (2011).

50
The histogram command
The basic syntax is:
histogram varname [if] [in] [weight] [, [continuous opts | discrete opts]
options]
where:
varname is the name of a continuous variable, unless the discrete
option is specied,
continuous opts includes:
bin(#): sets the number of bins to #,
width(#): sets the width of bins to #,
start(#): sets the lower limit of rst bin to # (the default is the
observed minimum value of varname).
bin() and width() are alternatives. If neither is specied, results are
the same as if bin(k) had been specied, with

k = min{ n, 10 ln n/ ln 10},
where n is the (weighted) number of observations.
discrete opts includes:
discrete: species that the data are discrete and you want each
unique value of varname to have its own bin (bar of histogram),
options includes:
density: draws as density (the default),
fraction: draws as fractions,
frequency: draws as frequencies,
percent: draws as percentages,
addlabels: adds heights label to bars,
normal: adds a normal density to the graph,
kdensity: adds a kernel density estimate to the graph.

51
The kdensity command
The basic syntax is:
kdensity varname [if] [in] [weight] [, options]
where options includes:

kernel(kernel): species the kernel function. The available kernels in-


clude epanechnikov (default), biweight, cosine, gaussian, parzen,
rectangle, and triangle.

bwidth(#): species half-width of kernel. If not specied, the half-


width calculated and used corresponds to h = 1.059 n1/5 , which would
minimize the MISE if the data were Gaussian and a Gaussian kernel were
used (there is a little inconsistency here, as Epanechnikov is the default
kernel function).

generate(newvar x newvar d): stores the estimation points in new-


var x and the density estimate in newvar d.

n(#): estimates density using # points. The default is min(n, 50),


where n is the number of observations in memory.

at(var x): estimates the density using the values specied by var x.

nograph: suppresses graph.

normal: adds a normal density to the graph.

student(#): adds a Students t density with # degrees of freedom to


the graph.

52
2 Nonparametric Regression Estimators
We now assume that the data (X1 , Y1), . . . , (Xn , Yn ) are a sample from the joint
distribution of (X, Y ), for which the conditional mean function (CMF)

(x) = E(Y | X = x)

is well dened, and consider the problem of estimating (x) nonparamet-


rically. We focus on the case when X is continuous, because when X is
discrete with mass points at x1 , . . . , xJ , the CMF may simply be estimated
by n
i=1 1{Xi = xj } Yi
Yj =  n , j = 1, . . . , J,
i=1 1{Xi = xj }
the average of the sample values of Y for the cases when X = xj .

This is a more general problem than it may look at rst.

Example 12 The conditional probability function of a 0-1 rv Y satises

(x) = Pr{Y = 1 | X = x} = E(Y | X = x).

Thus, the problem of estimating (x) nonparametrically reduces to the prob-


lem of estimating the CMF of Y nonparametrically. 2

Example 13 The conditional variance function (CVF) of a rv Y satises

(x)2 = Var(Y | X = x) = E(Y 2 | X = x) [E(Y | X = x)]2

Thus, the problem of estimating (x)2 nonparametrically reduces to the prob-


lem of estimating the CMFs of Y and Y 2 nonparametrically. 2

53
Linear nonparametric regression estimators
We restrict attention to the class of nonparametric estimators of (x) that are
linear, that is, of the form

n
(x) = Sj (x) Yj , (17)
j=1

where the weight Sj (x) assigned to Yj depends only on the Xi s and the eval-
uation point x, not on the Yi s.

Linearity of this class of estimators is important and useful because:

It lowers the computational burden relative to nonlinear estimators.

It reduces the task of understanding the dierence between alterna-


tive estimates of (x) to the task of understanding the dierences in the
weights Sj (x).

It simplies considerably the task of evaluating the statistical prop-


erties of these estimators.

54
Regression smoothers
It is useful to represent a linear nonparametric regression estimator as a linear
regression smoother.
Let Y be the n-vector of observations on the outcome variable and let X be
the matrix of n observations on the k covariates. A regression smoother is
a way of using Y and X to produce a vector = (1 , . . . , n ) of tted values
that is less variable than Y itself.
A regression smoother is linear if it can be represented as
= SY,
where S = [sij ] is an n n smoother matrix that depends on X but not on
Y. Thus, the class of linear regression smoothers coincides with the class of
linear predictors of Y.
The ith element of a linear regression smoother is a weighted average
n
i = sij Yj
j=1

of the elements of Y and sij is the weight assigned to the jth element of Y in
the construction of i . Linear nonparametric regression estimates are
linear smoothers where i = (Xi ) and sij = Sj (Xi ).
Example 14 A parametric example of linear smoother is the vector of OLS
tted values = X = SY, where = (X X)1 X Y and the smoother
matrix S = X(X X)1 X is symmetric and idempotent. 2

A smoother matrix is not necessarily symmetric or idempotent. A matrix S


is said to preserve the constant if S  = , where is a vector of ones. If a
smoother matrix S preserves the constant, then
n
1
n i = n1  1 
n SY = n n Y = Y ,
i=1

where Y is the sample mean of Y . For example, the OLS smoother matrix
preserves the constant if the design matrix X contains a column of ones.
We now present other examples of linear regression smoothers. For ease of
presentation, we assume that X is a scalar rv.

55
2.1 Polynomial regressions
A polynomial regression represents a parsimonious and relatively ex-
ible way of approximating an unspecied CMF.

A k-degree polynomial regression approximates (x) by a function of the


form
m(x) = + 1 x + + k xk , k = 0.
A polynomial regression estimated by OLS corresponds to a linear regression
smoother dened by a symmetric idempotent smoother matrix.

Polynomials are frequently used because they can be evaluated, integrated,


dierentiated, etc., very easily. For example, if m(x) is a k-degree polynomial,
then
m (x) = 1 + 22 x + + kk xk1
is a (k 1)-degree polynomial, and

1 1
m(x) dx = + x + 1 x2 + + k xk+1
2 k+1

is a (k + 1)-degree polynomial.

However, if the CMF is very irregular, even on small regions of the approxi-
mation range, then a polynomial approximation tends to be poor everywhere
(Figure 7).

56
Figure 7: Polynomial regression estimates. The sample consists of 200 obser-
vations from a Gaussian model with (x) = 1 x + exp[50(x .5)2 )].

Data Linear

1.5
1.5

1
1
mu(x)
.5

.5
0

0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
x x
Quadratic Cubic
1.5

1.5
1

1
.5

.5
0

0
.5

.5

0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
x x

57
2.2 Regression splines
One way of avoiding global dependence on local properties of the function (x)
is to consider piecewise polynomial functions.

A regression spline is a smooth piecewise polynomial function and


therefore represents a very exible way of approximating (x). The simplest
example is linear splines.

58
Linear splines
Select J distinct points or knots, c1 < < cJ , on the support of X. These
points dene a partition of R into J + 1 intervals. To simplify the notation,
also dene two boundary knots c0 < c1 and cJ+1 > cJ .
A linear spline is a continuous piecewise linear function of the form
m(x) = j + j x, cj1 < x cj , j = 1, . . . , J + 1.
For m to be continuous at the rst knot c1 , the model parameters must satisfy
1 + 1 c1 = 2 + 2 c1 ,
which implies that 2 = 1 + (1 2 )c1 . On the interval (c0 , c2 ], we therefore
have
1 + 1 x, if c0 < x c1 ,
m(x) =
1 + (1 2 )c1 + 2 x, if c1 < x c2 .
A more compact representation of m on (c0 , c2 ] is
m(x) = + x + 1 (x c1 )+ , c0 < x c2 ,
where = 1 , = 1 , 1 = 2 1 and (x c1 )+ = max(0, x c1 ).
Repeating this argument for all other knots, a linear spline may be represented
as
J
m(x) = + x + j (x cj )+ ,
j=1

where j = j+1 j .

Remarks
The number of free parameters in the model is only J + 2, less than
the number 2(J + 1) of parameters of an unrestricted piecewise linear
function. The dierence 2(J + 1) (J + 2) is equal to the number of
constraints that must be imposed to ensure continuity.
Although continuous, a linear spline is not dierentiable for its deriva-
tive is a step function with jumps at c1 , . . . , cJ .
This problem may be avoided by considering smooth higher-order
(quadratic, cubic, etc.) piecewise polynomial functions.

59
Cubic splines
A cubic spline is a twice continuously dierentiable piecewise cubic
function. Given J distinct knots c1 < < cJ in the support of X, it possesses
the parametric representation

J
2 3
m(x) = + 1 x + 2 x + 3 x + j (x cj )3+ , (18)
j=1

which contains J + 4 free parameters. It is easy to verify that m satises all


properties of a cubic spline, namely:
it is a cubic polynomial in each subinterval [cj , cj+1 ];
it is twice continuously dierentiable;
its third derivative is a step function with jumps at c1 , . . . , cJ .

Remarks:
A natural cubic spline is a cubic spline that is forced to be linear out-
side the boundary knots c0 and cJ+1 . The number of free parameters
in this case is just J + 2.
The representation (18) lends itself directly to estimation by OLS.
In general, given the sequence of knots and the degree of the polynomial,
a regression spline may be estimated by an OLS regression of Y on an
appropriate set of vectors that represent a basis for the selected family
of piecewise polynomial functions evaluated at the sample values of X.
Regression splines estimated by OLS give linear regression smoothers
dened by symmetric idempotent smoother matrices. The number
and position of the knots control the exibility and smoothness of the
approximation.

Problems with regressions splines:


How to select the number and the position of the knots.
Considerable increase in the degree of complexity of the problem when
X is multivariate (Wahba 1990).

60
Figure 8: Splines (blue) vs. polynomial (black) regression estimates. The
sample consists of 200 observations from a Gaussian model with (x) = 1
x + exp[50(x .5)2 )].

Data Linear

1.5
1.5

1
1
mu(x)
.5

.5
0

0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
x x
Quadratic Cubic
1.5

1.5
1

1
.5

.5
0

0
.5

.5

0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
x x

61
2.3 The kernel method
Recall that the CMF may be written

f (x, y) c(x)
(x) = y dy = ,
fX (x) fX (x)

where fX (x) = f (x, y) dy is the marginal density of X and

c(x) = y f (x, y) dy.

Thus, given nonparametric estimates c(x) and fX (x) of c(x) and fX (x), a
nonparametric estimate of (x) is

c(x)
(x) = . (19)
fX (x)

An advantage of the kernel method is that, under certain conditions, such


an estimate may be computed directly without the need of estimating the
joint density f (x, y).

62
The Nadaraya-Watson estimator
Consider the bivariate kernel density estimate
n
1 x Xi y Y i
f (x, y) = K , ,
nhX hY i=1 hX hY

where K : R2 R. If hX = hY = h and fX (x) is the estimate of fX (x) based


on the kernel K(x) = K (x, y) dy, then

1 
n
x Xi
f (x, y) dy = fX (x) = K .
nh i=1 h

If, in addition, K (x, y) is such that y K (x, y) dy = 0, then

1 
n
x Xi
c(x) = y f (x, y) dy = K Yi .
nh i=1 h

Substituting these two expressions into (19) gives the NadarayaWatson


(NW) or kernel regression estimator
n
x Xi
K Yi
h
(x) = i=1n
 x Xi
. (20)
K
i=1
h
Thus, (x) is a weighted average of the Yj with nonnegative weights
 n 1
 x Xi x Xj

wj (x) = K K , j = 1, . . . , n,
i=1
h h
which add up to one. The NW estimator is therefore a linear smoother
whose smoother matrix S = [wj (Xi )] depends on the sample values Xi , the
selected kernel K, and the bandwidth h.
If the kernel corresponds to a unimodal density with mode at zero, then the
closer Xj is to x, the bigger is the weight assigned to Yj in forming (x).
If the kernel vanishes outside the interval [1, 1], then only values of Y for
which |x Xj | < h enter the summation. This may be exploited to speed up
the computations.

63
Example 15 A family of kernels with bounded support which contains
many kernels used in practice is

K (u) = c (1 u2 ) 1{0 |u| 1}, = 0, 1, 2, . . . ,

where the constant c , chosen such that K is a proper density, is

c = 221 (2 + 2) ( + 1)2

and (z) is the gamma function.

If = 0, we obtain the uniform kernel



1/2, if 1 u < 1,
K(u) =
0, otherwise.

If = 1, we obtain the Epanechnikov kernel


3
K(u) = (1 u2 ) 1{|u| 1}.
4

If = 2, we obtain the quartic kernel


15
K(u) = (1 u2 )2 1{|u| 1}.
16

If , then K converges to the Gaussian kernel.

64
Figure 9: Gaussian kernel regression estimates. The sample consists of 200
observations from a Gaussian model with (x) = 1 x + exp[50(x .5)2 )].

h=.02 h=.04
1.5
1
.5
0

h=.06 h=.08
1.5
1
.5
0

0 .5 1 0 .5 1
x
True Estimated

65
2.4 The nearest neighbor method
This method suggests considering the Xi that are closest to the evaluation
point x and estimating (x) by averaging the corresponding values of Y .

A kth nearest neighborhood of x, denoted by Ok (x), consists of the k points


that are closest to x. The corresponding nearest neighbor (NN) estimate
of (x) is
1  n
(x) = Yj = wj (x)Yj ,
k j=1
jOk (x)

where
1/k, if j Ok (x),
wj (x) =
0, otherwise.
Thus, (x) is just a running mean or moving average of the Yi .

The NN estimate is clearly a linear smoother, and its degree of smoothness


depends on the value of k or, equivalently, on the span = k/n of the neigh-
borhood.

If X is a vector, then the choice of the metric is of some importance. Two


possibilities are:
the Euclidean distance

d(x, x ) = [(x x ) (x x )]1/2 ,

the Mahalanobis distance

d(x, x ) = [(x x ) 1 (x x )]1/2 ,

where is a pd estimate of the covariance matrix of the covariates.

Notice that both the NN estimator and the NW estimator solve the following
weighted LS problem
 n
min wi (x)(Yi c)2 .
cR
i=1

The two estimators correspond to dierent choices of weights.

66
Running lines
Using a running mean may give severe biases near the boundaries of the
data, where neighborhoods tend to be highly asymmetric.

A way of reducing this bias is to replace the running mean by a running


line
(x) = (x) + (x)x, (21)
where (x) and (x) are the intercept and the slope of the OLS line computed
using only the k points (k > 2) in the neighborhood Ok (x).

Computation of (21) for the n sample points may be based on the recursive
formulae for OLS and only requires a number of operations of order O(n).

Notice that (x) and (x) in (21) solve the locally weighted LS problem


n
min wi (x)(Yi Xi )2 ,
(,)R2
i=1

where the weight wi (x) is equal to one if the ith point belongs to Ok (x) and
to zero otherwise.

Running lines tend to produce ragged curves but the method can be im-
proved replacing OLS by locally WLS, with weights that decline as the dis-
tance of Xi from x increases.

67
Lowess
An example of locally weighted LS estimator is the lowess (LOcally WEighted
Scatterplot Smoother) estimator introduced by Cleveland (1979) and com-
puted as follows:

Algorithm 1

(1) Identify the kth nearest neighborhood Ok (x) of x.

(2) Compute the distance (x) = maxiOk (x) d(Xi , x) of the point in Ok (x)
that is farthest from x.

(3) Assign weights wi (x) to each point in Ok (x) according to the tricube
function
d(Xi , x)
wi (x) = W ,
(x)
where
(1 u3 )3 , if 0 u < 1,
W (u) =
0, otherwise.

(4) Compute (x) as the predicted value of Y corresponding to x from a WLS


regression that uses the observations in Ok (x) and the weights dened in
step (3).

Notice that the weight wi (x) is maximum when Xi = x, decreases as the


distance of Xi from x increases, and becomes zero if Xi is the kth nearest
neighbor of x.

68
Figure 10: Lowess estimates. The sample consists of 200 observations from a
Gaussian model with (x) = 1 x + exp[50(x .5)2 )].

span = 20% span = 40%


1.5
1
.5
0

span = 60% span = 80%


1.5
1
.5
0

0 .5 1 0 .5 1
x
True Estimated

69
Local polynomial tting
Another straightforward generalization is local polynomial tting (Fan &
Gijbels 1996), which replaces the running line by a running polynomial of
order k 1,
(x) = (x) + 1 (x)x + + k (x)xk ,
where the coecients , 1 (x), . . . , k (x) solve the problem


n
min Ki (x)(Yi 1 Xi k Xik )2 ,
(,1 ,...,k )R2
i=1

where the Ki (x) are general kernel weights (Figure 11).

70
Figure 11: Locally linear tting with Gaussian kernel weights. The sample
consists of 200 observations from a Gaussian model with (x) = 1 x +
exp[50(x .5)2 )].

h=.02 h=.04
1.5
1
.5
0

h=.06 h=.08
1.5
1
.5
0

0 .5 1 0 .5 1
x
True Estimated

71
2.5 Cubic smoothing splines
The regression analogue of the maximum penalized likelihood problem is
the probleme of nding a smooth function that best interpolates the obser-
vations on Y without uctuating too wildly.

Let [a, b] be a closed interval that contains the observed values of X. Then
this problem may be formalized as follows

n  b
2
min
2
Q(m) = [Yi m(Xi )] + m (u)2 du, (22)
mC [a,b] a
i=1

where C 2 [a, b] is the class of functions dened on [a, b] that have continuous
1st derivative and integrable 2nd derivative, and 0 is a xed pa-
rameter. The residual sum of squares in the functional Q measures delity to
the data, whereas the second term penalizes for excessive uctuations
of the estimated CMF.

A solution to problem (22) exists, is unique, and can be represented as a


natural cubic spline with at most J = n 2 interior knots corresponding
to distinct sample values of X (Reinsch 1967). Such a solution is called a
cubic smoothing spline.

Larger values of tend to produce solutions that are smoother, whereas


smaller values tend to produce solutions that are more wiggly. In particular:

If , problem (22) reduces to the OLS problem.

If = 0, there is no penalty for curvature and the solution is a twice


dierentiable function that exactly interpolates the data.

72
Smoothing splines and penalized LS
Knowledge of the form of the solution to (22) makes it possible to dene its
smoother matrix. If we consider the representation of as a cubic spline,
then the required number of basis functions is J + 4 = n 2 + 4 = n + 2. The
solution to problem (22) may therefore be written

n+2
(x) = j Bj (x),
j=1

where 1 , . . . , n+2 are coecients to be determined and B1 (x), . . . , Bn+2 (x) is


a set of twice dierentiable basis functions.

Dene the n (n + 2) matrix B = [Bij ], where Bij = Bj (Xi ) is the value of


the jth basis function at the point Xi . Notice that the matrix B has at most
rank n, so it does not have full column rank. The residual sum of squares in
(22) may then be written
 2
 n 
n 
n+2
2
[Yi (Xi )] = Yi j Bj (Xi ) = (Y B) (Y B),
i=1 i=1 j=1

where = (1 , . . . , n+2 ), while the penalty term in (22) may be written


 b  b 
n+2
2
 (u)2 du = j Bj (u) du
a a j=1
 b 
n+2 
n+2

= r s Br (u)Bs (u) du
a r=1 s=1

n+2 n+2
= r s rs =  ,
r=1 s=1
b
where is the (n+2)(n+2) matrix with generic element rs = a Br (u)Bs (u) du.
Problem (22) is therefore equivalent to the penalized LS problem
min (Y B) (Y B) +  . (23)

Thus, the original problem (22) is reduced to the much simpler problem of
minimizing a quadratic form in the (n + 2)-dimensional vector .

73
The penalized LS estimator
The solution to the penalized LS problem (23) exists and must satisfy the
normal equation
0 = 2B  (Y B ) + 2.
If B  B + is a nonsingular matrix, then

= (B  B + )1 B  Y.

Because formally coincides with a ridge-regression estimate, it also has


a nice Bayesian interpretation.

Since = B , the smoother matrix of a cubic smoothing spline is

S = B(B  B + )1 B  ,

and is symmetric but not idempotent.

See Murphy and Welch (1992) for an independent derivation of the same result.

74
2.6 Statistical properties of linear smoothers
The smoother matrix of a linear smoother typically depends on the sample val-
ues of X and a parameter (or set of parameters) which regulates the amount of
smoothing of the data. To emphasize this dependence, we denote the smoother
matrix by S = [sij ], where is the smoothing parameter. We adopt the
convention that larger values of correspond to more smoothing.

Let the data be a sample from the distribution of (X, Y ), suppose that the
CMF (x) = E(Y | X = x) and the CVF 2 (x) = Var(Y | X = x) of Y
are both well dened, and let and denote, respectively, the n-vector with
generic element i = (Xi ) and the diagonal nn matrix with generic element
i2 = 2 (Xi ).

For a linear smoother = S Y, with generic element i = (Xi ), we have


E( | X) = S . The bias of is therefore

Bias( | X) = E( | X) = (S In ).

A case when is unbiased is when = X and S is such that S X = X.

The sampling variance of is

Var( | X) = S S .

Given an estimate of the matrix , Var( | X) may be estimated by S S .


The result may then be used to construct pointwise standard error bands
for the regression estimates. If the smoother is approximately unbiased,
then these standard error bands also represent pointwise condence inter-
vals. An alternative way of constructing condence intervals is the bootstrap.

If the data are conditionally homoskedastic, that is, 2 (x) = 2 for all x,
then = 2 In may be estimated by 2 In , where 2 =
n1 i (Yi i )2 . The
sampling variance of i may then be estimated by 2 ( nj=1 s2ij ).

75
Choosing the amount of smoothing
One way of choosing the smoothing parameter optimally is to maximize a
global measure of accuracy, conditionally on X.

As a global measure of accuracy, consider the average conditional MSE


(AMSE)


n
AMSE() = n1 E(i i )2
i=1
= n [tr(S S ) +  (S In ) (S In )].
1

When the data are conditionally homoskedastic, this reduces to

AMSE() = n1 [ 2 tr(S S ) +  (S In ) (S In )].

In general, increasing the amount of smoothing tends to increase the bias


and to reduce the sampling variance of a smoother. The AMSE provides
a summary of this trade-o.

The following relationship links the AMSE to the average conditional mean
squared error of prediction or average predictive risk (APR)


n 
n
1 2 1
n E(Yi i ) = n i2 + AMSE().
i=1 i=1

Thus, minimizing the AMSE is equivalent to minimizing the APR.

Since the APR depends on unknown parameters, one way of choosing the
smoothing parameter optimally is to minimize an unbiased estimate of the
APR.

76
Cross-validation
An approximately unbiased estimator of the APR is the cross-validation
criterion
1
n
CV() = [Yi (i) (Xi )]2 ,
n i=1
where (i) (Xi ) is the value of the smoother at the point Xi , computed by
excluding the ith observation, and Yi (i) (Xi ) is the ith predicted resid-
ual. Minimizing CV() represents an automatic method for choosing the
smoothing parameter .

Example 16 In the OLS case, (i) (Xi ) = (i) Xi , where
Xi Ui
(i) = (X X Xi Xi )1 (X Y Xi Yi ) = n (X X)1
1 hii
is the OLS coecient computed by excluding the ith observation, n is the
OLS coecient computed for the full sample, Ui = Yi n Xi is the OLS
residual and hii is the ith diagonal element of the matrix H = X(X X)1 X .

Since Yi (i) Xi = Ui /(1 hii ), the cross-validation criterion is
 2
1 1
n n
 2 Ui
CV() = [Yi (i) Xi ] = .
n i=1 n i=1 1 hii
2

Cross-validation requires computing the sequence (1) (X1 ), . . . , (n) (Xn ). This

nif the smoother matrix preserves the constant, that is, S =
is very simple
. Because i=1 sij = 1 in this case, (i) (Xi ) may be computed by setting the
weight of the ith observation equal to zero and then dividing all the other
weights by 1 sii in order for them to add up to one. This gives
n
 sij j=1 sij Yj sii Yi i sii Yi
(i) (Xi ) = Yj = = .
j=i
1 sii 1 sii 1 sii

Hence, the ith predicted residual is Yi (i) (Xi ) = Ui /(1 sii ), where Ui =
Yi i , and the cross-validation criterion becomes
 2
1
n
Ui
CV() = .
n i=1 1 sii

77
Equivalent kernels and equivalent degrees of freedom
Using the representation (17), it is easy to compare linear smoothers obtained
from dierent methods.

The set of weights S1 (x), . . . , Sn (x) assigned to the sample values of Y in the
construction of (x) is called the equivalent kernel evaluated at the point x.
The ith row of the smoother matrix S gives the equivalent kernel evaluated
at the point Xi . A comparison of equivalent kernels is therefore a simple and
eective way of comparing dierent linear smoothers.

Example 17 Given two dierent linear smoothers, one may compare the dif-
ferent weights,
(1) (1) (2) (2)
{si1 , . . . , sin }, {si1 , . . . , sin },
that they give to the sample values of Y in constructing the estimate of the
value (Xi ) of (x) at a given point Xi . Alternatively, given a smoother and
two dierent points Xi and Xj , one may compare the dierent weights,

{si1 , . . . , sin }, {sj1 , . . . , sjn },

that the smoother gives to the sample values of Y in constructing the estimates
of (Xi ) and (Xj ). 2

A synthetic index of the amount of smoothing of the data is the eective


number of parameters dened, by analogy with OLS, as

k = tr S .

The number df = n k is called equivalent degrees of freedom.

The eective number of parameters is equal to the sum of the eigenvalues of the
matrix S . If S is symmetric and idempotent (as in the case of regression
splines), then
k = rank S .

The eective number of parameters depends mainly on the value of the smooth-
ing parameter , while the conguration of the predictors tends to have a much
smaller eect. As increases, we usually observe a decrease in k and a
corresponding increase in df .

78
Asymptotic properties
We now present some asymptotic properties of regression smoothers. For sim-
plicity, we conne our attention to estimators based on the kernel method.

Given a sample of size n from the distribution of (X, Y ), where X is a random


k-vector and Y is a scalar rv, a NW estimator of the CMF of Y is
 n 1 n
 x Xi  x Xi
n (x) = K K Yi .
i=1
hn i=1
hn

Theorem 5 Suppose that the sequence {hn } of bandwidths and the kerne
K: Rk R satisfy:
(i) hn 0;

(ii) nhkn ;
  
(iii) K(u) du = 1, uK(u) du = 0, and uu K(u) du is a nite k k
matrix.
p
Then n (x) (x) for every x in the support of X.

Theorem 6 In addition to the assumptions of Theorem 5, suppose that

h2n (nhkn )1/2 ,

where 0 < . If the CMF (x) and the CVF 2 (x) are smooth and X
has a twice continuously dierentiable density fX (x) then

k 1/2 b(x) 2 (x) 2
(nhn ) [n (x) (x)] N , K(u) du
fX (x) fX (x)
for every x in the support of X, where

(k+2) x Xi
b(x) = lim hn E [(Xi ) (x)] K .
n hn

Further, (nhkn )1/2 [n (x) (x)] and (nhkn )1/2 [n (x ) (x )] are asymptotically
independent for x = x .

79
Remarks
n (x) is asymptotically biased for (x), unless hn is chosen such that
= 0. In this case, however, the convergence of n (x) to its limiting
distribution is slower than in the case when > 0.

The convergence rate of n (x) is inversely related to the number k of co-


variates, reecting the curse-of-dimensionality problem. For k xed,
the maximal rate is achieved when hn = cn1/(k+4) , for some c > 0. In
this case, = c(k+4)/2 and therefore

2/(k+4) 2 b(x) 2 (x) 2
n [n (x) (x)] N c , K(u) du .
f (x) ck f (x)

Even when there is a single covariate (k = 1), the fastest convergence


rate is only equal to n2/5 and is lower than the rate of n1/2 achieved
by a regular estimator of (x) in a parametric context.

These problems have generated two active lines of research:

One seeks ways of eliminating the asymptotic bias of n (x) while


retaining the maximal convergence rate of n2/(k+4) . One possibility
(Schucany & Sommers 1977, Bierens 1987) is to consider a weighted
average of two estimators with dierent bandwidths chosen such that
the resulting estimator is asymptotically centered at (x).

The other seeks ways of improving the speed of convergence


  of
n (x). For example, if the kernel K is chosen such that uu K(u) du =
0 (allowing therefore for negative values of K), then the optimal conver-
gence rate can be shown to be n3/(k+6) . More generally, if one chooses a
kernel with zero moments up to order m, then the optimal band-
width becomes hn = cn1/(k+2m) , and the maximal convergence rate
becomes nm/(k+2m) which, as m , tends to the rate n1/2 typical
of a parametric estimator.

80
Tests of parametric models
Nonparametric regression may be used to test the validity of a parametric
model. For this purpose one may employ both informal methods, especially
graphical ones, and more formal tests.

To illustrate, let [X, Y] be a sample of size n from the distribution of (X, Y ),


let (x) denote the CMF of Y , and consider testing for the goodness-of-
t of a linear regression model. The null hypothesis H0 : (x) = + x
species the (x) as linear in x, whereas the alternative hypothesis species
(x) as a smooth nonlinear function of x.

Let U = (In H)Y, with H = X(X X)1 X , be the OLS residual vector and
let U = (In S )Y be the residual vector associated with some other linear
smoother = S Y. Because H0 is nested within the alternative, an F -type
test rejects H0 for large values of the statistic

U U U U Y MY Y M Y
F = = ,
U U Y M Y

where M = (In S ) (In S ).

For general types of linear smoother, the distribution of F under H0 is dicult


to derive for it depends on the unknown parameters and .

81
Connections with the Durbin-Watson test
Azzalini and Bowman (1993) suggest replacing H0 by the hypothesis that
E U = 0, conditional on X. The alternative is now that E U is a smooth
function of the covariates. This leads to a test that rejects H0 for large values
of the statistic

U U U M U
F = 
.
U M U
It can be shown that F depends only on the data and the value of the smooth-
ing parameter . Rejecting H0 for large values of F is equivalent to rejecting
for small values of the statistic

1 U M U
= = ,
F + 1 U U
which has the same form as the DurbinWatson test statistic.

In order to determine the critical value of a test based on F , one needs to


compute probabilities such as
  
Pr{F > c} = Pr{U U U M U > c U M U} = Pr{U AU > 0},

where A = In (1 + c)M . Hence, the problem reduces to calculating the



distribution of the quadratic form U AU under the null hypothesis. Azzalini
and Bowman (1993) oer some suggestions.

82
2.7 Average derivatives
Suppose that the
 CMF is suciently smooth and write it as (x) = c(x)/fX (x),
where c(x) = y f (x, y) dy. Then its slope is

c (x)fX (x) c(x) fX (x) c (x) (x) fX (x)


 (x) = = ,
fX (x)2 fX (x)

and its average slope or average derivative is



= E (X) =  (x) fX (x) dx,

(24)

where the expectation is with respect to the marginal distribution of X. The


average derivative provides a convenient summary of the CMF through a
nite dimensional parameter (a scalar parameter if X is unidimensional).

Notice that, if (x) is linear, that is, (x) = +  x, then

 (x) = = E  (X).

This is no longer true if (x) is nonlinear.

Under regularity conditions, integrating by parts gives the following equiva-


lent representation of the average derivative

= E sX (X) (X), (25)

where sX (x) = fX (x)/fX (x) is the score of the marginal density of X.

A third representation follows from the fact that, by the Law of Iterated Ex-
pectations,

= E sX (X) (X) = E[sX (X) E(Y | X)] = E sX (X) Y, (26)

where the last expectation is with respect to the joint distribution of (X, Y ).

These three equivalent representations provide alternative approaches to


estimating the average derivative.

83
Estimation of the average derivative
If the kernel K is dierentiable, an analog estimators of the average derivative
based on (24) is

n
1 = n1 Di  (Xi ),
i=1

where Di = 1{fX (Xi ) > cn } for some trimming constant cn that goes to
zero as n . The lower bound on fX is introduced to avoid erratic behavior
where X is sparse, so the value of fX (x) is very small.

An analog estimator based on (25) is


n
2 = n1 Di s(Xi ) (Xi ),
i=1

where sX (x) = fX (x)/fX (x) is the estimated score.

Another analog estimator, based on (26), is


n
1
3 = n Di sX (Xi ) Yi .
i=1

84
2.8 Methods for high-dimensional data
We now review a few nonparametric approaches that try to overcome some of
the problems encountered in trying to extend univariate nonparametric meth-
ods to multivariate situations.

Project pursuit regression estimation


The projection pursuit (PP) method was originally introduced in the re-
gression context by Friedman and Stuetzle (1981). Specically, given a rv Y
and a random k-vector X, they proposed to approximate the CMF of Y by a
function of the form p

m(x) = + mj (j x), (27)
j=1

where j = (1j , . . . , kj ) is a vector with unit norm that species a particular



direction in Rk , j x = ki=1 ij xi is a linear combination or projection of
the variables in X, and m1 , . . . , mp are unknown smooth univariate ridge
functions to be determined empirically.

Remarks:

By representing the CMF as a sum of arbitrary functions of lin-


ear combinations of the elements of x, PP regression allows for quite
complex patterns of interaction among the covariates.

Linear regression models may be viewed as special cases correspond-


ing to p = 1 and m1 (u) = u, single-index models as special cases
corresponding to p = 1, and sample selection models as special cases
corresponding to p = 2.

The curse-of-dimensionality problem is bypassed by using linear pro-


jections and nonparametric estimates of univariate ridge functions.

85
The PP algorithm
Algorithm 2
(1) Given estimates j of the projection directions and estimates mj of the
rst h 1 ridge functions in (27), compute the approximation errors

h1
ri = Yi mj (j Xi ), i = 1, . . . , n,
j=1

where Yi = Yi Y .
(2) Given a vector b Rk such that b = 1, construct a linear smoother
m(b Xi ) based on the errors ri and compute the RSS

n
 2
S(b) = ri m(b Xi ) .
i=1

(3) Determine b and the function m for which S(b) is minimized.


(4) Insert b and the function m as the hth terms into (27).
(5) Iterate (1)(4) until the decrease in the RSS becomes negligible.

As stressed by Huber (1985), PP regression emerges as the most powerful


method yet invented to lift one-dimensional statistical techniques to higher
dimensions. However, it is not without problems:
Interpretability of the individual terms of (27) is dicult when p > 1.
PP regression has problems in dealing with highly nonlinear struc-
tures. In fact, there exist functions that cannot be represented as a
nite sum of ridge functions (e.g. (x) = exp(x1 x2 )).
The sampling theory of PP regression is still lacking.
The choice of the amount of smoothing in constructing the non-
parametric estimates of the ridge functions is delicate.
PP regression tends to be computationally expensive.

We now discuss an approach that may be regarded as a simplication of PP


regression.

86
Additive regression
An important property of the linear regression model


k
m(x) = + j xj
j=1

is additivity. If there is no functional relation between the elements of


x, this property makes it possible to separate the eect of the dierent
covariates and to interpret j as the (constant) partial derivative of (x) with
respect to the jth covariate.

An additive regression model approximates the CMF of Y by a function


of the form
k
m(x) = + mj (xj ), (28)
j=1

where m1 , . . . , mk are univariate smooth functions, one for each covari-


ate. The linear regression model corresponds to mj (u) = j u for all j or,
equivalently, mj (u) = j .

Remarks:

The additive model (28) is a special case of the PP regression model


(27) corresponding to p = k and j = ej , the jth unit vector.

Because mj (xj ) = m(x)/xj if there is no functional relationship be-


tween the elements of X, additive regression retains the interpretabil-
ity of the eect of the individual covariates.

Interactions between categorical and continuous ones may be mod-


eled by estimating separate versions of the model for each value of the
categorical variable.

Interactions between continuous covariates may be modelled by creat-


ing compound variables, such as products of pairs.

87
Characterizing additive regression models
An additive regression model may be characterized as the solution to an ap-
proximation problem. In turn, this leads to a general iterative proce-
dure for estimating this class of models.

This way of proceeding is very similar in spirit to obtaining OLS as the sample
analog of the normal equations that characterize the best linear approximation
to an arbitrary CMF.

Before stating the main result, it is worth recalling a few denitions and results.
We refer to Luenberger (1969) for details.

88
Banach and Hilbert spaces
A normed vector space X is a vector space on which is dened a function
x : X R+ , called the norm of x, that satises the following properties:

(i) x = 0 for all x = 0;

(ii) (triangle inequality) x + y x + y for all y, x X ;

(iii) x = || x for all real numbers .

In a normed vector space, an innite sequence of vectors {xn } is said to


converge to a vector x, written xn x, if the sequence { x xn } of real
numbers converges to zero. An innite sequence {xn } is called a Cauchy
sequence if xn xm 0 for n, m . A normed vector space X is said
to be complete if every Cauchy sequence in X has a limit in X . A complete
vector space is also called a Banach space.

Given a vector space X , an inner product is a function x | y: X X R


that satises the following properties:

(i) x | y = y | x;

(ii) x + y | z = x | z + y | z;

(iii) x | z = x | z for all scalar ;

(iv) x | x 0, and x | x = 0 if and only if x = 0.

Two vectors x, y in a vector space X with inner product  |  are said to be


orthogonal, written x y, if x | y = 0. A vector x X is said to be
orthogonal to a subset M of X , written x M, if x m for all m M.

On a vector space with inner product  | , the function x = x | x is a
norm. A complete vector space on which an inner product is dened is called
a Hilbert space. Hence, a Hilbert space is a Banach space equipped with an
inner product which induces a norm.

89
Example 18 An important example of Hilbert space is the space H of all ran-
dom variables with nite variance dened on a probability space (, A, P ).
In this case, given two random variables Y, X H, we dene the inner prod-
uct
X | Y  = Cov(X, Y ).
Viewed as elements of H, two uncorrelated random variables are therefore
orthogonal. The inner product thus dened induces the norm

X = [Cov(X, X)]1/2 = [Var(X)]1/2 .

90
The Classical Projection Theorem
Theorem 7 Let M be a closed subspace of a Hilbert space H. For any vector
x H, there exists a unique vector m M such that x m x m
for all m M. Further, a necessary and sucient condition for m M to
be such a vector is that x m M.

The vector m M such that x m M is called the orthogonal projec-


tion of x onto M.

The next result is just a restatement of the Classical Projection Theorem.

Theorem 8 Let M be a closed subspace of a Hilbert space H. Let x H and


let V be the ane subspace x + M. Then, there exists a unique vector x in V
with minimum norm. Further, x M.

91
The main result
Theorem 9 Let Z = (X1 , . . . , Xk , Y ) be a random (k + 1)-vector with -
nite second moments, let H be the Hilbert space of zero-mean functions of
Z, with the inner product dened as  |  = E (Z)(Z), and let (x) =
(x1 , . . . , xn ) = E(Y | X1 = x1 , . . . , Xk = xk ). For any j = 1, . . . , k, let
Mj be the subspace of zero-mean square integrable functions of Xj and let
Ma = M1 + + Mk be the subspace of zero-mean square integrable additive
functions of X = (X1 , . . . , Xk ). Then a solution to the following problem

min E [(X) m(X)]2 (29)


mMa

exists, is unique, and has the form


k
m (X) = mj (Xj ),
j=1

where m1 , . . . , mk are univariate functions such that


 

mj (Xj ) = E Y mh (Xh ) | Xj , j = 1, . . . , k. (30)
h=j

92
Proof
We only sketch the proof. The details can be found in Stone (1985). If we
endow the Hilbert space H with the norm = [E (Z)2 ]1/2 , then problem
(29) is equivalent to the minimum norm problem

min Y m 2 .
mMa

Under some technical conditions, Ma and M1 , . . . , Mk are closed subspaces


of H. Thus, by the Classical Projection Theorem, there exists a unique vector
m Ma that solves (29). Moreover, m is characterized by the orthogonality
condition Y m Ma or equivalently, since Ma = M1 + + Mk , by the
set of orthogonality conditions

Y m Mj , j = 1, . . . , k.

Because m Ma , it must be of the form m (X) = kj=1 mj (Xj ). Further,
because the conditional expectation E( | Xj ) is an orthogonal projection onto
Mj , we must have
 
k
E Y mj (Xj ) | Xj = 0, j = 1, . . . , k,
j=1

that is,  

mj (Xj ) = E Y mh (Xh ) | Xj , j = 1, . . . , k.
l=j

93
Estimating additive regression models
Let Sj denote a smoother matrix for univariate smoothing on the jth covariate.
Because the sample analogue of the conditional mean j (x) = E(Y | Xj = x)
is the univariate smoother Sj Y, the analogy principle suggests the following
equation system as the sample counterpart of (30)
 

mj = Sj Y mh , j = 1, . . . , k,
h=j

where mj is the n-vector with generic element mj (Xij ) and Y is the n-vector
with generic element Yi Y .

This results in the following system of nk nk linear equations



I n S1 S1 S1 m1 S1 Y
S2 I n S2 S2 S2 Y
m2
.. .. .. .. .. = .. . (31)
. . . . . .
Sk Sk Sk In mk Sk Y

If (m1 , . . . , mk ) is a solution to the above system and we put = Y , then an


estimate of = ((X1 ), . . . , (Xn )) is


k
= n + mj .
j=1

94
The backtting algorithm
One way of solving system (31) is the GaussSeidel method. This method
solves iteratively for each vector mj from the relationship
 

mj = Sj Y ml ,
h=j

using the latest values of {mh , h = j} at each step. The process is repeated
for j = 1, . . . , k, 1, . . . , k, . . ., until convergence. The result is the backtting
algorithm:

Algorithm 3
(1) Compute the univariate smoother matrices S1 , . . . , Sk .
(0)
(2) Initialize the algorithm by setting mj = mj , j = 1, . . . , k.

(3) Cycle for j = 1, . . . , k, mj = Sj (Y h=j mh ), where the smoother
matrix Sj for univariate
 smoothing on the jth covariate is applied to the
residual vector Y h=j mh obtained from the previous step.
(4) Iterate step (3) until the changes in the individual functions become neg-
ligible.

The smoother matrices Sj can be computed using any univariate linear smoother
and do not change through the backtting algorithm.

To ensure that Y h=j mh has mean zero at every stage, Sj may be replaced
by the centered smoother matrix Sj (In n1 n 
n ).

A possible starting point for the algorithm is the tted value of Y from an
(0)
OLS regression of Y on X. In this case mj = Xj j , where Xj and j denote
respectively the jth column of X and the jth element of the OLS coecient
vector.

It can be shown that convergence of the backtting algorithm is guaranteed


in the case of smoothing splines or when the chosen linear smoother is a
projection operator, that is, dened by a symmetric idempotent smoother
matrix.

95
Partially linear models
A partially linear model is of the form

E(Y | X, W ) = m1 (X) + m2 (W ),

where m1 (x) =  x is an unknown linear function and m2 (w) is an unknown


univariate smooth function.

For simplicity (and without loss of generality), let X be a scalar rvs and write
the model as
Y = X + m2 (W ) + U, (32)
where E(U | X, W ) = 0.

If E X m2 (W ) = 0, then may be estimated consistently and eciently by an


OLS regression of Y on X. This is typically the case when X has mean zero and
is independent of W . In general, however, X and m2 (W ) are correlated,
so a regression of Y on X gives inconsistent estimates of .

Notice however that


E(Y | W ) = E[E(Y | X, W ) | W ]
= E[X + m2 (W ) | W ]
= E(X | W ) + m2 (W ).

Subtracting this expression from (32) gives

Y E(Y | W ) = [X E(X | W )] + U.

If Y = P Y is a linear regression smoother of Y on W and X = SX is a linear


regression smoother of X on W , then may simply be estimated by an OLS
regression of Y Y = (In P )Y on X X = (In S)X. The resulting
estimate of is

= [X (In S) (In S)X]1 X (In S) (In P )Y.

This method, proposed by Robinson (1988), may be regarded as a gener-


alization of the double residual regression (or Frisch-Waugh-Lovell)
method for partitioned linear regressions.

96
The backtting algorithm for partially linear models
Now consider applying the backtting algorithm. Because in this case m1 is
linear, two dierent smoothers may be employed:
An OLS smoother, with smoother matrix S1 = X(X X)1 X , which
produces estimates of the form m1 = X, where is the desired estimate
of .
A nonparametric smoother with smoother matrix S2 . Applied to
the vector W = (W1 , . . . , Wn ), this smoother produces estimates of the
vector m2 = (m2 (W1 ), . . . , m2 (Wn )) of the form m2 = S2 Y.

In this case, the steps of the backtting algorithm are


m1 = S1 (Y m2 ),
m2 = S2 (Y m1 ).

Premultiplying the rst of these two equations by X and substituting the


expression for m2 from the second equation gives
X m1 = X (Y m2 ) = X (In S2 )Y + S2 m1 ,
where we used the fact that X S1 = X . Because m1 = X, we have
X X = X (In S2 )Y + X S2 X
or, equivalently,
X (In S2 )Y = X (In S2 )X.
Solving for we then get
= [X (In S2 )X]1 X (In S2 )Y.
In this case, no iteration is necessary.

This method may be regarded as another way of generalizing the double-


residual regression method for partitioned linear regressions. In fact, if
S2 = W(W W)1 W , then the two methods coincide.

The method proposed by Robinson (1988) coincides with the backtting algo-
rithm if S2 = P = S is a projection matrix (i.e., symmetric and idempotent).

97
2.9 Stata commands
We now briey review the commands for linear nonparametric regression esti-
mation available in Stata, version 12.

These include the lowess command for lowess, the lpoly command for kernel-
weighted local polynomial smoothing, and the mkspline command for construct-
ing linear or natural cubic splines. All three commands consider the case of a
single regressor.

The package bspline in Newson (2012) allows one to estimate univariate and
multivariate splines.

98
The lowess command
This is a computationally intensive command. For example, running lowess
on 1,000 observations require performing 1,000 regressions each involving a
number of observations equal to n times the span. Since Stata does not take
advantage of the recursive formulae for OLS or WLS, this may take a long
time on a slow computer.

The basic syntax is:


lowess yvar xvar [if] [in] [weight] [, options]
where options includes:

mean: running-mean smooth (default is running-line least squares).

noweight: suppresses weighted regressions (default is tricube weighting


function).

bwidth(#): span of the data (default is 0.80). Centered subsets con-


taining a fraction bwidth() of the observations are used for calculating
smoothed values for each point in the data except for end points, where
smaller uncentered subsets are used.

logit: transforms dependent variable to logits and adjust smoothed


mean to equal mean of yvar.

adjust: adjusts smoothed mean to equal mean of yvar.

generate(newvar): creates newvar containing smoothed values of yvar.

nograph: suppresses graph. This option is often used with the generate()
option.

99
The lpoly command
The basic syntax is:
lpoly yvar xvar [if] [in] [weight] [, options]
where options includes:

kernel(kernel): species kernel function. The available kernels include


epanechnikov (default), biweight, cosine, gaussian, parzen, rectangle,
and triangle.

bwidth(#): species the half-width of the kernel. If bwidth() is not


specied, a rule-of-thumb bandwidth is calculated and used. A local
variable bandwidth may be specied in varname, in conjunction with an
explicit smoothing grid using the at() option.

degree(#): species the degree of the polynomial smooth. The default


is degree(0) corresponding to a running mean. This is equivalent to a
NW estimator.

generate([newvar x] newvar s): stores smoothing grid in newvar x


and smoothed points in newvar s. If at() is not specied, then both
newvar x and newvar s must be specied. Otherwise, only newvar s is
to be specied.

n(#): obtain the smooth at # points. The default is min(n, 50).

at(varname): species a variable that contains the values at which


the smooth should be calculated. By default, the smoothing is done on
an equally spaced grid but one can use at() to instead perform the
smoothing at the observed X, for example.

nograph: suppresses graph. This option is often used with the generate()
option.

noscatter: suppresses scatterplot only. This option is useful when the


number of resulting points would be so large as to clutter the graph.

pwidth(#): species pilot bandwidth for standard error calculation.

100
The mkspline command
In the rst syntax, it creates a set of k variables containing a linear spline of
oldvar with knots at the specied k 1 knots:
mkspline newvar 1 #1 newvar 2 #2 [...]] newvar k = oldvar [if] [in]
[, marginal displayknots]
where

marginal species that the new variables be constructed so that, when


used in estimation, the coecients represent the change in the slope
from the preceding interval. The default is to construct the variables so
that, when used in estimation, the coecients measure the slopes for the
interval.

displayknots displays the values of the knots that were used in creating
the linear or restricted cubic spline.

In the second syntax, mkspline creates # variables named stubname1, . . . ,


stubname# containing a linear spline of oldvar:
mkspline stubname # = oldvar [if] [in] [weight] [, marginal pctile
displayknots]
where pctile species that the knots be placed at percentiles of oldvar. The
default are equally spaced knots over the range of oldvar.

In the third syntax, mkspline creates variables containing a natural cubic


spline of oldvar:
mkspline stubname = oldvar [if] [in] [weight], cubic [nknots(#) knots(numlist)
displayknots]
where nknots(#)) species the number of knots that are to be used for the
natural cubic spline. This number must be between 3 and 7 unless the knot
locations are specied using knots(). The default number of knots is 5.

101
3 Distribution function and quantile function
estimators
The distribution function and the quantile function are equivalent ways
of characterizing the probability distribution of a univariate rv Z.

3.1 Distribution functions and quantile functions


The distribution function
The distribution function (df) of Z is a function F : R [0, 1] dened by
F (z) = Pr{Z z}, z R.
Sometimes, instead of working with the df, it is convenient to work with the
function
S(z) = 1 F (z) = Pr{Z > z}, z R,
called the survivor function.

A df F is called absolutely continuous if it is dierentiable almost every-


where and satises
 z
F (z) = F  (t) dt, < z < .

z
Any function f such that F (z) = f (t) dt is called a density for F . Any
such density must agree with F  except possibly on a set with measure zero.
If f is continuous at z0 , then f (z0 ) = F  (z0 ).

Sometimes the df is dened in terms of its density. For example, the density
of the N (0, 1) distribution is
2
1 z
(z) = exp .
2 2
The associated df is dened by the integral
 z
(z) = (u) du,

but this integral does not have a closed form expression.

102
Properties of the distribution function
F is non decreasing, right-continuous, and satises F () = 0 and
F () = 1.

For any z,
Pr{Z = z} = F (z+) F (z).
where F (z+) = limh0 F (z + h) and F (z) = limh0 F (z h).

If F is continuous at z, then

F (z+) = F (z) = F (z),

and so Pr{Z = z} = 0.

The rv F (Z) is distributed as U(0, 1).

Given c < d, the rv X = min(d, max(c, Z)) is called a censored version


of Z. The dfs of Z and X agree over the interval (c, d).

103
Quantiles
For any p (0, 1), a pth quantile of a rv Z is a number Qp R such that

Pr{Z < Qp } p and Pr{Z > Qp } 1 p, (33)

or equivalently

Pr{Z Qp } p and Pr{Z Qp } 1 p.

Thus, a pth quantile satises

Pr{Z < Qp } p Pr{Z Qp },

or equivalently
Pr{Z > Qp } 1 p Pr{Z Qp }.
The set of pth quantiles of Z is a closed interval of the real line. A pth
quantile is unique if the df of Z is strictly increasing at Qp , in which case
F (Qp ) = p.

A quantile corresponding to p = .50 is called a median of Z and will be


denoted by . Quantiles corresponding to p = .10, .25, .75 and .90 are called,
respectively, lower deciles, lower quartiles, upper quartiles and upper
deciles.

104
Quantiles as solutions to a minimization problem
Recall that, if a rv Z has nite mean, then its mean solves the problem
mincR E(Z c)2 . A pth quantile of Z may also be dened implicitly as a
solution to a particular minimization problem.

Dene the convex function



(1 p)|v|, if v < 0,
p (v) = [p 1{v < 0}] v =
pv, if v 0,
called the asymmetric absolute loss function. Also dene the expected
loss L(c) = E p (Z c), where

(1 p)(c Z), if Z < c,
p (Z c) = [p 1{Z < c}] (Z c) =
p(Z c), if Z c.
Then we have the following:

Theorem 10 A pth quantile of Z solves the problem

min L(c). (34)


cR

Proof. Consider for simplicity the case when Z is a continuous rv with density
f and nite mean . Since L(c) is convex, it is enough to show that Qp is a
root of the equation L (c) = 0. First notice that
  c
L(c) = p (z c)f (z) dz (z c)f (z) dz

 c
= p( c) zf (z) dz + cF (c).

Hence,
L (c) = p cf (c) + F (c) + cf (c) = p + F (c).
Since Qp satises F (Qp ) = p, it is a root of L (Qp ) = 0. 2

If p = .50, then p (v) = |v|/2 is symmetric and is proportional to the fa-


miliar symmetric absolute loss function. Further, the corresponding minimum
expected loss (E |Z |)/2 is proportional to the mean absolute deviation of
Z from its median.

105
Figure 12: The asymmetric absolute loss function p (v).
1.5
1
.5
0

2 1 0 1 2
v

p = .25 p = .50 p = .75

106
The quantile function
The quantile function (qf) of a rv Z is a function Q: (0, 1) R dened by

Q(p) = inf {z R: F (z) p}. (35)

If Qp is unique, then Q(p) = Qp . While the df F maps R into [0, 1], the qf Q
maps (0, 1) into R.

Properties of the quantile function


Q is non decreasing and left-continuous.

Q(F (z)) z and F (Q(p)) p. Thus,

F (z) p z Q(p).

This justies calling Q the inverse of F and using the notation Q(p) =
F 1 (p). Also,
F (z) = sup {p [0, 1]: Q(p) z}. (36)

Q is continuous i F is strictly increasing.

Q is stricly increasing i F is continuous, in which case

Q(p) = inf {z R: F (z) = p}, F (Q(p)) = p.

Further, if F is continuous, then Z has the same distribution as Q(U),


where U U(0, 1) (quantile transformation).

107
Derivation of the qf from the df
If the df F is continuous and can be expressed in closed form, then the qf
is easily obtained by solving the equation F (z) = p.

Example 19 If Z U(0, 1), then F (z) = z, so Q(p) = p. 2

Example 20 If Z E(), > 0, then

F (z) = 1 exp(z),

so
ln(1 p)
Q(p) = .

2

108
Table 1: Distribution function F (z) and quantile function Q(p) of selected
distributions.

Distribution F (z) Q(p)

1 1
Cauchy 2
+
arctan z tan[(p 12 )]

Chi-square(1) 2( z) 1 [1 ((p + 1)/2)]2

Exponential 1 ez ln(1 p)

Gaussian (z) 1 (p)

Gumbel 1 exp(ez ) ln[ ln(1 p)]

1 z 1
Laplace 2
e , z<0 ln 2p, p < 2

1 12 ez , z 0 ln 2(1 p), p 1
2

ez p
Logistic ln
1 + ez 1p

Log-normal (ln z) exp[1 (p)]

Pareto 1 (/z) (1 p)1/

Uniform z p

Weibull 1 exp(z ) [ 1 ln(1 p)]1/

109
Figure 13: Quantile functions of selected distributions.

Uniform Exponential Chisquare(1)


1

8
6
4
.5

4
2

2
0

0
Gumbel Gaussian Laplace
2

5
6 4 2 0

2
Q(p)

0
0
2

5
Logistic Lognormal Weibull(2,1)
15

3
5

10

2
0

1
5

0 .5 1 0 .5 1 0 .5 1
p

110
Other properties of the quantile function
If the rst two moments of Z exist, then
 1  1  1 2
2
EZ = Q(p) dp, Var Z = Q(p) dp Q(p) dp .
0 0 0

If Z has a continuous and strictly positive density f in a neigh-


borhood of Q(p), then dierentiating the identity F (Q(p)) = p gives
f (Q(p)) Q (p) = 1. Hence,
1
Q (p) = .
f (Q(p))
The slope Q is known as the sparsity function or quantile-density
function, whereas the composition f (Q()) is known as the density-
quantile function. The sparsity function is well dened whenever
the density is strictly positive, and is strictly positive whenever the
density is bounded.
Dierentiating the identity ln Q (p) = ln f (Q(p)) gives
Q (p) f  (Q(p))
= .
Q (p) f (Q(p)) Q (p)
Hence, the score function s = f  /f satises
Q (p) d 1 d
s(Q(p)) =  2
= 
= f (Q(p)).
[Q (p)] dp Q (p) dp

If the function g is a monotonically increasing and left-continuous,


then the qf of the rv g(Z) is equal to g(Q()). In particular, the qf of
g(Z) = + Z, with 0, is equal to g(Q(p)) = + Q(p).
Given c < d, let
X = min(d, max(c, Z))
be a censored version of Z. It then follows from the previous result
that the qfs of Z and X agree at all p such that F (c) < p < F (d). In
particular, Z and X have the same median provided that F (c) < 1/2 <
F (d). This is generally not true for the mean.

111
Moment-based summaries of a distribution
Center, spread and symmetry of a distribution are intuitive but somewhat
vague concepts. Kurtosis is not even intuitive.

If Z is a rv with nite 4th moment, then it is customary to dene:

Center in terms of the mean = E Z.

Spread in terms of the variance 2 = E(Z )2 or the standard


deviation = [E(Z )2 ]1/2 .

Symmetry in terms of the 3rd moment E(Z )3 .

Kurtosis in terms of the 4th moment E(Z )4 .

If Z N (, 2), then E(Z )3 = 0 and E(Z )4 = 3 4 .

The drawbacks with these measures are:

they may not exist (e.g., for all four measures to exist, a t distribution
needs at least 5 degrees of freedom);

their interpretation is not easy (especially for kurtosis);

they lack robustness.

112
Quantile-based summaries of a distribution
An alternative is to use quantile-based summaries of a distribution. Thus,
we may alternatively dene:

Center in terms of the median = Q(.50).

Spread in terms of either the interquartile range

IQR = Q(.75) Q(.25),

or the interdecile range

IDR = Q(.90) Q(.10).

The IQR (IDR) is the interval of values which contains the central 50%
(80%) of the probability mass of Z.

Symmetry in terms of the dierence

[Q(.90) Q(.50)] [Q(.50) Q(.10)],

or the ratios
Q(.90) Q(.50)
Q(.50) Q(.10)
or
[Q(.90) Q(.50)] [Q(.50) Q(.10)]
,
Q(.90) Q(.10)
also known as Kelleys measure. One may also consider analogous
measures with Q(.90) and Q(.10) replaced by Q(.75) and Q(.25).

Kurtosis, or more precisely tail weight, in terms of either the dier-


ence IDR - IQR or the ratio IDR/IQR.

113
Functionals of the distribution function or the quantile function
Sometimes, the parameter of interest is a particular functional of either the df
or the qf. We illustrate this with two examples.

Example 21 Given a rv Z with nite mean, the -level tail conditional


mean of Z is formally dened as
 Q()
1
() = E(Z | Z Q()) = z dF (z), 0 < < 1.

In nancial applications, Z represents the return on a given asset during a


period, while Q() and () are measures of nancial risk:

The quantile Q(), also called the Value-at-Risk (VaR) at level ,


gives a lower bound on the loss made in the worst percent of the
periods.

The tail conditional mean (), also called the -level expected short-
fall, represents the expected value of a loss (negative return) that exceeds
the VaR.

Typical values of are .01, .05 or .10.

If F is continuous and strictly increasing, a change of variable from F (z) to p


gives the equivalent representation
 F (Q()) 
1 1 1
() = F (p) dp = Q(p) dp. (37)
F () 0

This second representation is particularly convenient when the quantiles of Z


have a closed form expression. 2

114
Example 22 Let Z be a non-negative rv with nite nonzero mean . The
Lorenz curve of Z is formally dened as

1
L() = Q(p) dp, 0 < < 1.
0
The Lorenz curve is commonly used in economics to describe the distribution
of income and is associated with the Gini inequality index.

From (37), the following relationship links the Lorenz curve and the tail con-
ditional mean of Z

L() = ().

The generalized Lorenz curve (Shorrocks 1983) is the Lorenz curve scaled
up by the mean, and is equal to

GL() = Q(p) dp = (), 0 < < 1.
0

If the non-negative rv Z represents individual income, then GL() simply


cumulates individual incomes up to the th quantile. 2

115
3.2 The empirical distribution function
Let Z1 , . . . , Zn be a sample from the distribution of a rv Z, and consider the
problem of estimating the df F of Z when F is not restricted to a known
parametric family.

The key is the fact that, for any z,

F (z) = Pr{Z z} = E 1{Z z},

that is, F (z) is just the mean of the Bernoulli rv 1{Z z}. This suggests
estimating F (z) by its sample counterpart


n
F (z) = n1 1{Zi z}, (38)
i=1

namely the fraction of sample points such that Zi z. Viewed as a function


dened on R, F is called the empirical distribution function (edf).

In fact, F is the df of a discrete probability measure, called the empirical


measure, which assigns to a set A a probability equal to the fraction of sample
points contained in A. In particular, each distinct sample point receives
probability equal to 1/n, whereas each sample point repeated m n times
receives probability equal to m/n.

Suppose that all observations are distinct, and let

Z[1] < Z[2] < < Z[n]

be the sample order statistics. Then the edf may also be written

0, if z < Z[1] ,
F (z) = i/n, if Z[i] z < Z[i+1] , i = 1, . . . , n 1,

1 if z Z[n] .

This shows that the edf contains all the information carried by the sample,
except the order in which the observations have been drawn or are arranged.

116
Figure 14: Empirical distribution function F for a sample of 100 observations
from a N (0, 1) distribution.
1
.8
.6
.4
.2
0

4 2 0 2 4
z

Population df Empirical df

117
The Riemann-Stieltjes integral
In what follows we shall sometimes use the notation of the Riemann-Stieltjes
integral (see e.g. Apostol 1974, Chpt. 7).

Given a df F and an integrable function g, dene


 
EF g(Z) = g(z) dF (z) = g(z)f (z) dz

if F is continuous with probability density function f (z), and


 
EF g(Z) = g(z) dF (z) = g(zj )f (zj )
j

if F is discrete with probability mass function f (zj ).

Because the edf F is discrete, with probability mass function that assigns
probability mass 1/n to each distinct sample point, we have
 
n
1
EF g(Z) = g(z) dF (z) = n g(Zi ).
i=1

118
The empirical process
Let Z1 , . . . , Zn be a sample from the distribution of a rv Z with df F , let

= T (F ) = g(z) dF (z)

be the population parameter of interest, let F be the edf, and let


 
n
1
= T (F ) = g(z) dF (z) = n g(Zi )
i=1

be a plug-in or bootstrap estimator of .

The estimation error associated with is


  
= g(z) dF (z) g(z) dF (z) = g(z) d[F (z) F (z)].

Thus, the estimation error ultimately depends on the dierence between the
edf F and the population df F .

In asymptotic analysis, attention typically focuses on the rescaled estima-


tion error 

n ( ) = g(z) dpn (z),

where
pn (z) = n [F (z) F (z)].
Notice that pn (z) is a function dened on R for any given sample, but is a rv
for any given z. The collection

pn = {pn (z), z R}

of all such rvs is therefore a stochastic process with index set R. This
process is called the empirical process.

119
Figure 15: Realization of the empirical process pn for a sample of n = 100
observations from a N (0, 1) distribution.
.6
.4
.2
p_n(z)
0 .2
.4

4 2 0 2 4
z

120
Finite sample properties
The key is again the fact that F (z) is the average of n iid rvs 1{Zi z} that
have a common Bernoulli distribution with parameter F (z).

Theorem 11 If Z1 , . . . , Zn is a sample from a distribution with df F , then for


any z R:

(i) nF (z) Bi(n, F (z));

(ii) E F (z) = F (z);

(iii) Var F (z) = n1 F (z)[1 F (z)];

(iv) Cov[F (z), F (z  )] = n1 F (z)[1 F (z  )] for any z  z.

Remarks:

For any z, F (z) is an unbiased for F (z).

Because of the correlation between F (z) and F (z  ), care is needed in


drawing inference about the shape of F from small samples.

As n , the sampling variance of F (z) vanishes, so F (z) is consistent


for F (z). Further, as n , the correlation between F (z) and F (z  )
also vanishes.

Being the average of n iid rvs with nite variance, F (z) also satises the
standard CLT and is therefore asymptotically normal.

121
Asymptotic properties
We now consider the properties of a sequence {Fn (z)} of edfs indexed by the
sample size n. As in the case of density estimation, we distinguish between
local properties (valid at a nite set of z values) and global properties.

Given J distinct points

< z1 < < zJ < ,

let F be the J-vector consisting of F1 , . . . , FJ , with Fj = F (zj ), and let Fn =


(Fn1 , . . . , FnJ ) be the sample analogue of F, with Fnj = Fn (zj ). The sampling
distribution of the J-vector

pn = n (Fn F

corresponds to the J-dimensional distribution of the empirical process pn .

As n we have:
as
Fn F,

pn NJ (0, ), where is the J J matrix with generic element

jk = min(Fj , Fk ) Fj Fk , j, k = 1, . . . , J.

These two properties may be viewed as special cases of the following two
results.

Dvoretzky-Kiefer-Wolfowitz (DKW) inequality: For all n and any


 > 0, 
2
Pr sup |F (z) F (z)| 2e2n . (39)
zR

The empirical process {pn (Q(u)), u (0, 1)} converges weakly to the
Brownian bridge or tied-down Brownian motion, that is, the unique
Gaussian process U with continuous sample paths on [0, 1] such that
E U(t) = 0 and Cov[U(s), U(t)] = min(s, t) st, with 0 < s, t < 1.

122
The Glivenko-Cantelli theorem
As a measure of distance between the edf Fn and the population df F consider
the Kolmogorov-Smirnov statistic

Dn = sup |Fn (z) F (z)|.


<z<

Since Fn depends on the data, Dn is itself a rv.

It would be nice if, as n , Dn 0 in some appropriate sense. In fact, one


as
can shown that Dn 0 as n , that is,

Pr{ lim Dn = 0} = 1
n

or, equivalently, the event that Dn does not converge to zero as n occurs
with zero probability under repeated sampling.

This fundamental result, known as the Glivenko-Cantelli theorem, implies


that the entire probabilistic structure of Z can almost certainly be uncovered
from the sample data provided that n is large enough. Vapnik (1995) calls
this the most important result in the foundation of statistics.

123
Figure 16: Empirical distribution function F for a sample of 1, 000 observations
from a N (0, 1) distribution.
1
.8
.6
.4
.2
0

4 2 0 2 4
z

Population df Empirical df

124
Multivariate and conditional edf
The denition of edf and its sampling properties are easily generalized to the
multivariate case.

If Z1 , . . . , Zn is a sample from the distribution of a random m-vector Z, then


the m-variate edf is dened as

n
1
F (z1 , . . . , zm ) = n 1{Zi1 z1 , . . . , Zim zm }
i=1
n
= n1 1{Zi1 z1 } 1{Zim zm }.
i=1

Example 23 If m = 2 and Zi = (Xi , Yi ), then the bivariate edf is dened


as

n
1
F (x, y) = n 1{Xi x, Yi y}
i=1

n
= n1 1{Xi x} 1{Yi y}.
i=1
2

Although conceptually straightforward, multivariate df estimates suer of the


curse of dimensionality problem and are dicult to represent graphically
unless m is small, say equal to 2 or 3.

125
3.3 The empirical quantile function
We now consider the problem of estimating the qf of a rv Z given a sample
Z1 , . . . , Zn from its distribution.

Sample quantiles
A sample analogue of (33) is called a sample quantile. Thus, for any p
(0, 1), a pth sample quantile is a number Qp R such that


n 
n
n1 1{Zi < Qp } p and n1 1{Zi > Qp } 1 p,
i=1 i=1

that is, about np of the observations are smaller than Qp and about n(1 p)
are greater. Equivalently, we have the condition

n 
n
1{Zi < Qp } np 1{Zi Qp }, (40)
i=1 i=1

that is, np cannot be smaller than the number of observations strictly less than
Qp and cannot be greater than the number of observations less than or equal
to Qp .

A pth sample quantile is unique if np is not an integer, and lies in a closed


interval dened by two adjacent order statistics if np is an integer.

Example 24 A sample quantile corresponding to p = 1/2 is called a sample


median and will be denoted by . In fact, if Z[1] Z[n] are the sample
order statistics, then:

when n is odd (n/2 is not an integer), = Z[k] , where k = (n + 1)/2,

when n is even (n/2 is an integer), is any point in the closed interval


[Z[k] , Z[k+1]], where k = n/2.

126
Asymmetric LAD
It can be shown that a pth sample quantile is also a solution to the sample
analogue of (34), namely


n
1
min EF p (Z c) = n p (Zi c), (41)
cR
i=1

where p (v) = [p 1{v < 0}] v is the asymmetric absolute loss function.
Problem (41) is called an asymmetric least absolute deviations (ALAD)
problem. Notice that solving problem (41) avoids the need of sorting.

Since the function p (v) is convex but not dierentiable at v = 0, the


1 n
ALAD objective function L(c) = n i=1 p (Zi c) is itself convex but not
dierentiable at all z that correspond to a sample value. However, we can
always dene its subgradients

1
n
 L(c h) L(c)
L (c) = lim =p 1{Zi < c}
h0 h n i=1

(actually its gradient when c is not equal to a sample value) and

1
n
 L(c + h) L(c)
L (c+) = lim = 1{Zi c} p,
h0 h n i=1

with L (c) L (c+) in large samples. We can then characterize Qp as a


solution to (41) by the fact that both L (Qp ) 0 and L (Qp +) 0 must
hold, that is,

1 1
n n
1{Zi < Qp } p 1{Zi Qp },
n i=1 n i=1

which is just a restatement of condition (40).

127
Example 25 A sample median is a solution to problem (41) for p = 1/2
and minimizes the least absolute deviations (LAD) objective function

1 
n
L(c) = |Zi c|.
2n i=1

Notice that minimizing L(c) over R is equivalent to solving


n
min |Zi c|.
cR
i=1

In this case, the subgradients are

1 1
n
 L(c h) L(c)
L (c) = lim = 1{Zi < c}
h0 h 2 n i=1

and
1
n
 L(c + h) L(c) 1
L (c+) = lim = 1{Zi c} ,
h0 h n i=1 2
so the following condition must hold

1 1
n n
1
1{Zi < } 1{Zi },
n i=1 2 n i=1

that is, at most half of the observations are less than and at least half are
greater or equal to . 2

128
Figure 17: LAD objective function for dierent sample sizes.
1.2
1
.8
.6
.4

1 .5 0 .5 1
c

n=4 n=5

129
Figure 18: Subgradients L (c) and L (c+) of the LAD objective function for
n = 4.
.5
0
.5

1 .5 0 .5 1
c

L(c) L(c+)

130
Figure 19: Subgradients L (c) and L (c+) of the LAD objective function for
n = 5.
.5
0
.5

1 .5 0 .5 1
c

L(c) L(c+)

131
The empirical quantile function
The empirical quantile function (eqf) is a real-valued function dened on
(0, 1) by
Q(p) = inf {z R: F (z) p}.
Thus, the eqf is just the sample analogue of problem (35) and satises all
properties of a qf. In particular, it is the left-continuous inverse of the
right-continuous edf F .

Notice that Q(p) = Qp when np is not an integer. Further, there is a close


association between the values of the eqf and the sample order statistics, for
i1 i
Q(p) = Z[i] , for <p , i = 1, . . . , n.
n n

The rescaled dierence



qn (p) = n [Q(p) Q(p)],

is a function on (0, 1) for any given sample, and is a rv for any given p. The
collection
qn = {qn (p), 0 < p < 1}
of all such rvs is therefore a stochastic process with index set (0, 1). This
process is called the sample quantile process.

132
Figure 20: Empirical quantile function Q for a sample of 100 observations from
a N (0, 1) distribution.
4
2
0
2
4

0 .25 .5 .75 1
p

Population qf Empirical qf

133
Figure 21: Empirical quantile function Q for a sample of 1000 observations
from a N (0, 1) distribution.
4
2
0
2
4

0 .25 .5 .75 1
p

Population qf Empirical qf

134
Figure 22: Realization of the sample quantile process qn for a sample of n = 100
observations from a N (0, 1) distribution.
4
2
0
q_n(z)
2
4
6

0 .25 .5 .75 1
p

135
Finite sample properties
Because sample quantiles essentially coincide with sample order statistics, we
present a result on the df and the density of the sample order statistic Z[i] ,
corresponding to the sample quantile Q(i/n).

Theorem 12 Let Z1 , . . . , Zn be a sample from a distribution with df F . Then


the df of Z[i] , 1 i n, is
n
 n
G(z) = Pr{Z[i] z} = F (z)k [1 F (z)]nk .
k=i
k

If F has a density f = F  , then the density of Z[i] is



 n1
g(z) = G (z) = n f (z) F (z)i1 [1 F (z)]ni .
i1

Interest often centers not just on a single quantile, but on a nite number of
them. This is the case when we seek a detailed description of the shape of a
probability distribution, or we are interested in constructing some L-estimate,
that is, linear combinations of sample quantiles, such as the sample IQR, the
sample IDR, or a trimmed mean.

Sample quantiles corresponding to dierent values of p are dependent. Hence,


from a practical point of view, what matters is their joint distribution. The
next result presents the joint density of two sample order statistics, Z[i] and
Z[j] , corresponding respectively to the sample quantiles Q(i/n) and Q(j/n).

Theorem 13 Let Z1 , . . . , Zn be a sample from a continuous distribution with


df F and density f . Then the joint density of Z[i] and Z[j] , 1 i < j 1, is

n!
g(u, v) = f (u)f (v) F (u)i1
(i 1)! (j 1 i)!(n j)!
[F (v) F (u)]j1i [1 F (v)]nj .

136
Asymptotic properties
Since the exact sampling properties of a nite set of sample quantiles are
somewhat complicated, we now consider their asymptotic properties.

Let Qn be a solution to (41) (to simplify notation we henceforth drop the ref-
erence to p) and again assume that Z1 , . . . , Zn is a sample from a continuous
distribution with df F and nite strictly positive density f . Under this
assumption, Q = F 1 (p) is the unique solution to (34) for any 0 < p < 1.
Further, the ALAD objective function

n
1
Ln (c) = n p (Yi c)
i=1

can be shown to converge almost surely as n , uniformly on R, to


the population objective function

L(c) = E p (Y c).
as
Thus, as n , Qn Q for any 0 < p < 1.

137
Asymptotic normality of a sample quantile

Because the ALAD objective function Ln (c) = n1 ni=1 p (Zi c) is not
dierentiable, we cannot derive asymptotic normality of a sample quantile
Qn ((the index p has been dropped for simplicity) by taking a 2nd-order Taylor
expansion of Ln .

There are various approaches to the problem. A simple approach is based on


the fact that the subgradient of Ln ,

n
Ln (c) = n1 1{Zi c} p,
i=1

is increasing in c (Figures 1819). Since Ln (Qn ) 0, this implies that Qn c


i Ln (c) > 0.

Next notice that, for any t R, Tn = n (Qn Q) t i Qn Q + t/ n .
Thus, putting c = Q + t/ n , the df of the rescaled and recentered dierence
Tn is 
 t
Pr{Tn t} = Pr Ln Q + >0
n
( )
= Pr Wn > 0 ,

with Wn = n1 ni=1 Wi , where Wi = 1{Zi Q+t/ n }p is a binary rv that
takes values 1p and p with probabilities F (Q+t/ n ) and 1F (Q+t/ n )
respectively. Thus, Wn is an average of iid rvs with nite mean and variance.
As n ,

t t f (Q) t
E Wi = F Q + p = F Q+ F (Q) ,
n n n
while  
t t
Var Wi = F Q + 1F Q+ p(1 p).
n n
If f (Q) > 0 then, by the De Moivre-Laplace Central Limit Theorem,
* +
( ) Wn f (Q) t/ n t t
Pr Wn > 0 = Pr  >
p(1 p)/n

as n , where = p(1 p)/f (Q). Hence, Tn N (0, 2) as n .

138
Asymptotic joint normality of sample quantiles
Given J distinct values

0 < p1 < . . . < pJ < 1,

let Q be the J 1 vector consisting of Q1 , . . . , QJ , with Qj = Q(pj ), and let


Qn = (Qn1 , . . . , QnJ ), with Qnj = Qn (pj ), be the sample analogue of Q. The
sampling distribution of the vector

qn = n (Qn Q)

corresponds to the J-dimensional distribution of the sample quantile pro-


cess qn .

Extending the previous argument one can show that, if F possesses a contin-
uous density f which is positive and nite at Q1 , . . . , QJ , then

qn NJ (0, )

as n , where is the J J matrix with generic element

min(pr , ps ) pr ps
rs = , r, s = 1, . . . , J. (42)
f (Qr ) f (Qs )

Csorgo (1983) and Shorack and Wellner (1986) discuss an analogue of the
DKW inequality and weak convergence of the sample quantile process to the
Brownian bridge with nite dimensional distributions equal to the asymp-
totic distribution of qn .

139
Bahadur representation
Bahadur (1966) and Kiefer (1967) established the close link between the
sample quantile
process qn (p) = n [Qn (p) Q(p)] and the empirical process
pn (z) = n [Fn (z) F (z)].

They showed that, under regularity conditions, with probability tending to


one as n ,

1  p 1{Zi Q(p)}
n
Qn (p) Q(p) = + Rn ,
n i=1 f (Q(p))

uniformly for p [, 1 ] and some  > 0,where the remainder Rn is


Op (n3/4 (ln ln n)3/4 ). Multiplying both sides by n gives

n [F (Q(p)) Fn (Q(p))]
n [Qn (p) Q(p)] = + Rn ,
f (Q(p))

where the remainder Rn is Op (n1/4 (ln ln n)3/4 ). Thus,

pn (Q(p))
qn (p) = + op (1).
f (Q(p))

Multivariate normality of sample quantiles follows immediately from this rep-


resentation.

140
Estimating the asymptotic variance of sample quantiles
From (42), a pth sample quantile has asymptotic variance

p(1 p)
AV(Qn (p)) = .
f (Q(p))2

In particular, a sample median n has asymptotic variance


1
AV(n ) = ,
4f ()2

where = Q(.50).

In practice, the asymptotic variance of sample quantiles has to be estimated


from the data. If f is a nonparametric estimate of f , an estimate of the
asymptotic variance of Qn (p) is

 Qn (p)) = p(1 p) .
AV(
f(Qn (p))2

As a simple consistent estimator of f (Q(p))1 , Cox and Hinkley (1974) sug-


gest
Z([np]+hn) Z([np]hn)
,
2hn /n
where [x] denotes the integer part of x and hn is a bandwidth that goes to
zero as n .

141
Asymptotic relative eciency of the median to the mean
We may use result (42) to compare the asymptotic properties of the sample
mean Zn and the sample median n for a random sample from a rv Z with a
unimodal distribution that is symmetric about with variance 0 < 2 <
and density f that is strictly positive at .
Under these assumptions, as n ,

n (Zn ) N (0, 2),

1
n (n ) N 0, .
4f ()2

Because the two estimators are asymptotically normal with the same asymp-
totic mean, their comparison may be based on the ratio of their asymptotic
variances
AV(Zn )
ARE(n , Zn ) = = 4 2 f ()2 ,
AV(n )
called the asymptotic relative eciency (ARE) of the sample median to
the sample mean.
Because Var Zn n1 AV(Zn ) and Var n n1 AV(), the ARE is equal
to the ratio of the sample sizes that the sample mean and the sample median
respectively need to attain the same level of precision (inverse of the sampling
variance). Thus, if 100 observations are needed for the sample median to attain
a given level of precision, then the sample mean would need about 100 ARE
observations.
The ARE of the sample median increases with the peakedness f () of the
density f at . It is easy to verify that

2/ .64, if Z Gaussian,
ARE(n , Zn ) = 2 /12 .82, if Z logistic,

2, if Z double exponential.
The table below shows the ARE of the sample median for t distributions with
m 3 dof.

m 3 4 5 8
ARE 1.62 1.12 .96 .80 .64

142
3.4 Conditional distribution and quantile functions
Consider a random vector (X, Y ), where X is Rk -valued and Y is real val-
ued. The conditional distribution of Y given X = x may equivalently be
characterized through either:

The conditional df (or distributional regression function), a real-


valued function dened on R Rk by

F (y | x) = Pr{Y y | X = x}.

The conditional qf (or quantile regression function), a real-valued


function dened on (0, 1) Rk by

Q(p | x) = inf {y R: F (y | x) p}.

For any xed x, F (y | x) and Q(p | x) satisfy all properties of a df and a qf,
respectively. In particular,

F (y | x) = sup {p (0, 1): Q(p | x) y}.

To stress the relationship between the conditional qf and the conditional df,
the notations Q(p | x) = F 1 (p | x) and F (y | x) = Q1 (y | x) will also be used.

Further, for any xed x, the following monotonicity or no-crossing proper-


ties must be satised:

if y  > y, then F (y  | x) F (y | x),

if p > p, then Q(p | x) Q(p | x).

143
Conditional and marginal dfs and qfs
The Law of Iterated Expectations implies that

E Y = E (X) = (x) dH(x),

where H is the df of X. Thus, given two sub-populations with mean 1 and


2 respectively, the dierence in their means may be decomposed as
 
1 2 = [1 (x) 2 (x)] dH1 (x) + 2 (x) [dH1 (x) dH2 (x)],

where the 1st term on the rhs reects dierences in the CMF of the two
subpopulations, while the 2nd term reects dierences in their composition.

When the two CMFs are linear, that is, j (x) = j x, j = 1, 2, we have the
so-called Blinder-Oaxaca decomposition
1 2 = (1 2 ) X,1 + 2 (X,1 X,2 ),
where X,j denotes the mean of X in group j = 1, 2.

In the case of a df we have



F (y) = F (y | x) dH(x),

so dierences in the marginal distribution of Y for two sub-populations may


be decomposed as follows

F1 (y) F2 (y) = [F1 (y | x) F2 (y | x)] dH1 (x)+
 (43)
+ F2 (y | x) [dH1 (x) dH2 (x)],

where the 1st term on the rhs reects dierences in the cdf of the two sub-
populations, while the 2nd term reects dierences in their composition.

Simple decomposition of this kind are not available for quantiles. Machado
and Mata (2005) and Melly (2005) propose methods essentially based on in-
version of (43). Recently, Firpo, Fortin and Lemieux (2011) have proposed
an approximate method based on the recentered inuence function. See
also Fortin, Lemieux and Firpo (2011).

144
Estimation when X is discrete
The methods reviewed so far can easily be extended to estimation of the con-
ditional df and qf when the covariate vector X has a discrete distribution.

All is needed is partitioning the data according to the values of X. The


conditional df and the regression qf corresponding to each of these values may
then be estimated by treating the relevant subset of observations as if they
were a separate sample.

Example 26 When X is a discrete rv and x is one of its possible values, the


conditional df of Y given X = x is dened as

Pr{Y y, X = x}
F (y | x) = Pr{Y y | X = x} = .
Pr{X = x}

If O(x) = {i: Xi = x} is the set of sample points such that Xi = x and n(x)
is their number, then the sample counterpart of Pr{X = x} is the fraction
n(x)/n of sample points such that Xi = x. Hence, if n(x) > 0, a reasonable
estimate of F (y | x) is the fraction of sample points in O(x) such that Yi y


n
F (y | x) = n(x)1 1{Yi y, Xi = x}
i=1
 (44)
1
= n(x) 1{Yi y}.
iO(x)

Viewed as a function of y for x xed, F (y | x) is called the conditional edf


of Yi given Xi = x. 2

For these estimates to make sense, the number of data points corresponding to
each of the possible values of X ought to be suciently large. This approach
breaks down, therefore, when either X is continuous or X is discrete but its
support contains too many points relative to the available sample size.

In what follows, we begin with the problem of estimating the conditional qf


Q(p | x) when X is a continuous random k-vector. In the last two decades,
this problem has received considerable attention in econometrics.

145
3.5 Estimating the conditional quantile function
Let (X1 , Y1 ), . . . , (Xn , Yn ) be a sample from the joint distribution of (X, Y ).
The conditional qf Q(p | x), viewed as a function of x for a given p, may be
obtained by solving the problem

min E p (Y c(X)), 0 < p < 1, (45)


cC

where C is the class of real valued functions dened on Rk .

For a given p (0, 1), this suggests estimating Q(p | x) by choosing a function
of x, out of a suitable family C C, as to solve the sample analogue of (45)


n
1
min n p (Yi c(Xi )), 0 < p < 1.
cC
i=1

This is called the quantile regression approach.

To recover Q(p | x), now viewed as a function of both p and x, it is customary


to select J distinct values

0 < p1 < . . . < pJ < 1,

and then estimate J distinct functions Q1 , . . . , QJ , each dened on Rk , with


Qj (x) = Q(pj | x). There may be as many such values as one wishes. By
suitably choosing their number and position, one may get a reasonably accurate
description of Q(p | x).

Given an estimate Q(p | x) of Q(p | x), F (y | x) may be estimated by inversion

F (y | x) = sup {p (0, 1): Q(p | x) y}.

This is a proper df i Q(p | x) is a proper qf.

146
Linear quantile regression
In the original approach of Koenker and Bassett (1978), C is the class of
linear functions of x, that is, c(x) = b x (unless stated otherwise, the model
always includes a constant).

In this case, the sample counterpart of problem (45) is the ALAD problem


n
min n1 p (Yi b Xi ), 0 < p < 1. (46)
bRk
i=1

A solution (p) to (46) is called a linear quantile regression estimate.


Given a solution (p) to (46), an estimate of Q(p | x) is

Q(p | x) = (p) x.

If Xi only contains a constant term, that is, Xi = 1, i = 1, . . . , n, then problem


(46) is equivalent to problem (41), so (p) = Q(p) in this case.

By suitably redening the elements of the covariate vector Xi , one may easily
generalize problem (46) to cases in which C is a class of functions that de-
pend linearly on a nite dimensional parameter vector, such as the class of
polynomial functions of X of a given degree.

147
Computational aspects of linear quantile regression
The lack of smoothness of the ALAD objective function implies that gradi-
ent methods cannot be employed to solve (46).

An ALAD estimate may however be computed eciently using linear pro-


gramming methods (Barrodale & Roberts 1973, Koenker & dOrey 1987).
This is because the ALAD problem (46) may be reformulated as the following
linear program
min p U+ + (1 p) U
Rk

subject to Y = X + U+ U
U+ 0, U 0,
where is the n-vector of ones, and U+ and U are n-vectors with generic
elements equal to Ui+ = max(0, Yi  Xi ) and Ui = min(0, Yi  Xi )
respectively.

Simpler but cruder algorithms based on iterative WLS may also be em-
ployed.

The dual of the above linear program is

max Y
[p1,p]n

subject to X = 0

or, setting = + (1 p)n ,

max Y
[0,1]n

subject to X = (1 p)X n .

The solution p to the dual problem connects quantile regression to regression


rank scores, the regression generalization of classical rank scores proposed by
Gutenbrunner and Jureckova (1992).

148
Asymptotics for linear quantile regression estimators
Let n = (n (p1 ), . . . , n (pJ )) be a kJ-vector of linear quantile regression
estimators, and consider the following assumptions:

Assumption 2 (X1 , Y1 ), . . . , (Xn , Yn ) is a random sample from the joint dis-


tribution of (X, Y ), where the rst element of X is the constant term.

Assumption 3 E XX  = P , a nite pd k k matrix.

Assumption 4 The conditional df of Y given X = x satises

Pr{Y y | X = x} = F (y  x),

where F has a continuous and strictly positive density f and a continuous and
strictly increasing qf Q such that Q(p) = 0 for some 0 < p < 1.

Assumption 4 is equivalent to the assumption that

Yi =  Xi + Ui , i = 1, . . . , n,

where the Ui are distributed independently of the Xi with df F (pure lo-


cation shift model). This assumption implies that

Q(p | x) =  x + Q(p) = (p) x,

where the rst element of (p) is equal to 1 + Q(p), with 1 the rst element
of . If p is such that Q(p) = 0, then (p) = .

149
Consistency and asymptotic normality
Under Assumptions 24,
as
n ,
where = ((p1), . . . , (pJ )). Further,

n ( n ) NkJ (0, P 1 ), (47)

where is the J J matrix with generic element

min(pr , ps ) pr ps
jk = , r, s = 1, . . . , J.
f (Q(pr )) f (Q(ps ))

It follows from this result that, if X only contains a constant term, then P = 1,
n = Qn , and the asymptotic variance of n is equal to .

Suppose, in addition, that F has nite variance 2 . Then, under the above two
assumptions, result (47) implies that the ARE of a linear quantile regression
estimator n (p) relative to the OLS estimator n is equal to

2 f (Q(p))2
ARE(n (p), n ) = .
p(1 p)

In practice, to construct estimates of the asymptotic variance of linear quan-


tile regression estimators, one needs estimates of the density of the quantile
regression errors Ui = Yi Xi .

150
Extensions
Koenker and Portnoy (1987) provide a uniform Bahadur representation
for linear quantile regression estimators.

They show that, under Assumptions 24, with probability tending to one as
n ,

1  p 1{Ui Q(p)}
n

n [n (p) (p)] = P 1 Xi + op (1), (48)
n i=1 f (Q(p))

uniformly for p [, 1 ] and some  > 0.

Among other things, the asymptotically linear representation (48) helps ex-
plain both the asymptotic normality and the good robustness proper-
ties of linear quantile regression estimators with respect to outliers in the
Y -space (although not to outliers in the X-space).

Both properties follow from the fact that the inuence function of (p) is
equal to
p 1{Ui Q(p)}
P 1 Xi ,
f (Q(p))
which is a bounded function of Ui (although not of Xi ).

151
Drawbacks of linear quantile regression estimators
Although increasingly used in empirical work to describe the conditional dis-
tribution of an outcome of interest, linear quantile regression estimators have
several drawbacks.

Some have to do with general properties of quantiles, especially:

the diculty of imposing the no-crossing condition;

the complicated relationship between marginal and conditional quan-


tiles;

the diculty of generalizing quantiles to the case when Y is vector


valued.

Others have to do with their specic properties, especially their behavior


under heteroskedasticity. The relevant issues are:

the validity of the linearity assumption;

the form of the asymptotic variance of linear quantile regression


estimators when the linearity assumption does not hold;

how to consistently estimate this asymptotic variance.

152
Behavior under heteroskedasticity
To illustrate the problem, let X be a scalar rv and suppose rst that Y =
(X) + U, where the regression error U is independent (not just mean in-
dependent) of Y with continuous strictly increasing df F . Then F (y | x) =
F (y (x)). By denition, the pth conditional quantile of Y satises
F (Q(p | x) | x) = p.
Hence
Q(p | x) = Q(p) + (x),
where Q(p) = F 1 (p) is the pth quantile of F . Thus, for any p = p ,
Q(p | x) Q(p | x) = Q(p ) Q(p)
for all x, that is, the distance between any pair of conditional quantiles is
independent of x. If (x) = + x, then Q(p | x) = [ + Q(p)] + x, that is,
the conditional quantiles of Y are a family of parallel lines with common
slope .
Now suppose that (x) = + x but Y is conditionally heteroskedastic,
that is, Y = + X + (X) U, where the function (x) is strictly positive.
The homoskedastic model is a special case where (x) = 1 for all x. Now

y x
F (y | x) = F ,
(x)
so
Q(p | x) = + x + (x)Q(p).
In this case:
although (x) is linear in x, conditional quantiles need not be;
the distance between any pair of conditional quantiles depends on x,
Q(p | x) Q(p | x) = (x)[Q(p ) Q(p)].

Partial exceptions are the cases when:


(x) is linear and F is symmetric about zero, implying that the
conditional median Q(.5 | x) (but not other quantiles) is linear in x;
(x) is linear in x, implying that conditional quantiles are linear in x,
although no longer with a common slope.

153
Figure 23: Quantiles of Y | X = x N ((x), 2 (x)) when (x) = 1 + x and
either 2 (x) = 1 (homoskedasticity) or 2 (x) = 1 + (2x + .5)2 (heteroskedas-
ticity).

Homoskedasticity Heteroskedasticity
10

10
5
5
Q(p|x)

Q(p|x)
0
0

5
5

2 1 0 1 2 2 1 0 1 2
x x

154
Implications
Linear quantile regressions may be a poor approximation to popula-
tion quantiles when data are conditionally heteroskedastic and the
square root of the conditional variance function is far from being linear
in x.

In the general case when (x) is an arbitrary function, the conditional


quantiles of Y are of the form

Q(p | x) = (x) + (x)Q(p).

In the absence of prior information, it is impossible to determine whether


nonlinearity of Q(p | x) reects nonlinearity of (x), heteroskedastic-
ity, or both.

Example 27 If (x) = 0 + 1 x and (x) = 0 + 1 x, then

Q(p | x) = (p) + (p)x,

where (p) = 0 + 0 Q(p) and (p) = 1 + 1 Q(p).

If instead (x) = 0 + 1 x + 2 x2 and (x) = 0 + 1 x, then

Q(p | x) = (p) + (p)x + x2 ,

where (p) = 0 + 0 Q(p) and (p) = 1 + 1 Q(p) depend on p, but = 2


does not.

Finally, if (x) = 0 + 1 x and (x) = (0 + 1 x)2 , then

Q(p | x) = (p) + (p)x + (p)x2 ,

where (p) = 0 +02 Q(p), (p) = 1 +20 1 Q(p) and (p) = 12 Q(p) all depend
on p. 2

155
Interpreting linear quantile regressions
Consider the ordinary (mean) regression model Y =  X + U, where U and
X are uncorrelated. In this model,  X may equivalently be interpreted
as the best linear predictor of Y given X, or as the best linear approxi-
mation to the conditional mean (X) = E(Y | X), that is,

= arg min E(Y b X)2


bRk

= arg min E X [(X) b X]2 ,


bRk

where the rst expectation is with respect to the joint distribution of X and
Y , while the second is with respect to the marginal distribution of X.

Angrist, Chernozhukov and Fernandez-Val (2006) show that a similar inter-


pretation is available for linear quantile regression. More precisely, they show
that if (p) is the slope of a linear quantile regression, then

(p) = arg min E p (Y b X)


bRk

= arg min E X w(X, b) [Q(p | X) b X]2 ,


bRk

with  1
w(x, b) = (1 u) f (Q(p | x) + u(x, b) | x) du,
0

where (x, b) = b xQ(p | x) and f (y | x) is the conditional density of Y given


X = x. Thus, linear quantile regression minimizes a weighted quadratic mea-
sure of discrepancy between the population conditional qf and its best linear
approximation. The weighting function, however, is not easily interpretable.

For an alternative proof of this result, see Theorem 2.7 in Koenker (2005).

156
Inference under heteroskedasticity
From the practical point of view, heteroskedasticity implies that estimated lin-
ear quantile regressions may cross each other, thus violating a fundamental
property of quantiles and complicating the interpretation of the results of
a statistical analysis.

It also implies that the asymptotic variance matrix of is no longer given by


(47). The block corresponding to the asymptotic covariance between j and
k is instead equal to

[min(pj , pk ) pj pj ] Bj1 P Bk1 ,

where  
Bj = E X f (X  j | X) XX T
and f (y | x) denotes the conditional density of Y given X = x.

An analog estimator of Bj is not easy to obtain, as it requires estimating


the conditional density f (y | x), so bootstrap methods are typically used to
carry out inference in this case.

157
Figure 24: Scatterplot of the data and estimated quantiles for a random
sample of 200 observations from the model Y | X N ((X), 2(X)), with
X N (0, 1), (X) = 1 + X and either 2 (X) = 1 (homoskedasticity) or
2 (X) = 1 + (2X + .5)2 (heteroskedasticity).

Homoskedasticity Heteroskedasticity
12

12
8
8
Y

Y
0
0

4
4

4 2 0 2 4 4 2 0 2 4
x x
Homoskedasticity Heteroskedasticity
9

9
Estimated quantiles

Estimated quantiles
6

6
3

3
0

0
3

4 2 0 2 4 4 2 0 2 4
x x

158
Nonparametric quantile regression estimators
To overcome the problems arising with the linearity assumption, several non-
parametric quantile regression estimators have been proposed. These include:

kernel and nearest neighbor methods (Chauduri 1991),

regression splines with a xed number of knots (Hendricks & Koenker


1992),

smoothing splines and penalized likelihood (Koenker, Ng & Port-


noy 1994),

locally linear tting (Yu & Jones 1998), that is, Q(p | x) = (p; x) x,
with

n
1
(p; x) = argmin n p (Yi b Xi ) Wi (x), 0 < p < 1,
bRk i=1

where the Wi (x) are nonnegative weights that add up to one and, typi-
cally, give more weight to Y -values for which Xi is closer to x.

In all these cases, the family of functions C is left essentially unrestricted,


except for smoothness.

Drawbacks
Curse-of-dimensionality problem: It is not clear how to general-
ize these estimators to cases when there are more than two or three
covariates.

Because all these estimators are nonlinear, it is hard to represents them


in ways that facilitate comparisons. For example, it is not clear how
to generalize the concepts of equivalent kernel and equivalent degrees of
freedom that prove so useful for linear smoothers.

159
3.6 Estimating the conditional distribution function
If the interest is not merely in a few quantiles but in the whole conditional dis-
tribution of Y given a random k-vector X, why not estimating the conditional
df F (y | x) directly?

Given a sample (X1 , Y1 ), . . . , (Xn , Yn ) from the joint distribution of (X, Y ),


Stone (1977) suggested estimating F (y | x) nonparametrically by a nearest-
neighbor estimator of the form

n
1
F (y | x) = n Wi (x) 1{Yi y}, < y < ,
i=1

where Wi (x) = Wi (x; X1 , . . . , Xn ) gives more weight to Y -values for which Xi


is closer to x.

Viewed as a function of y for x given, F (y | x) is a proper df if:

Wi (x) 0, i = 1, . . . , n,
n
i=1 Wi (x) = 1.

Stute (1986) studied the properties of two estimators of this type:

The rst is based on weights of the form


 
1 H(x) H(Xi )
Wi (x) = K ,
hn hn

where H(x) is the edf of the X, K is a smooth kernel with bounded


support and hn is the bandwidth. For any x, F (y | x) is not a proper df,
as the kernel weights do not add up to one.

The second, denoted by F 


(y | x) is a proper df for it chooses normalized
weights Wi (x) = Wi (x)/ j Wj (x).

Estimators of this kind tend to do rather poorly when data are sparse,
which is typically the case k is greater than 2 or 3.

160
Avoiding the curse of dimensionality
We now describe a simple semi-nonparametric method that appears to
perform well even when k is large relative to the available data. The basic
idea is to partition the range of Y into J + 1 intervals dened by a set of
knots
< y1 < . . . < yJ < ,
and then estimate the J functions F1 (x), . . . , FJ (x), where Fj (x) = F (yj | x) =
Pr{Y yj | X = x).

If the conditional distribution of Y is continuous with support on the whole


real line then, at any x in the support of X, the sequence of functions {Fj (x)}
must satisfy the following conditions:

0 < Fj (x) < 1, j = 1, . . . , J, (49)

0 < F1 (x) < < FJ (x) < 1. (50)

One way of automatically imposing the nonnegativity condition (49) is to


model not Fj (x) directly, but rather the conditional log odds

Fj (x)
j (x) = ln .
1 Fj (x)

Given an estimate j (x) of j (x), one may then estimate Fj (x) by

exp j (x)
Fj (x) = .
1 + exp j (x)

161
Estimation
By the analogy principle, j may be estimated by maximizing the sample log
likelihood

n
L() = [1{Yi yj } (Xi ) ln(1 + exp (Xi )) ]
i=1

over a suitable family H of functions of x.

This approach, which may be called the distributional regression ap-


proach, entails tting J separate logistic regression, one for each threshold
value yj . Boundedness of 1{Yi yj } ensures good robustness properties
with respect to outliers in the Y -space.

Alternative specications of H correspond to alternative estimation methods.


As for quantile regressions, one may distinguish between:

parametric methods (e.g. logit, probit);

nonparametric methods with H unrestricted, except for smooth-


ness;

nonparametric methods with H restricted (such as projection pur-


suit, additive or locally-linear modeling).

162
Example 28 The simplest case is when (y | x) = (y) x.

For j = 1, . . . , J, let jn be the logit estimator of j = (yj ) and let jn (x) =



jn x be the implied estimator of j (x). Also let

F1 (x) 1 1 (x)
..
F (x) = . , = ... , (x) = ... = (IJ x ) ,
FJ (x) J J (x)

where IJ is the JJ unit matrix. Finally, let F n (x), n and n (x) = (IJ x ) n
be the estimators of F (x), and (x) respectively.

Under mild regularity conditions, as n ,



n (n ) NkJ (0, I 1 ),

where
I = E[V (X) XX  ]
and V (x) is the J J matrix with generic element

Vjk (x) = min(Fj (x), Fk (x)) Fj (x) Fk (x).

Hence, as n ,

n [n (x) (x)] NJ (0, A(x)),

where
A(x) = (IJ x ) I 1 (IJ x).
Therefore, as n ,

n [F n (x) F (x)] NJ (0, (x)),

where
(x) = V (x) A(x) V (x) .
2

163
Figure 25: Estimated conditional dfs of log monthly earnings (Peracchi 2002).

Austria Belgium Denmark Finland


1 1 1 1

.5 .5 .5 .5

0 0 0 0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
France Germany Greece Ireland
1 1 1 1

.8
estimated probability

.5 .5 .6 .5

.4

0 0 .2 0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Italy Luxembourg Netherlands Portugal
1 .6 1 1

.8
.4
.5 .5 .6
.2
.4

0 0 0 .2
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Spain UK
1 1

.5 .5

0 0
0 10 20 30 40 0 10 20 30 40

experience

164
Imposing monotonicity
While our approach automatically imposes the nonnegativity condition (49),
it does not guarantee that the monotonicity condition (50) also holds.
Since j (x) is strictly increasing in Fj (x), monotonicity is equivalent to the
condition that
< 1 (x) < . . . < J (x) <
for all x in the support of X.
If j (x) = j + j (x), then monotonicity holds if
j > j1 , j (x) j1(x).
One case where these two conditions are satised is the ordered logit model.
This model is restrictive, however, for it implies that changes in the covariate
vector X aect the conditional distribution of Y only through a location
shift.
An alternative is to model F1 (x) and the conditional probabilities or discrete
hazards
j (x) = Pr{Y yj | Y > yj1 , X = x}
Sj1(x) Sj (x)
= , j = 2, . . . , J,
Sj1(x)
where Sj (x) = 1 Fj (x) is the survivor function evaluated at yj .
Using the recursion
Sj (x) = [1 j (x)] Sj1 (x), j = 2, . . . , J,
we get

k
Sk (x) = S1 (x) [1 j (x)], k = 2, . . . , J,
j=2
that is,

k
Fk (x) = 1 [1 F1 (x)] [1 j (x)], k = 2, . . . , J.
j=2

If F1 (x) and the j (x) are modeled to guarantee that 0 < F1 (x) < 1 and 0 <
j (x) < 1, then both monotonicity and the constraint (49) are automatically
satised.

165
Figure 26: Estimated conditional dfs of log monthly earnings imposing mono-
tonicity (Peracchi 2002).

Austria Belgium Denmark Finland


1 1 1 1

.5 .5 .5 .5

0 0 0 0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
France Germany Greece Ireland
1 1 1 1

.8
estimated probability

.5 .5 .6 .5

.4

0 0 .2 0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Italy Luxembourg Netherlands Portugal
1 .8 1 1

.6 .8

.5 .4 .5 .6

.2 .4

0 0 0 .2
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Spain UK
1 1

.5 .5

0 0
0 10 20 30 40 0 10 20 30 40

experience

166
Extensions: autoregressive models
Consider the discrete-time univariate AR(1) process

Yt = Yt1 + Ut ,

where || < 1, > 0 and the {Ut } are iid with zero mean and marginal df G.
The conditional df of Yt given Yt1 = x is

y x
F (y | x) = Pr{Yt y | Yt1 = x} = G . (51)

By the stationarity assumption, this is also the conditional df of Yt+h given


Yt+h1 = x for any |h| = 0, 1, 2, . . ..

The assumptions implicit in (51) are strong. As an alternative, one may retain
the assumption that F (y | x) is time-invariant and apply the results of the
previous section by letting Xt = Yt1 .

Extensions to a univariate pth autoregression is straightforward. In this


case, what is assumed to be time-invariant is

F (y | x) = Pr{Yt y | Xt = x},

where Xt = (Yt1 , . . . , Ytp ) and x = (x1 , . . . , xp ).

167
3.7 Relationships between the two approaches
Koenker, Leorato and Peracchi (2013) ask the general question: How does the
distributional regression (DR) approach outlined in the previous section
relate to quantile regression (QR) approach?

Here we focus on two specic questions:

What restrictions on the family of conditional quantiles are implied by


specic assumptions on the family of conditional log odds?

What restrictions on the family of conditional log odds are implied by


specic assumptions on the family of conditional quantiles?

168
From log odds to quantiles
If F (y | x) is continuous, the fact that F (Q(p | x) | x) = p implies
p
(Q(p | x) | x) = ln , p (0, 1).
1p

Suppose that (y | x) is some known continuously dierentiable function of


(x, y) and y (y | x) is strictly positive for every (x, y), with subscripts denoting
partial derivatives. By the implicit function theorem applied to
p
(y | x) = ln ,
1p
the conditional qf is unique and continuously dierentiable in x with
derivative
, ,
x (y | x) ,, Fx (y | x) ,,
Qx (p | x) = = .
y (y | x) ,y=Q(p | x) f (y | x) ,y=Q(p | x)

Thus, Qx (p | x) and x (Q(p | x)) have opposite sign.

Next notice that conditional quantiles are linear in x i the partial derivative
Qx (p | x) does not depend on x. If the conditional log odds are linear in x,
that is, (y | x) = (y) + (y)x, then a quantile regression is linear in x i
,
(y) ,
Qx (p | x) =  ,
(y) +  (y)x ,y=Q(p | x)

does not depend on x. Sucient conditions are: (i) (y) is linear in y, and
(ii) (y) does not depend on y (as with the ordered logit model).

169
From quantiles to log odds
If Q(p | x) is a known continuously dierentiable function of (p, x) such that
Q(p | x) = y, where y is a xed number, then

F (Q(p | x)) = F (y | x).

By the chain rule,

Fx (y | x) = Qx (p | x) f (y | x)|p=F (y | x)

or, equivalently,

x (y | x) = Qx (p | x) y (y | x)|p=F (y | x) .

Clearly, conditional log odds are linear i x (y | x) does not depend on x. If


conditional quantiles are linear, that is Q(p | x) = (p) + x(p), then condi-
tional log odds are linear i

x (y | x) = (p) y (y | x)|p=F (y | x)

does not depend on x. Sucient conditions are: (i) (p) does not depend on
p, and (ii) y (y | x) does not depend on x.

Example 29 Let Y = + X + U, where U has a logistic distribution with


mean zero and qf Q. Then Q(p | x) = (p) + x, where (p) = + Q(p),
whereas
exp(y x)
F (y | x) = .
1 + exp(y x)
Hence (y | x) = (y) + (y)x, where (y) = and (y) = y , is also linear
in x. Notice that, except for the sign, Q(p | x) and (y | x), viewed as functions
of x, have the same slope. 2

170
3.8 Stata commands
We now briey review the commands available in Stata, version 12.

These include the qreg command for linear conditional quantile estimation,
the associated iqreg, sqreg and bsqreg commands, and the post-estimation
tools in qreg postestimation.

At the moment, Stata only oers the cumul for estimating univariate distribu-
tion functions and has no command for estimating cdfs.

171
The qreg command
The basic syntax is:
qreg depvar indepvars [if] [in] [weight] [, qreg options]
where qreg options includes:

quantile(#): species the quantile to be estimated and should be a


number between 0 and 1, exclusive. Numbers larger than 1 are in-
terpreted as percentages. The default value of 0.5 corresponds to the
median.

level(#): sets the condence level. The default is level(95) (95 per-
cent).

wlsiter(#): species the number of WLS iterations that will be at-


tempted before the linear programming iterations are started. The de-
fault value is 1. If there are convergence problems, increasing this number
should help.

Notice that the standard errors produced by qreg are based on the ho-
moskedasticity assumption and should not be trusted.

172
Other commands
The iqreg, sqreg and bsqreg commands all assume linearity of conditional
quantiles but estimate the variance matrix of the estimators (VCE) via the
bootstrap. Their syntax similar to that of qreg. For example,
iqreg depvar indepvars [if] [in] [weight] [, iqreg options]

Remarks:

iqreg estimates IQR regressions, i.e. regressions of the dierence in quan-


tiles. The available options include:

quantiles(# #): species the quantiles to be compared. The rst


number must be less than the second, and both should be be-
tween 0 and 1, exclusive. Numbers larger than 1 are interpreted
as percentages. Not specifying this option is equivalent to specify-
ing quantiles(.25 .75), meaning the IQR.
reps(#): species the number of bootstrap replications to be used
to obtain an estimate of the VCE. The default is reps(20), arguably
a little small.

sqreg estimates simultaneous-quantile regression essentially using the al-


gorithm in Koenker and dOrey (1987). It produces the same coecients
as qreg for each quantile. Bootstrap estimation of the VCE includes
between-quantile blocks. Thus, one can test and construct condence
intervals comparing coecients describing dierent quantiles. The avail-
able options include:

quantiles(# [# [# ...]]) species the quantiles to be estimated


and should contain numbers between 0 and 1, exclusive. Numbers
larger than 1 are interpreted as percentages. The default value of
0.5 corresponds to the median.
reps(#): same as above.

bsqreg is equivalent to sqreg with one quantile.

173
Post-estimation tools
The following postestimation commands are available for qreg, iqreg, sqreg
and bsqreg:

estat: variance matrix and estimation sample summary,

estimates: cataloging estimation results,

lincom: point estimates, standard errors, testing, and inference for linear
combinations of the coecients,

linktest: link test for model specication,

margins: marginal means, predictive margins, marginal eects, and av-


erage marginal eects,

nlcom: point estimates, standard errors, testing, and inference for non-
linear combinations of coecients,

predict: predictions, residuals, inuence statistics, and other diagnostic


measures,

predictnl: point estimates, standard errors, testing, and inference for


generalized predictions,

test: Wald tests of simple and composite linear hypotheses,

testnl: Wald tests of nonlinear hypotheses.

174
References
Angrist J., Chernozhukov V. and Fernandez-Val I. (2006) Quantile Regression un-
der Misspecication, with an Application to the U.S. Wage Structure. Econo-
metrica, 74: 539563.

Apostol T.M. (1974) Mathematical Analysis, 2nd Ed., Addison-Wesley, Reading,


MA.

Barrodale I. and Roberts F. (1973) An Improved Algorithm for Discrete l1 Linear


Approximation. SIAM Journal of Numerical Analysis, 10: 839848.

Bassett G. and Koenker R. (1978) The Asymptotic Distribution of the Least Ab-
solute Error Estimator. Journal of the American Statistical Association, 73:
618622.

Bassett G. and Koenker R. (1982) An Empirical Quantile Function for Linear


Models with IID Errors. Journal of the American Statistical Association, 77:
407415.

Bickel P. (1982) On Adaptive Estimation. Annals of Statistics, 10: 647671.

Bierens H.J. (1987) Kernel Estimators of Regression Functions. In Bewley T.F.


(ed.) Advances in Econometrics, Fifth World Congress, Vol. 1, pp. 99144,
Cambridge University Press, New York.

Bloomeld P. and Steiger W. (1983) Least Absolute Deviations: Theory, Applica-


tions, Algorithms, Birkhauser, Boston, MA.

Blundell R. and Duncan A. (1998) Kernel Regression in Empirical Microeconomics.


Journal of Human Resources, 33: 6287.

Buchinsky M. (1998) Recent Advances in Quantile Regression Models: A Practical


Guideline for Empirical Research. Journal of Human Resources, 33: 88126.

Chauduri P. (1991) Nonparametric estimates of Regression Quantiles and Their


Local Bahadur Representation. Annals of Statistics, 2: 760777.

Cleveland W.S. (1979) Robust Locally Weighted Regression and Smoothing Scat-
terplots. Journal of the American Statistical Association, 74: 829836.

Cleveland W.S. and Devlin S.J. (1988) Locally Weighted Regression: An Approach
to Regression Analysis by Local Fitting. Journal of the American Statistical
Association, 93: 596610.

175
De Angelis D., Hall P. and Young G.A. (1993) Analytical and Bootstrap Approxi-
mations to Estimator Distributions in L1 Regression. Journal of the American
Statistical Association, 88: 13101316.
de Boor C. (1978) A Practical Guide to Splines, Springer, New York.
De Luca G. (2008) SNP and SML Estimation of Univariate and Bivariate Binary-
Choice Models. Stata Journal, 8: 190220.
De Luca G., and Perotti V. (2011) Estimation of Ordered Response Models with
Sample Selection. Stata Journal, 11: 213239.
Engle R.F., Granger C.W.J., Rice J.A. and Weiss A. (1986) Semiparametric Esti-
mates of the Relationship Between Weather and Electricity Sales. Journal of
the American Statistical Association, 81: 310320.

Eubank R.L. (1988) Spline Smoothing and Nonparametric Regression, Dekker, New
York.

Eubank R.L. and Spiegelman C.H. (1990) Testing the Goodnes of Fit of a Linear
Model Via Nonparametric Regression Techniques. Journal of the American
Statistical Association, 85: 387392.
Fan J. (1992) Design-adaptive Nonparametric Regression. Journal of the American
Statistical Association, 87: 9981004.
Fan J. and Gijbels I. (1996) Local Polynomial Modelling and Its Applications, Chap-
man and Hall, London.
Fan J., Heckman N.E. and Wand M.P. (1995) Local Polynomial Regression for
Generalized Linear Models and Quasi-likelihood Functions. Journal of the
American Statistical Association, 90: 141150.
Firpo S., Fortin N. and Lemieux T. (2009) Unconditional Quantile Regressions.
Econometrica, 77: 953973.
Fortin N., Lemieux T. and Firpo S. (2011) Decomposition Methods in Economics.
In Ashenfelter O. and Card D. (eds.) Handbook of Labor Economics, Vol. 4a,
pp. 2102, Elsevier, Amsterdam.
Friedman J.H. and Stuetzle W. (1981) Projection Pursuit Regression. Journal of
the American Statistical Association, 76: 817823.

Friedman J.H., Stuetzle W. and Schroeder A. (1984) Projection Pursuit Density


Estimation. Journal of the American Statistical Association, 79: 599608.

176
Gallant A.R. and Nychka D.W. (1987) Semi-Nonparametric Maximum Likelihood
Estimation. Econometrica, 55: 363390.

Good I.J. and R.A. Gaskins (1971) Nonparametric Roughness Penalties for Prob-
ability Densities. Biometrika, 58: 255-277.

Green P.J. and Silverman B.W. (1994) Nonparametric Regression and Generalized
Linear Models, Chapman and Hall, London.

Gutenbrunner C. and Jureckova J. (1992) Regression Rank Scores and Regression


Quantiles. Annals of Statistics, 20: 305330.

Hall P., Wol R.C.L. and Yao Q. (1999) Methods for estimating a Conditional
Distribution Function. Journal of the American Statistical Association, 94:
154163.

Hardle W. (1990) Applied Nonparametric Regression, Cambridge University Press,


New York.

Hardle W. and Linton O. (1994) Applied Nonparametric Methods. In Engle R.F.


and McFadden D.L. (eds.) Handbook of Econometrics, Vol. 4, pp. 22972339,
North-Holland, Amsterdam.

Hardle W. and Stoker T. (1989) Investigating Smooth Multiple Regression by the


Method of Average Derivatives. Journal of the American Statistical Associa-
tion, 84: 986995.

Hastie T.J. and Loader C.L. (1993) Local Regression: Automatic Kernel Carpentry
(with discussion). Statistical Science, 8: 120143.

Hastie T.J. and Tibshirani R.J. (1990) Generalized Additive Models, Chapman and
Hall, London.

Hendricks W. and Koenker R. (1992) Hierarchical Spline Models for Conditional


Quantiles and the Demand for Electricity. Journal of the American Statistical
Association, 87: 58-68.

Horowitz J.L. (1998) Semiparametric Methods in Econometrics, Springer, New


York.

Koenker, R. (2005) Quantile Regression, Cambridge University Press, New York.

Koenker R. and Bassett G. (1978) Regression Quantiles. Econometrica, 46: 3350.

177
Koenker R. and Bassett G. (1982) Robust Tests for Heteroskedasticity Based on
Regression Quantiles. Econometrica, 50: 4361.

Koenker R. and dOrey V. (1987) Computing Regression Quantiles. Applied Statis-


tics, 36: 383393.

Koenker R., Leorato S., and Peracchi F. (2013) Distributional vs. Quantile Regres-
sion. Mimeo.

Koenker R. and Machado J.A.F. (1999) Goodness of Fit and Related Inference
Processes for Quantile Regression. Journal of the American Statistical Asso-
ciation, 94: 12961310.

Koenker R., Ng P. and Portnoy S. (1994) Quantile Smmothing Splines. Biometrika,


81: 673680.

Li O. and Racine J.S. (2007) Nonparametric Econometrics, Princeton University


Press, Princeton.

Luenberger D. (1969) Optimization by Vector Space Methods, Wiley, New York.

Machado J.A.F. and Mata J. (2005) Counterfactual Decomposition of Changes in


Wage Distributions Using Quantile Regression. Journal of Applied Economet-
rics, 20: 445465.

Marron J.S. and Nolan D. (1988) Canonical Kernels for Density Estimation. Statis-
tics and Probability Letters, 7: 195199.

Melly B. (2005) Decomposition of Dierences in Distribution Using Quantile Re-


gression. Labour Economics, 12: 577590.

Newson R.B. (2012) Sensible Parameters for Univariate and Multivariate Splines.
Stata Journal, 12: 479504.

Pagan A.R. and Ullah A. (1999) Nonparametric Econometrics, Cambridge Univer-


sity Press, New York.

Peracchi F. (2002) On Estimating Conditional Quantiles and Distribution Func-


tions. Computational Statistics and Data Analysis, 38: 433447.

Pollard D. (1991) Asymptotics for Least Absolute Deviation Regression Estimators.


Econometric Theory, 7: 186199.

178
Portnoy S. and Koenker R. (1997) The Gaussian Hare and the Laplacian Tortoise:
Computability for Squared-Error versus Absolute-Error Estimators (with dis-
cussion). Statistical Science, 12: 279300.

Robinson P. (1988) Root-N Consistent Semiparametric Regression. Econometrica,


56: 931954.

Rosenblatt M. (1956) Remarks on Some Nonparametric Estimates of a Density


Function. Annals of Mathematical Statistics, 27: 832837.

Ruppert D., Wand M.P. and Carroll R.J. (2003) Semiparametric Regression, Cam-
bridge University Press, New York.

Sheater S.C. and Jones M.C. (1991) A Reliable Data-Based Bandwidth Selection
Method for Kernel Density Estimation. Journal of the Royal Statistical Soci-
ety, Series B, 53: 683690.

Schlossmacher E.G. (1973) An Iterative Technique for Absolute Deviation Curve


Fitting. Journal of the American Statistical Association, 68: 857865.

Schucany W.R. and Sommers J.P. (1977) Improvement of Kernel Type Density
Estimators. Journal of the American Statistical Association, 72: 420423.

Shorrocks A.F. (1982) Ranking Income Distributions. Econometrica, 50: 3-17.

Silverman B.W. (1986) Density Estimation for Statistics and Data Analysis, Chap-
man and Hall, New York.

Stoker T.M. (1986) Consistent Estimation of Scaled Coecients. Econometrica,


54: 14611481.

Stone C.J. (1977) Consistent Nonparametric Regression (with discussion). Annals


of Statistics, 5: 595645.

Stone C.J. (1982) Optimal Global Rates of Convergence for Nonparametric Re-
gression. Annals of Statistics, 10: 10401053.

Stone C.J. (1984) An Asymptotically Optimal Window Selection Rule for Kernel
Density Estimates. Annals of Statistics, 12: 12851297.

Stone, C.J. (1985) Additive regression and other nonparametric models. Annals of
Statistics, 13: 689705.

Tibshirani R. and Hastie T. (1987) Local Likelihood Estimation. Journal of the


American Statistical Association, 82: 559567.

179
Vapnik V.N. (1995) The Nature of Statistical Learning Theory, Springer, New York.

Wahba G. (1990) Spline Models for Observational Data, SIAM, Philadelphia, PA.

Yatchew A. (1998) Nonparametric Regression Techniques in Economics. Journal


of Economic Literature, 36: 669-721.

Yatchew A. (2003) Semiparametric Regression for the Applied Econometrician,


Cambridge University Press, New York.

Yu K. and Jones M.C. (1998) Local Linear Quantile Regression. Journal of the
American Statistical Association, 93: 228237.

180

Potrebbero piacerti anche