Sei sulla pagina 1di 37

Nonparametric Density and Regression Estimation

February 2, 2007

1
1.1

Density Estimation
Histogram Estimator

The simplest nonparametric density estimator is the histogram estimator. Suppose we estimate density by a histogram with bin width h

histogram

Suggests estimator: P 1 f (x0 ) = nh n 1(xi [x0 , x0 + h]) i=1 1

Proportion of data in bin: Pn 1 i=1 1(xi [x0 , x0 + h]) f (x0 )h n

1.1.1 Bias

What is the variance and bias of this estimator?

E f (x0 ) =

1 E(1(xi [x0 , x0 + h])) h Z Z

1 = h 1 = h

1(xi [x0 , x0 + h])f (xi )dxi

x0 +h

x0

{f (x0 ) + f 0 ()(xi x0 )}dxi x Z


x0 +h

x [xi, x0 )

Assume

1 = f (x0 ) h |

x0

dxi + {z } |
1

x0 +h

x0

1 f ()(xi x0 ) x dxi h {z }
0 R(x0 ,h)

R(x0 , h) Thus,

R x0 +h
x0

| f 0 (x)| c < |f 0 ()| |xi x0 | dxi x h

x C h2 1 = Op (h) h

E f (x0 ) f (x0 ) Ch

Consistency of the histogram requires that h 0 as n Also, we assumed f is dierentiable and derivative is bounded

Variance Note: Consistency requires that var 0 as well n P var{1(xi [x0 , x0 + h])} var f (x0 ) = 21 2
n h i=1

assuming iid obs.

var{1(xi [x0 , x0 + h])} = E (1(xi [x0 , x0 + h])2 ) E(1())2 E (1(xi [x0 , x0 + h])2 ) Assume that f (x0 ) C 0 , so, varf (x0 ) C0 n h C0 1 = = Op ( ) 2 h2 n nh nh

Consistency requires nh Remarks The estimator is not root-n consistent, because variance blows up as h 0 We require h 0 and n h 1.1.2 Selecting the bin size

How might you choose h? One way is to choose h is minimize RMSE

MSE = Ch2 +
h
0

C0 nh

C = 2 Ch + nh2 = 0 0 2 Ch3 = + C n C0 1 1 h= [ ]3 n 3 |2C {z }

optimal choice of bin width

1.2

Alternative Histogram Estimator


n 1P 1 1(|x0 xi | < h) f (x0 ) = n i=1 2h

x0 - h

x0

x0 + h

Construct bin symmetrically around point of evaluation }

Can rewrite as

symmetric bins {z
width=2h

n (x0 ) = 1 P 1 1( x0 xi < 1) f h nh i=1 2 | {z }


uniform kernel

1/2

-1

0
uniform kernel

1 note that this area integrates to 1 and is symmetric

1.3

Standard kernel density estimator

Instead of uniform kernel, use kernel with weights that go smoothly to zero:

x0-h

x0

x0+h

k (s)

quartic kernel function 4

s=

x0 xi h

Also integrates to 1 and is symmetric gives more weight to closer observations ex. k(s) = =0 k(s) =
15 (x2 16

1)2 if |x| 1 biweight or "quartic" kernel else


2

s 1 e 2 2

normal kernel

1.4

Nadaraya-Watson Kernel Estimator


n 1 P x0 xi k( ) f (x0 ) = nhn i=1 hn

1.4.1

Bias E f (x0 ) = Z 1 x0 xi k( )f (xi )dxi hn hn xi = x0 shn

let s =

x0 xi 1 ds = hn hn

when xi = , si = Z = k(s)f (x0 shn )ds

k(s)f (x0 shn )ds

Assume f is r times dierentiable with bounded rth derivative 5

If

1 f (x0 shn ) = f (x0 ) + f 0 (x0 )(hn s) + f 00 (x0 )h2 s2 + ... n 2 1 1 ... + f r (x0 )(hn s)r + [f r () f r (x0 )](hn s)r x r! r! | {z }
Remainder term R(x 0 ,s)

then bias is R

k(s)sR = 0 R = 1..r

R k(s)R(X0 , s)ds hr C |s|r k(s)ds n (since we assumed bounded rth derivative) Thus, this estimator improves on histogram if some moments equal 0. Typically, k satises Z

k(s)ds = 1

Z If we require Z

k(s)sds = 0

k(s)s2 ds = 0

Then k(s) must be negative in places Remark If f r satises a lipschitz condition i.e. |f r () f r (x0 )| R | x0 | x m x then Bias C0 hr+1 |s|r k(s)ds = Op (hr+1 ) n n | {z } 6

higher order

k(s)

Figure 1: 1.4.2 Variance


1 n2 h2 n i=1 n P x V ark( x0hn i )

V ar f (x0 ) =

x x 1 = nh2 [Ek2 ( x0hn i ) Ek( x0hn i )2 ] n C1 nh 1 = Op ( nhn )

can show using same techniques does not improve on histogram

1.5

Choice of Bandwidth

Choose h to minimize the asymptotic mean-squared error: minMSE(h) = C0 h2r + n


h h C1 nhn

= C0 2r h2r1 n
C1 C0 2rn
1

C1 nh2 n

=0

h2r+1 = n

C1 hn = ( 2rC0 ) 2r+1 (n) 2r+1

hn shrinks to zero as n grows large Now, lets look at order of MSE. Plug in the optimal value for hn that was obtained above. Let C2 =
C1 2rC0
1 2r+1

. 7

2 MSE(hk ) = C0 C2 n 2r+1 +

2r

C1 n C2 n 2r+1
2r 2r+1 1

2 = C0 C2 n

2r 2r+1

1 + C1 C2 n

= Op (n so

2r 2r+1

r 2r+1 MSE = Op (n )

r 1 < 2r + 1 2 (parametric rate is n 2 ) X 0 X 1 ) N 1 ) ( ols 0 ) (0, 2 ( N | {z } V ar 1 RMSE is O(N 2 ) 1.5.1 How do you generalize kernel density estimator to more dimensions? f (x0 , z0 ) = 1 nhn,x hn,z
n P
1

k1 (

i=1

x0 xi z0 zi )k2 ( ) hn,x hn,z

1.6

Alternative Estimator- k nearest neighbor

estimator has variable bandwidth

rk =distance to the kth nearest neighbor

x0

xk

{z

f (x0 ) 2rk =

#data in bin=k1 n k1 2rk n

f (x0 ) = For m-dimensional data, f (x0 ) =

k1 (2rk )m n

1.7

Asymptotic Distribution of Kernel Density Estimator

We already did consistency (showed conv. in mean square):

n 1 P x0 xi p f (x0 ) = k( ) E f (x0 ) f (x0 ) by LLN nhn i=1 hn p V AR f (x0 ) 0

as h 0

as nhn

Now, will show asymptotic normality:

n n p p 1 P x0 xi 1 P x0 xi x0 xi nhn ( k( ) f (x0 )) = nhn ( k( ) E k( ) nhn i=1 hn nhn i=1 hn hn

(1)

n p 1 P x0 xi nhn ( E k( ) f (x0 )) nhn i=1 hn

(2)

Analyze term (1): n 1P 1 x0 xi 1 x0 xi k( n ) E( k( )) n i=1 hn hn hn hn can nd a CLT for this part

x0 xi 1 )) N (0, V AR k( hn hn 1 x0 xi 1 x0 xi 2 1 x0 xi 2 V AR k( ) = E (k( ) ) E (k( )) hn h h hn hn hn |n {z n }


f (x0 )

k2 (s)ds

Assume

k(s)ds = 1,

k(s)sds = 0

1 O(h2 )] 0 as hn 0 n hn

Term (1) N (0, f (x0 )


p

k2 (s)ds)

Now, show Term (2) 0 (will require more stringent conditions on the bandwidth)

n p p 1 P x0 xi 1 x0 xi nhn ( E k( ) f (x0 )) = nhn {E k( ) f (x0 )} nhn i=1 hn hn hn

10

1 = hn Z Z

k(

x0 xi )f (xi )dxi hn

s=

x0 xi hn

hn ds = dxi

k(s)f (x0 shn )ds x f 00 () 2 2 s hn ]ds 2 Z

k(s)[f (x0 ) + f 0 (x0 )(shn ) + Z Z

= f (x0 )

k(s)ds + f (x0 ) hn

k(s)sds +

h2 n

f 00 () x k(s)s2 ds 2

x between x0 and x0 + sh.

Assume |f () M and x Z k(s)sds = 0 (satised if k is symmetric) 1 x0 xi then E f (x0 ) is Op (h2 ) k n hn hn p 1 x0 xi )) is which implies that nhn E( k( hn hn p p Op ( nhn h2 ) = Op ( nh5 ) n n p p 1 x0 xi nhn E( k( )) = Op ( nh5 ) n hn hn Hence, we require nh5 0 in addition to hn 0, nhn . The required n condition for asymptotic normality is stronger than for consistency.)

11

Nonparametric Regression

Model:

yi = g(xi ) + E( i |xi ) = 0 E( 2 |xi ) = c < i

Two basic approaches Global approach series expansion methods, orthonormal expansion methods, splines ex. g(x) = 0 + 1 x + 2 x2 + ... + k xk - main choice is which basis to use -Drawback: estimate at each point depend on all other data and is potentially sensitive to outliers Local approach Examples: kernel regression, local polynomial regression, smoothing Early smoothing methods date back to the 1860s in actuarial literature (see Cleveland (1979, JASA)) Will emphasize here local estimation methods 12 (using polynomial basis)

x x x x x

g(x)

Figure 2: Repeated data regression estimator

2.1

Local nonparametric regression

One possibility is to estimate the conditional mean by:

E(y|x) = =

yf (y|x)dy = y f (x, y) dy. f (x)

f (x, y) dy f (x)

Remarks

With repeated data at each point, could estimate by P yi 1(xi = x0 ) i g(x0 ) by g (x0 ) = P 1(xi = x0 )
i

Can only estimate this way at points where we have data Need to interpolate or extrapolate between and outside data observations.

Could create cells as in histogram estimator:

13

g(x0 ) =

If we want g(x) to be smooth, need to choose a a kernel function that goes to 0 smoothly
n P

x yi 1( x0hn i < 1) = i=1 n P x0 xi 1( hn < 1)


n P i=1

i=1 n P

n P

yi 1(xi [x0 hn , x0 + hn ]) 1(xi [x0 hn , x0 + hn ])

i=1

g(x0 ) = i=1 n P

x yi k( x0hn i ) x k( x0hn i )

i=1

1 nhn

=
1 nhn

i=1

i=1 n P

n P

x yi k( x0hn i )

( g(x0 )f (x0 )) ( f (x0 ))

x k( x0hn i )

i=1

n P

yi wi (x0 )

Note: we dont need

x k( x0hn i ) wi (x0 ) = P n x k( x0hn i ) i=1

note :

i=1

numerator and denominator integrate to the same value. 14

n P

wi (x0 ) = 1

k(s)ds = 1, but we do require that kernel used in

2.1.1

Two types of problems

(i) When the density at x0 is low, the denominator tends to be close to 0 (common problem to all nonparametric methods) (ii) At boundary points, the estimator has higher order bias, even with appropriately chosen k(). This problem is known as "boundary bias." 2.1.2 Interpretation of Local Polynomial Estimators as a local regression

Solve this problem at each point of evaluation, x0 (which may or may not correspond to points in the data): a : argmin
n P

{a,b1 ,..,bk } i=1

(yi a b1 (xi x0 ) b2 (xi x0 )2 ... bk (xi x0 )k )2 ki ki = k x0 xi xn

choice of k, choice of hn note: 1 provides an estimator for g0 (x0 ) b 2 provides an estimator for b if k = 0,
g 00 (x0 ) 2

a= P =
i=1 n P
i=1

n P i=1 n

yi k( k(

x0 xi ) hn

x0 xi ) hn

standard Nadaraya-Watson kernel regression estimator where


i=1 n P

yi wi (x0 )

yi (x0 ) = 1

15

if k = 1
n P

yi ki

a=

i=1

i=1

j=1 n P i=1 n P

n P

(xj x0 )2
j=1 n P

k=1

ki

kj (xj x0 )2 {

n P

yk kk (xk x0 )
k=1 n P

l=1

kk (xk x0 )}2

n P

kl (xl x0 )

Can again write as where

yi wi (x0 )

i=1

and
i=1 n P

n P

wi (x0 ) = 1

wi (x0 )(xi x0 ) = 0

k = 0 weights for the standard Nadaraya-Watson kernel estimator do not satisfy this property.

LLR estimator (k=1) improves on kernel estimator in two ways (a) asymptotic bias does not depend on design density of the data (i.e. f(x) does not appear in the bias expression.) (b) bias has a higher order of convergence at boundary points Intuitively, the standard kernel estimator ts a local constant, so when there is no data on the other side, the estimate will tend to be too high or too low. It does not capture the slope. LLR gets the slope.

16

x x

x x

LLR gets slope right hn

Figure 3: Boundary performance of LLR estimator

gLLR (x0 ) g(x0 ) =

i=1

n P n P n P

wi (x0 )yi g(x0 )


n P

yi = g(xi ) + i

wi (x0 )i +

i=1

i=1

wi (g(xi ) g(x0 ))

i=1

wi (x0 )i + g 0 (x0 ) (xi x0 ) + [g0 () g0 (x0 )] (xi x0 ) x

variance part

If g0 satises a Lipschitz condition, the last term can be bounded. 2.1.3 Properties of Nonparametric Regression Estimators

wi (g 0 () g0 (x0 )) (xi x0 ) x | {z } i=1

n P

n P wi i + g0 (x0 ) wi (xi x0 )+ i=1 i=1 | | {z } {z } =0 with LLR

n P

1. Consistency (will show convergence in mean square) 17

2. Asymptotic Normality Lets work with the standard kernel regression estimator (simpler than LLR)
1 nhn

m(x0 ) = Assume (i) hn 0, nhn (ii) R

1 nhn

P
i

x yi k( x0hn i ) i P x0 xi k( hn )

sk(s)ds = 0

(iii) f (x) and g(x) are c2 with bounded second derivatives Consistency We already showed 1 P x0 xi MS k( ) f (x0 ) nhn i hn Z

k(s)ds 6= 0

k(s)s2 ds < C

k(s)ds

(denominator)

Then apply Mann.Wald theorem (plim of a continuous function is continuous function of plim) to conclude ratio g(x0 ) For denominator
P

will now show Z x0 xi MS 1 P yi k( ) f (x0 )g(x0 ) k(s)ds (numerator) nhn i hn

V AR(

n 1 x0 xi 1 P x0 xi k( )) 2 2 n E(k2 ( )) nhn i=1 hn nh hn 1 Op (hn ) = nh2 1 = Op ( ) nh

18

Thus, 1 P x0 xi MS 1 P x0 xi k( ) E( k( )) nhn i hn nhn i hn Z = f (x0 ) k(s)ds + Op (hn ) Numerator:

E(

n x0 xi 1 x0 xi 1 P yi k( )) = E(E(yi |xi )k( )) nhn i=1 hn hn hn x0 xi 1 )) = E(g (xi ) k( hn hn Z 1 x0 xi = g (xi ) k( )f (xi )dxi hn hn

where h(x0 ) = g(x0 )f (x0 ) Also, can show V


1 AR( nhn n P

Let h(xi ) = f (xi ) g(xi ) and Taylor expand (as before) Z k(s)ds + Op (hn ) = h (x0 )

yi k

c nhn

i=1

0 as nhn

x0 xi hn

c nhn

Hence, ratio f0(x0 ) 0 k(s)ds = g (x0 ) provided that hn 0 nhn


g(x )f (x ) k(s)ds

19

Asymptotic Normality Show nhn ((x0 ) g(x0 )) i nh g o R R 0 )f 0 (x N 1 g00 (x0 ) + g (x0(x0 ) 0 ) h2 nhn u2 k(u)du, 2 (x0 ) k 2 (u) du n 2 f where 2 (x0 ) = E (2 |xi = x0 ) conditional variance i nhn P P yi ki (yi g(x0 ))ki P g (x0 ) = nhn P
ki ki

yi = g(xi ) + i

E(i |xi ) = 0

P P 1 1 p p i ki (g(xi ) g(x0 ))ki nhn nhn P = nhn ( 1 P ) + nhn ( ) 1 ki ki nhn nhn we showed before that
1 nhn

P P p i ki p (g(xi ) g(x0 ))ki P = nhn P + nhn ki ki | {z } | {z }


variance part bias part

R assuming k(s)ds = 1 P 1 will show now that nhn nhn i ki = =


1 n

ki f (x0 ) + O(hn )
1 nhn

MS

P
i

1 i k( x0 xi ) hn hn

P
i

x i k( x0hn i )

Note:

and apply CLTto get 1 i k( x0 xi ) N 0, V AR hn hn

20

1 x0 xi 1 2 2 x0 xi V AR i k( ) = E i k ( ) + higher order terms hn hn hn hn x0 xi 1 )) = E E(2 |xi k2 ( i hn hn =2 (xi ) Z x0 xi 1 2 (xi )k 2 ( )f (xi )dxi = hn hn Z = 2 (x0 )f (x0 ) k2 (s)ds + O(hn ) by change of var (3) =
1 nhn

P 1 nhn nhn (g(xi ) g(x0 ))ki


i 1 nhn

P x (g(xi ) g(x0 ))k( x0hn i ) n R

= =

x (g(xi ) g(x0 ))k( x0hn i )f (xi )dxi

{f (x0 ) shn f 0 (x0 ) +

R nhn {g0 (x0 ) (shn ) + 1 g 00 (x0 )s2 h2 + 1 [g00 () g 00 (x0 )]s2 h2 } x n n 2 2


s2 h2 00 n f ()}k(s)ds x 2

R nhn (g(x0 shn ) g(x0 )) k(s) f (x0 shn )ds

ignore higher order

if g satises lipschitz condition |g 00 () g 00 (x0 )| C | x0 | x x then


[g 00 ()g 00 (x0 )] x term 2

will be higher order

other terms = g f
0

z Z

=0

sk(s)ds + g 0 f 0 h2 n R

}|

s2 k(s)ds

g 0 f 00 () x 2

higher order

+ 1 g00 (x0 )f (x0 ) h2 n 2

k(s)s2 ds + higher order terms 21

h3 n |{z}

k(s)s3 ds

get Z X p p 1 00 0 0 2 nhn k(s)s2 ds (g(xi )g(x0 ))ki nhn [g (x0 )f (x0 )+ g (x0 )f (x0 )] hn 2 require p nhn h2 0 n Putting these results together, we get i.e. nh5 0 n

0 Z Z 0 p p g (x0 )f (x0 ) 1 00 2 2 2 2 + g (x0 )] nhn hn k(s)s ds, (x0 ) k(s) ds nhn ((x0 )g(x0 )) N [ g f (x) 2 (which is what we set out to show)

Remarks Getting asymptotic distribution requires plug-in estimator of g 0 , f 0 , f, g 00 , 2 (x0 ) Getting the asymptotic distribution requires estimating all the ingredients of the bias and variance (for plug-in estimator) g 0 -obtain by local linear or local quadratic regression f 0 -derivative of estimator for f gives estimator of f
n 1 P 0 x0 xi k( ) f 0 (x0 ) = nh2 i=1 hn n

f obtain by standard density estimator 22

2 (x0 )conditional variance. Estimate by

k2 (s)ds,

k(s)sds are constants that depend on kernel function k(s)

i E(2 x0 )

P w (x )2 P i 0 i tted residuals i = y m(xi ) = wi (x0 )

We showed distribution of kernel estimator. How about distribution of LLR? Fan (JASA, 1992) showed

Z Z p p 1 00 g (x0 ) k(s)s2 ds nhn h2 , 2 (x0 ) k2 (s)ds nhn (LLR (x0 )g(x0 )) = N g n 2 | {z } | {z }


one less term in bias expression same variance

Remarks Bias of LLR does not depend on f (x0 ) (the design density of the data). Because of this feature, Fan refers to the estimator as being "DesignAdaptive" Fan showed that in general, better to use odd-order local polynomial; there is no cost in variance, bias of odd-order does not depend on f (x) LLR estimator does not suer from boundary bias problem

23

VS

Figure 4: functions of dierent smoothness 2.1.4 How might we optimally choose the bandwidth?

Could choose the bandwidth to minimize the pointwise asymptotic MSE(AMSE) R 1 AMSELLR = [h2 2 g 00 (x0 ) k(s)s2 ds]2 + n
hn 2 (x0 ) nhn

R R 2 (x = 4h3 1 g 00 (x0 )2 [ k(s)s2 ds]2 + nhn0 ) k2 (s)ds = 0 n4 1 5 R 2 k (s)ds 1 n 5 R [ k(s)s2 ds]2 | {z }


constant

k 2 (s)ds

2 (x hn = g (x00))2 00 Remarks:

higher variance 2 (x0 ) wider bandwidth more variability in function (g00 (x0 )) smaller bandwidth more data narrower bandwidth (with kernel regression (i.e. local polynomial of degree 0), bias term also depends on f (x) and therefore optimal bandwidth choice will depend on f (x))

24

dicultyobtaining optimal bandwidth requires estimates of 2 (x0 ), g 00 (x0 ) (and in the case of the usual regression estimator f (x0 ). This is the problem of how to choose pilot bandwidth. One could assume normality in choosing a pilot for f (x0 ) (this is Silvermans ruleof-thumb method). Could also use xed bw or nearest neighbor for 2 (x0 ), g00 (x0 ) R Could minimize AMISE = E ((x0 ) g(x0 ))dx0 and pick a g global bandwidth For further information on choosing the bandwidth, see below and/or see book by Fan and Gijbels (1996).

3
3.1
3.1.1

Bandwidth Selection for Density Estimation


First Generation Methods
Plug-in method (local version)

One could derive the optimal bandwidth at each point of evaluation, x0 , which would lead to a pointwise localized bandwidth estimator

25

2 h(x0 ) = arg min AMSE f (x0 ) = E f (x0 ) f (x0 ) h4 = f 00 (x0 )2 4 Z 2 Z 1 f (x0 ) k2 (s)ds k(s)s ds + nh
2

Need to estimate f (x0 ) , f 00 (x0 ) to get optimal plug-in bandwidth 3.1.2 Global plug-in method

f (x0 ) h(x0 ) = R 2 00 (f (x0 ))2 k(s)s2 ds

" R

k2 (s)ds

#1 5

n 5

Because the above method is computationally intensive, one could allow derive a optimal bandwidth for all the points of evaluation by minimizing the mean integrate squared error Z 2 f (x0 ) f (x0 ) dx0

MISE f (x0 ) = E

removes x0 by integration

Z Z Z Z 1 h4 2 2 00 2 f (x0 ) dx0 k 2 (s)ds = ( k(s)s ds) ( f (x0 ) dx0 ) + 4 nh {z } | {z }| {z } |


B =1 A

A 1 1 5 5 h4 MISE = ( ) n B

Remarks: R f 00 (x0 )2 dx0 is a measure of the variation in f (x0 ). If f is highly

variable then h will be small. 26

Here, we still need estimate of f 00 (x0 ) but no longer require are estimate of f (x0 ) (eliminated by integrating)

3.2

Rule-of-thumb Method (Silverman)

This approach makes a normality assumption for the purpose of selecting "pilot" bandwidths. That is, assume x Normal then
5 h4 SD(x) MISE n
1

SD = standard deviation

This method often applied to nonnormal data. It has a tendency to oversmooth, particularly when data density is multimodel. Sometimes, the standard deviation is replaced in the formula by the 75%-25% interquartile range. This is an older method, now available in some software packages, but the performance is not so good. Other methods are better.

3.3

Global Method #2: Least Squares Cross-validation

The idea is to minimize an estimate of the integrated squared error (ISE)

27

minISE =
h

Z Z

{f (x0 ) f (x0 )}2 dx0 Z Z

f (x0 ) dx0 2 | {z } |
2 term #1

T erm #1 =

f (x0 )f (x0 )dx0 + {z }


term #2

term doesnt depend on h

f (x0 )2 dx0 | {z }

1 P P xi x0 xj x0 k( )k( ) 2 h2 n n i j hn hn x0 xj = +s hn hn

let s =

xi x0 x0 = xj + sh hn Z Z

1 PP = 2 n hn i j 1 PP = 2 n hn i j

k(

xj xi s)k(s)ds hn hn

k(

xi xj s)k(s)ds hn

Note:

a,s with densities ka and ks (aka convolution) let y =a+s Pr(y y0 ) = Pr(a + s y0 ) = Pr(a y0 s) = R F R (y0 s)k(s)ds = ka (y0 s)ks (s)ds y 28

ka (a s)ks (s)ds is the density of the sum of two random variables

Now examine Term #2: 2 Z f (x0 )f (x0 )dx0

1 P f (x0 )f (x0 )dx0 = nh i

k(

xi x0 )f (x0 )dx0 h

Now, lets consider rst that P P xi xj 1 1 k( )) = E( n(n 1)h i j6=i hn h Z Z Z k( y x0 ) f (y)f (x0 )dx0 dy hn

=E

f (x0 )f (x0 )dx0

note: need xi and xj to be independent. Therefore, need to use leave-oneout estimator R An unbiased estimator for term (2) 2 f (x0 )f (x0 )dx0
2 N(N1)

hCV

1 PP = arg min 2 nh i j | f (x ) |i i} {z

PP

k(

i j6=i

xi xj ) h

2 N 2h

PP k(

k(

i j6=i

xi xj ) h

need to calculate convolution for this part 1 = f (xi ) k(o) nh

xi xj s)k(s)ds h {z }

ith data point not used in estimating density at xi f i (xi )

2 P P xi xj ) k( n2 h i j6=i h {z } |

leave one out estimator

Remarks:

1 it turns out that hCV can only be estimated at rate n 10 which is very

slow (a result due to Stone). However, in Monte Carlo studies LSCV often does better that Rule-of-thumb method 29

Marson (1990) found a tendency for local minima near small h areas, which leads to a tendency to undersmooth. LSCV is usually implemented by searching over a grid of bandwidth values and choosing the one that minimizes the cross-validation function

3.4

Biased Cross-validation
Z h4 k (s)ds + n 4
2

1 AMISE = nh

k(s)ds

f 00 (x)2 dx + o(

1 + h4 ) nh

what if we plug in f 00 (x)? note: 1 P xi x k( f (x) = ) nh i h 1 P 0 xi x ) k( f 0 (x) = nh2 i h 1 P 00 xi x ) f 00 (x) = k ( nh3 i h f 00 (x)2 =
n n 1 P P xi x xj x )k( ) k( n2 h6 i=1 j=1 h h

30

What is bias in estimation b00 (x)2 ?

hBCV

1 P 00 xi x P 00 xj x E( 2 6 k ( ))dx ) k ( nh i h hn j Z 1 xi x 2 = ) dx E 00 ( 6 nh h Z 1 00 can show = f (x) + 5 k00 (s)2 d + higher order nh Z Z Z Z 1 h4 2 2 00 (x)2 dx 1 k (s)ds+ s k(s)ds[ f = arg min k00 (s)ds ] nh 4 nh5 {z } | E{f 00 (x)2 } =
subtract this bias here

3.5

Second Generation Methods

Sheather and Jones(1991) Solve -the -equation plug-in" Method " #1 5

hSIP I

k2 (s)ds = R R 2 00 fg(h) (x)2 dx s2 k(s)ds

n 5

nd h that solves this

The idea is to plug in an estimate for f 00 (x), where a pilot bandwidth used to estimate f 00 is g : g(h). The bandwidth that is optimal for f (x) is not same as optimal bandwidth for f 00 . Therefore, nd analogue to AMSE for R(f 00 ) = R f 00 (x)2 dx. Get

h i 1 1 g(h) = C1 (k){R(f 000 )} 7 C2 (k) n 7 Z 00 R(f ) = f 00 (x)dx 31

where C1 (k) and C2 (k) are functions of Kernel (constants). Now solve for g as a function of the optimal bandwidth h 5 R(f 00 ) 1 7 C (k) h 7 } 4 g(h) = C3 (k){ 000 ) R(f Now we need an estimate of f 000 . At this point, SJ suggest using a rule-ofthumb method based on normal density assumption What is main advantage of the SJPI method? most methods h hMISE np hMISE for SJPI, = 5 14 1 (close to ) 2

Bandwidth Choice in Nonparametric Regression

yi = m(xi ) + i n P i min yi ... p (xi x)p k( xhx ) n


i=1

how to choose hn , given p

Before, we found that the distribution theory for the local linear regression

(LLR) estimator was

Z Z p p 1 00 2 2 2 2 nhn (LLR (x0 )g(x0 )) N g g (x0 ) k (s)sds nhn hn , (x0 ) k (s) ds 2

32

4.1

Plug-in Estimator

choose bandwidth to minimize the MSE data-dependent 2 Z 2 Z 1 00 2 (x0 ) 2 4 min g (x0 ) k2 (S)ds k(s)s ds hn + hn 2 nhn {z } | {z } |
Variance=C1 n1 h1 n

Bias2 =C0 h4 n 4 minC0 hn + C1 n1 h1 n hn 3 1 2 4C0 hn C1 n hn = 0 C h5 = 4C10 n1 n 1 1 5 C h = 4C10 n 5 opt

h = opt

Remarks:

2 (x0 ) g 00 (x0 )2

k2 (s)ds k(s)s2 ds

i1 5

n5

h : variable bandwidth opt

g 00 high smaller bandwidth want smaller bandwidth in regions where function is highly variable 2 (x0 ) highuse a larger bandwidth

4.2 4.3

global Plug-in Estimator Global MISE criterion

minimize global criterion such as the mean integrated squared error (MISE). See Fan &Gijbels (1996) Local Polynomial Regression

33

hMISE

for LLR, p = 1 and get n 5 , for p odd R

z }| { Z (p + 1)p!2 R(k ) 2 (x)dx p R = 2 p+1 (x)2 f (x)dx p+1 (kp ) m


1

V AR(y|x)

1 2p+3

n 2p+3

kp = function of kernel R(kp ) = l (kp ) = = h kp (s)2 ds l kp ()d i1 5 n 5


1

unknowns are (x), m (x), f (x)

R(k) 2 (x)dx 2 (k)2 m00 (x)2 f (x)dx 2 00

Apply same plug-in method to estimate m00 (x) then take average R 2 Still need to evaluate (x)dx blocking method (1) divide x into N blocks (2) Estimate y = m(x) + by Q-degree poly within each block (N) = (n sN)
2 1

o PPn Q yi mj (xi ) 1(xi Xj )


i j

Bandwidth Selection for Nonparametric RegressionCross-Validation

Recall LLR estimator dened as 34

a = arg min Let

i=1

n P

(yi a b(xi x0 ))2 k(


x1 x0 hn

xi x0 ) hn

y1 ... w = ... y= N N yN 0 k xNhx0 n 1 (x1 x0 ) N Z ... x= 1 (xN x0 ) gLLR (x0 ) = [1 0](x0 wx)1 (x0 wy) coecient from a weighted least squares problem. weights depend on x0, h Let H(h) = [1 0](x0 wx)1 x0 w
12 22 2N NN

1N

gLLR (x0 ) = H(h) y The cross-validation bandwidth selector chooses h to minimize the conditional variable (min E((LLR (x0 ) g(x))2 |x) ) at each point of evaluation: g E((LLR (x0 ) g(x))2 |x) ) g = E((H (h) y g)0 (H(h)y g)|x) = E((H (h) g + H (h) g)0 ((H (h) g + H (h) g)|x) = E(((H (h) I)g + H (h) )0 ((H (h) I)g + H (h) )|x) = g 0 (H (h) I)0 (H (h) I)g + E( 0 H(h)0 H(h) |x) = g 0 (H (h) I)0 (H (h) I)g + tr(E(0 |x)H(h)0 H(h)) = g 0 (H (h) I)0 (H (h) I)g + 2 tr(H(h)0 H(h)) Consider what is estimated by 35 = g (H (h) I) (H (h) I)g + E(tr( H(h) H(h))|x)
0 0 1N N1 1N N1 0 0

tr(ABCD) = tr(BCDA) need estimator for this

(y H(h)y)0 (y H(h)y) = y 0 (I H (h))0 (I H (h))y

sum of tted residuals

= (g + )0 (I H (h))0 (I H (h))(g + ) = g 0 (I H (h))0 (I H (h))g + g0 (I H (h))0 (I H (h)) + 0 (I H (h))0 (I H (h))g + 0 (I H (h))0 (I H (h)) E(g0 (I H (h))0 (I H (h))g|x) = g 0 (I H (h))0 (I H (h))g E(g 0 (I H (h))0 (I H (h))|x) = 0 E(0 (I H (h))0 (I H (h))|x) = E(tr(0 (I H (h))0 (I H (h)))|x) = tr(E(0 |x)(I H (h))0 (I H (h)) = 2 tr[(I H (h))0 (I H (h))] = 2 tr[I H(h)0 I I 0 H(h) + H(h)0 H(h)] = 2 [trI 2tr(H (h)) + trH(h)0 H(h)] = 2 trI 22 tr(H (h)) + 2 trH(h)0 H(h) together,

(y H(h)y)0 (y H(h)y) = g0 (I H (h))0 (I H (h))g +

does not depend on h

If we use gi (xi ) instead of g (xi ) then

2 trI | {z }

dont want this term

(leave-one-out estimator)

2 2 trH (h) | {z }

+ 2 trH (h)0 H (h)

36

trH (h) = 0 P hCV = arg min (yi gi (xi ))2


n hH i=1

37

Potrebbero piacerti anche