Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
February 2, 2007
1
1.1
Density Estimation
Histogram Estimator
The simplest nonparametric density estimator is the histogram estimator. Suppose we estimate density by a histogram with bin width h
histogram
1.1.1 Bias
E f (x0 ) =
1 = h 1 = h
x0 +h
x0
x [xi, x0 )
Assume
1 = f (x0 ) h |
x0
dxi + {z } |
1
x0 +h
x0
1 f ()(xi x0 ) x dxi h {z }
0 R(x0 ,h)
R(x0 , h) Thus,
R x0 +h
x0
x C h2 1 = Op (h) h
E f (x0 ) f (x0 ) Ch
Consistency of the histogram requires that h 0 as n Also, we assumed f is dierentiable and derivative is bounded
Variance Note: Consistency requires that var 0 as well n P var{1(xi [x0 , x0 + h])} var f (x0 ) = 21 2
n h i=1
var{1(xi [x0 , x0 + h])} = E (1(xi [x0 , x0 + h])2 ) E(1())2 E (1(xi [x0 , x0 + h])2 ) Assume that f (x0 ) C 0 , so, varf (x0 ) C0 n h C0 1 = = Op ( ) 2 h2 n nh nh
Consistency requires nh Remarks The estimator is not root-n consistent, because variance blows up as h 0 We require h 0 and n h 1.1.2 Selecting the bin size
MSE = Ch2 +
h
0
C0 nh
1.2
x0 - h
x0
x0 + h
Can rewrite as
symmetric bins {z
width=2h
1/2
-1
0
uniform kernel
1.3
Instead of uniform kernel, use kernel with weights that go smoothly to zero:
x0-h
x0
x0+h
k (s)
s=
x0 xi h
Also integrates to 1 and is symmetric gives more weight to closer observations ex. k(s) = =0 k(s) =
15 (x2 16
s 1 e 2 2
normal kernel
1.4
1.4.1
let s =
x0 xi 1 ds = hn hn
If
1 f (x0 shn ) = f (x0 ) + f 0 (x0 )(hn s) + f 00 (x0 )h2 s2 + ... n 2 1 1 ... + f r (x0 )(hn s)r + [f r () f r (x0 )](hn s)r x r! r! | {z }
Remainder term R(x 0 ,s)
then bias is R
k(s)sR = 0 R = 1..r
R k(s)R(X0 , s)ds hr C |s|r k(s)ds n (since we assumed bounded rth derivative) Thus, this estimator improves on histogram if some moments equal 0. Typically, k satises Z
k(s)ds = 1
Z If we require Z
k(s)sds = 0
k(s)s2 ds = 0
Then k(s) must be negative in places Remark If f r satises a lipschitz condition i.e. |f r () f r (x0 )| R | x0 | x m x then Bias C0 hr+1 |s|r k(s)ds = Op (hr+1 ) n n | {z } 6
higher order
k(s)
V ar f (x0 ) =
1.5
Choice of Bandwidth
= C0 2r h2r1 n
C1 C0 2rn
1
C1 nh2 n
=0
h2r+1 = n
hn shrinks to zero as n grows large Now, lets look at order of MSE. Plug in the optimal value for hn that was obtained above. Let C2 =
C1 2rC0
1 2r+1
. 7
2 MSE(hk ) = C0 C2 n 2r+1 +
2r
C1 n C2 n 2r+1
2r 2r+1 1
2 = C0 C2 n
2r 2r+1
1 + C1 C2 n
= Op (n so
2r 2r+1
r 2r+1 MSE = Op (n )
r 1 < 2r + 1 2 (parametric rate is n 2 ) X 0 X 1 ) N 1 ) ( ols 0 ) (0, 2 ( N | {z } V ar 1 RMSE is O(N 2 ) 1.5.1 How do you generalize kernel density estimator to more dimensions? f (x0 , z0 ) = 1 nhn,x hn,z
n P
1
k1 (
i=1
1.6
x0
xk
{z
f (x0 ) 2rk =
k1 (2rk )m n
1.7
as h 0
as nhn
(1)
(2)
k2 (s)ds
Assume
k(s)ds = 1,
k(s)sds = 0
1 O(h2 )] 0 as hn 0 n hn
k2 (s)ds)
Now, show Term (2) 0 (will require more stringent conditions on the bandwidth)
10
1 = hn Z Z
k(
x0 xi )f (xi )dxi hn
s=
x0 xi hn
hn ds = dxi
= f (x0 )
k(s)ds + f (x0 ) hn
k(s)sds +
h2 n
f 00 () x k(s)s2 ds 2
Assume |f () M and x Z k(s)sds = 0 (satised if k is symmetric) 1 x0 xi then E f (x0 ) is Op (h2 ) k n hn hn p 1 x0 xi )) is which implies that nhn E( k( hn hn p p Op ( nhn h2 ) = Op ( nh5 ) n n p p 1 x0 xi nhn E( k( )) = Op ( nh5 ) n hn hn Hence, we require nh5 0 in addition to hn 0, nhn . The required n condition for asymptotic normality is stronger than for consistency.)
11
Nonparametric Regression
Model:
Two basic approaches Global approach series expansion methods, orthonormal expansion methods, splines ex. g(x) = 0 + 1 x + 2 x2 + ... + k xk - main choice is which basis to use -Drawback: estimate at each point depend on all other data and is potentially sensitive to outliers Local approach Examples: kernel regression, local polynomial regression, smoothing Early smoothing methods date back to the 1860s in actuarial literature (see Cleveland (1979, JASA)) Will emphasize here local estimation methods 12 (using polynomial basis)
x x x x x
g(x)
2.1
E(y|x) = =
f (x, y) dy f (x)
Remarks
With repeated data at each point, could estimate by P yi 1(xi = x0 ) i g(x0 ) by g (x0 ) = P 1(xi = x0 )
i
Can only estimate this way at points where we have data Need to interpolate or extrapolate between and outside data observations.
13
g(x0 ) =
If we want g(x) to be smooth, need to choose a a kernel function that goes to 0 smoothly
n P
i=1 n P
n P
i=1
g(x0 ) = i=1 n P
x yi k( x0hn i ) x k( x0hn i )
i=1
1 nhn
=
1 nhn
i=1
i=1 n P
n P
x yi k( x0hn i )
x k( x0hn i )
i=1
n P
yi wi (x0 )
note :
i=1
n P
wi (x0 ) = 1
2.1.1
(i) When the density at x0 is low, the denominator tends to be close to 0 (common problem to all nonparametric methods) (ii) At boundary points, the estimator has higher order bias, even with appropriately chosen k(). This problem is known as "boundary bias." 2.1.2 Interpretation of Local Polynomial Estimators as a local regression
Solve this problem at each point of evaluation, x0 (which may or may not correspond to points in the data): a : argmin
n P
choice of k, choice of hn note: 1 provides an estimator for g0 (x0 ) b 2 provides an estimator for b if k = 0,
g 00 (x0 ) 2
a= P =
i=1 n P
i=1
n P i=1 n
yi k( k(
x0 xi ) hn
x0 xi ) hn
yi wi (x0 )
yi (x0 ) = 1
15
if k = 1
n P
yi ki
a=
i=1
i=1
j=1 n P i=1 n P
n P
(xj x0 )2
j=1 n P
k=1
ki
kj (xj x0 )2 {
n P
yk kk (xk x0 )
k=1 n P
l=1
kk (xk x0 )}2
n P
kl (xl x0 )
yi wi (x0 )
i=1
and
i=1 n P
n P
wi (x0 ) = 1
wi (x0 )(xi x0 ) = 0
k = 0 weights for the standard Nadaraya-Watson kernel estimator do not satisfy this property.
LLR estimator (k=1) improves on kernel estimator in two ways (a) asymptotic bias does not depend on design density of the data (i.e. f(x) does not appear in the bias expression.) (b) bias has a higher order of convergence at boundary points Intuitively, the standard kernel estimator ts a local constant, so when there is no data on the other side, the estimate will tend to be too high or too low. It does not capture the slope. LLR gets the slope.
16
x x
x x
i=1
n P n P n P
yi = g(xi ) + i
wi (x0 )i +
i=1
i=1
wi (g(xi ) g(x0 ))
i=1
variance part
If g0 satises a Lipschitz condition, the last term can be bounded. 2.1.3 Properties of Nonparametric Regression Estimators
n P
n P
2. Asymptotic Normality Lets work with the standard kernel regression estimator (simpler than LLR)
1 nhn
1 nhn
P
i
x yi k( x0hn i ) i P x0 xi k( hn )
sk(s)ds = 0
(iii) f (x) and g(x) are c2 with bounded second derivatives Consistency We already showed 1 P x0 xi MS k( ) f (x0 ) nhn i hn Z
k(s)ds 6= 0
k(s)s2 ds < C
k(s)ds
(denominator)
Then apply Mann.Wald theorem (plim of a continuous function is continuous function of plim) to conclude ratio g(x0 ) For denominator
P
V AR(
18
E(
n x0 xi 1 x0 xi 1 P yi k( )) = E(E(yi |xi )k( )) nhn i=1 hn hn hn x0 xi 1 )) = E(g (xi ) k( hn hn Z 1 x0 xi = g (xi ) k( )f (xi )dxi hn hn
Let h(xi ) = f (xi ) g(xi ) and Taylor expand (as before) Z k(s)ds + Op (hn ) = h (x0 )
yi k
c nhn
i=1
0 as nhn
x0 xi hn
c nhn
19
Asymptotic Normality Show nhn ((x0 ) g(x0 )) i nh g o R R 0 )f 0 (x N 1 g00 (x0 ) + g (x0(x0 ) 0 ) h2 nhn u2 k(u)du, 2 (x0 ) k 2 (u) du n 2 f where 2 (x0 ) = E (2 |xi = x0 ) conditional variance i nhn P P yi ki (yi g(x0 ))ki P g (x0 ) = nhn P
ki ki
yi = g(xi ) + i
E(i |xi ) = 0
P P 1 1 p p i ki (g(xi ) g(x0 ))ki nhn nhn P = nhn ( 1 P ) + nhn ( ) 1 ki ki nhn nhn we showed before that
1 nhn
ki f (x0 ) + O(hn )
1 nhn
MS
P
i
1 i k( x0 xi ) hn hn
P
i
x i k( x0hn i )
Note:
20
1 x0 xi 1 2 2 x0 xi V AR i k( ) = E i k ( ) + higher order terms hn hn hn hn x0 xi 1 )) = E E(2 |xi k2 ( i hn hn =2 (xi ) Z x0 xi 1 2 (xi )k 2 ( )f (xi )dxi = hn hn Z = 2 (x0 )f (x0 ) k2 (s)ds + O(hn ) by change of var (3) =
1 nhn
= =
other terms = g f
0
z Z
=0
sk(s)ds + g 0 f 0 h2 n R
}|
s2 k(s)ds
g 0 f 00 () x 2
higher order
h3 n |{z}
k(s)s3 ds
get Z X p p 1 00 0 0 2 nhn k(s)s2 ds (g(xi )g(x0 ))ki nhn [g (x0 )f (x0 )+ g (x0 )f (x0 )] hn 2 require p nhn h2 0 n Putting these results together, we get i.e. nh5 0 n
0 Z Z 0 p p g (x0 )f (x0 ) 1 00 2 2 2 2 + g (x0 )] nhn hn k(s)s ds, (x0 ) k(s) ds nhn ((x0 )g(x0 )) N [ g f (x) 2 (which is what we set out to show)
Remarks Getting asymptotic distribution requires plug-in estimator of g 0 , f 0 , f, g 00 , 2 (x0 ) Getting the asymptotic distribution requires estimating all the ingredients of the bias and variance (for plug-in estimator) g 0 -obtain by local linear or local quadratic regression f 0 -derivative of estimator for f gives estimator of f
n 1 P 0 x0 xi k( ) f 0 (x0 ) = nh2 i=1 hn n
k2 (s)ds,
i E(2 x0 )
We showed distribution of kernel estimator. How about distribution of LLR? Fan (JASA, 1992) showed
Remarks Bias of LLR does not depend on f (x0 ) (the design density of the data). Because of this feature, Fan refers to the estimator as being "DesignAdaptive" Fan showed that in general, better to use odd-order local polynomial; there is no cost in variance, bias of odd-order does not depend on f (x) LLR estimator does not suer from boundary bias problem
23
VS
Figure 4: functions of dierent smoothness 2.1.4 How might we optimally choose the bandwidth?
Could choose the bandwidth to minimize the pointwise asymptotic MSE(AMSE) R 1 AMSELLR = [h2 2 g 00 (x0 ) k(s)s2 ds]2 + n
hn 2 (x0 ) nhn
k 2 (s)ds
2 (x hn = g (x00))2 00 Remarks:
higher variance 2 (x0 ) wider bandwidth more variability in function (g00 (x0 )) smaller bandwidth more data narrower bandwidth (with kernel regression (i.e. local polynomial of degree 0), bias term also depends on f (x) and therefore optimal bandwidth choice will depend on f (x))
24
dicultyobtaining optimal bandwidth requires estimates of 2 (x0 ), g 00 (x0 ) (and in the case of the usual regression estimator f (x0 ). This is the problem of how to choose pilot bandwidth. One could assume normality in choosing a pilot for f (x0 ) (this is Silvermans ruleof-thumb method). Could also use xed bw or nearest neighbor for 2 (x0 ), g00 (x0 ) R Could minimize AMISE = E ((x0 ) g(x0 ))dx0 and pick a g global bandwidth For further information on choosing the bandwidth, see below and/or see book by Fan and Gijbels (1996).
3
3.1
3.1.1
One could derive the optimal bandwidth at each point of evaluation, x0 , which would lead to a pointwise localized bandwidth estimator
25
2 h(x0 ) = arg min AMSE f (x0 ) = E f (x0 ) f (x0 ) h4 = f 00 (x0 )2 4 Z 2 Z 1 f (x0 ) k2 (s)ds k(s)s ds + nh
2
Need to estimate f (x0 ) , f 00 (x0 ) to get optimal plug-in bandwidth 3.1.2 Global plug-in method
" R
k2 (s)ds
#1 5
n 5
Because the above method is computationally intensive, one could allow derive a optimal bandwidth for all the points of evaluation by minimizing the mean integrate squared error Z 2 f (x0 ) f (x0 ) dx0
MISE f (x0 ) = E
removes x0 by integration
A 1 1 5 5 h4 MISE = ( ) n B
Here, we still need estimate of f 00 (x0 ) but no longer require are estimate of f (x0 ) (eliminated by integrating)
3.2
This approach makes a normality assumption for the purpose of selecting "pilot" bandwidths. That is, assume x Normal then
5 h4 SD(x) MISE n
1
SD = standard deviation
This method often applied to nonnormal data. It has a tendency to oversmooth, particularly when data density is multimodel. Sometimes, the standard deviation is replaced in the formula by the 75%-25% interquartile range. This is an older method, now available in some software packages, but the performance is not so good. Other methods are better.
3.3
27
minISE =
h
Z Z
f (x0 ) dx0 2 | {z } |
2 term #1
T erm #1 =
f (x0 )2 dx0 | {z }
1 P P xi x0 xj x0 k( )k( ) 2 h2 n n i j hn hn x0 xj = +s hn hn
let s =
xi x0 x0 = xj + sh hn Z Z
1 PP = 2 n hn i j 1 PP = 2 n hn i j
k(
xj xi s)k(s)ds hn hn
k(
xi xj s)k(s)ds hn
Note:
a,s with densities ka and ks (aka convolution) let y =a+s Pr(y y0 ) = Pr(a + s y0 ) = Pr(a y0 s) = R F R (y0 s)k(s)ds = ka (y0 s)ks (s)ds y 28
k(
xi x0 )f (x0 )dx0 h
Now, lets consider rst that P P xi xj 1 1 k( )) = E( n(n 1)h i j6=i hn h Z Z Z k( y x0 ) f (y)f (x0 )dx0 dy hn
=E
note: need xi and xj to be independent. Therefore, need to use leave-oneout estimator R An unbiased estimator for term (2) 2 f (x0 )f (x0 )dx0
2 N(N1)
hCV
1 PP = arg min 2 nh i j | f (x ) |i i} {z
PP
k(
i j6=i
xi xj ) h
2 N 2h
PP k(
k(
i j6=i
xi xj ) h
xi xj s)k(s)ds h {z }
2 P P xi xj ) k( n2 h i j6=i h {z } |
Remarks:
1 it turns out that hCV can only be estimated at rate n 10 which is very
slow (a result due to Stone). However, in Monte Carlo studies LSCV often does better that Rule-of-thumb method 29
Marson (1990) found a tendency for local minima near small h areas, which leads to a tendency to undersmooth. LSCV is usually implemented by searching over a grid of bandwidth values and choosing the one that minimizes the cross-validation function
3.4
Biased Cross-validation
Z h4 k (s)ds + n 4
2
1 AMISE = nh
k(s)ds
f 00 (x)2 dx + o(
1 + h4 ) nh
what if we plug in f 00 (x)? note: 1 P xi x k( f (x) = ) nh i h 1 P 0 xi x ) k( f 0 (x) = nh2 i h 1 P 00 xi x ) f 00 (x) = k ( nh3 i h f 00 (x)2 =
n n 1 P P xi x xj x )k( ) k( n2 h6 i=1 j=1 h h
30
hBCV
1 P 00 xi x P 00 xj x E( 2 6 k ( ))dx ) k ( nh i h hn j Z 1 xi x 2 = ) dx E 00 ( 6 nh h Z 1 00 can show = f (x) + 5 k00 (s)2 d + higher order nh Z Z Z Z 1 h4 2 2 00 (x)2 dx 1 k (s)ds+ s k(s)ds[ f = arg min k00 (s)ds ] nh 4 nh5 {z } | E{f 00 (x)2 } =
subtract this bias here
3.5
hSIP I
n 5
The idea is to plug in an estimate for f 00 (x), where a pilot bandwidth used to estimate f 00 is g : g(h). The bandwidth that is optimal for f (x) is not same as optimal bandwidth for f 00 . Therefore, nd analogue to AMSE for R(f 00 ) = R f 00 (x)2 dx. Get
where C1 (k) and C2 (k) are functions of Kernel (constants). Now solve for g as a function of the optimal bandwidth h 5 R(f 00 ) 1 7 C (k) h 7 } 4 g(h) = C3 (k){ 000 ) R(f Now we need an estimate of f 000 . At this point, SJ suggest using a rule-ofthumb method based on normal density assumption What is main advantage of the SJPI method? most methods h hMISE np hMISE for SJPI, = 5 14 1 (close to ) 2
Before, we found that the distribution theory for the local linear regression
32
4.1
Plug-in Estimator
choose bandwidth to minimize the MSE data-dependent 2 Z 2 Z 1 00 2 (x0 ) 2 4 min g (x0 ) k2 (S)ds k(s)s ds hn + hn 2 nhn {z } | {z } |
Variance=C1 n1 h1 n
h = opt
Remarks:
2 (x0 ) g 00 (x0 )2
k2 (s)ds k(s)s2 ds
i1 5
n5
g 00 high smaller bandwidth want smaller bandwidth in regions where function is highly variable 2 (x0 ) highuse a larger bandwidth
4.2 4.3
minimize global criterion such as the mean integrated squared error (MISE). See Fan &Gijbels (1996) Local Polynomial Regression
33
hMISE
V AR(y|x)
1 2p+3
n 2p+3
Apply same plug-in method to estimate m00 (x) then take average R 2 Still need to evaluate (x)dx blocking method (1) divide x into N blocks (2) Estimate y = m(x) + by Q-degree poly within each block (N) = (n sN)
2 1
i=1
n P
xi x0 ) hn
y1 ... w = ... y= N N yN 0 k xNhx0 n 1 (x1 x0 ) N Z ... x= 1 (xN x0 ) gLLR (x0 ) = [1 0](x0 wx)1 (x0 wy) coecient from a weighted least squares problem. weights depend on x0, h Let H(h) = [1 0](x0 wx)1 x0 w
12 22 2N NN
1N
gLLR (x0 ) = H(h) y The cross-validation bandwidth selector chooses h to minimize the conditional variable (min E((LLR (x0 ) g(x))2 |x) ) at each point of evaluation: g E((LLR (x0 ) g(x))2 |x) ) g = E((H (h) y g)0 (H(h)y g)|x) = E((H (h) g + H (h) g)0 ((H (h) g + H (h) g)|x) = E(((H (h) I)g + H (h) )0 ((H (h) I)g + H (h) )|x) = g 0 (H (h) I)0 (H (h) I)g + E( 0 H(h)0 H(h) |x) = g 0 (H (h) I)0 (H (h) I)g + tr(E(0 |x)H(h)0 H(h)) = g 0 (H (h) I)0 (H (h) I)g + 2 tr(H(h)0 H(h)) Consider what is estimated by 35 = g (H (h) I) (H (h) I)g + E(tr( H(h) H(h))|x)
0 0 1N N1 1N N1 0 0
= (g + )0 (I H (h))0 (I H (h))(g + ) = g 0 (I H (h))0 (I H (h))g + g0 (I H (h))0 (I H (h)) + 0 (I H (h))0 (I H (h))g + 0 (I H (h))0 (I H (h)) E(g0 (I H (h))0 (I H (h))g|x) = g 0 (I H (h))0 (I H (h))g E(g 0 (I H (h))0 (I H (h))|x) = 0 E(0 (I H (h))0 (I H (h))|x) = E(tr(0 (I H (h))0 (I H (h)))|x) = tr(E(0 |x)(I H (h))0 (I H (h)) = 2 tr[(I H (h))0 (I H (h))] = 2 tr[I H(h)0 I I 0 H(h) + H(h)0 H(h)] = 2 [trI 2tr(H (h)) + trH(h)0 H(h)] = 2 trI 22 tr(H (h)) + 2 trH(h)0 H(h) together,
2 trI | {z }
(leave-one-out estimator)
2 2 trH (h) | {z }
36
37