Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
M.Gerolimetto
Dip. di Statistica
Universit`
a CaFoscari Venezia,
margherita.gerolimetto@unive.it
www.dst.unive.it/margherita
Definition
Local regression is an approach to fitting curves
and surfaces to data by smoothing. It is called
LOCAL since the fit at a generic point x0 is the
value of a parametric function fitted only to those
observations that are close to x0.
In this sense it can be thought as a natural extension of parametric fitting. Since now we considered
models like:
yi = + xi + i,
i = 1, . . . , N
i = 1, . . . , N
where m is linear.
When we assume that m(x) is an element of a
specific parametric class of functions (for example
linear) we are forcing the relationship to have a
certain shape.
Parametric localization
The underlying model for local regression is:
yi = m(xi) + ui,
i = 1, . . . , N
N
X
i=1
The estimation of m that comes from above definition is obtained with the following steps:
within this neighborhood assume that m is approximated by some member of the chosen parametric family
Once wi and h have been chosen, one is not interested in calculating m only on a single point x0,
but typically on a set of values (usually uniformly
spaced along the interval between x1 and xN ).
Practically, one creates a grid between x1 and xN
consisting of m points (uniformly spaced) and then
compute the minimization over all points of the
grid.
This corresponds to having m times locally weighted
least squares, one for every of the m points of the
grid that become the center of the neighborhood.
Trade-off... again!
Modeling m non parametrically requires a trade off
between bias and variance, starting from the choice
of the bandwidth (but not only!).
In some applications there is a strong preference toward rough estimates (smaller bias) in some other
there is a preference toward smoother estimates
(bigger bias).
Using criteria of model selection, like cross-validation,
has the advantage of an automatic choice (less
subjectivity), but at the same time the disadvantage of giving a poor answer in any particular application.
Using graphical criteria, the advantage is great power,
but the disadvantage that they are labor-intensive.
They are good for picking a small number of parameters, but in case of adaptive fitting it becomes
extremely long.
w(u) =
Polynomial degree
The choice of the polynomial degree is also a biasvariance trade-off: a higher degree will produce a
less biased, but more variable estimate.
In case the degree is 0 the local regression estimate
is:
Pn
xxi
K(
h )yi
m(x)
= Pi=1
xxi
n
K(
i=1
h )
This choice p = 0 is quite well-known in nonparametric literature (it is called local constant regression), because it is the one for which the asymptotic theory has been derived. However this case
is, at the same time, the one that in practice has
less frequently shown good performance.
The problem with local constant regression is that
it cannot reproduce a line even in the very special
case of equally spaced data away from boundaries.
Reducing the lack of fit to a tolerable level requires
very small bandwidths that end up in a very rough
estimate.
Pn
Pn
xxi
i
w )yi
K( xx
)y
)(xi X
K(
i
h
h
i=1
i=1
w ) Pn
Pn
+
(x
X
xxi
xxi
2
i=1
K(
i=1
K(
w )
)(xi X
where
Pn
xxi
)xi
K(
i=1
h
Xw = Pn
xxi
K(
i=1
h )
Notable cases
BIAS
Similarly to kernel density estimators, the kernel regression is biased of size O(h2):
f (x0)
1 00
b(x0) = h2 m (x0)
+ m (x0)
f (x0)
2
0
z 2k(z)dz
LIMIT DISTRIBUTION
The kernel regression estimator has a limit distribution which is normal
2
N h (m(x
0 ) m(x0 ) b(x0 )) N (0,
f (x0 )
k(z)2 dz
BANDWIDTH
The choice of the bandwidth is once more connected to the bias-variance trade-off.
As in kernel density estimator the bandwidth
can be determined using different methods, we
will see them in the next slides.
M SE [(m(x
0)] = E (m(x
0) m(x0))
M SE [(m(x
0)] f (x0)dx0
N
X
(yi m
i(xi))2
i=1
The optimality properties derive from the asymptotic equivalence between minimizing CV (h) and
minimizing M ISE(h) or ISE(h), recalling that, similarly to what presented in the previous section:
ISE(h) =
((m(x
0) m(x0))2 f (x0)dx0
Plug-in
Usually in the kernel regression context it is not
used, the CV is preferred.
LOESS estimator
The LOESS estimator is a local regression estimator where:
1. the weight function used for LOESS is the tricube weight function
About the third characteristic, usually the smoothing parameter, q, is a number between (p+1)/N
and 1, with p denoting the degree of the local polynomial.
Large values of q produce the smoothest functions
that do not react that much in response to fluctuations in the data. Smaller values of q make the
regression function follow closely the data.
Note that using too small a value of the smoothing parameter is not desirable, however, since the
regression function will eventually start to capture
the random error in the data (too rough!). Possibly
good values of the smoothing parameter typically
lie in the range 0.25 to 0.5 for most LOESS applications.