Sei sulla pagina 1di 24

Advanced data analysis

M.Gerolimetto
Dip. di Statistica
Universit`
a CaFoscari Venezia,
margherita.gerolimetto@unive.it
www.dst.unive.it/margherita

PART 4: LOCAL REGRESSION

Definition
Local regression is an approach to fitting curves
and surfaces to data by smoothing. It is called
LOCAL since the fit at a generic point x0 is the
value of a parametric function fitted only to those
observations that are close to x0.
In this sense it can be thought as a natural extension of parametric fitting. Since now we considered
models like:
yi = + xi + i,

i = 1, . . . , N

that can be seen as


yi = m(xi) + i,

i = 1, . . . , N

where m is linear.
When we assume that m(x) is an element of a
specific parametric class of functions (for example
linear) we are forcing the relationship to have a
certain shape.

However it is possible that these models cannot be


applied because of nonlinearity (especially of unknown form) in the data.

In this sense nonparametric modelling is a good


response because it is like placing a a flexible
curve on the (x, y) scatterplot with no parametric restrictions on the form of the curve.

Moreover nonparametric methods can help to


see in the scatterplot the underlying structures
of the data (smoothing).

Parametric localization
The underlying model for local regression is:
yi = m(xi) + ui,

i = 1, . . . , N

The distribution of the yis are unknown.


The means m(xi) are unknown.
In practice we must model the data, which means
making certain assumptions on m and other aspects of the distribution of the yi.
One common assumption is that the yis are
homoskedastic.
As for m it is supposed that the function can be
locally approximated by a member of a parametric class, usually chosen to be a polynomial
of certain degree.
This is the parametric localization: in carrying
out the local regression we use a parametric family
as in global parametric fitting but we ask only that
the family fit locally and not globally.

Suppose x0 is a generic point in the support of the


x variable. Suppose we do not know function m(x),
but we can assume it is derivable.
To estimate m(x) in x0, we can think of using the
Taylor expansion
0

m(x) = m(x0) + m (x0)(x x0) + r


where r is a quantity of order smaller than (x x0).
Whatever function (under certain regularity conditions) can be locally approximated by a line.
It is possible to estimate m(x) in a neighborhood of
x0 by minimizing squared errors for pairs (xi, yi), i =
1, . . . , N
min,

N
X
i=1

{yi (xi x0)}2 wi

The weights wi in the previous formula are often


chosen so that they are bigger when (xi x0) is
smaller. This means that the closer is xi to the
point x0, the bigger is the weight.
This minimization can be thought in a sort of local
view around x0: we can think of weighted least
squares.
BIG ISSUES:

1. How can the weights be chosen?

2. How large should be the neighborhood?

The estimation of m that comes from above definition is obtained with the following steps:

for each fitting point x0 define a neighborhood


based on some metric in the space of the x
variable

within this neighborhood assume that m is approximated by some member of the chosen parametric family

estimate the parameters from observations in


the neighborhood; the local fit at x0 is the fitted function evaluated at x0.

Very often a weight function w(u) is incorporated


that gives greater weight to the xis that are closer
to x0 and smaller weight to the xis that are further
from x0.
The estimation method used depends on the assumption on the yis. If the yis are assumed to
be Gaussian with constant variance, then it makes
sense to base estimation on least squares.

Once wi and h have been chosen, one is not interested in calculating m only on a single point x0,
but typically on a set of values (usually uniformly
spaced along the interval between x1 and xN ).
Practically, one creates a grid between x1 and xN
consisting of m points (uniformly spaced) and then
compute the minimization over all points of the
grid.
This corresponds to having m times locally weighted
least squares, one for every of the m points of the
grid that become the center of the neighborhood.

Modeling the data


When using local regression the following are the
choices to be made:

1. Assumptions about the behaviour of m


Weight function
Bandwidth
Parametric family

2. Assumption about the yis


Fitting criterion

Differently from parametric fitting we do not rely


on a priori knowledge.
To make the choices listed above we use i) either
the data with graphical analysis or ii) some automatic methods to carry out model selection.

Trade-off... again!
Modeling m non parametrically requires a trade off
between bias and variance, starting from the choice
of the bandwidth (but not only!).
In some applications there is a strong preference toward rough estimates (smaller bias) in some other
there is a preference toward smoother estimates
(bigger bias).
Using criteria of model selection, like cross-validation,
has the advantage of an automatic choice (less
subjectivity), but at the same time the disadvantage of giving a poor answer in any particular application.
Using graphical criteria, the advantage is great power,
but the disadvantage that they are labor-intensive.
They are good for picking a small number of parameters, but in case of adaptive fitting it becomes
extremely long.

Selecting the weight function


Supposing that m is continuous, then we will use
weight functions that are peaked around 0 and decay smoothly as distances from x0 (let us call the
distances u) increase.
A smooth weight function results in a smoother
estimate than, for example, using a rectangular
weight function.
A natural choice is to use gaussian kernels. The
tricube kernels also are often used because of the
computational speed of a weight function that at
a certain point (but smoothly) gives zero weight
compared to one that only approaches zero as u
gets larger:
(

w(u) =

(1 |u|3)3 |u| < 1


0 |u| > 1

In case a gaussian kernel is used, local regression


take the name of kernel regression. In case a
tricube kernel is used (plus a nearest neighbors
bandwidth), local regression take the name of LOESS
estimator, as we will see later on.

Selecting the fitting criterion


Virtually any global fitting procedure can be localized. So local regression could work on the basis
of the same number of distributions as global parametric fitting.
The simplest case is the Gaussian yis. Least squares
methods approaches can be used. An objection to
least squares is that those estimators are not robust to heavy-tailed residuals distributions. Under
these circumstances, proposals of ad hoc robustified fitting procedures are available (LOWESS).
In case other distributions are hypothesized for the
yis, then the locally weighted likelihood can be
used. For example in case of binary data the non
parametric estimated is obtained by local likelihood.

Selecting the bandwidth and local family


These issues will be sort of discussed simultaneously since they are strongly connected.
Both the choice of the bandwidth parameter and
the parametric family are related to the goal of
producing an estimate that is as smoother as possible whithout distorting the underlying pattern of
dependence of the response on the independent
variables.
As for kernel estimates of density functions, a balance between bias and variance must be found.
As for the bandwidth selection, will be considered
fixed and nearest neighbors bandwidth. As for the
parametric family the choice will be made among
polynomial forms whith the degree ranging from 0
to 3.

Nearest neighbor bandwidths vs fixed bandwidth


The problem with fixed bandwidth is that it provokes strong swings in variance in case of large
changes in the density of the data.
The boundary issue plays a major role in the bandwidth choice. The issue is that using the same
bandwidth at the boundary (where observations
can be more sparse) as in the interior can produce estimates with a large variability. Think of
gaussian data!
The variable bandwidth (as nearest neighbors) appears to perform better overall in applications for
this variance issue.
Of course nearest neighbors can fail for some specific examples, but it is not the fixed bandwidth
the remedy, but rather adaptive methods.

Polynomial degree
The choice of the polynomial degree is also a biasvariance trade-off: a higher degree will produce a
less biased, but more variable estimate.
In case the degree is 0 the local regression estimate
is:
Pn
xxi
K(
h )yi
m(x)

= Pi=1
xxi
n
K(
i=1
h )

This choice p = 0 is quite well-known in nonparametric literature (it is called local constant regression), because it is the one for which the asymptotic theory has been derived. However this case
is, at the same time, the one that in practice has
less frequently shown good performance.
The problem with local constant regression is that
it cannot reproduce a line even in the very special
case of equally spaced data away from boundaries.
Reducing the lack of fit to a tolerable level requires
very small bandwidths that end up in a very rough
estimate.

So, by using a polynomial degree greater than zero


it is possible to increase the bandwith (so reducing
the roughness) without introducing an intolerable
bias.
In case the degree is 1 the local regression estimate
is:
m(x)

Pn
Pn
xxi
i
w )yi
K( xx
)y
)(xi X
K(
i
h
h
i=1
i=1
w ) Pn
Pn
+
(x

X
xxi
xxi
2
i=1

K(

i=1

K(

w )
)(xi X

where
Pn
xxi
)xi
K(
i=1
h

Xw = Pn
xxi
K(
i=1
h )

This choice p = 1 is called local linear regression.

Notable cases

1. Kernel regression is a local constant regression


(p = 0) where the weigthing mechanism is done
using typical kernel functions (in particular the
Gaussian). It is also called Nadaraya Watson
regression.

2. The LOESS estimator for local regression is


characterized for having a tricube weigthing mechanism and a nearest neighbours bandwidth.

Kernel regression theory


For kernel regression much theory has been proposed even though it is not the best option in practise.
The model is
y = m(x) + u
for a given choice of K and h (fixed), we suppose
that the data are i.i.d., the x are not stochastic.

BIAS
Similarly to kernel density estimators, the kernel regression is biased of size O(h2):

f (x0)
1 00
b(x0) = h2 m (x0)
+ m (x0)
f (x0)
2
0

z 2k(z)dz

Given a value for h, the bias varies with the


kernel function that we use, but most of all it
depends on the slope and the curvature of the
function m in x0 and with the slope of f (x0) the
density of the regressors. In the kernel density,
00
instead, the bias depends only on f (x).

LIMIT DISTRIBUTION
The kernel regression estimator has a limit distribution which is normal

2
N h (m(x
0 ) m(x0 ) b(x0 )) N (0,
f (x0 )

k(z)2 dz

Note that the variance of the estimator m(x


0)
is inversely related to f (x0), which means that
the variance of m(x
0) is bigger in regions where
x is sparse.

BANDWIDTH
The choice of the bandwidth is once more connected to the bias-variance trade-off.
As in kernel density estimator the bandwidth
can be determined using different methods, we
will see them in the next slides.

Choosing the bandwidth: Optimal rule


A value of h that minimizes MISE in an asymptotic
sense would be an optimal bandwidth.
Remember that MSE (mean squared error) measures the local performance of m
in x0, in this case
it takes the form:
h

M SE [(m(x
0)] = E (m(x
0) m(x0))

The MISE (mean integrated squared error) is a


global measure of performance
M ISE(h) =

M SE [(m(x
0)] f (x0)dx0

where f is the density of the regressors.


The optimal bandwidth is obtained
by minimizing

the MISE and this yields h = O N 1/5 .
It has been shown that the kernel estimate converges with a rate that is slower than the parametric estimate.

Choosing the bandwidth: Cross-validation


An empirical estimate of the optimal h can be obtained using the leave-one-out cross validation procedure, thus minimizing:
CV (h) =

N
X

(yi m
i(xi))2

i=1

The optimality properties derive from the asymptotic equivalence between minimizing CV (h) and
minimizing M ISE(h) or ISE(h), recalling that, similarly to what presented in the previous section:
ISE(h) =

((m(x
0) m(x0))2 f (x0)dx0

Plug-in
Usually in the kernel regression context it is not
used, the CV is preferred.

LOESS estimator
The LOESS estimator is a local regression estimator where:

1. the weight function used for LOESS is the tricube weight function

2. the local polynomials degrees are almost always


of first or second degree (that is, either locally
linear or locally quadratic)

3. the subsets of data used for each weighted least


squares fit in LOESS are determined by a nearest neighbors algorithm

About the third characteristic, usually the smoothing parameter, q, is a number between (p+1)/N
and 1, with p denoting the degree of the local polynomial.
Large values of q produce the smoothest functions
that do not react that much in response to fluctuations in the data. Smaller values of q make the
regression function follow closely the data.
Note that using too small a value of the smoothing parameter is not desirable, however, since the
regression function will eventually start to capture
the random error in the data (too rough!). Possibly
good values of the smoothing parameter typically
lie in the range 0.25 to 0.5 for most LOESS applications.

Potrebbero piacerti anche