Sei sulla pagina 1di 14

Geographically Weighted Regression Technique

for Spatial Data Analysis

Chang-Lin Mei

School of Science, Xian Jiaotong University


(E-mail: clmei@mail.xjtu.edu.cn)

1. Introduction

In many practical fields such as geography, economics, environmental science


and epidemiology, the data are generally related with the geographical locations
where they are observed. This type of data is called spatial data. How to ana-
lyze spatial dada has long been one of very important problems in statistics. In
todays speech, I will first introduce a recently developed spatial data analysis
method called geographically weighted regression (GWR) technique which is orig-
inally proposed by Brunsdon et al (1996; 1998). Then some work of ours on this
technique are summarized. Furthermore, sseveral problems that need to be further
studied will be discussed.

2. Geographically weighted regression model and its fitting method

2.1 The geographically weighted regression model


Motivated by the idea of nonparametrical regression methods, Brunsdon et al
(1996; 1998) have proposed a so-called geographically weighted regression (GWR)
technique for exploring spatial non-stationarity of a regression relationship for spa-
tial data by locally fitting a spatially varying coefficient regression model of the
form
p
X
yi = j (ui , vi )xij + i , i = 1, 2, , n, (1)
j=1

where (yi ; xi1 , , xip ) are observations of the response y and explanatory variables
x1 , x2 , , xp at location (ui , vi ) in the studied geographical region, j (ui , vi )(j =
1, 2, , p) are p unknown functions of geographical locations and i (i = 1, 2, , n)
are error terms with mean zero and common variance 2 . j (ui , vi )(j = 1, 2, , p)

1
are locally estimated at each location (ui , vi ) by the weighted least-squares proce-
dure in which some distance-decay weights are used. Each set of the estimated
coefficients at n locations can produce a map of variation which may give useful in-
formation on non-stationarity of the regression relationship. The GWR technique
has a great appeal in analysis of spatial data and has been successfully applied
to many practical problems. The main results on this topic are summarized in
Fotheringham et al (2002).
2.2 The fitting methodgeographically weighted regression technique
The parameters in the GWR model are locally estimated by the weighted least
squares approach. The weights at each location (ui , vi ) are taken as a function of
the distance from (ui , vi ) to other locations where the observations are collected.
Suppose that the weights at location (ui , vi ) are wj (ui , vi ), j = 1, 2, , n. Then
the parameters at location (ui , vi ) is estimated by minimizing
n
X
wj (ui , vi )[yj 1 (ui , vi )xj1 2 (ui , vi )xj2 p (ui , vi )xjp ]2
j=1

Let
x11 x12 x1p y1

x21 x22 x2p y2
X=
.. .. . . . .. , Y =
.. .

. . . .
xn1 xn2 xnp yn
and
W(ui , vi ) = Diag[w1 (ui , vi ), w2 (ui , vi ), , wn (ui , vi )].

Then according to the theory of the weighted least squares, the estimated param-
eters at (ui , vi ) are

i , vi ) = [XT W(ui , vi )X]1 XT W(ui , vi )Y


(u

Let xT
i = (xi1 , xi2 , , xip ) be the ith row of X. Then the fitted value of y at

(ui , vi ) is obtained by

yi = xT T T 1 T
i (ui , vi ) = xi [X W(ui , vi )X] X W(ui , vi )Y.

= (
Denote respectively by Y y1 , y2 , , yn )T and = (
1 , 2 , , n )T the vector of
fitted values of y and the vector of residuals at n locations (ui , vi ), i = 1, 2, , n.

2
Then (
Y = LY;
= (I L)Y, (2)
= Y Y
where
xT T 1 T
1 [X W(u1 , v1 )X] X W(u1 , v1 )



xT [XT W(u , v )X]1 XT W(u , v )
L=

2 2 2 2 2

..
.
xT T 1 T
n [X W(un , vn )X] X W(un , vn )

is an n n matrix and I is an identity matrix of order n.


2.3 Choices of the weights
The function of the weights is to place different emphases on different observa-
tions in generating the estimated parameters. In spatial analysis, observations close
to a location (ui , vi ) generally exert more influence on the parameter estimates at
location (ui , vi ) than those farther away. When the parameters at location (ui , vi )
are estimated, more emphases should be placed on the observations which are
close to location (ui , vi ). Like the weights in nonparametric regression, one obvious
choice is
wj (ui , vi ) = exp[(dij /h)2 ], j = 1, 2, , n,

where dij is the distance from the location (ui , vi ) to (uj , vj ) and h is called band-
width.
Another choice of the weights is as follows.
( 2
[1 (dij /h)2 ] , if dij h
wj (ui , vi ) = , j = 1, 2, , n.
0, if dij > h

The bandwidth h can be determined the cross-validation procedure. Let


n
X
(h) = (yi y(i) (h))2 ,
i=1

where y(i) (h) is the fitted value of yi with the observation at location (ui , vi ) omitted
from the fitting process. Choose h0 as a desirable value of the bandwidth such that

(h0 ) = min(h).

3. Testing for global linear regression based on the geographically


weighted regression technique (Leung, Mei and Zhang, 2000a)

3
The geographically weighted regression technique provides a feasible way for
testing a global linear regression relationship for spatial data. This amounts to
test the following hypotheses:
H0 : j (ui , vi ) = j , j = 1, 2, , p versus
H1 : At least one of the j (ui , vi )s is varying with the locations.
3.1 Construction of the test statistic
We construct the test statistic based on the residuals sum of squares respectively
obtained under H0 and H1 .
Under the H0 , we fit the corresponding linear regression model by ordinary least
squares approach and obtain the residuals sum of squares as

RSS(H0 ) = YT (I H)Y, (3)

where H = X(XT X)1 XT .


Under the H1 , the spatially varying coefficient regression model (1) is fitted by
the geographically weighted regression technique and we obtain the residuals sum
of squares as
RSS(H1 ) = T = YT (I L)T (I L)Y. (4)

The test statistic is then constructed as


RSS(H0 ) RSS(H1 ) YT (M0 M1 )Y
F = = , (5)
RSS(H1 ) Y T M1 Y

where M0 = I H and M1 = (I L)T (I L).


3.2 Calculation of p-value of the test
Since larger value of T tends to support H1 , the p-value of the test is

p0 = PH0 (F > f ), (6)

where f is the observed value of the test statistic F . It is observed that LX = X


and E(Y) = X when H0 is true, where = (0 , 1 , , p )T . We then obtain
that RSS(H1 ) = T M1 and that RSS(H0 ) = T M0 . Therefore, under the null
hypothesis H0 , F can be expressed as

T (M0 M1 )
F = ,
T M1

4
and p0 can be calculated by
!
T (M0 M1 )
p0 = P >f = P{T [M0 (1 + f )M1 ] > 0}. (7)
T M1
That is, the p-value has been expressed as the probability that a ratio of quadratic
forms takes positive value. If we assume the error vector N (0, 2 I), we can
obtain both the exact and the approximate formulae for calculating p0 .
3.2.1 The exact formula
Theorem 3.1 Suppose that the error terms 1 , 2 , , n are independent and
identically distributed random variables with common distribution N (0, 2 ). Then
1 1 Z sin[(t)]
p0 = PH0 (F > f ) = + dt, (8)
2 0 t(t)
where Pm
1 1
(t)
= 2 k=1 [hk tan (k t)],



(t) = m 2 2 hk /4
k=1 (1 + k t ) ,
1 , 2 , , m are the distinct nonzero eigenvalues of the matrix M0 (1 + f )M1
and h1 , h2 , , hm are their respective orders of multiplicity.
3.2.2 Three-moment 2 approximation
Computing numerically the eigenvalues of an n n matrix and an integral
on an infinite interval is in fact computationally expensive. Some approximate
methods are available in this case. Here, we introduce a so-called three-moment
2 approximation method to compute the p-value of the test. This approximate
method can significantly reduce the computational overhead.
The main idea of this approximation is to approximate the distribution of a
quadratic form in normal variables by that of a linear function of a 2 variable
with appropriate degrees of freedom, say a + b2d . The coefficients a and b of the
linear function and the degrees of freedom d are chosen in such a way that the first
three moments of a + b2d are made to match those of the quadratic form.
Theorem 3.2 Suppose that the error terms 1 , 2 , , n are independently
and identically distributed as N (0, 2 ). If three-moment 2 approximation is used
to approximate the p-value p0 , then we have

2


d > d h),
P( if tr[M0 (1 + f )M1 ]3 > 0;

p0 tr[M0 (1+f )M1 ] , if tr[M0 (1 + f )M1 ]3 = 0; (9)

2tr[M0 (1+f )M1 ]2

P(2 < d h), 3
d if tr[M0 (1 + f )M1 ] < 0.

5
where
{tr[M0 (1+f )M1 ]2 }3
d= ,

{tr[M0 (1+f )M1 ]3 }2


tr[M0 (1+f )M1 ]2 tr[M0 (1+r)M1 ]
h= .
tr[M0 (1+f )M1 ]3

4. Testing for spatial autocorrelation among the residuals of the


geographically weighted regression (Leung, Mei and Zhang, 2000b)

Similar to the case in the ordinary linear regression, spatial autocorrelation in


error terms can invalidate the standard assumption of homoscedasticity of the error
terms and mislead the results of statistical inference. Therefore, developing some
statistical methods to test for spatial autocorrelation in the error terms of model
(1) is a very important issue. Here, the well known two statistics, Morans I and
Geary C in the regional science are used to explore spatial autocorrelation among
the residuals of the geographically weighted regression.
let
1 , 2 , , n )T = (I L)Y
= (

be the residual vector obtained by fitting the spatially varying coefficient model
with the GWR technique and W = (wij )nn be a specific spatial weight matrix
which is defined by the underlying spatial structure such as the spatial contiguity or
adjacency between the geographical units where observations are observed. After
neglecting a constant coefficient, the Morans I and Gearys C of the residuals with
respect to W = (wij )nn are respectively
Pn Pn
i=1 j=1 wij i j T W
I= Pn = T , (10)
2i
i=1
and Pn Pn
i=1 j=1 wij (i j )2 T (D 2W)
C= Pn 2
= , (11)
i=1
i T
where
D = Diag(w1 + w1 , w2 + w2 , , wn + wn )
Pn Pn
and wi = j=1 wij , w
i = j=1 wji .
The p-values for testing spatial autocorrelation are respectively

pI = PH0 (I > I0 ) and pC = PH0 (C > C0 ), (12)

6
where I0 and C0 are respectively the observed values of I and C.
When the bias of the fitted value of y at each location is negligible, the Morans
I and Gearys C can be respectively expressed as

T NT WN
T NT (D 2W)N
I= , C = , (13)
T NT N T NT N
where N = I L. If we assume that the error vector N (0, 2 I), we can
calculate the p-values with the same methods introduced in Sections 3.2.1 and
3.2.2.
Along with the same derivation as above, Leung, Mei and Zhang (2003) have
proposed a approach for testing local patterns of spatial association based on the
recently proposed local statistics of local Morans Ii , local Gearys Ci and Anselins
LISA.

5. Mixed geographically weighted regression model

In consideration of the situations where certain explanatory variables influenc-


ing the response may be global in nature, whist others are local, Brunsdon et al
(1999) have proposed a mixed geographically weighted regression (MGWR) model
in which some coefficients in the model (1) are assumed to be constant and the oth-
ers are allowed to vary across the studied region. After re-ordering the explanatory
variables, a MGWR model is specified as
q
X p
X
yi = j xij + j (ui , vi )xij + i , i = 1, 2, , n. (14)
j=1 j=q+1

By taking xi1 = 1 or xi,q+1 = 1 for all i, the model can involve a constant or a
spatially varying intercept.
5.1 Identification of constant coefficients in a MGWR model (Mei, He and
Fang, 2004)
When a MGWR model is applied to analyze a real-world data set, one should
first determine which coefficients can be kept fixed and which ones cannot.
For a given k(1 k p), to test whether or not the coefficient k (ui , vi ) of the
kth explanatory variable xk is constant across the geographical region amounts to
test the following hypotheses
(
H0 : k (u1 , v1 ) = k (u2 , v2 ) = = k (un , vn ),
H1 : not all k (ui , vi ) (1 i n) are equal,

7
Firstly, fit the data to the spatially varying coefficient model (1) and let

i , vi ) = (0 (ui , vi ), 1 (ui , vi ), , p (ui , vi ))T = [XT W(ui , vi )X]1 XT W(ui , vi )Y


(u

be the estimated coefficient vector at location (ui , vi ). The n estimated values of


the kth coefficient j (ui , vi ) at the n locations where the data are observed are

k (uj , vj ) = ek [XT W(ui , vi )X]1 XT W(ui , vi )Y, j = 1, 2, , n, (15)

where ek is a column vector of p dimensions with unity for the kth element and
zero for others. Let

k = (k (u1 , v1 ), k (u2 , v2 ), , k (un , vn ))T . (16)

When neglecting the constant 1/n, the sample variance of k (uj , vj ), j = 1, 2, , n


can be expressed as

1 1
V (k) = kT I J k = YT BT I J BY, (17)
n n
where
eT T 1 T
k [X W(u1 , v1 )X] X W(u1 , v1 )



eT [XT W(u , v )X]1 XT W(u , v )
B=

k 2 2 2 2 ,

..
.
eT T 1 T
k [X W(un , vn )X] X W(un , vn )

and J is an n n matrix with unity for each of its elements. The test statistic is
constructed as
kT (I n1 J)k YT BT (I n1 J)BY
F (k) = = . (18)
T YT (I L)T (I L)Y
The p-value of F (k) is
p(k) = PH0 [F (k) > f (k)], (19)

where f (k) is the observed value of F (k). Under the null hypothesis and some
conditions, we have
T BT (I n1 J)B T M1
F (k) = = ,
T (I L)T (I L) T M2
where
M1
= BT (I n1 J)B


M2 = (I L)T (I L).

8
Therefore, the p-value can be calculated by the same methods introduced in sections
3.2.1 and 3.2.2.
5.2 Estimation and inference on the MGWR model (Mei, Wang and Zhang,
2004)
5.2.1 Estimation of the MGWR model
After the constant coefficients in a MGWR model are identified, we can estimate
both the constant coefficients and spatially varying coefficients which are important
to reflect the spatial nonstationarity of the regression relationship. Brunsdon et
al (1999) have proposed an iterative estimation method based on the back-fitting
procedure. However, this method is computationally expensive. Motivated by
the approach in Speckman (1988) for fitting a partially linear model, we propose
the following estimation method which can significantly reduce the computational
overhead.
Let

x11 x12 x1q x1,q+1 x1,q+2 x1p y1

x21 x22 x2q x2,q+1 x2,q+2 x2p y2
Xc =
.. .. .. . , Xv =
.. .. .. . , Y =
.. ,

. . . .. . . . .. .
xn1 xn2 xnq xn,q+1 xn,q+2 xnp yn

and
1 q+1 (ui , vi )

2 q+2 (ui , vi )
c =
.. , v (ui , vi ) =
.. , i = 1, 2, , n.

. .
q p (ui , vi )
Firstly, we rewrite the MGWR model (4) as
q
X p
X
yi = yi j xij = j (ui , vi )xij + i , i = 1, 2, , n.
j=1 j=q+1

Using the GWR technique, we obtain the estimated spatially varying coefficients
at location (ui , vi ) as

v (ui , vi ) = (q+1 (ui , vi ), q+2 (ui , vi ), , p (ui , vi ))T


= (XT 1 T
v W(ui , vi )Xv ) Xv W(ui , vi )Y, (20)

where
= (
Y y1 , y2 , , yn )T = Y Xc c .

9
Then, substituting the elements of v (ui , vi ) into the original MGWR model (4)
and rewrite it as
p
X q
X
yi j (ui , vi )xij = j xij + i , i = 1, 2, , n. (21)
j=q+1 j=1

Because
Pp
j=q+1 j (u1 , v1 )x1j xT
v1 v (u1 , v1 )


Pp
fv = j=q+1 j (u2 , v2 )x2j
T = Sv (Y Xc c ),
= xv2 v (u2 , v2 ) = Sv Y
.. ..

. .
Pp
j=q+1 j (un , vn )xnj xT
vn v (un , vn )
(22)
equation (21) can be expressed with the matrix notation as

Y Sv (Y Xc c ) = Xc c +

or
(I Sv )Y = (I Sv )Xc c + .

According to the ordinary leat-squares method, we obtain the estimates of the


constant coefficients as

c = (1 , 2 , , q )T = [XT T 1 T T
c (I Sv ) (I Sv )Xc ] Xc (I Sv ) (I Sv )Y. (23)

Substituting c into (20), we finally obtain the estimated spatially varying coeffi-
cients at location (ui , vi ) as

v (ui , vi ) = [XT 1 T
v W(ui , vi )Xv ] Xv W(ui , vi )(Y Xc c ), i = 1, 2, , n. (24)

Then according to (22), the fitted values of the spatially varying coefficient part at
n locations are
fv = Sv (Y Xc c ). (25)

Therefore, the fitted values of the response at n locations are

y1 , y2 , , yn )T = fv + Xc c
= (
Y
= Sv (Y Xc c ) + Xc c = Sv Y + (I Sv )Xc c = SY, (26)

10
where

S = Sv + (I Sv )Xc (XT T 1 T T
c (I Sv ) (I Sv )Xc ) Xc (I Sv ) (I Sv ). (27)

Here, we suggest a generalized cross-validation method for selecting the value


of the bandwidth which can reduce the computational overhead considerably.
In order to clearly show the dependence between the fitted values of the response
and the bandwidth h, we write (26) as


Y(h) y1 (h), y2 (h), , yn (h))T = S(h)Y,
= (

where S(h) is shown in (27) for the back-fitting method or in equation (18) for the
two-step method. Let
n
!2
X yi yi (h)
GCV (h) = ,
i=1 1 sii (h)

where sii (h) is the ith diagonal element of S(h) and yi (h) is the ith fitted value of
y. Select h0 as a desirable value of h such that

GCV (h0 ) = minh>0 GCV (h). (28)

5.2.2 An inference framework of the MGWR model


We shall describe in this section a framework of statistical inference on the
MGWR model. Considering that, like a partially linear model, the constant co-
efficients in a MGWR model are frequently of primary interest because of their
explanatory power, we henceforth mainly focus the inference on the constant co-
efficients to illustrate the inference framework. One of the important inference
problems is whether or not certain variable in the constant coefficient part is sta-
tistically significant. This amounts to test the following hypotheses

H0 : k = 0 vs H1 : k 6= 0, for some 1 k q.

We first fit the full MGWR model (4) (that is, under H1 ) by the method pro-
posed before and denote by S1 the hat matrix in (27). Then the residual sum of
squares under H1 is

RSS(H1 ) = (Y S1 Y)T (Y S1 Y) = YT (I S1 )T (I S1 )Y = YT R1 Y, (29)

11
where R1 = (I S1 )T (I S1 ).
We then fit the reduced MGWR model under H0 (that is, let k = 0 in the
model (4)) with the same method and the same value of bandwidth as those under
H1 . Denote by S0 the resulted hat matrix. Then the residual sum of squares under
H0 is

RSS(H0 ) = (Y S0 Y)T (Y S0 Y) = YT (I S0 )T (I S0 )Y = YT R0 Y, (30)

where R0 = (I S0 )T (I S0 ).
If H0 is indeed true, there should not be significant difference between RSS(H0 )
and RSS(H1 ). Otherwise, RSS(H0 ) RSS(H1 ) will tend to be larger. Therefore
it is natural to propose the test statistic as

RSS(H0 ) RSS(H1 ) YT (R0 R1 )Y


T = = . (31)
RSS(H1 ) Y T R1 Y
Here,we introduce a bootstrap procedure to derive the p-value of the test as
follows.
Step 1. Fix the bandwidth at some properly given value, say h , and respec-
tively fit the MGWR model under H0 and H1 by the aforementioned estimation
method. Then calculate the residual sums of squares RSS(H1 ) and RSS(H0 ) in
(29) and (30) as well as the observed value t of statistic T in (31). Furthermore,
1 , 2 , , n )T = Y S1 Y under H1 and com-
obtain the residual vector = (
1 Pn
pute ic = i n j=1 j for i = 1, 2, , n, to form the centered residual vector
1c , 2c , , nc )T .
c = (
Step 2. Draw a bootstrap sample = (1c , 2c , , nc )T with replacement
1c , 2c , , nc )T . Let
from c = (

Y = S0 Y +

and
T = (Y )T (R0 R1 )Y /(Y )T R1 Y . (32)

Step 3. Repeat step 2 for B times and obtain a bootstrap sample of the
statistic T as T1 , T2 , , TB . The bootstrap p-value of the test is

p = #{Ti ; Ti t}/B, (33)

12
where t is the observed value of T obtained from step 1 and #A represents the
number of elements in the set A.
Extensive simulations demonstrate that the proposed test method with the
bootstrap procedure for deriving the p-value of the test are quite accurate and
powerful.

6. Some problems in future research

1. Up till now, the studies on the GWR technique all assume that the er-
ror terms in the model (1) are independent and identically distributed random
variables. Generally, the error terms are spatial correlated. The influence of the
spatially correlated error terms on the GWR technique needs to be further inves-
tigated. Furthermore, when the error terms follow some specific forms of spatial
correlation such as spatial ARMA process, how to apply the GWR technique to
explore spatial non-stationarity of the data remains to be studied.
2. Spatial-temporal data analysis is more useful in practice, because most of
data sets in, for example, economics, environmental science and epidemiology are
related to not only the geographical locations but also the time. It is an interesting
topic to apply the GWR technique to analyze this kind of data sets. One of the
possible way for the study is to assume the coefficients in the model (1) are functions
of both spatial location and time. But how to efficiently deal with the problem of
curse of dimensionality may be an important issue in the study.

References

Brunsdon C, Fotheringham A S, Charlton M, 1996, Geographically weighted re-


gression: a method for exploring spatial nonstationarity Geographical Anal-
ysis 28 281298

Brunsdon C, Fotheringham A S, Charlton M, 1998, Geographically weighted


regression modelling spatial nonstationarity The Statistician 47 431443

Brunsdon C, Fotheringham A S, Charlton M, 1999, Some notes on parametric


significance tests for geographically weighted regression Journal of Regional
Science 39 497524

13
Fotheringham A S, Brunsdon C, Charlton M, 2002 Geographically Weighted Regression
the Analysis of Spatially Varying Relationships, Wiley, Chichester

Leung Y, Mei C L, Zhang W X, 2000a, Statistical tests for spatial nonstationarity


based on the geographically weighted regression model Environment and
Planning A 32 932

Leung Y, Mei C L, Zhang W X, 2000b, Testing for spatial autocorrelation among


the residuals of the geographically weighted regression Environment and
Planning A 32 871890

Leung Y, Mei C L, Zhang W X, 2003, Statistical test for local patterns of spatial
association Environment and Planning A 35 725744

Mei C L, He S Y, Fang K T, 2004, A note on the mixed geographically weighted


regression model Journal of Regional Science 44 143157

Mei C L, Wang N, Zhang W X, 2004, Estimation and inference on mixed geo-


graphically weighted regression model, to appear in Environment and Plan-
ning A.

Speckman P, 1988, Kernel smoothing in partial linear model Journal of the


Royal Statistical Society, Series B 50 413436

14

Potrebbero piacerti anche