Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Technical Report
TR 02-033
Discovering Co-location Patterns from Spatial Datasets: A General
Approach
Yan Huang, Shashi Shekhar, and Hui Xiong
Hui Xiong
Abstra
t
Given a
olle
tion of boolean spatial features, the
o-lo
ation pattern dis
overy
pro
ess nds the subsets of features frequently lo
ated together. For example, the
analysis of an e
ology dataset may reveal the frequent
o-lo
ation of a re ignition sour
e feature with a needle vegetation type feature and a drought feature.
The spatial
o-lo
ation rule problem is dierent from the asso
iation rule problem. Even though boolean spatial feature types (also
alled spatial events) may
orrespond to items in asso
iation rules over market-basket datasets, there is no
natural notion of transa
tions. This
reates di
ulty in using traditional measures
(e.g. support,
onden
e) and applying asso
iation rule mining algorithms whi
h
use support-based pruning. We propose a notion of user-spe
ied neighborhoods in
pla
e of transa
tions to spe
ify groups of items. New interest measures for spatial
o-lo
ation patterns are proposed whi
h are robust in the fa
e of potentially innite overlapping neighborhoods. We also propose a family of algorithms to mine
frequent spatial
o-lo
ation patterns. Experimental results are provided to show
the strength of ea
h algorithm and design de
isions related to performan
e tuning.
Keywords:
Conta
t author. This work was supported by the Army High Performan
e Computing Resear
h
Center under the auspi
es of the Department of the Army, Army Resear
h Laboratory
ooperative
agreement number DAAD19-01-2-0014, the
ontent of whi
h does not ne
essarily re
e
t the position or
the poli
y of the government, and no o
ial endorsement should be inferred
1 Introdu
tion
Widespread use of spatial databases [10, 24, 25, 36 is leading to an in
reasing interest in
mining interesting and useful but impli
it spatial patterns [9, 16, 20, 23, 32, 26, 28, 5, 29,
35. For example, E-servi
es are growing along with mobile
omputing infrastru
tures
su
h as PDAs and
elluar phones. Finding E-servi
es frequently lo
ated together is of
interest to businesses that want to
ondu
t lo
ation sensitive market promotions su
h
as promoting a taxi servi
e for
ustomers who reserve an E-ti
ket in some lo
ations. In
e
ology, s
ientists are interested in nding frequent
o-o
urren
es among boolean spatial
features, e.g., drought, El Nino, substantial in
rease in vegetation, substantial drop in
vegetation, extremely high pre
ipitation, et
. E
ient tools for extra
ting information
from geo-spatial data, the fo
us of this work, are
ru
ial to organizations whi
h make
de
isions based on large spatial datasets. These organizations are spread a
ross many
domains in
luding e
ology and environmental management, publi
safety, transportation,
publi
health, business, and tourism [3, 14, 18, 11, 32, 37.
Asso
iation rule nding [13, 1, 2, 13, 22, 30, 31, 33 is an important data mining
te
hnique whi
h has helped retailers interested in nding items frequently bought together
to make store arrangements, plan
atalogs, and promote produ
ts together. Spatial
asso
iation rules [17 are spatial
ases of general asso
iation rules where at least one
of the predi
ates is spatial. Asso
iation rule mining algorithms [1, 2, 12 assume that
a nite set of disjoint transa
tions are given as input to the algorithms. In market
basket data, a transa
tion
onsists of a
olle
tion of item types pur
hased together by a
ustomer. Algorithms like apriori [2
an e
iently nd the frequent itemsets from all
the transa
tions and asso
iation rules
an be found from these frequent itemsets.
Many spatial datasets
onsist of instan
es of a
olle
tion of instan
es of boolean
spatial features (e.g., drought, needle leaf vegetation). Figure 1 a) shows the frequent
o-o
urren
es of some point spatial feature types represented by dierent shapes. As
an
be seen, instan
es of spatial features in sets f`+', `'g and f`o', `*'g tend to be lo
ated
together. Figure 1 b) shows an instan
e of
o-lo
ation patterns among extended spatial
features, namely road-types, on an urban road map. Highways often have frontage roads
nearby in large metropolitan areas, e.g. Minneapolis. Identi
ation of su
h
o-lo
ations
is useful in sele
ting test-sites for evaluating in-vehi
le navigation te
hnology [38. While
boolean spatial features
an be thought of as item types, there may not be an expli
it
nite set of transa
tions due to the
ontinuity of the underlying spa
e.
We formalize the
o-lo
ation rule mining problem as follows: Given 1) a set T of K
1
70
60
50
40
30
20
10
10
20
30
40
X
50
60
70
80
(a)
(b)
Figure 1: a) Point Spatial Co-lo
ation Patterns Illustration. Shapes represent dierent
spatial feature types. Spatial features in sets f`+', `'g and f`o', `*'g tend to be lo
ated
together. b) Line String Co-lo
ation Patterns Illustration. Highway 100 and Normandale
Road are
o-lo
ated for several hundred meters. Highways are often
o-lo
ated with
frontage roads.
spatial feature types T = ff1 ; f2 ; : : : ; fK g and their instan
es P = fp1 ; p2 ; : : : ; pN g, ea
h
pi 2 P is a ve
tor < instan
e-id, spatial feature type, lo
ation > where lo
ation 2 spatial
framework S, 2) A symmetri
and re
exive neighbor relation R over lo
ations in S, 3) Min
prevalen
e threshold (min prevalen
e) and min
onditional probability (min
ond prob);
e
iently nd
orre
t and
omplete set of
o-lo
ation rules with parti
ipation index
> min prevalen
e and
onditional probability > min
ond prob.
Approa
hes to dis
overing
o-lo
ation rules in the literature
an be
ategorized into two
lasses, namely spatial statisti
s and asso
iation rules. Spatial
statisti
s-based approa
hes use measures of spatial
orrelation to
hara
terize the relationship between dierent types of spatial features. Measures of spatial
orrelation
in
lude
ross-k fun
tion with Monte Carlo simulation [7,
hi-square tests,
orrelation
oe
ients, and regression models [6 as well as their generalizations using spatial neighborhood relationships. Computing spatial
orrelation measures for all possible
o-lo
ation
patterns
an be
omputationally expensive due to the exponential number of
andidate
Related Work:
(A; B ) will not be found sin
e it does not involve the referen
e feature. Generalizing
this paradigm to the
ase where no referen
e feature is spe
ied is non-trivial. Dening
transa
tions around lo
ations of instan
es of all features may yield dupli
ate
ounts for
many
andidate asso
iations. Dening transa
tions by an ad-ho
data-partition approa
h
[21 attempts to measure the frequen
y of a
o-lo
ation pattern by grouping the spatial
instan
es into disjoint partitions. However, imposing arti
ial disjoint transa
tions via
spa
e partitioning often under
ounts instan
es of tuples interse
ting the boundaries of
arti
ial transa
tions or double-
ounts instan
es of tuples
o-lo
ated together. In addition, there may be multiple partitions yielding distin
t sets of transa
tions, whi
h in
turn yields dierent values of support of the
o-lo
ation. Figure 2
) shows two possible partitions for the dataset of Figure 2 a), along with the support for
o-lo
ation
(A; B ). The ad-ho
approa
h is partitioning sensitive and thus the prevalen
e measure
is ill-dened. A dierent grouping may result in dierent values of the support measure
and thus dierent
o-lo
ation patterns.
In re
ent work, we developed an event
entri
model. For point spatial feature whi
h
provides a transa
tion-free approa
h by using the
on
ept of neighborhood. The event
entri
model is relevant to appli
ations like e
ology where there are many types of
boolean spatial features. E
ologists are interested in nding subsets of spatial features
likely to o
ur in a neighborhood around instan
es of given subsets of event types. For
example, let us determine the probability of nding at least one instan
e of feature type
B in the neighborhood of an instan
e of feature type A in Figure 2 a). There are two
instan
es of type A and both have some instan
e(s) of type B in their neighborhoods.
The
onditional probability for the
o-lo
ation rule is: spatial feature A at lo
ation l !
spatial feature type B in neighborhood is 100%. This yields a well-dened prevalen
e
measure(i.e. support) without the need for transa
tions. Figure 2 d) illustrates that our
approa
h will identify both (A; B ) and (B; C ) as frequent patterns.
This paper extends our re
ent work [27 on the event
entri
model and makes the following new
ontributions. First, it presents a generalized algorithm to dis
over
o-lo
ation patterns from point spatial datasets. The generalized
algorithm in
ludes a novel multi-resolution lter step. Se
ond, it also provides proofs of
orre
tness and
ompleteness for the generalized algorithm. Finally the paper provides
an experimental performan
e evaluation to
ompare alternative
hoi
es for key design
de
isions, su
h as the use of a multi-resolution lter.
Our Contributions:
C1
A1
B1
C1
A1
A2
B2
C2
Example dataset, neighboring
instances of different features
are connected
(a)
C2
B2
Reference feature = C
Transactions{{B1},{B2}}
support(A,B) =
(b)
C2
C1
A1
B1
A2
A2
C1
A1
B1
B1
B2
A2
C2
Support(A,B) = 2
B2
Support(A,B) = 1
B1
C2
B2
Support(A,B) = min(2/2,2/2) = 1
Support(B,C) = min(2/2,2/2) = 1
(d)
Figure 2: Example to Illustrate Dierent Approa
hes to Dis
overing Co-lo
ation Patterns
a) Example dataset. Grid is imposed to illustrate window
enter model b) Ad ho
data
partition approa
h. Support measure is ill-dened and order sensitive
) Referen
e feature
entri
model d) Event
entri
model
Se
tion 2 des
ribes our approa
h for modeling
o-lo
ation patterns and the
asso
iated measures of prevalen
e and
onditional probability. Se
tion 3 proposes a
family of algorithms to mine
o-lo
ation patterns; an analysis of the algorithms in the
areas of
orre
tness,
ompleteness, and
omputational e
ien
y is presented in se
tion
4. We present the experimental setup and results in se
tion 5. Finally, se
tion 6 presents
the
on
lusion and future work.
Outline:
Legend:
T.i represents instance i with feature type T
Lines between instances represents neighbor relationships
e)
t7
B.5
C.3
A.4
k=3
A.3
B.3
d)
A.1
C.1
B.4
a)
t2
t3
1
2
3
4
5
1
2
3
1
t4
candidate co-locations of size 2
A
t1
t2
A
B
C
C
t1
t2
t3
t3
t4
t5
t5
table Id
t6
co-location
1
2
1
4
1
3
2
1
1
1
3
row instance
2
4
5
.5
.6
.4
will be pruned if min_prevalence
set to .5 and algorithm stops
1
k=1
c)
b)
t1
B.2
B.1
1
2
3
4
.2
A.2
C.2
table instance
participation index
k=2
The parti
ipation ratio P r(C; fi ) for feature type fi in a
o-lo
ation C = ff1 ; : : : ; fk g
is the fra
tion of instan
es of fi whi
h parti
ipate in any row instan
e of
o-lo
ation
jdistin
t(fi (table instan
e of C ))j
C . The ratio
an be
omputed as
, where is a relational
jinstan
e of ffi gj
proje
tion operation. For example, in Figure 3, row instan
es of
o-lo
ation fA; B g are
f(A:1; B:1); (A:2; B:4); (A:3; B:4)g. Only two out of ve instan
es, B:1 and B:4 of spatial
feature B , parti
ipate in
o-lo
ation fA; B g. So P r((A; B ); B ) = 2=5 = 0:4.
The parti
ipation index of a
o-lo
ation C = ff1 ; : : : ; fk g is minki=1 fP r(C; fi )g.
In Figure 3, the parti
ipation ratio P r(fA; B g; B ) of feature B in
o-lo
ation fA; B g is
0.4 as
al
ulated above. Similarly, P r(fA; B g; A) is 0.75. The parti
ipation index(A; B )
= min(0.75, 0.4) = 0.4. The
onditional probability of a
o-lo
ation rule C1 ! C2 is
the probability of nding an instan
e of C2 in a neighborhood of instan
e of C1 . It
an
jdistin
t(C1 (all row instan
e of C1 [C2 ))j
be
omputed as
where is a proje
tion operation.
jinstan
e of C1 j
Input:
(a) E = {Event-ID, Event-Type, Location in Space} representing a set of events;
ET = {Set of boolean spatial event types};
(b) Neighborhood relationship function; pair of spatial points;
(c) Interest measure function (e.g. prevalence, conditional probability);
(d) Threshold on prevalence measure and conditional probability;
Output:
A set of co-locations with values of interest measures (i.e. prevalence, conditional
probability) satisfying threshold.
Data Structure:
k = Co-location size
Ck = set of candidate size k co-locations in iteration k = 1, 2, ..., P
Tk = set of table instances of co-locations in C k for k = 1, 2, ..., P
Pk = set of prevalent size k co-locations for k = 1, 2, ..., P
Rk = set of co-location rules of size k for k = 1, 2, ..., P
T_C k = set of coarse-level table instances of size k co-location in C k for k= 1, 2, ..., P
Steps:
Co-location-size k = 1
C1 = ET;
P1 = ET;
T1 = generate_table_instance(C 1, E);
Initialize data structure C k, Tk, Pk, Rk, T_Ck to be empty for k >1
while(not empty Pk) do{
Ck+1 = generate_candidate_colocation (C k, k);
if (fmul = true) then {
C k+1 = multi_resolution_pruning (C k+1);
}
Tk+1 = generate_table_instance ( q,C k+1, Tk);
Pk+1 = select_prevalent_colocation( q,C k+1, Tk+1);
Rk+1 = generate_colocation_rule (P k+1, Tk+1);
}
return union (R 2, ..., Rk+1);
Ck
id
1)-subset
Note that Lk 1 , Lk , and Ck are nested tables [34 where
olumns table instan
e id
refer to table instan
e of appropriate
o-lo
ations.
.instan ek )
It takes the size k + 1
andidate
o-lo
ation set Ck+1 and table instan
es of the
size k prevalent
o-lo
ations as arguments and works as follows:
.table instan
e id1
and
.table instan
e id2 spe
ify the table instan
es of the two
o-lo
ations joined in
apriori gen to produ
e
. Here, a sort-merge join is preferred be
ause the table instan
es of ea
h iteration
an be kept sorted for the next iteration. This follows from a
similar property of apriori-gen [2. Sort order is based on an ordering of the set of itemtypes to order item-types in a
o-lo
ation to form the sort-eld. Finally, all
o-lo
ations
with empty table instan
e will be eliminated from Ck+1 .
The join
omputation for generating table instan
es has two
onstraints, namely a spatial neighbor relationship
onstraint (p.instan
ek , q .instan
ek ) and a
ombinatorial distin
t event-type
onstraint (p.instan
e1 =q .instan
e1, : : : , p.instan
ek 1 =q .instan
ek 1).
9
We examine three strategies for
omputing this join, a geometri
strategy, a
ombinatorial strategy and a hybrid strategy. These are des
ribed in forth
oming subse
tions.
Exploration of other join strategies is beyond the s
ope of this paper but we may explore
su
h strategies in future work.
This approa
h
an be implemented by neighborhood
relationship-based spatial joins of table instan
es of prevalent
o-lo
ations of size k with
table instan
e sets of prevalent
o-lo
ations of size 1. In pra
ti
e, spatial join operations
are divided into a lter step and a renement step [24 to e
iently pro
ess
omplex
spatial data types su
h as point
olle
tions in a row instan
e. In the lter step, the
spatial obje
ts are represented by simpler approximations su
h as the MBR - Minimum
Bounding Re
tangle. There are several well-known algorithms, su
h as plane sweep [4,
spa
e partition [15 and tree mat
hing [19, whi
h
an then be used for
omputing the
spatial join of MBRs using the overlap relationship; the answers from this test form
the
andidate solution set. In the renement step, the exa
t geometry of ea
h element
from the
andidate set and the exa
t spatial predi
ates are examined along with the
ombinatorial predi
ate to obtain the nal result.
Geometri
Approa h:
In Figure 3, table 4 of
o-lo
ation fA; B g and table 5 of
o-lo
ation fA; C g
are joined to produ
e the table instan
e of
o-lo
ation fA; B; C g be
ause
o-lo
ation
fA; B g and
o-lo
ation fA; C g were joined in apriori gen to produ
e
o-lo
ation fA; B; C g
in the previous step. In the example, row instan
e f3; 4g of table 4 and row instan
e f3; 1g
of table 5 are joined to generate row instan
e f3; 4; 1g of
o-lo
ation fA; B; C g (Table 7).
Row instan
e f1; 1g of table 4 and row instan
e f1; 2g of table 5 fail to generate row
instan
e f1; 1; 2g of
o-lo
ation fA; B; C g be
ause instan
e 1 of B and instan
e 2 of C
are not neighbors.
Example
This approa h hooses the more promising of the spatial and ombinatorial approa hes in ea h iteration. In our experiment, it pi ks the spatial approa h
Hybrid approa h:
10
to generate table instan
es for
o-lo
ations of size 2 and the
ombinatorial approa
h for
generating table instan
es for
o-lo
ations greater than size 2.
3.3 Pruning
Candidate
o-lo
ations
an be pruned using the given threshold on the prevalen
e
measure. In addition, multi-resolution pruning
an be used for spatial dataset with
strong auto-
orrelation [7, i.e., where instan
es of ea
h spatial feature types tend to be
lo
ated near ea
h other.
We rst
al
ulate the parti
ipation indexes for all
andidate
o-lo
ations in Tk+1 . Computation of the parti
ipation indexes
an be a
omplished
by keeping a bitmap of size
ardinality(fi) for ea
h feature fi of
o-lo
ation C . One
s
an of the table instan
e of C will be enough to put 1s in the
orresponding bits in
ea
h bitmap. By summarizing the total number of 1s (pfi ) in ea
h bitmap, we obtain
the parti
ipation ratio of ea
h feature fi (divide pfi by jinstan
e of fi j). In Figure 3
), to
al
ulate the parti
ipation index for
o-lo
ation fA; B g, we need to
al
ulate the
parti
ipation ratios for A and B in
o-lo
ation fA; B g. Bitmap bA = (0,0,0,0) of size four
for A and bitmap bB = (0,0,0,0,0) of size 5 for B are initialized to zeros. S
anning of
table 4 will result in bA = (1,1,1,0) and bB = (1,0,0,1,0). Three out of four instan
es of
A (i.e., 1, 2, and 3) parti
ipate in
o-lo
ation fA; B g. Thus the parti
ipation ratio for A
is .75. Similarly, the parti
ipation ratio for B is .4. The parti
ipation index is minf.75,
.4g = .4.
After the parti
ipation indexes are determined, prevalen
e-based pruning is
arried
out and non-prevalent
o-lo
ations and their table instan
es are deleted from the
andidate prevalent
o-lo
ation sets. For ea
h remaining prevalent
o-lo
ation C after
prevalen
e-based pruning, we keep a
ounter to spe
ify the
ardinality of the table instan
e of C . All the table instan
es of the prevalent
o-lo
ations in this iteration will be
kept for generation of the prevalent
o-lo
ations of size k + 2 and dis
arded after the next
iteration.
Prevalen
e-Based Pruning:
Multi-resolution pruning is learned on a summary of spatial data at a
oarse resolution using a simple re
ti-linear grid. We
ombine all instan
es
of a spatial feature f in ea
h
ell (x; y ) in the grid as a new
oarse instan
e < (x; y ); f; m >
in the
oarse spa
e where m is the number of instan
es of spatial point feature f in
ell
(x; y ). For ea
h
andidate
o-lo
ation generated by apriori gen, we generate its
oarse
Multi-resolution Pruning:
11
table instan
e using new
oarse instan
es and its
oarse parti
ipation index based on
the
oarse table instan
e. Multi-resolution pruning eliminates a
o-lo
ation if its
oarse
parti
ipation indexes fall below the user given threshold. We illustrate the idea with one
example now.
d
Co-location of size 1
and fine level table instances
Co-location of size 1
and coarse level table instances
c1
c2
c2
5
10
(0, 4)
(1, 3)
(5, 0)
(5, 3)
1
3 6
7
9
8 5
5
6
4
3 1
a)
(0, 0) (0, 4)
(0, 3) (1, 3)
(0, 4) 1
(1, 3)
(5, 4)
1
f1
f2
f3
1
2
3
4
5
6
7
1
2
3
4
5
6
7
8
9
10
1
1
2
3
4
1
1
Co-location of size 2
and fine level table instances
3
2
f4
Co-location of size 3
and fine level table instances
f7
3
3
3
3
4
4
6
6
4/7
5
5
6
6
5
5
......
9
9
1
2
1
2
1
2
5/10
4/4
3
4
f5
Co-location of size 2
and coarse level table instances
c4
c5
c6
f6
Co-location of size 3
and coarse level table instances
c7
(0, 4)
(0, 4)
(0, 4)
(0, 4)
(0, 4)
(0, 4)
(1, 3)
(1, 3)
(1, 3)
(1, 3)
(0, 3)
(0, 3)
(0, 4)
(0, 4)
(1, 3)
(1, 3)
(0, 4)
(0, 4)
(1, 3)
(1, 3)
(0, 4)
(1, 3)
(0, 4)
(1, 3)
(0, 4)
(1,3)
(0, 4)
(1, 3)
(0, 4)
(1, 3)
4/7
6/10
4/4
3
3
3
4
4
5
5
5
5
5
6
6
6
6
4/7
4
5
6
5
6
5
6
7
8
9
5
7
8
9
6/10
3
3
4
4
5
5
6
6
1
2
1
2
3
4
3
4
4/7 4/4
5
5
6
6
7
7
8
8
9
9
1
2
1
2
3
4
3
4
3
4
(0, 4) (0, 3)
(0, 4) (0, 4)
(0, 4) (1, 3)
(1, 3) (0, 3)
(1, 3) (0, 4)
(1, 3) (1, 3)
(5, 3) (5, 4)
5/7
(0, 4) (0, 4)
(0, 4) (1, 3)
(1, 3) (0, 4)
(1, 3) (1, 3)
4/7 4/4
(0,3 ) (0, 4)
(0, 3) (1, 3)
(0, 4) (0, 4)
(0, 4) (1, 3)
(1, 3) (0, 4)
(1, 3) (0, 4)
6/10 4/4
7/10
5/10 4/4
will be pruned if min_pre
if set to 0.6,
and
will be further evaluated
and algorithm stops
b)
the
oarse table instan
es. For ea
h spatial feature fi , we add up all the
ounts of point
instan
es in ea
h
oarse instan
e with 1s in its
orresponding bitmap (pfi ) and divide
this by jinstan
e of fi j to get the
oarse-parti
ipation ratio of feature fi . For example,
oarse P r((?; ); ?) = 4=7 sin
e there are 4
oarse row instan
es of (?; ) and 7 ne-grain
instan
es of ?. Similarly,
oarse P r((?; ); ) = 4=4 = 1, yielding
oarse parti
ipation
index(?; ) = min(4=7; 4=4) = 4/7. Figure 5 b) shows
oarse table instan
es of
olo
ations (?; ); (?; ) and (?; ). If the threshold for prevalen
e is set to 0.6, then
o-lo
ation
5
an be pruned by multi-resolution pruning. We also note that the sizes
of
oarse table instan
es are smaller than the sizes of table instan
es at ne resolution.
This shows the possibility of
omputation
ost saving via multi-resolution pruning for
lustered datasets. Finally, the examples in Figure 5 b) show that the
oarse parti
ipation
ratios and parti
ipation indexes never underestimate the true parti
ipation indexes of the
original dataset.
Proof:
13
where
0
. The parti
ipation index is also monotoni
be
ause 1) the parti
ipation ra+1
tio is monotoni
and 2)pi(
[ fk+1) = minki=1
fpr(
[ fk+1; fi)g minki=1fpr(
[ fk+1; fi)g
minki=1 fpr(
; fi )g = pi(
). Given this property, spatial feature level pruning
an be ee
tive. We
an also rely on a
ombinatorial approa
h and use apriori gen [2 to generate
size k + 1
andidate
o-lo
ations from size k prevalent
o-lo
ations.
0
Lemma 2
: When
o-lo
ation size = 1, the value of the
oarse parti
ipation index and
the true parti
ipation index is 1, so Lemma 2 is trivially true. Suppose Lemma 2 is
true for
o-lo
ations size=k. Let us
onsider the
ase that
o-lo
ation size is equal to
k+1. For ea
h
andidate
o-lo
ation C of size k + 1 generated from the apriori gen by
joining C1 and C2 of size k , we generate its
oarse instan
e table by joining the
oarse
instan
e tables of C1 and C2 . Be
ause Lemma 2 is true for
o-lo
ations of size k , the
andidate
o-lo
ation set of size k found is a superset of the prevalent
o-lo
ation set
on the original dataset. Thus C1 and C2 are in the
andidate
o-lo
ation set in the
previous iteration and their
oarse level table instan
es are available to be joined to
produ
e the
oarse level table instan
e of C . The table join to produ
e the
oarse table
instan
e of C has the following property: if N eighborR (p1 ; p2 ) is in the original dataset,
then
oarse N eighbor(
ell
1 ;
ell
2 ) will be in the
oarse-level dataset given p1 2
1 and
p2 2
2 . When we
al
ulate the
oarse parti
ipation index, any spatial feature instan
e
whi
h parti
ipates in the
o-lo
ation in the original dataset will
ontribute to the
ounts
during the
oarse parti
ipation ratio
al
ulation. So the
oarse parti
ipation ratios never
underestimates the true parti
ipation ratios, implying that the
oarse parti
ipation index
never underestimates the true parti
ipation index and that the pruning will not eliminate
any truly prevalent
o-lo
ation. Thus the
andidate
o-lo
ation set after multi-resolution
pruning is a superset of the prevalent
o-lo
ation set on the original dataset.
Proof
Lemma 3
: The spatial join produ
es all pairs (p0 ; p00 ) of instan
es where p0 .feature =
6
00
0
00
p .feature and p and p are neighbors. Any row instan
e of any size 2
o-lo
ation
satisfying these two
onditions in the join predi
ate will be generated. The s
hema
Proof
14
level pruning using apriori gen is
omplete due to the monotoni
ity of the parti
ipation
index as proved in Lemma 1. Then we prove that the join of the table instan
es of C1
and C2 to produ
e the table instan
e of C is
omplete. A
ording to the neighborhood
denition, any subset of a neighborhood is a neighborhood too. For any instan
e I =
fi1; : : : ; ik+1g of
o-lo
ation C , subsets I1 = fi1; : : : ; ik g and I2 = fi1; : : : ; ik 1; ik+1g
are neighborhoods, ik and ik+1 are neighbors, and I1 and I2 are row instan
es of C1
and C2 respe
tively. Joining I1 and I2 will produ
e I . Enumeration of the subsets of
ea
h of the prevalent
o-lo
ations ensures that no spatial
o-lo
ation rules with both
high prevalen
e and high
onditional probabilities are missed. We then prove that multiresolution pruning does not ae
t
ompleteness. By Lemma 2, the
o-lo
ation set found is
a superset of the prevalent
o-lo
ation set on the original dataset. Thus multi-resolution
pruning does not falsely eliminate any prevalent
o-lo
ation.
Lemma 4
Proof:
We will only show that the row instan
e of ea
h
o-lo
ation is
orre
t, as that will
imply the
orre
tness of the parti
ipation index values and that of ea
h
o-lo
ation meeting the user spe
ied threshold. An instan
e I1 = fi1;1; : : : ; i1;k g of C1 = ff1 ; : : : ; fk+1 g
and an instan
e I2 = fi2;1 ; : : : ; i2;k g of C2 = ff1 ; : : : ; fk 1 ; fk+1 g is joined to produ
e an
instan
e Inew = fi1;1 ; : : : ; i1;k ; i2;k g of C = ff1 ; : : : ; fk+1 g if: 1) all elements of I1 and I2
are the same ex
ept i1;k and i2;k ; 2) i1;k and i2;k are neighbors. The s
hema of Inew is
apparently C , and elements in Inew are in a neighborhood be
ause I1 is a neighborhood
and i2;k is a neighbor of every element of I1 .
15
ombinatorial approa
h be
omes
heaper than the geometri
approa
h. This is due to
its exploitation of the sort-merge join strategy while keeping ea
h table instan
e sorted,
resulting in the
omputation
omplexity of O(N ) where N is the size of table instan
es.
In a hybrid approa
h, we pi
k the
heaper of the two basi
strategies in ea
h iteration to
a
hieve the best overall
ost.
Se
ond, let us
ompare the
ost of the Co-lo
ation Miner algorithm without and with
the multi-resolution lter step. Let Tm
m (k ) and T
m (k ) represent the
osts of iteration
k of the Co-lo
ation Miner algorithm with and without the Multi-resolution lter.
Tm
m (k )
Tprune C
and;k
( (
+1);grid data)
(2)
Tprune(C ( and;k+1);data)
Furthermore, we assume that the average time to generate a table instan
e in the
original dataset is Torig (k ) for iteration k and the average time to generate a table instan
e
in the grid dataset is Tgrid (k ) for iteration k. The number of
andidate
o-lo
ations
generated by the apriori gen is jCk+1 j and the number of
andidate
o-lo
ations after the
oarse instan
e level pruning is jCk0 +1 j, Equation 2
an be written as:
Tm
m (k )
T
m (k )
+1
+1
+1
+1
(3)
The rst term of the ratio is
ontrolled by the \
lumpiness" (the average number of
instan
es of the spatial features per grid
ell) of the lo
ations of spatial features. The
se
ond term is
ontrolled by the ltering e
ien
y of the
oarse instan
e level pruning.
16
When the lo
ations of spatial features are
lustered, the sizes of the ne level table
instan
es are mu
h greater than the sizes of the
oarse level table instan
es and the time
needed to generate ne level table instan
es is greater than the time needed to generate
oarse level table instan
es. In our experiments, as des
ribed in the next se
tion, we
use the parameter m
lump , whi
h
ontrols the number of instan
es
lumping together
for ea
h spatial feature, to evaluate the rst term, and we use the parameter moverlap ,
whi
h represents the possible false
andidate ratios to evaluate the se
ond term. From
the formula, we
an see that the Multi-resolution Co-lo
ation Miner is likely to be more
e
ient than the Co-lo
ation Miner when the lo
ations of spatial features are
lustered
and the false
andidate ratio is high.
Parameter Denition
C1
C2
2
50
50
N
o lo
1
D1 D2
d
rnoise f
rnoise n
moverlap
m
lump
5
5
106
106
The size of the square to dene a
o-lo
ation
10
The ratio the of number of noise features over the number .5
of features involved in generating the maximal
o-lo
ations
The number of noise instan
es
50,000
The number of
o-lo
ation generated by appending one more 1
spatial feature for ea
h
ore
o-lo
ation
The number of instan
es generated for ea
h spatial feature 1
in a neighborhood for a
o-lo
ation
4
5
250
1,000
10
.5
1,000
1
1
by generating a set of instan
es of features from a set of noise features disjoint with the
features involving generation of
ore
o-lo
ations and pla
ing those at random lo
ations
in the global spatial framework.
Nco_loc, 1, 2, moverlap
Generate
Co-locations
D, d, mclump
Generate
Neighborhoods
Noise Model
Add
Noise
Spatial Datasets
Analysis
Measurements
Co-location
Algorithms
Candidate
Algorithms
Metrics
140
120
100
80
60
40
20
0
0
100
200
300
400
500
600
700
800
(a)
(b)
19
Clumpiness Degree = 5
Clumpiness Degree = 10
Clumpiness Degree = 15
Clumpiness Degree = 20
4.5
4
Performance Ratio
4
Performance Ratio
Overlap Degree = 2
Overlap Degree = 4
Overlap Degree = 6
Overlap Degree = 8
4.5
3.5
3
2.5
3.5
3
2.5
1.5
1.5
1
2
Overlap Degree
10
12
14
16
18
20
Clumpiness
(a)
(b)
0.5
Overlap Degree = 2
Overlap Degree = 4
Overlap Degree = 6
Overlap Degree = 8
0.45
0.4
0.4
0.35
0.35
0.5
Clumpiness Degree = 5
Clumpiness Degree = 10
Clumpiness Degree = 15
Clumpiness Degree = 20
0.45
0.3
0.25
0.2
0.15
0.3
0.25
0.2
0.15
0.1
0.1
0.05
0.05
2
Overlap Degree
10
12
14
16
18
20
Clumpiness
(a)
(b)
Figure 9: Filter Time Ratio (a) By Overlap Degree (b) By Clumpiness Degree
or the \
lumpiness" of the lo
ations of ea
h spatial feature type. The overhead of the
multi-resolution lter as a fra
tion of prevalen
e-based pruning de
reases when the degree
of overlap or
lumpiness in
reases. Clumpiness strongly ae
ts the overhead, redu
ing it
from 0.45 to 0.1.
5.4 Ee
t of Noise
The base dataset, generated using parameter values in
olumn C1 of Table 1, used a
re
tangle spatial framework of size 106 106, a square neighborhood of size 10 10,
an average
o-lo
ation size of 5, an average table instan
e size of 50 when m
lump = 1,
a noise feature ratio of 0.5, a noise number of 50,000, and an overlapping degree of 1.
Then we in
reased the noise instan
es up to 800,000 and measured the performan
e, as
shown in Figure 7 (b). The exe
ution time for dis
overing
o-lo
ations of size 2 is and 3+
are shown in the gure. We note that noise-level ae
ts the exe
ution time to dis
over
o-lo
ations of size 2 but does not ae
t the exe
ution time to dis
over larger
o-lo
ations
given
o-lo
ations of size 2. In other words, noise is ltered out during the determination
of
o-lo
ations of size 2.
lem as well as the di
ulties in using traditional measures (e.g. support,
onden
e)
reated by impli
it, overlapping and potentially innite transa
tions in spatial data sets.
We proposed the notion of user-spe
ied neighborhoods in pla
e of transa
tions to spe
ify groups of items and dened interest measures that are robust in fa
e of potentially
innite overlapping neighborhoods. We dened a new spatial measure of
onditional
probability as well as a new monotoni
measure of prevalen
e to allow iterative pruning. The Co-lo
ation Miner, a generalized algorithm for mining
o-lo
ation patterns
was presented and analyzed for
orre
tness,
ompleteness and
omputation
ost. Design
de
ision in the proposed algorithm were evaluated using theoreti
al and experimental
methods. Empiri
al evaluation shows that the geometri
strategy performs mu
h better
than the
ombinatorial strategy when generating size-2
o-lo
ation; however, it be
omes
slower when generating
o-lo
ations with more than 2 features. The hybrid strategy
integrates the best features of the above two approa
hes. Experimental results show
that the Co-lo
ation Miner is tolerant of noise and provides the best overall performan
e.
Furthermore, when the lo
ations of the features tend to be spatially
lustered, whi
h is
often true for spatial data due to spatial-auto
orrelation, the Multi-resolution lter will
signi
antly redu
e
omputation
ost of proposed algorithm.
Several questions remain open. The
o-lo
ation mining problem should be investigated to a
ount for extended spatial data types, su
h as line segments, polygons and
ir
les. We brie
y dis
uss the potential approa
h to deal with extended spatial obje
ts in
Appendix A. We
onsidered only boolean spatial features here. In the real world, the features
an be
ategori
al and
ontinuous. There is a need to extend the
o-lo
ation mining
framework to handle
ontinuous features. Finally, if the lo
ations of features
hange over
time, it should be possible for us to identify some temporal-spatial
o-lo
ation patterns.
A
knowledgments
We thank Raymod T. Ng.(University of British Columbia,Canada), Jiawei Han (Simon
Fraser University, Canada), James Lesage, and Su
harita Gopal(Boston University) for
valuable insights. We also thank C.T. Lu, Xiaobin Ma, and Pusheng Zhang (University
of Minnesota) for their valuable feedba
k on early versions of this paper. We would also
like to express our thanks to Kim Koolt for her timely and detailed feedba
k to help
improve the readability of this paper.
22
Referen
es
[1 R. Agarwal, T. Imielinski, and A. Swami. Mining asso
iation rules between sets of items in large
databases. In Pro
. of the ACM SIGMOD Conferen
e on Management of Data, pages 207-216,
may 1993.
[2 R. Agarwal and R. Srikant. Fast algorithms for Mining asso
iation rules. In Pro
. of the 20th
VLDB, 1994.
[3 P.S. Albert and L.M. M
Shane. A Generalized Estimating Equations Approa
h for Spatially Correlated Binary Data: Appli
ations to the Analysis of Neuroimaging Data. Biometri
s(Publisher:
Washington, Biometri
So
iety, Et
.), 1:627-638, 1995.
[4 L. Arge, O. Pro
opiu
, S. Ramaswamy, T. Suel, and J. Vitter. S
alable Sweeping-Based Spatial
Join. In Pro
. of the Int'l Conferen
e on Very Large Databases, 1998.
[5 S. Chawla, S. Shekhar, W. Wu, and U. Ozesmi. Extending Data Mining for Spatial Appli
ations:
A Case Study in Predi
ting Nest Lo
ations. In Pro
. Intl. Conf. on ACM SIGMOD Workshop on
Resear
h Issues in Data Mining and Knowledge Dis
overy, 2000.
[6 Y. Chou. Exploring spatial analysis in geographi
information system. Onward Press (ISBN:
1566901197), 1997.
[7 N.A.C. Cressie. Statisti
s for spatial data. John Wiley and Sons, (ISBN:0471843369), 1991.
[8 G. Graefe. Sort-merge-join: An idea whose time has(h) passed? In Pro
. of IEEE Conf. on Data
Engineering, 1994.
[9 G. Greenman. Turning a map into a
ake layer of information. New York Times, Feb 12 2000.
[10 R.H. Guting. An Introdu
tion to Spatial Database Systems. In Very Large Data Bases Journal(Publisher: Springer Verlag), O
tober 1994.
23
[11 R.J. Haining. Spatial Data Analysis in the So
ial and Environmental S
ien
es. In Cambridge
University Press, Cambdge, U.K, 1989.
[12 J. Han and Y. Fu. Dis
overy of multiple-level asso
iation rules from large databases. In In Pro
.
1995 Int. Conf. Very Large Data Bases, pages 420-431, Zuri
h, Switzerland, September 1995.
[13 J. Hipp, U. Guntzer, and G. Nakaeizadeh. Algorithms for Asso
iation Rule Mining - A General
Survey and Comparison. In Pro
. ACM SIGKDD International Conferen
e on Knowledge Dis
overy
and Data Mining, 2000.
[14 Issaks, Edward, and M. Svivastava. Applied Geostatisti
s. Oxford University Press, Oxford, 1989.
[15 D. J. DeWitt J. M. Patel. Partition Based Spatial-Merge Join. In Pro
. of the ACM SIGMOD
Conferen
e on Management of Data, pp.259-270, Montreal, Canada, June 1996.
[16 K. Koperski, J. Adhikary, and J. Han. Spatial Data Mining: Progress and Challenges. In Workshop
on Resear
h Issues on Data Mining and Knowledge Dis
overy (DMKD'96), 1996.
[17 K. Koperski and J. Han. Dis
overy of Spatial Asso
iation Rules in Geographi
Informa020111366X
tion Databases. In Pro
. Fourth International Symposium on Large Spatial Data bases, Maine. 4766, 1995.
[18 P. Krugman. Development, Geography, and E
onomi
theory. MIT Press, Camgridge, MA, 1995.
[19 S. T. Leutenegger and M. A. Lopez. The Ee
t of Buering on the Performan
e of R-Trees. In
Pro
. of the ICDE Conf., pp 164-171, 1998.
[20 D. Mark. Geographi
al Information S
ien
e: Criti
al Issues in an Emerging Cross-dis
iplinary
Resear
h Domain. In NSF Workshop, February 1999.
[21 Y. Morimoto. Mining Frequent Neighboring Class Sets in Spatial Databases. In Pro
. ACM
SIGKDD International Conferen
e on Knowledge Dis
overy and Data Mining, 2001.
24
[22 J.S. Park, M. Chen, and P.S. Yu. Using a Hash-Based Method with Transa
tion Trimming for
Mining Asso
iation Rules. In IEEE Transa
tions on Knowledge and Data Engineering, vol. 9, no.
5, pp. 813-825, Sep-O
t 1997.
[23 J.F. Roddi
k and M. Spiliopoulou. A Bibliography of Temporal, Spatial and Spatio-temporal Data
Mining Resear
h. ACM Spe
ial Interest Group on Knowledge Dis
overy in Data Mining(SIGKDD)
Explorations, 1999.
[24 S. Shekhar and S. Chawla. Spatial Databases: A Tour. Prenti
e Hall (ISBN: 0130174807), 2003.
[25 S. Shekhar, S. Chawla, S. Ravada, A. Fetterer, X. Liu, and C.T. Lu. Spatial Databases: A
omplishments and Resear
h Needs. IEEE Transa
tions on Knowledge and Data Engineering, 11(1),
Jan-Feb 1999.
[26 S. Shekhar and Y. Huang. Categorization of Spatial Data Mining Te
hniques. S
ienti
Data
Mining, working
hapter, 2001.
[27 S. Shekhar and Y. Huang. Co-lo ation Rules Mining: A Summary of Results. In Prof. Spatiotemporal Symposium on Database, 2001.
[28 S. Shekhar, C.T. Lu, and P. Zhang. Dete
ting Graph-based Spatial Outliers: Algorithms and
Appli
ations. The Seventh ACM SIGKDD Intl. Conf. on Knowledge Dis
overy and Data Mining,
2001.
[29 S. Shekhar, P. S
hrater, W.R. Raju, and W. Wu. Spatial Contextual Classi
ation and Predi
tion
Models for Mining Geospatial Data. IEEE Transa
tions on Multimedia (spe
ial issue on Multimedia
Databases), 2002.
[30 R. Srikant and R. Agrawal. Mining Generalized Asso
iation Rules. In Pro
. of the 21st Int'l
Conferen
e on Very Large Databases, Zuri
h, Switzerland, 1997.
25
[31 R. Srikant, Q. Vu, and R. Agrawal. Mining Asso
iation Rules with Item Constraints. In Pro
. of
the 3rd Int'l Conferen
e on Knowledge Dis
overy in Databases and Data Mining, Newport Bea
h,
California, Aug 1997.
[32 P. Stolorz, H. Nakamura, E. Mesrobian, R.R. Muntz, E.C. Shek, J.R. Santos, J. Yi, K. Ng, S.Y.
Chien, R. Me
hoso, and J.D. Farrara. Fast Spatio-Temporal Data Mining of Large Geophysi
al
Datasets. In Pro
eedings of the First International Conferen
e on Knowledge Dis
overy and Data
Mining, AAAI Press, 300-305, 1995.
[33 C. Tsur, J. Ullman, C. Clifton, S. Abiteboul, R. Motwani, S. Nestorov, and A. Rosenthal. Query
Flo
ks: a Generalization of Asso
iation-Rule Mining. In Pro
eedings of 1998 ACM SIGMOD,
Seattle, 1998.
[34 J. Ullman. Prin
iples of Database and Knowledge-Base Systems. Vol.1. Computer S
ien
e Press,
1988.
[35 W. Wang, J. Yang, and R. Muntz. STING: A statisti
al information grid approa
h to spatial data
mining. In Matthias Jarke, Mi
hael J. Carey, Klaus R. Dittri
h, Frederi
k H. Lo
hovsky, Peri
les
Lou
opoulos, and Manfred A. Jeusfeld, editors, Twenty-Thrid International Conferen
e on Very
Large Data Bases, pages 186-195, Athens, Gree
e, 1997.
[36 M.F. Worboys. GIS: A Computing Perspe
tive. In Taylor and Fran
is, 1995.
[37 Y. Yasui and S.R. Lele. A Regression Method for Spatial Disease Rates: An Estimating Fun
tion
Approa
h. Journal of the Ameri
an Statisti
al Asso
iation, 94:21-32, 1997.
[38 Yilin Zhao.
26
A Point Object O
A LineString Object O
A Polygon Object O
Neighborhood of O
Neighborhood of O
Neighborhood of O
.
.
Row Instance r
. .
Neighborhood of r
. .
Model
Items
lo al
tures
boolean feature types
event
entri
Transa
tions
dened by
28