Issues Related To Spatial Data

In: J. McKenzie (ed) Proceedings of the epidemiology and state veterinary programmes.
New Zealand
Veterinary Association / Australian Veterinary Association Second Pan Pacific Veterinary Conference,
Christchurch, 23-28 June 1996; 83-105.
Issues related to handling of spatial data

D.U. PFEIFFER
Department of Veterinary Clinical Sciences
Massey University, Palmerston North, New Zealand
Introduction
Epidemiological analyses are mainly conducted using data which does not include or take
account of spatial relationships between the observations studied. More recently, the need for
spatial data analysis has been pointed out as it may provide additional insight when attempting
to reveal epidemiological cause-effect relationships (Rothman 1990). While the theory of
spatial analysis has been an area of research interest for many years, only the advent of
personal computers made the techniques more easily accessible to epidemiologists. Still, even
recent text books on medical or veterinary epidemiology do not provide more than basic
introductions to the subject area of spatial data analysis. This is could seem surprising, as place
has always been seen as part of classic epidemiological triad of time, person, place.
Spatial Data
The distinction between spatial and non-spatial data can easily become the subject of extensive
discussions. In general, observations for which absolute location and/or relative positioning
(spatial arrangement) is taken into account can be referred to as spatial data (Anselin 1992). It
can be subdivided into two major categories representing discrete and continuous phenomena.
Based on the former classification, which has also been called entity view, spatial phenomena
are described using zero dimensional objects such as points, one dimensional objects such as
lines or two dimensional objects such as areas. If space is described using continuous
phenomena, such as in the case of temperature or topography, this has also been described as
field view. In practice, the latter is usually measured based on sampling discrete entities such as
locations in space.
The entity view allows spatial objects to have attributes. Spatial analysis is typically aimed
at the spatial arrangement of the observational units, but can also take into account attribute
information. An analysis conducted only on the basis of the attributes of the observational units
ignoring the spatial relationships is not considered a spatial data analysis.
Spatial Data Analysis
The methods used in spatial data analysis can be broadly categorized in those concerned with
visualizing data, those for exploratory data analysis and methods for development of statistical
models (Bailey and Gatrell 1995). During most analyses, a combination of techniques will be
used with the data first being displayed visually, followed by exploration of possible patterns
and possibly modeling.
Data visualization
One of the first steps in any data analysis should be an inspection of the data. Visual displays of
information using plots or maps will provide the epidemiologist with the basis for generating
hypotheses and, if required, an assessment of the fit or predictive ability of models. Over the
last couple of years interactive computer packages have been developed which allow dynamic
displays of the data. Geographic information systems can be used to produce maps and they
allow the exploration of spatial patterns in an interactive fashion.
Exploratory data analysis
Data exploration is aimed at developing hypotheses and makes extensive use of graphical
views of the data such as maps or scatter plots. Exploratory data analysis makes few
assumptions about the data and should be robust to extreme data values. Simple analytical
models can also be used in this analysis phase.
Models of spatial data
For this type of spatial data analysis specific hypotheses are formally tested or predictions are
made using statistical models of the data. Modeling of spatial phenomena has to incorporate
the possibility of spatial dependence in order to provide a true representation of the existing
effects. Such spatial effects can be either large scale trends or local effects. The first is also
called a first order effect and it describes overall variation in the mean value of a parameter
such as rainfall. The second which is named a second order effect is produced by spatial
dependence and represents the tendency of neighboring values to follow each other in terms of
their deviation from the mean. This can for example be the case with the incidence of an
infectious animal disease affecting animals on farm properties. First order effects can be readily
modeled by standard regression models. The presence of second order effects violates the
independence assumption of standard statistical analysis techniques, and appropriate analysis
techniques will have to take account of the covariance structure in the data giving rise to these
local effects.
Often spatial data are modeled as stationary spatial processes which assumes that while
there may be dependence between neighboring observations, it is independent from absolute
location. A spatial process is isotropic, if in a stationary process covariance between
observations at different locations depends only on the distance but not on direction. Nonstationary data is almost impossible to model as most locations will require different parameter
sets. Therefore, most spatial modeling procedures begin with first identifying a trend in mean
value and then modeling the residuals from this trend as a stationary process.
With any of these models it has to be kept in mind that they are abstractions of reality, and
first or second order effects are artifacts of the modeler. Bailey and Gatrell (1995) conclude
that models can be at best 'not wrong', rather than 'right'. They add that the analyst should
always involve judgment and intuition in statistical modeling.
Problems in Spatial Data Analysis
A major factor influencing spatial data analysis is the geographical scale at which the data is
being analyzed. It may be possible to identify specific non-random patterns at a local level
which when looked at from a national level turn into random variations. Another problem can
be that many spatial data sets are based on irregularly shaped area units or there may be
directional effects. Proximity or neighborhood also may be more difficult to clearly define than
for example in time-series analysis. Any type of spatial analysis will be subject to some degree
of edge effect where area units on the map boundary do have neighbors only in one direction.
Many data analyses have to be conducted with observations based on information summarized
at a particular spatial aggregation level such as at the veterinary district. Inferences from such
analyses may only be correct if used at the same level of aggregation. This situation has also
been called the modifiable areal unit problem.
84
Methods of Spatial Data Analysis

Methods used in spatial data analysis can be divided according to the three main categories of
data to be analyzed. They are point patterns, spatially continuous and area data.
Point patterns
Spatial point patterns are based on the coordinates of events such as the locations of outbreaks
of a disease. It is also possible that they include attribute information such as the time of
outbreak occurrence. Data on point patterns can be based on a complete map of all point
events or a sampled point pattern. The basic interest of a spatial point pattern analysis will be
to detect whether it is distributed at random or represents a clustered or regular pattern. It is
important to recognize that the stochastic process studied relates to the locations where events
are occurring. A spatial point pattern can be quantified in terms of the intensity of the process
using its first order properties, measured as the mean number of events per unit area. Second
order properties or spatial dependency are analyzed on the basis of the relationship between
pairs of points or areas. The latter is typically interpreted as analysis for clustering.
Visualization of spatial point patterns
The method used most frequently to present spatial point patterns is a dot map. It is generally
difficult to assess randomness of a pattern from visual inspection of such a map. It becomes
important to take account of the population at risk when for example inspecting a dot map of
disease outbreaks. One method for representing this difference in population at risk is to use a
cartogram, where the size of the areas is geometrically transformed proportional to the
corresponding population value.
In a case-control study of tuberculosis breakdown in cattle herds from the Waikato region
of New Zealand all cattle herds which had broken down with tuberculosis infection were
compared with a random sample of cattle herds free from infection. Figure 1 presents a series
of dot maps showing the locations of cases and random controls. Inspection of the map with
locations of the case herds could give the observer the impression that they are clustered.
Without inspection of the distribution of random control herds it becomes difficult to
differentiate whether clustering only occurred in case herds or if the distribution of all cattle
herds in this area is inherently non-random. In this situation it clearly is the case that cattle
herds are not randomly distributed throughout the study region. But there is also some
clustering of tuberculosis breakdowns.
Random Control Herds
Case Herds
Case and Control Herds
Figure 1: Dot maps of the locations of herds from a case-control study of tuberculosis breakdown in New Zealand cattle
herds
85
Exploratory analysis of spatial point patterns

Techniques for exploratory spatial analysis of point patterns are aimed at deriving summary
statistics or plots of the observed distribution to investigate specific hypotheses. The methods
used are examining first or second order effects.
First order effects for point patterns can be examined with two techniques - quadrat
counts and kernel estimation. The quadrat methods involve dividing the area into sub-regions
of equal size - quadrats and produce a summary statistic on the basis of the number of counts
per quadrat. The counts are then divided by the size of the area. These techniques give an
indication of the variation of the intensity of the underlying process in space. The disadvantage
of the techniques is that they aggregate the information into area type data which can result in
loss of information. Kernel estimation is a technique which uses the original point locations to
produce a smooth bivariate histogram of intensity. It has been used for example for home
range estimation in wildlife ecology (Izenman 1991).
Second order properties of point patterns can be investigated using the distances between
the points - particularly nearest-neighbor distances. The latter can be estimated using two
techniques - either the distance between a randomly selected event and the nearest neighboring
event or between a randomly selected location in space and the nearest event. Spatial
dependence can be investigated by visual examination of the probability distributions of the
observed nearest -neighbor distances. Clustered events would show a steep part of the
distribution function with lower values, whereas regularity would be indicated by steepness of
the curve with higher values. The k - function will allow taking into account not just the
nearest events. It depends on the assumption of an underlying isotropic process and is
problematic to use in the presence of significant first order effects.
Modeling of spatial point patterns
Spatial point modeling techniques are aimed at explaining an observed point pattern, and
typically involve comparison with the model of complete spatial randomness (CSR). A point
pattern generated by a random spatial process should follow a homogeneous Poisson process.
This implies that every event has an equal probability of occurring at any position in the study
area and occurrence is independent of the location of any other event, hence the absence of
first order and second order effects. It is against this basic model that the analysis will assess
whether the point process is regular, clustered or random. There are a range of methods
available to test for CSR. Some are based on quadrat counts such as the index of dispersion
tests, others use nearest-neighbor distances such as the Clark-Evans test or the K function.
Comparison of an observed pattern with CSR has its limitations in epidemiology as it does not
allow definition of the type of point process other than whether it is completely random in
space or not. It also cannot take account of issues such as a clustered underlying population at
risk. Alternative models which could be used include the heterogeneous Poisson process, the
Cox process, the Poisson cluster process or Markov point process (Bailey and Gatrell 1995).
Getis and Ord (1992) describe the use of a distance statistic G which can be used to assess
spatial autocorrelation for point patterns as well as for area data. It can be used to detect local
pockets of dependence which might not show up when using a global statistic.
The analysis of point patterns is important in veterinary epidemiology as it allows
inferences on the occurrence of spatial clustering. The presence of clustering would suggest
infectiousness or the presence of specific environmental risk factors. Second order effects in a
spatial process can be the result of disease clustering. Disease clustering can be assessed using
a number of methods and they can be categorized into general and focused tests (Waller and
86
Lawson 1995). The latter tests relate to the clustering of events around fixed point locations
such for example a nuclear power plant. Wartenberg and Greenberg 1990 describe techniques
for detection of hot spot clusters and clinal clusters.
A tool which can be effectively used for the analysis of clustering effects is the K function
(Kingham, Gatrell, and Rowlingson 1995). In this context, two classes of point processes such
as cases of disease and random controls without the disease are compared. The principle is that
both point processes are pooled and then the point process describing the cases is compared
with the pooled process. Cuzick and Edwards (1990) developed a method which is also based
on nearest-neighbor distances. The test statistic simply compares the number of case-case pairs
for a given number of nearest neighbors. Applying this technique to the case-control data
mentioned above it appears that there is significant clustering of cases compared with the
control population (see Figure 2). The software Stat! (BioMedware, Ann Arbor, Michigan,
U.S.A.) was used to run the analysis and produce the graph.
Figure 2: Cuzick and Edwards method applied to tuberculosis breakdown case control study data (+ = cases, =
controls, arrows identify nearest neighbors)
Kingham, Gatrell, and Rowlingson (1995) describe a method combining the Diggle and
Chetwynd method based on bivariate K functions with results of a logistic regression analysis
allowing them to make use of information about additional covariates to test for clustering.
Methods aimed at the detection of hot spots or small areas which might represent clusters
of disease include the Geographical Analysis Machine developed by Openshaw (1990). This
technique is based on comparing the observed intensity of cases in circles of varying radius.
The result of the analysis is a map with circles indicating the areas where case incidence was
higher than expected under the assumption of spatial randomness. Alexander and Cuzick
(1992) reviewed methods for assessment of disease clusters. Wartenberg and Greenberg
(1993) discuss problems associated with detection of disease clusters.
87
Other analyses using point pattern information

Point events in space may have time of occurrence as one particular attribute. Specific methods
are available to investigate clustering in space and time. Interaction is present if pairs of cases
are near in space as well as in time. Contagious diseases requiring direct contact will produce
space-time clustering between cases. The techniques described below are not considered useful
for non-contagious diseases. There are three main methods available which allow assessment
of a space-time relationship between cases: Knoxs method, Mantels method and the Knearest neighbor method. All three techniques require production of distance matrices of the
spatial as well as the temporal relationship between cases.
Data from a longitudinal study of Mycobacterium bovis infection in a wild possum
population in New Zealand will be used to demonstrate the usage of these techniques (Pfeiffer
1994). As a first step a histogram of the geographical nearest neighbor distances has to be
produced as shown in Figure 3. The distribution of nearest neighbor distances expected under
spatial randomness is shown in the histogram. It assumes uniform population density across the
study area. The excess of shorter distances compared with the Poisson probability density
function suggests that there may be spatial clustering in this data set. The map in the same
figure indicates the locations of the 26 cases used in the analysis and their nearest neighbors are
indicated by the arrows. Figure 4 presents the temporal distance distribution and a map
connecting the cases according to their sequence of occurrence. The map does suggest the
disease has shifted its spatial focus over time and that there is some degree of temporal
clustering.
Figure 3: Map and histogram of geographical distances between cases of tuberculosis infection in wild possums
88
Figure 4: Temporal distance map and histogram for cases of tuberculosis infection in wild possums
As a next step a formal statistical test has to be conducted to assess the statistical significance
of a potential space-time interaction process. When using Knoxs method, a critical distance in
time as well as in space defining closeness has to be set and pairs of cases are tabulated into a
2*2 contingency table with spatial and temporal closeness/farness defining the rows and
columns (Knox 1964). Knox saw the critical distance as defining latency period. In most
situations determining the critical distance requires a subjective decision. Approximate
randomization permutation techniques are used to construct a Null distribution for Knoxs test
statistic. Figure 5 shows the results of Knoxs test applied to the tuberculous possum data
using a critical distance of 100m in space and 3 months in time. The result of 30 for Knoxs
test statistic X which is significant at a p-value of 0.02 suggests that given the selected critical
distances time-space interaction is present in this data set.
89
Figure 5: Results of Knoxs method applied to cases of tuberculosis infection in wild possums
Another approach to investigate time-space interaction could be the use of the Mantel method
(Mantel 1967). Mantels approach does not require selection of critical distances. It uses both,
time and space distance matrices between all cases. But it should be kept in mind that the
Mantel test can be insensitive to non-linear associations between time and space distances.
Distance measures can be transformed in a number of ways, such as the reciprocal
transformation which reduces the effect of large time and space distances. The Null hypothesis
is that the time distances are independent of the space distances. Randomization permutation
techniques can be used to generate a test statistic for the Mantel test. Figure 6 presents the
results of this analysis when applied to the data on tuberculous possums. The scatter plot of
space distances against temporal distances seems to suggest while the points are scattered
throughout the plot that there some denser accumulations of cases present. The frequency
distribution of the test statistic under the Null hypothesis on the basis of 500 random
permutations is presented in the left window of Figure 6. It can be concluded that there is
significant space-time interaction.
90
Figure 6: Results from applying the Mantel method to test for time-space interaction between cases of tuberculosis
infection in wild possums
A third approach available is the K-nearest neighbor test of space-time interaction in point
data. The test statistic indicates the number of case pairs which are K nearest neighbors in time
and space. The statistic is based on an approximate randomization of the Mantel product
statistic. Figure 7 presents the results from applying the K-nearest neighbor method to the
possum tuberculosis data. The map shows the locations of the cases and the arrows indicate
k=2 nearest neighbors. The test statistic produced on the basis of 1000 random permutations
suggests that only the cumulative statistic Jk is statistically significant, whereas Jk is not. The
latter parameter measures the statistical significance from increasing K by 1. The test statistic
supports the presence of space-time interaction, and suggests that the first 5 nearest neighbors
are involved in space-time interaction.
91
Figure 7: Results from applying the K-nearest neighbor method to test for time-space interaction between cases of
tuberculosis infection in wild possums
Spatially continuous data

The point pattern analyses assessed characteristics of the spatial distributions of points, but
made only limited use of attribute information. With spatially continuous and also area data the
analysis focus shifts towards use of the attribute information, in order to describe their pattern
in space. Spatially continuous data is also often referred to as geostatistical data. The data is
usually collected by sampling at fixed points in space. The main objective of the analysis will be
to describe the spatial variation in an attribute value, using the data collected at the sampled
points. The spatial variation can be modeled as first and second order spatial processes.
Visualization of spatially continuous data
The data values obtained from the sampled locations can be mapped using proportionally
scaled symbols or columns for each sampling point. Column charts will in fact allow
presentation of multiple attribute values at the same point. Overlapping symbols can cause a
problem with interpretation of such maps. But none of these approaches will be able to
represent the underlying continuity of the process studied.
Exploratory analysis of spatially continuous data
The techniques used in this area can be categorized into methods for describing first order
effects and those for second order effects.
92
Exploratory analysis of first order effects in spatially continuous data

The main techniques in this area are spatial moving averages, tessellation methods and kernel
estimation techniques. A spatial moving average interpolates values between a given number
of neighboring sampling points. The more points are used the smoother the created surface will
be. It will be possible to describe global trends. A weighting mechanism can be introduced to
account for varying distances between sampling points. Alternatively a tessellation of the
observed sample points can be used. This is most commonly done using Delauney
triangulation, also referred to as a triangulated irregular network (TIN). This method assigns
to each sampling point a territory in which each point is closer to this sampling point than to
any other. The resulting polygon map is called a Dirichlet tessellation and the tiles are known
as Voronoi or Thiessen polygons. Such a TIN can be used to construct a contour map or a
digital terrain model (DTM). Figure 8 shows a triangulated irregular network based on height
information from a 50 m grid plus break lines. The TIN was then used to create a contour map
and a digital terrain model of possum tuberculosis longitudinal study area. Pfeiffer (1994) used
this technique to generate polygons from point locations describing the areas where particular
Mycobacterium bovis strains occurred. The technique can be appropriate for presence/absence
type information, where a smooth surface is not the objective of the interpolation.
TIN
Contour map
DTM
Figure 8: TIN and derived contour maps and digital terrain model for the possum tuberculosis longitudinal study site
As with point patterns it is also possible to use kernel estimation to convert the attribute data
from the sampling points into a surface. This time not using the number of events per unit area
but rather the value of the attribute. This technique has been used in geographical
epidemiology to model the relative risk function, measuring local risk relative to the regional
mean (Bithell 1990).
Exploratory analysis of second order effects in spatially continuous data
Spatial dependence between attribute values measured at sampled locations is described using
the covariance function or covariogram. The presence of second order effects would result in
positive covariance between observations a small distance apart and lower covariance or
correlation if they are further apart. The covariogram describes the function of the covariance
for varying distances h between sample points and the correlogram the corresponding
correlation. The semi-variogram is a graphical representation of the variation between
sampling points separated by a given distance and direction. For a stationary spatial process all
three describe similar information. Estimates of the semi-variogram are considered to be more
robust to departures from stationarity represented as a general trend in the spatial process. A
93
continuous process without spatial dependence will result in a horizontal line. A stationary
process will reach an upper bound, referred to as the sill at a distance h called the range.
Theoretically, the intercept with the y-axis should be at a value of 0 variation. In reality,
sampling error and small scale variation will result in variability at small distances and the
variogram will meet the y-axis not in the origin. This intercept with the y-axis is called the
nugget effect. Variograms which do not reach an upper bound suggest non-stationarity. Figure
9 shows an isotropic sample semi-variogram for the proportion of tuberculous possums
captured at trap sites during the longitudinal study. The shape of the variogram suggests that
the process is non-stationary, but given the relatively small nugget value there is also likely to
be spatial dependence.
Parameters defining a variogram
Example of a variogram
Figure 9: Isotropic semi-variogram for the proportion of tuberculous possums at individual trap sites in the longitudinal
study
Modeling of spatially continuous data

A number of approaches can be used to model or predict spatially continuous data. For the
first-order processes trend surface analysis can be used based on ordinary polynomial least
squares regression. Results have to be treated with caution, because the standard regression
assumptions of independent random errors and heteroscedasticity are likely to be violated.
Lessard et al. (1990) used an inverse distance-weighted mathematical algorithm to interpolate
climatic measurements between sample points. Most trend surface models may be able to
describe an overall trend, but are not useful for local prediction. In the presence of weak first
order, but strong second order effects it is more appropriate to use models fitted to
variograms. Such models can be defined by eye and are most commonly based on the
spherical, exponential or Gaussian model. The fit of a particular model can be assessed
through cross-validation. Figure 10 shows an omni-directional exponential variogram model
for the possum tuberculosis prevalence data. For the model to be valid it would be necessary to
remove the non-stationarity through trend regression.
94
Figure 10: Omnidirectional exponential variogram model for possum tuberculosis prevalence data
The variogram model itself does not allow prediction of values. This can be achieved with
Kriging. This is a weighted moving average technique for estimating the value of a spatially
distributed variable from adjacent values while considering interdependence expressed in a
variogram. It allows the interpolation error to be mapped and from a statistical viewpoint is
considered to be the most satisfactory method for interpolation (Oliver and Webster 1990).
Pfeiffer (1994) used ordinary Kriging to produce a surface of possum population density
based on possum capture data at sample points (see Figure 11). The omnidirectional
variogram suggests that this data is more stationary than the tuberculosis prevalence data, but
it also shows strong spatial dependence. An exponential model was fitted and used as the basis
for Kriging. The distribution of Kriging errors shows that there are some reasonably high
errors and according to the map they are located in one particular area of the study.
95
Omnidirectional variogram model
Histogram of Kriging errors
Map of Kriging errors

Contour map based on Kriging estimates
Figure 11: Variogram model, Kriging errors and estimates for possum density in the longitudinal study on possum
tuberculosis epidemiology
A number of multivariate methods can be used for modeling of spatially continuous data.
Principal components can be used to combine the information from multiple variables into a
small number of components, each of them representing a particular combination of variables
and explaining a particular proportion of the variation in the data. Eastman and Fulk (1993)
used the technique to analyze the information contained in a time series of NDVI maps for
Africa, thereby conducting a space-time analysis. Cliff et al. (1995) discuss the application of
multidimensional scaling (MDS) to spatial epidemiological data. They use the technique to
map geographical information about measles mortality in Australia and New Zealand as
disease space where points with similar disease risks are closer to each other on the MDS map
even though they are far removed geographically. Bailey and Gatrell (1995) discuss a range of
other multivariate analysis techniques for spatially continuous data.
Area data
Attribute data which does have values within fixed polygonal zones within a study area is
referred to as area data or lattice data. The areal units can constitute a regular lattice or grid
or consist of irregular units. It is usually not required to estimate values as they should be
present for all areas. The main emphasis with area data is on detection and explanation of
spatial patterns or trends possibly extended to take account of covariates.
96
Visualization of area data

Area data can be visualized using a wide range of techniques. The Choropleth map is probably
the most commonly used tool. Appropriate use of class intervals and colors to represent values
in a choropleth map is essential. Cartograms or density equalized maps can be used to express
the importance of particular areas. The analyst has to be aware of the problems which can be
caused by the modifiable areal units problem. It is also possible to display several attributes at
the same time, by adding scaled columns or symbols to a choropleth map. Area information
can be presented together with spatially continuous data for example by draping a choropleth
map over a DTM.
N
0
20
40
60
80
100 Kilometers
Cattle per Hectare

0 - 0.4
0.4 - 0.6
0.6 - 0.8
0.8 - 1.0
1.0 - 1.2
Lakes
Canton Boundaries
Choropleth map of cattle density in Switzerland
Thiessen polygons representing areas used by possums

infected with four different strains of Mycobacterium
bovis draped over a DTM
Figure 12: Examples of choropleth maps
Exploration of area data

Informal investigations of hypotheses can be aimed at first order or second order spatial
processes. Most of these techniques require a methodology for measuring proximity. Possible
approaches include various distance measures between polygon centroids as well as presence
or length of a shared boundary. For further analyses the proximity information can be described
through generation of contiguity or spatial weights matrices. This is quite difficult to achieve in
currently available GIS software and there are only few specialized spatial statistics software
packages (e.g. SpaceStat, Regional Research Institute, West Virginia University, Morgantown,
West Virginia, U.S.A.) which can perform such operations.
For investigation of first order effects, a spatial weights matrix can be used to estimate
spatial moving averages. If the data is available on the basis of a regular grid, the median
polish may be appropriate. Kernel estimation can also be applied for investigations of first
order effects in area data.
In the case of second order spatial processes the objective is to explore spatial dependence
of deviations in attribute values from their mean. In the context of area data, this effect is
referred to as spatial autocorrelation and it quantifies the correlation between values of the
same attribute between different locations. The most commonly used techniques for spatial
autocorrelation are Moran's I and Geary's C. The first is closely related to the covariogram
and the second to the variogram used for spatially continuous data. Power analyses of these
statistics have been conducted by Walter (1993) who concluded that the power of Morans I
was highest. A correlogram can be used to graphically display the correlation between values
97
at different spatial lags. If the autocorrelation does not decline after a number of lags, it
indicates the presence of non-stationarity. The correlogram has similar applications in spatial
analysis as it has in time-series analysis for describing patterns. Hungerford (1991) analyzed the
spatial distribution of cattle anaplasmosis between counties within the state of Illinois using
second-order analysis and detected significant spatial clustering within the state.
The above mentioned methods do not provide local indicators of spatial association which
would be useful for identifying so-called hot spots. The Moran scatterplot and spatial lag pies
described in Anselin (1994) can be used to describe local patterns of variation visually.
Quantitative estimates can be obtained using the G statistic by Getis and Ord (1992) or the
local indicators of spatial association by Anselin (1995). The latter can be used as an indicator
of local pockets of non-stationarity (hot spots), similar to the G statistic, and also to assess the
influence of individual data points on the global statistic and to identify outliers. Anselin,
Dodson, and Hudak (1993) describe how these different techniques can be combined to form a
exploratory spatial analysis system. Figure 13 shows a number of examples used by these
authors to display local variation (from Anselins world wide web site
http://lambik2.rri.wvu.edu/esda.htm). The spatial lag pie map superimposes a pie on each area
with the top half of the pie representing the local value and the bottom part the neighboring
values for this particular variable. It gives the observer an appreciation of the ratio between the
local value and the surrounding spatial units. The Moran scatterplot shows the original value
of each observation on the x-axis and the value of its spatial lag on the y-axis. The plot can be
used to identify outliers or even to conduct local regression to further describe the spatial
association. These outliers can then be mapped as shown in Figure 13. The map of the areas
with significant LISA statistic indicates the area were there appears to be spatial
autocorrelation.
98
Spatial Lag Pie Map
Moran scatterplot
Moran scatterplot outliers
Map of areas with significant LISA or G statistic
Figure 13: Example of an exploratory spatial data analysis approach
In landscape ecology, approaches have been developed to describe the interactions among
patches within a landscape mosaic referred to as landscape pattern. Most biological processes
and that includes of course diseases are influenced by a multitude of factors which together
may form a particular pattern. Spatial patterns are particularly difficult to quantify. Ecologists
use the term landscape structure which describes the spatial relationships between habitat
patches within a landscape (Dunning, Danielson, and Leck 1992). The software FRAGSTATS
(McGarigal and Marks, Oregon State University, Corvallis, Oregon, U.S.A.) allows calculation
of a wide range of indices and parameters describing landscape structure which could be used
for further analyses.
Modeling of area data
Modeling techniques are aimed at establishing explanatory relationships between attribute
values of a dependent variables, taking account of the relative spatial arrangement of the areas
and other values associated with each area unit. Again, it is possible to focus the analysis on
first order or second order effects. Multiple ordinary least squares regression can only be
used for preliminary exploratory analyses, but suffers from the problem that in the presence of
spatial dependence the errors are not independent and that the variance is unlikely to be
constant. The presence of spatial dependence can be assessed readily using a spatial
correlogram. A range of spatial regression models have been described by Haining (1990) and
they can be implemented using the SpaceStat software mentioned above. Hungerford (1991)
analyzed the relationship between cattle density and anaplasmosis prevalence on a county basis
99
in Illinois using measures of spatial correlation. Perry et al. (1991) used a GIS to investigate
the occurrence of Rhipicephalus appendiculatus in Africa to identify the factors controlling the
distribution of the vector tick which transmits the parasite Theileria parva causing East Coast
fever, Corridor disease and January disease in cattle. A number of authors have included spatial
data into multivariate analysis as independent variables. Clifton-Hadley (1993) used spatial
descriptive measures, spatial autocorrelation and distance to particular features of interest to
analyse patterns of occurrence of badger-related tuberculosis breakdowns of cattle herds in
south-west England. Pfeiffer (1994) used a GIS to provide for point locations (cases of
disease) specific geographical variables such as height above sea level, aspect, slope and
distance to features of interest which were then used as explanatory variables in multivariate
statistical analysis.
In the field of epidemiology, parameters of interest are very often counts or proportions
which can be modeled using generalised linear modeling techniques rather than ordinary
least-squares regression. It should be noted though that spatial forms of these models are not
well developed yet. Bailey and Gatrell (1995) suggest introducing covariates into the
regression model such as the spatial coordinates or a variable representing regions categorized
broadly by location to remove the effect of spatial dependence. Glass et al. (1995) developed a
risk density map for Lyme disease based on a multiple logistic regression model, but they did
not attempt to remove spatial dependence from the data. A number of different predictive
modeling approaches for spatial data was compared by Williams et al. (1994). They used linear
and non-linear discriminant analysis, tree-based induction and neural networks to map tsetse
distributions in Zimbabwe and concluded that while the simpler methods (linear discriminant
analysis and tree-based induction) were less precise, they were easier to interpret. Figure 14
presents some preliminary results of a logistic regression analysis for prediction of Theileria
parva presence in an African country (this analysis was conducted by Perry,B.D., Kruska,R.L.,
Pfeiffer,D.U. and others at ILRI, Nairobi, Kenya). The regression model includes eight
different environmental and land use variables and is based on information collected at random
sample locations throughout the country. The model was used to generate a risk map
representing the probability of T.parva presence at a particular location given a number of risk
factors included in the model. This map is presented as a DTM and as a raster map. In
addition, two additional raster maps are shown which display the lower and upper 95%
confidence limits of T.parva presence as predicted by the regression model. The receiver
operating characteristic curve (ROC) characterizing the predictive accuracy of the model could
be used to adjust the decision making cut-off for the prediction probability balancing sensitivity
and specificity as required. In this analysis the possible presence of spatial dependence was not
taken into account.
100
Sensitivity
100
90
80
70
60
50
40
30
20
10
0
0.08
0.16
0.32
10
20
30
40
50
60
70
80
90
100
Percent of False Positives
Sampling location
DTM of predicted probability of

Theileria parva presence
Raster map of lower 95% confidence limit of

probability of T.parva presence
ROC curve for logistic regression model
Raster map of predicted probability

of T.parva presence
Raster map of upper 95% confidence limit of

probability of T.parva presence
Figure 14: Results of a multiple logistic regression analysis for prediction of Theileria parva presence
101
The appropriate technique for mapping probabilities or rates is as a measure of relative

risk which could be estimated by dividing the observed risk by an estimate of expected risk.
The standardized mortality ratio has been used widely to represent spatial variation of disease
risk (Elliott, Martuzzi, and Shaddick 1995). Small numbers of observations may result in
extreme values. More recently this problem has become less important through the adoption of
empirical Bayes estimation. These techniques require estimates of a prior probability
distribution which can for example be based on the overall probabilities across all areas.
Bayesian techniques are then used to convert these estimates into posterior probability
estimates. These can be made spatial by using neighborhood probabilities to derive a prior
probability distribution.
Decision making and spatial data
Spatial data together with other non-spatial information is used for decision making purposes.
This has become more difficult because the amount of information available has increased
substantially. Specific decision making tools have been developed attempting to simplify the
process of making the right choice. A recent area of activity has been the adaptation of multicriteria and multi-objective evaluation techniques to spatial problems. Such systems can take
account of uncertainty in the data as well as of the risk of making the wrong decision (Eastman
et al. 1995).
Spatial data has also become an essential component of disease information systems which
are beginning to replace largely manual systems which have been used by decision makers for
the control of endemic and epidemic diseases. The large amounts of data which can be
processed easily, their objectivity and the quickness of response are some of the advantages of
computerized animal disease information systems. GIS provides an essential component of
such systems. An example of an animal disease information system is EpiMAN which was
developed in New Zealand for the management of an outbreak of foot-and-mouth disease
(Morris et al. 1992).
Epidemiological simulation
GIS can provide geographical data which allows computer simulations of the dynamics of
infectious diseases for specific geographical locations. Spatial heterogeneity can be represented
in simulation models resulting in more realistic representations of reality. There are only few
examples where this approach has been used in veterinary epidemiology. Sanson (1993)
described a model of foot-and-mouth disease which represents inter-farm spread of the disease
on a true geographical area, using various transmission mechanisms. Pfeiffer (1994) developed
a geographic simulation model of the dynamics of bovine tuberculosis infection in wild possum
populations. The geographical component is a major feature of this model. The model uses
vegetation maps to represent the ecological conditions of particular environments.
Conclusion
Spatial data has become an important component of disease investigations. The availability of
geographic information systems in combination with fast and relatively inexpensive computer
hardware leaves the epidemiologist with the responsibility of making effective use of the
information. While the descriptive techniques for spatial data have been available for a long
time, exploratory and modeling techniques are still a very active area of development and they
are not as accessible to the analyst so that they could become a routine component of
epidemiological analysis.
102
References
Alexander,F.E. and J. Cuzick. 1992. Methods for the assessment of disease clusters.
Geographical and environmental Epidemiology: Methods for small-area Studies. Editors
P. Elliott, J. Cuzick, D. English, and R. Stern, 238-50. 382 . Oxford: Oxford University
Press.
Anselin,L. 1994. Exploratory spatial data analysis and geographic information systems. New
Tools for spatial Analysis. Editor M. Painho, 45-54. Luxembourg: Eurostat.
Anselin,L. 1995. Local indicators of spatial association - LISA. Geographical Analysis 27, no.
2: 93-115.
Anselin,L. 1992. Spatial data analysis with GIS: An introduction to application in the social
Sciences. 75 . Technical Report Series. Santa Barbara, California: National Center for
Geographic Information and Analysis.
Anselin,L., R. F. Dodson, and S. Hudak. 1993. Linking GIS and spatial data analysis in
practice. Geographical Analysis 1: 3-23.
Bailey,T.C. and A. C. Gatrell 1995. Interactive spatial data analysis. Harlow, Essex, England:
Longman Group. 413pp
Bithell, J. F. 1990. An application of density estimation to geographical epidemiology.
Statistics in Medicine 9: 691-701.
Cliff, A. D., P. Haggett, M. R. Smallman-Raynor, D. F. Stroup, and G. D. Williamson. 1995.
The application of multidimensional scaling methods to epidemiological data. Statistical
Methods in Medical Research 4: 102-23.
Clifton-Hadley, R. S. 1993. The use of a geographical information system (GIS) in the control
and epidemiology of bovine tuberculosis in south-west England. Proceedings of the Society
for Veterinary Epidemiology and Preventive Medicine, editor M. V. Thrusfield, 166-79.
Society for Veterinary Epidemiology and Preventive Medicine.
Cuzick, J., and R. Edwards. 1990. Spatial clustering for inhomogeneous populations. Journal
of the Royal Statistical Society B 52, no. 1: 73-104.
Dunning, J. B., B. J. Danielson, and C. F. Leck. 1992. Ecological processes that affect
populations in complex landscapes. Oikos 65: 169-75.
Eastman, J. R., and M. Fulk. 1993. Long sequence time series evaluation using standardized
principal components. Photogrammetric Engineering and Remote Sensing 59, no. 6: 99196.
Eastman, J. R., W. Jin, P. A. K. Kyem, and J. Toledano. 1995. Raster procedures for multicriteria/multi-objective decisions. Photogrammetric Engineering and Remote Sensing 61,
no. 5: 539-47.
Elliott, P., M. Martuzzi, and G. Shaddick. 1995. Spatial statistical methods in environmental
epidemiology: a critique. Statistical Methods in Medical Research 4: 137-59.
Getis, A., and J. K. Ord. 1992. The analysis of spatial association by use of distance statistics.
Geographical Analysis 24 (3): 189-206.
Glass, G. E., B. S. Schwartz, J. M. Morgan, D. T. Johnson, P. M. Noy, and E. Israel. 1995.
Environmental risk factors for Lyme disease identified with geographic information systems.
American Journal of Public Health 85, no. 7: 944-48.
103
Haining,R. 1990. Spatial Data Analysis in the social and environmental Sciences. Cambridge:
Cambridge University Press.
Hungerford, L. L. 1991. Use of spatial statistics to identify and test significance in geographic
disease patterns. Preventive Veterinary Medicine 11: 237-42.
Izenman, A. J. 1991. Recent developments in nonparametric density estimation. Journal of the
American Statistical Association 86 (413): 205-24.
Kingham, S. P., A. C. Gatrell, and B. Rowlingson. 1995. Testing for clustering of health events
within a geographical information system framework. Environment and Planning A 27:
809-21.
Knox, E. G. 1964. The detection of space-time interaction. Applied Statistics. 13: 25-29.
Lessard, P., R. L`Eplattenier, R. A. I. Norval, K. Kundert, T. T. Dolan, H. Croze, and others.
1990. Geographical information systems for studying the epidemiology of cattle diseases
caused by Theileria parva. Veterinary Record 126: 255-62.
Mantel, N. 1967. The detection of disease clustering and a generalized regression approach.
Cancer Research. 27 (2): 209-20.
Morris,R.S., Sanson,R.L and Stern,M.W. 1992: EPIMAN - A Decision Support System for
Managing a Foot-and-Mouth Disease Epidemic. Proceedings Fifth Annual Meeting of the
Dutch Society for Veterinary Epidemiology and Economy, Wageningen, 1-35.
Oliver, M. A., and R. Webster. 1990. Kriging: a method of interpolation for geographical
information systems. International Journal of Geographical Information Systems 4 (3):
313-32.
Openshaw, S. 1990. Automating the search for cancer clusters: a review of problems,
progress, and opportunities. Spatial epidemiology. Editor R.W. Thomas, 48-78. Pion
Publications.
Perry, B. D., R. Kruska, P. Lessard, R. A. I. Norval, and K. Kundert. 1991. Estimating the
distribution and abundance of Rhipicephalus appendiculatus in Africa. Preventive
Veterinary Medicine 11: 261-68.
Pfeiffer, D. U. 1994. The role of a wildlife reservoir in the epidemiology of bovine
tuberculosis. Unpublished PhD Thesis, Massey University, Palmerston North, New Zealand.
Rothman, K. J. 1990. A sobering start for the cluster busters' conference. American Journal of
Epidemiology 132 Sup 1: S6-S13.
Sanson, R.L. 1993. The development of a decision support system for an animal disease
emergency. Unpublished PhD Thesis. Massey University, Palmerston North, New Zealand.
Waller, L. A., and A. B. Lawson. 1995. The power of focused tests to detect disease
clustering. Statistics in Medicine 14: 2291-308.
Walter, S. D. 1993. Assessing spatial patterns in disease rates. Statistics in Medicine 12: 188594.
Wartenberg, D., and M. Greenberg. 1990. Detecting disease clusters: the importance of
statistical power. American Journal of Epidemiology 132 Sup 1: S156-S166.
Wartenberg, D., and M. Greenberg. 1993. Solving the cluster puzzle: Clues to follow and
pitfalls to avoid. Statistics in Medicine 12: 1763-70.
104
Williams, B., D. Rogers, G. Staton, B. Ripley, and T. Booth. 1994. Statistical modelling of
georeferenced data: Mapping tsetse distributions in Zimbabwe using climate and vegetation
data. Modelling vector-borne and other parasitic Diseases. Editors B. D. Perry, and J. W.
Hansen, 267-80. 369 . Nairobi, Kenya: The International Laboratory for Research on
Animal Diseases.
105

Issues Related To Spatial Data

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Issues Related To Spatial Data

Caricato da

Copyright:

Formati disponibili

In: J. McKenzie (ed) Proceedings of the epidemiology and state veterinary programmes.

Issues related to handling of spatial data

Methods of Spatial Data Analysis

Random Control Herds

Case and Control Herds

Exploratory analysis of spatial point patterns

Other analyses using point pattern information

Spatially continuous data

Exploratory analysis of first order effects in spatially continuous data

Parameters defining a variogram

Modeling of spatially continuous data

Omnidirectional variogram model

Histogram of Kriging errors

Map of Kriging errors

Visualization of area data

Cattle per Hectare

Choropleth map of cattle density in Switzerland

Thiessen polygons representing areas used by possums

Figure 12: Examples of choropleth maps

Exploration of area data

Spatial Lag Pie Map

Moran scatterplot outliers

Map of areas with significant LISA or G statistic

Figure 13: Example of an exploratory spatial data analysis approach

Percent of False Positives

DTM of predicted probability of

Raster map of lower 95% confidence limit of

ROC curve for logistic regression model

Raster map of predicted probability

Raster map of upper 95% confidence limit of

The appropriate technique for mapping probabilities or rates is as a measure of relative

Potrebbero piacerti anche