Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
INTRODUCTION
In this chapter an introduction to the area of the work is provided. Brief present day scenario
with regard to the area of the work and motivation for doing the project is discussed. Both
main and secondary objectives of the work is specified. Project work schedule is mentioned.
Spatial data, also known as geospatial data or geographic information, is the data or
information that identifies the geographic location of features and boundaries on earth,
such as natural or constructed features, oceans, and more. Spatial data is usually stored as
coordinates and topology and is data that can be mapped.
Cluster analysis is the process of partitioning a set of data objects (or observations) into
subsets. Each subset is a cluster, such that objects in a cluster are similar to one another,
yet dissimilar to objects in other clusters. The set of clusters resulting from a cluster
analysis can be referred to as a clustering. In this context, different clustering methods
may generate different clustering’s on the same data set. The partitioning is done by the
clustering algorithms. Hence, clustering is useful in discovery of previously unknown
groups within the data. It is an important part of spatial data mining since it provides
certain insights into the distribution of data and characteristics of spatial clusters.
Linear indicator of spatial auto correlation is one of the most widely used techniques in
spatial clustering
In this project we will be applying various machine learning clustering techniques on
spatial data and compare the results with that of linear indicator of spatial auto correlation
technique.
1
Comparison of the various cluster detection techniques will allow us to provide an insight
regarding the clustering quality and execution time and we can decide on which
clustering technique to use depending on kind of data we have.
Analysing the results of different clustering methods and finding which is the best
method according to the data available to us.
Table 1.1:
2
CHAPTER 2
BACKGROUND THEORY
2.1 Introduction:
In this chapter we will discuss the title of the project, Literature review, Summarized
outcome of the literature review, General analysis, Mathematical derivations and
conclusions.
K Means,DBSCAN and LISA are used to detect clustering in spatial data and
results are compared to find the best algorithm to a given type of data.
3
K Means algorithm:
(a) For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the
vector of the p feature means for the observations in the kth cluster.
K Means Flowchart:
4
Important terms in this algorithm are:
1.Epcilon
2. Minpoints
3.Core Points
4.Border points
5.Noise Point
Directly density-reachable:
5
A point p is directly density-reachable from a point q w.r.t Eps,Minpts if
1. P belongs to Neps(q)
2. Core point condition:|Neps(q)|>=Minpts
Density-reachable:
A point p is density reachable from a point q w.r.t Eps,Minpts if there is a chain of points
p1,p2….pn,such that pi+1 is directly density-reachable from pi.
6
A point p is density-connected to a point q w.r.t Eps,Minpts if there is a point o such that both
p and q are density-reachable from o w.r.t Eps and Minpts.
DBSCAN Algorithm:
DBSCAN Flowchart:
7
Fig 2.6 DBSCAN Flowchart
8
Spatial autocorrelation:
The concept of spatial autocorrelation is one of the most important in spatial statistics
in that it implies a lack of spatial independence. Classical statistics assumes that
observations are independently chosen and are spatially unrelated to each other. The
intuitive concept is that the location of an incident (e.g., a street robbery, a burglary)
is unrelated to the location of any other incident. The opposite condition - spatial
autocorrelation, is a spatial arrangement of incidents such that the locations where
incidents occur are related to each other; that is, they are not statistically independent
of one another. In other words, spatial autocorrelation is a spatial arrangement where
spatial independence has been violated.
When events or people or facilities are clustered together, we refer to this arrangement
as positive spatial autocorrelation. Conversely, an arrangement where people, events
or facilities are extremely dispersed is referred to as negative spatial autocorrelation; it
is a rarer arrangement, but does exist.
If a user has information on the location of individual events (e.g., robberies), then it
is better to utilize that information with the point statistics The individual-level
information will contain all the uniqueness of the events.
However, sometimes it is not possible to analyze data at the individual level. The
user may need to aggregate the individual data points to spatial areas (zones) in order
to compare the events to data that are only obtained for zones, such as census data, or
to model environmental correlates of the data points or may find that individual data
are not available (e.g., when a police department releases information by police beats
but not individual streets). In this case, the individual data points are allocated to
zones by, first, spatially assigning them to the zones in which they fall and, second,
counting the number of points assigned to each zone. A user can do this with a GIS
program or with the “Assign Primary points to Secondary Points” routine.
9
In this case, the zone becomes the unit of analysis instead of the individual data
points. All the incidents are assigned to a single geographical coordinate, typically
the centroid of the zone, and the number of incidents in the zone (the count) becomes
an attribute of the zone (e.g., number of robberies per zone; number of motor vehicle
crashes per zone).
Thus, the distance between zones is a singular value for all the points in those zones
whereas there is much greater variability with the distances between individual
events.
Further, zones have attributes which are properties of the zone, not of the individual
events. The attribute can be a count or a continuous variable for a distributional
property of the zone (e.g., median household income; percentage of households below
poverty level).
Moran=s “I” statistic (Moran, 1950) is one of the oldest indicators of spatial
autocorrelation. It is applied to zones or points that have attribute variables associated
with them (intensities). For any continuous variable, Xi, a mean, , can be calculated
and the deviation of any one observation from that mean, , can also be calculated.
The statistic then compares the value of the variable at any one location with the value
at all other locations (Ebdon, 1988; Griffith, 1987; Anselin, 1992). Formally, it is
defined as:
10
In Moran’s initial formulation, the weight variable, Wij, was a contiguity matrix. If
zone j is adjacent to zone i, the interaction receives a weight of 1. Otherwise, the
interaction receives a weight of 0. Cliff and Ord (1973) generalized these definitions
to include any type of weight. In more current use, Wij, is a distance-based weight
which is the inverse distance between locations i and j (1/dij). CrimeStat uses this
interpretation. Essentially, it is a weighted Moran=s I where the weight is an inverse
distance.
Unlike a correlation coefficient, the theoretical value of the index does not equal 0 for
lack of spatial dependence, but instead is negative but very close to 0:
Values of “I” above the theoretical mean, E(I), indicate positive spatial
autocorrelation while values of “I” below the theoretical mean indicate negative
spatial autocorrelation.
As dij becomes small, then Wij becomes very large, approaching infinity as the
distance between the points approaches 0. If the two zones were next to each other,
which would be true for two adjacent blocks for example, then the pair of
observations would have a very high weight, sufficient to distort the “I” value for the
entire sample. Further, there is a scale problem that alters the value of the weight. If
the zones are police precincts, for example, then the minimum distance between
precincts will be a lot larger than the minimum distance between a smaller
geographical unit, such as a block. We need to take into account these scales
11
CrimeStat includes an adjustment for small distances so that the maximum weight can
never be greater than 1.0. The adjustment scales distances to one mile, which is a
typical distance unit in the measurement of crime incidents. When the small distance
adjustment is turned on, the minimal distance is automatically scaled to be one mile.
The formula used is:
where ‘I’ is the empirical value calculated from a sample, E(I) is the theoretical mean
of a random distribution and SE(I) is the theoretical standard deviation of E(I).
12
CHAPTER 3
METHODOLOGY
3.1 Introduction:
3.2 methodology:
Data Preprocessing
Step I
Data Preprocessing:
Aggregate data:
We are considering the crime data on female from the year 2013 to perform clustering
analysis.
The data have the following attributes:
Rape
Kidnapping
Dowry_Deaths
Assault_on_women
Insult_on_Women
Cruelty_by_Husband
Importation_Girls
13
We are considering each district as an object, but the raw data with which we are
dealing must have same number of objects and same names for objects in order to
map with the shape file.
So it involves some manual cross checking for object names(district names).
And for some districts the data was divided into sub-districts data. All these sub-
districts data should be rejoined into a district.
The shape file is a geospatial vector data format for geographic information system
(GIS) software.
We use two different formats of shapefile for this project.
shp – Has Geospatial visualization
dbf – Has data of each object in an excel sheet
The data from the excel sheet is mapped into the Indian districts shapefile.
We use a software called ArcGIS for mapping data into shapefile.
Mapping the date into a shape file is an important step, which can be later used to
perform clustering analysis.
14
Point data:
The data have the following attributes:
In the Greater manchester crime data they are 31,000 instances, we filtered the data based on
category wise (violence category ). The category consists of 7200 instances .We filtered the
data because our laptop doesn’t support that much amount of data.
15
For point data we used qgis software for creation of shapefile. We are unable to get the map
on back ground because shapefile consists of one layer to get the map we need multilayer
files
Performing clustering analysis:
LISA:
K Means :
We find the K value by trail and error method and choose k according to our data and
the requirements of the project. We use kmeans fuction to perform clustering.
DBSCAN:
16
17
CHAPTER 4
RESULT ANALYSIS
4.1 Introduction:
In this chapter results are analysed and Significance of the result obtained are
discussed .
4.2 Results:
Lisa:
18
Resulting map is shown below:
Fig 4.2 Output obtained using Cluster and Outlier Analysis(Anselin Local Morans I)
19
Results obtained using GEODA software:
20
Fig 4.4 significance map in GEODA
21
Results obtained in R:
22
K Means:
23
DBSCAN:
24
AGNES:
25
Fuzzy clustering:
In fuzzy we did soft clustering(mix of red ,green ,blue) ,so it is difficult to mention the
clusters ,we are trying vissualiztion in differet method
26
K-Medoid:
27
Results obtained using point data:
K Means :
DBSCAN:
28
Fig 4.15 KNN-distance plot
AGNES:
29
FUZZY:
30
K-Medoid:
31
4.3 Comparison of results obtained:
The Jaccard similarity index (sometimes called the Jaccard similarity coefficient)
compares members for two sets to see which members are shared and which are distinct. It’s
a measure of similarity for the two sets of data, with a range from 0% to 100%. The higher
the percentage, the more similar the two populations.
This percentage tells you how similar the two sets are. Two sets that share all members would
be 100% similar. the closer to 100%, the more similarity.
Jaccard Distance:
A similar statistic, the Jaccard distance, is a measure of how dissimilar two sets are. It is the
complement of the Jaccard index and can be found by subtracting the Jaccard Index from
100%.
D(X,Y) = 1 – J(X,Y)
Rand index:
The Rand index or Rand measure in statistics, and in particular in data clustering, is a
measure of the similarity between two data clusterings. A form of the Rand index may be
defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index.
From a mathematical standpoint, Rand index is related to the accuracy, but is applicable even
when class labels are not used.
32
a,the number of pairs of elements in S that are in the same subset in X and in the same
subset in Y.
b,the number of pairs of elements in S that are different subsets in X and in the
different subsets in Y.
c,the number of pairs of elements in S that are in the same subset in X and in the
different subsets in Y.
d,the number of pairs of elements in S that are in the different subsets in X and in the
same subset in Y.
a+b can be considered as the number of agreements between X and Y and c + d as the
number of disagreements between X and Y.
Since the denominator is the total number of pairs, the Rand index represents the frequency
of occurrence of agreements over the total pairs, or the probability that X and Y will agree on
a randomly chosen pair.
Similarly, one can also view the Rand index as a measure of the percentage of correct
decisions made by the algorithm. It can be computed using the following formula:
where TP is the number of true positives, TN is the number of true negatives, FP is the
number of false positives, and FN is the number of false negatives.
33
Comparison of results for aggregate data taking LISA as standard:
Table 4.1:
SL No: Algorithm Jaccard Index Rand Index
1 K means 26.5 68.5
2 K medoid 23 68.3
3 AGNES 29 69.8
4 FUZZY 28.12 68.7
5 DBSCAN 30.4 71.3
Table 4.2:
SL No: Algorithm Jaccard Index Rand Index
1 K means 35.6 73
2 K medoid 34.6 72.8
3 AGNES 31.4 65.3
4 DENCLUE 30.8 64955
5 DBSCAN 29.5 61.5
34
Table 4.3:
35
CHAPTER 5
CONCLUSION AND FUTURE SCOPE OF WORK
5.1 Conclusion:
We have presented an overview of clustering algorithms that are useful to the spatial
clustering analysis. We categorize them into four categories
1.Partitioning-based
2.Hierarchical-based
3.Density-based
4.Grid-based
Partitioning methods like k-means and k-medoids are methods which make uses of a
technique called iterative reallocation to improve clustering quality from an initial
solution.As these methods find clusters that are of spherical shape and similar in size,they are
more useful for applications like facility allocation where the objective is not to find natural
cluster but to minimize the sum of distances from the data objects to their cluster centers.
Unlike the partitioning-based clustering algorithms which reallocate data objects from one
cluster to another in order to improve the clustering quality, hierarchical clustering like
AGNES algorithms fixed the membership of a data object once it has been allocated to a
cluster.
To increase the efficiency of clustering grid based clustering methods approximate the
dense regions of the clustering space by quantizing it into a finite number of cells and
identifying cells that contain more than a number of points as dense. Grid based approach is
usually more efficient than a density-based approach.
To conclude the hierarchical clustering methods are similar in performance but takes more
time as compared to the others.The performance of partitional based clustering methods like
k-means and k-medoid algorithms are not well in handling irregularly shaped clusters.The
density based methods and grid based methods are more suitable for handling spatial data but
when considering time complexity grid based methods are more preferable.
36
The problem with LISA is it requires frequency of events associated with the data point so
it is not suitable for point data where each crime is reported individually which makes the
count of each data point as one . From the research papers we concluded fuzzy is the best
when dealing with point data .fuzzy shows decent results in aggregate .
Partitional methods like kmeans and kmedoid shows decent values for aggregate
data.These are highly efficient for point data.
37
REFERENCES
[1]. Neethu C V and Mr.Subu Surendra, “Review of Spatial Clustering Methods”, SCT
College of Engineering Trivandrum,India,2013,24.
[2]. S.Sivaranjani, Dr.S.Sivakumari and Aasha.M, “Crime Prediction and Forecasting in
Tamilnadu using Clustering Approaches”, Avinashilingam University Coimbatore,
India,2016,6
[3]. Nilima,A.Puranik,S.M.shreenidhi,S.N. Rai, “Spatial evaluation of prevalence,pattern
and predictors of cervical cancer screening in india”,2019,13.
[4]. Ahamed Shafeeq B M and Dr.Binu V S, “Spatial Patterns of Crimes in India using Data
Mining Techniques.”,2014,5
38
PROJECT DETAILS
Student Details
Student Name Kapugarla Manmadha
Register Number 160907470 Section / Roll No C/47
Email Address Kmanmadha133@gmail.com Phone No (M) 8296538645
Project Details
Project Title A comparative study of various algorithms to detect clustering in
spatial data.
39