Sei sulla pagina 1di 39

CHAPTER 1

INTRODUCTION

In this chapter an introduction to the area of the work is provided. Brief present day scenario
with regard to the area of the work and motivation for doing the project is discussed. Both
main and secondary objectives of the work is specified. Project work schedule is mentioned.

1.1 Introduction to the area of the work:

Spatial data, also known as geospatial data or geographic information, is the data or
information that identifies the geographic location of features and boundaries on earth,
such as natural or constructed features, oceans, and more. Spatial data is usually stored as
coordinates and topology and is data that can be mapped.

Cluster analysis is the process of partitioning a set of data objects (or observations) into
subsets. Each subset is a cluster, such that objects in a cluster are similar to one another,
yet dissimilar to objects in other clusters. The set of clusters resulting from a cluster
analysis can be referred to as a clustering. In this context, different clustering methods
may generate different clustering’s on the same data set. The partitioning is done by the
clustering algorithms. Hence, clustering is useful in discovery of previously unknown
groups within the data. It is an important part of spatial data mining since it provides
certain insights into the distribution of data and characteristics of spatial clusters.

1.2 Present day scenario:

Linear indicator of spatial auto correlation is one of the most widely used techniques in
spatial clustering
In this project we will be applying various machine learning clustering techniques on
spatial data and compare the results with that of linear indicator of spatial auto correlation
technique.

1.3 Motivation to do the work:

The detection of spatial clusters is important in public health decision making to


allocate resources for health prevention and to make environmental control decisions.

1
Comparison of the various cluster detection techniques will allow us to provide an insight
regarding the clustering quality and execution time and we can decide on which
clustering technique to use depending on kind of data we have.

1.4 Objective of the work:

To compare K means ,Linear Indicator of Spatial Auto correlation DBSCAN


clustering algorithms in identifying hotspots in terms of four factors such as time
complexity, inputs, handling of higher dimensions and handling of irregularly shaped
clusters.

1.5 Target Specifications:

Analysing the results of different clustering methods and finding which is the best
method according to the data available to us.

1.6 Project schedule:

Table 1.1:

o Framing of research problem


January 2020 o In depth learning of the clustering algorithms

o Execution of clustering algorithm on real data and obtaining


February 2020 results

o Execution of clustering algorithm on real data and obtaining


March 2020 results

o Report writing and submission


April 2020 o Preparation of manuscript based on research findings for
publication

2
CHAPTER 2
BACKGROUND THEORY

2.1 Introduction:

In this chapter we will discuss the title of the project, Literature review, Summarized
outcome of the literature review, General analysis, Mathematical derivations and
conclusions.

2.2 Introduction to project title:

K Means,DBSCAN and LISA are used to detect clustering in spatial data and
results are compared to find the best algorithm to a given type of data.

2.3 Literature Review:


Table 2.1:
Title Author’s Research Findings References
Name/Source
Crime Prediction and S.Sivaranjani Implementation of Kmeans [2]
Forecasting in Dr.S.Sivakumari ,DBSCAN,KNN on crime
Tamilnadu using Aasha.M data and comparing the
Clustering Approaches performances.
Review of Spatial Neethu C V Theoretical comparison of [1]
Clustering Methods Mr.Subu Surendran Spatial clustering methods
Spatial Patterns of Ahamed B M Found cluster maps for [4]
Crimes in India using Dr.Binu V S parameters such as
Data Mining population density, riot
Techniques. rate, Employment rate etc
Spatial evaluation of Nilima The study identified key [3]
prevalence,pattern and A.puranik existing gaps in cervical
predictors of cervical S.M.Shreenidhi cancer screening.The maps
cancer screening in S.N.Rai presented in this study
india offer a composite
representation of the
utilization of available
screening services across
india.

2.4 Background theory:

3
K Means algorithm:

1. Randomly assign a number, from 1 to K, to each of these observations as initial cluster


assignments for the observations.

2. Iterate until the cluster assignments stop changing:

(a) For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the
vector of the p feature means for the observations in the kth cluster.

(b) Assign each observation to the cluster whose centroid is closest.

(where closest is defined using Euclidean distance).

K Means Flowchart:

Fig 2.1 K means Flowchart


DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

4
Important terms in this algorithm are:
1.Epcilon
2. Minpoints
3.Core Points
4.Border points
5.Noise Point

 Take a point and with epcilon as radius draw a circle


 If number of points inside the circle are greater than equal to Minpoints then the
above point is considered as a core point
 If a point doesnot satisfy Minpoints condition but we have atleast one core point
inside it then it becomes a border point.
 If both the above conditions fail then the point becomes noise point.
 Only core and border points are considered to form a cluster Noise points are never
taken into consideration.

Fig 2.2 Noise, core and border points

Directly density-reachable:

5
A point p is directly density-reachable from a point q w.r.t Eps,Minpts if
1. P belongs to Neps(q)
2. Core point condition:|Neps(q)|>=Minpts

Fig 2.3 Directly density-reachable points

Density-reachable:

A point p is density reachable from a point q w.r.t Eps,Minpts if there is a chain of points
p1,p2….pn,such that pi+1 is directly density-reachable from pi.

Fig 2.4 Density reachable points


Density-connected:

6
A point p is density-connected to a point q w.r.t Eps,Minpts if there is a point o such that both
p and q are density-reachable from o w.r.t Eps and Minpts.

Fig 2.5 Density-connected points

DBSCAN Algorithm:

• Arbitrarily select a point p.


• Retrieve all points density-reachable from p w.r.t Eps and Minpts.
• If p is a core point, a cluster is formed.
• If p is a border point,no points are density-reachable from p,and DBSCAN visits the
next point of the database.
Continue the process until all of the points have been processed.

DBSCAN Flowchart:

7
Fig 2.6 DBSCAN Flowchart

LISA(Linear indicator of spatial autocorrelation):

8
Spatial autocorrelation:

The concept of spatial autocorrelation is one of the most important in spatial statistics
in that it implies a lack of spatial independence. Classical statistics assumes that
observations are independently chosen and are spatially unrelated to each other. The
intuitive concept is that the location of an incident (e.g., a street robbery, a burglary)
is unrelated to the location of any other incident. The opposite condition - spatial
autocorrelation, is a spatial arrangement of incidents such that the locations where
incidents occur are related to each other; that is, they are not statistically independent
of one another. In other words, spatial autocorrelation is a spatial arrangement where
spatial independence has been violated.

When events or people or facilities are clustered together, we refer to this arrangement
as positive spatial autocorrelation. Conversely, an arrangement where people, events
or facilities are extremely dispersed is referred to as negative spatial autocorrelation; it
is a rarer arrangement, but does exist.

Assigning points to data:

If a user has information on the location of individual events (e.g., robberies), then it
is better to utilize that information with the point statistics The individual-level
information will contain all the uniqueness of the events.

However, sometimes it is not possible to analyze data at the individual level. The
user may need to aggregate the individual data points to spatial areas (zones) in order
to compare the events to data that are only obtained for zones, such as census data, or
to model environmental correlates of the data points or may find that individual data
are not available (e.g., when a police department releases information by police beats
but not individual streets). In this case, the individual data points are allocated to
zones by, first, spatially assigning them to the zones in which they fall and, second,
counting the number of points assigned to each zone. A user can do this with a GIS
program or with the “Assign Primary points to Secondary Points” routine.

9
In this case, the zone becomes the unit of analysis instead of the individual data
points. All the incidents are assigned to a single geographical coordinate, typically
the centroid of the zone, and the number of incidents in the zone (the count) becomes
an attribute of the zone (e.g., number of robberies per zone; number of motor vehicle
crashes per zone).

Thus, the distance between zones is a singular value for all the points in those zones
whereas there is much greater variability with the distances between individual
events.
Further, zones have attributes which are properties of the zone, not of the individual
events. The attribute can be a count or a continuous variable for a distributional
property of the zone (e.g., median household income; percentage of households below
poverty level).

Moran’s “I” Statistic:

Moran=s “I” statistic (Moran, 1950) is one of the oldest indicators of spatial
autocorrelation. It is applied to zones or points that have attribute variables associated
with them (intensities). For any continuous variable, Xi, a mean, , can be calculated
and the deviation of any one observation from that mean, , can also be calculated.
The statistic then compares the value of the variable at any one location with the value
at all other locations (Ebdon, 1988; Griffith, 1987; Anselin, 1992). Formally, it is
defined as:

where N is the number of cases, Xi is the value of a variable at a particular location, i,


Xj is the value of the same variable at another location (where i =/ j), X is the mean
of the variable and Wij is a weight applied to the comparison between location i and
location j.

10
In Moran’s initial formulation, the weight variable, Wij, was a contiguity matrix. If
zone j is adjacent to zone i, the interaction receives a weight of 1. Otherwise, the
interaction receives a weight of 0. Cliff and Ord (1973) generalized these definitions
to include any type of weight. In more current use, Wij, is a distance-based weight
which is the inverse distance between locations i and j (1/dij). CrimeStat uses this
interpretation. Essentially, it is a weighted Moran=s I where the weight is an inverse
distance.

Unlike a correlation coefficient, the theoretical value of the index does not equal 0 for
lack of spatial dependence, but instead is negative but very close to 0:

Values of “I” above the theoretical mean, E(I), indicate positive spatial
autocorrelation while values of “I” below the theoretical mean indicate negative
spatial autocorrelation.

Adjust for small values:

CrimeStat calculates the weighted Moran=s I formula using equation above


However, there is one problem with this formula that can lead to unreliable results.
The distance weight between two locations, Wij, is defined as the reciprocal of the
distance between the two points, consistent with Moran’s original formulation:

As dij becomes small, then Wij becomes very large, approaching infinity as the
distance between the points approaches 0. If the two zones were next to each other,
which would be true for two adjacent blocks for example, then the pair of
observations would have a very high weight, sufficient to distort the “I” value for the
entire sample. Further, there is a scale problem that alters the value of the weight. If
the zones are police precincts, for example, then the minimum distance between
precincts will be a lot larger than the minimum distance between a smaller
geographical unit, such as a block. We need to take into account these scales

11
CrimeStat includes an adjustment for small distances so that the maximum weight can
never be greater than 1.0. The adjustment scales distances to one mile, which is a
typical distance unit in the measurement of crime incidents. When the small distance
adjustment is turned on, the minimal distance is automatically scaled to be one mile.
The formula used is:

Testing the Significance of Moran’s “I” :

The empirical distribution can be compared with the theoretical distribution by


dividing by an estimate of the theoretical standard deviation:

where ‘I’ is the empirical value calculated from a sample, E(I) is the theoretical mean
of a random distribution and SE(I) is the theoretical standard deviation of E(I).

12
CHAPTER 3
METHODOLOGY

3.1 Introduction:

In this chapter detailed methodology and Tools used will be discussed.

3.2 methodology:

Data Preprocessing
Step I

Mapping the data into shapefile


Step II

Step Performing clustering analysis


III

Comparing the results based on the clustering parameters


Sep IV

Data Preprocessing:
Aggregate data:

We are considering the crime data on female from the year 2013 to perform clustering
analysis.
The data have the following attributes:
Rape
Kidnapping
Dowry_Deaths
Assault_on_women
Insult_on_Women
Cruelty_by_Husband
Importation_Girls

13
We are considering each district as an object, but the raw data with which we are
dealing must have same number of objects and same names for objects in order to
map with the shape file.
So it involves some manual cross checking for object names(district names).
And for some districts the data was divided into sub-districts data. All these sub-
districts data should be rejoined into a district.

Mapping the data into shapefile:

The shape file is a geospatial vector data format for geographic information system
(GIS) software.
We use two different formats of shapefile for this project.
shp – Has Geospatial visualization
dbf – Has data of each object in an excel sheet
The data from the excel sheet is mapped into the Indian districts shapefile.
We use a software called ArcGIS for mapping data into shapefile.
Mapping the date into a shape file is an important step, which can be later used to
perform clustering analysis.

14
Point data:
The data have the following attributes:

In the Greater manchester crime data they are 31,000 instances, we filtered the data based on
category wise (violence category ). The category consists of 7200 instances .We filtered the
data because our laptop doesn’t support that much amount of data.

15
For point data we used qgis software for creation of shapefile. We are unable to get the map
on back ground because shapefile consists of one layer to get the map we need multilayer
files
Performing clustering analysis:

LISA:

Select the option geoprocessing and go to ArcToolbox.


From ArcToolbox select mapping clusters there you see different options.
In this analysis we do Using Cluster and Outlier Analysis(Anselin Local Morans I)
And Hot Spot Analysis(Getis-Ord Gi),click on any of these options to obtain desired
Results.

K Means :

We find the K value by trail and error method and choose k according to our data and
the requirements of the project. We use kmeans fuction to perform clustering.

DBSCAN:

We use dbscan function to perform DBSCAN clustering.Two important parameters are


required for DBSCAN epsilon(“eps”) and minimum points(“MinPts”).The parameter eps
defines the radius of neighborhood of x.The parameter MinPts is the minimum number of
neighbors within “eps” radius.
MinPts is best set by a domain expert who understands the data well.In many cases we
don’t know the domain knowledge,one approach is to use ln(n),where n is the total number of
points to be clustered.
Epsilon:
There are several ways to determine epsilon one of the best way is k-distance plot.In a
clustering with minpts=k,we expect that core points and border points k-distance are within a
certain range,while noise points can have much greater k-distance,thus we can observe a knee
point in the k-distance plot.

16
17
CHAPTER 4
RESULT ANALYSIS

4.1 Introduction:

In this chapter results are analysed and Significance of the result obtained are
discussed .

4.2 Results:

Lisa:

Results obtained in ArcGIS software are as follows:

Using Cluster and Outlier Analysis(Anselin Local Morans I)

Inputs given during analysis are shown in the picture below

Fig 4.1 Inputs given during analysis in ArcGIS

18
Resulting map is shown below:

Fig 4.2 Output obtained using Cluster and Outlier Analysis(Anselin Local Morans I)

19
Results obtained using GEODA software:

Fig 4.3 LISA cluster map in GEODA

20
Fig 4.4 significance map in GEODA

Fig 4.5 Moran’s I plot in GEODA

21
Results obtained in R:

Fig 4.6 LISA map in R

22
K Means:

Fig 4.7 K Means result in R

23
DBSCAN:

Fig 4.8 DBSCAN results in R

Fig 4.9 KNN distance plot

24
AGNES:

Fig 4.10 AGNES result in R

25
Fuzzy clustering:

In fuzzy we did soft clustering(mix of red ,green ,blue) ,so it is difficult to mention the
clusters ,we are trying vissualiztion in differet method

Fig 4.11 Fuzzy result in R

26
K-Medoid:

Fig 4.12 K-Medoid result in R

27
Results obtained using point data:

K Means :

Fig 4.13 K means result for point data

DBSCAN:

Fig 4.14 DBSCAN result in R

28
Fig 4.15 KNN-distance plot
AGNES:

Fig 4.16 AGNES Result in R

29
FUZZY:

Fig 4.17 FUZZY result in R


DENCLUE:

Fig 4.18 DENCLUE result in R

30
K-Medoid:

Fig 4.19 K-Medoid result in R

31
4.3 Comparison of results obtained:

Jaccard Index / Similarity Coefficient:

The Jaccard similarity index (sometimes called the Jaccard similarity coefficient)
compares members for two sets to see which members are shared and which are distinct. It’s
a measure of similarity for the two sets of data, with a range from 0% to 100%. The higher
the percentage, the more similar the two populations.

The formula to find the Index is:

This percentage tells you how similar the two sets are. Two sets that share all members would
be 100% similar. the closer to 100%, the more similarity.

Jaccard Distance:

A similar statistic, the Jaccard distance, is a measure of how dissimilar two sets are. It is the
complement of the Jaccard index and can be found by subtracting the Jaccard Index from
100%.

D(X,Y) = 1 – J(X,Y)

Rand index:

The Rand index or Rand measure in statistics, and in particular in data clustering, is a
measure of the similarity between two data clusterings. A form of the Rand index may be
defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index.
From a mathematical standpoint, Rand index is related to the accuracy, but is applicable even
when class labels are not used.

Given a set of n elements S={o1,…..on} and two partitions of S to compare X={X1,…


Xr},a partition of S into r subsets and Y={Y1,…Yn},a partition of S into s subsets,define the
following:

32
 a,the number of pairs of elements in S that are in the same subset in X and in the same
subset in Y.

 b,the number of pairs of elements in S that are different subsets in X and in the
different subsets in Y.

 c,the number of pairs of elements in S that are in the same subset in X and in the
different subsets in Y.

 d,the number of pairs of elements in S that are in the different subsets in X and in the
same subset in Y.

The Rand index R is:

a+b can be considered as the number of agreements between X and Y and c + d as the
number of disagreements between X and Y.

Since the denominator is the total number of pairs, the Rand index represents the frequency
of occurrence of agreements over the total pairs, or the probability that X and Y will agree on
a randomly chosen pair.

Similarly, one can also view the Rand index as a measure of the percentage of correct
decisions made by the algorithm. It can be computed using the following formula:

where TP is the number of true positives, TN is the number of true negatives, FP is the
number of false positives, and FN is the number of false negatives.

33
Comparison of results for aggregate data taking LISA as standard:

Table 4.1:
SL No: Algorithm Jaccard Index Rand Index
1 K means 26.5 68.5
2 K medoid 23 68.3
3 AGNES 29 69.8
4 FUZZY 28.12 68.7
5 DBSCAN 30.4 71.3

Comparison of results for point data taking FUZZY as standard:

Table 4.2:
SL No: Algorithm Jaccard Index Rand Index
1 K means 35.6 73
2 K medoid 34.6 72.8
3 AGNES 31.4 65.3
4 DENCLUE 30.8 64955
5 DBSCAN 29.5 61.5

34
Table 4.3:

Algorithm Time Inputs Handling of Handling of irregularly


compliexity higher shaped clusters
Required
dimensions
AGNES O(n^2 logn) *Data matrix No Arbitary shapes

KMeans O(tknd) *No of Not well Not completely


clusters

KMedoid O(k(n-k)^2) *No of Not well Not completely


clusters

Fuzzy O(knd(t^2)) *No of Yes Partially


clusters

DBScan O(n logn) *Two No Yes


parameters

Denclue O( D logD) *Two Yes Yes


parameters

LISA O(D) *3 parameters Yes Partially

35
CHAPTER 5
CONCLUSION AND FUTURE SCOPE OF WORK

5.1 Conclusion:

We have presented an overview of clustering algorithms that are useful to the spatial
clustering analysis. We categorize them into four categories
1.Partitioning-based
2.Hierarchical-based
3.Density-based
4.Grid-based

Partitioning methods like k-means and k-medoids are methods which make uses of a
technique called iterative reallocation to improve clustering quality from an initial
solution.As these methods find clusters that are of spherical shape and similar in size,they are
more useful for applications like facility allocation where the objective is not to find natural
cluster but to minimize the sum of distances from the data objects to their cluster centers.

Unlike the partitioning-based clustering algorithms which reallocate data objects from one
cluster to another in order to improve the clustering quality, hierarchical clustering like
AGNES algorithms fixed the membership of a data object once it has been allocated to a
cluster.

Instead of using distance to judge the membership of a data object,density-based clustering


algorithm like DBSCAN make use of the density of data points within a region to discover
clusters.DBSCAN results in a loss of efficiency for high dimensional clustering.This problem
is addressed by DENCLUE which models the over all density of a point to handle the
computation efficiently.

To increase the efficiency of clustering grid based clustering methods approximate the
dense regions of the clustering space by quantizing it into a finite number of cells and
identifying cells that contain more than a number of points as dense. Grid based approach is
usually more efficient than a density-based approach.

To conclude the hierarchical clustering methods are similar in performance but takes more
time as compared to the others.The performance of partitional based clustering methods like
k-means and k-medoid algorithms are not well in handling irregularly shaped clusters.The
density based methods and grid based methods are more suitable for handling spatial data but
when considering time complexity grid based methods are more preferable.

36
The problem with LISA is it requires frequency of events associated with the data point so
it is not suitable for point data where each crime is reported individually which makes the
count of each data point as one . From the research papers we concluded fuzzy is the best
when dealing with point data .fuzzy shows decent results in aggregate .

Partitional methods like kmeans and kmedoid shows decent values for aggregate
data.These are highly efficient for point data.

37
REFERENCES

[1]. Neethu C V and Mr.Subu Surendra, “Review of Spatial Clustering Methods”, SCT
College of Engineering Trivandrum,India,2013,24.
[2]. S.Sivaranjani, Dr.S.Sivakumari and Aasha.M, “Crime Prediction and Forecasting in
Tamilnadu using Clustering Approaches”, Avinashilingam University Coimbatore,
India,2016,6
[3]. Nilima,A.Puranik,S.M.shreenidhi,S.N. Rai, “Spatial evaluation of prevalence,pattern
and predictors of cervical cancer screening in india”,2019,13.
[4]. Ahamed Shafeeq B M and Dr.Binu V S, “Spatial Patterns of Crimes in India using Data
Mining Techniques.”,2014,5

38
PROJECT DETAILS

Student Details
Student Name Kapugarla Manmadha
Register Number 160907470 Section / Roll No C/47
Email Address Kmanmadha133@gmail.com Phone No (M) 8296538645
Project Details
Project Title A comparative study of various algorithms to detect clustering in
spatial data.

Project Duration 4 Months Date of reporting 3rd Jan 2020


Expected date of 2nd May 2020
completion of project
Internal Guide Details
Faculty Name Dr.Anu Shaju Areeckal

Full contact address Assistant Professor,Dept. of E&C Engg., Manipal Institute of


with pin code Technology, Manipal – 576 104 (Karnataka State), INDIA
Email address anu.areeckal@manipal.edu
External Guide Details
Faculty Name Ms.Amitha Puranik
Full contact address Assistant Professor
with pin code Department of Data Science
Prasanna School of Public Health,
MAHE, Manipal,576104

Email address amitha.puranik@manipal.edu

39

Potrebbero piacerti anche