Sei sulla pagina 1di 24

Clustering

Contents
Clustering

Types of Clustering Algorithms

K Means Clustering

Hierarchical Clustering

DBSCAN
Clustering
Clustering is famous task in the area of data mining used for
information retrieval, market research, pattern recognition, data
analysis, image processing etc..,
It is a un-supervised learning technique.
o Forms clusters of similar data automatically,
o Segments the data so that each training example is assigned to a cluster.

It is different from classification task – rather than predicting as class,


Clustering tries to assign the data to a cluster.
Clustering
Commonly, to measure similarity or dissimilarity between data
points, a distance measure is used.
Common distance functions are:
Euclidean
Manhattan
Minkowski

A distance function returns a lower value for pairs of objects that are
more similar to one another.
Distance functions
 Three types of distances functions are –
Types of Clustering
 Partition based clustering
 K-means
 Hierarchical clustering
 Agglomerative
 Divisive
Density based clustering
 DBSCAN
K-Means
K-Means clustering intends to partition n objects into k clusters
in which each object belongs to the cluster with the nearest
mean.
This method produces exactly k different clusters of greatest
possible distinction.
The best number of clusters k leading to the greatest separation
(distance) is not known as a priori and must be computed from
the data.
K-Means
The objective of K-Means clustering is to minimize total intra-
cluster variance, or, the squared error function:
Also, called as Within sum of squares (WSS).
K-Means
Steps involved in K-Means:
1. Clusters the data into k groups where k is predefined.
2. Select k points at random as cluster centers.
3. Assign objects to their closest cluster center according to the Euclidean
distance function.
4. Calculate the centroid or mean of all objects in each cluster.
5. Repeat steps 2, 3 and 4 until the same points are assigned to each
cluster in consecutive rounds.
Choosing K
In general, there is no method for determining exact value of K.
One of the metrics that is commonly used to compare results
across different values of K is the mean distance between data
points and their cluster centroid.
Increasing the number of clusters will always reduce the
distance to data points.
Hence, increasing K will always decrease the distance between
the data points and their cluster center.
Choosing K
The mean distance to
the centroid as a
function of K is plotted
and the "elbow point,"
where the rate of
decrease sharply shifts,
can be used to roughly
determine K.

 X - axis – no. of clusters


 Y – axis - WSS
Intution behind Elbow curve
Intution behind Elbow curve
Intution behind Elbow curve
Hierarchical Clustering
Hierarchical clustering involves creating clusters that have a
predetermined ordering from top to bottom.
There are two types of hierarchical clustering –
Agglomerative
Divisive
Hierarchical clustering is mostly used when the application
requires a hierarchy, e.g creation of a taxonomy.
However, they are expensive in terms of their computational
and storage requirements.
Hierarchical Clustering
The results of hierarchical
clustering can be shown using
dendrogram.
Hierarchical Clustering
Hierarchical Clustering
Single Linkage
In single linkage hierarchical clustering,
the distance between two clusters is
defined as the shortest distance
between two points in each cluster.
For example, the distance between
clusters “r” and “s” to the left is equal to
the length of the arrow between their
two closest points.
Hierarchical Clustering
Single Linkage
For the one dimensional data set
{7,10,20,28,35}, perform
hierarchical clustering and plot the
dendogram to visualize it.
• Cluster 1 : (7,10)
• Cluster 2 : (20,28,35)
Hierarchical Clustering
Complete Linkage
In complete linkage hierarchical
clustering, the distance between two
clusters is defined as the longest
distance between two points in each
cluster.
For example, the distance between
clusters “r” and “s” to the left is equal to
the length of the arrow between their
two furthest points.
Hierarchical Clustering
Complete Linkage
• Cluster 1 : (7,10,20)
• Cluster 2 : (28,35)
DBSCAN
 DBSCAN stands for - Density-Based Spatial Clustering
of Applications with Noise.
Used to separate clusters of high density from clusters of low
density.
The DBSCAN algorithm basically requires 2 parameters:
1. eps: specifies how close points should be to each other to be
considered a part of a cluster.
2. minPoints: the minimum number of points to form a dense
region.
DBSCAN
DBSCAN
DBSCAN does NOT necessarily categorize every data pointvery
good in handling outliers in the dataset.
Does not work well when dealing with clusters of varying
densities.
If given data with too many dimensions, DBSCAN struggles.

Potrebbero piacerti anche