Sei sulla pagina 1di 22

12/01/2017

Density-Based Clustering

Density-Based Clustering
blog.dominodatalab.com/topology-and-density-based-clustering/

09/09/2015

by Manojit Nandi on September 9th, 2015


Cluster Analysis is an important problem in data analysis. Data scientists use clustering to identify malfunctioning
servers, group genes with similar expression patterns, or various other applications.
There are many families of data clustering algorithm, and you may be familiar with the most popular one: K-Means.
As a quick refresher, K-Means determines k centroids in the data and clusters points by assigning them to the
nearest centroid.
While K-Means is easy to understand and implement in practice, the algorithm has no notion of outliers, so all
points are assigned to a cluster even if they do not belong in any. In the domain of anomaly detection, this causes
problems as anomalous points will be assigned to the same cluster as normal data points. The anomalous points
pull the cluster centroid towards them, making it harder to classify them as anomalous points.
In this blog post, I will cover a family of techniques known as density-based clustering. Compared to centroidbased clustering like K-Means, density-based clustering works by identifying dense clusters of points, allowing it
to learn clusters of arbitrary shape and identify outliers in the data. In particular, I will:
Discuss the highly popular DBSCAN algorithm.
Use the Python library DeBaCl to demonstrate the Level Set Tree clustering algorithm.
As always, the code can be found on the Domino platform. The Clustering.ipynb file a good place to start

Preliminary: -Balls and neighborhood density


Before we can discuss density-based clustering, we first need to cover a topic that you may have seen in a
topology course: -neighborhoods.
The general idea behind -neighborhoods is given a data point, we want to be able to reason about the data points
in the space around it. Formally, for some real-valued > 0 and some point p, the -neighborhood of p is defined
as the set of points that are at most distance away from p.
If you think back to geometry, the shape in which all points are equidistant from the center is the circle. In 2D
space, the -neighborhood of a point p is the set of points contained in a circle of radius , centered at p. In 3D
space, the -neighborhood is a sphere of radius , centered at p, and in higher dimensional space, the neighborhood is just the N-sphere of radius , centered at p.
Lets consider an example to make this idea more concrete.
I have scattered 100 data points in the interval [1,3]X[2,4]. Lets pick the point (3,2) to be our point p.

https://blog.dominodatalab.com/topology-and-density-based-clustering/

1/22

12/01/2017

Density-Based Clustering

First, lets consider the neighborhood of p with radius 0.5 ( = 0.5), the set of points that are distance 0.5 away
from p.

The opaque green oval represents our neighborhood, and there are 31 data points in this neighborhood. Since I
scattered 100 data points and 31 are in the neighborhood, this means that a little under one-third of the data points
are contained within the neighborhood of p with radius 0.5.
Now, lets change our radius to 0.15 ( = 0.15) and consider the resulting smaller neighborhood.

https://blog.dominodatalab.com/topology-and-density-based-clustering/

2/22

12/01/2017

Density-Based Clustering

We have shrunk the neighborhood, so now only 3 data points are contained within it. By decreasing from 0.5 to
0.15 (a 70% reduction), we have decreased the number of points in our neighborhood from 31 to 3 (a 90%
reduction).
Now that we have defined what we mean by a neighborhood, well introduce the next important concept: the
notion of a density for a neighborhood (were building up to describing density-based clustering, after all).
In a grade-school science class, children are taught that density = mass/volume. Lets use this idea of mass
divided by volume to define density at some point p. If we consider some point p and its neighborhood of radius ,
we can define the mass of the neighborhood as the number of data points (or alternatively, the fraction of data
points) contained within the neighborhood, and the volume of the neighborhood is volume of the resulting shape of
the neighborhood. In the 2D case, the neighborhood is a circle, so the volume of the neighborhood is just the area
of the resulting circle. In the 3D and higher dimensional case, the neighborhood is a sphere or n-sphere, so we can
calculate the volume of this shape.
For example, lets consider our neighborhood of p = (3,2) of radius 0.5 again.

https://blog.dominodatalab.com/topology-and-density-based-clustering/

3/22

12/01/2017

Density-Based Clustering

The mass is the number of data points in the neighborhood, so mass = 31. The volume is the area of the circle, so
volume = 0.52 = /4. Therefore, our local density approximation at *p = (3,2) is calculated as density =
mass/volume = 31/(/4) = 124/ ~= 39.5.
This value is meaningless by itself, but if we calculate the local density approximation for all points in our dataset,
we could cluster our points by saying that points that are nearby (contained in the same neighborhood) and have
similar local density approximations belong in the same cluster. If we decrease the value of , we can construct
smaller neighborhoods (less volume) that would also contain fewer data points. Ideally, we want to identify highly
dense neighborhoods where most of the data points are contained in these neighborhoods, but the volume of each
of these neighborhoods is relatively small.
While this is not exactly what either DBSCAN or the Level Set Tree algorithm does, it forms the general intuition
behind density-based clustering.
To recap, we discussed the -neighborhoods and how they allow us to reason about the space around a particular
point. Then we defined a notion of density at a particular point for a particular neighborhood. In the next section, Ill
discuss the DBSCAN algorithm where the -ball is a fundamental tool for defining clusters.

DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is the most well-known density-based
clustering algorithm, first introduced in 1996 by Ester et. al. Due to its importance in both theory and applications,
this algorithm is one of three algorithms awarded the Test of Time Award at SIGKDD 2014.
Unlike K-Means, DBSCAN does not require the number of clusters as a parameter. Rather it infers the number of
clusters based on the data, and it can discover clusters of arbitrary shape (for comparison, K-Means usually
discovers spherical clusters). As I said earlier, the -neighborhood is fundamental to DBSCAN to approximate local
density, so the algorithm has two parameters:

: The radius of our neighborhoods around a data point p.


minPts: The minimum number of data points we want in a neighborhood to define a cluster.
Using these two parameters, DBSCAN categories the data points into three categories:
https://blog.dominodatalab.com/topology-and-density-based-clustering/

4/22

12/01/2017

Density-Based Clustering

Core Points: A data point p is a core point if Nbhd(p,) [-neighborhood of p] contains at least minPts ;
|Nbhd(p,)| >= minPts.
Border Points: A data point *q is a border point if Nbhd(q, ) contains less than minPts data points, but q is
reachable from some core point p.
Outlier: A data point o is an outlier if it is neither a core point nor a border point. Essentially, this is the other
class.
These definitions may seem abstract, so lets cover what each one means in more detail.

Core Points
Core Points are the foundations for our clusters are based on the density approximation I discussed in the previous
section. We use the same to compute the neighborhood for each point, so the volume of all the neighborhoods is
the same. However, the number of other points in each neighborhood is what differs. Recall that I said we can think
of the number of data points in the neighborhood as its mass. The volume of each neighborhood is constant, and
the mass of neighborhood is variable, so by putting a threshold on the minimum amount of mass needed to be
core point, we are essentially setting a minimum density threshold. Therefore, core points are data points that
satisfy a minimum density requirement. Our clusters are built around our core points (hence the core part), so by
adjusting our minPts parameter, we can fine-tune how dense our clusters cores must be.

Border Points
Border Points are the points in our clusters that are not core points. In the definition above for border points, I used
the term density-reachable. I have not defined this term yet, but the concept is simple. To explain this concept, lets
revisit our neighborhood example with epsilon = 0.15. Consider the point r (the black dot) that is outside of the
point ps neighborhood.

All the points inside the point ps neighborhood are said to be directly reachable from p. Now, lets explore the
neighborhood of point q, a point directly reachable from p. The yellow circle represents qs neighborhood.

https://blog.dominodatalab.com/topology-and-density-based-clustering/

5/22

12/01/2017

Density-Based Clustering

Now while our target point r is not our starting point ps neighborhood, it is contained in the point qs neighborhood.
This is the idea behind density-reachable: If I can get to the point r by jumping from neighborhood to neighborhood,
starting at a point p, then the point r is density-reachable from the point p.

As an analogy, we can think of density-reachable points as being the friends of a friend. If the directly-reachable
of a core point p are its friends, then the density-reachable points, points in neighborhood of the friends of p, are
the friends of its friends. One thing that may not be clear is density-reachable points is not limited to just two
adjacent neighborhood jumps. As long as you can reach the point doing neighborhood jumps, starting at a core
point p, that point is density-reachable from p, so friends of a friend of a friend of a friend are included as well.

https://blog.dominodatalab.com/topology-and-density-based-clustering/

6/22

12/01/2017

Density-Based Clustering

It is important to keep in mind that this idea of density-reachable is dependent on our value of . By picking larger
values of , more points become density-reachable, and by choosing smaller values of , less points become
density-reachable.

Outliers
Finally, we get to our other class. Outliers are points that are neither core points nor are they close enough to a
cluster to be density-reachable from a core point. Outliers are not assigned to any cluster and, depending on the
context, may be considered anomalous points.
Now that I have covered all the preliminaries, we can finally talk about how the algorithm works in practice.

Application
DBSCAN is implemented in the popular Python machine learning library Scikit-Learn, and because this
implementation is scalable and well-tested, I will be using it to demonstrate how DBSCAN works in practice.
The steps to the DBSCAN algorithm are:
Pick a point at random that has not been assigned to a cluster or been designated as an outlier. Compute
its neighborhood to determine if its a core point. If yes, start a cluster around this point. If no, label the point
as an outlier.
Once we find a core point and thus a cluster, expand the cluster by adding all directly-reachable points to
the cluster. Perform neighborhood jumps to find all density-reachable points and add them to the cluster. If
an an outlier is added, change that points status from outlier to border point.
Repeat these two steps until all points are either assigned to a cluster or designated as an outlier.
Thanks to Scikit-Learns easy-to-use API, we can implement DBSCAN in only a couple lines of code.
1

from sklearn.clusters import DBSCAN

To test out DBSCAN, Im going to use a dataset consisting of annual customer data for a wholesale distributor.
The dataset consists of 440 customers and has 8 attributes for each of these customers. I will use the Pandas
library to load the .csv file into a DataFrame object.
1

import pandas as pd

data = pd.read_csv("data/wholesale.csv")

data.drop(["Channel", "Region"], axis = 1, inplace = True)

After dropping two fields that identify the customer, we can examine the first few rows of this dataset.

https://blog.dominodatalab.com/topology-and-density-based-clustering/

7/22

12/01/2017

Density-Based Clustering

So we can visualize the data, Im going to use only two of these attributes:
Groceries: The customers annual spending (in some monetary unit) on grocery products.
Milk: The customers annual spending (in some monetary unit) on milk products.
1

data = data[["Grocery", "Milk"]]

data = data.as_matrix().astype("float32", copy = False)

Because the values of the data are in the thousands, we are going to normalize each attribute by scaling it to 0
mean and unit variance.
1

stscaler = StandardScaler().fit(data)

data = stscaler.transform(data)

Now, lets visualize the normalized dataset.

https://blog.dominodatalab.com/topology-and-density-based-clustering/

8/22

12/01/2017

Density-Based Clustering

As you can see, there is positive correlation between grocery purchases and milk product purchases. There is a
cluster centered about the mean milk purchase (milk = 0) and the mean groceries purchase (groceries = 0). In
addition, there are a handful of outliers pertaining to customers who buy more groceries or milk products compared
to other customers.
With DBSCAN, we want to identify this main cluster of customers, but we also want to flag customers with more
unusual annual purchasing habits as outliers.
We will construct a DBSCAN object that requires a minimum of 15 data points in a neighborhood of radius 0.5 to
be considered a core point.
1

dbsc = DBSCAN(eps = .5, min_samples = 15).fit(data)

Next, we can extract our cluster labels and outliers to plot our results.
1

labels = dbsc.labels_

core_samples = np.zeros_like(labels, dtype = bool)

core_samples[dbsc.core_sample_indices_] = True

Lining up with our intuition, the DBSCAN algorithm was able to identify one cluster of customers who about the
mean grocery and mean milk product purchases. In addition, it was able to flag customers whose annual
purchasing behavior deviated too heavily from other customers.
Because the outliers corresponded to customers with more extreme purchasing behavior, the wholesale distributor
could specifically target these customers with exclusive discounts to encourage larger purchases.
As a baseline, lets run K-Means with two clusters on this dataset. The big blue dot represents the centroid for the
black cluster, and the big gold dot represents the centroid for the white cluster.
https://blog.dominodatalab.com/topology-and-density-based-clustering/

9/22

12/01/2017

Density-Based Clustering

While the white clusters appears to capture most of the outliers, the cluster basically captures customers who
purchase relatively more goods. If we designate the white cluster as the anomalous cluster, then we basically flag
any customer who purchases a lot of milk or groceries. If you were the wholesale distributor, then you would be
calling your better customers, the ones whom you make more money from, anomalies.

DBSCAN Toy Example


For a more contrived but impressive illustration of DBSCANs capabilities, lets consider a toy example informally
known as the half-moons dataset where each data point belongs to one of the two half-moons.
1

from sklearn.datasets import make_moons

moons_X, moon_y = make_moons(n_samples = 2000)

https://blog.dominodatalab.com/topology-and-density-based-clustering/

10/22

12/01/2017

Density-Based Clustering

This dataset is aesthetically pleasing, but it has no outliers. To rectify this problem, I am going to add random noise
to 1% of the data using the following code:
1

def add_noise(X,y, noise_level = 0.01):

amt_noise = int(noise_level*len(y))

idx = np.random.choice(len(X), size = amt_noise)

noise = np.random.random((amt_noise, 2) ) -0.5

X[idx,:] += noise

return X

Now the dataset has noticeable outliers.

https://blog.dominodatalab.com/topology-and-density-based-clustering/

11/22

12/01/2017

Density-Based Clustering

We can run DBSCAN on the data to get the following results. The algorithm successfully discovers both halfmoons and identifies almost every noisy data point as an outlier.

In contrast, K-Means performs poorly on this dataset. The algorithm can not succesfully discover the half-moon
structure, nor can it distinguish the noisy data points from the original data points.

https://blog.dominodatalab.com/topology-and-density-based-clustering/

12/22

12/01/2017

Density-Based Clustering

DeBaCl
The Level Set Tree clustering algorithm was developed in 2010 by Chaudhuri and Dasgupta, two computer
scientist at UC San Diego. For this section, I will be using the DeBaCl, Density-Based Clustering, library developed
in 2013 by Kent, Rinaldo, and Verstynen, three statisticians at Carnegie Mellon University. This library contains an
efficient implementation of the Level Set Tree clustering algorithm and provides interactive tools for selecting
clusters.

Level Sets
While DBSCAN is built on -balls as its foundation, Level Set Tree clustering is built on, as the name suggests,
level sets.
If we have some mathematical function f and some constant c, then the level set Lc is all the values of x such that
f(x) = c. Formally, in mathematical set builder notation.

Lc(f) = { x in X| f(x) = c}
For example, if f(x) = x2 and c = 4, then Lc(f) = {-2, 2} because f(-2) = (-2)2 = 4 and f(2) = 22 = 4.

https://blog.dominodatalab.com/topology-and-density-based-clustering/

13/22

12/01/2017

Density-Based Clustering

To build the Level Set Trees, we are going to use a special class of level sets called -upper level sets. For some
constant and some function f, the -upper level set of f, L(f), is defined as:
L(f) = { x in X| f(x) >= }
Essentially, this is where a function is greater than or equal to some constant . Returning back to our previous
example of f(x) = x2 and = 4, the -upper level set of f would be (-,-2]U[2,).

Now, Im sure you wondering how we can use this mathematical object to perform clustering. The basic idea is that
we assume our data is generated according to some probability density function f, so if we pick to a constant in
https://blog.dominodatalab.com/topology-and-density-based-clustering/

14/22

12/01/2017

Density-Based Clustering

the interval [0,1], our -upper level sets correspond to regions of the data that are denser than our threshold .
Our goal is to use the threshold parameter to identify high-density clusters that correspond to connections
regions of the data with similar densities.
Now, lets talk about the Tree part of the algorithm. Lets say we start with a high initial value of , say = 0.7, and
compute the resulting -upper level set of f. We end up with set of points such that f(x) >= 0.7 for each point in the
upper level set. Now, lets decrease the value of to 0.5 and compute the resulting -upper level set. Now by the
transitive property, all the points in the = 0.7-upper level set are also members of the = 0.5-upper level set
because if f(x) >= 0.7 and 0.7 >= 0.5, then f(x) >= 0.5.
Generalizing this pattern, if 1 < 2, then L2 (f) is a subset of L1 (f). Essentially, if we were to iterate through an
increasing sequence of s, we would create -upper level set that is a subset of the previous upper level sets. This
is the tree structure: The root represents = 0.0 (all the points), and each branch represents a higher value of
and thus a subset of the points from the previous level.

kNN Density Estimation


Since we dont know the true probability density function f that generates the data, we are going to need to
estimate it. Previously, we used our -Balls to estimate density by computing the mass to volume ratio of our
neighborhoods.
DeBaCl uses a similar method called k-Nearest Neighbor Density Estimation to estimate the underlying probability
density function f. With the -Balls method, we first fixed the volume of the neighborhood and then counted the
number of points within the neighborhood to determine the mass. With the k-Nearest Neighbor Density Estimation
method, were going to do the opposite; first we fix the mass to be some number k, then we compute the volume
necessary to get k points within our ball.
In machine learning, one of the most basic classification algorithms is k-Nearest Neighbors (k-NN) classification. A
quick refresher on k-NN classification:
Pick some integer k and some test datapoint xi.
Retrieve the k points in your training dataset that are closest to xi.
Get the majority label of these k points and assign that label to xi
With k-NN density estimation, we are going to do something similar. Like I said above, we want k points in our
neighborhood, so we are going to find the volume of the smallest neighborhood that contains exactly k points.
To estimate the density of a given point using k-NN density estimation, we are going to find the distance to the Kth
nearest point, dk, and use this as the radius of our neighborhood. By doing this, we get a neighborhood around our
point with exactly k other points in it.
The mathematical equation for the k-Nearest Neighbor estimator is included below:
In this equation, k is number of points we want in our neighborhood, xi is our given
point, n is the number of points in the dataset, vd is the volume of the d-dimensional
Euclidean ball, and rk(xi) is the distance to the kth nearest point.
For 2-dimensional data, the volume of the 2-dimensional Euclidean ball is R2 where R = 1, so vd = , in this
case.
The following code computes the k-NN density estimate for 2-dimensional data:
1

def knn_estimate(data, k, point):

https://blog.dominodatalab.com/topology-and-density-based-clustering/

15/22

12/01/2017

Density-Based Clustering

n,d = data.shape

point = point.reshape((1,2))

knn = sorted(reduce(lambda x,y: x+y,cdist(data, point).tolist()))[k+1]

estimate = float(k)/(n*np.power(knn, d)*np.pi)

return estimate

To demonstrate how k-NN density estimation, lets consider a toy dataset informally called the crater dataset. The
dataset consists of a very dense core (impact of the crater) with a less dense ring surrounding the core.

If you look at the image above, youll notice a tightly bundled set of data points in the center, and the points appear
to get less bundled as we move away from the center. There are 2,000 red data points, and 1,000 blue data points,
for reference.
The code I used to create this toy dataset is provided below:

https://blog.dominodatalab.com/topology-and-density-based-clustering/

16/22

12/01/2017

Density-Based Clustering

01

def makeCraters(inner_rad = 4, outer_rad = 4.5, donut_len = 2, inner_pts = 1000,


outer_pts = 500):

02

03

radius_core = inner_rad*np.random.random(inner_pts)

04

direction_core = 2*np.pi*np.random.random(size = inner_pts)

05

06

core_x = radius_core*np.cos(direction_core)

07

core_y = radius_core*np.sin(direction_core)

08

crater_core = zip(core_x, core_y)

09

10

radius_ring = outer_rad + donut_len*np.random.random(outer_pts)

11

direction_ring = 2*np.pi*np.random.random(size = outer_pts)

12

13

ring_x = radius_ring*np.cos(direction_ring)

14

ring_y = radius_ring*np.sin(direction_ring)

15

crater_ring = zip(ring_x, ring_y)

16

17

return np.array(crater_core), np.array(crater_ring)

From the above image, we should expect for the points near the center to have higher k-NN density estimates and
the points in the outer ring to have lower k-NN density estimates.
We run the k-NN density estimate code I included above on this crater dataset and plot the points where larger
and darker points correspond to higher density values.
https://blog.dominodatalab.com/topology-and-density-based-clustering/

17/22

12/01/2017

Density-Based Clustering

As expected, the larger and darker points are bundled in the center, and as we move farther away from the center,
the points become lighter and smaller. By choosing appropriate values of , we can create an upper-level set
consisting of only the darker points near the center, corresponding to a highly dense cluster.

Application
The DeBaCl library is available through Pip, so we can install it with
1

pip install debacl

or include debacl in our projects requirement.txt. Now, we can import the geom_tree class to implement Level Set
Trees. We will be applying the Level Set Tree algorithm to the crater dataset from the previous section. Rather
than trying to cluster the data into the core and the outer ring, I want to showcase the hierarchical nature of the
Level Set Tree algorithm. Because of this, I will choose parameters that cluster the inner ring, but clusters the outer
ring into a set of denser sub-clusters.
1

from debacl import geom_tree as gtree

from debacl import utils as utl

To make the algorithm run faster, we will sample 1000 data points (1/3 of the data).
1

n,p = data.shape

n_samples = 1000

ix = np.random.choice(range(n), size = n_samples, replace = False)

https://blog.dominodatalab.com/topology-and-density-based-clustering/

18/22

12/01/2017

data = data[ix,:]

n,p = data.shape

Density-Based Clustering

Now, we initialize our two parameters: k, the number of neighbors, and gamma, a tree pruning parameter. In this
case, I chose pk to be 0.5%, so that we look at the 5th closest point when we perform k-NN density estimate. In
addition, pgamma is 5%, so that when pruning the tree, we merge branches that contain less than 5% of the data
points.
1

p_k = 0.005

p_gamma = 0.05

k = int(n*p_k)

gamma = int(n*p_gamma)

Now, we set some plotting parameter to properly display the tree dendogram and then build our Level Set Tree
from the data.
1

tree = gtree.geomTree(data, k, gamma, n_grid = None, verbose = False)

print tree

We can print the tree object to see a summary of the tree construction. In this summary, we see the algorithm first
broke the data into two branches (children 3,4) and then broke branch 3 into two smaller branches, branches 15
and 16.

https://blog.dominodatalab.com/topology-and-density-based-clustering/

19/22

12/01/2017

Density-Based Clustering

We can call tree.plot() to display the Level Set Tree dendogram.


1

fig = tree.plot()

From the dendogram, we see that the dataset was clustered into three different groups. The first partition created
the red cluster, and the second partition created the blue and green clusters.
We can know plot the data with these cluster labels.
1

uc, nodes = tree.getClusterLabels(method = "all-mode")

fig, ax = utl.plotForeground(data, uc, fg_alpha = 0.95, bg_alpha = 0.1,


edge_alpha = 0.6, s = 60)

https://blog.dominodatalab.com/topology-and-density-based-clustering/

20/22

12/01/2017

Density-Based Clustering

The Level Set Tree was able to successfully cluster most of the inner core as a single cluster and then identified
two dense sub-clusters in the outer ring.
With Level Set Trees, we can clusters in datasets where the data exhibits large differences in the density. Whereas
DBSCAN just flags outliers, Level Set Trees attempt to discover some cluster-based substructure in these outliers.
In market segmentation, this could be useful in detecting an emerging market. In data security, researchers could
identify new malware in which the small sample of recently infected computers behave differently from the rest of
the data, but similar to each other.

Distance Functions
In the preliminary section about -neighborhoods, I said these neighborhoods exhibit circular or spherical shapes.
When we use the standard Euclidean distance metric, sometimes called the l2 metric, our neighborhoods display
this spherical shape. Why does this happen?
Lets consider two points p1 = (x1,y1), p2 = (x2, y2) in 2D space. I am using points in 2D space for simplicity, but
these distance metrics can be generalized to higher dimensional spaces.
The Euclidean distance for these two points is d(p1,p2) = [(x1 x2)2 + (y1 y2)2]0.5 .
If we fix the distance to be less than , then we get the following inequality:

d(p1,p2) = [(x1 x2)2 + (y1 y2)2]0.5 <


Squaring both sides yields:

(x1 x2)2 + (y1 y2)2 < 2


The above inequality should look familiar to the general equation of a circle of radius r centered at the point (c1,
c2): (x c1)2 + (y- c2)2 = r2.

https://blog.dominodatalab.com/topology-and-density-based-clustering/

21/22

12/01/2017

Density-Based Clustering

Thanks to the Euclidean distance metric, when we compute Nbhd(p, ) , we end up with a spherical neighborhood
centered at p with radius .
However, if we were to use a different distance metric, we would end up with neighborhoods of different shapes.
For example, if we used the Manhattan distance, or l1 metric, where the distance between two points is d(p1,p2) =
|x1 x2| + |y1 y2| (|.| is the absolute value), then neighborhoods will display a rectangular shape.
By changing the distance metric used, we can change the shape of neighborhoods. By changing the shape of
neighborhoods, we can discover different sets of clusters. Depending on the problem, it may be beneficial to use a
distance metric other than the Euclidean distance metric to discover different types of clusters.

Conclusion
Density-based clustering methods are great because they do not specify the number of clusters beforehand.
Unlike other clustering methods, they incorporate a notion of outliers and are able to filter these out. Finally, these
methods can learn clusters of arbitrary shape, and with the Level Set Tree algorithm, one can learn clusters in
datasets that exhibit wide differences in density.
However, I should point out that these methods are somewhat harder to tune compared to parametric clustering
methods like K-Means. Things like the epsilon parameter for DBSCAN or the parameters for the Level Set Tree are
less intuitive to reason about compared to the number of clusters parameter for K-Means, so its more difficult to
pick good initial parameter values for these algorithms.
This blog post is inspired by my experience in Carnegie Mellons Statistical Machine Learning course, and for
readers interested in a more mathematical analysis of clustering, the class notes for the clustering lecture are
publicly available.
Finally, for readers who are interested in general clustering methodology for different types of problems, I have
found this book on data clustering to be a handy reference.

Related

https://blog.dominodatalab.com/topology-and-density-based-clustering/

22/22

Potrebbero piacerti anche