Sei sulla pagina 1di 28

DT-DBSCAN: Density Based Spatial Clustering in Linear

Expected Time Using Delaunay Triangulation


Jian Park
Sage Hill High School, 20402 Newport Coast Drive, Newport Coast, CA 92657.

Density-based spatial clustering of applications with noise (DBSCAN) is a popular


clustering algorithm that connects regions with spatially dense data points to
determine clusters. DBSCAN has many advantages, including the ability to find
clusters of arbitrary size, but one of its disadvantages is the O(n log n) time
complexity. Several approximate DBSCAN algorithms have been developed in the
past, and although they run in linear time, they do not produce the same results as
DBSCAN. This paper outlines the DT-DBSCAN algorithm, a non-approximate
density-based clustering algorithm that runs in linear expected time. This time
complexity is achieved by conducting a tree-search through Delaunay Triangulation
edges to distinguish core and noise points instead of spatial indexing. Experiments
conducted show that DT-DBSCAN outperforms current DBSCAN iterations for larger
datasets.

1
1) Introduction
Clustering is the process of grouping a set of objects into “clusters“ such that any two objects within
the same cluster are “similar” to one another but are distinctly less “similar” to objects in other clusters. Spatial
clustering is a type of clustering that uses spatial distance as the metric for similarity [1]. K-means, hierarchal
clustering, and expectation-minimization are all different types of clustering algorithms.

The diagram to the left


depicts the results of
several different spatial
clustering algorithms.
DBSCAN is depicted on
the row that is second
from the right.

The DBSCAN algorithm (density-based spatial clustering of applications with noise) clusters data
points with large enough local density and achieves many significant benefits to other clustering methods.
DBSCAN algorithm can determine clusters of arbitrary shape. In addition, noise points, or outliers, can be
removed from clusters automatically [2]. Due to these advantages, DBSCAN is used in data mining procedures
for medical, business, and engineering purposes [12,13,3].
Most recent DBSCAN algorithms use spatial indexing methods like R-trees to compute clusters in O(n
log n) time complexity [15]. However, this time complexity still makes DBSCAN unfavorable and inefficient
compared to quicker algorithms like K-means, which has a time complexity of O(n) [6].
In order to solve this time complexity issue, many recent iterations of DBSCAN involve approximate
methods, including grid-based, epsilon-approximate [6], and hashing-based procedures [16]. Although most of
these algorithms are able to achieve linear time complexity, they sacrifice accuracy, and in general, they produce
less predictable results. This also means that it is more difficult to choose the numerical criterion for the
algorithm because they no longer hold a definitive and intuitive definition (explained in 2.1).
A majority of the recent studies on Density-Based Spatial Clustering aims to develop more efficient
algorithm, with the main goal being an accurate algorithm that is able to run in linear expected time [10].
Within this paper, the first section will define concepts and procedures, the second section will define
several basic properties of the Delaunay Triangulation, the third section will prove a new property about the
degree distribution of vertices within Delaunay Triangulation graphs, the fourth section will define a new
DBSCAN algorithm with linear expected run-time, and the fifth section will analyze experimental results for
this new algorithm. Finally, the last section discusses several possible implications of this paper.

2) Preliminaries
In this section, basic concepts and definitions for DBSCAN and Delaunay Triangulations are given
[2,8].

2
2.1) DBSCAN Algorithm

Density-based spatial clustering of applications with noise, shortly known as DBSCAN, is a density-
based clustering algorithm, which clusters data points having large enough local density.

Several terms used within the DBSCAN algorithm are as follows:

Defenition 1: DBSCAN is a spatial clustering algorithm. It has three inputs: a dataset of k-tuples (each of these
tuples represent a vertex in k-dimensions), a positive numerical value called epsilon, and a positive integer value
called minPts. Epsilon and minPts are chosen by the user of the algorithm based on their knowledge of the
properties of the dataset. Given these inputs, DBSCAN generates a set of clusters.
Defenition 2: The neighborhood of a vertex s is the set of all vertices that lie within epsilon distance of this
vertex.
Defenition 3: A given vertex is labeled a core point if the number of points within its neighborhood is greater
than or equal to the value of MinPts.
Defenition 4: A point is a boundary point if the number of points within its neighborhood is less than the
value of MinPts and its neighborhood contains at least one core point.
Defenition 5: A point is a noise point if it is neither a core point nor a noise point.
Defenition 6: A any point p is directly density-reachable from a point q if and only if object p is in the
neighborhood of object q and at least one of p or q is a core point.
Defenition 7: Point p is density-reachable from point q if and only if there is a path p1, p2, … , pn such that p1
= q, pn = p, and each pi+1 is directly density-reachable from pi for each 𝑖 ∈ {1, 2, 3, … 𝑛 + 1}.
Defenition 8: Any two vertices are in the same cluster if and only if one vertex is density-reachable from the
other.

These definitions will be used later in section 4 during the derivation of the algorithm.

The red points on the diagram to


left denote core points, the
yellow points denote boundary
points, and the blue point
represents a noise point. Every
point other than N is within the
same cluster because they are all
density-reachable.

In addition, the standard DBSCAN algorithm is defined as follows:

Algorithm 1.1: Brute Force DBSCAN

3
Input: vertecies {v}, minPts, Epsilon
Output: Set of density-based clusters

ClusterNumber = 0
for each point P in {v} {
if label(P) is undefined then continue
Set Neighborhood = RangeQuery ({v}, P, Epsilon)
if | Neighborhood | < minPts then {
label(P) = Noise
continue
}
ClusterNumber = ClusterNumber + 1
Label(P) = ClusterNumber
Set Seed = Neighborhood \ {P}
While Seed is not empty{
Q is some point from Seed (remove Q from Seed in the process)
if label(Q) = Noise then label(Q) = ClusterNumber
if label(Q) is not undefined then continue
label(Q) = C
Set Neighborhood = RangeQuery ({v}, Q, Epsilon)
if | Neighborhood | ≥ minPts then {
Add vertecies in Neighborhood to Seed
}
}
}

In addition, the auxiliary algorithm RangeQuery is defined as follows:

Algorithm 1.2: RangeQuery

Input: vertecies {v}, vertex P, Epsilon


Output: Neighborhood of P

Set Neighborhood = empty set


for each point Q in {v} {
if dist(P, Q) < Epsilon{
Add Q to Neighborhood
}
}

return Neighborhood

The brute force algorithm presented here was developed by Hans-Peter Kriegel and three other men
in [17], and it is the first iteration of DBSCAN. This algorithm has an expected runtime of O(n2), but more

4
recent algorithms have made this more efficient by reducing the time-complexity associated with the
RangeQuery process.

2.2) Delaunay Triangulation

Triangulation is the process of subdividing planar objects into non-intersecting triangles. Delaunay
Triangulation is a special type of triangulation, which is that the circumcircle of any triangle within the
triangulation does not contain any other point. Delaunay Triangulation has a worst case time-complexity of
O(n log n), but recent iterations allow for the process to complete in O(n) expected time. One such algorithm
is presented within [5].

The diagram to the left depicts


the Delaunay Triangulation of a
set of vertices. Note that the
circumcircles of the Delaunay
triangles do not contain any
vertices.

Here are some definitions related to Delaunay Triangles that will be used within this paper.
Defenition 9: deg(v) denotes the degree of point v within the Delaunay triangulation of {v}. The degree of
the point is the number of edges connected to that point.
Defenition 10: Pi,j denotes a Delaunay path between points i and point j. A Delaunay path is defined by the set
of points p1, p2, p3, … pn, where p1 = i, pn = j, and pk and pk+1 are connected by an edge on the Delaunay
triangulation graph.

Let B be a closed, continuous region with a set of vertices, and let s be some vertex that lies within s.
If T is some larger region that contains all of region B and its vertices, then s is considered to be complete
relative to T if the Delaunay Triangulation of the points within B and the Delaunay Triangulation of the points
within T have identical adjacent vertices for s.
The Delaunay Triangulation of point sets generated by a uniform density function have several
statistical properties. One important property, derived in [9], revolves around the expected degree value of a
randomly selected vertex. Hilhorst proves the distribution function has an asymptotical behavior that is identical
to the following equation

0.344 … (8π8 )9 >


p(i) = ;1 + O(i=8 )? {i > 2}
4π (2i)!

as the number of vertices within the Delaunay triangulation approach infinity.

5
The graph on the left depicts
empirical data on the degree
distribution of Delaunay
triangulation vertices. Note that
the distribution converges rapidly
as the number of degrees
increases.

In addition to this, [14] proves that the expected length of the longest edge within a Delaunay
CDE F
Triangulation graph with a uniform vertex distribution is on the order of 𝑂 B G.
F

2.3) Assumption

DBSCAN is known to have the provably non-linear worst-case time of at least O(n4/3) (this was proven
within [10]). This implies that in order to compute DBSCAN in linear time, it is necessary to establish an some
assumption regarding the dataset or the values for Epsilon or MinPts. For this project, the assumption will that
the point set {v} that is generated within some finite region T by a density function that is positive and semi-
uniform. Semi-uniform means that the region T can be subdivided into a set of regions with positive area such
that the density function is continuous and finite within each of these regions. In addition, we assume that
MinPts is some reasonably small constant (which is usually the case in most DBSCAN applications).

3) Lemmas
This section establishes two basic lemmas that will be used later within the derivation of the DT-
DBSCAN algorithm.

Lemma 1: The Delaunay Triangulation of a set of n points will have less than or equal to 3n – 6 edges.

Proof:
Let n be the number of vertices in {v} and let b be the number of vertices on the boundary of the
convex hull of {v}. Let t be the number of triangles within the Delaunay triangulation of {v}, and let
s be the sum of all the interior angles of all triangles within the Delaunay triangulation of {v}.

First, find the value of s for {v}. Given that any interior point within a Delaunay Triangulation will
always be fully surrounded by triangles, the sum of the angles around an interior point is always 2π.
Next, according to the polygon exterior angle sum theorem, the exterior angles of a convex polygon
with k sides will always sum up to πk + 2π. This means that the angles around all points on the convex
polygon sum up to 2πb − (πk + 2π). There are n − b interior points, meaning that the angles around
all interior points sum up to 2π(n − b). By adding these two together,

s = 2πn − πb − 2π

6
Given that the sum of the interior angles of a triangle is π, Another way to compute s is to multiply π
to the number of triangles within the Delaunay triangulation of {v}. By substituting the s from the
equation above with πt, the resulting equation simplifies to

t = 2n − b − 2

The minimum size of the convex hull is 3, so the maximum number of triangles is

t ≤ 2n − 5

To count the edges, the equation 3t would double count each interior edge, because two triangles will
always border one of these edges. However, each edge on the convex hull will not be double-counted
because only one triangle borders these edges. Therefore,

3t + b 6n − 3b − 6 + b
Number of edges = ≤ = 3n − b − 3
2 2

As before, the minimum value for b is 3, so the maximum number of edges is 3n − 6 [Q.E.D.].

Corollary 1: A randomly chosen vertex on a Delaunay Triangulation has an average degree that is less
than 6.

Proof:

The handshaking lemma states the following.

Z deg (𝑣) = 2|𝐸|


\∈{\}

Lemma 1 shows that |𝐸| ≤ 3n − 6. Therefore,

Z deg (𝑣) ≤ 6n − 12
\∈{\}

1 12
Z deg (𝑣) ≤ 6 − <6
𝑛 𝑛
\∈{\}

Where >
∑\∈{\} deg (𝑣) denotes the average degree of a vertex within the Delaunay Triangulation
F
[Q.E.D.].

Lemma 2: For any point set {v}, there exists Pi,j for points vi and vj such that for each point 𝐯𝐤 ∈ 𝐏𝐢,𝐣 , if
𝐯𝐤 ≠ 𝐯𝐣 , the distance between 𝐯𝐤 and 𝐯𝐣 is less than the distance between 𝐯𝐣 and 𝐯𝐢 .

7
Proof:
Let C be a circle such that the line segment constructed by connecting vi and vj is its diameter.

Case 1: If there is no point from {v} within C, then there exists a vertex that directly connects vi to vj.

Let D denote a circle that is fixed to vi and vj. The center of D lies somewhere on the
perpendicular bisector of the line segment vi and vj, and the radius of D is minimized when C = D. As
the center of circle D moves away from the center of circle C, the radius of circle D strictly increases.

For each point vh from {v}, construct the circumcircle of triangle v9 vi vh . Let vj9k denote
the point vh that minimizes the radius of the circumcircle of v9 vi vh .
Let the center of circle D start from the center of circle C and move towards vj9k . As it is moving,
D will never intersect another point within {v} until it intersects vj9k , because that would assume the
existence of another triangle with a smaller circumcircle than vj9k v9 vi (fig. 2). This means that there
is still no point contained within D, which means that vj9k v9 vi is a valid Delaunay triangle. Therefore,
there is a path that directly connects v9 and vi , which is a path that meets the conditions of lemma 2.

(left) Moving circle D while it is fixed to vertices vi and vj


(right) Under case 1 of lemma 2, region 1 (depicted in the diagram) cannot contain a point due to the definition of case 1,
and region 2 cannot contain a point because that assumes the existence of some point that has a smaller value of rD(v) than
rD(vmin)

Case 2: If there are points in the dataset that lie inside of C, then there exists a path between vi and v9
that adheres to the limitations outlined in the statement of Lemma 2.

Let E be a circle that is always tangent to the inside of circle C at vi . As the radius rE of circle
E increases from 0 (fig. 3) for each data point v that lies within C, E will intersect v exactly once as rE
increases to rC (radius of circle C).

For each point v inside C, let rE(v) denote the radius of circle E when E intersects v, and let
vmin denote the point within C that minimizes the value rE(v). By definition, when circle E intersects
vmin, circle E will contain no other points, because if it did, that would assume the existence of some
point with a smaller value of rE(v) than vmin.
Next, similar to the procedure used in case 1, let F be a circle that is fixed to points vmin and
vj. As the center of F moves away from the center of circle E (either to the right or to the left), let vfirst
denote the first point circle E intersects. Again, similar to case 1, vj vmin vfirst must form a triangle within

8
the Delaunay triangulation of {v} because no other point can be contained within the circumcircle of
vj vmin vfirst (fig. 2). This means that vj and vmin are connected by an edge. (Note that for any point vk in
the interior of C, the distance between vk and vi must be smaller than the distance between vj and vi)

Given this information, it is possible to prove case 2 by recursion. Repeat lemma 2 for points
vmin and vi (instead of vj and vi). If case 2 is called again, then the point vmin derived from the second
iteration of case 2, vmin 2, will be connected to vmin, which is connected to vj. By continually repeating
this process, assuming {v} has a finite number of points, at some point, case 1 must be called. By that
point, the path generated, {vj, vmin, vmin 2, vmin 3 … vi}, will adhere to the required conditions, because
for the n-th element vn within this set, the distance between vn and vi will always be greater than the
distance between vn+1 and vi. This completes the proof [Q.E.D.].

Diagram of circle E as rE increases. The solid circle represents circle C, and the red point denotes vj

4) Degree Distribution of Semi-Uniform Delaunay Triangulations


[9] proves that a Delaunay triangulation with uniformly distributed vertices will follow a distribution
that has a specific asymptotical behavior(stated in section 2.2). This section proves that Hilhorst’s results also
apply to the more general class of Delaunay Triangulations formed from a set of vertices with a semi-uniform
distribution.

First, let us define “grid point matrices”.

Let λ(x, y) be some distribution function that generate points within a bounded region T and let {v}
be a set of n random points generated by this distribution function. Let S be the smallest square that contains
the entirety of T with sides that are parallel to the x and y axis (S may not be unique. In this case, choose the
square that minimizes the x and y coordinates of the center of the square).
Create an m-by-m grid within square S, where every grid space is an identically sized square. Next, let
V be a square matrix with the same dimensions. V is called an m-grid point matrix of {v} if Va,b contains the
number of points within {v} that lies within the grid space (in S) that is a spaces from the left and b spaces
from the top.

9
(left) a density function that is defined only within some region T
(center) a set of points {v} that generated by λ(x, y), overlaid on top of a 6-by-6 square grid that encompasses the entirety of region T
(right) the 6-grid point matrix for {v}

Next, let us define “Delaunay unique matrices”.

Let V be some m-by-m matrix with non-negative integer values, and let {v1} and {v2} be two sets of
vertices that both have V as its m-grid point matrix. If there exists a pair of point sets {v1} and {v2} such that
the Delaunay triangulations of {v1} and {v2} are not identical, then V is called a Delaunay non-unique matrix.
If no such pair exists, then V is called a Delaunay unique matrix.

The diagrams to the left depict


two sets of points with
identical 6-grid point matrices.
It is easily observable that the
given 6-grid point matrix will
always define an identical
Delaunay triangulation graph,
which shows that it is a
Delaunay unique matrix.

The diagrams to the left depict


two sets of points with
identical 6-grid point matrices.
The Delaunay triangulation of
the point sets do not yield
identical Delaunay
triangulation graphs, which
shows that it is a Delaunay
non-unique matrix.

The following lemma shows that it is rare for an m-grid point matrix to be Delaunay non-unique for
large values of m.

Lemma 3: For a randomly created grid point representation, the probability that it is Delaunay non-
unique approaches 0 as the grid becomes arbitrarily large.

Proof:
We prove this lemma with a recursive approach.

10
Let V be the m-grid point matrix for some set of points {v}, and let V be Delaunay Unique.
Let us add another point vnew to {v}, and let P be the probability that the new grid point matrix Vnew
is Delaunay non-unique.
The first thing to note is that if vnew lies within a gridspace that also contains the circumcircle
of a triangle formed by any three points within {v}, there is a chance that Vnew is non-unique. On the
other hand, if vnew lies within a gridspace that does not contain the circumcircle of a triangle formed
by any three points within {v}, Vnew must remain Delaunay unique. This is proven true by looking at
the definition of the Delaunay triangulation: the graph of a Delaunay triangulation changes based on
whether or not a given point lies within a circumcircle of a Delaunay triangle, and nothing else.
Therefore, if vnew does not lie within a grid space that is intersected by a circumcircle, then we
know that Vnew will be Delaunay Unique. The number of grids intersected by a circle is O(m), and the
number of triangles is oFpq (𝑤ℎ𝑖𝑐ℎ 𝑖𝑠 𝑂(𝑛p )). The proportion of grid spaces intersected by a
Fv
circumcircle is 𝑂( w ). We can clearly see that if m is infinitely large, this proportion becomes 0 [Q.E.D.].

Now, the following series of lemmas allows us the reach the desired result for this section.

Lemma 4: Let {v} be a very large set of points generated by density function 𝛌(𝐱, 𝐲) where

• every point in {v} lies within a finite, bounded region T where 𝐓 ⊂ ℝ𝟐


• 𝐜 − ∆ ≤ 𝛌(𝐱, 𝐲) ≤ 𝐜 + ∆ (for some constants c and ∆)

Then, as ∆ approaches 0, the Delaunay Triangulation of {v} has an expected vertex degree distribution
that approaches that of a uniform Delaunay Triangulation.

Proof:
Let T be a finite planar region, and let {v1} denote a set of n points within T generated by the
uniform density function λ> (x, y) = c. Then, Let {v2} denote a set of n points within B generated by
a density function λ8 (x, y) such that

c − ∆c ≤ λ8 (x, y) ≤ c + ∆c

where ∆c is a small positive number.


Let V be some m-by-m matrix with positive integer values where the sum of the elements is
equal to some integer n. Let p> (V) denote a function that represents the probability that V is the grid
point matrix of a set of n points arbitrarily generated by density function λ> (x, y). For example, if V
had the value 0 for every space except for the center space which had n, p> (V) would denote the
probability that all vertices would generate within the small, central grid space (which is very unlikely
for large n). On the other hand, let p8 (V) denote a function that represents the probability that V is
the grid point matrix of a set of n points arbitrarily generated by density function λ8 (x, y). p> (V) and
p> (V) are defined as follows

11
„’“,” „
j j
n!
p> (V) = w Š Š ‹Œ • 𝑐 dx dy‘ •
∏†ˆ‰ ∏w‡ˆ‰„𝑏†,‡ „ 9ˆ> iˆ> Ž•,•

„’“,” „
j j
n!
p8 (V) = Š Š ‹Œ • λ8 (x, y) dx dy‘ •
∏w w
†ˆ‰ ∏‡ˆ‰„𝑏†,‡ „ 9ˆ> iˆ> Ž•,•

Here, b9,i represents the grid space that is i spaces from the left and j spaces from the top, and
„b9,i „ represents the number of vertices within the grid space b9,i .
Let v(k, V) denote the probability that a randomly chosen vertex within the Delaunay
triangulation of the vertex set denoted by V will have a vertex degree of k. Note that this function is
only defined when V is a Delaunay unique matrix because non-unique matrices have multiple possible
Delaunay triangulations.
Let P> (k) denote the probability that a vertex from a set of n points arbitrarily generated by
density function λ> (x, y) will have a degree of k. Similarly, let P8 (k) denote the probability that a vertex
from a set of n points arbitrarily generated by density function λ8 (x, y) will have a degree of k. P> (k)
and P8 (k) are defined as follows:

P> (k) = Z v(k, V)p> ( V)


—∈{˜™,š }

P8 (k) = Z v(k, V)p8 ( V)


—∈{˜™,š }

Here, 𝑀w,F represents the set of m-by-m positive integer matrices where the sum of all
elements is equal to n.

First, note the following:


• c dx dy − • ∆c dx dy ≤ • λ(x, y) dx dy ≤ • c dx dy + • ∆c dx dy
Ž•,• Ž•,• Ž•,• Ž•,• Ž•,•

Where b9,i is some arbitrary grid space.


Given this, we see that the following inequality holds:

12
„’“,” „
j j
n!
Š Š ‹Œ • 𝑐 dx dy − • ∆c dx dy‘ •
∏†ˆ‰ ∏w
w
‡ˆ‰„𝑏†,‡ „9ˆ> iˆ> Ž•,• Ž•,•

„’“,” „
j j
n!
≤ w Š Š ‹Œ • λ8 (x, y) dx dy‘ •
∏†ˆ‰ ∏w‡ˆ‰„𝑏†,‡ „ 9ˆ> iˆ> Ž•,•

„’“,” „
j j
n!
≤ w Š Š ‹Œ • 𝑐 dx dy + • ∆c dx dy‘ •
∏†ˆ‰ ∏w‡ˆ‰„𝑏†,‡ „ 9ˆ> iˆ> Ž•,• Ž•,•

As ∆c approaches 0, we see that p8 ( V) gets sandwiched into equaling

„’“,” „
j j
n!
Š Š ‹Œ • 𝑐 dx dy‘ •
∏†ˆ‰ ∏w
w
‡ˆ‰„𝑏†,‡ „ 9ˆ> iˆ> Ž•,•

which is the function for p> ( V).

If lim p8 ( V) = p> ( V) for any matrix V, we see that


∆•→‰

lim Z v(k, V)p8 ( V) = Z v(k, V)p> ( V)


∆•→‰
—∈{˜™,š } —∈{˜™,š }

which completes the proof [Q.E.D.].

Lemma 5: Let {v} be a set of n points generated by a positive density function 𝛌(𝐱, 𝐲). Then, the
longest edge within the Delaunay Triangulation of {v} will have an edge length on the order of
𝒍𝒐𝒈 𝒏
𝑶B 𝒏
G.

Proof:
[14] proves that for a Delaunay Triangulation of a uniformly distributed set of points, the
CDE F
expected length of the longest edge within the entire graph is 𝑂 B F
G.
Let us assume that {v} is a very large set of vertices generated by a continuous and positive
density function λ(x, y) that is only defined within a finite region T. Because λ(x, y) is assumed to be
positive, minoλ(x, y)q > 0.
The next step is to split the density function as follows:

λ(x, y) = minoλ(x, y)q + oλ(x, y) − minoλ(x, y)qq

13
Note that generating a set of points with density minoλ(x, y)q and combining it with a set of
points that were generated with density oλ(x, y) − minoλ(x, y)qq is an equivalent procedure to
generating a set of points with λ(x, y) density.
Note that minoλ(x, y)q is a positive constant, so the Delaunay triangulation of the vertex set
CDE F
that is generated by minoλ(x, y)q has a maximum edge length on the order of 𝑂 B G.
F
Next, in the process of combining this vertex set with the vertex set generated by the density
function oλ(x, y) − minoλ(x, y)qq, the resulting Delaunay triangulation will have a maximum edge
length that is not longer than the maximum edge length of the Delaunay triangulation of the point set
generated by λ(x, y). This is true if you look at the Voronoi diagram definition of the Delaunay
triangulation; adding a new point to the Voronoi diagram will never increase edge length because that
would assume that the Voronoi polygons would expand and become adjacent to other polygons that
it was originally disconnected from (which is impossible).
Therefore, we see that the Delaunay triangulation of vertex set with a positive density curve
CDE F
will have a maximum edge length on the order of 𝑂 B F
G [Q.E.D.].

Lemma 6: Let {v} be a very large set of points generated by density function 𝛌(𝐱, 𝐲) where

• every point in {v} lies within a finite, bounded region T where 𝐓 ⊂ ℝ𝟐


• 𝛌(𝐱, 𝐲) is continuous and positive
Then, as ∆ approaches 0, the Delaunay Triangulation of {v} has an expected vertex degree distribution
that approaches that of a uniform Delaunay Triangulation.

Proof:
Let us assume that {v} is a very large set of n vertices generated by a continuous and positive
density function λ(x, y). Lemma 5 proves that the longest edge length of the Delaunay triangulation
CDE F
of {v}will be on the order of 𝑂 B F
G. This directly implies that for any vertex 𝑠 ∈ {𝑣}, there will
CDE F
exist some circle centered at 𝑠 with radius 𝑂 B F
G such that all of the neighbors to s will lie within
this circle.
CDE F
Let s be some point from {v}, and consider a circle C centered at s with a radius of 𝑂 B F¤/¦ G
(a somewhat arbitrary choice). As n approaches infinity, the size of the circle approaches 0. However,
the number of points within circle continue to increase indefinitely .This is true because all neighbors
CDE F
of s must be contained in a circle of radius 𝑂 B F
G, and all neighbors of the neighbors of s must be
8∗CDE F
contained within a circle of radius 𝑂 B F
G. The circle C therefore contains all points that are
connected to s by less than 𝑂o√𝑛q edges, which would increase to infinity as n increases.
Because 𝜆(𝑥, 𝑦) is assumed to be continuous,

𝑙𝑖𝑚 𝜆(𝑥 + ∆𝑥, 𝑦 + ∆𝑦) = 𝜆(𝑥 , 𝑦)


∆®,∆¯ →‰

14
Given that the radius of circle C approaches 0 as n approaches infinity, we see that the maximum and
the minimum value of the density function within the bounds of C converge due to this equation.
We then recognize that the points within C directly apply to lemma 4: every point within the
set of points lies within a finite, bounded region (circle c), and the minimum and the maximum value
of the density function within this region converge to a single value. Therefore, after Delaunay
triangulating {v}, a randomly chosen point within C will have an expected degree distribution that is
identical to that of a Delaunay Triangulation with uniformly distributed vertices.
By repeating this process for every point within {v} (assuming that n is arbitrarily large), we
see that every point within {v} has an expected Delaunay Triangulation degree distribution that is
identical to that of Delaunay Triangulation of a uniform Delaunay Triangulation [Q.E.D.].

Lemma 7: Let {v} be a set of n points within bounded region T where 𝐓 ⊂ ℝ𝟐 . If {v} is generated by
an inhomogeneous Poisson point process and B is a region such that 𝐁 ⊂ 𝐓, then the probability that
an arbitrary point s is incomplete within region T relative to B approaches zero as 𝐧 approaches
infinity.

Proof:
Let s lie within closed region B, and let t denote the distance between s and the nearest
boundary of B. Assume that λ(x, y) > 0 for any points (x, y) within B. Let C1 denote the condition
that s is unfinished in the Delaunay triangulation of B, and let C2 denote the condition that s is
connected to an edge with a length greater than t within the Delaunay Triangulation of region T. C2
will always be true if C1 is true, because any edge with a length less than t will lie within B by definition.
Next, consider a circle with center s and radius t/√2, and divide it into 16 open sectors A1, A2, … A16
with angles π/8. Then, construct a square with a diagonal along the line formed by s and q, where q is
some point outside of B. The diagonal divides the square into two triangles. As seen within lemma 2,
there exists some vertex-free circle passing through s and q if s and q are connected by an edge. By an
extension of Thale’s theorem, this circle will always contain at least one of the two triangles, which
means that at least one of the sectors is covered. If C3 denote the condition that one of these sectors
are empty, C3 will always be true if C2 is true.
Given that C3 will always be true if C1 is true,

P(C> ) ≤ P(Cp ) ≤ Z P(A9 contains no points)


9ˆ>

This can be expanded to

>µ k=>
Z ¶1 − • λ(x, y)¸
9ˆ> ·•


where ∫º λ(x, y) = 1.

Because λ(x, y) > 0, ∫· λ(x, y) > 0. This means that the summation above is less than

15
k=>
16 ¶ max (1 − • λ(x, y))¸
>»9»>µ ·•


where max (1 − ∫· λ(x, y)) < 1. Therefore,
>»9»>µ •

P(C> ) ≤ 16(1 − a)k=>

where a is some positive number less than 1 [Q.E.D.].

Lemma 8: Let {v} be a very large set of points generated by density function 𝛌(𝐱, 𝐲) where

• every point in {v} lies within a finite, bounded region T where 𝐓 ⊂ ℝ𝟐


• 𝛌(𝐱, 𝐲) is semi-uniform and positive

Then, the Delaunay Triangulation of {v} has an expected vertex degree distribution that approaches
that of a uniform Delaunay Triangulation.

Proof:
Let {v} be a set of points within region T. If {v} is generated by a semi-uniform and positive
density function 𝜆(𝑥 , 𝑦), then the definition of semi-uniform shows that it is possible to partition
region T into a set of regions {𝑟> , 𝑟8 … } such that 𝑟 ∈ {𝑟> , 𝑟8 … } has a positive area. Given lemma
6, the probability that any random point within region 𝑟 ∈ {𝑟> , 𝑟8 … } is incomplete is at most
16(1 − 𝑎)F=> . As n approaches infinity, this probability approaches 0, which means that almost every
point will be completed relative to their respective regions. In other words, the importance of inter-
region edges are miniscule for very large sets of points in the context of calculating degree distributions.
This shows that intra-region degree distributions are the only deciding factor when it comes to
determining the overall degree distribution graph of the Delaunay Triangulation of a Semi-uniform
dataset. Lemma 6 proves that within a finite region with a continuous density function, the degree
distribution of a Delaunay triangulation is equivalent to that of a uniform Delaunay Triangulation.
Therefore, if n is sufficiently large, the expected vertex degree distribution of the Delaunay
Triangulation of a Semi-Uniform dataset is essentially equivalent to the vertex degree distribution of a
uniform Delaunay Triangulation.

Corollary 2: The properties of [9] also apply to Delaunay Triangulations of semi-uniformly distributed

Proof
Lemma 8 proves that the degree distribution of the Delaunay Triangulation of a set of semi-
uniform vertices is equal to the degree distribution of a uniform Delaunay Triangulation. This directly
implies that these two distribution functions also have identical asymptotical behaviors. Therefore, the
asymptotical function he outlined in his paper also applies to Delaunay Triangulations formed by semi-
uniformly distributed vertecies.

16
5 DT-DBSCAN Algorithm Derivation

5.1 Algorithm Overview

DT-DBSCAN is partitioned into four distinct segments. The first segment triangulates the data using
the Delaunay triangulation process established in Katajainen & Koppinen. The second segment separates the
noise points from the core points, and this is done by conducting a tree-search along the edges of the
triangulation. The next segment connects all density-reachable core points by re-triangulating the core point set
from the previous segment and removing all triangulation edges exceeding the length of epsilon. All connected
components are density-reachable, and they are categorized into the same cluster. Finally, boundary points are
assigned from the noise point set by checking the core points within epsilon distance of each point. Each
section has an expected linear run-time, which means that the overall DT-DBSCAN algorithm also runs in
linear expected time.

Fig. 4: The steps of DT-DBSCAN

5.2 Delaunay Triangulation

There are many iterations of Delaunay Triangulation algorithms that run in O(n) expected time. DT-
DBSCAN uses the algorithm presented in [Katajainen & Koppinen, 1988], and it is defined as such:

Algorithm 2.1: Delaunay Triangulation

Input: points {v}

K = 4⌊CDE¿|{\}|⌋
Determine the smallest square that contains the points in {v}
Partition the square into K square cells Ci,j (each cell is empty)

for each point P in {v}


Insert P into the proper cell C(i, j)
end for loop
for each cell C(i, j)
DTij = Gubias&Stolfi_DT(C(i, j))
end for loop
for ℎ = 𝑙𝑜𝑔Ã 𝐾 down to 1
for i = 0 to 2Å=8
for j= 0 to 2Å=8

17
T1 = Merge(DT(2*i) (2*j), DT(2*i+1) (2*j))
T2 = Merge(DT(2*i) (2*j+1), DT(2*i+1) (2*j+1))
DTij = Merge(T1, T2)
for end
for end
for end

Where Gubias&Stolfi_DT is an O(n log n) triangulation algorithm, and Merge is a triangulation merge
algorithm. Both of these algorithms are given in [4].
This triangulation algorithm combines cell division techniques and divide-and-conquer techniques to
determine the Delaunay triangulation. Katajainen and Koppinen prove that this algorithm runs in O(n) time
for bounded non-homogenous distributions. In addition, Katajainen and Koppinen also prove in their paper
that this algorithm will always yield a valid Delaunay Triangulation.

5.3 Core Point Classification

In every iteration of DBSCAN, each point v within {v} is at some point classified as either a noise
point or a core point. In DT-DBSCAN, this is done by tree-searching Delaunay edges to find points from {v}
within epsilon distance of v. If there are more than minPts number of points within epsilon distance of point
v, v is classified as a core point. Else, it is classified as a noise point. The algorithm is defined as such:

Algorithm 2.2: Core Point Classification

Input: points {v}, Adjacency List, minPts, epsilon


Output: set of core points, set of noise points, neighborhood array set(only for noise points)

Neighborhood = empty 2D Array


for each point P in {v} {
Query = empty list
Finished = empty list
Add P to Query
while | Query | > 0 {
P1 = last element within Query
Remove P1 from Query
for each point P2 in Finished
if P2 = P1 then continue
for each point P3 adjacent to P1
Add P3 to Query
Add P1 to Finished
if | Finished | > minPts{
label(P) = core
increment for-loop
while end
insert Finished into Neighborhood

18
label(P) = noise
for end

For worst case time complexity, this segment will run in O(n2), because there are n points in {v}, and
the Query search will search at most n points, because that is the total number of points in {v}.

However, for a data set {v} that adheres to the condition outlined in the assumption section, the
expected run time is bound by O(n).

Theorem 1: The expected runtime of the Core Point Classification segment is O(n)

Proof:
There are two for-loops within the while-loop in the algorithm outlined above. The first for-
loop is a linear operation because the maximum length of Finished is minPts, which is a constant
independent of n. The second for-loop operates in deg(P1) worst-case time, because that is the number
of points adjacent to P1 (this for-loop may not be reached in some cases. In these cases, the run time
is constant). Every other operation within the while-loop is a constant operation, so one iteration of
the while loop runs in O(deg(P1)) worst-case time. Given that every operation within the encompassing
for-loop is a constant operation except for the while-loop, the total run time of the Core Point
Classification step, T, can be represented as

𝑇 = Z [𝑤ℎ𝑖𝑙𝑒(𝑣)]
\ ∈ {\}

where while(v) denotes the run-time of the while-loop for point v.


Let Ci,j represent a function that yields either 1 or 0. If vi is within the Query of vj, Ci,j = 1.
Else, it is 0. Given this function, T can also be represented as

𝑇= Z Z 𝐶†,‡ ∗ 𝑑𝑒𝑔 (𝑣† )


\† ∈ {\} \‡ ∈{\}/\†

Let 𝑣† and 𝑣‡ be two points such that there does not exist a Pi,j with less than minPts points.
Then, it is impossible for 𝑣† to ever be within the Query of 𝑣‡ . If 𝑣† , 𝑣8 , 𝑣p … 𝑣Ì=> , 𝑣‡ denotes the
elements of a path between 𝑣† and 𝑣‡ , this is true because if 𝑣‡ were to be in the Query during the
classification 𝑣† , points 𝑣† , 𝑣8 , 𝑣p … 𝑣Ì=> would already be within the Finished array. Given the initial
conditions for 𝑣† and 𝑣‡ , the length of the Finished array must be over minPts by this point. However,
because Finished cannot be over minPts in length, the algorithm would have already incremented to
another point P before the path could reach 𝑣‡ . Therefore, 𝑣† will never appear during the core point
classification of 𝑣‡ , and vice versa.
First, Let 𝑣wÍ® (F) denote the point with the n-th largest degree within the Delaunay
Triangulation of {v}. Let {Nmax(1)} = {v1, v2, … } denote the set of points such that for each point vi,
there exists a Pmax(1),i with less than minPts points. Every point within {Nmax(1)} must have a degree of
less than or equal to 𝑑𝑒𝑔(𝑣wÍ®(>) ). For the sake of overcounting, let us assume that ever point has a

19
the same degree as 𝑣wÍ®(F) . Given this, the number of points 1 edge away from 𝑣wÍ®(F) is
deg(𝑣wÍ®(>) ). The number of points 2 edges away from 𝑣wÍ®(F) is at most 𝑑𝑒𝑔(𝑣wÍ® (>) )8 . For k
edges away, there are at most 𝑑𝑒𝑔(𝑣wÍ® (>) )Î points. By adding these up, the size of {Nmax(1)} must
be less than ∑w†FÏÌÐ
Έ> 𝑑𝑒𝑔(𝑣wÍ® (>) )Î . The size of {Nmax(1)} is greater than or equal to number of times
where Cmax(1),i = 1 for some 𝑣† ∈ {𝑣}, because {Nmax(1)} represents the maximum number of points
that can contain vmax(1) within it’s Query.
Then to determine the number of times Ci,max(1) = 1, note that this value is also bound by the
size of {Nmax(1)}, because the number of points reachable from 𝑣wÍ® (>) is equal to the number of
points that can reach 𝑣wÍ® (>) .
Rewrite the summation from above as such

𝑇≤ Z Z 𝐶†,‡ ∗ 𝑑𝑒𝑔(𝑣† ) + Z 𝐶wÍ® (>),‡ ∗ 𝑑𝑒𝑔o𝑣wÍ®(†) q


\† ∈ {\}/\™ÑÒ (¤) \‡ ∈{\}/{\†,\™ÑÒ (¤) } \‡ ∈{\}

+ Z 𝐶†,wÍ® (>) ∗ 𝑑𝑒𝑔(𝑣† )


\‡ ∈{\}

First, for the sake of overcounting, replace the 𝑑𝑒𝑔 (𝑣† ) from the second summation with
𝑑𝑒𝑔 (𝑣wÍ® (>) ), because 𝑑𝑒𝑔 (𝑣wÍ® (>) ) > 𝑑𝑒𝑔 (𝑣† ). Next, given that

w†FÏÌÐ w†FÏÌÐ
Î Î
Z 𝐶wÍ® (>),‡ ≤ Z 𝑑𝑒𝑔o𝑣wÍ®(>) q 𝑎𝑛𝑑 Z 𝐶‡,wÍ® (†) ≤ Z 𝑑𝑒𝑔o𝑣wÍ®(>)q
\‡ ∈{\} Έ> \‡ ∈{\} Έ>

the summation above can be simplified to

w†FÏÌÐ
ÎÓ>
𝑇≤ Z Z 𝐶†,‡ ∗ 𝑑𝑒𝑔(𝑣† ) + Z 𝑑𝑒𝑔o𝑣wÍ®(>) q
\† ∈ {\}/\™ÑÒ (¤) \‡ ∈{\}/{\†,\™ÑÒ (¤) } Έ>

The function above establishes a recursive step, where the summation on the left can be
expanded by replacing 𝑣wÍ® (>) and {v} with 𝑣wÍ® (8) and {v}/𝑣wÍ® (>) . By continually repeating this
recursive step, at some point, the double summation on the left becomes ∑\† ∈ {\}/{\} ∑\‡ ∈{\}/{\} 𝐶†,‡ ∗
𝑑𝑒𝑔 (𝑣† ). Because {𝑣}/{𝑣} = ∅, this summation equals 0. Therefore, when this condition is reached,

w†FÏÌÐ

𝑇≤ Z Z 𝑑𝑒𝑔(𝑣)ÎÓ>
\ ∈ {\} Έ>


‰.pÃÃ… oÖÕ¦ q
The number of points with degree k is less than 𝑛 ∗ ÃÕ (8†)!
𝐶, where C is some constant
¤
such that ×1 + 𝑂(𝑖 =¦ )Ø < 𝐶. Therefore, the inequality above can be re-written as

20
Ú w†FÏÌÐ
0.344 … (8𝜋 8 )†
𝑇 ≤ 𝐶 ∗ 𝑛 ∗ Z Z 𝑖 ÎÓ>
4𝜋 (2𝑖 )!
†ˆ> Έ>

Ò
oÖÕ¦ q
Let Mc = ∑Ú
®ˆ> (8®)!
(𝑥 Û ) for some constant c. The ratio test for series convergence states
Ý(®Ó>)
that if 𝑙𝑖𝑚 Ü Ü < 1, then ∑Ú
®ˆ> 𝑓(𝑥) converges to some constant. When this is done for M,
®→Ú Ý(®)

(8𝜋 8 )®Ó> ∗ (2𝑥)! ∗ (𝑥 + 1)Û 8𝜋 8 ∗ (𝑥 + 1)Û


𝑙𝑖𝑚 ß ß = 𝑙𝑖𝑚 ß ß
®→Ú (2𝑥 + 2)! ∗ (8𝜋 8 )® ∗ 𝑥 Û ®→Ú (2𝑥 + 1) ∗ (2𝑥 + 1) ∗ 𝑥 Û

which simplifies into

8𝜋 8 (𝑥 + 1)Û
𝑙𝑖𝑚 ß ß ∗ 𝑙𝑖𝑚 à à
®→Ú (2𝑥 + 1) ∗ (2𝑥 + 1) ®→Ú 𝑥Û

ÖÕ¦ (®Ó>)á
using the product rule of limits. 𝑙𝑖𝑚 Ü(8®Ó>)∗(8®Ó>)Ü = 0, and 𝑙𝑖𝑚 Ü ®á
Ü = 1 for any finite constant
®→Ú ®→Ú
c. Therefore, because 0 ∗ 1 < 1, Mc is a value that converges for finite c.
The inequality derived above can be represented as

w†FÏÌÐ
0.344 …
𝑇 ≤ 𝐶 ∗ 𝑛 ∗ Z 𝑀Û
4𝜋
Έ>
Given that minPts is a constant independent to n, ∑w†FÏÌÐ
Έ> 𝑀Û converges to some constant
C1. Therefore,

0.344 …
𝑇 ≤ 𝐶 ∗ 𝑛 ∗ ∗ 𝐶>
4𝜋

This shows that 𝑇 ≤ 𝑂(𝑛), which completes the proof [Q.E.D.].

Theorem 2: The Core Point Classification step will always correctly classify core and noise points.

Proof:
For the sake of contradiction, let us assume that the Core Point Classification step incorrectly
classifies a point. This occurs only when a point more than epsilon distance away is counted in finished,
or when a point less than epsilon distance away is never counted.
The first condition can be easily discounted, because the algorithm only adds a point to
finished if a given point is within epsilon distance away from the point being classified.
The second condition only occurs if the algorithm fails to find a valid path between the point
being classified and some point within epsilon distance away from that point. However, lemma 2
proves the existence of a path between these two paths such that every intermediate vertex is within
epsilon distance away from the point being classified. Therefore, every point within epsilon distance
away can be classified correctly [Q.E.D.].

21
5.4 Cluster Construction

The next step in computing DBSCAN is to categorize core points into clusters. Every density-
reachable core point must be placed into the same cluster. This task is done by re-triangulating the core points,
and then by removing all edges that exceed epsilon in length. All connected components after this removal
stage are density-reachable, and they are placed in the same cluster. This part is defined as such:

Algorithm 2.3: Cluster Construction

Input: core points {vcore}, epsilon


Output: Index array (only defined for core points)

Core_Adjacency = Delaunay Triangulation({vcore})


Removed_Core_Adjacency = empty adjacency list
for each point P in {vcore} {
for each point Q adjacent to P{
if the distance between Q and P is less than epsilon
Add Q to Removed_Core_Adjacency for point P
for end
for end
Index = empty array
clusterIndex = 0;
for each point P in {vcore} {
if Index of P is defined, then continue
clusterIndex = clusterIndex + 1
query = empty list
add P to query
while | query | > 0 {
if Index of the last element of query is defined, then continue
Q = last element of query.
Remove Q from query
Label of Q = clusterIndex
for point R adjacent to Q in Removed_Core_Adjacency
add R to query
while end
for end

Theorem 3: Assuming {v} is triangulated in linear time, the worst-case runtime of the Cluster
Construction segment is O(n)

Proof:

22
The run-time of the first section is determined by the run-time of the triangulation algorithm
for {𝑣ÛDâã }. Assuming that the Triangulation algorithm runs in linear time for {v}, the Triangulation
of {𝑣ÛDâã } will take less than or equal amount of time, because {𝑣ÛDâã } is a subset of {𝑣}.
The run-time of the second section can be represented as

Z 𝑑𝑒𝑔 (𝑣)
\ ∈{\áäåæ }

because this is exactly the number of times the if-statement is called (note that the if-statement is a
constant operation). Due to the handshake lemma, this value is equal to double the number of edges,
and Lemma 1 proves that the Delaunay Triangulation of k points has less than or equal to 3k – 6 edges.
{𝑣ÛDâã } is a subset of {v}, and therefore contains less than or equal to n points. This means that the
run-time of the second section has a linear bound.
For the third section, the run-time is represented by the number of times the first if-statement
is called, added to the number of times the while loop iterates throughout the entire procedure. First,
the if-statement is called at most n times, because {v•çèé } is a subset of {v}. For the while-loop, each
point v will be contained within query at most deg(v) number of times because a point is added to the
query when Q is defined as a point adjacent to v (a given point can only be assigned to Q once, because
before the while loop iterates again, Q is assigned to a cluster). Again, this means that the run time of
this section is less than or equal to the sum of the degrees of each point within {v•çèé }. The number
of edges has a linear bound, meaning that the third section runs in linear time.
Because all three sections run in linear-time, the worst-case runtime of the Cluster
Construction segment is O(n) [Q.E.D.].

Theorem 4: The Cluster Construction process will always cluster density-reachable core points, but it
will not cluster non-reachable core points

Proof:
Lemma 2 proves that there exists a path between vi and vj such that each point within the path
is closer to vi than vj. When looking closer at Lemma 2, note that each edge vk vk+1 is contained within
the circle with vk vj as the diameter. The largest possible diameter for this circle is achieved when vk =
vi. If the distance between vi and vj is less than epsilon, then the maximum diameter of the circle is less
than epsilon. Because all line segments contained within a circle of diameter d has a length of less than
or equal to d, all edges within the path vi, … vj are less than or equal to epsilon in length. This means
that it is possible to traverse between vi and vj only using edges defined within
Removed_Core_Adjacency.
Let vi and vt be two density-reachable points. This assumes that there exists some path vi, v2,
v3, … vt where the distance between two consecutive points is less than or equal to epsilon. Given that
there exists a path between vi and vj on the Removed_Core_Adjacency adjacency list if vi and vj are
less than epsilon distance away, there exists a path between vi and v2, a path between v2 to v3, and
between every consecutive points in the path until vj. This means there exists a path between any to
density reachable points vi and vt, which means that two density reachable points will be put into the
same cluster.

23
Let vj and vu be two non-density-reachable points. This means that there does not exist a path
vi, v2, v3, … vt where the distance between two consecutive points is less than or equal to epsilon.
Given this, it is impossible to find a path between vj and vu on edges defined within
Removed_Core_Adjacency, because every edge within this adjacency list is shorter than epsilon. This
means that non-reachable core points will not be clustered together [Q.E.D.].

5.5 Boundary Point Classification

The last step is to distinguish boundary points from the set of noise points. This is simply done by
checking the neighborhood arrays for each noise point, and then determining whether or not there exists a core
point within this set. If a core point does lie within this range, then the index of that noise point becomes the
index of the core point. This segment is defined as such:

Algorithm 2.4: Boundary Point Classification

Input: noise points {vnoise}, Neighborhood array set, index array (defined only for core points)
Output: completed index array

noiseIndex = empty Boolean array of length N


for each point P in {vnoise} {
noise Index of P = true
Neighborhood = Neighborhood array for point P
(from core classification section)
for each point Q in Neighborhood {
if(noiseIndex of Q = false & index of Q = defined){
index of Q = index of P
break
for end
for end

Due to the fact that the length of a Neighborhood array cannot exceed minPts, the nested for-loop is
a constant operation, meaning that the boundary classification step is proportional to the size of {vnoise}. {vnoise}
is a subset of {v}, which means it has less than n points. Therefore, the boundary point classification stage will
always run in O(n) time.
For accuracy, theorem 2 proves that the Finished array will always yield all points within epsilon range
of point v if there are less than minPts number of points within this range. Theorem 4 proves how the index
of each density-reachable point will be the same. Therefore, by searching each point within the neighborhood
adjacency of point v for a core point will always correctly categorize a border point, but it will also leave v as a
noise point if there are no core points within epsilon range of v.

6 Experimentation

6.1 Procedure

24
The accuracy and the expected time complexity of DT-DBSCAN are proven above, but these results
must be verified through experimentation. Therefore, DT-DBSCAN was implemented into Java, and two tests
were conducted: one test to check for accuracy, and one test to determine the run-time of the algorithm.
Both tests were run using five different types of cluster data sets: uniformly distributed points, Gaussian
clusters, Gaussian clusters with noise, uniform density clusters, and uniform density clusters with noise (fig. 9
- 13). Each data set was randomly generated by a computer algorithm within a given set of parameters. This
was done to simulate the variety of data sets encountered in different practical applications.

Fig. 5 - 9: Test data types from left to right, and then


top to bottom: Uniformly Distributed, Gaussian
Clusters, Gaussian Clusters with Noise, Uniformly
Distributed Clusters, Uniformly Distributed Clusters
with noise
Accuracy test:

For the accuracy test, cluster results derived from DT-DBSCAN were compared to results derived
from the standard DBSCAN algorithm. To evaluate “accuracy” as a numerical value, C({v}), or the correctness
coefficient, was created. To define C({v}), let DT(v) represent the index of point v after running DT-DBSCAN,
and let DB(v) represent the index of point v after running DBSCAN. If d(x, y) denotes the discreet metric (d(x,
y) = 0 if x = y, d(x, y) = 1 otherwise),

1
C({v}) = Z Z d( doDB(v> ), DB(v8 )q, d(DT(v> ), DT(v8 )))
n8
í¤ ∈ {í} í¦ ∈ {í}

C({v}) is a number between 0 and 1, and the DT-DBSCAN clusters are equivalent to the DBSCAN
clusters if and only if C({v}) = 0.
C({v}) is defined in this way because DBSCAN and DT-DBSCAN often assign each cluster to
different index numbers, which makes direct comparison harder to conduct (the fact that clusters are assigned
to different index numbers does not mean DBSCAN and DT-DBSCAN are yielding different results).
For each of the five data types, 300 test data sets were randomly generated with data sizes between
1000 – 30000 points (10 per each increment of 1000 points). For each test set {v}, C({v}) was computed.

25
Run-Time Test:

DT-DBSCAN was compared to the Standard DBSCAN, as well as an O(n * log n) DBSCAN algorithm
provided in [12]. This DBSCAN algorithm utilizes spatial indexing to improve the time-complexity, which is a
common strategy used by many DBSCAN iterations. For each of the five data types, 750 test data sets were
randomly generated with data sizes between 1000 – 30000 points (25 per each increment of 1000 points). Each
of the three algorithms used these test sets as inputs to generate clusters, and the run-time for each algorithm
was measured in milliseconds.
In order to control outside variables, all three algorithms were implemented into Java, and all three
algorithms were run on the same IDE on the same computer.

6.2 Accuracy Test Results

Fig. 10: Table showing the values of C({v}) for each data type.
The fact that C({v}) is always 0 shows that there was no
difference in clustering results.

Fig. 11 & 12: The clusters on the left were generated


by DBSSCAN and the clusters on the right were
generated by DT-DBSCAN (Color differences do not
indicate a difference in clustering).

6.3 Performance Test Results

26
Fig. 13 - 17: The clusters on the left were generated
by DBSSCAN and the clusters on the right were
generated by DT-DBSCAN (Color differences do not
indicate a difference in clustering).

7 Discussion
The DT-DBSCAN algorithm is able to solve for the time complexity issue of DBSCAN, while not
deviating from the clustering results of DBSCAN. First, theorems 1 and 3 prove that DT-DBSCAN will run
in linear expected time. The linearity of the run-time is then experimentally shown in the run-time test. For
each of the five different types of data sets, DT-DBSCAN always out-performed the standard DBSCAN
algorithm and the O(n log n) algorithm for point sets that contain more than 15,000 points. In addition, the
run-time graphs show that the execution time of DT-DBSCAN grows in a strictly linear trend as the data sets
increase in size. This is a notable improvement compared to currently used DBSCAN clustering algorithms.
Then, theorems 2 and 4 prove that DT-DBSCAN will always yield the exact same clustering results as
the standard DBSCAN. The non-approximative nature of DT-DBSCAN is also shown within the accuracy
tests, where zero clustering errors occurred after testing a total of 23,250,000 points.
DT-DBSCAN is a novel density-based clustering algorithm that drastically improves performance for
larger data sets with virtually no changes to clustering accuracy, and DT-DBSCAN also retains the same input
parameters as DBSCAN.
Next, we consider the implications of section 4, theorem 1, and theorem 2, which collectively suggests
that it is possible to compute range Queries in a constant time complexity with an O(n) pre-computation. This
is notably faster than the indexing methods that exist today, and this could prove to be a major benefit for many
other spatial data-based algorithms like KNN and Hierarchal Clustering.
Lastly, there is potential future research in the hopes of extending the DT-DBSCAN concept to higher
dimensions. This could certainly be possible with the existence of a linear expected time Delaunay Triangulation
algorithm for larger dimension datasets.
8 References
[1] Nerurka P, Empirical Analysis of Data Clustering Algorithms, 6th International Conference
on Smart Computing and Communications, ICSCC 2017, 2017.

[2] Ester M, Kriegel P, Sander J, Xu X. A Density- Based Algorithm for Discovering Clusters in
Large Spatial Databases with Noise. Proc. 2nd Int. Conf. on Knowledge Discovery and
Data Mining, 1996, pp. 226-231.

[3] Kou G, Peng Y. Evaluation of Clustering Algorithms for Financial Risk Analysis using
MCDM methods. Information Sciences, 2014, 275(11): 1 – 12.

27
[4] Guibas L, Stolfi J. Primitives for the manipulation of general subdivisions and the
computation of Voronoi diagrams. ACM Transactions on Graphics, 1985, 4(3): 74 – 123.

[5] Katajainen J, Koppinen M. Constructing Delaunay Triangulations by Merging Buckets in


Quadtree Order. Fundamenta Informaticae, 1988, 11(3): 275 – 288.

[6] Zhao Y. Li X. An improved DBSCAN algorithm based on cell-like P systems with promoters
and inhibitors. PLoS ONE , 2017. 13(12): 1 – 17.

[7] Kou G, Peng Y. Evaluation of Clustering Algorithms for Financial Risk Analysis using
MCDM methods. Information Sciences, 2014, 275(11): 1 – 12.

[8] Lewis B, Robinson J. Triangulation of planar regions with applications, Computer J. 1978.
21 (4):324–332.

[9] Hilhorst H. J. The perimeter of large planar Voronoi cells: a double-stranded random walk.
Journal of Statistical Mechanics: Theory and Experiment, 2005, 2005(7): 72 – 82.

[10] Gan J, Tao Y. DBSCAN Revisited: Mis-Claims, Un-Fixability, and Approximation. 2015.
Proceedings of the 2015 Sigmod International Conference on Management of Data : 519
– 530.

[11] Frantz C. Java implementation of density-based clustering algorithm DBSCAN. 2016.


GitHub.

[12] Ahmed M, Mahmood AN. Novel Approach for network Traffic Pattern Analysis using
cluster- based collective anomaly detection. Annals of Data Science, 2015, 2(1): 1 – 20.

[13] Li Z, Qiao Z. Network cluster analysis of protein-protein interaction network-identified


biomarker for type 2 diabetes. Diabetes Technology and Therapeutics, 2015, 17(7) : 475
– 481.

[14] Bern M., Eppstein D., Yao F. The expected extremes in a delaunay triangulation. Albert J.L.,
Monien B., Artalejo M.R. (eds) Automata, Languages and Programming. ICALP 1991.
Lecture Notes in Computer Science, vol 510. Springer, Berlin, Heidelberg, 1991

[15] Kryszkiewicz M, Skonieczny L. Faster Clustering with DBSCAN. Institute of Computer Science, Sarsaw
University of Technology, Nowowiejska 19/19, 00-665 Warsaw Poland.

[16] Li T, Heinis T, Luk W. Hashing-Based Approximate DBSCAN. Dynamic Onotology-Based Sensor


Binding (pg. 31 - 45). Aug. 2016.

[17] Ester M, Kriegel H.P., Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial
databases with noise. Proceedings of 2nd International Conference on Knowledge Discovery and Data
Mining. 1996.

28

Potrebbero piacerti anche