Sei sulla pagina 1di 12

CSE4014 - High Performance Computing

(EPJ)

Submitted by Project Guide


Raj Prakash Shrivastav- 18BCE2463 MANJULA V
Ashutosh Devkota- 18BCE2465 Associate Professor
Ashish Paudel- 18BCE2494 School of Information Technology and
Engineering
ABSTRACT

• The Density Peak Clustering (DPC) algorithm is a new density-based clustering method.
• It spends most of its execution time on calculating the local density and the separation
distance for each data point in a dataset.
• The purpose of this study is to accelerate its computation.
• On average, the DPC algorithm scans half of the dataset to calculate the separation distance of
each data point.
MODIFICATIONS…!!

• We propose an approach to calculate the separation distance of a data point


by scanning only the neighbors of the data point.
• Additionally, the purpose of the separation distance is to assist in choosing the
density peaks, which are the data points with both high local density and high
separation distance.
• We propose an approach to identify non-peak data points at an early stage to
avoid calculating their separation distances.
• Our experimental results show that most of the data points in a dataset can
benefit from the proposed approaches to accelerate the DPC algorithm.
PROBLEM STATEMENT AND OBJECTIVES

Accelerating DPC by Scanning Neighbors Only

The objectives of the project is:


 To speed up the density clustering algorithm for large data sets.
 To accelerate the calculation of separation distances and yield the same clustering
results as that of the DPC algorithm.
 To accelerate the DPC algorithm by identifying a significant portion of the non-
peak data points and avoiding calculating their separation distances.
 Input: the set of data points X∈ℝNXM and the parameters 𝑑C for defining the
neighborhood, and 𝑑r for selecting density peaks
 Output: the label vector of cluster index y∈ℝNx1
 Algorithm:
 1. Calculate ρ(𝑥i) for each 𝑥i ∈ X using either (1) or (3).
 2. Sort all data points in X by their local densities descendingly.
 3. Calculate δ(𝒙i) and σ(𝒙i) for each 𝒙i ∈ X using (4) and (5), respectively.
 4. Select data points with ρ(𝒙i)δ(𝒙i) > 𝑑r as density peaks.
 5. For each density peak 𝒙i, set 𝑦i = 𝑖. // starting point of each cluster
 6. For each non-peak data point 𝒙i, set 𝑦i = 𝑦δ(𝒙i). // cluster assignment
 7. Return y.
Preliminary

– Consider a set of points in some space to be clustered. Let ε be a parameter specifying the radius of a neighborhood
with respect to some point.
– For the purpose of DPC clustering, the points are classified as core points, (density-)reachable points and outliers, as
follows:
– A point p is a core point if at least minPts points are within distance ε of it (including p).
– A point q is directly reachable from p if point q is within distance ε from core point p.
– Points are only said to be directly reachable from core point.
– A point q is reachable from p if there is a path p1, ..., pn with p1 = p and pn = q, where each pi+1 is directly reachable
from pi.
– All points not reachable from any other point are outliers or noise points.

– Now if p is a core point, then it forms a cluster together with all points (core or non-core) that are reachable from it.
Each cluster contains at least one core point; non-core points can be part of a cluster, but they form its "edge", since
they cannot be used to reach more points.
Contd..

■ Reachability is not a symmetric relation since, by definition, no point may be reachable


from a non-core point, regardless of distance (so a non-core point may be reachable, but
nothing can be reached from it).
■ Therefore, a further notion of connectedness is needed to formally define the extent of
the clusters found by DBSCAN. Two points p and q are density-connected if there is a
point o such that both p and q are reachable from o. Density-
connectedness is symmetric.
■ A cluster then satisfies two properties:
■ All points within the cluster are mutually density-connected.
■ If a point is density-reachable from any point of the cluster, it is part of the cluster as
well.
 The quality of DPC depends on the distance measure used in the function region
Query.
 The most common distance metric used is Euclidean distance. Especially for high-
dimensional data, this metric can be rendered almost useless due to the so-called
"Curse of dimensionality", making it difficult to find an appropriate value.
 This effect, however, is also present in any other algorithm based on Euclidean
distance.
 DPC cannot cluster data sets well with large differences in densities.
Conclusion

• The proposed methods focus on accelerating the calculation of the


separation distance.
• However, it is also possible to improve the DPC algorithm by
accelerating the calculation of the local density .
• Conceptually, the DPC algorithm builds a directed acyclic graph of
all data points with an out-degree ≤ 1. Then, it selects several
data points from the graph as the density peaks.
• Finally, it removes the outgoing links of the density peaks and
breaks the graph into several subgraphs, each of which represents
a cluster.

Potrebbero piacerti anche