Sei sulla pagina 1di 2

# Worksheet on Clustering

Clustering
It is often useful to partition data without having a training sample; this is also known as
unsupervised learning. For example, in business, it may be important to determine groups
of customers who have similar buying patterns, or in medicine, it may be important to
determine groups of patients who show similar reactions to prescribed drugs. The goal of
clustering is to place records into groups, such that records in a group are similar to each
other and dissimilar to records in other groups. The groups are usually disjoint.

An important aspect of clustering is the similarity function that is used. When the data is
numeric, a similarity function based on distance is typically used. For example, the
Euclidean distance can be used to measure similarity. Consider two n-dimensional data
records as points x and y in n-dimensional space. We can consider the value for the ith
dimension as xi and yi for the two records. The Euclidean distance between points
x=(x1,…,xn) and y=(y1,…,yn) in n-dimensional space is

n
U ( x, y ) ¦x  yi
2
i
i 1

The smaller the distance between two points, the greater is the similarity. A classic
clustering algorithm is the following k-Means algorithm:

## K-means clustering algorithm

This algorithm begins by randomly choosing k records to represent the centroids (means),
ml, ..., mk, of the clusters, C1, ..., Ck. All the records are placed in a given cluster based on
the distance between the record and the cluster mean. If the distance between mi and
record rj is the smallest among all cluster means, then record r, is placed in cluster Ci.
Once all records have been initially placed in a cluster, the mean for each cluster is
recomputed. Then the process repeats, by examining each record again and placing it in
the cluster whose mean is closest. Several iterations may be needed, but the algorithm
will converge. Consider the following database.

## RECORD Age Years of Service

1 30 5
2 50 25
3 50 15
4 25 5
5 30 10
6 55 25

1
Assume that the number of desired clusters k is 2. Let the algorithm choose records with
RECORD 3 for cluster C1 and RECORD 6 for cluster C2 as the initial cluster centroids.
The remaining records will be assigned to one of those clusters during the first iteration of
the repeat loop.

RECORD 1 has a distance from C1 RI¥ 2 + 102) = 22.4 and a distance from C2 of 32.0,
so it joins cluster C1. RECORD 2 has a distance from C1 of 10.0 and a distance from C2
of 5.0, so it joins cluster C2. RECORD 4 has a distance from C1 of 25.5 and a distance
from C2 of 36.6, so it joins cluster C1. RECORD 5 has a distance from C1 of 20.6 and a
distance from C2 of 29.2, so it joins cluster C1.

## RECORD Age Years of Service Dist from 3 Dist from 6

1 30 5 22.4 32.0
2 50 25 10.0 5.0
3 50 15 0 -
4 25 5 25.5 36.6
5 30 10 20.6 29.2
6 55 25 - 0

Thus we have
C1 = {RECORD 1, RECORD 3, RECORD 4, RECORD 5}

## C2 = {RECORD 2, RECORD 6}.

Now, the new means (centroids) for the two clusters are computed. The mean for a cluster,
Ci, is a vector consisting of the mean of the individual dimensions within the cluster.