Sei sulla pagina 1di 5

1 of 5 1 of 5

Cluster Analysis
The idea of cluster analysis is that we have a set of observations on which we have available
several measurements. Using these measurements, we want to find out if the observations
naturally group together.
Good clustering is one for which within-cluster variation is as small as possible. So are two very
important decisions that need to be made whenever you are carrying out a cluster analysis:
Standardization
Cluster analysis algorithms all depend on the concept of measuring the distance (or some other
measure of similarity) between the different observations were trying to cluster. If one of the
variables is measured on a much larger scale than the other variables, then whatever measure we
use will be overly influenced by that variable => we need some sort of standardization.
Standard way of doing this is using z-score (for each variable, subtract the mean and divide by ):
Distance Measure
If clustering organizes things that are close into groups, how do we define close?
Euclidean distance - most often used and default in stats packages
Manhattan distance - distance measured along the grid.
Canberra distance
Performs its own standardization; absolute values of differences are divided by the absolute
Z =
x
i

d = ( + ( + + ( A
1
A
2
)
2
B
1
B
2
)
2
Z
1
Z
2
)
2

| | + | | + + | | A
1
A
2
B
1
B
2
Z
1
Z
2
2 of 5 2 of 5
value of the sum of the corresponding variables in the two observations.
Clustering Techniques
K-Means Clustering (kmeans)
The first step is to specify the number of clusters (k) that will be formed in the final solution.
Process begins by randomly assigning each observations to a cluster.
Compute cluster centroids
Re-assign each point to the closest cluster centroid.
Re-compute cluster centroids.
The process continues until no observations switch clusters.
Good:
k-means technique is fast, and doesnt require calculating all of the distances between each
observation and every other observation.
It can efficiently deal with very large data sets, so it may be useful in cases where other
methods fail.
Bad:
If you rearrange your data, its very possible that youll get a different solution every time.
Another criticism of this technique is that you may try, for example, a 3 cluster solution that
seems to work pretty well, but when you look for the 4 cluster solution, all of the structure
that the 3 cluster solution revealed is gone. This makes the procedure somewhat unattractive
if you dont know exactly how many clusters you should have in the first place.
Partitioning Around Medoids (pam)
Modern alternative to k-means clustering (library cluster: pam)
The term medoid refers to an observation within a cluster for which the sum of the distances
between it and all the other members of the cluster is a minimum.
pam requires that you know the number of clusters that you want (like k-means clustering), but it
does more computation than k-means in order to insure that the medoids it finds are truly
representative of the observations within a given cluster.
In the k-means method the centers of the clusters are only recalculated after all of the
observations have had a chance to move from one cluster to another. With pam, the sums of the
distances between objects within a cluster are constantly recalculated as observations move
3 of 5 3 of 5
around, which will hopefully provide a more reliable solution.
As a by-product of the clustering operation it identifies the observations that represent the
medoids, and these observations (one per cluster) can be considered a representative example of
the members of that cluster.
pam requires that the entire distance matrix is calculated to facilitate the recalculation of the
medoids, and it involves considerably more computation than k-means.
As with k-means, theres no guarantee that the structure thats revealed with a small number of
clusters will be retained when you increase the number of clusters.
Hierarchical Agglomerative Clustering (hclust)
Starts out by putting each observation into its own separate cluster.
It then examines all the distances between all the observations and pairs together the two closest
ones to form a new cluster.
Now there is one less cluster than there are observations. To determine which observations will
form the next cluster, we need to come up with a method for finding the distance between an
existing cluster and individual observations, since once a cluster has been formed, well determine
which observation will join it based on the distance between the cluster and the observation.
Distance Between Clusters
Some of the methods (each will reveal certain types of structure within the data):
Single linkage: take the minimum distance between an observation and any member of the
cluster (tends to find clusters that are drawn out and snake-like).
Complete linkage: maximum distance (tends to find compact clusters).
Average linkage: average distance (compute all pairwise distances between observations in
two clusters, and take the average)
Centroid linkage: find centroid of each cluster, and take distance between centroids.
Ward method attempts to form clusters by keeping the distances within the clusters as small
as possible.This method tends to find compact and spherical clusters. We can think about it
as trying to minimize the variance within each cluster and the distance among clusters (often
useful when the other methods find clusters with only a few observations).
We dont need to tell these procedures how many clusters we want (we get a complete set of
solutions starting from the trivial case of each observation in a separate cluster all the way to the
other trivial case where we say all the observations are in a single cluster)
4 of 5 4 of 5
Traditionally, hierarchical cluster analysis has taken computational shortcuts when up- dating the
distance matrix to reflect new clusters. In particular, when a new cluster is formed and the distance
matrix is updated, all the information about the individual members of the cluster is discarded in
order to make the computations faster. The cluster library provides the agnes function which uses
essentially the same technique as hclust, but which uses fewer shortcuts when updating the
distance matrix.
For example, when the mean method of calculating the distance between observations and clusters
is used, hclust only uses the two observations and/or clusters which were recently merged when
updating the distance matrix, while agnes calculates those distances as the average of all the
distances between all the observations in the two clusters.
While the two techniques will usually agree quite closely when minimum or maximum updating
methods are used, there may be noticeable differences when updating using the average distance
or Wards method.
Heatmap (heatmap)
orders rows and columns according to hierarchical clustering.
In R:
Hierarchical Clustering
# Compute distances
distances = dist(data, method = "euclidean")
# Hierarchical clustering
hc = hclust(distances, method = "complete")
# Plot the dendrogram
plot(hc)
# Assign points to clusters
clusterGroups = cutree(hc, k = 10)
# Compute the percentage of observations from each group in each cluster
spl = split(data, clusterGroups)
lapply(spl, colMeans)
5 of 5 5 of 5
k-Means Clustering
# Change the data type to matrix
dataMatrix = as.matrix(data)
# Turn matrix into a vector
dataVector = as.vector(dataMatrix)
# Compute distances
distance = dist(dataVector, method = "euclidean")
# Specify number of clusters
k = 5
# Run K-Means
# nstart - attempts multiple initial configurations and reports on the best one
KMC = kmeans(dataVector, centers = k, iter.max = 1000, nstart=15)
# Extract clusters
dataClusters = KMC$cluster
Cluster Analysis
Gabriela Hromis
Notes are based on different books and class notes from different universities, especially
https://www.stat.berkeley.edu/classes/s133/all2011.pdf

Potrebbero piacerti anche