Sei sulla pagina 1di 17

# 0

9. Hierarchical Clustering
• Many times, clusters are not disjoint, but may have
subclusters, in turn having sub-subclusters, etc.
• Consider a sequence of partitions of the n samples
into c clusters
• The first is a partition into n clusters, each one
containing exactly one sample
• The second is a partition into n-1 clusters, the third into
n-2, and so on, until the n-th in which there is only one
cluster containing all of the samples
• At the level k in the sequence, c = n-k+1.

1

## • Given any two samples x and x’, they will be grouped

together at some level, and if they are grouped a level k,
they remain grouped for all higher levels
• Hierarchical clustering  tree called dendrogram

## Pattern Classification, Chapter 10

• The similarity values may help to determine if the
2

## grouping are natural or forced, but if they are

evenly distributed no information can be gained
• Another representation is based on Venn diagrams

3

## • Hierarchical clustering can be divided in

agglomerative and divisive.

n singleton cluster and form the sequence by
merging clusters

samples in one cluster and form the sequence
by successively splitting clusters

4

## • The procedure terminates when the specified number of

cluster has been obtained, and returns the clusters as sets
of points, rather than the mean or a representative vector
for each cluster. Terminating at c = 1 yields a dendrogram.
Pattern Classification, Chapter 10
5

## • At any level, the distance between nearest clusters

can provide the dissimilarity value for that level
• To find the nearest clusters, one can use
d min ( Di , D j )  min x  x' Nearest neighbor
x D i , x ' D j

## d max ( Di , D j )  max x  x' Farthest neighbor

xDi , x 'D j

1
d avg ( Di , D j ) 
ni n j
  x  x'
xDi x 'D j

d mean ( D i , D j )  m i  m j
which behave quite similar if the clusters are
hyperspherical and well separated.

## Pattern Classification, Chapter 10

6

Nearest-neighbor algorithm
• When dmin is used, the algorithm is called the
nearest neighbor algorithm
• If it is terminated when the distance between
nearest clusters exceeds an arbitrary threshold,
• If data points are considered nodes of a graph
with edges forming a path between the nodes in
the same subset Di, the merging of Di and Dj
corresponds to adding an edge between the
neirest pair of node in Di and Dj
• The resulting graph has any closed loop and it is
a tree, if all subsets are linked we have a
spanning tree
Pattern Classification, Chapter 10
7

## • The use of dmin as a distance measure and

the agglomerative clustering generate a
minimal spanning tree

## • Chaining effect: defect of this distance

measure (right side of figure next slide)

8

Problem!

9

## The farthest neighbor algorithm

• When dmax is used, the algorithm is called the
farthest neighbor algorithm
• If it is terminated when the distance between
nearest clusters exceeds an arbitrary threshold, it
• This method discourages the growth of elongated
clusters
• In the terminology of graph theory, every cluster
constitutes a complete subgraph, and the distance
between two clusters is determined by the most
distant nodes in the 2 clusters
Pattern Classification, Chapter 10
10

## • When two clusters are merged, the graph is

changed by adding edges between every pair
of nodes in the 2 clusters

## • All the procedures involving minima or

maxima are sensitive to outliers. The use of
dmean or davg are natural compromises

11

12

## • Typically, the number of clusters is known

• When it’s not, there are several ways of proceed
• When clustering is done by extremizing a criterion function,
a common approach is to repeat the clustering with c=1,
c=2, c=3, etc.
• Another approach is to state a threshold for the creation of
a new cluster
• These approaches are similar to model selection
procedures, typically used to determine the topology and
number of states (e.g., clusters, parameters) of a model,
given a specific application

13

## • Combine features to reduce the dimension of the feature space

• Linear combinations are simple to compute and tractable
• Project high dimensional data onto a lower dimensional space
• Two classical approaches for finding “optimal” linear
transformation
• PCA (Principal Component Analysis) “Projection that best represents the
data in a least- square sense”
• MDA (Multiple Discriminant Analysis) “Projection that best separates the
data in a least-squares sense” (generalization of Fisher’s Linear
Discriminant for two classes)

14

## • PCA (Principal Component Analysis) “Projection that

best represents the data in a least- square sense”
• The scatter matrix of the cloud of samples is the same as
the maximum-likelihood estimate of the covariance matrix
• Unlike a covariance matrix, however, the scatter matrix includes
samples from all classes!

## • And the least-square projection solution (maximum scatter)

is simply the subspace defined by the d’<d eigenvectors of
the covariance matrix that correspond to the largest d’
eigenvalues of the matrix
• Because the scatter matrix is real and symmetric the eigenvectors
are orthogonal

15

## Fisher Linear Discriminant

• While PCA seeks directions efficient for representation, discriminant
analysis seeks directions efficient for discrimination

16

## Multiple Discriminant Analysis

• Generalization of Fisher’s Linear Discriminant for more than two classes