Sei sulla pagina 1di 57

Computing Clusters

Unsupervised Learning

Luc Anselin

http://spatial.uchicago.edu

Copyright © 2016 by Luc Anselin, All Rights Reserved


• dimension reduction
• classical clustering methods
• spatially-constrained clustering

Copyright © 2016 by Luc Anselin, All Rights Reserved


Dimension Reduction

Copyright © 2016 by Luc Anselin, All Rights Reserved


Principles

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Curse of dimensionality
• in a nutshell

• low dimensional techniques break down in high


dimensions

• complexity of some functions increases


exponentially with variable dimension

Copyright © 2016 by Luc Anselin, All Rights Reserved


Example 1

change with p (variable dimension) of distance in unit cube 



required to reach fraction of total data volume

Source: Hastie, Tibshirani, Friedman (2009)

Copyright © 2016 by Luc Anselin, All Rights Reserved


Example 2

nearest neighbor distance in one vs 2 dimensions

Source: Hastie, Tibshirani, Friedman (2009)

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Dimension reduction
• reduce multiple variables into a smaller number
of functions of the original variables

• principal component analysis (PCA)

• visualize the multivariate similarity (distance)


between observations in a lower dimension

• multidimensional scaling (MDS)

Copyright © 2016 by Luc Anselin, All Rights Reserved


Principal Components Analysis (PCA)

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Principle
• capture the variance in the p by p covariance
matrix X’X through a set of k principal
components with k << p

• principal components capture most of the variance

• principal components are orthogonal

• principal component coefficients (loadings) are


scaled

Copyright © 2016 by Luc Anselin, All Rights Reserved


• More formally
• ci = ai1 x1 + ai2 x2 + … + aip xp
• each principal component is a weighted sum of the
original variables

• ci’cj = 0
• components are orthogonal to each other

• Σk aik2 = 1
• the sum of the squared loadings equals one

• computation
• matrix decomposition

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Typical results of interest
• loadings for each principal component
• the contribution of each of the original variables to
that component

• principal component score


• the value of the principal component for each
observation

• variance proportion explained


• the proportion of the overall variance each
principal component explains

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Example
• 77 Community Areas in Chicago (2014)

• 12 health indicator variables (all % or rates)

• teen birth rate, pre-term births, infant mortality


rate, Gonnorrhea, breast cancer, lung cancer,
colorectal cancer, prostate cancer, lead, diabetes,
stroke, tuberculosis

Copyright © 2016 by Luc Anselin, All Rights Reserved


first four principal components with their loadings

Copyright © 2016 by Luc Anselin, All Rights Reserved


biplot

scatter plot of first two principal components


each vector (arrow) shows relative loadings for that variable
Copyright © 2016 by Luc Anselin, All Rights Reserved
variance explained by each component

Copyright © 2016 by Luc Anselin, All Rights Reserved


how many components?
elbow in scree plot

Copyright © 2016 by Luc Anselin, All Rights Reserved


mapping principal components (pc1)
Copyright © 2016 by Luc Anselin, All Rights Reserved
neighbors in multivariate space are not necessarily

neighbors in geographical space
(Hyde Park and Near North Side)

Copyright © 2016 by Luc Anselin, All Rights Reserved


Multidimensional Scaling (MDS)

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Principle
• n observations are points in a p-dimensional
data hypercube

• p-variate distance or dissimilarity between all


pairs of points 

(e.g., Euclidean distance in p dimensions)

• represent the n observations in a lower


dimensional space (at most p - 1) while
respecting the pairwise distances

Copyright © 2016 by Luc Anselin, All Rights Reserved


• More formally
• n by n distance or dissimilarity matrix D

• dij = ||xi - xj|| Euclidean distance in p dimensions

• find values z1, z2, … zn in k-dimensional space


(with k << p) that minimize the stress function

• S(z) = Σi,j ( dij - ||zi - zj|| )2

• least squares or Kruskal-Shephard scaling

Copyright © 2016 by Luc Anselin, All Rights Reserved


MDS representation of 12 dimensions into 2

Copyright © 2016 by Luc Anselin, All Rights Reserved


neighbors in multivariate space vs neighbors in geographical space

Copyright © 2016 by Luc Anselin, All Rights Reserved


Classical Clustering
Methods

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Principle
• grouping of similar observations

• maximize within-group similarity

• minimize between-group similarity

• or, maximize between-group dissimilarity

• each observation belongs to one 



and only one group

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Issues
• similarity criterion
• Euclidean distance, correlation
• how many groups
• many “rules of thumb”
• computational challenges
• combinatorial problem, NP hard
• n observations in k groups
• kn possible partitions
• k = 4 with n = 77, kn = 2.3 x 1046

• no guarantee of a global optimum

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Two main approaches
• hierarchical clustering

• start from bottom

• determine number of clusters later

• partitioning clustering (k-means)

• start with random assignment to k groups

• number of clusters pre-determined

• many clustering algorithms


Copyright © 2016 by Luc Anselin, All Rights Reserved
Hierarchical Clustering

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Algorithm
• find two observations that are closest 

(most similar)
• they form a cluster

• determine the next closest pair


• include the existing clusters in the comparisons

• continue grouping until all observations have


been included
• result is a dendrogram
• a hierarchical tree structure

Copyright © 2016 by Luc Anselin, All Rights Reserved


2 5 2 5 2 5
3 3 3
7 7 7
Variable2

Variable2

Variable2
1 1 1

4 8 4 8 4 8

2 3 2 3 5 7
Variable1 Variable1 Variable1

2 and 3 are the closest points. They become a cluster. 5 and 7 are the closest points. They become a cluster. 1 and the cluster of 2 and 3 are the closest points.

2 5 2 5 2 5
3 3 3
7 7 7
Variable2

Variable2

Variable2
1 1 1

4 8 4 8 4 8

2 3 1 5 7 2 3 1 4 5 7 2 3 1 4 5 7 8
Variable1 Variable1 Variable1

4 and the cluster of 1, 2, and 3 are the closest points. 8 and the cluster of 5 and 7 are the closest points. The two remaining clusters are the closest points.

Stop ("cut") here


2 5 2 5 for two clusters 2 5
3 3 3
7 7 7 Stop ("cut") here
Variable2

Variable2

Variable2
1 1 1 for four clusters

4 8 4 8 4 8

2 3 1 4 5 7 8 2 3 1 4 5 7 8 2 3 1 4 5 7 8
Variable1 Variable1 Variable1

The algorithm has finished. Rewind algorithm to reveal desired number of clusters.

hierarchical clustering algorithm


Source: Grolemund and Wickham (2016)

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Practical issues
• measure of similarity (dissimilarity) between
clusters = linkage
• complete
• compact clusters
• single
• elongated clusters, singletons
• average
• centroid
• others …
• how many clusters
• where to cut the tree

Copyright © 2016 by Luc Anselin, All Rights Reserved


complete single average centroid

2 2 2 2
3 3 3 3
4 4 4 4
δ x x5
1 5 1 5 1 5 1
δ δ δ

types of linkages
Source: Grolemund and Wickham (2016)

Copyright © 2016 by Luc Anselin, All Rights Reserved


complete linkage dendrogram
(using first four principal components)

Copyright © 2016 by Luc Anselin, All Rights Reserved


Complete linkage hierarchical cluster maps

k=4

k=6

Copyright © 2016 by Luc Anselin, All Rights Reserved


single linkage dendrogram
(using first four principal components)

Copyright © 2016 by Luc Anselin, All Rights Reserved


Single linkage hierarchical cluster maps

k=4

k=6

Copyright © 2016 by Luc Anselin, All Rights Reserved


k-Means Clustering

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Algorithm
• randomly assign n observations to k groups

• compute group centroid (or other


representative point)

• assign observations to closest centroid

• iterate until convergence

Copyright © 2016 by Luc Anselin, All Rights Reserved


x x
x x x x

Randomly assign points to k groups. (Here k = 3). Compute centroid of each group. Reassign each point to group of closest centroid.

x x x
x
x x
x x x

Re-compute centroid of each group. Reassign each point to group of closest centroid. Re-compute centroid of each group.

x x
x x
x x

Reassign each point to group of closest centroid. Re-compute centroid of each group. Stop when group membership ceases to change.

k-means clustering algorithm


Source: Grolemund and Wickham (2016)

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Practical issues
• which k to select ?
• compare solutions on within-group and 

between-group similarities

• sensitivity to starting point


• use several random assignments and pick the best

• avoid local optima


• sensitivity analysis

• replicability
• set random seed

Copyright © 2016 by Luc Anselin, All Rights Reserved


k-means clustering on four major principal components
note: labeling is irrelevant

Copyright © 2016 by Luc Anselin, All Rights Reserved


K-means cluster maps

k=4

k=6

Copyright © 2016 by Luc Anselin, All Rights Reserved


Spatially-Constrained
Clustering

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Regionalization
• grouping contiguous objects that are similar into
new aggregate areal units
• multiple objectives
• classical clustering
• within-group similarity, between-group dissimilarity

• spatial similarity
• only contiguous objects in same group

• shape
• compactness

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Solution strategies
• classical clustering with updates

• multi-objective approach

• automatic zoning

• graph-based approaches

• explicit optimization

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Classical clustering with updates
• start with hierarchical clustering or k-means
solution

• split/combine clusters that are not contiguous

• inefficient approach

• number of cluster indeterminate

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Multi-objective approach
• introduce location (x, y) as variables within the
clustering routing

• assign weights to similarity objective vs spatial


objective

• difficult to set weights

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Automatic Zoning
• AZP

• automatic zoning procedure (Openshaw and Rao)

• heuristic

• starts from random initial feasible solutions

• optimization (NP-hard problem)

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Graph-based approaches
• represent the contiguity structure of the objects
as a graph

• graph pruning

• e.g., using minimum spanning tree (SKATER)

• maximize internal similarity objective

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Explicit optimization
• formulate as an integer programming problem

• decision variables to allocate object i to region j

• formalize adjacency constraints

• typically as a graph representation

• several heuristics

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Example: SKATER
• Spatial Kluster analysis be Tree Edge Removal

• Assuncao et al (2006)

• algorithm

• construct minimum spanning tree from adjacency


graph

• prune the tree (cut edges) to achieve maximum


internal homogeneity

Copyright © 2016 by Luc Anselin, All Rights Reserved


Chicago Community Areas - neighbor structure (as a tree)

Copyright © 2016 by Luc Anselin, All Rights Reserved


Chicago Community Areas - minimum spanning tree (MST)

Copyright © 2016 by Luc Anselin, All Rights Reserved


Chicago Community Areas - pruned tree

Copyright © 2016 by Luc Anselin, All Rights Reserved


Contiguity Constrained Clusters

k=4

k=6

Copyright © 2016 by Luc Anselin, All Rights Reserved


• Issues
• many algorithms/heuristics
• different search spaces
• different conceptualization of attribute similarity
• different consideration of spatial contiguity

• additional constraints
• districting: target population size

• number of clusters
• exogenous
• endogenous, max-p region problem

Copyright © 2016 by Luc Anselin, All Rights Reserved

Potrebbero piacerti anche