Sei sulla pagina 1di 36

Cluster Analysis

The term cluster analysis encompasses a number of


different algorithms and methods for grouping objects
of similar kind into respective categories.

A general question facing researchers in many areas of


inquiry is how to organize observed data into meaningful
structures, that is, to develop taxonomies.

In other words cluster analysis is an exploratory data


analysis tool which aims at sorting different objects
into groups in a way that the degree of association
between two objects is maximal if they belong to the
same group and minimal otherwise.
4.1
Wednesday 11 July 2018 12:05 PM
Cluster Analysis
In other words, cluster analysis simply discovers
structures in data without explaining why they exist.

For an overview see


Hierarchical Cluster Analysis: Comparison of Three Linkage
Measures and Application to Psychological Data
Odilia Yim and Kylee T. Ramdeen
The Quantitative Methods for Psychology (TQMP) 2015
11(1) 8-21

Cluster Analysis – A Standard Setting Technique In Meas


urement And Testing
Muhammad Naveed Khalid
Journal of Applied Quantitative Methods 2011 6(2) 46-58.
4.2
Cluster Analysis
The data (set a) are correlations between variables
relating to home and school circumstances of children.
The file contains the full matrix of correlations, which
we use as similarities.
X1 Parental circumstances in 1964
X2 Details of class teacher in 1964 Better to use
more sensible
X3 School-parent interaction in 1964 names!
X4 Girl's attitude in 1964
X5 Test score in 1964
X6 Type of school in 1968
X7 Parental circumstances in 1968
X8 School-parent interaction in 1968
X9 Test score in 1968 4.3
Cluster Analysis
Rather than clustering individuals (as is usual) the aim is
to examine how five measurements made on secondary
school girls in 1964 relate to four measurements made
on the same girls in 1968.

About a quarter of the children could not be traced,


which may bias the results.

The data are for 398 girls in their final year of primary
school in 1964 and fourth year of secondary in 1968.
The nine variables are composite measures.

4.4
Cluster Analysis
Source

The Analysis and Interpretation of Multivariate Data


for Social Scientists
David J. Bartholomew, Fiona Steele, Irini Moustaki
andJ.I. Galbraith
2002 by Chapman and Hall/CRC
Table 2.17

The Plowden children four years later


Peaker G.F.
1971 National Foundation for Educational Research in
England and Wales
Table 7
4.5
Cluster Analysis
The data (set a) are correlations between variables relating to
home and school circumstances of children. The file contains the
full matrix of correlations, which we use as similarities.

In order to have the matrix of proximities recognized as such by


the cluster procedure, we must add two variables to the matrix file
and we must run the procedure as a syntax command (it is not
available via the drop down menu’s). The two variables are
ROWTYPE_ and VARNAME_. Both variables are string variables
4.6
with a width of 8 characters.
Cluster Analysis
First load the data matrix (set a) either by activating
the button on the web site. Or by loading the data to
your local machine using the following instructions.

File >
Open >
Data

4.7
Cluster Analysis
First load the data matrix (set a) either by activating
the button on the web site. Or by saving the data to
your local machine and using the following instructions.

Where you navigate to


the location of the
required file, then
select “Open”.

4.8
Cluster Analysis
Now open the syntax window.

File >
New >
Syntax

4.9
Cluster Analysis
Enter the following into the syntax window.

CLUSTER x1 x2 x3 x4 x5 x6 x7 x8 x9
/MATRIX IN (*)
/METHOD COMPLETE
/PRINT SCHEDULE
/PLOT DENDROGRAM.

Simply cut and paste.

4.10
Cluster Analysis
Now run the syntax (Run > All).

4.11
Cluster Analysis
The agglomeration schedule describes the successive
formation of the clusters. 1 links to 7, then 5 to 9 then
5 to 6 and so on. Recall that 1 refers to x1, and so on.
Agglomeration Schedule
Cluster Combined Stage Cluster First Appears
Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage
1 1 7 .770 0 0 4
2 5 9 .758 0 0 3
3 5 6 .572 2 0 4
4 1 5 .388 1 3 5
5 1 3 .305 4 0 6
6 1 4 .193 5 0 7
7 1 8 .128 6 0 8
8 1 2 -.050 7 0 0

4.12
Cluster Analysis
The Dendrogram
summarises the
data from the
previous slide, its
all you really
need.

The second
column are the
variables
described by
their sequence
(x1) 1, (x7) 7 …
(x2) 2. Useful
when better
names used. 4.13
Cluster Analysis

The dendrogram
shows that two
pairs of
variables,
parental
circumstances in
1964 (x1) and
1968 (x7), and
total test scores
1964 (x5) and
1968 (x9), are
each closely
linked.

4.14
Cluster Analysis

While those for


the school parent
interaction (x3
and x8) are not,
only being linked
at the sixth out
of eight steps.

4.15
Cluster Analysis
We might conclude that
the teacher’s
characteristics (x2),
the girl’s attitude in
1964 (x4), and the
school-parent
interaction in 1968 (x8)
are only weakly
associated with the
test scores (x5 and x9),

whereas the other four


variables (x1, x3, x6
and x7) have stronger
associations with the
test scores (x5 and x9). 4.16
Cluster Analysis
The above conclusions can be confirmed by examining the
correlation matrix.

4.17
Cluster Analysis
In SPSS numerous methods and measures are available.
The three methods are;

K-means cluster is a method to quickly cluster large


data sets, which typically take a while to compute with
the preferred hierarchical cluster analysis. The
researcher must define the number of clusters in
advance. This is useful to test different models with a
different assumed number of clusters (for example, in
customer segmentation).

4.18
Cluster Analysis
Hierarchical cluster is the most common method. We
will discuss this method shortly. It takes time to
calculate, but it generates a series of models with
cluster solutions from 1 (all cases in one cluster) to n (all
cases are an individual cluster). Hierarchical cluster also
works with variables as opposed to cases; it can cluster
variables together in a manner somewhat similar to
factor analysis. In addition, hierarchical cluster analysis
can handle nominal, ordinal, and scale data, however it is
not recommended to mix different levels of
measurement.

4.19
Cluster Analysis
Two-step cluster analysis is more of a tool than a single
analysis. It identifies the groupings by running pre-
clustering first and then by hierarchical methods.
Because it uses a quick cluster algorithm upfront, it can
handle large data sets that would take a long time to
compute with hierarchical cluster methods. In this
respect, it combines the best of both approaches. Also
two-step clustering can handle scale and ordinal data in
the same model. Two-step cluster analysis also
automatically selects the number of clusters, a task
normally assigned to the researcher in the two other
methods.
4.20
Cluster Analysis
For a second example (set b) the data are agriculture
Agr
the percentage employed in different Min mining

industries in Europe countries during 1979. Man


PS
manufacturing
power supplies
The job categories are agriculture, mining, Con construction

manufacturing, power supplies,


SI service industries
Fin finance
construction, service industries, finance, SPS social and personal services
transport and communications
TC
social and personal services, and transport
and communications.

It is important to note that these data were collected during the Cold
War (source).

The data may be loaded by utilising the link on the module web site, or
saving locally and using the approach previously described.

4.21
The raw data
Cluster Analysis

4.22
Cluster Analysis
Select

Analyze
> Classify
> Hierarchical Cluster

4.23
Cluster Analysis

Select all the variables and case labels

4.24
Cluster Analysis

Select the desired plots

4.25
Cluster Analysis
Select the desired method and scaling

4.26
Cluster Analysis

Can you see any


“structure”?

4.27
Cluster Analysis

A single-linkage
cluster analysis
show that the
countries cluster
together into three
main groups along
political lines.

4.28
Cluster Analysis

Group 1 at the top


of the plot contains
countries of
capitalist Western
Europe.

4.29
Cluster Analysis

Group 2 contains
countries of the
communist East
Bloc. I suspect
Spain was affected
by its Facist past.

4.30
Cluster Analysis

Group 3 contains
Yugoslavia, which
was unaligned and
shared some
characteristics of
both other groups,
and Turkey, which is
probably more
properly classified
as an Asian nation
since only a small
percentage of its
land area lies on the
European continent.

4.31
Cluster Analysis
Alternately enter the following into the syntax window.
DATASET DECLARE D0.08685272375079045.
PROXIMITIES Agr Min Man PS Con SI Fin SPS TC
/MATRIX OUT(D0.08685272375079045)
/VIEW=CASE
/MEASURE=EUCLID
The long digit
/PRINT NONE
/ID=Country D0.08685272375079045
/STANDARDIZE=VARIABLE SD. is a user assigned name
CLUSTER for the data set.
/MATRIX IN(D0.08685272375079045)
/METHOD SINGLE
/ID=Country
/PRINT SCHEDULE
/PLOT DENDROGRAM.
Dataset Close D0.08685272375079045.

This syntax is particularly complex since it scales the


data. Why is this important? 4.32
Cluster Analysis
You could enter the following into the syntax window,
with no scaling.
CLUSTER Agr Min Man PS Con SI Fin SPS TC
/METHOD SINGLE
/MEASURE=EUCLID
/ID=Country
/PRINT NONE
/PLOT DENDROGRAM.

Leading to a very different view.

4.33
Cluster Analysis

Leading to a very
different view
(structure) .

Clearly Turkey is still


“unusual”.

4.34
Cluster Analysis
For those who wish to investigate far beyond the scope of
this course see
Comparing the performance of biomedical clustering
methods. Nature Methods, 2015; DOI: 10.1038/nmeth.3583
Christian Wiwie, Jan Baumbach, Richard Röttger.
Identifying groups of similar objects is a popular first step in biomedical data analysis, but it is
error-prone and impossible to perform manually. Many computational methods have been developed
to tackle this problem. Here we assessed 13 well-known methods using 24 data sets ranging from
gene expression to protein domains. Performance was judged on the basis of 13 common cluster
validity indices. We developed a clustering analysis platform, ClustEval, to promote streamlined
evaluation, comparison and reproducibility of clustering results in the future. This allowed us to
objectively evaluate the performance of all tools on all data sets with up to 1,000 different
parameter sets each, resulting in a total of more than 4 million calculated cluster validity indices. We
observed that there was no universal best performer, but on the basis of this wide-ranging
comparison we were able to develop a short guideline for biomedical clustering tasks. ClustEval allows
biomedical researchers to pick the appropriate tool for their data type and allows method
developers to compare their tool to the state of the art.
4.35
4.36

Potrebbero piacerti anche