Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
The data are for 398 girls in their final year of primary
school in 1964 and fourth year of secondary in 1968.
The nine variables are composite measures.
4.4
Cluster Analysis
Source
File >
Open >
Data
4.7
Cluster Analysis
First load the data matrix (set a) either by activating
the button on the web site. Or by saving the data to
your local machine and using the following instructions.
4.8
Cluster Analysis
Now open the syntax window.
File >
New >
Syntax
4.9
Cluster Analysis
Enter the following into the syntax window.
CLUSTER x1 x2 x3 x4 x5 x6 x7 x8 x9
/MATRIX IN (*)
/METHOD COMPLETE
/PRINT SCHEDULE
/PLOT DENDROGRAM.
4.10
Cluster Analysis
Now run the syntax (Run > All).
4.11
Cluster Analysis
The agglomeration schedule describes the successive
formation of the clusters. 1 links to 7, then 5 to 9 then
5 to 6 and so on. Recall that 1 refers to x1, and so on.
Agglomeration Schedule
Cluster Combined Stage Cluster First Appears
Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage
1 1 7 .770 0 0 4
2 5 9 .758 0 0 3
3 5 6 .572 2 0 4
4 1 5 .388 1 3 5
5 1 3 .305 4 0 6
6 1 4 .193 5 0 7
7 1 8 .128 6 0 8
8 1 2 -.050 7 0 0
4.12
Cluster Analysis
The Dendrogram
summarises the
data from the
previous slide, its
all you really
need.
The second
column are the
variables
described by
their sequence
(x1) 1, (x7) 7 …
(x2) 2. Useful
when better
names used. 4.13
Cluster Analysis
The dendrogram
shows that two
pairs of
variables,
parental
circumstances in
1964 (x1) and
1968 (x7), and
total test scores
1964 (x5) and
1968 (x9), are
each closely
linked.
4.14
Cluster Analysis
4.15
Cluster Analysis
We might conclude that
the teacher’s
characteristics (x2),
the girl’s attitude in
1964 (x4), and the
school-parent
interaction in 1968 (x8)
are only weakly
associated with the
test scores (x5 and x9),
4.17
Cluster Analysis
In SPSS numerous methods and measures are available.
The three methods are;
4.18
Cluster Analysis
Hierarchical cluster is the most common method. We
will discuss this method shortly. It takes time to
calculate, but it generates a series of models with
cluster solutions from 1 (all cases in one cluster) to n (all
cases are an individual cluster). Hierarchical cluster also
works with variables as opposed to cases; it can cluster
variables together in a manner somewhat similar to
factor analysis. In addition, hierarchical cluster analysis
can handle nominal, ordinal, and scale data, however it is
not recommended to mix different levels of
measurement.
4.19
Cluster Analysis
Two-step cluster analysis is more of a tool than a single
analysis. It identifies the groupings by running pre-
clustering first and then by hierarchical methods.
Because it uses a quick cluster algorithm upfront, it can
handle large data sets that would take a long time to
compute with hierarchical cluster methods. In this
respect, it combines the best of both approaches. Also
two-step clustering can handle scale and ordinal data in
the same model. Two-step cluster analysis also
automatically selects the number of clusters, a task
normally assigned to the researcher in the two other
methods.
4.20
Cluster Analysis
For a second example (set b) the data are agriculture
Agr
the percentage employed in different Min mining
It is important to note that these data were collected during the Cold
War (source).
The data may be loaded by utilising the link on the module web site, or
saving locally and using the approach previously described.
4.21
The raw data
Cluster Analysis
4.22
Cluster Analysis
Select
Analyze
> Classify
> Hierarchical Cluster
4.23
Cluster Analysis
4.24
Cluster Analysis
4.25
Cluster Analysis
Select the desired method and scaling
4.26
Cluster Analysis
4.27
Cluster Analysis
A single-linkage
cluster analysis
show that the
countries cluster
together into three
main groups along
political lines.
4.28
Cluster Analysis
4.29
Cluster Analysis
Group 2 contains
countries of the
communist East
Bloc. I suspect
Spain was affected
by its Facist past.
4.30
Cluster Analysis
Group 3 contains
Yugoslavia, which
was unaligned and
shared some
characteristics of
both other groups,
and Turkey, which is
probably more
properly classified
as an Asian nation
since only a small
percentage of its
land area lies on the
European continent.
4.31
Cluster Analysis
Alternately enter the following into the syntax window.
DATASET DECLARE D0.08685272375079045.
PROXIMITIES Agr Min Man PS Con SI Fin SPS TC
/MATRIX OUT(D0.08685272375079045)
/VIEW=CASE
/MEASURE=EUCLID
The long digit
/PRINT NONE
/ID=Country D0.08685272375079045
/STANDARDIZE=VARIABLE SD. is a user assigned name
CLUSTER for the data set.
/MATRIX IN(D0.08685272375079045)
/METHOD SINGLE
/ID=Country
/PRINT SCHEDULE
/PLOT DENDROGRAM.
Dataset Close D0.08685272375079045.
4.33
Cluster Analysis
Leading to a very
different view
(structure) .
4.34
Cluster Analysis
For those who wish to investigate far beyond the scope of
this course see
Comparing the performance of biomedical clustering
methods. Nature Methods, 2015; DOI: 10.1038/nmeth.3583
Christian Wiwie, Jan Baumbach, Richard Röttger.
Identifying groups of similar objects is a popular first step in biomedical data analysis, but it is
error-prone and impossible to perform manually. Many computational methods have been developed
to tackle this problem. Here we assessed 13 well-known methods using 24 data sets ranging from
gene expression to protein domains. Performance was judged on the basis of 13 common cluster
validity indices. We developed a clustering analysis platform, ClustEval, to promote streamlined
evaluation, comparison and reproducibility of clustering results in the future. This allowed us to
objectively evaluate the performance of all tools on all data sets with up to 1,000 different
parameter sets each, resulting in a total of more than 4 million calculated cluster validity indices. We
observed that there was no universal best performer, but on the basis of this wide-ranging
comparison we were able to develop a short guideline for biomedical clustering tasks. ClustEval allows
biomedical researchers to pick the appropriate tool for their data type and allows method
developers to compare their tool to the state of the art.
4.35
4.36