Sei sulla pagina 1di 1

FULL-PATTERN CLUSTER ANALYSIS OF

MULTIPLE X-RAY POWDER DIFFRACTION DATA


Thomas Degen, PANalytical B.V., Lelyweg 1, Almelo, The Netherlands

Introduction:
Modern X-ray diffraction equipment like X'Pert PRO systems with an X'Celerator detector allows the rapid collection of hundreds of scans in a very short time. This can be useful in
application areas like polymorph screening, high throughput screening, non-ambient experiments and more.
Here we present a method that greatly simplifies the analysis of large amounts of data by automatically sorting all scans of an experiment into closely related clusters, identifying
the most representative scan of each cluster and outlying patterns.
Full-pattern cluster analysis is a new feature added to the FDA Part 11 compliant X-ray powder diffraction analysis software packages X'Pert HighScore/HighScore Plus.
This software comprises several full-pattern analysis methods like search/match phase identification, quantitative Rietveld analysis and crystallinity determination plus an exhaustive
range of pattern treatment methods and a complete report generation in RTF format or as MS Word documents. All methods can be used in an automated way (pushbutton or
command line) in any sequence and with user definable parameters.
The implemented cluster analysis method can basically be seen as an automatic 3-step process, but additional visualization tools are present to judge and influence the clustering
based on dendrograms, histograms and score plots derived from principal component analysis.

The 3-step process:

Step 1: Generation of the Step 2: Agglomerative Step 3: Estimation of the


correlation and distance hierarchical cluster analysis number of clusters
matrices
Comparing the full profile of every powder The correlation matrix p generated in step 1 A well known and in principle unsolved For cluster J containing m patterns the
diffraction pattern in a set of n patterns is the input for a hierarchical agglomerative problem is to find the "right" number of most representative data set is defined as
with all the other patterns yields an n x n cluster analysis, which puts the patterns into clusters. This means cutting the dendrogram the one with:
correlation matrix p (figure 1). This matrixclasses defined by their similarity. Merging at a given dissimilarity and retaining a
can be converted to a Euclidean distance two clusters (Ci and Cj) poses the problem to meaningful set of clusters.
matrix, d, of the same dimension by: define the distance between the newly We offer two solutions:
formed cluster Ci + Cj and any other cluster 1) The cut-off is automatically put at the
d = (1 - p). Ck. There are various linkage methods position of the biggest relative step on
available (single linkage, complete linkage, the dissimilarity scale between the
The comparison is performed by the average linkage, Ward's method…) to clusters. This mimics what people are
proprietary match algorithm used with great calculate these distances. The average doing when they judge a dendrogram.
success for phase identification. Here it linkage method is often the most useful and 2) The KGS test (ref. 1, figure 3) known
compares every scan with all other scans. effective. from protein structure analysis by NMR

Cut-Off: 194.28
Penalty Function

Cut-Off: 6.22
spectra. The minimum KGS value
27

Cut-Off: 8.14
26.5
26
25.5

The result of this analysis is usually displayed

Cut-Off: 9.43
First all measured data sets (which could also represents the state, where the clusters 25
24.5

Cut-Off: 10.91
24

Cut-Off: 11.17
include peaks information) are reduced to as a dendrogram (figure 2). are as highly populated as possible, 23.5
23

Cut-Off: 12.13
22.5
22

Cut-Off: 13.05
probability curves ui(x). The match is carried Each pattern starts at the left side as a whilst simultaneously maintaining the 21.5
21

Cut-Off: 14.99
20.5

out by a direct comparison between ui(x) separate cluster, and these clusters smallest spread. 20

Cut-Off: 17.13
Cut-Off: 164.64
19.5

Penalty Value
19

Cut-Off: 17.88
18.5

and uj(x). Various similarity indicators amalgamate in a stepwise fashion, linked by

Cut-Off: 22.49
18
17.5

Cut-Off: 28.92
17

(FOM's) are calculated to indicate the vertical tie bars. The horizontal position of The most representative data set is

Cut-Off: 34.46
16.5
16

Cut-Off: 149.84

Cut-Off: 39.85
15.5
15

similarity between the data sets. All the tie bar represents a dissimilarity defined as the data set that has the

Cut-Off: 42.92
14.5
14

Cut-Off: 143.14

Cut-Off: 46.13
13.5

Cut-Off: 58.70
indicators are calculated for the overlapping measure. minimum mean distance from all other 13

Cut-Off: 130.37
12.5

Cut-Off: 96.37
12
11.5

range of the compared data sets and are data sets in a given cluster. 11
10.5
10

only computed for those regions where one 2 3 4 5 6 7 8 9 10 11 12 13


Linkage Number
14 15 16 17 18 19 20 21 22 23 24

of the curves u(x) exceeds a threshold T.


Figure 3: The minimum of the KGS test
Example: Overall figure of merit F1: penalty function shows the "right" number
of clusters (seven), the cut-off value (about
96) for the dendrogram is also indicated.

Where ui(x) or uj(x) > T


Optional step 4: Principal
F1(i, j) gives an overall probability that the Component Analysis
two data sets are the same. A value of unity Principal Component Analysis (PCA) is a
indicates a perfect match. What is separate and independent method to
interesting, is that F1(i, i) is not exactly Figure 2: The dendrogram is a graphical
visualize and to judge the quality of the
unity, thus reflecting the fact that ui(x) itself display of the result of an agglomerative
clustering. The correlation matrix p from
may include probabilities differing from 0 hierarchical cluster analysis (actual cut-off at step 1 is used as input.
and 1. a dissimilarity of about 90). To display the results we use a three-
dimensional score plot (figure 4) with the
three axes corresponding to the first three
principal components and where the data
sets are displayed as balls. The color of the
balls indicates the cluster they belong to
(red color for un-clustered scans). Three
stars mark the most representative data set
Figure 4: The PCA score plot (left) shows
of a cluster.
the clear separation of the data sets into
6 clusters plus one outlier; however the
first 3 principle components shown in the
plot cover only 78 percent of the variation
in the data (right: Eigenvalues plot).

Figure 1: The correlation matrix for F1,


generated by comparing each data set with
all other data sets using the proven
"Matching" algorithm.

Figure 5: The 6 scans forming the third


cluster (in pink color) of the dendrogram.

References:
1) Kelley, L.A., Gardner, S.P., Sutcliffe, M.J. (1996) An automated approach
for clustering an ensemble of NMR-derived protein structures into
conformationally-related subfamilies, Protein Engineering, 9, 1063-1065.

The Analytical X-ray Company

Potrebbero piacerti anche