Clustering With Confidence: A Binning Approach

Clustering with Confidence: A Binning Approach
Rebecca Nugent Werner Stuetzle

Department of Statistics Department of Statistics
Carnegie Mellon University University of Washington
Pittsburgh, PA 15213 Box 354322, Seattle, WA 98195
Abstract two statistical approaches. Parametric (model-based)

clustering assumes the data have been generated by a
finite mixture of g underlying parametric probability
We present a plug-in method for estimating
distributions pg (x) (often multivariate Gaussian), one
the cluster tree of a density. The method
component per group (Fraley, Raftery 1998; McLach-
takes advantage of the ability to exactly com-
lan, Basford 1988). This procedure selects a number
pute the level sets of a piecewise constant
of mixture components (clusters) and estimates their
density estimate. We then introduce clus-
parameters. However, it is susceptible to violation of
tering with confidence, an automatic pruning
distributional assumptions. Skewed or non-Gaussian
procedure that assesses significance of splits
groups may be modeled incorrectly by a mixture of
(and thereby clusters) in the cluster tree; the
several Gaussians.
only user input required is the desired confi-
dence level. In contrast, the nonparametric approach assumes a
correspondence between groups in the data and modes
of the density p(x). Wishart (1969) first advocated
1 Introduction searching for modes as manifestations of the presence
of groups; nonparametric clustering algorithms should
The goal of clustering is to identify distinct groups in a be able to resolve distinct data modes, independently
data set and assign a group label to each observation. of their shape and variance.
Ideally, we would be able to find the number of groups Hartigan expanded this idea and made it more precise
as well as where each group lies in the feature space (1975, 1981, 1985). Define a level set L(; p) of a den-
with minimal input from the user. For example, Figure sity p at level as the subset of the feature space for
1 below shows two curvilinear groups, 200 observations which the density exceeds :
each, and two spherical groups, 100 observations each;
we would like a four cluster solution, one per group. L(; p) = {x|p(x) > }.
Its connected components are the maximally con-

1.0
nected subsets of a level set. For any two connected

0.8
components A and B, possibly at different levels, ei-

ther A B, B A, or A B = . This hierarchical
0.6
x2
structure of the level sets is summarized by the cluster

0.4
tree of p.
0.2
The cluster tree is easiest to define recursively (Stuet-

0.0 0.2 0.4 0.6 0.8 1.0 zle 2003). Each node N of the tree represents a subset
x1
D(N ) of the support L(0; p) of p and is associated with
a density level (N ). The root node represents the en-
Figure 1: Data set with four well-separated groups
tire support of p and is associated with density level
(N ) = 0. To determine the daughters ofTa node, we
To cast clustering as a statistical problem, we regard find the lowest level d for which L(d ; p) D(N ) has
the data, X = {x1 , ..., xn } <p , as a sample from two or more connected components. If no such d ex-
some unknown population density p(x). There are ists, then D(N ) is a mode of the density, and N is a
leaf of the tree. Otherwise, let CT
1 , C2 , ..., Cn be the Several other level set estimation procedures have been
connected components of L(d ; p) D(N ). If n = 2, proposed. Walthers level set estimator consists of
we create two daughter nodes at level d , one for each two steps (1997). First, compute a density estimate
connected component; we then apply the procedure p of the underlying density p. Second, approximate
recursively to each daughter node. If n > S 2, weScre- L(; p) by L (; p), a union of balls Bi of radius r cen-
ate two connected components C1 and C2 C3 ... Cn tered around xi with p(xi ) > that do not contain
and their respective daughter nodes and then recurse. xi with p(xi ) . While this structure is computa-
We call the regions D(N ) the high density clusters tionally convenient for finding connected components,
of p. its accuracy depends on the choice of r. Walther pro-
vides a formula for r based on the behavior of p on
Note that we have defined a binary tree; other possible
the boundary L(; p) of the level set L(; p). The
definitions with non-binary splits could be used. How-
Cuevas, Febrero, and Fraiman level set estimator also
ever, the recursive binary tree easily accommodates
is based on a union of balls of radius r (2000, 2001).
level sets with more than two connectedT components. However, their estimate L (; p) is the union of all
For example, if the level set L(d ; p) D(N ) has three
balls Bi centered around xi where p(xi ) > ; no balls
connected components, CS1 , C2 , C3 , the binary tree first
are excluded. They propose several heuristic methods
splits, say, C1 from C2 C3 at height d and, after
for choosing r based on interpoint distance.
finding any subsequent splits and leaves on the cor-
responding branch
S for C1 , returns to the connected Stuetzles runt pruning method (2003) estimates the
component C2 C3 at height d , immediately splits cluster tree of p by computing the cluster tree of the
C2 and C3 , and creates two more daughter nodes at nearest neighbor density estimate and then prunes
height d . The final cluster tree structure will contain branches corresponding to supposed spurious modes.
a node for each of the three connected components, The method is sensitive to the pruning amount pa-
all at height d . Figure 2 shows a univariate density rameter chosen by the user. Klemela developed visu-
with four modes and the corresponding cluster tree alization tools for multivariate density estimates that
with initial split at = 0.0044 and subsequent splits are piecewise constant over (hyper-)rectangles (2004).
at = 0.0288, 0.0434. Estimating the cluster tree is a He defines a level set tree which has nodes at every
fundamental goal of nonparametric cluster analysis. level corresponding to a unique value of p. Note that
the cluster tree only has nodes at split levels.
0.10
0.10
2 Constructing the cluster tree for a

0.08
0.08
piecewise constant density estimate

0.06
0.06
Density
Density
We can estimate the cluster tree of a density p by

0.04
0.04
the cluster tree of a density estimate p. However, for

0.02
most density estimates, computing the cluster tree is

0.02
a difficult problem; there is no obvious method for

0.00
computing and representing the level sets. Exceptions

are density estimates that are piecewise constant over
Figure 2: (a) Density with four modes; (b) correspond- (hyper-)rectangles. Let B1 , B2 , ..., BN be the rectan-
ing cluster tree with three splits (four leaves) gles, and let pi be the estimated density for Bi . Then
[
There are several previously suggested clustering L(; p) = Bi .
methods based on level sets. Wisharts one level mode pi >
analysis is probably the first such method (1969); its
goal is to find the connected components of L(; p) If the dimension is low enough, any density estimate
for a given . After computing a density estimate can be reasonably binned. To illustrate, we choose the
p, all observations xi where p(xi ) are set aside. Binned Kernel Density Estimate (BKDE), a binned
Well-separated groups of the remaining high density approximation to an ordinary kernel density estimate,
observations correspond to connected components of on the finest grid we can computationally afford (Hall,
L(; p). One drawback is that not all groups or modes Wand 1996; Wand, Jones 1995). We use ten-fold cross
may be identified by examining a single level set. validation for bandwidth estimation. Figure 3a shows
Wisharts hierarchical mode analysis addresses this a heat map of the BKDE for the four well-separated
weakness; this complex algorithm constructs a cluster groups on a 20 x 20 grid with a cross-validation band-
tree (although not so named) through iterative merg- width of 0.0244. Several high frequency areas are in-
ing. dicated.
Note that T during cluster tree construction, same is true for every subtree of the cluster tree. Fig-
L(d ; p) D(N ) changes structure only when ure 3b shows L(0.00016; p) with removed bins in red.
the level d is equal to the next higher value of p(Bi ) Two daughter nodes are created, one for each con-
for one or more bins Bi in D(N ). After identifying nected component/cluster core. Figure 3c illustrates
and sorting the bins unique density estimate values, the subsequent partitioning of the feature space after
we can compute the cluster tree of p by stepping assignment of fluff bins to the closest core.
through only this subset of all possible levels . Ev-
Figure 4a shows the cluster tree generated from the ex-
ery increase in level then corresponds to the removal
ample BKDE; the corresponding cluster assignments
of one or more bins from the level set L(d ; p).
and partitioned feature space (labeled by color) are in
Finding the connected components of L(d ; p) D(N ) Figures 4b,c. The cluster tree indicates the BKDE has
can be cast as a graph problem. Let G be an adjacency nine modes. The first split at = 1 1016 and the
graph over the bins Bi L(d ; p)D(N ). The vertices mode of p around (0.5, 0) (in yellow in Figure 4c) are
of the graph are the bins Bi . Two bins Bi and Bj are artifacts of the density estimate; no observations are
connected by an edge if adjacent (i.e. share a lower- assigned to the resulting daughter node. The remain-
dimensional face). The connected components of G ing eight leaves correspond to (a subset of) one of the
correspond to the connected components of the level groups.
set L(d ; p) D(N ) and finding them is a standard
graph problem (Sedgewick 2002).
0.005
0.004
1.0
Density Estimate
0.003
0.8
0.002
0.6
x2
0.001
0.4
0.000
0.2
0.0 0.2 0.4 0.6 0.8 1.0

1.0
1.0
x1
0.8
0.8
1.0
1.0
0.6
0.6
0.8
0.8
x2
x2
0.4
0.4
0.6
0.6
x2
x2
0.2
0.2
0.4
0.4
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.2
x1 x1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x1
Figure 4: (a) Cluster tree for BKDE; (b) cluster as-
signments by color; (c) partitioned feature space
Figure 3: (a) example data BKDE on a 20 x 20 grid;
(b) level set at = 0.00016; cluster cores in orange,
white; (c) corresponding partitioned feature space We compare these results to common clustering ap-
proaches (briefly described here). K-means is a it-
When a node N of the cluster tree has been split into erative descent approach that minimizes the within-
daughters Nl , Nr , the high density clusters D(Nl ), cluster sum of squares; it tends to partition the fea-
D(Nr ), also referred to as the cluster cores, do not ture space into equal-sized (hyper-)spheres (Hartigan,
form a partition of D(N ). We refer to the bins in Wong 1979). Although we know there are four groups,
D(N )\(D(Nl ) D(Nr )) (and observations in these we range the number of clusters from two to twelve and
bins) as the fluff. We assign each bin B in the plot the total within-cluster sum of squares against the
fluff to Nr if the Manhattan distance dM (B, D(Nr )) number of clusters. The elbow in the curve occurs
is smaller that dM (B, D(Nl )). If dM (B, D(Nr )) > around five or six clusters. Solutions with more clus-
dM (B, D(Nl )), then B is assigned to Nl . In case of ters have negligible decreases in the criterion. Model-
ties, the algorithm arbitrarily chooses an assignment. based clustering is described in Section 1. We allow
The cluster cores and fluff represented by the leaves the procedure to search over one to twelve clusters
of the cluster tree form a partition of the support of p with no constraints on the covariance matrix; a ten-
and a corresponding partition of the observations. The cluster solution is chosen. Finally, we include results
from three hierarchical linkage methods (Mardia, et In our example, the procedure identified the four orig-
al 1979). Hierarchical agglomerative linkage methods inal groups (the three splits post-artifact split in the
iteratively join clusters in an order determined by a cluster tree) but erroneously continued splitting the
chosen between-cluster distance; we use the minimum, clusters. The corresponding branches of the cluster
maximum, or average Euclidean distance between ob- tree need to be pruned.
servations in different clusters (single, complete, aver-
age respectively). The hierarchical cluster structure 3 Clustering with Confidence
is graphically represented by a tree-based dendrogram
whose branches merge at heights corresponding to the We propose a bootstrap-based automatic pruning pro-
clusters dissimilarity. We prune each dendrogram for cedure. The idea is to find (1 ) simultaneous upper
the known number of groups (four). In all cases, we and lower confidence sets for each level set. During
compare the generated clustering partitions to the true cluster tree construction, only splits indicated as sig-
groups using the Adjusted Rand Index (Table 1) (Hu- nificant by the bootstrap confidence sets are taken to
bert, Arabie 1985). The ARI is a common measure of signal multi-modality. Spurious modes are discarded
agreement between two partitions. Its expected value during estimation; the only decision from the user is
is zero under random partitioning with a maximum the confidence level.
value of one; larger values indicate better agreement.
3.1 Bootstrap confidence sets for level sets
Table 1: Comparison of Clustering Methods
We choose upper confidence sets to be of the form
Method ARI Lu (; p) = L( u ; p) and lower confidence sets of
form Ll (; p) = L( + l ; p) with u , l > 0. By
Cluster Tree: 20 x 20 (k = 8) 0.781 construction,
Cluster Tree: 15 x 15 (k = 6) 0.865
K-means (k = 4) 0.924 Lower = Ll (; p) L(; p) Lu (; p) = U pper.
K-means (k = 5) 0.803
Let p1 , p2 , ..., pm be the density estimates for m boot-
K-means (k = 6) 0.673
strap samples of size n drawn with replacement from
MBC (k = 10) 0.534
the original sample. We call a pair (L( u ; p),
Linkage (single, complete, average) 1
L( + l ; p)) a non-simultaneous (1 ) confidence
band for L(; p) if for (1 ) of the bootstrap density
The groups in the simulated data are well-separated; estimates pi , the upper confidence set L(u ; p) con-
however, the two curvilinear groups lead to an ex- tains L(; pi ) and the lower confidence set L( + l ; p)
pected increased number of clusters. While the four is contained in L(; pi ):
cluster k-means solution had high agreement with the
true partition, the total within-cluster sum of squares Pboot {L( + l ; p) L(; pi ) L( u ; p)} 1 .
criterion was lower for solutions with larger numbers
Here is one method to determine u , l (Buja 2002).
of clusters (5, 6) who had lower ARIs comparable to
For each bootstrap sample pi and each of the finitely
the cluster tree partition generated for the BKDE on
many levels of p, find the smallest u (i) such that
a 20 x 20 grid. Model-based clustering overestimated
the number of groups in the population; the ARI cor-
respondingly drops. All three linkage methods gave L(; pi ) L( u (i); p)
perfect agreement; this performance is not unexpected
given the well-separated groups. and the smallest l (i) such that
Although the cluster tree method has an ARI of 0.781, L( + l (i); p) L(; pi ).
the algorithm generates the correct partition for the
density estimate. Its performance is dependent on the Choose u = (1 2 ) quantile of the u (i) and l =
density estimate. A similarly found density estimate (1 2 ) quantile of the l (i). By construction, the pair
on a 15 x 15 grid with a cross-validation selected band- (L(u ; p), L(+l ; p)) is a (1) non-simultaneous
width of 0.015 yielded a cluster tree with an ARI of confidence band for L(; p). To get confidence bands
0.865. In both cases, the number of groups is overes- for all occurring as values of p with simultaneous
timated (8 and 6). Figure 4 illustrates this problem coverage probability 1 , we simply increase the cov-
in the approach. While the cluster tree is accurate erage level of the individual bands until
for the given density estimate, a density estimate is
inherently noisy, which results in spurious modes not Pboot {L( + l ; p) L(; pi ) L( u ; p) }
corresponding to groups in the underlying population. 1 .
Note that the actual upper and lower confidence sets
for L(; p) are the level sets (L( u ; p), L( + l ; p))
1.0
respectively for p. The bootstrap is used only to find
0.8
u , l .
0.6
x2
3.2 Constructing the cluster tree
0.4
0.2
After finding u , l for all , we incorporate the boot-
strap confidence sets into the cluster tree construction 0.0 0.2 0.4 0.6 0.8 1.0
x1
by only allowing splits at heights for which the corre-
sponding bootstrap confidence set (Ll (; p), Lu (; p))
1.0
1.0
gives strong evidence of a split. We use a similar
recursive procedure as described in Section 2. The
0.8
0.8
root node represents the entire support of p and is
0.6
0.6
x2
x2
associated with density level (N ) = 0. To de-
0.4
0.4
termine the daughters of a node, T we find the low-
est level d for which a) Ll (d ; p) D(N ) has two
0.2
0.2
or more connected components
T that b) are discon-
nected in Lu (d ; p) D(N ).
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Condition (a) indi- x1 x1
cates that the underlying density p has two peaks

above height ; condition (b) indicates that the two Figure 5: Bootstrap confidence set for = 0.0036:
peaks are separated by a valley dipping below height (a)L(0.0036; p) ; (b) Lower confidence set, l = 0.0037;
. Satisfying both conditions indicates a split at (c) upper confidence set, u = 0.0026
height . If no such d exists, N is a leaf of the
tree. Otherwise, letTC1l , C2l be two connected com-
ponents ofTLl (d ; p) D(N ) that are disconnected in
Lu (d ; p) D(N ). Let CT1u and C2u be the connected corresponding split heights in the cluster tree in Figure
components of Lu (d ; p) D(N ) that contain C1l and 4. The CWC procedure required stronger evidence for
C2l respectively. If n = 2, we create two daughter a split than was available at the lower levels. It per-
nodes at level d for C1u and C2u and, to each, ap- formed similarly to hierarchical linkage methods and
ply the procedure recursively. If n >S2, we create two more favorably than k-means or model-based cluster-
S ing (Table 1).
connected components C1u and C2u C3u ... Cnu and
daughter nodes and recurse.
0.006
Returning to the example cluster tree (Figure 4a),

0.005
we examine the split at = 0.0036, the first split

0.004
that breaks the lower left curvilinear group into two

Density Estimate
0.003
clusters. Figure 5 shows the bootstrap confidence set

( = 0.05) for this level set. The original level set
0.002
L(0.0036; p) is in Figure 5a, color-coded by subsequent

0.001
final leaf. The lower confidence set (Figure 5b) is found

0.000
to be l = 0.0037 higher, i.e. L(0.0073; p). The upper

confidence set (Figure 5c) is found to be u = 0.0026
lower, i.e. L(0.001; p). At = 0.0036, the upper con-
1.0
1.0
fidence set does not have two connected components

0.8
0.8
(no valley). Moreover, even though the lower confi-

dence set does have two connected components, they
0.6
0.6
x2
x2
do not correspond to the two connected components in

0.4
0.4
L(0.0036; p). We do not have evidence of a significant

split and so do not create daughter nodes at this level.
0.2
0.2
Clustering with Confidence (CWC) using = 0.05 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x1
generates the cluster tree and data/feature space par-
titions seen in Figure 6. The cluster trees significant
Figure 6: (a) 95% confidence cluster tree (b) cluster
splits have identified the four original groups (ARI =
assignments; (c) partitioned feature space
1). No other smaller clusters (or modal artifacts) are
found. Note that the split heights are higher than the
4 Examples by their Evaluate Function mastery level. The more
well-separated skill set profiles are identified earlier.
The algorithms presented could be used for any num-
ber of dimensions, but are more tractable for lower
1.0
1.0
dimensions. For easy visualization of the results, we

present two examples using real data in two dimen-
0.8
0.8
sions. We comment on higher dimensionality in the

Multiplication
Multiplication
0.6
0.6
summary and future work section.

0.4
0.4
0.2
0.2
4.1 Identifying different skill sets in a

student population
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Evaluate Function Evaluate Function
In educational research, a fundamental goal is identi-
fying whether or not students have mastered the skills
1.5
1.0
taught. The data available usually consists of whether

0.8
or not a student answered questions correctly and

1.0
which skills the questions require. Cognitive diagnosis

Multiplication
0.6
Dissimilarity
models and item response theory models, often used to

0.5
0.4
estimate students abilities, can be lengthy in run time;

0.2
estimation instability grows as the number of students,

0.0
445 296
449
126430
146
155
421
2932
4480
2194
3427
295
31159
299
512
467
525
361
62
153
8
273
181
469
103
40
6
141
20
120
245
234
3
454
479
514
33
463
325
55
504
360
346
51
79
473
423
46
157
42
455
459
366
214
353
333
303
90
429
539
25
52
45
74
359
493
207
476
69
200
272
18
305
206
413
529
147
268
54
440
301
165
139
83
159
210
509
171
49
357
390
290
285
496
109
500
517
218
319
453
67
100
257
363
107
420
498
14
326
347
47
219
192
41
26
37
195
318
88
194
764
197
302
39
185
436
372
415
408
499
64
98
316
246
462
286
367
145
58
344
368
451
169
150
280
questions, and skills increases. Clustering capabil-

395
516
148
198
217
484
489
338
483
143 9
255
152
274
526
437
438
362
190
201
386
271
323
427
439
541
447
65
416
494
66
308
249
403
546
31
288
530
328
85
106
897
209
216
56
466
549
212
450
70
96
322
203
93
327
441
460
183
170
414
435
543
235
115
151
75
431
442
502
313
345
497
422
228
283
358
199
240
275
158
297
393
458
53
404
508
389
248
144
394
378
410
193
24
387
536
172
392
225
105
291
154
57
374
538
208
339
470
204
329
401
267
289
16
350
511
418
477
550
191
465
495
385
266
230
433
30
188
71
471
277
424
84
276
330
417
243
340
354
15
287
38
44
189
545
202
482
132
264
19
111
136
130
196
292
448
134
481
35
513
91
300
381
108
531
247
278
472
540
0.0
118
356
262
63
156
72
176
11
28
167
12
92
140
162
223
317
352
361
370
419
528
334
521
373
524
95
505
116
13
341
444
114
178
137
331
73
388
461
102
254
110
503
332
279
486
160
50
519
533
125
161
224
269
443
425
205
320
122
222
238
215
306
547
351
99
315
534
324
490
310
173
321
260
48
17
457
474
492
211
527
226
349
369
542
121
304
284
518
239
377
131
142
507
186
258
265
282
382
383
475
177
259
68
487
251
298
237
23
376
380
123
227
452
501
281
468
293
398
179
60
221
491
135
342
520
80
250
355
117
309
397
409
164
411
127
180
101 3
10
187
113
182
244
544
94
133
375
97
229
428
399
510
379
129
434
426
86
77
478
124
407
119
241
261
391
104
446
175
81
220
405
213
335
22
270
256
535
138
307
242
532
402 5
252
364
232
456
87
233
231
412
312
432
112
314
253
485
464
488
343
506
149
168
406
236
400
537
396
515
163
371
21
384
184
62
365
337
348
78
128
174
522
551
336
523
166
548
82
263
0.0 0.2 0.4 0.6 0.8 1.0

ity estimates easily derived from students responses Evaluate Function
and skill-labeled questions to partition students into
groups with different skill sets has been proposed as a Figure 8: (a) K-means assignment (k=14); (b) Model-
quick alternative (Ayers, Nugent, Dean 2008). A ca- based clustering assignment (k = 11);(c) Complete
pability vector Bi = {Bi1 , Bi2 , ..., BiK } [0, 1]K for linkage dendrogram; (d) Complete linkage cluster as-
K skills is found for each student. For skill k, a Bik signment (k = 4)
near 0 indicates no mastery, near 1 indicates mastery,
and near 0.5 indicates partial mastery or uncertainty. Figure 8 shows selected results from other clustering
The cluster center is viewed as a skill set capability
procedures. The total within-cluster sum of squares
profile for a group of students. criterion suggested a k-means solution with 14 or 15
clusters. Figure 8a illustrates its propensity to parti-
tion into small tight spheres (perhaps overly so). Given
0.014
1.0
a search range of two to 18 clusters with no constraints,

0.012
0.8
model-based clustering chose a eleven cluster solution

0.010
Density Estimate
(Figure 8b). The strips are divided into cohesive

Multiplication
0.6
0.008
sections of feature space; the uncertain mastery stu-

0.006
0.4
dents are clustered together. Note that model-based

0.004
0.2
clustering identified a very loose cluster of students

0.002
with poor Multiplication mastery and Evaluate Func-

0.000
0.0
0.0 0.2 0.4 0.6 0.8 1.0

Evaluate Function tion mastery ranging from 0.2 to 1 that overlaps with
a tighter cluster of students with an Evaluate Func-
Figure 7: (a) Cluster tree (b) Cluster assignments tion value of 0.5. This skill profile is a casualty of the
assumption that each mixture component in the den-
sity estimate corresponds to a group in the population.
Figure 7b shows the (jittered) estimated capabilities
However, overall the results seem to give more natural
for 551 students on two skills, Evaluate Functions and
skill profile groupings than k-means. All hierarchical
Multiplication, from an online mathematics tutor (As-
linkage methods yielded dendrograms without obvious
sistment system; Heffernan, Koedinger, Junker 2001).
cluster solutions; Figures 8c,d show an example of a
The BKDE cross-validated bandwidth was 0.032. The
four cluster complete linkage solution. Students with
corresponding cluster tree (Figure 7a) has 10 different
different Evaluate Function mastery levels are com-
skill set profiles (clusters). Figure 7b shows the sub-
bined; in this respect, the results are unsatisfying.
sequent partition. Note that the two highest splits in
the tree separate the clusters in the top half section Developing 10 different instruction methods for a
of the feature space; these students have partial to group of students is unreasonable in practice. We
complete mastery of Multiplication and are separated use CWC to construct a cluster tree for = 0.90 to
0.014
1000
1.0
0.012
0.8
800
0.010
Density Estimate
AntiBrdU FITC
Multiplication
0.6
0.008
600
0.006
0.4
400
0.004
0.2
200
0.002
0.000
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000
Evaluate Function 7 AAD
0.012
1000
Figure 9: (a) 10% confidence cluster tree (b) cluster
0.010
assignments
800
0.008
Density Estimate
AntiBrdU FITC
600
0.006
400
identify skill set profiles in which we are at least 10%
0.004
confident. Figure 9 shows the subsequent cluster tree
200
0.002
and assignments. At 10% confidence, we have only
0.000
0
three skill set profiles, each identified only by their 0 200 400
7 AAD
600 800 1000
mastery level of Evaluate Functions. This result may

be a more reasonable partition given the less clear sep- Figure 10: (a) Flow cytometry measurements on the
aration along Multiplication mastery. Note that there two fluorescent markers anti-BrdU and 7-AAD; (b)
are instances in which close fluff observations are as- Cluster tree with 13 leaves (8 clusters, 5 artifacts);
signed to different cores, a result of the tie-breaker. (c) Cluster assignments: fluff = x; core =
4.2 Automatic Gating in Flow Cytometry

are located in the high frequency areas; the fluff is ap-
Flow cytometry is a technique for examining and sort- propriately assigned. The sizes of the cores give some
ing microscopic particles. Fluorescent tags are at- evidence as to their eventual significance. For example,
tached to mRNA molecules in a population of cells and the core near (400, 700) consists of one observation; we
passed in front of a single wavelength laser; the level of would not expect the corresponding cluster to remain
fluorescence in each cell (corresponding, for example, in the confidence cluster tree for any reasonable .
to level of gene expression) is recorded. We might be
interested in discovering groups of cells that high fluo-
rescence levels for multiple channels or groups of cells
1000
1000
that have different levels across channels. A common
800
800
identification method is gating or subgroup extrac-

AntiBrdU FITC
Anti BrdU FITC

600
600
tion from two-dimensional plots of measurements on

two channels. Most commonly, these subgroups are
400
400
identified subjectively by eyeballing the graphs. Clus-

200
200
tering techniques would allow for more statistically

0
motivated subgroup identification (see Lo, Brinkman, 0 200 400 600 800 1000 0 200 400 600 800 1000
7 AAD 7 AAD
Gottardo 2008 for one proposed method).
Figure 10a shows 1545 flow cytometry measurements Figure 11: (a) K-means assignment (k=10); (b)
on two fluorescence markers applied to Rituximab, a Model-based clustering assignment (k = 7)
therapeutic monoclonal antibody, in a drug-screening
project designed to identify agents to enhance its an-
Figure 11a shows a ten-cluster k-means solution simi-
tilymphoma activity (Gottardo, Lo 2008). Cells were
larly chosen as before. While many of the smaller clus-
stained, following culture, with the agents anti-BrdU
ters are found by both k-means and the cluster tree,
and the DNA binding dye 7-AAD.
k-means splits the large cluster in Figure 10c into sev-
We construct a cluster tree (cross-validated BKDE eral smaller clusters. Model-based clustering chooses
bandwidth of 21.834) and plot the cluster assignments a seven-cluster solution (Figure 11b). We again see
indicating whether the observations are fluff or part overlapping clusters in the lower left quadrant; this
of a cluster core (Figures 10b,c). The cluster tree solution would not be viable in this application. Link-
has 13 leaves (8 clusters, 5 modal artifacts). The cores age methods again provided unclear solutions.
We use CWC to construct a confidence cluster tree for with Applications to Function Estimation and Functional
= 0.10; we are at least 90% confident in the gen- Data. In revision.
erated clusters (Figure 12). All modal artifacts have Cuevas, A., Febrero M., and Fraiman, R. (2000) Estimat-
been removed, and the smaller clusters have merged ing the Number of Clusters. The Canadian Journal of
into two larger clusters with cores at (200, 200) and Statistics, 28:367-382.
(700, 300). Note that the right cluster is a combina- Cuevas, A., Febrero M., and Fraiman, R. (2001) Cluster
tion of the two high 7-AAD/low anti-BrdU clusters Analysis: a further approach based on density estimation.
from Figure 10c. CWC did not find enough evidence Computational Statistics & Data Analysis, 36:441-459.
to warrant splitting this larger cluster into subgroups. Fraley, C and Raftery, A (1998). How Many Clusters?
Which Clustering Method? - Answers Via Model-Based
Cluster Analysis. The Computer Journal, 41:578-588.
0.012
1000
Gottardo, R. and Lo, K.(2008) flowClust Bioconductor
0.010
package
800
0.008
Density Estimate
AntiBrdU FITC
Hall, P. and Wand, M. P. (1996) On the Accuracy of
600
0.006
400 Binned Kernel Density Estimators. Journal of Multivari-

ate Analysis, 56:165-184.
0.004
200
0.002
Hartigan, J. A. (1975) Clustering Algorithms. Wiley.

0.000
0 200 400 600 800 1000 Hartigan, J. A. (1981) Consistency of Single Linkage for
7 AAD
High-Density Clusters. Journal of the American Statisti-
cal Association, 76:388-394.
Figure 12: (b) 90% Cluster Tree with two leaves; (c) Hartigan, J. A. (1985) Statistical Theory in Clustering.
Cluster assignments: fluff = x; core = Journal of Classification, 2:63-76.
Hartigan, J and Wong, M.A. (1979). A k-means clustering
algorithm. Applied Statistics, 28, 100-108.
5 Summary and future work
Heffernan, N.T., Koedinger K.R., and Junker, B. W.
(2001) Using Web-Based Cognitive Assessment Systems for
We have presented a plug-in method for estimating the Predicting Student Performance on State Exams. Research
cluster tree of a density that takes advantage of the proposal to the Institute of Educational Statistics, US De-
ability to exactly compute the level sets of a piecewise partment of Education. Department of Computer Science
constant density estimate. The approach shows flexi- at Worcester Polytechnic Institute, Massachusetts.
bility in finding clusters of unequal sizes and shapes. Hubert, L. and Arabie, P. (1985) Comparing partitions.
However, the cluster tree is dependent on the (inher- Journal of Classification, 2, 193-218.
ently noisy) density estimate. We introduced cluster- Klemela, J. (2004) Visualization of Multivariate Density
ing with confidence, an automatic pruning procedure Estimates with Level Set Trees. Journal of Computational
that assesses significance of splits in the cluster tree; and Graphical Statistics, 13:599-620.
the only input needed is the desired confidence level. Lo, K., Brinkman R., and Gottardo, R. (2008). Auto-
mated Gating of Flow Cytometry Data via Robust Model-
These procedures may become computationally in-
Based Clustering. Cytometry, Part A, 73A: 321-332.
tractable as the number of adjacent bins grows with
the dimension and are realistically for use in lower di- Mardia, K., Kent, J.T., and Bibby, J.M. (1979) Multivari-
ate Analysis. Academic Press.
mensions. One high-dimensional approach would be to
employ projection or dimension reduction techniques McLachlan, G.J. and Basford, K.E. (1988) Mixture Models:
prior to cluster tree estimation. We also have devel- Inference and Applications to Clustering. Marcel Dekker.
oped a graph-based approach that approximates the Sedgewick, Robert. (2002) Algorithms in C, Part 5: Graph
cluster tree in high dimensions. Clustering with Con- Algorithms, 3rd Ed. Addison-Wesley.
fidence then could be applied to the resulting graph to Stuetzle, W. (2003) Estimating the Cluster Tree of a Den-
identify significant clusters. sity by Analyzing the Minimal Spanning Tree of a Sample.
Journal of Classification. 20:25-47.
References Walther, G. (1997) Granulometric Smoothing. The An-
Ayers, E, Nugent R, and Dean, N. (2008) Skill Set Profile nals of Statistics, 25:2273-2299.
Clustering Based on Student Capability Vectors Computed Wand, M. P. and Jones, M. C. (1995) Kernel Smoothing.
From Online Tutoring Data. Proc. of the Intl Conference Chapman & Hall.
on Educational Data Mining (peer-reviewed). To appear.
Wishart, D. (1969) Mode Analysis: A Generalization of
Buja, A. (2002) Personal Communication. Also Nearest Neighbor which Reduces Chaining Effect. Numer-
Buja, A. and Rolke, W. Calibration for Simultane- ical Taxonomy, ed. A. J. Cole, Academic Press, 282-311.
ity: (Re)Sampling Methods for Simultaneous Inference

Clustering With Confidence: A Binning Approach

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Clustering With Confidence: A Binning Approach

Caricato da

Copyright:

Formati disponibili

Clustering with Confidence: A Binning Approach

Rebecca Nugent Werner Stuetzle

Abstract two statistical approaches. Parametric (model-based)

Its connected components are the maximally con-

nected subsets of a level set. For any two connected

components A and B, possibly at different levels, ei-

structure of the level sets is summarized by the cluster

The cluster tree is easiest to define recursively (Stuet-

2 Constructing the cluster tree for a

piecewise constant density estimate

We can estimate the cluster tree of a density p by

the cluster tree of a density estimate p. However, for

most density estimates, computing the cluster tree is

a difficult problem; there is no obvious method for

computing and representing the level sets. Exceptions

0.0 0.2 0.4 0.6 0.8 1.0

cates that the underlying density p has two peaks

Returning to the example cluster tree (Figure 4a),

we examine the split at = 0.0036, the first split

that breaks the lower left curvilinear group into two

clusters. Figure 5 shows the bootstrap confidence set

L(0.0036; p) is in Figure 5a, color-coded by subsequent

final leaf. The lower confidence set (Figure 5b) is found

to be l = 0.0037 higher, i.e. L(0.0073; p). The upper

fidence set does not have two connected components

(no valley). Moreover, even though the lower confi-

do not correspond to the two connected components in

L(0.0036; p). We do not have evidence of a significant

dimensions. For easy visualization of the results, we

sions. We comment on higher dimensionality in the

summary and future work section.

4.1 Identifying different skill sets in a

taught. The data available usually consists of whether

or not a student answered questions correctly and

which skills the questions require. Cognitive diagnosis

models and item response theory models, often used to

estimate students abilities, can be lengthy in run time;

estimation instability grows as the number of students,

questions, and skills increases. Clustering capabil-

0.0 0.2 0.4 0.6 0.8 1.0

a search range of two to 18 clusters with no constraints,

model-based clustering chose a eleven cluster solution

(Figure 8b). The strips are divided into cohesive

sections of feature space; the uncertain mastery stu-

dents are clustered together. Note that model-based

clustering identified a very loose cluster of students

with poor Multiplication mastery and Evaluate Func-

0.0 0.2 0.4 0.6 0.8 1.0

mastery level of Evaluate Functions. This result may

4.2 Automatic Gating in Flow Cytometry

identification method is gating or subgroup extrac-

Anti BrdU FITC

tion from two-dimensional plots of measurements on

identified subjectively by eyeballing the graphs. Clus-

tering techniques would allow for more statistically

400 Binned Kernel Density Estimators. Journal of Multivari-

Hartigan, J. A. (1975) Clustering Algorithms. Wiley.

Potrebbero piacerti anche