Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
tree of p.
0.2
Density
Density
0.005
0.004
1.0
Density Estimate
0.003
0.8
0.002
0.6
x2
0.001
0.4
0.000
0.2
1.0
x1
0.8
0.8
1.0
1.0
0.6
0.6
0.8
0.8
x2
x2
0.4
0.4
0.6
0.6
x2
x2
0.2
0.2
0.4
0.4
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.2
x1 x1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x1
Figure 4: (a) Cluster tree for BKDE; (b) cluster as-
signments by color; (c) partitioned feature space
Figure 3: (a) example data BKDE on a 20 x 20 grid;
(b) level set at = 0.00016; cluster cores in orange,
white; (c) corresponding partitioned feature space We compare these results to common clustering ap-
proaches (briefly described here). K-means is a it-
When a node N of the cluster tree has been split into erative descent approach that minimizes the within-
daughters Nl , Nr , the high density clusters D(Nl ), cluster sum of squares; it tends to partition the fea-
D(Nr ), also referred to as the cluster cores, do not ture space into equal-sized (hyper-)spheres (Hartigan,
form a partition of D(N ). We refer to the bins in Wong 1979). Although we know there are four groups,
D(N )\(D(Nl ) D(Nr )) (and observations in these we range the number of clusters from two to twelve and
bins) as the fluff. We assign each bin B in the plot the total within-cluster sum of squares against the
fluff to Nr if the Manhattan distance dM (B, D(Nr )) number of clusters. The elbow in the curve occurs
is smaller that dM (B, D(Nl )). If dM (B, D(Nr )) > around five or six clusters. Solutions with more clus-
dM (B, D(Nl )), then B is assigned to Nl . In case of ters have negligible decreases in the criterion. Model-
ties, the algorithm arbitrarily chooses an assignment. based clustering is described in Section 1. We allow
The cluster cores and fluff represented by the leaves the procedure to search over one to twelve clusters
of the cluster tree form a partition of the support of p with no constraints on the covariance matrix; a ten-
and a corresponding partition of the observations. The cluster solution is chosen. Finally, we include results
from three hierarchical linkage methods (Mardia, et In our example, the procedure identified the four orig-
al 1979). Hierarchical agglomerative linkage methods inal groups (the three splits post-artifact split in the
iteratively join clusters in an order determined by a cluster tree) but erroneously continued splitting the
chosen between-cluster distance; we use the minimum, clusters. The corresponding branches of the cluster
maximum, or average Euclidean distance between ob- tree need to be pruned.
servations in different clusters (single, complete, aver-
age respectively). The hierarchical cluster structure 3 Clustering with Confidence
is graphically represented by a tree-based dendrogram
whose branches merge at heights corresponding to the We propose a bootstrap-based automatic pruning pro-
clusters dissimilarity. We prune each dendrogram for cedure. The idea is to find (1 ) simultaneous upper
the known number of groups (four). In all cases, we and lower confidence sets for each level set. During
compare the generated clustering partitions to the true cluster tree construction, only splits indicated as sig-
groups using the Adjusted Rand Index (Table 1) (Hu- nificant by the bootstrap confidence sets are taken to
bert, Arabie 1985). The ARI is a common measure of signal multi-modality. Spurious modes are discarded
agreement between two partitions. Its expected value during estimation; the only decision from the user is
is zero under random partitioning with a maximum the confidence level.
value of one; larger values indicate better agreement.
3.1 Bootstrap confidence sets for level sets
Table 1: Comparison of Clustering Methods
We choose upper confidence sets to be of the form
Method ARI Lu (; p) = L( u ; p) and lower confidence sets of
form Ll (; p) = L( + l ; p) with u , l > 0. By
Cluster Tree: 20 x 20 (k = 8) 0.781 construction,
Cluster Tree: 15 x 15 (k = 6) 0.865
K-means (k = 4) 0.924 Lower = Ll (; p) L(; p) Lu (; p) = U pper.
K-means (k = 5) 0.803
Let p1 , p2 , ..., pm be the density estimates for m boot-
K-means (k = 6) 0.673
strap samples of size n drawn with replacement from
MBC (k = 10) 0.534
the original sample. We call a pair (L( u ; p),
Linkage (single, complete, average) 1
L( + l ; p)) a non-simultaneous (1 ) confidence
band for L(; p) if for (1 ) of the bootstrap density
The groups in the simulated data are well-separated; estimates pi , the upper confidence set L(u ; p) con-
however, the two curvilinear groups lead to an ex- tains L(; pi ) and the lower confidence set L( + l ; p)
pected increased number of clusters. While the four is contained in L(; pi ):
cluster k-means solution had high agreement with the
true partition, the total within-cluster sum of squares Pboot {L( + l ; p) L(; pi ) L( u ; p)} 1 .
criterion was lower for solutions with larger numbers
Here is one method to determine u , l (Buja 2002).
of clusters (5, 6) who had lower ARIs comparable to
For each bootstrap sample pi and each of the finitely
the cluster tree partition generated for the BKDE on
many levels of p, find the smallest u (i) such that
a 20 x 20 grid. Model-based clustering overestimated
the number of groups in the population; the ARI cor-
respondingly drops. All three linkage methods gave L(; pi ) L( u (i); p)
perfect agreement; this performance is not unexpected
given the well-separated groups. and the smallest l (i) such that
Although the cluster tree method has an ARI of 0.781, L( + l (i); p) L(; pi ).
the algorithm generates the correct partition for the
density estimate. Its performance is dependent on the Choose u = (1 2 ) quantile of the u (i) and l =
density estimate. A similarly found density estimate (1 2 ) quantile of the l (i). By construction, the pair
on a 15 x 15 grid with a cross-validation selected band- (L(u ; p), L(+l ; p)) is a (1) non-simultaneous
width of 0.015 yielded a cluster tree with an ARI of confidence band for L(; p). To get confidence bands
0.865. In both cases, the number of groups is overes- for all occurring as values of p with simultaneous
timated (8 and 6). Figure 4 illustrates this problem coverage probability 1 , we simply increase the cov-
in the approach. While the cluster tree is accurate erage level of the individual bands until
for the given density estimate, a density estimate is
inherently noisy, which results in spurious modes not Pboot {L( + l ; p) L(; pi ) L( u ; p) }
corresponding to groups in the underlying population. 1 .
Note that the actual upper and lower confidence sets
for L(; p) are the level sets (L( u ; p), L( + l ; p))
1.0
respectively for p. The bootstrap is used only to find
0.8
u , l .
0.6
x2
3.2 Constructing the cluster tree
0.4
0.2
After finding u , l for all , we incorporate the boot-
strap confidence sets into the cluster tree construction 0.0 0.2 0.4 0.6 0.8 1.0
x1
by only allowing splits at heights for which the corre-
sponding bootstrap confidence set (Ll (; p), Lu (; p))
1.0
1.0
gives strong evidence of a split. We use a similar
recursive procedure as described in Section 2. The
0.8
0.8
root node represents the entire support of p and is
0.6
0.6
x2
x2
associated with density level (N ) = 0. To de-
0.4
0.4
termine the daughters of a node, T we find the low-
est level d for which a) Ll (d ; p) D(N ) has two
0.2
0.2
or more connected components
T that b) are discon-
nected in Lu (d ; p) D(N ).
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Condition (a) indi- x1 x1
1.0
0.8
0.6
x2
x2
Clustering with Confidence (CWC) using = 0.05 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x1
generates the cluster tree and data/feature space par-
titions seen in Figure 6. The cluster trees significant
Figure 6: (a) 95% confidence cluster tree (b) cluster
splits have identified the four original groups (ARI =
assignments; (c) partitioned feature space
1). No other smaller clusters (or modal artifacts) are
found. Note that the split heights are higher than the
4 Examples by their Evaluate Function mastery level. The more
well-separated skill set profiles are identified earlier.
The algorithms presented could be used for any num-
ber of dimensions, but are more tractable for lower
1.0
1.0
0.8
Multiplication
0.6
0.6
0.4
0.2
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Evaluate Function Evaluate Function
In educational research, a fundamental goal is identi-
fying whether or not students have mastered the skills
1.5
1.0
0.4
445 296
449
126430
146
155
421
2932
4480
2194
3427
295
31159
299
512
467
525
361
62
153
8
273
181
469
103
40
6
141
20
120
245
234
3
454
479
514
33
463
325
55
504
360
346
51
79
473
423
46
157
42
455
459
366
214
353
333
303
90
429
539
25
52
45
74
359
493
207
476
69
200
272
18
305
206
413
529
147
268
54
440
301
165
139
83
159
210
509
171
49
357
390
290
285
496
109
500
517
218
319
453
67
100
257
363
107
420
498
14
326
347
47
219
192
41
26
37
195
318
88
194
764
197
302
39
185
436
372
415
408
499
64
98
316
246
462
286
367
145
58
344
368
451
169
150
280
148
198
217
484
489
338
483
143 9
255
152
274
526
437
438
362
190
201
386
271
323
427
439
541
447
65
416
494
66
308
249
403
546
31
288
530
328
85
106
897
209
216
56
466
549
212
450
70
96
322
203
93
327
441
460
183
170
414
435
543
235
115
151
75
431
442
502
313
345
497
422
228
283
358
199
240
275
158
297
393
458
53
404
508
389
248
144
394
378
410
193
24
387
536
172
392
225
105
291
154
57
374
538
208
339
470
204
329
401
267
289
16
350
511
418
477
550
191
465
495
385
266
230
433
30
188
71
471
277
424
84
276
330
417
243
340
354
15
287
38
44
189
545
202
482
132
264
19
111
136
130
196
292
448
134
481
35
513
91
300
381
108
531
247
278
472
540
0.0
118
356
262
63
156
72
176
11
28
167
12
92
140
162
223
317
352
361
370
419
528
334
521
373
524
95
505
116
13
341
444
114
178
137
331
73
388
461
102
254
110
503
332
279
486
160
50
519
533
125
161
224
269
443
425
205
320
122
222
238
215
306
547
351
99
315
534
324
490
310
173
321
260
48
17
457
474
492
211
527
226
349
369
542
121
304
284
518
239
377
131
142
507
186
258
265
282
382
383
475
177
259
68
487
251
298
237
23
376
380
123
227
452
501
281
468
293
398
179
60
221
491
135
342
520
80
250
355
117
309
397
409
164
411
127
180
101 3
10
187
113
182
244
544
94
133
375
97
229
428
399
510
379
129
434
426
86
77
478
124
407
119
241
261
391
104
446
175
81
220
405
213
335
22
270
256
535
138
307
242
532
402 5
252
364
232
456
87
233
231
412
312
432
112
314
253
485
464
488
343
506
149
168
406
236
400
537
396
515
163
371
21
384
184
62
365
337
348
78
128
174
522
551
336
523
166
548
82
263
1.0
0.8
0.4
0.2
0.0
1000
1.0
0.012
0.8
800
0.010
Density Estimate
AntiBrdU FITC
Multiplication
0.6
0.008
600
0.006
0.4
400
0.004
0.2
200
0.002
0.000
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000
Evaluate Function 7 AAD
0.012
1000
Figure 9: (a) 10% confidence cluster tree (b) cluster
0.010
assignments
800
0.008
Density Estimate
AntiBrdU FITC
600
0.006
400
identify skill set profiles in which we are at least 10%
0.004
confident. Figure 9 shows the subsequent cluster tree
200
0.002
and assignments. At 10% confidence, we have only
0.000
0
three skill set profiles, each identified only by their 0 200 400
7 AAD
600 800 1000
1000
that have different levels across channels. A common
800
800
600
400
200
motivated subgroup identification (see Lo, Brinkman, 0 200 400 600 800 1000 0 200 400 600 800 1000
7 AAD 7 AAD
Gottardo 2008 for one proposed method).
Figure 10a shows 1545 flow cytometry measurements Figure 11: (a) K-means assignment (k=10); (b)
on two fluorescence markers applied to Rituximab, a Model-based clustering assignment (k = 7)
therapeutic monoclonal antibody, in a drug-screening
project designed to identify agents to enhance its an-
Figure 11a shows a ten-cluster k-means solution simi-
tilymphoma activity (Gottardo, Lo 2008). Cells were
larly chosen as before. While many of the smaller clus-
stained, following culture, with the agents anti-BrdU
ters are found by both k-means and the cluster tree,
and the DNA binding dye 7-AAD.
k-means splits the large cluster in Figure 10c into sev-
We construct a cluster tree (cross-validated BKDE eral smaller clusters. Model-based clustering chooses
bandwidth of 21.834) and plot the cluster assignments a seven-cluster solution (Figure 11b). We again see
indicating whether the observations are fluff or part overlapping clusters in the lower left quadrant; this
of a cluster core (Figures 10b,c). The cluster tree solution would not be viable in this application. Link-
has 13 leaves (8 clusters, 5 modal artifacts). The cores age methods again provided unclear solutions.
We use CWC to construct a confidence cluster tree for with Applications to Function Estimation and Functional
= 0.10; we are at least 90% confident in the gen- Data. In revision.
erated clusters (Figure 12). All modal artifacts have Cuevas, A., Febrero M., and Fraiman, R. (2000) Estimat-
been removed, and the smaller clusters have merged ing the Number of Clusters. The Canadian Journal of
into two larger clusters with cores at (200, 200) and Statistics, 28:367-382.
(700, 300). Note that the right cluster is a combina- Cuevas, A., Febrero M., and Fraiman, R. (2001) Cluster
tion of the two high 7-AAD/low anti-BrdU clusters Analysis: a further approach based on density estimation.
from Figure 10c. CWC did not find enough evidence Computational Statistics & Data Analysis, 36:441-459.
to warrant splitting this larger cluster into subgroups. Fraley, C and Raftery, A (1998). How Many Clusters?
Which Clustering Method? - Answers Via Model-Based
Cluster Analysis. The Computer Journal, 41:578-588.
0.012
1000
Gottardo, R. and Lo, K.(2008) flowClust Bioconductor
0.010
package
800
0.008
Density Estimate
AntiBrdU FITC
Hall, P. and Wand, M. P. (1996) On the Accuracy of
600
0.006
200
0.002
0 200 400 600 800 1000 Hartigan, J. A. (1981) Consistency of Single Linkage for
7 AAD
High-Density Clusters. Journal of the American Statisti-
cal Association, 76:388-394.
Figure 12: (b) 90% Cluster Tree with two leaves; (c) Hartigan, J. A. (1985) Statistical Theory in Clustering.
Cluster assignments: fluff = x; core = Journal of Classification, 2:63-76.
Hartigan, J and Wong, M.A. (1979). A k-means clustering
algorithm. Applied Statistics, 28, 100-108.
5 Summary and future work
Heffernan, N.T., Koedinger K.R., and Junker, B. W.
(2001) Using Web-Based Cognitive Assessment Systems for
We have presented a plug-in method for estimating the Predicting Student Performance on State Exams. Research
cluster tree of a density that takes advantage of the proposal to the Institute of Educational Statistics, US De-
ability to exactly compute the level sets of a piecewise partment of Education. Department of Computer Science
constant density estimate. The approach shows flexi- at Worcester Polytechnic Institute, Massachusetts.
bility in finding clusters of unequal sizes and shapes. Hubert, L. and Arabie, P. (1985) Comparing partitions.
However, the cluster tree is dependent on the (inher- Journal of Classification, 2, 193-218.
ently noisy) density estimate. We introduced cluster- Klemela, J. (2004) Visualization of Multivariate Density
ing with confidence, an automatic pruning procedure Estimates with Level Set Trees. Journal of Computational
that assesses significance of splits in the cluster tree; and Graphical Statistics, 13:599-620.
the only input needed is the desired confidence level. Lo, K., Brinkman R., and Gottardo, R. (2008). Auto-
mated Gating of Flow Cytometry Data via Robust Model-
These procedures may become computationally in-
Based Clustering. Cytometry, Part A, 73A: 321-332.
tractable as the number of adjacent bins grows with
the dimension and are realistically for use in lower di- Mardia, K., Kent, J.T., and Bibby, J.M. (1979) Multivari-
ate Analysis. Academic Press.
mensions. One high-dimensional approach would be to
employ projection or dimension reduction techniques McLachlan, G.J. and Basford, K.E. (1988) Mixture Models:
prior to cluster tree estimation. We also have devel- Inference and Applications to Clustering. Marcel Dekker.
oped a graph-based approach that approximates the Sedgewick, Robert. (2002) Algorithms in C, Part 5: Graph
cluster tree in high dimensions. Clustering with Con- Algorithms, 3rd Ed. Addison-Wesley.
fidence then could be applied to the resulting graph to Stuetzle, W. (2003) Estimating the Cluster Tree of a Den-
identify significant clusters. sity by Analyzing the Minimal Spanning Tree of a Sample.
Journal of Classification. 20:25-47.
References Walther, G. (1997) Granulometric Smoothing. The An-
Ayers, E, Nugent R, and Dean, N. (2008) Skill Set Profile nals of Statistics, 25:2273-2299.
Clustering Based on Student Capability Vectors Computed Wand, M. P. and Jones, M. C. (1995) Kernel Smoothing.
From Online Tutoring Data. Proc. of the Intl Conference Chapman & Hall.
on Educational Data Mining (peer-reviewed). To appear.
Wishart, D. (1969) Mode Analysis: A Generalization of
Buja, A. (2002) Personal Communication. Also Nearest Neighbor which Reduces Chaining Effect. Numer-
Buja, A. and Rolke, W. Calibration for Simultane- ical Taxonomy, ed. A. J. Cole, Academic Press, 282-311.
ity: (Re)Sampling Methods for Simultaneous Inference