Sei sulla pagina 1di 33

Clustering methods for

the analysis of DNA microarray data


Robert Tibshirani, Trevor Hastie, Mike Eisen,

Doug Ross, David Botstein and Pat Brown

Department of Health Research and Policy, Department of Statistics,


Department of Genetics and Department of Biochemistry,
Stanford University
October 15, 1999

Abstract
It is now possible to simultaneously measure the expression of thou-
sands of genes during cellular di erentiation and response, through the
use of DNA microarrays. A major statistical task is to understand the
structure in the data that arise from this technology. In this paper

1
we review various methods of clustering, and illustrate how they can
be used to arrange both the genes and cell lines from a set of DNA
microarray experiments. The methods discussed are global clustering
techniques including hierarchical, K-means, and block clustering, and
tree-structured vector quantization. Finally, we propose a new method
for identifying structure in subsets of both genes and cell lines that
are potentially obscured by the global clustering approaches.

1 Introduction
DNA microarrays and other high-throughput methods for analyzing complex
nucleic acid samples make it now possible to measure rapidly, eciently and
accurately the levels of virtually all genes expressed in a biological sample.
The application of such methods in diverse experimental settings generates
results rich in information. However, the process of transforming this in-
formation into meaningful biological insights is impeded by the complexity
and vastness of the data. One way to overcome this obstacle is exempli ed
in recent analyses of genome- scale expression timeseries (Eisen, Spellman,
Brown & Botstein (1998), Tamayo, Slonim, Mesirov, Zhu, Kitareewan &
Dmitrovsky (1999), Iyer, Eisen, Ross, Schuler, Moore, Lee, Trent, Hudson,
Boguski, Lashkari, Botstein & Brown (1999), Chu, Eisen, Mulholland, Bot-

2
stein, Brown & Herskowitz (1998), Spellman, Sherlock, Iyer, Zhang, Anders,
Eisen, Brown & Botstein (1998), Roth, Estep, & Church (1998)) where statis-
tical clustering methods were used to organize the data by identifying groups
of genes with similar behavior across time. Such organizational frameworks
greatly facilitates the process of exploring these complex sets of biological
data (Botstein & Brown (1999)). In this paper we discuss the logical exten-
sion of these methods to expression data from collections of discrete samples,
where it is useful to uncover relationships among samples as well as genes,
and illustrate the properties of various methods using gene expression data
from sixty human tumor cell lines [Ross, 1999 to be added]. We rst de-
scribe application of one-dimensional clustering methods to both the gene
and sample dimensions. We then describe a new implementation of two-way
clustering. Finally, we propose methods for identifying structure in subsets
of both axes that are potentially obscured by global clustering approaches.

2 Clustering techniques
The data from a microarray experiment form a matrix, where the rows are
di erent genes and the columns are di erent cell lines. In some experiments
the samples are di erent cell lines from di erent people, and we assume that
here. In other experiments the samples are a time series of measurements

3
during di erent phases of cell development.
Recently some authors have explored the use of clustering methods to
arrange the genes in some natural order, with similar genes placed close
together. Good general references on clustering are Everitt (1980), Kaufman
& Rousseeuw (1990) and Gordon (1999). There are two major approaches to
clustering- bottom up and top-down. Hierarchical clustering (e.g. Sokal &
Mitchener (1958)) is a bottom-up clustering method, that starts with each
observation (gene) in its own cluster. It works by agglomerating the closest
pair of clusters at each stage, successively combining clusters until all of the
data is in one cluster. The clustering sequence is represented by a hierarchical
tree{ the \dendogram"{, which can be cut at any level to yield a speci ed
number of clusters. Eisen et al. (1998) apply this kind of clustering to DNA
microarray data.
Top down clustering starts with a speci ed number of clusters and initial
positions for the cluster centers. The K-means (or Lloyd's) algorithm ((Lloyd
1957),(MacQueen 1967)) is used to reposition the cluster centers through the
following steps a) observations are assigned to the closest cluster center to
form a partition of the data, b) the observations in each cluster are averaged
to produce new values for the center vector of that cluster. Steps (a) and
(b) are iterated, and the process converges to a local minimum of the total
within cluster variance. Typically the K-means procedure is repeated with
4
a number of initial values for the cluster centers, and the best solution (in
terms of total within cluster variance) is chosen.
Tree-structured vector quantization (TSVQ) carries out K-means clus-
tering in a top-down, binary manner (Gersho & Gray (1992), Perlmutter,
Cosman, Olshen, Gray, Li & Bergin (1998)). It is commonly used in image
and signal compression.
Principal components (e.g. Mardia, Kent & Bibby (1979)) when applied
to the genes, nds the linear combinations of genes having the highest vari-
ance. Similarly, when applied to cell lines, it nds the highest variance linear
combination of the cell lines. The correlation of each gene with the leading
principal component provides a way of sorting (clustering) the genes, and
similarly for the cell lines.
The self-organizing map (SOM)(Kohonen (1989)) is similar to K-means
clustering, with the additional constraint that the cluster centers are re-
stricted to lie in a one or two-dimensional manifold. An online procedure is
used to readjust the positions of the centers. There is a similarity between
SOMs, multi-dimensional scaling and nonlinear principal components. See
Ripley (1995) and Cherkassky & Mulier (1998) for more details. This method
was used successfully for DNA microarray data by Tamayo et al. (1999).
We have found that K-means clustering produces tighter clusters than
hierarchical clustering, but the latter tends to produce a greater number of
5
smaller clusters, which can be a valuable feature for discovery. Unlike K-
means, hierarchical clustering also produces an ordering of the objects (see
below) which can be informative for data display. SOMs allow interpretation
of the clusters, but should be checked against K-means clustering to see
if the low-dimensional representation for the cluster centers is a reasonable
assumption for the data.
All of these methods are one-way clustering techniques. In this paper
we investigate the use of two-way clustering, to simultaneously cluster both
genes and cell lines. One simple approach this problem is to apply a one-
way clustering method to the genes and cell lines separately, and we do this
below. Block clustering, in contrast, uses both gene and cell line information
to simultaneously cluster both.
The two-way clustering procedures seek a global organization of genes and
cell lines. We nd that they are able to discover gross global structure but
may not be e ective for discovering nal detail. In response to this nding,
we propose a new method call \gene shaving" which searches for sets of genes
that optimally separate the cell lines.

3 Materials and methods


Data and preprocessing. Our data take the form of an m  n matrix of real-

6
valued expression levels Y = yij , where genes are the rows and samples are
the columns.
Two way clustering. We investigate four di erent methods for two-way
clustering. The rst three methods cluster and reorder the rows and columns
of data matrix separately from one another.

a) Two-way hierarchical clustering. We use average linkage Euclidean


distance- based hierarchical clustering, on the rows and columns sep-
arately (see e.g. Hartigan (1973)). This also produces a (non-unique)
ordering of the objects, one that ensures that the branches of the cor-
responding dendogram do not cross. We reorder the row and columns
according to these orderings, and display the resulting data matrix.

b) Two-way K-means clustering. As in (a), we cluster the rows and


columns separately. We use 200 clusters for the genes and 20 for the cell
lines, and then display the rows and columns ordered within cluster by
multi-dimensional scaling, and between clusters by multi-dimensional
scaling of the cluster centers.

c) Two way tree-structured vector quantization (TSVQ). This pro-


cedure is K-means clustering, performed in a top-down, binary tree
fashion (Gersho & Gray (1992), Perlmutter et al. (1998)). Two-means

7
clustering is performed at each tree node, and the best node is suc-
cessively split until the speci ed number of clusters is obtained. An
advantage over simple K-means is that an ordering of the objects can
be obtained from the leaves of the tree.

d) Principal components/ singular value decomposition. Here we


compute a singular value decomposition of the data matrix. The lead-
ing left and right singular vectors are the rst principal components
of the genes and cell lines respectively. We then sort the genes from
smallest to largest inner product with the rst principal component of
the genes, and similarly for the cell lines.

e) Block clustering. This is a top down, row and column clustering of


a data matrix. It reorders the rows and columns to produce a ma-
trix with homogeneous blocks of the outcome (here gene expression).
Block clustering also produces hierarchical clustering trees for the rows
and columns. The basic algorithm for forward block splitting is due to
Hartigan (1972); we have added a backward pruning procedure and de-
vised a permutation-based method for deciding on the optimal number
of blocks. Hartigan (1972) reviews earlier work on two-way clustering,
citing Good (1965) and Tryon & Bailey (1970). Hartigan called his ap-
proach \direct clustering", but it has become known as block clustering

8
(e.g. (Du y & Quiroz 1991)).

Here is an outline of the block clustering procedure:

 Begin with the entire data in one block


 At each stage nd the row or column split of all existing blocks into
two pieces, choosing the one that produces largest reduction in the total
within block variance.

 Allowable splits: if there are existing row splits that intersect the block,
one of these must be used for the rows, called a \ xed split". The same
is done for columns. Otherwise all split points are tried.

 The splitting is continued until a large number of blocks are obtained,


and then some block are recombined until the optimal number of blocks
are obtained (see discussion of this point below)

To nd the best split into two groups, one can show that it is sucient
to sort the rows (or columns) by row (resp. column) mean, and then seek a
split in that order. A drawback of block clustering when applied to median
centered data (the case here) is that at the start, all row and columns means
are approximately zero. Hence the procedure has diculty getting started.
By restricting the splits to xed splits, this ensures that a) the overall
partition can be displayed as a contiguous representation, with a common
9
re-ordering for the rows and columns, and b) the partitions of each of the
rows and columns can be described by hierarchical trees.
Figure 1 shows a simple example for illustration. There are 5 genes and
3 cell lines, labelled 1{5 and 1{3 respectively. The rst (vertical) split sepa-
rates cell line 3 from 1 and 2. The second (horizontal) split separates genes
2 and 3 from 1,4,5. Now consider splitting the rightmost box. The split
that separates genes 1 and 2 from 3,4,5 in the right box would not allow a
single contiguous representation of the entire data matrix, and hence is not
permitted. The split that separates gene 2 from 1,3,4,5 violates property (b)
above and is also not permitted. The only permissible horizontal split of the
rightmost box is the one that separates genes 2 and 3 from 1,4,5, continuing
the horizontal line segment in the left box all the way to the right.
The contiguity property (a) is most important to preserve. It is however
possible to relax (b), allowing splits such as 2 vs 3,5,4,1 in the right box.

Stopping rule for splitting blocks. For all clustering techniques, estimation
of the appropriate number of clusters is an important but dicult problem.
Clustering algorithms will nd clusters, when applied to independent (un-
clustered) data, so it is important to calibrate them. Milligan & Cooper
(1985) compare many of the suggested approaches to the problem, for one-
way clustering. For block clustering, Du y & Quiroz (1991) suggest the use

10
1

4
genes

3 1 2

cell lines

Figure 1: Simple example to illustrate the block clustering rules. The rst
(vertical) split separates cell line 3 from 1 and 2. The second (horizontal) split
separates genes 2 and 3 from 1,4,5. If the rightmost box is split horizontally,
it must be split between genes 2,3 and 1,4,5.

11
of permutation tests to determine when a given block split is not signi cant.
However this can lead to early stopping of the splitting process, which can
miss good block splits later.
Instead, our strategy is to split into some large number of blocks M , and
then apply weakest link pruning (recombining) of the block to produce a
series of partitions having di erent numbers of blocks (between 1 and M ).
Then we apply the algorithm to permuted versions of the data, to estimate
the best number of blocks k  M . Here is a summary of what we call the
\maximum gap" approach:

1. Let rssk be the total within block sum of squares, when k clusters are
used.

2. Create a new dataset by separately permuting the elements within each


row of the data matrix, thereby forming a new data matrix. Apply
block clustering to the permuted data, and let rss0k be the resulting
within cluster sum of squares. Do this for a number of permuted
datasets (say 10) and compute the average ave(rss0k ).

3. Compute the gap function

gap(k) = ave(rss0k ) rssk (1)

and nally choose the value of k that maximizes gap(k).


12
The idea is that the optimal number of blocks is the value for which the
drop in residual sum of squares, relative to what we expect from permuted
data, is largest. The same idea can be used to estimate the optimal number
of clusters in hierarchical or K-means clustering. In that case, if rows were
being clustered, we would permute the elements within each row of the data
matrix (and similarly for clustering columns).

e) Gene shaving
The two-way clustering methods seek a single re-ordering of the cell lines
for all genes. However a more complex pattern may exist. In particular, one
set of genes might cluster the cell lines in one fashion, and another set of
genes might produce a very di erent clustering.
Here we describe a method which rst nds the linear combination of
genes having maximal variation among the cell lines . We think of this linear
combination as a \super gene". The genes having lowest correlation with
super gene are then removed (\shaved") from the data, and the process is
continued until the subset of genes contains only one gene. This process
produces a sequence of gene blocks, each containing genes that are similar to
one another and displaying large variance across the cell lines.
The details of the gene shaving procedure are as follows:

1. Start with all of the data. Find the rst principal component of the

13
genes.

2. For each gene i compute the absolute value of its correlation with the
rst principal component.

3. Remove the fraction of genes having the smallest absolute correlation.

4. Repeat steps 2 and 3 until only one gene remains.

The proportion of genes shaved o at each stage is taken to be 10%.


Denote the full set of genes by G. If B shaving steps are needed to leave
a single gene, this procedure produces a sequence of nested gene groups
G  G1  G2     GB . In order to estimate the optimal shave size, we
can compare the columnwise variance for each group to that obtained by
applying the procedure to permuted data (the \maximum gap" method), as
described above for block clustering, to obtain an optimal gene group G^b .
For illustration here, we have instead chose a constant shave size of 10 genes,
which is fairly close to the optimal number found from the gap method.
After isolating this optimal gene group, we compute its vector of column
averages. and then for each gene we remove the component that is correlated
with this average. With this modi ed data matrix we repeat the above
procedure, obtaining a new gene shave. This is done repeatedly until no
interesting gene shaves can be found.

14
The Dataset.
The dataset used in our study has expression measurements on 6830 genes
for a set of 64 human cancer tumors. A full decription of these data appears
in [Ross et al 1999] The row and column median were set to zero, by alter-
nately subtracting o median of each column and each row, in an alternating
fashion. Finally, missing values were set to the value zero.

4 Results
Two-way clustering
Figures 2 | 7 show the clustering results for the human tumor data.
K-means clustering performs poorly probably because it does not give an
order of the clustered objects. In the gure we have used multidimensional
scaling to order the objects within each cluster, and to order the centroids
of each cluster. TSVQ xes this problem, and gives a similar picture to hier-
archical clustering. Both TSVQ and hierarchical clustering have successfully
organized the genes and cell lines to produce some visible structure. Block
clustering probably does the best job of discovering contiguous blocks of gene
expression.
Two of the cell lines have two replicates in the dataset, indicated by

15
the sux \repro". An e ective clustering technique should place replicates
nearby one another. Examining the gures, this occurs for hierarchical, prin-
cipal component, TSVQ and partly for block clustering. K-mean clusterings
fails in this regard.
Block clustering also gives a one-way clustering of the cell lines, and and a
one-way clustering of the genes. In Figure 7, the cell lines are partitioned into
9 groups, by the vertical lines in the diagram. This partition is hierarchical,
meaning that for any two subpartitions the rst is contained in the second,
or vice versa. Examination of the genes in Figure 7 corresponding to the
green block in the bottom left, and the red block in the middle left reveal
a number that are known to be characteristically up or down regulated in
leukemia. Also included are unregulated ring 3 proteins, and cytoskeletal
proteins. The presence of a breast cell line clustered with the leukemias
is somewhat surprising, and is also seen with some of the other clustering
techniques. However it dicult to extract ne gene-cell line interactions from
block clustering or any of the other global clustering schemes.

Gene shaving
The rst three blocks of genes from the gene shaving process are shown
in gures 8 and 10. The variance of the column means of gene expression is
indicated in the heading. Some clear separation of the cell lines is visible. Al-

16
columns. The order of the rows and columns was randomly chosen.
Figure 2: Human tumor data, with genes in the rows and cell lines in the

BREAST
BREAST
CNS
RENAL
RENAL
MELANOMA
LEUKEMIA
CNS
COLON
BREAST
OVARIAN
COLON
LEUKEMIA
BREAST
NSCLC
BREAST
PROSTATE
LEUKEMIA
MELANOMA
RENAL
RENAL
RENAL
OVARIAN
OVARIAN
MELANOMA
OVARIAN
NSCLC
UNKNOWN
COLON
17

raw data
OVARIAN
CNS
CNS
MELANOMA
RENAL
RENAL
LEUKEMIA
MELANOMA
OVARIAN
NSCLC
RENAL
NSCLC
NSCLC
NSCLC
BREAST
MELANOMA
COLON
PROSTATE
K562B-repro
BREAST
LEUKEMIA
NSCLC
NSCLC
COLON
COLON
COLON
MCF7D-repro
MELANOMA
CNS
LEUKEMIA
NSCLC
MELANOMA
K562A-repro
RENAL
MCF7A-repro
rows and columns, from hierarchical clustering applied separately to each.
Figure 3: Clustering for human tumor data. Shown is the result of reordering

BREAST
MCF7A-repro
BREAST
MCF7D-repro
COLON
COLON
COLON
COLON
COLON
COLON
COLON
LEUKEMIA
LEUKEMIA
LEUKEMIA
LEUKEMIA
K562B-repro
K562A-repro
LEUKEMIA
LEUKEMIA

two way hierarchical clustering


NSCLC
RENAL
BREAST
NSCLC
NSCLC
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
18

BREAST
BREAST
MELANOMA
MELANOMA
RENAL
UNKNOWN
OVARIAN
BREAST
NSCLC
CNS
CNS
CNS
CNS
BREAST
OVARIAN
OVARIAN
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
OVARIAN
OVARIAN
NSCLC
NSCLC
NSCLC
NSCLC
MELANOMA
CNS
NSCLC
PROSTATE
OVARIAN
PROSTATE
clustering, applied separately to rows and columns.
Figure 4: Clustering for human tumor data. Shown is the result of K-means

RENAL
RENAL
MELANOMA
NSCLC
CNS
CNS
RENAL
MELANOMA
BREAST
OVARIAN
RENAL
NSCLC
NSCLC
OVARIAN
BREAST
OVARIAN
LEUKEMIA
CNS
OVARIAN
RENAL
MCF7D-repro
PROSTATE
NSCLC

Two-way k-means
COLON
COLON
LEUKEMIA
LEUKEMIA
19

COLON
MCF7A-repro
BREAST
RENAL
UNKNOWN
LEUKEMIA
OVARIAN
BREAST
NSCLC
RENAL
LEUKEMIA
CNS
PROSTATE
BREAST
RENAL
LEUKEMIA
NSCLC
CNS
COLON
COLON
K562B-repro
K562A-repro
MELANOMA
NSCLC
NSCLC
BREAST
OVARIAN
NSCLC
RENAL
MELANOMA
COLON
COLON
BREAST
structured vector quantization, applied separately to rows and columns.
Figure 5: Clustering for human tumor data. Shown is the result of tree-

NSCLC
UNKNOWN
OVARIAN
MELANOMA
CNS
BREAST
NSCLC
CNS
CNS
CNS
RENAL
BREAST
CNS
BREAST
NSCLC
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
PROSTATE
OVARIAN
PROSTATE
NSCLC
NSCLC
NSCLC
20

two-way TSVQ
NSCLC
OVARIAN
OVARIAN
OVARIAN
OVARIAN
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
BREAST
BREAST
LEUKEMIA
NSCLC
NSCLC
K562B-repro
K562A-repro
LEUKEMIA
MCF7A-repro
BREAST
MCF7D-repro
BREAST
COLON
COLON
COLON
COLON
COLON
COLON
COLON
LEUKEMIA
LEUKEMIA
LEUKEMIA
LEUKEMIA
ordered with respect to their inner product with the rst principal component.
Figure 6: Clustering for human tumor data. Here the rows and columns are

BREAST
MELANOMA
OVARIAN
PROSTATE
PROSTATE
NSCLC
BREAST
OVARIAN
OVARIAN
NSCLC
MELANOMA
OVARIAN
NSCLC
MELANOMA
NSCLC
MELANOMA
OVARIAN
MELANOMA
UNKNOWN
COLON

Largest principal component


BREAST
MELANOMA
NSCLC
RENAL
MELANOMA
NSCLC
MELANOMA
OVARIAN
COLON
21

CNS
RENAL
COLON
NSCLC
RENAL
BREAST
LEUKEMIA
CNS
COLON
BREAST
RENAL
NSCLC
RENAL
RENAL
CNS
LEUKEMIA
CNS
COLON
NSCLC
CNS
MCF7D-repro
RENAL
COLON
COLON
RENAL
RENAL
BREAST
MCF7A-repro
K562B-repro
LEUKEMIA
K562A-repro
LEUKEMIA
BREAST
LEUKEMIA
LEUKEMIA
ible.
rows and columns have been rearranged, and some contiguous blocks are vis-
Figure 7: Clustering for human tumor data. Result from block clustering:

LEUKEMIA
LEUKEMIA
LEUKEMIA
K562A-repro
LEUKEMIA
BREAST
MCF7A-repro
LEUKEMIA
MCF7D-repro
COLON
NSCLC
COLON
BREAST
NSCLC
COLON
BREAST
MELANOMA
COLON
BREAST
COLON
RENAL
MELANOMA
UNKNOWN
OVARIAN
OVARIAN
BREAST

block clustering
PROSTATE
OVARIAN
RENAL
22

K562B-repro
LEUKEMIA
COLON
COLON
MELANOMA
OVARIAN
MELANOMA
MELANOMA
MELANOMA
PROSTATE
OVARIAN
MELANOMA
NSCLC
OVARIAN
MELANOMA
NSCLC
NSCLC
NSCLC
NSCLC
NSCLC
RENAL
NSCLC
CNS
CNS
RENAL
BREAST
RENAL
RENAL
CNS
CNS
CNS
RENAL
RENAL
RENAL
BREAST
though the cancer classes were not used in the shaving process, the resulting
orderings are quite successful at grouping together some of the classes.
The gene names shown at the left of each rectangle are internal codes.
The Most of the genes are uncharacterized, illustrating the potential for this
technique to discover new patterns of expression.
The full genes names are :

Block 1:

1. "357775" "SIDW357775,HumannuclearorphanreceptorLXR-alphamRNA,completecds

[5':W95560,3':W95433]"

2. "512287" "SID512287,Humanneuronalpentraxin1(NPTX1)mRNA,completecds

[5':AA057692,3':AA057694]"

3. "359412" "SIDW359412,CyclinD2

[5':AA011227,3':AA010487]"

4. 376178" "SIDW376178,Human5'-AMP-activatedproteinkinase,gamma-1subunitmRNA,

completecd[5':AA040683,3':AA040600]"

5. "136798" "FN1Fibronectin1Chr.2

[136798,(IEW),5':R36450,3':R36451]"

6."359396""SIDW359396,HumancGMP-stimulated3',5'-

23
-cyclicnucleotidephosphodiesterasePDE2A3(PDE2A)mRNA,completecd [5':AA010496,3'

7. "376052" "SIDW376052,Humannucleotide-bindingproteinmRNA,completecds

[5':AA039305,3':AA039353]"

8. "151144" "FN1Fibronectin1Chr.2[151144,(EW),5':H03906,3':H03907]"

9. "324037" "SIDW324037,Homosapiensclone24590mRNAsequence[5':W46518,3':W46450]

Block 2

1. "50250" ESTsChr.9[50250,(R),5':H17799,3':H17800]"

2. "512355"

"SID512355,ESTs,HighlysimilartoSRCSUBSTRATEP80/85PROTEINS[Gallusgallus][

5':AA059424,3':AA057835]"

Block 3

1. "241935"

SPP1Secretedphosphoprotein1(osteopontin,bonesialoproteinI)Chr.4

[241935,(EW),5':H93913,3':H93048]"

2. "363981"

"SPP1Secretedphosphoprotein1(osteopontin,bonesialoproteinI)Chr.4

[363981,(EW),5':AA021511,3':AA021512]"

24
The rst block are related to stromal cells, and tend to separate the tissue
tumors from blood cancers. The second block of genes are uncharacterized.
The third block consists of Secreted phosphoprotein genes, and produce a
di erent separation of the stromal cancers than the rst block of genes. This
illustrates the potential for this technique to discover new patterns of expres-
sion.

5 Discussion
We have investigated the use of two-way clustering methods DNA microarray
data. Some of the methods are successful for discovering contiguous areas
of high or low gene expression, including hierarchical clustering, TSVQ, and
block clustering. We have introduced the \maximum gap" diagnostic for
protection against nding spurious structure.
There are close connections between block clustering and the classi cation
and regression tree algorithm (CART) of Breiman, Friedman, Olshen & Stone
(1984). Block clustering is very similar to CART with splits on 2 categorical
predictors (genes and cell lines), and the pruning algorithm is the same as
that in CART. What's di erent is the restriction to xed splits and the use

25
1071X

3414X

3397X

4751X

2808X

2492X

3281X

5037X
200X
Figure 8: First gene block from gene shaving process.

LEUKEMIA
NSCLC
NSCLC
LEUKEMIA
COLON
RENAL
COLON
BREAST
NSCLC
LEUKEMIA
BREAST
BREAST
COLON
OVARIAN
PROSTATE
BREAST
COLON
OVARIAN
MCF7A-repro
LEUKEMIA
LEUKEMIA

variance= 4.37
OVARIAN
RENAL
26

K562B-repro
CNS
MCF7D-repro
K562A-repro
COLON
PROSTATE
OVARIAN
COLON
COLON
NSCLC
BREAST
CNS
MELANOMA
OVARIAN
NSCLC
NSCLC
CNS
MELANOMA
MELANOMA
MELANOMA
NSCLC
RENAL
MELANOMA
RENAL
BREAST
OVARIAN
MELANOMA
MELANOMA
NSCLC
CNS
RENAL
CNS
RENAL
NSCLC
UNKNOWN
RENAL
RENAL
LEUKEMIA
MELANOMA
RENAL
BREAST
6293X

3004X

4344X

2453X

1082X

2016X
802X

118X

502X
Figure 9: Second gene block from gene shaving process.

BREAST
MELANOMA
BREAST
MELANOMA
MELANOMA
LEUKEMIA
LEUKEMIA
LEUKEMIA
LEUKEMIA
LEUKEMIA
MELANOMA
MELANOMA
K562A-repro
LEUKEMIA
COLON
MELANOMA
MELANOMA
BREAST
MCF7A-repro
MELANOMA

variance= 3.007
K562B-repro
COLON
COLON
27

COLON
CNS
BREAST
RENAL
MCF7D-repro
NSCLC
COLON
OVARIAN
COLON
NSCLC
BREAST
COLON
OVARIAN
OVARIAN
OVARIAN
BREAST
NSCLC
NSCLC
PROSTATE
RENAL
RENAL
RENAL
RENAL
OVARIAN
NSCLC
BREAST
RENAL
CNS
NSCLC
RENAL
NSCLC
CNS
UNKNOWN
CNS
RENAL
OVARIAN
PROSTATE
NSCLC
CNS
RENAL
NSCLC
4325X
263X
Figure 10: Third gene block from gene shaving process.

MCF7A-repro
HS_578T_CL5006__BREAST
MOLT-4_CL7006_LEUKEMIA
NCI-H226_CL1013__NSCLC
CCRF-CEM_CL7003_LEUKEMIA
ADR-RES_CL5002_UNKNOWN
SR_CL7019__LEUKEMIA
OVCAR-8_CL6005_OVARIAN
K562A-repro
SNB-75_CL12005_RENAL
HCT-15_CL4015__COLON
NCI-H522_CL1003__NSCLC
T-47D__CL5014__BREAST
MCF7D-repro
OVCAR-5_CL6003_OVARIAN
KM12__CL4017_COLON
HOP-62_CL1026_NSCLC
SF-539__CL12016_CNS
SN12C_CL9008__RENAL

variance= 11.344
BT-549_CL5013_BREAST
OVCAR-3_CL6001_OVARIAN
PC-3 (CL11001) PROSTATE
DU-145_CL11003_PROSTATE
28

SF-268__CL12014_CNS
HL-60 (CL7008) LEUKEMIA
SW-620_CL4009_COLON
SF-295_CL12015_CNS
HCT-116_CL4003_COLON
MCF7_CL5001__BREAST
K-562 (CL7005) LEUKEMIA
MALME-3M_CL10002_MELANOMA
MDA-MB-231_CL5005__BREAST
HOP-92__CL1029_NSCLC
K562B-repro
COLO205_CL4010_COLON
HT-29___CL4001__COLON
UACC-62_CL10020_MELANOMA
NCI-H23_CL1001__NSCLC
EKVX__CL1008_NSCLC
SK-MEL-2_CL10005_MELANOMA
OVCAR-4 (CL6002) OVARIAN
LOXIMVI (CL10001) MELANOMA
UACC-257CL10021_MELANOMA
HCC-2998_CL4002_COLON
SK-OV-3_CL6011_OVARIAN
A549_CL1004__NSCLC
CAKI-1_CL9015_RENAL
SK-MEL-28_CL10008_MELANOMA
786-0__CL9018_RENAL
TK-10_CL9024_RENAL
SNB-19_CL12002_CNS
SK-MEL-5_CL10007_MELANOMA
RPMI-8226_CL7010__LEUKEMIA
UO-31_CL9004__RENAL
RXF-393_CL9016__RENAL
U251_CL12009_CNS
MDA-N_CL5012__BREAST
ACHN_CL9023_RENAL
M-14_CL10014_MELANOMA
MDA-MB-435_CL5011__BREAST
A498_CL9013_RENAL
NCI-H322_CL1017_NSCLC
NCI-H460_CL1021_NSCLC
IGROV1_CL6010_OVARIAN
of permutations to estimate the optimal number of splits.
By seeking a single global organization of the data, the two-way clustering
procedures are limited in their ability to discover ne structure. The gene
shaving method, introduced here, looks for blocks of genes that produce dif-
ferent separations of the cell lines, and the initial results look very promising.
There are many interesting modi cations of this procedure. For example any
aspect of the data can be used to direct the shaving process. If class labels
are available for the cell lines (tumor types in our example), the shaving can
be supervised by these labels. The procedure then tries to nd subsets of
genes that separate the classes as well as possible. Details will be given in a
forthcoming paper.

Acknowledgments: We would like to thank Andreas Buja for pointing


us to the work of Hartigan, and Du y and Qurioz on block clustering.

References
Botstein, D. & Brown, P. (1999), `Exploring the new world of the genome
with dna microarrays', Nature Genetics (Supp.) 21, 33{7.

Breiman, L., Friedman, J., Olshen, R. & Stone, C. (1984), Classi cation and
Regression Trees, Wadsworth.

29
Cherkassky, V. & Mulier, F. (1998), Learning from data, Wiley.

Chu, S.and DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P. O.
& Herskowitz, I. (1998), `The transcriptional program of sporulation in
budding yeast', Science 282, 699{705.

Du y, D. & Quiroz, A. (1991), `A permutation-based algorithm for block


clustering', J. of Classi cation 8, 65{91.

Eisen, M., Spellman, P., Brown, P. & Botstein, D. (1998), `Cluster analysis
and display of genome-wide expression patterns', Proc. Nat. Acad. Sci
95, 14863{14868.
Everitt, B. (1980), Cluster analysis, Halstead, New York.

Gersho, A. & Gray, R. M. (1992), VECTOR QUANTIZATION AND SIG-


NAL COMPRESSION, Kluwer Academic Publisher.

Good, I. (1965), Categorization of Classi cation Mathematics and Computer


Science in Biology and Medicine, Her Majesty's Stationary Oce, Lon-
don.

Gordon, A. (1999), Classi cation (2nd edition), Chapman and Hall/CRC


press, London.

30
Hartigan, J. (1972), `Direct clustering of a data matrix', J. Amer. Statis.
Assoc. 6, 123{129.

Hartigan, J. (1973), Clustering algorithms, Wiley, New York.

Iyer, V. R., Eisen, M. B., Ross, D. R., Schuler, G., Moore, T., Lee, J. C. F.,
Trent, J. M., Hudson, J., Boguski, M., Lashkari, D.and Shalon, D.,
Botstein, D. & Brown, P. (1999), `The transcriptional program in the
response of human broblasts to serum', Science 283, 83{87.

Kaufman, L. & Rousseeuw, P. (1990), Finding groups in data: an introduc-


tion to cluster analysis, New York; Wiley.

Kohonen, T. (1989), Self-Organization and Associative Memory (3rd edi-


tion), Springer-Verlag, Berlin.

Lloyd, S. (1957), Least squares quantization in pcm., Technical report, Bell


Laboratories. Published in 1982 in IEEE Trans. Inf. Theory, 28, 128-137.

MacQueen, J. (1967), Some methods for classi cation and analysis of multi-
variate observations, in `Proceedings of the Fifth Berkeley Symposium
on Mathematical Statistics and Probability, eds L.M. LeCam and J.
Neyman', Univ. of Cal. Press, pp. 281{297.

31
Mardia, K., Kent, J. & Bibby, J. (1979), Multivariate Analysis, Academic
Press.

Milligan, G. W. & Cooper, M. C. (1985), `An examination of procedures


for determining the number of clusters in a data set', Psychometrika
50, 159{179.
Perlmutter, S., Cosman, P.C.and Tseng, C.-W., Olshen, R., Gray, R., Li, K.
& Bergin, C. (1998), `Medical image compression and vector quantiza-
tion', Statistical Science 13, 30{53.

Ripley, B. D. (1995), Pattern Recognition and Neural Networks|a Statistical


Approach, Cambridge University Press.

Roth, F.P.and Hughes, J., Estep, P., & Church, G. (1998), `Finding dna
regulatory motifs within unaligned noncoding sequences clustered by
whole genome mrna quantitation', Nat. Biotechnol. 16, 939{45.

Sokal, R. & Mitchener, C. (1958), `A statistical method for evaluating sys-


tematic relationships', Univ. Kansas Sci. Bull.. 38, 1409{1438.

Spellman, P. T., Sherlock, G., Iyer, V. R., Zhang, M., Anders, K., Eisen,
M. B., Brown, P. O. & Botstein, D.and Futcher, B. (1998), `Comprehen-
sive identi cation of cell cylce-reulated genes of the yeast saccharomyces
by microarray hybridization', Mol. Cell. Biol. 9(12), 3273{975.
32
Tamayo, P., Slonim, T., Mesirov, J., Zhu, Q., Kitareewan, S. & Dmitrovsky,
E. (1999), `Interpreting patterns of gene expression with self-organizing
maps: Methods and applications to hematopoietic diferentation', Proc.
Nat. Acad. Sci 96, 2907{2912.

Tryon, R. & Bailey, D. (1970), Cluster Analysis, McGraw-Hill., New York.

33

Potrebbero piacerti anche