Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Dedication
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.12.014
labmed.theclinics.com
Preface
Clinical laboratory data are among the most detailed, objective, reliable,
and useful measures of patient characteristics contained in the medical
record. Numerous studies over the past 30 years based on laboratory data
alone and in aggregate with other clinical and experimental data have
revealed correlative and predictive patterns in laboratory data that have
improved our understanding of disease, therapeutic response, and health care
delivery processes. Additional useful patterns undoubtedly remain hidden in
the data, awaiting discovery by creative, prepared minds using eective
analysis techniques.
Some pathologists have recognized this opportunity; over the past 10
years there have been periodic reports in the literature that have used automated pattern recognition and modeling techniques collectively termed
data mining to identify patterns in laboratory data for various purposes.
Unfortunately, these eorts have been relatively few, whereas the use of data
mining techniques in other medical domains has increased dramatically (see
article by Dr. Harrison). There are several reasons that the use of data mining techniques has been inhibited in the laboratory. Data mining is a set of
statistical approaches to data analysis that are relatively technical and that
need to be correctly matched to an analysis task. This specialized knowledge
is generally outside the scope of laboratorians training. Software tools for
data mining by non-experts have been very expensive and were often a poor
t for laboratory databases. Most were designed to discover associations
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.011
labmed.theclinics.com
xii
PREFACE
PREFACE
xiii
a guide to the types of clinical and research opportunities that will become
available over the next several years.
James H. Harrison, Jr, MD, PhD
Departments of Public Health Sciences and Pathology
University of Virginia
Hospital West Complex 3181
PO Box 800717
Charlottesville, VA 22908-0717, USA
E-mail address: james.harrison@virginia.edu
The progressive increase in the amount of clinical data stored in electronic form isdfor the rst time in historydmaking it possible to carry
out large-scale studies that focus on the interaction between genotype, phenotype, and disease at a population level. Such studies have extraordinary
potential to determine the eectiveness of treatment and monitoring strategies, identify subpopulations at risk for disease, dene the real variability in
the natural history of disease and comorbidities, discover rational bases
for targeting therapies to particular patients, and determine the incidence
and contexts of unwanted health care outcomes. Matching patient responses (phenotype) with gene expression and known metabolic pathway
relationships across large numbers of individuals may be the best hope for
understanding the complex interplay between multiple genes and the environment that underlies some of the most common and debilitating health
problems [1]. Although serious issues remain to be resolved before the
large-scale secondary use of health data for research can become routine,
this topic has been recognized and identied as a national priority in Canada
[2] and the United States [3]. Clinical laboratory databases contain perhaps
the largest available collection of structured medical data representing human phenotypes of disease progression and response to therapy. Alone
and especially in combination with other clinical and environmental data,
laboratory databases have substantial value for translational research, including correlative studies linking gene expression with phenotype, and
for identifying groups of patients with similar characteristics for follow-up
analysis or inclusion in clinical studies.
Large-scale clinical databases permit targeted observational and correlative studies that complement randomized clinical trials [4,5]. These databases also hold the promise of more comprehensive analyses to reveal
E-mail address: james.harrison@virginia.edu
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.001
labmed.theclinics.com
HARRISON
unknown, useful real-world relationships among clinical data. The data volume and comprehensiveness that make these data sets useful, however, also
make them dicult or impossible to analyze by manual or traditional statistical methods. Analogous challenges have occurred previously in other domains, including the need to identify purchasing associations among billions
of retail transactions [6], the need to identify similarities in patterns among
terabytes of geologic data for oil exploration [7], and the need to identify
patterns in planetary mapping data [8], among many other examples. These
needs have been addressed using a set of techniques from the machine learning and pattern recognition elds collectively referred to as data mining.
In recent years, biomedical science has also begun to apply these techniques
to large-scale data analysis, as evidenced by the dramatic increase in biomedical publications referring to data mining over the past 10 years (Fig. 1).
Brief overview of data mining
Data mining has been described as the extraction of implicit, previously
unknown and potentially useful information [9], such as associations and
correlations between data elements, from large repositories of data. It is
the technical and statistical component of the process of knowledge discovery in databases (KDD, Refs. [10,11]), which has a primary goal of identifying useful new information and is sometimes used synonymously with
KDD. Although the data mining label is sometimes also applied to techniques designed to determine whether and to what extent prespecied patterns exist in data sets, those primarily data querying methods are distinct
Data mining articles
Data mining reviews
300
250
200
150
100
50
1
1 1
0
1985
1990
1995
2000
2005
Year
Fig. 1. Dramatic increase in articles and reviews in biomedical science mentioning data mining
since 1981. The rst article appeared in 1984, and single articles also appeared in 1995 and 1996.
The rst review appeared in 1997. Articles and reviews include clinical research topics and studies using bioinformatics/high-throughput analytic techniques. In 2006 there were 304 articles
and 44 reviews mentioning data mining.
HARRISON
genomics data sets. Some foundational topics are also addressed in the later
articles in the setting of specic applications, for example pattern pruning
and false discovery rate evaluation in the articles by Siadaty and Harrison,
and Lee and colleagues, elsewhere in this issue.
Although topics in data mining can often be approached in an intuitive
manner, data mining methodology is based on mathematic and statistical
principles. Eective application of data mining techniques, including eective use of data mining software, requires a reasonable understanding of
these principles. A full introduction to the mathematics of data mining is beyond the scope of this brief review issue, but several references are generally
available for interested readers. Those who nd the mathematic discussions
in the articles by Brown, Klee, and Lee and colleagues, elsewhere in this
issue too advanced may wish to start their study of data mining by reviewing
Tans Introduction to Data Mining [18], which presents data mining topics
intuitively using visualizations, or the initial sections of Dunhams Data
Mining: Introductory and Advanced Topics [19], which takes a bit more
mathematic approach. General reviews introducing data mining topics are
also available from Fayyad and colleagues [10], Hand and colleagues [20],
and Cios and Moore [21].
Special characteristics of medical data
Medical data have characteristics that make them uniquely dicult to analyze in an automated fashion by traditional techniques or by data mining
[21]. Some of these characteristics appear in data sets from other domains,
but medical data seem to combine more problematic and challenging features at once than almost any other type of data.
High dimensionality
High dimensionality means that many dierent data elements, each representing a dimension that can vary in value, characterize an item of interest,
such as a patient, disease, or specimen. It is not unusual for a patients medical record to contain 50 or even 100 dierent types of data elements. With
so many variables across a limited number of comparisons, the likelihood of
patients sharing coincidental patterns is high and appropriate techniques
must be used to minimize identication of spurious patterns.
Heterogeneity
Medical data may include textual descriptions, various types of images,
and discrete values using multiple scales. Values may be obtained from multiple methods, some of which may produce incompatible results for the same
observation. Data mining requires consistent data, which means that large
volumes of clinical data may need to be transformed to compatible representations. Some of these challenges are further addressed in the articles by Siadaty and Harrison, and Harrison and Aller, elsewhere in this issue.
Imprecision
Unlike retail transactions, which directly reect a particular purchase act,
medical observations commonly indicate a probability that a condition exists based on their sensitivity and specicity as indicators for that condition.
A given feature thus may be consistent with more than one condition, or the
condition may exist without the feature. Furthermore, sensitivities and specicities for most features are not known precisely for all data sets and at best
may be estimated based on values obtained in other data sets. For these reasons, linking data elements to true characteristics of patients is not
straightforward.
Interpretations
Diagnoses and other summary data in medical records are generally human interpretations of aggregates of observations and objective data values.
Interpretations by dierent individuals may dier or even conict, and are
often expressed as text that must be further interpreted to a form that
may be processed during data mining.
No canonical form
Although substantial progress continues to be made in developing standard medical terminologies, in the absence of a generally accepted representation for important medical concepts many clinical data are still expressed
in idiosyncratic ways.
Incomplete and inconsistent data
Patients who have the same conditions may have substantially dierent
types and timing of observations, unlike retail transactions, surveys, or technical data, which generally have comparable complements of data elements
obtained at similar times. Clinical data may be inconsistent or conicting for
various reasons. These qualities add noise and spurious patterns that increase the diculty of identifying real patterns of interest.
Dicult mathematic characterization
Medicine deals with concepts, such as inammation, comorbidities, and
disease severity, that strongly inuence clinical outcome but are dicult
to quantitate and incorporate into mathematic relationships with diagnoses
and disease progression models.
Temporal patterns
Data elements in clinical records may not be meaningful outside of a particular temporal context. This phenomenon is particularly true for laboratory databases, which are largely composed of time sequences. For
HARRISON
Summary
Mining medical data, including laboratory data, is currently a challenging
exercise in data acquisition, aggregation, and reconciliation. Signicant political and legal challenges may also exist for medical data mining projects.
Partly for these reasons, data mining techniques have not been widely used
in laboratory medicine (with a few notable exceptions). Substantial potential
exists, however. The last ve articles in this issue show that data mining can
be applied successfully to clinical care, public health, and research problems.
As the volume of data online increases, standard data representations become more widespread, and issues related to secondary analysis of health
data are resolved, the cost and eort barriers to data mining projects will
decrease. Laboratory data represent a substantial volume of objective,
References
[1] Rees J. Complex disease and the new clinical sciences. Science 2002;296:698700.
[2] Canadian Institute for Health Research. Secondary use of personal information in health research: case studies. 2002. Available at: http://www.cihr-irsc.gc.ca/e/1475.html. Accessed
August 26, 2007.
[3] Safran C, Bloomrosen M, Hammond WE, et al. Toward a national framework for the secondary use of health data: an American medical informatics association white paper. J Am
Med Inform Assoc 2007;14(1):19.
[4] Grossman J, Mackenzie FJ. The randomized controlled trial: gold standard, or merely standard? Perspect Biol Med 2005;48(4):51634.
[5] Jager K, Stel V, Wanner C, et al. The valuable contribution of observational studies to nephrology. Kidney Int 2007;72(5):53942.
[6] Babcock C. Parallel processing mines retail data. Computerworld 1994;28(39):6.
[7] Harrison D. Backing up 100 terabytes. Network Computing 1993;413:98104.
[8] Fayyad UM, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery: an
overview. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, editors. Advances in knowledge
discovery and data mining. Menlo Park (CA): AAAI Press; 1996. p. 134.
[9] Lee S, Siau K. A review of data mining techniques. Industrial Management & Data Systems
2001;100(1):416.
[10] Fayyad UM, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in
databases. AI Magazine 1996;17(3):3754.
[11] Hipp J, Guntzer U, Nakhaeizadeh G. Data mining of association rules and the process of
knowledge discovery in databases. In: Perner P, editor. Advances in data mining: applications in e-commerce, medicine, and knowledge management. Berlin (Germany): Springer;
2002. p. 20726.
[12] Hand DJ. Principles of data mining. Drug Saf 2007;30(7):6212.
[13] SAS Institute Inc. SAS Enterprise Miner. Available at: http://www.sas.com/technologies/
analytics/datamining/miner/index.html. Accessed August 26, 2007.
[14] SPSS Inc. Clementine. Available at: http://www.spss.com/clementine/. Accessed
August 26, 2007.
[15] Cognos Inc. Data mining. Available at: http://www.cognos.com/data-mining.html. Accessed August 26, 2007.
[16] Insightful Corp. Insightful Miner. Available at: http://www.insightful.com/products/
iminer/default.asp. Accessed August 26, 2007.
[17] Oracle Corp. Oracle data mining. Available at: http://www.oracle.com/technology/
products/bi/odm/index.html. Accessed August 26, 2007.
[18] Tan P, Steinbach M, Kumar V. Introduction to data mining. Boston: Addison-Wesley Longman Publishing Co., Inc.; 2005.
[19] Dunham MH. Data mining: introductory and advanced topics. Upper Saddle River, NJ:
Prentice Hall; 2002.
[20] Hand D, Blunt G, Kelly M, et al. Data mining for fun and prot. Stat Sci 2000;15(2):11126.
[21] Cios KJ, Moore GW. Uniqueness of medical data mining. Artif Intell Med 2002;26(12):
124.
Although data mining is a new eld of study of interest to medical informatics the application of analytic techniques to the discovery of patterns has
a rich history. Perhaps one of the most successful early uses of data analysis
for discovery and understanding was in medicine, specically infectious
diseases.
In the middle of the nineteenth century, London was hit with a pandemic
of infectious disease that killed large numbers of its citizenry. At that time
medicine knew little about the causes of infectious diseases but two theories
competed for consideration. The rst and more popular theory, called the
miasma theory, suggested that bad air propagated disease. The second theory postulated infectious agents or germs as the source of infection.
A leading supporter of the miasma theory was Dr. William Farr, a civil
servant in the General Register Oce. According to Farr decaying organic
matter provided a mechanism for the transfer of disease. In areas closer to
the Thames River the air was particularly unhealthy from decaying matter,
whereas locations away from the Thames had more healthy air. His careful
analysis of available mortality data showed a strong negative correlation
with elevation above the Thames.
Dr. John Snow was a leading proponent of the germ theory. Snow was
a pioneer in anesthesia and had served as an obstetrician to Queen Victoria.
During the pandemic of this period, he meticulously collected observational
data on those infected with the disease and carefully looked for patterns that
could provide causal understanding. He quickly narrowed his search to understanding the contribution of water to the disease. During the outbreak in
1853 and 1854 he was particularly interested in assessing the association of
the disease with specic water companies.
10
BROWN
When a new outbreak of the disease occurred in 1852 Snow carefully collected data from higher and lower elevations that had dierent sources of water.
His analysis showed clearly that once the source of water was taken into account
the elevation above the Thames was not predictive of infection. Despite the apparent relationship shown in Farrs ndings, Snow showed strong evidence that
germs within a polluted water supply were the mode of disease transmission.
This example illustrates the advantages and disadvantages of data mining. On the positive side data mining can discover important patterns. We
can frequently identify patterns even when we do not fully understand the
causal mechanisms behind those patterns. In this sense, data mining can
open pathways to research and discovery that may not have been evident.
Using data mining it is possible to discover irrelevant or cursory patterns
that confuse more than they enlighten. Data mining thus should never replace careful analysis and directed reasoning about important problems.
The term data mining itself is somewhat unfortunate because we are generally not interested in mining data per se, but rather in mining information
from data. We live in data-rich times and as each day passes more data are collected and stored in databases. The desire to process and use these data to help
answer or understand important questions has driven the development of data
mining techniques. The goal of these techniques is to nd information within
the large stores of data. By information we mean patterns that are persistent
and meaningful. Data mining is sometimes referred to as knowledge discovery, which imputes the deeper goal of knowledge extraction from data.
Most techniques for mining large data sets have emerged in the last 20 years
with the widespread use of databases, particularly relational databases. These
databases have provided outstanding capabilities for transaction processing,
meaning that individual records for many millions of individuals can be
quickly and securely updated based on real-time processing of a transaction.
These same characteristics that enable ecient operation of individual
transactions in large databases frustrate the use of the data for analysis, leading to embarrassing shortcomings, such as the following: Although I can tell
you the date, time, and results from your last TB test, I cannot tell you how
many TB tests were performed by a specic laboratory over a specic period
of time. Because of these shortcomings most database developers have moved
to add functionality to provide some limited data summaries to their customers. The more intense search for patterns or information in these collections of data has only recently been available through data mining, however.
Although extracting information from data is the major motivation for
data mining, there is also a second, less discussed motivation. Much research
has shown that unaided human analysis of data for decision making is unintentionally awed (see Ref. [1]). Even with small databases data mining can
provide protection against unaided human inference about patterns. In this
use, data mining is an aid to human judgment and for this reason data mining
techniques should attempt to provide quantiable measures behind the discovered patterns.
11
Discovery techniques
Discovery techniques look for the interdependence or association between observations or between variables in a data set. Unlike the methods
described in the section on predictive techniques, the data have not been segmented by the analyst in sets of particular interest. Specically, there are no
designated response variables, such as coronary heart disease and type II
diabetes. Instead the concern is with nding patterns of association among
the observations (eg, patients) or variables (eg, demographics). Many techniques can be used for nding associations among observations and variables. Nonetheless, to maintain a user perspective on the techniques this
article purposely separates techniques by the analysts goal.
If the analyst is interested in nding associations among variables or features in a database, then the section on discovery methods for variables provides an introduction. These methods are particularly relevant to new
problems in which databases have many more variables than observations.
For example, in gene expression databases it is not uncommon to nd tens
of thousands of variables and only a few hundred observations.
The section on discovery methods for observations describes introductory methods for nding associations among observations. When used
well these methods provide analysts with improved understanding of their
data and sometimes serve as a preliminary step to the methods described
in the predictive techniques section. In other words, these methods frequently enable the segmentation of the data into sets of response and predictor variables needed by the predictive techniques.
12
BROWN
K
X
ajk xij
j1
In vector notation
ek aTk xi
where T indicates transpose, eTk e1k ; e2k ; .; n, aTk a1k ; a2k ; .; aKk ,
and xTi x1k ; x2k ; .; xKk for k 1,2,.,K.
13
Consider the rst principal component, k 1. Because the goal is to maximize the variance this is equivalent to nding the aT1 a11 ; a21 ; .; aK1 that
maximizes the variance of the rst principal component, l1 . This variance is
given by
l1 aT1 Sa1
where S is the sample covariance matrix. As noted the normalizing property
provides a bound for this solution. The optimization thus requires the constraint that
aT1 a1 1:
With this normalizing constraint the solution for the vector that maximizes the variance of the rst principal component is rst eigenvector of S.
So a1 is the rst eigenvector of S and the variance of the rst principal component, l1 , is the rst eigenvalue of S.
The remaining principal components are found through similar procedures. Specically the second principal component is found from the
aT2 a12 ; a22 ; .; aK2 that maximizes the variance of the second principal
component, l2 . This variance is given by
l2 aT2 Sa2
and the normalizing constraint,
aT2 a2 1:
In this case there is also the second constraint given in the goal statement:
the principal components should be orthogonal. This implies that
aT1 a2 0:
Under these constraints, the solution for a2 is the second eigenvector for
S and l2 is the second eigenvalue. The procedure continues in this fashion to
obtain the remaining K 2 principal components. Each additional component found must be orthogonal to the ones preceding it. The solutions again
are the eigenvectors for the principal components and the eigenvalues for
the variances. In data mining we typically look for solutions in which the
number of principal components is less than the number of variables in
the database, indicating that the procedure has identied associated variables and placed them in the same principal components.
The proportion of variance explained by the principal components is easily found. For example, the variance explained by the rst two principal
14
BROWN
15
15
10
10
Pregnant
Age
10
0.1
Comp.2
5
Glucose
BloodPress
0.0
Insulin
Pedigree
0.1
SkinThick
10
BodyMass
15
0.2
0.2
0.1
0.0
0.1
Comp.1
Fig. 1. Plot of the rst two principal components for the Pima Indian data set.
PC 1
PC 3
PC 4
PC 5
Pregnant
Glucose
Blood
pressure
Skin
thickness
Insulin
Body mass
Pedigree
Age
0.315
0.424
0.330
PC 2
0.552
d
d
0.218
0.474
0.391
0.197
0.227
0.307
0.245
d
0.776
d
0.723
d
0.195
d
0.145
0.634
d
0.100
0.383
0.413
0.302
0.381
0.122
0.642
0.106
0.365
0.368
0.167
0.412
d
0.505
0.188
0.477
0.582
0.256
0.256
0.118
0.255
d
0.847
0.156
d
0.141
0.388
0.110
0.660
d
d
d
0.119
0.711
d
d
d
0.107
d
0.741
PC 6
PC 7
PC 8
16
BROWN
In addition to principal components many other methods exist for associating variables. Some representative methods include partial least squares
[3], ridge regression [4], and independent components [5]. A discussion and
comparison of methods can be found in the article by Copas [6].
Discovery methods for observations
Many people outside the eld of data mining believe that associating observations is the only purpose for data mining. It is unquestionably the area
of data mining that has received the most attention in the popular press. It is
also the area of data mining that arguably contains the largest number of
intractable problems. The goal is similar to that of associating variables,
but when analysts associate observations they often want more than they require when associating variables. In particular, they seek strength of association and any causal implications. These requirements create inferential and
computational burdens on the proposed techniques.
This section provides an introduction to the techniques in this challenging
area. It starts with a description of the market basket problem and the Apriori algorithm often used for its solution. The section ends with an overview
of clustering methods and the commonly used hierarchical approaches.
Again the goal is to introduce representative techniques.
Data mining customer purchase behavior is the aim of market basket analysis. Consider, for example, data on the purchase of items by customers at
a store over a recent period of time. Do these customers frequently buy the
same groups of items? So, for example, when they purchase peanut butter,
do they also purchase jelly? Understanding these associations may help store
managers to better inventory, display, and manage their marketable items. In
health care, market basket analysis can provide an understanding of associations among patients with demands for similar services and treatments.
Consider the set of all possible items that can be placed in a customers
market basket. Then each item has value associated with it which represents
the quantity purchased by that customer. The goal of market basket analysis
is to nd those values of items for which their joint probability of occurrence
is high. Unfortunately, for even modest-sized stores this problem is
intractable.
Instead, analysts typically simplify the problem to allow only binary
values for the items. These values reect a yes or no decision for that item
and not the quantity. Each basket then is represented as a vector of binary
valued variables. These vectors show the associations among the items. The
results are typically formed into association rules. For example, customers
who buy peanut butter (pb) also buy jelly (j) is converted to the rule,
pb 0 j
These rules are augmented by the data to show the support and the
condence in the rule. Support for a rule means the proportion of
17
18
BROWN
100
Height
80
60
40
20
0
Observations
Fig. 2. Dendrogram of single link clustering for the Pima Indian data set.
Euclidean
(
dij
p
X
2
jxik xjk
)12
k1
p
X
xik xjk
k1
Maximum
dij max
xik xjk
k1;2;.;p
Height
600
400
200
Observations
Fig. 3. Dendrogram of complete link clustering for the Pima Indian data set.
19
400
Height
300
200
100
Observations
Fig. 4. Dendrogram of average link clustering for the Pima Indian data set.
20
BROWN
Table 2
Complete and average link clusters
Complete link
Average link
1
2
3
4
314
0
0
2
0
15
0
0
0
0
3
0
9
5
0
46
21
Table 3
Medians for complete and average link cluster 4
Method
Glucose
Blood pressure
Skin thickness
Insulin
Complete
Average
Data
141.5
146.5
119.0
74
76
70
32.5
34.5
29.0
274.5
276.5
125.0
four clusters. A more complete analysis would consider other partitions and
possibly other clustering methods.
Predictive techniques
Many data mining techniques go beyond discovering relationships between variables and observations. A major set of techniques focuses on
the prediction of variable values given the values of other variables. In
this sense, these techniques look for strong associations between sets of
variables.
Prediction requires a priori identication of the set of variables to consider as predictors and the set of variables to predict (the response variables). Although many methods attempt to identify the more important
predictor variables, no methods can be expected to nd predictors that
are not present in the data. The type of response variable provides a strong
constraint on the data mining technique.
This section introduces representative data mining techniques for prediction. To keep the notation manageable only the single-variable response is
described. The extension to the multivariate case is conceptually straightforward once the univariate case is understood. The section begins with numeric response, because this builds directly on commonly used regression
or least squares techniques. From there the discussion moves to the categorical response variables. Most data mining methods can handle both types of
response, although the actual mechanics of the methods change with changing response type. Part of the discussion of these methods involves variables
selection and interpretation. In many applications, understanding the contribution of the variables to the prediction is important. Not all methods
provide an interpretation, and, hence, this dierence is noted where relevant.
Table 4
Medians for complete and average link cluster 3
Method
Glucose
Blood pressure
Skin thickness
Insulin
Complete
Data
189.0
119.0
70
70
33.0
29.0
744.0
125.0
22
BROWN
23
150
Frequency
100
50
0
0
200
400
600
800
Insulin
Fig. 5. Histogram of insulin in the Pima Indian data set.
The least squares approach also identies useful predictor variables. This
identication is accomplished by hypothesis testing on the values of the parameters for each variable in the model. The hypothesis tested is whether
a variables parameter is signicantly dierent from zero. The level of significance is chosen by the analyst.
Predicted log(Insulin)
Actual log(Insulin)
Fig. 6. Predicted versus actual values in a linear model of the Pima Indian data.
24
BROWN
For the Pima Indian data set, two variables show signicance in predicting insulin: glucose and body mass index. Glucose actually has a nonlinear
relationship with insulin and this nonlinearity can also be captured in the
model.
Logistic regression
Least squares regression models data mining problems with numeric response variables. To mine data with categorical response variables requires
a dierent approach to regression. Consider the simplest case in which the
categorical variable is binary (eg, diabetes or no diabetes). Least squares regression would not be appropriate for this problem because it would provide
predictions that would lie outside the binary response values.
An extension to the regression approach is accomplished by modeling the
probability of a binary response. With n independent observations then the
probability of k occurrences of an event is given a binomial distribution. Let
m be the parameter for this binomial distribution, which is simply the probability of an event in any observation. A convenient, but by no means
unique, model assumes this probability m is a logistic function of the predictors with parametric vector q. This yields the following:
m
log
qT x
1m
where xT x0 ; x1 ; .; xk and qT q0 ; q1 ; .; qk .
This model can be applied to the Pima Indian data to classify the patients
as diabetic or not based on the values of the predictor variables. Fig. 7
1.0
Actual Diagnosis
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Predicted
Fig. 7. Actual versus predicted values in a logistic regression model of the Pima Indian data.
25
shows a plot of the actual versus predicted values for these data. The model
does reasonably well. Using a test set of patients not used to construct the
model it achieved an error rate of 22%. This nding compares well to the
base rate of diabetes in this data set of 33%. The plot shows there are patients who are not diabetic, however, and yet are given a high probability
(O0.9) of having diabetes, whereas there are other patients who have diabetes and yet are given a low probability (!0.1) of this event by the model.
As with linear regression, logistic regression provides insight into inuence of the predictors on the response. Using likelihood ratio tests two variables, glucose and body mass index, show signicance (p!.05) for
predicting diabetes in this population. A third variable, age, is signicant
at p!.1.
Classication trees
Classication trees provide an easily understood and interpretable approach to predictive data mining with a categoric response. The central
idea is to partition the data set into regions for which a particular categorical response is prominent. The partitioning is accomplished through a series
of questions. For example, is body mass index less than 40? Observations
with armative answers to this question are separated from those with negative answers. Additional questions continue the partitioning until regions
are found that primarily contain a single response value.
This partitioning can be viewed as a tree. Each node represents a question
that partitions the data and the combination of all questions in the nodes
provides the nal partitioning. Fig. 8 shows a classication tree obtained
Yes
Yes
Glucose
< 127.5
No
Yes
No
Insulin
< 143.5
No
Glucose
< 165.5
No Diabetes
Diabetes
Yes
No Diabetes
No
Yes
No
Age < 23.5
Diabetes
No Diabetes
Diabetes
26
BROWN
for the Pima Indian data set. The top or root node partitions the data based
on values of glucose less than 127.5. All observations with values less than
127.5 go left in the tree and those greater than 127.5 go right. The observations on each path are further partitioned. For example, those that went
right are again partitioned by values of glucose, but this time they are compared with the value 165.5. Those that left are now partitioned based on the
value of insulin. The nodes at the base of the tree provide the classication
labels. Again looking at the tree in Fig. 8, patients who have a value of glucose greater than 165.5 are classied as diabetic, whereas those who have
a glucose measurement less than 127.5 and insulin less than 143.5 are classied as nondiabetic.
This example shows that tree classiers can provide easily understood results with excellent interpretability. Obviously a larger tree with more variables becomes less easily understood, but even in these cases it is possible to
view the tree in segments that lends to understanding of even very large
databases. This ease of interpretability and understanding is one the major
reasons for the use of classication trees in data mining.
The construction of classication trees is made possible through various
algorithms. One of the most eective of these, known as recursive partitioning, was developed by Breiman and colleagues [8]. This algorithm constructs
trees by providing eective answers to three major tree construction questions: (1) When to stop growing the tree, (2) what label to put on a terminal
node, and (3) how to choose a question at a node.
The rst question they answered in an unusual but important way. They
do not stop growing the tree. Instead the algorithm grows the tree out to its
maximum size (eg, each observation in its own terminal node). It then
prunes the tree back to a size that best predicts a set of holdout samples
(the actual approach used is discussed in the evaluation section). This pruning approach avoids generating trees that are not eective because they did
not consider a suciently large and cooperative set of nodes.
The second question is fairly easily answered by simply counting the
number of members of each category that appear in a terminal node and
choosing the winner. Ties are simply reported. This approach means that
the algorithm provides a quick estimate of the probabilities for each label
in the terminal node. For example, looking again at the tree in Fig. 8, patients who have a glucose reading greater than 165.5 are classied as diabetic. This classication has an estimated probability of 0.85 because 85%
of the patients in the Pima Indian database who had glucose levels this
high were diabetic.
The answer they provided to the third question was more involved. To
develop a question for a node, their algorithm begins with the data that
have arrived at the node. Each variable in the database is considered and
each value or set of values (for categorical variables) is considered. The algorithm chooses from this large set the question that best partitions the
data. Best is measured by purity of the results. So a question that partitions
27
the data into nodes with dominant class labels is preferred to one that has
the labels in roughly equal proportions.
Other approaches exist to building classication trees and use dierent
answers to the questions on tree construction (eg, Ref. [9]). For example,
it is possible to build trees with more than pairwise partitions at the nodes
and to consider trees that ask more complicated questions involving more
than one variable [10].
The interpretability and ease of understanding of the results make classication trees an important and useful technique in data mining. Their importance is evident in continuing work to improve their accuracy and
applicability. Two of the more important recent extensions are boosting
[11] and random forests [12]. Boosting provides a method for trees to improve in accuracy by adapting to the errors they make in classication. Random forests provides a mechanism for combining results from multiple
classication trees to produce more accurate predictions.
Neural networks
Neural networks provide data mining techniques meant to mimic the pattern recognition properties of biologic systems. The most commonly used of
these techniques, multilayer perceptrons or backpropagation neural networks, begins with a simplied model of neural processing known as a perceptron. A perceptron typically applies a nonlinear transfer function to
a weighted sum of the inputs. The inputs are the values for each predictor
variable for an observation. The better performing transfer functions are
smooth and continuous.
As the name implies, multilayer perceptrons use several perceptrons and
organize them into dierent layers for processing the data. In most data
mining applications three layers of perceptrons are used. The rst layer is
the input layer and provides an input node for every variable in the database. Another node, known as the bias node, is often used to provide the
model with greater exibility in modeling. This node always inputs the
same value (eg, 1).
The next layer is known as the hidden layer. In fully connected
networks every input node is connected to every node of the hidden layer.
The number of nodes in this hidden layer is undetermined and can greatly
impact the results. In most applications trials are made with dierent numbers of hidden nodes to nd the number that works well for the specic
application.
The nal layer is the output. This layer depends on the values (scalar or
vector) sought in the output; so, for a simple binary classication problem
a single output node is sucient.
Fig. 9 shows an example of a multilayer perceptron neural network for
the Pima Indian data. The input layer contains nodes for each of the variables in this data set and for the bias term. This hidden layer contains
28
BROWN
Glucose
Hidden 1
Glucose
Hidden 2
Output
classification
Blood Press.
Hidden k
Bias
some chosen number of nodes, say k. Finally, for this problem there is a single output node to report the classication of the patient.
A multilayer perceptron requires values for the weights represented by
the arcs or connections in the neural network. In the example in Fig. 9 we
would need weights for each of the arcs connecting the nodes in the input
layer to each member of the hidden layer. Similarly the connections between
each hidden node and the output node require weights.
The basic algorithm used to calculate these weights is the backpropagation algorithm of Werbos. The name of this algorithm is what gives multilayer perceptrons their other commonly used name: backpropagation neural
networks. This algorithm initializes by randomly assigning weights to the
connections in the network. As the algorithms rst step, an observation
from the database is presented to and processed by the neural net. The error
is computed at the terminal node and this error is then propagated back
through the network. The algorithm changes those weights the most that
most contributed to the error. The algorithm proceeds in this fashion until
the weights change little or not at all, which can take many presentations of
the data in the database to the neural network. Werbos basic algorithm has
been modied by many researchers and faster algorithms now exist that do
not require single observation processing.
29
Fig. 10 shows the actual classication values for the Pima Indian plotted
against the predictions from a ve hidden node backpropagation neural network; this is the same plot as shown in Fig. 7 for logistic regression. As in
that case the neural network does well on many cases but still makes some
signicant errors. One important dierence between neural networks and
both logistic regression and classication trees is that they do not provide
an understandable interpretation of the results. By themselves, it is not possible to know which features or combination of features most inuenced
a prediction.
Support vector machines
The nal technique discussed in this article is one of the more recent additions to data mining. Support vector machines (SVMs) were developed by
Vapnick [13] and the technique seeks to predict class labels by separating the
database in mutually exclusive regions. There are several important innovations in the approach taken by support vector machines to this problem.
First, SVMs perform the separation based on the few points, the support
vectors, near the boundary between the classes. In this perspective they differ from all the previous approaches described in this article that form decision boundaries using data in all the points. Second, they transform the
data into a space where separability between the classes is improved. Finally, rather than explicitly performing the transformation, they use kernel
functions to provide computational tractability. This section provides an
overview to support vector machines by discussing these innovations.
1.0
actual diagnosis
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
predicted diagnosis
Fig. 10. Prediction results from a ve hidden layer multilayer perceptron for the Pima Indian
data.
30
BROWN
gin
ar
on
i
cis
De
ry
da
un
bo
Support
vectors
31
the surface. Support vector classiers handle this situation by relaxing the
requirement that all points must be on one side of the surface. Formally
they do this by adding slack variables to the previous formulation. Let
xi ; i 1; .; n be the slack variables, which have nonnegative values. Then
replace the constraint in the previous formulation with
yi wT xi w0 R 1 xi
Solving this constrained optimization problem allows some points to be
misclassied.
The next major component of the support vector machines is the transformation of the original problem into a new space. This transformation
enables the creation of nonlinear decision boundaries. This capability is important because many applications do not have the simple linear boundaries
shown in Fig. 11. The SVM solution is to nd a transformation of the original variables that enables linear separability. Even though the problem is
not linearly separable in the original space dened by the variables in the
database, SVM nds a transformation of the variables in which linear separability is possible.
The search for variable transformation and the maximal margin classier
in this new space can be computationally expensive. Fortunately, the third
major contribution from SVM helps resolve these issues. Rather than actually perform the transformation and develop the maximal margin classier
in a new space, SVM uses kernel functions in the original variables. These
kernel functions mean that there is no need to actually nd the variable
transformation for nonlinear decision surfaces. Kernel functions enable
this because they allow for computation of the similarity (formally, the
dot product) between two observations in the original space rather than
in the transformed space. In so doing the kernels make it possible to solve
the optimization for linear separability in reasonable time.
Several possible choices exist for kernel functions that enable the use of
SVM for data mining. Two common choices are polynomial kernels and
radial basis functions. Fig. 12 shows Pima Indian data classied with a polynomial kernel in two dimensions, glucose (X1) and body mass index (X2).
Evaluation
The nal section of this article examines methods to evaluate the results
of data mining. This discussion focuses on evaluating predictive techniques.
Discovery techniques are dicult to evaluate because unlike predictive techniques there is no available response value. The few available evaluation
methods for discovery techniques build on those for the predictive techniques that are discussed later.
Evaluation techniques are vital to the use of predictive data mining. It is
clearly not sucient to simply apply data mining techniques to a database.
32
BROWN
Fig. 12. SVM classier for the Pima Indian data with a polynomial kernel.
The results from these data mining techniques must be objectively assessed
before they are used to inform decision making. Evaluation requires testing
procedures and metrics.
Testing procedures are normally dened by the application area. For
some applications it is possible to conduct formal experiments designed using the variables in the database. For many applications experiments are not
possible, however. In these situations, the analyst normally uses the observations in the database to evaluate the data mining results.
The goal of testing procedures is to provide the analyst with an objective
view of the performance of the data mining technique on future observations. For many reasons it is best not to rely on the observations in the database that were used to parameterize a technique to assess its performance
on future values. The major reason for this caveat is because each technique
can be made to perform perfectly on a set of observations. This perfect performance on a known database would not translate into perfect performance on newly obtained observations, however. In fact, the performance
on these would be poor because the technique was overt to the existing
data.
Testing procedures provide a way to avoid overtting. The simplest
testing procedure is to divide the database into two parts. One part, the
training set, is used to build and parameterize the data mining technique.
The second part is used to test the technique. For reasonably sized
33
databases the division is normally two thirds for training and one third for
testing. In addition, the choice of observations for each set is randomly
made. It may be useful to use stratied sampling for either or both of
the training and test sets if the distributions of groups within a target population is known.
Cross-validation is another testing procedure that is used when the database is small or when concerns exist about the representativeness of
a test set. Cross-validation begins by dividing the data into M roughly
equal-sized parts. For each part, I 1,., M the model is t using the
data in the other M 1 parts. The metric is then computed using the
data in the remaining part; this is done M times giving M separate estimates of the metric. The nal estimate for the metric is simply the average
over all M estimates.
Cross-validation has the advantage that it uses all the data for training
and testing; this means that the analyst does not have to form a separate
test set. Recursive partitioning, discussed in the classication trees section,
uses cross-validation to determine the nal size of the tree. In this way
cross-validation is frequently used to nd parameter values for the dierent
data mining techniques. For those methods that do not use it for parameter
estimation it provides a convenient testing approach to assess a data mining
technique.
In addition to testing procedures, the analyst must also select a metric or
metrics to use to evaluate the techniques. For numeric response problems,
common metrics are functions of sums of squares or sums of absolute deviations. Both measures weight performance by distance to the correct response, but the former measure tends to penalize extreme errors more
than measures that use absolute deviation.
For categorical response, metrics that count the number of errors are typically used. In many applications the type of error is also important, however. This observation is particularly true in diagnostic applications. In
these cases it is convenient to separate the errors into false positives and false
negatives. False positives occur when the data mining technique predicts an
outcome and the outcome does not occur. False negatives happen when the
data mining technique fails to predict an outcome that occurred. The diabetes example illustrates a case wherein these two errors are not equally
weighted. In this case a false negative typically is worse than a false positive
because the latter error can be caught by subsequent testing. On the other
hand, it would be disastrous if only false positives occurred because this
would quickly overwhelm the available testing resources. In performing
evaluations on classiers, therefore, both types of errors need to measured
and trade-os made between their predicted values.
A useful display that allows for viewing of both metrics is the receiver
operating characteristic (ROC) curve. The name for this graphic derives
from its origin in WWII where it was used by the allies to assess the performance of early radar systems. The ROC curve shows the trade-os between
34
BROWN
false positives and false negatives by plotting true positives (1 false negatives) versus false positives. This plotting means that the ideal performance
is in the upper left hand corner of the plot. The worst performance is in the
lower right hand corner. Random performance is shown by a diagonal line
at 45 .
Fig. 13 shows an ROC curve for several of the data mining techniques
discussed using the Pima Indian data set. These curves are for a test set of
100 observations selected randomly from original 392 observations. This
plot gives us a way to decide among the techniques given the desired
trade-o between false positives and false negatives.
The plot in Fig. 13 illustrates another aspect of data mining techniques.
In most applications, there is not clear winner among the techniques. The
choice of technique depends on the application as illustrated in this case
through the choice of trade-os between false positives and false negatives.
The choice also depends on the importance of the understanding and interpretability results, because some of the techniques provide these attributes more easily than others. Fortunately, the variety and the
capabilities of data mining techniques continue to improve. This variety
has to a large extent built on the successes of the methods described in
this article.
ROC Curve
1.0
True Positive
0.8
0.6
0.4
0.2
Logistic Regression
Tree
Neural Network
0.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive
Fig. 13. ROC curve for data mining techniques for the Pima Indian data set.
35
References
[1] Kahneman D, Slovic P, Tversky A. Judgment under uncertainty: heuristics and biases.
Cambridge (UK): Cambridge University Press; 1982.
[2] Hand D, Mannila H, Smyth P. Principles of data mining. Cambridge (MA): MIT Press;
2001.
[3] Wold H. Soft modeling by latent variables: the nonlinear iterative partial least squares
(NIPALS) approach. Perspectives in probability and statistics, In Honor of MS Bartlett.
Sheeld, UK: Applied Probability Trust; 1975. p. 11744.
[4] Hoerl AE, Kennard R. Ridge regression: biased estimation for nonorthogonal problems.
Technometrics 1964;12:5567.
[5] Comon P. Independent component analysis, a new concept? Technometrics 1994;36:
287314.
[6] Copas JB. Regression, prediction and shrikage (with discussion). J Roy Stat Soc B 1983;45:
31354.
[7] Agrawal R, Mannila H, Srikant R, et al. Fast discovery of association rules. In: Fayyad UM,
Pietsky-Shapiro G, Smyth P, editors. Advances in knowledge discovery and data mining.
Cambridge (MA): AAAI/MIT Press; 1996. p. 30728.
[8] Breiman L, Friedman J, Olshen R, et al. Classication and regression trees. Belmont (CA):
Wadsworth; 1984.
[9] Kass GV. An exploratory technique for investigating large quantities of categorical data.
Appl Stat 1980;29:11927.
[10] Brown DE, Pittard CL. Classication trees with optimal multi-variate splits. Proceedings of
the IEEE International Conference on Systems, Man, and Cybernetics, Le Touquet
(France); 1993. p. 4758.
[11] Freund Y, Schapire R. A decision theoretic generalization of online learning and an application to boosting. J of Comp & Sys Sci 1997;55:11939.
[12] Breiman L. Random forests. Mach Learn 2001;45:532.
[13] Vapnick V. The nature of statistical learning theory. New York: Springer Verlag; 1995.
The history of software packages for data mining is short but eventful.
Although the term data mining was coined in the mid-1990s [1], statistics,
machine learning, data visualization, and knowledge engineeringdresearch
elds that contribute their methods to data miningdwere at that time
already well developed and used for data exploration and model inference.
Obviously, software packages were in use that supported various data mining tasks. But compared with the data mining suites of today, they were
awkward, most often providing only command-line interfaces and at best
oering some integration with other packages through shell scripting, pipelining, and le interchange. For an expert physician, the user interfaces of
early data mining programs were as cryptic as the end of the last sentence.
It took several decades and substantial progress in software engineering and
user interface paradigms to create modern data mining suites, which oer
simplicity in deployment, integration of excellent visualization tools for
exploratory data mining, anddfor those with some programming backgrounddthe exibility of crafting new ways to analyze the data and adapting algorithms t to the particular needs of the problem at hand.
Within data mining, there is a group of tools that have been developed by
a research community and data analysis enthusiasts; they are oered free of
charge using one of the existing open-source licenses. An open-source development model usually means that the tool is a result of a community eort,
not necessary supported by a single institution but instead the result of
contributions from an international and informal development team. This
development style oers a means of incorporating the diverse experiences
This work was supported by a Program Grant P20209 and a Project Grants J29699
and V20221 from the Slovenian Research Agency and NIH/NICHD Program Project Grant
P01 HD39691.
* Corresponding author. University of Ljubljana, Trzaska 25, SI-1000 Ljubljana,
Slovenia.
E-mail address: blaz.zupan@fri.uni-lj.si (B. Zupan).
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.002
labmed.theclinics.com
38
39
40
particular scripting language. Although harder to learn and use for novices
and those with little expertise in computer science or math than systems
driven completely by graphical user interfaces, scripting in data mining
environments is essential for fast prototyping and development of new techniques and is a key to the success of packages like R.
Why mine medical data with open-source tools?
Compared with o-the-shelf commercial data mining suites, open-source
tools may have several disadvantages. They are developed mostly by
research communities that often incorporate their most recent data analysis
algorithms, resulting in software that may not be completely stable. Commercial data mining tools are often closely integrated with a commercial
database management system, usually oered by the same vendor. Opensource data mining suites instead come with plug-ins that allow the user
to query for the data from standard databases, but integration with these
may require more eort than a single-vendor system.
These and other potential shortcomings are oset by several advantages
oered by open-source data mining tools. First, open-source data mining
suites are free. They may incorporate new, experimental techniques, including some in prototype form, and may address emerging problems sooner
than commercial software. This feature is particularly important in biomedicine, with the recent emergence of many genome-scale data sets and new
data and knowledge bases that could be integrated within analysis schemata.
Provided that a large and diverse community is working with a tool, the set
of techniques it may oer can be large and thus may address a wide range of
problems. Research-oriented biomedical groups nd substantial usefulness
in the extendibility of the open-source data mining suites, the availability
of direct access to code and components, and the ability to cross-link the
software with various other data analysis programs. Modern scripting
languages are particularly strong in supporting this type of ad hoc integration. Documentation for open-source software may not be as polished as
that for commercial packages, but it is available in many forms and often
includes additional tutorials and use cases written by enthusiasts outside
the core development team. Finally, there is user support, which is dierent
for open-source than for commercial packages. Users of commercial packages depend on the companys user support department, whereas users of
open-source suites are, as a matter of principle, usually eager to help each
other. This cooperation is especially true for open-source packages with
large and active established user bases. Such communities communicate
by online forums, mailing lists, and bug tracking systems to provide encouragement and feedback to developers, propose and prioritize improvements,
report on bugs and errors, and support new users.
As these open-source tools incorporate advances in user interfaces and
reporting tools, implement the latest analysis methods, and grow their user
41
42
43
Fig. 1. Snapshot of the basic R environment (RGui) with an example script that reads the data,
constructs an object that stores the result of hierarchical clustering, and displays it as a dendrogram in a separate window.
44
Fig. 2. Snapshots of Tanagra with an experimental setup dened in the left column, which
loads the data (Dataset), shows a scatterplot (Scatterplot 1), selects a set of features (Dene
status 1), computes linear correlations (Linear correlation 1), selects a subset of instances based
on a set of conditions (Rule-based selection 1), computes the correlation and a scatterplot for
these instances, and so on. The components of the data processing tree are dragged from the list
at the bottom (Components); the snapshot shows only those related to statistics. The scatterplot
on the right side shows the separation of the instances based on the rst two axes as found by
the partial least squares analysis, where each symbol represents a patient, with the symbols
shape corresponding to a diagnosis.
45
Weka
Weka (Waikato Environment for Knowledge Analysis, http://www.cs.
waikato.ac.nz/ml/weka/) [9] is perhaps the best-known open-source machine
learning and data mining environment. Advanced users can access its components through Java programming or through a command-line interface.
For others, Weka provides a graphical user interface in an application called
the Weka KnowledgeFlow Environment featuring visual programming, and
Weka Explorer (Fig. 3) providing a less exible interface that is perhaps easier to use. Both environments include Wekas impressive array of machine
learning and data mining algorithms. They both oer some functionality
for data and model visualization, although not as elaborate as in the other
suites reviewed here. Compared with R, Weka is much weaker in classical
statistics but stronger in machine learning techniques. Wekas community
has also developed a set of extensions (http://weka.sourceforge.net/wiki/
index.php/Related_Projects) covering diverse areas, such as text mining,
visualization, bioinformatics, and grid computing. Like R in statistics,
Fig. 3. Weka Explorer with which we loaded the heart disease data set and induced a na ve
Bayesian classier. On the right side of the window are the results of evaluation of the model
using 10-fold cross-validation.
46
Fig. 4. A snapshot of YALE with the experimental setup for cross-validation that reads the
data, computes some basic statistics about the features, and then cross-validates a classication
tree inducer J48. Selection of any component from the list on the left of the window provides
access to its parameters; those for cross-validation are displayed in the snapshot. The experiment log is displayed in the bottom part of the window. After executing the experiment, the
results of experiments are available in the Results tab.
47
into a treelike structure and runs the program (Fig. 4). Internal nodes of the
tree represent functions in which their children are the arguments (which
may in turn bedand usually aredfunctions). For example, an operator
XValidation performs cross-validation and requires two child nodes.
The rst must be able to handle an ExampleSet and deliver a Model.
The second child node gets an ExampleSet and a Model and outputs a PerformanceVector. The second child would typically be an operator chain consisting of a ModelApplier, which uses the prediction Model on an
ExampleSet, resulting in a table of predictions and actual classes and a PerformanceEvaluator, which takes the table and computes the corresponding
classier scores.
YALE incorporates a reasonable number of visualizations ranging from
the basic histograms to multidimensional RadViz [14] projections. YALE is
written in Java and is built on top of Weka, thus including its vast array of
data analysis components. Although data miners with a background in
programming easily grasp its visual functional programming concepts,
Fig. 5. Screenshot of KNIME. The central part of the window shows the experimental setup
with several interconnected nodes; the right part contains a useful description of the selected
node. The screenshot shows an experiment in which we loaded the data, colored the instances
according to their class and showed them in a table, and used parallel coordinates and a scatterplot for visualization. In the middle of the graph we placed the nodes for testing the performance of a classication tree inducer; node Cross-validation has an internal workow with
the denition of the evaluated learning algorithm. At the bottom part of the graph are nodes
for random partitioning of the data set, binning of the training set, and derivation of a classication tree used to predict the classes of the test set and obtain the related performance scores.
In addition, we visualized the training set in a scatterplot, but put the instances through the
HiLite Filter. With this setup, we can pick a node in the classication tree J48 Weka and
see the corresponding examples in the Scatter Plot.
48
Fig. 6. A dialog of the node CAIM Binner (from Fig. 5) that transforms continuous features
into discrete features (discretization). Features to be discretized are selected in the bottom part
of the window, with the top part of the window displaying the corresponding split points.
49
unsupervised data mining algorithms with those provided by Weka. But unlike that of Yale, KNIMEs visual programming is organized like a data
ow. The user programs by dragging nodes from the node repository
to the central part of the benchmark (Fig. 5). Each node performs a certain
function, such as reading the data, ltering, modeling, visualization, or similar functions. Nodes have input and output ports; most ports send and receive data, whereas some handle data models, such as classication trees.
Unlike nodes in Wekas KnowledgeFlow, dierent types of ports are clearly
marked, relieving the beginner of the guesswork of what connects where.
Typical nodes in KNIMEs KnowledgeFlow have two dialog boxes, one for
conguring the algorithm or a visualization and the other for showing its results
(Fig. 6). Each node can be in one of the three states, depicted with a trac-light
display: they can be disconnected, not properly congured, or lack the input
data (red); be ready for execution (amber); or have nished the processing
(green). A nice feature called HiLite (Fig. 7) allows the user to select a set of
Fig. 7. KNIME HiLiteing (see Fig. 5), where the instances from the selected classication tree
node are HiLited and marked in the scatterplot.
50
instances in one node and have them marked in any other visualization in the
current application, in this way further supporting exploratory data analysis.
Orange
Orange (http://www.ailab.si/orange) is a data mining suite built using the
same principles as KNIME and Weka KnowledgeFlow. In its graphical
environment called Orange Canvas (Fig. 8), the user places widgets on a canvas
and connects them into a schema. Each widget performs some basic function,
Fig. 8. Snapshot of the Orange canvas. The upper part of the schema centered around Test
Learners uses cross-validation to compare the performance of three classiers: na ve Bayes,
logistic regression, and a classication tree. Numerical scores are displayed in Test Learners,
with evaluation results also passed on to ROC Analysis and Calibration Plot that provide
means to graphically analyze the predictive performance. The bottom part contains a setup
similar to that in KNIME (see Fig. 5): the data instances are split into training and test sets.
Both parts are fed into Test Learners, which, in this case, requires a separate test set and tests
a classication tree built on the training set that is also visualized in Classication Tree
Graph. Linear Projection visualizes the training instances, separately marking the subset
selected in the Classication Tree Graph widget.
51
but unlike in KNIME with two data typesdmodels and sets of instancesdthe
signals passed around Oranges schemata may be of dierent types, and may
include objects such as learners, classiers, evaluation results, distance matrices, dendrograms, and so forth. Oranges widgets are also coarser then
KNIMEs nodes, so typically a smaller number of widgets is needed to accomplish the same task. The dierence is most striking in setting up a crossvalidation experiment, which is much more complicated in KNIME, but
with the benet of giving the user more control in setting up the details of the
experiment, such as separate preprocessing of training and testing example sets.
Besides friendliness and simplicity of use, Oranges strong points are
a large number of dierent visualizations of data and models, including
intelligent search for good visualizations, and support of exploratory data
analysis through interaction. In a concept similar to KNIMEs HiLiteing
(yet subtly dierent from it), the user can select a subset of examples in
a visualization, in a model, or with an explicit lter, and pass them to, for
instance, a model inducer or another visualization widget that can show
them as a marked subset of the data (Fig. 9).
Orange is weak in classical statistics; although it can compute basic statistical properties of the data, it provides no widgets for statistical testing. Its
Fig. 9. The linear projection widget from Orange displaying a two-dimensional projection of
data, where the x and y axis are a linear combination of feature values whose components are
delineated with feature vectors. Coming from the schema shown in Fig. 8, the points corresponding to instances selected in the classication tree are lled and those not in the selection are open.
52
Fig. 10. Scatterplot, a matrix of scatterplots and parallel coordinates as displayed by GGobi.
The instances selected in one visualization (scatterplot, in this case) are marked in the others.
53
Fig. 11. GGobis Grand tour shows a projection similar to the Linear Projection in Orange (see
Fig. 9) but animates it by smoothly switching between dierent interesting projections, which
gives a good impression of positions of the instances in the multidimensional space.
Summary
State-of-the-art open-source data mining suites of today have come a long
way from where they were only a decade ago. They oer nice graphical interfaces, focus on usability and interactivity, support extensibility through augmentation of the source code or (better) through use of interfaces for add-on
modules. They provide exibility through either visual programming within
graphical user interfaces or prototyping by way of scripting languages. Major
toolboxes are well documented and use forums or discussion groups for user
support and exchange of ideas.
The degree to which all of the above is implemented of course varies from
one suite to another, but in the packages we have reviewed in this article most
of the above issues were addressed and we could not nd a clear winner in supporting all of the aspects in the best way. For a medical practitioner or biomedical researcher starting with data mining the choice for the right suite may be
guided by the simplicity of the interface, whereas for research teams a choice of
implementation or integration language (Java, R, C/C, Python, and so
forth) may be important. For the wish list of data mining techniques we nd
that all packages we have reviewed (with the exception of GGobi focusing
on visualization only) cover most of the standard data mining operations,
ranging from preprocessing to modeling, with some providing better support
for statistics and others for visualization.
There are many open-source data mining tools available, and our intention was only to demonstrate the ripeness of the eld through exemplary
54
References
[1] Fayyad U, Piatetsky-Shapiro G, Smyth P, et al, editors. Advances in knowledge discovery
and data mining. Menlo Park (CA): AAAI Press; 1996.
[2] Quinlan JR. C4.5: programs for machine learning. San Mateo (CA): Morgan Kaufmann
Publishers; 1993.
[3] Michalski RS, Kaufman K. Learning patterns in noisy data: the AQ approach. In: Paliouras
G, Karkaletsis V, Spyropoulos C, editors. Machine learning and its applications. Berlin:
Springer-Verlag; 2001. p. 2238.
[4] Clark P, Niblett T. The CN2 induction algorithm. Machine Learning 1989;3:26183.
[5] Asuncion A, Newman DJ. UCI Machine Learning Repository. Available at: http://www.ics.
uci.edu/wmlearn/MLRepository.html. Accessed April 15, 2007. Irvine, CA: University of
California, Department of Information and Computer Science; 2007.
[6] Wall L, Christiansen T, Orwant J. Programming Perl. 3rd edition. Sebastopol, CA: OReilly
Media, Inc.; 2000.
[7] Kohavi R, Sommereld D, Dougherty J. Data mining using MLC: a machine learning
library in C. International Journal on Articial Intelligence Tools 1997;6:53766.
[8] Brunk C, Kelly J, Kohavi R. MineSet: an integrated system for data mining. In Proc. 3rd Intl.
Conf. on Knowledge Discovery and Data Mining, Menlo Park (CA). p. 1358.
[9] Witten IH, Frank E. Data mining: practical machine learning tools and techniques with Java
implementations. 2nd edition. San Francisco (CA): Morgan Kaufmann; 2005.
[10] Zupan B, Holmes JH, Bellazzi R. Knowledge-based data analysis and interpretation. Artif
Intell Med 2006;37:1635.
[11] Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform 2006; in press.
[12] Cios KJ, Moore GW. Uniqueness of medical data mining. Artif Intell Med 2002;26:124.
[13] Becker RA, Chambers JM. S: an interactive environment for data analysis and graphics.
Pacic Grove (CA): Wadsworth & Brooks/Cole; 1984.
[14] Homan PE, Grinstein GG, Marx KE. DNA visual and analytic data mining. Phoenix (AZ):
In Proc. IEEE Visualization; 1997. p. 43741.
[15] Asimov D. The grand tour: a tool for viewing multidimensional data. SIAM J Sci Statist
Comput 1985;6:12843.
Data mining requires an underlying data set and these data may be
acquired, stored, and managed in multiple ways. As this data set increases
in volume, data mining techniques generally become more eective and useful. Medicine is becoming well-positioned to take advantage of the capabilities of data mining; there is a tremendous wealth of largely untapped
clinical data available in the operational clinical information systems of laboratories, hospitals, and clinics around the world, and the volume of these
data is increasing rapidly as medical centers adopt electronic medical
records. An important initial step toward the most eective mining of these
data for biomedical and translational research is the development of enterprise clinical data warehouses that ooad information from production systems into separate, fully integrated databases optimized for performing
population-based queries. Without such systems, mining of production clinical data requires complex and time-consuming preparatory data aggregation and processing steps on a project-by-project basis. Because most
health care data have potential value for multiple research endeavors, the
benets of developing and maintaining large-scale multipurpose enterprise
data warehouses can be considerable. In this article, we present an introduction to clinical data warehouses, highlight examples from the literature in
* Corresponding author. Division of Clinical Informatics, Department of Public Health
Sciences, Suite 3181 West Complex, 1335 Lee Street, University of Virginia Health System,
Charlottesville, VA 22908.
E-mail address: lyman@virginia.edu (J.A. Lyman).
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.003
labmed.theclinics.com
56
LYMAN
et al
which they support data mining, and describe specic issues and challenges
related to their development and use.
Introduction to data warehouses
Health care data warehouses, which integrate data from multiple operational systems and provide longitudinal, population-based views of health
information, are becoming increasingly common [19]. One challenge in discussing these systems is the lack of consensus about the appropriate terminology. The terms data repository, data warehouse, knowledge warehouse,
and information warehouse have all been used in the academic and business
communities to describe databases that are distinct from production systems
(eg, electronic health records) and exist to support analytic processing and
strategic decision making (in the business world) and biomedical research
(in the health care sector). There is ambiguity in the terminology, however,
because the term clinical data repository is often used to describe a production clinical system designed to display integrated patient data to health
care providers for the purposes of patient care [10].
A data warehouse, in general terms, can be dened as a copy of transaction data specically structured for query and analysis [11]. Transaction
data, as it pertains to health care organizations, may include clinical laboratory results derived directly from a laboratory information system, medication orders, textual documents (eg, discharge summaries, radiology reports,
surgical pathology reports), and administrative claims data, to name a few.
Although small data warehouses may serve primarily to structure production data from individual systems for analytic purposes, the most useful
warehouses incorporate data from multiple traditional production systems.
Integrating these disparate data into a common system that provides longitudinal views of retrospective clinical data is the essence of a clinical data
warehouse. A related concept, the clinical data mart, is a more specialized
version of a clinical data warehouse with a subset of data pertinent to a particular setting or topic [12,13].
Creating a copy of transaction data in a separate integrated database
usually yields signicant performance benets for research queries. Production clinical systems are optimized for high speed, continuous updating of
individual patient data elements, and individual patient queries in small transactions. These types of systems are generally referred to as online transaction
processing (OLTP) systems and their optimization for processing large
numbers of small transactions at high speed introduces important constraints
on their internal data models [14]. In contrast to production system transactions, researchers queries may be complex, involving long time periods or
multiple conditions and patients, and may return substantial data sets.
Clinical production systems are not designed for these types of queries and often respond sluggishly, yielding poor performance with research queries and
potentially compromising the speed of concurrent clinical care processes as
57
an unwanted side eect. Systems whose data models are structured to support
data analysis are generally termed online analytical processing (OLAP)
systems to distinguish them from OLTP systems [14]. Systems designed for
OLAP, such as data warehouses, are also typically nonvolatile: data are intermittently added and never changed. Because updating performance is not
critical, data additions are batch-scheduled during slow use periods and
include many data types on multiple patients [15]. Consequently, the underlying data model can be optimized to maximize the performance of the types
of complex queries across populations that are characteristic of research.
Data warehouses as sources for data mining initiatives
In recent years, there have been several published examples of the use of
data warehouses for data mining eorts related to clinical investigation.
Researchers have used these databases for factor analysis to identify patients
at risk for suboptimal outcomes, to explore associations between diseases
and clinical ndings, for improvement of clinical information systems, and
to examine diseasedisease and diseaseprocedure associations.
Prather and colleagues [16] describe the development of a clinical data
warehouse at Duke University Medical Center and its subsequent use for
a data mining project intended to identify factors associated with increased
risk for preterm birth. Data on more than 45,000 patients were transferred
from the perinatal database of their organizations electronic medical record
system to a separate relational database designed for research. In a preliminary analysis restricted to a 2-year time period, multiple queries against the
data warehouse were used to create a data set of 3902 patients. During this
process, variables were cleansed to correct erroneous or discrepant values
and formatting inconsistencies. Once the nal data set was created, factor
analysis was performed. For patients who met study criteria, more than
91% of the data values were usable, and three latent factors were identied
that explained 48.7% of the variance in the data. In a subsequent larger
analysis of 19,970 patients with 1622 variables per patient, 7 predictor variables for preterm birth risk were identied, yielding 0.72 area under the
curve for receiver operating characteristics [17].
In another example, researchers at Columbia University used co-occurrence statistics to discover associations between the presence of diseases
and clinical ndings [18]. Such associations, they believed, could improve
the usefulness of an automated problem list summarization tool used in
their electronic medical record system by allowing the removal of redundant, clinically non-informative ndings (eg, chest pain in a patient who
had myocardial infarction). They compared two methods, Chi-square and
the proportion condence interval (PCI), assessing their respective abilities
to detect clinically recognized disease-nding associations in discharge summaries. Using the former technique, 94% of the associations that were identied were believed to be true associations as judged by expert physicians,
58
LYMAN
et al
whereas 77% of the associations found using the PCI approach were believed to be correct. Although the purpose of their knowledge discovery effort was to improve the usefulness and accuracy of an automated problem
list generator, they acknowledged the potential usefulness of their approach
for identifying novel associations between diseasedisease, diseasemedication, diseaseprocedure, and so forth.
The identication of such associations was the focus of another data
mining eort using the University of Virginias Clinical Data Repository,
an enterprise-wide data warehouse developed to support clinical investigation. Mullins and colleagues [19] described a collaborative endeavor between
researchers at the University of Virginia, the Virginia Commonwealth
University, and IBM Life Sciences, in which a data set with 667,000 patients
was mined using three dierent unsupervised methods to identify potentially
interesting disease associations. Results were compared with automated
searches of the biomedical literature in an eort to distinguish between associations that were well established versus those that might represent previously unknown relationships. The analysis identied multiple associations
of both types, including congestive heart failurevalvular diseasehypertension (a well-known association) and albuteroltracheostomymagnesium (an association not found in the biomedical literature).
59
documented. In the context of building a data warehouse for clinical investigation and data mining, this can be challenging because of the diverse,
ever-changing nature of biomedical research. It is dicult to anticipate at
the outset the breadth and depth of information that will ultimately be
required. This is in contrast to business-oriented applications, in which there
are typically a small number of focused questions that a data warehouse is
intended to support.
Murphy and colleagues [7] approached the requirements process by
retrospectively examining queries of an existing clinical information system
to identify particular data types that were of most value. In their study,
coded diagnoses and medications accounted for more than 90% of queries,
and this information was used to optimize their data model for these types
of queries. This approach has benets but is also limited because historical
queries can only use information that was readily available at the time, and
by denition cannot represent data or types of queries that may become
available in a new system.
By practical necessity, the development of a clinical data warehouse is often largely driven by the data that are electronically available at the time of
its creation. Universally collected data, at least in the United States, include
administrative information necessary to support billing and governmental
reporting requirements. These include coded diagnoses, procedures, demographics, discharge disposition, and payer information. Although there is
debate about whether the accuracy of such administrative data is sucient
for clinical research or quality assessment [20,21], this type of information
has been used successfully for scores of clinical and health services research
projects. Because clinical laboratory results are typically stored in electronic
format at most institutions and are important indicators of diagnosis,
disease progression, and response to therapy, this information along with
administrative data often form the core around which a clinical data
warehouse is constructed. Other data, including medications (ordered or administered), vital signs, monitoring data (eg, EKG), or textual reports, are
often useful for research studies but are less commonly available electronically. Patient identiers, although perhaps not necessary for many research
projects, are still important for linking data from multiple systems over time,
and their inclusion has signicant ramications for the security requirements
of the data warehouse.
Database design includes not only requirements gathering and the identication of desired data elements but also the adoption or development of
the data model that will be used. The data model serves as the blueprint for
database construction and ultimately determines how data will be organized by the database management system. Data models for data warehouses tend to be fundamentally dierent from those developed for
transactional (OLTP) databases that support, for example, electronic medical records and other production systems. Analytic (OLAP) systems typically use a multidimensional data model that is implemented in one of two
60
LYMAN
et al
61
Fig. 1. Star schema for clinical laboratory data. Star schemata are characterized by a central
fact table (Clinical Lab Result) containing a minimal representation of an event, such as a laboratory test, with links to multiple surrounding dimension tables holding the detailed data on
the characteristics of the events. FK, foreign key, which points to a primary key in another table; PK, primary key.
A fact table focused on laboratory test results might contain the following elds: patient identier, date/time identier, laboratory test identier,
specimen type, and numeric result (see Fig. 1). If a patient had multiple serum glucose tests, each result would be stored in a row in the table. In this
example, a single fact refers to a specic result for a specic laboratory
test that was obtained for a specic patient at a specic point in time. The
identiers function as links to the dimension tables: a patient dimension,
time dimension, and laboratory test dimension, where more detailed information about each of those elements would be stored. Star schemata
make it easy to add new dimensional characteristics. For example, if one
of the dimensions is time, then the dimension table would allow you to
group the time of events recorded in the fact table in dierent ways: by
year, quarter, month, day of week, season, weekend indicator, holiday indicator, and so forth. A patient age dimension might be constructed to easily
allow data to be queried or analyzed by adult versus pediatric, age decade,
or some other locally dened age group.
62
LYMAN
et al
63
idiosyncrasies, mining the data is dicult unless it is processed into an appropriate form and collected into a single database. One of the most important functions of data warehouses is to aggregate these data in a form that is
appropriate for mining. Several steps are required to adequately organize and
process the data before loading them into the data warehouse: extraction,
ltering and transformation (sometimes referred to as data cleansing),
classication, and aggregation. The ultimate goal is to standardize the information from disparate systems and transform it so that it is useful for mining.
Extraction
Data may be extracted from production systems as les or may be received in real time through existing hospital interfaces (Fig. 2). Real-time
data are often communicated through an interface engine using the HL7
messaging standard [32], and there are general purpose HL7 parsers (commercial and open source) that can extract data from HL7 feeds. Nonstandard real-time formats require custom programming. Alternatively, some
data sources (particularly those focused on administrative requirements,
such as billing) may deliver data as batch les 1 to 2 months after hospital
discharge. These batch les can be parsed and processed using standard text
parsing or database import libraries.
Filtering
Not all of the data provided by source systems are desirable or accurate.
Some of the information may have no potential clinical research benet and
may be removed (eg, inventory data or clerical information about how reports were transcribed). Data related to patient identity (eg, social security
number, insurance policy numbers), which are usually required for accurate
linkage of records from multiple systems, need special handling. Data errors
are a perpetual challenge and the optimal approach to them is to identify
errors at the time of data acquisition and lter them out. For example, invalid data or data with related missing critical values may need to be eliminated. Decisions must be made on which data will be ltered versus which
will be transformed into missing or unknown values. For example, if a lab
result is received with a date that is in the future, should the record be eliminated, should it be loaded with the bad date, or should the date be changed
to a best guess or unknown value? Each approach has its advantages and
disadvantages, and the answer should be determined by the goals and requirements of the data warehouse.
Transformation
It may be useful to force some data to t predetermined standard values.
For example, a patients gender may be represented in one data source as
M or F but in another as 1 or 2. Some systems provide patient
names in a non-atomic format that combines rst, middle, and last name
all into one eld. Transformation should yield consistent data
64
LYMAN
et al
Fig. 2. Data ow associated with the University of Virginia Clinical Data Repository (CDR).
Data are derived from clinical production systems (left) as HL7 messages or batch les. Messages pass through an HL7 interface engine (1), to the CDR PHI database in a secure environment (top). PHI (HIPAA Safe Harbor data elements) are stripped and stored there, and the
remaining limited data set is passed to a set of staging databases (middle) where data are
held between batch updates of the main CDR database (CDR DB). During this time, data corrections are passed from the production system to the PHI and staging databases, and the data
are ltered and transformed as necessary. An interface (2) is available in the secure CDR environment that allows CDR sta access to the PHI and de-identied databases, and also executes
prebuilt queries for quality assurance studies or researchers who have permission to access patient identity data (see below). CDR users (bottom right) working with de-identied data or limited data sets with IRB permission query the database directly through an SSL-secured Web
interface (3) located with the de-identied databases in a controlled-access environment (middle)
that is restricted by user IP address and account name/password. When users who have permission to view PHI extract data, a call is passed from interface 3 to interface 2 to run an appropriate prebuilt query, and combined data from the de-identied and PHI databases are
returned. There is no direct external user access into the secure CDR environment. The system
also provides a separate database in the secure environment for PHI for external patients (xID
DB, upper right) with capabilities similar to the primary PHI database, to accommodate, for
example, outside data from multicenter trials. The CDR is implemented as MySQL databases
running in Linux, and Web access is provided by way of JDBC from custom Java servlets running in the Apache Tomcat environment.
65
66
LYMAN
et al
essential for linking records derived from disparate systems and for maintaining longitudinal records over time. At the University of Virginia, our
data warehouse, the Clinical Data Repository (CDR, see Fig. 2), addresses
this task by separating direct identiers, such as name, medical record number, and social security number, and storing this information in a distinct
database on its own highly protected secure server, accessible only to members of the CDR project team [4]. Date of birth, gender, and race are also
stored on this server because they are sometimes useful for resolving ambiguous matches when linking records. Our system assigns a disguised identier
for each new patient, which links clinical records for that patient within the
database that is accessible to our research users. This disguised identier is
mapped to the identiable data in the highly secure database and serves as
the master link between the two systems. The names, dates of birth, medical
record numbers, and so forth can thus be omitted from the accessible research server but are available when necessary for record linking and for
research purposes with appropriate Institutional Review Board permission.
The HIPAA privacy rule species a list of 18 data elements that must be
removed for a database to be considered de-identied [37]. A data warehouse
that maintains a de-identied database for direct user access has security and
convenience advantages that may be preferable to requiring various users to
obtain permission to access a system that contains patient identiers.
Removal of HIPAA identiers may not prevent identication for patients
who have rare conditions or rare combinations of conditions, however. In
these cases, queries may return very small groups or single patients [38].
For this reason, most clinical data warehouses that implement investigator
access to de-identied data return only summary data if the data set resulting
from a query contains fewer than a dened minimum number of patients.
Although many of the data elements that HIPAA includes as potentially
identiable are not needed at the user query stage, temporal information is
often important. The HIPAA privacy rule prohibits specic health-related
dates (other than year) in a de-identied data set, so disguising dates in
a way that preserves their research usefulness is benecial. The use of date
osets, the number of days (or hours, minutes, and so forth) between events,
is one way to address researchers needs for temporal constraints on query
conditions. Typically the absolute date is not required for research, but it is
necessary to determine whether events occurred within some specied time
period (eg, readmission to the hospital within 30 days of discharge, or the
use of a medication within 24 hours of hospital admission). The use of osets requires the identication of a time zero for each patient; our data
warehouse uses the date/time of the rst event for a patient as a starting
point and all osets are calculated based on that reference. The raw dates
can be stored in the highly secure database containing the identiable
data set. This approach, although having the benet of improved condentiality, does have important limitations. Strictly adhering to this method
means that a table storing outpatient clinic visits could not include the
67
date of the visit, just the year and the number of days between that visit and
the patients rst event in the database. A user query aimed at identifying
specic seasonal trends in hospitalized patients, for example, would require
links back to the identiable data to nd the correct dates of origin for the
oset calculations.
The challenges of textual data
Some of the richest, most clinically detailed information in the medical
record is stored as text. This information includes consultative reports (eg,
pathology, radiology), operative notes, progress notes, discharge summaries, and also some clinical laboratory results, including microbiology. These
data are often valued by researchers but present the data warehouse developer with multiple challenges, including ensuring condentiality, classifying
data correctly, and providing useful query methods.
Condentiality and textual data
As opposed to coded clinical/administrative data (eg, diagnoses, procedures, medications, and so forth) or numeric clinical laboratory tests, data
in textual elds are much more susceptible to the inclusion of identiable
data, because the data are entered in a free text format, and the identiers
may be clinically and operationally important for communication between
members of the health care team. A surgical pathology report typically includes phrases that describe how specimens are labeled, and may explicitly
identify patient name and medical record number. A textual eld in a laboratory result might include the phrase, Result called at 3/23/2006 9:12 AM
to Dr. Smith or a reference to a particular health care facility. Although
access to these reports might be allowable for researchers who have IRB approval to review identiable data, the ultimate goal is to provide as much
de-identied information as possible so that researchers may work without
identities unless they are truly necessary. The automated de-identication,
or scrubbing, of textual reports is an active area of medical informatics research, and there are increasing numbers of available tools to accomplish
this [35,39,40]. A detailed discussion is beyond the scope of this article
but two approaches described in the literature include (1) automated extraction of accepted medical terms, which are stored in the de-identied database in lieu of the textual report [35], and (2) removal of identiable data
from the corpus of the text, leaving behind a report that is ostensibly clinically detailed but free from any information that might make the patients
identity known [39,40].
Querying textual data
A common query into a clinical data warehouse might ask for all newly
diagnosed patients who have a certain form of cancer. Surgical pathology
68
LYMAN
et al
reports are often a valuable source of information for this type of question,
but locating these cases means successfully searching potentially hundreds of
thousands of documents that are rife with abbreviations, homonyms, synonyms, and misspellings. The query challenges are, in essence, similar to those
that users face when searching the biomedical literature or even the World
Wide Web, and consequently similar approaches can be used to facilitate
successful queries. One of the commonly studied methods for addressing
this challenge is to use a computer-based auto-coding approach in which
the text is parsed and codes are assigned from a standard terminology,
such as the Unied Medical Language System Metathesaurus [35,41,42].
Currently, though, the large-scale use of standard terminologies for linkage
to, or replacement of, textual reports remains a future goal.
Performance issues
Except for very small databases, performance should be a major focus
when designing a relational clinical data warehouse. If the database will
be large, performance will be a signicant issue and good database design
will be a particularly critical requirement for success.
One of the most signicant features of a relational database system that
can enhance performance is the ability to place indexes on any data element
(or set of data elements). When searching the database or joining tables together, tremendous performance gains can be achieved by strategically adding indexes on those columns that will be used in database queries. Indexes
are separate database les that are maintained internally and contain a copy
of the indexed columns in sorted order along with pointers to corresponding
rows in the original data table. Indexes allow the database system to use
a fast binary search algorithm to nd data and the original table rows
that contain it quickly.
Indexing, however, cannot usually be placed on every data element. It is
only useful if the ratio of unique data values to the number of rows is relatively high. Creating indexes on data elements that contain few distinct
values (such as gender or race) typically oers little benet, and can actually
be detrimental to query performance. There are times, therefore, when
a query must nd rows by doing a sequential read through the entire table.
In these cases, query speed is related to table size, so tables should be designed to be as lean as possible.
In the laboratory test result tables of our CDR, the lab test description
and unit of measure were extracted to separate tables. Some space was saved
because of reduction in redundancy, but there was a larger reward in performance. Descriptions tend to be large so they can signicantly increase the
size of a table, which in turn adds to sequential search times. By extracting
the descriptions, we decreased the size of the table yielding faster sequential
search times. The added time required to extract the description from
69
70
LYMAN
et al
data mining. Even in cases in which data are exported from a data warehouse so that they can be restructured into an appropriate format for mining, such tasks are orders of magnitude less time-consuming and resourceintensive than de novo collection and processing of data from multiple clinical operational systems. Furthermore, data mining software developers often incorporate relational database connectivity into their applications,
allowing mining directly against warehouses without the need for data export and transformation. These developments indicate that the nature of
biomedical research is evolving: we are entering an era in which large
amounts of clinical data in electronic form will be accessible to researchers
using well-designed analysis tools to pursue biomedical and translational
knowledge discovery.
References
[1] Dewitt JG, Hampton PM. Development of a data warehouse at an academic health system:
knowing a place for the rst time. Acad Med 2005;80:101925.
[2] Kamal J, Pasuparthi K, Rogers P, et al. Using an information warehouse to screen patients
for clinical trials: a prototype. Proc AMIA Symp 2005;1004.
[3] Bock BJ, Dolan CT, Miller GC, et al. The data warehouse as a foundation for populationbased reference intervals. Am J Clin Pathol 2003;120:66270.
[4] Einbinder JS, Scully KW, Pates RD, et al. Case study: a data warehouse for an academic
medical center. J Healthc Inf Manag 2001;15:16575.
[5] Tusch G, Muller M, Rohwer-Mensching K, et al. Data warehouse and data mining in a surgical clinic. Stud Health Technol Inform 2000;77:7849.
[6] Wisniewski MF, Kieszkowski P, Zagorski BM, et al. Development of a clinical data warehouse for hospital infection control. J Am Med Inform Assoc 2003;10:45462.
[7] Murphy SN, Morgan MM, Barnett GO, et al. Optimizing healthcare research data warehouse design through past COSTAR query analysis. Proc AMIA Symp 1999;8926.
[8] Verma R, Harper J. Life cycle of a data warehousing project in healthcare. J Healthc Inf
Manag 2001;15:10717.
[9] Berndt DJ, Hevner AR, Studnicki J. The catch data warehouse: support for community
health care decision-making. Decision Support Systems 2003;35(3):36784.
[10] Sittig DF, Pappas J, Rubalcaba P. Building and using a clinical data repository. Available at:
http://www.informatics-review.com/thoughts/cdr.html. Accessed April 23, 2007.
[11] Kimball R. The data warehouse toolkit. New York, NY: John Wiley & Sons, Inc.; 1996.
[12] McNamee LA, Launsby BD, Frisse ME, et al. Scaling an expert system data mart: more facilities in real-time. Proc AMIA Symp 1998;498502.
[13] Brandt CA, Morse R, Matthews K, et al. Metadata-driven creation of data marts from an
EAV-modeled clinical research database. Int J Med Inform 2002;65:22541.
[14] Rob P, Coronel C. Database systems: design, implementation, and management. 7th edition.
Boston: Thomson/Course Technology; 2007.
[15] Inmon WH. Building the data warehouse. 4th edition. Indianapolis (IN): Wiley; 2005.
[16] Prather JC, Lobach DF, Goodwin LK, et al. Medical data mining: knowledge discovery in
a clinical data warehouse. Proc AMIA Symp 1997;1015.
[17] Goodwin LK, Iannacchione MA. Data mining methods for improving birth outcomes prediction. Outcomes Manag 2002;6:805.
[18] Cao H, Markatou M, Melton GB, et al. Mining a clinical data warehouse to discover diseasending associations using co-occurrence statistics. Proc AMIA Symp 2005;10610.
71
[19] Mullins IM, Siadaty MS, Lyman J, et al. Data mining and clinical data repositories: insights
from a 667,000 patient data set. Comput Biol Med 2006;36:135177.
[20] Humphries KH, Rankin JM, Carere RG, et al. Co-morbidity data in outcomes research: are
clinical data derived from administrative databases a reliable alternative to chart review?
J Clin Epidemiol 2000;53:3439.
[21] Iezzoni LI. Assessing quality using administrative data. Ann Intern Med 1997;127:66674.
[22] Gorla N. Features to consider in a data warehousing system. Commun ACM 2003;46(11):
1115.
[23] Weber R, Schek H, Blott S. A quantitative analysis and performance study for similaritysearch methods in high-dimensional spaces. VLDB98, Proceedings of the twenty fourth International Conference on Very Large Data Bases; 1998. p. 194205.
[24] Levene M, Loizou G. Why is the snowake schema a good data warehouse design? Information Systems 2003;28(3):22540.
[25] Nadkarni PM, Brandt C. Data extraction and ad hoc query of an entity-attribute-value database. J Am Med Inform Assoc 1998;5:51127.
[26] Breen C, Rodrigues LM. Implementing a data warehouse at Inglis Innovative Services.
J Healthc Inf Manag 2001;15:8797.
[27] Murphy SN, Gainer V, Chueh HC. A visual interface designed for novice users to nd research patient cohorts in a large biomedical database. Proc AMIA Symp 2003;48993.
[28] Ledbetter CS, Morgan MW. Toward best practice: leveraging the electronic patient record as
a clinical data warehouse. J Healthc Inf Manag 2001;15:11931.
[29] Nigrin DJ, Kohane IS. Scaling a data retrieval and mining application to the enterprise-wide
level. Proc AMIA Symp 1999;9015.
[30] Corwin J, Silberschatz A, Miller PL, et al. Dynamic tables: an architecture for managing
evolving, heterogeneous biomedical data in relational database management systems.
J Am Med Inform Assoc 2007;14:8693.
[31] Lyman JA, Scully K, Tropello S, et al. Mapping from a clinical data warehouse to the HL7
reference information model. Proc AMIA Symp 2003;920.
[32] HL7. Health level 7. Available at: http://www.hl7.org. Accessed June 15, 2007.
[33] Nardon FB, Moura LA. Knowledge sharing and information integration in healthcare using
ontologies and deductive databases. Medinfo 2004;11(Pt 1):626.
[34] Khan AN, Grith SP, Moore C, et al. Standardizing laboratory data by mapping to
LOINC. J Am Med Inform Assoc 2006;13(3):3535.
[35] Berman JJ. Concept-match medical data scrubbing. How pathology text can be used in research. Arch Pathol Lab Med 2003;127:6806.
[36] National Institutes of Health. Clinical research and the HIPAA privacy rule. Available at:
http://privacyruleandresearch.nih.gov/clin_research.asp. Accessed June 18, 2007.
[37] Schell SR. Creation of clinical research databases in the 21st century: a practical algorithm
for HIPAA compliance. Surg Infect (Larchmt) 2006;7(1):3744.
[38] El Emam K, Jabbouri S, Sams S, et al. Evaluating common de-identication heuristics for
personal health information. J Med Internet Res 2006;8(4):E28.
[39] Gupta D, Saul M, Gilbertson J. Evaluation of a deidentication (de-id) software engine to
share pathology reports and clinical documents for research. Am J Clin Pathol 2004;121:17686.
[40] Beckwith BA, Mahaadevan R, Balis UJ, et al. Development and evaluation of an open
source software tool for deidentication of pathology reports. BMC Med Inform Decis
Mak 2006;6:1221.
[41] Nadkarni P, Chen R, Brandt C. UMLS concept indexing for production databases: a feasibility study. J Am Med Inform Assoc 2001;8:8091.
[42] Hazlehurst B, Frost HR, Sittig DF, et al. MediClass: a system for detecting and classifying
encounter-based clinical events in any electronic medical record. J Am Med Inform Assoc
2005;12:51729.
[43] McDonald CJ, Dexter P, Schadow G, et al. SPIN query tools for de-identied research on
a humongous database. Proc AMIA Symp 2005;5159.
Multi-Database Mining
Mir S. Siadaty, MD, MS*,
James H. Harrison, Jr, MD, PhD
Division of Clinical Informatics, Department of Public Health Sciences, University of Virginia,
Suite 3181 West Complex, 1335 Hospital Drive Charlottesville, VA 22908, USA
* Corresponding author.
E-mail address: mirsiadaty@virginia.edu (M.S. Siadaty).
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.004
labmed.theclinics.com
74
Fig. 1. Flow chart for applying single and multi-database data mining techniques.
MULTI-DATABASE MINING
75
whether it is best to pool the data or mine the databases separately (see
Fig. 1, Box 1 versus 2). When the databases dier substantially in structure
or data representation (see Fig. 1, Box B), the approach to mining is determined by whether logically related patterns and compatible measures of association strength can be dened across the set of databases. Logical
relationships between patterns do not require uniform data representation
or structure or the existence of identical patterns across the set of databases.
If a logical correspondence between patterns and their association strengths
cannot be created, data mining may be performed on one primary database
(see Fig. 1, Box 3), and the remaining databases may be used to provide supplemental data or annotation of the results. If a correspondence between
patterns in the databases can be established, multiple database data mining
techniques may be applied (see Fig. 1, Box 4). In this case, multiple database
mining can yield specic insights into the characteristics of the data set and
make unique interpretations possible. The dual mining method, which is
described later in this article, is an example of this scenario.
Methods specic to multi-database mining
Multi-database mining shares many technical features with single database mining, but it also includes some special requirements and methods.
Methods for single database mining are covered elsewhere in this issue.
This article briey discusses some of the dierences in methodology between
single and multi-database mining and then illustrates several multi-database
mining tasks.
Because databases vary in content and because there may be many candidate databases that could be used in a data mining project, multi-database
mining benets from ranking databases based on their similarity to each
other and relevance to the problem at hand. Liu and colleagues [12] developed techniques for identifying and ranking relevant databases based on the
pertinence of the information in the databases to the planned analysis. They
argue that eective data mining from multi-databases should involve an
explicit selection phase before mining in which the databases are objectively
evaluated for relevance. Their method computes a relevance factor for each
candidate database from a listing of the data elements it contains that are
related to the intended mining project. Databases are then ranked and
selected for inclusion on the basis of their relevance factor. Zhang and
colleagues [13] also argue for the evaluation of databases before mining projects, and present a set of methods for database classication that are
application independent.
Cleansing and preparation of data for mining are important in single
and multi-database mining. Because databases may dier substantially in
data structure, representation, and integrity, mining multiple databases usually requires each database to be prepared individually using techniques
appropriate for its characteristics. In addition, the data in each database
76
must be prepared or transformed as necessary such that the patterns discovered during mining are comparable as intended across databases.
When data mining in each database is complete, the patterns across all
databases are classied into one or more of four types [11]. Individual databases produce a set of ndings termed local patterns, similar to patterns produced by a standard single database mining task. Some local patterns may
also be shared across most of the databases mined. These patterns are
termed high-voting patterns and generally have global implications. Other
local patterns may be prominent in a few or only one of the databases. These
exceptional patterns may highlight the unique characteristics of a database,
for example, features that are visible only from a particular perspective. Patterns that are present across multiple databases with a moderate frequency
slightly below that required for attention are termed suggestive patterns.
MULTI-DATABASE MINING
77
The dual mining method aims to solve the knowledge acquisition bottleneck for discovering useful or interesting patterns by automatically comparing the strengths of associations mined from a target database with the
strengths of corresponding associations mined from a relevant knowledge
base, for example, published biomedical literature. When the estimates of
the strength of an association do not match in the knowledge base and
target database, a high surprise score is assigned to that association to
identify it as potentially interesting. The surprise score captures the degree
of novelty or interestingness of mined patterns without the need for
a domain expert to evaluate the patterns by hand.
As a simple example of surprise scores, consider four patterns mined
from a target database for which the strengths of association, on a scale
of 0 to 1, are 0.09, 0.12, 0.97, and 0.84, respectively. Although patterns 3
and 4 are obviously the stronger ones, these patterns might already be
known and therefore not very interesting or useful. To determine which
of the patterns might be interesting, we estimate the strengths of the same
patterns in a pertinent knowledge base. Corresponding association strengths
of the patterns in the knowledge base are 0.08, 0.96, 0.84, and 0.12. Patterns
2 and 3 are the strong ones in the knowledge base. We argue for a model in
which patterns that are similarly associated in both the database and knowledge base (patterns 1 and 3) are less interesting than patterns that are
strongly associated in the knowledge base but not the database or vice versa
(patterns 2 and 4). Intuitively, a pattern with strong association in the database but weak association in the knowledge base may represent a discovery
with little current knowledge in its support, warranting further investigation. Likewise, pattern 2 appears to be well-established knowledge, but
the association estimated in the database is not large; therefore, pattern 2
also represents a surprise nding.
The scatterplot in Fig. 2 visualizes the relationship between strengths of
patterns in the database and strengths of patterns in the knowledge base.
The diagonal 45-degree line represents equal database and knowledge
base association strengths. According to our model, patterns close to the
diagonal are not of much interest to the user, whereas patterns that are
farther away from the diagonal are more likely to be interesting. The distance from the diagonal, rather than the strength of an association, denes
its degree of interestingness. We dene patterns that are located far from the
diagonal as surprise patterns in dual mining. They are an example of
the exceptional pattern in multi-database mining that was described in
the previous section as occurring with substantially dierent frequency
within a set of mined databases.
To apply the method, a target database is paired with a relevant knowledge base that contains facts and relationships describing the data items in
the target. For example, the US National Library of Medicines Medical Subject Headings (MeSH) encoded MEDLINE database could represent a pertinent knowledge base applicable to a target database of clinical laboratory
78
Fig. 2. Scatterplot showing the association strength of four patterns (circles labeled 14) simultaneously mined from a database (x-axis) and knowledge base (y-axis). Patterns 1 and 3 are near
the diagonal (similar association strengths in both data sets) and may represent uninteresting
ndings. Patterns 2 and 4 are far away from the diagonal (dierent association strengths in
the data sets) and may be interesting. (Adapted from Siadaty MS, Knaus WA. Locating previously unknown patterns in data-mining results: a dual data- and knowledge-mining method.
BMC Med Inform Decis Mak 2006;6:13; with permission.)
MULTI-DATABASE MINING
79
are also saved in the CDR. Several thousand dierent laboratory tests and
diagnosis codes (ie, laboratory and disease concepts) in the CDR are
observed with varying frequency. We chose as concepts to study those that
appeared with high frequency in the CDR to ensure an adequate number
of patterns for analysis, and we focused on concepts that could be dened
and detected with certainty in the CDR. The latter are concepts that can
be coded directly based on unambiguous criteria that are generally accepted
clinically. The appearance of those codes in the CDR yields high condence
that the condition existed in the patient. After applying these ltering criteria,
we obtained 96 disease concepts and 105 laboratory concepts.
Because dual mining is based on a comparison of the incidence of associations across databases, concepts in the databases must be represented
consistently so that their associations can be correctly categorized. Data
representation in the CDR, which uses aggregate disease classications
such as ICD-9, diers from that in MEDLINE. The latter codes articles
with MeSH, a detailed biomedical vocabulary developed for the medical
research literature. To reconcile this dierence in data representation, we
obtained the MEDLINE database (including citations, abstracts, and
MeSH encoding) from the National Library of Medicine [18] and used the
Unied Medical Language System [19] to map MeSH and free-text medical
terminology (contents of titles and abstracts) to ICD diagnosis codes in the
CDR.
Concepts in the CDR were generated from ICD codes (disease concepts),
the presence or absence of laboratory tests, and laboratory results (separately classied as abnormal, elevated, and depressed). The laboratory result
concepts were based on reference ranges stored with the test result in the
CDR. To detect the 96 disease and 105 laboratory concepts in MEDLINE,
we used two complementary approaches. In one approach, the textual
description of the CDR ICD codes and the laboratory test concepts was
used to identify corresponding text strings in MEDLINE titles, abstracts,
and MeSH terms. In the second approach, the automatic term mapping
capability of ReleMed, a publicly accessible search engine for MEDLINE
(www.relemed.com), was used to dynamically generate additional term
mappings at runtime [20]. ReleMed evaluates both the presence of and relationships between query terms in MEDLINE records.
We constructed all possible pairs of one disease concept and one laboratory test concept, resulting in 10,080 patterns. We additionally constructed
a set of patterns containing each pair of diagnosis and laboratory concepts
in gender (male, female) and race (black, white) subsets, for an additional
10,080 * 4 40,320 patterns. In total, we constructed 50,400 patterns containing associations of two to four concepts. Scripts written in Perl (www.
perl.org) were used to scan the CDR database and MEDLINE citations
for instances of these patterns. In the CDR, patterns were constrained to
within patient visits. In MEDLINE, patterns were constrained to within
articles (title, abstract, and full-text when available). In total, we scanned
80
27.5 million tests and diagnosis codes in the CDR (containing data from
9.4 million visits from 1993 to 2005) and 15.7 million MEDLINE citations.
Fig. 3 shows the correspondence between the association scores estimated
in the CDR and MEDLINE for each pattern. Patterns with the 100 highest
surprise scores are shown as larger circles. Note that some points are not in
the top 100 even though they appear to be farther away from the diagonal
line when compared with some circles. These patterns have scores with
larger variances, making their surprise scores less signicant [17]. As
a preliminary evaluation of the ability of the surprise score to prune uninteresting patterns, we built a list of top patterns ranked according to the
strength of their association in the CDR and compared it with a list of patterns ranked by surprise score. One would expect most strong associations
in a clinical database to be previously described and thus not interesting.
Consistent with this notion, 99% of patterns with strong associations in
the database were eliminated by using the surprise score. The remaining
1% of associations may be of interest for more detailed follow-up. Table 1
shows a listing of ten representative patterns deemed interesting/surprising
Fig. 3. Pairs of associations mined from the University of Virginia CDR database (x-axis) and
MEDLINE knowledge base (y-axis). The diagonal line represents uninterestingness as in
Fig. 2. The points appear non-homogeneously distributed because a weighted normalization
procedure was used. The graph depicts more than 50,000 data points represented as tiny
dots. The 100 patterns with the largest surprise scores are shown as larger circles in the upper
left and lower right corners of the graph. (Adapted from Siadaty MS, Knaus WA. Locating
previously unknown patterns in data-mining results: a dual data- and knowledge-mining
method. BMC Med Inform Decis Mak 2006;6:13; with permission.)
81
MULTI-DATABASE MINING
Table 1
Sample interesting mined patterns
Sex and
race
Fw
Tot
Fw
Fw
Fb
Fb
Fw
Fw
Fw
Fw
Disease concept
Laboratory concept
Nephritis
Secondary
hyperparathyroidism
Ventricular brillation
Ventricular brillation
Apnea
Ventricular brillation
Ventricular tachycardia
Sleep apnea
Glomerulonephritis
Ventricular tachycardia
Hypercapnia
Hypophosph(orjat)emia
Low serum albumin
Anemia
High serum albumin
Thrombocytopenia
Anemia
Thrombocytopenia
Hypercapnia
High serum albumin
rDbi
rKbi
5
7
0.571
0.77
0.95
0.64
33
45
83
88
91
92
93
99
0.615
0.564
0.605
0.626
0.521
0.443
0.553
0.543
SS rank
0.859
0.772
0.848
0.74
0.785
0.96
0.867
0.884
Abbreviations: Fw, female white; Fb, female black; rDBi, strength of association of the sex,
race, disease, and laboratory concepts in the database (CDR); rKbi, strength of association of
the concepts in the knowledge base (MEDLINE); SS rank, rank based on surprise score;
Tot, all sex and race groups combined.
Data from Siadaty MS, Knaus WA. Locating previously unknown patterns in data-mining
results: a dual data- and knowledge-mining method. BMC Med Inform Decis Mak 2006;6:13.
by the automated dual mining algorithm. Although the surprise score indicates that these patterns occur at unexpected incidences in the database
when compared with the knowledge base, further work is required to determine the extent to which they identify meaningful clinical associations or are
useful in creating research hypotheses.
Summary
Data mining is often performed against data that was originally collected
into multiple databases. In many cases, it is appropriate to integrate these
databases for standard data mining using various data transformation
and data fusion approaches, or to create a federated database across multiple contributing systems; however, multiple databases may provide multiple
useful perspectives on a biomedical problem. Typical integrative approaches
may lose the unique perspectives inherent in separate databases. Under
appropriate conditions, these diering perspectives can be leveraged using
multi-database mining techniques to yield valuable insights into a data
set. The authors dual mining approach is an example of this potential, in
which the diering perspectives of related databases are used to identify
interesting association patterns within the databases.
References
[1] Lussier YA, Liu Y. Computational approaches to phenotyping: high-throughput phenomics. Proc Am Thorac Soc 2007;4(1):1825.
82
* Corresponding author.
E-mail address: arp4m@virginia.edu (A.R. Post).
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.005
labmed.theclinics.com
84
85
Fig. 1. State intervals of digoxin drug levels. A patients serum digoxin measurements were used
to infer periods of decreasing and increasing values. A period of normal levels can be inferred
from the rst eight values because they are all within the normal range (between the thin horizontal lines). The last three values are above the upper limit of normal and thus indicate that
the patient had toxic levels during that time period. Intervals like these can be computed by temporal abstraction (see Fig. 4).
Fig. 2. Temporal relationships between intervals dened in Allens temporal logic [8]. There are
seven basic relationships and their inverses (not shown). In the case of Equals, the basic and
inverse relationships are identical.
86
from the discovery of patterns in stock market data [20] to the identication
of anomalies in space shuttle telemetry [21]. The use of temporal data mining with clinical time sequences has emerged more recently [22] because
increasing volumes of patient data are being stored electronically [23].
Some kinds of clinical data (electrocardiogram [21] and intensive care unit
[24] data, in particular) have been found to have properties resembling these
business and engineering data sources and are thus amenable to similar
types of analyses.
Temporal data mining techniques in the medical domain [18] are generally designed either for exploration or for prediction. Exploratory methods
involve processing a database to identify groups of time series with similar
combinations of frequent intervals (see Fig. 1) and temporal relationships
(see Fig. 2). Clinical domain knowledge may be applied to these groups
or clusters to determine if they represent useful or previously unknown relationships between data types. Predictive techniques may target a diagnosis,
therapeutic response, or other clinical or patient care process, and search for
combinations of intervals that frequently occur with some temporal relationship to the target. These methods vary in the temporal features of
time sequences that they incorporate into pattern discovery and the ease
with which they can be congured for use with a particular data set. All
of them also depend on eective techniques for measuring the similarity
of time series.
Time series similarity measures
The distance metrics used by conventional data mining techniques to determine similarity between data elements can be applied to detecting similarity in some clinical time series. The most commonly used measure for this
purpose is the Euclidean distance (Table 1) [25], which computes the square
root of the sum-squared dierences between sequential pairs of points in
two series. Euclidean distance requires each time series to have the same
number of values, or new values must be interpolated into the time series
to equalize their lengths. Although interpolation preserves the general prole of a time sequence, it distorts the speed with which the prole changes
value (ie, the local slope of the graph) and assumes that additional values
are accurately predictable from existing data. Euclidean distance is not sensitive to data order, hence randomly shuing the elements of two time
sequences would not change their Euclidean distance, and the dierence
between time series with signicant variation tends to be underestimated.
Euclidean distance also disregards the durations of the time spans between
data elements, is sensitive to outliers, and poorly estimates the dierence between short time series. Results are thus usually unsatisfactory for clinically
relevant comparisons between medical time series.
Several distance measures improve on Euclidean distance by correlating
successive values of a pair of time series, thus taking into account the
87
Table 1
Representative methods for evaluating time series similarity
Name
Strategy
Applicationa
Euclidean distance
Pearson correlation
distance
Fourier transform
Discrete wavelet
transform
Length refers to the number of elements in the series rather than its total duration.
relative order of the data elements. The Pearson correlation distance (see
Table 1) [26] recognizes similarities in the shapes of two sequences, as
long as those sequences are the same length, and can also capture inverse
relationships (an increase in the values of one time sequence with a concurrent decrease in those of another). For data sets in which time sequences all
have the same number of data elements, the time spans between data
elements are unimportant, and comparisons of entire sequences are of interest, this distance metric can produce satisfactory results. Because routine
medical observations are not made with the intent of cross-population comparison, however, medical time series are often composed of data elements
of varying numbers that are spaced irregularly and dierently across
episodes and patients. Traditional metrics, such as Pearson, may classify incorrectly proles that are similar but are sampled at dierent times. If the
durations of the time spans between data elements are important, the number of data elements in each sequence is variable, or temporal relationships
between subsequences (eg, trends and periodicity) are important, more
sophisticated methods are needed to incorporate these temporal relationships into similarity measures.
Dynamic time warping
Dynamic time warping (see Table 1) is a robust distance calculation
method developed for speech recognition [27]. It computes the distance
88
89
coecients. The coecients represent progressively ner levels of detail associated with their corresponding time scales. The wavelet transform can
thus represent the general shape of a time sequence and its ne structure,
eectively allowing for zooming in and out of the sequences temporal
features. It, like the Fourier transform, assumes regularly spaced data in
which the number of data elements is a power of two, and interpolation
can be used satisfy these requirements. Because the functional descriptions
produced by the wavelet transform are localized in time, this technique is
applicable to detection of nonperiodic patterns in clinical data sequences.
Clinical application areas include early detection of hemodynamic deterioration as measured by multiple physiologic variables in ICU patients [25] and
early detection of infectious disease outbreaks by monitoring for spikes in
emergency department chief complaints [32].
Although the transform-based methods feature signicantly improved
temporal pattern recognition over plain distance measures, including the
ability to identify periodic patterns and similarities between complex proles, they suer from the need to prespecify parameters (eg, the wavelet
function for the wavelet transform) that may be unintuitive for clinical
experts, and, like basic similarity measures and dynamic time warping,
they compare the similarity of entire time sequences. In practice, however,
interesting clinical features are most often expressed as characteristic subsequences within longer sequences of data. In clinical settings comparison of
entire time sequences is often undesirable and methods are needed to split
time sequences into shorter, clinically meaningful subsequences.
Subsequencing methods
Sliding window methods
Time series may be directly divided into subsequences by scanning a sliding window that views a xed number of data elements across each
sequence, creating the set of all possible subsequences of that length (Table
2). The subsequences overlap, and thus every data element is analyzed in its
local temporal context. The sliding window method is illustrated in Fig. 3.
The position of a subsequence within a sequence is lost in this technique,
thus it is primarily useful for identifying common motifs or combinations
of motifs across a collection of time sequence data for which the specic
location of a motif within each sequence does not matter. Also, because
the size of the sliding window is dened as a number of data elements rather
than a duration, the sliding window approach is not appropriate when the
duration of a motif is a critical feature, particularly if the time sequences
are not regularly spaced or have missing values.
As compared with transform-based methods, the window size parameter
is relatively intuitive and the approach allows for clustering time series based
on specic temporal features that can be contained by the window, even if
90
Table 2
Selected methods for subsequencing time series
Name
Strategy
Application
Sliding window
Segmentation
Temporal abstraction
Specically identies
subsequences of interest
based on dened temporal
and mathematic relationships
between data elements
Length refers to the number of elements in the series rather than its total duration.
the time series dier signicantly in other respects. The window size can
markedly aect what temporal features are found, however, and empirical
testing may be required to pick the best window size. Also complicating
the use of this technique is that time sequences from the same source tend
Fig. 3. Illustration of a sliding window of length 3 scanning a time sequence of platelet counts.
Subsequences are indicated by dashed rectangles and are labeled by number in the order in
which they are processed. The rst three scanned subsequences have negative slopes, and the
last two subsequences have positive slopes. A data mining algorithm might cluster the rst three
and last two subsequences into two clusters representing decreasing (A) and increasing (B)
values.
91
92
interval of time over which a state or process exists (see Fig. 1) [37]. These
temporal abstractions may be inferred from raw time-stamped data
elements based on prespecied mathematic relationships (eg, states, trends)
in the data, yielding low-level abstractions, or from prespecied combinations of previously inferred abstractions based on temporal relationships
between their intervals, yielding high-level abstractions. Examples of abstractions are shown in Fig. 4. Knowledge elicitation techniques have been
developed to facilitate the process of encoding the expert knowledge dening abstractions in a computable form [38]. Temporal abstraction has been
developed primarily in the medical domain, and has been applied successfully to pattern detection in laboratory test results [16,30,39] and childrens
Fig. 4. Example time series of platelet (PLT) counts in HELLP (Hemolysis, Elevated Liver enzymes, Low Platelets) syndrome, and intervals identied by temporal abstraction. HELLP is
a dangerous complication of pregnancy that appears during the latter part of the third trimester
or after childbirth [50]. HELLP syndrome has been dened as pre-eclampsia with PLT less than
100,000/ml, lactate dehydrogenase (LDH) greater than 600 U/L, and aspartate aminotransferase
(AST) greater than 70 U/L, and increasing PLT indicates recovery [51,52]. Subsequences of
PLT, LDH, and AST values that satisfy these thresholds are labeled as PLT_Low, LDH_High,
and AST_High intervals, respectively. Overlapping subsequences of low PLT, high LDH, and
high AST are labeled as Lab_HELLP intervals, subsequences of increasing PLT count are labeled as PLT_Increasing intervals, and subsequences of increasing PLT count that begin at
or after the start of a Lab_HELLP interval are labeled as Recovering intervals.
93
94
high-level intervals can be specied and the goal is discovering novel associations and predictive relationships between those intervals.
Mining interval databases for temporal patterns
Once interesting subsequences of time series are identied by segmentation, temporal abstraction, or other methods, temporal mining strategies
can assemble them into multivariate patterns that may have clinical meaning. Methods have been proposed that identify combinations of intervals
that frequently co-occur in a data set, and some of these methods allow
for mining combinations of intervals that frequently co-occur with some
temporal relationship (see Fig. 2). The former methods are generally called
association learning methods, and the latter are called temporal rule learning methods.
Association learning methods use a variant of the Apriori algorithm [43],
which is an iterative process that rst identies all single instances of a particular type of interval, nds all frequent combinations, or itemsets, of the
rst interval type with a second interval type, and then nds all frequent
combinations of the rst two interval types with a third interval type, and
so on (Fig. 5). These aggregations can either be used for discovery of frequently co-occurring intervals, or they can be thought of as predictive rules
in which the presence of the interval types in an itemset predicts the presence
of the interval type added to the itemset in each iteration of the algorithm.
When used for discovery of frequent combinations of interval types,
Apriori has a parameter called the minimum support threshold that denes
the minimum number or percentage of time sequences in which an itemset
must occur for the itemset to be passed on to the next iteration of the algorithm (see Fig. 5). For mining interval types that may have multiple occurrences within a time sequence, a useful alternative denition of support is
the minimum total time duration of all instances of an interval type in a sequence. For rule learning, there is an additional parameter called the
minimum condence threshold, which denes the minimum conditional
probability that a predicted interval type will occur within a sequence, given
an already-discovered itemset. This parameter is useful for ltering out rules
that have low predictive value. Apriori-based association rule learning has
been applied to assess the quality of a hemodialysis service: frequent combinations of temporal abstractions in physiologic parameters were mined and
used to improve the understanding of the contribution of physiologic factors
to the values of a set of quality indicators [36].
Temporal rule learning extends Apriori by storing temporal relationships
between interval types as additional attributes of each itemset [35,44], using
a temporal reasoning language (see later discussion) to encode relationships.
Many itemsets are created for a given group of interval types, one for each
temporal conguration of those interval types that is found in the dataset
based on the selected support and condence parameters. Temporal Apriori
95
Fig. 5. Illustration of the Apriori algorithm processing a hypothetical database of clinical ndings for frequent associations between ndings, with a minimum support threshold of 3 occurrences in the database. In the left table (Items), ndings and number of occurrences are listed. In
the center table (Pairs), pairs of ndings that co-occur in the same patient record are shown. In
the right table (Triplets), combinations of three ndings that co-occur in the same patient record
are shown. Findings and combinations of ndings that satisfy the minimum support threshold
are highlighted in gray.
96
97
98
99
[19] Antunes CM, Oliveira AL. Temporal data mining: an overview. Paper presented at the Proceedings of the Knowledge Discovery and Data Mining Workshop on Temporal Data Mining (KDD 01); Aug 2629, 2001. San Francisco.
[20] Fu T-c, Chung F-l, Luk R, et al. Preventing meaningless stock time series pattern discovery
by changing perceptually important point detection. Fuzzy Systems and Knowledge Discovery 2005;11714.
[21] Keogh E, Lin J, Fu A. HOT SAX: nding the most unusual time series subsequence: algorithms and applications. Paper presented at the 5th IEEE International Conference on Data
Mining. New Orleans (LA), November 2730, 2005.
[22] Keogh E, Pazzani M. An enhanced representation of time series which allows fast and accurate classication, clustering, and relevance feedback. AAAI Press; Proceedings of the
Fourth International Conference on Knowledge Discovery and Data Mining. 1998.
p. 23941.
[23] Haux R. Health information systemsdpast, present, future. Int J Med Inform 2006;75(34):
26881.
[24] Li J, Leong TY. Using linear regression functions to abstract high-frequency data in medicine. Proc AMIA Symp 2000;4926.
[25] Saeed M, Mark R. A novel method for the ecient retrieval of similar multiparameter physiologic time series using wavelet-based symbolic representations. AMIA Annu Symp Proc
2006;67983.
[26] Altiparmak F, Ferhatosmanoglu H, Erdal S, et al. Information mining over heterogeneous
and high-dimensional time-series data in clinical trials databases. IEEE Trans Inf Technol
Biomed 2006;10(2):25463.
[27] Ratanamahatana CA, Keogh E. Everything you know about dynamic time warping is
wrong. Paper presented at the Third Workshop on Mining Temporal and Sequential Data
(KDD-2004). Seattle (WA), August 2225, 2004.
[28] Fritsche L, Schlaefer A, Budde K, et al. Recognition of critical situations from time
series of laboratory results by case-based reasoning. J Am Med Inform Assoc 2002;
9(5):5208.
[29] Chateld C. Analysis of time series. 4th edition. New York: Chapman and Hall; 1989.
[30] Bellazzi R, Larizza C, Riva A. Temporal abstractions for interpreting diabetic patients
monitoring data. Intelligent Data Analysis 1998;2(14):97122.
[31] Graps A. An introduction to wavelets. IEEE Comput Sci Eng 1995;2(2):5061.
[32] Zhang J, Tsui FC, Wagner MM, et al. Detection of outbreaks from time series data using
wavelet transform. Proc AMIA Symp 2003;74852.
[33] Keogh E, Lin J, Truppel W. Clustering of time series subsequences is meaningless: implications for previous and future research. Paper presented at The Third IEEE International Conference on Data Mining (ICDM 03). Melbourne (FL), November 1922,
2003.
[34] Hoppner F. Time series abstraction methodsda survey. Paper presented at the Dortmund,
Germany: GI Jahrestagung; September 30Oct 3; 2002.
[35] Hoppner F. Learning dependencies in multivariate time series. Paper presented at the
ECAI02 Workshop on Knowledge Discovery in (Spatio-) Temporal Data. Lyon: France;
July 2223, 2002.
[36] Bellazzi R, Larizza C, Magni P, et al. Temporal data mining for the quality assessment of
hemodialysis services. Artif Intell Med 2005;34(1):2539.
[37] Stacey M, McGregor C. Temporal abstraction in intelligent clinical data analysis: a survey.
Artif Intell Med 2007;39(1):124.
[38] Shahar Y, Chen H, Stites DP, et al. Semi-automated entry of clinical temporal-abstraction
knowledge. J Am Med Inform Assoc 1999;6(6):494511.
[39] Larizza C, Moglia A, Stefanelli M. M-HTP: a system for monitoring heart transplant patients. Artif Intell Med 1992;4:11126.
100
[40] Kuilboer MM, Shahar Y, Wilson DM, et al. Knowledge reuse: temporal-abstraction mechanisms for the assessment of childrens growth. Proc Annu Symp Comput Appl Med Care
1993;44953.
[41] Kohane IS, Haimowitz IJ. Hypothesis-driven data abstraction with trend templates. Proc
Annu Symp Comput Appl Med Care 1993;4448.
[42] Post AR, Harrison JH Jr. PROTEMPA: a method for specifying and identifying temporal
sequences in retrospective data for patient selection. J Am Med Inform Assoc 2007;14(5):
67483.
[43] Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. Paper
presented at the 20th Int Conf Very Large Databases. Santiago de Chile, Chile: VLDB; September 1215. 1994. Santiago de Chile, Chile.
[44] Morchen F, Ultsch A. Discovering temporal knowledge in multivariate time series. Paper
presented at the Gesellschaft fur Klassikation (GfKI). Dortmund, Germany; March
911, 2004.
[45] Morchen F. A better tool than Allens relations for expressing temporal knowledge in interval data. Paper presented at the Theory and Practice of Temporal Data Mining (TPTDM
2006). Philadelphia; August 2023, 2006.
[46] Korfhage RR. Information storage and retrieval. New York: Wiley; 1997.
[47] Siadaty MS, Knaus WA. Locating previously unknown patterns in data-mining results:
a dual data- and knowledge-mining method. BMC Med Inform Decis Mak 2006;6:13.
[48] Tsoi AC, Zhang S, Hagenbuchner M. Pattern discovery on Australian medical claims datad
a systematic approach. IEEE Trans Know Data Eng 2005;17(10):142035.
[49] McCallum A, Nigam K, Ungar LH. Ecient clustering of high-dimensional data sets with
application to reference matching. Paper presented at the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Boston, MA: August
2023, 2000.
[50] Sibai BM. The HELLP syndrome (hemolysis, elevated liver enzymes, and low platelets):
much ado about nothing? Am J Obstet Gynecol 1990;162(2):3116.
[51] Sibai BM, Barton JR. Dexamethasone to improve maternal outcome in women with hemolysis, elevated liver enzymes, and low platelets syndrome. Am J Obstet Gynecol 2005;193(5):
158790.
[52] Fonseca JE, Mendez F, Catano C, et al. Dexamethasone treatment does not improve the outcome of women with HELLP syndrome: a double-blind, placebo-controlled, randomized
clinical trial. Am J Obstet Gynecol 2005;193(5):15918.
Departments of Public Health Sciences and Pathology, University of Virginia, Suite 3181
West Complex, 1335 Hospital Drive, Charlottesville, VA 22908, USA
b
Automated Disease Surveillance Section, Acute Communicable Disease Control Program,
Los Angeles County Department of Public Health, 313 N. Figueroa Street,
Room 222, Los Angeles, CA 90012, USA
* Corresponding author.
E-mail address: james.harrison@virginia.edu (J.H. Harrison).
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.006
labmed.theclinics.com
102
103
Table 1
Contents of common regional data sets
Data set
Typical contentsa
Features
ICD-9-CM codesb
CPT codesc
HCPCS codesd
NDC codese
Charges
ICD/CPT codes
Problem lists
Clinical orders
Observation/test results
Procedure descriptions
Textual case summaries
Diagnoses/codes
Observation/test results
Follow-up data as specied,
including geographic and
environmental data
Large-scale survey results
therapeutics, a therapeutic class categorization, and prescription date. Timestamps generally provide only the day of the charge (which may not be the
day of the clinical event) and thus do not support precise short-term calculations such as most pharmacokinetics evaluations or conrmation of
a rapid response to an event. Claims databases have been used in a large
number of health care process and outcomes studies (eg, Ref. [18]), and
similar data are used by the Health Plan Employer Data and Information
Set (HEDIS) to compare quality of care among providers [19].
A number of problems exist with claims data, and their use is controversial [15,16,20]. By their nature, claims records exist only for items that generate a nancial transaction, and there is an incentive for the choice of codes
to support the operation of the nancial system. Many codes are added after
patient care during a separate clinical records review, rather than by caregivers. Claims forms may provide a limited number of slots for codes,
forcing code selection based on the needs of the current transaction. Diagnostic codes attached to procedures and laboratory tests may represent
rule-out possibilities rather than conrmed diagnoses. Finally, the ICD
coding system is not comprehensive for all clinical conditions and does
104
105
106
107
108
Fig. 1. Reportable diseases and health conditions, Los Angeles County, California. Reporting
may be performed by submission of paper forms, telephone, or electronic communication. The
submitted information, along with additional follow-up data, is incorporated into a regional public health data warehouse. (Courtesy of County of Los Angeles, Department of Health Services,
Public Health, Los Angeles, CA.)
109
Fig. 2. Prevalence of a rash symptom in patients seen in emergency departments in Los Angeles
County. Color coding indicates how statistically abnormal the rate is in each zip code of patient
residence. The bright red area indicates a probable outbreak of a rash-related illness.
governmental data sets are limited, in that they are not complete health records and contain only data that are mandated for survey, reporting, or capture during follow-up. Thus they are not regional or national clinical
repositories in the sense of some of the European systems, but they can
be useful for addressing particular questions in appropriate domains.
Challenges inherent in regional databases
Regional data warehouses, whether they house clinical data from collaborating care providers or public health data, oer all the challenges previously described in this issue for data warehouse construction, with the
added complexity of multiple independent data-contributing organizations.
Furthermore, regional warehouses may receive useful data from organizations or locations distinct from traditional health care providers, for example environmental or veterinary laboratories (see Fig. 3), nursing homes or
home health care providers, or remote sensors such as home blood-glucose
monitors. Some users may be outside the traditional health care provider or
health services researcher roles, and data presentation or analysis capabilities
should support their needs. For example, public health data are important for
highlighting geographic clusters of symptoms in biosurveillance (see Fig. 2)
110
Fig. 3. Veterinary data indicating the distribution of West Nile virus (WNV) in bird autopsies
in Los Angeles County. Larger dots indicate a higher number of WNV-positive birds. These
data permit the County to focus its mosquito-abatement eorts into high-incidence areas, which
may help limit transmission to humans.
and for targeting mosquito abatement in eorts to suppress the West Nile virus
in areas of high bird loss (see Fig. 3). Beyond these special applications, regional data warehouses also have distinct challenges related to data communication and loading, reconciliation of varied data representation, linkage of
data by patient, and normalization of laboratory and other numerical data.
111
112
may represent values in multiple ways. For example, most laboratories have
created their own unique sets of test codes and text mnemonics, and other
data similarly may have site-specic representation. With a few exceptions,
it generally is necessary to build a translation table into each source system,
mapping its vocabulary to a standard target vocabulary used in the regional
warehouse. The Logical Observation Identier Names and Codes (LOINC)
has established a well-accepted standard for laboratory test names [60], and
organism names in microbiology can be converted to the Systematized
Nomenclature of Medicine Clinical Terminology (SNOMED CT [61]).
The task of fully coding a laboratory database for LOINC and SNOMED
nomenclature is signicant, and most laboratories create mapping tables
only for tests that are exported. This approach can cause problems, because each laboratorys mapping table must be maintained as tests change
and new tests are incorporated into the regional database. Failure to maintain the table appropriately can cause failure of data transmission and
require retroactive identication and transmission of unreported results.
In the future, as an increasing number of regulatory agencies and other organizations require the use of communication standards, laboratories
should be able to provide conversion to standard terminologies as a part
of routine data export. Similar data-representation considerations apply
to other clinical systems that supply data to a regional warehouse.
Linkage of data by patient
A health information data warehouse that is optimally useful for understanding disease presentation and course as well as care processes should
contain accurate longitudinal representations of disease, care, and therapeutic response for each patient. When data come from multiple care providers
who share no common patient identication mechanism, data linkage requires transmission of some type of patient identier [62]. These identiers
may be stripped from the data before the data are used for analysis, but
it is necessary to maintain the identiers and their link to the patient data
in a secure environment in the data warehouse so that future data may be
linked to data from the same patient that are already in the warehouse,
irrespective of the data source. Although this architecture is used in regional
and national European health data warehouses [45,46], the authors are not
aware that it has been deployed successfully in a multiorganizational health
data warehouse in the United States.
Importance, standardization, and normalization of clinical
laboratory data
Laboratory data occupy an important position in health data warehouses. Much of the data in these databases is coded (eg, ICD-9-CM codes),
and the codes typically are entered by people who are performing a rapid
review of all or part of the medical record for the purpose of nancial
113
transactions [16]. In databases that contain textual records, physicians comments in text may be unclear, ambiguous, or misleading. Laboratory data
are objective and clinically meaningful and generally are transmitted accurately. Laboratory data can serve as validation for correctly coded data
and as a ag for incorrect codes. Laboratory data also can support risk
stratication and inference of disease severity or unusual presentation,
which are poorly represented in the currently common coding schemes
[63]. Finally, with appropriate analysis techniques, laboratory data can
allow correct classication of patients to conditions for which codes do
not exist or when codes are omitted [64].
The aggregation of data from multiple laboratories into data warehouses
oers several important challenges in addition to the previously mentioned
dierences in naming of test codes. Results that are expressed as categoric
values or discrete scales may dier between laboratories based on dierent
category names and scales. These values must be reconciled or standardized
to a common set of values or scale. Many clinical laboratory tests that yield
numeric values are not standardized to reference material, and the variability of results between test methodologies, and even between laboratories using the same methodology, has been well described [6567]. The authors
have found that many analytes can yield dierences between laboratories
in a range that would aect clinical interpretation and yield spurious associations in data mining. Merely comparing the results to reference ranges is
insucient for rigorous manual evaluation and is inadequate for data mining. Statistical methods to normalize result values between dierent laboratories are available and should be considered for general application across
the data warehouse [68,69].
Summary
Although the United States lags behind Europe in creating large-scale,
detailed health data repositories at the regional level, several types of repositories exist in this country that may provide data useful for particular datamining applications. Regional repositories or data warehouses have special
requirements and constraints that distinguish them from intraorganizational
data warehouses. These dierences are related to their intended purpose, the
diversity of data they may receive, and their connection to multiple unrelated data providers. Public health databases dier further because they
are exempt from HIPAA privacy requirements and operate under their
own security and condentiality mandates. As objective data that are highly
likely to be transmitted correctly, clinical laboratory results play an important role in all these databases in validating other data, including diagnosis
and procedure codes, and in supporting inference of disease severity and
risk-related information that is poorly handled in coding systems. Aggregation of laboratory results from multiple data providers yields specic challenges related to representation and normalization of data that must be
114
115
[18] Udvarhelyi IS, Gatsonis C, Epstein AM, et al. Acute myocardial infarction in the Medicare
population. Process of care and clinical outcomes. JAMA 1992;268(18):25306.
[19] National Committee for Quality Assurance. Health Plan Employer Data and Information
Set 3.0. Washington, DC: National Committee for Quality Assurance; 1998.
[20] Ahmed F, Janes GR, Baron R, et al. Preferred provider organization claims showed high
predictive value but missed substantial proportion of adults with high-risk conditions.
J Clin Epidemiol 2005;58:6248.
[21] Kern EF, Maney M, Miller DR, et al. Failure of ICD-9-CM codes to identify patients with
comorbid chronic kidney disease in diabetes. Health Serv Res 2006;41(2):56480.
[22] Birman-Deych E, Waterman AD, Yan Y, et al. Accuracy of ICD-9-CM codes for identifying
cardiovascular and stroke risk factors. Med Care 2005;43(5):4805.
[23] Bullano MF, Kamat S, Willey VJ, et al. Agreement between administrative claims and the
medical record in identifying patients with a diagnosis of hypertension. Med Care 2006;
44(5):48690.
[24] Waikar SS, Wald R, Chertow GM, et al. Validity of international classication of diseases,
ninth revision, clinical modication codes for acute renal failure. J Am Soc Nephrol 2006;
17(6):168894.
[25] Aronsky D, Haug PJ, Lagor C, et al. Accuracy of administrative data for identifying patients
with pneumonia. Am J Med Qual 2005;20(6):31928.
[26] Dombkowski KJ, Wasilevich EA, Lyon-Callo SK. Pediatric asthma surveillance using
Medicaid claims. Public Health Rep 2005;120(5):51524.
[27] Kolodner K, Lipton RB, Lafata JE, et al. Pharmacy and medical claims data identied
migraine suerers with high specicity but modest sensitivity. J Clin Epidemiol 2004;
57(9):96272.
[28] Harrold LR, Saag KG, Yood RA, et al. Validity of gout diagnoses in administrative data.
Arthritis Rheum 2007;57(1):1038.
[29] Bazarian JJ, Veazie P, Mookerjee S, et al. Accuracy of mild traumatic brain injury case
ascertainment using ICD-9 codes. Acad Emerg Med 2006;13(1):318.
[30] Mapel DW, Frost FJ, Hurley JS, et al. An algorithm for the identication of undiagnosed
COPD cases using administrative claims data. J Manag Care Pharm 2006;12(6):45765.
[31] Hollis J. Deploying an HMOs data warehouse. Health Manag Technol 1998;19(8):
468.
[32] Joyce JS, Fetter MM, Klopfenstein DH, et al. The Kaiser Permanente Northwest Cardiovascular Risk Factor Management Program: a model for all. The Permanente Journal 2004;
9(2):1926.
[33] Sequist TD, Cullen T, Ayanian JZ. Information technology as a tool to improve the quality
of American Indian health care. Am J Public Health 2005;95(12):21739.
[34] Koski E, Teates KS, Tellez P, et al. Exploring the role of Quest Diagnostics corporate data
warehouse for timely inuenza surveillance. Advances in Disease Surveillance 2006;1:41.
[35] Kupersmith J, Francis J, Kerr E, et al. Advancing evidence-based care for diabetes: lessons
from the Veterans Health Administration. Health A (Millwood) 2007;26(2):w15668.
[36] Berndt DJ, Hevner AR, Studnicki J. The Catch data warehouse: support for community
health care decision-making. Decision Support Systems 2003;35:36784.
[37] Starr P. Smart technology, stunted policy: developing health information networks. Health
A (Millwood) 1997;16(3):91105.
[38] Payton FC, Brennan PF. How a community health information network is really used.
Commun ACM 1999;42(12):859.
[39] Berberabe T. Information: its better when you share. Manag Care 2005;14(2):30, 3537.
[40] Solomon MR. Regional health information organizations: a vehicle for transforming health
care delivery? J Med Syst 2007;31(1):3547.
[41] Massachusetts Health Data Consortium. Active Regional Health Information Organizations (RHIO) List. 2006. Available at: http://www.mahealthdata.org/data/library/
20061127_ActiveRHIOs.pdf. Accessed August 19, 2007.
116
[42] Wright A, Ricciardi TN, Zwick M. Application of information-theoretic data mining techniques in a national ambulatory practice outcomes research network. AMIA Annu Symp
Proc 2005;82933.
[43] Everett AD, Ringel R, Rhodes JF, et al. Development of the MAGIC congenital heart disease catheterization database for interventional outcome studies. J Interv Cardiol 2006;
19(2):1737.
[44] Currie CJ, McEwan P, Peters JR, et al. The routine collation of health outcomes data from
hospital treated subjects in the Health Outcomes Data Repository (HODaR): descriptive
analysis from the rst 20,000 subjects. Value Health 2005;8(5):58190.
[45] van Bemmel JH, van Mulligen EM, Mons B, et al. Databases for knowledge discovery.
Examples from biomedicine and health care. Int J Med Inform 2006;75(34):25767.
[46] Ben Said M, le Mignot L, Mugnier C, et al. A multi-source information system via the Internet for end-stage renal disease: scalability and data quality. Stud Health Technol Inform
2005;116:9949.
[47] Steil H, Amato C, Carioni C, et al. EuCliDda medical registry. Methods Inf Med 2004;
43(1):838.
[48] Hristovski D, Rogac M, Markota M. Using data warehousing and OLAP in public health
care. Proc AMIA Symp 2000;36973.
[49] Muilu J, Peltonen L, Litton JE. The federated databasea basis for biobank-based postgenome studies, integrating phenome and genome data from 600,000 twin pairs in Europe.
Eur J Hum Genet 2007;15(7):71823.
[50] Scotch M, Parmanto B. Development of SOVAT: a numerical-spatial decision support
system for community health assessment research. Int J Med Inform 2006;75(1011):
77184.
[51] Inhorn SL, Wilcke BW Jr, Downes FP, et al. A comprehensive Laboratory Services Survey
of State Public Health Laboratories. J Public Health Manag Pract 2006;12(6):51421.
[52] Hanrahan LP, Foldy S, Barthell EN, et al. Medical informatics in population health: building Wisconsins strategic framework for health information technology. WMJ 2006;105(1):
1620.
[53] Loonsk JW, McGarvey SR, Conn LA, et al. The Public Health Information Network
(PHIN) Preparedness initiative. J Am Med Inform Assoc 2006;13(1):14.
[54] Banks D, Woo EJ, Burwen DR, et al. Comparing data mining methods on the VAERS
database. Pharmacoepidemiol Drug Saf 2005;14(9):6019.
[55] Iskander J, Pool V, Zhou W, et al. Data mining in the US using the Vaccine Adverse Event
Reporting System. Drug Saf 2006;29(5):37584.
[56] Zeni MB, Kogan MD. Existing population-based health databases: useful resources for
nursing research. Nurs Outlook 2007;55(1):2030.
[57] Cabell CH, Noto TC, Kruco MW. Clinical utility of the food and drug administration electrocardiogram warehouse: a paradigm for the critical pathway initiative. J Electrocardiol
2005;38(Suppl 4):1759.
[58] Health Level Seven, Inc. About HL7. Available at: http://www.hl7.org/about/hl7about.htm.
Accessed August 19, 2007.
[59] Blobel BGME, Engel K, Pharow P. Semantic interoperabilityHL7 version 3 compared to
advanced architecture standards. Methods Inf Med 2006;45(4):34353.
[60] Regenstrief Institute, Inc. Logical Observation Identiers Names and Codes (LOINC).
Available at: http://www.regenstrief.org/medinformatics/loinc/. Accessed August 19, 2007.
[61] National Library of Medicine. SNOMED Clinical Terms (SNOMED CT). Available at:
http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html. Accessed August 19,
2007.
[62] Black N. Secondary use of personal data for health and health services research: why identiable data are essential. J Health Serv Res Policy 2003;8(3 Suppl 1):3640.
[63] Emons MF. Integrated patient data for optimal patient management: the value of laboratory
data in quality improvement. Clin Chem 2001;47(8):151620.
117
[64] Post AR, Harrison JH Jr. PROTEMPA: a method for specifying and identifying temporal
sequences in retrospective data for patient selection. J Am Med Inform Assoc 2007;14(5):
67483.
[65] Ricos C, Domenech MV, Perich C. Analytical quality specications for common reference
intervals. Clin Chem Lab Med 2004;42(7):85862.
[66] Westgard JO, Westgard SA. The quality of laboratory testing today: an assessment of sigma
metrics for analytic quality using performance data from prociency testing surveys and the
CLIA criteria for acceptable performance. Am J Clin Pathol 2006;125(3):34354.
[67] Viljoen A, Twomey PJ. True or not: uncertainty of laboratory results. J Clin Pathol 2007;
60(6):5878.
[68] Karvanen J. The statistical basis of laboratory data normalization. Drug Inf J 2003;37:
1017.
[69] Ruvuna F, Flores D, Mikrut B, et al. Generalized lab norms for standardizing data from
multiple laboratories. Drug Inf J 2003;37:6179.
* Corresponding author.
E-mail address: sbrossette@medmined.com (S.E. Brossette).
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.007
labmed.theclinics.com
120
patent pending, Cardinal Health). The NIM outperforms the CDC National
Nosocomial Infection Surveillance system (NNIS) clinical case denitions in
the ICU and the Study on the Ecacy of Nosocomial Infection Control
(SENIC) case denitions house-wide [2], and is based solely on electronic
clinical microbiology and electronic patient census and movement data.
As a result, the NIM is reproducibly computable, thus solving a major
limitation of manual case-nding methods. Data models based on the NIM,
or other deterministically computable infection proxies, can specically and
reliably describe patterns of NIs, not just laboratory results, allowing for
more specic and objective process improvement initiatives.
Predictive data mining
Descriptive data mining should reveal and describe important, previously
unknown patterns of nosocomial (and community-acquired) infections,
contamination, and colonization. Predictive data mining should construct
models to predict NI risk. To some extent, the NIM algorithm accomplishes
as much. If an NIM is detected, it is likely associated with NI; if an NIM is
not detected, an NI is likely not present [2]. The previously described GermWatcher system implemented culture-based denitions for NI to accomplish
similar goals [3]. The NIM and GermWatcher are expert rules systems;
neither generates models to predict risk. They can, however, like the
NIM, be used to provide data to model-generating systems.
Predictive data mining to build models of infection risk could use any of
the classier building techniques from machine learning [4] (eg, neural
networks) or even techniques used mostly for descriptive mining, such as
association rules. This endeavor would require substantial research, but
once developed, classiers could be used to target proactively high-risk
patients for prevention eorts. Of course, the exercise could lead to obvious
conclusions, such as neutropenic patients are at high risk for NI, but it
also may provide insights that are currently unknown or underappreciated.
Descriptive data mining: the Data Mining Surveillance System
Descriptive data mining in laboratory medicine and infection control is at
this time entirely represented by the Data Mining Surveillance System
(DMSS) [58]. DMSS uses frequent set and association rule analysis to automatically construct from laboratory medicine and patient movement data
patterns of statistical and clinical interest. The reason these techniques are
useful for infection control is that NI risks are complex and subtle patterns
of infection, colonization, contamination, and multidrug resistance often go
unnoticed. This is not hard to understand; the combinatorial complexity of
a simple infection event space is substantial. A hypothetical hospital with
20 common bacterial pathogens, 10 specimen sources, 10 physicians or
services, and 10 in vitro antimicrobial sensitivity results (each sensitive,
121
122
123
Operational considerations
Although DMSS is a complex system and a full description is beyond the
scope of this article, a few distilled operational principles and challenges are
worth discussing.
Data collection
DMSS requires useable data. The garbage-in, garbage-out adage is
applicable. Data are obtained from the laboratory information system
(LIS), the admit-discharge-transfer (ADT) system, and the hospital census
system. Clinical laboratory data, especially clinical microbiology data, are
poorly structured, however, and contain free text and natural language.
Clinical microbiology data (including molecular testing) and infectionassociated serology data from the LIS can be obtained in three ways:
custom-built LIS queries, printed reports, and HL7 messages. Custom
queries can be built to specicationdcontent and presentation can be
controlleddbut they require programmer resources to construct and their
results must be checked against gold-standard results (usually printed
reports) for completeness.
Printed reports from the LIS, specically those used to present information to clinicians, are mostly complete, but may suppress results that are
selectively reported (eg, imipenem susceptibility in Pseudomonas aeruginosa). Suppressed results limit the ability of frequent set and association
rule algorithms to detect relationships that may exist, but these limitations
are usually not signicant because the results are not suppressed in cases
in which the information is most useful (eg, imipenem resistance in the presence of aminoglycoside and cephalosporin resistance). Printed reports can
also change format with LIS upgrades, the introduction of new tests, and
the removal of discontinued tests. For these reasons, structure and content
must be actively monitored for change. Printed reports can be readily
124
obtained in le format from printer queues (usually custom queues established expressly for le retrieval), but need to be parsed to load the data
into a database. Tools such as Monarch Data Pump (www.datawatch.
com) can be useful.
Clinical microbiology HL7 messages are often poorly structured but are
readily available from HL7 routers in most hospitals. Their modeling and
parsing, however, require considerable sophistication. Print-structured
data are often simply embedded in message segments, and therefore all challenges and considerations of print report modeling apply to HL7 message
modeling. Additional challenges exist, such as identifying and modeling
only applicable messages. DMSS obtains LIS data by HL7 messages.
Patient movement and census data are obtained from two sources: HL7
ADT messages and electronic census reports. Although ADT messages are
rich in content, near real-time, and precise, they are transaction based and
occasional message omission is not uncommon. For example, if for some
reason a discharge message is not generated for a patient, the patient
appears to never leave the hospital. For this reason, DMSS uses census
reports obtained throughout the day to reconcile ADT data errors.
Data cleaning/normalization
Once data are obtained using one of the three mechanisms above, they
must be loaded into a database, quality checked, and mapped. Database
design and population are beyond the scope of this article (see the article
by Lyman and colleagues, elsewhere in this issue for a general discussion
of database design for mining), but once data les are retrieved and checked
to make sure their sizes are within normal limits and their data are from the
time periods expected, data can be loaded and mapped. Mapping requires,
among other things, that SA, S. AURIUS, STAPH AUREUS, and
so forth all be mapped to Staphylococcus aureus. Original data are also
maintained. Cardinal Health DMSS databases contain data from more
than 250 hospitals and have hundreds of mappings to single organisms
and specimen sources. For example, there are hundreds of terms for blood
specimens, including ones with misspellings of blood. The management of
term mapping alone requires pattern recognition and quality assurance
systems. After terms are mapped, data must be checked again to make
sure certain common specimens, tests, and organisms exist within statistical
limits.
The next step is to impart additional meaning to the data. For example,
NIM criteria are applied so that NIs, community-acquired infections, and
specimen contamination proxies (indicators) are computed. If this information were not imparted to the data before pattern analysis, patterns would
less reliably describe nosocomial versus community-acquired infections versus colonization or contamination. Electronic proxies for these clinical and
laboratory states, like the NIM, add value to the data and make data mining
125
more productive. Once data are annotated with these proxies, they can be
analyzed.
Frequent set and association rule analysis and alert generation
FS/AR analyses generally work as described above, but are typically
fraught with complexity for the inexperienced practitioner. Time partitioning of the data, the organization of association rules obtained from each
partition, and the ability to track changes among rules need to be handled.
Once rules are stored along with their condences in time, rules whose condences change signicantly between two single or aggregate time periods
compose alerts. Rules whose condences are changing insignicantly are ignored. Alert clustering reduces alert volume by a factor of two to four and is
yet another tool used to reduce pattern overload. All data mining steps from
data selection to pattern presentation need to be designed with this problem
in mind. Generating too many statistically signicant but often meaningless
or redundant patterns leads to user exhaustion and project failure.
The nal step of data mining is report preparation. In DMSS reports are
prepared from clustered patterns by domain experts who select patterns by
their usefulness. Pattern usefulness or interestingness is a function of clinical
signicance and actionability, and includes an estimate of how much information the end user can use eciently. These are largely subjective measures
that are dicult to code explicitly, but through experience we know that end
users do well with 5 to 10 patterns a month, about one tenth of all clustered
patterns (Table 1). DMSS pattern reports are currently presented monthly.
Median
Interquartile
range
Inpatient
admits
Specimens
Tests
NIMs
CIMs
Clustered
patterns
Reported
patterns
1498
8092368
2728
13674289
3254
16045170
61
26112
245
157424
52
27.583
6
39
126
services that include DMSS pattern analysis, and from these hospitals more
than 20 DMSS-based abstracts have been presented at national conferences.
Future directions
In its current form, DMSS provides a practical illustration of the usefulness of data mining in health care. Access to additional electronic data could
extend the model-building capabilities and usefulness of DMSS. For example, additional data about the patient origin could allow models to describe
or predict signicant patterns from nursing homes, zip codes, counties, and
so forth. Additional electronic data, such as surgical procedure, operating
room, operative time, anesthesia scores, and wound class, could increase
the descriptiveness of surgery-associated patterns. Antimicrobial use data
or complete blood counts could increase the sensitivity and specicity of
the NIM, even if for only specic subsets of patients. Any gains in pattern
specicity and marker performance, however, add data acquisition costs
and require additional eort for data validation and cleansing. These
requirements must be matched by a corresponding increase in the clinical
usefulness of alerts and reports to justify additional development.
References
[1] Emori TG, Edwards JR, Culver DH, et al. Accuracy of reporting nosocomial infections in
intensive-care-unit patients to the National Nosocomial Infections Surveillance System:
a pilot study. Infect Control Hosp Epidemiol 1998;19:30816.
[2] Brossette SE, Hacek DM, Gavin PJ, et al. A laboratory-based, hospital-wide, electronic
marker for nosocomial infection. Am J Clin Pathol 2006;125:349.
[3] Kahn MG, Steib SA, Fraser VJ, et al. An expert system for culture-based infectioncontrol
surveillance. Proc Annu Symp Comput Appl Med Care 1993;1715.
[4] Mitchell Tom. Machine learning. McGraw Hill; 1997.
[5] Brossette SE, Sprague AP, Hardin JM, et al. Association rules and data mining in hospital
infection control and public health surveillance. J Am Med Inform Assoc 1998;5:37381.
[6] Brossette SE, Moser SA. Application of knowledge discovery and data mining to intensive
care microbiologic data. Journal of Emerging Infectious Diseases 1999;5:4547.
[7] Brossette SE, Sprague AP, Jones WT, et al. A data mining system for infection control surveillance. Methods Inf Med 2000;39:30310.
[8] Peterson LR, Brossette SE. Hunting healthcare associated infections from the clinical microbiology laboratory: passive, active, and virtual surveillance. J Clin Microbiol 2002;40:14.
[9] Peterson LR, Hacek DM, Rolland D, et al. Detection of a community infection outbreak with
virtual surveillance [letter]. Lancet 2003;362(9395):15878.
128
KLEE
129
are anchored at the 30 end of a gene. The present build of the database
(#201, updated 03/01/2007) contains sequence data for 77 species, including
more than 6,694,833 human transcripts in 124,179 clusters.
The Unigene interface for tissue-specicity analysis is well suited for
single-gene queries. Each EST cluster entry provides a Gene Expression
Summary section that includes links to the cluster Expression Prole.
Within the prole, a numeric and graphic display of the cluster expression
in dierent tissue libraries is provided. Three distinct proles are provided,
illustrating the relative cluster expression in dierent normal tissues, in
various health states, and in independent developmental stages. When
more than half of a clusters entire EST counts are assigned to a single
prole state, that state is labeled as a restricted expression state. For
example, PSA (KLK3), a well-known prostate cancer serum biomarker,
shows restricted expression in prostate, prostate cancer, and adult developmental stage. Unigene also can be queried for specic tissue-expression
proles using the Digital Dierential Display (DDD) tool, which generates
information on tissue-specic distribution of expression dened by the EST
library of origin. DDD uses the Fisher Exact Test to compare the distribution of transcripts in two dierent libraries, thereby allowing users to
compare expression in two libraries (tissue types) of interest. Analysis is
restricted to deeply sequenced libraries (O 1000 sequences) to ensure
validity of results. Overall, Unigene provides a rich information resource
for biomarker data mining, generating gene-expression proles across tissue,
disease, and development stage strata.
The Cancer Genome Anatomy Project
SAGE is a transcriptional proling technique that surveys short 30 oligonucleotides derived from mRNA to quantify gene expression. Highthroughput, highly redundant sequencing generates millions of transcripts
that provide information on gene expression that is complementary to
traditional EST methods. The Cancer Genome Anatomy Project (CGAP)
of the NIH maintains a large database of SAGE sequences [15]. An internal
analysis pipeline lters out erroneous data and maps SAGE transcripts to
gene sequences. Based on the relative number of transcripts mapped to
a gene and the disease or tissue characteristics of the originating cellular
library, tissue-specic and disease-specic expression proles can be
generated.
Within CGAP, the SAGE Genie is used to mine the database and generate several tissue-specic expression proles. The SAGE Anatomical Viewer
creates a list of best SAGE tags for a specic gene. For any tag, three
dierent visualizations of the data can be selected. The Ludwig Transcript
Viewer provides a positional map of SAGE transcripts on the target gene,
the Digital Northern provides a frequency-sorted list of tag occurrence in
SAGE libraries, and the Anatomical Viewer generates three heat-map
130
KLEE
131
Table 1
Methods for computing tissue specicity metrics from large gene-expression data sets
Study
Data source
Specicity metric/test
Selective expression
TissueInfo
ExQuest
GEPIS
Shannon Entropy
Akaikes Information Criterion
Tissue selectivity
ROKU
EST
EST
EST, microarray
Microarray
Microarray
Microarray
Abbreviations: DEU, digital expression units; EST, Expressed Sequence Tag; GEPIS, Gene
Expression Proling in silico; SAGE, Serial Analysis of Gene Expression; MPSS, Massively
Parallel Signature Sequencing; TUS, tissue-specic units.
132
KLEE
using the Dixon discordance test for uniform distributions. The data analysis includes the following steps:
1. Filter the data points for high-quality measurements.
2. Verify a minimum number of data points passing step one for each
source.
3. Apply the quantitative test of discordance and use the statistical significance score to determine if scores are reliable.
4. Adjust scores computed in step three for the baseline expression level of
nonvariant genes.
5. Compute the minimum intensity gap (the separation between the largest
intensity and the second largest intensity).
6. Compute an overall condence for the selective expression parameter by
combining steps four and ve to provide an overall condence level for
the selectivity dened by the numeric evaluation of parameters in a single
decision function.
Step four of the process allows the data to be adjusted to reduce the
signicance of discordant values as the baseline expression level approaches
saturation for the measurement system used. Step ve introduces a level of
robustness to the discordant values in step six by noting if the minimum
intensity gap approaches the resolution power of the measurement system
used.
The decision metric described in this approach is designed to be universally applicable to any data source capable of reporting tissue-specic
expression proles. The six-step analysis method computes a numeric value
for tissue selectivity that can be congured to the technique used to
generate the data to prevent erroneous predictions. The algorithm has
been used in several studies to identify tissue-selective expression [2024].
TissueInfo
The TissueInfo method and interface were constructed to provide tissuespecic expression proles for a target gene sequence on the basis of EST
sequence counts [25]. TissueInfo uses the Basic Alignment Search Tool
(BLAST) [26] and MegaBLAST [27] sequence comparison programs to
align individual EST sequences obtained from dbEST to a user-submitted
query sequence. Based on the number of ESTs matching the query sequence
and the associated annotation of the matching EST source libraries, the
query sequence is annotated as (1) expressed in a tissue, (2) specically
expressed in a tissue, or (3) tissue specic. To be labeled as expressed in
a tissue, the query sequence need only have a single matching EST sequence
derived from a library constructed from that tissue type. To be specic to
a tissue, the number of ESTs from the tissue in question matching the query
sequence divided by the total number of ESTs matching the query sequence
must be greater than a user-dened threshold for specicity (the authors use
133
134
KLEE
ExQuest does not incorporate any statistical test for tissue specicity. The
TSU metric provides an interesting and simple approach to normalizing
expression measurements between libraries that are sampled to dierent
extents. It should provide a fair estimate of tissue-specic expression levels,
provided the analysis includes no libraries that are sampled at a very low
level.
ExQuest provides a unique hierarchical organization of tissue categories
embedded in the program. EST libraries are organized into hierarchical
tissue bins in which related libraries are grouped together, thereby increasing
the overall transcript count per tissue category. The hierarchical organization is designed to provide users with three levels of organization for querying specicity: at the primary tissue, the secondary tissue, and individual
EST library. For example, a primary tissue would be the pancreas, a secondary tissue would be the islet of Langerhans, and the individual EST library
would be a single EST library derived from the islet of Langerhans. This
hierarchical organization gives users the exibility to balance sampling
depth in a category with more exact feature specicity. A second unique
feature of ExQuest is that users can dene the degree of similarity by which
EST sequences are mapped to gene entities by selecting the percentage alignment parameter used by MegaBLAST. The authors demonstrate that varying this parameter allows related paralog sequences to be dierentiated from
each other. Finally, ExQuest also provides an interesting interface option to
view chromosomal mapping of sequence-associated ESTs. This feature
enables users to zoom in and out of a chromosomal map while selecting
specic tissues for which matching ESTs will be displayed. This interface
lets users evaluate tissue-specic expression patterns in a positional context
and associate these patterns with established genomic eects. The ExQuest
program is available through a Web server (http://lena.jax.org/wdcb/
ensRNA/exquest.html) but has not been updated since 2004 (Derry Roopenian, PhD, personal communication, Bar Harbor, Maine, 2007). The hierarchical grouping of libraries is a unique and useful feature not explicitly
provided in other tissue-specicity methods, allowing users to adapt the
analysis to the level best suited to their biomarker study. It is unfortunate
the system is not in active development, but even in its present state it
may be a useful resource for investigators.
Gene Expression Proling in silico
The Gene Expression Proling in silico (GEPIS) method uses EST sequence mapping to compute tissue-specic expression in 43 tissue types
[29]. Before computing specicity, the GEPIS system executes a rigorous
data-quality ltering protocol on the ESTs and EST libraries obtained
from dbEST. All ESTs obtained from libraries with tissue-source annotations of unknown, ambiguous, or pooled tissue type are excluded from
the analysis. EST libraries that have been normalized or subtracted and
135
EST libraries derived from fetal or embryonic tissues also are excluded from
the analysis. Finally, the authors also eliminated several EST libraries that
were deemed to have been misannotated according to their expression analysis. The remaining EST sequences were assigned to gene sequences using
BLAST comparisons.
GEPIS computes EST library normalized expression values, called
digital expression units (DEU), for all genes in all tissues. The DEU is
equal to the number of ESTs assigned to a gene in a tissue, divided by the
total ESTs in the tissue, and multiplied by a scaling factor of 1,000,000.
A Z-statistic then is used to compare any two tissue categories and determine statistical signicance. The Z-statistic is computed by Equation 1,
and the result compared with a normal distribution to obtain a p-value.
pbA pbB
Z r
1
1
pb1 pb
NA NB
Where pbA and pbB are the DEU values for tissues A and B, pbis the DEU for
the gene of interest across all tissues, and NA and NB are total number of
EST sequences from tissues A and B. For a given gene, the Web interface
returns a graphic representation of the tissue-specic expression prole. It
also provides a chart containing raw EST counts and DEU values for
both normal and cancer tissue libraries, a p-value based on the Z-statistic
comparing the normal and cancer DEU values within a tissue, and the highest p-value obtained from the Z-statistics comparing the gene expression in
a normal tissue with the expression in each of the other normal tissues.
GEPIS, much like the ExQuest program [28], computes a Regional Atlas
that graphically shows the expression of all genes in proximity to the gene of
interest, across a user-dened range and number of tissue types. This feature is
useful when investigating events such as cancer-induced copy-number eects.
The program authors also attempted to verify GEPIS gene-expression proles
experimentally using quantitative polymerase chain reaction (qPCR) measurements of 40 genes over a range of expression levels in normal and cancer
colon samples [29]. The analysis revealed strong correlation between the
GEPIS expression level and qPCR expression level at all expression ranges
(low, medium, and high). GEPIS recently has been revised and released
as GeneHub-GEPIS [30]. The upgraded version now allows users to search
the database using multiple database identiers.
Shannon Entropy
Schug and colleagues used Shannon Entropy (generally designated H in
information theory) was used to measure the overall specicity of a gene,
that is, the degree by which a gene-expression prole diers from a ubiquitous expression prole [31]. This statistical measure of specicity provides
136
KLEE
a single metric for assessing the complete gene-expression prole but does
not provide any information regarding the tissues in which a gene may be
specically expressed. To obtain this information, a new statistic, Q, is introduced to measure what the authors label categorical specicity or expression specic to a tissue type. This method was used to measure tissue
specicity in both the GNF Gene Expression Atlas [18] microarray data
and the Database of Transcribed Sequences, EST data [32]. The microarray
expression values were analyzed, without modication, as obtained from the
Gene Expression Atla For the EST data, counts of EST sequences associated with a given gene in a given tissue were normalized into pseudo-counts
by Equation 2.
ng;t 1
wg;t
2
Nt Ng
Where ng,t is the number of EST sequences mapped to gene g, in tissue t. Nt
is the total number of EST sequences associated with tissue t, and Ng is the
total number of genes. Relative expression values ptjg were computed for all
expression values by dividing wg,t by the total expression for a gene across
all tissues. Using the relative expression values, the Shannon Entropy was
computed using Equation 3.
X
Hg
ptjg log2 ptjg
3
1%t%N
Where N is the total number of tissues evaluated. A low entropy value reects a highly specic gene-expression prole. The Q-statistic, for measuring
categorical specicity, then is computed from the Shannon Entropy using
Equation 4.
Qgjt Hg log2 ptjg
Where a Q-value of zero denotes expression restricted to that tissue only, and
an increasing Q-value is indicative of a more ubiquitous expression prole.
A comparison of specicity measurements between the microarray and
EST data using the same gene set and tissue set showed the overall distribution of entropy across all genes was slightly depressed in the EST data set
versus the microarray data set. This nding may reect insucient sampling
associated with some EST sequence libraries, so that several genes seem to
be expressed in a more specic manner than found when analyzing the
microarray data set. Independent experimental measurements would be
required to understand the discrepancies between the specicity measurements from the two data sources.
The method described in this article introduces a classic measure from
information theory, Shannon Entropy, to detect tissue-specic gene-expression patterns and a novel statistic, Q, to measure tissue-specic expression.
137
By using these two statistics together, genes that are expressed in a tissue-selective manner can be identied and ranked by the overall measure of specicity, in a manner similar to that developed to analyze the Gene Logic
(http://www.genelogic.com) microarray data set [33]. It would be interesting
to apply the Shannon Entropy method to the GeneLogic data and compare
the overall ranks and measures of gene specicity computed by the two different approaches. The method also is advantageous, because it is applicable
to data from two dierent sources (EST and microarray).
Akaikes Information Criterion
Akaikes Information Criterion can be used to identify genes with markedly dierent expression in one or a few genes relative to the gene expression
in most tissue types [34]. The criterion originally was developed to identify
an optimal model from a class of competing models [35] but has been adapted to the detection of outlier gene expression. The authors rationalize the
use of this criterion because it operates without requiring a signicance level
to be selected, thereby providing an objective method of gene selection. In
a study of mouse microarray data obtained from the Riken Expression Array Database [36], 49 samples from dierent normal tissues were evaluated
with the criterion. After ltering out low-quality data and normalizing to
a reference array, genes of interest were selected using a U-statistic, where
a low score was desirable, as dened in Equation 5.
p logn!
U nlogs s 2
5
n
Where (ns) equals the total number of observations across tissue types, s
equals the number of outlier candidates, and s equals the SD across scores
from the n tissues (but not the s tissues)
A U-statistic was computed for up to X(X1)/2 combinations of possible
outliers, where X 1(ns)/2. X varied according to the number of data
points (ie, changed if there were missing data points) and used a combination
of outliers that were both up- and down-regulated. This method can be
adapted easily to the analysis of any large data set (eg, microarray data)
for which expression ratios in multiple tissues are available or computable.
The method is fairly straightforward and easy for other users to implement,
and the authors believe that the lack of a p-value for specicity is an advantage. It does, however, require a signicant amount of computation, up to
N*X(X 1) calculations where X is the number of tissues types evaluated,
and N is the number of genes evaluated.
Tissue selectivity
A robust method to calculate tissue selectivity uses the GeneLogic microarray expression data set [33]. The methods were designed to identify
138
KLEE
139
140
KLEE
the question remains as to which data type should be analyzed to benet the
study best. Clearly, this decision depends on the specic nature of the study
and whether the study-specic tissue is sampleddand sampled sucientlyd
in a given database. In some cases, the study target could fall within a very
specic tissue type and may be restricted to a single data source, such as the
EST database or the CGAP SAGE collection. In most cases, however, any
one of the four data types probably will be able to provide pertinent
information. To address whether this information is redundant or complementary, Huminiecki and Bicknell [45] evaluated the congruence of specicity data obtained from SAGE and EST data analysis and later compared
these data with microarray data analysis. The initial study concluded that
evaluating SAGE and EST data together provided a more accurate assessment of tissue specicity than could be obtained from either data set alone.
In a later study, the correlation of specicity analysis by EST and SAGE
data compared with that of microarray data was strong in tissues that
were extensively sampled [46]. The study reported, however, that correlation
was not strong between microarray and the EST and SAGE data types in
tissues that had complex cellular composition or that had not been sampled
extensively. This nding suggests that EST and SAGE libraries measure up
to microarray analysis only in tissues that are deeply sampled within those
libraries. None of the microarray analysis methods discussed here have
addressed the possible eect of low-level microarray measurements on the
robustness of the tissue-specicity predictions. Low-expression measurements often fall within the background level of an array technology, and
any analysis dependent on these signals is potentially erroneous. Consequently, it would seem that for genes expressed at a low level or in a poorly
sampled tissue, analysis of a single data type would be insucient. Therefore, researchers should consider evaluating tissue-specic expression in
multiple (or all available) data types to obtain the most comprehensive
expression prole.
Tissue-specicity expression proling has been used widely for biomarker
discovery but is equally or more applicable to candidate biomarker characterization. The literature is lled with references to studies that have identied candidate biomarkers by mining transcriptomic data for genes
dierentially expressed in normal and cancerous tissue [4755]. This type
of candidate biomarker discovery can be undertaken using the data-mining
methods described herein. It is the purpose of this review, however, to
encourage investigators to consider using these data-mining methods to generate tissue-specic expression proles that can complement existing eorts
to discover candidate biomarkers. This approach has been used to identify
candidate cardiac markers that are specically expressed in the heart [55], to
identify candidate prostate cancer biomarkers that are dierentially
expressed in normal and cancer tissues and also are selectively expressed
in prostate [54], to identify candidate brain-injury markers specic to the
brain [53], and to identify candidate bladder carcinoma biomarkers specic
141
142
KLEE
[22] Lai C, Chou C, Chang L, et al. Identication of novel human genes evolutionarily conserved
in Caenorhabditis elegans by comparative proteomics. Genome Res 2000;10(5):70313.
[23] Walker MG, Volkmuth W, Sprinzak E, et al. Prediction of gene function by genomescale expression analysis: prostate cancer-associated genes. Genome Res 1999;9(12):
1198203.
[24] Ewing RM, Kahla AB, Poirot O, et al. Large-scale statistical analyses of rice ESTs reveal
correlated patterns of gene expression. Genome Res 1999;9(10):9509.
[25] Skrabanek L, Campagne F. TissueInfo: high-throughput identication of tissue expression
proles and specicity. Nucleic Acids Res 2001;29(21):E102.
[26] Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol 1990;215:
40310.
[27] Zhang Z, Schwartz S, Wagner L, et al. A greedy algorithm for aligning DNA sequences.
J Comput Biol 2000;7:20314.
[28] Brown AC, Kai K, May ME, et al. ExQuest, a novel method for displaying quantitative gene
expression from ESTs. Genomics 2004;83(3):52839.
[29] Zhang Y, Eberhard DA, Frantz GD, et al. GEPISquantitative gene expression proling in
normal and cancer tissues. Bioinformatics 2004;20(15):23908.
[30] Zhang Y, Luoh SM, Hon L, et al. GeneHub-GEPIS: digital expression proling for normal
and cancer tissues based on an integrated gene database. NAR 2007;35:W1528.
[31] Schug J, Schuller WP, Kappen C, et al. Promoter features related to tissue specicity as measured by Shannon Entropy. Genome Biol 2005;6(4):R33.
[32] The Computational Biology and Informatics Laboratory. AllGenes: a Web site providing
access to an integrated database of known and predicted human (release 9.0, 2004) and
mouse genes (release 10.0, 2004). Center for Bioinformatics, University of Pennsylvania.
Available at: http://www.allgenes.org. Accessed November 19, 2007.
[33] Liang S, Li Y, Be X, et al. Detecting and proling tissue-selective genes. Physiol Genomics
2006;26(2):15862.
[34] Kadota K, Nishimura S, Bono H, et al. Detection of genes with tissue-specic expression
patterns using Akaikes information criterion procedure. Physiol Genomics 2003;12(3):
2519.
[35] Akaike H. Information theory and an extension of the maximum likelihood principle. Proc:
2nd Int symp information theory. Budapest; 1973. p. 26781.
[36] Miki R, Kadota K, Bono H, et al. Delineating developmental and metabolic pathways in
vivo by expression proling using the RIKEN set of 18,816 full-length enriched mouse
cDNA arrays. Proc Natl Acad Sci U S A 2001;98(5):2199204.
[37] Kadota K, Ye J, Nakai Y, et al. ROKU: a novel method for identication of tissue-specic
genes. BMC Bioinformatics 2006;7:294.
[38] Saito-Hisaminato A, Katagiri T, Kakiuchi S, et al. Genome-wide proling of gene expression in 29 normal human tissues with a cDNA microarray. DNA Res 2002;9(2):3545.
[39] Hsiao LL, Dangond F, Yoshida T, et al. A compendium of gene expression in normal human
tissues. Physiol Genomics 2001;7(2):956.
[40] Misra J, Schmitt W, Hwang D, et al. Interactive exploration of microarray gene expression
patterns in a reduced dimensional space. Genome Res 2002;12(7):111220.
[41] Vasmatzis G, Klee E, Kube DM, et al. Quantitating tissue specicity of human genes to
facilitate biomarker discovery. Bioinformatics 2007;23(11):134855.
[42] Gupta S, Vingron M, Haas SA. T-STAG: resource and Web-interface for tissue-specic
transcripts and genes. Nucleic Acids Res 2005;33(Web Server issue):W6548.
[43] Wang J, Liang P. DigiNorthern, digital expression analysis of query genes based on ESTs.
Bioinformatics 2003;19(5):6534.
[44] Madden SF, ODonovan B, Furney SJ, et al. Digital extractor: analysis of digital dierential
display output. Bioinformatics 2003;19(12):15945.
[45] Huminiecki L, Bicknell R. In silico cloning of novel endothelial-specic genes. Genome Res
2000;10(11):1796806.
143
[46] Huminiecki L, Lloyd AT, Wolfe KH. Congruence of tissue expression proles from gene
expression atlas, SAGEmap and TissueInfo databases. BMC Genomics 2003;4(1):31.
[47] Campagne F, Skrabanek L. Mining expressed sequence tags identies cancer markers of
clinical interest. BMC Bioinformatics 2006;7:481.
[48] Wang XS, Zhang Z, Wang HC, et al. Rapid identication of UCA1 as a very sensitive and
specic unique marker for human bladder carcinoma. Clin Cancer Res 2006;12(16):48518.
[49] Wang AG, Yoon SY, Oh JH, et al. Identication of intrahepatic cholangiocarcinoma related
genes by comparison with normal liver tissues using expressed sequence tags. Biochem
Biophys Res Commun 2006;345(3):102232.
[50] Yoon SY, Kim JM, Oh JH, et al. Gene expression proling of human HBV- and/or HCVassociated hepatocellular carcinoma cells using expressed sequence tags. Int J Oncol 2006;
29(2):31527.
[51] Huang ZG, Ran ZH, Lu W, et al. Analysis of gene expression prole in colon cancer using
the cancer genome anatomy project and RNA interference. Chin J Dig Dis 2006;7(2):97102.
[52] Aouacheria A, Navratil V, Barthelaix A, et al. Bioinformatic screening of human ESTs for
dierentially expressed genes in normal and tumor tissues. BMC Genomics 2006;7:94.
[53] Laterza OF, Modur VR, Crimmins DL, et al. Identication of novel brain biomarkers. Clin
Chem 2006;52(9):171321.
[54] Asmann YW, Kosari F, Wang K, et al. Identication of dierentially expressed genes in
normal and malignant prostate by electronic proling of expressed sequence tags. Cancer
Res 2002;62(11):330814.
[55] Megy K, Audic S, Claverie JM. Heart-specic genes revealed by expressed sequence tag
(EST) sampling. Genome Biol 2002;3(12):RESEARCH0074:111.
[56] Klee EW, Finlay J, McDonald C, et al. Bioinformatics methods for prioritizing serum biomarker candidates. Clin Chem 2006;52(11):21624.
146
LEE
et al
147
in genomic data analysis have been theoretically proven to be of nonpolynomial-hard complexity, implying that no computational algorithm can search
for all possible candidate solutions. Thus, heuristicdmost frequently statisticaldalgorithms that eectively search and investigate a very small portion
of all possible solutions are often sought for genomic data mining problems.
The success of many bioinformatics studies critically depends on the construction and use of eective and ecient heuristic algorithms, most of
which are based on the careful application of probabilistic modeling and statistical inference techniques.
Challenge 5: noisy high-throughput biological data
The next challenge derives from the fact that high-throughput biotechnical data and large biological databases are inevitably noisy because biological information and signals of interest are often observed with many other
random or confounding factors. Furthermore, a one-size-ts-all experimental design for high-throughput biotechniques can introduce bias and error
for many candidate targets. Therefore, many investigations in bioinformatics can be performed successfully only when the variability of genomics
data is well understood. In particular, the distributional characteristics of
each data set must be analyzed using statistical and quality control techniques on initial data sets so that relevant statistical approaches may be
applied appropriately. This preprocessing step is critical for all subsequent
bioinformatics analyses, and reconciling dramatically dierent results that
may stem from slightly dierent preprocessing procedures can sometimes
be dicult. Although this issue has no easy answer, consistent preprocessing
procedures within each and across dierent analyses, with good documentation of procedures, must be used.
Challenge 6: integration of multiple, heterogeneous biological
data for translational bioinformatics research
The last challenge is the integration of genomic data with heterogeneous
biological data and associated metadata, such as gene function, biological
subjects phenotypes, and patient clinical parameters. For example, multiple
heterogeneous data sets, including gene expression data, biological responses, clinical ndings, and outcomes data, may need to be combined
to discover genomic biomarkers and gene networks that are relevant to disease and predictive of clinical outcomes, such as cancer progression and chemosensitivity to an anticancer compound. Some data sets exist in dierent
formats and may require combined preprocessing, mapping between data elements, or other preparatory steps before correlative analysis, depending on
their biological characteristics and data distributions. Eective combination
and use of the information from these heterogeneous genomic, clinical, and
other data resources remain a signicant challenge.
148
LEE
et al
This article reviews novel concepts and techniques for tackling various
genomic data mining problems. In particular, because DNA microarrays
and GeneChips techniques have become an important tool in biological
and biomedical investigations, this article focuses on statistical approaches
that have been applied to various microarray data analyses to overcome
some challenges mentioned earlier.
Table 1
Classication of the candidate hypotheses
Null true
Alternative true
Total
Null accept
Null reject
Total
U
T
W
V
S
R
Mo
M1
M
149
The FDR evaluation has been rapidly adopted for microarray data analysis, including the widely used signicance analysis of microarrays (SAM)
and other approaches [1,6]. Many dierent methods have been suggested
for estimating FDR directly from test statistics, or indirectly from classical
P values of these statistics. The latter methods are convenient, because standard P values can be simply converted into their corresponding FDR values
[5,7] and Q value, especially the latter based on a resampling technique.
More careful FDR assessment can also be found in many other recent
studies [7].
150
LEE
et al
the variance variability across dierent intensity ranges [1]. Based on the observation that the signal-to-noise ratio varies with dierent gene expression intensities, SAM tries to stabilize gene-specic uctuations and is dened based
on the ratio of change in gene expression to the standard deviation in the data
for that gene. The relative dierence d(i) in gene expression is dened as:
di xI i xU i=si s0
where xI(i) and xU(i) are the average expression values of gene i in states I
and U, respectively. The gene-specic scatter s(i) is the standard pooled
deviation of replicated expression values of the gene in the two states. To
compare values of d(i) across all genes, the distribution of d(i) is assumed
to be independent of the level of gene expression. However, at low expression levels, variability in d(i) can be high because of small values of s(i).
To ensure that the variance of d(i) is independent of gene expression, a positive constant s0 is added to the denominator. The value for s0 is chosen to
minimize the coecient of variation, where the coecient of variability of
d(i) is computed as a function of s(i) in moving windows across all the genes.
Local pooled error
Based on a more careful error-pooling technique, the so-called localpooled-error (LPE) test was also introduced. This testing technique is
particularly useful when the sample size is very small (eg, two or three per
condition). LPE variance estimates for genes are formed by pooling variance
estimates for genes with similar expression intensities from replicated arrays
within experimental conditions [6]. The LPE approach leverages the observations that genes with similar expression intensity values often show similar
array-experimental variability within experimental conditions; and that variance of individual gene expression measurements within experimental conditions typically decreases as a (nonlinear) function of intensity. LPE has
been introduced specically for analyzing small-sample microarray data,
whereby error variance estimates for genes are formed through pooling variance estimates for genes with similar expression intensities from replicated
arrays within experimental conditions (LPE package, www.bioconductor.
org). The LPE approach is possible because common background noise
can often be found within each local intensity region of the microarray
data. At high levels of expression intensity, this background noise is dominated by the expression intensity, whereas at low levels the background
noise is a larger component of the observed expression intensity, which
can be easily observed in the so-called M versus A log-intensity scatter
plot of two replicated chips among three dierent immune conditions
(Fig. 1) [6]. The LPE approach controls the situation when a gene with
low expression may have very low variance by chance and the resulting signal-to-noise ratio is unrealistically large.
151
Fig. 1. Log-intensity ratio (M) as a function of average gene expression between replicated
chips (A). Top panels represent the estimated error distributions (based on a non-parametric
regression) for (A) naive, (B) 48 hour activated, and (C) T-cell clone D4 conditions in the mouse
immune response microarray study.
152
LEE
et al
array, gene, condition, arraygene interaction, and conditiongene interaction separately on complementary DNA (cDNA) microarray data [11], and
a two-stage mixed model was proposed rst to model cDNA microarray
data with the eects of array, condition, and conditionarray interaction
and then t the residuals with the eects of gene, genecondition interaction,
and genearray interaction [12]. Several approaches have also been developed using the Bayesian paradigm for analyzing microarray data, including
Bayesian parametric modeling [13], Bayesian regularized t test [8], Bayesian
hierarchical modeling with a multivariate normal prior [14], and Bayesian
heterogeneous error model (HEM) with two error components [15].
Analysis of variance modeling
The use of ANOVA models has been suggested to estimate relative gene
expression and to account for other sources of variation in microarray data
[16]. Although the exact form of the ANOVA model depends on the particular data set, a typical ANOVA model for two-colorbased cDNA microarray data can be dened as
yikg m Ai Dj Vk Gg ADij AGig DGig VGkg eijkg
where yijkg is the measured intensity from array i, dye j, variety k, and gene g
on an appropriate scale (typically the log scale). The generic term variety is
often used to refer to the mRNA samples studied, such as treatment and
control samples; cancer and normal cells; or time points of a biological process. The terms A, D, and AD account for the overall eects that are not
gene-specic. The gene eects Gg capture the average levels of expression
for genes and the array-by-gene interactions AGig capture dierences caused
by varying sizes of spots on arrays. The dye-by-gene interactions DGjg represent gene-specic dye eects. None of these eects are of biological interest but amount to a normalization of the data for ancillary sources of
variation. The eects of primary interest are the interactions between genes
and varieties, VGg. These terms capture dierences from overall averages
that are attributable to the specic combination of variety k and gene g. Differences among these variety-by-gene interactions provide the estimates for
the relative expression of gene g in varieties 1 and 2 through VG1g VG2g.
Note that AV, DV, and other higher-order interaction terms are typically assumed to be negligible and are considered together with the error terms. The
error terms eijkg are often assumed to be independent and normal with mean
zero and a common variance. However, such a global ANOVA model is difcult to implement in practice because of its computational restriction. Instead, one often considers gene-by-gene ANOVA models such as
yijkg mg Ai Dj Vk ADij VGkg eijkg
153
154
LEE
et al
Fig. 2. Receiver operating characteristic curves from heterogeneous error model (solid lines) and
analysis of variance (dotted lines) models with two and ve replicated arrays. The horizontal axis
is 1false-positive error rate (FPR) and the vertical axis is 1false-negative error rate (FNR).
clustering analysis is one of the most frequently used techniques for genomic data mining in biomedical studies [1719]. Some technical aspects of
these approaches are summarized. A clustering approach rst must be dened through a measure or distance index of similarity or dissimilarity,
such as
Euclidean: dx; y Sxk yk 2
Manhattan: dx; y Sjxk yk j
Correlation: d(x, y) 1 r(x, y), where r(x, y) is a correlation coecient
Next, an allocation algorithm must be dened based on one of these distance metrics. Two classes of clustering algorithms have been used in genomic data analysis: hierarchical and partitioning allocation algorithms.
Hierarchical algorithms that allocate each subject to its nearest subject or
group include:
Agglomerative methods: average linkage based on group average distance, single linkage based on minimum nearest distance, and complete
linkage based on maximum furthest distance;
155
Fig. 3. Dendrogram (top panel) and heatmap (bottom panel) of hierarchical clustering analysis
for the concordant complementary DNA (cDNA) and oligo array expression patterns on the
NCI-60 cancer cell lines. A region of heatmap occupied by melanoma genes are shown from
the combined set of 3297 oligo and cDNA transcripts. Each gene expression pattern is designated as coming from the cDNA or oligo array set. The concordant oligo and cDNA microarray expressions are marked with blue bars.
156
LEE
et al
One of the most dicult aspects of using these clustering analyses is the
interpretation of their heuristic, often unstable, clustering results. To overcome this shortcoming, several rened clustering approaches have been suggested. For example, the use of bootstrapping was suggested to evaluate the
consistency and condence of each genes membership to particular cluster
groups [11]. The gene shaving approach was suggested to nd the clusters
directly relevant to major variance directions of an array data set [3]. Recently, tight clustering, a rened bootstrap-based hierarchical clustering,
was proposed to formally assess and identify the groups of genes that are
most tightly clustered with each other [20].
157
evaluation. Several dierent measures are currently used to evaluate performance of classication models: classication error rate, area under the
receiver operating characteristic curve (area under the curve [AUC]), and
the product of posterior classication probabilities [27,28].
When a large number of candidate models (eg, approximately 108 twogene models on 10K array data) are compared in their performance, these
measures are often saturateddtheir maximum performance levels are
achieved using many competing modelsdso that identication of the best
(most robust) prediction model among them is extremely dicult. Furthermore, these measures cannot capture an important aspect of classication
model performance as follows: suppose three samples are classied using
two classication models (or rules); one model provides the correct posterior
classication probabilities 0.8, 0.9, and 0.4, and the other 0.8, 0.8, and 0.4
for the three samples. Assuming these were unbiased estimates of classication error probabilities (on future data), the former model would be preferred because this model will perform better in terms of the expected
number of correctly classied samples in future data.
Note that the two models provide the same misclassication error rate,
one third. This aspect of classication performance cannot be captured
through evaluating the commonly used error rate or AUC criteria, which
simply add one count for each correctly classied sample, ignoring its degree
of classication error probability.
To overcome this limitation, the so-called misclassication-penalized
posterior (MiPP) criterion has been suggested recently [4]. This measure is
the sum of the correct-classication (posterior) probabilities of correctly classied samples subtracted by the sum of the misclassication (posterior) probabilities of misclassied samples. Suppose there are m classes pk, i 1, .,
mk, from a population of N samples. Let Xj, j 1, ., ni, be the jth sample. Under a particular prediction model (eg, one- or two-gene model) from a classication rule, such as linear discriminant analysis or SVMs, MiPP is then dened
as:
L Scorrect pk Xj Swrong 1 pk Xj
where pk(Xj) is the posterior classication probability of sample Xj into the
kth class. Here correct and wrong correspond to the samples that are correctly and incorrectly classied. In the two-class problem, correct simply
means pk(Xj) is more than 0.5, but in general it occurs when pk(Xj)
maxi 1, ., m (pi(Xj)). MiPP can also be shown to be the sum of the posterior probabilities of correct classication penalized by the number of misclassied samples (NM): L S pk(Xj) NM. Thus, MiPP is a continuous
measure (compared with the discrete error rate) of classication performance that accounts for the degree of classication certainty and the error
rate, and is sensitive enough to distinguish subtle dierences in prediction
performance among many competing models.
158
LEE
et al
Classication modeling
Several classication modeling approaches are currently widely used in
genomic data analysis.
Gene voting
Gene voting [21] is an intuitively-derived technique that aggregates the
weighted votes from all modeling gene signatures; the advantage of this
technique is that it can be easily implemented without complicated computing and statistical arguments. It has been proposed for predicting subclasses
of patients who have acute leukemia observed with microarray gene expression data [21]. This method gains accuracy through aggregating predictors
built from a learning set and casting their voting weights. For binary classication, each gene casts a vote for class 1 or 2 among p samples, and the
votes are aggregated over genes. For gene gj, the vote is vj aj (gj bj),
where aj (m1 m2)/(s1 s2) and bj (m1 m2)/2 for sample means
m1 and m2 and sample standard deviations s1 and s2. Using this method
based on 50 gene predictors, 36 of 38 patients in an independent validation
set have been correctly classied between acute myeloid leukemia and acute
lymphoblastic leukemia.
Linear and quadratic discriminant analysis
Linear or quadratic discriminate analysis is a classical statistical classication technique based on the multivariate normal distribution assumption.
This technique is frequently found to be robust and powerful for many different applications, despite the distributional assumption; the gene voting
technique can be considered as a variant of linear discriminant analysis. Linear discriminant analysis can be applied with leave-one-out classication,
assuming each class follows a multivariate normal distribution. Each sample
will then be allocated to group k, to which its classication probability is
maximized. The quadratic discriminate analysis can be similarly performed,
except that the covariance matrix of the multivariate normal distribution
(for each of m classes) is now considered dierently among m classes. Differences between linear discriminant analysis and quadratic discriminant
analysis are typically small, especially if polynomial factors are considered
in linear discriminant analysis. In general, quadratic discriminant analysis
requires more observations to estimate each variancecovariance matrix
for each class. Linear and quadratic discriminant analysis have consistently
shown high performance, not because the data is likely derived from Gaussian distributions, but more likely because the data support only a simple
boundaries, such as linear or quadratic [28].
Logistic regression
The logistic regression classication technique is based on the regression
t on probabilistic odds among comparing conditions. This technique
159
requires no specic distribution assumption but is often found to be less sensitive than other approaches. Logistic regression methods simply maximize
the conditional likelihood Pr(G kjX), typically by a Newton-Raphson
algorithm [29]. The allocation decision on a sample is based on the logistic
regression t:
Logitpi logpi =1 pi wbT x
where b is the logistic regression estimated coecient vector for the
microarray data. Logistic regression discriminant analysis is often used
because of its exible assumption about the underlying distribution, but if
it is actually from Gaussian distribution, logistic regression shows a loss
of 30% eciency in the (misclassication) error rate compared with linear
discriminant analysis.
Support vector machines
Conceptually similar to gene voting, SVMs are one of the recent
machine-learning classication techniques based on the data projection to
high-dimensional kernel space. This technique also does not require distributional assumption, yet can perform better than other approaches in
some complicated cases. However, it often requires large numbers of samples and predictor gene signatures for optimal performance. SVMs separate
a given set of binary labeled training data with a hyperplane that is maximally distant from them, known as the maximal margin hyperplane [24].
Based on a kernel, such as a polynomial of dot products, the current data
space will be embedded in a higher dimensional space. Commonly used kernels include
Radial basis function kernel: K(x, y) exp(jx yj2/2s2)
Polynomial kernel: K(x, y) !x, yO^d or K(x, y) (!x, yO c)d,
where !, O denotes the inner product.
Comparison of classication methods
These classication techniques must be carefully applied in prediction
model training on genomic data. In particular, if all the samples are used
for model search/training and evaluation in a large screening search for classication models, a serious selection bias is inevitably introduced [30]. To
avoid this pitfall, a stepwise (leave-one-out) cross-validated discriminant
procedure has been suggested that gradually adds genes to the training set
[4,28]. The prediction performance is typically found to be continuously
improved (or not decreased) through adding more features into the model.
This result is again caused by a sequential search-and-selection strategy
against an astronomically large number of candidate models; some of
them can show overoptimistic prediction performance for a particular
160
LEE
et al
Table 2
Classication results of the classication rules and the corresponding gene model
Error rate on
training data
MiPP on
training data
Error rate
on test data
MiPP on
test data
1144
5062
4211 575
4377 1882
0%
0%
0%
0%
37.91
37.96
37.99
35.16
2.9%
5.8%
11.8%
0%
31.46
29.81
25.64
29.26
0%
32.52
5.9%
21.71
Method
Gene model
LDA
QDA
Logistic
SVM K
linear
SVM K
RBF
1882
4847
1807
2020
Abbreviations: K, kernel; LDA, linear discriminant analysis; MiPP, misclassicationpenalized posterior; QDA, quadratic discriminant analysis; RBF, radial basis function; SVM,
support vector machines.
161
regression. In an application to a dierent microarray study on colon cancer, the radial basis functionkernel SVM model with three genes was found
to perform best among these classication techniques.
The MiPP-based SCVD procedure was the most robust classication
model and could accurately classify samples with a very small number of
featuresdonly two or three genes for the two well-known microarray
data sets, outperforming many previous models with 50 to 100 featuresd
although dierent classication methods may perform dierently in dierent data sets. These data are consistent with the notion that many correlated
genes share more or less similar information and may discriminate similarly
among dierent subtypes of a particular disease, and that multiple smallfeature models may perform well in terms of the construction of a classication model. As shown, the prediction performance on the training set is
quickly saturated with a 0% error rate and very close to the maximum
MiPP value of 38 (total sample size). However, error rates and MiPP values
vary greatly on the independent test set. The error rates were also found to
be misleading and less informative than MiPP.
162
LEE
et al
163
164
LEE
et al
and techniques for these microarray data have been signicantly improved,
especially in testing (eg, SAM, LPE, false discovery rate), clustering (eg, hierarchical, self-organizing map, K-means, response projected clustering),
classication (linear discriminant analysis, SVMs, logistic regression, random forest), and pathway analysis (Gene Map Annotator And Pathway Proler, ingenuity pathway analysis) for investigating the complex and extensive
information in massive genomic data sets eectively and eciently [1,4,24].
Finally, most importantly, based on signicant eorts by the National Institutes of Health (Gene Expression Omnibus [GEO]) and the European Bioinformatics Institute (ArrayExpress), many precious microarray data sets of
cancerdcell lines and patientsdhave been archived for public access. For example, GEO currently archived more than 5550 microarray data sets on
more than 150,000 dierent biomedical samples and human patients with
more than 1500 sets for cancer alone. Furthermore, despite their technical
dierences, microarray data sets from dierent time points, laboratories,
and even platforms contain consistent information for many gene expression
patterns, so that investigations can be performed successfully across those
dierent genomic data sets. This large and rapidly increasing compendium
of data demands data mining approaches and ensures that genomic data
mining will continue to be a necessary and highly productive eld.
References
[1] Tusher VG, Tibshirani R, Chu G. Signicance analysis of microarrays applied to the ionizing
radiation response. Proc Natl Acad Sci U S A 2001;98(9):511621.
[2] Storey JD, Tibshirani R. Statistical signicance for genomewide studies. Proc Natl Acad Sci
U S A 2003;100(16):94405.
[3] Hastie T, Tibshirani R, Eisen MB, et al. Gene shaving as a method for identifying distinct
sets of genes with similar expression patterns. Genome Biol 2000;1(2):p. RESEARCH0003.
[4] Soukup M, Cho H, Lee JK. Robust classication modeling on microarray data using misclassication penalized posterior. Bioinformatics 2005;21(Suppl 1):i42330.
[5] Benjamini Y, Drai D, Elmer G, et al. Controlling the false discovery rate in behavior genetics
research. Behav Brain Res 2001;125(12):27984.
[6] Jain N, Thatte J, Braciale T, et al. Local-pooled-error test for identifying dierentially expressed genes with a small number of replicated microarrays. Bioinformatics 2003;19(15):
194551.
[7] Jain N, Cho H, OConnell N, et al. Rank-invariant resampling based estimation of false discovery rate for analysis of small sample microarray data. BMC Bioinformatics 2005;6:187.
[8] Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes. Bioinformatics 2001;17(6):50919.
[9] Efron B, Tibshirani R. Empirical Bayes methods and false discovery rates for microarrays.
Genet Epidemiol 2002;23(1):7086.
[10] Kerr MK, Martin M, Churchill GA. Analysis of variance for gene expression microarray
data. J Comput Biol 2000;7(6):81937.
[11] Kerr MK, Churchill GA. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci U S A 2001;98(16):89615.
[12] Wolnger RD, Gibson G, Wolnger ED, et al. Assessing gene signicance from cDNA microarray expression data via mixed models. J Comput Biol 2001;8(6):62537.
165
[13] Newton MA, Kendziorski CM, Richmond CS, et al. On dierential variability of expression
ratios: improving statistical inference about gene expression changes from microarray data.
J Comput Biol 2001;8(1):3752.
[14] Ibrahim JGaC, M.-H., Gray RJ. Bayesian models for gene expression with DNA microarray
Data. J Am Stat Assoc 2002;97:8899.
[15] Cho H, Lee JK. Bayesian hierarchical error model for analysis of gene expression data.
Bioinformatics 2004;20(13):201625.
[16] Kerr MK, Churchill GA. Statistical design and the analysis of gene expression microarray
data. Genet Res 2001;77(2):1238.
[17] Lee JK, Bussey KJ, Gwadry FG, et al. Comparing cDNA and oligonucleotide array data:
concordance of gene expression across platforms for the NCI-60 cancer cells. Genome
Biol 2003;4(12):R82.
[18] Scherf U, Ross DT, Waltham M, et al. A gene expression database for the molecular pharmacology of cancer. Nat Genet 2000;24(3):23644.
[19] Weinstein JN, Scherf U, Lee JK, et al. The bioinformatics of microarray gene expression proling. Cytometry 2002;47(1):469.
[20] Tseng GC, Wong WH. Tight clustering: a resampling-based approach for identifying stable
and tight patterns in data. Biometrics 2005;61(1):106.
[21] Golub TR, Slonim DK, Tamayo P, et al. Molecular classication of cancer: class discovery
and class prediction by gene expression monitoring. Science 1999;286(5439):5317.
[22] West M, Blanchette C, Dressman H, et al. Predicting the clinical status of human breast cancer by using gene expression proles. Proc Natl Acad Sci U S A 2001;98(20):114627.
[23] Su AI, Welsh JB, Sapinoso LM, et al. Molecular classication of human carcinomas by use
of gene expression signatures. Cancer Res 2001;61(20):738893.
[24] Furey TS, Cristianini N, Duy N, et al. Support vector machine classication and validation
of cancer tissue samples using microarray expression data. Bioinformatics 2000;16(10):
90614.
[25] Nguyen DV, Rocke DM. Partial least squares proportional hazard regression for application to DNA microarray survival data. Bioinformatics 2002;18(12):162532.
[26] Li L, Darden TA, Weinberg CR, et al. Gene assessment and sample classication for gene
expression data using a genetic algorithm/k-nearest neighbor method. Comb Chem High
Throughput Screen 2001;4(8):72739.
[27] Hand DJ. Construction and assessment of classication rules. Chichester: John Wiley &
Sons; 1997.
[28] Soukup M, Lee JK. Developing optimal prediction models for cancer classication using
gene expression data. J Bioinform Comput Biol 2004;1(4):68194.
[29] Pampel FC. Logistic regression: a primer. Sage University Papers Series on Quantitative
Applications of the Social Sciences; 2000.
[30] Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of microarray
gene-expression data. Proc Natl Acad Sci U S A 2002;99(10):65626.
[31] Romero PR, Karp PD. Using functional and organizational information to improve
genome-wide computational prediction of transcription units on pathway-genome databases. Bioinformatics 2004;20(5):70917.
[32] Brivanlou AH, Darnell JE Jr. Signal transduction and the control of gene expression. Science
2002;295(5556):8138.
[33] Friedman N, Linial M, Nachman I, et al. Using Bayesian networks to analyze expression
data. J Comput Biol 2000;7(34):60120.
[34] Segal E, Taskar B, Gasch A, et al. Rich probabilistic models for gene expression. Bioinformatics 2001;17(Suppl 1):S24352.
[35] Segal E, Friedman L, Koller D, et al. A module map showing conditional activity of expression modules in cancer. Nat Genet 2004;36(10):10908.
[36] Conlon EM, Liu XS, Lieb JD, et al. Integrating regulatory motif discovery and genome-wide
expression analysis. Proc Natl Acad Sci U S A 2003;100(6):333944.
166
LEE
et al
[37] van t Veer LJ, Dai H, van de Vijver MJ, et al. Gene expression proling predicts clinical outcome of breast cancer. Nature 2002;415(6871):5306.
[38] van t Veer LJ, Dai H, van de Vijver MJ, et al. Expression proling predicts outcome in breast
cancer. Breast Cancer Res 2003;5(1):578.
[39] Dressman HK, Hans C, Bild A, et al. Gene expression proles of multiple breast cancer phenotypes and response to neoadjuvant chemotherapy. Clin Cancer Res 2006;12(3 Pt 1):
81926.
[40] Potti A, Mukherjee S, Petersen R, et al. A genomic strategy to rene prognosis in early-stage
non-small-cell lung cancer. N Engl J Med 2006;355(6):57080.
[41] Miller LD, Smeds J, George J, et al. An expression signature for p53 status in human breast
cancer predicts mutation status, transcriptional eects, and patient survival. Proc Natl Acad
Sci U S A 2005;102(38):135505.
[42] Havaleshko DM, Cho H, Conaway M, et al. Prediction of drug combination chemosensitivity in human bladder cancer. Mol Cancer Ther 2007;6(2):57886.
[43] Paik S, Shak S, Tang G, et al. A multigene assay to predict recurrence of tamoxifen-treated,
node-negative breast cancer. N Engl J Med 2004;351(27):281726.
[44] Horvath S, Zhang B, Carlson M, et al. Analysis of oncogenic signaling networks in glioblastoma identies ASPM as a molecular target. Proc Natl Acad Sci U S A 2006;103(46):
174027.
[45] Bild AH, Yao G, Chang JT, et al. Oncogenic pathway signatures in human cancers as a guide
to targeted therapies. Nature 2006;439(7074):3537.
[46] Potti A, Yao G, Chang JT, et al. Genomic signatures to guide the use of chemotherapeutics.
Nat Med 2006;12(11):1294300.
[47] Ma XJ, Patel R, Wang X, et al. Molecular classication of human cancers using a 92-gene
real-time quantitative polymerase chain reaction assay. Arch Pathol Lab Med 2006;
130(4):46573.
[48] Puskas LG, Juhasz F, Zarva A, et al. Gene proling identies genes specic for well-dierentiated epithelial thyroid tumors. Cell Mol Biol (Noisy-le-grand) 2005;51(2):17786.