Nörrköping Presentation

Recurrence Prediction and Non-
compliances to Guideline : Register

Analysis Using Data Mining
Amir R Razavi
Department of Biomedical Engineering, Division of Medical Informatics
Linköpings universitet, Linköping, Sweden
Dept of Biomedical Engineering, Medical Informatics

Linköpings universitet, Linköping, Sweden
http://www.imt.liu.se
Outline
• Introduction
• Knowledge Discovery in Databases
• Clinical Guidelines
• Discussion
• Future works

Linköpings universitet, Linköping, Sweden 2
Introduction
• Data in medicine are stored in many different ways:

– Hospital Information Systems (HIS)
– Electronic Medical Records (EMR)
– Medical registers
– Output from devices for example imaging devices
– …
• And storing patients data continues…
– “Patientjournal 08” project in Östergötland: Computerization
of all patients data completed Dec 2008
– …

Introduction
• Medical Registers
– To monitor trends in the incidence of
conditions and diseases
– Monitoring outcomes after the implementation
of disease-prevention and treatment programs
– Assessing the safety of new drugs and
procedures, identify best clinical practice and
compare healthcare systems

Introduction
• Data in medicine are unique and need

special attention.
– Heterogeneity of medical data
• Complexity of medical data; images, signals, …
• Physician's interpretation
• Poor mathematical characterization
• Degrees of relationships between variables
– Ethical/legal/social issues
• Data ownership
• Confidentiality of human data

Introduction
• What can be extracted from these large

depositories for medical data?
• How can the hidden knowledge be
extracted from medical registers?
• Can the extracted knowledge be used to
give support to clinicians in their decision
making?

Introduction
• Why decision support is important?

– Limited resources,
– Need for ways to improve health care processes
and their outcomes.
– Improving decision making ability of clinicians
by allowing more or better decisions within
constraints of their knowledge and time limits.

Introduction
• Playground
– Breast cancer register
– Knowledge Discovery in Databases
– Clinical Guidelines
• Aim
– Decision Support to clinicians in oncology

Introduction
• Breast cancer register

– South-east region of Sweden breast cancer
register, vårdprogram

Introduction
• Knowledge Discovery in Databases (KDD)

– Process of semi-automatically analyzing large
databases to find patterns that are:
• Valid: true for new data with some certainty
• Novel: non-obvious
• Useful: it should be possible to act upon the item
• Understandable: humans should be able to interpret
the pattern

Introduction
• Clinical Guidelines (CG)

– Developed to help physicians make decisions
about appropriate treatment for specific
circumstances
– Can result in improvements in overall
healthcare, including clinical practice
– Can provide decision support tools for
practitioners

Introduction
• Decision Support Systems (DSS)

– DSS is a computerized system for helping make
decisions
– A decision is a choice between alternatives
based on estimates of the values of those
alternatives
– Interactive computer-based systems that help
decision makers utilize data and models to
solve problems (Sprague and Carlson,1982)

Knowledge Discovery in Databases
• KDD steps:
– Understanding the domain
– Creating the main dataset for the KDD
– Data pre-processing
– Data mining
– Interpretation of the result or found patterns
– Evaluation

KDD
• Understanding the domain

• Creating the main dataset for the KDD

KDD
• Data pre-processing:
– Tasks in data preprocessing
• Cleaning
• Data integration
• Handling missing values
• Transformation
• Data reduction

KDD
• Data pre-processing :
– It describes any type of processing performed
on raw data to prepare it for another processing
procedure.
– Why? Real world data are generally
• Incomplete
• Noisy
• Inconsistent

KDD
• Why data pre-processing is needed?

– If the data do not have good quality, the
analysis results will not be good.
– Decisions must be based on high-quality data.
– Duplicate or missing data may cause incorrect
or even misleading results.

KDD
• Cleaning
– Outliers
– Multies
– Noise
• Data integration

KDD
• Handling missing values:

– Listwise deletion
– Pairwise deletion
– Hot deck imputation
– Mean substitution
– Regression substitution
– Expectation-Maximization algorithm
– Multiple Imputation

KDD
• Data reduction:
– Obtain a reduced representation of the dataset
that is much smaller in volume but yet produce
the same or almost the same analytical results.
• Why to do it?
– The dataset may be gigantic in volume
– Processing time

KDD
• Dimension reduction
– Removes unimportant attributes: Canonical
Correlation Analysis (CCA)
• Data Compression
• Reducing the number of instances
• Discretization and concept hierarchy
generation

KDD
• Canonical correlation analysis (CCA)

– CCA seeks to identify and quantify the
associations between two sets of variables (i.e.,
predictors and outcomes of a disease)
– It focuses on the correlation between a linear
combination of the variables in one set
(independents/predictors) and a linear
combination of the variables in another set
(dependents/outcomes)

KDD
• It creates a number of canonical solutions each

consisting of a linear combination of one set of
variables:
Ui = a1 X1 + a2 X2 + … + am Xm
and a linear combination of the other set of
variables:
Vi = b1 Y1 + b2 Y2 + … + bn Yn
• The goal is to determine the coefficients (a’s and
b’s) that maximize the correlation between
canonical variates (a linear combination of a set of
original variables) Ui and Vi
KDD
• Examining canonical solutions to determine

the relative importance of each of the
original variables in the canonical variate
– Canonical Loadings
• Represents the simple linear correlation between an
original observed variable and its canonical variate.
• Shows how each original variable contribute
towards each canonical variate

KDD

KDD
• An example:
Exploring Cancer Register Data to find Risk
Factors for Recurrence of Breast Cancer-
Application of Canonical Correlation
Analysis
Razavi AR, Gill H, Stål O, Sundquist M, Thorstenson S, Åhlfeldt H,
Shahsavar N, the South-East Swedish Breast Cancer Study Group
BMC Medical Informatics and Decision Making.

2005 Aug 22;5:29

KDD
Predictor Set Outcome Set ‡
Age DM, first two years
Tumor location DM, 2-4 years
Side DM, more than 4 years
*
Tumor size LRR, first two years
LN involvement * LRR, 2-4 years
LN involvement † LRR, more than 4 years
Periglandular growth *
NHG
Multiple tumors *
Abbreviations: LN: lymph
Estrogen receptor
node, NHG: Nottingham
Progesterone receptor
Histologic Grade, DM:
S-phase fraction
Distant Metastasis, LRR:
Loco-regional Recurrence DNA index
*
from pathology report, † DNA ploidy
N0: Not palpable LN
metastasis, ‡ all periods are
time after diagnosis.

KDD

KDD
• CCA is suggested as an appropriate method

when there are many variables in the input set
and more than one variable in the output set.
• The results successfully detected well known
predictors for breast cancer recurrence.
• This can be assumed as the dimension
reduction step in the process of knowledge
discovery in databases.

KDD
• Data Mining
– “…the process of discovering meaningful new
correlations, patterns, and trends by sifting through
large amounts of data…” (Gartner Group)
– “…the analysis of observational data sets to find
unsuspected relationships and to summarize data in
novel ways…” (Hand et al.)
– “…is an interdisciplinary field bringing together
techniques from machine learning, pattern recognition,
statistics, databases, and visualization…” (Cabana et
al.)
– …

KDD
• Supervised vs. unsupervised learning

– Supervised learning (classification) is seen as
learning from examples.
• Supervision: The data (observations, measurements,
etc.) are labeled with pre-defined classes.
– Unsupervised learning (clustering)
• Class labels of the data are unknown.
• Given a set of data, the task is to establish the
existence of classes or clusters in the data.

KDD
• Supervised learning (classification)

– Decision Tree Induction (DTI)
– Artificial Neural Networks (ANN)
– Support Vector Machines (SVM)
– Multiple Regression Analysis (MRA)
– …

KDD
• Decision Tree Induction (DTI)

– Decision tree learning is widely used
• Its classification accuracy is competitive with other
methods
• Representation as If-then rules is easy to interpret
• Works well on noisy data

KDD
• An example:
A Data Pre-processing Method to Increase

Efficiency and Accuracy in Data Mining
A. R. Razavi, H. Gill, H. Åhlfeldt, and N. Shahsavar
Lecture Notes in Computer Science, Artificial Intelligence in

Medicine, J. H. S. Miksch, E. Keravnou, Ed.: Springer-
Verlag GmbH, 2005, pp. 434-443

KDD

KDD
Without With replacing With

preprocessing missing values preprocessing
Accuracy 54% 57% 67%
Sensitivity 83% 82% 80%
Specificity 41% 46% 63%
Number of Leaves 137 196 14
Tree Size 273 391 27


KDD
• DTI is a a greedy divide-and-conquer

algorithm
• Tree is constructed in a top-down recursive manner
• At start, all the training examples are at the root
• Examples are partitioned recursively based on
selected attributes
• Attributes are selected on the basis of an impurity
function (e.g., information gain)

KDD
• DTI
– Pros
• Reasonable training time
• Fast application
• Easy to interpret
• Easy to implement
• Can handle large number of features
– Cons
• Cannot handle complicated relationship between
features

KDD
• Building Predictive Models

– An important function of data mining is the production
of a model. A model can be descriptive or predictive.
– A descriptive model helps in understanding underlying
processes or behavior.
– A predictive model is an equation or set of rules that
makes it possible to predict an unseen or unmeasured
value (the dependent variable or output) from other,
known values (independent variables or input).

KDD
• Transparency of the model:

– By choosing data mining methods which
produce an understandable predictive model
such as decision tree induction (DTI).
– Providing the model to clinicians to inspect all
the details and how the decisions are made in
the model; studying the tree and rules.

KDD
• Showing that models’ performance is good

– Testing the model for accuracy on an
independent dataset, one that has not been used
to create the model.
– Examining the performance of the model on the
training set is not a good indicator because of
overfitting.
– The prediction of the model for an independent
dataset is compared to the actual outcome.
– An analysis is performed which measures how
well a model is performing.
KDD
• Validating methods:
– Examining an independent dataset.
– Cross validation:
• Divides the whole data by random sampling into n
folds (partitions) and perform n times testing.
– At each testing, one partition of data is used as the testing
set and the rest is training set.
• Leave-one-out cross-validation
–…

KDD
• Methods for showing how well a model

works:
– Accuracy: refers to the degree of fit between
the model and the data.
– Sensitivity and specificity.
– Confusion matrix: shows the counts of the
actual versus predicted class values.
– ROC curve and area under the curve (AUC).
–…

KDD
• Our suggested approach

– Giving some cases to clinicians without any
data pre-processing and ask for their
predictions.
– Validating the predictive model by examining
the same cases as clinicians.
– Comparing the results and see if there are
statistically significant differences.

KDD
• An Example:
Predicting Metastasis in Breast Cancer:
Comparing a Decision Tree with
Domain Experts
Amir R. Razavi, Hans Gill, Hans Åhlfeldt, and Nosrat Shahsavar
In press “Journal of Medical Systems”

KDD
• 3699 patients
• A decision tree was trained with all patients except
for 100 cases and tested with those 100 cases.
• Two domain experts were asked to give their
opinion about the probability of recurrence of a
certain outcome for these 100 patients.
• ROC curves and area under the ROC curves
(AUC) for predictions were computed and
compared.

KDD

KDD

KDD
100
80
60
Sensitivity
DTI_J48
Oncologist_1
Oncologist_2
40
20
0
0 20 40 60 80 100
100-Specificity
DTI (J48) Oncologist 1 Oncologist 2

AUC 0.761 0.847 0.810
DTI: decision tree induction

KDD
• It is possible to formulate the knowledge that is

hidden in registers in the form of a decision tree
• A methodology producing an understandable model
with about the same accuracy as domain experts
can be used as a semi-automatic knowledge
discovery method for building a predictive model
in oncology

Clinical Guidelines
• Physicians’ adherence to clinical guidelines is not

100 percent.
– Physicians’ disagreement with guidelines
– Poor availability
– Low outcome expectancy
– Patient-related obstacles

Clinical Guidelines
• Non-compliance with the guideline in some

individual cases is of less general importance
– Some patients may have refused to accept the treatment
– Sometimes physicians believe that a treatment is
appropriate for a particular patient
• If systematic and repetitive patterns are identified,
then they can be used as rules to alert physicians

Clinical Guidelines
• However, there may be useful information in the

disagreements
– Every institution has its own experience with treatments
– Repetitive patterns can provide the writers of guidelines
with new insight that may result in improvements in the
guidelines

Clinical Guidelines
• One method for finding disagreements is to

evaluate each case separately and the disagreement
reason be analyzed
• However, a faster method that ignores sporadic
disagreements and can find repetitive patterns is
preferable

Clinical Guidelines
• An example:
A Data Mining Approach to Analyze Non-
compliance with a Guideline for the
Treatment of Breast Cancer
Razavi AR, Gill H, Åhlfeldt H, Shahsavar N
To be presented in MedInfo 2007, Brisbane, Australia

Clinical Guidelines
• Our suggested approach:

– Using data mining to find patterns of non-compliance
between the guideline and the real collected figures for
post-mastectomy radiotherapy (PMRT)
– The dataset is filtered using the guideline rules for
finding non-compliant cases
– Repetitive patterns of inconsistencies with the guideline
were identified by decision tree induction

Clinical Guidelines
• 12 variables important in the recurrence of breast

cancer
• Data were filtered using the local modified
guideline for PMRT (125 out of 962 cases)
• Data then were analyzed with Decision Tree
Induction
• 3 variables showed to be important (Age, inv. LNs,
T. size)

Clinical Guidelines
• Patterns for non-compliance with the PMRT

guideline:

Clinical Guidelines
• Supporting clinicians in following the guidelines

can be done by using the resulting rules as alerts for
identifying inconsistencies between clinicians’
practices and guidelines.
• Resulting rules from mining historical data can also
be embedded in knowledge bases of guideline-
based decision support systems.

Discussion
• In some domains such as finance and

banking KDD has already showed a great
benefit to the industry but in medicine we
are far behind them.

Discussion
• There is no gold standard method for how to

do the pre-processing step and handling
missing values.
• CCA can handle multiple outcomes and this
is unique compared to other methods such as
MRA and Cox RA.

Discussion
• DTI predictive model does not differ

significantly from predictions made by
domain experts.
• Compared to other data mining methods,
DTI is more explainable. In contrast ANN
works as a ”black box”.

Discussion
• A predictive model which is built based on

the most relevant and important predictors of
an event can have a better performance.
• Improvement of the quality of cancer
registers by adding variables with high
predictive ability.

Discussion
• Combining KDD and clinical guidelines can

result in a knowledge useful in improving
medical practice.
• This can be done by embedding the resulted
knowledge in decision support systems.

Discussion
• In medicine, assisting clinicians in their

decision making in the right time, right
place and in a suitable format is valuable.
• Providing reminders, interpretations or
advices specific to a given patient at a
particular time is advantageous.

Future Works
• Results from the presented methodology

can be used to build a decision support
application in the field of oncology.


Nörrköping Presentation

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Nörrköping Presentation

Caricato da

Copyright:

Formati disponibili

Recurrence Prediction and Non-

compliances to Guideline : Register

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

• Data in medicine are stored in many different ways:

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

• Data in medicine are unique and need

Dept of Biomedical Engineering, Medical Informatics

• What can be extracted from these large

Dept of Biomedical Engineering, Medical Informatics

• Why decision support is important?

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

• Breast cancer register

Dept of Biomedical Engineering, Medical Informatics

• Knowledge Discovery in Databases (KDD)

Dept of Biomedical Engineering, Medical Informatics

• Clinical Guidelines (CG)

Dept of Biomedical Engineering, Medical Informatics

• Decision Support Systems (DSS)

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

• Understanding the domain

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

• Why data pre-processing is needed?

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

• Handling missing values:

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

• Canonical correlation analysis (CCA)

Dept of Biomedical Engineering, Medical Informatics

• It creates a number of canonical solutions each

• Examining canonical solutions to determine

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

BMC Medical Informatics and Decision Making.

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

• CCA is suggested as an appropriate method

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

• Supervised vs. unsupervised learning

Dept of Biomedical Engineering, Medical Informatics

• Supervised learning (classification)

Dept of Biomedical Engineering, Medical Informatics

• Decision Tree Induction (DTI)

Dept of Biomedical Engineering, Medical Informatics

A Data Pre-processing Method to Increase

Lecture Notes in Computer Science, Artificial Intelligence in

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

Without With replacing With

Dept of Biomedical Engineering, Medical Informatics

• DTI is a a greedy divide-and-conquer

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics