Sei sulla pagina 1di 66

Recurrence Prediction and Non-

compliances to Guideline : Register


Analysis Using Data Mining

Amir R Razavi
Department of Biomedical Engineering, Division of Medical Informatics
Linköpings universitet, Linköping, Sweden

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden
http://www.imt.liu.se
Outline

• Introduction
• Knowledge Discovery in Databases
• Clinical Guidelines
• Discussion
• Future works

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 2
http://www.imt.liu.se
Introduction

• Data in medicine are stored in many different ways:


– Hospital Information Systems (HIS)
– Electronic Medical Records (EMR)
– Medical registers
– Output from devices for example imaging devices
– …
• And storing patients data continues…
– “Patientjournal 08” project in Östergötland: Computerization
of all patients data completed Dec 2008
– …

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 3
http://www.imt.liu.se
Introduction

• Medical Registers
– To monitor trends in the incidence of
conditions and diseases
– Monitoring outcomes after the implementation
of disease-prevention and treatment programs
– Assessing the safety of new drugs and
procedures, identify best clinical practice and
compare healthcare systems

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 4
http://www.imt.liu.se
Introduction

• Data in medicine are unique and need


special attention.
– Heterogeneity of medical data
• Complexity of medical data; images, signals, …
• Physician's interpretation
• Poor mathematical characterization
• Degrees of relationships between variables
– Ethical/legal/social issues
• Data ownership
• Confidentiality of human data

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 5
http://www.imt.liu.se
Introduction

• What can be extracted from these large


depositories for medical data?
• How can the hidden knowledge be
extracted from medical registers?
• Can the extracted knowledge be used to
give support to clinicians in their decision
making?

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 6
http://www.imt.liu.se
Introduction

• Why decision support is important?


– Limited resources,
– Need for ways to improve health care processes
and their outcomes.
– Improving decision making ability of clinicians
by allowing more or better decisions within
constraints of their knowledge and time limits.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 7
http://www.imt.liu.se
Introduction

• Playground
– Breast cancer register
– Knowledge Discovery in Databases
– Clinical Guidelines
• Aim
– Decision Support to clinicians in oncology

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 8
http://www.imt.liu.se
Introduction

• Breast cancer register


– South-east region of Sweden breast cancer
register, vårdprogram

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 9
http://www.imt.liu.se
Introduction

• Knowledge Discovery in Databases (KDD)


– Process of semi-automatically analyzing large
databases to find patterns that are:
• Valid: true for new data with some certainty
• Novel: non-obvious
• Useful: it should be possible to act upon the item
• Understandable: humans should be able to interpret
the pattern

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 10
http://www.imt.liu.se
Introduction

• Clinical Guidelines (CG)


– Developed to help physicians make decisions
about appropriate treatment for specific
circumstances
– Can result in improvements in overall
healthcare, including clinical practice
– Can provide decision support tools for
practitioners

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 11
http://www.imt.liu.se
Introduction

• Decision Support Systems (DSS)


– DSS is a computerized system for helping make
decisions
– A decision is a choice between alternatives
based on estimates of the values of those
alternatives
– Interactive computer-based systems that help
decision makers utilize data and models to
solve problems (Sprague and Carlson,1982)

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 12
http://www.imt.liu.se
Knowledge Discovery in Databases

• KDD steps:
– Understanding the domain
– Creating the main dataset for the KDD
– Data pre-processing
– Data mining
– Interpretation of the result or found patterns
– Evaluation

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 13
http://www.imt.liu.se
KDD

• Understanding the domain


• Creating the main dataset for the KDD

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 14
http://www.imt.liu.se
KDD

• Data pre-processing:
– Tasks in data preprocessing
• Cleaning
• Data integration
• Handling missing values
• Transformation
• Data reduction

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 15
http://www.imt.liu.se
KDD

• Data pre-processing :
– It describes any type of processing performed
on raw data to prepare it for another processing
procedure.
– Why? Real world data are generally
• Incomplete
• Noisy
• Inconsistent

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 16
http://www.imt.liu.se
KDD

• Why data pre-processing is needed?


– If the data do not have good quality, the
analysis results will not be good.
– Decisions must be based on high-quality data.
– Duplicate or missing data may cause incorrect
or even misleading results.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 17
http://www.imt.liu.se
KDD

• Cleaning
– Outliers
– Multies
– Noise
• Data integration

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 18
http://www.imt.liu.se
KDD

• Handling missing values:


– Listwise deletion
– Pairwise deletion
– Hot deck imputation
– Mean substitution
– Regression substitution
– Expectation-Maximization algorithm
– Multiple Imputation

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 19
http://www.imt.liu.se
KDD

• Data reduction:
– Obtain a reduced representation of the dataset
that is much smaller in volume but yet produce
the same or almost the same analytical results.
• Why to do it?
– The dataset may be gigantic in volume
– Processing time

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 20
http://www.imt.liu.se
KDD

• Dimension reduction
– Removes unimportant attributes: Canonical
Correlation Analysis (CCA)
• Data Compression
• Reducing the number of instances
• Discretization and concept hierarchy
generation

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 21
http://www.imt.liu.se
KDD

• Canonical correlation analysis (CCA)


– CCA seeks to identify and quantify the
associations between two sets of variables (i.e.,
predictors and outcomes of a disease)
– It focuses on the correlation between a linear
combination of the variables in one set
(independents/predictors) and a linear
combination of the variables in another set
(dependents/outcomes)

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 22
http://www.imt.liu.se
KDD

• It creates a number of canonical solutions each


consisting of a linear combination of one set of
variables:
Ui = a1 X1 + a2 X2 + … + am Xm
and a linear combination of the other set of
variables:
Vi = b1 Y1 + b2 Y2 + … + bn Yn
• The goal is to determine the coefficients (a’s and
b’s) that maximize the correlation between
canonical variates (a linear combination of a set of
original variables) Ui and Vi
Dept of Biomedical Engineering, Medical Informatics
Linköpings universitet, Linköping, Sweden 23
http://www.imt.liu.se
KDD

• Examining canonical solutions to determine


the relative importance of each of the
original variables in the canonical variate
– Canonical Loadings
• Represents the simple linear correlation between an
original observed variable and its canonical variate.
• Shows how each original variable contribute
towards each canonical variate

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 24
http://www.imt.liu.se
KDD

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 25
http://www.imt.liu.se
KDD

• An example:
Exploring Cancer Register Data to find Risk
Factors for Recurrence of Breast Cancer-
Application of Canonical Correlation
Analysis
Razavi AR, Gill H, Stål O, Sundquist M, Thorstenson S, Åhlfeldt H,
Shahsavar N, the South-East Swedish Breast Cancer Study Group

BMC Medical Informatics and Decision Making.


2005 Aug 22;5:29

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 26
http://www.imt.liu.se
KDD
Predictor Set Outcome Set ‡
Age DM, first two years
Tumor location DM, 2-4 years
Side DM, more than 4 years
*
Tumor size LRR, first two years
LN involvement * LRR, 2-4 years
LN involvement † LRR, more than 4 years
Periglandular growth *
NHG
Multiple tumors *
Abbreviations: LN: lymph
Estrogen receptor
node, NHG: Nottingham
Progesterone receptor
Histologic Grade, DM:
S-phase fraction
Distant Metastasis, LRR:
Loco-regional Recurrence DNA index
*
from pathology report, † DNA ploidy
N0: Not palpable LN
metastasis, ‡ all periods are
time after diagnosis.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 27
http://www.imt.liu.se
KDD

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 28
http://www.imt.liu.se
KDD

• CCA is suggested as an appropriate method


when there are many variables in the input set
and more than one variable in the output set.
• The results successfully detected well known
predictors for breast cancer recurrence.
• This can be assumed as the dimension
reduction step in the process of knowledge
discovery in databases.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 29
http://www.imt.liu.se
KDD

• Data Mining
– “…the process of discovering meaningful new
correlations, patterns, and trends by sifting through
large amounts of data…” (Gartner Group)
– “…the analysis of observational data sets to find
unsuspected relationships and to summarize data in
novel ways…” (Hand et al.)
– “…is an interdisciplinary field bringing together
techniques from machine learning, pattern recognition,
statistics, databases, and visualization…” (Cabana et
al.)
– …

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 30
http://www.imt.liu.se
KDD

• Supervised vs. unsupervised learning


– Supervised learning (classification) is seen as
learning from examples.
• Supervision: The data (observations, measurements,
etc.) are labeled with pre-defined classes.
– Unsupervised learning (clustering)
• Class labels of the data are unknown.
• Given a set of data, the task is to establish the
existence of classes or clusters in the data.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 31
http://www.imt.liu.se
KDD

• Supervised learning (classification)


– Decision Tree Induction (DTI)
– Artificial Neural Networks (ANN)
– Support Vector Machines (SVM)
– Multiple Regression Analysis (MRA)
– …

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 32
http://www.imt.liu.se
KDD

• Decision Tree Induction (DTI)


– Decision tree learning is widely used
• Its classification accuracy is competitive with other
methods
• Representation as If-then rules is easy to interpret
• Works well on noisy data

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 33
http://www.imt.liu.se
KDD

• An example:

A Data Pre-processing Method to Increase


Efficiency and Accuracy in Data Mining
A. R. Razavi, H. Gill, H. Åhlfeldt, and N. Shahsavar

Lecture Notes in Computer Science, Artificial Intelligence in


Medicine, J. H. S. Miksch, E. Keravnou, Ed.: Springer-
Verlag GmbH, 2005, pp. 434-443

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 34
http://www.imt.liu.se
KDD

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 35
http://www.imt.liu.se
KDD

Without  With replacing  With 


 
pre­processing  missing values  pre­processing 
Accuracy  54%  57%  67% 
Sensitivity  83%  82%  80% 
Specificity  41%  46%  63% 
Number of Leaves  137  196  14 
Tree Size  273  391  27 
 

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 36
http://www.imt.liu.se
KDD

• DTI is a a greedy divide-and-conquer


algorithm
• Tree is constructed in a top-down recursive manner
• At start, all the training examples are at the root
• Examples are partitioned recursively based on
selected attributes
• Attributes are selected on the basis of an impurity
function (e.g., information gain)

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 37
http://www.imt.liu.se
KDD

• DTI
– Pros
• Reasonable training time
• Fast application
• Easy to interpret
• Easy to implement
• Can handle large number of features
– Cons
• Cannot handle complicated relationship between
features

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 38
http://www.imt.liu.se
KDD

• Building Predictive Models


– An important function of data mining is the production
of a model. A model can be descriptive or predictive.
– A descriptive model helps in understanding underlying
processes or behavior.
– A predictive model is an equation or set of rules that
makes it possible to predict an unseen or unmeasured
value (the dependent variable or output) from other,
known values (independent variables or input).

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 39
http://www.imt.liu.se
KDD

• Transparency of the model:


– By choosing data mining methods which
produce an understandable predictive model
such as decision tree induction (DTI).
– Providing the model to clinicians to inspect all
the details and how the decisions are made in
the model; studying the tree and rules.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 40
http://www.imt.liu.se
KDD

• Showing that models’ performance is good


– Testing the model for accuracy on an
independent dataset, one that has not been used
to create the model.
– Examining the performance of the model on the
training set is not a good indicator because of
overfitting.
– The prediction of the model for an independent
dataset is compared to the actual outcome.
– An analysis is performed which measures how
well a model is performing.
Dept of Biomedical Engineering, Medical Informatics
Linköpings universitet, Linköping, Sweden 41
http://www.imt.liu.se
KDD

• Validating methods:
– Examining an independent dataset.
– Cross validation:
• Divides the whole data by random sampling into n
folds (partitions) and perform n times testing.
– At each testing, one partition of data is used as the testing
set and the rest is training set.
• Leave-one-out cross-validation
–…

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 42
http://www.imt.liu.se
KDD

• Methods for showing how well a model


works:
– Accuracy: refers to the degree of fit between
the model and the data.
– Sensitivity and specificity.
– Confusion matrix: shows the counts of the
actual versus predicted class values.
– ROC curve and area under the curve (AUC).
–…

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 43
http://www.imt.liu.se
KDD

• Our suggested approach


– Giving some cases to clinicians without any
data pre-processing and ask for their
predictions.
– Validating the predictive model by examining
the same cases as clinicians.
– Comparing the results and see if there are
statistically significant differences.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 44
http://www.imt.liu.se
KDD

• An Example:
Predicting Metastasis in Breast Cancer:
Comparing a Decision Tree with
Domain Experts
Amir R. Razavi, Hans Gill, Hans Åhlfeldt, and Nosrat Shahsavar

In press “Journal of Medical Systems”

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 45
http://www.imt.liu.se
KDD

• 3699 patients
• A decision tree was trained with all patients except
for 100 cases and tested with those 100 cases.
• Two domain experts were asked to give their
opinion about the probability of recurrence of a
certain outcome for these 100 patients.
• ROC curves and area under the ROC curves
(AUC) for predictions were computed and
compared.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 46
http://www.imt.liu.se
KDD

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 47
http://www.imt.liu.se
KDD

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 48
http://www.imt.liu.se
KDD
100

80

60
Sensitivity

DTI_J48
Oncologist_1
Oncologist_2
40

20

0
0 20 40 60 80 100
100-Specificity

DTI (J48) Oncologist 1 Oncologist 2


AUC 0.761 0.847 0.810
DTI: decision tree induction

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 49
http://www.imt.liu.se
KDD

• It is possible to formulate the knowledge that is


hidden in registers in the form of a decision tree
• A methodology producing an understandable model
with about the same accuracy as domain experts
can be used as a semi-automatic knowledge
discovery method for building a predictive model
in oncology

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 50
http://www.imt.liu.se
Clinical Guidelines

• Physicians’ adherence to clinical guidelines is not


100 percent.
– Physicians’ disagreement with guidelines
– Poor availability
– Low outcome expectancy
– Patient-related obstacles

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 51
http://www.imt.liu.se
Clinical Guidelines

• Non-compliance with the guideline in some


individual cases is of less general importance
– Some patients may have refused to accept the treatment
– Sometimes physicians believe that a treatment is
appropriate for a particular patient
• If systematic and repetitive patterns are identified,
then they can be used as rules to alert physicians

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 52
http://www.imt.liu.se
Clinical Guidelines

• However, there may be useful information in the


disagreements
– Every institution has its own experience with treatments
– Repetitive patterns can provide the writers of guidelines
with new insight that may result in improvements in the
guidelines

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 53
http://www.imt.liu.se
Clinical Guidelines

• One method for finding disagreements is to


evaluate each case separately and the disagreement
reason be analyzed
• However, a faster method that ignores sporadic
disagreements and can find repetitive patterns is
preferable

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 54
http://www.imt.liu.se
Clinical Guidelines

• An example:
A Data Mining Approach to Analyze Non-
compliance with a Guideline for the
Treatment of Breast Cancer
Razavi AR, Gill H, Åhlfeldt H, Shahsavar N

To be presented in MedInfo 2007, Brisbane, Australia

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 55
http://www.imt.liu.se
Clinical Guidelines

• Our suggested approach:


– Using data mining to find patterns of non-compliance
between the guideline and the real collected figures for
post-mastectomy radiotherapy (PMRT)
– The dataset is filtered using the guideline rules for
finding non-compliant cases
– Repetitive patterns of inconsistencies with the guideline
were identified by decision tree induction

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 56
http://www.imt.liu.se
Clinical Guidelines

• 12 variables important in the recurrence of breast


cancer
• Data were filtered using the local modified
guideline for PMRT (125 out of 962 cases)
• Data then were analyzed with Decision Tree
Induction
• 3 variables showed to be important (Age, inv. LNs,
T. size)

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 57
http://www.imt.liu.se
Clinical Guidelines

• Patterns for non-compliance with the PMRT


guideline:

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 58
http://www.imt.liu.se
Clinical Guidelines

• Supporting clinicians in following the guidelines


can be done by using the resulting rules as alerts for
identifying inconsistencies between clinicians’
practices and guidelines.
• Resulting rules from mining historical data can also
be embedded in knowledge bases of guideline-
based decision support systems.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 59
http://www.imt.liu.se
Discussion

• In some domains such as finance and


banking KDD has already showed a great
benefit to the industry but in medicine we
are far behind them.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 60
http://www.imt.liu.se
Discussion

• There is no gold standard method for how to


do the pre-processing step and handling
missing values.
• CCA can handle multiple outcomes and this
is unique compared to other methods such as
MRA and Cox RA.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 61
http://www.imt.liu.se
Discussion

• DTI predictive model does not differ


significantly from predictions made by
domain experts.
• Compared to other data mining methods,
DTI is more explainable. In contrast ANN
works as a ”black box”.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 62
http://www.imt.liu.se
Discussion

• A predictive model which is built based on


the most relevant and important predictors of
an event can have a better performance.
• Improvement of the quality of cancer
registers by adding variables with high
predictive ability.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 63
http://www.imt.liu.se
Discussion

• Combining KDD and clinical guidelines can


result in a knowledge useful in improving
medical practice.
• This can be done by embedding the resulted
knowledge in decision support systems.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 64
http://www.imt.liu.se
Discussion

• In medicine, assisting clinicians in their


decision making in the right time, right
place and in a suitable format is valuable.
• Providing reminders, interpretations or
advices specific to a given patient at a
particular time is advantageous.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 65
http://www.imt.liu.se
Future Works

• Results from the presented methodology


can be used to build a decision support
application in the field of oncology.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 66
http://www.imt.liu.se

Potrebbero piacerti anche