Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Welcome
Ozlem
Uzuner
Associate
Professor,
University
at
Albany
Meliha
Ye0sgen
Assistant
Professor,
University
of
Washington
Amber
Stubbs
Assistant
Professor,
Simmons
College
Outline
Introduc0on
to
clinical
and
biomedical
NLP
Research
ques0ons
in
clinical
and
biomedical
NLP
Data
and
annota0on
processes
Methods
Open
ques0ons
and
future
direc0ons
Introduc0on
to
Clinical
and
Biomedical
NLP
Biomedical
NLP
Focus
on
scien0c
discoveries
about
biology,
physiology,
and
medicine
Journal
ar0cles,
clinical
trials,
webpages,
5
Number
of
visits
(to
physician
oces,
hospital
outpa0ent
and
emergency
departments):
1.2
billion
(actual
number
reported
by
CDC
in
2010)
Hospital
inpa0ent
care
Number
of
discharges:
35.1
million
Discharges
per
10,000
popula0on:
1,139.6
Average
length
of
stay
in
days:
4.8
Biomedical
text:
PubMed
contains
more
than
23
million
biomedical
ar0cles
from
MEDLINE,
life
science
journals,
and
online
books
~500,000
new
records
are
added
each
year
13.1
million
abstracts,
and
14.2
million
full-text
*hfp://www.cdc.gov/nchs/fastats/physician_visits.htm
Clinical
Documents
HISTORY
OF
PRESENT
ILLNESS:
Mrs.
[Hun6ngton]
is
a
77-year-old-woman
with
long
standing
hypertension
who
presented
as
a
Walk-in
to
me
at
the
[Bronx]
Health
Center
on
[DATE].
Recently
had
been
started
q.o.d.
on
Clonidine
since
[DATE]
to
taper
o
of
the
drug.
Was
told
to
start
Zestril
20
mg.
q.d.
again.
The
pa0ent
was
sent
to
the
Emergency
Unit
for
direct
admission
for
cardioversion
and
an0coagula0on,
with
the
Cardiologist,
Dr.
[Swasissz]
to
follow.
SOCIAL
HISTORY:
Lives
alone,
has
one
daughter
living
in
[Spring].
Is
a
non-smoker,
and
does
not
drink
alcohol.
HOSPITAL
COURSE
AND
TREATMENT:
During
admission,
the
pa0ent
was
seen
by
Cardiology,
Dr.
[Tylenol],
was
started
on
IV
Heparin,
Sotalol
40
mg
PO
b.i.d.
increased
to
80
mg
b.i.d.,
and
had
an
echocardiogram.
By
[DATE]
the
pa0ent
had
befer
rate
control
and
blood
pressure
control
but
remained
in
atrial
brilla0on.
On
[DATE],
the
pa0ent
was
felt
to
be
medically
stable
10
Clinical
Documents
The
pa0ent
is
a
46
year
old
woman
with
a
history
of
Q
wave
myocardial
infarc6on
with
right
ventricular
infarct
in
October
1992.
Peak
CK's
were
2300.
Catheteriza0on
showed
100%
RCA
lesion
which
was
treated
with
angioplasty
reduced
to
20-30%
stenosis.
Subsequent
catheteriza0on
October
92
,
July
92
and
September
92
for
atypical
chest
pain
,
showed
clean
coronaries.
Exercise
tread
mill
test
in
September
92
,
the
pa0ent
went
three
minutes
and
31
seconds
with
standard
Bruce
protocol
and
stopped
secondary
to
atypical
chest
pain.
Maximum
heart
rate
162
,
blood
pressure
176/90
,
no
ST
or
T
wave
changes.
In
April
92
she
ruled
out
for
myocardial
infarc0on
by
enzymes
and
EKG
,
aser
presen0ng
with
prolonged
chest
pain.
VQ
scan
was
low
probability.
Chest
CT
ruled
out
aor0c
dissec0on.
The
pa0ent
now
presents
to
the
hospital
with
24
hours
of
right
sided
chest
pain
,
sta0ng
that
it
was
squeezing
in
her
right
breast
,
felt
to
be
between
the
shoulder
blades.
She
complained
of
shortness
of
breath
,
dizziness
,
weakness
and
nausea
,
no
palpita6ons
were
noted
11
Clinical
Documents
Pt
recently
hospitalized
7/19/06
for
chf
exacerba0on
(
diastolic
dysfunc0on
)
2nd
to
dietary
and
medicine
noncompliance
(
salty
foods
,
stopped
her
HCTZ
)
and
con0nued
to
smoke.
Pt
diuresed
and
sent
home
on
new
lasix
60qam
40qpm
regimen.
Pt
no0ced
steady
decline
in
func0onal
status
during
the
last
3
weeks
because
of
SOB.
at
baseline
should
sat
85%
on
ra
,
95%
on
6L02NC
at
rest
and
ambula0on.
(
on
home
o2
)
but
now
,
can't
ambulate
,
sa6ng
83-89%
on
6l
at
rest.
also
notes
pnd
,
orthopnea.
Pt
notes
intermifent
chest
pain
on
and
o
las0ng
5
minutes
not
associated
with
exer0on
or
any
other
cardiac
sx.
8/15
dobuta
mibi->
ischemia
in
d1
territory.
11/19
:echo->ef
60%
,
Pa
pressure
48
+
RA.
no
valve
dz.
rv
enlarged
and
hypokine6c.
A/P:
pump:
decompesated
CHF
(
diastolic
dysfxn
,
?
cor
pulmonale
component
)
2nd
to
diet/med
non-compliance.
up6trate
captopril
,
con6nue
iv
lasix
60
qd
with
goal
net
neg
2
liters
,
daily
weights
,
strict
Iand
O.
check
cxray.
Switched
to
po
lasix
10/06
,
back
to
lisinopril
for
d/c
Fri.
ischemia:
has
+
mibi
in
past
,
but
no
further
workup
to
d1
lesion.
can't
get
ecasa
2nd
to
vWD.
con6nue
BB
,
will
hold
o
on
sta6n
since
not
hyperlipidemic.
rate:tele.
12
Clinical
Language
Domain-specic,
jargon,
idioms
Telegraphic,
with
misspellings,
incomplete
sentences
Specula0ons,
hypotheses,
and
nega0ons
Some
structure
*Slide
courtesy
of
Bref
South.
DAvolio
L.W.,
Demner-Fushman
D.,
South
B.R.
(2013).
Tutorial
on
An
Introduc0on
to
Clinical
Natural
Language
Processing,
Fall
Symposium
of
AMIA,
2013.
13
Clinical
Language
Linguis0c
varia0on
Deriva0on
medias3nal = medias3num
Inec0on
Synonymy
*Slide
courtesy
of
Bref
South.
DAvolio
L.W.,
Demner-Fushman
D.,
South
B.R.
(2013).
Tutorial
on
An
Introduc0on
to
Clinical
Natural
Language
Processing,
Fall
Symposium
of
AMIA,
2013.
14
Clinical
Language
Polysemy
General polysemy
*Slide
courtesy
of
Bref
South.
DAvolio
L.W.,
Demner-Fushman
D.,
South
B.R.
(2013).
Tutorial
on
An
Introduc0on
to
Clinical
Natural
Language
Processing,
Fall
Symposium
of
AMIA,
2013.
15
Clinical
Language
Nega0on
and
uncertainty
Approximately
half
of
all
clinical
concepts
in
dictated
reports
are
negated*
Explicit
nega0on
Implicit nega0on
Uncertainty
*Slide
courtesy
of
Bref
South.
DAvolio
L.W.,
Demner-Fushman
D.,
South
B.R.
(2013).
Tutorial
on
An
Introduc0on
to
Clinical
Natural
Language
Processing,
Fall
Symposium
of
AMIA,
2013.
*Chapman
WW,
Bridewell
W,
Hanbury
P,
Cooper
GF,
Buchanan
BG.
Evalua0on
of
nega0on
phrases
in
narra0ve
clinical
reports.
Proc
AMIA
Sym.
2001:105-9.
16
Clinical
Language
Hypotheses
It
was
felt
that
the
pa3ent
probably
had
a
cerebrovascular
accident
involving
the
leI
side
of
the
brain.
Other
dieren3als
entertained
were
perhaps
seizure
and
the
pa3ent
being
post-ictal
when
he
was
found,
although
this
considera3on
is
less
likely.
R/O
out
pneumonia.
*Slide
courtesy
of
Bref
South.
DAvolio
L.W.,
Demner-Fushman
D.,
South
B.R.
(2013).
Tutorial
on
An
Introduc0on
to
Clinical
Natural
Language
Processing,
Fall
Symposium
of
AMIA,
2013.
17
Clinical
Language
Implica0on
Requires inference
*Slide
courtesy
of
Bref
South.
DAvolio
L.W.,
Demner-Fushman
D.,
South
B.R.
(2013).
Tutorial
on
An
Introduc0on
to
Clinical
Natural
Language
Processing,
Fall
Symposium
of
AMIA,
2013.
18
Clinical
Language
More
inference
Fever
Temperature
38.5C
Oxygen
desatura0on
Oxygen
satura3on
low
Oxygen
satura3on
85%
on
room
air
*Slide
adapted
from
Bref
South.
DAvolio
L.W.,
Demner-Fushman
D.,
South
B.R.
(2013).
Tutorial
on
An
Introduc0on
to
Clinical
Natural
Language
Processing,
Fall
Symposium
of
AMIA,
2013.
19
Clinical
Language
Temporality
Hypothe0cal
or
non-specic
men0ons
20
Clinical
Language
Report
structure
Anatomic
Loca0on
some0mes
in
sec0on
header
NECK:
no
adenopathy.
21
Biomedical
Documents
Grows
in
size
at
a
drama0c
pace
22
Biomedical
Language
Contains
domain
specic
rich
and
evolving
vocabulary
Concepts
introduced
when
new
discoveries
are
presented
Very structured
23
Research
Problems
in
Clinical
and
Biomedical
NLP
24
25
Applica0on:
Retrospec0ve
cohort
study
of
high
B12
levels
as
ICU
mortality
predictor
Hypotheses:
High
B12
levels
are
associated
with
liver
func0on
Alcohol
consump0on
impacts
liver
Data:
The
Mul0parameter
Intelligent
Monitoring
in
Intensive
Care
(MIMIC-II)
database
B12
measurements
for
~2,000
adult
pa0ents
Structured
data:
ICD9
codes
of
alcohol-based
illness,
e.g.,
delirium
tremens
(291*)
*Slide
courtesy
of
Dina
Demner-Fushman.
DAvolio
L.W.,
Demner-Fushman
D.,
South
B.R.
(2013).
Tutorial
on
An
Introduc0on
to
Clinical
Natural
Language
Processing,
Fall
Symposium
of
AMIA,
2013.
Callaghan
F.M.,
Leishear
K.,
Abhyankar
S.,
Demner-Fushman
D.,
McDonald
C.J.
(2014)
High
vitamin
B12
levels
are
not
associated
with
increased
mortality
risk
for
ICU
pa0ents
aser
adjus0ng
for
liver
func0on:
a
cohort
study.
ESPEN
J.
2014
Apr
1;9(2):e76-e83.
26
*Slide
courtesy
of
Dina
Demner-Fushman.
DAvolio
L.W.,
Demner-Fushman
D.,
South
B.R.
(2013).
Tutorial
on
An
Introduc0on
to
Clinical
Natural
Language
Processing,
Fall
Symposium
of
AMIA,
2013.
27
*Slide
courtesy
of
Dina
Demner-Fushman.
DAvolio
L.W.,
Demner-Fushman
D.,
South
B.R.
(2013).
Tutorial
on
An
Introduc0on
to
Clinical
Natural
Language
Processing,
Fall
Symposium
of
AMIA,
2013.
28
Nadkarni
P.M.,
Ohno-Machado
L.,
Chapman
W.W.
(2011).
Natural
language
processing:
an
introduc0on.
Journal
of
the
American
Medical
Informa3cs
Associa3on
2011
Sep-Oct;
18(5):
544551.
doi:
10.1136/amiajnl-2011-00046
29
Tokeniza0on:
complicated
by
characters
typically
used
as
token
boundaries,
e.g.,
10
mg/day,
N-acetylcysteine.
30
Deriva0on:
Inec0on:
Synonymy:
Acronyms:
Nadkarni
P.M.,
Ohno-Machado
L.,
Chapman
W.W.
(2011).
Natural
language
processing:
an
introduc0on.
Journal
of
the
American
Medical
Informa3cs
Associa3on
2011
Sep-Oct;
18(5):
544551.
doi:
10.1136/amiajnl-2011-00046
31
Nadkarni
P.M.,
Ohno-Machado
L.,
Chapman
W.W.
(2011).
Natural
language
processing:
an
introduc0on.
Journal
of
the
American
Medical
Informa3cs
Associa3on
2011
Sep-Oct;
18(5):
544551.
doi:
10.1136/amiajnl-2011-00046
32
Nadkarni
P.M.,
Ohno-Machado
L.,
Chapman
W.W.
(2011).
Natural
language
processing:
an
introduc0on.
Journal
of
the
American
Medical
Informa3cs
Associa3on
2011
Sep-Oct;
18(5):
544551.
doi:
10.1136/amiajnl-2011-00046
33
Datasets
and
the
Annota0on
Processes
34
There are various biomedical corpora annotated for syntax and seman0cs
MedTag:
A
collec0on
of
biomedical
annota0ons
(MEDLINE
abstracts):
the
AbGene
corpus
of
annotated
sentences
of
genes
and
protein
named
en00es,
the
MedPost
corpus
of
part
of
speech
tagged
sentences
and
the
GENETAG
corpus
for
named
en0ty
iden0ca0on
used
for
BioCreAtIvE
I.
TREC
Genomics
Track:
A
set
of
data
collecions
provided
by
TREC
Genomics
Track
useful
for
development
and
evalua0on
of
retrieval
and
text
categoriza0on
strategies
in
the
biomedical
domain.
BioCrea0ve
corpus:
Dataset
produced
by
the
BioCrea0ve
assessment,
text
passages
relevant
for
GO
annota0ons
of
human
proteins.
GENIA
corpus:
Annotated
corpus
of
literature
related
to
the
MeSH
terms:
Human,
Blood
Cells,
and
Transcrip0on
Factors.
Yapex
corpus:
Training
and
test
data
for
the
protein
tagger
(NER)
YAPEX.
PASBio:
Predicate-argument
structures
of
biomedical
literature.
LLL05
dataset:
Genic
Interac0on
Extrac0on
Challenge:
protein/gene
interac0ons
IE
data
set
IEPA
corpus:
The
Interac0on
Extrac0on
Performance
Assessment
corpus
BioText
Data:
Dataset
for
extrac0on
of
disease/treatment
en00es
rela0ons
BioText
NC
Seman0cs
Dataset:
Dataset
of
Noun
Compound
Seman0cs
used
in
experiments
described
in
ar0cles
PennBioIE:
UPenn
Biomedical
Informa0on
Extrac0on
datasets
of
annotated
PubMed
abstracts:
CYP450
domain
and
oncology
domain
Medstract
corpus:
Biomedical
annota0on
corpus
useful
for
acronym
deni0on
and
coreference
resolu0on
Medstract
corpus:
Biomedical
annota0on
corpus
useful
for
acronym
deni0on
and
coreference
resolu0on
OHSUMED
text
collec0on:
Document
collec0on
used
for
the
TREC-9
contest.
BMC
corpus:
Open
access
corpus
of
full
text
ar0cles
provided
by
BioMed
Central.
FetchProt
corpus:
Full
text
journal
ar0cles
from
the
biological
domain
analyzed
for
experiments
on
proteins.
PDG
Bio-sentence
splifer
corpus:
Small
collec0on
of
text
data
sets
derived
from
PubMed
abstracts
to
develop
and
assess
sentence
spling
tools.
Bio1
corpus:
annotated
corpus,
same
eld
as
GENIA,
but
annotated
to
small
top-level
ontology.
35
Run
as
shared-tasks
Training
data
made
available
Tes0ng
set
held
out
for
evalua0on
Evalua0on
performed
by
i2b2
37
Annota6on
Protocol
Training/
Tes6ng
Records
De-iden0ca0on
Partners
Healthcare
(PH)
Standard
669 / 220
Smoking Status
PH
Standard
by
experts
398
/
104
Obesity Diagnosis
PH
Standard
by
experts
730 / 507
Medica0on
Extrac0on
E & C
PH
Community
annota0on
17 / 251
Concepts,
Asser0ons,
and
Rela0ons
E & C
Standard,
then
349
/
477
machine
valida0on
Coreference
Resolu0on
E & C
Standard
then
492
/
322
machine
valida0on
Temporal
Rela0ons
Heart
disease
risk
factors
E & C
PH, MIMIC II
Standard
PH
Longitudinal
records
190
/
120
38
Annota6on
Protocol
Training/
Tes6ng
Records
De-iden6ca6on
Partners
Healthcare
(PH)
Standard
669 / 220
Smoking Status
PH
Standard
by
experts
398
/
104
Obesity Diagnosis
PH
Standard
by
experts
730 / 507
Medica0on
Extrac0on
E & C
PH
Community
annota0on
17 / 251
Concepts,
Asser0ons,
and
Rela0ons
E & C
Standard,
then
349
/
477
machine
valida0on
Coreference
Resolu0on
E & C
Standard
then
492
/
322
machine
valida0on
Temporal
Rela0ons
Heart
disease
risk
factors
E & C
PH, MIMIC II
Standard
PH
Longitudinal
records
190
/
120
39
Uzuner
.,
Luo
Y.,
Szolovits
P.
(2007).
Evalua0ng
the
State-of-the-Art
in
Automa0c
De-
iden0ca0on.
Journal
of
the
American
Medical
Informa3cs
Associa3on.
September
2007;14(5):550-563.
40
Discharge Summaries
HISTORY
OF
PRESENT
ILLNESS:
Mrs.
[Hun0ngton]
is
a
77-year-old-woman
with
long
standing
hypertension
who
presented
as
a
Walk-in
to
me
at
the
[Bronx]
Health
Center
on
[DATE].
Recently
had
been
started
q.o.d.
on
Clonidine
since
[DATE]
to
taper
o
of
the
drug.
Was
told
to
start
Zestril
20
mg.
q.d.
again.
The
pa0ent
was
sent
to
the
Emergency
Unit
for
direct
admission
for
cardioversion
and
an0coagula0on,
with
the
Cardiologist,
Dr.
[Swasissz]
to
Misspelled
or
follow.
foreign
SOCIAL
HISTORY:
Lives
alone,
has
one
daughter
living
in
[Spring].
Is
a
non-
name?
smoker,
and
does
not
drink
alcohol.
HOSPITAL
COURSE
AND
TREATMENT:
During
admission,
the
pa0ent
was
seen
by
Cardiology,
Dr.
[Tylenol],
was
started
on
IV
Heparin,
Sotalol
40
mg
PO
b.i.d.
increased
to
80
mg
b.i.d.,
and
had
an
echocardiogram.
By
[DATE]
the
pa0ent
had
befer
rate
control
and
blood
pressure
control
but
remained
in
atrial
brilla0on.
On
[DATE],
the
pa0ent
was
felt
to
be
medically
stable
41
Annota6on
Protocol
Training/
Tes6ng
Records
De-iden0ca0on
Partners
Healthcare
(PH)
Standard
669 / 220
Smoking Status
PH
Standard
by
experts
398
/
104
Obesity Diagnosis
PH
Standard
by
experts
730 / 507
Medica0on
Extrac0on
E & C
PH
Community
annota0on
17 / 251
Concepts,
Asser0ons,
and
Rela0ons
E & C
Standard,
then
349
/
477
machine
valida0on
Coreference
Resolu0on
E & C
Standard
then
492
/
322
machine
valida0on
Temporal
Rela0ons
Heart
disease
risk
factors
E & C
PH, MIMIC II
Standard
PH
Longitudinal
records
190
/
120
42
Challenge
2006
Smoking
Status
Document
classica0on
into
the
following
classes:
Smoker
Current
smoker
Past
smoker
Non-smoker
Uzuner
.,
Goldstein
I.,
Luo
Y.,
Kohane
I.
(2008).
Iden0fying
Pa0ent
Smoking
Status
from
Medical
Discharge
Records.
Journal
of
the
American
Medical
Informa3cs
Associa3on.
January
2008;15(1):14-24.
43
44
Annota6on
Protocol
Training/
Tes6ng
Records
De-iden0ca0on
Partners
Healthcare
(PH)
Standard
669 / 220
Smoking Status
PH
Standard
by
experts
398
/
104
Obesity Diagnosis
PH
Standard
by
experts
730 / 507
Medica0on
Extrac0on
E & C
PH
Community
annota0on
17 / 251
Concepts,
Asser0ons,
and
Rela0ons
E & C
Standard,
then
349
/
477
machine
valida0on
Coreference
Resolu0on
E & C
Standard
then
492
/
322
machine
valida0on
Temporal
Rela0ons
Heart
disease
risk
factors
E & C
PH, MIMIC II
Standard
PH
Longitudinal
records
190
/
120
45
Challenge
2008
Obesity
Diagnosis
Obesity
and
15
of
its
co-morbidi0es
Asthma,
atherosclero0c
cardiovascular
disease
(CAD),
conges0ve
heart
failure
(CHF),
depression,
diabetes
mellitus
(DM),
gallstones
/
cholecystectomy,
gastroesophageal
reux
disease
(GERD),
gout,
hypercholesterolemia,
hypertension
(HTN),
hypertriglyceridemia,
obstruc0ve
sleep
apnea
(OSA),
osteoarthri0s
(OA),
peripheral
vascular
disease
(PVD),
and
venous
insuciency
Annota6on
Protocol
Training/
Tes6ng
Records
De-iden0ca0on
Partners
Healthcare
(PH)
Standard
669 / 220
Smoking Status
PH
Standard
by
experts
398
/
104
Obesity Diagnosis
PH
Standard
by
experts
730 / 507
Medica6on
Extrac6on
E & C
PH
Community
annota6on
17 / 251
Concepts,
Asser0ons,
and
Rela0ons
E & C
Standard,
then
349
/
477
machine
valida0on
Coreference
Resolu0on
E & C
Standard
then
492
/
322
machine
valida0on
Temporal
Rela0ons
Heart
disease
risk
factors
E & C
PH, MIMIC II
Standard
PH
Longitudinal
records
190
/
120
47
Challenge
2009
Medica0on
extrac0on
Medica0ons
and
informa0on
related
to
them
from
medical
discharge
summaries
Medica0on
names
Dosages
Modes
Frequencies
Dura0ons
Reasons
List/narra0ve
48
Medica0on
extrac0on
Extrac0on
task
Medica0ons
and
informa0on
related
to
them
Classica0on
task
Whether
a
piece
of
informa0on
is
related
to
a
medica0on
Uzuner
.,
Sol0
I.,
Cadag
E.
(2010).
Extrac0ng
Medica0on
Informa0on
from
Clinical
Text.
Journal
of
the
American
Medical
Informa3cs
Associa3on.
2010;17:514-518
doi:10.1136/
jamia.2010.003947.
Uzuner
.,
Sol0
I.,
Xia
F.,
Cadag
E.
(2010).
Community
Annota0on
Experiment
for
Ground
Truth
Genera0on
for
the
i2b2
Medica0on
Challenge.
Journal
of
the
American
Medical
Informa3cs
Associa3on.
2010;17:519-523
doi:10.1136/jamia.2010.004200.
49
Annota6on
Protocol
Training/
Tes6ng
Records
De-iden0ca0on
Partners
Healthcare
(PH)
Standard
669 / 220
Smoking Status
PH
Standard
by
experts
398
/
104
Obesity Diagnosis
PH
Standard
by
experts
730 / 507
Medica0on
Extrac0on
E & C
PH
Community
annota0on
17 / 251
Concepts,
Asser6ons,
and
Rela6ons
E & C
Standard,
then
machine
valida6on
349 / 477
Coreference
Resolu0on
E & C
Standard
then
492
/
322
machine
valida0on
Temporal
Rela0ons
Heart
disease
risk
factors
E & C
PH, MIMIC II
Standard
PH
Longitudinal
records
190
/
120
50
Challenge
2010
Rela0on
Extrac0on
Three
0er
task
Clinical
concepts
Asser0ons
on
concepts
Rela0ons
of
concepts
51
Classica6on
(C)
or
Extrac6on
(E)
Data Sources
Annota6on
Protocol
Training/
Tes6ng
Records
De-iden0ca0on
Standard
669 / 220
Smoking Status
PH
Standard
by
experts
398
/
104
Obesity Diagnosis
PH
Standard
by
experts
730 / 507
Medica0on
Extrac0on
E & C
PH
Community
annota0on
17 / 251
Concepts,
Asser0ons,
and
Rela0ons
E & C
Coreference
Resolu6on
E & C
Standard
then
machine
valida6on
492 / 322
Temporal
Rela0ons
Heart
disease
risk
factors
E & C
PH, MIMIC II
Standard
190 / 120
PH Longitudinal records
52
Classica6on
(C)
or
Extrac6on
(E)
Data Sources
Annota6on
Protocol
Training/
Tes6ng
Records
De-iden0ca0on
Standard
669 / 220
Smoking Status
PH
Standard
by
experts
398
/
104
Obesity Diagnosis
PH
Standard
by
experts
730 / 507
Medica0on
Extrac0on
E & C
PH
Community
annota0on
17 / 251
Concepts,
Asser0ons,
and
Rela0ons
E & C
Coreference
Resolu0on
E & C
Standard
then
492
/
322
machine
valida0on
Temporal
Rela6ons
E & C
PH, MIMIC II
Standard
PH Longitudinal records
190 / 120
53
Annota6on
Protocol
Training/
Tes6ng
Records
De-iden0ca0on
Partners
Healthcare
(PH)
Standard
669 / 220
Smoking Status
PH
Standard
by
experts
398
/
104
Obesity Diagnosis
PH
Standard
by
experts
730 / 507
Medica0on
Extrac0on
E & C
PH
Community
annota0on
17 / 251
Concepts,
Asser0ons,
and
Rela0ons
E & C
Standard,
then
349
/
477
machine
valida0on
Coreference
Resolu0on
E & C
Standard
then
492
/
322
machine
valida0on
Temporal
Rela0ons
E & C
PH, MIMIC II
Standard
PH
Longitudinal
records
190
/
120
54
55
Aims
56
Corpus selec0on
297 pa0ents
57
hyperlipidemia/
hypercholesterolemia,
hypertension,
obesity,
a
family
history
of
premature
CAD,
and
being
a
smoker
1
medical
doctor
6
registered
nurses
1
medical
assistant
Each
le
triple-annotated
Gold
standard:
any
tags
appearing
in
2/3
or
more
of
the
annota0ons
59
Training/tes0ng
60
61
Methods
62
Methods
Depending
on
the
problem
a
wide
range
of
methods
are
applied.
Rule
based
Sta0s0cal
Hybrid
63
Clinical
applica0ons
Phenotype
modeling
in
the
ICU
Pneumonia
predictor
Acute
lung
injury
predictor
Sec0on
segmenta0on
HISTORY
OF
PRESENT
ILLNESS:
This
is
an
85
year
old
man
initially
admitted
to
the
Plastic
Surgery
Service
for
evaluation
of
a
left
facial
mass.
Subsequently,
CMED
CCU
was
consulted
and
he
was
transferred
to
our
Service
postoperatively.
PAST
MEDICAL
HISTORY:
His
past
medical
history
is
significant
for
prostate
cancer,
benign
prostatic
hypertrophy,
hypothyroidism
,
status
post
radiation
for
non
Hodgkin
's
lymphoma,
chronic
painless
hematuria
,
degenerative
joint
disease
and
history
of
a
murmur.
Last
colonoscopy,
five
years
ago.
Dementia.
ALLERGIES:
No
known
drug
allergies.
MEDICATIONS:
1.
Levothyroxine.
2.
Lasix.
3.
Proscar.
4.
Aeroseb.
5.
Ancef.
PHYSICAL
EXAMINATION:
On
examination,
he
is
afebrile.
Vital
signs,
stable.
Elderly
man,
somewhat
cachectic
.
Head,
eye,
ears,
nose
and
throat,
polypoid
lesion
just
inferior
to
the
left
zygoma,
elevated
superiorly,
with
visible
bone.
No
exudate.
Minimal
bleeding.
Regular
rate
and
rhythm.
Clear
to
auscultation.
Nontender,
nondistended.
HOSPITAL
COURSE:
He
was
initially
admitted
to
CMED
for
resection
and
repair
of
this
left
facial
lesion.
He
also
had
consults
from
Urology
for
his
hematuria
as
well
as
Medicine
preoperatively
and
CMED
CCU.
He
went
to
the
Operating
Room
on
2016-03-10
with
Urology
for
hematuria
where
he
had
a
cystoscopy
transurethral
resection
of
prostate
placement.
He
then
went
to
the
Operating
Room
on
2016-03-14
where
he
had
...
65
Sec0on
segmenta0on
HISTORY
OF
PRESENT
ILLNESS:
This
is
an
85
year
old
man
initially
admitted
to
the
Plastic
Surgery
Service
for
evaluation
of
a
left
facial
mass.
Subsequently,
CMED
CCU
was
consulted
and
he
was
transferred
to
our
Service
postoperatively.
PAST
MEDICAL
HISTORY:
His
past
medical
history
is
significant
for
prostate
cancer,
benign
prostatic
hypertrophy,
hypothyroidism
,
status
post
radiation
for
non
Hodgkin
's
lymphoma,
chronic
painless
hematuria
,
degenerative
joint
disease
and
history
of
a
murmur.
Last
colonoscopy,
five
years
ago.
Dementia.
ALLERGIES:
No
known
drug
allergies.
MEDICATIONS:
1.
Levothyroxine.
2.
Lasix.
3.
Proscar.
4.
Aeroseb.
5.
Ancef.
PHYSICAL
EXAMINATION:
On
examination,
he
is
afebrile.
Vital
signs,
stable.
Elderly
man,
somewhat
cachectic
.
Head,
eye,
ears,
nose
and
throat,
polypoid
lesion
just
inferior
to
the
left
zygoma,
elevated
superiorly,
with
visible
bone.
No
exudate.
Minimal
bleeding.
Regular
rate
and
rhythm.
Clear
to
auscultation.
Nontender,
nondistended.
HOSPITAL
COURSE:
He
was
initially
admitted
to
CMED
for
resection
and
repair
of
this
left
facial
lesion.
He
also
had
consults
from
Urology
for
his
hematuria
as
well
as
Medicine
preoperatively
and
CMED
CCU.
He
went
to
the
Operating
Room
on
2016-03-10
with
Urology
for
hematuria
where
he
had
a
cystoscopy
transurethral
resection
of
prostate
placement.
He
then
went
to
the
Operating
Room
on
2016-03-14
where
he
had
...
66
Framework
We
create
a
sec0on
header
ontology
for
a
given
note
type
(e.g.,
radiology
reports,
discharge
summaries)
We
annotate
a
small
set
of
document
for
each
of
the
report
types
We
train
a
two
level
classier
First
classier
iden0es
the
boundaries
of
the
sec0ons
Second
classier
iden0ed
the
sec0on
type
67
Exam Details
Exam
Comparison
Contrast
Procedure
Findings
Findings
Impression
Impression
Afending
Statement
68
Admission
diagnoses
History
Reason
for
admission
Medica0ons
Condi0ons as discharge
Discharge
diagnoses
Other
diagnoses
Physician
exam
on
discharge
Disposi0on
Other
diagnoses
Condi0on
Consulta0on
Procedures
Hospital
course
Studies
Physical
Discharge instruc0ons
Allergies
Past
medical
history
Past
surgical
history
Family
history
Gynecological
history
Social
history
Hospital course
Admit
physician
Afending
physician
Discharge
physician
Afending
surgeon
Medical history
Provider informa0on
Admit
date
Discharge
date
Service
Discharge
instruc0ons
Discharge
medica0ons
Follow-up
Addenda
Afending
statement
Note
69
Annota0on
Process
HISTORY(OF(PRESENT(ILLNESS:(
This(is(an(85(year(old(man(initially(admitted(to(the(Plastic(Surgery(Service(for(evaluation(of(a(
left(facial(mass.(Subsequently,(CMED(CCU(was(consulted(and(he(was(transferred(to(our(
Service(postoperatively.(
MEDICAL(HISTORY:(
His(past(medical(history(is(significant(for(prostate(cancer,(benign(prostatic(hypertrophy,(
hypothyroidism(,(status(post(radiation(for(non(Hodgkin('s(lymphoma,(chronic(painless(
hematuria(,(degenerative(joint(disease(and(history(of(a(murmur.(Last(colonoscopy,(five(
years(ago.(Dementia.(
ALLERGIES:(
No(known(drug(allergies.(
MEDICATIONS:(
1.(Levothyroxine.(
2.(Lasix.(
3.(Proscar.(
4.(Aeroseb.(
5.(Ancef.(
PHYSICAL(EXAMINATION:(
On(examination,(he(is(afebrile.(Vital(signs,(stable.(Elderly(man,(somewhat(cachectic(.(Head,(
eye,(ears,(nose(and(throat,(polypoid(lesion(just(inferior(to(the(left(zygoma,(elevated(
superiorly,(with(visible(bone.(No(exudate.(Minimal(bleeding.(Regular(rate(and(rhythm.(Clear(
to(auscultation.(Nontender,(nondistended.(
HOSPITAL(COURSE:(
He(was(initially(admitted(to(CMED(for(resection(and(repair(of(this(left(facial(lesion.(He(also(
had(consults(from(Urology(for(his(hematuria(as(well(as(Medicine(preoperatively(and(CMED(
CCU.(He(went(to(the(Operating(Room(on(2016\03\10(with(Urology(for(hematuria(where(he(
had(a(cystoscopy(transurethral(resection(of(prostate(placement.((He(then(went(to(the(
Operating(Room(on(2016\03\14(where(he(had(...(
70
Annota0on
Process
History
of
present
illness
Past
medical
history
Allergies
Medica0ons
Physical Examina0on
Hospital Course
HISTORY(OF(PRESENT(ILLNESS:(
This(is(an(85(year(old(man(initially(admitted(to(the(Plastic(Surgery(Service(for(evaluation(of(a(
left(facial(mass.(Subsequently,(CMED(CCU(was(consulted(and(he(was(transferred(to(our(
Service(postoperatively.(
MEDICAL(HISTORY:(
His(past(medical(history(is(significant(for(prostate(cancer,(benign(prostatic(hypertrophy,(
hypothyroidism(,(status(post(radiation(for(non(Hodgkin('s(lymphoma,(chronic(painless(
hematuria(,(degenerative(joint(disease(and(history(of(a(murmur.(Last(colonoscopy,(five(
years(ago.(Dementia.(
ALLERGIES:(
No(known(drug(allergies.(
MEDICATIONS:(
1.(Levothyroxine.(
2.(Lasix.(
3.(Proscar.(
4.(Aeroseb.(
5.(Ancef.(
PHYSICAL(EXAMINATION:(
On(examination,(he(is(afebrile.(Vital(signs,(stable.(Elderly(man,(somewhat(cachectic(.(Head,(
eye,(ears,(nose(and(throat,(polypoid(lesion(just(inferior(to(the(left(zygoma,(elevated(
superiorly,(with(visible(bone.(No(exudate.(Minimal(bleeding.(Regular(rate(and(rhythm.(Clear(
to(auscultation.(Nontender,(nondistended.(
HOSPITAL(COURSE:(
He(was(initially(admitted(to(CMED(for(resection(and(repair(of(this(left(facial(lesion.(He(also(
had(consults(from(Urology(for(his(hematuria(as(well(as(Medicine(preoperatively(and(CMED(
CCU.(He(went(to(the(Operating(Room(on(2016\03\10(with(Urology(for(hematuria(where(he(
had(a(cystoscopy(transurethral(resection(of(prostate(placement.((He(then(went(to(the(
Operating(Room(on(2016\03\14(where(he(had(...(
71
Two-level
classier
Line
Labeling
Labels
each
line
of
the
report
to
one
of
the
following
three
labels
B
Beginning
of
sec0on
I
Inside
of
sec0on
O
Outside
of
sec0on
Classier:
MaxEnt
72
Features
Features
for
line
labeling
Type
Features
Text features
Tag features
Features
Header features
Body features
Tag features
prevTag
73
Results*
Discharge
Summaries
Dataset:
430
notes
Performance:
Precision:
91.1%
Recall:
92.4%
F:
91.8%
Radiology
Reports
Dataset:
100
notes
Performance:
Precision:
93.1%
Recall:
91.1%
F:
92.1%
74
Asser0on
analysis
Asser0on
classica0on:
Given
a
medical
problem
concept
men0oned
in
a
clinical
note
(e.g.,
chest
pain),
the
purpose
of
a
system
solving
this
task
to
determine
whether
the
concept
is:
present
absent
condi0onal
hypothe0cal
possible
not
associated
with
the
pa0ent
(not
pa0ent).
75
76
Annotated
corpus
Part
of
2010
Informa0cs
for
Integra0ng
Biology
and
the
Bedside
(i2b2)/Veterans
Aairs
(VA)
shared-task
challenge:
21
teams
competed
for
this
task
Dataset:
826
clinical
reports
(mostly
discharge
summaries):
349
training
documents
(11,968
instances)
477
test
documents
(18,550
instances)
69%
present
20%
absent
4.5%
hypothe0cal
and
possible
<1%
condi0onal
and
not
pa0ent
77
Related
work
De
Burjin
et
al.
(2011):
Basic
features
Encode
the
surrounding
contextual
informa0on
of
the
medical
concept
at
the
sentence
level:
word,
lemma,
and
stem
uni/bi/tri-grams
occurring
before
and
aser
the
concept
right
sparse
stem
trigram
(e.g.,
*
li(
funitur)
while
lining
furniture,
if
lining
furniture,
aner
lining
furniture
80
L/R
Absent
NP
Cond.
Hypo.
Possible
Present
Score
found to have
35
1.0
which showed
27
1.0
showed
153
.99
revealed
83
.97
developed
54
.96
on exer0on
11
1.0
with exer0on
1.0
when
.58
Present
Condi6onal
81
L/R
Absent
NP
Cond.
Hypo.
Possible
Present
Score
no evidence *
90
1.0
no * or
61
1.0
did not *
33
1.0
no
1000
.99
r/o
14
1.0
ques0onable
12
1.0
possible
50
.98
ques0on of
16
.94
Absent
Possible
82
Seman0c
features
The
pa0ent
was
then
followed
in
the
cardiac
cri0cal
care
unit
where
he
had
evidence
of
anoxic
encephalopathy.
(present)
Heart
was
regular
with
a
I/VI
systolic
ejec0on
murmur
without
jugular
venous
disten0on.
(absent)
He
does
become
slightly
short
of
breath
when
lising
furniture.
(condi6onal)
If
you
have
fevers
please
contact
your
PCP
or
return
to
the
emergency
room.
(hypothe6cal)
The
pa0ent
was
con0nued
on
an0bio0cs
for
possible
pneumonia.
(possible)
Father
had
coronary
artery
disease.
(not
pa6ent)
83
Seman0c
features
The
pa0ent
was
then
followed
in
the
cardiac
cri0cal
care
unit
where
he
had
evidence
of
anoxic
encephalopathy.
(present)
Heart
was
regular
with
a
I/VI
systolic
ejec0on
murmur
without
jugular
venous
disten0on.
(absent)
He
does
become
slightly
short
of
breath
when
lising
furniture.
(condi6onal)
If
you
have
fevers
please
contact
your
PCP
or
return
to
the
emergency
room.
(hypothe6cal)
The
pa0ent
was
con0nued
on
an0bio0cs
for
possible
pneumonia.
(possible)
Father
had
coronary
artery
disease.
(not
pa6ent)
84
Seman0c
features
Asser0on
cues:
Seman0c
features:
encode
the
connec0on
between
asser0on
cues
and
medical
concepts
capture
the
meaning
of
asser0on
cues
help
classiers
decide
whether
or
not
a
concept
is
within
the
focus
of
an
asser0on
cue
85
Seman0c
features
the
closest
nega0ve
cue
in
the
les
token
context
window
(size=8)
the
rst
asser0on
cue
on
the
path
in
the
dependency
tree
between
the
concept
and
root
the
rst
verb
on
the
dependency
tree
path
between
the
medical
concept
and
root
the
modal
auxiliary
verb
associated
with
the
rst
verb
on
the
dependency
tree
path
between
the
medical
concept
and
the
closest
asser0on
cue
the
sequence
of
POS
labels
between
the
closest
les
asser0on
cue
and
the
medical
concept
86
not
pa6ent
<1%
condi6onal
<1%
hypothe6cal
4.5%
possible
4.5%
P
present
69%
Overall
MF
mF
training
set
basic
95.77
95.66
85.48
57.61
76.19
31.07
94.26
88.33
79.36 64.67
95.05
97.81
77.93
94.48
+sect
96.20
95.78
92.68
82.61
73.33
32.04
95.36
91.55
79.73 65.42
95.5
97.89
81.65
94.96
+spec
96.50
95.78
92.94
85.87
80.39
39.81
95.51
91.55
84.87 72.34
95.97
98.16
84.55
95.55
+focus
96.87
96.37
95.18
85.87
82.35
40.78
95.54
92.17
87.03 74.02
96.2
98.31
85.42
95.89
Kim
et
al
96.31
2011
94.71
97.52
81.38
81.25
30.41
92.07
87.45
78.3
54.36
94.46
98.07
79.76
94.17
Our
95.71
system
93.88
91.79
84.83
80.00
30.41
92.42
86.75
83.16 55.95
94.51
98.28
79.96
94.23
test set
*C.A.
Bejan,
L.
Vanderwende,
F.
Xia,
M.
Ye0sgen-Yildiz.
Asser0on
modeling
and
its
role
in
clinical
phenotype
iden0ca0on.
J
Biomed
Inform.
2013;46(1):68-74.
87
Fred
Hutch:
Pathology
notes
88
89
Data
-
unstructured
All
free-text
notes
generated
during
ICU
stay
Admit
notes
ICU
daily
progress
notes
Acute
care
daily
progress
notes
Transfer
notes
Cardiology
daily
progress
notes
Respiratory
therapy
notes
Radiology
notes
(chest
x-rays)
Microbiology
notes
Discharge
summary
91
92
Patient
records
Feature
Extractor
MetaMap
Training Data
Pneumonia
Learner
Test Data
Pneumonia
Predictor
Positive Negative
*M.
Ye0sgen-Yildiz,
B.J.
Glavan,
F.
Xia,
L.
Vanderwende,
M.M.
Wurfel.
Extrac0on
of
Pneumonia
Cases
from
Free-Text
Intensive
Care
Unit
Reports.
Proceedings
AMIA'2011.
*M.
Ye0sgen-Yildiz,
B.J.
Glavan,
F.
Xia,
L.
Vanderwende,
M.M.
Wurfel.
Iden0fying
Pa0ents
with
Pneumonia
from
Free-Text
Intensive
Care
Unit
Reports.
Proceedings
of
Learning
from
Unstructured
Clinical
Text
Workshop
of
ICML'2011.
*C.A.
Bejan,
L.
Vanderwende,
M.M.
Wurfel,
and
M.
Ye0sgen-Yildiz.
Assessing
Pneumonia
Iden0ca0on
from
Time-Ordered
Narra0ve
Reports.
Proceedings
of
AMIA'12.
Patient
records
Ranked
words
Ranked
concepts
Feature
Extractor
MetaMap
Training Data
Pneumonia
Learner
Test Data
Pneumonia
Predictor
Assertion
Classifier
Positive Negative
*C.A.
Bejan,
L.
Vanderwende,
F.
Xia,
M.
Wurfel,
M.
Ye0sgen-Yildiz.
Pneumonia
iden0ca0on
using
sta0s0cal
feature
selec0on.
J
Am
Med
Inform
Assoc.
2012;19(5):817-23.
94
InstanceLabel(Pa0ent
A,
Day
0)
=
InstanceLabel(Pa0ent
A,
Day
1)
=
InstanceLabel(Pa0ent
A,
Day
2)
=
+
F-score=76.46
lp
=
2
Other
experiments
Acute
lung
injury
predic0on
from
chest
x-rays
M.
Ye0sgen-Yildiz,
C.A.
Bejan,
M.M.
Wurfel.
Iden0ca0on
of
pa0ents
with
acute
lung
injury
from
free-text
chest
x-ray
reports.
Proceeding
of
BioNLP
Workshop
of
ACL'2013,
2013.
96
Applica0on
#2
Informa0on
extrac0on
from
radiology
notes
Goal:
Clinically
important
recommenda0on
extrac0on.
Seng:
pa0ent,
clinician,
radiologist
The
clinician
orders
a
radiology
test
for
the
pa0ent
The
radiologist
takes
X-ray
and
writes
a
radiology
report,
which
is
sent
back
to
the
clinician
97
*M.
Ye0sgen-Yildiz,
M.L.
Gunn,
F.
Xia,
T.H.
Payne.
Automa0c
Iden0ca0on
of
Cri0cal
Follow-Up
Recommenda0on
Sentences
in
Radiology
Reports.
Proceedings
of
AMIA2011.
*M.
Ye0sgen-Yildiz,
ML
Gunn,
F
Xia,
TH
Payne.
A
Text
Processing
Methodology
to
Extract
Recommenda0on
Informa0on
from
Radiology
Reports.
J
Biomed
Inform,
2013;
46(2):354-62.
98
Architecture
99
Features
Feature
Type
Features
Syntac0c
Knowledge-based
umlsCpncept,
umlsSeman0cType,
radlexConcept
Structural
sec0onType
Performance:
103
Current
research
Events
in
clinical
text
How
are
events
represented
in
clinical
text?
How
can
we
extract
event?
Report
types:
Microbiology
notes
Longitudinal
chest
x-ray
notes
104
Microbiology
notes
Microbiology
laboratory
culture
tests
ordered
to
(1)
iden0fy
sources
of
bacterial
infec0on,
(2)
determine
between
dieren0al
diagnoses,
and
(3)
adjust
an0bio0c
treatments
Unlike
other
report
types,
microbiology
notes
change
over
0me
as
more
informa0on
is
available
about
the
culture
105
Event
deni0on
Main
afributes:
Addi0onal afributes:
107
Rela0ons
108
Annota0on examples
109
Corpus*
1442
microbiology
reports
from
UW
Medical
center
100
reports
were
double
annotated
by
a
medical
student
and
a
biomedical
informa0cs
PhD
student
En0ty
level
Kappa:
0.977
F-score:
0.964
En0ty extrac0on
Rule based:
111
STAT.
HYBRID
RULE
FP
FN
Ra0ng
453
0.99
0.99
MIC
83
0.98
0.99
No-growth 26
No-growth
26
measure
Specimen
124
0.99
0.97
0.98
Reference 134
Drug
262
resistance
0.99
0.99
0.99
Organism
quan0ty
738
18
0.99
0.97
0.98
Drug
304
0.99
0.97
0.98
Organism
1281
94
123
0.93
0.91
0.92
Specimen
109
13
27
0.89
0.80
0.85
112
All en00es
0.968
0.952
0.960
Rela0ons
0.915
0.860
0.886
113
Event
descrip0on
Main
afributes
loc:
anatomical
loca0on
afr:
something
being
measured
or
observed
(e.g.,
volume,
opacity)
val:
a
possible
value
for
the
afr
(e.g.,
clear)
cos:
change
of
state
compared
to
other
reports
for
the
same
pa0ent
(e.g.,
unchanged)
ref:
a
link
to
the
report(s)
that
the
change
of
state
compared
to
(e.g.,
prior
examina0on)
116
Example annota0ons
A snippet featuring an event annota0on connec0ng all ve elds of the COS tuple.
117
Corpus*
1008
sentences
from
1344
chest
x-ray
notes
7173
en00es
4128
rela0ons
2101
event
tuples
Agreement:
*P.Klassen,
F.
Xia,
L.
Vanderwende,
M.
Ye0sgen.
Annota0ng
Clinical
Events
in
Text
Snippets
for
Phenotype
Detec0on.
To
Appear
in
Proceedings
of
Interna0onal
Conference
on
Language
Resources
and
Evalua0on
(LREC).
Reykjavik,
Iceland,
May,
2014.
*L.
Vanderwende,
F.
Xia,
M.
Ye0sgen-Yildiz.
Annota0ng
Change
of
State
for
Clinical
Events.
Proceedings
of
The
1st
Workshop
on
EVENTS:
Deni0on,
Detec0on,
Coreference,
and
Representa0on
Workshop
of
NAACL'2013,
2013.
118
Event
extrac0on
Sequen0al
classica0on
for
en0ty
recogni0on
SVM
for
rela0on
classica0on
En0ty
extrac0on
performance:
P:
0.94
R:
0.95
F:
095
Future
steps
Running
experiments
with
the
VAP
data
to
see
whether
features
extracted
from
microbiology
and
radiology
events
improve
VAP
classica0on
120
Local
Context
Characteris0cs
of
the
target
word
(TW)
and
of
the
words
immediately
surrounding
the
TW
Lexical
and
orthographic
features
Syntac0c
features
Seman0c
features
123
De-iden0ca0on
Privacy
concerns
related
to
medical
records
Health
Informa0on
Portability
and
Accountability
Act
(HIPAA)
Discharge Summaries
HISTORY
OF
PRESENT
ILLNESS:
Mrs.
[Hun0ngton]
is
a
77-year-old-woman
with
long
standing
hypertension
who
presented
as
a
Walk-in
to
me
at
the
[Bronx]
Health
Center
on
[DATE].
Recently
had
been
started
q.o.d.
on
Clonidine
since
[DATE]
to
taper
o
of
the
drug.
Was
told
to
start
Zestril
20
mg.
q.d.
again.
The
pa0ent
was
sent
to
the
Emergency
Unit
for
direct
admission
for
cardioversion
and
an0coagula0on,
with
the
Cardiologist,
Dr.
[Swasissz]
to
Misspelled
or
follow.
foreign
SOCIAL
HISTORY:
Lives
alone,
has
one
daughter
living
in
[Spring].
Is
a
non-
name?
smoker,
and
does
not
drink
alcohol.
HOSPITAL
COURSE
AND
TREATMENT:
During
admission,
the
pa0ent
was
seen
by
Cardiology,
Dr.
[Tylenol],
was
started
on
IV
Heparin,
Sotalol
40
mg
PO
b.i.d.
increased
to
80
mg
b.i.d.,
and
had
an
echocardiogram.
By
[DATE]
the
pa0ent
had
befer
rate
control
and
blood
pressure
control
but
remained
in
atrial
brilla0on.
On
[DATE],
the
pa0ent
was
felt
to
be
medically
stable
125
Related
Work
Named
En0ty
Recogni0on
(NER)
Exploit
both
the
characteris0cs
of
the
names
of
the
en00es
and
contextual
clues
related
to
these
en00es
(Bikel
et
al.;
McCallum
et
al.;
Rilo
and
Jones;
)
De-iden0ca0on
Combina0ons
of
sta0s0cal
and
rule-based
approaches
Most
sta0s0cal
approaches
focused
on
sub-categories
of
PHI
(Taira
et
al.;
Thomas
et
al.;
...)
Approaches
that
target
full
de-iden0ca0on
use
dic0onaries,
rules,
and
paferns
(Gupta
et
al.;
Douglass;
...)
126
Syntac0c
features:
Part
of
speech
(POS)
of
TW,
of
the
word
before,
and
of
the
word
aser
Syntac0c
bigrams
Seman0c
features:
Presence
of
TW,
of
the
word
before,
and
of
the
word
aser
in
relevant
dic0onaries
MeSH
ID
Transferred
to
Transferred
immediately
to
Transferred
later
to
Seman0c
features:
The
heading
of
the
sec0on
in
which
TW
appears
128
Syntac0c
Informa0on
From
the
output
of
the
Link
Grammar
Parser
(Sleator
and
Temperly,
1Xp
991)
MVa
Wd
Ss
Op
Dmc
Syntac0c
Bigrams
Wd
Ss
les
syntac0c
bigram
130
Stat
De-id
Mul0-class
SVM
(linear
kernel)
with
local
context
Determine
if
a
word
is:
PHI
Pa0ent
name
Doctor
name
Date
Phone
ID
Hospital
name
Loca0on
Non-PHI
132
Evalua0on
Compare
with
an
Heuris0c
+
Dic0onary
(H+D)
approach
that
benets
from
dic0onaries,
rules,
and
paferns
(Douglass)
Compare
with
approaches
that
benet
from
wider
context,
i.e.,
Iden0Finder
(Bikel
et
al.)
and
a
Condi0onal
Random
Field
(CRF)
De-iden0er
Wider
context:
Characteris0cs
and
dependencies
of
the
en00es
in
the
sentence
containing
the
target
133
Methods
Stat
De-id
Cross-validated
CRFD
Cross-validated
H+D
Rule-based
Iden0Finder
Obtained
pre-trained
on
newswire
134
Data
Number
of
tokens
PHI
Category
Random
corpus
Authen6c
corpus
Challenge
corpus
Non-PHI
Pa6ent
Doctor
Loca6on
Hospital
Date
ID
Phone
17,874
1,048
311
24
600
735
36
39
112,669
294
738
88
656
1,953
482
32
444,127
1,737
7,697
518
5,204
7,651
5,110
271
Data
Category
PHI
Non-PHI
Pa6ent
Doctor
Loca6on
Hospital
Date
ID
Phone
Table 2: Distribution of words, i.e., tokens, that are ambiguous between PHI and non-PHI.
136
Data
Corpus
Pa6ents
in
names
dict.
Doctors
in
names
dict.
Random
Authen6c
Challenge
86.45%
78.57%
14.10%
86.50%
70.33%
17.20%
Corpus
Random
Authen6c
Challenge
Loca6ons
in
Hospitals
in
loca6on
dict. hospital
dict.
87.5%
54.55%
11.40%
Non-PHI
in
Non-PHI
in
names
loca6on
dict.
dict.
15.87%
9.19%
16.12%
10.19%
15.36%
11.32%
Dates
in
month
dict.
87.5%
80.18%
26.59%
Non-PHI
in
hospitals
dict.
14.10%
12.74%
8.61%
12.65%
21.97%
5.15%
Non-PHI
in
month
dict.
0.07%
0.02%
0.06%
137
Evalua0on
Metrics
Precision
Recall
F-measure
Only
on
the
PHI
Aggregate
over
all
PHI
138
Results
F-measures
on
PHI
in
the
Randomly
Re-
id'ed
Corpus
139
100.00
95.33
96.82
94.28
98.03
95.08
80.00
60.00
40.00
20.00
0.00
Random
All
Stat
De-id
/
All
CRFD
All
dierences
signicant
at
alpha=0.05.
Authen0c
Challenge
Contribu0ons
We
can:
De-iden0fy
medical
discharge
summaries
using
a
sta0s0cal
representa0on
of
local
context
We
showed
that:
Stronger
the
local
context,
befer
the
performance
When
using
an
SVM
141
Open
Ques0ons
and
Future
Direc0ons
142