Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Record Linkage
Introduction, recent
advances, privacy issues
Peter Christen
The Australian
National University
Record Linkage: Introduction,
recent advances, and privacy issues
Peter Christen
Contact: peter.christen@anu.edu.au
Database A Database B
Indexing /
Searching
Matches
Clerical Potential
Review Matches
Deterministic matching
Rule-based matching (complex to build and maintain)
Probabilistic record linkage (Fellegi and Sunter, 1969)
Use available attributes for linking (often personal
information, like names, addresses, dates of birth, etc.)
Calculate match weights for attributes
“Computer science” approaches
Based on machine learning, data mining, database, or
information retrieval techniques
Supervised: Requires training data (true matches)
Unsupervised: Clustering, collective, and graph based
Database A Database B
Indexing /
Searching
Matches
Clerical Potential
Review Matches
Clean input
Remove unwanted characters and words
Expand abbreviations and correct misspellings
Segment address into well defined output fields
Verify if address (or parts of it) exists in reality
July 2015 – p. 23/69
Standardisation approaches
Cleaning
Based on look-up tables and correction lists
Remove unwanted characters and words
Correct various misspellings and abbreviations
Tagging
Split input into a list of words, numbers and separators
Assign one or more tags to each element of this list
(using look-up tables and/or features)
Segmenting
Use for example a trained HMM to assign list elements
to output fields
July 2015 – p. 26/69
Example address standardisation
Database A Database B
Indexing /
Searching
Matches
Clerical Potential
Review Matches
g a y l e
0 1 2 3 4 5
g 1 0 1 2 3 4
a 2 1 0 1 2 3
i 3 2 1 1 2 2
l 4 3 2 2 1 2
−5 0 5 10 15
Total matching weight
July 2015 – p. 36/69
Weight calculation: Month of birth
abbybond r5
paulsmith r2 First
pedrosmith r4 window
Second
of records
pedrosmith r9 window
Third
of records
percysmith r1 window
Fourth
of records
petersmith r7 window
Fifth
of records
petersmith r10 window
Last
of records
robinstevens r3 window
of records
sallytaylor r6
sallytaylor r8
Johnathon, Smith, 2009 Johnathon, Smith, 2009 Johnathon, Smith, 2009 Johnathon, Smith, 2009 Johnathon, Smith, 2009 Johnathon, Smith, 2009
Joey, Schmidt, 2009 Joey, Schmidt, 2009 Joey, Schmidt, 2009 <’S253’> Joey, Schmidt, 2009 Joey, Schmidt, 2009
Joe, Miller, 2902 Joe, Miller, 2902 Joe, Miller, 2902 Joey, Schmidt, 2009 <’M460’, ’M450’> <’Jo’><’M460’, ’M450’>
Joseph, Milne, 2902 Joseph, Milne, 2902 Joseph, Milne, 2902 <’M460’> Joe, Miller, 2902 Joe, Miller, 2902
Paul, , 3000 <’Pa’> Joe, Miller, 2902 Joseph, Milne, 2902 Joseph, Milne, 2902
<’Pa’, ’Pe’>
Peter, Jones, 3000 Paul, , 3000 <’M450’> <’Pa’, ’Pe’>
Paul, , 3000
Joseph, Milne, 2902 Paul, , 3000
<’Pe’> Peter, Jones, 3000 Peter, Jones, 3000
Peter, Jones, 3000 Blocking Keys = <FN, F2>, <SN, Sdx>
Smin = 2, S max = 3
a1 a4
a2 a3
H1 H2
31 30 29 30
cc (ph)
Printed Handwritten Memory sub, ins, del
attr swap, repl
cc (ph) cc (ty)
sub, ins, del sub, ins, del, trans
attr swap, repl attr swap, repl
Dictate
Typed
OCR Abbreviations:
cc : character change
wc : word change
cc (ph,ty) subs : substitution
cc (ph and or ty) sub, ins, del, trans ins : insertion
cc (ph) sub, ins, del, trans wc split, merge
sub, ins, del attr swap, repl del : deletion
-
attr swap, repl
Speech recognition trans : transpose
repl : replace
cc (ocr) ty : typographic
sub, ins, del
ph : phonetic
wc split, merge
attr : attribute
Electronic document
Linkage Researchers
unit
Details given in: Chris Kelman, John Bass, and D’Arcy Holman: Research use of Linked
Health Data – A Best Practice Protocol, Aust NZ Journal of Public Health, vol. 26, 2002.
Database A Database B
Indexing /
Searching
Matches
Clerical Potential
Encoded data Review Matches
(1)
(1)
Alice Bob
(2)
Alice (2) Bob (2) (2)