Sei sulla pagina 1di 16

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO.

1, JANUARY 2007 1

Duplicate Record Detection: A Survey


Ahmed K. Elmagarmid, Senior Member, IEEE,
Panagiotis G. Ipeirotis, Member, IEEE Computer Society, and
Vassilios S. Verykios, Member, IEEE Computer Society

Abstract—Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common
key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors,
incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of
the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we
present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also
cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with
coverage of existing tools and with a brief discussion of the big open problems in the area.

Index Terms—Duplicate detection, data cleaning, data integration, record linkage, data deduplication, instance identification,
database hardening, name matching, identity uncertainty, entity resolution, fuzzy duplicate detection, entity matching.

1 INTRODUCTION
Data cleaning [2], or data scrubbing [3], refers to the process of
D ATABASES play an important role in today’s IT-based
economy. Many industries and systems depend on the
accuracy of databases to carry out operations. Therefore, the
resolving such identification problems in the data. We
distinguish between two types of data heterogeneity:
quality of the information (or the lack thereof) stored in the structural and lexical. Structural heterogeneity occurs when
databases can have significant cost implications to a system the fields of the tuples in the database are structured
that relies on information to function and conduct business. differently in different databases. For example, in one
In an error-free system with perfectly clean data, the database, the customer address might be recorded in one
construction of a comprehensive view of the data consists field named, say, addr, while, in another database, the same
of linking—in relational terms, joining—two or more tables information might be stored in multiple fields such as street,
city, state, and zipcode. Lexical heterogeneity occurs when the
on their key fields. Unfortunately, data often lack a unique,
tuples have identically structured fields across databases,
global identifier that would permit such an operation.
but the data use different representations to refer to the same
Furthermore, the data are neither carefully controlled for
real-world object (e.g., StreetAddress = 44 W. 4th St. versus
quality nor defined in a consistent way across different data StreetAddress = 44 West Fourth Street).
sources. Thus, data quality is often compromised by many In this paper, we focus on the problem of lexical
factors, including data entry errors (e.g., Microsft instead of heterogeneity and survey various techniques which have
Microsoft), missing integrity constraints (e.g., allowing been developed for addressing this problem. We focus on
entries such as EmployeeAge ¼ 567), and multiple conven- the case where the input is a set of structured and properly
tions for recording information (e.g., 44 W. 4th St. versus segmented records, i.e., we focus mainly on cases of database
44 West Fourth Street). To make things worse, in indepen- records. Hence, we do not cover solutions for various other
dently managed databases, not only the values, but also the problems, such as that of mirror detection, in which the goal
structure, semantics, and underlying assumptions about the is to detect similar or identical Web pages (e.g., see [4], [5]).
data may differ as well. Also, we do not cover solutions for problems such as
Often, while integrating data from different sources to
anaphora resolution [6] in which the problem is to locate
implement a data warehouse, organizations become aware
different mentions of the same entity in free text (e.g., that
of potential systematic differences or conflicts. Such pro- the phrase “President of the US” refers to the same entity as
blems fall under the umbrella-term data heterogeneity [1]. “George W. Bush”). We should note that the algorithms
developed for mirror detection or for anaphora resolution
. A.K. Elmagarmid is with the Department of Computer Sciences and Cyber are often applicable for the task of duplicate detection.
Center, Purdue University, West Lafayette, IN 47907. Techniques for mirror detection have been used for
E-mail: ake@cs.purdue.edu. detection of duplicate database records (see, for example,
. P.G. Ipeirotis is with the Department of Information, Operations, and Section 5.1.4) and techniques for anaphora resolution are
Management Sciences, Leonard N. Stern School of Business, New York
University, 44 West 4th Street, HKMC 8-84, New York, NY 10012. commonly used as an integral part of deduplication in
E-mail: panos@stern.nyu.edu. relations that are extracted from free text using information
. V.S. Verykios is with the Department of Computer and Communication extraction systems [7].
Engineering, University of Thessaly, Glavani 37 and 28th str., 38221 The problem that we study has been known for more
Volos, Greece. E-mail: verykios@inf.uth.gr.
than five decades as the record linkage or the record matching
Manuscript received 21 June 2005; revised 18 Mar. 2006; accepted 6 Sept.
2006; published online 20 Nov. 2006.
problem [8], [9], [10], [11], [12], [13] in the statistics
For information on obtaining reprints of this article, please send e-mail to: community. The goal of record matching is to identify
tkde@computer.org, and reference IEEECS Log Number TKDE-0240-0605. records in the same or different databases that refer to the
1041-4347/07/$20.00 ß 2007 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 22:31 from IEEE Xplore. Restrictions apply.
2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 1, JANUARY 2007

same real-world entity, even if the records are not identical. type that makes sense within the context of the original
In slightly ironic fashion, the same problem has multiple application, but not in a newly developed or subsequent
names across research communities. In the database system. The renaming of a field from one name to another is
community, the problem is described as merge-purge [14], considered data transformation as well. Encoded values in
data deduplication [15], and instance identification [16]; in the operational systems and in external data is another problem
AI community, the same problem is described as database that is addressed at this stage. These values should be
hardening [17] and name matching [18]. The names coreference converted to their decoded equivalents, so records from
resolution, identity uncertainty, and duplicate detection are also different sources can be compared in a uniform manner.
commonly used to refer to the same task. We will use the Range checking is yet another kind of data transformation
term duplicate record detection in this paper. which involves examining data in a field to ensure that it
The remaining parts of this paper are organized as falls within the expected range, usually a numeric or date
follows: In Section 2, we briefly discuss the necessary steps range. Last, dependency checking is slightly more involved
in the data cleaning process before the duplicate record since it requires comparing the value in a particular field to
detection phase. Then, Section 3 describes techniques used the values in another field to ensure a minimal level of
to match individual fields and Section 4 presents techniques consistency in the data.
for matching records that contain multiple fields. Section 5 Data standardization refers to the process of standardiz-
describes methods for improving the efficiency of the ing the information represented in certain fields to a specific
duplicate record detection process and Section 6 presents content format. This is used for information that can be
a few commercial, off-the-shelf tools used in industry for stored in many different ways in various data sources and
must be converted to a uniform representation before the
duplicate record detection and for evaluating the initial
duplicate detection process starts. Without standardization,
quality of the data and of the matched records. Finally,
many duplicate entries could be erroneously designated as
Section 7 concludes the paper and discusses interesting
nonduplicates based on the fact that common identifying
directions for future research. information cannot be compared. One of the most common
standardization applications involves address information.
2 DATA PREPARATION There is no one standardized way to capture addresses, so
the same address can be represented in many different ways.
Duplicate record detection is the process of identifying Address standardization locates (using various parsing
different or multiple records that refer to one unique real- techniques) components such as house numbers, street
world entity or object. Typically, the process of duplicate names, post office boxes, apartment numbers, and rural
detection is preceded by a data preparation stage during routes, which are then recorded in the database using a
which data entries are stored in a uniform manner in the standardized format (e.g., 44 West Fourth Street is stored as 44
database, resolving (at least partially) the structural hetero- W. 4th St.). Date and time formatting and name and title
geneity problem. The data preparation stage includes a formatting pose other standardization difficulties in a
parsing, a data transformation, and a standardization step. The database. Typically, when operational applications are
approaches that deal with data preparation are also designed and constructed, there is very little uniform
described under the using the term ETL (Extraction, handling of date and time formats across applications.
Transformation, Loading) [19]. These steps improve the Because most operational environments have many different
quality of the in-flow data and make the data comparable formats for representing dates and times, there is a need to
and more usable. While data preparation is not the focus of transform dates and times into a standardized format. Name
this survey, for completeness we briefly describe the tasks standardization identifies components such as first names,
performed in that stage. A comprehensive collection of last names, title, and middle initials and records everything
papers related to various data transformation approaches using some standardized convention. Data standardization
can be found in [20]. is a rather inexpensive step that can lead to fast identification
Parsing is the first critical component in the data of duplicates. For example, if the only difference between
preparation stage. Parsing locates, identifies, and isolates two records is the differently recorded address (44 West
individual data elements in the source files. Parsing makes Fourth Street versus 44 W. 4th St.), then the data standardiza-
it easier to correct, standardize, and match data because it tion step would make the two records identical, alleviating
allows the comparison of individual components, rather the need for more expensive approximate matching ap-
than of long complex strings of data. For example, the proaches, which we describe in the later sections.
appropriate parsing of name and address components into
After the data preparation phase, the data are typically
consistent packets of information is a crucial part in the data
stored in tables having comparable fields. The next step is to
cleaning process. Multiple parsing methods have been
identify which fields should be compared. For example, it
proposed recently in the literature (e.g., [21], [22], [23], [24],
[25]) and the area continues to be an active field of research. would not be meaningful to compare the contents of the
Data transformation refers to simple conversions that can field LastName with the field Address. Perkowitz et al. [26]
be applied to the data in order for them to conform to the presented a supervised technique for understanding the
data types of their corresponding domains. In other words, “semantics” of the fields that are returned by Web
this type of conversion focuses on manipulating one field at databases. The idea was that similar values (e.g. last names)
a time, without taking into account the values in related tend to appear in similar fields. Hence, by observing value
fields. The most common form of a simple transformation is overlap across fields, it is possible to parse the results into
the conversion of a data element from one data type to fields and discover correspondences across fields at the
another. Such a data type conversion is usually required same time. Dasu et al. [27] significantly extend this concept
when a legacy or parent application stored data in a data and extract a “signature” from each field in the database;

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 22:31 from IEEE Xplore. Restrictions apply.
ELMAGARMID ET AL.: DUPLICATE RECORD DETECTION: A SURVEY 3

 
this signature summarizes the content of each column in the for the nontrivial case where j1 j  j2 j  k.) Needleman
database. Then, the signatures are used to identify fields and Wunsch [31] modified the original edit distance model
with similar values, fields whose contents are subsets of and allowed for different costs for different edit distance
other fields, and so on. operations. (For example, the cost of replacing O with 0
Even after parsing, data standardization, and identifica- might be smaller than the cost of replacing f with q.) Ristad
tion of similar fields, it is not trivial to match duplicate and Yiannilos [32] presented a method for automatically
records. Misspellings and different conventions for recording determining such costs from a set of equivalent words that
the same information still result in different, multiple are written in different ways. The edit distance metrics
representations of a unique object in the database. In the next work well for catching typographical errors, but they are
section, we describe techniques for measuring the similarity typically ineffective for other types of mismatches.
of individual fields and, later, in Section 4, we describe
techniques for measuring the similarity of entire records. 3.1.2 Affine Gap Distance
The edit distance metric described above does not work
3 FIELD MATCHING TECHNIQUES well when matching strings that have been truncated or
shortened (e.g., “John R. Smith” versus “Jonathan Richard
One of the most common sources of mismatches in database
Smith”). The affine gap distance metric [33] offers a solution
entries is the typographical variations of string data.
to this problem by introducing two extra edit operations:
Therefore, duplicate detection typically relies on string
comparison techniques to deal with typographical varia- open gap and extend gap. The cost of extending the gap is
tions. Multiple methods have been developed for this task, usually smaller than the cost of opening a gap, and this
and each method works well for particular types of errors. results in smaller cost penalties for gap mismatches than the
While errors might appear in numeric fields as well, the equivalent cost under the edit distance metric. The algo-
related research is still in its infancy. rithm for computing the affine gap distance requires Oða 
In this section, we describe techniques that have been j1 j  j2 jÞ time when the maximum length of a gap
applied for matching fields with string data in the duplicate a  minfj1 j; j2 jg. In the general case, the algorithm runs
record detection context. We also briefly review some in approximately Oða2  j1 j  j2 jÞ steps. Bilenko et al. [18],
common approaches for dealing with errors in numeric in a spirit similar to what Ristad and Yiannilos [32]
data. proposed for edit distances, describe how to train an edit
distance model with affine gaps.
3.1 Character-Based Similarity Metrics
The character-based similarity metrics are designed to 3.1.3 Smith-Waterman Distance
handle typographical errors well. In this section, we cover Smith and Waterman [34] described an extension of edit
the following similarity metrics: distance and affine gap distance in which mismatches at the
beginning and the end of strings have lower costs than
. edit distance,
mismatches in the middle. This metric allows for better local
. affine gap distance,
. Smith-Waterman distance, alignment of the strings (i.e., substring matching). There-
. Jaro distance metric, and fore, the strings “Prof. John R. Smith, University of Calgary”
. Q-gram distance. and “John R. Smith, Prof.” can match within a short distance
using the Smith-Waterman distance since the prefixes and
3.1.1 Edit Distance suffixes are ignored. The distance between two strings can
The edit distance between two strings 1 and 2 is the be computed using a dynamic programming technique
minimum number of edit operations of single characters based on the Needleman and Wunsch algorithm [31]. The
needed to transform the string 1 into 2 . There are three Smith and Waterman algorithm requires Oðj1 j  j2 jÞ time
types of edit operations: and space for two strings of length j1 j and j2 j; many
improvements have been proposed (e.g., the BLAST
. insert a character into the string, algorithm [35] in the context of computational biology
. delete a character from the string, and
applications, the algorithms by Baeza-Yates and Gonnet
. replace one character with a different character.
[36], and the agrep tool by Wu and Manber [37]). Pinheiro
In the simplest form, each edit operation has cost 1. This and Sun [38] proposed a similar similarity measure which
version of edit distance is also referred to as the Levenshtein tries to find the best character alignment for the two
distance [28]. The basic dynamic programming algorithm
compared strings 1 and 2 so that the number of character
[29] for computing the edit distance between two strings
mismatches is minimized.
takes Oðj1 j  j2 jÞ time for two strings of length j1 j and j2 j,
respectively. Landau and Vishkin [30] presented an algo- 3.1.4 Jaro Distance Metric
rithm for detecting in Oðmaxfj1 j; j2 jg  kÞ whether two
Jaro [39] introduced a string comparison algorithm that was
strings
 have an edit distance less than k. (Notice that if
j1 j  j2 j > k, then, by definition, the two strings do not mainly used for comparison of last and first names. The basic
algorithm for computing the Jaro metric for two strings 1
match within distance k, so
and 2 includes the following steps:
Oðmaxfj1 j; j2 jg  kÞ  Oðj1 j  kÞ  Oðj2 j  kÞ
1. Compute the string lengths j1 j and j2 j.

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 22:31 from IEEE Xplore. Restrictions apply.
4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 1, JANUARY 2007

2. Find the “common characters” c in the two strings; algorithm, the similarity of two fields is the number of their
common are all the characters 1 ½j and 2 ½j for matching atomic strings divided by their average number of
which 1 ½i ¼ 2 ½j and ji  jj  12 minfj1 j; j2 jg. atomic strings.
3. Find the number of transpositions t; the number of
transpositions is computed as follows: We compare 3.2.2 WHIRL
the ith common character in 1 with the ith common Cohen [48] described a system named WHIRL that adopts
character in 2 . Each nonmatching character is a from information retrieval the cosine similarity combined
transposition. with the tf.idf weighting scheme to compute the similarity of
The Jaro comparison value is: two fields. Cohen separates each string  into words and
  each word w is assigned a weight
1 c c c  t=2
Jaroð1 ; 2 Þ ¼ þ þ : ð1Þ v ðwÞ ¼ logðtfw þ 1Þ  logðidfw Þ;
3 j1 j j2 j c
From the description of the Jaro algorithm, we can see where tfw is the number of times that w appears in the field
that the Jaro algorithm requires Oðj1 j  j2 jÞ time for two and idfw is jDj
nw , where nw is the number of records in the
strings of length j1 j and j2 j, mainly due to Step 2, which database D that contain w. The tf.idf weight for a word w in
computes the “common characters” in the two strings. a field is high if w appears a large number of times in the
Winkler and Thibaudeau [40] modified the Jaro metric to field (large tfw ) and w is a sufficiently “rare” term in the
give higher weight to prefix matches since prefix matches database (large idfw ). For example, for a collection of
are generally more important for surname matching. company names, relatively infrequent terms such as
“AT&T” or “IBM” will have higher idf weights than more
3.1.5 Q-Grams frequent terms such as “Inc.” The cosine similarity of 1 and
The q-grams are short character substrings1 of length q of 2 is defined as
the database strings [41], [42]. The intuition behind the use PjDj
of q-grams as a foundation for approximate string matching j¼1 v1 ðjÞ  v2 ðjÞ
simð1 ; 2 Þ ¼ :
is that, when two strings 1 and 2 are similar, they share a kv1 k2  kv2 k2
large number of q-grams in common. Given a string , its The cosine similarity metric works well for a large
q-grams are obtained by “sliding” a window of length q variety of entries and is insensitive to the location of words,
over the characters of . Since q-grams at the beginning and thus allowing natural word moves and swaps (e.g., “John
the end of the string can have fewer than q characters from Smith” is equivalent to “Smith, John”). Also, the introduc-
, the strings are conceptually extended by “padding” the tion of frequent words only minimally affects the similarity
beginning and the end of the string with q  1 occurrences of the two strings due to the low idf weight of the frequent
of a special padding character, not in the original alphabet. words. For example, “John Smith” and “Mr. John Smith”
With the appropriate use of hash-based indexes, the would have similarity close to one. Unfortunately, this
average time required for computing the q-gram overlap similarity metric does not capture word spelling errors,
between two strings 1 and 2 is Oðmaxfj1 j; j2 jgÞ. Letter especially if they are pervasive and affect many of the
q-grams, including trigrams, bigrams, and/or unigrams, words in the strings. For example, the strings “Compter
have been used in a variety of ways in text recognition and Science Department” and “Deprtment of Computer Scence”
spelling correction [43]. One natural extension of q-grams is will have zero similarity under this metric. Bilenko et al.
the positional q-grams [44], which also record the position of [18] suggest the SoftTF-IDF metric to solve this problem. In
the q-gram in the string. Gravano et al. [45], [46] showed the SoftTF.IDF metric, pairs of tokens that are “similar”2
how to use positional q-grams to efficiently locate similar (and not necessarily identical) are also considered in the
strings within a relational database. computation of the cosine similarity. However, the product
of the weights for nonidentical token pairs is multiplied by
3.2 Token-Based Similarity Metrics the the similarity of the token pair, which is less than one.
Character-based similarity metrics work well for typogra-
3.2.3 Q-Grams with tf.idf
phical errors. However, it is often the case that typogra-
phical conventions lead to rearrangement of words (e.g., Gravano et al. [49] extended the WHIRL system to handle
“John Smith” versus “Smith, John”). In such cases, character- spelling errors by using q-grams, instead of words, as
level metrics fail to capture the similarity of the entities. tokens. In this setting, a spelling error minimally affects the
Token-based metrics try to compensate for this problem. set of common q-grams of two strings, so the two strings
“Gteway Communications” and “Comunications Gateway”
3.2.1 Atomic Strings have high similarity under this metric, despite the block
Monge and Elkan [47] proposed a basic algorithm for move and the spelling errors in both words. This metric
matching text fields based on atomic strings. An atomic string handles the insertion and deletion of words nicely. The
is a sequence of alphanumeric characters delimited by string “Gateway Communications” matches with high
punctuation characters. Two atomic strings match if they similarity the string “Communications Gateway Interna-
are equal or if one is the prefix of the other. Based on this tional” since the q-grams of the word “International” appear
often in the relation and have low weight.
1. The q-grams in our context are defined on the character level. In speech
processing and in computational linguistics, researchers often use the term 2. The token similarity is measured using a metric that works well for
n-gram to refer to sequences of n words. short strings, such as edit distance and Jaro.

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 22:31 from IEEE Xplore. Restrictions apply.
ELMAGARMID ET AL.: DUPLICATE RECORD DETECTION: A SURVEY 5

3.3 Phonetic Similarity Metrics NYSIIS, using a name database of New York State, and
Character-level and token-based similarity metrics focus on concluded that NYSIIS is 98.72 percent accurate, while
the string-based representation of the database records. Soundex is 95.99 percent accurate for locating surnames.
However, strings may be phonetically similar even if they The NYSIIS encoding system is still used today by the New
are not similar in a character or token level. For example, York State Division of Criminal Justice Services.
the word Kageonne is phonetically similar to Cajun despite
the fact that the string representations are very different. 3.3.3 Oxford Name Compression Algorithm (ONCA)
The phonetic similarity metrics are trying to address such ONCA [53] is a two-stage technique, designed to overcome
issues and match such strings. most of the unsatisfactory features of pure Soundex-ing,
retaining in parallel the convenient four-character fixed-
3.3.1 Soundex length format. In the first step, ONCA uses a British version
Soundex, invented by Russell [50], [51], is the most common of the NYSIIS method of compression. Then, in the second
phonetic coding scheme. Soundex is based on the assignment step, the transformed and partially compressed name is
of identical code digits to phonetically similar groups of Soundex-ed in the usual way. This two-stage technique has
consonants and is used mainly to match surnames. The been used successfully for grouping similar names together.
rules of Soundex coding are as follows:
3.3.4 Metaphone and Double Metaphone
1. Keep the first letter of the surname as the prefix Philips [54] suggested the Metaphone algorithm as a better
letter and completely ignore all occurrences of W alternative to Soundex. Philips suggested using 16 con-
and H in other positions. sonant sounds that can describe a large number of sounds
2. Assign the following codes to the remaining letters: used in many English and non-English words. Double
Metaphone [55] is a better version of Metaphone, improving
. B; F ; P ; V ! 1,
some encoding choices made in the initial Metaphone and
. C; G; J; K; Q; S; X; Z ! 2,
allowing multiple encodings for names that have various
. D; T ! 3,
possible pronunciations. For such cases, all possible encod-
. L ! 4,
ings are tested when trying to retrieve similar names. The
. M; N ! 5, and
introduction of multiple phonetic encodings greatly en-
. R ! 6.
hances the matching performance, with rather small over-
3. A, E, I, O, U, and Y are not coded but serve as
head. Philips suggested that, at most, 10 percent of
separators (see below).
American surnames have multiple encodings.
4. Consolidate sequences of identical codes by keeping
only the first occurrence of the code. 3.4 Numeric Similarity Metrics
5. Drop the separators. While multiple methods exist for detecting similarities of
6. Keep the letter prefix and the three first codes, string-based data, the methods for capturing similarities in
padding with zeros if there are fewer than three numeric data are rather primitive. Typically, the numbers
codes. are treated as strings (and compared using the metrics
Newcombe [10] reports that the Soundex code remains described above) or simple range queries, which locate
largely unchanged, exposing about two-thirds of the numbers with similar values. Koudas et al. [56] suggest, as a
spelling variations observed in linked pairs of vital records, direction for future research, consideration of the distribu-
and that it sets aside only a small part of the total tion and type of the numeric data, or extending the notion
discriminating power of the full alphabetic surname. The of cosine similarity for numeric data [57] to work well for
code is designed primarily for Caucasian surnames, but duplicate detection purposes.
works well for names of many different origins (such as
those appearing on the records of the US Immigration and 3.5 Concluding Remarks
Naturalization Service). However, when the names are of The large number of field comparison metrics reflects the
predominantly East Asian origin, this code is less satisfac- large number of errors or transformations that may occur in
tory because much of the discriminating power of these real-life data. Unfortunately, there are very few studies that
names resides in the vowel sounds, which the code ignores. compare the effectiveness of the various distance metrics
presented here. Yancey [58] shows that the Jaro-Winkler
3.3.2 New York State Identification and Intelligence metric works well for name matching tasks for data coming
System (NYSIIS) from the US census. A notable comparison effort is the work
The NYSIIS system, proposed by Taft [52], differs from of Bilenko et al. [18], who compare the effectiveness of
Soundex in that it retains information about the position of character-based and token-based similarity metrics. They
vowels in the encoded word by converting most vowels to show that the Monge-Elkan metric has the highest average
the letter A. Furthermore, NYSIIS does not use numbers to performance across data sets and across character-based
replace letters; instead, it replaces consonants with other, distance metrics. They also show that the SoftTF.IDF metric
phonetically similar letters, thus returning a purely alpha works better than any other metric. However, Bilenko et al.
code (no numeric component). Usually, the NYSIIS code for emphasize that no single metric is suitable for all data sets.
a surname is based on a maximum of nine letters of the full Even metrics that demonstrate robust and high perfor-
alphabetical name, and the NYSIIS code itself is then mance for some data sets can perform poorly on others.
limited to six characters. Taft [52] compared Soundex with Hence, they advocate more flexible metrics that can

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 22:31 from IEEE Xplore. Restrictions apply.
6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 1, JANUARY 2007

accommodate multiple similarity comparisons (e.g., [18], of the two classes. Then, if the density function for each
[59]). In the next section, we review such approaches. class is known, the duplicate detection problem becomes a
Bayesian inference problem. In the following sections, we
will discuss various techniques that have been developed
4 DETECTING DUPLICATE RECORDS for addressing this (general) decision problem.
In the previous section, we described methods that can be
used to match individual fields of a record. In most real-life 4.2.1 The Bayes Decision Rule for Minimum Error
situations, however, the records consist of multiple fields, Let x be a comparison vector, randomly drawn from the
making the duplicate detection problem much more comparison space that corresponds to the record pair h; i.
complicated. In this section, we review methods that are The goal is to determine whether h; i 2 M or h; i 2 U. A
used for matching records with multiple fields. The decision rule, based simply on probabilities, can be written
presented methods can be broadly divided into two as follows:
categories: 
M if pðMjxÞ  pðUjxÞ
h; i 2 ð2Þ
. Approaches that rely on training data to “learn” how U otherwise:
to match the records. This category includes (some)
probabilistic approaches and supervised machine This decision rule indicates that, if the probability of the
learning techniques. match class M, given the comparison vector x, is larger than
. Approaches that rely on domain knowledge or on the probability of the nonmatch class U, then x is classified
generic distance metrics to match records. This to M, and vice versa. By using the Bayes theorem, the
category includes approaches that use declarative previous decision rule may be expressed as:
languages for matching and approaches that devise (
pðxjMÞ pðUÞ
distance metrics appropriate for the duplicate M if lðxÞ ¼ pðxjU Þ  pðMÞ
h; i 2 ð3Þ
detection task. U otherwise:
The rest of this section is organized as follows: Initially, The ratio
in Section 4.1, we describe the notation. In Section 4.2, we
present probabilistic approaches for solving the duplicate pðxjMÞ
lðxÞ ¼ ð4Þ
detection problem. In Section 4.3, we list approaches that pðxjUÞ
use supervised machine learning techniques and, in pðUÞ
Section 4.4, we describe variations based on active learning is called the likelihood ratio. The ratio pðMÞ denotes the
methods. Section 4.5 describes distance-based methods and threshold value of the likelihood ratio for the decision. We
Section 4.6 describes declarative techniques for duplicate refer to the decision rule in (3) as the Bayes test for minimum
detection. Finally, Section 4.7 covers unsupervised machine error. It can be easily shown [60] that the Bayes test results in
learning techniques and Section 4.8 provides some con- the smallest probability of error and it is, in that respect, an
cluding remarks. optimal classifier. Of course, this holds only when the
distributions of pðxjMÞ, pðxjUÞ and the priors pðUÞ and
4.1 Notation pðMÞ are known; this, unfortunately, is very rarely the case.
We use A and B to denote the tables that we want to match, One common approach, usually called Naive Bayes, to
and we assume, without loss of generality, that A and B computing the distributions of pðxjMÞ and pðxjUÞ is to
have n comparable fields. In the duplicate detection make a conditional independence assumption and postu-
problem, each tuple pair h; i ð 2 A;  2 BÞ is assigned late that the probabilities pðxi jMÞ and pðxj jMÞ are inde-
to one of the two classes M and U. The class M contains the pendent if i 6¼ j. (Similarly, for pðxi jUÞ and pðxj jUÞ.) In that
record pairs that represent the same entity (“match”) and the case, we have
class U contains the record pairs that represent two Y
n
different entities (“nonmatch”). pðxjMÞ ¼ pðxi jMÞ
We represent each tuple pair h; i as a random vector i¼i
x ¼ ½x1 ; . . . ; xn T with n components that correspond to the Yn

n comparable fields of A and B. Each xi shows the level of pðxjUÞ ¼ pðxi jUÞ:
i¼i
agreement of the ith field for the records  and . Many
approaches use binary values for the xi s and set xi ¼ 1 if The values of pðxi jMÞ and pðxi jUÞ can be computed using a
field i agrees and let xi ¼ 0 if field i disagrees. training set of prelabeled record pairs. However, the
probabilistic model can also be used without using training
4.2 Probabilistic Matching Models data. Jaro [61] used a binary model for the values of xi (i.e.,
Newcombe et al. [8] were the first to recognize duplicate if the field i “matches” xi ¼ 1, else xi ¼ 0) and suggested
detection as a Bayesian inference problem. Then, Fellegi using an expectation maximization (EM) algorithm [62] to
and Sunter [12] formalized the intuition of Newcombe et al. compute the probabilities pðxi ¼ 1jMÞ. The probabilities
and introduced the notation that we use, which is also pðxi ¼ 1jUÞ can be estimated by taking random pairs of
commonly used in duplicate detection literature. The records (which are with high probability in U).
comparison vector x is the input to a decision rule that When the conditional independence is not a reasonable
assigns x to U or to M. The main assumption is that x is a assumption, then Winkler [63] suggested using the general
random vector whose density function is different for each expectation maximization algorithm to estimate pðxjMÞ,

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 22:31 from IEEE Xplore. Restrictions apply.
ELMAGARMID ET AL.: DUPLICATE RECORD DETECTION: A SURVEY 7

pðxjUÞ. In [64], Winkler claims that the general, unsuper- M if rM ðxÞ < rU ðxÞ
h; i 2 ð7Þ
vised EM algorithm works well under five conditions: U otherwise:
It can be easily proved [67] that the minimum cost
1. the data contain a relatively large percentage of
decision rule for the problem can be stated as:
matches (more than 5 percent),
2. the matching pairs are “well-separated” from the 
M if lðxÞ > ðcðcUM
MU cUU ÞpðUÞ

other classes, h; i 2 cMM ÞpðMÞ ð8Þ


U otherwise:
3. the rate of typographical errors is low,
4. there are sufficiently many redundant identifiers to Comparing the minimum error and minimum cost
overcome errors in other fields of the record, and decision rule, we notice that the two decision rules
5. the estimates computed under the conditional become the same for the special setting of the cost
independence assumption result in good classifica- functions to cUM  cMM ¼ cMU  cUU . In this case, the cost
tion performance. functions are termed symmetrical. For a symmetrical cost
Winkler [64] shows how to relax the assumptions above function, the cost becomes the probability of error and the
(including the conditional independence assumption) and Bayes test for minimum cost specifically addresses and
still get good matching results. Winkler shows that a semi- minimizes this error.
supervised model, which combines labeled and unlabeled
4.2.3 Decision with a Reject Region
data (similar to Nigam et al. [65]), performs better than
Using the Bayes Decision rule when the distribution
purely unsupervised approaches. When no training data is
parameters are known leads to optimal results. However,
available, unsupervised EM works well, even when a
even in an ideal scenario, when the likelihood ratio lðxÞ is
limited number of interactions is allowed between the close to the threshold, the error (or cost) of any decision is
variables. Interestingly, the results under the independence high [67]. Based on this well-known and general idea in
assumption are not considerably worse compared to the decision theory, Fellegi and Sunter [12], suggested adding
case in which the EM model allows variable interactions. an extra “reject” class in addition to the classes M and U.
Du Bois [66] pointed out the importance of the fact that, The reject class contained record pairs for which it is not
many times, fields have missing (null) values and proposed possible to make any definite inference and a “clerical
a different method to correct mismatches that occur due to review” is necessary. These pairs are examined manually by
missing values. Du Bois suggested using a new comparison experts to decide whether they are true matches or not. By
vector x with dimension 2n instead of the n-dimensional setting thresholds for the conditional error on M and U, we
comparison vector x such that can define the reject region and the reject probability, which
measure the probability of directing a record pair to an
x ¼ ðx1 ; x2 ; . . . ; xn ; x1 y1 ; x2 y2 ; . . . ; xn yn Þ; ð5Þ expert for review.
where Tepping [11] was the first to suggest a solution
 methodology focusing on the costs of the decision. He
1 if the ith field on both records is present; presented a graphical approach for estimating the like-
yi ¼ ð6Þ
0 otherwise: lihood thresholds. Verykios et al. [68] developed a formal
framework for the cost-based approach taken by Tepping
Using this representation, mismatches that occur due to which shows how to compute the thresholds for the three
missing data are typically discounted, resulting in im- decision areas when the costs and the priors P ðMÞ and
proved duplicate detection performance. Du Bois proposed P ðUÞ are known.
using an independence model to learn the distributions of The “reject region” approach can be easily extended to a
pðxi yi jMÞ and pðxi yi jUÞ by using a set of prelabeled training larger number of decision areas [69]. The main problem
record pairs. with such a generalization is appropriately ordering the
thresholds which determine the regions in such a way that
4.2.2 The Bayes Decision Rule for Minimum Cost no region disappears.
Often, in practice, the minimization of the probability of
4.3 Supervised and Semisupervised Learning
error is not the best criterion for creating decision rules as
The probabilistic model uses a Bayesian approach to
the misclassifications of M and U samples may have
classify record pairs into two classes, M and U. This model
different consequences. Therefore, it is appropriate to was widely used for duplicate detection tasks, usually as an
assign a cost cij to each situation, which is the cost of application of the Fellegi-Sunter model. While the Fellegi-
deciding that x belongs to the class i when x actually Sunter approach dominated the field for more than two
belongs to the class j. Then, the expected costs rM ðxÞ and decades, the development of new classification techniques
rU ðxÞ of deciding that x belongs to the class M and U, in the machine learning and statistics communities
respectively, are: prompted the development of new deduplication techni-
ques. The supervised learning systems rely on the existence
rM ðxÞ ¼ cMM  pðMjxÞ þ cMU  pðUjxÞ of training data in the form of record pairs, prelabeled as
rU ðxÞ ¼ cUM  pðMjxÞ þ cUU  pðUjxÞ: matching or not.
One set of supervised learning techniques treat each
In that case, the decision rule for assigning x to M record pair h; i independently, similarly to the probabil-
becomes: istic techniques of Section 4.2. Cochinwala et al. [70] used the

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 22:31 from IEEE Xplore. Restrictions apply.
8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 1, JANUARY 2007

well-known CART algorithm [71], which generates classifi- 4.4 Active-Learning-Based Techniques
cation and regression trees, a linear discriminant algorithm One of the problems with the supervised learning techni-
[60], which generates a linear combination of the parameters ques is the requirement for a large number of training
for separating the data according to their classes, and a examples. While it is easy to create a large number of
“vector quantization” approach, which is a generalization of training pairs that are either clearly nonduplicates or clearly
nearest neighbor algorithms. The experiments which were duplicates, it is very difficult to generate ambiguous cases
conducted indicate that CART has the smallest error that would help create a highly accurate classifier. Based on
percentage. Bilenko et al. [18] use SVMlight [72] to learn this observation, some duplicate detection systems used
how to merge the matching results for the individual fields active learning techniques [79] to automatically locate such
of the records. Bilenko et al. showed that the SVM approach ambiguous pairs. Unlike an “ordinary” learner that is
usually outperforms simpler approaches, such as treating trained using a static training set, an “active” learner
the whole record as one large field. A typical postprocessing actively picks subsets of instances from unlabeled data,
step for these techniques (including the probabilistic which, when labeled, will provide the highest information
techniques of Section 4.2) is to construct a graph for all the gain to the learner.
records in the database, linking together the matching Sarawagi and Bhamidipaty [15] designed ALIAS, a
records. Then, using the transitivity assumption, all the learning-based duplicate detection system, that uses the
records that belong to the same connected component are idea of a “reject region” (see Section 4.2.3) to significantly
considered identical [73]. reduce the size of the training set. The main idea behind
The transitivity assumption can sometimes result in ALIAS is that most duplicate and nonduplicate pairs are
inconsistent decisions. For example, h; i and h; i can be clearly distinct. For such pairs, the system can automatically
considered matches, but h; i not. Partitioning such categorize them in U and M without the need of manual
“inconsistent” graphs with the goal of minimizing incon- labeling. ALIAS requires humans to label pairs only for
sistencies is an NP-complete problem [74]. Bansal et al. [74] cases where the uncertainty is high. This is similar to the
propose a polynomial approximation algorithm that can “reject region” in the Fellegi and Sunter model, which
partition such a graph, automatically identifying the marked ambiguous cases as cases for clerical review.
clusters and the number of clusters in the data set. Cohen ALIAS starts with small subsets of pairs of records
and Richman [75] proposed a supervised approach in designed for training which have been characterized as
which the system learns from training data how to cluster either matched or unique. This initial set of labeled data
together records that refer to the same real-world entry. The forms the training data for a preliminary classifier. In the
main contribution of this approach is the adaptive distance sequel, the initial classifier is used for predicting the status
function, which is learned from a given set of training of unlabeled pairs of records. The initial classifier will make
examples. McCallum and Wellner [76] learn the clustering clear determinations on some unlabeled instances but lack
determination on most. The goal is to seek out from the
method using training data; their technique is equivalent to
unlabeled data pool those instances which, when labeled,
a graph partitioning technique that tries to find the min-cut
will improve the accuracy of the classifier at the fastest
and the appropriate number of clusters for the given data
possible rate. Pairs whose status is difficult to determine
set, similarly to the work of Bansal et al. [74].
serve to strengthen the integrity of the learner. Conversely,
The supervised clustering techniques described above
have records as nodes for the graph. Singla and instances in which the learner can easily predict the status
Domingos [77] observed that, by using attribute values of the pairs do not have much effect on the learner. Using
as nodes, it is possible to propagate information across this technique, ALIAS can quickly learn the peculiarities of
nodes and improve duplicate record detection. For a data set and rapidly detect duplicates using only a small
example, if the records hGoogle; MountainV iew; CAi number of training data.
and hGoogleInc:; MountainV iew; Californiai are deemed Tejada et al. [59], [80] used a similar strategy and
equal, then CA and California are also equal and this employed decision trees to teach rules for matching records
information can be useful for other record comparisons. with multiple fields. Their method suggested that, by
The underlying assumption is that the only differences creating multiple classifiers, trained using slightly different
are due to different representations of the same entity data or parameters, it is possible to detect ambiguous cases
(e.g., “Google” and “Google Inc.”) and that there is no and then ask the user for feedback. The key innovation in
erroneous information in the attribute values (e.g., by this work is the creation of several redundant functions and
mistake someone entering Bismarck; ND as the location the concurrent exploitation of their conflicting actions in
of Google headquarters). Pasula et al. [78] propose a order to discover new kinds of inconsistencies among
semisupervised probabilistic relational model that can duplicates in the data set.
handle a generic set of transformations. While the model
can handle a large number of duplicate detection 4.5 Distance-Based Techniques
problems, the use of exact inference results in a Even active learning techniques require some training data
computationally intractable model. Pasula et al. propose or some human effort to create the matching models. In the
using a Markov Chain Monte Carlo (MCMC) sampling absence of such training data or the ability to get human
algorithm to avoid the intractability issue. However, it is input, supervised and active learning techniques are not
unclear whether techniques that rely on graph-based appropriate. One way of avoiding the need for training data
probabilistic inference can scale well for data sets with is to define a distance metric for records which does not need
hundreds of thousands of records. tuning through training data. Using the distance metric and

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 22:31 from IEEE Xplore. Restrictions apply.
ELMAGARMID ET AL.: DUPLICATE RECORD DETECTION: A SURVEY 9

an appropriate matching threshold, it is possible to match duplicate detection in databases that use multiple tables
similar records without the need for training. to store the entries of a record. This approach is
One approach is to treat a record as a long field and use conceptually similar to the work of Perkowitz et al. [26]
one of the distance metrics described in Section 3 to and of Dasu et al. [27], which examine the contents of
determine which records are similar. Monge and Elkan fields to locate the matching fields across two tables (see
[47], [73] proposed a string matching algorithm for Section 2).
detecting highly similar database records. The basic idea Finally, one of the problems of the distance-based
was to apply a general purpose field matching algorithm, techniques is the need to define the appropriate value for
especially one that is able to account for gaps in the strings, the matching threshold. In the presence of training data, it is
to play the role of the duplicate detection algorithm. possible to find the appropriate threshold value. However,
Similarly, Cohen [81] suggested using the tf.idf weighting this would nullify the major advantage of distance-based
scheme (see Section 3.2), together with the cosine similarity techniques, which is the ability to operate without training
metric to measure the similarity of records. Koudas et al. data. Recently, Chaudhuri et al. [86] proposed a new
[56] presented some practical solutions to problems en- framework for distance-based duplicate detection, obser-
countered during the deployment of such a string-based ving that the distance thresholds for detecting real duplicate
duplicate detection system at AT&T. entries are different from each database tuple. To detect the
Distance-based approaches that conflate each record in appropriate threshold, Chaudhuri et al. observed that
one big field may ignore important information that can be entries that correspond to the same real-world object but
used for duplicate detection. A simple approach is to have different representation in the database tend 1) to
measure the distance between individual fields, using the have small distances from each other (compact set prop-
appropriate distance metric for each field, and then erty), and 2) to have only a small number of other neighbors
compute the weighted distance [82] between the records. within a small distance (sparse neighborhood property).
In this case, the problem is the computation of the weights Furthermore, Chaudhuri et al. propose an efficient algo-
and the overall setting becomes very similar to the rithm for computing the required threshold for each object
probabilistic setting that we discussed in Section 4.2. An in the database and show that the quality of the results
alternative approach, proposed by Guha et al. [83] is to outperforms approaches that rely on a single, global
create a distance metric that is based on ranked list threshold.
merging. The basic idea is that if we compare only one
4.6 Rule-Based Approaches
field from the record, the matching algorithm can easily
find the best matches and rank them according to their A special case of distance-based approaches is the use of
rules to define whether two records are the same or not.
similarity, putting the best matches first. By applying the
Rule-based approaches can be considered as distance-based
same principle for all the fields, we can get, for each record,
techniques, where the distance of two records is either 0 or
n ranked lists of records, one for each field. Then, the goal is
1. Wang and Madnick [16] proposed a rule-based approach
to create a rank of records that has the minimum aggregate
for the duplicate detection problem. For cases in which
rank distance when compared to all the n lists. Guha et al.
there is no global key, Wang and Madnick suggest the use
map the problem into the minimum cost perfect matching
of rules developed by experts to derive a set of attributes
problem and develop then efficient solutions for identifying
that collectively serve as a “key” for each record. For
the top-k matching records. The first solution is based on
example, an expert might define rules such as
the Hungarian Algorithm [84], a graph-theoretic algorithm
that solves the minimum cost perfect matching problem.
Guha et al. also present the Successive Shortest Paths
algorithm that works well for smaller values of k and is
based on the idea that it is not required to examine all
potential matches to identify the top-k matches. Both of the
proposed algorithms are implemented in T-SQL and are
directly deployable over existing relational databases. By using such rules, Wang and Madnick hoped to
The distance-based techniques described so far treat generate unique keys that can cluster multiple records that
each record as a flat entity, ignoring the fact that data is represent the same real-world entity. Lim et al. [87] also
often stored in relational databases, in multiple tables. used a rule-based approach, but with the extra restriction
Ananthakrishna et al. [85] describe a similarity metric that that the result of the rules must always be correct.
uses not only the textual similarity, but the “co- Therefore, the rules should not be heuristically defined
occurrence” similarity of two entries in a database. For but should reflect absolute truths and serve as functional
example, the entries in the state column “CA” and dependencies.
“California” have small textual similarity; however, the Hernández and Stolfo [14] further developed this idea
city entries “San Francisco,” “Los Angeles,” “San Diego,” and derived an equational theory that dictates the logic of
and so on, often have foreign keys that point both to domain equivalence. This equational theory specifies an
“CA” and “California.” Therefore, it is possible to infer inference about the similarity of the records. For example, if
that “CA” and “California” are equivalent. Ananthakrish- two people have similar name spellings and these people
na et al. show that, by using “foreign key co-occurrence” have the same address, we may infer that they are the same
information, they can substantially improve the quality of person. Specifying such an inference in the equational

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 22:31 from IEEE Xplore. Restrictions apply.
10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 1, JANUARY 2007

theory requires declarative rule language. For example, the few labeled data, and then use unsupervised learning
following is a rule that exemplifies one axiom of the techniques to appropriately label the data with unknown
equational theory developed for an employee database: labels. Initially, Verykios et al. treat each entry of the
comparison vector (which corresponds to the result of a
field comparison) as a continuous, real variable. Then, they
partition the comparison space into clusters by using the
AutoClass [91] clustering tool. The basic premise is that
each cluster contains comparison vectors with similar
Note that “similar to” is measured by one of the string
characteristics. Therefore, all the record pairs in the cluster
comparison techniques (Section 3), and “matches” means to
belong to the same class (matches, nonmatches, or possible-
declare that those two records are matched and therefore
matches). Thus, by knowing the real class of only a few
represent the same person.
vectors in each cluster, it is possible to infer the class of all
AJAX [88] is a prototype system that provides a
vectors in the cluster and, therefore, mark the correspond-
declarative language for specifying data cleaning programs,
ing record pairs as matches or not. Elfeky et al. [92]
consisting of SQL statements enhanced with a set of
implemented this idea in TAILOR, a toolbox for detecting
primitive operations to express various cleaning transfor-
duplicate entries in data sets. Verykios et al. show that the
mations. AJAX provides a framework wherein the logic of a
data cleaning program is modeled as a directed graph of classifiers generated using the new, larger training set have
data transformations starting from some input source data. high accuracy, and require only a minimal number of
Four types of data transformations are provided to the user prelabeled record pairs.
of the system. The mapping transformation standardizes Ravikumar and Cohen [93] follow a similar approach
data, the matching transformation finds pairs of records and propose a hierarchical, graphical model for learning to
that probably refer to the same real object, the clustering match record pairs. The foundation of this approach is to
transformation groups together matching pairs with a high model each field of the comparison vector as a latent binary
similarity value, and, finally, the merging transformation variable which shows whether the two fields match or not.
collapses each individual cluster into a tuple of the resulting The latent variable then defines two probability distribu-
data source. tions for the values of the corresponding “observed”
It is noteworthy that such rule-based approaches which comparison variable. Ravikumar and Cohen show that it
require a human expert to devise meticulously crafted is easier to learn the parameters of a hierarchical model than
matching rules typically result in systems with high to attempt to directly model the distributions of the real-
accuracy. However, the required tuning requires extremely valued comparison vectors. Bhattacharya and Getoor [94]
high manual effort from the human experts and this effort propose using the Latent Dirichlet Allocation generative
makes the deployment of such systems difficult in practice. model to perform duplicate detection. In this model, the
Currently, the typical approach is to use a system that latent variable is a unique identifier for each entity in the
generates matching rules from training data (see Sections 4.3 database.
and 4.4) and then manually tune the automatically
4.8 Concluding Remarks
generated rules.
There are multiple techniques for duplicate record detec-
4.7 Unsupervised Learning tion. We can divide the techniques into two broad
As we mentioned earlier, the comparison space consists of categories: ad hoc techniques that work quickly on existing
comparison vectors which contain information about the relational databases and more “principled” techniques that
differences between fields in a pair of records. Unless some are based on probabilistic inference models. While prob-
information exists about which comparison vectors corre- abilistic methods outperform ad hoc techniques in terms of
spond to which category (match, nonmatch, or possible- accuracy, the ad hoc techniques work much faster and can
match), the labeling of the comparison vectors in the scale to databases with hundreds of thousands of records.
training data set should be done manually. One way to Probabilistic inference techniques are practical today only
avoid manual labeling of the comparison vectors is to use for data sets that are one or two orders of magnitude
clustering algorithms and group together similar compar- smaller than the data sets handled by ad hoc techniques. A
ison vectors. The idea behind most unsupervised learning promising direction for future research is to devise
approaches for duplicate detection is that similar compar- techniques that can substantially improve the efficiency of
ison vectors correspond to the same class. approaches that rely on machine learning and probabilistic
The idea of unsupervised learning for duplicate detec- inference.
tion has its roots in the probabilistic model proposed by A question that is unlikely to be resolved soon is the
Fellegi and Sunter (see Section 4.2). As we discussed in question of which of the presented methods should be used
Section 4.2, when there is no training data to compute the for a given duplicate detection task. Unfortunately, there is
probability estimates, it is possible to use variations of the no clear answer to this question. The duplicate record
Expectation Maximization algorithm to identify appropriate detection task is highly data-dependent and it is unclear if
clusters in the data. we will ever see a technique dominating all others across all
Verykios et al. [89] propose the use of a bootstrapping data sets. The problem of choosing the best method for
technique based on clustering to learn matching models. duplicate data detection is very similar to the problem of
The basic idea, also known as cotraining [90], is to use very model selection and performance prediction for data mining:

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 22:31 from IEEE Xplore. Restrictions apply.
ELMAGARMID ET AL.: DUPLICATE RECORD DETECTION: A SURVEY 11

We expect that progress on that front will also benefit from lead to an increased number of missed matches due to
the task of selecting the best method for duplicate detection. errors in the blocking step that placed entries in the wrong
buckets, thereby preventing them from being compared to
actual matching entries. One alternative is to execute the
5 IMPROVING THE EFFICIENCY OF DUPLICATE
duplicate detection algorithm in multiple runs, using a
DETECTION different field for blocking each time. This approach can
So far, in our discussion of methods for detecting whether substantially reduce the probability of false mismatches,
two records refer to the same real-world object, we have with a relatively small increase in the running time.
focused mainly on the quality of the comparison techniques
and not on the efficiency of the duplicate detection process. 5.1.2 Sorted Neighborhood Approach
Now, we turn to the central issue of improving the speed of Hernández and Stolfo [14] describe the so-called sorted
duplicate detection. neighborhood approach. The method consists of the follow-
An elementary technique for discovering matching ing three steps:
entries in tables A and B is to execute a “nested-loop”
. Create key: A key for each record in the list is
comparison, i.e., to compare every record of table A with
computed by extracting relevant fields or portions of
every record in table B. Unfortunately, such a strategy
fields.
requires a total of jAj  jBj comparisons, a cost that is
. Sort data: The records in the database are sorted by
prohibitively expensive even for moderately sized tables. In
using the key found in the first step. A sorting key is
Section 5.1, we describe techniques that substantially reduce
defined to be a sequence of attributes, or a sequence
the number of required comparisons.
of substrings within the attributes, chosen from the
Another factor that can lead to increased computation
record in an ad hoc manner. Attributes that appear
expense is the cost required for a single comparison. It is not
first in the key have a higher priority than those that
uncommon for a record to contain tens of fields. Therefore,
appear subsequently.
each record comparison requires multiple field compar-
. Merge: A fixed size window is moved through the
isons and each field comparison can be expensive. For
sequential list of records in order to limit the
example, computing the edit distance between two long comparisons for matching records to those records
strings 1 and 2 , respectively, has a cost of Oðj1 j  j2 jÞ; just in the window. If the size of the window is w records,
checking if they are within a prespecified edit distance then every new record that enters that window is
threshold k can reduce the complexity to Oðmaxfj1 j; j2 jg  compared with the previous w  1 records to find
kÞ (see Section 3.1). We examine some of the methods that “matching” records. The first record in the window
can be used to reduce the cost of record comparison in slides out of it.
Section 5.2.
The sorted neighborhood approach relies on the assump-
5.1 Reducing the Number of Record Comparisons tion that duplicate records will be close in the sorted list,
and therefore will be compared during the merge step. The
5.1.1 Blocking
effectiveness of the sorted neighborhood approach is highly
One “traditional” method for identifying identical records dependent upon the comparison key that is selected to sort
in a database table is to scan the table and compute the the records. In general, no single key will be sufficient to
value of a hash function for each record. The value of the sort the records in such a way that all the matching records
hash function defines the “bucket” to which this record is can be detected. If the error in a record occurs in the
assigned. By definition, two records that are identical will particular field or portion of the field that is the most
be assigned to the same bucket. Therefore, in order to locate important part of the sorting key, there is a very small
duplicates, it is enough to compare only the records that fall possibility that the record will end up close to a matching
into the same bucket for matches. The hashing technique record after sorting.
cannot be used directly for approximate duplicates since To increase the number of similar records merged,
there is no guarantee that the hash value of two similar Hernández and Stolfo implemented a strategy for executing
records will be the same. However, there is an interesting several independent runs of the sorted-neighborhood
counterpart of this method, named blocking. method (presented above) by using a different sorting key
As discussed above with relation to utilizing the hash and a relatively small window each time. This strategy is
function, blocking typically refers to the procedure of called the multipass approach. This method is similar in
subdividing files into a set of mutually exclusive subsets spirit to the multiple-run blocking approach described
(blocks) under the assumption that no matches occur across above. Each independent run produces a set of pairs of
different blocks. A common approach to achieving these records that can be merged. The final results, including the
blocks is to use a function such as Soundex, NYSIIS, or transitive closure of the records matched in different passes,
Metaphone (see Section 3.3) on highly discriminating fields are subsequently computed.
(e.g., last name) and compare only records that have
similar, but not necessarily identical, fields. 5.1.3 Clustering and Canopies
Although blocking can substantially increase the speed Monge and Elkan [73] try to improve the performance of a
of the comparison process, it can also lead to an increased basic “nested-loop” record comparison by assuming that
number of false mismatches due to the failure of comparing duplicate detection is transitive. This means that if  is
records that do not agree on the blocking field. It can also deemed to be a duplicate of  and  is deemed to be a

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 22:31 from IEEE Xplore. Restrictions apply.
12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 1, JANUARY 2007

duplicate of , then  and  are also duplicates. Under the retrieval, suggest pruning the inverted index, removing
assumption of transitivity, the problem of matching records terms with low weights since they do not contribute much
in a database can be described in terms of determining the to the computation of the tf.idf cosine similarity. Gravano et al.
connected components of an undirected graph. At any time, [49] present a SQL-based approach that is analogous to the
the connected components of the graph correspond to the approach of Soffer et al. [98] and allows fast computation of
transitive closure of the “record matches” relationships cosine similarity within an RDBMS. Mamoulis [99] presents
discovered so far. Monge and Elkan [73] use a union-find techniques for efficiently processing a set join in a database,
structure to efficiently compute the connected components focusing on the containment and non-zero-overlap operators.
of the graph. During the Union step, duplicate records are Mamoulis shows that inverted indexes are typically superior
“merged” into a cluster and only a “representative” of the to approaches based on signature files, confirming earlier
cluster is kept for subsequent comparisons. This reduces the
comparison studies [100]. Sarawagi and Kirpal [101] extend
total number of record comparisons without substantially
the set joins approach to a large number of similarity
reducing the accuracy of the duplicate detection process.
predicates that use set joins. The Probe-Cluster approach of
The concept behind this approach is that, if a record  is not
Sarawagi and Kirpal works well in environments with
similar to a record  already in the cluster, then it will not
match the other members of the cluster either. limited main memory and can be used to compute efficiently
McCallum et al. [95] propose the use of canopies for a large number of similarity predicates, in contrast to
speeding up the duplicate detection process. The basic idea previous approaches which were tuned for a smaller number
is to use a cheap comparison metric to group records into of similarity predicates (e.g., set containment, or cosine
overlapping clusters called canopies. (This is in contrast to similarity). Furthermore, Probe-Cluster returns exact values
blocking that requires hard, nonoverlapping partitions.) for the similarity metrics, in contrast to previous approaches
After the first step, the records are then compared pairwise, which used approximation techniques.
using a more expensive similarity metric that leads to better
5.2 Improving the Efficiency of Record Comparison
qualitative results. The assumption behind this method is
that there is an inexpensive similarity function that can be So far, we have examined techniques that reduce the
used as a “quick-and-dirty” approximation for another, number of required record comparisons without compro-
more expensive function. For example, if two strings have a mising the quality of the duplicate detection process.
length difference larger than 3, then their edit distance Another way of improving the efficiency of duplicate
cannot be smaller than 3. In that case, the length comparison detection is to improve the efficiency of a single record
serves as a cheap (canopy) function for the more expensive comparison. Next, we review some of these techniques.
edit distance. Cohen and Richman [75] propose the tf.idf When comparing two records, after having computed
similarity metric as a canopy distance and then use multiple the differences of only a small portion of the fields of two
(expensive) similarity metrics to infer whether two records records, it may be obvious that the pair does match,
are duplicates. Gravano et al. [45] propose using the string irrespective of the results of further comparison. Therefore,
lengths and the number of common q-grams of two strings it is paramount to determine the field comparison for a pair
as canopies (filters according to [45]) for the edit distance of records as soon as possible to avoid wasting additional,
metric, which is expensive to compute in a relational valuable time. The field comparisons should be terminated
database. The advantage of this technique is that the canopy when even complete agreement of all the remaining fields
functions can be evaluated efficiently using vanilla SQL cannot reverse the unfavorable evidence for the matching of
statements. In a similar fashion, Chaudhuri et al. [96] the records [13]. To make the early termination work, the
propose using an indexable canopy function for easily global likelihood ratio for the full agreement of each of the
identifying similar tuples in a database. Baxter et al. [97] identifiers should be calculated. At any given point in the
perform an experimental comparison of canopy-based
comparison sequence, the maximum collective favorable
approaches with traditional blocking and show that the
evidence, which could be accumulated from that point
flexible nature of canopies can significantly improve the
forward, will indicate what improvement in the overall
quality and speed of duplicate detection.
likelihood ratio might conceivably result if the comparisons
5.1.4 Set Joins were continued.
Another direction toward efficiently implementing data Verykios et al. [89] propose a set of techniques for
cleaning operations is to speed up the execution of set reducing the complexity of record comparison. The first
operations: A large number of similarity metrics, discussed step is to apply a feature subset selection algorithm for
in Section 3, use set operations as part of the overall reducing the dimensionality of the input set. By using a
computation. Running set operations on all pair combina- feature selection algorithm (e.g., [102]) as a preprocessing
tions is a computationally expensive operation and is step, the record comparison process uses only a small
typically unnecessary. For data cleaning applications, the subset of the record fields, which speeds up the comparison
interesting pairs are only those in which the similarity value process. Additionally, the induced model can be generated
is high. Many techniques use this property and suggest in a reduced amount of time and is usually characterized by
algorithms for fast computation of set-based operations on a higher predictive accuracy. Verykios et al. [89] also suggest
set of records. using a pruning technique on the derived decision trees that
Cohen [81] proposed using a set of in-memory inverted are used to classify record pairs as matches or mismatches.
indexes together with an A search algorithm to locate the Pruning produces models (trees) of smaller size that not
top-k most similar pairs, according to the cosine similarity only avoid overfitting and have a higher accuracy, but also
metric. Soffer et al. [98], mainly in the context of information allow for faster execution of the matching algorithm.

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 22:31 from IEEE Xplore. Restrictions apply.
ELMAGARMID ET AL.: DUPLICATE RECORD DETECTION: A SURVEY 13

6 DUPLICATE DETECTION TOOLS not to perform sophisticated duplicate detection, but


rather to generate a set of candidate pairs that should
Over the past several years, a range of tools for cleaning be then processed by more sophisticated duplicate
data has appeared on the market and research groups have detection algorithms.
made available to the public software packages that can be Finally, we should note that, currently, many database
used for duplicate record detection. In this section, we vendors (Oracle, IBM, and Microsoft) do not provide
review such packages, focusing on tools that have open sufficient tools for duplicate record detection. Most of the
architecture and allow the users to understand the under- efforts until now has focused on creating easy-to-use ETL
lying mechanics of the matching mechanisms. tools that can standardize database records and fix minor
The Febrl system3 (Freely Extensible Biomedical Record errors, mainly in the context of address data. Another typical
Linkage) is an open-source data cleaning toolkit, and it has function of the tools that are provided today is the ability to
two main components: The first component deals with data use reference tables and standardize the representation of
standardization and the second performs the actual entities that are well-known to have multiple representa-
duplicate detection. The data standardization relies mainly tions. (For example, “TKDE” is also frequently written as
on hidden-Markov models (HMMs); therefore, Febrl typi- “IEEE TKDE” or as “Transactions on Knowledge and Data
cally requires training to correctly parse the database Engineering.”) A recent, positive step is the existence of
entries. For duplicate detection, Febrl implements a variety multiple data cleaning operators within Microsoft SQL
of string similarity metrics, such as Jaro, edit distance, and Server Integration Services, which is part of Microsoft SQL
q-gram distance (see Section 3). Finally, Febrl supports Server 2005. For example, SQL server now includes the
phonetic encoding (Soundex, NYSIIS, and Double Meta- ability to perform “fuzzy matches” and implements “error-
phone) to detect similar names. Since phonetic similarity is tolerable indexes” that allow fast execution of such approx-
sensitive to errors in the first letter of a name, Febrl also imate lookups. The adopted similarity metric is similar to
computes phonetic similarity using the reversed version of SoftTF.IDF, described in Section 3.2. Ideally, the other major
the name string, sidestepping the “first-letter” sensitivity database vendors would also follow suit and add similar
problem. capabilities and extend the current ETL packages.
TAILOR [92] is a flexible record matching toolbox which
allows the users to apply different duplicate detection 7 FUTURE DIRECTIONS AND CONCLUSIONS
methods on the data sets. The flexibility of using multiple
In this survey, we have presented a comprehensive survey
models is useful when the users do not know which
of the existing techniques used for detecting nonidentical
duplicate detection model will perform most effectively on
duplicate entries in database records. The interested reader
their particular data. TAILOR follows a layered design,
may also want to read a complementary survey by Winkler
separating comparison functions from the duplicate detec- [104] and the special issue of the IEEE Data Engineering
tion logic. Furthermore, the execution strategies which Bulletin on data quality [105].
improve the efficiency are implemented in a separate layer, As database systems are becoming more and more
making the system more extensible than systems that rely commonplace, data cleaning is going to be the cornerstone
on monolithic designs. Finally, TAILOR reports statistics, for correcting errors in systems which are accumulating
such as estimated accuracy and completeness, which can vast amounts of errors on a daily basis. Despite the breadth
help the users better understand the quality of a given and depth of the presented techniques, we believe that there
duplicate detection execution over a new data set. is still room for substantial improvement in the current
WHIRL4 is a duplicate record detection system available state-of-the-art.
for free for academic and research use. WHIRL uses the First of all, it is currently unclear which metrics and
tf.idf token-based similarity metric to identify similar strings techniques are the current state-of-the-art. The lack of
within two lists. The Flamingo Project5 is a similar tool that standardized, large-scale benchmarking data sets can be a
provides a simple string matching tool that takes as input big obstacle for the further development of the field as it is
two string lists and returns the strings pairs that are within almost impossible to convincingly compare new techniques
a prespecified edit distance threshold. WizSame by WizSoft with existing ones. A repository of benchmark data sources
is also a product that allows the discovery of duplicate with known and diverse characteristics should be made
records in a database. The matching algorithm is very available to developers so they may evaluate their methods
similar to SoftTF.IDF (see Section 3.2): Two records match if during the development process. Along with benchmark
they contain a significant fraction of identical or similar and evaluation data, various systems need some form of
words, where similar are the words that is within edit training data to produce the initial matching model.
distance one. Although small data sets are available, we are not aware
BigMatch [103] is the duplicate detection program used of large-scale, validated data sets that could be used as
by the US Census Bureau. It relies on blocking strategies benchmarks. Winkler [106] highlights techniques on how to
to identify potential matches between the records of two derive data sets that are properly anonymized and are still
relations and scales well for very large data sets. The only useful for duplicate record detection purposes.
requirement is that one of the two relations should fit in Currently, there are two main approaches for duplicate
memory, and it is possible to fit in memory even relations record detection. Research in databases emphasizes rela-
with 100 million records. The main goal of BigMatch is tively simple and fast duplicate detection techniques that
can be applied to databases with millions of records. Such
3. http://sourceforge.net/projects/febrl.
4. http://www.cs.cmu.edu/~wcohen/whirl/. techniques typically do not rely on the existence of training
5. http://www.ics.uci.edu/~flamingo/. data and emphasize efficiency over effectiveness. On the

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 22:31 from IEEE Xplore. Restrictions apply.
14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 1, JANUARY 2007

other hand, research in machine learning and statistics aims [11] B.J. Tepping, “A Model for Optimum Linkage of Records,” J. Am.
Statistical Assoc., vol. 63, no. 324, pp. 1321-1332, Dec. 1968.
to develop more sophisticated matching techniques that
[12] I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am.
rely on probabilistic models. An interesting direction for Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969.
future research is to develop techniques that combine the [13] H.B. Newcombe, Handbook of Record Linkage. Oxford Univ. Press,
best of both worlds. 1988.
Most of the duplicate detection systems available today [14] M.A. Hernández and S.J. Stolfo, “Real-World Data Is Dirty: Data
offer various algorithmic approaches for speeding up the Cleansing and the Merge/Purge Problem,” Data Mining and
duplicate detection process. The changing nature of the Knowledge Discovery, vol. 2, no. 1, pp. 9-37, Jan. 1998.
[15] S. Sarawagi and A. Bhamidipaty, “Interactive Deduplication
duplicate detection process also requires adaptive methods Using Active Learning,” Proc. Eighth ACM SIGKDD Int’l Conf.
that detect different patterns for duplicate detection and Knowledge Discovery and Data Mining (KDD ’02), pp. 269-278, 2002.
automatically adapt themselves over time. For example, a [16] Y.R. Wang and S.E. Madnick, “The Inter-Database Instance
background process could monitor the current data, Identification Problem in Integrating Autonomous Systems,” Proc.
incoming data, and any data sources that need to be Fifth IEEE Int’l Conf. Data Eng. (ICDE ’89), pp. 46-55, 1989.
merged or matched, and decide, based on the observed [17] W.W. Cohen, H. Kautz, and D. McAllester, “Hardening Soft
Information Sources,” Proc. Sixth ACM SIGKDD Int’l Conf.
errors, whether a revision of the duplicate detection process Knowledge Discovery and Data Mining (KDD ’00), pp. 255-259, 2000.
is necessary or not. Another related aspect of this challenge [18] M. Bilenko, R.J. Mooney, W.W. Cohen, P. Ravikumar, and S.E.
is to develop methods that permit the user to derive the Fienberg, “Adaptive Name Matching in Information Integration,”
proportions of errors expected in data cleaning projects. IEEE Intelligent Systems, vol. 18, no. 5, pp. 16-23, Sept./Oct. 2003.
Finally, large amounts of structured information are now [19] R. Kimball and J. Caserta, The Data Warehouse ETL Toolkit: Practical
Techniques for Extracting, Cleaning, Conforming, and Delivering Data.
derived from unstructured text and from the Web. This John Wiley & Sons, 2004.
information is typically imprecise and noisy; duplicate [20] IEEE Data Eng. Bull., E. Rundensteiner, ed., special issue on date
record detection techniques are crucial for improving the transformation, vol. 22, no. 1, Jan. 1999.
quality of the extracted data. The increasing popularity of [21] A. McCallum, D. Freitag, and F.C.N. Pereira, “Maximum Entropy
Markov Models for Information Extraction and Segmentation,”
information extraction techniques is going to make this Proc. 17th Int’l Conf. Machine Learning (ICML ’00), pp. 591-598,
issue more prevalent in the future, highlighting the need to 2000.
develop robust and scalable solutions. This only adds to the [22] V.R. Borkar, K. Deshmukh, and S. Sarawagi, “Automatic
sentiment that more research is needed in the area of Segmentation of Text into Structured Records,” Proc. 2001 ACM
SIGMOD Int’l Conf. Management of Data (SIGMOD ’01), pp. 175-
duplicate record detection and in the area of data cleaning 186, 2001.
and information quality in general. [23] E. Agichtein and V. Ganti, “Mining Reference Tables for
Automatic Text Segmentation,” Proc. 10th ACM SIGKDD Int’l
Conf. Knowledge Discovery and Data Mining (KDD ’04), pp. 20-29,
ACKNOWLEDGMENTS 2004.
[24] C. Sutton, K. Rohanimanesh, and A. McCallum, “Dynamic
Ahmed Elmagarmid was partially supported by the US Conditional Random Fields: Factorized Probabilistic Models for
National Science Foundation under Grants ITR-0428168 and Labeling and Segmenting Sequence Data,” Proc. 21st Int’l Conf.
IIS-0209120, an RVAC grant from the US Department of Machine Learning (ICML ’04), 2004.
Homeland Security, and a generous grant from the Lilly [25] V. Raman and J.M. Hellerstein, “Potter’s Wheel: An Interactive
Data Cleaning System,” Proc. 27th Int’l Conf. Very Large Databases
Endowment. (VLDB ’01), pp. 381-390, 2001.
[26] M. Perkowitz, R.B. Doorenbos, O. Etzioni, and D.S. Weld,
“Learning to Understand Information on the Internet: An
REFERENCES Example-Based Approach,” J. Intelligent Information Systems,
[1] A. Chatterjee and A. Segev, “Data Manipulation in Heterogeneous vol. 8, no. 2, pp. 133-153, Mar. 1997.
Databases,” ACM SIGMOD Record, vol. 20, no. 4, pp. 64-68, Dec. [27] T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk,
1991. “Mining Database Structure; or, How to Build a Data Quality
[2] IEEE Data Eng. Bull., S. Sarawagi, ed., special issue on data Browser,” Proc. 2002 ACM SIGMOD Int’l Conf. Management of Data
cleaning, vol. 23, no. 4, Dec. 2000. (SIGMOD ’02), pp. 240-251, 2002.
[3] J. Widom, “Research Problems in Data Warehousing,” Proc. 1995 [28] V.I. Levenshtein, “Binary Codes Capable of Correcting Deletions,
ACM Conf. Information and Knowledge Management (CIKM ’95), Insertions and Reversals,” Doklady Akademii Nauk SSSR, vol. 163,
pp. 25-30, 1995. no. 4, pp. 845-848, 1965, original in Russian—translation in Soviet
[4] A.Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig, Physics Doklady, vol. 10, no. 8, pp. 707-710, 1966.
“Syntactic Clustering of the Web,” Proc. Sixth Int’l World Wide [29] G. Navarro, “A Guided Tour to Approximate String Matching,”
Web Conf. (WWW6), pp. 1157-1166, 1997. ACM Computing Surveys, vol. 33, no. 1, pp. 31-88, 2001.
[5] J. Cho, N. Shivakumar, and H. Garcia-Molina, “Finding Repli-
[30] G.M. Landau and U. Vishkin, “Fast Parallel and Serial Approx-
cated Web Collections,” Proc. 2000 ACM SIGMOD Int’l Conf.
imate String Matching,” J. Algorithms, vol. 10, no. 2, pp. 157-169,
Management of Data (SIGMOD ’00), pp. 355-366, 2000.
June 1989.
[6] R. Mitkov, Anaphora Resolution, first ed. Longman, Aug. 2002.
[7] A. McCallum, “Information Extraction: Distilling Structured Data [31] S.B. Needleman and C.D. Wunsch, “A General Method Applicable
from Unstructured Text,” ACM Queue, vol. 3, no. 9, pp. 48-57, to the Search for Similarities in the Amino Acid Sequence of Two
2005. Proteins,” J. Molecular Biology, vol. 48, no. 3, pp. 443-453, Mar.
[8] H.B. Newcombe, J.M. Kennedy, S. Axford, and A. James, 1970.
“Automatic Linkage of Vital Records,” Science, vol. 130, [32] E.S. Ristad and P.N. Yianilos, “Learning String Edit Distance,”
no. 3381, pp. 954-959, Oct. 1959. IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 5,
[9] H.B. Newcombe and J.M. Kennedy, “Record Linkage: Making pp. 522-532, May 1998.
Maximum Use of the Discriminating Power of Identifying [33] M.S. Waterman, T.F. Smith, and W.A. Beyer, “Some Biological
Information,” Comm. ACM, vol. 5, no. 11, pp. 563-566, Nov. 1962. Sequence Metrics,” Advances in Math., vol. 20, no. 4, pp. 367-387,
[10] H.B. Newcombe, “Record Linking: The Design of Efficient 1976.
Systems for Linking Records into Individual and Family [34] T.F. Smith and M.S. Waterman, “Identification of Common
Histories,” Am. J. Human Genetics, vol. 19, no. 3, pp. 335-359, Molecular Subsequences,” J. Molecular Biology, vol. 147, pp. 195-
May 1967. 197, 1981.

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 22:31 from IEEE Xplore. Restrictions apply.
ELMAGARMID ET AL.: DUPLICATE RECORD DETECTION: A SURVEY 15

[35] S.F. Altschula, W. Gisha, W. Millerb, E.W. Meyersc, and D.J. [61] M.A. Jaro, “Advances in Record-Linkage Methodology as Applied
Lipmana, “Basic Local Alignment Search Tool,” J. Molecular to Matching the 1985 Census of Tampa, Florida,” J. Am. Statistical
Biology, vol. 215, no. 3, pp. 403-410, Oct. 1990. Assoc., vol. 84, no. 406, pp. 414-420, June 1989.
[36] R. Baeza-Yates and G.H. Gonnet, “A New Approach to Text [62] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Like-
Searching,” Comm. ACM, vol. 35, no. 10, pp. 74-82, Oct. 1992. lihood from Incomplete Data via the EM Algorithm,” J. Royal
[37] S. Wu and U. Manber, “Fast Text Searching Allowing Errors,” Statistical Soc., vol. B, no. 39, pp. 1-38, 1977.
Comm. ACM, vol. 35, no. 10, pp. 83-91, Oct. 1992. [63] W.E. Winkler, “Improved Decision Rules in the Felligi-Sunter
[38] J.C. Pinheiro and D.X. Sun, “Methods for Linking and Mining Model of Record Linkage,” Technical Report Statistical Research
Heterogeneous Databases,” Proc. Int’l Conf. Knowledge Discovery Report Series RR93/12, US Bureau of the Census, Washington,
and Data Mining (KDD ’98), pp. 309-313, 1998. D.C., 1993.
[39] M.A. Jaro, “Unimatch: A Record Linkage System: User’s Manual,” [64] W.E. Winkler, “Methods for Record Linkage and Bayesian
technical report, US Bureau of the Census, Washington, D.C., Networks,” Technical Report Statistical Research Report Series
1976. RRS2002/05, US Bureau of the Census, Washington, D.C., 2002.
[40] W.E. Winkler and Y. Thibaudeau, “An Application of the Fellegi- [65] K. Nigam, A. McCallum, S. Thrun, and T.M. Mitchell, “Text
Sunter Model of Record Linkage to the 1990 US Decennial Classification from Labeled and Unlabeled Documents Using
Census,” Technical Report Statistical Research Report Series EM,” Machine Learning, vol. 39, nos. 2/3, pp. 103-134, 2000.
RR91/09, US Bureau of the Census, Washington, D.C., 1991. [66] N.S.D. Du Bois Jr., “A Solution to the Problem of Linking
[41] J.R. Ullmann, “A Binary n-Gram Technique for Automatic Multivariate Documents,” J. Am. Statistical Assoc., vol. 64, no. 325,
Correction of Substitution, Deletion, Insertion, and Reversal pp. 163-174, Mar. 1969.
Errors in Words,” The Computer J., vol. 20, no. 2, pp. 141-147, 1977.
[67] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis.
[42] E. Ukkonen, “Approximate String Matching with q-Grams and
Wiley, 1973.
Maximal Matches,” Theoretical Computer Science, vol. 92, no. 1,
pp. 191-211, 1992. [68] V.S. Verykios, G.V. Moustakides, and M.G. Elfeky, “A Bayesian
[43] K. Kukich, “Techniques for Automatically Correcting Words in Decision Model for Cost Optimal Record Matching,” VLDB J.,
Text,” ACM Computing Surveys, vol. 24, no. 4, pp. 377-439, Dec. vol. 12, no. 1, pp. 28-40, May 2003.
1992. [69] V.S. Verykios and G.V. Moustakides, “A Generalized Cost
[44] E. Sutinen and J. Tarhio, “On Using q-Gram Locations in Optimal Decision Model for Record Matching,” Proc. 2004 Int’l
Approximate String Matching,” Proc. Third Ann. European Symp. Workshop Information Quality in Information Systems, pp. 20-26,
Algorithms (ESA ’95), pp. 327-340, 1995. 2004.
[45] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. [70] M. Cochinwala, V. Kurien, G. Lalk, and D. Shasha, “Efficient Data
Muthukrishnan, and D. Srivastava, “Approximate String Joins in Reconciliation,” Information Sciences, vol. 137, nos. 1-4, pp. 1-15,
a Database (Almost) for Free,” Proc. 27th Int’l Conf. Very Large Sept. 2001.
Databases (VLDB ’01), pp. 491-500, 2001. [71] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classifica-
[46] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. tion and Regression Trees. CRC Press, July 1984.
Muthukrishnan, L. Pietarinen, and D. Srivastava, “Using [72] T. Joachims, “Making Large-Scale SVM Learning Practical,”
q-Grams in a DBMS for Approximate String Processing,” IEEE Advances in Kernel Methods—Support Vector Learning, B. Schölkopf,
Data Eng. Bull., vol. 24, no. 4, pp. 28-34, Dec. 2001. C.J.C. Burges, and A.J. Smola, eds., MIT Press, 1999.
[47] A.E. Monge and C.P. Elkan, “The Field Matching Problem: [73] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent
Algorithms and Applications,” Proc. Second Int’l Conf. Knowledge Algorithm for Detecting Approximately Duplicate Database
Discovery and Data Mining (KDD ’96), pp. 267-270, 1996. Records,” Proc. Second ACM SIGMOD Workshop Research Issues in
[48] W.W. Cohen, “Integration of Heterogeneous Databases without Data Mining and Knowledge Discovery (DMKD ’97), pp. 23-29, 1997.
Common Domains Using Queries Based on Textual Similarity,” [74] N. Bansal, A. Blum, and S. Chawla, “Correlation Clustering,”
Proc. 1998 ACM SIGMOD Int’l Conf. Management of Data (SIGMOD Machine Learning, vol. 56, nos. 1-3, pp. 89-113, 2004.
’98), pp. 201-212, 1998. [75] W.W. Cohen and J. Richman, “Learning to Match and Cluster
[49] L. Gravano, P.G. Ipeirotis, N. Koudas, and D. Srivastava, “Text Large High-Dimensional Data Sets for Data Integration,” Proc.
Joins in an RDBMS for Web Data Integration,” Proc. 12th Int’l Eighth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data
World Wide Web Conf. (WWW12), pp. 90-101, 2003. Mining (KDD ’02), 2002.
[50] R.C. Russell Index, U.S. Patent 1,261,167, http://patft.uspto. gov/ [76] A. McCallum and B. Wellner, “Conditional Models of Identity
netahtml/srchnum.htm, Apr. 1918. Uncertainty with Application to Noun Coreference,” Advances in
[51] R.C. Russell Index, U.S. Patent 1,435,663, http://patft.uspto. gov/ Neural Information Processing Systems (NIPS ’04), 2004.
netahtml/srchnum.htm, Nov. 1922. [77] P. Singla and P. Domingos, “Multi-Relational Record Linkage,”
[52] R.L. Taft, “Name Search Techniques,” Technical Report Special Proc. KDD-2004 Workshop Multi-Relational Data Mining, pp. 31-48,
Report No. 1, New York State Identification and Intelligence 2004.
System, Albany, N.Y., Feb. 1970. [78] H. Pasula, B. Marthi, B. Milch, S.J. Russell, and I. Shpitser,
[53] L.E. Gill, “OX-LINK: The Oxford Medical Record Linkage “Identity Uncertainty and Citation Matching,” Advances in Neural
System,” Proc. Int’l Record Linkage Workshop and Exposition, Information Processing Systems (NIPS ’02), pp. 1401-1408, 2002.
pp. 15-33, 1997.
[79] D.A. Cohn, L. Atlas, and R.E. Ladner, “Improving Generalization
[54] L. Philips, “Hanging on the Metaphone,” Computer Language
with Active Learning,” Machine Learning, vol. 15, no. 2, pp. 201-
Magazine, vol. 7, no. 12, pp. 39-44, Dec. 1990, http://
221, 1994.
www.cuj.com/documents/s=8038/cuj0006philips/.
[55] L. Philips, “The Double Metaphone Search Algorithm,” C/C++ [80] S. Tejada, C.A. Knoblock, and S. Minton, “Learning Object
Users J., vol. 18, no. 5, June 2000. Identification Rules for Information Integration,” Information
Systems, vol. 26, no. 8, pp. 607-633, 2001.
[56] N. Koudas, A. Marathe, and D. Srivastava, “Flexible String
Matching against Large Databases in Practice,” Proc. 30th Int’l [81] W.W. Cohen, “Data Integration Using Similarity Joins and a
Conf. Very Large Databases (VLDB ’04), pp. 1078-1086, 2004. Word-Based Information Representation Language,” ACM Trans.
[57] R. Agrawal and R. Srikant, “Searching with Numbers,” Proc. 11th Information Systems, vol. 18, no. 3, pp. 288-321, 2000.
Int’l World Wide Web Conf. (WWW11), pp. 420-431, 2002. [82] D. Dey, S. Sarkar, and P. De, “Entity Matching in Heterogeneous
[58] W.E. Yancey, “Evaluating String Comparator Performance for Databases: A Distance Based Decision Model,” Proc. 31st Ann.
Record Linkage,” Technical Report Statistical Research Report Hawaii Int’l Conf. System Sciences (HICSS ’98), pp. 305-313, 1998.
Series RRS2005/05, US Bureau of the Census, Washington, D.C., [83] S. Guha, N. Koudas, A. Marathe, and D. Srivastava, “Merging the
June 2005. Results of Approximate Match Operations,” Proc. 30th Int’l Conf.
[59] S. Tejada, C.A. Knoblock, and S. Minton, “Learning Domain- Very Large Databases (VLDB ’04), pp. 636-647, 2004.
Independent String Transformation Weights for High Accuracy [84] R.K. Ahuja, T.L. Magnanti, and J.B. Orlin, Network Flows: Theory,
Object Identification,” Proc. Eighth ACM SIGKDD Int’l Conf. Algorithms, and Applications, first ed. Prentice Hall, Feb. 1993.
Knowledge Discovery and Data Mining (KDD ’02), 2002. [85] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating
[60] T. Hastie, R. Tibshirani, and J.H. Friedman, The Elements of Fuzzy Duplicates in Data Warehouses,” Proc. 28th Int’l Conf. Very
Statistical Learning. Springer Verlag, Aug. 2001. Large Databases (VLDB ’02), 2002.

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 22:31 from IEEE Xplore. Restrictions apply.
16 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 1, JANUARY 2007

[86] S. Chaudhuri, V. Ganti, and R. Motwani, “Robust Identification of Ahmed K. Elmagarmid received the BS
Fuzzy Duplicates,” Proc. 21st IEEE Int’l Conf. Data Eng. (ICDE ’05), degree in computer science from the University
pp. 865-876, 2005. of Dayton and the MS and PhD degrees from
[87] E.-P. Lim, J. Srivastava, S. Prabhakar, and J. Richardson, “Entity The Ohio State University in 1977, 1981, and
Identification in Database Integration,” Proc. Ninth IEEE Int’l Conf. 1985, respectively. He has been with the
Data Eng. (ICDE ’93), pp. 294-301, 1993. Department of Computer Science at Purdue
[88] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita, University since 1988, where he is now the
“Declarative Data Cleaning: Language, Model, and Algorithms,” director of the Cyber Center at Discovery Park.
Proc. 27th Int’l Conf. Very Large Databases (VLDB ’01), pp. 371-380, He served as a corporate chief scientist for
2001. Hewlett-Packard, on the faculty of the Pennsyl-
[89] V.S. Verykios, A.K. Elmagarmid, and E.N. Houstis, “Automating vania State University, and as an industry adviser for corporate strategy
the Approximate Record Matching Process,” Information Sciences, and product roadmaps. Professor Elmagarmid has been a database
vol. 126, nos. 1-4, pp. 83-98, July 2000. consultant for the past 20 years. He received a Presidential Young
[90] A. Blum and T. Mitchell, “Combining Labeled and Unlabeled Investigator award from the US National Science Foundation and the
Data with Co-Training,” COLT ’98: Proc. 11th Ann. Conf. distinguished alumni awards from The Ohio State University and the
Computational Learning Theory, pp. 92-100, 1998. University of Dayton in 1988, 1993, and 1995, respectively. Professor
[91] P. Cheeseman and J. Sturz, “Bayesian Classification (Autoclass): Elmagarmid has served on several editorial boards and has been active
Theory and Results,” Advances in Knowledge Discovery and Data in many of the professional societies. He is a member of the ACM, the
Mining, pp. 153-180, AAAI Press/The MIT Press, 1996. AAAS, and a senior member of the IEEE.
[92] M.G. Elfeky, A.K. Elmagarmid, and V.S. Verykios, “TAILOR: A
Record Linkage Tool Box,” Proc. 18th IEEE Int’l Conf. Data Eng. Panagiotis G. Ipeirotis received the BSc
(ICDE ’02), pp. 17-28, 2002. degree from the Computer Engineering and
[93] P. Ravikumar and W.W. Cohen, “A Hierarchical Graphical Model Informatics Department (CEID) at the University
for Record Linkage,” 20th Conf. Uncertainty in Artificial Intelligence of Patras, Greece, in 1999 and the PhD degree
(UAI ’04), 2004. in computer science from Columbia University in
[94] I. Bhattacharya and L. Getoor, “Latent Dirichlet Allocation Model 2004. He is an assistant professor in the
for Entity Resolution,” Technical Report CS-TR-4740, Computer Department of Information, Operations, and
Science Dept., Univ. of Maryland, Aug. 2005. Management Sciences at the Leonard N. Stern
[95] A. McCallum, K. Nigam, and L.H. Ungar, “Efficient Clustering of School of Business at New York University. His
High-Dimensional Data Sets with Application to Reference area of expertise is databases and information
Matching,” Proc. Sixth ACM SIGKDD Int’l Conf. Knowledge retrieval, with an emphasis on management of textual data. His research
Discovery and Data Mining (KDD ’00), pp. 169-178, 2000. interests include Web searching, text and Web mining, data cleaning,
[96] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, “Robust and and data integration. He is the recipient of the Microsoft Live Labs
Efficient Fuzzy Match for Online Data Cleaning,” Proc. 2003 ACM Award, the “Best Paper” award for the IEEE ICDE 2005 Conference,
SIGMOD Int’l Conf. Management of Data (SIGMOD ’03), pp. 313- and the “Best Paper” award for the ACM SIGMOD 2006 Conference. He
324, 2003. is a member of the IEEE Computer Society.
[97] R. Baxter, P. Christen, and T. Churches, “A Comparison of Fast
Blocking Methods for Record Linkage,” Proc. ACM SIGKDD ’03 Vassilios S. Verykios received the diploma
Workshop Data Cleaning, Record Linkage, and Object Consolidation, degree in computer engineering from the Uni-
pp. 25-27, 2003. versity of Patras, Greece, and the MS and PhD
[98] A. Soffer, D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, degrees from Purdue University in 1992, 1997,
and Y.S. Maarek, “Static Index Pruning for Information Retrieval and 1999, respectively. In 1999, he joined the
Systems,” Proc. 24th Ann. Int’l ACM SIGIR Conf. Research and faculty of information systems in the College of
Development in Information Retrieval, (SIGIR ’01), pp. 43-50, 2001. Information Science and Technology at Drexel
[99] N. Mamoulis, “Efficient Processing of Joins on Set-Valued University, Pennsylvania, as a tenure track
Attributes,” Proc. 2003 ACM SIGMOD Int’l Conf. Management of assistant professor. Since 2005, he has been
Data (SIGMOD ’03), pp. 157-168, 2003. an assistant professor in the Department of
[100] J. Zobel, A. Moffat, and K. Ramamohanarao, “Inverted Files Computer and Communication Engineering at the University of
versus Signature Files for Text Indexing,” ACM Trans. Database Thessaly, in Volos, Greece. He has also served on the faculty of the
Systems, vol. 23, no. 4, pp. 453-490, Dec. 1998. Athens Information Technology Center, Hellenic Open University, and
[101] S. Sarawagi and A. Kirpal, “Efficient Set Joins on Similarity University of Patras, Greece. His main research interests include
Predicates,” Proc. 2004 ACM SIGMOD Int’l Conf. Management of knowledge-based systems, privacy and security in advanced database
Data (SIGMOD ’04), pp. 743-754, 2004. systems, data mining, data reconciliation, parallel computing, and
[102] D. Koller and M. Sahami, “Hierarchically Classifying Documents performance evaluation of large-scale parallel systems. Dr. Verykios
Using Very Few Words,” Proc. 14th Int’l Conf. Machine Learning has published more than 40 papers in major referred journals and in the
(ICML ’97), pp. 170-178, 1997. proceedings of international conferences and workshops, and he has
[103] W.E. Yancey, “Bigmatch: A Program for Extracting Probable served on the program committees of several international scientific
Matches from a Large File for Record Linkage,” Technical Report events. He has consulted for Telcordia Technologies, ChoiceMaker
Statistical Research Report Series RRC2002/01, US Bureau of the Technologies, Intracom SA, and LogicDIS SA. He has also been a
Census, Washington, D.C., Mar. 2002. visiting researcher for CERIAS, the Department of Computer Sciences
[104] W.E. Winkler, “Overview of Record Linkage and Current at Purdue University, the US Naval Research Laboratory, and the
Research Directions,” Technical Report Statistical Research Report Research and Academic Computer Technology Institute in Patras,
Series RRS2006/02, US Bureau of the Census, Washington, D.C., Greece. He is a member of the IEEE Computer Society.
2006.
[105] IEEE Data Eng. Bull., N. Koudas, ed., special issue on data quality,
vol. 29, no. 2, June 2006.
[106] W.E. Winkler, “The State of Record Linkage and Current Research . For more information on this or any other computing topic,
Problems,” Technical Report Statistical Research Report Series please visit our Digital Library at www.computer.org/publications/dlib.
RR99/04, US Bureau of the Census, Washington, D.C., 1999.

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 22:31 from IEEE Xplore. Restrictions apply.

Potrebbero piacerti anche