Sei sulla pagina 1di 31

Human Language Technology

Conflation Algorithms

October 2009

HLT: Conflation Algorithms

Acknowledgements
John Repici (2002) http://www.creativyst.com/Doc/Articles/SoundEx1/SoundE x1.htm Porter, M.F., 1980, An algorithm for suffix stripping, reprinted in Sparck Jones, Karen, and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1-55860-454-4. [Vince has a copy of this] Jurafsky & Martin appendix B pp 833-836.

October 2009

HLT: Conflation Algorithms

Conflation
COMPUT COMPUTE COMPUTER

COMPUTING

COMPUTES

COMPUTABILITY COMPUTATION

October 2009

HLT: Conflation Algorithms

Types of Conflation Algorithm


Stemming
Process based - e.g. affix stripping

Lemmatisation
Attempt to map to same lemma POS dependent

Morphological Analysis
Includes morpho-syntactic information

October 2009

HLT: Conflation Algorithms

Word Conflation Algorithms


Morphological analysis versus conflation Notion of word class used is application dependent
Genealogy: Phonetic similarity Information Retrieval: Semantic similarity

Based on written language (not phonetic transcription) Well known algorithms


Soundex Porter
October 2009 HLT: Conflation Algorithms 5

Soundex: Problems with Names


Names can be misspelt: Rossner Same name can be spelt in different ways Kirkop; Chircop Same name appears differently in different cultures: Tchaikovsky; Chaicowski To solve this problem, we need phonetically oriented algorithms which can find similar sounding terms and names. Just such a family of algorithms exist and are called SoundExes, after the first patented version.
October 2009 HLT: Conflation Algorithms 6

The Soundex Algorithm


A Soundex algorithm takes a word as input and produces a character string which identifies a set of words that are (roughly) phonetically alike. It is very handy for searching large databases Originally developed 1918 by Margaret K. Odell and Robert C. Russell of the US Bureau of Archives, to simplify census-taking.

October 2009

HLT: Conflation Algorithms

Soundex Algorithm 1
The Soundex Algorithm uses the following steps to encode a word:
1. The first character of the word is retained as the first character of the Soundex code. 2. The following letters are discarded: a,e,i,o,u,h,w, and y. 3. Remaining consonants are given a code number. 4. If consonants having the same code number appear consecutively, the number will only be coded once. (e.g. "B233" becomes "B23")

October 2009

HLT: Conflation Algorithms

Code Numbers
b, p, f, and v c, s, k, g, j, q, x, z d, t l m m,n r
October 2009

1 2 3 4 5 6

HLT: Conflation Algorithms

Soundex Algorithm: Example


The Soundex Algorithm uses the following steps to encode a word: [ROSNER]
1. The first character of the word is retained as the first character of the Soundex code [R] 2. The following letters are discarded: a,e,i,o,u,h,w, and y. [RSNR] 3. Remaining consonants are given a code number. [R256] 4. If consonants having the same code number appear consecutively, the number will only be coded once. (e.g. "B233" becomes "B23") [R256]
October 2009 HLT: Conflation Algorithms 10

Soundex Algorithm 2
The resulting code is modified so that it becomes exactly four characters long: If it is less than 4 characters, zeroes are added to the end (e.g. "B2" becomes "B200") If it is more than 4 characters, the code is truncated (e.g. "B2435" becomes "B243")

October 2009

HLT: Conflation Algorithms

11

Uses for the Soundex Code


Airline reservations - The soundex code for a passenger's surname is often recorded to avoid confusion when trying to pronounce it. U.S. Census - As is noted above, the U.S. Census Department was a frequent user of the Soundex algorithm while trying to compile a listing of families around the turn of the century. Genealogy - In genealogy, the Soundex code is most often used to avoid problems when dealing with names that might have alternate spellings.
October 2009 HLT: Conflation Algorithms 12

Improvements
Preprocessing before applying the basic algorithm, e.g. identification of
DG with G GH with H GN with N (not 'ng') KN with N PH with F

Question: where to stop? Question: how to evaluate?


October 2009 HLT: Conflation Algorithms 13

IR Applications
Information Retrieval: Query Relevant Documents

Bag of Terms document model What is a single term?


October 2009 HLT: Conflation Algorithms 14

Why Stemming is Necessary


Frequently we get collections of words of the following kind in the same document
compute, computer, computing, computation, computability .

Performance of IR system will be improved if all of these terms are conflated.


Less terms to worry about More accurate statistics
October 2009 HLT: Conflation Algorithms 15

Issues
Is a dictionary available?
Stems Affixes

Motivation: linguistic credibility or engineering performance? When to remove a affix versus when to leave it alone Porter (1980): W1 and W2 should be conflated if there appears to be no difference between the statements "this document is about W1/W2" relate/relativity vs. radioactive/radioactivity
October 2009 HLT: Conflation Algorithms 16

Consonants and Vowels


A consonant is a letter other than a,e,i,o,u and other than y preceded by a consonant: sky, (nb. y in toy is not regarded as a consonant). If a letter is not a consonant it is a vowel. A sequence of consonants (cc..c) or vowels (vv..v) will be represented by C or V respectively. For example the word troubles maps to C V C V C Any word or part of a word, therefore has one of the following forms: (CV)n.C (CV)n.V (VC)n.C (VC)n.V
October 2009 HLT: Conflation Algorithms 17

Measure
All the above patterns can be replaced by the following regular expression (C) (VC)m (V) m is called the measure of any word or word part. m=0: tr, ee, tree, y, by m=1: trouble, oats, trees, ivy m=2: troubles; private
October 2009 HLT: Conflation Algorithms 18

Rules
Rules for removing a suffix are given in the form (condition) S1 S2 i.e. if a word ends with suffix S1, and the stem before S1 satisfies the condition, then it is replaced with S2. Example (m > 1) EMENT Example: enlargement enlarg
October 2009 HLT: Conflation Algorithms 19

Conditions
*S - stem ends with s *Z - stem ends with z *T stem ends with t *v* - stem contains a vowel *d - stem ends with a double consonant *o - stem ends cvc, where second c is not w, x or y e.g. wil, -hop In conditions, Boolean operators are possible e.g. (m>1 and (*S or *T)) Sets of rules applied in 7 steps. Within each step, rule matching longest suffix applies.
October 2009 HLT: Conflation Algorithms 20

Organisation
-s Step 1 Plurals and Third Person Singular Verbs -ed, -ing Step 2 Verbal Past Tense and Progressive fly/flies Step 3: Y to I Noun Inflections Steps 4 and 5 Derivational Morphology Multiple Suffixes visualisation visualise

Steps 6 Derivational Morphology Single Suffixes Step 7 Cleanup


October 2009

HLT: Conflation Algorithms

21

Step 1:Plural Nouns and 3rd Person Singular Verbs


condition rewrite SSES SS IES SS S I SS example caresses caress ponies poni caress caress cats cat

October 2009

HLT: Conflation Algorithms

22

Step 2a Verbal Past Tense and Progressive Forms


condition (m>1) (*v*) (*v*) rewrite EED EE ED ING example feed feed agreed agree plastered plaster bled bled killing kill sing sing
23

October 2009

HLT: Conflation Algorithms

Step 2b: Cleanup If 2nd or 3rd of last step succeeds


condition rewrite AT ATE BL BLE IZ IZE *d and not (*L or *S or single letter *Z)
October 2009

example generat generate troubl trouble capsiz capsize hopp hop hiss hiss

HLT: Conflation Algorithms

24

Step 3: Y to I
(*v*) YI happy happi cry cry

October 2009

HLT: Conflation Algorithms

25

STEP 4: Derivational Morphology 1 Multiple Suffixes (excerpt) Condition (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0)
October 2009

Rewrite ATIONAL ATE TIONAL TION ENCI ENCE ABLI ABLE OUSLI OUS IZATION IZE ATION ATE ATOR ATE ALISM AL IVENESS IVE FULNESS FUL OUSNESS OUS ALITI AL BILITI BLE

Example relational relate conditional condition valenci valence comfortabli comfortable analagously analagous digitizer digitize generation generate operator operate formalism formal pensiveness pensive hopefulness hopeful callousness callous formality formal possibility possible
26

HLT: Conflation Algorithms

Step 6: Derivational Morphology III: Single Suffixes Condition (m > 1) (m > 1) (m > 1) (m > 1) (m > 1) (m > 1) (m > 1) (m > 1) (m > 1) (m > 1) (m > 0) (*S or *T) (m > 1) (m > 1) (m > 1)
October 2009

Rewrite AL ANCE ENCE ER IC ABLE ANT EMENT MENT ENT ION OU ISM ATE

Example revival reviv allowance allow inference infer airliner airlin Coptic Copt laughable laugh irritant irrit replacement replac adjustment adjust dependent depend adoption adopt callousness callous formalism formal activate activ
27

ITI Conflation Algorithms HLT:

Porter Example
INPUT in the first focus area, integrated projects shall help develop, principally, common open platforms for software and services supporting a distributed information and decision systems for risk and crisis management

October 2009

HLT: Conflation Algorithms

28

Porter Output
Original Word Stemmed Word Original Word Stemmed Word

first focus area integrated projects help develop principally common open platforms
October 2009

first focu area integr project help develop princip common open platform

platforms software services supporting distributed information decision systems risk crisis management

platform softwar servic support distribut inform decis system risk crisi manag
29

HLT: Conflation Algorithms

Stemming Errors
Under-stemming
the error of taking off too small a suffix croulons croulon since croulons is a form of the verb crouler

Over-stemming
the error of taking off too much example: crotons crot since crotons is the plural of croton

Miss-stemming
taking off what looks like an ending, but is really part of the stem reply rep
October 2009 HLT: Conflation Algorithms 30

Summary
Conflation serves different purposes Generally, motivation is to achieve an engineering goal rather than linguistic fidelity. This can cause errors in the bag of words model. Soundex and Porter very well established and easily available.
October 2009 HLT: Conflation Algorithms 31

Potrebbero piacerti anche