Sei sulla pagina 1di 290

Morphology

Morphology
Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using morphemes Morphological parsing = the task of recognizing the morphemes inside a word Important for many tasks
e.g., hands, foxes, children machine translation information retrieval lexicography any further processing (e.g., part-of-speech tagging)
Slide 1

base form (stem), e.g., believe affixes (suffixes, prefixes, infixes), e.g., un-, -able, -ly

Morphemes and Words


Combine morphemes to create words
Inflection
combination of a word stem with a grammatical morpheme same word class, e.g. clean (verb), clean-ing (verb) combination of a word stem with a grammatical morpheme Yields different word class, e.g. clean (verb), clean-ing (noun) combination of multiple word stems combination of a word stem with a clitic different words from different syntactic categories, e.g. Ive = I + have

Derivation

Compounding Cliticization

Slide 2

Inflectional Morphology
Inflectional Morphology word stem + grammatical morpheme cat + s only for nouns, verbs, and some adjectives
Nouns plural: regular: +s, +es irregular: mouse - mice; ox - oxen rules for exceptions: e.g. -y -> -ies like: butterfly - butterflies possessive: +'s, +' Verbs main verbs (sleep, eat, walk) modal verbs (can, will, should) primary verbs (be, have, do)

Slide 3

Inflectional Morphology (verbs)


Verb Inflections for: main verbs (sleep, eat, walk); primary verbs (be, have, do) Morpholog. Form stem -s form -ing participle past; -ed participle Morph. Form stem -s form -ing participle -ed past -ed participle Regularly Inflected Form walk merge walks merges walking merging walked merged Irregularly Inflected Form eat catchcut eats catches eating catching ate caught eaten caught try tries trying tried map maps mapping mapped

cuts cutting cut cut


Slide 4

Inflectional Morphology (nouns)


Noun Inflections for: regular nouns (cat, hand); irregular nouns(child, ox) Morpholog. Form stem plural form Morph. Form stem plural form Regularly Inflected Form cat hand cats hands Irregularly Inflected Form child ox children oxen

Slide 5

Inflectional and Derivational Morphology (adjectives)


Adjective Inflections and Derivations: prefix suffix un-ly -er -est -ness unhappy happily happier happiest happiness adjective, negation adverb, mode adjective, comparative 1 adjective, comparative 2 noun

suffix

plus combinations, like unhappiest, unhappiness. Distinguish different adjective classes, which can or cannot take certain inflectional or derivational forms, e.g. no negation for big.
Slide 6

Derivational Morphology (nouns)

Slide 7

Derivational Morphology (adjectives)

Slide 8

Verb Clitics

Slide 9

Methods, Algorithms

Stemming
Stemming algorithms strip off word affixes yield stem only, no additional information (like plural, 3rd person etc.) used, e.g. in web search engines famous stemming algorithm: the Porter stemmer

Slide 11

Stemming
Reduce tokens to root form of words to recognize morphological variation.
computer, computational, computation all reduced to same token compute

Correct morphological analysis is language specific and can be complex. Stemming blindly strips off known affixes (prefixes and suffixes) in an iterative fashion.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compres and compres are both accept as equival to compres.
Slide 12

Porter Stemmer
Simple procedure for removing known affixes in English without using a dictionary. Can produce unusual stems that are not English words: May conflate (reduce to the same token) words that are actually distinct. Does not recognize all morphological derivations Typical rules in Porter stemmer sses ss ies i ational ate tional tion ing

computer, computational, computation all reduced to same token comput

Slide 13

Stemming Problems

Errors of Comission organization doing Generalization Numerical Policy organ doe Generic numerous police

Errors of Omission European analysis Matrices Noise sparse Europe analyzes matrix noisy sparsity

Slide 14

Tokenization, Word Segmentation


Tokenization or word segmentation separate out words (lexical entries) from running text expand abbreviated terms
E.g. Im into I am, its into it is

collect tokens forming single lexical entry


E.g. New York marked as one single entry

More of an issue in languages like Chinese

Slide 15

Simple Tokenization
Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token.
However, frequently they are not.

Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens. More careful approach:
Separate ? ! ; : [ ] ( ) < > Care with . - why? when? Care with ??
Slide 16

Punctuation
Childrens: use language-specific mappings to normalize (e.g. AngloSaxon genitive of nouns, verb contractions: wont -> wo nt) State-of-the-art: break up hyphenated sequence. U.S.A. vs. USA a.out

Slide 17

Numbers
3/12/91 Mar. 12, 1991 55 B.C. B-52 100.2.86.144
Generally, dont index as text Creation dates for docs

Slide 18

Lemmatization
Reduce inflectional/derivational forms to base form Direct impact on vocabulary size E.g.,
am, are, is be car, cars, car's, cars' car

the boy's cars are different colors the boy car be different color How to do this?
Need a list of grammatical rules + a list of irregular words Children child, spoken speak Practical implementation: use WordNets morphstr function
Perl: WordNet::QueryData (first returned value from validForms function)
Slide 19

Morphological Processing
Knowledge
lexical entry: stem plus possible prefixes, suffixes plus word classes, e.g. endings for verb forms (see tables above) rules: how to combine stem and affixes, e.g. add s to form plural of noun as in dogs orthographic rules: spelling, e.g. double consonant as in mapping

Processing: Finite State Transducers


take information above and analyze word token / generate word form

Slide 20

Fig. 3.3 FSA for verb inflection.

Slide 21

Fig. 3.4 Simple FSA for adjective inflection.

Fig. 3.5 More detailed FSA for adjective inflection.


Slide 22

Fig. 3.7 Compiled FSA for noun inflection.


Slide 23

LINGUIST 180: Introduction to Computational Linguistics


Dan Jurafsky Lecture 5: Intro to Probability, Language Modeling

IP notice: some slides for today from: Jim Martin, Sandiway Fong, Dan Klein
LING 180 Autumn 2007

Outline
Probability
Basic probability Conditional probability Bayes Rule

Language Modeling (N-grams)


N-gram Intro The Chain Rule The Shannon Visualization Method Evaluation: Smoothing: Add-1 Advanced stuff I wont discuss:
Perplexity

Discounting: Good-Turing and Katz backoff Interpolation Unknown words Advanced LM algorithms
LING 180 Autumn 2007
2

1. Introduction to Probability
Experiment (trial) Sample Space (S)
Example Example Repeatable procedure with well-defined possible outcomes
the set of all possible outcomes finite or infinite coin toss experiment possible outcomes: S = {heads, tails} die toss experiment possible outcomes: S = {1,2,3,4,5,6}
QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

LING 180 Autumn 2007

Slides from Sandiway Fong

Introduction to Probability
Definition of sample space depends on what we are asking
Sample Space (S): the set of all possible outcomes Example
die toss experiment for whether the number is even or odd possible outcomes: {even,odd} not {1,2,3,4,5,6}

QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

LING 180 Autumn 2007

More definitions
Events Example
an event is any subset of outcomes from the sample space die toss experiment let A represent the event such that the outcome of the die toss experiment is divisible by 3 A = {3,6} A is a subset of the sample space S= {1,2,3,4,5,6} Draw a card from a deck suppose sample space S = {heart,spade,club,diamond} (four suits) let A represent the event of drawing a heart let B represent the event of drawing a red card A = {heart} B = {heart,diamond}
QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

Example

LING 180 Autumn 2007

Introduction to Probability
Some definitions
Counting
suppose operation oi can be performed in ni ways, then a sequence of k operations o1o2...ok can be performed in n1 n2 ... nk ways die toss experiment, 6 possible outcomes two dice are thrown at the same time number of sample points in sample space = 6 6 = 36

Example

QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

LING 180 Autumn 2007

Definition of Probability
The probability law assigns to an event a nonnegative number Called P(A) Also called the probability A That encodes our knowledge or belief about the collective likelihood of all the elements of A Probability law must satisfy certain properties

LING 180 Autumn 2007

Probability Axioms
Nonnegativity
P(A) >= 0, for every event A

Additivity
If A and B are two disjoint events, then the probability of their union satisfies: P(A U B) = P(A) + P(B)

Normalization
The probability of the entire sample space S is equal to 1, I.e. P(S) = 1.

LING 180 Autumn 2007

An example
An experiment involving a single coin toss There are two possible outcomes, H and T Sample space S is {H,T} If coin is fair, should assign equal probabilities to 2 outcomes Since they have to sum to 1 P({H}) = 0.5 P({T}) = 0.5 P({H,T}) = P({H})+P({T}) = 1.0

LING 180 Autumn 2007

Another example
Experiment involving 3 coin tosses Outcome is a 3-long string of H or T S ={HHH,HHT,HTH,HTT,THH,THT,TTH,TTTT} Assume each outcome is equiprobable What is probability of the event that exactly 2 heads occur? A = {HHT,HTH,THH} P(A) = P({HHT})+P({HTH})+P({THH}) = 1/8 + 1/8 + 1/8 =3/8
Uniform distribution

LING 180 Autumn 2007

10

Probability definitions
In summary:

Probability of drawing a spade from 52 well-shuffled playing cards:

LING 180 Autumn 2007

11

Probabilities of two events


If two events A and B are independent Then
P(A and B) = P(A) x P(B)

If flip a fair coin twice


What is the probability that they are both heads?

If draw a card from a deck, then put it back, draw a card from the deck again
What is the probability that both drawn cards are hearts?

LING 180 Autumn 2007

12

How about non-uniform probabilities? An example


A biased coin,
twice as likely to come up tails as heads, is tossed twice

What is the probability that at least one head occurs? Sample space = {hh, ht, th, tt} (h = heads, t = tails) Sample points/probability for the event:
ht 1/3 x 2/3 = 2/9 th 2/3 x 1/3 = 2/9 hh 1/3 x 1/3= 1/9 tt 2/3 x 2/3 = 4/9

Answer: 5/9 = 0.56 (sum of weights in red )

LING 180 Autumn 2007

13

Moving toward language


Whats the probability of drawing a 2 from a deck of 52 cards with four 2s?
4 1 P ( drawing a two) = = = .077 52 13 Whats the probability of a random word (from a random dictionary page) being a verb?

P ( drawing a verb) =

# of ways to get a verb

all words
LING 180 Autumn 2007
14

Probability and part of speech tags


Whats the probability of a random word (from a random dictionary page) being a verb?
P ( drawing a verb) =
# of ways to get a verb

all words

How to compute each of these All words = just count all the words in the dictionary # of ways to get a verb: number of words which are verbs! If a dictionary has 50,000 entries, and 10,000 are verbs. P(V) is 10000/50000 = 1/5 = .20

LING 180 Autumn 2007

15

Conditional Probability
A way to reason about the outcome of an experiment based on partial information
In a word guessing game the first letter for the word is a t. What is the likelihood that the second letter is an h? How likely is it that a person has a disease given that a medical test was negative? A spot shows up on a radar screen. How likely is it that it corresponds to an aircraft?

LING 180 Autumn 2007

16

More precisely
Given an experiment, a corresponding sample space S, and a probability law Suppose we know that the outcome is within some given event B We want to quantify the likelihood that the outcome also belongs to some other given event A. We need a new probability law that gives us the conditional probability of A given B P(A|B)

LING 180 Autumn 2007

17

An intuition
A is its raining now. P(A) in dry California is .01 B is it was raining ten minutes ago P(A|B) means what is the probability of it raining now if it was raining 10 minutes ago P(A|B) is probably way higher than P(A) Perhaps P(A|B) is .10 Intuition: The knowledge about B should change our estimate of the probability of A.

LING 180 Autumn 2007

18

Conditional probability
One of the following 30 items is chosen at random What is P(X), the probability that it is an X? What is P(X|red), the probability that it is an X given that it is red?

LING 180 Autumn 2007

19

Conditional Probability
let A and B be events p(B|A) = the probability of event B occurring given event A occurs definition: p(B|A) = p(A B) / p(A)

QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

LING 180 Autumn 2007

20

Conditional probability
P(A|B) = P(A B)/P(B) Or
P( A, B) P( B)

P( A | B) =

Note: P(A,B)=P(A|B) P(B) Also: P(A,B) = P(B,A) A A,B B

LING 180 Autumn 2007

21

Independence

What is P(A,B) if A and B are independent? P(A,B)=P(A) P(B) iff A,B independent.
P(heads,tails) = P(heads) P(tails) = .5 .5 = .25 Note: P(A|B)=P(A) iff A,B independent Also: P(B|A)=P(B) iff A,B independent

LING 180 Autumn 2007

22

Bayes Theorem

P( A | B) P( B) P( B | A) = P( A)
Swap the conditioning Sometimes easier to estimate one kind of dependence than the other
LING 180 Autumn 2007
23

Deriving Bayes Rule

P ( A B) P ( B | A) = P ( A B) P ( A | B) = P ( A) P (B)
P ( A | B) P ( B) = P ( A B) P ( B | A) P ( A) = P ( A B)

P ( A | B) P ( B) = P (B | A)P ( A)
P (B | A)P ( A) P ( A | B) = P (B)
LING 180 Autumn 2007
24

Summary
Probability Conditional Probability Independence Bayes Rule

LING 180 Autumn 2007

25

How many words?


I do uh main- mainly business data processing
Fragments Filled pauses

Are cat and cats the same word? Some terminology


Lemma: a set of lexical forms having the same stem, major part of speech, and rough word sense
Cat and cats = same lemma

Wordform: the full inflected surface form.


Cat and cats = different wordforms

LING 180 Autumn 2007

26

How many words?


they picnicked by the pool then lay back on the grass and looked at the stars
16 tokens 14 types

SWBD:
~20,000 wordform types, 2.4 million wordform tokens

Brown et al (1992) large corpus


583 million wordform tokens 293,181 wordform types

Let N = number of tokens, V = vocabulary = number of types General wisdom: V > O(sqrt(N))

LING 180 Autumn 2007

27

Language Modeling
We want to compute P(w1,w2,w3,w4,w5wn), the probability of a sequence Alternatively we want to compute P(w5|w1,w2,w3,w4,w5): the probability of a word given some previous words The model that computes P(W) or P(wn|w1,w2wn1) is called the language model. A better term for this would be The Grammar But Language model or LM is standard

LING 180 Autumn 2007

28

Computing P(W)
How to compute this joint probability:
P(the,other,day,I,was,walking,along,an d,saw,a,lizard)

Intuition: lets rely on the Chain Rule of Probability

LING 180 Autumn 2007

29

The Chain Rule


Recall the definition of conditional probabilities Rewriting:

P( A^ B) P( A | B) = P( B)

More generally P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C) In general P(x1,x2,x3,xn) = P(x1)P(x2|x1)P(x3|x1,x2)P(xn|x1xn-1)

P( A^ B) = P( A | B) P( B)

LING 180 Autumn 2007

30

The Chain Rule Applied to joint probability of words in sentence

P(the big red dog was)=


P(the)*P(big|the)*P(red|the big)*P(dog|the big red)*P(was|the big red dog)

LING 180 Autumn 2007

31

Very easy estimate:


How to estimate?
P(the|its water is so transparent that)

P(the|its water is so transparent that) = C(its water is so transparent that the) _______________________________ C(its water is so transparent that)

LING 180 Autumn 2007

32

Unfortunately
There are a lot of possible sentences Well never be able to get enough data to compute the statistics for those long prefixes P(lizard|the,other,day,I,was,walking,along,and,saw,a) Or P(the|its water is so transparent that)

LING 180 Autumn 2007

33

Markov Assumption
Make the simplifying assumption
P(lizard|the,other,day,I,was,walking,along,and,saw,a ) = P(lizard|a)

Or maybe
P(lizard|the,other,day,I,was,walking,along,and,saw,a ) = P(lizard|saw,a)

LING 180 Autumn 2007

34

Markov Assumption
So for each component in the product replace with the approximation (assuming a prefix of N)

P ( wn | w1 ) P (wn | w
Bigram version

n 1

n 1 n N +1

P ( wn | w1 ) P (wn | wn 1 )
LING 180 Autumn 2007
35

n 1

Estimating bigram probabilities


The Maximum Likelihood Estimate

count ( wi1, wi ) P ( wi | wi1) = count (wi1 )

c (wi1, wi ) P ( wi | wi1) = c ( wi1)


LING 180 Autumn 2007
36

An example
<s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s>

This is the Maximum Likelihood Estimate, because it is the one which maximizes P(Training set|Model)

LING 180 Autumn 2007

37

Maximum Likelihood Estimates


The maximum likelihood estimate of some parameter of a model M from a training set T
Is the estimate that maximizes the likelihood of the training set T given the model M

Suppose the word Chinese occurs 400 times in a corpus of a million words (Brown corpus) What is the probability that a random word from some other text will be Chinese MLE estimate is 400/1000000 = .004
This may be a bad estimate for some other corpus

But it is the estimate that makes it most likely that Chinese will occur 400 times in a million word corpus.

LING 180 Autumn 2007

38

More examples: Berkeley Restaurant Project sentences


can you tell me about any good cantonese restaurants close by mid priced thai food is what im looking for tell me about chez panisse can you give me a listing of the kinds of food that are available im looking for a good place to eat breakfast when is caffe venezia open during the day

LING 180 Autumn 2007

39

Raw bigram counts


Out of 9222 sentences

LING 180 Autumn 2007

40

Raw bigram probabilities


Normalize by unigrams: Result:

LING 180 Autumn 2007

41

Bigram estimates of sentence probabilities


P(<s> I want english food </s>) = p(i|<s>) x p(want|I) x p(english|want) x p(food|english) x p(</s>|food) =.000031

LING 180 Autumn 2007

42

What kinds of knowledge?


P(english|want) = .0011 P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25

LING 180 Autumn 2007

43

The Shannon Visualization Method


Generate random sentences: Choose a random bigram <s>, w according to its probability Now choose a random bigram (w, x) according to its probability And so on until we choose </s> Then string the words together
<s> I I want want to to eat eat Chinese Chinese food food </s>

LING 180 Autumn 2007

44

LING 180 Autumn 2007

45

Shakespeare as corpus
N=884,647 tokens, V=29,066 Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table) Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare

LING 180 Autumn 2007

46

The wall street journal is not shakespeare (no offense)

LING 180 Autumn 2007

47

Evaluation
We train parameters of our model on a training set. How do we evaluate how well our model works? We look at the models performance on some new data This is what happens in the real world; we want to know how our model performs on data we havent seen So a test set. A dataset which is different than our training set Then we need an evaluation metric to tell us how well our model is doing on the test set. One such metric is perplexity (to be introduced below)
LING 180 Autumn 2007
48

Unknown words: Open versus closed vocabulary tasks


If we know all the words in advanced Often we dont know this
Vocabulary V is fixed Closed vocabulary task Out Of Vocabulary = OOV words Open vocabulary task Training of <UNK> probabilities

Instead: create an unknown word token <UNK>


Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to <UNK> Now we train its probabilities like a normal word If text input: Use UNK probabilities for any word not in training

At decoding time

LING 180 Autumn 2007

49

Evaluating N-gram models


Best evaluation for an N-gram
Put model A in a speech recognizer Run recognition, get word error rate (WER) for A Put model B in speech recognition, get word error rate for B Compare WER for A and B Extrinsic evaluation

LING 180 Autumn 2007

50

Difficulty of extrinsic (in-vivo) evaluation of N-gram models


Extrinsic evaluation
This is really time-consuming Can take days to run an experiment

So
As a temporary solution, in order to run experiments To evaluate N-grams we often use an intrinsic evaluation, an approximation called perplexity But perplexity is a poor approximation unless the test data looks just like the training data So is generally only useful in pilot experiments (generally is not sufficient to publish) But is helpful to think about.

LING 180 Autumn 2007

51

Perplexity
Perplexity is the probability of the test set (assigned by the language model), normalized by the number of words:

Chain rule:

For bigrams:

Minimizing perplexity is the same as maximizing probability


The best language model is one that best predicts an unseen test set
LING 180 Autumn 2007
52

A totally different perplexity Intuition


How hard is the task of recognizing digits 0,1,2,3,4,5,6,7,8,9,oh: easy, perplexity 11 (or if we ignore oh, perplexity 10) How hard is recognizing (30,000) names at Microsoft. Hard: perplexity = 30,000 If a system has to recognize
Operator (1 in 4) Sales (1 in 4) Technical Support (1 in 4) 30,000 names (1 in 120,000 each) Perplexity is 54

Perplexity is weighted equivalent branching factor

Slide from Josh Goodman

LING 180 Autumn 2007

53

Perplexity as branching factor

LING 180 Autumn 2007

54

Lower perplexity = better model


Training 38 million words, test 1.5 million words, WSJ

LING 180 Autumn 2007

55

Lesson 1: the perils of overfitting


N-grams only work well for word prediction if the test corpus looks like the training corpus
In real life, it often doesnt We need to train robust models, adapt to test set, etc

LING 180 Autumn 2007

56

Lesson 2: zeros or not?


Zipfs Law:
A small number of events occur with high frequency A large number of events occur with low frequency You can quickly collect statistics on the high frequency events You might have to wait an arbitrarily long time to get valid statistics on low frequency events Our estimates are sparse! no counts at all for the vast bulk of things we want to estimate! Some of the zeroes in the table are really zeros But others are simply low frequency events you haven't seen yet. After all, ANYTHING CAN HAPPEN! How to address? Estimate the likelihood of unseen N-grams!

Result:

Answer:

Slide adapted from Bonnie Dorr and Julia Hirschberg

LING 180 Autumn 2007

57

Smoothing is like Robin Hood: Steal from the rich and give to the poor (in probability mass)

Slide from Dan Klein

LING 180 Autumn 2007

58

Laplace smoothing
Also called add-one smoothing Just add one to all the counts! Very simple MLE estimate: Laplace estimate: Reconstructed counts:

LING 180 Autumn 2007

59

Laplace smoothed bigram counts

LING 180 Autumn 2007

60

Laplace-smoothed bigrams

LING 180 Autumn 2007

61

Reconstituted counts

LING 180 Autumn 2007

62

Note big change to counts


C(count to) went from 608 to 238! P(to|want) from .66 to .26! Discount d= c*/c
d for chinese food =.10!!! A 10x reduction So in general, Laplace is a blunt instrument Could use more fine-grained method (add-k)

But Laplace smoothing not used for N-grams, as we have much better methods Despite its flaws Laplace (add-k) is however still used to smooth other probabilistic models in NLP, especially
For pilot studies in domains where the number of zeros isnt so huge.

LING 180 Autumn 2007

63

Better discounting algorithms


Intuition used by many smoothing algorithms
Good-Turing Kneser-Ney Witten-Bell

Is to use the count of things weve seen once to help estimate the count of things weve never seen

LING 180 Autumn 2007

64

Good-Turing: Josh Goodman intuition


Imagine you are fishing
There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass

You have caught


10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that next species is new (i.e. catfish or bass)


3/18

Assuming so, how likely is it that next species is trout?


Must be less than 1/18
Slide adapted from Josh Goodman
LING 180 Autumn 2007

65

Good-Turing Intuition
Notation: Nx is the frequency-of-frequency-x
So N10=1, N1=3, etc

To estimate total number of unseen species


Use number of species (words) weve seen once c0* =c1 p0 = N1/N

All other estimates are adjusted (down) to give probabilities for unseen

Slide from Josh Goodman

LING 180 Autumn 2007

66

Good-Turing Intuition
Notation: Nx is the frequency-of-frequency-x
So N10=1, N1=3, etc

To estimate total number of unseen species


Use number of species (words) weve seen once c0* =c1 p0 = N1/N p0=N1/N=3/18

All other estimates are adjusted (down) to give probabilities for unseen
P(eel) = c*(1) = (1+1) 1/ 3 = 2/3

Slide from Josh Goodman

LING 180 Autumn 2007

67

LING 180 Autumn 2007

68

Bigram frequencies of frequencies and GT re-estimates

LING 180 Autumn 2007

69

Complications
In practice, assume large counts (c>k for some k) are reliable:

That complicates c*, making it:

Also: we assume singleton counts c=1 are unreliable, so treat N-grams with count of 1 as if they were count=0 Also, need the Nk to be non-zero, so we need to smooth (interpolate) the Nk counts before computing c* from them

LING 180 Autumn 2007

70

Backoff and Interpolation


Another really useful source of knowledge If we are estimating:
trigram p(z|xy) but c(xyz) is zero

Use info from:


Bigram p(z|y)

Or even:
Unigram p(z)

How to combine the trigram/bigram/unigram info?

LING 180 Autumn 2007

71

Backoff versus interpolation


Backoff: use trigram if you have it, otherwise bigram, otherwise unigram Interpolation: mix all three

LING 180 Autumn 2007

72

Interpolation
Simple interpolation

Lambdas conditional on context:

LING 180 Autumn 2007

73

How to set the lambdas?


Use a held-out corpus Choose lambdas which maximize the probability of some held-out data
I.e. fix the N-gram probabilities Then search for lambda values That when plugged into previous equation Give largest probability for held-out set Can use EM to do this search

LING 180 Autumn 2007

74

Katz Backoff

LING 180 Autumn 2007

75

Why discounts P* and alpha?


MLE probabilities sum to 1

So if we used MLE probabilities but backed off to lower order model when MLE prob is zero We would be adding extra probability mass And total probability would be greater than 1

LING 180 Autumn 2007

76

GT smoothed bigram probs

LING 180 Autumn 2007

77

Intuition of backoff+discounting
How much probability to assign to all the zero trigrams?
Use GT or other discounting algorithm to tell us

How to divide that probability mass among different contexts?


Use the N-1 gram estimates to tell us

What do we do for the unigram words not seen in training?


Out Of Vocabulary = OOV words

LING 180 Autumn 2007

78

OOV words: <UNK> word


Out Of Vocabulary = OOV words We dont use GT smoothing for these Instead: create an unknown word token <UNK>
Training of <UNK> probabilities Because GT assumes we know the number of unseen events

At decoding time

Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to <UNK> Now we train its probabilities like a normal word If text input: Use UNK probabilities for any word not in training

LING 180 Autumn 2007

79

Practical Issues
We do everything in log space
Avoid underflow (also adding is faster than multiplying)

LING 180 Autumn 2007

80

ARPA format

LING 180 Autumn 2007

81

LING 180 Autumn 2007

82

Language Modeling Toolkits


SRILM CMU-Cambridge LM Toolkit

LING 180 Autumn 2007

83

Google N-Gram Release

LING 180 Autumn 2007

84

Google N-Gram Release


serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234

LING 180 Autumn 2007

85

Advanced LM stuff
Current best smoothing algorithm
Kneser-Ney smoothing

Other stuff
Variable-length n-grams Class-based n-grams
Clustering Hand-built classes

Cache LMs Topic-based LMs Sentence mixture models Skipping LMs Parser-based LMs

LING 180 Autumn 2007

86

Summary
Probability
Basic probability Conditional probability Bayes Rule

Language Modeling (N-grams)


N-gram Intro The Chain Rule The Shannon Visualization Method Evaluation: Smoothing: Add-1 Advanced stuff I wont discuss:
Perplexity

Discounting: Good-Turing and Katz backoff Interpolation Unknown words Advanced LM algorithms
LING 180 Autumn 2007
87

Todays Lecture
NGrams Bigram Smoothing and NGram

Add one smoothing Witten-Bell Smoothing

Simple N-Grams

An N-gram model uses the previous N-1 words to predict the next one:

P(wn | wn -1) We'll be dealing with P(<word> | <some previous words>)

unigrams: P(dog) bigrams: P(dog | big) trigrams: P(dog | the big) quadrigrams: P(dog | the big dopey)

Chain Rule
conditional probability: So:

P( A B) P( A | B) = P( B)
P( A B) = P( A | B) P( B) and P( A B ) = P ( B | A) P( A)

P( A B) = P( B | A) P( A)

the dog:

P(The dog ) = P(dog | the) P(the)

the dog bites:

P(The dog bites ) = P(The) P(dog | The) P(bites | The dog )

Chain Rule
the probability of a word sequence is the probability of a conjunctive event.

P ( w ) = P ( w1 ) P ( w2 | w1 ) P ( w3 | w )...P( wn | w )
n 1 2 1

n 1 1

= P( wk | w )
k 1 1 k =1

Unfortunately, thats really not helpful in general. Why?


4

Markov Assumption
n 1 P ( wn | w1n 1 ) P( wn | wn N +1 )

P(wn) can be approximated using only N-1 previous words of context This lets us collect statistics in practice Markov models are the class of probabilistic models that assume that we can predict the probability of some future unit without looking too far into the past Order of a Markov model: length of prior context

Language Models and N-grams


Given a word sequence: w1 w2 w3 ... wn Chain rule


Note:

p(w1 w2) = p(w1) p(w2|w1) p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) ... p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1) Its not easy to collect (meaningful) statistics on p(wn|wn-1wn-2...w1) for all possible word sequences just look at the previous word only (not all the proceedings words) Markov Assumption: finite length history 1st order Markov Model p(w1 w2 w3..wn) = p(w1) p(w2|w1) p(w3|w1w2) ..p(wn|w1...wn-3wn-2wn-1) p(w1 w2 w3..wn) p(w1) p(w2|w1) p(w3|w2)..p(wn|wn-1) p(wn|wn-1) is a lot easier to estimate well than p(wn|w1..wn-2 wn-1)
6

Bigram approximation

Note:

Language Models and N-grams


Given a word sequence: w1 w2 w3 ... wn Chain rule


Trigram approximation

p(w1 w2) = p(w1) p(w2|w1) p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) ... p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1)

2nd order Markov Model just look at the preceding two words only p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2) p(w4|w1w2w3)...p(wn|w1...wn-3wn-2wn-1) p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-2 wn-1)

Note:

p(wn|wn-2wn-1) is a lot easier to estimate well than p(wn|w1...wn-2 wn-1) but harder than p(wn|wn-1 )
7

Corpora
Corpora are (generally online) collections of text and speech e.g.

Brown Corpus (1M words) Wall Street Journal and AP News corpora ATIS, Broadcast News (speech) TDT (text and speech) Switchboard, Call Home (speech) TRAINS, FM Radio (speech)
8

Sample Word frequency (count)Data


(The Text REtrieval Conference) - (from B. Croft, UMass)

Counting Words in Corpora


Probabilities are based on counting things, so . What should we count? Words, word classes, word senses, speech acts ?

What is a word?

e.g., are cat and cats the same word? September and Sept? zero and oh? Is seventy-two one word or two? AT&T? Where do we find the things to count?
10

Terminology
Sentence: unit of written language Utterance: unit of spoken language Wordform: the inflected form that appears in the corpus Lemma: lexical forms having the same stem, part of speech, and word sense Types: number of distinct words in a corpus (vocabulary size) Tokens: total number of words

11

Training and Testing

Probabilities come from a training corpus, which is used to design the model.

narrow corpus: probabilities don't generalize general corpus: probabilities don't reflect task or domain

A separate test corpus is used to evaluate the model

12

Simple N-Grams

An N-gram model uses the previous N-1 words to predict the next one:

P(wn | wn -1) We'll be dealing with P(<word> | <some prefix>)

unigrams: P(dog) bigrams: P(dog | big) trigrams: P(dog | the big) quadrigrams: P(dog | the big red)

13

Using N-Grams

Recall that

For a bigram grammar

P(wn | w1..n-1) P(wn | wn-N+1..n-1) P(sentence) can be approximated by multiplying all the bigram probabilities in the sequence P(I want to eat Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food)
14

Chain Rule

Recall the definition of conditional probabilities Rewriting Or Or

P( A^ B) P( A | B) = P( B)

P( A^ B) = P( A | B) P( B)
P(The big ) = P(big | the) P(the) P(The big ) = P(the) P(big | the)
15

Example
The

big red dog

P(The)*P(big|the)*P(red|the big)*P(dog|the big red) Better P(The| <Beginning of sentence>) written as P(The | <S>) Also <end> for end of sentence
16

General Case

The word sequence from position 1 to n is So the probability of a sequence is

w1

P ( w1n ) = P ( w1) P ( w2 | w1) P ( w3 | w12 )...P ( wn | w1n 1 ) = P( w1) k = 2 P( wk | w )


n k 1 1

17

Unfortunately

That doesnt help since its unlikely well ever gather the right statistics for the prefixes.

18

Markov Assumption

Assume that the entire prefix history isnt necessary. In other words, an event doesnt depend on all of its history, just a fixed length near history

19

Markov Assumption

So for each component in the product replace each with its approximation (assuming a prefix (Previous words) of N)

P ( wn | w ) P ( wn | w

n 1 1

n 1 n N +1

)
20

N-Grams The big red dog


Unigrams: P(dog) Bigrams: P(dog|red) Trigrams: P(dog|big red) Four-grams: P(dog|the big red)

In general, well be dealing with P(Word| Some fixed prefix)


Note: prefix is Previous words
21

N-gram models can be trained by counting and normalization


C ( wn1wn ) P( wn | wn1 ) = C ( wn1 )
n 1 P ( wn | wn N +1 ) = n 1 C ( wn N +1wn ) n 1 C ( wn N +1 )
22

Bigram:

Ngram:

An example

<s> I am Sam <\s> <s> Sam I am <\s> <s> I do not like green eggs and meet <\s>
P( I |< s >) =

2 = 0.67 3 1 P( Sam |< s >) = = 0.33 3 2 P(am | I = ) = 0.67 3 1 P(< \ s >| Sam) = = 0.5 2 1 P(< s >| Sam) = = 0.5 2 1 P( Sam | am= ) = 0.5 2 1 P(do | I ) = = 0.33 3

23

BERP Bigram Counts


I I Want To Eat Chinese Food Lunch 8 3 3 0 2 19 4 Want 1087 0 0 0 0 0 0 To 0 786 10 2 0 17 0 Eat

BErkeley Restaurant Project (speech)


Chinese 0 6 3 19 0 0 0 Food 0 8 0 2 120 0 1 lunch 0 6 12 52 1 0 0
24

13 0 860 0 0 0 0

BERP Bigram Probabilities

Normalization: divide each row's counts by appropriate unigram counts


Want 1215 To 3256 Eat 938 Chinese 213 Food 1506 Lunch 459

I 3437

Computing the probability of I I


C(I|I)/C(all I) p = 8 / 3437 = .0023

A bigram grammar is an NxN matrix of probabilities, where N is the vocabulary size


25

A Bigram Grammar Fragment from BERP


Eat on Eat some Eat lunch Eat dinner Eat at Eat a Eat Indian Eat today .16 .06 .06 .05 .04 .04 .04 .03 Eat Thai Eat breakfast Eat in Eat Chinese Eat Mexican Eat tomorrow Eat dessert Eat British .03 .03 .02 .02 .02 .01 .007 .001
26

<start> I <start> Id <start> Tell <start> Im I want I would I dont I have Want to Want a

.25 .06 .04 .02 .32 .29 .08 .04 .65 .05

Want some Want Thai To eat To have To spend To be British food British restaurant British cuisine British lunch

.04 .01 .26 .14 .09 .02 .60 .15 .01 .01
27

Language Models and N-grams

Example:

wn wn-1

wn-1wn bigram frequencies

unigram frequencies

bigram probabilities

sparse matrix zeros probabilities unusable (well need to do smoothing)


28

Example

P(I want to eat British food) = P(I|<start>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25*.32*.65*.26*.001*.60 = (different from textbook) 0.0000081 vs. I want to eat Chinese food = .00015

29

Note on Example

Probabilities seem to capture syntactic facts, world knowledge


eat is often followed by a NP British food is not too popular

30

What do we learn about the language?

What's being captured with ...


P(want | I) = .32 P(to | want) = .65 P(eat | to) = .26 P(food | Chinese) = .56 P(lunch | eat) = .055

31

Some Observations

P(I | I) P(want | I) P(I | food)

I I I want I want I want to The food I want is

32

What

about

P(I | I) = .0023 I I I I want P(I | want) = .0025 I want I want P(I | food) = .013 the kind of food I want is ...

33

To avoid underflow use Logs

You dont really do all those multiplies. The numbers are too small and lead to underflows Convert the probabilities to logs and then do additions. To get the real probability (if you need it) go back to the antilog.

34

Generation

Choose N-Grams according to their probabilities and string them together

35

BERP

I want
want to to eat eat Chinese Chinese food food .

36

Some Useful Observations

A small number of events occur with high frequency

You can collect reliable statistics on these events with relatively small samples

A large number of events occur with small frequency

You might have to wait a long time to gather statistics on the low frequency events

37

Some Useful Observations

Some zeroes are really zeroes

Meaning that they represent events that cant or shouldnt occur

On the other hand, some zeroes arent really zeroes

They represent low frequency events that simply didnt occur in the corpus

38

Problem
Lets assume were using N-grams How can we assign a probability to a sequence where one of the component ngrams has a value of zero Assume all the words are known and have been seen

Go to a lower order n-gram Back off from bigrams to unigrams Replace the zero with something else

39

Add-One

Make the zero counts 1. Justification: Theyre just events you havent seen yet. If you had seen them you would only have seen them once. so make the count equal to 1.

40

unsmoothed bigram counts:


I I
8 3 3 0 2 19 4

Add-one: Example
2nd word
eat
0 786 10 2 0 17 0 13 0 860 0 0 0 0

want
1087 0 0 0 0 0 0

to

Chinese
0 6 3 19 0 0 0

food
0 8 0 2 120 0 1

lunch
0 6 12 52 1 0 0

Total (N)
3437 1215 3256 938 213 1506 459

1st word

want to eat Chinese food lunch

unsmoothed normalized bigram probabilities:


I I want to eat Chinese food lunch
.0023 (8/3437) .0025 .00092 0 .0094 .013 .0087

want
.32 0 0 0 0 0 0

to
0 .65 .0031 .0021 0 .011 0

eat
.0038 (13/3437) 0 .26 0 0 0 0

Chinese
0 .0049 .00092 .020 0 0 0

food
0 .0066 0 .0021 .56 0 .0022

lunch
0 .0049 .0037 .055 .0047 0 0

Total
1 1 1 1 1 1 1

41

Add-one: Example (cont)


add-one smoothed bigram counts:
I I want to eat Chinese food lunch
8 9 3 4 4 1 3 20 5

want
1087 1088 1 1 1 1 1 1

to
1 787 11 23 1 18 1

eat
14 1 861 1 1 1 1

Chinese
1 7 4 20 1 1 1

food
1 9 1 3 121 1 2

lunch
1 7 13 53 2 1 1

Total (N+ V)
3437 5053 2831 4872 2554 1829 3122 2075

add-one normalized bigram probabilities:


I want
.22 .00035 .00021 .00039 .00055 .00032 .00048

to

eat

Chinese
.0002 .0025 .00082 .0078 .00055 .00032 .00048

food
.0002 .0032 .00021 .0012 .066 .00032 .0022

lunch
.0002 .0025 .0027 .021 .0011 .00032 .00048

Total
1 1 1 1 1 1 1
42

I want to eat Chinese food lunch

.0018 (9/5053) .0014 .00082 .00039 .0016 .0064 .0024

.0002 .28 .0023 .0012 .00055 .0058 .00048

.0028 (14/5053) .00035 .18 .00039 .00055 .00032 .00048

The example again


unsmoothed bigram counts:
I
8 3 3 0 2 19 4

V= 1616 word types


Chinese food
0 6 3 19 0 0 0 0 8 0 2 120 0 1

I want to eat Chinese food lunch

want
1087 0 0 0 0 0 0

to
0 786 10 2 0 17 0

eat
13 0 860 0 0 0 0

lunch
0 6 12 52 1 0 0

Total (N)
3437 1215 3256 938 213 1506 459

V= 1616

Smoothed P(I eat) = (C(I eat) + 1) / (nb bigrams starting with I + nb of possible bigrams starting with I) = (13 + 1) / (3437 + 1616) = 0.0028
43

Smoothing and N-grams


Add-One Smoothing

Bigram

add 1 to all frequency counts p(wn|wn-1) = (C(wn-1wn)+1)/(C(wn-1)+V) (C(wn-1 wn)+1)* C(wn-1) /(C(wn-1)+V)
I want to eat Chinese food lunch 8 1087 0 13 0 0 0 3 0 786 0 6 8 6 3 0 10 860 3 0 12 0 0 2 0 19 2 52 2 0 0 0 0 120 1 19 0 17 0 0 0 0 4 0 0 0 0 1 0 I 6.12 1.72 2.67 0.37 0.35 9.65 1.11 want to eat Chinese food lunch 740.05 0.68 9.52 0.68 0.68 0.68 0.43 337.76 0.43 3.00 3.86 3.00 0.67 7.35 575.41 2.67 0.67 8.69 0.37 1.10 0.37 7.35 1.10 19.47 0.12 0.12 0.12 0.12 14.09 0.23 0.48 8.68 0.48 0.48 0.48 0.48 0.22 0.22 0.22 0.22 0.44 0.22

Frequencies
I want to eat Chinese food lunch I want to eat Chinese food lunch

Remarks: add-one causes large changes in some frequencies due to relative size of V (1616) want to: 786 338 = (786 + 1) * 1215 / (1215 + 1616)

c +1 N N +V
i

44

Problem with add-one smoothing

bigrams starting with Chinese are boosted by a factor of 8 ! (1829 / 213) unsmoothed bigram counts:
I want to eat Chinese food lunch I
8 3 3 0 2 19 4

want
1087 0 0 0 0 0 0

to
0 786 10 2 0 17 0

eat
13 0 860 0 0 0 0

Chinese food
0 6 3 19 0 0 0 0 8 0 2 120 0 1

lunch
0 6 12 52 1 0 0

Total (N)
3437 1215 3256 938 213 1506 459

add-one smoothed bigram counts:


I I want
4 4 1 3 20 5 9

1st word

want
1088 1 1 1 1 1 1

to
1 787 11 23 1 18 1

eat
14 1 861 1 1 1 1

Chinese
1 7 4 20 1 1 1

food
1 9 1 3 121 1 2

lunch
1 7 13 53 2 1 1

Total (N+ V)
5053 2831 4872 2554 1829 3122 2075
45

1st word

to eat Chinese food lunch

Problem with add-one smoothing (cont)

Data from the AP from (Church and Gale, 1991)


Corpus of 22,000,000 bigrams Vocabulary of 273,266 words (i.e. 74,674,306,756 possible bigrams) 74,671,100,000 bigrams were unseen And each unseen bigram was given a frequency of 0.000295

Freq. from training data Freq. from held-out data

fMLE 0 1 2 3 4 5

fempirical 0.000027 0.448 1.25 2.24 3.23 4.21

fadd-one 0.000295 0.000274 0.000411 0.000548 0.000685 0.000822

Add-one smoothed freq.

too high

too low

Total probability mass given to unseen bigrams =

(74,671,100,000 x 0.000295) / 22,000,000 ~99.96 !!!!


46

Smoothing and N-grams

Witten-Bell Smoothing

Unigram

equate zero frequency items with frequency 1 items use frequency of things seen once to estimate frequency of things we havent seen yet smaller impact than Add-One a zero frequency word (unigram) is an event that hasnt happened yet count the number of words (T) weve observed in the corpus (Number of types) p(w) = T/(Z*(N+T))

w is a word with zero frequency Z = number of zero frequency words N = size of corpus

47

Distributing

The amount to be distributed is The number of events with count zero So distributing evenly gets us

T N +T

Z
1 T Z N +T
48

Smoothing and N-grams

Bigram

p(wn|wn-1) = C(wn-1wn)/C(wn-1) (original) p(wn|wn-1) = T(wn-1)/(Z(wn-1)*(T(wn-1)+N)) for zero bigrams (after Witten-Bell)

T(wn-1)/ Z(wn-1) * C(wn-1)/(C(wn-1)+ T(wn-1)) p(wn|wn-1) = C(wn-1wn)/(C(wn-1)+T(wn-1))


T(wn-1) = number of bigrams beginning with wn-1 Z(wn-1) = number of unseen bigrams beginning with wn-1 Z(wn-1) = total number of possible bigrams beginning with wn-1 minus the ones weve seen Z(wn-1) = V - T(wn-1)
estimated zero bigram frequency

for non-zero bigrams (after Witten-Bell)

49

Smoothing and N-grams

Witten-Bell Smoothing

Bigram

use frequency (count) of things seen once to estimate frequency (count) of things we havent seen yet T(wn-1)/ Z(wn-1) * C(wn-1)/(C(wn-1)+ T(wn-1)) estimated zero bigram frequency (count)

T(wn-1) = number of bigrams beginning with wn-1 Z(wn-1) = number of unseen bigrams beginning with wn-1

I want to eat Chinese food lunch

I want to eat Chinese food lunch 8 1087 0 13 0 0 0 3 0 786 0 6 8 6 3 0 10 860 3 0 12 0 0 2 0 19 2 52 2 0 0 0 0 120 1 19 0 17 0 0 0 0 4 0 0 0 0 1 0 I 7.785 2.823 2.885 0.073 1.828 18.019 3.643 want 1057.763 0.046 0.084 0.073 0.011 0.051 0.026 to 0.061 739.729 9.616 1.766 0.011 16.122 0.026 eat Chinese 12.650 0.061 0.046 5.647 826.982 2.885 0.073 16.782 0.011 0.011 0.051 0.051 0.026 0.026 food lunch 0.061 0.061 7.529 5.647 0.084 11.539 1.766 45.928 109.700 0.914 0.051 0.051 0.911 0.026

Remark: smaller changes

I want to eat Chinese food lunch

50

Distributing Among the Zeros

If a bigram wx wi has a zero count


Number of bigram types starting with wx

T ( wx ) 1 P ( wi | wx ) = Z ( wx ) N ( w x ) + T ( w x )
Number of bigrams starting with wx that were not seen
Actual frequency (count)of bigrams beginning with wx
51

Thank you

52

By : Mu10co05 Mu10co22

Top-Down Parsing (TD)

It is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar

Top-Down Parsing (TD)


Since were trying to find trees rooted with an S (Sentences) start with the rules that give us an S Then work your way down from there to the words

12/2/2013

Top-down parsing (TD)


Book that flight.
S

S NP VP

S Aux NP VP

S VP

S
VP

S VP V NP

S VP V
4

NP

VP

NP

VP

NP

Det

NOMINAL

PropN

Pronoun

12/2/2013

Aux N VP Aux NP VP P Det NOMINA Prop L N

Problems with the top-down parser


Left-recursion Ambiguity Inefficiency reparsing of subtrees

Left-recursion

It refers to any recursive non terminal that, when it produces a sentential form containing itself, that new copy of itself appears on the left of the production rule. NP NP PP VP VP PP S S and S

Ambiguity

Common structural ambiguity


Attachment ambiguity Coordination ambiguity

Ambiguities: PP Attachment

Coordination ambiguity
Different set of phrases that can be

conjoined by a conjunction like and. For example old men and women can be
[old [men and women]] or [old men] and

[women]

Repeated Parsing Subtrees

The parser often builds valid parse trees for portion of the input, then discards them during backtracking, only to find that it has to rebuild them again.

a flight From Indianapolis To Houston On TWA A flight from Indianapolis A flight from Indianapolis to Houston A flight from Indianapolis to Houston on TWA

4 3 2 1 3 2 1

Parsing with CFG

10

Comparison Of Top-down and Bottom-Up Parsing


Top-Down parsers never explore illegal parses (never explore trees that cant form an S) -- but waste time on trees that can never match the input. Bottom-Up parsers never explore trees inconsistent with input -- but waste time exploring illegal parses (Explore trees with no S root).

Thank You!!

Selectional Restrictions & its Limitations


Presented By: MU10CO26

Word Sense Disambiguation


The task of word sense disambiguation is to

examine word tokens in context and specify exactly which sense of each word is being used.

MOTIVATION
One of the central challenges in NLP is ambiguity. Compositional semantic analyzers ignore the issue

of lexical ambiguity. Needed in:


Machine Translation: For correct lexical choice. Information Retrieval: Resolving ambiguity in queries. Information Extraction: For accurate analysis of text.

SOLUTION:
Knowledge Based Approach. Rely on knowledge resources like WordNet. May use grammar rules for disambiguation. May use hand coded rules for disambiguation.

Necessity of a Mechanism
A system should include a mechanism which

ensures that only nouns with appropriate properties are associated with given verbs in a given context.

Selectional Restrictions:
Used to perform disambiguation. Used to rule out inappropriate senses and

thereby reduce the amount of ambiguity present during semantic analysis. Introduced by Fodor and Katz (1963). In selectional restriction, a predicate ( verb) imposes semantic constraints on its arguments (noun).
6

A violation of selectional restrictions is the explanation for the oddity of the following examples:

Kim ate a motor-bike. There is an apple bathing in the water. The stone thinks. The verb think selects a subject with the

feature human, which suggests that words labeled with inanimate are rejected.

Problems with the inference and the constraint view!


Inference view: Edible = appears as object of an eating

event? Constraint view: There are almost no strict constraints.

Lets look more closely at the selectional restrictions of eat!


Peter ate a banana.

eat(Human,Fruit) Peter ate fish. =>eat(Human,Edible) Kims mother bought a motor-bike of chocolate at the bakery. Kim ate the motorbike immediately. =>eat(Animate,Physical Object) 9

Examples of this approach:


In our house, everybody has a career and none of

them includes washing dishes, he says. In her tiny kitchen at home, Ms. Chen works efficiently, stir-frying several simple dishes, including braised pigs ears and chicken livers with green peppers. The dishwasher read the article.

10

Limitations:
There are examples like the following where the

available selectional restrictions are too general to uniquely select a correct sense. What kind of dishes do you recommend?

11

But it fell apart in 1931, perhaps because people

realized you cant eat gold for lunch if youre hungry. In his two championship trials, Mr. Kulkarni ate glass on an empty stomach, accompanied only by water and tea. The sentence itself is not semantically ill-formed, despite the violation of eats selectional restrictions.

12

If you want to kill the Soviet Union, get it

to try to eat Afghanistan. Here the typical selectional restrictions on both kill and eat will eliminate all possible literal senses leaving the system with no possible meanings! It brings the semantic analysis to halt!
13

THANK YOU

14

Semantics
Going beyond syntax

1/27

Semantics
Relationship between surface form and meaning What is meaning? Lexical semantics Syntax and semantics

2/27

What is meaning?
Reference to worlds
Objects, relationships, events, characteristics Meaning as truth

Understanding
Inference, implication Modelling beliefs

Meaning as action
Understanding activates procedures
3/27

Lexical semantics
Meanings of individual words
Sense and Reference What do we understand by the word lion ? Is a toy lion a lion? Is a toy gun a gun? Is a fake gun a gun?

Grammatical meaning
What do we understand by the lion, lions, the lions, as in The lion is a dangerous animal The lion was about to attack
4/27

Lexical relations
Lexical meanings can be defined in terms of other words
Synonyms, antonyms, broader/narrower terms synsets Part-whole relationships (often reflect realworld relationships) Linguistic usage (style, register) also a factor
5/27

Semantic features
Meanings can be defined (to a certain extent) in terms of distinctive features
e.g. man = adult, male, human

Meanings can be defined (to a certain extent) in terms of distinctive features

6/27

Types of representation
1. Syntactic relations
The man shot an elephant with his gun

shot subj man det the obj adv

elephant gun det an mod his


7/27

Types of representation
2. Deep syntax
The man shot an elephant with his gun An elephant was shot by the man with his gun

shot dsubj man qtf the dobj

instr

elephant gun qtf an poss his


8/27

Types of representation
3. Semantic roles, deep cases
The man shot an elephant with his gun An elephant was shot by the man with his gun The man used his gun to shoot an elephant

shot agent patient man qtf the

instr

elephant gun qtf an poss his


9/27

Types of representation
4. Event representation, semantic network
The man shot an elephant with his gun An elephant was shot by the man with his gun The man used his gun to shoot an elephant

shooting

shooter shot- instr thing man elephant gun qtf the qtf poss man
10/27

Types of representation
5. Predicate calculus
The man shot an elephant with his gun An elephant was shot by the man with his gun The man used his gun to shoot an elephant The man owned the gun which he used to shoot an elephant The man used the gun which he owned to shoot an elephant

event(e) & time(e,past) & pred(e,shoot) & man(A) & the(A) & (B) & dog(B) & shoot(A,B) & (C) & gun(C) & own(A,C) & use(A,C,e)

11/27

Types of representation
6. Conceptual dependency (Schank) John punched Mary

12/27

Types of representation
7. Semantic formulae (Wilks)

((THIS((PLANT STUFF)SOUR)) ((((((THRU PART)OBJE) (NOTUSE *ANI))GOAL) ((MAN USE) (OBJE THING) )))

door
13/27

Uses for semantic representations


As a linguistic artefact (because its there) To capture the text meaning relationship Identifying paraphrases, equivalences (e.g. summarizing a text, searching a text for information) Understanding and making inferences (e.g. so as to understand a sequence of events) Interpreting questions (so as to find the answer), commands (so as to carry them out), statements (so as to update data) 14/27 Translating

Uses for semantic representations


Different levels of understanding/meaning Textual meaning may be little more than disambiguating
Attachment ambiguities Word-senses Anaphora (pronoun reference, coreference)

Conceptual meaning may be much deeper Somewhere in between a good example is Wilks preference semantics: especially good for metaphor
15/27

Linguistic issues
Words and Concepts
Objects, properties, actions n, adj, v Language allows us to be vague (e.g. toy gun)

Semantic primitives what are they? Meaning equivalence when do two things mean the same? Grammatical meaning
Tense vs. time Topic and focus Quantifiers, plurals, etc.
16/27

Linguistic issues
There are many other similarly tricky linguistic phenomena
Modality (could, should, would, must, may) Aspect (completed, ongoing, resulting) Determination (the, a, some, all, none) Fuzzy sets (often, some, many, usually)

17/27

Lexical semantics
Lexical relations (familiar to linguists) have an impact on NLP systems
Homonymy word-sense selection; homophones in speech-based systems Polysemy understanding narrow senses Synonymy lexical equivalence Ontology structure vocabulary, holds much of the knowledge used by clever systems
18/27

WordNet
Began as a psycholinguistic theory of how the brain organizes its vocabulary (Miller) Organizes vocabulary into synsets, hierarchically arranged together with other relations (hyp[er|o]nymy, isa, member, antonyms, entailments) Turns out to be very useful for many applications Has been replicated for many languages (sometimes just translated!)
19/27

Natural Language Processing - Parsing Language, Syntax, Parsing Problems in Parsing


Ambiguity, Attachment / Binding Bottom vs. Top Down Parsing Earley-Algorithm

Where does NLP fit in the CS taxonomy?


Computers Databases Artificial Intelligence Algorithms Networking

Robotics Information Retrieval

Natural Language Processing Machine Translation

Search

Language Analysis

Semantics
12/2/2013

Parsing
2

The Steps in NLP


Discourse Pragmatics

Semantics Syntax

**we can go up, down and up and


Morphology

down and combine steps too!! **every step is equally complex


12/2/2013 3

What is Syntax?
Study of structure of language Specifically, goal is to relate an interface to morphological component to an interface to a semantic component Note: interface to morphological component may look like written text Representational device is tree structure

Simplified View of Linguistics


Phonology /waddyasai/

Morphology

/waddyasai/

what did you say

Syntax

what did you say

say
subj you obj

say
Semantics

what

subj you

obj

P[ x. say(you, x) ]

what

Natural Language - Parsing


Natural Language Syntax described like a formal language, usually through a context-free grammar: the Start-Symbol S sentence Non-Terminals NT syntactic constituents Terminals T lexical entries/ words Productions/Rules P NT (NTT)+ grammar rules Parsing derive the syntactic structure of a sentence based on a language model (grammar) construct a parse tree, i.e. the derivation of the sentence based on the grammar (rewrite system)

Sample Grammar
Grammar (S, NT, T, P) Sentence Symbol S NT, Part-of-Speech NT, syntactic Constituents NT, Grammar Rules P NT (NT T)* statement S NP VP question S Aux NP VP command S VP NP Det Nominal NP Proper-Noun Nominal Noun | Noun Nominal | Nominal PP VP Verb | Verb NP | Verb PP | Verb NP PP PP Prep NP Det that | this | a Noun book | flight | meal | money Proper-Noun Houston | American Airlines | TWA Verb book | include | prefer Aux does Prep from | to | on

Task: Parse "Does this flight include a meal?"

Sample Parse Tree


Task: Parse "Does this flight include a meal?"
S Aux NP Det Nominal Noun does this flight include VP Verb NP Det Nominal a meal

Structure in Strings
Some words: the a small nice big very boy girl sees likes Some good sentences:
the boy likes a girl the small girl likes the big girl a very small nice boy sees a very nice boy

Some bad sentences:


*the boy the girl *small boy likes nice girl

Can we find subsequences of words (constituents) which in some way behave alike?

From Substrings to Trees


(((the) boy) likes ((a) girl))

boy the

likes girl a

Node Labels?
( ((the) boy) likes ((a) girl) ) Choose constituents so each one has one non-bracketed word: the head Group words by distribution of constituents they head (part-of-speech, POS):
Noun (N), verb (V), adjective (Adj), adverb (Adv), determiner (Det)

Category of constituent: XP, where X is POS


NP, S, AdjP, AdvP, DetP

Node Labels
(((the/Det) boy/N) likes/V ((a/Det) girl/N))
S

NP

likes
DetP

NP

DetP

boy

girl

the

Types of Nodes
(((the/Det) boy/N) likes/V ((a/Det) girl/N))
S nonterminal symbols = constituents NP

likes
DetP

NP

Phrase-structure tree

DetP

boy

girl

the

a
terminal symbols = words

Determining Part-of-Speech
noun or adjective?
a blue seat a very blue seat this seat is blue a child seat *a very child seat *this seat is child

blue and child are not the same POS blue is Adj, child is Noun

Determining Part-of-Speech (2)


preposition or particle?
A he threw out the garbage B he threw the garbage out the door A he threw the garbage out B *he threw the garbage the door out The two out are not same POS; A is particle, B is Preposition

Word Classes (=POS)


Heads of constituents fall into distributionally defined classes Additional support for class definition of word class comes from morphology

Some Points on POS Tag Sets


Possible basic set: N, V, Adj, Adv, P, Det, Aux, Comp, Conj 2 supertypes: open- and closed-class
Open: N, V, Adj, Adv Closed: P, Det, Aux, Comp, Conj

Many subtypes:
eat/V eat/VB, eat/VBP, eats/VBZ, ate/VBD, eaten/VBN, eating/VBG, Reflect morphological form & syntactic function

Phrase Structure and Dependency Structure


S

likes/V
NP

likes
DetP

NP

boy/N girl the/Det

girl/N a/Det

DetP

boy

the

a
Only leaf nodes labeled with words!

All nodes are labeled with words!

Phrase Structure and Dependency Structure (ctd)


S

likes/V
NP

likes
DetP

NP

boy/N girl the/Det

girl/N a/Det

DetP

boy

the

a
Representationally equivalent if each nonterminal node has one lexical daughter (its head)

Types of Dependency

Adj(unct)

likes/V
Subj Fw

Obj

sometimes/Adv the/Det

boy/N
Adj

girl/N
Fw

small/Adj
Adj

a/Det

very/Adv

Grammatical Relations
Types of relations between words
Arguments: subject, object, indirect object, prepositional object Adjuncts: temporal, locative, causal, manner, Function Words

Subcategorization
List of arguments of a word (typically, a verb), with features about realization (POS, perhaps case, verb form etc) In canonical order Subject-Object-IndObj Example:
like: N-N, N-V(to-inf) see: N, N-N, N-N-V(inf)

Note: J&M talk about subcategorization only within VP

What About the VP?


S S NP NP VP

DetP

likes boy

NP

DetP

girl

DetP

boy

likes
DetP

NP

the

the

girl

What About the VP?


Existence of VP is a linguistic (i.e., empirical) claim, not a methodological claim Semantic evidence??? Syntactic evidence
VP-fronting (and quickly clean the carpet he did! ) VP-ellipsis (He cleaned the carpets quickly, and so did she ) Can have adjuncts before and after VP, but not in VP (He often eats beans, *he eats often beans )

Note: VP cannot be represented in a dependency representation

Context-Free Grammars
Defined in formal language theory (comp sci) Terminals, nonterminals, start symbol, rules String-rewriting system Start with start symbol, rewrite using rules, done when only terminals left NOT A LINGUISTIC THEORY, just a formal device

CFG: Example
Many possible CFGs for English, here is an example (fragment):
S NP VP VP V NP NP DetP N | AdjP NP AdjP Adj | Adv AdjP N boy | girl V sees | likes Adj big | small Adv very DetP a | the
the very small boy likes a girl

Derivations in a CFG
S
S NP VP VP V NP NP DetP N | AdjP NP AdjP Adj | Adv AdjP N boy | girl V sees | likes Adj big | small Adv very DetP a | the
S

Derivations in a CFG
NP VP
S NP VP VP V NP NP DetP N | AdjP NP AdjP Adj | Adv AdjP N boy | girl V sees | likes Adj big | small Adv very DetP a | the
S

NP

VP

Derivations in a CFG
DetP N VP
S NP VP VP V NP NP DetP N | AdjP NP AdjP Adj | Adv AdjP N boy | girl V sees | likes Adj big | small Adv very DetP a | the
S

NP

VP

DetP

Derivations in a CFG
the boy VP
S NP VP VP V NP NP DetP N | AdjP NP AdjP Adj | Adv AdjP N boy | girl V sees | likes Adj big | small Adv very DetP a | the
S

NP

VP

DetP

the boy

Derivations in a CFG
the boy likes NP
S NP VP VP V NP NP DetP N | AdjP NP AdjP Adj | Adv AdjP N boy | girl V sees | likes Adj big | small Adv very DetP a | the
S

NP DetP

VP

NP

the boy likes

Derivations in a CFG
the boy likes a girl
S NP VP VP V NP NP DetP N | AdjP NP AdjP Adj | Adv AdjP N boy | girl V sees | likes Adj big | small Adv very DetP a | the
S

NP DetP

VP

NP

the boy likes

DetP

girl

Derivations in a CFG; Order of Derivation Irrelevant


NP likes DetP girl
S NP VP VP V NP NP DetP N | AdjP NP AdjP Adj | Adv AdjP N boy | girl V sees | likes Adj big | small Adv very DetP a | the
S

NP

VP

NP

likes

DetP

girl

Derivations of CFGs
String rewriting system: we derive a string (=derived structure) But derivation history represented by phrase-structure tree (=derivation structure)!

Representing Immediate Constituent Structure


The constituent structure of the whole sentence can be represented by means of labeled bracketing e.g. [ [ [Poor] [John] ] [ [lost] [ [his] [watch] ] ] Or using a tree diagram for the same -

Representing Immediate Constituent Structure (contd.)


Labeled bracketing using Category Symbols :
[ [ [Poor]ADJ [John]N ]NP [ [lost]V [ [his]PRON [watch ]N ]NP ]VP ]S

Poor ADJ John N Lost V His PRON S Watch - N

Poor John - NP his watch - NP lost his watch - VP Poor John lost his watch -

Immediate Constituent Structure using Tree Diagram


S NP VP

ADJ

NP PRON N watch

Poor

John

lost

his

Exercise
Analyze the following constructions using Labeled Bracketing as well as tree Diagram: I saw a man with a telescope. She touched the cat with a feather. The girl pushed the large box towards the huge door. The man in the blue shirt is waiting for you.

Problems in Parsing - Ambiguity


Ambiguity One morning, I shot an elephant in my pajamas. How he got into my pajamas, I dont know. Groucho Marx syntactical/structural ambiguity several parse trees are possible e.g. above sentence semantic/lexical ambiguity several word meanings e.g. bank (where you get money) and (river) bank even different word categories possible (interim) e.g. He books the flight. vs. The books are here. or Fruit flies from the balcony vs. Fruit flies are on the balcony.

Ambiguity examples

Ambiguities: PP Attachment

Attachments
I cleaned the dishes from dinner. I cleaned the dishes with detergent. I cleaned the dishes in my pajamas. I cleaned the dishes in the sink.

Syntactic Ambiguities 1
Prepositional Phrases
They cooked the beans in the pot on the stove with handles.

Particle vs. Preposition


The puppy tore up the staircase.

Complement Structure
The tourists objected to the guide that they couldnt hear. She knows you like the back of her hand.

Gerund vs. Participial Adjective


Visiting relatives can be boring. Changing schedules frequently confused passengers.

Syntactic Ambiguities 2
Modifier scope within NPs
impractical design requirements plastic cup holder

Multiple gap constructions


The chicken is ready to eat. The contractors are rich enough to sue.

Coordination scope
Small rats and mice can squeeze into holes or cracks in the wall.

Classical NLP Parsing: The problem and its solution


Very constrained grammars attempt to limit unlikely/weird parses for sentences
But the attempt makes the grammars not robust: many sentences have no parse

A less constrained grammar can parse more sentences


But simple sentences end up with ever more parses

Solution: We need mechanisms that allow us to find the most likely parse(s)
Statistical parsing lets us work with very loose grammars that admit millions of parses for sentences but to still quickly find the best parse(s)

Introduction
Parsing = associating a structure (parse tree) to an input string using a grammar CFG are declarative, they dont specify how the parse tree will be constructed Book that flight. Parse trees are used in
Grammar checking Semantic analysis Machine translation Question answering Information extraction
S VP NP NOMINAL Verb Det Noun

Book

that

flight

12/2/2013

46

Parsing
Parsing with CFGs refers to the task of assigning correct trees to input strings Correct here means a tree that covers all and only the elements of the input and has an S at the top It doesnt actually mean that the system can select the correct tree from among the possible trees
12/2/2013 47

Parsing
Parsing involves a search which involves the making of choices Some Parsing techniques:
Top-down parsing Bottom-up parsing

12/2/2013

48

Bottom-up and Top-down Parsing


Bottom-up from word-nodes to sentence-symbol Top-down Parsing from sentence-symbol to words
S Aux NP Det Nominal Noun does this flight include VP Verb NP Det Nominal a meal

For Now
Assume
You have all the words already in some buffer The input isnt POS tagged We wont worry about morphological analysis All the words are known

12/2/2013

50

Parsing as search
A Grammar to be used in our example
S NP VP S Aux NP VP S VP NP Pronoun NP Det NOMINAL NP Proper-Noun NOMINAL Noun NOMINAL NOMINAL PP VP Verb VP Verb NP
12/2/2013

VP Verb NP PP

VP Verb PP VP VP PP PP Preposition NP Det that | this |a Noun book | flight | meal | money Verb book | include | prefer Aux does Proper-Noun Houston | TWA Preposition from | to | on | near | through
51

NOMINAL NOMINAL Noun Pronoun I | she | me

Parsing as search
Book that flight. S Two types of constraints on the parses: 1. some that come from the input string 2. others that come from the Verb grammar Book VP NP NOMINAL Det that Noun flight 52

12/2/2013

Top-Down Parsing (TD)


Since were trying to find trees rooted with an S (Sentences) start with the rules that give us an S Then work your way down from there to the words

12/2/2013

53

Top-down parsing (TD)


Book that flight.
S

S NP VP

S Aux NP VP

S VP

S
VP

S Aux NP VP PropN

S VP V NP

S VP V

NP

VP

NP

VP

NP

Aux NP VP Det
NOMINAL

Det

NOMINAL

PropN

12/2/2013

Pronoun

54

Bottom-Up Parsing
Since we want trees that cover the input words start with trees that link up with the words in the right way. Then work your way up from there.

12/2/2013

55

Bottom-up parsing (BU)


Book Noun Book Det that Noun flight NOMINAL Noun flight Verb Book Det that that flight Verb Book Det that Noun flight NOMINAL Noun flight NP NOMINAL Noun flight VP NP NOMINAL Verb
12/2/2013

NOMINAL Noun Book NP NOMINAL Noun Book Det that Det that

VP Verb Book Det that

NOMINAL Noun flight VP Verb Book NP Det that

NOMINAL Noun flight

NOMINAL Verb Book Det that Noun flight


56

Det that

Noun flight

Book

Comparing Top-Down and Bottom-Up


Top-Down parsers never explore illegal parses (never explore trees that cant form an S) -- but waste time on trees that can never match the input Bottom-Up parsers never explore trees inconsistent with input -- but waste time exploring illegal parses (Explore trees with no S root) For both: how to explore the search space?
Pursuing all parses in parallel or ? Which rule to apply next? Which node to expand next?

Needed: some middle ground.


12/2/2013 57

Problems with Bottom-up and Top-down Parsing


Problems with left-recursive rules like NP NP PP: dont know how many times recursion is needed Pure Bottom-up or Top-down Parsing is inefficient because it generates and explores too many structures which in the end turn out to be invalid (several grammar rules applicable interim ambiguity). Combine top-down and bottom-up approach: Start with sentence; use rules top-down (look-ahead); read input; try to find shortest path from input to highest unparsed constituent (from left to right). Chart-Parsing / Earley-Parser

Basic Top-Down (TD) parser


Practically infeasible to generate all trees in parallel. Use depth-first strategy. When arriving at a tree that is inconsistent with the input, return to the most recently generated but still unexplored tree.

12/2/2013

59

12/2/2013

60

Presented By Dhruba Baishya (mu10co01) Kallol Ghose (mu10co14)

WORD SENSE DISAMBIGUATION

Introduction
Word Sense Disambiguation
It is to examine word tokens and its context and specify which

sense of words is being used. Some Examples


A bank can hold the investments in a custodial account in the clients name. As agriculture burgeons on the east bank, the river will shrink even more. Here, the BANK word is used in different senses.

Motivation
Central Challenges in NLP. Ubiquitous across all language. Needed in
Machine Translation For correct lexeme choice Information Retrieval Resolving queries Information Extraction For accurate analysis of text

Approaches for Word Sense Disambiguation


Knowledge Based Approaches
Selectional Restriction Dictionary Based

Machine Learning Approaches


Supervised Machine Learning Semi Supervised Machine Learning Unsupervised Learning

Approaches for Word Sense Disambiguation


Knowledge Based Approaches
Selectional Restriction Dictionary Based

Machine Learning Approaches


Supervised Machine Learning Semi Supervised Machine Learning Unsupervised Learning

Selectional Restriction
Uses hierarchical type of information about arguments. Rule out inappropriate senses and hence reduce the

amount of ambiguity. Selectional restriction are used to block the formation of component meaning representations which violate the same.

Selectional Restriction
Sense for word DISHES Sense 1 In our house, everybody has a career and none of the includes washing dishes with soap, he says. Sense 2 Ms Chen works efficiently, stir frying several simple dishes including fried chicken.

Artifact

Food

Selectional Restriction
Sense for word SERVE Sense 1 What is the name of the airlines that serve Denver? Sense 2 Well there was a time when served green lipped mussels from New Zealand. Sense 3 Which one of the airliners serve breakfast?

Geographical Entity

Food

Meal Designator

Selectional Restriction
Consider an Example
Im looking for a restaurant that serves vegetarian dishes.
Geographical Entity Serve Food Food Meal Designator

Artifact Dishes

Approaches for Word Sense Disambiguation


Knowledge Based Approaches
Selectional Restriction Dictionary Based

Machine Learning Approaches


Supervised Machine Learning Semi Supervised Machine Learning Unsupervised Learning

Dictionary Based Approach


All the sense definition of the word are retrieved from the

dictionary. Each of the senses are compared to dictionary definition of the remaining word in the context. The sense with the highest overlap with this context words is chosen as the correct sense.

Example
Fruit of evergreen tree.

Consider phrase Pine cone


Pine Kinds of evergreen tree with needle shaped leaves. Waste away through sorrow or illness. Cone Solid body which narrows a point. Something of this shape whether solid or hollow. Fruit of certain evergreen trees.

Approaches for Word Sense Disambiguation


Knowledge Based Approaches
Selectional Restriction Dictionary Based

Machine Learning Approaches


Supervised Machine Learning Semi Supervised Machine Learning Unsupervised Learning

Machine Learning Approaches


Systems are trained to perform the task of word sense

disambiguation In this method a classifier is learnt which is then used to assign senses

Machine learning approaches(continued)


Input is taken which consists of
The word to be disambiguated Context text with which the word is embedded

Fixed set of linguistic features are extracted relevant to

learning the task

Machine learning approaches(continued)


Features are of two classes
Co-location Co-occurrence

Co-location
Looks for information about words of specific positions

that are located to left or right of the target word Example


An electric guitar and bass player stand o to one side, not

really part of scene.

Co-occurrence
Features consists of data about neighboring words. Words themselves serve as features. For the earlier example, a co-occurrence vector

consisting of 12 most frequent words from a collection of bass sentences drawn from WSJ corpus has following features: shing, big, sound, player, y, rod, pound, double, runs, playing, guitar, band 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0

Supervised learning approach


Words are labeled with their senses Example
She pays 3% interest/INTEREST-MONEY on the loan. He showed a lot of interest/INTEREST-CURIOSITY in the

painting.

Supervised learning approach(continued)


Supervised approaches are therefore similar to tagging:
given a corpus tagged with senses dene features that indicate one sense over another learn a model that predicts the correct sense given the features

Semi-supervised approach
Problem with supervised approach is the need of a large

tagged training set. Relies on a relatively small number of instances. These labeled instances are used as seeds to train an initial classifier. This initial classifier is then used to extract a larger training set.

Unsupervised approach
Unlabeled instances are taken up as input and are

grouped into clusters according to similarity metric. These clusters are labeled by hand with word senses. Main disadvantage is that senses are not well defined.

Potrebbero piacerti anche