Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Morphology
Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using morphemes Morphological parsing = the task of recognizing the morphemes inside a word Important for many tasks
e.g., hands, foxes, children machine translation information retrieval lexicography any further processing (e.g., part-of-speech tagging)
Slide 1
base form (stem), e.g., believe affixes (suffixes, prefixes, infixes), e.g., un-, -able, -ly
Derivation
Compounding Cliticization
Slide 2
Inflectional Morphology
Inflectional Morphology word stem + grammatical morpheme cat + s only for nouns, verbs, and some adjectives
Nouns plural: regular: +s, +es irregular: mouse - mice; ox - oxen rules for exceptions: e.g. -y -> -ies like: butterfly - butterflies possessive: +'s, +' Verbs main verbs (sleep, eat, walk) modal verbs (can, will, should) primary verbs (be, have, do)
Slide 3
Slide 5
suffix
plus combinations, like unhappiest, unhappiness. Distinguish different adjective classes, which can or cannot take certain inflectional or derivational forms, e.g. no negation for big.
Slide 6
Slide 7
Slide 8
Verb Clitics
Slide 9
Methods, Algorithms
Stemming
Stemming algorithms strip off word affixes yield stem only, no additional information (like plural, 3rd person etc.) used, e.g. in web search engines famous stemming algorithm: the Porter stemmer
Slide 11
Stemming
Reduce tokens to root form of words to recognize morphological variation.
computer, computational, computation all reduced to same token compute
Correct morphological analysis is language specific and can be complex. Stemming blindly strips off known affixes (prefixes and suffixes) in an iterative fashion.
for example compressed and compression are both accepted as equivalent to compress.
for exampl compres and compres are both accept as equival to compres.
Slide 12
Porter Stemmer
Simple procedure for removing known affixes in English without using a dictionary. Can produce unusual stems that are not English words: May conflate (reduce to the same token) words that are actually distinct. Does not recognize all morphological derivations Typical rules in Porter stemmer sses ss ies i ational ate tional tion ing
Slide 13
Stemming Problems
Errors of Comission organization doing Generalization Numerical Policy organ doe Generic numerous police
Errors of Omission European analysis Matrices Noise sparse Europe analyzes matrix noisy sparsity
Slide 14
Slide 15
Simple Tokenization
Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token.
However, frequently they are not.
Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens. More careful approach:
Separate ? ! ; : [ ] ( ) < > Care with . - why? when? Care with ??
Slide 16
Punctuation
Childrens: use language-specific mappings to normalize (e.g. AngloSaxon genitive of nouns, verb contractions: wont -> wo nt) State-of-the-art: break up hyphenated sequence. U.S.A. vs. USA a.out
Slide 17
Numbers
3/12/91 Mar. 12, 1991 55 B.C. B-52 100.2.86.144
Generally, dont index as text Creation dates for docs
Slide 18
Lemmatization
Reduce inflectional/derivational forms to base form Direct impact on vocabulary size E.g.,
am, are, is be car, cars, car's, cars' car
the boy's cars are different colors the boy car be different color How to do this?
Need a list of grammatical rules + a list of irregular words Children child, spoken speak Practical implementation: use WordNets morphstr function
Perl: WordNet::QueryData (first returned value from validForms function)
Slide 19
Morphological Processing
Knowledge
lexical entry: stem plus possible prefixes, suffixes plus word classes, e.g. endings for verb forms (see tables above) rules: how to combine stem and affixes, e.g. add s to form plural of noun as in dogs orthographic rules: spelling, e.g. double consonant as in mapping
Slide 20
Slide 21
IP notice: some slides for today from: Jim Martin, Sandiway Fong, Dan Klein
LING 180 Autumn 2007
Outline
Probability
Basic probability Conditional probability Bayes Rule
Discounting: Good-Turing and Katz backoff Interpolation Unknown words Advanced LM algorithms
LING 180 Autumn 2007
2
1. Introduction to Probability
Experiment (trial) Sample Space (S)
Example Example Repeatable procedure with well-defined possible outcomes
the set of all possible outcomes finite or infinite coin toss experiment possible outcomes: S = {heads, tails} die toss experiment possible outcomes: S = {1,2,3,4,5,6}
QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.
QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.
Introduction to Probability
Definition of sample space depends on what we are asking
Sample Space (S): the set of all possible outcomes Example
die toss experiment for whether the number is even or odd possible outcomes: {even,odd} not {1,2,3,4,5,6}
QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.
More definitions
Events Example
an event is any subset of outcomes from the sample space die toss experiment let A represent the event such that the outcome of the die toss experiment is divisible by 3 A = {3,6} A is a subset of the sample space S= {1,2,3,4,5,6} Draw a card from a deck suppose sample space S = {heart,spade,club,diamond} (four suits) let A represent the event of drawing a heart let B represent the event of drawing a red card A = {heart} B = {heart,diamond}
QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.
Example
Introduction to Probability
Some definitions
Counting
suppose operation oi can be performed in ni ways, then a sequence of k operations o1o2...ok can be performed in n1 n2 ... nk ways die toss experiment, 6 possible outcomes two dice are thrown at the same time number of sample points in sample space = 6 6 = 36
Example
QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.
Definition of Probability
The probability law assigns to an event a nonnegative number Called P(A) Also called the probability A That encodes our knowledge or belief about the collective likelihood of all the elements of A Probability law must satisfy certain properties
Probability Axioms
Nonnegativity
P(A) >= 0, for every event A
Additivity
If A and B are two disjoint events, then the probability of their union satisfies: P(A U B) = P(A) + P(B)
Normalization
The probability of the entire sample space S is equal to 1, I.e. P(S) = 1.
An example
An experiment involving a single coin toss There are two possible outcomes, H and T Sample space S is {H,T} If coin is fair, should assign equal probabilities to 2 outcomes Since they have to sum to 1 P({H}) = 0.5 P({T}) = 0.5 P({H,T}) = P({H})+P({T}) = 1.0
Another example
Experiment involving 3 coin tosses Outcome is a 3-long string of H or T S ={HHH,HHT,HTH,HTT,THH,THT,TTH,TTTT} Assume each outcome is equiprobable What is probability of the event that exactly 2 heads occur? A = {HHT,HTH,THH} P(A) = P({HHT})+P({HTH})+P({THH}) = 1/8 + 1/8 + 1/8 =3/8
Uniform distribution
10
Probability definitions
In summary:
11
If draw a card from a deck, then put it back, draw a card from the deck again
What is the probability that both drawn cards are hearts?
12
What is the probability that at least one head occurs? Sample space = {hh, ht, th, tt} (h = heads, t = tails) Sample points/probability for the event:
ht 1/3 x 2/3 = 2/9 th 2/3 x 1/3 = 2/9 hh 1/3 x 1/3= 1/9 tt 2/3 x 2/3 = 4/9
13
P ( drawing a verb) =
all words
LING 180 Autumn 2007
14
all words
How to compute each of these All words = just count all the words in the dictionary # of ways to get a verb: number of words which are verbs! If a dictionary has 50,000 entries, and 10,000 are verbs. P(V) is 10000/50000 = 1/5 = .20
15
Conditional Probability
A way to reason about the outcome of an experiment based on partial information
In a word guessing game the first letter for the word is a t. What is the likelihood that the second letter is an h? How likely is it that a person has a disease given that a medical test was negative? A spot shows up on a radar screen. How likely is it that it corresponds to an aircraft?
16
More precisely
Given an experiment, a corresponding sample space S, and a probability law Suppose we know that the outcome is within some given event B We want to quantify the likelihood that the outcome also belongs to some other given event A. We need a new probability law that gives us the conditional probability of A given B P(A|B)
17
An intuition
A is its raining now. P(A) in dry California is .01 B is it was raining ten minutes ago P(A|B) means what is the probability of it raining now if it was raining 10 minutes ago P(A|B) is probably way higher than P(A) Perhaps P(A|B) is .10 Intuition: The knowledge about B should change our estimate of the probability of A.
18
Conditional probability
One of the following 30 items is chosen at random What is P(X), the probability that it is an X? What is P(X|red), the probability that it is an X given that it is red?
19
Conditional Probability
let A and B be events p(B|A) = the probability of event B occurring given event A occurs definition: p(B|A) = p(A B) / p(A)
QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.
20
Conditional probability
P(A|B) = P(A B)/P(B) Or
P( A, B) P( B)
P( A | B) =
21
Independence
What is P(A,B) if A and B are independent? P(A,B)=P(A) P(B) iff A,B independent.
P(heads,tails) = P(heads) P(tails) = .5 .5 = .25 Note: P(A|B)=P(A) iff A,B independent Also: P(B|A)=P(B) iff A,B independent
22
Bayes Theorem
P( A | B) P( B) P( B | A) = P( A)
Swap the conditioning Sometimes easier to estimate one kind of dependence than the other
LING 180 Autumn 2007
23
P ( A B) P ( B | A) = P ( A B) P ( A | B) = P ( A) P (B)
P ( A | B) P ( B) = P ( A B) P ( B | A) P ( A) = P ( A B)
P ( A | B) P ( B) = P (B | A)P ( A)
P (B | A)P ( A) P ( A | B) = P (B)
LING 180 Autumn 2007
24
Summary
Probability Conditional Probability Independence Bayes Rule
25
26
SWBD:
~20,000 wordform types, 2.4 million wordform tokens
Let N = number of tokens, V = vocabulary = number of types General wisdom: V > O(sqrt(N))
27
Language Modeling
We want to compute P(w1,w2,w3,w4,w5wn), the probability of a sequence Alternatively we want to compute P(w5|w1,w2,w3,w4,w5): the probability of a word given some previous words The model that computes P(W) or P(wn|w1,w2wn1) is called the language model. A better term for this would be The Grammar But Language model or LM is standard
28
Computing P(W)
How to compute this joint probability:
P(the,other,day,I,was,walking,along,an d,saw,a,lizard)
29
P( A^ B) P( A | B) = P( B)
P( A^ B) = P( A | B) P( B)
30
31
P(the|its water is so transparent that) = C(its water is so transparent that the) _______________________________ C(its water is so transparent that)
32
Unfortunately
There are a lot of possible sentences Well never be able to get enough data to compute the statistics for those long prefixes P(lizard|the,other,day,I,was,walking,along,and,saw,a) Or P(the|its water is so transparent that)
33
Markov Assumption
Make the simplifying assumption
P(lizard|the,other,day,I,was,walking,along,and,saw,a ) = P(lizard|a)
Or maybe
P(lizard|the,other,day,I,was,walking,along,and,saw,a ) = P(lizard|saw,a)
34
Markov Assumption
So for each component in the product replace with the approximation (assuming a prefix of N)
P ( wn | w1 ) P (wn | w
Bigram version
n 1
n 1 n N +1
P ( wn | w1 ) P (wn | wn 1 )
LING 180 Autumn 2007
35
n 1
An example
<s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s>
This is the Maximum Likelihood Estimate, because it is the one which maximizes P(Training set|Model)
37
Suppose the word Chinese occurs 400 times in a corpus of a million words (Brown corpus) What is the probability that a random word from some other text will be Chinese MLE estimate is 400/1000000 = .004
This may be a bad estimate for some other corpus
But it is the estimate that makes it most likely that Chinese will occur 400 times in a million word corpus.
38
39
40
41
42
43
44
45
Shakespeare as corpus
N=884,647 tokens, V=29,066 Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table) Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare
46
47
Evaluation
We train parameters of our model on a training set. How do we evaluate how well our model works? We look at the models performance on some new data This is what happens in the real world; we want to know how our model performs on data we havent seen So a test set. A dataset which is different than our training set Then we need an evaluation metric to tell us how well our model is doing on the test set. One such metric is perplexity (to be introduced below)
LING 180 Autumn 2007
48
At decoding time
49
50
So
As a temporary solution, in order to run experiments To evaluate N-grams we often use an intrinsic evaluation, an approximation called perplexity But perplexity is a poor approximation unless the test data looks just like the training data So is generally only useful in pilot experiments (generally is not sufficient to publish) But is helpful to think about.
51
Perplexity
Perplexity is the probability of the test set (assigned by the language model), normalized by the number of words:
Chain rule:
For bigrams:
53
54
55
56
Result:
Answer:
57
Smoothing is like Robin Hood: Steal from the rich and give to the poor (in probability mass)
58
Laplace smoothing
Also called add-one smoothing Just add one to all the counts! Very simple MLE estimate: Laplace estimate: Reconstructed counts:
59
60
Laplace-smoothed bigrams
61
Reconstituted counts
62
But Laplace smoothing not used for N-grams, as we have much better methods Despite its flaws Laplace (add-k) is however still used to smooth other probabilistic models in NLP, especially
For pilot studies in domains where the number of zeros isnt so huge.
63
Is to use the count of things weve seen once to help estimate the count of things weve never seen
64
65
Good-Turing Intuition
Notation: Nx is the frequency-of-frequency-x
So N10=1, N1=3, etc
All other estimates are adjusted (down) to give probabilities for unseen
66
Good-Turing Intuition
Notation: Nx is the frequency-of-frequency-x
So N10=1, N1=3, etc
All other estimates are adjusted (down) to give probabilities for unseen
P(eel) = c*(1) = (1+1) 1/ 3 = 2/3
67
68
69
Complications
In practice, assume large counts (c>k for some k) are reliable:
Also: we assume singleton counts c=1 are unreliable, so treat N-grams with count of 1 as if they were count=0 Also, need the Nk to be non-zero, so we need to smooth (interpolate) the Nk counts before computing c* from them
70
Or even:
Unigram p(z)
71
72
Interpolation
Simple interpolation
73
74
Katz Backoff
75
So if we used MLE probabilities but backed off to lower order model when MLE prob is zero We would be adding extra probability mass And total probability would be greater than 1
76
77
Intuition of backoff+discounting
How much probability to assign to all the zero trigrams?
Use GT or other discounting algorithm to tell us
78
At decoding time
Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to <UNK> Now we train its probabilities like a normal word If text input: Use UNK probabilities for any word not in training
79
Practical Issues
We do everything in log space
Avoid underflow (also adding is faster than multiplying)
80
ARPA format
81
82
83
84
85
Advanced LM stuff
Current best smoothing algorithm
Kneser-Ney smoothing
Other stuff
Variable-length n-grams Class-based n-grams
Clustering Hand-built classes
Cache LMs Topic-based LMs Sentence mixture models Skipping LMs Parser-based LMs
86
Summary
Probability
Basic probability Conditional probability Bayes Rule
Discounting: Good-Turing and Katz backoff Interpolation Unknown words Advanced LM algorithms
LING 180 Autumn 2007
87
Todays Lecture
NGrams Bigram Smoothing and NGram
Simple N-Grams
An N-gram model uses the previous N-1 words to predict the next one:
unigrams: P(dog) bigrams: P(dog | big) trigrams: P(dog | the big) quadrigrams: P(dog | the big dopey)
Chain Rule
conditional probability: So:
P( A B) P( A | B) = P( B)
P( A B) = P( A | B) P( B) and P( A B ) = P ( B | A) P( A)
P( A B) = P( B | A) P( A)
the dog:
Chain Rule
the probability of a word sequence is the probability of a conjunctive event.
P ( w ) = P ( w1 ) P ( w2 | w1 ) P ( w3 | w )...P( wn | w )
n 1 2 1
n 1 1
= P( wk | w )
k 1 1 k =1
Markov Assumption
n 1 P ( wn | w1n 1 ) P( wn | wn N +1 )
P(wn) can be approximated using only N-1 previous words of context This lets us collect statistics in practice Markov models are the class of probabilistic models that assume that we can predict the probability of some future unit without looking too far into the past Order of a Markov model: length of prior context
Note:
p(w1 w2) = p(w1) p(w2|w1) p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) ... p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1) Its not easy to collect (meaningful) statistics on p(wn|wn-1wn-2...w1) for all possible word sequences just look at the previous word only (not all the proceedings words) Markov Assumption: finite length history 1st order Markov Model p(w1 w2 w3..wn) = p(w1) p(w2|w1) p(w3|w1w2) ..p(wn|w1...wn-3wn-2wn-1) p(w1 w2 w3..wn) p(w1) p(w2|w1) p(w3|w2)..p(wn|wn-1) p(wn|wn-1) is a lot easier to estimate well than p(wn|w1..wn-2 wn-1)
6
Bigram approximation
Note:
Trigram approximation
p(w1 w2) = p(w1) p(w2|w1) p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) ... p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1)
2nd order Markov Model just look at the preceding two words only p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2) p(w4|w1w2w3)...p(wn|w1...wn-3wn-2wn-1) p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-2 wn-1)
Note:
p(wn|wn-2wn-1) is a lot easier to estimate well than p(wn|w1...wn-2 wn-1) but harder than p(wn|wn-1 )
7
Corpora
Corpora are (generally online) collections of text and speech e.g.
Brown Corpus (1M words) Wall Street Journal and AP News corpora ATIS, Broadcast News (speech) TDT (text and speech) Switchboard, Call Home (speech) TRAINS, FM Radio (speech)
8
Probabilities are based on counting things, so . What should we count? Words, word classes, word senses, speech acts ?
What is a word?
e.g., are cat and cats the same word? September and Sept? zero and oh? Is seventy-two one word or two? AT&T? Where do we find the things to count?
10
Terminology
Sentence: unit of written language Utterance: unit of spoken language Wordform: the inflected form that appears in the corpus Lemma: lexical forms having the same stem, part of speech, and word sense Types: number of distinct words in a corpus (vocabulary size) Tokens: total number of words
11
Probabilities come from a training corpus, which is used to design the model.
narrow corpus: probabilities don't generalize general corpus: probabilities don't reflect task or domain
12
Simple N-Grams
An N-gram model uses the previous N-1 words to predict the next one:
unigrams: P(dog) bigrams: P(dog | big) trigrams: P(dog | the big) quadrigrams: P(dog | the big red)
13
Using N-Grams
Recall that
P(wn | w1..n-1) P(wn | wn-N+1..n-1) P(sentence) can be approximated by multiplying all the bigram probabilities in the sequence P(I want to eat Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food)
14
Chain Rule
P( A^ B) P( A | B) = P( B)
P( A^ B) = P( A | B) P( B)
P(The big ) = P(big | the) P(the) P(The big ) = P(the) P(big | the)
15
Example
The
P(The)*P(big|the)*P(red|the big)*P(dog|the big red) Better P(The| <Beginning of sentence>) written as P(The | <S>) Also <end> for end of sentence
16
General Case
w1
17
Unfortunately
That doesnt help since its unlikely well ever gather the right statistics for the prefixes.
18
Markov Assumption
Assume that the entire prefix history isnt necessary. In other words, an event doesnt depend on all of its history, just a fixed length near history
19
Markov Assumption
So for each component in the product replace each with its approximation (assuming a prefix (Previous words) of N)
P ( wn | w ) P ( wn | w
n 1 1
n 1 n N +1
)
20
Unigrams: P(dog) Bigrams: P(dog|red) Trigrams: P(dog|big red) Four-grams: P(dog|the big red)
Bigram:
Ngram:
An example
<s> I am Sam <\s> <s> Sam I am <\s> <s> I do not like green eggs and meet <\s>
P( I |< s >) =
2 = 0.67 3 1 P( Sam |< s >) = = 0.33 3 2 P(am | I = ) = 0.67 3 1 P(< \ s >| Sam) = = 0.5 2 1 P(< s >| Sam) = = 0.5 2 1 P( Sam | am= ) = 0.5 2 1 P(do | I ) = = 0.33 3
23
13 0 860 0 0 0 0
I 3437
<start> I <start> Id <start> Tell <start> Im I want I would I dont I have Want to Want a
.25 .06 .04 .02 .32 .29 .08 .04 .65 .05
Want some Want Thai To eat To have To spend To be British food British restaurant British cuisine British lunch
.04 .01 .26 .14 .09 .02 .60 .15 .01 .01
27
Example:
wn wn-1
unigram frequencies
bigram probabilities
Example
P(I want to eat British food) = P(I|<start>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25*.32*.65*.26*.001*.60 = (different from textbook) 0.0000081 vs. I want to eat Chinese food = .00015
29
Note on Example
30
P(want | I) = .32 P(to | want) = .65 P(eat | to) = .26 P(food | Chinese) = .56 P(lunch | eat) = .055
31
Some Observations
32
What
about
P(I | I) = .0023 I I I I want P(I | want) = .0025 I want I want P(I | food) = .013 the kind of food I want is ...
33
You dont really do all those multiplies. The numbers are too small and lead to underflows Convert the probabilities to logs and then do additions. To get the real probability (if you need it) go back to the antilog.
34
Generation
35
BERP
I want
want to to eat eat Chinese Chinese food food .
36
You can collect reliable statistics on these events with relatively small samples
You might have to wait a long time to gather statistics on the low frequency events
37
They represent low frequency events that simply didnt occur in the corpus
38
Problem
Lets assume were using N-grams How can we assign a probability to a sequence where one of the component ngrams has a value of zero Assume all the words are known and have been seen
Go to a lower order n-gram Back off from bigrams to unigrams Replace the zero with something else
39
Add-One
Make the zero counts 1. Justification: Theyre just events you havent seen yet. If you had seen them you would only have seen them once. so make the count equal to 1.
40
Add-one: Example
2nd word
eat
0 786 10 2 0 17 0 13 0 860 0 0 0 0
want
1087 0 0 0 0 0 0
to
Chinese
0 6 3 19 0 0 0
food
0 8 0 2 120 0 1
lunch
0 6 12 52 1 0 0
Total (N)
3437 1215 3256 938 213 1506 459
1st word
want
.32 0 0 0 0 0 0
to
0 .65 .0031 .0021 0 .011 0
eat
.0038 (13/3437) 0 .26 0 0 0 0
Chinese
0 .0049 .00092 .020 0 0 0
food
0 .0066 0 .0021 .56 0 .0022
lunch
0 .0049 .0037 .055 .0047 0 0
Total
1 1 1 1 1 1 1
41
want
1087 1088 1 1 1 1 1 1
to
1 787 11 23 1 18 1
eat
14 1 861 1 1 1 1
Chinese
1 7 4 20 1 1 1
food
1 9 1 3 121 1 2
lunch
1 7 13 53 2 1 1
Total (N+ V)
3437 5053 2831 4872 2554 1829 3122 2075
to
eat
Chinese
.0002 .0025 .00082 .0078 .00055 .00032 .00048
food
.0002 .0032 .00021 .0012 .066 .00032 .0022
lunch
.0002 .0025 .0027 .021 .0011 .00032 .00048
Total
1 1 1 1 1 1 1
42
want
1087 0 0 0 0 0 0
to
0 786 10 2 0 17 0
eat
13 0 860 0 0 0 0
lunch
0 6 12 52 1 0 0
Total (N)
3437 1215 3256 938 213 1506 459
V= 1616
Smoothed P(I eat) = (C(I eat) + 1) / (nb bigrams starting with I + nb of possible bigrams starting with I) = (13 + 1) / (3437 + 1616) = 0.0028
43
Add-One Smoothing
Bigram
add 1 to all frequency counts p(wn|wn-1) = (C(wn-1wn)+1)/(C(wn-1)+V) (C(wn-1 wn)+1)* C(wn-1) /(C(wn-1)+V)
I want to eat Chinese food lunch 8 1087 0 13 0 0 0 3 0 786 0 6 8 6 3 0 10 860 3 0 12 0 0 2 0 19 2 52 2 0 0 0 0 120 1 19 0 17 0 0 0 0 4 0 0 0 0 1 0 I 6.12 1.72 2.67 0.37 0.35 9.65 1.11 want to eat Chinese food lunch 740.05 0.68 9.52 0.68 0.68 0.68 0.43 337.76 0.43 3.00 3.86 3.00 0.67 7.35 575.41 2.67 0.67 8.69 0.37 1.10 0.37 7.35 1.10 19.47 0.12 0.12 0.12 0.12 14.09 0.23 0.48 8.68 0.48 0.48 0.48 0.48 0.22 0.22 0.22 0.22 0.44 0.22
Frequencies
I want to eat Chinese food lunch I want to eat Chinese food lunch
Remarks: add-one causes large changes in some frequencies due to relative size of V (1616) want to: 786 338 = (786 + 1) * 1215 / (1215 + 1616)
c +1 N N +V
i
44
bigrams starting with Chinese are boosted by a factor of 8 ! (1829 / 213) unsmoothed bigram counts:
I want to eat Chinese food lunch I
8 3 3 0 2 19 4
want
1087 0 0 0 0 0 0
to
0 786 10 2 0 17 0
eat
13 0 860 0 0 0 0
Chinese food
0 6 3 19 0 0 0 0 8 0 2 120 0 1
lunch
0 6 12 52 1 0 0
Total (N)
3437 1215 3256 938 213 1506 459
1st word
want
1088 1 1 1 1 1 1
to
1 787 11 23 1 18 1
eat
14 1 861 1 1 1 1
Chinese
1 7 4 20 1 1 1
food
1 9 1 3 121 1 2
lunch
1 7 13 53 2 1 1
Total (N+ V)
5053 2831 4872 2554 1829 3122 2075
45
1st word
Corpus of 22,000,000 bigrams Vocabulary of 273,266 words (i.e. 74,674,306,756 possible bigrams) 74,671,100,000 bigrams were unseen And each unseen bigram was given a frequency of 0.000295
fMLE 0 1 2 3 4 5
too high
too low
Witten-Bell Smoothing
Unigram
equate zero frequency items with frequency 1 items use frequency of things seen once to estimate frequency of things we havent seen yet smaller impact than Add-One a zero frequency word (unigram) is an event that hasnt happened yet count the number of words (T) weve observed in the corpus (Number of types) p(w) = T/(Z*(N+T))
w is a word with zero frequency Z = number of zero frequency words N = size of corpus
47
Distributing
The amount to be distributed is The number of events with count zero So distributing evenly gets us
T N +T
Z
1 T Z N +T
48
Bigram
p(wn|wn-1) = C(wn-1wn)/C(wn-1) (original) p(wn|wn-1) = T(wn-1)/(Z(wn-1)*(T(wn-1)+N)) for zero bigrams (after Witten-Bell)
T(wn-1) = number of bigrams beginning with wn-1 Z(wn-1) = number of unseen bigrams beginning with wn-1 Z(wn-1) = total number of possible bigrams beginning with wn-1 minus the ones weve seen Z(wn-1) = V - T(wn-1)
estimated zero bigram frequency
49
Witten-Bell Smoothing
Bigram
use frequency (count) of things seen once to estimate frequency (count) of things we havent seen yet T(wn-1)/ Z(wn-1) * C(wn-1)/(C(wn-1)+ T(wn-1)) estimated zero bigram frequency (count)
T(wn-1) = number of bigrams beginning with wn-1 Z(wn-1) = number of unseen bigrams beginning with wn-1
I want to eat Chinese food lunch 8 1087 0 13 0 0 0 3 0 786 0 6 8 6 3 0 10 860 3 0 12 0 0 2 0 19 2 52 2 0 0 0 0 120 1 19 0 17 0 0 0 0 4 0 0 0 0 1 0 I 7.785 2.823 2.885 0.073 1.828 18.019 3.643 want 1057.763 0.046 0.084 0.073 0.011 0.051 0.026 to 0.061 739.729 9.616 1.766 0.011 16.122 0.026 eat Chinese 12.650 0.061 0.046 5.647 826.982 2.885 0.073 16.782 0.011 0.011 0.051 0.051 0.026 0.026 food lunch 0.061 0.061 7.529 5.647 0.084 11.539 1.766 45.928 109.700 0.914 0.051 0.051 0.911 0.026
50
T ( wx ) 1 P ( wi | wx ) = Z ( wx ) N ( w x ) + T ( w x )
Number of bigrams starting with wx that were not seen
Actual frequency (count)of bigrams beginning with wx
51
Thank you
52
By : Mu10co05 Mu10co22
It is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar
12/2/2013
S NP VP
S Aux NP VP
S VP
S
VP
S VP V NP
S VP V
4
NP
VP
NP
VP
NP
Det
NOMINAL
PropN
Pronoun
12/2/2013
Left-recursion
It refers to any recursive non terminal that, when it produces a sentential form containing itself, that new copy of itself appears on the left of the production rule. NP NP PP VP VP PP S S and S
Ambiguity
Ambiguities: PP Attachment
Coordination ambiguity
Different set of phrases that can be
conjoined by a conjunction like and. For example old men and women can be
[old [men and women]] or [old men] and
[women]
The parser often builds valid parse trees for portion of the input, then discards them during backtracking, only to find that it has to rebuild them again.
a flight From Indianapolis To Houston On TWA A flight from Indianapolis A flight from Indianapolis to Houston A flight from Indianapolis to Houston on TWA
4 3 2 1 3 2 1
10
Thank You!!
examine word tokens in context and specify exactly which sense of each word is being used.
MOTIVATION
One of the central challenges in NLP is ambiguity. Compositional semantic analyzers ignore the issue
SOLUTION:
Knowledge Based Approach. Rely on knowledge resources like WordNet. May use grammar rules for disambiguation. May use hand coded rules for disambiguation.
Necessity of a Mechanism
A system should include a mechanism which
ensures that only nouns with appropriate properties are associated with given verbs in a given context.
Selectional Restrictions:
Used to perform disambiguation. Used to rule out inappropriate senses and
thereby reduce the amount of ambiguity present during semantic analysis. Introduced by Fodor and Katz (1963). In selectional restriction, a predicate ( verb) imposes semantic constraints on its arguments (noun).
6
A violation of selectional restrictions is the explanation for the oddity of the following examples:
Kim ate a motor-bike. There is an apple bathing in the water. The stone thinks. The verb think selects a subject with the
feature human, which suggests that words labeled with inanimate are rejected.
eat(Human,Fruit) Peter ate fish. =>eat(Human,Edible) Kims mother bought a motor-bike of chocolate at the bakery. Kim ate the motorbike immediately. =>eat(Animate,Physical Object) 9
them includes washing dishes, he says. In her tiny kitchen at home, Ms. Chen works efficiently, stir-frying several simple dishes, including braised pigs ears and chicken livers with green peppers. The dishwasher read the article.
10
Limitations:
There are examples like the following where the
available selectional restrictions are too general to uniquely select a correct sense. What kind of dishes do you recommend?
11
realized you cant eat gold for lunch if youre hungry. In his two championship trials, Mr. Kulkarni ate glass on an empty stomach, accompanied only by water and tea. The sentence itself is not semantically ill-formed, despite the violation of eats selectional restrictions.
12
to try to eat Afghanistan. Here the typical selectional restrictions on both kill and eat will eliminate all possible literal senses leaving the system with no possible meanings! It brings the semantic analysis to halt!
13
THANK YOU
14
Semantics
Going beyond syntax
1/27
Semantics
Relationship between surface form and meaning What is meaning? Lexical semantics Syntax and semantics
2/27
What is meaning?
Reference to worlds
Objects, relationships, events, characteristics Meaning as truth
Understanding
Inference, implication Modelling beliefs
Meaning as action
Understanding activates procedures
3/27
Lexical semantics
Meanings of individual words
Sense and Reference What do we understand by the word lion ? Is a toy lion a lion? Is a toy gun a gun? Is a fake gun a gun?
Grammatical meaning
What do we understand by the lion, lions, the lions, as in The lion is a dangerous animal The lion was about to attack
4/27
Lexical relations
Lexical meanings can be defined in terms of other words
Synonyms, antonyms, broader/narrower terms synsets Part-whole relationships (often reflect realworld relationships) Linguistic usage (style, register) also a factor
5/27
Semantic features
Meanings can be defined (to a certain extent) in terms of distinctive features
e.g. man = adult, male, human
6/27
Types of representation
1. Syntactic relations
The man shot an elephant with his gun
Types of representation
2. Deep syntax
The man shot an elephant with his gun An elephant was shot by the man with his gun
instr
Types of representation
3. Semantic roles, deep cases
The man shot an elephant with his gun An elephant was shot by the man with his gun The man used his gun to shoot an elephant
instr
Types of representation
4. Event representation, semantic network
The man shot an elephant with his gun An elephant was shot by the man with his gun The man used his gun to shoot an elephant
shooting
shooter shot- instr thing man elephant gun qtf the qtf poss man
10/27
Types of representation
5. Predicate calculus
The man shot an elephant with his gun An elephant was shot by the man with his gun The man used his gun to shoot an elephant The man owned the gun which he used to shoot an elephant The man used the gun which he owned to shoot an elephant
event(e) & time(e,past) & pred(e,shoot) & man(A) & the(A) & (B) & dog(B) & shoot(A,B) & (C) & gun(C) & own(A,C) & use(A,C,e)
11/27
Types of representation
6. Conceptual dependency (Schank) John punched Mary
12/27
Types of representation
7. Semantic formulae (Wilks)
((THIS((PLANT STUFF)SOUR)) ((((((THRU PART)OBJE) (NOTUSE *ANI))GOAL) ((MAN USE) (OBJE THING) )))
door
13/27
Conceptual meaning may be much deeper Somewhere in between a good example is Wilks preference semantics: especially good for metaphor
15/27
Linguistic issues
Words and Concepts
Objects, properties, actions n, adj, v Language allows us to be vague (e.g. toy gun)
Semantic primitives what are they? Meaning equivalence when do two things mean the same? Grammatical meaning
Tense vs. time Topic and focus Quantifiers, plurals, etc.
16/27
Linguistic issues
There are many other similarly tricky linguistic phenomena
Modality (could, should, would, must, may) Aspect (completed, ongoing, resulting) Determination (the, a, some, all, none) Fuzzy sets (often, some, many, usually)
17/27
Lexical semantics
Lexical relations (familiar to linguists) have an impact on NLP systems
Homonymy word-sense selection; homophones in speech-based systems Polysemy understanding narrow senses Synonymy lexical equivalence Ontology structure vocabulary, holds much of the knowledge used by clever systems
18/27
WordNet
Began as a psycholinguistic theory of how the brain organizes its vocabulary (Miller) Organizes vocabulary into synsets, hierarchically arranged together with other relations (hyp[er|o]nymy, isa, member, antonyms, entailments) Turns out to be very useful for many applications Has been replicated for many languages (sometimes just translated!)
19/27
Search
Language Analysis
Semantics
12/2/2013
Parsing
2
Semantics Syntax
What is Syntax?
Study of structure of language Specifically, goal is to relate an interface to morphological component to an interface to a semantic component Note: interface to morphological component may look like written text Representational device is tree structure
Morphology
/waddyasai/
Syntax
say
subj you obj
say
Semantics
what
subj you
obj
P[ x. say(you, x) ]
what
Sample Grammar
Grammar (S, NT, T, P) Sentence Symbol S NT, Part-of-Speech NT, syntactic Constituents NT, Grammar Rules P NT (NT T)* statement S NP VP question S Aux NP VP command S VP NP Det Nominal NP Proper-Noun Nominal Noun | Noun Nominal | Nominal PP VP Verb | Verb NP | Verb PP | Verb NP PP PP Prep NP Det that | this | a Noun book | flight | meal | money Proper-Noun Houston | American Airlines | TWA Verb book | include | prefer Aux does Prep from | to | on
Structure in Strings
Some words: the a small nice big very boy girl sees likes Some good sentences:
the boy likes a girl the small girl likes the big girl a very small nice boy sees a very nice boy
Can we find subsequences of words (constituents) which in some way behave alike?
boy the
likes girl a
Node Labels?
( ((the) boy) likes ((a) girl) ) Choose constituents so each one has one non-bracketed word: the head Group words by distribution of constituents they head (part-of-speech, POS):
Noun (N), verb (V), adjective (Adj), adverb (Adv), determiner (Det)
Node Labels
(((the/Det) boy/N) likes/V ((a/Det) girl/N))
S
NP
likes
DetP
NP
DetP
boy
girl
the
Types of Nodes
(((the/Det) boy/N) likes/V ((a/Det) girl/N))
S nonterminal symbols = constituents NP
likes
DetP
NP
Phrase-structure tree
DetP
boy
girl
the
a
terminal symbols = words
Determining Part-of-Speech
noun or adjective?
a blue seat a very blue seat this seat is blue a child seat *a very child seat *this seat is child
blue and child are not the same POS blue is Adj, child is Noun
Many subtypes:
eat/V eat/VB, eat/VBP, eats/VBZ, ate/VBD, eaten/VBN, eating/VBG, Reflect morphological form & syntactic function
likes/V
NP
likes
DetP
NP
girl/N a/Det
DetP
boy
the
a
Only leaf nodes labeled with words!
likes/V
NP
likes
DetP
NP
girl/N a/Det
DetP
boy
the
a
Representationally equivalent if each nonterminal node has one lexical daughter (its head)
Types of Dependency
Adj(unct)
likes/V
Subj Fw
Obj
sometimes/Adv the/Det
boy/N
Adj
girl/N
Fw
small/Adj
Adj
a/Det
very/Adv
Grammatical Relations
Types of relations between words
Arguments: subject, object, indirect object, prepositional object Adjuncts: temporal, locative, causal, manner, Function Words
Subcategorization
List of arguments of a word (typically, a verb), with features about realization (POS, perhaps case, verb form etc) In canonical order Subject-Object-IndObj Example:
like: N-N, N-V(to-inf) see: N, N-N, N-N-V(inf)
DetP
likes boy
NP
DetP
girl
DetP
boy
likes
DetP
NP
the
the
girl
Context-Free Grammars
Defined in formal language theory (comp sci) Terminals, nonterminals, start symbol, rules String-rewriting system Start with start symbol, rewrite using rules, done when only terminals left NOT A LINGUISTIC THEORY, just a formal device
CFG: Example
Many possible CFGs for English, here is an example (fragment):
S NP VP VP V NP NP DetP N | AdjP NP AdjP Adj | Adv AdjP N boy | girl V sees | likes Adj big | small Adv very DetP a | the
the very small boy likes a girl
Derivations in a CFG
S
S NP VP VP V NP NP DetP N | AdjP NP AdjP Adj | Adv AdjP N boy | girl V sees | likes Adj big | small Adv very DetP a | the
S
Derivations in a CFG
NP VP
S NP VP VP V NP NP DetP N | AdjP NP AdjP Adj | Adv AdjP N boy | girl V sees | likes Adj big | small Adv very DetP a | the
S
NP
VP
Derivations in a CFG
DetP N VP
S NP VP VP V NP NP DetP N | AdjP NP AdjP Adj | Adv AdjP N boy | girl V sees | likes Adj big | small Adv very DetP a | the
S
NP
VP
DetP
Derivations in a CFG
the boy VP
S NP VP VP V NP NP DetP N | AdjP NP AdjP Adj | Adv AdjP N boy | girl V sees | likes Adj big | small Adv very DetP a | the
S
NP
VP
DetP
the boy
Derivations in a CFG
the boy likes NP
S NP VP VP V NP NP DetP N | AdjP NP AdjP Adj | Adv AdjP N boy | girl V sees | likes Adj big | small Adv very DetP a | the
S
NP DetP
VP
NP
Derivations in a CFG
the boy likes a girl
S NP VP VP V NP NP DetP N | AdjP NP AdjP Adj | Adv AdjP N boy | girl V sees | likes Adj big | small Adv very DetP a | the
S
NP DetP
VP
NP
DetP
girl
NP
VP
NP
likes
DetP
girl
Derivations of CFGs
String rewriting system: we derive a string (=derived structure) But derivation history represented by phrase-structure tree (=derivation structure)!
Poor John - NP his watch - NP lost his watch - VP Poor John lost his watch -
ADJ
NP PRON N watch
Poor
John
lost
his
Exercise
Analyze the following constructions using Labeled Bracketing as well as tree Diagram: I saw a man with a telescope. She touched the cat with a feather. The girl pushed the large box towards the huge door. The man in the blue shirt is waiting for you.
Ambiguity examples
Ambiguities: PP Attachment
Attachments
I cleaned the dishes from dinner. I cleaned the dishes with detergent. I cleaned the dishes in my pajamas. I cleaned the dishes in the sink.
Syntactic Ambiguities 1
Prepositional Phrases
They cooked the beans in the pot on the stove with handles.
Complement Structure
The tourists objected to the guide that they couldnt hear. She knows you like the back of her hand.
Syntactic Ambiguities 2
Modifier scope within NPs
impractical design requirements plastic cup holder
Coordination scope
Small rats and mice can squeeze into holes or cracks in the wall.
Solution: We need mechanisms that allow us to find the most likely parse(s)
Statistical parsing lets us work with very loose grammars that admit millions of parses for sentences but to still quickly find the best parse(s)
Introduction
Parsing = associating a structure (parse tree) to an input string using a grammar CFG are declarative, they dont specify how the parse tree will be constructed Book that flight. Parse trees are used in
Grammar checking Semantic analysis Machine translation Question answering Information extraction
S VP NP NOMINAL Verb Det Noun
Book
that
flight
12/2/2013
46
Parsing
Parsing with CFGs refers to the task of assigning correct trees to input strings Correct here means a tree that covers all and only the elements of the input and has an S at the top It doesnt actually mean that the system can select the correct tree from among the possible trees
12/2/2013 47
Parsing
Parsing involves a search which involves the making of choices Some Parsing techniques:
Top-down parsing Bottom-up parsing
12/2/2013
48
For Now
Assume
You have all the words already in some buffer The input isnt POS tagged We wont worry about morphological analysis All the words are known
12/2/2013
50
Parsing as search
A Grammar to be used in our example
S NP VP S Aux NP VP S VP NP Pronoun NP Det NOMINAL NP Proper-Noun NOMINAL Noun NOMINAL NOMINAL PP VP Verb VP Verb NP
12/2/2013
VP Verb NP PP
VP Verb PP VP VP PP PP Preposition NP Det that | this |a Noun book | flight | meal | money Verb book | include | prefer Aux does Proper-Noun Houston | TWA Preposition from | to | on | near | through
51
Parsing as search
Book that flight. S Two types of constraints on the parses: 1. some that come from the input string 2. others that come from the Verb grammar Book VP NP NOMINAL Det that Noun flight 52
12/2/2013
12/2/2013
53
S NP VP
S Aux NP VP
S VP
S
VP
S Aux NP VP PropN
S VP V NP
S VP V
NP
VP
NP
VP
NP
Aux NP VP Det
NOMINAL
Det
NOMINAL
PropN
12/2/2013
Pronoun
54
Bottom-Up Parsing
Since we want trees that cover the input words start with trees that link up with the words in the right way. Then work your way up from there.
12/2/2013
55
NOMINAL Noun Book NP NOMINAL Noun Book Det that Det that
Det that
Noun flight
Book
12/2/2013
59
12/2/2013
60
Introduction
Word Sense Disambiguation
It is to examine word tokens and its context and specify which
Motivation
Central Challenges in NLP. Ubiquitous across all language. Needed in
Machine Translation For correct lexeme choice Information Retrieval Resolving queries Information Extraction For accurate analysis of text
Selectional Restriction
Uses hierarchical type of information about arguments. Rule out inappropriate senses and hence reduce the
amount of ambiguity. Selectional restriction are used to block the formation of component meaning representations which violate the same.
Selectional Restriction
Sense for word DISHES Sense 1 In our house, everybody has a career and none of the includes washing dishes with soap, he says. Sense 2 Ms Chen works efficiently, stir frying several simple dishes including fried chicken.
Artifact
Food
Selectional Restriction
Sense for word SERVE Sense 1 What is the name of the airlines that serve Denver? Sense 2 Well there was a time when served green lipped mussels from New Zealand. Sense 3 Which one of the airliners serve breakfast?
Geographical Entity
Food
Meal Designator
Selectional Restriction
Consider an Example
Im looking for a restaurant that serves vegetarian dishes.
Geographical Entity Serve Food Food Meal Designator
Artifact Dishes
dictionary. Each of the senses are compared to dictionary definition of the remaining word in the context. The sense with the highest overlap with this context words is chosen as the correct sense.
Example
Fruit of evergreen tree.
disambiguation In this method a classifier is learnt which is then used to assign senses
Co-location
Looks for information about words of specific positions
Co-occurrence
Features consists of data about neighboring words. Words themselves serve as features. For the earlier example, a co-occurrence vector
consisting of 12 most frequent words from a collection of bass sentences drawn from WSJ corpus has following features: shing, big, sound, player, y, rod, pound, double, runs, playing, guitar, band 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0
painting.
Semi-supervised approach
Problem with supervised approach is the need of a large
tagged training set. Relies on a relatively small number of instances. These labeled instances are used as seeds to train an initial classifier. This initial classifier is then used to extract a larger training set.
Unsupervised approach
Unlabeled instances are taken up as input and are
grouped into clusters according to similarity metric. These clusters are labeled by hand with word senses. Main disadvantage is that senses are not well defined.