Natural Language Processing 2

Morphology
Morphology
Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using morphemes Morphological parsing = the task of recognizing the morphemes inside a word Important for many tasks
e.g., hands, foxes, children machine translation information retrieval lexicography any further processing (e.g., part-of-speech tagging)
Slide 1
base form (stem), e.g., believe affixes (suffixes, prefixes, infixes), e.g., un-, -able, -ly
Morphemes and Words

Combine morphemes to create words
Inflection
combination of a word stem with a grammatical morpheme same word class, e.g. clean (verb), clean-ing (verb) combination of a word stem with a grammatical morpheme Yields different word class, e.g. clean (verb), clean-ing (noun) combination of multiple word stems combination of a word stem with a clitic different words from different syntactic categories, e.g. Ive = I + have
Derivation
Compounding Cliticization
Slide 2
Inflectional Morphology
Inflectional Morphology word stem + grammatical morpheme cat + s only for nouns, verbs, and some adjectives
Nouns plural: regular: +s, +es irregular: mouse - mice; ox - oxen rules for exceptions: e.g. -y -> -ies like: butterfly - butterflies possessive: +'s, +' Verbs main verbs (sleep, eat, walk) modal verbs (can, will, should) primary verbs (be, have, do)
Slide 3
Inflectional Morphology (verbs)

Verb Inflections for: main verbs (sleep, eat, walk); primary verbs (be, have, do) Morpholog. Form stem -s form -ing participle past; -ed participle Morph. Form stem -s form -ing participle -ed past -ed participle Regularly Inflected Form walk merge walks merges walking merging walked merged Irregularly Inflected Form eat catchcut eats catches eating catching ate caught eaten caught try tries trying tried map maps mapping mapped
cuts cutting cut cut

Slide 4
Inflectional Morphology (nouns)

Noun Inflections for: regular nouns (cat, hand); irregular nouns(child, ox) Morpholog. Form stem plural form Morph. Form stem plural form Regularly Inflected Form cat hand cats hands Irregularly Inflected Form child ox children oxen
Slide 5
Inflectional and Derivational Morphology (adjectives)

Adjective Inflections and Derivations: prefix suffix un-ly -er -est -ness unhappy happily happier happiest happiness adjective, negation adverb, mode adjective, comparative 1 adjective, comparative 2 noun
suffix
plus combinations, like unhappiest, unhappiness. Distinguish different adjective classes, which can or cannot take certain inflectional or derivational forms, e.g. no negation for big.
Slide 6
Derivational Morphology (nouns)
Slide 7
Derivational Morphology (adjectives)
Slide 8
Verb Clitics
Slide 9
Methods, Algorithms
Stemming
Stemming algorithms strip off word affixes yield stem only, no additional information (like plural, 3rd person etc.) used, e.g. in web search engines famous stemming algorithm: the Porter stemmer
Slide 11
Stemming
Reduce tokens to root form of words to recognize morphological variation.
computer, computational, computation all reduced to same token compute
Correct morphological analysis is language specific and can be complex. Stemming blindly strips off known affixes (prefixes and suffixes) in an iterative fashion.
for example compressed and compression are both accepted as equivalent to compress.
for exampl compres and compres are both accept as equival to compres.
Slide 12
Porter Stemmer
Simple procedure for removing known affixes in English without using a dictionary. Can produce unusual stems that are not English words: May conflate (reduce to the same token) words that are actually distinct. Does not recognize all morphological derivations Typical rules in Porter stemmer sses ss ies i ational ate tional tion ing
computer, computational, computation all reduced to same token comput
Slide 13
Stemming Problems
Errors of Comission organization doing Generalization Numerical Policy organ doe Generic numerous police
Errors of Omission European analysis Matrices Noise sparse Europe analyzes matrix noisy sparsity
Slide 14
Tokenization, Word Segmentation

Tokenization or word segmentation separate out words (lexical entries) from running text expand abbreviated terms
E.g. Im into I am, its into it is
collect tokens forming single lexical entry

E.g. New York marked as one single entry
More of an issue in languages like Chinese
Slide 15
Simple Tokenization
Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token.
However, frequently they are not.
Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens. More careful approach:
Separate ? ! ; : [ ] ( ) < > Care with . - why? when? Care with ??
Slide 16
Punctuation
Childrens: use language-specific mappings to normalize (e.g. AngloSaxon genitive of nouns, verb contractions: wont -> wo nt) State-of-the-art: break up hyphenated sequence. U.S.A. vs. USA a.out
Slide 17
Numbers
3/12/91 Mar. 12, 1991 55 B.C. B-52 100.2.86.144
Generally, dont index as text Creation dates for docs
Slide 18
Lemmatization
Reduce inflectional/derivational forms to base form Direct impact on vocabulary size E.g.,
am, are, is be car, cars, car's, cars' car
the boy's cars are different colors the boy car be different color How to do this?
Need a list of grammatical rules + a list of irregular words Children child, spoken speak Practical implementation: use WordNets morphstr function
Perl: WordNet::QueryData (first returned value from validForms function)
Slide 19
Morphological Processing
Knowledge
lexical entry: stem plus possible prefixes, suffixes plus word classes, e.g. endings for verb forms (see tables above) rules: how to combine stem and affixes, e.g. add s to form plural of noun as in dogs orthographic rules: spelling, e.g. double consonant as in mapping
Processing: Finite State Transducers

take information above and analyze word token / generate word form
Slide 20
Fig. 3.3 FSA for verb inflection.
Slide 21
Fig. 3.4 Simple FSA for adjective inflection.
Fig. 3.5 More detailed FSA for adjective inflection.

Slide 22
Fig. 3.7 Compiled FSA for noun inflection.

Slide 23
LINGUIST 180: Introduction to Computational Linguistics

Dan Jurafsky Lecture 5: Intro to Probability, Language Modeling
IP notice: some slides for today from: Jim Martin, Sandiway Fong, Dan Klein
LING 180 Autumn 2007
Outline
Probability
Basic probability Conditional probability Bayes Rule
Language Modeling (N-grams)

N-gram Intro The Chain Rule The Shannon Visualization Method Evaluation: Smoothing: Add-1 Advanced stuff I wont discuss:
Perplexity
Discounting: Good-Turing and Katz backoff Interpolation Unknown words Advanced LM algorithms
2
1. Introduction to Probability
Experiment (trial) Sample Space (S)
Example Example Repeatable procedure with well-defined possible outcomes
the set of all possible outcomes finite or infinite coin toss experiment possible outcomes: S = {heads, tails} die toss experiment possible outcomes: S = {1,2,3,4,5,6}
QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.
Slides from Sandiway Fong
Introduction to Probability
Definition of sample space depends on what we are asking
Sample Space (S): the set of all possible outcomes Example
die toss experiment for whether the number is even or odd possible outcomes: {even,odd} not {1,2,3,4,5,6}
More definitions
Events Example
an event is any subset of outcomes from the sample space die toss experiment let A represent the event such that the outcome of the die toss experiment is divisible by 3 A = {3,6} A is a subset of the sample space S= {1,2,3,4,5,6} Draw a card from a deck suppose sample space S = {heart,spade,club,diamond} (four suits) let A represent the event of drawing a heart let B represent the event of drawing a red card A = {heart} B = {heart,diamond}
Example
Introduction to Probability
Some definitions
Counting
suppose operation oi can be performed in ni ways, then a sequence of k operations o1o2...ok can be performed in n1 n2 ... nk ways die toss experiment, 6 possible outcomes two dice are thrown at the same time number of sample points in sample space = 6 6 = 36
Example
Definition of Probability
The probability law assigns to an event a nonnegative number Called P(A) Also called the probability A That encodes our knowledge or belief about the collective likelihood of all the elements of A Probability law must satisfy certain properties
Probability Axioms
Nonnegativity
P(A) >= 0, for every event A
Additivity
If A and B are two disjoint events, then the probability of their union satisfies: P(A U B) = P(A) + P(B)
Normalization
The probability of the entire sample space S is equal to 1, I.e. P(S) = 1.
An example
An experiment involving a single coin toss There are two possible outcomes, H and T Sample space S is {H,T} If coin is fair, should assign equal probabilities to 2 outcomes Since they have to sum to 1 P({H}) = 0.5 P({T}) = 0.5 P({H,T}) = P({H})+P({T}) = 1.0
Another example
Experiment involving 3 coin tosses Outcome is a 3-long string of H or T S ={HHH,HHT,HTH,HTT,THH,THT,TTH,TTTT} Assume each outcome is equiprobable What is probability of the event that exactly 2 heads occur? A = {HHT,HTH,THH} P(A) = P({HHT})+P({HTH})+P({THH}) = 1/8 + 1/8 + 1/8 =3/8
Uniform distribution
10
Probability definitions
In summary:
Probability of drawing a spade from 52 well-shuffled playing cards:
11
Probabilities of two events

If two events A and B are independent Then
P(A and B) = P(A) x P(B)
If flip a fair coin twice

What is the probability that they are both heads?
If draw a card from a deck, then put it back, draw a card from the deck again
What is the probability that both drawn cards are hearts?
12
How about non-uniform probabilities? An example

A biased coin,
twice as likely to come up tails as heads, is tossed twice
What is the probability that at least one head occurs? Sample space = {hh, ht, th, tt} (h = heads, t = tails) Sample points/probability for the event:
ht 1/3 x 2/3 = 2/9 th 2/3 x 1/3 = 2/9 hh 1/3 x 1/3= 1/9 tt 2/3 x 2/3 = 4/9
Answer: 5/9 = 0.56 (sum of weights in red )
13
Moving toward language

Whats the probability of drawing a 2 from a deck of 52 cards with four 2s?
4 1 P ( drawing a two) = = = .077 52 13 Whats the probability of a random word (from a random dictionary page) being a verb?
P ( drawing a verb) =
# of ways to get a verb
all words
14
Probability and part of speech tags

Whats the probability of a random word (from a random dictionary page) being a verb?
P ( drawing a verb) =
# of ways to get a verb
all words
How to compute each of these All words = just count all the words in the dictionary # of ways to get a verb: number of words which are verbs! If a dictionary has 50,000 entries, and 10,000 are verbs. P(V) is 10000/50000 = 1/5 = .20
15
Conditional Probability
A way to reason about the outcome of an experiment based on partial information
In a word guessing game the first letter for the word is a t. What is the likelihood that the second letter is an h? How likely is it that a person has a disease given that a medical test was negative? A spot shows up on a radar screen. How likely is it that it corresponds to an aircraft?
16
More precisely
Given an experiment, a corresponding sample space S, and a probability law Suppose we know that the outcome is within some given event B We want to quantify the likelihood that the outcome also belongs to some other given event A. We need a new probability law that gives us the conditional probability of A given B P(A|B)
17
An intuition
A is its raining now. P(A) in dry California is .01 B is it was raining ten minutes ago P(A|B) means what is the probability of it raining now if it was raining 10 minutes ago P(A|B) is probably way higher than P(A) Perhaps P(A|B) is .10 Intuition: The knowledge about B should change our estimate of the probability of A.
18
Conditional probability
One of the following 30 items is chosen at random What is P(X), the probability that it is an X? What is P(X|red), the probability that it is an X given that it is red?
19
Conditional Probability
let A and B be events p(B|A) = the probability of event B occurring given event A occurs definition: p(B|A) = p(A B) / p(A)
20
Conditional probability
P(A|B) = P(A B)/P(B) Or
P( A, B) P( B)
P( A | B) =
Note: P(A,B)=P(A|B) P(B) Also: P(A,B) = P(B,A) A A,B B
21
Independence
What is P(A,B) if A and B are independent? P(A,B)=P(A) P(B) iff A,B independent.
P(heads,tails) = P(heads) P(tails) = .5 .5 = .25 Note: P(A|B)=P(A) iff A,B independent Also: P(B|A)=P(B) iff A,B independent
22
Bayes Theorem
P( A | B) P( B) P( B | A) = P( A)
Swap the conditioning Sometimes easier to estimate one kind of dependence than the other
23
Deriving Bayes Rule
P ( A B) P ( B | A) = P ( A B) P ( A | B) = P ( A) P (B)
P ( A | B) P ( B) = P ( A B) P ( B | A) P ( A) = P ( A B)
P ( A | B) P ( B) = P (B | A)P ( A)
P (B | A)P ( A) P ( A | B) = P (B)
24
Summary
Probability Conditional Probability Independence Bayes Rule
25
How many words?

I do uh main- mainly business data processing
Fragments Filled pauses
Are cat and cats the same word? Some terminology

Lemma: a set of lexical forms having the same stem, major part of speech, and rough word sense
Cat and cats = same lemma
Wordform: the full inflected surface form.

Cat and cats = different wordforms
26
How many words?

they picnicked by the pool then lay back on the grass and looked at the stars
16 tokens 14 types
SWBD:
~20,000 wordform types, 2.4 million wordform tokens
Brown et al (1992) large corpus

583 million wordform tokens 293,181 wordform types
Let N = number of tokens, V = vocabulary = number of types General wisdom: V > O(sqrt(N))
27
Language Modeling
We want to compute P(w1,w2,w3,w4,w5wn), the probability of a sequence Alternatively we want to compute P(w5|w1,w2,w3,w4,w5): the probability of a word given some previous words The model that computes P(W) or P(wn|w1,w2wn1) is called the language model. A better term for this would be The Grammar But Language model or LM is standard
28
Computing P(W)
How to compute this joint probability:
P(the,other,day,I,was,walking,along,an d,saw,a,lizard)
Intuition: lets rely on the Chain Rule of Probability
29
The Chain Rule

Recall the definition of conditional probabilities Rewriting:
P( A^ B) P( A | B) = P( B)
More generally P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C) In general P(x1,x2,x3,xn) = P(x1)P(x2|x1)P(x3|x1,x2)P(xn|x1xn-1)
P( A^ B) = P( A | B) P( B)
30
The Chain Rule Applied to joint probability of words in sentence
P(the big red dog was)=

P(the)*P(big|the)*P(red|the big)*P(dog|the big red)*P(was|the big red dog)
31
Very easy estimate:

How to estimate?
P(the|its water is so transparent that)
P(the|its water is so transparent that) = C(its water is so transparent that the) _______________________________ C(its water is so transparent that)
32
Unfortunately
There are a lot of possible sentences Well never be able to get enough data to compute the statistics for those long prefixes P(lizard|the,other,day,I,was,walking,along,and,saw,a) Or P(the|its water is so transparent that)
33
Markov Assumption
Make the simplifying assumption
P(lizard|the,other,day,I,was,walking,along,and,saw,a ) = P(lizard|a)
Or maybe
P(lizard|the,other,day,I,was,walking,along,and,saw,a ) = P(lizard|saw,a)
34
Markov Assumption
So for each component in the product replace with the approximation (assuming a prefix of N)
P ( wn | w1 ) P (wn | w
Bigram version
n 1
n 1 n N +1
P ( wn | w1 ) P (wn | wn 1 )
35
n 1
Estimating bigram probabilities

The Maximum Likelihood Estimate
count ( wi1, wi ) P ( wi | wi1) = count (wi1 )
c (wi1, wi ) P ( wi | wi1) = c ( wi1)

36
An example
<s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s>
This is the Maximum Likelihood Estimate, because it is the one which maximizes P(Training set|Model)
37
Maximum Likelihood Estimates

The maximum likelihood estimate of some parameter of a model M from a training set T
Is the estimate that maximizes the likelihood of the training set T given the model M
Suppose the word Chinese occurs 400 times in a corpus of a million words (Brown corpus) What is the probability that a random word from some other text will be Chinese MLE estimate is 400/1000000 = .004
This may be a bad estimate for some other corpus
But it is the estimate that makes it most likely that Chinese will occur 400 times in a million word corpus.
38
More examples: Berkeley Restaurant Project sentences

can you tell me about any good cantonese restaurants close by mid priced thai food is what im looking for tell me about chez panisse can you give me a listing of the kinds of food that are available im looking for a good place to eat breakfast when is caffe venezia open during the day
39
Raw bigram counts

Out of 9222 sentences
40
Raw bigram probabilities

Normalize by unigrams: Result:
41
Bigram estimates of sentence probabilities

P(<s> I want english food </s>) = p(i|<s>) x p(want|I) x p(english|want) x p(food|english) x p(</s>|food) =.000031
42
What kinds of knowledge?

P(english|want) = .0011 P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25
43
The Shannon Visualization Method

Generate random sentences: Choose a random bigram <s>, w according to its probability Now choose a random bigram (w, x) according to its probability And so on until we choose </s> Then string the words together
<s> I I want want to to eat eat Chinese Chinese food food </s>
44
45
Shakespeare as corpus
N=884,647 tokens, V=29,066 Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table) Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare
46
The wall street journal is not shakespeare (no offense)
47
Evaluation
We train parameters of our model on a training set. How do we evaluate how well our model works? We look at the models performance on some new data This is what happens in the real world; we want to know how our model performs on data we havent seen So a test set. A dataset which is different than our training set Then we need an evaluation metric to tell us how well our model is doing on the test set. One such metric is perplexity (to be introduced below)
48
Unknown words: Open versus closed vocabulary tasks

If we know all the words in advanced Often we dont know this
Vocabulary V is fixed Closed vocabulary task Out Of Vocabulary = OOV words Open vocabulary task Training of <UNK> probabilities
Instead: create an unknown word token <UNK>

Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to <UNK> Now we train its probabilities like a normal word If text input: Use UNK probabilities for any word not in training
At decoding time
49
Evaluating N-gram models

Best evaluation for an N-gram
Put model A in a speech recognizer Run recognition, get word error rate (WER) for A Put model B in speech recognition, get word error rate for B Compare WER for A and B Extrinsic evaluation
50
Difficulty of extrinsic (in-vivo) evaluation of N-gram models

Extrinsic evaluation
This is really time-consuming Can take days to run an experiment
So
As a temporary solution, in order to run experiments To evaluate N-grams we often use an intrinsic evaluation, an approximation called perplexity But perplexity is a poor approximation unless the test data looks just like the training data So is generally only useful in pilot experiments (generally is not sufficient to publish) But is helpful to think about.
51
Perplexity
Perplexity is the probability of the test set (assigned by the language model), normalized by the number of words:
Chain rule:
For bigrams:
Minimizing perplexity is the same as maximizing probability

The best language model is one that best predicts an unseen test set
52
A totally different perplexity Intuition

How hard is the task of recognizing digits 0,1,2,3,4,5,6,7,8,9,oh: easy, perplexity 11 (or if we ignore oh, perplexity 10) How hard is recognizing (30,000) names at Microsoft. Hard: perplexity = 30,000 If a system has to recognize
Operator (1 in 4) Sales (1 in 4) Technical Support (1 in 4) 30,000 names (1 in 120,000 each) Perplexity is 54
Perplexity is weighted equivalent branching factor
Slide from Josh Goodman
53
Perplexity as branching factor
54
Lower perplexity = better model

Training 38 million words, test 1.5 million words, WSJ
55
Lesson 1: the perils of overfitting

N-grams only work well for word prediction if the test corpus looks like the training corpus
In real life, it often doesnt We need to train robust models, adapt to test set, etc
56
Lesson 2: zeros or not?

Zipfs Law:
A small number of events occur with high frequency A large number of events occur with low frequency You can quickly collect statistics on the high frequency events You might have to wait an arbitrarily long time to get valid statistics on low frequency events Our estimates are sparse! no counts at all for the vast bulk of things we want to estimate! Some of the zeroes in the table are really zeros But others are simply low frequency events you haven't seen yet. After all, ANYTHING CAN HAPPEN! How to address? Estimate the likelihood of unseen N-grams!
Result:
Answer:
Slide adapted from Bonnie Dorr and Julia Hirschberg
57
Smoothing is like Robin Hood: Steal from the rich and give to the poor (in probability mass)
Slide from Dan Klein
58
Laplace smoothing
Also called add-one smoothing Just add one to all the counts! Very simple MLE estimate: Laplace estimate: Reconstructed counts:
59
Laplace smoothed bigram counts
60
Laplace-smoothed bigrams
61
Reconstituted counts
62
Note big change to counts

C(count to) went from 608 to 238! P(to|want) from .66 to .26! Discount d= c*/c
d for chinese food =.10!!! A 10x reduction So in general, Laplace is a blunt instrument Could use more fine-grained method (add-k)
But Laplace smoothing not used for N-grams, as we have much better methods Despite its flaws Laplace (add-k) is however still used to smooth other probabilistic models in NLP, especially
For pilot studies in domains where the number of zeros isnt so huge.
63
Better discounting algorithms

Intuition used by many smoothing algorithms
Good-Turing Kneser-Ney Witten-Bell
Is to use the count of things weve seen once to help estimate the count of things weve never seen
64
Good-Turing: Josh Goodman intuition

Imagine you are fishing
There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass
You have caught

10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that next species is new (i.e. catfish or bass)

3/18
Assuming so, how likely is it that next species is trout?

Must be less than 1/18
Slide adapted from Josh Goodman
65
Good-Turing Intuition
Notation: Nx is the frequency-of-frequency-x
So N10=1, N1=3, etc
To estimate total number of unseen species

Use number of species (words) weve seen once c0* =c1 p0 = N1/N
All other estimates are adjusted (down) to give probabilities for unseen
66
Good-Turing Intuition
Notation: Nx is the frequency-of-frequency-x
So N10=1, N1=3, etc
To estimate total number of unseen species

Use number of species (words) weve seen once c0* =c1 p0 = N1/N p0=N1/N=3/18
All other estimates are adjusted (down) to give probabilities for unseen
P(eel) = c*(1) = (1+1) 1/ 3 = 2/3
67
68
Bigram frequencies of frequencies and GT re-estimates
69
Complications
In practice, assume large counts (c>k for some k) are reliable:
That complicates c*, making it:
Also: we assume singleton counts c=1 are unreliable, so treat N-grams with count of 1 as if they were count=0 Also, need the Nk to be non-zero, so we need to smooth (interpolate) the Nk counts before computing c* from them
70
Backoff and Interpolation

Another really useful source of knowledge If we are estimating:
trigram p(z|xy) but c(xyz) is zero
Use info from:

Bigram p(z|y)
Or even:
Unigram p(z)
How to combine the trigram/bigram/unigram info?
71
Backoff versus interpolation

Backoff: use trigram if you have it, otherwise bigram, otherwise unigram Interpolation: mix all three
72
Interpolation
Simple interpolation
Lambdas conditional on context:
73
How to set the lambdas?

Use a held-out corpus Choose lambdas which maximize the probability of some held-out data
I.e. fix the N-gram probabilities Then search for lambda values That when plugged into previous equation Give largest probability for held-out set Can use EM to do this search
74
Katz Backoff
75
Why discounts P* and alpha?

MLE probabilities sum to 1
So if we used MLE probabilities but backed off to lower order model when MLE prob is zero We would be adding extra probability mass And total probability would be greater than 1
76
GT smoothed bigram probs
77
Intuition of backoff+discounting
How much probability to assign to all the zero trigrams?
Use GT or other discounting algorithm to tell us
How to divide that probability mass among different contexts?

Use the N-1 gram estimates to tell us
What do we do for the unigram words not seen in training?

Out Of Vocabulary = OOV words
78
OOV words: <UNK> word

Out Of Vocabulary = OOV words We dont use GT smoothing for these Instead: create an unknown word token <UNK>
Training of <UNK> probabilities Because GT assumes we know the number of unseen events
At decoding time
Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to <UNK> Now we train its probabilities like a normal word If text input: Use UNK probabilities for any word not in training
79
Practical Issues
We do everything in log space
Avoid underflow (also adding is faster than multiplying)
80
ARPA format
81
82
Language Modeling Toolkits

SRILM CMU-Cambridge LM Toolkit
83
Google N-Gram Release
84
Google N-Gram Release

serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234
85
Advanced LM stuff
Current best smoothing algorithm
Kneser-Ney smoothing
Other stuff
Variable-length n-grams Class-based n-grams
Clustering Hand-built classes
Cache LMs Topic-based LMs Sentence mixture models Skipping LMs Parser-based LMs
86
Summary
Probability
Basic probability Conditional probability Bayes Rule
Language Modeling (N-grams)

N-gram Intro The Chain Rule The Shannon Visualization Method Evaluation: Smoothing: Add-1 Advanced stuff I wont discuss:
Perplexity
Discounting: Good-Turing and Katz backoff Interpolation Unknown words Advanced LM algorithms
87
Todays Lecture
NGrams Bigram Smoothing and NGram
Add one smoothing Witten-Bell Smoothing
Simple N-Grams
An N-gram model uses the previous N-1 words to predict the next one:

P(wn | wn -1) We'll be dealing with P(<word> | <some previous words>)
unigrams: P(dog) bigrams: P(dog | big) trigrams: P(dog | the big) quadrigrams: P(dog | the big dopey)
Chain Rule
conditional probability: So:
P( A B) P( A | B) = P( B)
P( A B) = P( A | B) P( B) and P( A B ) = P ( B | A) P( A)
P( A B) = P( B | A) P( A)
the dog:
P(The dog ) = P(dog | the) P(the)
the dog bites:
P(The dog bites ) = P(The) P(dog | The) P(bites | The dog )
Chain Rule
the probability of a word sequence is the probability of a conjunctive event.
P ( w ) = P ( w1 ) P ( w2 | w1 ) P ( w3 | w )...P( wn | w )
n 1 2 1
n 1 1
= P( wk | w )
k 1 1 k =1
Unfortunately, thats really not helpful in general. Why?

4
Markov Assumption
n 1 P ( wn | w1n 1 ) P( wn | wn N +1 )
P(wn) can be approximated using only N-1 previous words of context This lets us collect statistics in practice Markov models are the class of probabilistic models that assume that we can predict the probability of some future unit without looking too far into the past Order of a Markov model: length of prior context
Language Models and N-grams

Given a word sequence: w1 w2 w3 ... wn Chain rule

Note:
p(w1 w2) = p(w1) p(w2|w1) p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) ... p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1) Its not easy to collect (meaningful) statistics on p(wn|wn-1wn-2...w1) for all possible word sequences just look at the previous word only (not all the proceedings words) Markov Assumption: finite length history 1st order Markov Model p(w1 w2 w3..wn) = p(w1) p(w2|w1) p(w3|w1w2) ..p(wn|w1...wn-3wn-2wn-1) p(w1 w2 w3..wn) p(w1) p(w2|w1) p(w3|w2)..p(wn|wn-1) p(wn|wn-1) is a lot easier to estimate well than p(wn|w1..wn-2 wn-1)
6
Bigram approximation

Note:

Given a word sequence: w1 w2 w3 ... wn Chain rule

Trigram approximation

p(w1 w2) = p(w1) p(w2|w1) p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) ... p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1)
2nd order Markov Model just look at the preceding two words only p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2) p(w4|w1w2w3)...p(wn|w1...wn-3wn-2wn-1) p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-2 wn-1)
Note:
p(wn|wn-2wn-1) is a lot easier to estimate well than p(wn|w1...wn-2 wn-1) but harder than p(wn|wn-1 )
7
Corpora
Corpora are (generally online) collections of text and speech e.g.
Brown Corpus (1M words) Wall Street Journal and AP News corpora ATIS, Broadcast News (speech) TDT (text and speech) Switchboard, Call Home (speech) TRAINS, FM Radio (speech)
8
Sample Word frequency (count)Data

(The Text REtrieval Conference) - (from B. Croft, UMass)
Counting Words in Corpora

Probabilities are based on counting things, so . What should we count? Words, word classes, word senses, speech acts ?
What is a word?
e.g., are cat and cats the same word? September and Sept? zero and oh? Is seventy-two one word or two? AT&T? Where do we find the things to count?
10
Terminology
Sentence: unit of written language Utterance: unit of spoken language Wordform: the inflected form that appears in the corpus Lemma: lexical forms having the same stem, part of speech, and word sense Types: number of distinct words in a corpus (vocabulary size) Tokens: total number of words
11
Training and Testing
Probabilities come from a training corpus, which is used to design the model.

narrow corpus: probabilities don't generalize general corpus: probabilities don't reflect task or domain
A separate test corpus is used to evaluate the model
12
Simple N-Grams
An N-gram model uses the previous N-1 words to predict the next one:

P(wn | wn -1) We'll be dealing with P(<word> | <some prefix>)
unigrams: P(dog) bigrams: P(dog | big) trigrams: P(dog | the big) quadrigrams: P(dog | the big red)
13
Using N-Grams

Recall that
For a bigram grammar
P(wn | w1..n-1) P(wn | wn-N+1..n-1) P(sentence) can be approximated by multiplying all the bigram probabilities in the sequence P(I want to eat Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food)
14
Chain Rule
Recall the definition of conditional probabilities Rewriting Or Or
P( A^ B) P( A | B) = P( B)
P( A^ B) = P( A | B) P( B)
P(The big ) = P(big | the) P(the) P(The big ) = P(the) P(big | the)
15
Example
The
big red dog
P(The)*P(big|the)*P(red|the big)*P(dog|the big red) Better P(The| <Beginning of sentence>) written as P(The | <S>) Also <end> for end of sentence
16
General Case
The word sequence from position 1 to n is So the probability of a sequence is
w1
P ( w1n ) = P ( w1) P ( w2 | w1) P ( w3 | w12 )...P ( wn | w1n 1 ) = P( w1) k = 2 P( wk | w )

n k 1 1
17
Unfortunately
That doesnt help since its unlikely well ever gather the right statistics for the prefixes.
18
Markov Assumption

Assume that the entire prefix history isnt necessary. In other words, an event doesnt depend on all of its history, just a fixed length near history
19
Markov Assumption
So for each component in the product replace each with its approximation (assuming a prefix (Previous words) of N)
P ( wn | w ) P ( wn | w
n 1 1
n 1 n N +1
)
20
N-Grams The big red dog

Unigrams: P(dog) Bigrams: P(dog|red) Trigrams: P(dog|big red) Four-grams: P(dog|the big red)
In general, well be dealing with P(Word| Some fixed prefix)

Note: prefix is Previous words
21
N-gram models can be trained by counting and normalization

C ( wn1wn ) P( wn | wn1 ) = C ( wn1 )
n 1 P ( wn | wn N +1 ) = n 1 C ( wn N +1wn ) n 1 C ( wn N +1 )
22
Bigram:
Ngram:
An example

<s> I am Sam <\s> <s> Sam I am <\s> <s> I do not like green eggs and meet <\s>
P( I |< s >) =
2 = 0.67 3 1 P( Sam |< s >) = = 0.33 3 2 P(am | I = ) = 0.67 3 1 P(< \ s >| Sam) = = 0.5 2 1 P(< s >| Sam) = = 0.5 2 1 P( Sam | am= ) = 0.5 2 1 P(do | I ) = = 0.33 3
23
BERP Bigram Counts

I I Want To Eat Chinese Food Lunch 8 3 3 0 2 19 4 Want 1087 0 0 0 0 0 0 To 0 786 10 2 0 17 0 Eat
BErkeley Restaurant Project (speech)

Chinese 0 6 3 19 0 0 0 Food 0 8 0 2 120 0 1 lunch 0 6 12 52 1 0 0
24
13 0 860 0 0 0 0
BERP Bigram Probabilities
Normalization: divide each row's counts by appropriate unigram counts

Want 1215 To 3256 Eat 938 Chinese 213 Food 1506 Lunch 459
I 3437
Computing the probability of I I

C(I|I)/C(all I) p = 8 / 3437 = .0023
A bigram grammar is an NxN matrix of probabilities, where N is the vocabulary size

25
A Bigram Grammar Fragment from BERP

Eat on Eat some Eat lunch Eat dinner Eat at Eat a Eat Indian Eat today .16 .06 .06 .05 .04 .04 .04 .03 Eat Thai Eat breakfast Eat in Eat Chinese Eat Mexican Eat tomorrow Eat dessert Eat British .03 .03 .02 .02 .02 .01 .007 .001
26
<start> I <start> Id <start> Tell <start> Im I want I would I dont I have Want to Want a
.25 .06 .04 .02 .32 .29 .08 .04 .65 .05
Want some Want Thai To eat To have To spend To be British food British restaurant British cuisine British lunch
.04 .01 .26 .14 .09 .02 .60 .15 .01 .01
27
Example:
wn wn-1
wn-1wn bigram frequencies
unigram frequencies
bigram probabilities
sparse matrix zeros probabilities unusable (well need to do smoothing)

28
Example
P(I want to eat British food) = P(I|<start>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25*.32*.65*.26*.001*.60 = (different from textbook) 0.0000081 vs. I want to eat Chinese food = .00015
29
Note on Example
Probabilities seem to capture syntactic facts, world knowledge

eat is often followed by a NP British food is not too popular
30
What do we learn about the language?
What's being captured with ...

P(want | I) = .32 P(to | want) = .65 P(eat | to) = .26 P(food | Chinese) = .56 P(lunch | eat) = .055
31
Some Observations

P(I | I) P(want | I) P(I | food)
I I I want I want I want to The food I want is
32
What
about
P(I | I) = .0023 I I I I want P(I | want) = .0025 I want I want P(I | food) = .013 the kind of food I want is ...
33
To avoid underflow use Logs
You dont really do all those multiplies. The numbers are too small and lead to underflows Convert the probabilities to logs and then do additions. To get the real probability (if you need it) go back to the antilog.
34
Generation
Choose N-Grams according to their probabilities and string them together
35
BERP
I want
want to to eat eat Chinese Chinese food food .
36
Some Useful Observations
A small number of events occur with high frequency
You can collect reliable statistics on these events with relatively small samples
A large number of events occur with small frequency
You might have to wait a long time to gather statistics on the low frequency events
37
Some Useful Observations
Some zeroes are really zeroes
Meaning that they represent events that cant or shouldnt occur
On the other hand, some zeroes arent really zeroes
They represent low frequency events that simply didnt occur in the corpus
38
Problem
Lets assume were using N-grams How can we assign a probability to a sequence where one of the component ngrams has a value of zero Assume all the words are known and have been seen
Go to a lower order n-gram Back off from bigrams to unigrams Replace the zero with something else
39
Add-One

Make the zero counts 1. Justification: Theyre just events you havent seen yet. If you had seen them you would only have seen them once. so make the count equal to 1.
40
unsmoothed bigram counts:

I I
8 3 3 0 2 19 4
Add-one: Example
2nd word
eat
0 786 10 2 0 17 0 13 0 860 0 0 0 0
want
1087 0 0 0 0 0 0
to
Chinese
0 6 3 19 0 0 0
food
0 8 0 2 120 0 1
lunch
0 6 12 52 1 0 0
Total (N)
3437 1215 3256 938 213 1506 459
1st word
want to eat Chinese food lunch
unsmoothed normalized bigram probabilities:

I I want to eat Chinese food lunch
.0023 (8/3437) .0025 .00092 0 .0094 .013 .0087
want
.32 0 0 0 0 0 0
to
0 .65 .0031 .0021 0 .011 0
eat
.0038 (13/3437) 0 .26 0 0 0 0
Chinese
0 .0049 .00092 .020 0 0 0
food
0 .0066 0 .0021 .56 0 .0022
lunch
0 .0049 .0037 .055 .0047 0 0
Total
1 1 1 1 1 1 1
41
Add-one: Example (cont)

add-one smoothed bigram counts:
I I want to eat Chinese food lunch
8 9 3 4 4 1 3 20 5
want
1087 1088 1 1 1 1 1 1
to
1 787 11 23 1 18 1
eat
14 1 861 1 1 1 1
Chinese
1 7 4 20 1 1 1
food
1 9 1 3 121 1 2
lunch
1 7 13 53 2 1 1
Total (N+ V)
3437 5053 2831 4872 2554 1829 3122 2075
add-one normalized bigram probabilities:

I want
.22 .00035 .00021 .00039 .00055 .00032 .00048
to
eat
Chinese
.0002 .0025 .00082 .0078 .00055 .00032 .00048
food
.0002 .0032 .00021 .0012 .066 .00032 .0022
lunch
.0002 .0025 .0027 .021 .0011 .00032 .00048
Total
1 1 1 1 1 1 1
42
I want to eat Chinese food lunch
.0018 (9/5053) .0014 .00082 .00039 .0016 .0064 .0024
.0002 .28 .0023 .0012 .00055 .0058 .00048
.0028 (14/5053) .00035 .18 .00039 .00055 .00032 .00048
The example again

unsmoothed bigram counts:
I
8 3 3 0 2 19 4
V= 1616 word types

Chinese food
0 6 3 19 0 0 0 0 8 0 2 120 0 1
want
1087 0 0 0 0 0 0
to
0 786 10 2 0 17 0
eat
13 0 860 0 0 0 0
lunch
0 6 12 52 1 0 0
Total (N)
3437 1215 3256 938 213 1506 459
V= 1616
Smoothed P(I eat) = (C(I eat) + 1) / (nb bigrams starting with I + nb of possible bigrams starting with I) = (13 + 1) / (3437 + 1616) = 0.0028
43
Smoothing and N-grams

Add-One Smoothing

Bigram
add 1 to all frequency counts p(wn|wn-1) = (C(wn-1wn)+1)/(C(wn-1)+V) (C(wn-1 wn)+1)* C(wn-1) /(C(wn-1)+V)
I want to eat Chinese food lunch 8 1087 0 13 0 0 0 3 0 786 0 6 8 6 3 0 10 860 3 0 12 0 0 2 0 19 2 52 2 0 0 0 0 120 1 19 0 17 0 0 0 0 4 0 0 0 0 1 0 I 6.12 1.72 2.67 0.37 0.35 9.65 1.11 want to eat Chinese food lunch 740.05 0.68 9.52 0.68 0.68 0.68 0.43 337.76 0.43 3.00 3.86 3.00 0.67 7.35 575.41 2.67 0.67 8.69 0.37 1.10 0.37 7.35 1.10 19.47 0.12 0.12 0.12 0.12 14.09 0.23 0.48 8.68 0.48 0.48 0.48 0.48 0.22 0.22 0.22 0.22 0.44 0.22
Frequencies
I want to eat Chinese food lunch I want to eat Chinese food lunch
Remarks: add-one causes large changes in some frequencies due to relative size of V (1616) want to: 786 338 = (786 + 1) * 1215 / (1215 + 1616)

c +1 N N +V
i
44
Problem with add-one smoothing
bigrams starting with Chinese are boosted by a factor of 8 ! (1829 / 213) unsmoothed bigram counts:
I want to eat Chinese food lunch I
8 3 3 0 2 19 4
want
1087 0 0 0 0 0 0
to
0 786 10 2 0 17 0
eat
13 0 860 0 0 0 0
Chinese food
0 6 3 19 0 0 0 0 8 0 2 120 0 1
lunch
0 6 12 52 1 0 0
Total (N)
3437 1215 3256 938 213 1506 459
add-one smoothed bigram counts:

I I want
4 4 1 3 20 5 9
1st word
want
1088 1 1 1 1 1 1
to
1 787 11 23 1 18 1
eat
14 1 861 1 1 1 1
Chinese
1 7 4 20 1 1 1
food
1 9 1 3 121 1 2
lunch
1 7 13 53 2 1 1
Total (N+ V)
5053 2831 4872 2554 1829 3122 2075
45
1st word
to eat Chinese food lunch
Problem with add-one smoothing (cont)
Data from the AP from (Church and Gale, 1991)

Corpus of 22,000,000 bigrams Vocabulary of 273,266 words (i.e. 74,674,306,756 possible bigrams) 74,671,100,000 bigrams were unseen And each unseen bigram was given a frequency of 0.000295
Freq. from training data Freq. from held-out data
fMLE 0 1 2 3 4 5
fempirical 0.000027 0.448 1.25 2.24 3.23 4.21
fadd-one 0.000295 0.000274 0.000411 0.000548 0.000685 0.000822
Add-one smoothed freq.
too high
too low
Total probability mass given to unseen bigrams =
(74,671,100,000 x 0.000295) / 22,000,000 ~99.96 !!!!

46
Witten-Bell Smoothing

Unigram

equate zero frequency items with frequency 1 items use frequency of things seen once to estimate frequency of things we havent seen yet smaller impact than Add-One a zero frequency word (unigram) is an event that hasnt happened yet count the number of words (T) weve observed in the corpus (Number of types) p(w) = T/(Z*(N+T))

w is a word with zero frequency Z = number of zero frequency words N = size of corpus
47
Distributing
The amount to be distributed is The number of events with count zero So distributing evenly gets us
T N +T
Z
1 T Z N +T
48
Bigram

p(wn|wn-1) = C(wn-1wn)/C(wn-1) (original) p(wn|wn-1) = T(wn-1)/(Z(wn-1)*(T(wn-1)+N)) for zero bigrams (after Witten-Bell)

T(wn-1)/ Z(wn-1) * C(wn-1)/(C(wn-1)+ T(wn-1)) p(wn|wn-1) = C(wn-1wn)/(C(wn-1)+T(wn-1))

T(wn-1) = number of bigrams beginning with wn-1 Z(wn-1) = number of unseen bigrams beginning with wn-1 Z(wn-1) = total number of possible bigrams beginning with wn-1 minus the ones weve seen Z(wn-1) = V - T(wn-1)
estimated zero bigram frequency
for non-zero bigrams (after Witten-Bell)
49
Witten-Bell Smoothing
Bigram
use frequency (count) of things seen once to estimate frequency (count) of things we havent seen yet T(wn-1)/ Z(wn-1) * C(wn-1)/(C(wn-1)+ T(wn-1)) estimated zero bigram frequency (count)
T(wn-1) = number of bigrams beginning with wn-1 Z(wn-1) = number of unseen bigrams beginning with wn-1
I want to eat Chinese food lunch 8 1087 0 13 0 0 0 3 0 786 0 6 8 6 3 0 10 860 3 0 12 0 0 2 0 19 2 52 2 0 0 0 0 120 1 19 0 17 0 0 0 0 4 0 0 0 0 1 0 I 7.785 2.823 2.885 0.073 1.828 18.019 3.643 want 1057.763 0.046 0.084 0.073 0.011 0.051 0.026 to 0.061 739.729 9.616 1.766 0.011 16.122 0.026 eat Chinese 12.650 0.061 0.046 5.647 826.982 2.885 0.073 16.782 0.011 0.011 0.051 0.051 0.026 0.026 food lunch 0.061 0.061 7.529 5.647 0.084 11.539 1.766 45.928 109.700 0.914 0.051 0.051 0.911 0.026
Remark: smaller changes
50
Distributing Among the Zeros
If a bigram wx wi has a zero count

Number of bigram types starting with wx
T ( wx ) 1 P ( wi | wx ) = Z ( wx ) N ( w x ) + T ( w x )
Number of bigrams starting with wx that were not seen
Actual frequency (count)of bigrams beginning with wx
51
Thank you
52
By : Mu10co05 Mu10co22
Top-Down Parsing (TD)
It is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar

Since were trying to find trees rooted with an S (Sentences) start with the rules that give us an S Then work your way down from there to the words
12/2/2013
Top-down parsing (TD)

Book that flight.
S
S NP VP
S Aux NP VP
S VP
S
VP
S VP V NP
S VP V
4
NP
VP
NP
VP
NP
Det
NOMINAL
PropN
Pronoun
12/2/2013
Aux N VP Aux NP VP P Det NOMINA Prop L N
Problems with the top-down parser

Left-recursion Ambiguity Inefficiency reparsing of subtrees
Left-recursion
It refers to any recursive non terminal that, when it produces a sentential form containing itself, that new copy of itself appears on the left of the production rule. NP NP PP VP VP PP S S and S
Ambiguity
Common structural ambiguity

Attachment ambiguity Coordination ambiguity
Ambiguities: PP Attachment
Coordination ambiguity
Different set of phrases that can be
conjoined by a conjunction like and. For example old men and women can be
[old [men and women]] or [old men] and
[women]
Repeated Parsing Subtrees
The parser often builds valid parse trees for portion of the input, then discards them during backtracking, only to find that it has to rebuild them again.
a flight From Indianapolis To Houston On TWA A flight from Indianapolis A flight from Indianapolis to Houston A flight from Indianapolis to Houston on TWA
4 3 2 1 3 2 1
Parsing with CFG
10
Comparison Of Top-down and Bottom-Up Parsing

Top-Down parsers never explore illegal parses (never explore trees that cant form an S) -- but waste time on trees that can never match the input. Bottom-Up parsers never explore trees inconsistent with input -- but waste time exploring illegal parses (Explore trees with no S root).
Thank You!!
Selectional Restrictions & its Limitations

Presented By: MU10CO26
Word Sense Disambiguation

The task of word sense disambiguation is to
examine word tokens in context and specify exactly which sense of each word is being used.
MOTIVATION
One of the central challenges in NLP is ambiguity. Compositional semantic analyzers ignore the issue
of lexical ambiguity. Needed in:

Machine Translation: For correct lexical choice. Information Retrieval: Resolving ambiguity in queries. Information Extraction: For accurate analysis of text.
SOLUTION:
Knowledge Based Approach. Rely on knowledge resources like WordNet. May use grammar rules for disambiguation. May use hand coded rules for disambiguation.
Necessity of a Mechanism
A system should include a mechanism which
ensures that only nouns with appropriate properties are associated with given verbs in a given context.
Selectional Restrictions:
Used to perform disambiguation. Used to rule out inappropriate senses and
thereby reduce the amount of ambiguity present during semantic analysis. Introduced by Fodor and Katz (1963). In selectional restriction, a predicate ( verb) imposes semantic constraints on its arguments (noun).
6
A violation of selectional restrictions is the explanation for the oddity of the following examples:
Kim ate a motor-bike. There is an apple bathing in the water. The stone thinks. The verb think selects a subject with the
feature human, which suggests that words labeled with inanimate are rejected.
Problems with the inference and the constraint view!

Inference view: Edible = appears as object of an eating
event? Constraint view: There are almost no strict constraints.
Lets look more closely at the selectional restrictions of eat!

Peter ate a banana.
eat(Human,Fruit) Peter ate fish. =>eat(Human,Edible) Kims mother bought a motor-bike of chocolate at the bakery. Kim ate the motorbike immediately. =>eat(Animate,Physical Object) 9
Examples of this approach:

In our house, everybody has a career and none of
them includes washing dishes, he says. In her tiny kitchen at home, Ms. Chen works efficiently, stir-frying several simple dishes, including braised pigs ears and chicken livers with green peppers. The dishwasher read the article.
10
Limitations:
There are examples like the following where the
available selectional restrictions are too general to uniquely select a correct sense. What kind of dishes do you recommend?
11
But it fell apart in 1931, perhaps because people
realized you cant eat gold for lunch if youre hungry. In his two championship trials, Mr. Kulkarni ate glass on an empty stomach, accompanied only by water and tea. The sentence itself is not semantically ill-formed, despite the violation of eats selectional restrictions.
12
If you want to kill the Soviet Union, get it
to try to eat Afghanistan. Here the typical selectional restrictions on both kill and eat will eliminate all possible literal senses leaving the system with no possible meanings! It brings the semantic analysis to halt!
13
THANK YOU
14
Semantics
Going beyond syntax
1/27
Semantics
Relationship between surface form and meaning What is meaning? Lexical semantics Syntax and semantics
2/27
What is meaning?
Reference to worlds
Objects, relationships, events, characteristics Meaning as truth
Understanding
Inference, implication Modelling beliefs
Meaning as action
Understanding activates procedures
3/27
Lexical semantics
Meanings of individual words
Sense and Reference What do we understand by the word lion ? Is a toy lion a lion? Is a toy gun a gun? Is a fake gun a gun?
Grammatical meaning
What do we understand by the lion, lions, the lions, as in The lion is a dangerous animal The lion was about to attack
4/27
Lexical relations
Lexical meanings can be defined in terms of other words
Synonyms, antonyms, broader/narrower terms synsets Part-whole relationships (often reflect realworld relationships) Linguistic usage (style, register) also a factor
5/27
Semantic features
Meanings can be defined (to a certain extent) in terms of distinctive features
e.g. man = adult, male, human
Meanings can be defined (to a certain extent) in terms of distinctive features
6/27
Types of representation
1. Syntactic relations
The man shot an elephant with his gun
shot subj man det the obj adv
elephant gun det an mod his

7/27
2. Deep syntax
The man shot an elephant with his gun An elephant was shot by the man with his gun
shot dsubj man qtf the dobj
instr
elephant gun qtf an poss his

8/27
3. Semantic roles, deep cases
The man shot an elephant with his gun An elephant was shot by the man with his gun The man used his gun to shoot an elephant
shot agent patient man qtf the
instr
elephant gun qtf an poss his

9/27
4. Event representation, semantic network
The man shot an elephant with his gun An elephant was shot by the man with his gun The man used his gun to shoot an elephant
shooting
shooter shot- instr thing man elephant gun qtf the qtf poss man
10/27
5. Predicate calculus
The man shot an elephant with his gun An elephant was shot by the man with his gun The man used his gun to shoot an elephant The man owned the gun which he used to shoot an elephant The man used the gun which he owned to shoot an elephant
event(e) & time(e,past) & pred(e,shoot) & man(A) & the(A) & (B) & dog(B) & shoot(A,B) & (C) & gun(C) & own(A,C) & use(A,C,e)
11/27
6. Conceptual dependency (Schank) John punched Mary
12/27
7. Semantic formulae (Wilks)
((THIS((PLANT STUFF)SOUR)) ((((((THRU PART)OBJE) (NOTUSE *ANI))GOAL) ((MAN USE) (OBJE THING) )))
door
13/27
Uses for semantic representations

As a linguistic artefact (because its there) To capture the text meaning relationship Identifying paraphrases, equivalences (e.g. summarizing a text, searching a text for information) Understanding and making inferences (e.g. so as to understand a sequence of events) Interpreting questions (so as to find the answer), commands (so as to carry them out), statements (so as to update data) 14/27 Translating
Uses for semantic representations

Different levels of understanding/meaning Textual meaning may be little more than disambiguating
Attachment ambiguities Word-senses Anaphora (pronoun reference, coreference)
Conceptual meaning may be much deeper Somewhere in between a good example is Wilks preference semantics: especially good for metaphor
15/27
Linguistic issues
Words and Concepts
Objects, properties, actions n, adj, v Language allows us to be vague (e.g. toy gun)
Semantic primitives what are they? Meaning equivalence when do two things mean the same? Grammatical meaning
Tense vs. time Topic and focus Quantifiers, plurals, etc.
16/27
Linguistic issues
There are many other similarly tricky linguistic phenomena
Modality (could, should, would, must, may) Aspect (completed, ongoing, resulting) Determination (the, a, some, all, none) Fuzzy sets (often, some, many, usually)
17/27
Lexical semantics
Lexical relations (familiar to linguists) have an impact on NLP systems
Homonymy word-sense selection; homophones in speech-based systems Polysemy understanding narrow senses Synonymy lexical equivalence Ontology structure vocabulary, holds much of the knowledge used by clever systems
18/27
WordNet
Began as a psycholinguistic theory of how the brain organizes its vocabulary (Miller) Organizes vocabulary into synsets, hierarchically arranged together with other relations (hyp[er|o]nymy, isa, member, antonyms, entailments) Turns out to be very useful for many applications Has been replicated for many languages (sometimes just translated!)
19/27
Natural Language Processing - Parsing Language, Syntax, Parsing Problems in Parsing

Ambiguity, Attachment / Binding Bottom vs. Top Down Parsing Earley-Algorithm
Where does NLP fit in the CS taxonomy?

Computers Databases Artificial Intelligence Algorithms Networking
Robotics Information Retrieval
Natural Language Processing Machine Translation
Search
Language Analysis
Semantics
12/2/2013
Parsing
2
The Steps in NLP

Discourse Pragmatics
Semantics Syntax
**we can go up, down and up and

Morphology
down and combine steps too!! **every step is equally complex

12/2/2013 3
What is Syntax?
Study of structure of language Specifically, goal is to relate an interface to morphological component to an interface to a semantic component Note: interface to morphological component may look like written text Representational device is tree structure
Simplified View of Linguistics

Phonology /waddyasai/
Morphology
/waddyasai/
what did you say
Syntax
what did you say
say
subj you obj
say
Semantics
what
subj you
obj
P[ x. say(you, x) ]
what
Natural Language - Parsing

Natural Language Syntax described like a formal language, usually through a context-free grammar: the Start-Symbol S sentence Non-Terminals NT syntactic constituents Terminals T lexical entries/ words Productions/Rules P NT (NTT)+ grammar rules Parsing derive the syntactic structure of a sentence based on a language model (grammar) construct a parse tree, i.e. the derivation of the sentence based on the grammar (rewrite system)
Sample Grammar
Grammar (S, NT, T, P) Sentence Symbol S NT, Part-of-Speech NT, syntactic Constituents NT, Grammar Rules P NT (NT T)* statement S NP VP question S Aux NP VP command S VP NP Det Nominal NP Proper-Noun Nominal Noun | Noun Nominal | Nominal PP VP Verb | Verb NP | Verb PP | Verb NP PP PP Prep NP Det that | this | a Noun book | flight | meal | money Proper-Noun Houston | American Airlines | TWA Verb book | include | prefer Aux does Prep from | to | on
Task: Parse "Does this flight include a meal?"
Sample Parse Tree

Task: Parse "Does this flight include a meal?"
S Aux NP Det Nominal Noun does this flight include VP Verb NP Det Nominal a meal
Structure in Strings
Some words: the a small nice big very boy girl sees likes Some good sentences:
the boy likes a girl the small girl likes the big girl a very small nice boy sees a very nice boy
Some bad sentences:

*the boy the girl *small boy likes nice girl
Can we find subsequences of words (constituents) which in some way behave alike?
From Substrings to Trees

(((the) boy) likes ((a) girl))
boy the
likes girl a
Node Labels?
( ((the) boy) likes ((a) girl) ) Choose constituents so each one has one non-bracketed word: the head Group words by distribution of constituents they head (part-of-speech, POS):
Noun (N), verb (V), adjective (Adj), adverb (Adv), determiner (Det)
Category of constituent: XP, where X is POS

NP, S, AdjP, AdvP, DetP
Node Labels
(((the/Det) boy/N) likes/V ((a/Det) girl/N))
S
NP
likes
DetP
NP
DetP
boy
girl
the
Types of Nodes
(((the/Det) boy/N) likes/V ((a/Det) girl/N))
S nonterminal symbols = constituents NP
likes
DetP
NP
Phrase-structure tree
DetP
boy
girl
the
a
terminal symbols = words
Determining Part-of-Speech
noun or adjective?
a blue seat a very blue seat this seat is blue a child seat *a very child seat *this seat is child
blue and child are not the same POS blue is Adj, child is Noun
Determining Part-of-Speech (2)

preposition or particle?
A he threw out the garbage B he threw the garbage out the door A he threw the garbage out B *he threw the garbage the door out The two out are not same POS; A is particle, B is Preposition
Word Classes (=POS)

Heads of constituents fall into distributionally defined classes Additional support for class definition of word class comes from morphology
Some Points on POS Tag Sets

Possible basic set: N, V, Adj, Adv, P, Det, Aux, Comp, Conj 2 supertypes: open- and closed-class
Open: N, V, Adj, Adv Closed: P, Det, Aux, Comp, Conj
Many subtypes:
eat/V eat/VB, eat/VBP, eats/VBZ, ate/VBD, eaten/VBN, eating/VBG, Reflect morphological form & syntactic function
Phrase Structure and Dependency Structure

S
likes/V
NP
likes
DetP
NP
boy/N girl the/Det
girl/N a/Det
DetP
boy
the
a
Only leaf nodes labeled with words!
All nodes are labeled with words!
Phrase Structure and Dependency Structure (ctd)

S
likes/V
NP
likes
DetP
NP
boy/N girl the/Det
girl/N a/Det
DetP
boy
the
a
Representationally equivalent if each nonterminal node has one lexical daughter (its head)
Types of Dependency
Adj(unct)
likes/V
Subj Fw
Obj
sometimes/Adv the/Det
boy/N
Adj
girl/N
Fw
small/Adj
Adj
a/Det
very/Adv
Grammatical Relations
Types of relations between words
Arguments: subject, object, indirect object, prepositional object Adjuncts: temporal, locative, causal, manner, Function Words
Subcategorization
List of arguments of a word (typically, a verb), with features about realization (POS, perhaps case, verb form etc) In canonical order Subject-Object-IndObj Example:
like: N-N, N-V(to-inf) see: N, N-N, N-N-V(inf)
Note: J&M talk about subcategorization only within VP
What About the VP?

S S NP NP VP
DetP
likes boy
NP
DetP
girl
DetP
boy
likes
DetP
NP
the
the
girl
What About the VP?

Existence of VP is a linguistic (i.e., empirical) claim, not a methodological claim Semantic evidence??? Syntactic evidence
VP-fronting (and quickly clean the carpet he did! ) VP-ellipsis (He cleaned the carpets quickly, and so did she ) Can have adjuncts before and after VP, but not in VP (He often eats beans, *he eats often beans )
Note: VP cannot be represented in a dependency representation
Context-Free Grammars
Defined in formal language theory (comp sci) Terminals, nonterminals, start symbol, rules String-rewriting system Start with start symbol, rewrite using rules, done when only terminals left NOT A LINGUISTIC THEORY, just a formal device
CFG: Example
Many possible CFGs for English, here is an example (fragment):
S NP VP VP V NP NP DetP N | AdjP NP AdjP Adj | Adv AdjP N boy | girl V sees | likes Adj big | small Adv very DetP a | the
the very small boy likes a girl
Derivations in a CFG
S
S
NP VP
S
NP
VP
DetP N VP
S
NP
VP
DetP
the boy VP
S
NP
VP
DetP
the boy
the boy likes NP
S
NP DetP
VP
NP
the boy likes
the boy likes a girl
S
NP DetP
VP
NP
the boy likes
DetP
girl
Derivations in a CFG; Order of Derivation Irrelevant

NP likes DetP girl
S
NP
VP
NP
likes
DetP
girl
Derivations of CFGs
String rewriting system: we derive a string (=derived structure) But derivation history represented by phrase-structure tree (=derivation structure)!
Representing Immediate Constituent Structure

The constituent structure of the whole sentence can be represented by means of labeled bracketing e.g. [ [ [Poor] [John] ] [ [lost] [ [his] [watch] ] ] Or using a tree diagram for the same -
Representing Immediate Constituent Structure (contd.)

Labeled bracketing using Category Symbols :
[ [ [Poor]ADJ [John]N ]NP [ [lost]V [ [his]PRON [watch ]N ]NP ]VP ]S
Poor ADJ John N Lost V His PRON S Watch - N
Poor John - NP his watch - NP lost his watch - VP Poor John lost his watch -
Immediate Constituent Structure using Tree Diagram

S NP VP
ADJ
NP PRON N watch
Poor
John
lost
his
Exercise
Analyze the following constructions using Labeled Bracketing as well as tree Diagram: I saw a man with a telescope. She touched the cat with a feather. The girl pushed the large box towards the huge door. The man in the blue shirt is waiting for you.
Problems in Parsing - Ambiguity

Ambiguity One morning, I shot an elephant in my pajamas. How he got into my pajamas, I dont know. Groucho Marx syntactical/structural ambiguity several parse trees are possible e.g. above sentence semantic/lexical ambiguity several word meanings e.g. bank (where you get money) and (river) bank even different word categories possible (interim) e.g. He books the flight. vs. The books are here. or Fruit flies from the balcony vs. Fruit flies are on the balcony.
Ambiguity examples
Ambiguities: PP Attachment
Attachments
I cleaned the dishes from dinner. I cleaned the dishes with detergent. I cleaned the dishes in my pajamas. I cleaned the dishes in the sink.
Syntactic Ambiguities 1
Prepositional Phrases
They cooked the beans in the pot on the stove with handles.
Particle vs. Preposition

The puppy tore up the staircase.
Complement Structure
The tourists objected to the guide that they couldnt hear. She knows you like the back of her hand.
Gerund vs. Participial Adjective

Visiting relatives can be boring. Changing schedules frequently confused passengers.
Syntactic Ambiguities 2
Modifier scope within NPs
impractical design requirements plastic cup holder
Multiple gap constructions

The chicken is ready to eat. The contractors are rich enough to sue.
Coordination scope
Small rats and mice can squeeze into holes or cracks in the wall.
Classical NLP Parsing: The problem and its solution

Very constrained grammars attempt to limit unlikely/weird parses for sentences
But the attempt makes the grammars not robust: many sentences have no parse
A less constrained grammar can parse more sentences

But simple sentences end up with ever more parses
Solution: We need mechanisms that allow us to find the most likely parse(s)
Statistical parsing lets us work with very loose grammars that admit millions of parses for sentences but to still quickly find the best parse(s)
Introduction
Parsing = associating a structure (parse tree) to an input string using a grammar CFG are declarative, they dont specify how the parse tree will be constructed Book that flight. Parse trees are used in
Grammar checking Semantic analysis Machine translation Question answering Information extraction
S VP NP NOMINAL Verb Det Noun
Book
that
flight
12/2/2013
46
Parsing
Parsing with CFGs refers to the task of assigning correct trees to input strings Correct here means a tree that covers all and only the elements of the input and has an S at the top It doesnt actually mean that the system can select the correct tree from among the possible trees
12/2/2013 47
Parsing
Parsing involves a search which involves the making of choices Some Parsing techniques:
Top-down parsing Bottom-up parsing
12/2/2013
48
Bottom-up and Top-down Parsing

Bottom-up from word-nodes to sentence-symbol Top-down Parsing from sentence-symbol to words
S Aux NP Det Nominal Noun does this flight include VP Verb NP Det Nominal a meal
For Now
Assume
You have all the words already in some buffer The input isnt POS tagged We wont worry about morphological analysis All the words are known
12/2/2013
50
Parsing as search
A Grammar to be used in our example
S NP VP S Aux NP VP S VP NP Pronoun NP Det NOMINAL NP Proper-Noun NOMINAL Noun NOMINAL NOMINAL PP VP Verb VP Verb NP
12/2/2013
VP Verb NP PP
VP Verb PP VP VP PP PP Preposition NP Det that | this |a Noun book | flight | meal | money Verb book | include | prefer Aux does Proper-Noun Houston | TWA Preposition from | to | on | near | through
51
NOMINAL NOMINAL Noun Pronoun I | she | me
Parsing as search
Book that flight. S Two types of constraints on the parses: 1. some that come from the input string 2. others that come from the Verb grammar Book VP NP NOMINAL Det that Noun flight 52
12/2/2013

Since were trying to find trees rooted with an S (Sentences) start with the rules that give us an S Then work your way down from there to the words
12/2/2013
53
Top-down parsing (TD)

Book that flight.
S
S NP VP
S Aux NP VP
S VP
S
VP
S Aux NP VP PropN
S VP V NP
S VP V
NP
VP
NP
VP
NP
Aux NP VP Det
NOMINAL
Det
NOMINAL
PropN
12/2/2013
Pronoun
54
Bottom-Up Parsing
Since we want trees that cover the input words start with trees that link up with the words in the right way. Then work your way up from there.
12/2/2013
55
Bottom-up parsing (BU)

Book Noun Book Det that Noun flight NOMINAL Noun flight Verb Book Det that that flight Verb Book Det that Noun flight NOMINAL Noun flight NP NOMINAL Noun flight VP NP NOMINAL Verb
12/2/2013
NOMINAL Noun Book NP NOMINAL Noun Book Det that Det that
VP Verb Book Det that
NOMINAL Noun flight VP Verb Book NP Det that
NOMINAL Noun flight
NOMINAL Verb Book Det that Noun flight

56
Det that
Noun flight
Book
Comparing Top-Down and Bottom-Up

Top-Down parsers never explore illegal parses (never explore trees that cant form an S) -- but waste time on trees that can never match the input Bottom-Up parsers never explore trees inconsistent with input -- but waste time exploring illegal parses (Explore trees with no S root) For both: how to explore the search space?
Pursuing all parses in parallel or ? Which rule to apply next? Which node to expand next?
Needed: some middle ground.

12/2/2013 57
Problems with Bottom-up and Top-down Parsing

Problems with left-recursive rules like NP NP PP: dont know how many times recursion is needed Pure Bottom-up or Top-down Parsing is inefficient because it generates and explores too many structures which in the end turn out to be invalid (several grammar rules applicable interim ambiguity). Combine top-down and bottom-up approach: Start with sentence; use rules top-down (look-ahead); read input; try to find shortest path from input to highest unparsed constituent (from left to right). Chart-Parsing / Earley-Parser
Basic Top-Down (TD) parser

Practically infeasible to generate all trees in parallel. Use depth-first strategy. When arriving at a tree that is inconsistent with the input, return to the most recently generated but still unexplored tree.
12/2/2013
59
12/2/2013
60
Presented By Dhruba Baishya (mu10co01) Kallol Ghose (mu10co14)
WORD SENSE DISAMBIGUATION
Introduction
Word Sense Disambiguation
It is to examine word tokens and its context and specify which
sense of words is being used. Some Examples

A bank can hold the investments in a custodial account in the clients name. As agriculture burgeons on the east bank, the river will shrink even more. Here, the BANK word is used in different senses.
Motivation
Central Challenges in NLP. Ubiquitous across all language. Needed in
Machine Translation For correct lexeme choice Information Retrieval Resolving queries Information Extraction For accurate analysis of text
Approaches for Word Sense Disambiguation

Knowledge Based Approaches
Selectional Restriction Dictionary Based
Machine Learning Approaches

Supervised Machine Learning Semi Supervised Machine Learning Unsupervised Learning


Selectional Restriction
Uses hierarchical type of information about arguments. Rule out inappropriate senses and hence reduce the
amount of ambiguity. Selectional restriction are used to block the formation of component meaning representations which violate the same.
Sense for word DISHES Sense 1 In our house, everybody has a career and none of the includes washing dishes with soap, he says. Sense 2 Ms Chen works efficiently, stir frying several simple dishes including fried chicken.
Artifact
Food
Sense for word SERVE Sense 1 What is the name of the airlines that serve Denver? Sense 2 Well there was a time when served green lipped mussels from New Zealand. Sense 3 Which one of the airliners serve breakfast?
Geographical Entity
Food
Meal Designator
Consider an Example
Im looking for a restaurant that serves vegetarian dishes.
Geographical Entity Serve Food Food Meal Designator
Artifact Dishes


Dictionary Based Approach

All the sense definition of the word are retrieved from the
dictionary. Each of the senses are compared to dictionary definition of the remaining word in the context. The sense with the highest overlap with this context words is chosen as the correct sense.
Example
Fruit of evergreen tree.
Consider phrase Pine cone

Pine Kinds of evergreen tree with needle shaped leaves. Waste away through sorrow or illness. Cone Solid body which narrows a point. Something of this shape whether solid or hollow. Fruit of certain evergreen trees.



Systems are trained to perform the task of word sense
disambiguation In this method a classifier is learnt which is then used to assign senses
Machine learning approaches(continued)

Input is taken which consists of
The word to be disambiguated Context text with which the word is embedded
Fixed set of linguistic features are extracted relevant to
learning the task
Machine learning approaches(continued)

Features are of two classes
Co-location Co-occurrence
Co-location
Looks for information about words of specific positions
that are located to left or right of the target word Example

An electric guitar and bass player stand o to one side, not
really part of scene.
Co-occurrence
Features consists of data about neighboring words. Words themselves serve as features. For the earlier example, a co-occurrence vector
consisting of 12 most frequent words from a collection of bass sentences drawn from WSJ corpus has following features: shing, big, sound, player, y, rod, pound, double, runs, playing, guitar, band 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0
Supervised learning approach

Words are labeled with their senses Example
She pays 3% interest/INTEREST-MONEY on the loan. He showed a lot of interest/INTEREST-CURIOSITY in the
painting.
Supervised learning approach(continued)

Supervised approaches are therefore similar to tagging:
given a corpus tagged with senses dene features that indicate one sense over another learn a model that predicts the correct sense given the features
Semi-supervised approach
Problem with supervised approach is the need of a large
tagged training set. Relies on a relatively small number of instances. These labeled instances are used as seeds to train an initial classifier. This initial classifier is then used to extract a larger training set.
Unsupervised approach
Unlabeled instances are taken up as input and are
grouped into clusters according to similarity metric. These clusters are labeled by hand with word senses. Main disadvantage is that senses are not well defined.

Natural Language Processing 2

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Natural Language Processing 2

Caricato da

Copyright:

Formati disponibili

Morphology

Morphemes and Words

Inflectional Morphology (verbs)

cuts cutting cut cut

Inflectional Morphology (nouns)

Inflectional and Derivational Morphology (adjectives)

Derivational Morphology (nouns)

Derivational Morphology (adjectives)

computer, computational, computation all reduced to same token comput

Tokenization, Word Segmentation

collect tokens forming single lexical entry

More of an issue in languages like Chinese

Processing: Finite State Transducers

Fig. 3.3 FSA for verb inflection.

Fig. 3.4 Simple FSA for adjective inflection.

Fig. 3.5 More detailed FSA for adjective inflection.

Fig. 3.7 Compiled FSA for noun inflection.

LINGUIST 180: Introduction to Computational Linguistics

Language Modeling (N-grams)

LING 180 Autumn 2007

Slides from Sandiway Fong

LING 180 Autumn 2007

LING 180 Autumn 2007

LING 180 Autumn 2007

LING 180 Autumn 2007

LING 180 Autumn 2007

LING 180 Autumn 2007

LING 180 Autumn 2007

Probability of drawing a spade from 52 well-shuffled playing cards:

LING 180 Autumn 2007

Probabilities of two events

If flip a fair coin twice

LING 180 Autumn 2007

How about non-uniform probabilities? An example

Answer: 5/9 = 0.56 (sum of weights in red )

LING 180 Autumn 2007

Moving toward language

# of ways to get a verb

Probability and part of speech tags

LING 180 Autumn 2007

LING 180 Autumn 2007

LING 180 Autumn 2007

LING 180 Autumn 2007

LING 180 Autumn 2007

LING 180 Autumn 2007

Note: P(A,B)=P(A|B) P(B) Also: P(A,B) = P(B,A) A A,B B

LING 180 Autumn 2007

LING 180 Autumn 2007

Deriving Bayes Rule

LING 180 Autumn 2007

How many words?

Are cat and cats the same word? Some terminology

Wordform: the full inflected surface form.

LING 180 Autumn 2007

How many words?

Brown et al (1992) large corpus

LING 180 Autumn 2007

LING 180 Autumn 2007

Intuition: lets rely on the Chain Rule of Probability

LING 180 Autumn 2007

The Chain Rule

More generally P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C) In general P(x1,x2,x3,xn) = P(x1)P(x2|x1)P(x3|x1,x2)P(xn|x1xn-1)

LING 180 Autumn 2007

The Chain Rule Applied to joint probability of words in sentence