Language Models: CS6370: Natural Language Processing

Language Models
CS6370: Natural Language Processing
Why model language?

!!
!!
!!
!!
Validation/Verification !! Check for syntax !! Better understanding Generation !! Q & A !! Dialogue systems Prediction !! Speech recognition !! Spelling correction Discrimination !! Topic detection !! Authorship verification
2
Language model: Grammar?

!!
!!
Model: Rules of a language + dictionary Context Free Grammars

!! !! !! !!
Constructing full parse tree: expensive Ambiguity: PCFG Specified apriori: not data driven Overkill: complete grammar specification not needed for several tasks
!!
spell check, prediction, discrimination, etc.

3
What model do you use?

!! !!
!!
!!
!!
We turned ___ the TV to watch the Cricket __________. We turned on the TV to watch the Cricket match. We turned in the TV to watch the Cricket game. We turned on the TV to watch the Cricket tournament. We turned off the TV to watch the Cricket hop.
4
Frequentist!
!! !!
We subconsciously order things based on frequency Such statistical models can benefit automated language processing also
!!
Need a training corpus N-grams HMMs Topic Models PCFG etc..

5
!!
Many such models used in practical systems

!! !! !! !! !!
Why do they help?

match Cricket watched the I
is less common than

I watched the Cricket match
!!
!!
!!
Similarly match is a more likely completion for I watched the Cricket than game or hop. You probably used this idea in your spell check assignment. Simple way of modeling this is to use N-gram statistics or N-gram models.
6
N-gram model intuition

!!
N-gram is a sequence of N consecutive words.

!!
the cricket match
is a 3-gram (trigram)
!!
Look at the relative frequency of various N-grams in your training corpus

P(I watched the cricket match) = P(I | . .) x P(watched | . I) x P(the | I watched) x P(Cricket | watched the) x P(match | the Cricket) x P(. | Cricket match) x P(. | match .)
!!
N-gram computation
C (I watched the) P(the I watched )= C(I watched)
!!
Cannot do this for the entire sequence of words

!! !!
Computational issues Scalability
!!
So estimate for N-grams and use chain rule.

8
Bigrams
!!
The most popular version of N-gram is the 2-gram or bigram

n
P(w1w2 ! wn ) = " P(wi wi !1 )

i =1
!!
!!
Typically w0 is a special symbol for start of sentence and wn is a special end of sentence symbol Markov assumption
9
Estimating the bigram model

C (wn !1wn ) C (wn !1wn ) P(wn | wn !1 ) = = " C (wn !1w) C (wn !1 )
w
!! !!
Relative frequency Maximum Likelihood estimate
<s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham</s>
2 1 2 P(I <s> )= ; P(Sam <s> ) = ; P(am I)= 3 3 3 1 1 1 P(</s> Sam )= ; P(Sam am )= ; P(do I)= 2 2 3
10
Data sets: Some issues

!!
Usually a corpus is split into training and testing sets. !! Use various measures for evaluation
!!
Can also use a hold-out set and a development set
!!
!!
Performance depends on training corpora as with all data driven methods Other issue: Closed vs open vocabulary
!!
!!
Closed: Only a certain set of pre-determined words can occur Open: Allow for an unlimited vocabulary !! Choose a vocabulary to model !! Replace out-of-vocabulary words with <UNK> !! Find probability of <UNK> in training; !! Alternative use first occurrence all words as <UNK>
11
Counting words: Some issues

!!
What is a word?
!! !! !! !! !!
cat vs. cats eat vs. ate vs. eating President of the United States vs. POTUS ahh, umm .,?!;, etc
!! !!
Lemmatization, stemming, tokenization Task dependent

!!
All distinctions needed in speech recognition.

12
Evaluating a Language Model: Perplexity

!! !!
!!
How well the model fits the test data? Higher the probability of the test data, lower the perplexity Measure of branching factor
!!
Related to entropy
! 1 N
PP(W ) = P(w1w2 ! wN )
=N
1 = P(w1w2 ! wN ) N
1
N
" P (w
i =1
wi !1 )
Count </s> but not <s>, why?

13
Using the N-gram model

!!
Validity
!! !!
P(Sam I am) = P(Sam|<s>) x P(I | Sam) x P(am | I) x P(</s>| Sam) If P(Sam I am) < Threshold, invalid sentence! Bring me green eggs and argmaxw [ P(w|and) x P(</s>|w) ]
!!
Prediction
!! !!
!!
Discrimination
!! !!
I like green eggs and ham Pseuss(I like green eggs and ham) > Pshakespeare(I like green eggs and ham)
14
Using the N-gram model

!!
Generation
!! !!
!! !! !! !!
Fix wn-N+1 to wn-1. Sample wn from P(| wn-N+1 wn-N+2 wn-1) Treat punctuation, end of sentence, etc., as words.
Unigram: To him swallowed confess hear both Bigram: What means, sir. I confess she? Trigram: Therefore the sadness of parting, as they say, tis done. Quadrigram: What! I shall go seek the traitor Gloucester
!!
Limited applicability
!!
!! !!
Combine with other methods such as back-off and interpolation Most success in speech recognition and discrimination Useful in non-grammatical settings!
15
Smoothing
!!
What is the probability of I do not like Sam I am ?

!!
P(Sam | like) = 0; hence this is also 0 Do not assign zero probability even to unseen bigrams ..that arise due to sparseness How about really rare or nonsense combinations? !! Sam green ham
16
!!
Sparseness of data
!!
!!
Smoothing helps handle low or zero count cases.

!! !!
Laplace smoothing
unigram !! case
Increment counts of all words by 1

C (wi ) + 1 PLaplace (wi ) = N +V
V: Size of Vocabulary
!! !!
bi-gram case
add-one smoothing add-! smoothing; less dramatic.

C (wn !1wn ) + 1 PLaplace (wn wn !1 ) = C (wn !1 ) +V
17
Laplace adjusted estimates

2 1 2 P(I <s> )= ; P(Sam <s> ) = ; P(am I)= 3 3 3 1 1 1 P(</s> Sam )= ; P(Sam am )= ; P(do I)= 2 2 3
3 2 ! 3 ; P ! (Sam <s> ) = ; P (am I)= 14 14 14 2 2 2 P ! (</s> Sam )= ; P ! (Sam am )= ; P ! (do I)= 13 13 14 P ! (I <s> )=
1 P (Sam like )= 12
!
18
Laplace adjusted counts

N C (wi ) = ( C (wi ) + 1) N +V
!
unigram
C (wn "1 ) C (wn "1wn ) = ( C (wn "1wn ) + 1) C (wn "1 ) + V

!
bi-gram
!!
Divide by N or C(wn-1) to get the smoothed estimates 17 C (I) = 3; C (I) = 4 " = 2.43
!
!!
Discounted counts!
28
<s> I am Sam </s> 1 C (like Sam)=1 " = 0.083 <s> Sam I am </s> 12 <s> I do not like green eggs and ham</s>
!
C (I am)=2; C ! (I am)=3 "
3 = 0.64 14
19
Good-Turing Discounting
!! !!
!!
!! !!
Laplace simple but smoothens too much Use frequency of events that occurred once for estimating frequency of unseen events Let Nc be the number of items that occur exactly c times frequency of frequency c N c +1 ! The adjusted count is given by: c = (c + 1) Nc The probability of zero frequency items:
! PGT (items in N 0 ) =
Developed by Turing and Good during World War II as part of their work on deciphering German codes - Enigma
N1 N
20
Good-Turing some issues

!!
Potential N-grams known

!!
So number of unseen N-grams can be calculated
!!
Assumes that each of the N-gram distribution is binomial

!!
!!
The probability of a bigram occurring once is given by the GT-estimate So observed counts could be from bigrams of a different frequency
21
Simple Good-Turing Gale and Simpson

!!
What happens when Nc+1 is zero?

!! !! !! !! !!
We need a way of estimating the missing Nc Assumption: Nc=acb log Nc = a + b log c Linear regression on a log-log scale log c vs. log Nc Alternatively: fit adjusted counts
* = Nc
Nc ! + , c cc are consecutive non-zero frequencies + ! 0.5(c ! c )
!!
!!
Not a good fit for small c, hence use Nc as it is if available Switch from actual counts to estimated counts when error is small
22
Good-Turing Katzs correction

!! !!
Assume that low frequency items are really zero frequency Assume that the count of very high frequency items are correct and do not have to be discounted
( k + 1) N k +1 N c +1 "c (c + 1) N1 Nc ! c = , for 1 # c # k ( k + 1) N k +1 1" N1
!!
k of 5 is suggested by Katz.
23
Combining Estimators
!! !!
Another technique for handling sparseness Estimate N-gram probability by using the estimates for the constituent grams
!!
trigram using trigram, bigram and unigram
!!
!!
Yields better models than smoothing a fixed N-gram model We look at two methods:
!! !!
Interpolation Back-off
24
Simple Linear Interpolation

Pli (wn wn ! 2 wn !1 ) = "1P1 (wn ) + "2 P2 (wn wn !1 ) + "3 P3 (wn wn ! 2 wn !1 )
!!
Linear combination of shorter grams.

0 ! "i ! 1,
#"
i
=1
!! !! !! !!
Finite mixture model Weights determined by EM Empirically through a hold out set Discounting higher order probabilities
25
General Linear Interpolation

k
Pli (w h ) = " !i (h )Pi (w h ), where #h, 0 $ !i (h ) $ 1, and " i !i (h ) = 1

i =1
!!
The combining weights are functions of the history

!!
Can give higher weights to longer history if their counts are high
!!
Histories are not treated individually but binned

!! !!
Same frequencies Weight of N-1 gram model determined by the average number of non-zero N grams that follow this N-1 gram
!!
Takes care of grammatical zeroes
26
Katz Back-off
C (wi ! 2 wi !1wi ) # %(1 ! dwi! 2 wi!1 ) C (w w ) if C (wi ! 2 wi !1wi ) > k Pbo (wi wi ! 2 wi !1 ) = $ i ! 2 i !1 % " wi!2 wi!1 Pbo (wi wi !1 ) otherwise &
!!
Different models used in order of availability

!!
If N-gram model not reliable/available, then use N-1-gram model, and so on. Use Good-Turing
!!
The N-gram estimate has to be discounted

!!
!!
Back-off probability multiplied by " to ensure that only the left over probability is assigned
27
Katz back-off for trigrams

x ! wi" 2 # P ! ( z x, y), if C ( x, y, z ) > k y ! w i "1 P ( z x, y) = %" P ( z y), else if C ( x, y) > 0 $ xy bo bo ! % z ! wi P ( z ), otherwise
&
! # % P ( z y), if C ( y, z ) > k Pbo ( z y) = $ ! % & " y P ( z ), otherwise
1"
! xy =
z :C ( x , y , z )> k
1"
bo z :C ( x , y , z )> k
# P (z x, y ) # P (z y )
*
1"
!y =
z :C ( y , z )> k
1"
# P (z y ) # P ( z)
* *
z :C ( y , z )> k
28
Katz back-off: some issues

!! !!
!! !!
If N-1-gram was never seen, then " is 1 Can start from a quadrigram and go down to unigram! Generally perform well Can have problem with Grammatical zeroes
!!
!!
w very frequent but not part of a trigram, then potential zero Backing-off would estimate it to be some fraction of the bigram probability
!!
Change dramatically with new data

29
Absolute Discounting
!!
Instead of multiplicative reduction of higher order N-gram probability, do an additive reduction

!!
Limit total mass subtracted

# C ( w i !1w i ) ! D , if C ( w i !1w i ) > 0 % Pabsolute ( w i w i !1 ) = $ C ( w i !1 ) % & " ( w i ) P ( w i ), otherwise
30
Kneser-Ney Discounting
!!
Unigram probability used only when bigram probability is not available

!! !! !!
Francisco is more frequent than glasses But appears as only San Francisco Hence unigram count of Francisco can be low!
!!
If not after San then the probability of Francisco is small
!!
Continuation probability!
31
Kneser-Ney Discounting
!!
Unigram count depends on number of different bigrams in which the word occurs P (w ) = {w : C(w w ) > 0}
* i !1 i !1 i i
" {w
w
i !1
: C ( w i !1w ) > 0}
C ( w i !1w i ) ! D $ , if C ( w i !1w i ) > 0 & C ( w i !1 ) & PKN ( w i w i !1 ) = % w i !1 : C ( w i !1w i ) > 0} { &" ( w i ) , otherwise & # w {wi !1 : C(wi !1w) > 0} '
{wi !1 : C(wi !1wi ) > 0} C ( w i !1w i ) ! D PKN ( w i w i !1 ) = + " (wi ) C ( w i !1 ) # w {wi !1 : C(wi !1w) > 0}
32
Class-based N-grams
!! !!
Another way to handle sparseness to London, to Beijing, to Delhi

!!
to Shanghai?
!! !!
Learn frequency of to city Take care of new vocabulary terms
P(wi wi !1 ) " P(ci ci !1 ) # P(wi ci )

33
Other N-gram models

!!
Use longer spheres of influence

!!
Skip N-grams
!! !!
Green Eggs, Green Duck Eggs Variable skip length Green Eggs and, and Ham - both bigrams Use semantic information to guide formation of longer Ngrams. Only after a trigger word
!!
!!
Variable length N-gram

!! !!
!!
Trigger based
!!
like.ham, likeSam, likecricket

34
!!
Within a window from trigger
N-gram models: Summary

!!
Frequentist
!!
!!
Use MLE estimate of probabilities estimated from a corpus Simple, yet effective Add-one, add-delta Good Turing Interpolation Back-off
35
!!
Smoothing to handle sparseness of data

!! !!
!!
Combine estimators for better models

!! !!

Language Models: CS6370: Natural Language Processing

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Language Models: CS6370: Natural Language Processing

Caricato da

Copyright:

Formati disponibili

Language Models

CS6370: Natural Language Processing

Why model language?

Language model: Grammar?

Model: Rules of a language + dictionary Context Free Grammars

spell check, prediction, discrimination, etc.

What model do you use?

Need a training corpus N-grams HMMs Topic Models PCFG etc..

Many such models used in practical systems

Why do they help?

is less common than

N-gram model intuition

N-gram is a sequence of N consecutive words.

the cricket match

Look at the relative frequency of various N-grams in your training corpus

Cannot do this for the entire sequence of words

Computational issues Scalability

So estimate for N-grams and use chain rule.

The most popular version of N-gram is the 2-gram or bigram

P(w1w2 ! wn ) = " P(wi wi !1 )

Estimating the bigram model

Relative frequency Maximum Likelihood estimate

Data sets: Some issues

Can also use a hold-out set and a development set

Counting words: Some issues

Lemmatization, stemming, tokenization Task dependent

All distinctions needed in speech recognition.

Evaluating a Language Model: Perplexity

Count </s> but not <s>, why?

Using the N-gram model

Using the N-gram model

What is the probability of I do not like Sam I am ?

Smoothing helps handle low or zero count cases.

Increment counts of all words by 1

add-one smoothing add-! smoothing; less dramatic.

Laplace adjusted estimates

Laplace adjusted counts

C (wn "1 ) C (wn "1wn ) = ( C (wn "1wn ) + 1) C (wn "1 ) + V

C (I am)=2; C ! (I am)=3 "

Good-Turing some issues

Potential N-grams known

So number of unseen N-grams can be calculated

Assumes that each of the N-gram distribution is binomial

Simple Good-Turing Gale and Simpson

What happens when Nc+1 is zero?

Nc ! + , c cc are consecutive non-zero frequencies + ! 0.5(c ! c )

Good-Turing Katzs correction

trigram using trigram, bigram and unigram

Simple Linear Interpolation

Linear combination of shorter grams.

General Linear Interpolation

Pli (w h ) = " !i (h )Pi (w h ), where #h, 0 $ !i (h ) $ 1, and " i !i (h ) = 1

The combining weights are functions of the history

Histories are not treated individually but binned

Takes care of grammatical zeroes

Different models used in order of availability

The N-gram estimate has to be discounted

Katz back-off for trigrams

Katz back-off: some issues

Change dramatically with new data

Instead of multiplicative reduction of higher order N-gram probability, do an additive reduction

Limit total mass subtracted

Unigram probability used only when bigram probability is not available