Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
!!
!!
!!
Validation/Verification !! Check for syntax !! Better understanding Generation !! Q & A !! Dialogue systems Prediction !! Speech recognition !! Spelling correction Discrimination !! Topic detection !! Authorship verification
2
!!
Constructing full parse tree: expensive Ambiguity: PCFG Specified apriori: not data driven Overkill: complete grammar specification not needed for several tasks
!!
!!
!!
!!
We turned ___ the TV to watch the Cricket __________. We turned on the TV to watch the Cricket match. We turned in the TV to watch the Cricket game. We turned on the TV to watch the Cricket tournament. We turned off the TV to watch the Cricket hop.
4
Frequentist!
!! !!
We subconsciously order things based on frequency Such statistical models can benefit automated language processing also
!!
!!
!!
!!
!!
Similarly match is a more likely completion for I watched the Cricket than game or hop. You probably used this idea in your spell check assignment. Simple way of modeling this is to use N-gram statistics or N-gram models.
6
is a 3-gram (trigram)
!!
!!
N-gram computation
C (I watched the) P(the I watched )= C(I watched)
!!
!!
Bigrams
!!
!!
!!
Typically w0 is a special symbol for start of sentence and wn is a special end of sentence symbol Markov assumption
9
<s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham</s>
2 1 2 P(I <s> )= ; P(Sam <s> ) = ; P(am I)= 3 3 3 1 1 1 P(</s> Sam )= ; P(Sam am )= ; P(do I)= 2 2 3
10
Usually a corpus is split into training and testing sets. !! Use various measures for evaluation
!!
!!
!!
Performance depends on training corpora as with all data driven methods Other issue: Closed vs open vocabulary
!!
!!
Closed: Only a certain set of pre-determined words can occur Open: Allow for an unlimited vocabulary !! Choose a vocabulary to model !! Replace out-of-vocabulary words with <UNK> !! Find probability of <UNK> in training; !! Alternative use first occurrence all words as <UNK>
11
What is a word?
!! !! !! !! !!
cat vs. cats eat vs. ate vs. eating President of the United States vs. POTUS ahh, umm .,?!;, etc
!! !!
!!
How well the model fits the test data? Higher the probability of the test data, lower the perplexity Measure of branching factor
!!
Related to entropy
! 1 N
PP(W ) = P(w1w2 ! wN )
=N
1 = P(w1w2 ! wN ) N
1
N
" P (w
i =1
wi !1 )
Validity
!! !!
P(Sam I am) = P(Sam|<s>) x P(I | Sam) x P(am | I) x P(</s>| Sam) If P(Sam I am) < Threshold, invalid sentence! Bring me green eggs and argmaxw [ P(w|and) x P(</s>|w) ]
!!
Prediction
!! !!
!!
Discrimination
!! !!
I like green eggs and ham Pseuss(I like green eggs and ham) > Pshakespeare(I like green eggs and ham)
14
Generation
!! !!
!! !! !! !!
Fix wn-N+1 to wn-1. Sample wn from P(| wn-N+1 wn-N+2 wn-1) Treat punctuation, end of sentence, etc., as words.
Unigram: To him swallowed confess hear both Bigram: What means, sir. I confess she? Trigram: Therefore the sadness of parting, as they say, tis done. Quadrigram: What! I shall go seek the traitor Gloucester
!!
Limited applicability
!!
!! !!
Combine with other methods such as back-off and interpolation Most success in speech recognition and discrimination Useful in non-grammatical settings!
15
Smoothing
<s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham</s>
!!
P(Sam | like) = 0; hence this is also 0 Do not assign zero probability even to unseen bigrams ..that arise due to sparseness How about really rare or nonsense combinations? !! Sam green ham
16
!!
Sparseness of data
!!
!!
Laplace smoothing
unigram !! case
!! !!
bi-gram case
3 2 ! 3 ; P ! (Sam <s> ) = ; P (am I)= 14 14 14 2 2 2 P ! (</s> Sam )= ; P ! (Sam am )= ; P ! (do I)= 13 13 14 P ! (I <s> )=
1 P (Sam like )= 12
!
18
unigram
bi-gram
!!
Divide by N or C(wn-1) to get the smoothed estimates 17 C (I) = 3; C (I) = 4 " = 2.43
!
!!
Discounted counts!
28
<s> I am Sam </s> 1 C (like Sam)=1 " = 0.083 <s> Sam I am </s> 12 <s> I do not like green eggs and ham</s>
!
3 = 0.64 14
19
Good-Turing Discounting
!! !!
!!
!! !!
Laplace simple but smoothens too much Use frequency of events that occurred once for estimating frequency of unseen events Let Nc be the number of items that occur exactly c times frequency of frequency c N c +1 ! The adjusted count is given by: c = (c + 1) Nc The probability of zero frequency items:
! PGT (items in N 0 ) =
Developed by Turing and Good during World War II as part of their work on deciphering German codes - Enigma
N1 N
20
!!
!!
The probability of a bigram occurring once is given by the GT-estimate So observed counts could be from bigrams of a different frequency
21
We need a way of estimating the missing Nc Assumption: Nc=acb log Nc = a + b log c Linear regression on a log-log scale log c vs. log Nc Alternatively: fit adjusted counts
* = Nc
!!
!!
Not a good fit for small c, hence use Nc as it is if available Switch from actual counts to estimated counts when error is small
22
Assume that low frequency items are really zero frequency Assume that the count of very high frequency items are correct and do not have to be discounted
( k + 1) N k +1 N c +1 "c (c + 1) N1 Nc ! c = , for 1 # c # k ( k + 1) N k +1 1" N1
!!
k of 5 is suggested by Katz.
23
Combining Estimators
!! !!
Another technique for handling sparseness Estimate N-gram probability by using the estimates for the constituent grams
!!
!!
!!
Yields better models than smoothing a fixed N-gram model We look at two methods:
!! !!
Interpolation Back-off
24
#"
i
=1
!! !! !! !!
Finite mixture model Weights determined by EM Empirically through a hold out set Discounting higher order probabilities
25
!!
Can give higher weights to longer history if their counts are high
!!
Same frequencies Weight of N-1 gram model determined by the average number of non-zero N grams that follow this N-1 gram
!!
26
Katz Back-off
C (wi ! 2 wi !1wi ) # %(1 ! dwi! 2 wi!1 ) C (w w ) if C (wi ! 2 wi !1wi ) > k Pbo (wi wi ! 2 wi !1 ) = $ i ! 2 i !1 % " wi!2 wi!1 Pbo (wi wi !1 ) otherwise &
!!
If N-gram model not reliable/available, then use N-1-gram model, and so on. Use Good-Turing
!!
!!
Back-off probability multiplied by " to ensure that only the left over probability is assigned
27
1"
! xy =
z :C ( x , y , z )> k
1"
bo z :C ( x , y , z )> k
# P (z x, y ) # P (z y )
*
1"
!y =
z :C ( y , z )> k
1"
# P (z y ) # P ( z)
* *
z :C ( y , z )> k
28
!! !!
If N-1-gram was never seen, then " is 1 Can start from a quadrigram and go down to unigram! Generally perform well Can have problem with Grammatical zeroes
!!
!!
w very frequent but not part of a trigram, then potential zero Backing-off would estimate it to be some fraction of the bigram probability
!!
Absolute Discounting
!!
30
Kneser-Ney Discounting
!!
Francisco is more frequent than glasses But appears as only San Francisco Hence unigram count of Francisco can be low!
!!
!!
Continuation probability!
31
Kneser-Ney Discounting
!!
Unigram count depends on number of different bigrams in which the word occurs P (w ) = {w : C(w w ) > 0}
* i !1 i !1 i i
" {w
w
i !1
: C ( w i !1w ) > 0}
C ( w i !1w i ) ! D $ , if C ( w i !1w i ) > 0 & C ( w i !1 ) & PKN ( w i w i !1 ) = % w i !1 : C ( w i !1w i ) > 0} { &" ( w i ) , otherwise & # w {wi !1 : C(wi !1w) > 0} '
{wi !1 : C(wi !1wi ) > 0} C ( w i !1w i ) ! D PKN ( w i w i !1 ) = + " (wi ) C ( w i !1 ) # w {wi !1 : C(wi !1w) > 0}
32
Class-based N-grams
!! !!
to Shanghai?
!! !!
Skip N-grams
!! !!
Green Eggs, Green Duck Eggs Variable skip length Green Eggs and, and Ham - both bigrams Use semantic information to guide formation of longer Ngrams. Only after a trigger word
!!
!!
!!
Trigger based
!!
!!
Frequentist
!!
!!
Use MLE estimate of probabilities estimated from a corpus Simple, yet effective Add-one, add-delta Good Turing Interpolation Back-off
35
!!
!!