Sei sulla pagina 1di 4

Experiments with Statistical Language Generation

Some time ago I tried to imagine how amazingly our brain is able to generate language. This is a very hard problem and it isn't totally understood by the science. As I wish to do some practical experiments I used what I had in hands: statistics. Simple, there is some methods to generate text in the natural language toolkit for python (NLTK). I studied them and I created some curious examples. Here, I post my examples. First, let's create a code for learning a N-gram language model where each gram is a word present in the text. I use two corpora, the Bible Genesis in English and Reuters Trade. Both are available in the NLTK. Bellow the code.

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.

from nltk.corpus import reuters from nltk.corpus import genesis from nltk.probability import LidstoneProbDist from nltk.model import NgramModel

tokens = list(genesis.words('english-kjv.txt')) tokens.extend(list(reuters.words(categories = 'trade')))

estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)

model = NgramModel(3, tokens,estimator)

text_words = model.generate(50)

text = ' '.join([word for word in text_words])

print text

In the line 12 I use a Lidstone's smoothing technique for the n-gram language model. For more

information about the use of a smoothing techniques in n-gram language modelling consult the Wikipedia or a good statistical natural language processing book. I am generating 50 words in the example using 3-gram model. This means that the model will pick the previous 2 words and check in the text for words that came after and pick one of them respecting the estimation. For each 2 words, the model will select the third, and so do it for all 50 words. Some examples of two texts generated for the 3-gram language model were (see that each time the algorithm runs it generate a random another example):

I nt h ef i s c a ly e a r,w i t ht a b r e t,a n dm e n s e r v a n t s,a n ds a ti nt h e p r e s e n c eo ft h eh e a v e n;a n dd a r k n e s sw a su p o na l lt h a th em a yw a i v ea n yr e t a l i a t i o ni fi ta c c e l e r a t e dt o of a s ta n dt h eo n ea f f e c t e d b yt h eh a n do fE sf o rIa l s oh e r el o o k e d I nt h es p e e c h,t h eb r o t h e ro fJ a p h e t ht h ee l d e r,e v e nw i t hI s a a c 'ss t e e le x p o r t s,t h es t a t i s t i c sd e p a r t m e n ts a i df i r s tq u a r t e r. T h e r ew e r ei n d i c a t i o n so fi m p r o v e m e n ti nB r i t a i n'sd e p u t ym i n i s t e ro fE c o n o m ya n dC o m m u n i c a t i o n.S p e a k i n gi nac o f f i ni nE g y p t,J a c o b

You can see that the algorithm mixed the texts, some times speaks about heaven, Jacob and Egypt and other times about improvements in Britain, retaliation, minister of Economy and Communication. The most important is that you can, in a high level, say that the text respects the English grammar and that the sequence of words are not totally random. This is the result of the n-gram model. For much more n-grams do you have in your model, more accurate will be the text. For example, let's generate for 5 grams:

1. 2. model = NgramModel(5, tokens,estimator)

I nt h eb e g i n n i n gG o dc r e a t e dt h eh e a v e na n dt h ee a r t h.A n dw h e nJ o s e p hs a wB e n j a m i nw i t ht h e m,h es a i d.S u c ha na p p r o a c hw o u l df i n a n c et h et r a d ed e f i c i tb u ta l l o wf o ri t sg r a d u a lr e s o l u t i o no v e rt i m e.Y E U T T E RS A Y SU.S. ,C o m p a n ys p o k e s m e nt o l dR e u t e r s.

In this example, the text generated starts exactly equal the bible, but in the middle it changes to some random text from Reuters. In my examples I am not moving the start point, so the random

some random text from Reuters. In my examples I am not moving the start point, so the random text will always start with "In the beginning God ..." (for 5 grams). What we can verify is that the text is much more coherent (in a high level again) than the previous 3-gram modelling. It keeps the basic English syntax even if the meaning (semantic) is not coherent. Ok, if the text with more n-grams has better coherence, why we do not use a higher n-gram for modelling? Well, first because time/space constrains. The more n-gram we pick, the more time processing and memory we require (it grows experientially). In this example I am using a text with near 200.000 words, but imagine you using it with a more than 1 billion words corpus, which is usual for language modelling. Second because the probability itself. The number of different words that we can find after the same previous 3 words is much bigger than the ones we can find after the previous 6 or 7 words. So, in the case of a high n-gram, the probability function do not performs well because the insufficient amount of data. Again, more data more time, it is a complex trade-off. What is usual in computational linguistics is a 3-5 n-gram models, this is what you will find in the papers. Just for fun, let's see how crazy the random words are chosen by a 2-gram language model.

1. 2. model = NgramModel(2, tokens,estimator)

I nt h er o d so ft h el a n do fh e rp i t c h e rf r o m2 7o fa na g r e e m e n ta i m s t oi m p o s ep e r m a n e n tq u o t a so nJ a p a nh o p e so ft h el a t e s tm o n t h l yf i g u r ee x c l u d i n gD u t c h-i m p o r tq u o t a so ni t sc u r r e n c ya n d2 4.A n o t h e rp o s s i b l et or e b u i l do u rg o a ls e ti ts a i d"W ec a n n o t

For complementing my example, I will now demonstrate that you can choose even letters for your language model. The code is posted bellow.

1. 2. 3. 4. 5. 6. 7. 8. 9.

from nltk.corpus import reuters from nltk.corpus import genesis from nltk.probability import LidstoneProbDist from nltk.model import NgramModel

tokens = list(genesis.words('english-kjv.txt')) tokens.extend(list(reuters.words(categories = 'trade')))

9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.

letters = [letter for letter in ' '.join([word for word in tokens])]

estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)

model = NgramModel(4, letters,estimator)

text = model.generate(200) text_output = ''.join([letter for letter in text]) print text_output

I chose 4-grams, i.e., pick 3 previous letters and select the fourth. The text generated for 200 letters is bellow.

I nt oO nt os e r i t ys e r i o do nt h e6 2b i dt oW a s h,a n dh eU."t h e r a w i n gu p,s i n e s et h e ri th es h a r a hf e d e j iF u l f o r c ea g a i n t e di m p o r tb e c a s s e dt h eA s e n s u r p l u si nJ u l ya l w a yt oT a r i f f e c t i o n si n1.w

Of, course the readability is worse than picking words, since some words do not even exist. The punctuation signals also do not work well and the syntax, well, it is just some letters. You can increase or decrease the n-gram factor to see more results. The most amazing think in these examples is that: yes we could generate text, we could in some examples generate texts that people will say made by other people (some crazy people maybe). And you see, what are we using? Just a corpus and statistics! Natural Language Processing is amazing!

nlpb.blogspot.com

h t t p : / / n l p b . b l o g s p o t . c o m / 2 0 1 1 / 0 1 / e x p e r i m e n t s w i t h s t a t i s t i c a l l a n g u a g e . h t m l h t t p : / / g o o . g l / f f g S

Potrebbero piacerti anche