Sei sulla pagina 1di 4

423/723 Natural Language Processing

Assignment 1
Part of speech tagging and chunking (partial parsing)
Due: Tuesday Sep 29, 2015 Midnight to the D2L Dropbox
Some parts require you to use the NLTK and some simple programming in
Python; with permission, a student from non-informatics programs may do the
alternative that does not require programming. Each part of the assignment
into its own file. Use tar or zip (only) to combine the directory into a single file
for the drop box.

Part of Speech Tagging


1. Find one tagging error in each of the following sentences that are tagged
with the Penn Treebank tagset:
(a) I/PRF want/VBP a/DT train/NN from/IN Milwaukee/NN
(b) Does/VBZ this/DT trip/VBserve/VB dinner/NNS
(c) I/PRP had/VB a/DT friend/NN living/VBG in/IN Denver/NNP
(d) Can/VBP you/PRP list/VB the/DT fastest/JJ morning/NN flights/NNS
2. Use the Penn Treebank Tagset to tag each wod in the following sentences.
(a) It is a nice day
(b) This birthday party will be the best one ever.
(c) Nobody ever reads the comments she writes.
(d) He is a tall, skinny guy with a long, sad, mean-looking kisser, and a
creaky voice
(e) (GRAD ONLY) I am sitting in Sandys restaurant hoping to order
ceasar salad on fleek, which I would eat all day long.
(f) (GRAD ONLY) When a boy and a girl start talking back and forth
at each other, you know there is gonna be trouble.
3. (PROGRAMMING OPTION) Write 5 small functions in Python (as named
below under parts a to e). To prepare for this question, read chapter 5 of
the NLTK book. Test your functions with the Brown corpus (distributed
as part of nltk) using at least two categories (news, religion), and at least
3 words. One of the words should one occur with only one sense, one
should occur with at least 4 senses, and one must occur with exactly 3
senses. (Hint: the function plotNumberOfTags that you write will help
you find these).
Save copies of the output of your program to a file called hw1_part1_output.txt
Below is an example of how the 5 functions would be used together.
1

>>>import nltk
>>> corpus = GetCorpus(nltk.corpus.brown, news)
>>> PlotNumberOfTags(corpus)
...show a plot with axis: X - number of tags (1, 2...) and Y number of words having this number of tags...
>>> cfd = GetAmbiguousWords(corpus, 4)
<conditionalFrequency ...>
>>> TestGetAmbiguousWords(cfd, 4)
All words occur with more than 4 tags.
>>> ShowExamples(book, cfd, corpus)
book as NN: ....
book as VB: ....
(a) Function 1: Write a function GetCorpus(corpusName,categoryName)
that returns the tagged words for that corpus and category, so that
you can call the remaining functions without repeating this work.
(Hint: the body of this function should be a single statement.)
(b) Function 2: Write a function PlotNumberOfTags(corpus) that plots
the number of words having a given number of tags. The X-axis
should show the number of tags and the Y-axis the number of words
having exactly this number of tags.
If you do not have access to the pylab module, you can simplify your
plot by using an X at the y -value.
If you have access to the pylab module, you can use the following
example from the NLTK book as an inspiration:
def performance(cfd, wordlist):
lt = dict((word, cfd[word].max()) for word in wordlist)
baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger(NN))
return baseline_tagger.evaluate(brown.tagged_sents(categories=news))
def display():
import pylab
words_by_freq = list(nltk.FreqDist(brown.words(categories=news)))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories=news))
sizes = 2 ** pylab.arange(15)
perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
pylab.plot(sizes, perfs, -bo)
pylab.title(Lookup Tagger Performance with Varying Model Size)
pylab.xlabel(Model Size)
pylab.ylabel(Performance)
pylab.show()

(c) Function 3: Write a function GetAmbigousWords(corpus,N) that


finds words in the corpus with more than N observed tags. This
function should return a ConditionalFreqDist object where the con-

ditions are the words and the frequency distribution indicates the tag
frequencies for each word.
(d) Function 4: Write a test function TestGetAmbiguousWords(cfd,N)
that verifies that the words indeed have more than N distinct tags in
the returned value.
(e) Function 5: Write a function ShowExamples(word,corpus) that given
a word, finds one example of usage of the word with each of the different tags in which it can occur. (The corpus can be the tagged
sentences or tagged words according to what is most convenient)
NON-PROGRAMMING OPTION
(a) Write a set of 4 sentences sentences and then tag them with the Penn
Treebank tagset. (Include both the untagged and tagged text.) The
sentences should contain at least 3 words with at least 3 different tags.
You should use the online tagger program: ONLINE POS DEMO to
assign the tags.
(b) Create a table that includes all words in your hand-written textthat
received more than one tag, showing the overall frequency of the
word, the number of tags received, as well as the actual tags with
their frequency. For example,
word Total Occurences Total Tags
Distribution
book
3
2 NN(2), VB (1)
(c) Select an existing text of at least 500 words that is available on
the web that appears to have at least 3 ambiguous words, tag it
using the Penn Treebank tagset, and then create a table for this text
that includes all words that received more than one tag, showing the
overall frequency of the word, the number of tags received, as well as
the actual tags with their frequency, as before.
PART 2:

Chunking and Partial Parsing


(Readings: Jurafsky and Martin sections 13.1, 13.2, 13.5 (up to 13.5.2); NLTK
Book Ch 7)
1. (Exercise 7.2 from NLTK): Write a tag pattern to match noun phrases
containing plural head nouns, e.g. many/JJ researchers/NNS, two/CD
weeks/NNS, both/DT new/JJ positions/NNS. Try to do this by generalizing the tag pattern (discussed in the text) that handles singular noun
phrases. Reminder: Tag patterns are regular expression grammars that
use the tagset as categories to be matched. For example, the following
pattern says that a noun phrase might be made of an optional determiner
(as indicated by the DT followed by a question mark), any number of
adjectives (as indicated by the JJ followed by an asterisk), followed by a
noun (NN).
3

NP: {<DT>?<JJ>*<NN>}
Another symbol you might use is a + which is used to indicate one or more
occurences.
2. (Exercise 7.5 from NLTK) Write a tag pattern to cover noun phrases that
contain gerunds, e.g. the/DT receiving/VBG end/NN, assistant/NN
managing/VBG editor/NN. For each pattern, include an example phrase
that matches the pattern. If programming, add these patterns to the
grammar, one per line and test your work using some tagged sentences
of your own devising. Non-programming students should provide a set of
tagged sentences and mark by hand the noun phrases with gerunds that
match the new patterns; you can also use the online tagger demo to help
check your work.)
3. (Exercise 7.6 from NLTK) Write one or more tag patterns to handle coordinated noun phrases, e.g. uly/NNP and/CC August/NNP,
all/DT your/PRP$ managers/NNS and/CC supervisors/NNS,
company/NN courts/NNS and/CC adjudicators/NNS. For each pattern
include an example that matches the pattern.
4. Provide a text fragment of at least 120 words from an online source that
includes both noun phrases with gerunds and with coordination and show
the chunk boundaries. (Include both your chunk-marked text and a link
to the original source.)

Potrebbero piacerti anche