Sei sulla pagina 1di 5

Methods in Computational Linguistics 2!

Final Exam!
Spring 2014!

!
For all questions solved with python code, include your code with the answer.!
!
Probability!
!
1. (15pts) You select a shape from one box. !
!
!
!
!
!

!
!
!
!
!

A
!
!
!
Determine the following probabilities.!
!

Conditional!
P(Star | Box C)!

Marginal!
P(Star) if P(Box A) = 2/15 P(Box B) = 1/6 and P(Box C) = 1/5!

!
P(Circle) if P(Box A) = 1/3 P(Box B) = 1/3 and P(Box C) = 1/3!
!

Conditional!
P(Box B | Triangle) where the prior probability of the three boxes are p(Box A) = 1/3 p(Box B) =
1/3 p(Box C) = 1/3!

P(Box A | Star) where the prior probability of the three boxes are p(Box A) = 1/4 p(Box B) = 1/4
p(Box C) = 1/2!

!
Regular Expressions!
!

2. (10pts) Japanese uses a syllabary to construct words. Each syllable is a CV with an optional /
n/ coda. When foreign words enter the Japanese lexicon they need to be converted to this
syllabary. This leads to words like taxi being converted to takushi. However, some words like
`banana and `chiron need no conversion. Write a regular expression that matches English
words that need no modification to japanese. Use this simplified syllabary for this question. For
reference: Japanese has /b/ /z/ /d/ /g/ /p/ /j/ consonants as well as modified vowels to make
sounds like gya. For simplicity of this question these are omitted. Make particular note of the
highlighted missing cells and bolded cells that do not follow the standard pattern.!

!
a

ka

ki

ku

ke

ko

ta

chi

tsu

te

to

sa

shi

su

se

so

na

ni

nu

ne

no

ha

hi

fu

he

ho

ma

mi

mu

me

mo

ya
ra
wa

!
!
WordNet!
!

yu
ri

ru

yo
re

ro
wo

3. (10pts) You can access synsets in wordnet by their part-of-speech using the commands
wn.synsets(word, n) NOUNS, wn.synsets(word, v) VERBS, wn.synsets(word, a)
ADJECTIVES and wn.synsets(word, r) ADVERBS. The polysemy of a word can be measured
by the number of senses that it has. Similarly you can find all lemmas by part of speech using
wn.all_lemma_names(n), wn.all_lemma_names(v), wn.all_lemma_names(a),
wn.all_lemma_names(r)!

Calculate the average number of senses for nouns, verbs, adjectives and adverbs as measured
by WordNet synsets. Note: only include those words that have senses of the given part of
speech. For example, desk only has noun senses. It should not contribute to the calculations
for verbs, adjectives and adverbs.!

!
!
!

Part of Speech Tagging!


NLTK includes part of the brown corpus. (if your version doesnt have it type
nltk.download() to download any available corpora.) The corpus should include 57,340
tagged sentences. Confirm this locally with the following commands:!
>>> from nltk.corpus import brown!
>>> len(brown.tagged_sents())!
57340!

4A. (5pts) Identify the 10 most common part of speech trigrams in the NLTK brown corpus.
How frequent are each of them?!

4B. (10pts) Many words contain internal hyphens, like term-end, forty-five and over-all.
Identify the 10 most common part of speech tags for words containing internal hyphens. How
frequent are each of them?!

!
!
Dynamic Programming!
!

5A. (5pts) Definitions!


Dynamic programming shows up a lot throughout efficient algorithms. Define the two required
elements of a problem in order for it to be solved by dynamic programming.!

!
Optimal Substructure:!
!
Repeated Subproblems:!
!

5B. (20pts) Implementation!


Given a set of numbers in a triangle, find the maximum total by summing numbers from the top
of the triangle down and left or down and right.!

!
In this example the maximum total is 24=4+7+4+9!
!

4!
7 4!
2 4 6!
8 5 9 3!

For the purposes of this question, the triangle will be represented as a list of lists, each of which
have the same length. Extra entries will be padded with -1. The above example will be input
as:!
A = [[4,-1,-1,-1],[7,4,-1,-1],[2,4,6,-1],[8,5,9,3]]!

5B1. Write a function that, given this input as a variable A returns the maximum sum from top to
bottom in the triangle.!

def triangleSum(A)!
!
!
!
return sum!

!
Confirm that your answer works on the above triangle first.!
!
if triangleSum(A) == 24:!
!
print correct!
else:!
!
print incorrect!

5B2. With your function, determine the maximum sum of this triangle.!
B = [[53, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [200, 159, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1], [182, 28, 190, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [197, 103, 143, 123, -1, -1,

-1, -1, -1, -1, -1, -1, -1, -1, -1], [35, 73, 39, 89, 105, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [193, 42,
168, 1, 49, 23, -1, -1, -1, -1, -1, -1, -1, -1, -1], [110, 87, 31, 99, 40, 199, 159, -1, -1, -1, -1, -1, -1,
-1, -1], [65, 117, 155, 128, 76, 99, 102, 81, -1, -1, -1, -1, -1, -1, -1], [44, 4, 171, 110, 64, 142,
164, 93, 142, -1, -1, -1, -1, -1, -1], [73, 51, 1, 133, 41, 68, 88, 58, 98, 118, -1, -1, -1, -1, -1], [113,
196, 44, 24, 199, 164, 41, 108, 134, 21, 4, -1, -1, -1, -1], [42, 140, 37, 55, 168, 40, 81, 115, 134,
184, 119, 91, -1, -1, -1], [66, 118, 59, 24, 149, 9, 29, 105, 34, 62, 154, 42, 167, -1, -1], [82, 88,
120, 105, 167, 113, 107, 59, 72, 35, 147, 164, 139, 70, -1], [147, 158, 76, 57, 131, 48, 56, 119,
156, 86, 159, 36, 108, 120, 118]]!

053
200 159
182 028 190
197 103 143 123
035 073 039 089 105
193 042 168 001 049 023
110 087 031 099 040 199 159
065 117 155 128 076 099 102 081
044 004 171 110 064 142 164 093 142
073 051 001 133 041 068 088 058 098 118
113 196 044 024 199 164 041 108 134 021 004
042 140 037 055 168 040 081 115 134 184 119 091
066 118 059 024 149 009 029 105 034 062 154 042 167
082 088 120 105 167 113 107 059 072 035 147 164 139 070
147 158 076 057 131 048 056 119 156 086 159 036 108 120 118

!
!
Classification and Plotting!
!

In homework 3 and in class, we use the following code to distinguish positive from negative
reviews!

## Document Classification!
from nltk.corpus import movie_reviews!
documents = [(list(movie_reviews.words(fileid)), category)!
for category in movie_reviews.categories()!
for fileid in movie_reviews.fileids(category)]!
random.shuffle(documents)!

!
!

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())!


word_features = all_words.keys()[:2000]!
def document_features(document): !
document_words = set(document) !
features = {}!
for word in word_features:!
features['contains(%s)' % word] = (word in document_words)!
return features!

featuresets = [(document_features(d), c) for (d,c) in documents]!


train_set, test_set = featuresets[100:], featuresets[:100]!
classifier = nltk.NaiveBayesClassifier.train(train_set)!

!
!
!

print nltk.classify.accuracy(classifier, test_set) !

This classification example uses only bag-of-word features drawn from the 2000 most common
words in all movie reviews. Modify this code in two ways.!

6A. (20pts) Allow for the number of most common words to be varied. Change the
document_features function so that it also takes two variables words and n, where words is the
full list of words sorted by frequency, and n is the desired number of most common words to
include bag-of-word features for.!

6B. (5pts) Plot the accuracy of this classifier as N increases by 10 from 10 to 3000. Note:
while this operation may take some time, you should not be running 300 commands by hand
one at a time.!

Potrebbero piacerti anche