Sei sulla pagina 1di 21

Lecture 9: Part of Speech

Kai-Wei Chang
CS @ University of Virginia
kw@kwchang.net

Couse webpage: http://kwchang.net/teaching/NLP16

CS6501 Natural Language Processing 1


This lecture

v Parts of speech (POS)


v POS Tagsets

CS6501 Natural Language Processing 2


Parts of Speech
v Traditional parts of speech
v ~ 8 of them

CS6501 Natural Language Processing 3


POS examples

vN noun chair, bandwidth, pacing


vV verb study, debate, munch
v ADJ adjective purple, tall, ridiculous
v ADV adverb unfortunately, slowly
vP preposition of, by, to
v PRO pronoun I, me, mine
v DET determiner the, a, that, those

CS6501 Natural Language Processing 4


Parts of Speech
v A.k.a. parts-of-speech, lexical categories,
word classes, morphological classes,
lexical tags...

v Lots of debate within linguistics about the


number, nature, and universality of these

CS6501 Natural Language Processing 5


POS Tagging
v The process of assigning a part-of-speech to
each word in a collection (sentence).
WORD tag

the DET
koala N
put V
the DET
keys N
on P
the DET
table N
CS6501 Natural Language Processing 6
Why is POS Tagging Useful?
v First step of a vast number of practical tasks
v Parsing
v Need to know if a word is an N or V before you can parse
v Information extraction
v Finding names, relations, etc.
v Speech synthesis/recognition
v OBject obJECT
v OVERflow overFLOW
v DIScount disCOUNT
v CONtent conTENT
v Machine Translation

CS6501 Natural Language Processing 7


Open and Closed Classes
v Closed class: a small fixed membership
v Prepositions: of, in, by, …
v Pronouns: I, you, she, mine, his, them, …
v Usually function words (short common words which
play a role in grammar)
v Open class: new ones can be created
v English has 4: Nouns, Verbs, Adjectives, Adverbs
v Many languages have these 4, but not all!

CS6501 Natural Language Processing 8


Open Class Words

v Nouns
v Proper nouns (Boulder, Granby, Eli Manning)
v Common nouns (the rest).
v Count nouns and mass nouns
v Count: have plurals, get counted: goat/goats, one
goat, two goats
v Mass: don’t get counted (snow, salt, communism)
(*two snows)
v Verbs
v In English, have morphological affixes (eat/eats/eaten)

CS6501 Natural Language Processing 9


Closed Class Words
Examples:
vprepositions: on, under, over, …
vparticles: up, down, on, off, …
vdeterminers: a, an, the, …
vpronouns: she, who, I, ..
vconjunctions: and, but, or, …
vauxiliary verbs: can, may should, …
vnumerals: one, two, three, third, …

CS6501 Natural Language Processing 10


Prepositions from CELEX

CELEX: online dictionary


Frequency counts are from COBUILD 16-billion-word corpus

CS6501 Natural Language Processing 11


English Particles

CS6501 Natural Language Processing 12


Conjunctions

CS6501 Natural Language Processing 13


Choosing a Tagset

v Could pick very coarse tagsets


v N, V, Adj, Adv, Other
v More commonly used set is finer grained
v E.g., “Penn TreeBank tagset”, 45 tags: PRP$, WRB,
WP$, VBG
v Brown cropus, 87 tags.
v Prague Dependency Treebank (Czech)
v 4452 tags
v AAFP3----3N----: (nejnezajímavějším)
Adj Regular Feminine Plural….Superlative [Hajic 2006, VMC tutorial]

CS6501 Natural Language Processing 14


Penn TreeBank POS Tagset

CS6501 Natural Language Processing 15


Using the Penn Tagset

v The/DT grand/JJ jury/NN


commmented/VBD on/IN a/DT number/NN
of/IN other/JJ topics/NNS ./.

CS6501 Natural Language Processing 16


Universal Tag set

v ~ 12 different tags
v NOUN, VERB, ADJ, ADV, PRON, DET, ADP,
NUM, CONJ, PRT, “.”, X

CS6501 Natural Language Processing 17


POS Tagging v.s. Word clustering

v Words often have more than one POS:


back
v The back door = JJ
v On my back = NN
v Win the voters back = RB
v Promised to back the bill = VB

These examples from Dekang Lin


CS6501 Natural Language Processing 18
How Hard is POS Tagging?

CS6501 Natural Language Processing 19


POS tag sequences

v Some tag sequences more likely occur


than others
v POS Ngram view
https://books.google.com/ngrams/graph?co
ntent=_ADJ_+_NOUN_%2C_ADV_+_NOU
N_%2C+_ADV_+_VERB_

Existing methods often model POS tagging as a


sequence tagging problem

CS6501 Natural Language Processing 20


Evaluation

v How many words in the unseen test data


can be tagged correctly?
v Usually evaluated on Penn Treebank
v State of the art ~97%
v Trivial baseline (most likely tag) ~94%
v Human performance ~97%

CS6501 Natural Language Processing 21

Potrebbero piacerti anche