Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
4.2 Using the Preceding and 285,000 tokens differing from the training
Succeeding POS Tags data. The number of known/unknown words
The second method uses the POS tag infor- and the percentage of unknown word in the
mation on both sides of the guessing word. test data are shown in Table 2.
Same features as shown in Table 1 are used. The accuracies are compared with TnT
In general, the POS tags of the succeed- which is based on a second order Markov
ing words are unknown. Roth and Zelenko model.
address the POS tagging with no unknown
words and used a dictionary with the most 5.1 Unknown Word Guessing
frequent POS tag for each word (Roth and Ze- The accuracy of the unknown word guessing
lenko, 1998). They used that POS tag for the is shown in Table 3 together with the degree
succeeding words. They report that about 2% of polynomial kernel used for the experiments.
of accuracy decrease is caused by incorrectly Our method has higher accuracy compared to
attached POS tags by their method. We use TnT for every training data set.
a similar two pass method without using a Accuracies with various settings are shown
dictionary. In the first pass, all POS tags are in Table 4. The first column shows the cases
predicted without using the POS tag informa- when the correct POS context on both sides of
tion of succeeding words (i.e., using the same the guessing words is given. From the second
features as section 4.1). In the second pass, to fourth columns, some features are deleted
POS tagging is performed using the POS tags so as to see the contribution of the features
predicted in the first pass for the succeeding to the accuracy. The decrease of accuracy
context (i.e., using the same features as sec- caused by the errors in POS tagging by TnT is
tion 3). This method has the advantage of about 1%. Information from substrings (pre-
handling known and unknown words in the fixes, suffixes and the existence of numerals,
same way. capital letters and hyphens) plays the most
The tag dictionary is also used in this important role in predicting POS tags. On
method. the other hand, the words themselves have
much less contribution while the POS con-
5 Evaluation
text have moderate contribution to the final
Experiments for unknown word guessing and accuracy.
POS tagging are performed using the Penn In general, features that rarely appear in
Treebank WSJ corpus having 50 POS tags. the training data are statistically unreliable,
Four training data sets were constructed and often decrease the performance of the sys-
by randomly selecting approximately 1,000, tem. Ratnaparkhi ignored features that ap-
10,000, 100,000 and 1,000,000 tokens. peared less than 10 times in training data
Test data for unknown word guessing con- (Ratnaparkhi, 1996). We examined the be-
sists of words that do not appear in the train- havior when reducing the sparse features. Ta-
ing data. The POS tags on both sides of the ble 5 shows the result for 10,000 training to-
unknown word were tagged by TnT. kens. Ignoring the features that appeared
Test data for POS tagging consists of about only once, the accuracy is a bit improved.
Table 3: Performance of Unknown Word Guessing vs. Size of Training Tokens
Training Tokens SVMs d TnT
1,000 69.8% 1 69.4%
10,000 82.3% 2 81.5%
100,000 86.7% 2 83.3%
1,000,000 87.1% 2 84.2%
Table 8: Performance of POS tagging for Different Sets of Features (for Known Word/for
Unknown Word)
Training Tokens No POS No Word No Substrings
1,000 81.3%(96.0/64.2) 83.2%(96.2/68.0) 65.2%(96.2/29.0)
10,000 90.3%(94.7/75.8) 91.9%(95.5/79.9) 80.7%(94.8/34.4)
100,000 93.9%(95.1/80.1) 95.4%(96.4/85.0) 82.4%(87.0/30.7)
1,000,000 95.0%(95.3/80.8) 96.7%(96.9/85.7) 74.4%(75.6/24.1)
However, even if a large number of features with a large set of features. Our methods
are used without a cutoff, SVMs hardly over- do not depend on particular characteristics of
fit and keep good generalization performance. English, therefore, our methods can be ap-
The performance at the different degree of plied to other languages such as German and
polynomial kernel is shown in Table 6. The French. However, for languages like Japanese
best degree seems to be 1 or 2 for this task, and Chinese, it is difficult to apply our meth-
and the best degree tends to increase when ods straightforwardly because words are not
the training data increases. separated by spaces in those languages.
One problem of our methods is computa-
5.2 Part-of-Speech Tagging tional cost. It took about 16.5 hours for
The accuracies of POS tagging are shown in training with 100,000 tokens and 4 hours for
Table 7. The table shows two cases: one refers testing with 285,000 tokens in POS tagging
only to the preceding POS tags, and the other using POS tags on both sides on an Alpha
refers to both preceding and succeeding POS 21164A 500MHz processor. Although SVMs
tags with the two pass method. The results have good properties and performance, their
show that the performance is comparable to computational cost is large. It is difficult to
TnT in the first case and better in the second train for a large amount of training data, and
case. Between the first case and the second testing time increases in more complex mod-
case, the accuracy for known words are al- els.
most equal, but the accuracy of the first case Another point to be improved is the search
for unknown words is lower than that of the algorithm for POS tagging. In this paper,
second case. a deterministic method is used as a search
Accuracies measured by deleting each fea- algorithm. This method does not consider
ture are shown in Table 8. The contribution the overall likelihood of a whole sentence and
of each feature has the same tendency as the uses only local information compared to prob-
case of the unknown word guessing in section abilistic models. The accuracy may be im-
5.1. The biggest difference of features be- proved by incorporating some beam search
tween our method and the TnT is the use of scheme. Furthermore, our method outputs
word context. Although using a lot of features only the best answer and cannot output the
such as word context is difficult in Markov second or third best answer. There is a way
model, it is easy in SVMs as seen in section to translate the outputs of SVMs as proba-
5.1. For small training data, the accuracies bilities (Platt, 1999), which may be applied
of the case without the word context are less directly to remedy this problem.
than that of TnT. This means that one rea-
son for better performance of our method is
the use of word context. References
T. Brants. 2000. TnT — A Statistical Part-
6 Conclusion and Future Work of-Speech Tagger. In Proceedings of the
In this paper, we applied SVMs to unknown 6th Applied NLP Conference(ANLP-2000),
word guessing and showed that they per- pages 224–231.
form quite well using context and substring E. Brill. 1995. Transformation-Based Error-
information. Furthermore, extending the Driven Learning and Natural Language
method to POS tagging, the resulting tag- Processing: A Case Study in Part-of-
ger achieves higher accuracy than the state- Speech Tagging. Computational Linguis-
of-the-art HMM-based tagger. Comparing tics, 21(4), pages 543–565.
to other machine learning algorithms, SVMs
have the advantage of considering the com- E. Charniak, C. Hendrickson, N. Jacobson
binations of features automatically by intro- and M. Perkowitz. 1993. Equations for
ducing a kernel function and seldom over-fit Part-of-Speech Tagging. In Proceedings of
the Eleventh National Conference on Artifi- A. Ratnaparkhi. 1996. A Maximum
cial Intelligence(AAAI-93), pages 784–789. Entropy Model for Part-of-Speech Tag-
ging. In Proceedings of Conference on
C. Cortes and V. Vapnik. 1995. Support Vec- Empirical Methods in Natural Language
tor Networks Machine Learning, 20, pages Processing(EMNLP-1), pages 133–142.
273–297.
D. Roth and D. Zelenko. 1998. Part
S. Cucerzan and D. Yarowsky. 2000. Lan- of Speech Tagging Using a Network of
guage Independent, Minimally Supervised Linear Separators. In Proceedings of
Induction of Lexical Probabilities. In the joint 17th International Conference on
Proceedings of the 38th Annual Meet- Computational Linguistics and 36th An-
ing of the Association for Computational nual Meeting of the Association for Com-
Linguistics(ACL-2000), pages 270–277. putational Linguistics(ACL/COLING-98),
pages 1136–1142.
S. Džeroski, T. Erjavec and J. Zavrel.
H. Schmid. 1994. Probabilistic Part-
2000. Morphosyntactic Tagging of Slovene:
of-Speech Tagging Using Decision Trees.
Evaluating Taggers and Tagsets. In
In Proceedings of the International Con-
Proceedings of the Second International
ference on New Methods in Language
Conference on Language Resources and
Processing(NeMLaP-1), pages 44–49.
Evaluation(LREC-2000), pages 1099–1104.
S. Thede. 1998. Predicting Part-of-Speech
T. Joachims. 1998. Text Categorization with Information about Unknown Words using
Support Vector Machines: Learning with Statistical Methods. In Proceedings of
Many Relevant Features. In Proceedings of the joint 17th International Conference on
the 10th European Conference on Machine Computational Linguistics and 36th An-
Learning(ECML-98), pages 137–142. nual Meeting of the Association for Com-
putational Linguistics(ACL/COLING-98),
T. Kudoh and Y. Matsumoto. 2000. Use of pages 1505–1507.
Support Vector Learning for Chunk Iden-
tification In Proceedings of the Fourth V. Vapnik. 1999. The Nature of Statistical
Conference on Computational Natural Lan- Learning Theory. Springer.
guage Learning(CoNLL-2000), pages 142–
144. R. Weischedel, M. Meteer, R. Schwartz,
L. Ramshaw and J. Palmucci. 1993. Cop-
A. Mikheev. 1997. Automatic Rule Induc- ing with Ambiguity and Unknown Words
tion for Unknown-Word Guessing. Compu- through Probabilistic Models. Computa-
tational Linguistics, 23(3), pages 405–423. tional Linguistics, 19(2), pages 359–382.