Vietnamese Proper Noun Recognition: Chau Q.Nguyen, Tuoi T.Phan, Tru H.Cao

Vietnamese Proper Noun Recognition
Chau Q.Nguyen, Tuoi T.Phan , Tru H.Cao

and transformation rules from a manually tagged corpus. In the lexical lookup step, each word is assigned with its tag the most frequent recorded in the lexicon. Subsequently, transformation rules permit iterative correction of this priority tagging. Probabilistic methods [1], [2] make use of the probability distribution on the space of possible associations between word sequences and tag sequences. This distribution is determined thanks to a training corpus with or without tags. The tag disambiguation task becomes the choice of the tag sequence that maximizes the conditional probability of association to the word sequence. These methods require some probabilistic hypotheses: the association probability of a word to a tag is totally conditioned by the tag knowledge, and the occurrence probability of a tag is conditioned by the knowledge of this tag in a neighbor of fixed size. The performance of tagging systems is generally measured by the precision rate (at word level). This depends strongly on the nature and the size of a set of Tags (tagset). Most results are superior to 90%. The best results have been obtained in the Grace[19] evaluation and were 97.8%, 96.7% and 94.8% accurate. II. METHODOLOGY We consider the employment of a set of NLP techniques adequate for dealing with the Vietnamese proper noun recognition problem. We propose the following general proper noun recognition model (see Fig.1). This paper focuses on the description of the proper noun recognition model, with emphasis on proper noun recognition
Vietnamese Text
Abstract Recognition of proper noun is one of complex problems in Vietnamese natural language processing (NLP). This paper introduces, first, an approach for recognizing of Vietnamese proper noun. Second, it introduces NLP techniques that we propose to analyze Vietnamese texts, focusing on advanced word segmentation which accounts for a number of complex linguistic phenomena as well as pre-tagging, part-ofspeech (POS) tagging, and post-tagging tasks for proper noun recognition. Finally, the paper discusses the results of several experiments that have been performed to study the impact of the strategy chosen for the recognition of Vietnamese proper nouns and the further works we intend to deal with. Index_Terms Vietnamese, natural language processing, word segmentation, tagger, tagset, part-of-speech, POS tagging, text corpus, pre-tagging, post- tagging, corpus, syntax, context of texts, proper names, proper noun recognition.
ach word in a language has potentially one or more partsof-speech (POS) depending upon the context of its usage. Proper Noun (PN) recognition is an identification of morpho-syntactic categories [11], [12], [13], [14] over a continuous stream of word tokens. This task is essential for any further step of language processing: syntactic analysis, semantic or even pragmatic processing. Most PN recognition systems use lexical resources such as gazetteers (lists of names), but these are necessarily incomplete as new names are constantly coming into existence. Therefore, further techniques must be used, including syntactic and semantic analysis based on styles of texts. The PN recognition process consists generally of two steps: tokenization for identifying lexical units in the text, and disambiguation for recognition PN for words. There are two principal approaches for the disambiguation task: rule based methods and probabilistic methods [6], [7], [8], [9], [10], [15], [18]. Rule based methods apply a set of grammatical rules to solve the tagging problem. The unsupervised method uses constraint rules built by linguists and a dictionary in which each word has its possible tags. Such a tagger is likely a parser. The supervised method (Brill, 1992) [6], [1] learns tags
Chau Q. Nguyen is with the Faculty of Information Technology, Ho Chi Minh City University of Industry, Vietnam (Cell phone: 84-903-971-541; fax: 84-8-661420; e-mail: quangchau@ hcmut.edu.vn). Tuoi T. Phan is with the Faculty of Information Technology, Ho Chi Minh City University of Technology, Vietnam (e-mail: tuoi@dit.hcmut.edu.vn). Tru H. Cao is with the the Faculty of Information Technology, Ho Chi Minh City University of Technology, Vietnam (e-mail: tru@dit.hcmut.edu.vn).
I. INTRODUCTION
Segmentation
Plain Text HTML SGML XML RTF Email
Proper Noun Recognition
Proper Nouns
Fig. 1. The General Proper Noun Recognition Model.
tasks. Concurrently, we also propose a model which is employed of NLP method adequate for dealing with the
1-4244-0316-2/06/$20.00 2006 IEEE.
145
disambiguation task in the POS tagging module. This method combines rule based method and probabilistic method. Both methods will complement each other to increase expected performance of the model. Albeit our scheme is oriented towards the indexing of Vietnamese texts, it is also a proposal of a general architecture that can be applied to other languages with very slight modifications. A. Segmentation The main function of this segmentation module (see Fig.2) is to identify and separate the tokens present in the text in such a way that every individual word, as well as every punctuation mark, will be a different token. The segmentation module considers words, numbers with decimals, or dates in numerical format in order not to separate the dot, the comma, or the slash (respectively) from the preceding and/or following elements. For this purpose, it uses a dictionary (this is an allocation list) and a small set of rules to detect numbers and dates. Rules [18] are applied in one of two ways: Brill-style, in which each rule is applied at every point in the document that it matches; and Appelt-style, in which only the longest matching rule is applied at any point at which more than one might apply. Common Pattern Specification Language
The second step is to build a nondeterministic FSM from the java objects that obtained from the parsing process. This FSM will have one initial state and a set of final states, each of which is associated with one rule (this way we know what the right-hand-side we have to execute in case of a match). The nondeterministic FSM will also have empty transitions (arcs labeled with nil). In order to build this FSM we will need to implement a version of the algorithm used to convert regular expressions in NFAs. Finally, this nondeterministic FSM will have to be converted to a deterministic one. The deterministic FSM (DFSM) will have more states (in the worst case s!, where s is the number of states in the nondeterministic FSM this case is very improbable), but will be more efficient because it will not have to backtrack. One way to do the (combined) matching is to pre-process the DFSM and to convert all transitions to matchers. This could be done using the following algorithm: Input: A DFSM; Output: A DFSM with compound restrictions checks. for each state s of the DFSM collect all the restrictions in the labels of the outgoings arcs from s (in the DFSM transition graph) Note: these restrictions are either of form Type == t1 or of form Type == t1 && Attr i == Value i group all these restrictions by type and branch and create compound restrictions of form [Type == t1 && Attr 1 == Value1 && Attr 2 == Value 2 && ... && Attr n == Value n] the grouping has to be done with care so it doesnt mix restrictions from different branches, creating unnecessary restrictive queries. These restrictions will be sent to the annotation graph which will do the matching for us. Note that we can only reuse previous queries if the restrictions are identical on two branches create the data structures necessary for linking the bindings to the results of the queries.
Vietnamese Text
Dictionaries
Segmentation
Finite State
vnTokens
Fig. 2. The Vietnamese Segmentation Module.
In the Appelt case, the rule set for a phase may be considered as a single disjunctive expression (and an efficient implementation would construct a single automaton to recognize the whole rule set). To solve this problem, we need to employ two algorithms: one that takes as input a Common Pattern Specification Language (CPSL) representation and builds a machine capable of recognizing the situations that match the rules and makes the bindings that occur each time a rule is applied. This machine is a Finite State Machine (FSM), somewhat similar to a lexical analyzer (a deterministic finite state automaton). another algorithm that uses the FSM built by the above algorithm and traverses the annotation graph in order to find the situations that the FSM can recognize. 1) The first algorithm The first step that needs to be taken in order to create the FSM is to read the CPSL description from the external file(s). This is already done in the old version of Java Annotation Patterns Engine (JAPE) [18].
When this machine is used for the actual matching, the three queries will be run and the results will be stored in sets of annotations (S1.S3) and... For each pair of annotations from (A1, A2) s.t. A1 in S1 & A2 in S2 a new DFSM instance will be created; this instance will move to state 2; <A1, A2> will be bound to L1 the corresponding node in the annotation graph will become max(A1.endNode(), A2.endNode()). Similarly, for each pair of annotations from (A1, A3) s.t. A1 in S1 & A3 in S3 a new DFSM instance will be created; this instance will move to state 3; <A1, A3> will be bound to L2
1-4244-0316-2/06/$20.00 2006 IEEE.
146
the corresponding node in the annotation graph will become max(A1.endNode(), A3.endNode()). While building the compound matcher, it is possible to detect queries that depend one on another (e.g. if the expected results of a query are a subset of the results from another query). These kinds of situations can be marked so that when the queries are actually run, some operations can be avoided (e.g. if the less restrictive search returned no results than the more restrictive one can be skipped, or if a search returns an AnnotationSet[18]an object that can be queried than the more restrictive query can be skipped). 2) The second algorithm Basically, the algorithm (see Fig. 3) has to traverse this graph starting from the leftmost node to the rightmost one. Each path found is a sequence of possible matches. Because more than one annotation (all starting at the same point) can be matched at one step, a path is not viewed as a classical path in a graph, but a sequence of steps, each step being a set of annotations that start in the same node.
3. hc
while (not over) do for each Fi active instance of the FSM do if this instance is in a final state then save a clone of it in the set of accepting FSMs (instances of the FSM that have reached a final state); read all the annotations starting from the current node; select all sets of annotation that can be used to advance one step in the transition graph of the FSM; for each such set create a new instance of the FSM, put it in the active list and make it consume the corresponding set of annotations, making any necessary bindings in the process (this new instance will advance in the annotation graph to the rightmost node that is an end of a matched annotation); discard Fi; end for; if the set of active instances of FSM is empty then over = true;
end while;
1. khoa 2. hc
4. c
5. k
8. bn
if the set of accepting FSMs is not empty from all accepting FSMs select ** the one that matched the longest path; if there are more than one for the same path length select the one with highest priority; execute the action associated to the final state of the selected FSM instance; startNode =selectedFSMInstance.getLastNode.getNextNode(); start over from the next else //the matching failed node // startNode = startNode.getNextNode(); end while; B. Proper Noun Recognition We propose Proper Noun Recognition Module which consists of a sequence of processes (see Fig.4), each of them corresponding to the recognition of intuitive linguistic elements.
9. s
10. thut
11. tnh
6. k
7. my
Fig. 3. The Second Algorithm.
E.g. a path in the graph above can be: [1.3],[1.2.4.8],[1.2.4.9],[1.2.5.10],[1.2.6.10],[1.2.7.11]; Note that the next step continues from the rightmost node reached by the annotations in the current step. The matchings are made by a Finite State Machine that resembles a classical lexical analyzer. The main difference from a scanner is that there are no input symbols; the transition from one state to another is based on matching a set of objects (annotations) against a set of restrictions (the constraint group in the left-hand-side of a CPSL rule). The algorithm can be the following: startNode = the leftmost node create a first instance of the FSM and add it to the list of active instances; for this FSM instance set current node as the leftmost node; while(startNode != last node) do
*: The set of active FSM instances can decrease when an active instance cannot continue (there is no set of annotations starting from its current node that can be matched). In this case it will be removed from the set. **: If we do Brill style matching, we have to process each of the accepting instances.
1-4244-0316-2/06/$20.00 2006 IEEE.
147
Vietnamese
Pre Proper Noun Recognition Module Module

Proper noun trainer
Proper noun identifier
Part Of Speed Tagging Module
Post Proper Noun Recognition Module
Proper Nouns
Fig. 4. The Proper Noun Recognition Module.
this phase of the Pre Proper Noun Recognition Module is able to detect proper nouns, whether simple or compound, and whether appearing in ambiguous positions or in nonambiguous ones. In the case of non-ambiguous positions, we check for the longest sequence of valid connectives and capitalized words present in the external dictionary, assigning the tag corresponding to the leading capitalized word. If this fails, we assign a proper noun tag with gender under-specified to the longest sequence. In the case of ambiguous positions, we check for the longest sequence of valid connectives and capitalized words present in the external dictionary, assigning the corresponding tag. If this fails, we check for the longest sequence in the trained dictionary, labeling the sequence as a proper noun with gender under-specified. If this also fails, the sequence is not tagged. As an example, if we find ng Nguyn vn Trit in ambiguous position, supposing the training phase have found Nguyn&vn&Trit in a non-ambiguous position and that ng is found in the external dictionary as a masculine proper noun, the whole sequence ng&Nguyn&vn&Trit is tagged as a masculine proper noun 1. 2) The POS Tagging Module The output of the pre proper noun recognition module is taken as input by the POS tagging module. Almost any kind of POS tagging could be applied. In our system, we propose a model used in a probability model and based on a lexicon with information about possible POS tags for each word, a manually labeled corpus, syntax and context of texts. Concurrently, we also built a corpus with 75,000 entries and a lexicon with 80,000 entries. We have used a Hidden Markov model and Viterbi algorithm [1], [15] for part-of-speech tagging. The elements of the model and the procedures to estimate its parameters are based on Brants work [19]. Actually, we build the following module for Vietnamese partof-speech tagging (see Fig. 5)
1) The Pre Proper Noun Recognition Module

a) Proper noun trainer
Given a sample of the texts that are going to be indexed, this sub-module learns a set of candidate proper nouns that are stored in the trained dictionary. This sub-module identifies the words that begin with a capital letter and appear in nonambiguous positions, i.e. in positions where if a word begins with a capital letter then it is a proper noun. For instance, words appearing after a dot are not considered, and words in the middle of the text are considered. These words are added to a dictionary which is used later by the Proper Noun Identifier sub-module. The sub-module also identifies sequences of capitalized words connected by some valid connectives like the v and definite articles. All possible segmentations of these sequences are considered. For example, for B Gio dc v o to the following proper nouns would be generated: B&Gio dc& v& o to B&Gio dc Gio dc& v& o to Gio dc o to B&Gio dc Where & is used to join the words that form the compound proper noun. Then, all these proper nouns are added to the trained dictionary of proper nouns.
b)
Proper noun identifier.

In Vietnamese, Trit is a traditional first name, Nguyn is a traditional family name, vn is the common literature. The use of common nouns as part of a name (in this case Nguyn vn Trit) is a typical phenomenon in Vietnamese.
1
This sub-module uses an external dictionary of proper nouns to which the trained dictionary extracted by the Proper Noun Trainer sub-module can be added. With these resources,
1-4244-0316-2/06/$20.00 2006 IEEE.
148
Input
Part Of Speed Tagging based on rules (syntactic and semantic analysis based on style of texts)
Set of recognising rules for Date, proper noun, numeral, collective noun, classifier noun, personal pronoun, etc.
Vietnamese Corpus
in which: Jt time adjunct, Nt - classifier noun, Np proper noun, Vm move verb, Vtim transitive-impression verb, Na abstract noun. 3) The Post Proper Noun Recognition Module We have investigated the effects of the Pre Proper Noun Recognition module in the results, concluding the tasks performed by the Proper Noun Identifier sub-module have a great impact in the performance. The main problems are: 1. The presence of different name forms to designate the same entity. For example Bc H, Bc H Ch Minh and C H refers to the same person but the sub-module generates different terms for them. 2. Entities that are sometimes considered as common nouns, written in low letters, and sometimes as proper names, written capitalized. For example we can found K Ton Trng and k ton trng (chief accountant). The effect on indexing is important: when indexed as common nouns, all words involved are conflated via morphological families, which makes possible the matching of nouns formed by derivatives, as in the case of chief accountant. 3. Entities with complex proper names, for example B Gio Dc- o To v Vn Ha - Thng Tin . If an entity (or other specified subject) is indexed as a single term, it can not match queries referring to B Gio Dc- o To or to B Vn Ha - Thng Tin, as often occurs. In addition, such complex nouns form structured noun phrases by themselves, but any dependency structure can be extracted when they are considered as proper nouns. In order to solve these problems, we have decided to change the way proper nouns are managed: instead of tagging a sequence of capitalized words as a compound proper noun, each individual word is tagged as a simple proper noun. Accordingly, the grammar used by the parser is modified to take into account that several consecutive proper nouns can appear in the text. This is necessary to ensure the right dependencies are extracted. For each capitalized word W we apply the following algorithm, both in the Proper Noun Trainer and in the Proper Noun Identifier sub-modules: If W appears in the dictionaries of proper nouns Then W is tagged as a proper noun else if W appears in the lexicon with label T then W is tagged as a T
_
Part Of Speed Tagging based on probability model
Out put
Fig. 5. Vietnamese part-of-speech tagging module.
and an algorithm for Vietnamese part-of-speech tagging as follows: read tokens; part-of-speed tagging for non-ambiguous cases; write to buffers while(buffers is not empty) do Read three tokens from buffers; for each token of three tokens do if this token is in dictionary then tag its part-of-speed in dictionary on it; else tag probable part-of-speed on it; j = 0; while(j < Number of Tags) do Pw = P(tag|token); Pc = P(tag|t1,t2); Pw,c = Pw * Pc; j = j +1;
o end while; end for; end while;
Here is an example of tagging output for the sentence "Nm ngoi /, / ng / Nguyn Thnh Ti / i / thm / khu / di tch / lch s / C Chi." that means " Last year / , / Mr. / Nguyen Thanh Tai / visited / the / Cu Chi / historical / monument " <w pos="Jt"> Nm ngoi</w> <w pos="Nt"> ng </w> <w pos="Np"> Nguyn Thnh Ti </w> <w pos=","> , </w> <w pos="Vm"> i </w> <w pos="Vtim"> thm </w> <w pos="Nt"> khu </w> <w pos="Na"> di tch </w> <w pos="Na"> lch s </w> <w pos="Np"> C Chi </w> <w pos="."> . </w>
else if W is in a non-ambiguous position then W is tagged as an proper noun else W is tagged as an unknown word
The Proper Noun Trainer can only consult the external dictionary of proper nouns, while the Proper Noun Identifier can consult the external and the trained dictionary. Applying this algorithm, H Ch Minh is tagged as a sequence of three proper nouns, k ton trng is tagged as commonNounadjective and B Gio Dc- o To v Vn Ha - Thng Tin is tagged as commonNoun- commonNoun-coordination-
1-4244-0316-2/06/$20.00 2006 IEEE.
149
commonNoun. III. EVALUATION We have experimented with this method on many Vietnamese documents [20] with different styles of texts (see TABLE I). The best precision rate for the Vietnamese Proper Noun Recognition is around 90% with a training corpus of 75,000 entries and a lexicon with 80,000 entries (with tagset 48 lexical tags (see TABLE II) and 10 punctuations).
TABLE I
THE RESULT OF VIETNAMESE PROPER NOUN RECOGNITION

Documents / styles of texts
Love-story (1) / Vietnamese Novel
Love-story (2) / Vietnamese Novel A little prince / Foreign Story Summary history of time / Scientific Book Technology / Newspapers Average Precision
Words
16,787
14,698 18,663 11,626 10,662
Precision
90.75%
90.39% 90.48% 88.20% 87.90% 89.54%
tagging tasks for proper noun recognition. We have also introduced our method of POS tagging for Vietnamese texts; which was based on a lexicon with information about possible POS tags for each word, a manually labeled corpus, syntax and context of texts. nbVietnamese researchers have only recently become involved in the Natural Language Processing domain. Accordingly, we had to construct all necessary linguistic resources and define all data structures from scratch. Nevertheless, we benefit from some advantages: many existent methodologies for morpho-syntactic annotation, high consciousness of standardization tendency. Our used tagset [3], [4], [5], [16], [17], [20] can be easily readjusted and extended thanks to a lexical descriptions base. With automatic tagging assistance, we can easily increase the size of the annotated corpus. Obtained results lay the foundation for further research in NLP for Vietnamese: Syntactic analysis, information retrieval, information extraction, multilingual alignment, machine translation, etc. APPENDIX A. Viterbi algorithm Given word sequence W1, , WT , lexical categories L1, , LN , lexical probabilities Pr(Wi | Li), and bigram probabilities Pr(Li | Lj), find the most likely sequence of lexical categories C1,,CT for the word sequence. Initialization Step:
For i = 1 to K do SeqScore(i,1) = Pr(C1 | )* Pr(W1 | Li) BACKPTR(i,1) = 0 Iteration Step:
For t = 2 to T do For i = 1 to K do SeqScore (i,t) = Max j = 1,..K (SeqScore (j,t -1) * Pr (Li | Lj)) * Pr (Wt | Li), j = 1,..K BACKPTR(i,t) = index of j that gave the max above Sequence Identification Step:
C(T) = i that maximizes SeqScore(i,T) For i = T-1 to 1 do C(i) = BACKPTR(C(i+1),i+1)
The experimentation confirms that the larger the document is, the more accurate is the result, and the more different a document is with respect to the style of texts, the more varied the results. These are important notes for any further steps of language processing as syntactic analysis, semantic or even pragmatic processing. IV. CONCLUSION Recognizing proper nouns involves identifying which strings in a text name individuals. This task is made difficult by the unpredictable length of names (company names can be twelve or more words long), ambiguity in names (Sai Gon can be a company, a person, or a location), embedding, where e.g. a location name occurs within an organization name, variant forms, and unreliability of capitalization as a cue, e.g. in headlines and everywhere in Vietnam. Being able to recognize correctly clearly has relevance for a number of application areas: precision in IR systems should increase if multiword names are treated as unitary terms and if variant forms can be linked; IE systems rely heavily on PN recognition and classification. In this paper, we have proposed the model for Vietnamese proper noun recognition problem. It includes some natural language processing techniques that can be used to analyze Vietnamese texts, focusing on the advanced word segmentation which accounts for a number of complex linguistic phenomena, as well as for pre-tagging and post-
1-4244-0316-2/06/$20.00 2006 IEEE.
150
B. Vietnamese POS Set

TABLE II VIETNAMESE POS SET Np Nc Ng Nt Nu Na Nn Nl Vt Vit Vim Vo Vs Vb Vv Va Vc Vm Vla Vtim Vta Vtc Vtb Vto Vts Vtm Vtv Vitim Vitb Vits Vitc Vitm Aa An Pp Pd Pn Pa Pi Jt proper noun countable noun collective noun classifier noun concrete noun abstract noun numeral locative noun transitive verb intransitive verb impression verb orientation verb state verb transformation verb volotive verb acceptation verb comparative verb move verb l verb transitive-impression verb transitive-acceptation verb transitive-comparative verb transitive-transformation verb transitive-orientation verb transitive-state verb transitive-move verb transitive-volotive verb intransitive-impression verb intransitive-transformation verb intransitive-state verb intransitive-comparative verb intransitive-move verb quality adjective quantity adjective personal pronoun demonstrative pronoun quantity pronoun quality pronoun interrogative pronoun time adjunct
Jd Jr Ja Ji Cm Cc E I X
degree adjunct rapport adjunct adjunct of negation and acceptation imperative adjunct major/minor conjunction combination conjunction emotion word introductory word unknown
ACKNOWLEDGMENT Chau Q. Nguyen thanks the enthusiastic collaboration of all post graduate students in the NLP group of HCM City University of Technology Vietnam, and all members in the Vietnamese national project KC01_21 who have contributed to the research work. REFERENCES
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] James Allen, Natural Language Understanding, Benjamin/Cummings Publishing Company, 1995. Christopher Manning, Hinrich Schutze, Foundations of Statistical Natural Language Processing, MIT Press, Fourth printing, 2001. PGS.TS. Phan Th Ti, Bt li chnh t Ting Vit, i Hc Quc Gia Tp. HCM. Bo co khoa hc ti S Khoa hc-Cng ngh-Mi trng Tp. HCM, 1998. Cao Xun Ho, Ting Vit - my vn ng m, ng php, ng ngha, NXB Gio dc, 2000. Nguyn Ti Cn,Ng php ting Vit, NXB i hc quc gia H ni, 1999. Brill E., "Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging, Computational Linguistics, 21(4), pp.543-565, December 1999. Gina-Anne Levow, Corpus-based Techniques for Word Sense Disambiguation, University of Maryland, USA, 1997. Dermatas E., Kokkinakis G., "Automatic Stochastic Tagging of Natural Language Texts, Computational Linguistics 21.2, pp. 137 - 163, 1995. Schmid H., "Part-of-Speech Tagging with Neural networks , International Conference on Computational Linguistics, Japan, pp. 172-176, Kyoto, 1994. Tufis D., "Tiered Tagging and combined classifier, In Jelineck F. and Nrth E. (Eds), Text, Speech and Dialogue, Lecture Notes in Artificial Intelligence 1692, Springer, 1999. Oflazer K., "Error-tolenrant finite-state recognition with applications to morphological analysis and spelling correction, Computational Linguistics, 22(1), pp. 73-89, 1996. Levinger M., Ornan U., Itai A., "Learning morpho-lexical probabilities from an untagged corpus with an application to Hebrew, Comutational Linguistics, 21(3), pp. 383-404, 1995. MacMahon J.G., Smith F.J., "Improving statistical language model performance with automatically generated word hierarchies, Computational Linguistics, 19(2), pp. 313-330, 1993. Eric V. Siegel, Corpus-based Linguistic Indicators for Aspectual Classification, Columbia University, USA, 1999. Sang-Zhu Lee, Jung-ichi Tsujii, Hae-Chang Rim, Lexicalized Hidden Markov Models for Part-of-Speech Tagging, University of Tokyo, Japan, Korea University, Korea, 2000. U ban khoa hc x hi Vit Nam, Ng php ting Vit, NXB Khoa hc X hi, H ni, 1993. Hong Ph, T in ting Vit 2002, Nh xut bn Nng - Trung Tm T in Hc, 2002. Hamish, Cunningham, Diana Maynard, Kalina Bontcheva, Valentin Tablan, Cristian ,Ursu, Marin Dimitrov, Mike Dowman, Niraj Aswani , Developing Language Processing Components with GATE, The
1-4244-0316-2/06/$20.00 2006 IEEE.
151
University of Sheffield 2001-2005, [Online]. Available: http://gate.ac.uk/sale/tao/ [19] T. Brants. TNT - a statistical part-of-speech tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP2000), Seattle, 2000. [20] Nguyn Th Minh Huyn,V Xun Lng, L Hng Phong, S Dng B Gn Nhn T Loi Xc Sut QTAG Cho Vn Bn Ting Vit, Proceedings of ICT.rda'03. Hanoi, Feb 2003.
1-4244-0316-2/06/$20.00 2006 IEEE.
152

Vietnamese Proper Noun Recognition: Chau Q.Nguyen, Tuoi T.Phan, Tru H.Cao

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Vietnamese Proper Noun Recognition: Chau Q.Nguyen, Tuoi T.Phan, Tru H.Cao

Caricato da

Copyright:

Formati disponibili

Vietnamese Proper Noun Recognition

Chau Q.Nguyen, Tuoi T.Phan , Tru H.Cao

Plain Text HTML SGML XML RTF Email

Proper Noun Recognition

1-4244-0316-2/06/$20.00 2006 IEEE.

1-4244-0316-2/06/$20.00 2006 IEEE.

Fig. 3. The Second Algorithm.

1-4244-0316-2/06/$20.00 2006 IEEE.

Pre Proper Noun Recognition Module Module

Proper noun identifier

Part Of Speed Tagging Module

Post Proper Noun Recognition Module

Fig. 4. The Proper Noun Recognition Module.

1) The Pre Proper Noun Recognition Module

Proper noun identifier.

1-4244-0316-2/06/$20.00 2006 IEEE.

Part Of Speed Tagging based on probability model

o end while; end for; end while;

1-4244-0316-2/06/$20.00 2006 IEEE.

THE RESULT OF VIETNAMESE PROPER NOUN RECOGNITION

1-4244-0316-2/06/$20.00 2006 IEEE.

B. Vietnamese POS Set

1-4244-0316-2/06/$20.00 2006 IEEE.

1-4244-0316-2/06/$20.00 2006 IEEE.

Potrebbero piacerti anche