Sei sulla pagina 1di 8

Collocation and Thai Word Segmentation

Wirote Aroonmanakun

Department of Linguistics
Faculty of Arts, Chulalongkorn University
Phyathai Rd., Bangkok, Thailand, 10330
Tel. +66-2-218-4696
Fax.: +66-2-218-4695
e-mail: wirote.a@chula.ac.th

a. or
Abstract b.
˜µ„¨¤

ǨŠÁ¦º°
˜µB„¨¤

ÇB¨ŠBÁ¦º° or
˜µ„B¨¤

ǨŠBÁ¦º°

This paper presents another approach c. …œœ°„ …œBœB°„ or …œBœ°„

of Thai word segmentation, which is d. ¤µ„ªnµ ¤µB„ªnµ or ¤µ„Bªnµ

composed of two processes : syllable e. ®¨ªŠ˜µ¤®µ´ª or ®¨ªŠ˜µB¤®µ´ª ®¨ªŠB˜µ¤B®µB´ª

segmentation and syllable merging. Figure 1. Ambiguity of word segmentation


Syllable segmentation is done on the
basis of trigram statistics. Syllable Although many approaches have been
merging is done on the basis of implemented for word segmentation, the
collocation between syllables. We resolution is not satisfying yet. When a Thai
argue that many of word segmentation word segmentation program, which is based on
ambiguities can be resolved at the a trigram model and a learning algorithm
level of syllable segmentation. Since a (Charoenpornsawat 1998), is applied on real
syllable is a more well-defined unit texts from a newspaper, though the program is
and more consistent in analysis than a claimed to segment Thai words correctly more
word, this approach is more reliable than 90%, incorrect segmentations are easily
than other approaches that use a word- found. For example, some words, e.g. ‡œÄo

segmented corpus. This approach can (servant), (politics),


„µ¦Á¤º°Š (elect),  Á¨º°„˜´ÊŠ  …¸Á
Ê ™oµ

perform well at the level of accuracy (ash), are incorrectly segmented into two words
81-98% depending on the dictionary as (man)
‡œ use), (nom.)
čo (town),
„µ¦ Á¤º°Š  Á¨º°„

used in the segmentation. (choose) (set up), (excrement)


˜´ÊŠ (ash),
…¸Ê Á™oµ

respectively. Some words, e.g. are Ÿ´œŸªœ

incorrectly segmented as meaningless words,


1 Background Ÿ´œBŸBªœ

This paper discusses why Thai word


Thai word segmentation is a basic and essential segmentation is difficult. We will briefly review
issue for processing the Thai language. Since previous approaches and their underlying
1981, many approaches have been proposed to assumptions on the Thai language. We will
handle this task. Like many languages that do point out the drawback of the trigram approach
not have explicit word boundary, such as that is trained on a word-segmented corpus.
Chinese, word segmentation is viewed as a Then, we will argue that most of the ambiguities
problem of inserting word boundaries or word on segmentation can be viewed as a problem of
separators into the input sentence. Word syllable segmentation rather than a problem of
segmentation is difficult because, usually, there word segmentation. Thus, we will propose
is more than one way for inserting word another word segmentation approach based on a
separators. Examples as in Figure 1 are usually syllable-based trigram model and maximum
used to illustrate the difficulty of Thai word collocation. In short, word segmentation can be
segmentation. viewed as processes of segmenting syllables and
merging syllables rather than as inserting word
boundary. Segmenting syllables is done by
applying a trigram model of syllables. Merging kappa co-efficiency (Rietveld and van Hout
syllables is done on the basis of collocation 1993:221-222) indicates the degree of
strength between syllables. The best agreement at 0.75. (The value for perfect
segmentation is the one with maximum agreement is 1, while the value for agreement by
collocation strength. chance is 0). The formula of Fleiss’s kappa is
presented below:
2 Previous research  
 −
=  
This section briefly reviews previous research −
on Thai word segmentation. We discuss the with:
basic assumptions underlying each method and Po = proportion of agreeing pairs of judgments
its implications on the Thai language. The rule- Pe = proportion of agreeing pairs on the basis of
based approach (Thairatananond 1981, chance
Chamyapompong 1983) is excluded because it
is used for syllable segmentation rather than These proportions are calculated as follows:
   −  
word segmentation. Most approaches use a
dictionary as the basis, but segment texts by
 ∑
=   ! 
applying different strategies, such as longest −

matching (Poowarawan 1986), maximum  
=∑ 
matching1 (Sornlertlamvanich 1993). For  =
approaches that are corpus-based, the dictionary 

is already implicit, such as a trigram model


=
∑  
 = 
(Kawtrakul et al. 1997), and a feature-based
segmentation (Meknavin et al. 1997). Some do The symbols used are:
not use the dictionary to avoid the problem N = number of judged objects, or the number of
resulted from unknown words. (Theeramunkong syllables in the text.
et al. 2000) k = number of subjects, which is 5 in this case.
These approaches reflect different v = number of categories, which is 2 because
assumptions about the Thai language. Longest subjects decide whether each syllable boundary
matching and maximum matching approaches is a word boundary.
share the same view that compound words are nij = number of subjects who assign object i to
preferred over simple words, when they are category j.
applicable. But the maximum matching prefers
the overall number of words to be minimum. When compared the segmentation results
While these assumptions are not yet proved, between each pair of subjects, the average
they seem to be mostly correct since the precision, recall, and balanced F-Measure (2 * P
performance of the maximum matching is * R / (P + R) )2 (Rijsbergen, 1979), as shown in
claimed to be higher than 90% correct. For the Table 1, confirm that the agreement is not
corpus-based approach, the trigram statistics perfect. The average F-measure is only about
should play an important role in resolving 82%. The result suggests that word boundary
segmentation ambiguities. However, the result may not be always intuitively determined. For a
relies heavily on the training corpus, which is word-segmented corpus to be useful, the corpus
manually word-segmented. This approach could has to be prepared by few people who are
suffer from a lack of clear definition of Thai trained with the same operational criteria for
words and inconsistency of segmentation in the word segmentation.
training corpus. In a simple experiment, in
which five subjects were asked to manually
segment words on a sample text of 3,070
syllables, the result indicates that agreement on
word boundaries is not perfect. The Fleiss’

1 The term ‘maximum matching’ may be used in the 2 Precision is the number of identical words when
same meaning as ‘longest matching’ in other papers, comparing the segmentation of subject_i with that of
such as Palmer (1997). subject_j. Recall is counted in the opposite direction.
Precision Recall F-Measure —¸(good), (breathe) is composed of
®µ¥ċ ®µ¥

S1-S2 86.35 84.12 85.22 (lost) and (heart). Some compound words are
ċ

S1-S3 84.77 78.29 81.40 created by conjoining two simple words that are
S1-S4 83.30 88.04 85.61 quite similar in meaning, such as (look) —¼ ¨

S1-S5 82.04 84.74 83.37 (see), (pretty)


­ª¥ (beautiful). Some are
Šµ¤

S2-S3 93.16 88.32 90.68 created by conjoining the same word, such as
S2-S4 80.54 87.38 83.82 —Š(red) (symbol for duplication)
Ç (black) . —Î
µ Ç

S2-S5 77.59 82.28 79.87 Although the criteria above seem to be clear,
S3-S4 72.62 83.12 77.52 when looking at the real data, it is not always
S3-S5 73.24 81.93 77.34 easy to determine the number of words in a
S4-S5 80.14 78.33 79.22 given input. For example, should (rice ®¤o°®»Š…oµª

cooker) be analyzed as one compound word, or


Average 81.38 83.66 82.41
three simple words (pot), (cook), and
Table 1. Comparing word segmentation between
®¤o° ®»Š …oµª

(rice)? If we analyze as a single word


each pair of subjects
®¤o°®»Š…oµª

by assuming it denotes a single referent, should


we analyze (electric rice cooker)
3 Problems on Word Segmentation
®¤o°®»Š…oµªÅ¢¢jµ

as a single word too, since it denotes a single


referent. How about ®œ´Š­º°¦ª¤š‡ªµ¤šµŠª·µ

The lack of clear definition of Thai words could which is assigned as a


„µ¦Ĝ„µ¦ž¦³»¤­´¤¤œµ

cause a problem for word segmentation. Without translation equivalent of the word ‘proceeding’?
agreement on word segmentation, a training Should this string be a single word? Or should it
corpus will not be very useful because a corpus be analyzed as composed of nine words: ®œ´Š­º°

that is word-segmented by different persons is (book) (collect)


¦ª¤ (article)
š‡ªµ¤(about) šµŠ ª·

not compatible or sometimes in conflict. In fact, µ„µ¦(academic) (in) (nom.)


Ĝ (meet)
„µ¦ ž¦³»¤

even segmentation is carried out by the same ­´¤¤œµ(seminar)?


person, it could be inconsistent. Thus, we should This unclear-cut semantic criterion could be
discuss first at the definition of Thai words. a reason why word segmentations performed by
Though, in linguistics, a word is defined as a different persons, or even by the same person,
linguistic unit composing of one or more can be inconsistent. Chaicharoen (2002), thus,
morphemes, Thai grammar books usually view a proposes to use the uninterruptablility of a word
word as a composition of syllables and as one criterion for determining a Thai word. A
distinguish two types of word as follows: sequence of syllables is considered a word if its
1. Simple words: A simple word can have meaning is changed when inserting some other
one or more syllables. In a multi-syllable word, syllables in between. For example, ®¤o°®»Š…oµª

each syllable may have a meaning, but the (rice cooker) is considered a word because when
meaning of the word is not related to the inserting š¸ (that-is-used-for) between
É Äo­Î
µ®¦´

meaning of any syllable. Examples of these ®¤o°(pot) and (cook-rice)


®»Š…oµª  ®¤o°š¸Éčo­Î
µ®¦´ 

simple words are (sleep), (read),


œ°œ  °nµœ
3 ­³¡µœ
the meaning is not the same. But
®»Š…oµª  Á‡¦ºÉ°Š

(bridge), (clock) etc.


œµ¯·„µ
(electric-type-writer) are considered
¡·¤¡r—¸— Å¢¢jµ

2. Compound words: A compound word is two words, (type-writer) and


Á‡¦ºÉ°Š¡·¤¡r—¸—  Å¢¢jµ

composed of two or more simple words. The (electric), because when inserting (that-use), š¸Éčo

meaning of the word may not be the total the meaning of is not
Á‡¦ºÉ°Š¡·¤¡r—¸— š¸ÉčoÅ¢¢jµ

composition of the meaning of its parts, though much different from the original one. This
it can be related to the meaning of its parts. For criterion reflects the internal cohesion of a word.
example, (river) is composed of
¤nœÎÊ
µ ¤n
Of course, we cannot apply an insertion test
(mother) and (water); though the meaning of
œÎÊ
µ
directly for word segmentation programs, but we
is not ‘the mother of water’, it is related to
¤nœÎÊ
µ
may use the idea of internal cohesion as a clue
water. The meaning of some compound words for word segmentation. In this study, we use
could be different from the meaning of its part, collocation strength between syllables for
such as glad) is composed of (hear) and
¥·œ—¸ ¥·œ
measuring this internal cohesion.

3 In this paper, the symbol - is used for segmenting


syllables, while _ is used for segmenting word,
though, in actual text, there is no boundary markers.
4 Segmentation As Two Processes matching an input string. For example, Á&57µ³ Á;º

are syllable patterns in which X, C, R,


7°³ Á&5·7<

Previous approaches on Thai word segmentation Y, T stands for a different group of characters.
usually view the segmentation problem as the The results after matching these syllable patterns
resolution of word boundary ambiguities, as are usually ambiguous. For example, in
shown in Figure 1. However, we think that segmenting an input string like „¦¦¤„µ¦„¦¤¡¨«¹„¬µ

many segmentation ambiguities can be resolved ¦°¥„¦nµŠ36 possibilities of segmentation were


by just performing syllable segmentation. Since found. But when trigram statistics of syllables is
a syllable is a more well-defined unit than a applied, this sentence is segmented correctly as
word, it is easier and more consistent to build a .
„¦¦¤„µ¦„¦¤¡¨«¹„¬µ¦°¥„¦nµŠ

syllable-segmented corpus. Therefore, we view In this study, a training corpus of 553,372


Thai word segmentation as composing of two syllables from a newspaper is manually syllable-
processes.4 The first one is to do syllable segmented. Witten-Bell discounting is used for
segmentation, which could be done by applying smoothing (Chen and Goodman, 1998). Viterbi
a trigram model trained with a syllable- algorithm is used for determining the best
segmented corpus. This process should resolve segmentation. When tested on another corpus of
many segmentation ambiguities, at least in those 30,498 syllables, 52 errors of segmentation were
classic examples in Figure 1. The next process is found. Thus, the program can segment syllables
to group syllables into words. The latter is more correctly up to 99.8%. Of these 52 errors, 22 are
difficult than the first one. In this study, we use proper names and foreign words written in Thai.
the idea of collocation to measure internal
cohesion of a word. Collocation here refers to 6 Syllable Merging
co-occurrence of syllables observed from the
training corpus. It can be measured by many In this step, we assume that every syllable
statistical methods, such as mutual information, boundary is a potential word boundary. In a
chi-square, Dunning’s log-likelihood, etc. sentence that is syllable-segmented, this process
(Manning and Schutze 1999) But in this study, will determine which boundaries can be deleted.
to reflect the idea of internal cohesion, we use Those that are left are regarded as word
the ratio of the chance of finding two syllables boundaries. The output then is a sentence that is
together to the chance of finding other syllables word-segmented. The first design is to use
in between the two syllables. This is discussed collocation strength between syllables to merge
in section 6. syllables. The assumption is that if a word
contains two or more syllables, those syllables
5 Syllable Segmentation will always co-occur. Thus, the probability of
co-occurrence should be highly greater than by
Thai syllables here are referred to written chance. Collocation strength between two
syllables only. Typically, a syllable is composed syllables that are parts of a word should higher
of vowel forms, initial consonants, and final than collocation between two syllables that are
consonants. In some syllables, vowel forms are not a part of word. For example, in a phrase Ážd— B

omitted, like Some syllables use more than


„—
which consists of two words
®œoµ˜nµŠ (open)
Ážd— 

one character for vowel forms, such as Á­¸¥Š Á­º°


and (window), the collocation between
®œoµ˜nµŠ

etc. Some have two initial consonants such as, and


®œoµ should be higher than that of
˜nµŠ and
Ážd— 

etc. Some have more than one


„ªoµŠ  Œ¨µ—  Á…¨µ
®œoµ However, since the value of collocation
character for final consonant, such as ‹´„¦ Œ´˜ ¦
strength between two syllables in any
etc. Some syllables are unique, such as etc.
„È –
circumstances is always constant, it is
Nevertheless, we can define all syllable patterns, insufficient to determine word boundary by
and the number of patterns are finite. In this considering only collocation at the boundary.
study, we define about 200 syllable patterns for For example, even could be a two-syllable
…o°˜n°

word, as in (2a), but it could also be a part of a


4 Sawamiphakdi (1990) also did word segmentation multi-syllable word as in (2b) and (2c), or be
in two steps : building syllables by rules and merging two different words as in (2d), or be one word
them by dictionary look-up. However, she did not and a syllable of another word as in (2e) and
use statistical method for resolving segmentation (2f). Clearly, we cannot use the collocation
ambiguities.
a. µ›»¦„·‹Á„¸É¥ª„´ …o°˜n°¨³­µ¥°n°œš»„œ·—
¢d˜ Á¢¨È„ŽrÁ¨º°„šÎ …o°˜n° (joint)
b. š¸Éœµ¥žd~œ„¨nµª¤µÁž}œÁ¡¸¥Š…o°°oµŠ®¦º°…o°˜n°­¼
o ®œ¹ÉŠÁšnµœ´œ
Ê …o°˜n°­¼o (argument)
c. °°­Á˜¦Á¨¸¥°µ‹®¥· ¥„Á¦ºÉ°Šœ¸Ê…¹Êœ¤µÁž}œ…o°˜n°¦°Šžd— ˜¨µ—Ĝ …o°˜n°¦°Š (bargaining point)
µŠ­·œ‡oµ…°ŠŚ¥

d. ¤°œÃ¥µ¥…o°˜n°—¦­¤¼¦–r (clss.) (to)


…o° ˜n°

e. ‡ªµ¤Á­¸É¥Šš¸É­¼Š…o°˜n°¤µ‡º°„µ¦Áž¨¸É¥œž¨Š…°Š°´˜ ¦µ—°„Á¸Ê¥ (clss.) (next)


…o° ˜n°¤µ

f. ®µ„‡–³„¦¦¤„µ¦ž¨i°¥Ä®oŸ¼o¦´ Á®¤µ…Ȋ…o°˜n°¦µ‡µž¦³¤¼¨ (defy) (to)


…Ȋ…o° ˜n°

Figure 2. Examples of …o°˜n° in different contexts


6 −/
between and to determine whether that L> K A4BDCDA @ 9 ? 1 ?
= = ∑ ; < 2 ; : +1 EGFDHJI ?
…o° ˜n°

0 343535343 8
=
boundary is a word boundary. 7 =/
Examples as in Figure 2 do not only indicate ƒ ‚
U T NU T M = R S NR P M
the insufficiency of considering only syllable + +
collocation, but also raise a question of how to }y~€G zc{vz r P | r zk{_x u vyrhz rtsu u vwu x pq o Q
determine a word? Two syllables like can …o°˜n°
n P M m n ikj_l acb dfehg Y[Z\ \ ]_^_\ ` WX V O M
+ +
be either one word , or two words
…o°˜n° . …o°B˜n°

We know whether a sequence of syllables is a The best segmentation is the one with the
word because it refers to something. In other maximum collocation strength. In addition, we
words, we have the lexical knowledge of that also tested two variations of this model. The first
word. Thus, in this study, we will use a one is to do subtraction of collocation D only
dictionary for determining whether a sequence when the pair of syllables could be a part of
of syllables could be a word. Without a another word. For example, in a sequences ...a-
dictionary, the program might have to check all b-c-d-e,..., in which b-c-d forms a word, Da,b
possible sequences of syllables. Given that a will be subtracted only if a-b could be an ending
word can consist of one or more syllables, in a of some words. This variation will be called
sequence of n syllables, there could be 2n-1 MaxColl-B. (The first model is named MaxColl-
possible sequences of segmentation. A) Another variation is to ignore driving force
Therefore, in this process, a dictionary look- D, or not subtracting anything. It will be called
up is used to match all possible words from the MaxColl-C.
sequence of syllables. The result after matching For the collocation strength between
the input syllables with the dictionary could be syllables, since we hold the idea of internal
ambiguous. To determine the best resolution, cohesion for determining a word, we use the
though we cannot use collocation between ratio of p(x,y) to q(x,y), where p(x,y) is the
syllables directly, we may use the overall probability of finding syllables x and y together,
collocation strength in the sentence. It is and q(x,y) is the probability of finding any
assumed here that there is collocation strength syllable in between x and y (x-ANY-y), or the
between syllables at every syllable boundary. probability for x and y to be separated by
This strength is a force that binds syllables into a another syllable. The collocation between
word. On the other hand, there could be a syllables x-y then is defined as below:
driving force that prevents one syllable from – — – — – • — – • —
combining to another. For example, in a ˜ ™hš ¸ ¹ › º ˜ ™hš ¸ ¹ ¸ º ¹ ˜ ™hš ¸ · ¶
» –¹ ›º — = » –¹ —» –º •¹ — = µ –¯ •° 
sequence of syllables ...a-b-c-d-e..., in which b- ±¤²¤³ ®´ ‘ °  ¯  Ž ±¤²¤³ ®´ ‘ ° 
c-d forms a word, there are forces between b-c ’ “h”
= ±¤²¤³ ®´ ‘ °  ­.®«¯  ¬  Ž ¢¤£¤¥§¦¨ ª Œ
and c-d that combine them together, but there ¢¤£¤¥«¦€¨ † ª ˆ ‡
are also forces between a-b and d-e that prevent ‰ ŠD‹
= ¢¤£y¥§¦€¨ † ¡ … .ž ©Ÿ … œ „
b from combining to c, d from combing to c.
Thus, the overall collocation strength of an input
sentence is defined as the sum of collocation 7 The Results
within a word minus the collocation strength
between words. Four segmentation algorithms, MaxColl-A,
) % −"
,.- + MaxColl-B, MaxColl-C, and MaxMatch

=
( =$ ∑
' ( − * ' ( #' & "
& =" + (maximum matching), are tested on the testing
corpus of 30,498 syllables. The results from
these algorithms are compared with the manual
segmentation done by the author. In the process as one word, sometimes as two words.
of syllable segmentation, the results are 100% Therefore, we should compare the performances
correct. But in the process of syllable merging, of these algorithms by examining which one
the result of each algorithm is different. In a produces less severe errors of segmentation.
perfect situation, where every word in the By severe errors, we refer to segmentation
testing corpus is included in the dictionary, that results in a wrong word. The meaning of the
MaxColl-A can segment words correctly with sentence then is incorrect. Examples in Figure 3
the F-measure at 96.76%, as shown in Table 2. illustrate severe errors of segmentation from
However, in the same setting, when compared MaxMatch. In (3a), should be segmented as
š¸É¤µ

with the results from MaxColl-B, MaxColl-C, two words, (comp.) and (come), rather than
š¸É  ¤µ

and MaxMatch, the result from MaxMatch one word (source). In (3b),
š¸É¤µ should be ¤µ„ªnµ

seems to be the best.5 segmented as (asp.) (over), not


¤µ „ªnµ(more) ¤µ„

(say). In (3c),
ªnµ should be segmented
šµŠ„µ¦Á¤º°Š

Precision Recall F-ms as (prep.)


šµŠ (politics), not
„µ¦Á¤º°Š šµŠ„µ¦

Max 96.36 97.16 96.76 (official) (city).


Á¤º°Š

Coll-A
19271/19998 19271/19835 a.
Er:37 °—¸˜ B¦´“¤œ˜¦¸Bš¸É¤µB°¥¼nB¡¦¦‡BŚ¥¦´„Åš¥BĜBž{‹‹» ´œ

Max 97.97 97.66 97.81 b. ˜´—­·œBÁ¦ºÉ°ŠB„µ¦BčoB™o°¥‡Î


µBĜB—œB¤³„³Ã¦œ¸B¤µ„B

Coll-B ªnµBBže

19371/19773 19371/19835 c.
Er:31 —oª¥B„µ¦B…nŠ…´œBÁ…oµ­¼nB„µ¦B¤¸B˜Î
µÂ®œnŠBšµŠ„µ¦BÁ¤º°Š

Max 98.02 97.71 97.86 Figure 3. Examples of severe errors


Coll-C
19380/19772 19380/19835 In terms of severe errors produced from the
Er:28
Max 98.56 97.39 97.97 algorithm, MaxMatch produced 47 errors while
Match MaxColl-A, MaxColl-B, and MaxColl-C
19317/19600 19317/19835 produced 37, 31, 28 errors respectively. Thus,
Er:47
Table 2: Results of the word segmentation. MaxColl-C should be the best segmentation
algorithm by this criterion.
MaxMatch has the best score because it In a non-perfect situation where there are
produces the lowest number of words (19,600). unknown words, the performance of all
That explains why its precision rate is high. But algorithms dropped as shown in Table 3. The
if we look at the recall rate, we will see that dictionary used in this setting is derived from
MaxColl-B and MaxColl-C produce more word list of another corpus. In this setting, there
correct words than MaxMatch. However, the are 883 unknown lexical words from the total of
number of accuracy alone may not be the best 3,082 lexical words (in the testing corpus), or
indicator. Since the accuracy is measured 29% of the lexicon. The F-measure indicates
against the author’s manual segmentation and that MaxColl-B and MaxColl-C perform equally
the manual segmentation is not always perfectly well or better than MaxMatch.
consistent. The mismatch between the manual However, in terms of severe errors,
segmentation and the segmentation from the MaxMatch, MaxColl-A, MaxColl-B, and
algorithm does not necessary indicate that the MaxColl-C produced 64, 64, 52, and 51 errors
algorithm’s result is incorrect. For example, the respectively. Again, the result suggests that
program always segment as one word,
„¨»n¤˜´ª°¥nµŠ
MaxColl-C is the best segmentation algorithm.
but in the manual segmentation, even it is Therefore, the best maximum collocation model
carried out with care and rechecked for its for word segmentation should do only the
consistency, sometimes this string is segmented summation of collocation strength of each word.
The formula then is changed as below:
¼

5 The accuracy is very high because, like all other
6W = ¿ )¾ ½
research, it is evaluated against the researchers’ ½=
analysis. By using the dictionary prepared by the Å −À
ÙÍ Ø Ð4ѤÒhÐ Ï È Î Â Î
Ì = ∑ Ê Ë Ã Ê É + Â ÓÕÔhÖJ× Î
Á Ä5Ä5Ä4Ä5Ä Ç
researcher, the program shares the same idea of =
Æ À
lexical units in the analysis. Thus, unlike the results =

shown in Table 1, problems from disagreement of


lexical units are minimum.
Precision Recall F-ms prob(x-ANY-y). In fact, if there is no restriction
Max 75.39 85.72 80.23 on the scope, prob(x-ANY-y) will equal to prob
Coll-A (x)*prob(y). The log ratio of prob(x-y) to prob
17003/22553 17003/19835 (x)*prob(y) then is the same as the mutual
Er:64
Max 76.56 86.07 81.03 information. Thus, it is possible to improve the
Coll-B performance of the algorithm by using other
17071/22298 17071/19835 statistical method for measuring collocation
Er:52
Max 76.56 86.06 81.03 strength between syllables. In our preliminary
Coll-C tests when applying log-likelihood, mutual
17070/22295 17070/19835 information, and chi-square, for collocation
Er:51
Max 85.13 77.27 81.01 strength in MaxColl-C, we found no
Match improvement when using log-likelihood, a little
15326/18003 15326/19835 improvement when using mutual information, or
Er:64
Table 3 : Results when there are unknown chi-squares. Since using mutual information is
words computationally least expensive here, mutual
information might be the best method for
8 Conclusion capturing collocation strength at the moment.
However, which statistical method is best for the
Though the maximum collocation approach does approach is still an area for further study.
not clearly out-perform the maximum matching
approach, segmentation resulted from the latter References
relies heavily on the words listed in the
dictionary. When the maximum matching Chaicharoen, N. 2002. Computerized
approach is used, there is a preference of Integrated Word Segmentation And Part-Of-
compound words over simple words in the Specch Tagging Of Thai. Master Thesis, Faculty
dictionary. For examples, when the compound of Arts, Chulalongkorn University. (in Thai)
word Ú (source), is included in the dictionary,
š¸É¤µ
Chamyapornpong, S. 1983. A Thai Syllable
the maximum matching will always view this Seperation Algorithm. Master thesis, Asian
string, , as one word rather than two words,
š¸É¤µ
Institute of Technology.
(comp.) and
š¸É (come). But there is no such
¤µ
Charoenpornsawat, P. 1998. Feature-Based
preference when the maximum collocation Thai Word Segmentation. Master Thesis,
approach has been used. Whether this string is Depratment of Engineering, Chulalongkorn
one or two words depends on the overall University.
collocation strength of the input sentence. Thus, Chen, Stanley F., and Goodman, Joshua.
in this approach, the dictionary plays a role as 1998. An Empirical Study of Smoothing
defining what can be a word. How well the techniques for Language Modeling. TR-10-98.
algorithm performs depends on the Harvard University.
exhaustiveness of the dictionary. When there are Fleiss, J.L. 1971. Measuring Nominal Scale
many unknown words, the segmentation results Agreement Among Many Raters. Psychological
may not be satisfying. However, we think that a Bulletin 76. 378-382.
dictionary is still a necessary component for Krawtrakul, A. Thumkanon, C., Poovorawan,
defining a word. To cope with the problem of Y., and Suktarachan, M. 1997. Automatic Thai
unknown words, a further study should focus on Unknown Word Recognition. In Proceedings of
extending this approach to determine unknown the natural language Processing Pacific Rim
words. In case an unknown word may exist, Symposium 1997 (NLPRS’1997).
since the output from syllable segmentation is a Manning, Christopher and Hinrich Schütze.
sequence of syllables, we can use some 1999. Foundations of Statistical Natural
statistical methods to determine potential words Language Processing. Cambridge: MIT Press.
from this syllable sequence. Meknavin, S., Charenpornsawat, P., and
In addition, in this study we simplify the kijsirikul, B. 1997. Fetaure-based Thai Words
ratio of prob(x-y) to prob(x-ANY-y) by Segmentation, NLPRS, Incorporating SNLP-97.
considering only one syllable in between x and Palmer, David D. 1997. A Trainable Rule-
y. It is possible to consider a wider scope of based Algorithm for Word Segmentation.
syllables in between x and y when calculating
Poowarawan, Y. 1986. Dictionary-based
Thai Syllable Separation, In Proceeding of the
Ninth Electronics Engineering Conference.
Rietveld, Toni and Roeland van Hout. 1993.
Statistical Techniques for the Study of Language
and Language Behaviour. Berlin, new York:
Mouton de gruyter.
Rijsbergen, C.J. Van. 1979. Information
Retrieval. Butterworths: London.
Sawamiphakdi, Duangkaew. 1990. Building
a Thai Grammar Analyzer Software under the
UNIX System. Bangkok: Thammasart University
Press. (in Thai).
Sornlertlamvanich, V. 1993. Word
Segmentation for Thai in a Machine Translation
System NECTEC. (in Thai).
Thairatananond, Y. 1981. Towards the
Design of a Thai Text Syllable Analyzer. Master
thesis, Asian Institute of Technology.
Theeramunkong, T., Usanavasin, S.,
Machomsomboon, T., and Opasanont, B. 2000.
Thai Word Segmentation without a Dictionary
by Using Decision Trees. The fourth
Symposium on Natural Language Processing
2000.

Copyright © 2002, Joint International


Conference of SNLP-Oriental COCOSDA 2002,
SIIT, Thammasat University, Pathumthani,
Thailand. All Rights Reserved.

Potrebbero piacerti anche