Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Wirote Aroonmanakun
Department of Linguistics
Faculty of Arts, Chulalongkorn University
Phyathai Rd., Bangkok, Thailand, 10330
Tel. +66-2-218-4696
Fax.: +66-2-218-4695
e-mail: wirote.a@chula.ac.th
a. or
Abstract b.
µ¨¤
èÁ¦º°
µB¨¤
ÃB¨BÁ¦º° or
µB¨¤
èBÁ¦º°
perform well at the level of accuracy (ash), are incorrectly segmented into two words
81-98% depending on the dictionary as (man)
use), (nom.)
Äo (town),
µ¦ Á¤º° Á¨º°
1 The term ‘maximum matching’ may be used in the 2 Precision is the number of identical words when
same meaning as ‘longest matching’ in other papers, comparing the segmentation of subject_i with that of
such as Palmer (1997). subject_j. Recall is counted in the opposite direction.
Precision Recall F-Measure ¸(good), (breathe) is composed of
®µ¥Ä ®µ¥
S1-S2 86.35 84.12 85.22 (lost) and (heart). Some compound words are
Ä
S1-S3 84.77 78.29 81.40 created by conjoining two simple words that are
S1-S4 83.30 88.04 85.61 quite similar in meaning, such as (look) ¼ ¨
S2-S3 93.16 88.32 90.68 created by conjoining the same word, such as
S2-S4 80.54 87.38 83.82 Â(red) (symbol for duplication)
Ç (black) . Î
µ Ç
S2-S5 77.59 82.28 79.87 Although the criteria above seem to be clear,
S3-S4 72.62 83.12 77.52 when looking at the real data, it is not always
S3-S5 73.24 81.93 77.34 easy to determine the number of words in a
S4-S5 80.14 78.33 79.22 given input. For example, should (rice ®¤o°®»
oµª
cause a problem for word segmentation. Without translation equivalent of the word ‘proceeding’?
agreement on word segmentation, a training Should this string be a single word? Or should it
corpus will not be very useful because a corpus be analyzed as composed of nine words: ®´º°
each syllable may have a meaning, but the (rice cooker) is considered a word because when
meaning of the word is not related to the inserting ¸ (that-is-used-for) between
É ÄoÎ
µ®¦´
composed of two or more simple words. The (electric), because when inserting (that-use), ¸ÉÄo
meaning of the word may not be the total the meaning of is not
Á¦ºÉ°¡·¤¡r¸ ¸ÉÄoÅ¢¢jµ
composition of the meaning of its parts, though much different from the original one. This
it can be related to the meaning of its parts. For criterion reflects the internal cohesion of a word.
example, (river) is composed of
¤nÎÊ
µ ¤n
Of course, we cannot apply an insertion test
(mother) and (water); though the meaning of
ÎÊ
µ
directly for word segmentation programs, but we
is not ‘the mother of water’, it is related to
¤nÎÊ
µ
may use the idea of internal cohesion as a clue
water. The meaning of some compound words for word segmentation. In this study, we use
could be different from the meaning of its part, collocation strength between syllables for
such as glad) is composed of (hear) and
¥·¸ ¥·
measuring this internal cohesion.
Previous approaches on Thai word segmentation Y, T stands for a different group of characters.
usually view the segmentation problem as the The results after matching these syllable patterns
resolution of word boundary ambiguities, as are usually ambiguous. For example, in
shown in Figure 1. However, we think that segmenting an input string like ¦¦¤µ¦¦¤¡¨«¹¬µ
0 343535343 8
=
boundary is a word boundary. 7 =/
Examples as in Figure 2 do not only indicate
U T NU T M = R S NR P M
the insufficiency of considering only syllable + +
collocation, but also raise a question of how to }y~G zc{vz r P | r zk{_x u vyrhz rtsu u vwu x pq o Q
determine a word? Two syllables like can
o°n°
n P M m n ikj_l acb dfehg Y[Z\ \ ]_^_\ ` WX V O M
+ +
be either one word , or two words
o°n° .
o°Bn°
We know whether a sequence of syllables is a The best segmentation is the one with the
word because it refers to something. In other maximum collocation strength. In addition, we
words, we have the lexical knowledge of that also tested two variations of this model. The first
word. Thus, in this study, we will use a one is to do subtraction of collocation D only
dictionary for determining whether a sequence when the pair of syllables could be a part of
of syllables could be a word. Without a another word. For example, in a sequences ...a-
dictionary, the program might have to check all b-c-d-e,..., in which b-c-d forms a word, Da,b
possible sequences of syllables. Given that a will be subtracted only if a-b could be an ending
word can consist of one or more syllables, in a of some words. This variation will be called
sequence of n syllables, there could be 2n-1 MaxColl-B. (The first model is named MaxColl-
possible sequences of segmentation. A) Another variation is to ignore driving force
Therefore, in this process, a dictionary look- D, or not subtracting anything. It will be called
up is used to match all possible words from the MaxColl-C.
sequence of syllables. The result after matching For the collocation strength between
the input syllables with the dictionary could be syllables, since we hold the idea of internal
ambiguous. To determine the best resolution, cohesion for determining a word, we use the
though we cannot use collocation between ratio of p(x,y) to q(x,y), where p(x,y) is the
syllables directly, we may use the overall probability of finding syllables x and y together,
collocation strength in the sentence. It is and q(x,y) is the probability of finding any
assumed here that there is collocation strength syllable in between x and y (x-ANY-y), or the
between syllables at every syllable boundary. probability for x and y to be separated by
This strength is a force that binds syllables into a another syllable. The collocation between
word. On the other hand, there could be a syllables x-y then is defined as below:
driving force that prevents one syllable from
combining to another. For example, in a h ¸ ¹ º h ¸ ¹ ¸ º ¹ h ¸ · ¶
» ¹ º = » ¹ » º ¹ = µ ¯ °
sequence of syllables ...a-b-c-d-e..., in which b- ±¤²¤³ ®´ ° ¯ ±¤²¤³ ®´ °
c-d forms a word, there are forces between b-c h
= ±¤²¤³ ®´ ° .®«¯ ¬ ¢¤£¤¥§¦¨ ª
and c-d that combine them together, but there ¢¤£¤¥«¦¨ ª
are also forces between a-b and d-e that prevent D
= ¢¤£y¥§¦¨ ¡
. ©
b from combining to c, d from combing to c.
Thus, the overall collocation strength of an input
sentence is defined as the sum of collocation 7 The Results
within a word minus the collocation strength
between words. Four segmentation algorithms, MaxColl-A,
) % −"
,.- + MaxColl-B, MaxColl-C, and MaxMatch
∑
=
( =$ ∑
' ( − * ' ( #' & "
& =" + (maximum matching), are tested on the testing
corpus of 30,498 syllables. The results from
these algorithms are compared with the manual
segmentation done by the author. In the process as one word, sometimes as two words.
of syllable segmentation, the results are 100% Therefore, we should compare the performances
correct. But in the process of syllable merging, of these algorithms by examining which one
the result of each algorithm is different. In a produces less severe errors of segmentation.
perfect situation, where every word in the By severe errors, we refer to segmentation
testing corpus is included in the dictionary, that results in a wrong word. The meaning of the
MaxColl-A can segment words correctly with sentence then is incorrect. Examples in Figure 3
the F-measure at 96.76%, as shown in Table 2. illustrate severe errors of segmentation from
However, in the same setting, when compared MaxMatch. In (3a), should be segmented as
¸É¤µ
with the results from MaxColl-B, MaxColl-C, two words, (comp.) and (come), rather than
¸É ¤µ
and MaxMatch, the result from MaxMatch one word (source). In (3b),
¸É¤µ should be ¤µªnµ
(say). In (3c),
ªnµ should be segmented
µµ¦Á¤º°
Coll-A
19271/19998 19271/19835 a.
Er:37 °¸ B¦´¤¦¸B¸É¤µB°¥¼nB¡¦¦BÅ¥¦´Å¥BÄB{» ´
Coll-B ªnµBBe
19371/19773 19371/19835 c.
Er:31 oª¥Bµ¦BÂ
n
´BÁ
oµ¼nBµ¦B¤¸BÎ
µÂ®nBµµ¦BÁ¤º°