Sei sulla pagina 1di 10

An algorithm for morphological analysis of sentence

Petreski Metodija Faculty of Electrical Engineering and Information Technologies, Ss. Cyril and Methodius University, Skopje, R. Macedonia key words: morphological analysis, Macedonian language

Abstract
In this paper is described the algorithm for morphological analysis in the Macedonian language. We are concerned with the recognition of the lexical category (part-of-speech tagging) of a word in a sentence in Macedonian. This algorithm is usable only for grammatically correct sentences.

1.

Introduction

Recognition of other lexical classes is harder, because there are thousands of nouns, verbs, adjectives and adverbs. Although in Macedonian there are a few homographs, often the same word can be an adverb or an adjective (neutral gender). The word order in a sentence in Macedonian is not canonical: an adjective can be before or after a noun, or some adjectives- before, and some adjectives after the noun, it is the same with adverbs and verbs. But there are some rules which help, like suffixes or specific word grouping. The success of this algorithm is the quotient of good and wrong and selected words.

The Macedonian language is an inflecting language. In Macedonian there are: 6 persons, 9 tenses, 3 grammatical genders and 3 types of article. So, each verb can have up to 6 (persons) * 9 (tenses)=54 forms. Each noun or adjective can have up to (3 (grammar gender) + 1 (plural)) * (3 (types of article) + 1 (no article)) = 16 forms. Words in Macedonian language are divided in eleven groups: nouns, verbs, adjectives, adverbs, pronouns, numerals, prepositions, particles, exclamations, conjunctions and modal words. Prepositions, particles, exclamations, conjunctions, adverbs and modal words are closed (unchangeable) word classes. Nouns, verbs, adjectives, pronouns and numerals are open (changeable) word classes. They can be inflected by person, tense or gender. In this article, we will explain our algorithm for the recognition of a words part of speech. The recognition of conjunctions, modal words, exclamations, particles and prepositions and numerals is easy. In the Macedonian language the number of conjunctions, modal words, exclamations, particles, prepositions and pronouns is not greater than 200. On the other hand the number of roots of numerals is not greater than 50. So, we put these words and roots in database.

2. Previous works
The most of the algorithms, which are created for this problem are concerned with analyzing affixes (suffixes and prefixes). They have different approaches adapted to the specific language. These algorithms helped us in the building of the algorithm described here. Also there are algorithms for morphological analysis based on dictionary systems. The grammar rules and word formation in Macedonian are especially important. For that reasons, grammars of the Macedonian language are used [1], [2] and [3]. In each of these grammars the problem of morphological analysis is elaborated, but the grammar by Blazhe Koneski is the best.

3. Sentence structure
Sentences in Macedonian language can have different structures. They can include three compositions: subject composition, verb composition and object composition. Subject and object compositions have the same structure and they can contain one or more noun compositions. Noun compositions can be: noun, pronoun,

adjective, noun + adjective, noun + numeral, noun + adjective + numeral etc. Verb compositions can be: verb, verb + adverb, verb + preposition + verb (+adverb). These compositions can also include: prepositions, particles and conjunctions. Exp: .

Subject composition

Verb composition

Object composition

Petko with his every fish mother and Wednesday eat father Table 3.1. Subject, verb and object compositions in sentences. As we can see in this example the subject composition consists of nouns, pronouns, conjunctions and a prepositions, the verb composition consists of an adjective, a noun and a verb and the object composition consists of a noun. The order of these compositions is not always the same. Sometimes the subject composition is not necessary. Also, sometimes the object composition is first and the subject composition is third. In questions, verb composition is on the second (or on the third place, when there is pronoun on the second). In imperative sentences, the verb composition is at the first place. The verb composition can be divided. For example: , . (Yesterday, Mitre saw Petre at the inn.) In this example, Mitre is the subject composition, Petre is the object composition and the verb composition consists of Yesterday , saw and in the inn.

3.1. Structure of the noun composition


The word order in the noun composition can be: noun(s) + noun ( - student Nikola Nikoloski), adjective + noun ( little girl),

, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , and . According to these noun endings our algorithm finds about 60% of the nouns. But it cannot find nouns, which have different endings. Other problem can be the article. The article in Macedonian is a suffix, which can be added on a noun, adjective or a numeral. For better recognition in our algorithm the article is removed, but there are some nouns, which end on article suffixes (like: , , ). In the noun compositions, in the modern language, nouns are on the last position, but in poetry and dialects an adjective may be used after the nouns. It is grammatically correct too.

).

4.2. Recognition of adjectives


Adjectives are changeable words. They can be inflected by gender, number (plural on singular) and they can have article. Also they have comparative and superlative forms. Characteristic of Macedonian and Old Slavonic language is the adjectives prefix , which means that characteristics of the adjectives with this prefix are stronger than characteristics of the adjective without it. Adjectives in masculine gender usually end in a consonant (the same as the nouns). Feminine gender adjectives always end in the vowel a. Neutral gender adjective always end in the vowel o. In plural adjectives always in the vowel . Typical endings of adjectives are: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , v. ) ys

1st person - 2nd person - 3rd person - Table 4.5.1. suffixes for Present Tense

than one verb. Also in some verb compositions there are more verbs. They usually are separated with a preposition or preposition + pronoun(s).

4.6. Recognition of adverbs


The grour7h[((i)-7(t)-7(s)8( 01 0 0 1 333.24)] TJETBT1 0 0 21gor Plural

Singular 1st person 2nd person 3rd person Table 4.5.2. Verb to be in Present

Singular Plural 1st - person 2nd - person 3rd - person Table 4.5.3. suffixes for Imperfect

Singular 1st person - 2nd - person 3rd person - Table 4.5.4. suffixes for Aorist

Plural /

Singular Plural 3rd - / - / person Table 4.5.5. suffixes for Imperative

Masculine - Feminine - Neutral - Plural - Table 4.5.6. suffixes for - verb form All other tenses can be made by combination of this verb forms with prepositions (, ) or the auxiliary verb (its present or imperfect form). Verbs usually are used in verb compositions and every sentence must have at least one verb. Sometimes in complex sentences there are more

For this algorithm we need grammatically correct sentence written in Macedonian and in Cyrillic script. First of all this algorithm divides sentences. After each sentence is divided in words, the algorithm starts to check the part of speech of each word. For that reason we create a structure in which information about the words is saved. This structure contains: string Word - value of the separated word array of integers Kinds_of_Word - array in which all possible parts of speech where the word may belongs are stored array of floats Probability_of_kind - array in which are stored probabilities that word belongs to the group of word (with same ordinal number as the member of this array) boolean Article - is true if the word has an article. integer ord_number - the ordinal number of the word in the sentence

probability for good recognition is 1, if the word ends or begins on characteristic ending or beginning. When endings are analyzed and there is an article in the word, then the article is removed. After this, the algorithm checks if the word has an article. If there is an article then Article is set to true. Next, the algorithm checks if the word is a verb. In this sub-algorithm, we checks whether the word is equal to one of the forms of the verb to be. In this case in the array of words part of speech verb with probability 1 is added. But, if this condition is not met, we check the endings of the word. Some of the endings guarantee that word is verb with probability 1, those endings are: , , , ,

Otherwise, when the word ends on the rest of the endings, then the probability will be 0.6. If the word ends in ,

First of all, our algorithm checks if the word belong to the word classes which have a countable number of words. (the order of checking is: preposition, particle, conjunction, pronoun). All members of these classes are listed in the algorithm. So, the probability of error in these cases is very small. If the word belongs to these groups then Kinds_of_word has only one member (the ordinal number of the kind of word), and Probability_of_kind also has only one element and its value is 1. After that, the word is checked if it is one of most used exclamations. If the word is found in the list of exclamation, then the algorithm does the same for prepositions, particles, conjunctions or pronouns. Otherwise, the length of the word is checked, if it is equal to two then the word is set as exclamation with probability 0.8. Other case of exclamation is when the word contains two or more repeating characters (consonants or vowels), excluding words ending in double a or e. In this case the probability of being an exclamation is also 0.8. If the number of words in the sentence is one and the sentence ends with exclamation point then 0.1 is added to the probability. The next step is the check for numerals. As was previously said, numerals can be recognized by analyzing their endings or beginnings. The

Figure 5.1. Diagram of the algorithm for morphological analysis

adjective. If there is an article, then probability for noun is 0.45, and for adjective is 0.55. If the word has one of this prefixes: , , and , then probability for adjective is increased up to 0.75. The recognition of adjectives is after the recognition of nouns. Adjectives just like nouns and numerals can have an article. If the article exits it is removed before analyzing of the endings of the adjectives. If the ending is typical only for the adjectives then the probability will be 1. Otherwise if the word ends on or o then it is adjective with probability 0.3. If there is an article, then probability increases to 0.5. At the end of the first cycle of this algorithm is the recognition of adverbs and modal words. During the process of recognition of adverbs, we check if the word is in the list of adverbs. If the word is in this list, then it is an adverb with a probability of 1. If the word ends in the typical endings, then the probability is also one. When the word ends on the endings typical for adjectives in the neutral gender, then the probability is 0.5. The final step is the process of recognition of the modal words. For these words, at first is determined whether the word begins in a preposition, and the other part is a noun or an adjective. Also there is a list of modal words. In the second cycle the position of the word in the sentence is analyzed. The second cycle is for the words which do not have in the array of probability the value of one. In every sentence there must be a verb (Exceptions are exclamatory sentences, which may have only

one word: exclamation, pronoun or noun), so if in the sentence there is only one candidate for verb, then our algorithm tags this word as a verb. If there are two or more potential verbs, then it is checked if in the sentences contain commas, conjunctions or if the two potential verbs are separated with the particle (and a short pronoun form), if before the verb there are the particles or then our algorithm tags this word as verb. If in the pure noun composition (which may include only numerals, adjectives and nouns in this order) there is an article it is on the first word. Nouns are on the last place, so this algorithm recognizes all other words as adjectives (numerals are recognized in the first cycle). There may be a problem if there are more nouns in the composition or if some adjectives are at the end. Adverbs are usually after the verb or before the noun in pure noun compositions. If the verb is to be in any tense then that is not an adverb, but an adjective. Adverbs before the pure noun composition can be successfully recognized only if the gender of the noun composition is not neutral or the noun composition has an article (the article is always in the first word of the composition), otherwise the algorithm recognizes the word as an adjective.

6. Results
To calculate the efficiency of the algorithm for morphological analysis we created three test sets. The first test set is created with sentences of prose text, second test is created with sentences of poetry text and the third set is from the scholarly texts. Each test set has the same number of words. We got these results:

Poetry words in the class right in class wrong in class in other classes right in class (percentage) in other class (percentage)

nouns 813 327 486 478 40.22 59.37

verbs 792 594 198 72 75 10.81

adjectives 386 113 273 452 29.27 80

pronouns 534 526 8 0 98.50 0

numerals 62 61 1 0 98.38 0

Figure 6.1. Results of the algorithm for open-class words from poetry

Poetry

adverb s

preposition s

particle s

conjunction s

exclamation s

modal words

Scholar words in the class right in class wrong in class in other classes right in class (percentage)

adverbs 124 43 81 68 34.67

preposition s 264 264 0 8 100

particles 69 69 0 0 100

conjunction s 96 96 0 0 100

exclamation s 0 0 0 0 0 0

modal words 0 0 0 0 0 0

in other class (percentage) 61.26 2.94 0 0 Figure 6.6. Results of the algorithm for open-class words from scholar text

All texts Average (right in class (percentage))

nouns 46.93592

verbs 66.89533

adjectives 50.61232

pronouns 92.03877

numerals 99.46237 0

Average (in other class (percentage)) 54.84862 14.88142 71.82362 2.941176 Figure 6.7. Results of the algorithm for open-class words from all texts

All texts Average (right in class (percentage)) Average (in other class (percentage))

adverbs

prepositions

particles

conjunctions

exclamations

modal words

32.70

99.87

100

100

100

100

48.84

3.29

11.11

45.83

Figure 6.8. Results of the algorithm for closed-class words from scholar text As we can see in figures 6.1. - 6.8. the percentage of words tagged as the correct part of speech: prepositions and pronouns is greater than 90%, even higher than this, the percentage of particles and conjunctions is 100%, and there are not words that are put as other part of speech. In the group of exclamation, also there are only words from this class, but some exclamations are shown as other parts of speech. It is because there are some words which are created from the different voices from the nature. All of them cannot be included in our algorithm. In the group of modal words only the words from this class are included, but some of them can be words that were previously added in other class, because sometimes modal words can be in other class. The greatest problem with adverbs is that adjectives can also be recognized as adverbs, but also they may be classified as nouns (exp: , - today, yesterday). Therefore the percentage of correctly selected words is less than 33% in the list of adverbs, but correctly recognized words are about 50%. The percentage of recognized nouns is also above 50% because of the similarities of nouns and other parts of speech, just like verbs or adjectives. Other problem is that in prose or poetry adjectives are often used after the nouns. And that problem does not appear in the scholar texts. The percentage of recognized verbs is around 85%. Having in mind that verbs are open-class words, and that they can be recognized as other parts of speech this number of recognized words is good. The most undesirable results are those for the adjectives. Less than 30% of adjectives are recognized. This problem arises because adjectives are recognized at the end of the algorithm. Another problem is that sometimes adjectives can be recognized as adverbs. Sometimes, adjectives can

be recognized as nouns if they are placed after the nouns. According to these results numerals are always recognized, but sometimes other parts of speech can be recognized as numerals. This is because in the poetry texts, some archaisms and words from Old Slavonic language are used. Old Slavonic words are the reason of not recognized modal verbs in the poetry texts. Problems with recognizing pronouns exist because there are some short-form pronouns which are the same as the forms of the verb to be.

7. Future works
As we can see in the previous section, the results of this algorithm are not perfect. In the future there are two possible ways for improving the results. The first way is to improve the statistical measures

8the esch are 9 Tm[(t)1 perf 5(xi)-7(s)u82(e)15(xi)7((v)24(e)15(Tm[(t)168247(us)[(t)16824(r)-23(e)15( )-606(8(a6(h)-23(e)15( )-35 )-82(e)15(xi)7(h)

Potrebbero piacerti anche