Arabic Document Classification: A Comparative Study

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 23
Arabic Document Classification:

A Comparative Study
M. E. Abd El-Monsef , M. Amin , El-Sayed Atlam , O. El-Barbary
Abstract— Text classification has been widely used to assist users with the discovery of useful information from the Internet.
However, traditional classification methods are based on the word representation, which only accounts for term frequency in the
documents, and ignores important relationships between key terms and fields. This paper considering a specific word that related to
the field called Field Association (FA) words by considering their ranks (levels). Moreover, this paper built a Java software system to
make classification on Arabic text using keyword, FA words and compound FA words. Furthermore, a comparative study of keywords,
FA words and compound FA words on Arabic text were done using experimental results generated by our software. The presented
methods estimated by simulation results of 1819 files and 16 super fields.
Index Terms— Field association words; compound Field association words; Arabic Information retrieval; precision; recall;
document classification.
——————————  ——————————
1 INTRODUCTION
T he World Wide Web (WWW) has become one of the

most popular mediums for the dissemination of
multilingual web pages. Automatic mediation of
based alphabets, the orientation of writing in Arabic is
from right-to-left. The Arabic alphabet consists of 28
letters. The Arabic alphabet can be extended to ninety
access to foreign web pages is becoming an increasingly elements by additional shapes, marks, and vowels. Most
important problem. Automatic document classification is Arabic words are morphologically derived from a list of
important for organizing huge amounts of data. Many roots. The root is the bare verb form; it can be trilateral,
document retrieval systems have been developed along quadrilateral, or pentalateral Most of these roots are made
with advances in areas such as keyword retrieval, similar up of three consonants. Arabic words are classified into
file retrieval, automatic document classification, cross nouns (adjectives and adverbs), verbs and particles. In
language retrieval and document summarization [9, 16, formal writing Arabic sentences are delimited by commas
17, 19, 27, 34, 39, 42, 22, 47, 51]. and periods as in English.
Arabic Information Retrieval (AIR) has recently Most work of document classification treats
become a focus of research and commercial development. documents as a bag-of-words. This text is represented as
Very few standards for evaluation of such tools are a vector of weighted frequencies. Such a simplified
known and available. A concrete evaluation for AIR representation of text has been shown to be quite effective
systems is necessary for the advancement of this field. for a number of applications [7, 12, 40]. There are several
attempts to enhance text representation using concepts or
Arabic is one of the six official languages of the multi-word terms [56], El-Kourdi et al, [46] used Naïve
United Nations. It is the mother tongue of 300 million Bayes algorithm for Arabic document classification. The
people [38]. Statistics shows that since 1995 when the first average accuracy reported was about 68.781 %. Sawaf et
Arabic newspaper was launched online al, [27] used statistical classification methods such as
(www.asharqalawsat.com) the number of Arabic websites maximum entropy to classify and cluster news articles,
has been growing exponentially. By 2000 there were the accuracy reported was 62.7 %. These methods make it
about 20 thousand Arabic sites on the web, about 7% of possible to retrieve and classify texts in response to
the published sites on the web [14]. Unlike the Latin- arbitrary databases without referring to systematic
———————————————— classified information. As mentioned above, document
 M. E. Abd El-Monsef , is with the mathematics department Tanta classification independent of existing classified
University, Faculty of Science, Tanta, Egypt.
information is certainly important. But, humans can
 M. Amin is with the Department of mathematics, menofya University, recognize the fields like health or beauty by finding
Faculty of science. specific words without reading the whole text, these
words called keywords.
 El-Sayed Atlam is with the Tokishima University, Faculty of Engineering,
Japan and is with the mathematics department Tanta University, Faculty Most previous studies focused on keyword
of Science, Tanta, Egypt. selection metrics such as chi-square, information gain,
odd ratio, probability ratio, document frequency, and
 O. El-Barbary is with the mathematics department Tanta University, binomial separation [33, 34, 42] for English language.
Faculty of Science, Tanta, Egypt.
They used corpus-based keyword selection, "a common
keyword set for all class that reflects the most important
words in all document is selected", or class-based
keyword selection, "the keyword selection process is
preformed separately for each class". In this way, the means female teacher in English). The dual is formed by
most important words specific to each class are adding (‫ )ان‬or (‫ )ﻳ ﻦ‬at the end of the noun as in (‫)ﻣﻌﻠﻤ ﺎن‬
determined. (moaleman- which means two teacher in English). In the
third case, often referred to as broken plurals, the pattern
The concept of Field Association (FA) words is
of the singular noun is dramatically altered. We can
based on the fact that the subject of a text (document
recognize these plurals from the patterns. There are 27
field) can usually be identified by looking at certain
patterns for most of the broken nouns.
specific words or phrases in that text. It is natural for
Another kind of suffixation is the personal
people to identify the field of a document when they
pronouns. The personal pronoun can appear as an
notice these specific words or phrases. These specific
isolated form or as suffixes attached to the nouns, verbs,
words or phrases are called FA words. A FA word is
or prepositions. Certain suffixes are attached at the end of
defined as the minimum word or phrase that serves to
words to make them possessive pronouns. The attached
identify a particular field [20]. FA words form a limited
can be one letter, for example (‫( )ﺑﻴﺘ ﻲ‬baiety which means
set of discriminating terms that can specify document
my home in English). When the letter "‫("ي‬y) is attached to
fields [4, 5]. For example, “homerun” indicates the
the end of the word (‫( )ﺑﻴ ﺖ‬baiet- which means "home" in
subfield <Baseball> of super-field <Sports>, and “US
English) to form “my house” in English. For plural, two
presidential election” indicates sub-field <Election> of
letters are attached to the end of the word, for the
super-field <Politics>. Therefore, “homerun” and “US
masculine, the letters "‫( "ه ﻢ‬hom) are attached (‫)ﺑﻴ ﺘﻬﻢ‬
presidential election” are examples of FA Terms.
(baiethom which means their home in English) and the
The aim of this paper is to evaluate the use of letters "‫( "ه ﻦ‬hn) for the feminine noun (‫( )ﺑﻴ ﺘﻬﻦ‬baiethn-
normal keywords via the use of field association words which means their home in English) . These are the most
for Arabic language. In addition, [4] present compound common modifications to Arabic words. Example 1
FA words and applied it in classification English text. In shows different pattern for one word "‫( "ﻃﻔ ﻞ‬tefl which
this paper we applied it to Arabic documents with means child in English). And, summarizations for Arabic
response to the morphological analysis for Arabic affixes are shown in table (1). Dictionaries do not store
language. Accordingly, this paper presents two methods every form of regular words. Most dictionaries entries are
to classify the Arabic document using FA words and NB stored in singular form except the words that are usually
classifier. used in the plural like (‫( ) آﻤﺎﻟﻴ ﺎت‬kamaleyat- which
The reset of paper organized as follows. Section 2 means “luxuries” in English). The verbs are stored in
of this paper presents Arabic word formation. Also, perfect form. Therefore, compound FA terms may be
defines Compound FA words in Arabic language. Section classified as permanent or temporary. Permanent
3 defines Arabic document field tree. Section 4 identifies compounds are fixed by common usage and can usually
Field association words, its levels and how we determine be found in a dictionary. Temporary compounds consist
it. Section 5 is a comparison with traditional classification of words with addition. Dictionaries or reference books
methods. Section 6 illustrates how we classify Arabic may disagree on the evolutionary stage of a compound or
document using FA words and compound FA words. may not include temporary compounds.
Section 7 represents the experimental evaluation and the Compound FA terms appear in the Arabic text
comparison with other traditional approaches. Section 8 more, and some compound FA terms become more
focuses on conclusion and future work. restricted and allow more information retrieval If
2. ARABIC WORD STRUCTURE compound FA terms are divided into single words, they
will be ranked lower and will be at different levels, for
Unlike the English language, nouns in Arabic can example, compound FA terms "‫ "أﻧﻔﻠ ﻮﻧﺰا اﻟﺨﻨ ﺎزﻳﺮ‬relate to
be masculine or feminine. The nouns can be definite as in sub-field <‫( >اﻷﻣ ﺮاض‬al amrad- which means diseases in
(‫( )اﻟﻤﻌﻠ ﻢ‬al moalem- which means the "teacher" in English) English) with high rank. Therefore, using compound FA
or indefinite as in (‫( )ﻣﻌﻠ ﻢ‬moalem- which means a teacher words are more accurate than single FA words. [18, 35]
in English) Adding the prefix (‫( )اﻟ ـ‬al which means "the" explain methods for determining compound FA words.
in English) makes a difference in meaning. Plurals in Antefixes Prefixes infixes Suffixes Postfixes
Arabic are three kinds; they are the masculine plural, the ‫وﺑﺎل‬, ‫وال‬, ‫ﺑﺎل‬, ‫ﻓﺎل‬, ‫ال‬, ‫وﻟﻞ‬, ‫ا‬, ‫ن‬, ‫ي‬, ‫ت‬ ‫ا‬, ‫و‬ ‫ﺗﻤﺎ‬, ‫ﻳﻮن‬, ‫ﺗﻴﻦ‬, ‫ﺗﺎن‬, ‫ات‬, ‫ان‬, ‫ﺁﻣ ﺎ‬, ‫هﻤ ﺎ‬, ‫ﺁن‬, ‫ه ﻦ‬,
feminine plural, and the broken plural. The plural is ‫ال‬, ‫وب‬, ‫ول‬, ‫ﻟﻞ‬, ‫ﻓﺲ‬, ‫ﻓﺐ‬, ‫ﻓﻞ‬,
‫وس‬, ‫ك‬, ‫ف‬, ‫و‬, ‫ب‬, ‫ل‬
‫ون‬, ‫ﻳﻦ‬, ‫وا‬, ‫ﺗﺎ‬, ‫ﺗﻢ‬, ‫ﺗﻦ‬, ‫ﻧﺎ‬,
‫ت‬, ‫ن‬, ‫ا‬, ‫ي‬, ‫و‬
‫ﺗﻲ‬,
‫هﺎ‬, ‫ﻧﺎ‬, ‫هﻢ‬, ‫ﺁم‬, ‫ك‬, ‫ﻩ‬,
formed via suffixes or via pattern modification of the Prepositions meaning Letters A letter Terminations of
‫ي‬
Pronouns
nouns. In the first case, the suffix ~een for the accusative respectively: and meaning the add give conjugation for meaning
with conjugation meaning verbs respectively:
(‫ )ﻣﻌﻠﻤ ﻴﻦ‬for the masculine plural (moalmeen which means the, and the, with the, person of the person and your,
then the, as the, and verbs of dual/plural/female their, your,
teachers in English) and genitive or ~oon for the to in the present conjugation marks for nouns their,
(for) the, the, and tense my, her, our,
nominative (‫ )ﻣﻌﻠﻤ ﻮن‬the broken plural (moalmoon which with, their,
means teachers in English) is appended to the and to (for), then will,
then with, then to
your,
his,
your,
masculine noun. While ~aat as in (‫ )ﻣﻌﻠﻤ ﺎت‬for the (for),

and will, as, then,
my
feminine plural (moalmaat which means teachers in and,

with, to (for)
English) is appended to the plural feminine noun and the Table 1: Arabic affixes
suffix” ha” is attached to the end of the word to form
singular feminine noun as in (‫()ﻣﻌﻠﻤ ﺔ‬moalemah- which
3. FIELD TREE
A document field is defined as basic and
FA word Field association path Levels
common knowledge useful for human communication "‫("اﻟﺘﻤﺮﻳﺾ‬al-tamreed <‫<( >اﻟﻄﺐ\ﻋﻠﻮم اﻟﺼﺤﺔ‬al tep / olom al sehha >- which mean <
[37, 40]. A field tree is a schematic representation of - which means
nursing in English)
medicine \health science> in English) 1
relationships among document fields. Leaf nodes in the <‫ <(>اﻟﻄﺐ\ﻋﻠﻮم اﻟﺼﺤﺔ‬al tep / olom al sehha >- which mean <
"‫( "اﻟﺼﻴﺪﻟﺔ‬al-sydalah - medicine \health science> in English)
field tree correspond to terminal fields, nodes connected which means <‫<( >اﻟﻄﺐ\ﺻﻴﺪﻟﺔ‬al tep al-sydalah >- which mean < medicine \
2
to the root are super-fields and other nodes correspond to pharmacology in
English)
pharmacology > in English)
median fields. FA words are saved in a field tree as "‫(" اﻟﺘﺪﺧﻴﻦ‬al-tadkheen

- which means
<‫ <(>اﻟﻄﺐ‬al tep > - which means <health > in English)
3
knowledge base. This knowledge base would constitute smoking in English)
<‫ <(>اﻟﻄﺐ\اﻟﺼﺤﺔ‬al tep \ al seha >- which mean < medicine \
an FA words dictionary. To building this field tree, there "‫("اﻷﻣﺮاض واﻟﻌﻼج‬al health> in English)
amrad wa al elag - <‫ <( > اﻟﻌﻼج\ﺻﻴﺪﻟﺔ‬al elag \ al-sydalah >- which mean < cure
are a set of different relationships that link the different which mean pharmacology > in English)
4
elements of it together. For example, Ancestor, diseases and cure in

English)
Descendant, Parent, Child and Sibling are tree relations. "‫("اﻟﺘﻲ‬al late- which
For more details about building document field tree refer means "who" in 5
English)
to [37, 40]. "‫("هﻮ‬howa- which
means "he" in
In this study we will construct a field tree for Arabic English)
documents, contains of 15 super-fields and 114 sub-fields. Table 2: shows an example for the levels of FA words
An example of a field tree is given in Figure 1. All FA 4.2 Determination of FA words
terms and paths are manually assigned to levels. The traditional algorithm [4, 39, 50] that automatically
determines the candidates for FA words and their ranks
causes misleading redundant words (unnecessary
words). In [5] the author introduced normalized word
4. BACKGROUND WITH FA WORDS
frequency instated of word frequency. In this paper, we
4.1 FA words and levels will use it to extract efficient Arabic FA words and use it
A single FA word indicates a minimum unit in Arabic document classification.
(word) with semantic meaning that identifies a particular Definition 2: Let (<T>) be the total frequency of all words
field e.g., words "‫( " ﻓﻴﺮوس‬vairoos which means Virus in in the terminal field <T>; let (w, <T>) be the frequency of
English), "‫( "أﻧﻔﻠﻮﻧﺰا‬anfluanza - which means flu in the word w in the terminal field <T>, the (Normalization
English) are single FA words. A compound FA word (w, <T>)) can be defined as follows:
consists of two or more single FA words. e.g., terms ‫"أﻧﻔﻠﻮﻧﺰا‬
Normalization (w, <T>) =  Frequency( w, T )  (1)
"‫( اﻟﺨﻨﺎزﻳﺮ‬anfluanza al khanazeer - which means swine flu  Total _ Frequency( T ) 
in English) ,"‫( "ﻋﻠﻮم اﻟﺤﺎﺳﺐ‬oloum el haseb - which means  
computer science in English) "‫( "ﻧﻈﻢ اﻟﻤﻌﻠﻮﻣﺎت‬nozom el The normalized frequency defines how much a specific
malomat- which means information system in English) word is concentrated in a specific field.
are compound FA words. Definition 3: For the parent <S>, the child field <C>, the
FA words have its strength and scope particularly; the concentration ratio (Concentration (w, <C>)) of the FA
scope is ambiguous .In [5] the levels of the scope are word w in the field <C> is defined as in the following:
defined as follows for the field identification.
Definition 1 FA words have different scope to associate
with a field; five precision levels are used to classify FA  Normalization( w,  c ) 
words to document fields, they are: concentration( w, c)   
Levels of FA words are:  Normalization( w,  s ) 
 (Level 1): perfect FA words are associated with (2)
one subfield uniquely. The following algorithm determines FA words by
 (Level 2): medium FA words are associated with considering their ranks.
a few subfields of one super- field. Algorithm1: FA words determination algorithm
 (Level 3): super FA words are associated with Input: (a) w, candidates for FA words,
one super-field. (b) Normalization (w, <C>) for w and for field
 (Level 4): multiple FA words are associated with <C>,
a few subfield of some different super-field. (c) Threshold α, to judge FA words ranks,
 (Level 5): non-FA words unable to specify the (d) Field tree.
fields. Output: associated FA words and their ranks for w.
Example 2: The following Table shows some examples for (Step 1): Determination of Perfect FA words
FA words and their field path and levels. (tep - which For the root = <S>, the child field = <S/C> of the
means medicine in English) field tree, the following formula is used to judge
whether or not the word w is a Perfect-FA word.
concentration( w,  S )   (3).
(Al riyadah -which means sport in English)
‫اﻟﺮﻳﺎﺿﺔ‬
(thaqafa-which means culture in English)

‫ﺛﻘﺎﻓــــــﺔ‬
(Amrad el qalb -which
‫أﻣـﺮاض اﻟﻘﻠﺐ‬ means heart Disease in
English)
(Amrad -which means the
..…… Diseases in English) ...........
‫أﻣــــﺮاض‬ (anfluanza al khanazeer - which

(Al teb -which
means swine flu in English)
means medicine
in English) ........... ‫أﻧﻔﻠﻮﻧﺰا اﻟﺨﻨﺎزﻳﺮ‬
‫اﻟﻄـــــﺐ‬
(Al saratan - which means
‫اﻟﺴﺮﻃﺎن‬ cancer in English)
‫اﻟﻮﻗــــﺎﻳﺔ‬ (Al wekaia -which means protection
in English)
‫اﻟﻌــــﻼج‬ (Al elaag -which means

Root treatment in English)
(Al siyaha wa al safer- which mean the travel and tourism in

‫اﻟﺴﻴﺎﺣﺔ واﻟﺴﻔﺮ‬
‫اﻟﺼﺤﺔ و اﻟﻐﺬاء‬ (Al seha wa al ghezaa -which mean Health and
(Al seha -which food in English)
means health in
English) ...........
‫اﻟﺼﺤﺔ‬
(Al seha - which

means health in (Al vitamenat -which means vitamins in English)
‫اﻟﻔﻴﺘﺎﻣﻴﻨﺎت‬
English)
‫اﻻﻗﺘﺼﺎد‬
(Al hamyha - which means Diet in English)
‫اﻟﺤﻤﻴﺔ‬
(Amrad soaa al taghzya -which means Malnutrition

‫أﻣﺮاض ﺳﻮء اﻟﺘﻐﺬﻳﺔ‬ disease in English)
(Al taghzya -which
means feeding in
English) ............
‫اﻟﺘﻐﺬﻳﺔ‬
(al taghzya al saleema- which means Proper nutrition in English)

‫اﻟﺘﻐﺬﻳﺔ اﻟﺴﻠﻴﻤﺔ‬
..........
(Al elaag - which means Treatment in English)
‫اﻟﻌــــﻼج‬
Figure 1: Arabic field tree

If Formula (3) is fulfilled, <S/C> is replaced by <S> swine flu in English) in the medium field <‫ >اﻟﻄ ﺐ‬where
and the same judgment is carried out on the field “‫”أﻧﻔﻠ ﻮﻧﺰا‬appears the most frequently. As the
<S/C>. By repeating the same determination process, determination is made only in the terminal field<C>
if <S/C> becomes a terminal field, w is determined =< ‫ > أﻧﻔﻠ ﻮﻧﺰا اﻟﺨﻨ ﺎزﻳﺮ‬and the concentration ratio is (0.965)
as a Perfect FA word in the field <S/C>. If the field exceeds the threshold α (0.90), the word "‫ "أﻧﻔﻠ ﻮﻧﺰا‬is
<S/C> cannot fulfill the condition in Formula (3), determined as a Perfect FA word in the terminal field >.
then the process enters (Step 2). < ‫أﻧﻔﻠﻮﻧﺰا اﻟﺨﻨﺎزﻳﺮ‬
So, when applying Algorithm 2 after this algorithm all
(Step 2): Determination of Semi-perfect or Medium FA documents that will check and have the same word as a
words perfect FA word return to the same field. Otherwise, if a
If w is not determined as a Perfect FA word in the field < document has the word as a semi perfect or medium FA
S/C >, the terminal field has not been reached. word then one or more children field will appear
Therefore, the field <S> should be a medium field and 5. COMPARISON WITH TRADITIONAL
has at least two or more (m≥2) children fields. From all CLASSIFICATION
children fields <S/Ck> (1<k<m) of the medium field
Popular supervised machine learning text classification
<S> calculate the average value of k times children
algorithms include Naïve Bayes, Centroid-based [21], k-
including word w as in the following:
Nearest Neighbour (kNN) [31], and Support Vector
  m Normalization (w ,  c k )  Machines (SVM) [49]. Centroid-based, kNN and SVM are
 k 1  (4) based on the vector space model which represents text
 m  documents as vectors consisting of features. Naïve Bayes
classifier [21] is a probabilistic model based on applying
Accumulated Concentration (w, <S/Ck>) ratio for the
Bayes’ theorem with strong (naive) independence
children fields has higher normalized frequencies than
assumptions.
the average value in Formula (4). If the accumulated
These algorithms normally use the classical text
concentration ratio of k times (1<k<m) exceeds α and the representation technique [41] that maps a document to a
children fields <S/Ck> are all terminal fields, w is high dimensional feature vector consisting of a “bag of
judged as a Semi-perfect FA word in fields <S/Ck>. If words”. This representation leads to the inclusion of
the accumulated value does not exceed the threshold α, unimportant features, and the loss of important semantic
w is determined as a Medium FA word of field <S>. relationships and inflection information [29], resulting in
However, if all of these children fields are not terminal accuracy reduction.
fields, the process enters (Step 3) and conducts the Moreover, the accuracy of a text classifier
determination process of Multiple FA words. depends to a large extent upon the classification
(Step 3): Determination of Multiple FA words granularity, and on how well separated the training or
Extract the terminal field <S/C> from k children fields test documents belonging to different categories as in
and determine w as a Multiple FA word of the field < [32]. It may be relatively easy to classify two distinct
S/C>. Except for the terminal fields the child field categories such as ‘sports’ and ‘politics’, but it may be
<S/C> is changed into root <S> of the field tree, repeat more difficult to distinguish between similar categories.
the process to conduct (Step 1) and (Step 2). Then, many To overcome such problems, [3] has proposed
medium fields and terminal fields are obtained, and w is the use of compound words extracted by morphological
judged as a Multiple FA word of the field <S>. (end of analyzers. [32] Used automatically extracted summaries
algorithm) rather than the whole documents while [24] have
Example 3: Consider FA word candidates "‫"أﻧﻔﻠ ﻮﻧﺰا‬ proposed the use of a limited set of automatically selected
(anflwanza- which means flu in English)” ‫" ﺗﻐﺬﻳ ﺔ‬ keywords as features. Recently, [30] have investigated the
,(taghzya- which means feeding in English) and “‫”ﻣﺼ ﻞ‬ use of ontology for text categorization. However, the
(masel- which means serum in English) as in Figure 1. problems are still far from being solved.
The number of children fields in <root> is 16 field. We In this paper, we introduce a text classification
choosed ,<‫( >ﻃﺐ‬tep - which means medicine in English) methodology based on Field Association (FA) words,
,<‫(>اﻟﺼ ﺤﺔ‬al seha- which means health in English) and compound FA words in Arabic language. As FA words of
, <‫(>اﻟﺘﻐﺬﻳ ﺔ‬taghzya- which means feeding in English) are a field collectively store the essence (knowledge) of that
subfields A hreshold value α was chosen to be 0.90. field, they are effective for text classification. Accordingly,
this paper presents two methods to classify the Arabic
In (Step 1), suppose that w is “‫ “أﻧﻔﻠ ﻮﻧﺰا‬and < S> is <root>.
document using FA words and NB classifier.
The word “‫ ”أﻧﻔﻠ ﻮﻧﺰا‬appears the most frequently in the
selecting field ,< ‫ > ﻃ ﺐ‬then calculate the concentration
ratio of the field <C>= < ‫> ﻃ ﺐ‬on the field <S/C> = 6. DOCUMENT CLASSIFICATION USING FA
<root/ ,< ‫>>ﻃﺐ‬ WORDS
Concentration(< ‫ > اﻟﻄﺐ‬,"‫ =)"أﻧﻔﻠﻮﻧﺰا‬0.91   Text classification techniques are used in many
Repeating the same process, select terminal field < applications, including e-mail filtering, mail routing,
‫( >أﻧﻔﻠ ﻮﻧﺰا اﻟﺨﻨ ﺎزﻳﺮ‬anfluanza al khanazeer - which means spam filtering, news monitoring, sorting through
digitized paper archives, automated indexing of scientific applied to classify any Arabic document such as web
articles, classification of news stories and searching for documents, scientific paper, articles, news and others. In
interesting information on the WWW this paper the experimental is formed on a collection of
[6,11,13,23,25,26,43,44,53,57,58, 52, 49, 48, 45,54, 55]. The web documents because, its unstructured data and hardly
majority of these systems are designed to handle to classify it.
documents written in the English language, but it is not 7.1 Experimental Data
applicable to documents written in the Arabic language. Our experiments trained the system using Arabic
For example, stemming techniques refers to the process of documents collected from the Internet. It mainly collected
removing affixes from words. So, if we remove the affixes from Al-jazeera Arabic news channel which is the largest
from the following Arabic word"‫( "وﺳﺎم‬wisaam- which Arabic site, Al-Ahram newspaper, Al-watan newspaper,
means accolade in English) after stemming became "‫"ﺳﺎم‬ Al Akhbar, Al Arabiya and Wikipedia the free
(saam- which means poisonous in English). The encyclopedia. The documents categorized into 16 super-
meaning of the word change at all. For more detail about field and 137 subfields. The number of files in our corpus
defects in applying traditional methods in information is 1,819 file and it is about 26.4 MB. For experimental
retrieval for Arabic language refer to [1]. Developing text evaluation, we download a source code written in JAVA
classification systems for Arabic documents is a from
challenging task due to the complex and rich nature of http://nlp.cs.byu.edu/mediawiki/index.php/CS601R:Project
the Arabic language. Previous work on Arabic text _1_Guidelines.
classification has used distance-based algorithms [38], In addition, we modified it to be suitable for
Learning algorithms [9], and Bayesian classification the new NB algorithm. Also, we prepare the system
methods [8] in developing automated text classification according to
systems. Specifically in [10] used N-grams is used for http://nlp.cs.byu.edu/mediawiki/index.php/How_to_prepar
searching Arabic text documents. e_your_system.
In the following we illustrated the new algorithm and 7.2 Preprocess
explain how to use FA words in Arabic document Before applying the classification algorithm for
classification. testing data, some preprocessing in the text been
Because all FA words for a document must be exist, performed. All the experiments are performed after
Algorithm 2 normalizing the text. In normalization, the text is
first calls Algorithm 1 to find all FA words. Then the converted to UTF-8 encoded and punctuations and non-
algorithm generate the derivation frame for each FA letters are removed. Also, some Arabic letters are
word by added all affixes for the FA word, all affixes are normalized such as:
abbreviate in Table 1. Each document contains these FA  Replace a final “‫ “ؤ‬with,"‫"ء‬
word or any of its derivatives belongs to the same field.  Replace a final " ‫ "َﺊ‬with,"‫"ء‬
Algorithm 2: FA Words Classification Algorithm  Replace a final "‫ "ت‬with,"‫"ة‬
Input: (a) V  {v1 , v2 ,...., vn } is the set of FA words
 Replace ,"‫"إ‬,"‫"أ‬or "‫ "ﺁ‬with,"‫"ا‬
(b) , ,…, is a collection of unclassified
 Replace "‫"ى‬with,"‫"ي‬
documents
 Replace "‫"ة‬with,"‫"ﻩ‬
Output: F the classification of D.
Method:
 Replace "‫"وء‬with ,"‫"ؤ‬
1. Run Algorithm 1 to get the set of FA words  Replace "‫"ئ‬with ,"‫ "ى‬and
2.  Replace "‫"اا‬with."‫"ا‬
3. Set F= { } In addition, all Arabic text contains redundant words
4. Set
or unnecessary word, these words called stop words.
5. for each vi  V , do They are very common words that appear in the text that
6. for each k , 1 do carry little meaning; they serve only a syntactic function
7. if vi  d k ,copy d k to Fi , but do not indicate subject matter. These stop words have
8.
two different impacts on information retrieval process.
9. else, go to step 4 They can affect the retrieval effectiveness because they
10. Return F. have a very high frequency and tend to diminish the
The new idea for use FA word with derivation frame to impact of frequency difference among less common
Arabic language is more suitable to face the complexity of words. Deleting the stop words, the document changes
Arabic morphology. In addition, it can be applied on length and affects the weighting process. Identifying a
earlier techniques such as vector space model; stop words list or a stop list that contains such words in
probabilistic model and language model to modified it order to eliminate them from text processing is essential
and become efficient and suitable for Arabic language. to an information retrieval system. [2] explores the use of
The aim of this paper is appearing the advantage of use stop words and their effect on Arabic information
FA words with derivation frames for Arabic language retrieval. A general stop list1 is created, base on the
using comparable study. Arabic language structure and characteristics without any
7
7. EXPERIMENTAL 1 The stop lists for all the languages are available at
The new method for using FA words can be http://www.unine.ch/info/clef.
additions. The word categories that used are: Name of field Precision Recall F-
 Adverbs. measure
 Conditional pronouns. ‫( اﻻﺳﺘﻨﺴﺎخ‬al estensakh- which means 0.67 1 0.8
 Interrogative pronouns. the cloning in English)
 Prepositions. ‫( اﻷﻣﺮاض‬al amrad- which means 0.74 0.69 0.71
 Pronouns. diseases in English)
 Referral names/ determiners. ‫( اﻟﺒﻴﺌﺔ‬al beaa- which means the 0.72 1 0.8
 Relative pronouns. environment in English)
 Transformers (verbs, letters). ‫( اﻟﺘﻜﻨﻮﻟﻮﺟﻴﺎ‬al tecnologia- which 0.98 0.6 0.74

means technology in English)
 Verbal pronouns.
‫( ﺟﺴﻢ اﻹﻧﺴﺎن‬gesem al ensaan- which 0.5 0.9 0.64
 Other.
means the human body in English)
7.3 Experimental Evaluation
‫( ﻣﻘﺎﻻت ﻋﻠﻤﻴﺔ‬makalat elmiah which 0.44 0.1 0.6
7.3.1 Experimental Evaluation Using FA
means scientific articles in English)
Words Classification Algorithm
For experimental evaluations, we designed Table 4: Classification using FA words
software written by JAVA with three versions. Firstly, a
classification on Arabic text using normal FA words was Name of field Precision Recall F-
made. Secondly, a classification using compound field measure
association words was made. Finally, a classification ‫( اﻻﺳﺘﻨﺴﺎخ‬al estensakh- which 0.86 1 0.92
using compound FA words was made. means the cloning in English)
Simulation results for classification ‫(اﻷﻣﺮاض‬al amrad- which means 0.83 1 0.9
Input data: (keywords, text) diseases in English)
Output: classified data according to keywords. ‫( اﻟﺒﻴﺌﺔ‬al beaa- which means the 0.75 0.9 0.83
We have used about 150 keywords selected by human environment in English)
from corpus. Precision, Recall and F-measure are used to ‫( اﻟﺘﻜﻨﻮﻟﻮﺟﻴﺎ‬al tecnologia- which 0.74 0.99 0.85
estimate relevancies of the presented methods and means technology in English)
defined as follows [8] [35]: ‫( ﺟﺴﻢ اﻹﻧﺴﺎن‬gesem al ensaan - 0.52 1 0.7
which means the human body
in English)
/ ‫( ﻣﻘﺎﻻت ﻋﻠﻤﻴﺔ‬makalat elmiah- 0.57 1 0.73
which means scientific articles
/ in English)
2 / Table 5: Classification using compound FA words

Classification Classification Classification
Precision, Recall and F-measure for six super-fields are with keywords with FA words with compound
measured using normal keywords, FA words and Compound FA words
words are shown in Table 3, Table 4 and Table 5. Precision 0.2 0.68 0.72
From the evaluation results it turns out that the best Recall 0.81 0.87 0.98
performance is recorded in classification with compound FA- F-measure 0.26 0.72 0.82
words as shown in Table 6. Moreover, the calculation of F- In conclusion, the results of F-measure of both classifiers
measure for each class separately using compound FA-words using in this study for each class separately using
are more accurate than normal keywords and FA words. compound FA words are more accurate than normal
keywords and FA words.
Name of field Precision Recall F- measure
8. CONCLUSION
‫(اﻻﺳﺘﻨﺴﺎخ‬al estensakh- which means 0.04 0.8 0.07
the cloning in English)
With increasing popularity of the internet and tremendous
‫(اﻷﻣﺮاض‬al amrad- which means 0.6 0.6 0.6
amount of online text, automatic document classification is
diseases in English)
important for organizing huge amounts of data especially
Arabic documents. Moreover, document fields can be decided
‫( اﻟﺒﻴﺌﺔ‬al beaa- which means the 0.14 0.84 0.24
efficiently if there are many FA words and if the frequency rate
environment in English)
is high. FA words are used to classify Arabic documents. Words
‫( اﻟﺘﻜﻨﻮﻟﻮﺟﻴﺎ‬al tecnologia- which means 0.25 0.98 0.4
are extracted from these document corpora to get FA word
technology in English)
candidates. From the experiential results, our new software can
‫(ﺟﺴﻢ اﻹﻧﺴﺎن‬gesem al ensaan- which 0.08 0.9 0.14
be automatically classifying Arabic documents. F-measure is
means the human body in English)
72% of classification using FA words, F-measure is 82% of
‫( ﻣﻘﺎﻻت ﻋﻠﻤﻴﺔ‬makalat elmiah which 0.07 0.75 0.13
classification using compound FA words. Future work could
means scientific articles in English)
focus on automatic building of Arabic FA words using
Table 3: Classification using normal keyword morphological analysis.
REFERENCES determining weighted compound keywords from text databases",

Information Processing and Management, 34(4), 431–442, (1998).
[1] Abdu-salam F. “effective retrieval technique for Arabic text”, thesis
[19] Fuketa M., Lee S., Tsuji T., Okada M. and Aoe J. "A document
doctor of philosophy, school of computer science and information
classification method by using field association words", Information
technology. RMIT University, Melbourne, Victoria, Australia. May
Sciences 126 57-70, (2000).
(2008).
[20] Fuhr N. :Models for retrieval with probabilistic indexing. Information
[2] Abu El-Khair I., "Effects of stop words elimination for Arabic
Processing and Retrieval, 25(1), 55–72, (1989).
information retrieval : A comparative study", International Journal of
[21] Graham-Cumming J. " Naive Bayesian Text Classification: Fast,
Computing &Information Sciences, Vol.4, No. 3, pp 119-133, December,
accurate, and easy to implement", Dr. Dobb's Journal,
(2006).
http://www.ddj.com/development-tools/184406064, [Accessed 2010-
[3] Aizawa A. "Linguistic techniques to improve the performance of
12-01]. (2005).
automatic text categorization". In Proceedings of NLPRS-01, 6th
[22] Gu Q. and Zhou J., "Learning the Shared Subspace for Multi-task
Natural Language Processing Pacific Rim Symposium, 307–314. (2001).
Clustering and Transductive Transfer Classification," icdm, pp.159-168,
[4] Atlam E., Morita K., Fuketa M. and Aoe J. "A new method for selecting
2009 Ninth IEEE International Conference on Data Mining, (2009).
English field association terms of compound words and its knowledge
[23] Han E. and Karypis G. "Centroid-Based Document Classification:
representation", Information Processing and Management (38) 807–821,
Analysis and Experimental Results", In Proceeedings of Principles of
(2002).
Data Mining and Knowledge Discovery, 424-431, (2000).
[5] Atlam E., Elmarhomy G., Fuketa M., Morita K., Sumitomo T. And Aoe
[24] Hulth A. and Megyesi B. "A study on automatically extracted
J. “Automatic Deletion of Unnecessary Field Association Word Using
keywords in text categorization", In Proceedings of the 21st
Morphological Analysis”, International Journal of Computer and
International Conference on Computational Linguistics and the 44th
Mathematics, Vol. 83, No. 3, pp 247-262, March(2006).
annual meeting of the Association for Computational Linguistics, 537 –
[6] Cavnar W. B. and Trenkle J. M. “N-Gram Based Text Categorization,”
544, (2006).
Proceedings of SDAIR-94, 3rd Annual Symposium on Document
[25] Janik M. and Kochut K. "Wikipedia in Action: Ontological Knowledge
Analysis and Information Retrieval, (1994).
in Text Categorization", IEEE International Conference on Semantic
[7] Chinchor N. "Named Entity task definition", In Proceedings of the
Computing, pp. 268-275 (2008).
Seventh Message Understanding Conference, (1998).
[26] Joachims T. " Text Categorization with Support Vector Machines:
[8] Christopher D., Raghavan P. and Schütze H. "Introduction to
Learning with Many Relevant Features", In: European Conference on
Information Retrieval", ISBN 978-0-521-86571-5, Cambridge University
Machine Learning (ECML), (1998).
Press ,(2008).
[27] Kimoto H. " Automatic indexing and evaluation of keywords for
[9] Ciravegna F., Gilardoni L., Lavelli A., Ferraro M., Mana N., Mazza, S.,
Japanese newspapers". Transaction on Information and Systems, IEICE
Matiasek J., Black W. and Rinaldi F. "Flexible Text Classification for
of Japan, J74-D-I(8), 556–566 (in Japanese), (1991).
Financial Applications: the FACILE System", In Proceedings of PAIS-
[28] Kupiec J., Pedersen J., and Chen F." A trainable document summarizer".
2000, Prestigious Applications of Intelligent Systems sub-conference of
In Proceedings of annual international ACM SIGIR conference on
ECAI2000, (2000).
research and development in information retrieval (SIGIR’95) (pp. 68–
[10] Damerau A. and Weiss " Towards Language Independent Automated
73), (1995).
Learning of Text Categorization Models". Research and Development
[29] Lodhi H., Saunders C., Shawe-Taylor J., Cristianini N. and Watkins C.
in Information Retrieval P 23-30, (1994).
"Text Classification using String Kernels", Journal of Machine Learning
[11] Debole F. and Sebastiani F. "Supervised Term Weighting for
Research, 2, 419-444, (2002).
Automated Text Categorization", In: Proceedings of SAC-03, 18th ACM
[30] Melo G. and Siersdorfer S. "Multilingual Text Classification Using
Symposium on Applied Computing. ACM Press 784–788, (2003).
Ontologies", Lecture Notes in Computer Science: Advances in
[12] Diederich J., Kindermann J. , Leopold E. and Paaß G. "Authorship
Information Retrieval, 4425/2007, 541-548, (2007).
attribution with support vector machines". Applied Intelligence,
[31] Miah M. "Improved k-NN Algorithm for Text Classification", In
19(1/2): 109-123. Categorization System. Journal of Computer Science
Proceedings of the International Conference on Data Mining (DMIN),
3(6): 430-435 (2003).
(2009).
[13] Duwairi R."A Distance-based Classifier for Arabic Text
[32] Mihalcea R. and Hassan S. "Using the essence of texts to improve
Categorization", In Proceedings of the International Conference on Data
document classification", In Proceedings of the Conference on Recent
Mining, Las Vegas USA, (2005).
Advances in Natural Language Processing (RANLP). (2005)
[14] Egyptian Demographic Center,
[33] Mesleh A. "Chi Square Feature Extraction Based Svms Arabic
http://www.frcu.eun.eg/www/homepage /cdc/cdc.htm , (2000).
Language Text Categorization System", Journal of Computer Science 3
[15] El-Kourdi M., Bensaid A. and Rachidi T. "Automatic Arabic Document
(6): 430-435, ISSN 1549- 3636 (2007).
Categorization Based on the Naïve Bayes Algorithm". 20th
[34] Mladenic D. and Grobelnic M. "Feature Selection for Unbalanced Class
International Conference on Computational Linguistics. August,
Distribution and Naïve Bayes". In: Proceedings of the 16th International
Geneva. (2004).
Conference on Machine Learning 258–267, (1999).
[16] Forman G. "An Extensive Empirical Study of Feature Selection Metrics
[35] Miyazaki M., Ikehara S. and Yoko A. "Combined word retrieval for
for Text Classification", Journal of Machine Learning Research 3 pp.
bilingual dictionary based onthe analysis of compound words",
1289–1305, (2003).
Transaction of the Information Processing Society of Japan34 (4) 743-754
[17] Fukumoto F., and Suzuki Y. "Automatic clustering of articles using
in Japanese, (1993).
dictionary definitions". In Proceedings of the 16th international
[36] Miyazaki M. "Automatic segmentation for compound words using
conference on computational linguistic (COLINGO’96) (pp. 406–411),
semantic dependentrelationships between words", Transaction of the
(1996).
Information Processing Society of Japan 25(6) 970-979 in Japanese,
[18] Fuketa M., Mizobuchi S., Hayashi Y., and Aoe J. "A fast method of
(1984). [55] Yang J, Zhong N., Yao Y. and Wang J. "Peculiarity Analysis for
[37] Olshen R. A., Breiman L., Friedman J. H., and Stone C. J. “Classification Classifications," icdm, pp.607-616, 2009 Ninth IEEE International
and regression trees”. London: Chapman & Hall, (1984). Conference on Data Mining, 2009
[38] Paice C. D. "Constructing literature abstracts by computer: techniques [56] Zechner K. “Fast generation of abstracts from general relevant
and prospects". Information Processing and Management, 26, 171–186, sentences”, In Proceedings of the 16th international conference
(1990). on computational linguistic (COLING’96) (pp. 986–989), (1996).
[39] Peng F., Huang X., Schuurmans D. and Wang S. "Text Classification in [57] Zhang W. and Yoshida T. “Text classification based on multi-word
Asian Languages without Word Segmentation", In Proceedings of the with support vector machine”, Knowledge-Based Systems, 21(8), 879-
Sixth International Workshop on Information Retrieval with Asian 886, (2008).
Languages (IRAL 2003), Association for Computational Linguistics, [58] http:lltrec.nist.gov/pubs/trec10/papers/UMass_TREC10_final.pdf,
Sapporo, Japan (2003). (2002)
[40] Safavian S. R., and Landgrebe D. “A survey of decision tree classifier
methodology”, IEEE Transactions on Systems, Man, and Cybernetics, M . E. Abd El-Monsef M . E. Abd El-Monsef received the B.S&ED
degree in Mathematics from Assuit University, Egypt, in 1968 and the B.S
21(3), 660–674, (1991).
degree from Faculty of Science, Assuit University, Egypt, in 1973. He
[41] Salton, G., Wong, A., Yang, C. "A vector space model for automatic received the MS degree in Mathematics from Al Azhar University, Cairo,
indexing", Communications of the ACM, 18(11), 613–620, (1975). Egypt, in 1977.
[42] Sawaf H., Zaplo J. and Ney H. " Statistical classification methods for He received his PhD degree in Mathematics from Tanta University,
Tanta, Egypt, in 1980. He was assistant professor in the Department
Arabic news articles". Natural Language Processing in ACL2001,
of Mathematics, Faculty of Science, Tanta University, from 1984. He
Toulouse, France (2001). was a professor of Mathematics in the Department of Mathematics,
[43] Sebastiani F. " Machine learning in automated text categorization. Faculty of Science, Tanta University, from 1988. He worked as Vice
ACM Computing Surveys", Vol. 34 number 1, pp.1-47, (2002). Dean of Faculty of Science, Tanta University, for postgraduate and
researches affair from 1991 to 1996. He worked as Vice Dean of
[44] Shin, K., Abraham, A., Han, A. "Enhanced Centroid-Based
Faculty of Science, Tanta University for students affairs from 1996 to
Classification Technique by Filtering Outliers", In Proceedings of the 9th 1999. He worked as Dean of Faculty of Science, Tanta University
International Conference on Text, Speech and Dialogue, (2006). from 1999 to 2005. He was Chairman of the Scientific Committee of
[45] Saha S. and Bandyopadhyay S. "A new multi objective clustering Promotion to Assistant Professors Post in Mathematics of the
Scientific Council of the Egyptian Universities from 2001 to 2004.
technique based on the concepts of stability and symmetry"
Member of the Scientific Committee of Promotion to Professors Post
,Knowledge and Information Systems ,Volume 23, Number 1 / April 1- in Mathematics from 2004 to 2008. He is a member of National
27, DOI: 10.1007/s10115-009-0204-4, (2010 ). Committee for Mathematics, a member of National Committee for
[46] Tayli M. and Al-Salamah A. "Building Bilingual Microcomputer History and Philosophy of Science, member of the Board of Directors
of Egyptian mathematics Society, member of the editorial board of
Systems". In Communications of the ACM, Vol. 33, No.5, Pages 495-
the Journal of Egyptian Mathematical Society, member of the
505, (1990). Egyptian Society for the Arabization of Sciences. He Was the Editor
[47] Tang X. and Han M. "Ternary reversible extreme learning machines: of the Delta Science Journal .He is Representative of Tanta
the incremental tri-training method for semi-supervised classification", University in the League of Islamic Universities, member of Editorial
Board of the international scientific journal Science Echoes and
Knowledge and Information Systems Volume 23, Number 3, 345-372,
Applied Mathematics & Information Sciences. He is a member of the
DOI: 10.1007/s10115-009-0220-4, (2010). Supreme Advisory Committee of the Centre for Development of the
[48] Tamine-Lechani L., Boughanem M. and Daoud M. "Evaluation of Delta Region of the Academy of Scientific Research and
contextual information retrieval effectiveness: overview of issues and Technology. He participated in more than 90 Scientific Conference
and seminar specialist. He Also supervised over 51 PhD and about
research" Knowledge and Information Systems Volume 24, Number 2,
47 Master. Sovereignty to a lot of researches in the fields of general
221-233, DOI: 10.1007/s10115-009-0245-8, (2010). topology and fuzzy topology about 100 research papers published in
[49] Tezel S. and Latecki L., "Improving SVM Classification on Imbalanced scientific journals, interior and exterior prestigious. Was awarded the
Data Sets in Distance Spaces," icdm, pp.259-267, 2009 Ninth IEEE University of Tanta estimated in the basic sciences for the year
2001/2002. His research interests include General Topology, Rough
International Conference on Data Mining, (2009).
Sets, Digital Topology and Fuzzy Sets.
[50] Tsuji T., Nigazawa H., Okada M., and Aoe J. "Early Field Recognition
by Using Field Association Words", In the Proceeding of the 18th
International Conference on Computer Processing of Oriental Dr. El-Sayed Atlam: Received B.Sc. and M. Sc. Degrees in
Language, 2, 301-304,(1999). Mathematics from, Faculty of Science, Tanta University,
Egypt, in 1990 and 1994, respectively, and the Ph.D. degree in
[51] Wei F., Li w., Lu Q. and He Y. "A document-sensitive graph model for information science and Intelligent systems from University of
multi-document summarization", Knowledge and Information Tokushima, Japan, in 2002. He has been awarded by a Japan
Systems, Volume 22, Number 2, 245-259, DOI: 10.1007/s10115-009- Society of the Promotion of Science (JSPS) postdoctoral Fellow from
0194-2,(2010). 2003 to 2005 in Department of Information Science & Intelligent
Systems, Tokushima University; He is currently Associate professor
[52] Wang H. and Wang S. “Mining incomplete survey data through at the Department of information science and Intelligent systems
classification” , Knowledge and Information Systems Volume 24, from University of Tokushima, Japan. He is also Associate professor
Number 3, 441-465, DOI: 10.1007/s10115-009-0214-2 at the Department of Statistical and Computer science, Tanta
[53] Yang Y. and Liu X. “A Re-examination of Text Categorization University, Egypt. Dr. Atlam is a member in the Computer Algorithm
Series of the IEEE computer society Press (CAS) and the Egyptian
Methods”, In Proceedings of SIGIR-99, 22nd ACM International Mathematical Association (EMA). His research interests include
Conference on Research and Development in Information Retrieval. information retrieval, natural language processing and document
Berkeley (1996). processing.
[54] Yang Y. and Pedersen J.O. “A Comparative Study on Feature Selection
in Text Categorization”, In: Proceedings of the 14th International Mohammed Amin was graduated in mathematics in 1983 at
Conference on Machine Learning 412–420, (1997). Menoufiya University. He studied computer science from 1986 to
1989 at Ain Shams University in Cairo and received the M.Sc.
degree in 1990 and the Ph.D degree in computer science in 1997 at
the University of Gdansk, Poland. He is associate professor of
computer science at the faculty of science, Menoufiya University,
and research visitor to the faculty of Philosophy and sciences of the
Silesian University, Opava, Czech Republic. Hisresearch area in
formal languages and their application in compiler design.
Cooperating/distributed systems, web information retrieval, Petri nets
and its applications, and finite automata and cryptograph.
O. El-Barbary received B.Sc. Degree in Statistics and computer

science from, Faculty of Science, Tanta University, Egypt, in 2004
and M.Sc. in computer science from, Faculty of Science, Menofya
University, Egypt, in 2007, and now she is studying Ph.D. in Faculty
of Science, Menofya University, Egypt in the field of information
retrieval and natural language processing.

Arabic Document Classification: A Comparative Study

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Arabic Document Classification: A Comparative Study

Caricato da

Copyright:

Formati disponibili

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617

Arabic Document Classification:

T he World Wide Web (WWW) has become one of the

masculine noun. While ~aat as in (‫ )ﻣﻌﻠﻤ ﺎت‬for the (for),

feminine plural (moalmaat which means teachers in and,

median fields. FA words are saved in a field tree as "‫(" اﻟﺘﺪﺧﻴﻦ‬al-tadkheen

elements of it together. For example, Ancestor, diseases and cure in

(thaqafa-which means culture in English)

‫أﻣــــﺮاض‬ (anfluanza al khanazeer - which

‫اﻟﻌــــﻼج‬ (Al elaag -which means

(Al siyaha wa al safer- which mean the travel and tourism in

(Al seha - which

(Amrad soaa al taghzya -which means Malnutrition

(al taghzya al saleema- which means Proper nutrition in English)

Figure 1: Arabic field tree

 Relative pronouns. environment in English)

 Transformers (verbs, letters). ‫( اﻟﺘﻜﻨﻮﻟﻮﺟﻴﺎ‬al tecnologia- which 0.98 0.6 0.74

made. Secondly, a classification using compound field measure

2 / Table 5: Classification using compound FA words

REFERENCES determining weighted compound keywords from text databases",

O. El-Barbary received B.Sc. Degree in Statistics and computer

Potrebbero piacerti anche