Sei sulla pagina 1di 75

Comparison Some of Arabic Text Classification

Techniques using a Multinomial Mixture Model



Prepared by:
Siham Abdalhady Hasan
Supervised by:
Prof. Ghassan Kanaan
This Thesis was submitted in Partial Fulfilments of the
Requirements for the Masters Degree of Science in Computer
Science Faculty of Computer Sciences and Informatics

Amman Arab University
2013
ii

Abbreviations
Abbreviation Description
IR Information Retrieval
TC Text Classification
ATC Arabic Text Classification
WWW World Wide Web
MMM Multinomial Mixture Model
KNN K-Nearest Neighbor
NB Nave Bayes
SVM Support Vector Machine
D Document
C Class
HTML Hyper Text Markup Language
SGML Standard Generalized Markup Language
XML Extensible Markup Language
Re Recall
Pr Precision
FFS Feature subset selection
VMF Von Mises Fisher
BPSO Binary Particle Swarm Optimisation
LDA Latent Dirichlet Allocation





iii

Table of Contents
1. Chapter one: Introduction .......................................................................................................... 1
1.1. Introduction ........................................................................................................................ 2
1.1.1. Information retrieval ................................................................................................... 2
1.1.2. Text classification ........................................................................................................ 6
1.1.3. Arabic language ......................................................................................................... 13
1.2. The statement of the problem .......................................................................................... 16
1.3. Thesis Objective ................................................................................................................ 17
1.4. Summary ........................................................................................................................... 17
2. Chapter two: literature Review ................................................................................................. 18
2.1. Literature Review .............................................................................................................. 19
2.1.1. Text classification ...................................................................................................... 19
2.1.2. Arabic Text Classification .......................................................................................... 21
2.2. Summary ........................................................................................................................... 29
3. Chapter three: Methodology .................................................................................................... 30
3.1. Introduction ...................................................................................................................... 31
3.2. System Architecture .......................................................................................................... 31
1.2.3 . .................................................................................................................................. Corpus
32
3.2.2. Pre-processing ........................................................................................................... 33
3.2.3. Classifiers ................................................................................................................... 38
3.2.4. Evaluation .................................................................................................................. 45
3.3. Summary ........................................................................................................................... 47
4. Chapter four: Experiments and Evaluation .............................................................................. 48
4.1. Introduction ...................................................................................................................... 49
4.2. Data set preparation ......................................................................................................... 49
4.3. Performance measures: .................................................................................................... 50
4.4. Evaluation Results ............................................................................................................. 54
4.4.1. Nave Bayes algorithm using (MMM)........................................................................ 54
4.4.2. Comparisons MMM with other techniques and Discussions of Results ................... 57
iv

4.5. Results of Nave Bayes algorithm (MMM) with 5070 documents .................................... 60
4.6. Summary ........................................................................................................................... 63
4.7. Conclusion and Future Work: ........................................................................................... 64
4.8. Reference .......................................................................................................................... 65






















v

Acknowledgements

I would like to express my sincerest gratitude to my supervisor, Prof. Ghassan
Kanaan, who has been exceptionally patient and understanding with me during
my studies. Without his kind words of encouragement and advice this work
would not have been possible.
I am extremely grateful to all staff that has assisted me in the Department of
Computer sciences and informatics, especially Prof. Alaa Al-Hamami.
Thanks also to all of my other colleagues in the Computer sciences and
informatics for making my time here an enjoyable experience.

I would like to thank the Libyan Embassy in Amman to take care of me to
supplement my study.

The support of my family and friends has been much appreciated, and most
importantly, I would like to thank my husband, Ali and my children, to whom I
am indebted for all of the moral and loving support they have given me during
this time.
vi

Abstract
Text Classification (TC) assigns documents to one or more predefined
categories based on their contents. This project focuses on the comparison of
three automatic TC techniques: Rocchio, K-Nearest Neighbor (KNN) and
Nave Bayes (NB) classifier using a multinomial mixture model (MMM) on
Arabic language. In order to evaluation the mentioned techniques using the
MMM, an Arabic TC corpus that consists of 1445 Arabic documents that are
classified into nine categories: Computer, Economics, Education, Sport,
Politics, Engineer, Medicine, Law, and Religion. The main goal of this project
is to compare some of automatic text classification technique using a
multinomial mixture model on the Arabic language. The classification
effectiveness has been compared with the SVM model. This model was applied
in other project used the same traditional classifiers and the same collection.
Moreover; the experimental results are presented in terms of macro-averaging
precision, macro-averaging recall and macro-averagingF1 measures.
Furthermore, the results reveal that the naive Bayes using MMM work best for
Arabic TC tasks and outperformed k-NN and Rocchio classifiers.







1













1. Chapter one: Introduction












2

1.1. Introduction
The rapid development of the Internet, a larger number of Arabic information
is an available online; this motivates researchers to find some tools that may
help people to classify the huge Arabic information.
1.1.1. Information retrieval
It is necessary to clarify exactly what is mean by Information Retrieval (IR)
system. The information retrieval system is designed to analyse process, store
sources of information and retrieve those that match a particular user's
requirements. In other words in IR system, the similarity scores between a query
and a set of documents has been calculated, and the relevant documents have
been ranked based on their similarity scores. There are two main issues in the
IR systems, the first one is that characterization of the user information need is
not always clear and need to be transformed in order to be understood by the IR
system which is known as Query(short document contain few words)(Hasan,
2009). The second problem is in the structure of the information where there
are no standards or rules that control this structure especially on the World Wide
Web (WWW) and each language has its own characteristics and semantics. In
addition, the users need to find excellent information which is suitable for their
requirement. Furthermore, the time has been taken in account to catch
information quickly. The issues mention to very important topic which is Text
Classification (TC).
Information retrieval (IR) is a department of computer science. The main object
of IR is to provide effective methods for satisfying information needs.
Information that satisfies an information need is called relevant.

3


Figure 1.1:IR system components(Alnobani, 2008)

Three essential components can be used to represent IR system:
Input: set of available documents and set of request information (query).the
problem here all this information must convert to form which is suitable for
computer to use.
The processor: in this part of the retrieval system interested with the retrieval
process. A retrieval algorithm is given a query created by user that represents
their information need. In the case of text, this query consists of a series of
words, along with possible a set of relation between them. When all the inputs
are ready the process will make compare between the query and documents to
give the exact the requests users or at last retrieve the nearest result. The
information will be found resides in a collection which consists of a set of
documents.
Output: As a result of requested information (Query), the retrieval algorithm
scores the documents in the collection, ranking them according to some measure
of how well the query terms and relations are matched by information in the
document. For text, the relations most often used between terms are co-
occurrence or proximity constraints. Traditional relevance also relies on the
4

frequency with which terms occur in a document, and how unusual terms are in
the collection(Collins-Thompson and Adviser-Callan, 2008) .
It is necessary to recognize about how the retrieval process works. There are
two main approaches that can be used to build a system that retrieves documents
for a given query. One is in an ad-hoc method, and the other is a more
principled, model-based method.
Ad-hoc information retrieval: In Ad-hoc retrieval the documents in the
collection remain relatively static while new queries are submitted to the
system. For example, when the query compared to a set of documents, the
documents which contain the query terms will be retrieved. To improve upon
this in an ad-hoc method, we could decide to factor the number of times that a
term appears in a document as well. There are many benefits of ad-hoc
information retrieval. Ad-hoc retrieval system is quick to build. Many of the ad-
hoc retrieval methods are also very fast, needing only to look at the occurrence
of query terms in the documents they appear at query time. On the other hand,
the weak pointes are: the model is not built at all in Ad-hoc and every change is
made to an ad-hoc retrieval system effects the retrieval in unpredictable ways.
Additionally, it is very hard to understand exactly what is happening in these
systems, and therefore understand what should be done to improve them.
According to this reason the researchers prefer built models for information
retrieval(Alnobani, 2008).
Information retrieval using model: Unlike Ad-hoc in this case the retrieval
model has been built. If the model builds correctly, it will capture the important
aspects of a query and the documents needed for retrieval .Moreover, the benefit
of build the model is to understand and control the system.
5

1.1.1.1. Information retrieval models
Several information retrieval models have been applied Boolean model is one
of the earliest IR systems. This model considers clear formalism and simplicity.
However, a major problem with this kind of models is binary decision criterion
without any notion of a grading scale, and the difficulty of translating the query
into Boolean expressions. In addition, Difficult to control the number of
documents retrieved because all matched documents will be returned. However,
there is another popular retrieval model which is Vector model. It has
advantages such as, a term weighting scheme to improve retrieval performance
by sorting the documents according to their degree of similarity to the query,
and a partial matching strategy, which approximates the query conditions.
Despite its advantages which are mentioned above, the vector model suffers
from several major drawbacks such as lacked clean formalism and simplicity.
The third Information retrieval model is probabilistic models. The probabilistic
model is one of the most common frameworks for building principled
information retrieval models. However, the existence of a query has been
assumed by the probabilistic models then model the generation of documents
that are relevant and irrelevant to that query. Given a probabilistic generative
model for the corpus, a probabilistic model must retrieve and rank documents.
On the other hand, the probabilistic model refers of weak points such as: the
initial definition of relevant documents has to be supposed and the weights
ignore term frequency.
1.1.1.2. Evaluate the information retrieval
Finally, the performance of any systems has been proven according to achieved
result. The most common measures of system efficiency are time and space.
The shorter response time and the smaller space used are indicated to the best
6

system. In addition, effectiveness is a measure of the ability of the system to
retrieve relevant documents while at the same time holding back non-relevant
one; it can be measured by recall and precision (recall and precision have been
explained with more details in chapter 3).
1.1.2. Text classification
The TC system is classification all documents into static number of predefined
categories based on their content. Moreover, the text classification may be either
single-label where exactly one category must be assigned to each document or
multi-label where one or more categories can be assigned to each document.
Therefore, the main object of use TC is make the IR system result better than
without TC (Ghwanmeh et al., 2007). These advantages have led to the
development of automatic text and document classification systems. These are
capable of automatically organizing and classifying documents (Duwairi,
2007b).
Classification process can be done by manual or automatic, It is interesting to
note that manual categorization consider difficult and complex task especially
with huge information, because it classifies documents one by one via human
experts. Furthermore, the time to complete this mission will be too much. on
the other hand, with the speedy growth of online text documents, automatic text
categorization (TC) becomes an essential tool to develop text documents
efficiently and effectively(Wang et al.).
Text Classification is the task of classifying a document under a predefined
category. Additional officially, if d
i
is a document of the entire set of documents
D and (c
1
, c
2,,c
n
) is the set of all the categories, then text classification
assigns one category c
j
to a document d
i
According to increased number of
Arabic information on the Internet classifying documents manually is not
7

practical. However an automatic text classification has become an essential task
to save human effort to perform manual text classification. Therefore, the
optimal approach is automatic classification, due to the TC is used the science
of grammar ( ) as discharge and express (). In addition, the
thesauri and dictionaries are used. According to these, the system can be
understood the main topics in a document. This is done using statistical methods
to study the repetition of words within a document, and then determine the
context, which helps in operation of search.
There three essential stages in the TC system which are: document indexing,
classifier learning, and classifier evaluation.
Document index: is one of the most substantial issues in TC, which includes
document representation and a term weighting scheme. The bag of word is the
most common way to represent the context of text. This approach consider
simplicity because it is recorded only the frequency of word in document.
Moreover, all the predefine categories, the synonyms and prefix words for the
category are found and it helps to assign any document to that category based
on the synonym or prefix of a term. In addition some of term weight schemes
have been indicated with details in chapter 3.
Classifier learning : There are several mechanism learning algorithms have
been applied on automatic text classification by supervised learning(Ko et al.,
2004).The supervised learning algorithm finds a representation or judgment rule
from an example set of labelled documents for each class(Ko and Seo, 2009).
This can be illustrated briefly by Naive Bayes (NB)(Chen et al., 2009, Noaman
et al., Zhang and Gao), Support Vector Machine (SVM)(Moraes et al., Wang
and Chiang, Mesleh and Kanaan, 2008), Nearest Neighbour (k-NN)(Wan et al.,
Jiang et al.), Decision Trees(DT), Rocchio (Ko et al., 2004), and Voting etc.
8

Classifier evaluation: to know the effective of classifier that is according the
achieved result from each one. Moreover, perfect evaluation measures such as
Recall, Precision and F1-measure have been used to evaluate the different
classifiers.
1.1.2.1. Classification Based on Supervised Learning
The target of classification methods is assigned class labels to unlabelled text
documents from a fixed number of unknown categories. Each document can be
multiple, exactly one, or no category at all.
Supervised machine learning methods prescribe the input and output format.
The input to these methods is a set of objects (training data), and the output is
the classes which these objects belong.
The key advantage of supervised learning methods over unsupervised methods
is that by having a clear knowledge of the classes the different objects belong
to these algorithms can perform an effective feature selection if that leads to
better prediction accuracy.
Automatic text classification is treated as supervised learning task. The target
of this task is to evaluation a Boolean function determine whether a given
document belongs to the category or not by looking the synonyms or prefix of
that category(Deisy et al., 2010).
Classification can obviously be formulated as a supervised learning problem
with two class labels (positive and negative). Training and testing data used in
existing research are mostly product reviews, which is not surprising due to the
above assumption. Since each review at a typical review site already has a
reviewer-assigned rating (e.g., 1-5 stars), training and testing data are readily
available. Typically, a review with 4-5 stars is considered a positive review
9

(thumbs-up), and a review with 1-2 stars is considered a negative review
(thumbs-down).
Classification is similar to but also different from classic topic-based text
classification, which classifies documents into predefined topic classes
(politics, sciences, sports, etc.). In topic-based classification, topic related
words are important. In sentiment classification, topic-related words are
unimportant. Instead, sentiment or opinion words that indicate positive or
negative opinions are important (e.g., great, excellent, amazing, horrible, bad,
worst, etc.).
Existing supervised learning methods can be readily applied to sentiment
classification, such as, nave Bayesian, support vector machines (SVM) (Pang
et al., 2002). This approach can be used to classify movie reviews into two
classes (positive and negative). It was shown that using unigrams (a bag of
individual words) as features in classification performed well with both nave
Bayesian and SVM. Neutral reviews were not used in this work, making the
problem easier. The used features are data attributes used in machine learning,
not object features referred to in the previous section.
Subsequent research used many more kinds of features and techniques in
learning. As most machine learning applications, the main task of sentiment
classification is to find a suitable set of features. Some of the example features
used in research and possibly in practice such as mentioned (Pang and Lee,
2008).
Terms and their frequency: These features are individual words or word n-
grams and their frequency counts. Sometimes, word positions may also be
considered. The TF-IDF weighting scheme from information retrieval may be
applied too. These features are also commonly used in traditional topic-based
10

text classification. They have been shown quite effective in sentiment
classification as well.
Part of speech tags: In many early researches, it was found that adjectives are
important indicators of subjectivities and opinions. Therefore, adjectives have
been treated as special features.
Opinion words and phrases: Opinion words are words that are commonly used
to express positive or negative sentiments. For example, beautiful, wonderful,
good, and amazing are positive opinion words, and bad, poor, and terrible are
negative opinion words. Although many opinion words are adjectives and
adverbs, nouns (rubbish, junk, crap, etc.) and verbs (hate, and like) can also
indicate opinions. In addition to opinion words, there are also opinion phrases
and idioms (cost someone an arm and a leg). Opinion words and phrases are
helpful to sentiment analysis.
Syntactic dependency: Words dependency based features generated from
parsing or dependency trees are also tried by several researchers.
Negation: Clearly negation words are important since their appearances often
change the opinion orientation. For example, the sentence I dont like this
camera is negative. Negation words must be handled with care because not all
occurrences of such words mean negation. For example, not in not only but
also does not change the orientation direction.
Research also predicts the rating scores(Pang et al., 2002). In this case, the
problem is formulated as a regression problem since the rating scores are
ordinal. Another investigated research direction is the transfer learning or
domain adaptation. As it had been shown, sentiment classification is highly
sensitive to the domain from which the training data are extracted. A classifier
trained using opinionated texts from one domain often performs poorly when it
11

is applied or tested on opinionated texts from another domain. The reason is that
words and even language constructs used in different domains for expressing
opinions can be substantially different. Sometimes, the same word in one
domain means positive, but in another domain means negative (Turney, 2002).
For example, the adjective unpredictable may have a negative orientation in a
car review (unpredictable steering), but it could have a positive orientation in
a movie review (unpredictable plot). Therefore, domain adaptation is needed.
Existing research has used labelled data from one domain and unlabelled data
from the target domain, and general opinion words as features for
adaptation(Gamon and Aue, 2005).
1.1.2.2. Classification Based on Unsupervised Learning
Opinion words and phrases are the dominating indicators for sentiment
classification. Therefore, using unsupervised learning based on such words and
phrases would be quite natural. The methods used in(Turney, 2002).performs
classification based on some fixed syntactic phrases that are likely to be used to
express opinions. The algorithm consists of three steps:
Step 1: It extracts phrases containing adjectives or adverbs. The reason for doing
this is that research has shown that adjectives and adverbs are good indicators
of subjectivity and opinions. Although an isolated adjective may indicate
subjectivity, there may be an insufficient context to determine its opinion
orientation. Thus, the algorithm extracts two consecutive words, where one
member of the pair is an adjective/adverb, and the other is a context word.
For example: In the sentence, This camera produces beautiful pictures,
beautiful pictures will be extracted as it satisfies the first pattern.
Step 2: It estimates the orientation of the extracted phrases using the Point wise
Mutual Information (PMI) measure given in equation (1.1).
12



(

=
) 2 ( ) 1 (
) 2 & 1 (
2 log ) 2 , 1 (
term p term p
term term p
term term PMI

1.1

P (term1& term2) is the co-occurrence probability of term1 and term2, and P
(term1) P (term2) is the probability that the two terms co-occur if they are
statistically independent. The ratio between P (term1&term2) and P (term1) P
(term2) is a measure of the degree of statistical dependence between them. The
log of this ratio is the amount of information that we acquire about the presence
of one of the words when the other is observed.
The Opinion Orientation (OO) of a phrase is computed based on its association
with the positive reference word excellent, and its association with the
negative reference word poor:

). ' ' , ( ) ' ' , ( ) ( poor phrase PMI excellent phrase PMI phrase SO =

1.2

The probabilities are calculated by issuing queries to a search engine and
collecting the number of hits.
For each search query, a search engine usually gives the number of relevant
documents to the query, which is the number of hits. Thus, by searching the two
terms together and separately, we can estimate the probabilities in
equation(1.1)(Turney, 2002) used the AltaVista search engine because it has a
NEAR operator, which constrains the search to documents that contain the
words within ten words of one another, in either order. Let hits (query) are the
number of hits returned. Equation (1.2) can be rewritten as:
13


(

=
) ' (' ) ' ' (
' (' ) ' ' (
2 log ) (
excellent hits poor phraseNEAR hits
poor hits excellent phraseNEAR hits
phrase SO

1.3

Step 3: Given a review, the algorithm computes the average OO of all phrases
in the review, and classifies the review as recommended if the average OO is
positive, not recommended otherwise.
In this project, the text classification has been compared the three different
classifiers which are KNN, Rocchio and Nave Bayes by using MMMs.TC
system for Arabic language may not consider easy task compare with English
language. Because the Arabic language has very complex
morphology(Albalooshi et al.). Moreover, the test system considers the most
important stage in any IR system. It is used to determine the efficiency of the
system and helped to know which system is better than other. However, the
major goal of an IR system is to retrieve all the documents which are relevant
to a user query while retrieving as little non relevant documents as possible. The
evaluation can be achieved by Recall and Precision measures.
1.1.3. Arabic language
Apply text classification systems for Arabic language is a challenging task due
to it has very complex morphology (Albalooshi et al.). The Arabic alphabet
consists of the 28 letters:


The characters ( ) are called by vowels and rest letter are consonants. The
Arabic letters can be written by different forms, which depend on position it in
the word (beginning, middle, and end). For example the letter () has several
shapes. ( )if appear in the began( which mean in English Road);( ) If
14

the letter appears in the middle( which mean in English Surface);( )if the
letter appears in end( which mean in English Rubber).Furthermore The
Arabic language contain on diacritics() which putting over or below the
letters, The diacritics (fathah,kasra,dama,sukon,double fathah, double kasra,
double dama and shada) are used to clarify the mean of the words(Duwairi,
2007c). On top of that, in Arabic words, when diacritics are not clearly
mentioned, the text has a several meanings. Therefore, ambiguous meaning
which will negatively effect on text classification. To avoid these problems pre-
processing can be applied on Arabic language.
The Arabic language has more complex morphology than English language.
The Arabic language is written from right to left. Arabic words have two
genders, feminine and masculine; three numbers, singular, dual, and plural; and
three grammatical cases, nominative, accusative, and genitive. Noun has the
nominative case when it is the subject; accusative when it is the object of a verb;
and the genitive when it is the object of a preposition. In addition to, the Arabic
sentence are divided into three parts: noun, verb, and character. The noun and
verb stems are derived from a few thousand root by infixing, for example,
creating words like (computer), (he calculates),and (we
calculates),from the root(Duwairi, 2006).
A noun is a name or a word that describes a person, thing, or an idea.
Arabic verbs similar to English verbs, which are classified into Perfect and
Imperfect. Perfect tense denotes actions completed, while Imperfect denotes
incomplete actions. The imperfect tense has four mood: Indicative, subjective,
jussive, and imperative(Abboud and McCarus, 1983)
Arabic particles include prepositions, adverbs, conjunctions, interrogative
particles, exceptions, and interjections.
15

Most of Arabic words are derived from the pattern (); all words following
the same pattern have common properties and states. For example, the pattern
() indicates the subject of the verb, the pattern () represent the object
of the verb.
An Arabic adjective can also have many variant. When an adjective modifies a
noun in a phrase, the adjective agrees with the noun in gender, number, case,
and definiteness. An adjective has a masculine singular form such as (new),
a feminine singular form such as (new), a masculine plural form such as
(new),and a feminine plural form such as (new)(Chen and Gey, 2002).
In addition to the different forms of the Arabic word that result from the
derivational process, most connectors, conjunction, prepositions, pronouns, and
possession forms are attached to the Arabic surface form as prefixes and
suffixes. For instance, the definitive nouns are formed by attaching the article
() (as the) to the immediate front of the nouns. The conjunction word () (and)
is often attached to the following word. The letters (, , , ) can be adding
to the front to the word as prepositions. The suffix () is attached to represent
the feminine gender of the word. Also some suffixes are added to represent the
possessive pronoun (Her), for(My),and, for (Their)(Chen and Gey,
2002, Zrigui et al., 2012).
In addition, Arabic has two kinds of plurals: sound plurals and broken plurals.
The sound plurals are formed by adding plural suffixes to singular nouns. The
plural suffix is for feminine nouns in all three grammatical cases, for
masculine nouns in nominative case, and for masculine nouns in genitive and
accusative cases. Moreover, the formation of broken plurals is more complex
and often irregular, it is therefore, difficult to predict .furthermore, and broken
plurals are very common in Arabic. For example, the plural form of the noun
16

(child)is (children),which is formed by attaching the prefix and
inserting the infix .The plural form of the noun (book)is (books),which
is formed by deleting the infix .the plural form of (woman) is
(women)the plural form completely different of singular form(Chen and
Gey, 2002).
1.2. The statement of the problem
The IR system has been widely used to assist users with the discovery of useful
information from the Internet. Furthermore, the current IR systems are based on
the similarity and frequency term between query (users requirement) and the
available information on the Internet. However, the IR ignores important
semantic relationships between them. In addition, that ignorance makes the
research operation slowly and waste a lot of time. In addition to, the retrieval
documents May not useful and the big problem if the word has double meaning.
To overcome this problem, text categorization (classification) is a solution.
Text classification technique has been applied a little on the Arabic languages
compared to other languages (Al-Harbi et al., 2008). Unfortunately, there are
not perfect techniques to classify the text, thus the researchers have been
encouraged to develop TC techniques by using many different models and many
methods.
In this project the multinomial mixture model (MMM) has been suggested and
applied to classify the Arabic documents. In addition, this experiment will be
compared with other classifiers. In order to clarify, which model can be used
well than others.



17

1.3. Thesis Objective
Arabic text can be believed completely different of the English text and had
complex morphology. In this thesis, the multinomial mixture model (MMM)
has been recommended and applied to classify the Arabic documents.
Moreover, three different techniques are examined with Arabic text such as
Rocchio algorithm, traditional k-NN and nave Bayes.
The text classification system with these techniques has been evaluated using
the standard measures: recall, precision and f-measure. Moreover, the
effectiveness of the classifier will be decided according to the results achieved.
Finally, the results of MMM has been compared with the other two algorithms
to determine the best information retrieve system to Arabic language.
1.4. Summary
This chapter gives a short introduction into Information retrieval system (IR).
It also focuses on text categorization (TC), and describes the most important
tasks of a text categorization system. After the short introduction some
interesting text categorization systems and Arabic language are described
briefly. Moreover, the thesiss problems are presented. Finally; the multinomial
mixture model has been adopted as thesis objective.







18













2. Chapter two: literature Review












19

2.1. Literature Review
Text classification is defined as assigning new documents to a set of pre-defined
categories based on the classification patterns (Al Zamil and Can, Uuz). In
recent years, there have been an increasing amount of literatures on TC topic.
Moreover, the researchers have shown an increased interest to continue the
research and developed it according to the previous work.
2.1.1. Text classification
The text classification techniques have been investigated and used in many
application areas. Moreover, there are many researchers studied text
classification using different techniques.
The study(Guiying et al.) was presented review the key text classification
techniques including text model, feature selection methods and text
classification algorithms in building a text classification system. In addition,
the text classification system based on Mutual Information, K-Nearest
Neighbour algorithm and Support Vector Machine had been implemented. The
data set was created from the famous Reuters-21578 text classification
collection. Furthermore, the experiment result was shown that the Classification
accuracy rates were 91.1 %. This was mentioned to obtain better performance
than no feature selection and improved the classification rate. Moreover, the
SVM classifier gains the higher performance compared to KNN classifier.
In(Zhang and Gao) the practical a new feature selection method (Auxiliary
Feature)had been used. In addition the enhancement performance of Naive
Bayes for Text Classification was proved and an auxiliary feature method was
proposed to determine features by an existing feature selection method, and
selected an auxiliary feature which can reclassify the text space aimed at the
20

chosen features. In order to evaluate this experiment; the date set was choose
30000 junk mails and 10000 normal mails from CCERT. The result from this
study shows that the proposed method indeed improves the performance of
naive Bayes classifier.
Feature sub-set selection (FSS) is an important step for effective text
classification (TC) systems. Due to it may have a great effect on accuracy of the
classifier (Karabulut et al., Mesleh). However, there are many valuable studies
that investigated FSS metrics for English TC tasks; these studies have been used
different classifiers and many TC corpora (Uuz, Al-Ani et al., Khushaba et
al.).The Arabic TC tasks, there are some works that handles the FSS problem.
In recent years, there has been an increasing amount of literature on an empirical
comparison of seventeen FSS metrics (Chi, Gss, Gsss, Ngl, Or, Mi, Ig, Bns, Df,
Pwr, Acc, Acc2, F1, Pr, Re, Fo and Er) for Arabic TC task using SVM5
classifier, the evaluation used an Arabic corpus that consists of 7842 documents
which are independently classified into ten categories. However, the result of
experiment was proven that Chi-square and Fallout FSS metrics work best for
Arabic TC tasks (Mesleh).
Another study has been mentioned to the FSS problem which has title under
two-stage feature selection and feature extraction is used to improve the
performance of text categorization(Uuz, 2011).
An improved KNN algorithm for text classification was proposed, which builds
the classication model by combining constrained one pass clustering algorithm
and KNN text categorization. Despite the KNN is a simple and effective method
for text classification, it has three drawback points: firstly, the complexity of its
sample similarity computing is huge; secondly, its performance is easily
affected by single training sample, thirdly, KNN consider a lazy learning cause
21

do not build the classication model. Moreover, to overcome these drawbacks,
the improved KNN algorithm was executed. In addition, this algorithm was
used Vector Space Model (VSM) to represent the documents. The result show
that the INNTC classier is much more effective and efficient than KNN(Jiang
et al.).
In (Zhang et al., 2013) a novel project prototype based classifier for text
classification had been implemented. The basic idea behind the indicated
algorithm is based on which document categories in modelled by as set of
prototypes and their individual term subspace of the document category. The
classifier was tested using two English data sets then compared its perform
with other five classifiers: svm, three prototype, knn, knn-model and centroid
classifier. The experiment result of the suggested classifier show that the project
prototype based classifier was achieved higher classifier accuracy at a lower
computation cost than the traditional prototype based classifier especially for
date includes interfering document classification.
2.1.2. Arabic Text Classification
The studies carried out for Arabic text classification consider very few
compared to other languages (like English). Due to the Arabic language has an
extremely rich morphology and complex orthography. However, there are some
related work had been proposed to classify Arabic documents:
In (Duwairi, 2007a)claimed that three classifiers KNN, NB and distance-based
classifier had been implemented for Arabic TC. Every category was
represented as a vector of keywords in the distance-based and KNN. On the
other hand, the vectors with the NB were bag of word. The Dice measure was
used to calculate the similarity between them. In addition, the accuracy of the
classifier was tested using Arabic text corpus, which collected from online
22

magazines and newspapers. According to the result, the NB classifiers do better
than other two classifiers.

In 2008(Mesleh and Kanaan, 2008), the researcher indicated to SVM algorithm.
It had been implemented on Arabic text classification. The paper pointed out
to the SVM classifier achieved the result better than other classifiers such as
(Nave Bayes & KNN). In addition, the light stemming on Arabic TC tasks was
evaluated with SVM classifier. As a result, the light stemming did not enhance
the performance of Arabic SVM text classifier. On the other hand, Feature
Subset Section (FSS) had been implemented and improved the performance of
Arabic SVM text classifier. All feature subset section methods such as (Chi-
square, GSS, NGL, OR, IG and MI) had been achieved better Recall and better
F1 measure. Furthermore, the best result had been achieved with two methods
of feature subset section included (Chi-square, NGL).Finally, a new Ant Colony
Based FSS algorithm (ACO) had been applied to achieve the greatest TC
effectiveness of the six methods of FSS.
The main object was compared automatic text classification using kNN,
Rocchio and NB classifier on the Arabic language(Kanaan et al., 2009a).
Moreover the system had been tested by using a corpus of 1445 Arabic text
document. Additionally two models were used; the first model was the Support
Vector machine (SVM). It used to implement KNN and Rocchio classifier. Each
document was represented as a vector of terms. The second model was
probabilistic, which used to execute NB classifier. In probabilistic model the
probability of document belonging to any class had been calculated. The
document assigned to class has maximum probabilistic. However, the
23

experiments shown the Nave Bayes is the best performer followed by kNN and
Rocchio.

The paper (McCallum and Nigam, 1998a)had reported that comparison between
two probabilistic classifiers. In addition, the researchers mentioned to the
multinomial model were given the result better than the multivariate Bernoulli
model at large vocabulary sizes. In contrast when the vocabulary size is smaller
the multivariate Bernoulli model outperforms the multinomial model.
Furthermore, the results were tested on five real world corpora. As the
evaluation of their experimental proved that the multinomial model was reduced
error by an average of 27%, and sometimes by more than 50%.

In(Ueda and Saito, 2002) implemented the probabilistic generative models
called parametric mixture models (PMMs). The main goal of PMMs was to
avoid multiclass and multi-labelled text categorization problems. In addition,
the PMMs was achieved good results compared on the binary classification, due
to PMMs can simultaneously detect multiple categories of text instead depend
on binary judgment. Furthermore, the PMMs approaches was applied by using
World Wide Web pages, and showed on its efficiency.
In (Zhong and Ghosh, 2003)demonstrated comparative study of generative
models for document clustering was used multinomial model. In addition, the
comparative this model with other two probabilistic models such as multivariate
Bernoulli, and Von Mises-Fisher [2003] (VMF) was performed by applying
the cluster. Unfortunately the Bernoulli model was the worst for text clustering.
On the other hand, the VMF model produced clustering results better than both
Bernoulli and multinomial models.
24


As mentioned in (Li and Zhang, 2008) the literature, a novel mixture model
method for text clustering was named multinomial mixture model with feature
selection (M3FS). The M3FS method was used MMM instead using the
Gaussian mixtures to improve text clustering tasks. Prior studies that have noted
no label in unsupervised text clustering was the hard problem of feature
selection. In order to overcome this problem the M3FS was proposed to text
cluster. Furthermore the results demonstrate that M3FS method has good
clustering performance and feature selection capability.

The main idea was discussed by(Bouguila et al., 2012) two problems, one is
many irrelevant features which may affect the speed and also compromise the
accuracy of the used learning algorithm. The second challenge is the presence
of outliers, which affects the resulting models parameters. For this reason, the
researchers were suggested apply an algorithm that partitions a given data set
without a priori information about the number of clusters. Furthermore novel
statistical mixture model, based on the Gamma distribution, which makes
explicit what data or features have to be ignored and what information has to be
retained. The performance of method of a nite mixture model by using
different applications with analysis of data, real data and objects shape
clustering have been proposed. Moreover the experiment was prove this
approach has excellent modelling capabilities and that feature selection mixed
with outliers detection inuences signicantly the clustering performance.
(McCallum and Nigam, 1998b, Lewis, 1998)were discussed the history of naive
Bayes in information retrieval, and presents a theoretical comparison of the
25

multinomial and the multi-variate Bernoulli (again called the binary
independence model).
Compared to Indo-European languages (like English), the Arabic language has
an extremely rich morphology and a complex orthography. This is one of the
main reasons (El-Halees, 2007, Duwairi, 2006, MESLEH, 2007)behind the lack
of research in the field of Arabic text classification. However, many machine
learning approaches have been proposed to classify Arabic documents: Support
Vector Machine (SVM) classifier with the Chi-square feature extraction method
(MESLEH, 2007)the Nave Bayesian method(2004), k-Nearest Neighbours
(Al-Shalabi et al., 2006) distance based classifiers, the Rocchio
Algorithm(Syiam et al., 2006).
Sawaf, Zaplo and Ney(Sawaf et al., 2001) had used the maximum entropy
method for Arabic document clustering. Initially, documents were randomly
assigned to clusters. In subsequent iterations, documents were shifted from one
cluster to another if an improvement was gained. The algorithm terminated
when no further improvement could be achieved. Their text classification
method is based on unsupervised learning.
El-Kourdi, Bensaid, and Rachidi(2004) have used a Nave Bayesian classifier
to classify an in-house collection of Arabic documents.
They have concluded that there is some indication that the performance of
Nave Bayesian algorithm in classifying Arabic documents is not sensitive to
the Arabic root extraction algorithm. In addition to their own root extraction
algorithm, they 26 used other root extraction algorithms such as the algorithms
suggested by(Baeza-Yates and Ribeiro-Neto, 1999, Al-Shalabi and Evens,
1998).
26

Duwairi (Duwairi, 2006)has proposed a distance-based classifier for Arabic TC
tasks, where the Dice measure was used as a similarity measure. According to
work had been done by Duwairi, each category was represented as a vector of
words. In the training phase, the text classifier scanned training documents to
extract features that best capture inherent category specific properties.
Documents were classified on the basis of their closeness to the feature vectors
of the text.
El-Halees (El-Halees, 2007) was implemented a maximum entropy based
classifier to classify Arabic documents. Compared with other text classification
systems (such as El-Kourdi et al. and Sawaf et al.), the overall performance of
the system was good (in comparisons, the results were used as recorded in the
published papers mentioned above by El-Halees).
Hmeidi, Hawashin and El-Qawasmeh(Hmeidi et al., 2008) reported a
comparative study of SVM and K-Nearest Neighbours KNN classifiers on
Arabic text classification tasks. The concluded proven that SVM classifier
shows a better micro-averaging F1-measure.
Al-Saleem (Alsaleem, 2011) proposed an automated Arabic text classification
using SVM and NB classification methods. These methods were investigated
on different Arabic datasets. Several text evaluation measures had been used.
The experimental results against different Arabic text categorization datasets
showed that SVM algorithm outperforms the NB with regards to all measures
(recall, precision and F-measure). The F-measure of SVM was 77.8% while
74% for NB.

Al-Diabat et al, (Al-diabat, 2012) had investigated the problem of Arabic Text
Classification (ATC) by using rule-based classification approaches. The
27

performance of different classification approaches that produce simple "IF-
Then" knowledge to find the most appropriate one to handle ATC problem were
evaluated. Four rule-based classification algorithms were investigated: One
Rule, rule induction (RIPPER), decision tree (C4.5), and hybrid (PART).
Arabic data collection with 1526 text documents that belong to 6 categories was
used. The results showed that the hybrid approach of PART outperforms the
rest of algorithms. The average for precision was 61.9% and for recall 62.3%.
In 2012, Wahbeh et al (Al-Kabi et al., 2012)had compared three text
classification techniques, SVM, NB, and C4.5. A set of Arab text documents
collected from different websites was used. Moreover, four categories, and used
WEKA toolkit for running the previous classifiers, were used. The word
representation has been implemented to represent the documents.
According to (Al-Radaideh, Al-Shawakfa, Ghareb, & Abu-Salem, 2013) the
project have proposed an approach for ATC using Association Rule Mining;
their approach facilitated the discovery of association rules for building a
classification model for ATC. Three classification methods had used that used
association rules: ordered decision list, weighted rules, and majority voting. The
experimental results showed that the majority voting method gave better results
than other methods.
In (Goudjil et al., 2013, Settles, 2010) a novel batch mode active learning using
svm for Arabic text classification has been presented, while there are no a lot of
studies done in this area for the Arabic text. The purpose of apply active learning
is reduced the amount of the data needed for the training phase. Thus the cost
of manual annotating the data will be less; also the process of learning can be
hurried while the active method is allowed to choose a data from which it learns.
28

As long as the feature selection is a key factor in the accuracy and effectives of
resulting classification, the author in (Chantar and Corne, 2011) mentioned to
Binary Particle Swarm Optimisation (BPSO) as the feature selection for Arabic
text classification. The aim of apply Bpso/Knn is to find a good subset of feature
to facilitate the task of Arabic text categorization. SVM, Nave Bayes and C4.5
decision tree have been applied as classification algorithms. However the
suggest method was effective and achieved satisfy outcome on classification
accuracy.
The particle swarm optimization has been used to achieve the excellent feature
selection in(Al-Saleem, 2010).
In 2013(Abuaiadah, 2013), it was reported that multiword features has
implemented to improve Arabic information retrieval. Multiword features are
displayed as a mixture of word appearing within windows of varying size.
However, multiword features were applied with two similarity functions: the
dice similarity function and the cosine similarity function to improve the
outcome of Arabic text classification. According to the results had achieved the
dice function perform better than the cosine function. With the dice similarity
function, the frequencies of the features in the document are ignored and only
their existence is recognized.
In (Alsaleem, 2011) the investigator concentrated on just a single label
assignment. The goal of this paper is to present and compare result obtained
against Saudi Newspaper Arabic text collection using SVM algorithm and NB
algorithm. However, the experiment shows that the SVM classifier achieved
better result than NB classifier.
In(Zrigui et al., 2012) Latent Dirichlet Allocation(LDA)had been proposed as
text feature. LDA was used to index and represent Arabic texts. However, the
29

mine idea behind LDA is that documents are represented as random mixtures
over latent topics where each topic is described by a distribution over words.
SVM was used to apply classification task. Moreover, LDA-SVM algorithm
was achieved high effectiveness for Arabic text classification, which exceeds
SVM without LDA, Nave Bayes and KNN classifiers.
2.2. Summary
In chapter 2, there are different text classification algorithms were described. A
little was explained general in text classification and other explicated specially
the Arabic languages. Finally some papers with the Multinomial Mixture model
had been shown.
















30













3. Chapter three: Methodology













31

3.1. Introduction
There are many approaches can be used in text classification. KNN, Rocchio
and Nave Bayes by using MMM model have been implemented. Moreover,
these algorithms have been applied to the same datasets.
The main aim of applying TC on the Arabic languages is to improve the
performance of information retrieved without TC. Many steps can be done to
implement the TC task. However, all phases have been explained in section 3.2.
An IR process starts with the submission of a query, which describes a users
topic and finishes with a set of ranked results estimated by the IRs ranking
scheme to be the most relevant to the query(Ajayi et al.).
Recall and Precision consider famous measures; these can be used to evaluate
any IR system. Furthermore, the efficiency of the system can be determined by
using those measures.
This chapter is divided into three main sections. The section 3.1 shows overview
about the project. Section 3.2 has been presented the main text classification
system architecture.Section3.3mentioned to the short summary of chapter three.
3.2. System Architecture
The text classification technique has been implemented by passing through
several phases. Moreover, these phases execute sequentially to facility the TC
task. Uncategorized documents were pre-processed by removing punctuation
marks and stopwords. Every document is then represented either as a vector of
words only or as a vector of word, their frequencies and number of documents
in which these words appeared (inverse document frequency). Stemming was
used to decrease the dimensionality of feature vectors of document. The
accuracy of the classifier is computed using recall, precision and F-
32

measure(Duwairi, 2007a). An brilliant review of text classification area was
mentioned by Sebastiani(Sebastiani, 2002). However, the figure 3.1 below
indicated to the main steps in classification system.

Figure 3.1 Overview of classification process(Ikonomakis et al., 2005)

3.2.1. Corpus
The accuracy of the classifiers was tested using corpus contains of 1445
documents. The documents can be divided into nine categories: Computer,
Economics, Education, Sport, Politics, Engineer, Medicine, Law, and Religion.
Moreover, some of documents will be used for training classifier and sets for
testing classifier. Testing sets are the input documents need to be classified.
Training is set of documents or topics tagged with the correct classes. However,
the corpus and categories have been shown and explained with more details in
chapter 4.
33

3.2.2. Pre-processing
The pre-processing can be defined as the process of filter out words, which may
not give any meaning to a text, also might not be useful in information retrieval
systems. These words are called stop word(Al-Maimani et al.). The purpose of
applying pre-processing is to transform documents into a suitable representation
for classification task. In addition, it is reduced size of information which may
make search operation faster. The pre-processing can be done as follows:
-The different documents which have format such as HTML, SGML and XML
are converted to plain text format.
-Digital and Punctuation mark have been removed from each document.
Tokenization: Tokenization divides the document to set of tokens (words).
Remove stopword: There are two kinds of the terms in any document; the
firstly, it has been called stopword which occur commonly in all documents and
may not give any mean to the document. The secondly, it can be described as
Keywords or features. However, stopwords such as (punctuation mark,
formatting tags, prepositions, pronouns, conjunction and auxiliary verb) have
been removed to reduce the text size and save the process time. Moreover, this
process is essential to removing these high-frequency words because they may
misclassify the documents(Uuz).
Normalization: It is a very essential phase that will reduce many words that
have the same meaning, but it was written in different forms. Arabic language
refers to a very common problem when a single word has been written in many
forms like - - (which mean in English start).However, table 3.1show some
letters have been normalized:

34

(hamza)
(hamza on the top of alef)
(hamza under alef)
(mad on the top of alef )
(hamza on wow)
(hamza on ya)

All this letters have
been normalized to the
letter " (alef).
will be normalized to
Replace finalwith
Table3.1: Arabic character normalization
Stemming: Stemming is another common pre-processing step. The steaming
phase in Arabic is complex than in English language, while (Arabic gender
word has forms for singular, dual and plural). The purpose of decrease the size
of the initial feature set is removed misspelling or word with the same stem. It
is necessary to enhanced performance of information retrieval system.
However, terms sharing the same root that appears like different words due to
their affixes can be determined. For example: computer (), computing
(),computation()and computes)(, all those have the same
compute root(Porter, 2006). There are some different approaches to perform
stemming: the root-based stemmer; the light stemmer; and the statistical
stemmer.
Indexing:
Document index is one of the most substantial issues in TC, which includes
document representation and a term weighting scheme. The bag of word is the
most common way to represent the context of text. This approach consider
simplicity because it is recorded only the frequency of word in document.
35

Moreover, all the predefine categories, the synonyms and prefix words for the
category are found and it helps to assign any document to that category based
on the synonym or prefix of a term.
Several measures had been applied to calculate the weighted terms:
Term Frequency (TF): is the simplest measure to weight each term in a text.
The drawback with TF is concerns on term occurrence with in a text, this causes
in improve recall but does not improve precision according to the result was
achieved.
Inverse Document Frequency (IDF): the other popular weight measure is
IDF. The mine idea with IDF is concerns on the terms which rarely occur in a
collection of text. This perform in improve precision without enhance recall.
TF.IDF: as long as terms weight is effect on evaluated the text classification.
TF.IDF is combined two weight measures TF and IDF to enhance both recall
and precision then to enhance the text classification result. On the other hand in
TF.IDF when a new document occurs, recalculation of weighting factor to all
document is needed since it depends on number of documents[].
In the other papers the authors indicated to more weighted measures which are:
Weighted Inverse Document Frequency (WIDF): To overcome the TF.IDF
drawback the WIDF by weighting term that sums up to one over the collection
of text. As result the WIDF improve both the precision and recall.
Disadvantages of WIDF is that when the number of documents becomes huge,
the terms that have the nearest frequency, have almost equal, which makes the
learning task more difficult(El-Halees, 2007).
Anew method Modified Inverse Document Frequency (MIDF) had been
proposed(Deisy et al., 2010).The key idea with MIDF is depends on term
frequency and document frequency but not the number of document in the
36

collection. MIDF is normalized term frequency over the collection, which
provides the correct terms for learning. However, the MIDF performs better
than the existing term weighting schemes such as TF.IDF and WIDF. Indexing
a document is the method for characterizing its content to purpose making easy
subsequent retrieval in document storage. In addition, the index terms of
information retrieval systems are word stems automatically derived from a
document and weighted according to their distribution in a document collection.
Automatically indexing is the process of producing the descriptors (index
terms) of a text automatically(Lahtinen, 2000). Automatically indexing an
information source will save time and more rapidly, since most of the precise
human effort can be performed by a machine. However, in indexing approach
the order of terms in the vector are ignored. In information retrieval systems,
index terms are usually weighted according to their importance for describing
documents, and typically the weighting schemes are based on detection of word
frequencies across the document collection(Obaseki).
The vector of words can be called vector of weighted terms, consists of all
distinct terms that appear in all training documents. It consists of term frequency
which measures the number of times the frequency of term i appears in a
document j, Inverse Document Frequency (IDF) which measures the number of
times the term i appears in all the collection set of documents.
Term Frequency Inverse Document Frequency (TF-IDF) has been used in
this work as one of the most popular weight schemes. It considers not only term
frequencies in a document, but also the frequencies of a term in the entire
collection of documents(Moraes et al.). The classic TF IDF
t,d
assigns to term
t a weight in document d as:

37

TFIDF(i, j) = TF(i, j). IDF(i) 3.1

Thus, TF*IDF weighting assigns a high degree of importance to terms occurring
frequently only in few documents of a collection. Inverse Document Frequency
(IDF) for term Ti calculated as fallowing:


IDF(i) = log
N
DF(i)
nType equation here.
3.2
Where, DF(i)(document frequency of term Ti) is
number of documents in which Ti occurs.
Automatic indexing relies typically on word frequencies. If the word occurs
frequently in a document, but does not occur in many other documents, it is
possibly an appropriate document descriptor, and it should be weighted high by
the indexer.
Feature selection:
Feature sub-set selection (FSS) is one of important pre-processing steps of
machine learning and essentially a task for text classification. Feature selection
methods study how to choose a subset of attributes that are used to construct
models describing data(Khushaba et al.).There are many methods of FSS have
been applied on Arabic text(Al-Ani et al., Mesleh and Kanaan, 2008).
According to the previous relation works the FSS approach was proven to
provides several advantages for text classification system because it has very
effective in reducing dimensionality, removing irrelevant and redundant terms
from documents and decrease computational complexity. In addition the FSS
38

increasing learning accuracy, improving classification efficiency and scalability
by make building the classifier is usually simpler and faster).On the other hand,
FSS may decrease the classifiers accuracy (Mesleh). (Singh et al., Khushaba et
al., Al-Ani et al.).
While the number of feature is huge and redundant with text classification task,
it is important to examine how to select the best feature can use to achieve better
efficiency than others.
Many FSS algorithms have been tested and comparison in text classification
system for example: Chi-square and fallout were achieved the satisfy result in
Arabic TC tasks and Ant colony(ACO) is an optimization algorithm ,which is
derived from the study of real ant colonies ant it is one of the hopeful approaches
to better feature selection.
To classify a new document was pre-processed by removing punctuation marks
and stopwords, followed by extracting the roots of the remaining keywords. The
feature vector of a new document and the feature vector of all categories should
be compared. Ultimately, the document was assigned to the category with has
maximum similarity.
3.2.3. Classifiers
There are a lot of classifiers type have been applied and excused in text
classification area. Moreover, the result was completely different from one to
another, while every classifier has specific algorithm. However, many sort of
classifier has been explained and shown the advantages and drawback
according to the achieved result.
3.2.3.1. Support vector machine (SVM)
Support vector machine has been widely applied in text classification area
(Alsaleem, 2011, Zrigui et al., 2012, Mesleh and Kanaan, 2008). SVM classifier
39

is one of supervised machine learning techniques. .The document is represented
as a vector of terms (words).each dimension corresponds to a separate term.
When the term occurs in the document the value of the vector is not null but
this value can calculated using the best weigh methods such as tf*idf. In linear
classification, SVM creates a hyper plane that divide the data into two sets with
the maximum-margin. Hyper plane with the maximum margin has the distance
from the hyper plane to points when the two sides are equal. To apply the SVM
learn the function bellow can be used:


) ( ) ( b wx sign x f + =

3.3

Where W is a weighted vector in
n
R
.SVM find the hyper plane
b wx y + =
by
separating the space
n
R
into two half spaces with the maximum-
margin(Alsaleem, 2011). The SVM classifier is one of the simple and effective
algorithms that execute classification task. In addition SVM classifier has the
potential to handle a huge number of features. On the other hand with the svm
classifier the document containing similar contexts but different terms
vocabulary are not classified as the same category. Also in the vector
representation the order of the terms which appear in the document was lost.
The goal of this project is to compare three different classification techniques
on the Arabic language, namely kNN, Rocchio and Nave Bayes by using
multinomial model.
3.2.3.2. K-Nearest Neighbor (KNN)
The K-nearest neighbor (KNN) is one of the famous text classification
techniques. The principle KNN technique: all the documents that are close in
the space belong to the same class. However, the essential idea with KNN is to
identify the class of document based on the similarity measure.
40

The KNN has advantages such as simple, non-parameter and shows a very good
performance on text categorization tasks for Arabic text Language.On the other
hand, the KNN has drawbacks such as difficult to find optimal value of k;
classification time is long due to the distance of each query instance to all
training samples has been computed. In addition, this classifier has been called
lazy learning system, because it does not involve a true training phrase(Wan et
al.).
The major steps to apply k-nearest neighbor classifier:
Pre-process documents in training set.
Choose the K parameter value, K value means the number of nearest neighbors
of d in the training data.
Determine the distance between the testing document (d) and the training
documents (previous classes).
Class the distance and determine neighbors based on the minimum distance of
k-distances.
To classify an unknown document, the KNN classifier ranks the documents
neighbors among the training documents and uses the class labels of the k most
similar neighbors. The similarity score of each nearest neighbor document to
the test document is used as the weight classes for the neighbor document. If a
specific category is shared by more than one of the k-nearest neighbors, then
the sum of the similarity scores of those neighbors is obtained from the weight
of that particular shared category(Mitra et al., 2007).
An example of KNN classification has been showed in figure3.2.a. Moreover,
the document X has been assumed as test sample, which should be classified
either to the first category of white circle or to the second category of black
circle. If k = 1 the document X will be classified to the white category, because
41

there is one white circle and zero black circle inside the inner circle. If k = 5 it
is classified to black category, because the number of black circle more than of
white circle. Majority voting was used to determine the category of an
unclassified document. On the other hand, if K=10 the document will be
classified to both (black, white).To avoid this problem the similarity has been
determine according to total weight of two categories as show in figure3.2.b.





A B

Figure 3.2 an example for KNN
If K=5 the document has been classified to white category, because the sum
weight of white category (9) more than blacks weight (8).
3.2.3.3. Rocchio:
Rocchio relevance feedback algorithm is one of the most popular and widely
applied learning methods from information retrieval. In addition, the Rocchio
is considered easy to implement and very fast compared to KNN (Kanaan et al.,
2009a). The basic idea behind applying the Rocchio approach is uses a vector
42

to represent each document and class. The vector to represented the classes (
c

) has been called prototype or centroid (Ko et al., 2004, Kanaan et al., 2009a).
Prototype for each class calculated by subtract the average all document
appeared in class C
j
of the average all document do not appears in the classC
j
.



e e

=
j j
C D d j C d j
j
d
C D
d
C
c

| |
1
| |
1
| o

3.4

Where, & are parameters that adjust the relative impact of positive and
negative training examples.
Practically in text classification, Rocchio calculates similarity between test
document and each of prototype vectors. Then, the test document assigns to the
category which has the maximum similarity score.
3.2.3.4. Nave Bayes:
Nave Bayes classifier uses a probabilistic model of text. It achieves good
performance results on TC task for Arabic text(Kanaan et al., 2009b).
The NB mentioned (Noaman et al., Zhang and Gao) that it is a simple
probabilistic classifier based on applying Bayes theorem the condition
probability P (C
j
|d
i
) for each class can be computed as:


p(c
j
|d
i
) =
P(C
j
)P(d
i
|c
j
)
P(d
i
)

3.5

Where, P (c
j
) is the prior probability of a document occurring in class
c
j
.Frequently, each document (d
i
) in text classification represented as a vector
of words (v
1
, v
2
, . v
t
) then the above equation become as:

43


p(c
j
|d
i
) = P(C
j
)

k=1
t
P(v
k
|c
j
)
P(d
i
)

3.6

P (d
i
) is constant from all categories.


P(v
k
|c
j
) =
f
F

3.7

Where, f is the frequency of a word (v
k
) in the test document.F is the number
of document in which word v
k
has appeared in.
Notes that: to avoid zero probability add one Laplace smoothing is used and
therefore the; equation become as:


P(v
k
|c
j
) =
f +1
F +w
j

3.8

w
j
Equals number of training document in the category c
j
.
The Bayes classifier compute separately the posterior of document D falling
into each class, and assign the document to the class with the highest
probability, that is

c
optimal= arg max (p(c
j
|d
i
); 1i|C|
3.9

Where, |C| is the total number of classes(Duwairi, 2007a).
44

The Nave Bayes for categorization is frequently used in text classification; due
to it has speed and simplicity. Moreover, there are two event models of Nave
Bayes: multinomial model and Bernoulli model(Prasad).
In Bernoulli model, a test document is classified as binary occurrence
information, the number of occurrences is ignored. Although multinomial
model is kept tracking of multiple occurrences.(Zhong and Ghosh, 2003,
McCallum and Nigam, 1998b).
3.2.3.5. Multinomial Mixture Model
It is necessary to clarify exactly what is meant by MMM. It can be defined as
the distribution of words in a document as a multinomial. Furthermore, a
document is treated as a sequence of words and it is assumed that each word
position generated independently of every other(Rennie et al., 2003). In text
classication, the use of class-conditional multinomial mixtures can be seen as
a generalization of the Naive Bayes text classier relaxing its (class-conditional
feature) independence assumption(Civera and Juan, 2005).When a test
document is classified, an MMM keeps track of multiple occurrences compared
with another model such as Bernoulli model (Zhong and Ghosh, 2003). The
Bernoulli model uses binary occurrence information and ignores the number of
occurrences. As long as an MMM keeps the occurrence from all words
(frequency, position), thus, this makes the classification task easier.


p(c
j
|d
i
) =
P(C
j
)P(d
i
|c
j
)
P(d
i
)

3.10


P(C
j
)=
1+n
j
L+n
all

3.11

45

Where,n
j
is number of the document in class.n
all
is number of document in
training set D.L is number of classes(Chen et al., 2009).


P(d
i
|c
j
) = p(d
i
)d
i
!
k=1
v
p(w
k
|c
j
)
n
ik
n
ik
!

3.12


p(w
k
|c
j
) =
1 +n
cjk
n
all
+n
j

3.13

Where,n
cjk
is number of documents in class cjthat contain wordw
k.
.
3.2.4. Evaluation
As long as there are many retrieval systems on the market, but which one is the
best. It depends on the result which proposed from every one. An important
issue for information retrieval systems is the notion of relevance. The purpose
of an information retrieval system is to retrieve all the relevant documents
(recall) and no non-relevant documents (precision). Recall and precision are
defined as:
Precision: The ability to retrieve top-ranked documents that are mostly
relevant.
Precision =
Number of relevant documents retrieved
Total number of documents retrieved

3.14

The maximum (and optimal) precision value would be 100% and the worst
possible precision of 0% is achieved when not a single relevant document was
found.
Recall: The ability of the search to find all of the relevant items in the corpus.
46


Recall =
Number of relevant documents retrieved
Total number of relevant documents

3.15

One substantial aspect of results is how many of the relevant documents in a
collection have been found. Recall shows how many of the relevant documents
a user could possibly come across when reading all documents in the result set.
Therefore the higher level of recall it is mention to the best system.
Despite of with the recall measure both of number of relevant items retrieved
and total number of items retrieved are available, but total number of relevant
items is usually not available.

The most essential averages are: micro-average, which counts each document
equally important, and macro-average, which counts each category equally
important (see 4.3 for extra details).


The perfect information retrieval system can be achieved when the result of both
recall and precision equal one.
F1-measure: As a measure of effectiveness that combines the contributions of
precision and recall. The well-known F1 measure function is used to test
perform of the Information retrieval systems, which defined as:

Re Pr
Re . Pr 2
1
+
=
= |
F

3.16
Fallout: It is another evaluated measure can be used to evaluate the Information
Retrieval systems. Although, Recall and Precision consider the good evaluation
measure but they do not care on number of irrelevant documents in the
47

collection, that caused to undefined recall when there is no relevant document
in the collection, also to undefined precision when no document is retrieved.
However, Fallout number of irrelevant documents in the collection had been
taken in account. In another word the Fallout is inverse of Recall, that is indicate
to a good system should have high recall and low fallout.

3.3. Summary
This chapter gives some introduction to information retrieval, and describes the
common tasks of a TC system. Using multinomial mixture model as a machine
learning algorithm is nowadays the most popular approach. In the rest of chapter
three interesting kinds of TC algorithms have been described briefly.















48













4. Chapter four: Experiments and Evaluation












49

4.1. Introduction
Automatic Text Classification is defined as classifying unlabelled documents
into predefined categories based on its contents. It has become an important
topic due to the increased number of documents on the internet that people have
to deal with daily; this in itself has led to the urgent need of organizing them. In
this chapter, experiments will be achieved then the performance of the Rocchio
algorithm with traditional k-NN and Nave Bayes using MMM classifiers will
be documented.
These classifiers will be evaluated by some measures in order to know whether
Nave Bayes using MMM outperforms the other classifiers. The rest of this
chapter will be organized as following: section 4.2 will discuss the preparing
process for data set evaluation. Section 4.3 will list the performance measures.
Section 4.4 will discuss the evaluation results. Section 4.5 will discuss the
results of MMM with 5070 documents. In section 4.6 will show the summary.
Section 4.7 will explain the conclusion and future work. Section 4.8 will
indicate to the references.
4.2. Data set preparation
The corpus has been downloaded from(SAAD, 2010). The documents classified
into nine categories. The categories and number of documents of each one of
them appears in table 4.1. The total number of documents is 1445. The length
of documents is varying from each other. The nine categories are: Computer,
Economics, Education, Sport, Politics, Engineer, Medicine, Law, and Religion.
After the pre-processing achieved on all the documents, a copy of these pre-
processed documents have been converted into Attribute-Relation File Format
(ARFF) in order to be suitable for Weka tool.
50


NO Category Number
1 Medicine 232
2 Economics 222
3 Religion 222
4 Sport 232
5 Politics 481
6 Engineer 441
2 Law 72
8 Computer 22
7 Education 88

Table 4.1 the number of documents for each category

4.3. Performance measures:
Computational efficiency and classification effectiveness is what it meant of
performance of text classification algorithm. So, when a large number of
documents categorized into many categories, the efficiency of text classification
will be take into account. The effectiveness of text classification will be
measures by precision and recall(Kanaan et al., 2009a) .
Precision and Recall are defined as follows:



51


Recall =
tp
tp+fp
tp +fp>0 , 4.1
Precision =
tp
tp +fn
tp + fn > 0 , 4. 2
Where, counts the number of documents that classified by classifier correctly,
while counts the number of documents that classified by classifier
incorrectly, fp counts the number of documents that not classified by classifier
correctly counts the not assigned but incorrect cases and tn counts the not
assigned and correct cases. As showed in table 4.2.
Classifier
Decision
Correct Decision By Expert
YES is correct NO is incorrect
Assigned YES Tp Fn
Not Assigned
NO
Fp tn
Table4.2: confusion matrix for Performance measures
Precision is the fraction of retrieved instances that are relevant as it appear in
equation4.1, while recall is the fractions of relevant instances that are retrieved
as it appear in equation4.2.Both precision and recall are therefore based on an
understanding and measure of relevance. Precision and recall values often
depend on parameter tuning; thats mean there is a trade-off between precision
and recall. This is why another measure that combined both of the precision and
recall used: the F-measure which is defined as follows:
To evaluate the performance across categories, F-measure is averaged. There
are two kinds of averaged values, namely, micro average and macro average
(MESLEH, 2007).
52

F measure = 2(Precision Recal) (Precision +Recall) 4.3
For obtaining estimates of precision and recall relative to the whole category
set, two different methods may be adopted:
Category set
C={ c1,....,c|C }
Expert Judgments
YES NO
Classifier
Judgments
YES

=
=
| |
1
i
TP TP
C
i

=
=
| |
1
i
FN F
C
i
N

NO

=
=
| |
1
i
FP FP
C
i

=
=
| |
1
i
TN T
C
i
N

Table 4.3: The global contingency table
Macroaveraging: precision and recall are first evaluated locally for each
category, and then globally by averaging over the results of the different
categories.
Precision Recall
Macroaveraging
| | | |
Pr
| |
1
| |
1
C
FN TP
TP
C
C
i i i
i
C
i
i

= =
+
= =
| | | |
Re
| |
1
| |
1
C
FP TP
TP
C
C
i i i
i
C
i
i

= =
+
= =
Table 4.4: Macro-average
Microaveraging: precision and recall are obtained by globally summing over
all individual decisions. For this, the global contingency table of table 4.4,
obtained by summing over all category-specific contingency tables, is needed.





53

Precision Recall
Microaveraging


=
=
+
=
+
=
| |
1
| |
1
) (
C
i
i i
C
i
i
FN TP
TP
FP TP
TP


=
=
+
=
+
=
| |
1
| |
1
) (
C
i
i i
C
i
i
FP TP
TP
FN TP
TP

Table 4.5: Micro-average
Macro- and Micro-averaging formulas for precision and recall are shown in
tables4.4 and 4.5.
There are some differences between Micro-averaged and macro-averaged. The
dissimilarities between two of them can be large. Micro-averaged results give
equal weight to the documents and thus emphasize larger topics, while macro-
averaged results give equal weight to the topics and thus emphasize smaller
topics more than micro-averaged results(Joachims, 1996). As a result, the
ability of a classifier to behave well on categories with low generality (i.e.,
categories with few positive training instances) will be emphasized by macro-
averaging and much less so by micro-averaging. Micro-averaged results are
therefore really a measure of performance on the large classes in a test
collection(Han et al., 2001, Duwairi, 2007a). To get a sense of performance on
small classes, macro-averaged results should be computed. Whether one or the
other should be used obviously depends on the application requirements.
As a measure of effectiveness that combines the contributions of precision and
recall. The well-known F1 measure function has been used, defined as:


Re Pr
Re . Pr 2
1
+
=
= |
F

4.4
54

In single-label classification, as is implemented in the experiments, micro-
averaged precision equals recall(Rocchio, 1971), and is equal to F1 , so only
micro F1 will be noted for the micro-averaged results.
4.4. Evaluation Results
The results were obtained for each of the k-nearest neighbor, Rocchio, and
Nave Bayes using MMM as follow:
4.4.1. Nave Bayes algorithm using (MMM).
Table4.6 shows the confusion matrix for Nave Bayes using MMM algorithm.
The numbers reported in an entry of a confusion matrix correspond to the
number of documents that are known to actually belong to the category given
by the row header of the matrix, but that are assigned by NB using MMM to the
category given by the column header.
As shown in the table4.6; 67 documents of category Computer are classified
correctly into Computer category while 3 documents of Computer classified
incorrectly where 2 of these 3 documents classified as Education and 1 from 3
classified as law. The best classification at category is Sport where 231 of this
category classified correctly. Lowest value of correctly classified documents for
Education category where 56 documents classified correctly and 12 documents
classified incorrectly.






55


Table 4.6: Confusion Matrix results for NB using MMM algorithm
Figure4.1 shows recall, precision and f-measure for every category when the
Nave Bayes classifier was used, the precision reach it is highest value(1) for
the Sport , and computer categories while the lowest value of precision was
(0.812) for education category. Recall reaches its highest value (0.996) for sport
category and its lowest value for law category (0.804). F-measure reaches its
heights value (0.998) for sport category and its lowest value for education
category (0.818). The rest of the figure is self-exploratory.







56

Table4.7Confusion Matrix results for NB algorithm
The next figure [4.1] shows the precision, recall, and f-measure for all the
categories that classified using Nave Bayes by using MMM.

Figure 4.1: Result of the naive Bayes MMM classification algorithm
0
0.2
0.4
0.6
0.8
1
1.2
Precision
Recall
f-measure
Precision, Recall, and F- measure
Precision Recall f-measure
Actual Computer 1 0.957 0.978
Economy 0.864 0.841 0.852
Education 0.812 0.824 0.818
Engineer 0.948 0.948 0.948
Law 0.839 0.804 0.821
Medicine 0.996 0.991 0.993
Politics 0.833 0.918 0.873
Religion 0.905 0.885 0.895
Sport 1 0.996 0.998
57

Table4.8 shows the average of the above values for all categories in MMM
algorithm, the overall f-measure is 0. 908. This value consider high.
Nave Bayes using
MMM
Precision Recall F-measure
Weighted average 0.911 0.907 0.908
Table 4.8NB using MMM classifier weighted average for the nine categories
4.4.2. Comparisons MMM with other techniques and
discussions 0f results
Firstly a comparison made between k-NN, Rocchio and Nave Bayes classifiers.
All the results of KNN and Rocchio have been taken from(Kanaan et al.,
2009a). A summary of the recall, precision and F1 measures are shown in table
4.9. Nave Bayes gave the best F-measures with MiF1=0.9185 and
MaF1=0.908, followed by kNN widf with MiF1=0. 7970 and MaF1=0. 7871,
closely followed by Rocchio tf.idf with MiF1=0. 7314 and MaF1=0. 7882. A
comparison of values of MiF1 and MaF1 is shown in figure 4.2.
Method maP MaR maF1 miF1
kNN tf 0.7100 0.5359 0.6100 0.5711
kNN tfidf 0.8363 0.6902 0.7562 0.7272
kNN widf 0.8094 0.7662 0.7871 0.7970
Rocchio tf 0.5727 0.4501 0.5022 0.4427
Rocchio tfidf 0.8515 0.7337 0.7882 0.7314
Rocchio widf 0.7796 0.7199 0.7484 0.6968
Nave Bayes 0.911 0.907 0.908 0.9185
Table 4.9: Classifier comparison
58

The figure [4.2], show the maFi, miF1 for all the classifiers (KNN, Rocchio,
and Nave Bayes), from the figure, we can see that the Nave Bayes using MMM
got the higher value for both values (maF1, and miF1).

Figure 4.2: maF1, miF1 comparison for classifiers
The next figure [4.3] shows the macro precision of all the categories, and it
appears that the highest value is for the Naive Bayes using MMM, Rocchio, and
also KNN tf.idf not far away from Rocchio.

Figure 4.3: maP comparison for classifiers
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
kNN tf kNN tfidf kNN widf Rocchio tf Rocchio
tfidf
Rocchio
widf
Nave Bayes
maF1
miF1
0.71
0.8363
0.8094
0.5727
0.8515
0.7796
0.911
0
0.2
0.4
0.6
0.8
1
kNN tf kNN tfidf kNN widf Rocchio tf Rocchio tfidf Rocchio
widf
Nave Bayes
maP
maP
59

The next figure [4.4] shows the macro recall of all the categories, and it appears
that the highest value is for the naive Bayes using MMM, KNN and Rocchio is
not far away from KNN.


Figure 4.4: maR Comparison for Classifiers
It is clear that Naive Bayes classifier has the high values for the three measures
and then KNN classifier comes in the second place, the worst values in the three
measures was for Rocchio. As observed also there is disproportion in the
precision, recall and f-measure values for the k-NN where it is reach to high
value (0.83) at precision measure and very low at recall (0.53). As shown also
the values of precision, recall and f-measure values for the other two classifiers:
Rocchio and Nave Bayes classifiers more stability.

0.5359
0.6902
0.7662
0.4501
0.7337
0.7199
0.907
0
0.2
0.4
0.6
0.8
1
kNN tf kNN tfidf kNN widf Rocchio tf Rocchio tfidf Rocchio
widf
Nave Bayes
MaR
MaR
60


Figure 4.5 precision, recall and f-measure for the three classifiers
4.5. Results of Nave Bayes algorithm (MMM) with 5070
documents
Another experiment has been conducted, the collected corpus has showed in
Table4.10 contains 5070 documents that vary in length(SAAD, 2010). These
documents fall into six categories: Business, Entertainment, Middle East news,
Sport, World news, and Science and Technology.

NO Category Number
1 Business 836
2 Entertainment 474
3 Middle East news 1462
4 Sport 762
5 World news 1010
6 Science and Technology 526
Table 4.10 categories and their distributions in the corpus (5070 documents)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
kNN tf kNN tfidf kNN widf Rocchio tf Rocchio
tfidf
Rocchio
widf
Nave Bayes
maP
MaR
maF1
61

Table 4.11shows the confusion matrix for Nave Bayes using MMM algorithm.
Lowest value of correctly classified documents for Entertainment category
where 400 documents classified correctly and 74 documents classified
incorrectly.

Table 4.11: Confusion matrix results for NB Algorithm in the corpus (5070
Documents)
Figure4.6 shows recall, precision and f-measure for every category when the
Nave Bayes classifier was used, the precision reach it is highest value(0. 991)
for the Sport category while the lowest value of precision was (0. 746) for
Entertainment category. Recall reaches its highest value (0. 979) for Sport
category and its lowest value for Middle East news category (0. 832). F-measure
reaches its heights value (0. 985) for Sport category and its lowest value for
Entertainment category (0. 792). The rest of the figure is self-exploratory.


62


Table4.12Confusion Matrix results for NB Algorithm in the corpus (5070 Documents)
The next figure [4.6] shows the precision and recall for all the categories that
classified using Nave Bayes by using MMM.


Figure 4.6: Result of the Naive Bayes classification algorithm
Table4.13 shows the average of the above values for all categories in NB
algorithm, the overall f-measure is 0. 884.
0
0.2
0.4
0.6
0.8
1
1.2
Precision
Recall
63

Nave Bayes using
MNM
Precision Recall F-measure
Weighted average 0.882 0.890 0. 884
Table 4.13NB using MMM classifier weighted average for the six Categories in the
Corpus (5070 documents)
Comparing the overall result from table 4.8 and 4.13 show that there is a little
degrades in performance of precision, recall, and F-measure. Its because the
testing set still as its 4-fold cross validation. If the test set was in percent, the
results will be different since the classifier will learn more.

4.6. Summary
The Naive Bayes using MMM has been evaluated against KNN and Rocchio.
The Naive Bayes using MMM outperformed k-NN and Rocchio classifier.
Naive Bayes (MMM) classifier has the best precision, and then the other
techniques came after Naive Bayes.










64

4.7. Conclusion and Future Work:
Text classification for Arabic languages has been investigated in this project.
Three classifiers were compared: KNN, Rocchio and Naive Bayes using
Multinomial Mixture Model (MMM).
Unclassified documents were pre-processed by removing stopwords and
punctuation marks. The rest of words was stemmed and stored in feather
vectors. Every test document has its own feature vector. Finally the document
will be classified to the best class according to the classifier technique.
The accuracy of classifiers has been measured using recall, precision and F-
measure. For project experiments the classifiers were tested using 1445
document. The result shows that the performance of NB by using Multinomial
model outperformed the other two classifiers.
As a future work, we plan to continue working with Arabic text categorization
as this area not widely explored in the literature and trying the classifiers on a
huge collection.
-Apply an auxiliary feature method with Multinomial model in order to improve
classification accuracy.
-Comparing the Nave Bayes MMM model with different models such as the
multivariate Bernoulli(Zhang and Gao).
-Evaluate Bpso feature selection with Multinomial classifier by using the same
Arabic database was mention it in (Chantar and Corne, 2011), then compare the
result between the two achieved result.




65

4.8. Reference
1 Proceedings of the Workshop on Computational Approaches to Arabic Script-based
Languages. 2004 Geneva, Switzerland. 1621804: Association for Computational
Linguistics, 98.
2 ABBOUD, P. F. & MCCARUS, E. N. 1983. Elementary Modern Standard Arabic: Volume 1,
Pronunciation and Writing; Lessons 1-30, Cambridge University Press.
3 ABUAIADAH, D. 2013. Arabic Document Classification Using Multiword Features.
4 AJAYI, A. O., ADEROUNMU, G. A. & SORIYAN, H. A. An adaptive fuzzy information retrieval
model to improve response time perceived by e-commerce clients. Expert Systems with
Applications, 37, 82-91.
5 AL-ANI, A., ALSUKKER, A. & KHUSHABA, R. N. Feature subset selection using differential
evolution and a wheel based search strategy. Swarm and Evolutionary Computation.
6 AL-DIABAT, M. 2012. Arabic Text Categorization Using Classification Rule Mining. Applied
Mathematical Sciences, 6, 4033-4046.
7 AL-HARBI, S., ALMUHAREB, A., AL-THUBAITY, A., KHORSHEED, M. S. & AL-RAJEH, A. 2008.
Automatic Arabic text classification.
8 AL-KABI, M., WAHSHEH, H., ALSMADI, I., AL-SHAWAKFA, E., WAHBEH, A. & AL-HMOUD,
A. 2012. Content-based analysis to detect Arabic web spam. Journal of Information
Science, 38, 284-296.
9 AL-MAIMANI, M. R., NAAMANY, A. A. & BAKAR, A. Z. A. Arabic information retrieval:
techniques, tools and challenges. GCC Conference and Exhibition (GCC), 2011 IEEE. IEEE,
541-544.
10 AL-SALEEM, S. 2010. Associative classification to categorize Arabic data sets. The
International Journal Of ACM JORDAN, 1, 118-127.
11 AL-SHALABI, R. & EVENS, M. A computational morphology system for Arabic. Proceedings
of the Workshop on Computational Approaches to Semitic Languages, 1998. Association
for Computational Linguistics, 66-72.
12 AL-SHALABI, R., KANAAN, G. & GHARAIBEH, M. 2006. Arabic text categorization using kNN
algorithm. Proc. 4th Internat. Multiconf. on Computer Science and Information
Technology (CSIT 2006), 4.
13 AL ZAMIL, M. G. H. & CAN, A. B. ROLEX-SP: Rules of lexical syntactic patterns for free text
categorization. Knowledge-Based Systems, 24, 58-65.
14 ALBALOOSHI, N., MOHAMED, N. & AL-JAROODI, J. The challenges of Arabic language use
on the Internet. Internet Technology and Secured Transactions (ICITST), 2011
International Conference for, 11-14 Dec. 2011. 378-382.
15 ALNOBANI, A. A. 2008.
16 Improving Search Engines performance in query processing and indexing for Arabic
language.
17 ALSALEEM, S. 2011. Automated Arabic Text Categorization Using SVM and NB. Int. Arab J.
e-Technol., 2, 124-128.
18 BAEZA-YATES, R. & RIBEIRO-NETO, B. 1999. Modern information retrieval, ACM press New
York.
19 BOUGUILA, N., ALMAKADMEH, K. & BOUTEMEDJET, S. 2012. A finite mixture model for
simultaneous high-dimensional clustering, localized feature selection and outlier
rejection. Expert Systems with Applications, 39, 6641-6656.
66

20 CHANTAR, H. K. & CORNE, D. W. Feature subset selection for Arabic document
categorization using BPSO-KNN. Nature and Biologically Inspired Computing (NaBIC),
2011 Third World Congress on, 2011. IEEE, 546-551.
21 CHEN, A. & GEY, F. C. Building an Arabic Stemmer for Information Retrieval. TREC, 2002.
22 CHEN, J., HUANG, H., TIAN, S. & QU, Y. 2009. Feature selection for text classification with
Nave Bayes. Expert Systems with Applications, 36, 5432-5435.
23 CIVERA, J. & JUAN, A. 2005. Multinomial Mixture Modelling for Bilingual Text
Classification. Technical report DSIC-II/10/05, UPV.
24 COLLINS-THOMPSON, K. & ADVISER-CALLAN, J. 2008. Robust model estimation methods
for information retrieval, Carnegie Mellon University.
25 DEISY, C., GOWRI, M., BASKAR, S., KALAIARASI, S. & RAMRAJ, N. 2010. A novel term
weighting scheme MIDF for Text Categorization. Journal of Engineering Science and
Technology, 5, 94-107.
26 DUWAIRI, R. 2007a. Arabic text categorization. the international Arab Journal of
information Technology, 7.
27 DUWAIRI, R. 2007b. Arabic Text Categorization. International Arab Journal on Information
Technology, 4.
28 DUWAIRI, R. M. 2006. Machine learning for Arabic text categorization. Journal of the
American Society for Information Science and Technology, 57, 1005-1010.
29 DUWAIRI, R. M. 2007c. Arabic Text Categorization. Int. Arab J. Inf. Technol., 4, 125-132.
30 EL-HALEES, A. 2007. Arabic text classification using maximum entropy. The Islamic
University Journal (Series of Natural Studies and Engineering) Vol, 15, 157-167.
31 GAMON, M. & AUE, A. Automatic identification of sentiment vocabulary: exploiting low
association with known sentiment terms. Proceedings of the ACL Workshop on Feature
Engineering for Machine Learning in Natural Language Processing, 2005. Association for
Computational Linguistics, 57-64.
32 GHWANMEH, S., KANAAN, G., AL-SHALABI, R. & ABABNEH, A. Enhanced Arabic
Information Retrieval System based on Arabic Text Classification. Innovations in
Information Technology, 2007. IIT '07. 4th International Conference on, 18-20 Nov. 2007
2007. 461-465.
33 GOUDJIL, M., KOUDIL, M., HAMMAMI, N., BEDDA, M. & ALRUILY, M. Arabic text
categorization using SVM active learning technique: An overview. Computer and
Information Technology (WCCIT), 2013 World Congress on, 22-24 June 2013 2013. 1-2.
34 GUIYING, W., XUEDONG, G. & SEN, W. Study of text classification methods for data sets
with huge features. Industrial and Information Systems (IIS), 2010 2nd International
Conference on, 10-11 July 2010. 433-436.
35 HAN, E. H., KARYPIS, G. & KUMAR, V. 2001. Text categorization using weight adjusted k-
nearest neighbor classification. Advances in knowledge discovery and data mining, 53-65.
36 HASAN, M. M. Can Information Retrieval techniques automatic assessment challenges?
Computers and Information Technology, 2009. ICCIT '09. 12th International Conference
on, 21-23 Dec. 2009 2009. 333-338.
37 HMEIDI, I., HAWASHIN, B. & EL-QAWASMEH, E. 2008. Performance of KNN and SVM
classifiers on full word Arabic articles. Advanced Engineering Informatics, 22, 106-111.
38 IKONOMAKIS, M., KOTSIANTIS, S. & TAMPAKAS, V. 2005. Text classification using machine
learning techniques. WSEAS Transactions on Computers, 4, 966-974.
39 JIANG, S., PANG, G., WU, M. & KUANG, L. An improved K-nearest-neighbor algorithm for
text categorization. Expert Systems with Applications, 39, 1503-1509.
67

40 JOACHIMS, T. 1996. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text
Categorization. DTIC Document.
41 KANAAN, G., ALSHALABI, R., GHWANMEH, S. & ALMA'ADEED, H. 2009a. A
comparison of textclassification techniques applied to Arabic text. Journal of the
American Society for Information Science and Technology, 60, 1836-1844.
42 KANAAN, G., ALSHALABI, R., GHWANMEH, S. & ALMA'ADEED, H. 2009b. A
comparison of textclassification techniques applied to Arabic text. Journal of the
American Society for Information Science and Technology, 60, 1836-1844.
43 KARABULUT, E. M., ZEL, S. A. & BRIKI, T. A comparative study on the effect of
feature selection on classification accuracy. Procedia Technology, 1, 323-327.
44 KHUSHABA, R. N., AL-ANI, A. & AL-JUMAILY, A. Feature subset selection using differential
evolution and a statistical repair mechanism. Expert Systems with Applications, 38, 11515-
11526.
45 KO, Y., PARK, J. & SEO, J. 2004. Improving text categorization using the importance of
sentences. Information Processing & Management, 40, 65-79.
46 KO, Y. & SEO, J. 2009. Text classification from unlabeled documents with bootstrapping
and feature projection techniques. Information Processing & Management, 45, 70-83.
47 LAHTINEN, T. 2000. Automatic indexing: an approach using an index term corpus and
combining linguistic and statistical methods, University of Helsinki.
48 LEWIS, D. D. 1998. Naive (Bayes) at forty: The independence assumption in information
retrieval. Machine learning: ECML-98. Springer.
49 LI, M. & ZHANG, L. 2008. Multinomial mixture model with feature selection for text
clustering. Knowledge-Based Systems, 21, 704-708.
50 MCCALLUM, A. & NIGAM, K. A comparison of event models for naive bayes text
classification. 1998a. 41-48.
51 MCCALLUM, A. & NIGAM, K. A comparison of event models for naive bayes text
classification. AAAI-98 workshop on learning for text categorization, 1998b. Citeseer, 41-
48.
52 MESLEH, A. M. 2007. Chi square feature extraction based SVMs Arabic language text
categorization system. Journal of Computer Science, 3, 430.
53 MESLEH, A. M. & KANAAN, G. Support vector machine text classification system: Using
Ant Colony Optimization based feature subset selection. Computer Engineering &
Systems, 2008. ICCES 2008. International Conference on, 25-27 Nov. 2008 2008. 143-148.
54 MESLEH, A. M. D. Feature sub-set selection metrics for Arabic text classification. Pattern
Recognition Letters, 32, 1922-1929.
55 MITRA, V., WANG, C.-J. & BANERJEE, S. 2007. Text classification: A least square support
vector machine approach. Applied Soft Computing, 7, 908-914.
56 MORAES, R., VALIATI, J. O. F. & GAVIO NETO, W. P. Document-level sentiment
classification: An empirical comparison between SVM and ANN. Expert Systems with
Applications, 40, 621-633.
57 NOAMAN, H. M., ELMOUGY, S., GHONEIM, A. & HAMZA, T. Naive Bayes Classifier Based
Arabic Document Categorization. Informatics and Systems (INFOS), 2010 The 7th
International Conference on. IEEE, 1-5.
58 OBASEKI, T. I. Automated Indexing: The Key to Information Retrieval in the 21st Century.
59 PANG, B. & LEE, L. 2008. Opinion mining and sentiment analysis. Foundations and trends
in information retrieval, 2, 1-135.
68

60 PANG, B., LEE, L. & VAITHYANATHAN, S. Thumbs up?: sentiment classification using
machine learning techniques. Proceedings of the ACL-02 conference on Empirical
methods in natural language processing-Volume 10, 2002. Association for Computational
Linguistics, 79-86.
61 PORTER, M. F. 2006. An algorithm for suffix stripping. Program: electronic library and
information systems, 40, 211-218.
62 PRASAD, S. Micro-blogging Sentiment Analysis Using Bayesian Classification Methods.
Technical Report.
63 RENNIE, J. D., SHIH, L., TEEVAN, J. & KARGER, D. Tackling the poor assumptions of naive
bayes text classifiers. MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN
CONFERENCE-, 2003. 616.
64 ROCCHIO, J. J. 1971. Relevance feedback in information retrieval.
65 SAAD, M. 2010. Available: https://sites.google.com/site/motazsite/Home/osac.
66 SAWAF, H., ZAPLO, J. & NEY, H. 2001. Statistical classification methods for Arabic news
articles. Natural Language Processing in ACL2001, Toulouse, France.
67 SEBASTIANI, F. 2002. Machine learning in automated text categorization. ACM computing
surveys (CSUR), 34, 1-47.
68 SETTLES, B. 2010. Active learning literature survey. University of Wisconsin, Madison.
69 SINGH, S. R., MURTHY, H. A. & GONSALVES, T. A. Feature selection for text classification
based on Gini coefficient of inequality. Proceedings of the fourth international workshop
on feature selection in data mining. Citeseer, 76-85.
70 SYIAM, M. M., FAYED, Z. T. & HABIB, M. 2006. An intelligent system for Arabic text
categorization. International Journal of Intelligent Computing and Information Sciences,
6, 1-19.
71 TURNEY, P. D. Thumbs up or thumbs down?: semantic orientation applied to
unsupervised classification of reviews. Proceedings of the 40th annual meeting on
association for computational linguistics, 2002. Association for Computational Linguistics,
417-424.
72 UUZ, H. A two-stage feature selection method for text categorization by using
information gain, principal component analysis and genetic algorithm. Knowledge-Based
Systems, 24, 1024-1032.
73 UEDA, N. & SAITO, K. 2002. Parametric mixture models for multi-labeled text. Advances
in neural information processing systems, 15, 721-728.
74 UUZ, H. 2011. A two-stage feature selection method for text categorization by using
information gain, principal component analysis and genetic algorithm. Knowledge-Based
Systems, 24, 1024-1032.
75 WAN, C. H., LEE, L. H., RAJKUMAR, R. & ISA, D. A hybrid text classification approach with
low dependency on parameter by integrating K-nearest neighbor and support vector
machine. Expert Systems with Applications, 39, 11880-11888.
76 WANG, D., WU, J., ZHANG, H., XU, K. & LIN, M. Towards enhancing centroid classifier for
text classificationA border-instance approach. Neurocomputing, 101, 299-308.
77 WANG, T.-Y. & CHIANG, H.-M. Solving multi-label text categorization problem using
support vector machine approach with membership function. Neurocomputing, 74, 3682-
3689.
78 ZHANG, J., CHEN, L. & GUO, G. 2013. Projected-prototype based classifier for text
categorization. Knowledge-Based Systems, 49, 179-189.
69

79 ZHANG, W. & GAO, F. An Improvement to Naive Bayes for Text Classification. Procedia
Engineering, 15, 2160-2164.
80 ZHONG, S. & GHOSH, J. A comparative study of generative models for document
clustering. Proceedings of the workshop on Clustering High Dimensional Data and Its
Applications in SIAM Data Mining Conference, 2003.
81 ZRIGUI, M., AYADI, R., MARS, M. & MARAOUI, M. 2012. Arabic Text Classification
Framework Based on Latent Dirichlet Allocation. Journal of Computing and Information
Technology, 20, 125-140.

Potrebbero piacerti anche