Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
in Arabic tweets
Rihab Bouchlaghem Aymen Elkhelifi Rim Faiz
LARODEC, ISG, University of Tunis, Paris Sorbonne University, France LARODEC, IHEC, University of
Tunisia Carthage, Tunisia
rihab.bouchlaghem@isg.rnu.tn Aymen.Elkhlifi@paris.sorbonne.fr Rim.Faiz@ihec.rnu.tn
ABSTRACT
This research field has found many useful applications such as:
Nowadays, sentiment analysis methods become more and more
opinionated web search, automatic analysis of product reviews,
popular especially with the proliferation of social media platform
discover of customers opinions as part of marketing purposes.
users number. In the same context, this paper presents a sentiment
Sentiment analysis is also used politics since it allows to predict
analysis approach which can faithfully translate the sentimental
election results or to know public opinions about different
orientation of Arabic Twitter posts, based on a novel data
policies. Consequently, it is being actively studied by researchers
representation and machine learning techniques. The proposed
particularly with the use of machine learning algorithms for
approach applied a wide range of features: lexical, surface-form,
various languages. Most of existing research on sentiment
syntactic, etc. We also made use of lexicon features inferred from
analysis focuses on English text [1, 2, 3, 4, 5, 6, 7, 8].Despite its
two Arabic sentiment words lexicons. To build our supervised
reputation as one of the most used languages in the world, a few
sentiment analysis system, we use several standard classification
number of research has been dealt with Arabic sentiment analysis
methods (Support Vector Machines, K-Nearest Neighbour, Naïve
[15, 16, 17].
Bayes, Decision Trees, Random Forest) known by their
effectiveness over such classification issues.
In our study, Support Vector Machines classifier outperforms In this work, we want to study the sentiment analysis for the case
other supervised algorithms in Arabic Twitter sentiment analysis. of the Modern Standard Arabic Twitter posts from a machine
Via an ablation experiments, we show the positive impact of
lexicon based features on providing higher prediction
learning perspective. For this purpose, we propose a novel data
performance.CCS Concepts representation model applying several lexicon based features
• Computing methodologies➝Artificial intelligence➝Natural generated from two different sentiment words lexicons, in
language processing➝Language resources • Computing addition to other features categories (syntactic, linguistic, etc.)
methodologies➝Machine learning approaches. To investigate the impact of the proposed features set, we applied
five supervised classification algorithms: Support Vector
Keywords Machines (SVM), K-Nearest Neighbour (KNN), Naïve Bayes
Sentiment analysis; Twitter; Modern Standard Arabic; Supervised
(NB), Decision Trees (DT) and Random Forest (RF). The
classification; Arabic sentiment lexicon
proposed classification system was tested on Arabic Tweets
related to recent terroristic acts and organizations in Arabic world.
1. INTRODUCTION To our knowledge, research on sentiment in texts related to such
Recently, sentiment analysis becomes to be one of the most domain is almost non-existent. The obtained results show that
rapidly emerging research areas. The main purpose of sentiment SVM classifier gives the best classification accuracy. We also
analysis is to extract users’ sentiments/opinions from created propose an ablation experiment allowing to evaluate the impact of
contents by using automatic mining techniques to determine their each features group on the classification system performance.
attitudes with respect to some topic, often expressed in textual For the rest of this paper, we first introduce Twitter, our data
form. source. Then, we describe our data collecting, filtering and pre-
Permission to make digital or hard copies of all or part of this processing methods. After that, a detailed description of the
work for personal or classroom use is granted without fee proposed data representation and applied features is given. The
provided that copies are not made or distributed for profit or next section is about experiments steps and results. Finally, we
commercial advantage and that copies bear this notice and the full conclude the paper and point out directions for future work.
citation on the first page. To copy otherwise, or republish, to post
on servers or to redistribute to lists, requires prior specific 2. Twitter, our data source
permission and/or a fee. Our corpus is collected from one of the most known social
WIMS ’16, June 13 - 15, 2016, Nîmes, France. networks: Twitter. It’s a free micro blogging service where a great
number of users broadcast their content. Public figures such as
2016 Copyright held by the owner/author(s). Publication rights celebrities and politicians, media channels and companies are also
licensed to ACM. interested in Twitter; having use it to engage with their followers.
ISBN 978-1-4503-4056-4/16/06. . . $15.00 Twitter text unit is named “tweet” which is a short text easily
DOI: http://dx.doi.org/10.1145/2912845.2912874
disseminated. Tweets have a specific syntax that must respect Firstly, tweets of the corpus are normalized to format all
some conventions which comprise [4]: characters which can cause confusion. In fact, the normalization
Limited characters’’ number: a tweet content can’t, in any task consists, on the one hand, in converting all the various forms
way, exceed 140 characters, of a word to a common form. On the other hand, there are other
Arabic characters which must be removed, being mainly the
Mention: is the case when a user inserts a “@username” in his
shadda ligature; a special symbol used to accentuate the
tweet in order to mention the corresponding user “usename”,
consonant (e.g. “”ع ّذب, means “to torment”), and diacritics,
for example:
representing short vowels in Arabic texts (e.g.: “ب ََ َ“ ُع ِّذ,”ب
ََ )” َع َّذ
“ ساعة_استجابة#”يارب تنتهي فتنة الخوارج في هذا الشهر الفضيل [21]. We have used an existing normalizer to apply most of the
@Manar480 Arabic language normalization rules such as: {ئ،ؤ،;>ء< → }ء
Reply: is a mention particular case when the “@username” {ا،إ،،أ،>ا< → }اآ.
mention is placed in the beginning of the tweet in order to In the same perspective, we perform the last pre-processing task
start conversation with another user, e.g.: allowing deleting URL, usermentions, and the '#' symbol, after
ساعة_استجابة#“يارب تنتهي فتنة الخوارج في هذا الشهر الفضيل extracting Twitter specific features.
:@Manar48” The corpus is ready for specific NLP methods. We perform
Retweet: when a user re-shares a tweet previously posted by tokenization and POS (Part Of Speech) using the NLP tool
another user. The new tweet comprises the RT symbol and the Stanford Parser2[22].
username of the original tweet publisher, followed by the
original tweet content. 4. Data representation
We propose to represent Arabic tweets of our dataset by applying
Hashtag: is a term having a hash symbol (#) prefix, commonly commonly used text classification features like: n-grams and part-
used in social networks. Hashtags tend to represent the topic of-speech tag counts, as well as common Twitter-specific features
or the key words of the tweet [21]. such as user mention and hashtag counts.
3. Data We also introduce several sentiment features generated from two
Based on Twitter API1, we developed our tweet collecting module sentiment lexicons introduced in [31]. Then, each tweet is
and used it to automatic gather tweets about recent major political represented as a features vector.
events and social reforms in the Arabic world. Therefore, we 4.1.1 Lexicon based features
usually had launched collect campaigns in specific time steps The sentiment lexicon features are derived from two newly
using suitable key words to retrieve relevant tweets which are created Arabic sentiment words lexicons.
subjective towards specific target entities, events, organizations,
etc. 4.1.1.1 General-purpose lexicon
Following [12], these features are generated from general purpose
The resulted collected corpus comprises a great number of lexicon involving subjective words and their sentiment scores. For
duplicated and non-informative tweets. We propose a token based each token t occurring in a tweet and present in the lexicon, we
similarity measurement to identify all similar tweets in a given use its sentiment score to compute:
corpus.
The number of tokens with score (t) ≠0;
Lets: The number of tokens with sentiment score (t) > 0;
T1, T2 : two pre-processed tweets (after deleting urls and, The number of tokens with sentiment score (t) < 0;
@RT symbols and @usermentions), The total sentiment score = ∑𝑡€ 𝑡𝑤𝑒𝑒𝑡 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡_𝑠𝑐𝑜𝑟𝑒(𝑡)
Token(T), a function that returns the tokens of a tweet T The maximal score = max t€tweet sentiment_score(t)
(result of the tokenization task), The general purpose lexicon introduced in [32] is created with
large word-sentiment association lists which are automatically
We define Sim_Tweet, a similarity measure based on Tweet’s generated from an existed English polarity annotation corpus,
token counting, as follows: manually filtered, and automatically expanded and translated.
|𝑇𝑜𝑘𝑒𝑛(𝑇1)𝑇𝑜𝑘𝑒𝑛(𝑇2)| 4.1.1.2 Tweet specific lexicon
𝑆𝑖𝑚_𝑇𝑤𝑒𝑒𝑡(𝑇1, 𝑇2) = (1)
|𝑇𝑜𝑘𝑒𝑛(𝑇1)𝑇𝑜𝑘𝑒𝑛(𝑇2)|
The proposed tweet-specific sentiment lexicon [32] is built using a
The proposed measurement helps us in tweets filtering. gold seed words set manually annotated and extracted from the
data set, and automatically expanded from 500000 tweets using
From the collected corpus, we have selected a tweet subset to be
co-occurrence and coordination computing methods.
manually annotated for sentiment polarity. Each tweet in this
subset must: hold one sentiment with clear orientation, be written We mainly employed two features inspired from this lexicon:
in MSA, have informative content, etc. If a given tweet respects The number of tokens appearing in positive tweet specific
the relevance restriction cited above, it will receive one of these lexicon;
tags: positive, negative or neutral, expressing the tweet owner’s The number of tokens appearing in negative tweet specific
position. lexicon;
Because of the Arabic language particularities, several NLP 4.1.2 Linguistic features
methods can't be directly applied to Arabic texts and yield valid We explored a set of linguistic features able to handle several
output. Thus, the filtered corpus needs to be pre-processed to
Arabic language structures: negation (“”ال,””ليس, “”لم, “)”لن
promote an efficient use of such methods.
1 2
http://twitter4j.org/en/index.html http://nlp.stanford.edu/
intensifiers (“”كثيرا, “)”جدا, supplication and questions. Table 1 In our context, classifying a given tweet according to its sentiment
summarises the purpose of such features and gives related polarity consists in performing multi-class categorization by
examples. mapping it to the classes positive, negative or neutral.
4.1.3 Tweet specific features
Figure 1. The obtained KNN classification results when
As many existing works, we call Twitter-specific features the
varying the K value
commands and conventions used by Twitter users in their posts.
We used the following features:
0,800
URL (or links): computes the number of links in tweet, 0,700
User mentions: identifies the number of username mentions in
0,600
a given tweets. It also indicates if the tweet replies to other
K value
users, 0,500
Presence of retweet symbol “RT”, 0,400
Hashtag number, 0,300
Tweet’s Length. 0,200
0,100
0,000
2 3 4 5 6 7 8 9
Table 1. Linguistic features examples
F-measure 0,64 0,65 0,63 0,65 0,66 0,66 0,66 0,65
Precision 0,63 0,62 0,61 0,63 0,63 0,63 0,63 0,61
Features Arabic Arabic Markers
English translation Recall 0,68 0,71 0,65 0,74 0,73 0,75 0,75 0,75
type example markers translation