Sei sulla pagina 1di 7

2014. The copyright of this document resides with its authors.

It may be distributed unchanged freely in print or electronic forms.


DR. M. TAHIR HASSAN: CATEGORIZATION OF USERS ON TWITTER
Abstract
This report addresses the task of user categorization in social media with
an application to Twitter. !e automatically infer the "alues of user category
such as company indi"idual professional home user sportsman student or
teacher. !e employ a machine learning approach which relies on a
comprehensi"e set of features deri"ed from user#s tweets. $ur results showed
nearly %0& accuracy across two classifiers 'ai"e (ayes classifier and
)e*uential +inimal $ptimization ,)+$-.
Introduction
)uccessful microblogging ser"ices such as Twitter ha"e become an integral part of the
daily life of millions of users. In addition to communicating with friends family or
ac*uaintances microblogging ser"ices are used as recommendation ser"ices real.time
news sources and content sharing "enues.
/ user0s e1perience with a microblogging ser"ice could be significantly impro"ed if
information about the demographic attributes or personal interests of the particular user as
well as the other users of the ser"ice are a"ailable. )uch information could allow for
personalized recommendations of users to follow or user posts to read2 e"ents and topics of
interest to particular communities could be highlighted additionally targeted
ad"ertisements can also be displayed.

Categorization of users on Twitter
Dr. Malik Tahir Hassan
tahir.hassan@umt.edu.pk
Muhammad Usman Riaz
101620043@umt.edu.pk
Daud Khan
101620016@umt.edu.pk
Muhammad in Ul Hassan
10162000!@umt.edu.pk
"id #a$ed
10162002!@umt.edu.pk
Muzamil sad
101620013@umt.edu.pk
%&h''l '( %&ien&e and Te&hn'l')*
Uni$ersit* '( Mana)ement +
Te&hn'l')*
,ah're- .akistan
2 DR. M. TAHIR HASSAN: CATEGORIZATION OF USERS ON TWITTER
Literature review
Detecting user attributes based on user communication streams. 3re"ious work has
e1plored the impact of people0s profiles on the style patterns and content of their
communication streams. 4esearchers in"estigated the detection of gender from well
written traditional te1t ,5erring and 3aolillo 20102 )ingh 2001- blogs ,(urger and
5enderson 2010- re"iews ,$tterbacher 2010- e.mail ,6arera and 7aro"sky 2008- user
search *ueries ,9ones et al. 20082 !eber and :astillo 2010- and for Twitter ,4ao et al.
2010-. $ther pre"iously e1plored attributes include the user0s location ,9ones et al. 20082
;ink et al. 200<2 :heng :a"erlee and =ee 2010- location of origin ,4ao et al. 2010-
age,9ones et al. 20082 4ao et al. 2010- political orientation,Thomas 3ang and =ee 200%2
4ao et al. 2010-.
!hile such pre"ious work has addressed blogs and other informal te1ts microblogs
are >ust starting to be e1plored for user classification. /dditionally pre"ious work uses a
mi1ture of sociolinguistic features and n.gram models while we focused on content of the
tweets to achie"e our task of user classification.
Topic models for Twitter. ,4amage 2010- uses large scale topic models to
represent Twitter feeds and users showing impro"ed performance on tasks such as post
and user recommendation. !e confirm the "alue of large.scale topic models for a different
set of tasks ,user classification- and analyse their impact as part of a rich feature set.
Methodology
$ur methodology of categorization of users consists of two steps? ,i- 3re.processing ,ii-
:lassification.
3re.processing is further di"ided into four steps? ,i- :on"ersion to /4;; format ,ii-
Tweets ,strings- are con"erted into words ,iii- 4emo"al of stop words ,i"- +anual remo"al
of unnecessary words.
;igure 1 ,a- 4aw @ata
DR. M. TAHIR HASSAN: CATEGORIZATION OF USERS ON TWITTER 3


;igure 1 ,b- @ata $rganization of raw data
;igure 1? ,a- show raw data of the training and test set. This is data is processed to get
remo"e unnecessary attributes and words. ,b- shows data organization of the raw data.
. !re"processing
3re.processing is performed on both training and testing data sets. The gi"en data sets
were in .t1t format. To load these data.sets into !eka they are first con"erted into /4;;
format. /n /4;; ,/ttribute.4elation ;ile ;ormat- file is an /):II te1t file that describes
a list of instances sharing a set of attributes. /4;; files were de"eloped by the +achine
=earning 3ro>ect at the @epartment of :omputer )cience of The Ani"ersity of !aikato for
use with the !eka machine learning software.
The ne1t step was to remo"e unnecessary attributes from the raw /4;; file. This was
achie"ed by using BCditD function of !eka.
;igure 2? ,a- /ttributes to be remo"ed during pre.processing are highlighted
# DR. M. TAHIR HASSAN: CATEGORIZATION OF USERS ON TWITTER
;igure 2? ,b- @ata $rganization after
remo"al of unnecessary attributes
from raw data
$nce attributes are remo"ed the tweets which were primarily in string format are
con"erted in words using !eka function )tringTo!ordEector. The complete filter applied
to con"ert strings into words is as below?
weka.filters.usu!er"ise#.attri$ute.Stri%T&W&r#'e(t&r )R first)last )W *+++ )!rue)
rate )*.+ )N + )ste,,er weka.(&re.ste,,ers.NullSte,,er )M * )t&kei-er
.weka.(&re.t&kei-ers.W&r#T&kei-er )#eli,iters /. //r////t.01:///2///.3456/..

;igure F? ,a- @ata $rganization after
con"ersion of strings ,tweets- into
words.
/fter con"ersion of tweets into words !eka stop words filter is applied on the data set to
remo"e words with no "alue to our results ,is are of-. )top words are words which are
filtered out after processing of natural language data ,te1t-. =ater few words ,mostly
outliers- were remo"ed by looking at filtered data.set with naked eye.
DR. M. TAHIR HASSAN: CATEGORIZATION OF USERS ON TWITTER $
.2 %lassification
/fter pre.processing of training data
the testing data is also pre.processed
through (atch filtering so it could be
loaded in the classifier. (atch filtering
is used if a second dataset normally
the test set needs to be processed
with the same statistics as the first
dataset normally the training set.
;igure 4? Testing data being supplied
in the classifier
&esults and Discussion
!e applied se"eral classifiers on the test and training data sets howe"er were able to
achie"e better results with 'aG"e(ayes and )+$. 'ai"e (ayes classifiers are a family of
simple classifiers based on applying (ayes# theorem with strong ,nai"e- independence
assumptions between the features while )e*uential +inimal $ptimization ,)+$- is an
algorithm for efficiently sol"ing the optimization problem which arises during the training
of support "ector machines.
' DR. M. TAHIR HASSAN: CATEGORIZATION OF USERS ON TWITTER
;igure H? ,a- 4esults by 'aG"e(ayes :lassifier
;igure H? ,b- :onfusion matri1 by 'aG"e(ayes
;igure H? ,a- show results gi"en by the 'aG"e(ayes classifier using !eka.. ,b- shows the
confusion matri1 on the gi"en training and test data sets by 'aG"e(ayes classifier.
;igure %? ,a- 4esults by )+$ :lassifier
;igure %? ,b- :onfusion matri1 by )+$
DR. M. TAHIR HASSAN: CATEGORIZATION OF USERS ON TWITTER (
)+$ is a simple algorithm with high classification accuracy for our dataset. It shows high
performance with balanced distribution training data as input.
&eferences
I1J (lei @.2 'g /.2 and 9ordan +. 2002. =atent dirichlet allocation. 9+=4,F-? <<FK1022.
I2J (urger 9. and 5enderson 9. 2010. /n e1ploration of obser"able features related to blogger age.
In :omputational /pproaches to /nalyzing !eblogs? 3apers from the 200% ///I )pring
)ymposium 810K81L.
IFJ :heng M.2 :a"erlee 9.2 and =ee N. 2010. 7ou are where you tweet? / :ontent.based /pproach
to 6eo.locating Twitter Asers. In 3roceedings of :IN+.
I4J ;riedman 9. 5. 200%. 4ecent ad"ances in predicti"e ,machine- learning. 9ournal of :lassification
2F,2-?18HK1<8.
IHJ 9a"a /.2 )ong O.2 ;inin T.2 and Tseng (. 2008. !hy we twitter? understanding microblogging
usage and communities. In 3roceedings of the <th !eb N@@ and 1st )'/.N@@ 2008.
I%J 4amage @. 2010. :haracterizing +icroblogs with Topic +odels. In 3roceedings of I:!)+
2010.
I8J Twitter. 2010. Twitter /3I documentation. In http?PPde".twitter.comPdoc.
ILJ !iebe 9.2 !ilson T.2 and :ardie :. 200H. /nnotating e1pressions of opinions and emotions in
language. In =anguage 4esources and C"aluation 1%HK210.