Sei sulla pagina 1di 5

TweetClassification

Cardenas,Jimenez,King,Panuelos Introduction Forthisweekthegroupwasassignedtofinddataclassificationalgorithmstoseeif themachinecanlearntoclassifytweetsbasedonthetextinthetweets.Thetweetstobe gatheredareintheenglsihlanguageandcanbecategorizedashappy,sad,mador disgusted.Wekawasusedtoprocessthedataandattempttohaveaccurate classificationsoftweets.TheclassifieralgorithmsusedinWekawereJ48,LibSVM,SMO andNaiveBayesMultinomial. Therewerealsochangesinthedatagatheringgatheringsoftware.Ithasbeen alteredtooutputCSVandARFFfiles. S18 July6,2012

Figure1.TheDataGatherercannowsaveARFFandCSVformatsofthedataset

Rationale Itwasadvisedthatthedatasetbereworkedsuchthateachtoken(orword)ineach receivedtweetbemadeintoanattributethatfallsunderagivenclass.Whatwould happenasaresultisthatfurthertweetswouldneedtohaveinstancesofthesetokensto fallunderthatparticularclassofsentiment.Inotherwords,ifatweetcontained: Iameatingbreakfasttoday Thedatasetwouldseparatethewordsofthetweetstringandplacethemintoa separatecolumn. emotion happy I token1 token2 am token3 eating token4 breakfast token5 today

Thiswouldallowformuchmorestreamlinedtestingandresults.InCSVform,the datasetwouldlooklikethetabularizedformasshowninFigure2.

Figure2.TheCSVrepresentationofthedatasetasshowninMicrosoftExcel InWekasdocumentation1 however,textcategorizationisdonewithtokenization onlyif thedatasetisformattedasARFF,andthetweetasawholeisdefinedwiththe STRINGtypeasopposedtotheNOMINALtype(asdepictedinFigure3),thusleadingto theadditionofaSavetoARFFversion,renderingtheCSVdocumentpresentonlyfor theconvenienceaffordedbyviewingitinanapplicationlikeExcel.

Figure3.TheARFFrepresentationofthesamedatasetinFigure2 ThechiefdifferenceintheformatoftheARFFandtheCSVisthatthewordsare not separated.Wekasdocumentation1 recommendsusingthe StringToWordVector filter. This filter separates each word (tokenization) and puts them in their own attribute containinganumberrepresentingitsfrequency.Eachwordisalsogivenanemotional descriptor.Figure4showstheresultsofthefiltertotheARFFfile.Thedatacanthenbe usedastrainingdataunderdesiredalgorithmsinWeka.

Figure4.TheStringToWordVectorfilterisappliedtotheARFFdataset

Processing TheARFFdatasetwasproperlyfilteredandpreprocessedbeforetesting.Forboth algorithms, they will be run with 10fold crossvalidation and then they will be run severaltimes,from10%trainingdataallthewayupto90%trainingdatatoprovidea learning curve for model building. The results will determine how effective the algorithmsareforpredictionusingtheSVM. Classificationalgorithms Theclassificationalgorithmsyieldeddifferentaccuraciesincategorizingthetweets butallofthemarenotsufficientinmakingcredibleclassifications.Itissuggestedthat moredataisneededfortrainingtoimproveaccuracy.Additionally,thedatawasapplied withWeka'sStringToWordVectorfilterpriortoclassification.Thegraphbelowshowsthe percentageofcorrectclassifications.TheYaxisdisplaysthepercentageofaccuracywhile theXaxisdisplaysthepercentagesplitoftrainingandtestdata.

Figure5.LineGraphshowingtheaccuraciesofthealgorithms

ItcanberealizedfromthegraphthattheJ48algorithmistheleastcorrectin classificationswhiletherestareclosetoeachother.Itcanalsobeinterpretedthatsome classificationalgorithmsvaryinaccuraciesbasedonthepercentagesplitoftrainingand testdata.Thepeekaccuracyisaround40percentsomoreworkmustbedonetoimprove theaccuracyofthetheclassification. Comparisons PercentageSplit 10% 20% 30% 40% 50% 60% 70% 80% 90% LibSVM 34 35 38 39 38 37 38 40 36 SMO 34 37 38 39 38 39 39 40 37 J48 25 26 27 27 28 28 28 29 26 NaiveBayesMultinomial 32 37 38 39 38 38 40 40 39

Table1.PercentageofCorrectly Classified Instances ThetableindicatesthatahighPercentagesplitfortrainingdataisneededfor yieldingmoreaccurateresultsbuttoomuchtrainingdataalsoreducedaccuracy.Forthe givenalgorithmsitisbesttohave7080%ofthedataastrainingandtherestastest data.ItisalsovisiblefromthetablethattheNaiveBayesAlgorithmyieldedmoreorjust equalcorrectclassificationsthantheothersstartingfroma20percentsplitoftraining andtestdata. Conclusion Wekahasprovidedavarietyofdataclassificationalgorithmsandtheycanclassify tweets.BasedontheresultsoftheclassificationtestsruninWeka,itistheNaiveBayes Multinomialalgorithmthatprovidesthehighestreliabilityincategorizingtweetsas happy,sad,madordisgusted.However,moredataisneededinordertohaveamodelthat classifiestweetsaccordingtoemotionwithhighcredibility.

Potrebbero piacerti anche