Deliverable03 Tweet Miner

TweetClassification
Cardenas,Jimenez,King,Panuelos Introduction Forthisweekthegroupwasassignedtofinddataclassificationalgorithmstoseeif themachinecanlearntoclassifytweetsbasedonthetextinthetweets.Thetweetstobe gatheredareintheenglsihlanguageandcanbecategorizedashappy,sad,mador disgusted.Wekawasusedtoprocessthedataandattempttohaveaccurate classificationsoftweets.TheclassifieralgorithmsusedinWekawereJ48,LibSVM,SMO andNaiveBayesMultinomial. Therewerealsochangesinthedatagatheringgatheringsoftware.Ithasbeen alteredtooutputCSVandARFFfiles. S18 July6,2012
Figure1.TheDataGatherercannowsaveARFFandCSVformatsofthedataset
Rationale Itwasadvisedthatthedatasetbereworkedsuchthateachtoken(orword)ineach receivedtweetbemadeintoanattributethatfallsunderagivenclass.Whatwould happenasaresultisthatfurthertweetswouldneedtohaveinstancesofthesetokensto fallunderthatparticularclassofsentiment.Inotherwords,ifatweetcontained: Iameatingbreakfasttoday Thedatasetwouldseparatethewordsofthetweetstringandplacethemintoa separatecolumn. emotion happy I token1 token2 am token3 eating token4 breakfast token5 today
Thiswouldallowformuchmorestreamlinedtestingandresults.InCSVform,the datasetwouldlooklikethetabularizedformasshowninFigure2.
Figure2.TheCSVrepresentationofthedatasetasshowninMicrosoftExcel InWekasdocumentation1 however,textcategorizationisdonewithtokenization onlyif thedatasetisformattedasARFF,andthetweetasawholeisdefinedwiththe STRINGtypeasopposedtotheNOMINALtype(asdepictedinFigure3),thusleadingto theadditionofaSavetoARFFversion,renderingtheCSVdocumentpresentonlyfor theconvenienceaffordedbyviewingitinanapplicationlikeExcel.
Figure3.TheARFFrepresentationofthesamedatasetinFigure2 ThechiefdifferenceintheformatoftheARFFandtheCSVisthatthewordsare not separated.Wekasdocumentation1 recommendsusingthe StringToWordVector filter. This filter separates each word (tokenization) and puts them in their own attribute containinganumberrepresentingitsfrequency.Eachwordisalsogivenanemotional descriptor.Figure4showstheresultsofthefiltertotheARFFfile.Thedatacanthenbe usedastrainingdataunderdesiredalgorithmsinWeka.
Figure4.TheStringToWordVectorfilterisappliedtotheARFFdataset
Processing TheARFFdatasetwasproperlyfilteredandpreprocessedbeforetesting.Forboth algorithms, they will be run with 10fold crossvalidation and then they will be run severaltimes,from10%trainingdataallthewayupto90%trainingdatatoprovidea learning curve for model building. The results will determine how effective the algorithmsareforpredictionusingtheSVM. Classificationalgorithms Theclassificationalgorithmsyieldeddifferentaccuraciesincategorizingthetweets butallofthemarenotsufficientinmakingcredibleclassifications.Itissuggestedthat moredataisneededfortrainingtoimproveaccuracy.Additionally,thedatawasapplied withWeka'sStringToWordVectorfilterpriortoclassification.Thegraphbelowshowsthe percentageofcorrectclassifications.TheYaxisdisplaysthepercentageofaccuracywhile theXaxisdisplaysthepercentagesplitoftrainingandtestdata.
Figure5.LineGraphshowingtheaccuraciesofthealgorithms
ItcanberealizedfromthegraphthattheJ48algorithmistheleastcorrectin classificationswhiletherestareclosetoeachother.Itcanalsobeinterpretedthatsome classificationalgorithmsvaryinaccuraciesbasedonthepercentagesplitoftrainingand testdata.Thepeekaccuracyisaround40percentsomoreworkmustbedonetoimprove theaccuracyofthetheclassification. Comparisons PercentageSplit 10% 20% 30% 40% 50% 60% 70% 80% 90% LibSVM 34 35 38 39 38 37 38 40 36 SMO 34 37 38 39 38 39 39 40 37 J48 25 26 27 27 28 28 28 29 26 NaiveBayesMultinomial 32 37 38 39 38 38 40 40 39
Table1.PercentageofCorrectly Classified Instances ThetableindicatesthatahighPercentagesplitfortrainingdataisneededfor yieldingmoreaccurateresultsbuttoomuchtrainingdataalsoreducedaccuracy.Forthe givenalgorithmsitisbesttohave7080%ofthedataastrainingandtherestastest data.ItisalsovisiblefromthetablethattheNaiveBayesAlgorithmyieldedmoreorjust equalcorrectclassificationsthantheothersstartingfroma20percentsplitoftraining andtestdata. Conclusion Wekahasprovidedavarietyofdataclassificationalgorithmsandtheycanclassify tweets.BasedontheresultsoftheclassificationtestsruninWeka,itistheNaiveBayes Multinomialalgorithmthatprovidesthehighestreliabilityincategorizingtweetsas happy,sad,madordisgusted.However,moredataisneededinordertohaveamodelthat classifiestweetsaccordingtoemotionwithhighcredibility.

Deliverable03 Tweet Miner

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Deliverable03 Tweet Miner

Caricato da

Copyright:

Formati disponibili

TweetClassification

Potrebbero piacerti anche