Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Mostlytechnologywithoccasional
sprinklingofotherrandomthoughts
CountingUniqueMobileAppUserswithHyperLogLog
PostedonNovember16,2014
3Votes
Continuingalongthethemeofrealtimeanalyticwithapproximatealgorithms,thefocusthistimeisapproximate
cardinalityestimation.Toputtheideasinacontext,theusecasewewillbeworkingwithisforcountingnumber
ofuniqueusersforamobileapp.Analyzingthetrendofsuchuniquecounts,revealvaluableinsightsintothe
popularityofanapp.
WewillbeusingHyperLogLogalgorithmwhichisavailableinmyopensourceprojecthoidlaasaJavaAPI.The
stormimplementationoftheusecaseisavailableinmyotheropensourcewebanalyticprojectvisitante.
HyperLogLog
Cardinalityisthenumberofuniqueitemsinalist.Inanaiveimplementation,cardinalitycanbeestimatedusing
memoryproportionaltocardinalitysize.However,whenthecardinalityisveryhigh(e.g.,IPaddress,phone
number),suchanaiveapproachisnotpragmatic.Variousapproximateprobabilisticalgorithmscanbeusedfor
cardinalityestimation.
HyperLogLogisbasedonanalyzingsomebitpatternsinthehashedvalueofanitem.Welookatthelengthofthe
sequenceofmostsignificantzerobits.Themaximumlengthamongsuchzerobitsequencesfromallthehashed
valuesisindicativeoftheuniqueitemcount.
Inreality,toimprovethequalityoftheresult,multipleindependenthashfunctionsareusedandthelongest
sequencezerobitsresultingfromeachhashfunctionisusedtoproducethefinalaverageduniquecountvalue.
Insteadofusingmultiplehashfunctions,wewilluseatechniquecalledstochasticaveraging.Wesetasidea
sequenceofsignificantbitsofthehashforbuckets.Fromtheremainingbitswefindthesequenceofmost
significantzerobits.Forexample,touse256bucketsweusethemostsignificant8bitsforbucketsandthe
remaining24bitstofindthesequenceofmostsignificantzerobits.Foreachbucket,wemaintainacountof
maximumlengthofsequenceofzerobits.
SmallCardinality
TheHyperLogLogalgorithmasdescribedintheoriginalpaperdoesnotworkwellforsmallcardinality.As
suggestedinthepaper,whentheuniquecountfallsbelowathresholdanalgorithmbasedonprobabilistic
propertiesofrandomallocations.
Thecorrectionforsmallcardinalityisincludedintheimplementationinhoidla.However,iftheknowledgeof
smallcardinalityisnotknownapriori,youcoulddosimplehashbasedcountinginsteadofHyperLogLog.
MobileAppUsageData
ByinstrumentingtheSDKcallsoftheapp,usagedataiscreatedwiththefollowing5fields.Thephonenumberis
usedasanidentifierfortheuser.
1. Date
2. Time
3. SessionID
4. Phoneareacode
5. Phonenumber
Aswewillseelater,thephoneareacodeisusedtopartitionthedataintheStormimplementation.Thedataisfed
tostormthroughamessagequeue.Hereissomesampledata
2014-11-16
2014-11-16
2014-11-16
2014-11-16
2014-11-16
2014-11-16
2014-11-16
2014-11-16
02:17:48
02:17:50
02:17:50
02:17:52
02:17:54
02:17:56
02:17:56
02:17:57
c080f1fa-6d79-11e4-aa9d-c42c030f8af1
d0997c05-6d79-11e4-a360-c42c030f8af1
cbd4af2e-6d79-11e4-b7e1-c42c030f8af1
d0997c05-6d79-11e4-a360-c42c030f8af1
cbd4af2e-6d79-11e4-b7e1-c42c030f8af1
d0997c05-6d79-11e4-a360-c42c030f8af1
cbd4af2e-6d79-11e4-b7e1-c42c030f8af1
d5f8956b-6d79-11e4-891d-c42c030f8af1
310
408
339
408
339
408
339
213
(310)6121967
(408)4937187
(339)8242149
(408)4937187
(339)8242149
(408)4937187
(339)8242149
(213)7703334
StormTopology
Thestormtopologyarchitectureconsistsofaspoutandtwobolts.Thespoutreadsusagefromamessagequeue.
Asimplemessagequeueabstractionavailableinmyopensourceprojectchomboisused.Itfacilitatesusageof
anymessagequeue.IhaveusedRedis.
Thedataemittedbythespoutisfieldgroupedonareacode.Itsessentiallyhashpartitioningontheareacode.All
thedataforthesamearecodeisprocessedbythesameboltinstanceofUniqueVisitorCounterBolt.Eachbolt
instancemaintainsaninstanceofHyperLogLogobject.Whenanewtuplearrives,itsprocessedby
theHyperLogLogobject.
WhentheUniqueVisitorCounterBoltreceivesaticktuple,itobtainstheuniquecountfromtheHyperLogLog
objectandemitsthetuple(boltID,uniqueCount).
ThetupleemittedbyUniqueVisitorCounterBoltisprocessedbytheUniqueVisitorAggregatorBolt,ofwhich
thereisonlyinstance.Asthenamesuggestsitaggregatesthecounts,whichissimplysumminguptheunique
countsfromthepredecessorboltlayer.Theresultiswrittentoamessagequeue,whichanyclientapplicationcan
consumeforfurthertrendanalysisofuniqueusercountdata.
Theoutputissimplyathetuple(currentTime,uniqueCount).Hereissomesampleoutput.Asnewrecordsare
processed,theuniquecountgrows.Uniquecountisalwaysmonotonicallyincreasing.
1416191997120
1416192007121
1416192017123
1416192027584
76
76
77
78
1416192037125
1416192047126
1416192057126
1416192067128
1416192077128
78
78
78
79
79
TemporalReference
Theuniquecounthasatemporalreferencepointfortimeseriesdata.Thecountingiswithrespecttosomepoint
inpast.Incaseofofamobileapp,itwillbethelaunchdateoftheapp.
Althoughgenerally,thetemporalreferencedoesnotchangeonceset,sometimesitmaybenecessarytochangeit.
Thereisamechanismtoclearthecounterandstartcountingwithacleanslate.
Asimplepublishsubscribemechanismisusedtodispatchcommandstoboltinstances.Asimplepubsub
interfaceisavailableinchombo,withimplementationfordifferentmessagingprovider.IhaveusedRedis.On
receiptofaticktuple,theUniqueVisitorCounterBoltboltfetchesthecommandifanyfromthepubsubsystem.
ThentheHyperLogLogcounteriscleared.
SummingUp
Wehavegonethroughaexerciseofusingprobabilisticcountingalgorithmforapproximateuniquecount
estimation.TheHyperLogLogalgorithmhasbeenusedfornetworktrafficdataanalysisandqueryplannerin
databases.Stepbystepinstructiontorunthisusecaseinavailableinthistutorialdocument.
About these ads
Sharethis:
11
StumbleUpon
Like
Bethefirsttolikethis.
Related
RealtimeTrendingAnalysiswith
ApproximateAlgorithms
In"ApproximateQuery"
RetargetCampaignforAbandonedShopping
CartswithDecisionTree
In"BigData"
ALearningbutGreedyGambler
In"BigData"
AboutPranab
IamPranabGhosh,asoftwareprofessionalintheSanFranciscoBayarea.Imanipulatebitsandbytesforthegoodofliving
beingsandtheplanet.Ihaveworkedwithmyriadoftechnologiesandplatformsinvariousbusinessdomainsforearlystage
startups,largecorporationsandanythinginbetween.Iamanactivebloggerandopensourcecontributor.Iampassionateabout
technologyandgreenandsustainableliving.MytechnicalinterestareasareBigData,DistributedProcessing,NOSQLdatabases,
DataMiningandProgramminglanguages.Iamfascinatedbyproblemsthatdon'thaveneatclosedformsolution.
ViewallpostsbyPranab
ThisentrywaspostedinApproximateQuery,BigData,Mobile,RealTimeProcessing,Stormandtaggedcardinality,mobile,uniquecount.Bookmarkthepermalink.
Mawazo
TheTwentyTenTheme.
BlogatWordPress.com.
Follow
Follow
Follow Mawazo
Get every new post delivered
to your Inbox.
Join 263 other followers
Enteryouremailaddress
Signmeup
Build a website with WordPress.com