Analytics Storm Hadoop R-v6DM

Douglas Moore
Principal Consultant & Architect

February 2013
Predictive Analytics with

Storm, Hadoop, R on AWS
Leading Provider
Data Science and Engineering Services
Accelerating Your Time to Value using Big Data
IMAGINE ILLUMINATE IMPLEMENT

Strategy Training Hands-On
and Roadmap and Education Data Science and
Data Engineering
CONFIDENTIAL | 2
Boston Storm 2013-02-28 Meetup Agenda
Intro Lessons
Agenda Best Practices
Project Information Future
Predictive Analytics Bonus:
Storm Overview -Storm & Big Data Patterns
Architecture & Design
Deployment
CONFIDENTIAL | 3
Project Definition
AdGlue: Solving biggest problem for local advertisers Wheres my ad?
Their Needs:
Scale up for new business deals
More lively site
Better predictions
Recommendations.
Use Cases Project Plan
-Scale batch analysis pipeline; -8-9 weeks
Generate timely stats -Combined Data Engineering +
-Recommendations Data Science Engagement
-Predictions -Staff
How many page views in 1 Arch + 1 PM
the next 30 days? 1 Data Engineer
Environment 2 Data Scientists
-AWS 3 Client Engineers
-Version 1 of site & analytics in
production
CONFIDENTIAL | 4
Predictive Analytics Process
Model Design & Build
- Listening & Learning
- Discovery (Digging through the data)
- Creating a Research Agenda
- Testing & Learning
Production Predictive Model Development
- Data Cleansing, Aggregations, Conditioning
- Predictive Model Training Process
- Predictive Model Execution Process
Challenges:
- What functional forms predict future impression counts given counts up to
time T?
- Robust estimators, like medians rather than means, to cope with outliers
- How do we distinguish between new articles, versus old articles we're
seeing for the first time?
- How well do impression counts correspond to real humans?
CONFIDENTIAL | 5
Solution based on this approach theory
Analyze Analyze Near

Massive Historical Recent Realtime
Data Set Past Prediction
Massive Historical Set = S3 Recent Past = Storm + NoSQL

Analyze = Hadoop + Pig + R Analyze = R + Web Service
CONFIDENTIAL | 6
Storm Overview
DAG Processing of never ending streams of data

- Open Sourced: https://github.com/nathanmarz/storm/wiki
- Used at Twitter plus > 24 other companies
- Reliable - At Least Once semantics
- Think MapReduce for data streams
- Java / Clojure based
- Bolts in Java and Shell Bolts
- Not a queue, but usually reads from a queue.
Related:
- S4, CEP
Compromises
- Static topologies & cluster sizing Avoid messy dynamic rebalancing
- Nimbus SPOF
Strong Community Support, No commercial support
CONFIDENTIAL | 7
Storm Concepts Review
Cluster
-Supervisor
-Worker
Topology
Streams
Spout
Bolt
Tuple
Stream Groupings
-Shuffle, Field
Trident
DRPC
CONFIDENTIAL | 8
Why Storm? Why Realtime?
Needed better way to manage queue readers and logic pipeline
Much better than roll your own
Reliable (Message guarantees, fault tolerant)
Multi-node scaling (1MM messages / 10 nodes)
It works
For more reasons: https://github.com/nathanmarz/storm/wiki/Rationale
Better end-user experience
- View an ad, see the counter move.
Need to catch fast moving events
- Content half life measured in hours
Path to additional real-time capabilities
- Trend analysis to recommend hot articles for example.
- Ability to bolt on additional analytics
CONFIDENTIAL | 9
Overall Architecture
CloudWatch, SNS (Metrics, Alarms, Notifications)
Ad Serving
ElasticCache (Tuple state tracking)
Impression
Edge Archive Logs
Interactions Storm
View Ad - Queue Management
LB
SQS - Simple Bot Filtering
S3
- Real-time Bucketization S3
- Performance Counters S3
Edge - Event Logging S3
Management
Server
EMR
(Hadoop)
Edge DynamoDB
Performance Counters
LB Cleansing
Impression Buckets Model Training
Ad Management Recommendations
Ad Selling
Edge R
Model
getPrediction
R Model Parameters
Model
RDS
(MySQL)
CONFIDENTIAL | 10
Analytics Architecture
EMR
(Hadoop)
Train
Impression
Model
Bucketization R
Parameters
Model
Impressions Impression Predictive Model Impression

Buckets (Batch) Parameters Buckets (Realtime)
Storm
S3 Adapter
Simple Bot
Impression Spout Annotator BucketBolt
Web Request
Impression
Prediction
R
Model
CONFIDENTIAL | 11
Storm Topology (Greatly Simplified)
SQS
S3
S3
S3
Event S3
Spout S3
Adapter<T>
SimpleBotFilter
SQS
Command
Spout
Performance
Counters<T>
DynamoDB
Adapter<T>
___Performance
CONFIDENTIAL | 12
Storm Deployment
Storm-deploy project
https://github.com/nathanmarz/storm-deploy/wiki
Uses pallet & jclouds project to deploy cluster
Configured through conf/clusters.yaml & ~/.pallet/config.clj files
Pros: Cons:
-Quick and easy AWS -Requires Leinigen v1.x, no
deployment warning
-Project not kept up to date
-Changes & debugging in
Clojure
-Recovering a node is possible
but slow
Tip: Use Puppet/Chef for
production deployment
CONFIDENTIAL | 13
Lessons
Easy to develop, hard to debug

- Timeouts
Storm Infinite loop of failures
- Use Memcached to count # tuple failures
At Least Once processing
- Hadoop based read-repair job
Performance Counters not getting flushed
- Tick Tuples
Always ACK
Batching to S3
- Run a compaction & event de-duplication job in Hadoop
CONFIDENTIAL | 14
Lessons
Understand your timescales

- Frequency at which you emit running totals/ averages / stats
- Frequency at which you write logs to S3
- Frequency at which you commit to DynamoDB / RDS /
Painful tuning procedures when your topology carries lots of tuples
- TOPOLOGY_MESSAGE_TIMEOUT_SECS
- TOPOLOGY_MAX_SPOUT_PENDING
CONFIDENTIAL | 15
Storm Best Practices
Debug and unit test topology application logic in local mode.
- Mock testing
- Multiple environments
- Exception Handling & Logging
When running distributed
- Start with small number of workers and slots, with fewer log files to dig
through.
- Automated deployment
Use Metrics
- Instrument your spouts and bolts.
- Needed when scaling in order to optimize performance.
- Helps diagnosis problems.
Latest WIP versions of storm add specialized metrics, also improve
nimbus reporting.
Use test data that is similar to production data. Distribution across
topology is data dependent.
CONFIDENTIAL | 16
Future Improvements
Only once semantics
- Trident
S3 small file sizes
- Segment topology just for S3 persistence
- Incremental S3 uploads (faster too)
DynamoDB costs
- Use DRPC to access Time series and metric
Deploy using Chef/Puppet
- AWS OpsWorks?
Revisit analytical models
- Compare performance
- Compare with other models, do they perform better?
- Feature Analysis
CONFIDENTIAL | 17
Bonus
Storm & Big Data Patterns
Edge
Edge Edge
Transactional Server
Edge Edge
Transactional Server Server
Source
Transactional Servers Server
Devices
Source
Transactional CRUD
Systems
Source
Systems
Source
Systems
Systems Event Event
Event
STORM
Parse, Map, Enrich, Filter,
Distribute
Log
ETL Dimensional Analytics Subscription
Aggregation Indexer
Counts Services
DFS System Fuzzy Dashboard Partners

OLAP
of Record Search
CONFIDENTIAL | 19
Questions?

Analytics Storm Hadoop R-v6DM

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Analytics Storm Hadoop R-v6DM

Caricato da

Copyright:

Formati disponibili

Douglas Moore

Principal Consultant & Architect

Predictive Analytics with

Accelerating Your Time to Value using Big Data

IMAGINE ILLUMINATE IMPLEMENT

Analyze Analyze Near

Massive Historical Set = S3 Recent Past = Storm + NoSQL

DAG Processing of never ending streams of data

Impressions Impression Predictive Model Impression

Easy to develop, hard to debug

Understand your timescales

DFS System Fuzzy Dashboard Partners

Potrebbero piacerti anche