Sei sulla pagina 1di 20

Douglas Moore

Principal Consultant & Architect


February 2013

Predictive Analytics with


Storm, Hadoop, R on AWS
Leading Provider
Data Science and Engineering Services

Accelerating Your Time to Value using Big Data

IMAGINE ILLUMINATE IMPLEMENT


Strategy Training Hands-On
and Roadmap and Education Data Science and
Data Engineering

CONFIDENTIAL | 2
Boston Storm 2013-02-28 Meetup Agenda

Intro Lessons
Agenda Best Practices
Project Information Future
Predictive Analytics Bonus:
Storm Overview -Storm & Big Data Patterns
Architecture & Design
Deployment

CONFIDENTIAL | 3
Project Definition
AdGlue: Solving biggest problem for local advertisers Wheres my ad?
Their Needs:
Scale up for new business deals
More lively site
Better predictions
Recommendations.
Use Cases Project Plan
-Scale batch analysis pipeline; -8-9 weeks
Generate timely stats -Combined Data Engineering +
-Recommendations Data Science Engagement
-Predictions -Staff
How many page views in 1 Arch + 1 PM
the next 30 days? 1 Data Engineer
Environment 2 Data Scientists
-AWS 3 Client Engineers
-Version 1 of site & analytics in
production
CONFIDENTIAL | 4
Predictive Analytics Process
Model Design & Build
- Listening & Learning
- Discovery (Digging through the data)
- Creating a Research Agenda
- Testing & Learning
Production Predictive Model Development
- Data Cleansing, Aggregations, Conditioning
- Predictive Model Training Process
- Predictive Model Execution Process
Challenges:
- What functional forms predict future impression counts given counts up to
time T?
- Robust estimators, like medians rather than means, to cope with outliers
- How do we distinguish between new articles, versus old articles we're
seeing for the first time?
- How well do impression counts correspond to real humans?
CONFIDENTIAL | 5
Solution based on this approach theory

Analyze Analyze Near


Massive Historical Recent Realtime
Data Set Past Prediction

Massive Historical Set = S3 Recent Past = Storm + NoSQL


Analyze = Hadoop + Pig + R Analyze = R + Web Service

CONFIDENTIAL | 6
Storm Overview

DAG Processing of never ending streams of data


- Open Sourced: https://github.com/nathanmarz/storm/wiki
- Used at Twitter plus > 24 other companies
- Reliable - At Least Once semantics
- Think MapReduce for data streams
- Java / Clojure based
- Bolts in Java and Shell Bolts
- Not a queue, but usually reads from a queue.
Related:
- S4, CEP
Compromises
- Static topologies & cluster sizing Avoid messy dynamic rebalancing
- Nimbus SPOF
Strong Community Support, No commercial support

CONFIDENTIAL | 7
Storm Concepts Review

Cluster
-Supervisor
-Worker
Topology
Streams
Spout
Bolt
Tuple
Stream Groupings
-Shuffle, Field
Trident
DRPC

CONFIDENTIAL | 8
Why Storm? Why Realtime?
Needed better way to manage queue readers and logic pipeline
Much better than roll your own
Reliable (Message guarantees, fault tolerant)
Multi-node scaling (1MM messages / 10 nodes)
It works
For more reasons: https://github.com/nathanmarz/storm/wiki/Rationale
Better end-user experience
- View an ad, see the counter move.
Need to catch fast moving events
- Content half life measured in hours
Path to additional real-time capabilities
- Trend analysis to recommend hot articles for example.
- Ability to bolt on additional analytics

CONFIDENTIAL | 9
Overall Architecture
CloudWatch, SNS (Metrics, Alarms, Notifications)
Ad Serving
ElasticCache (Tuple state tracking)
Impression
Edge Archive Logs
Interactions Storm
View Ad - Queue Management
LB
SQS - Simple Bot Filtering
S3
- Real-time Bucketization S3
- Performance Counters S3
Edge - Event Logging S3

Management
Server
EMR
(Hadoop)
Edge DynamoDB
Performance Counters
LB Cleansing
Impression Buckets Model Training
Ad Management Recommendations
Ad Selling
Edge R
Model

getPrediction

R Model Parameters
Model
RDS
(MySQL)

CONFIDENTIAL | 10
Analytics Architecture
EMR
(Hadoop)
Train
Impression
Model
Bucketization R
Parameters
Model

Impressions Impression Predictive Model Impression


Buckets (Batch) Parameters Buckets (Realtime)

Storm
S3 Adapter

Simple Bot
Impression Spout Annotator BucketBolt

Web Request
Impression
Prediction
R
Model

CONFIDENTIAL | 11
Storm Topology (Greatly Simplified)
SQS
S3
S3
S3
Event S3
Spout S3
Adapter<T>

SimpleBotFilter
SQS
Command
Spout

Performance
Counters<T>

DynamoDB
Adapter<T>
___Performance

CONFIDENTIAL | 12
Storm Deployment

Storm-deploy project
https://github.com/nathanmarz/storm-deploy/wiki
Uses pallet & jclouds project to deploy cluster
Configured through conf/clusters.yaml & ~/.pallet/config.clj files

Pros: Cons:
-Quick and easy AWS -Requires Leinigen v1.x, no
deployment warning
-Project not kept up to date
-Changes & debugging in
Clojure
-Recovering a node is possible
but slow
Tip: Use Puppet/Chef for
production deployment

CONFIDENTIAL | 13
Lessons

Easy to develop, hard to debug


- Timeouts
Storm Infinite loop of failures
- Use Memcached to count # tuple failures
At Least Once processing
- Hadoop based read-repair job
Performance Counters not getting flushed
- Tick Tuples
Always ACK
Batching to S3
- Run a compaction & event de-duplication job in Hadoop

CONFIDENTIAL | 14
Lessons

Understand your timescales


- Frequency at which you emit running totals/ averages / stats
- Frequency at which you write logs to S3
- Frequency at which you commit to DynamoDB / RDS /
Painful tuning procedures when your topology carries lots of tuples
- TOPOLOGY_MESSAGE_TIMEOUT_SECS
- TOPOLOGY_MAX_SPOUT_PENDING

CONFIDENTIAL | 15
Storm Best Practices
Debug and unit test topology application logic in local mode.
- Mock testing
- Multiple environments
- Exception Handling & Logging
When running distributed
- Start with small number of workers and slots, with fewer log files to dig
through.
- Automated deployment
Use Metrics
- Instrument your spouts and bolts.
- Needed when scaling in order to optimize performance.
- Helps diagnosis problems.
Latest WIP versions of storm add specialized metrics, also improve
nimbus reporting.
Use test data that is similar to production data. Distribution across
topology is data dependent.
CONFIDENTIAL | 16
Future Improvements
Only once semantics
- Trident
S3 small file sizes
- Segment topology just for S3 persistence
- Incremental S3 uploads (faster too)
DynamoDB costs
- Use DRPC to access Time series and metric
Deploy using Chef/Puppet
- AWS OpsWorks?
Revisit analytical models
- Compare performance
- Compare with other models, do they perform better?
- Feature Analysis

CONFIDENTIAL | 17
Bonus
Storm & Big Data Patterns
Edge
Edge Edge
Transactional Server
Edge Edge
Transactional Server Server
Source
Transactional Servers Server
Devices
Source
Transactional CRUD
Systems
Source
Systems
Source
Systems
Systems Event Event
Event

STORM
Parse, Map, Enrich, Filter,
Distribute

Log
ETL Dimensional Analytics Subscription
Aggregation Indexer
Counts Services

DFS System Fuzzy Dashboard Partners


OLAP
of Record Search

CONFIDENTIAL | 19
Questions?

Potrebbero piacerti anche