Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
CONFIDENTIAL | 2
Boston Storm 2013-02-28 Meetup Agenda
Intro Lessons
Agenda Best Practices
Project Information Future
Predictive Analytics Bonus:
Storm Overview -Storm & Big Data Patterns
Architecture & Design
Deployment
CONFIDENTIAL | 3
Project Definition
AdGlue: Solving biggest problem for local advertisers Wheres my ad?
Their Needs:
Scale up for new business deals
More lively site
Better predictions
Recommendations.
Use Cases Project Plan
-Scale batch analysis pipeline; -8-9 weeks
Generate timely stats -Combined Data Engineering +
-Recommendations Data Science Engagement
-Predictions -Staff
How many page views in 1 Arch + 1 PM
the next 30 days? 1 Data Engineer
Environment 2 Data Scientists
-AWS 3 Client Engineers
-Version 1 of site & analytics in
production
CONFIDENTIAL | 4
Predictive Analytics Process
Model Design & Build
- Listening & Learning
- Discovery (Digging through the data)
- Creating a Research Agenda
- Testing & Learning
Production Predictive Model Development
- Data Cleansing, Aggregations, Conditioning
- Predictive Model Training Process
- Predictive Model Execution Process
Challenges:
- What functional forms predict future impression counts given counts up to
time T?
- Robust estimators, like medians rather than means, to cope with outliers
- How do we distinguish between new articles, versus old articles we're
seeing for the first time?
- How well do impression counts correspond to real humans?
CONFIDENTIAL | 5
Solution based on this approach theory
CONFIDENTIAL | 6
Storm Overview
CONFIDENTIAL | 7
Storm Concepts Review
Cluster
-Supervisor
-Worker
Topology
Streams
Spout
Bolt
Tuple
Stream Groupings
-Shuffle, Field
Trident
DRPC
CONFIDENTIAL | 8
Why Storm? Why Realtime?
Needed better way to manage queue readers and logic pipeline
Much better than roll your own
Reliable (Message guarantees, fault tolerant)
Multi-node scaling (1MM messages / 10 nodes)
It works
For more reasons: https://github.com/nathanmarz/storm/wiki/Rationale
Better end-user experience
- View an ad, see the counter move.
Need to catch fast moving events
- Content half life measured in hours
Path to additional real-time capabilities
- Trend analysis to recommend hot articles for example.
- Ability to bolt on additional analytics
CONFIDENTIAL | 9
Overall Architecture
CloudWatch, SNS (Metrics, Alarms, Notifications)
Ad Serving
ElasticCache (Tuple state tracking)
Impression
Edge Archive Logs
Interactions Storm
View Ad - Queue Management
LB
SQS - Simple Bot Filtering
S3
- Real-time Bucketization S3
- Performance Counters S3
Edge - Event Logging S3
Management
Server
EMR
(Hadoop)
Edge DynamoDB
Performance Counters
LB Cleansing
Impression Buckets Model Training
Ad Management Recommendations
Ad Selling
Edge R
Model
getPrediction
R Model Parameters
Model
RDS
(MySQL)
CONFIDENTIAL | 10
Analytics Architecture
EMR
(Hadoop)
Train
Impression
Model
Bucketization R
Parameters
Model
Storm
S3 Adapter
Simple Bot
Impression Spout Annotator BucketBolt
Web Request
Impression
Prediction
R
Model
CONFIDENTIAL | 11
Storm Topology (Greatly Simplified)
SQS
S3
S3
S3
Event S3
Spout S3
Adapter<T>
SimpleBotFilter
SQS
Command
Spout
Performance
Counters<T>
DynamoDB
Adapter<T>
___Performance
CONFIDENTIAL | 12
Storm Deployment
Storm-deploy project
https://github.com/nathanmarz/storm-deploy/wiki
Uses pallet & jclouds project to deploy cluster
Configured through conf/clusters.yaml & ~/.pallet/config.clj files
Pros: Cons:
-Quick and easy AWS -Requires Leinigen v1.x, no
deployment warning
-Project not kept up to date
-Changes & debugging in
Clojure
-Recovering a node is possible
but slow
Tip: Use Puppet/Chef for
production deployment
CONFIDENTIAL | 13
Lessons
CONFIDENTIAL | 14
Lessons
CONFIDENTIAL | 15
Storm Best Practices
Debug and unit test topology application logic in local mode.
- Mock testing
- Multiple environments
- Exception Handling & Logging
When running distributed
- Start with small number of workers and slots, with fewer log files to dig
through.
- Automated deployment
Use Metrics
- Instrument your spouts and bolts.
- Needed when scaling in order to optimize performance.
- Helps diagnosis problems.
Latest WIP versions of storm add specialized metrics, also improve
nimbus reporting.
Use test data that is similar to production data. Distribution across
topology is data dependent.
CONFIDENTIAL | 16
Future Improvements
Only once semantics
- Trident
S3 small file sizes
- Segment topology just for S3 persistence
- Incremental S3 uploads (faster too)
DynamoDB costs
- Use DRPC to access Time series and metric
Deploy using Chef/Puppet
- AWS OpsWorks?
Revisit analytical models
- Compare performance
- Compare with other models, do they perform better?
- Feature Analysis
CONFIDENTIAL | 17
Bonus
Storm & Big Data Patterns
Edge
Edge Edge
Transactional Server
Edge Edge
Transactional Server Server
Source
Transactional Servers Server
Devices
Source
Transactional CRUD
Systems
Source
Systems
Source
Systems
Systems Event Event
Event
STORM
Parse, Map, Enrich, Filter,
Distribute
Log
ETL Dimensional Analytics Subscription
Aggregation Indexer
Counts Services
CONFIDENTIAL | 19
Questions?