Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
DATA
Navin Chandra
OUTLINE
• Pig
• Hive
• Hcatalog
• HBase
• Oozie
• Sqoop
• Mahout
• Provisioning, Managing, Monitoring Cluster
• Real-time Use Case Scenarios
APACHE PIG
BACKGROUND
• Yahoo! was the first big adopter of Hadoop.
• Hadoop gained popularity in the company quickly.
• Yahoo! Research developed Pig to address the need for a higher
level language.
• Roughly 30% of Hadoop jobs run at Yahoo! are Pig jobs.
APACHE PIG
• Pig 0.11.1- A platform for analyzing large data sets that consists of a
high-level language for expressing data analysis programs, coupled
with infrastructure for evaluating these programs.
• Pig's infrastructure layer consists of
– a compiler that produces sequences of Map-Reduce programs,
– Pig's language layer currently consists of a textual language called Pig
Latin, which has the following key properties:
• Ease of programming
• Optimization opportunities
• Extensibility
9/21/2013
PIG
• What is Pig?
– An open-source high-level dataflow system
– Provides a simple language for queries and data manipulation,
Pig Latin, that is compiled into map-reduce jobs that are run on
Hadoop
– Pig Latin combines the high-level data manipulation constructs
of SQL with the procedural programming of map-reduce
• Why is it important?
– Companies and organizations like Yahoo, Google and Microsoft
are collecting enormous data sets in the form of click streams,
search logs, and web crawls
– Some form of ad-hoc processing and analysis of all of this
information is required
PIG
9/21/2013
EXAMPLE 1
$ cp /etc/passwd .
$ ls passwd
$ cat passwd (these are system files separated by colon :)
$ pig –x local
grunt> A = LOAD ‘Passwd’ using PigStorage (‘:’);
grunt> DUMP A;
grunt>B = FOREACH A GENERATE $0;
grunt>DUMP B;
grunt> STORE B INTO ‘Passout’; (Passout dir is created)
grunt> quit;
$ root@ubuntu:/home/Navin# cd Passout
$ root@ubuntu:/home/Navin/Passout# cat part-m-00000
Data Model
• Supports four basic types
– Atom: a simple atomic value (int, long, double, string)
• ex: ‘Peter’
– Tuple: a sequence of fields that can be any of the data
types
• ex: (‘Peter’, 14)
– Bag: a collection of tuples of potentially varying
structures, can contain duplicates
• ex: {(‘Peter’), (‘Bob’, (14, 21))}
– Map: an associative array, the key must be a chararray
but the value can be any type
DATA MODEL
9/21/2013
Pig Latin vs. SQL
• Pig Latin is procedural (dataflow programming model)
– Step-by-step query style is much cleaner and easier to write and
follow than trying to wrap everything into a single block of SQL
Pig Latin vs. SQL (continued)
• Lazy evaluation (data not processed prior to STORE command)
• Data can be stored at any point during the pipeline
• An execution plan can be explicitly defined
• Pipeline splits are supported
PIG LATIN
9/21/2013
PIG
MULTIPLE OUTPUTS
raw = LOAD . . .
. . .
A = LOAD . . .
B = LOAD . . .
C = LOAD . . .
A = FILTER . . .
B = FOREACH . . .
C = FILTER . . .
C = FOREACH . . .
. . .
PARTITIONER EXAMPLE FOR NO OF
BOOKS
• grunt> A = load 'tabcollege.txt' using PigStorage('\t') as
(user:chararray, age:int, college:chararray, nbooks:int);
• grunt> B = group A by college;
• grunt> C = FOREACH B { D = ORDER A by nbooks DESC; E = LIMIT
D 1; GENERATE FLATTEN (E); }
• grunt> DUMP C;
COMPUTING AVERAGE NUMBER OF PAGE VISITS BY USER
...
– Calculate average for all user
...
– Store the result
Page 38
With HCatalog
Feature MapReduce + Pig + HCatalog Hive
HCatalog
Record format Record Tuple Record
Data model int, float, string, int, float, string, int, float, string,
maps, structs, lists bytes, maps, tuples, maps, structs, lists
bags
Schema Read from metadata Read from metadata Read from metadata
Data location Read from metadata Read from metadata Read from metadata
Data format Read from metadata Read from metadata Read from metadata
Page 39
HOW DOES IT WORK?
• Pig
– -HCatLoader + HCatStorer Interface
• MapReduce
– -HCatInputFormat + HCatOutputFormat interface
• Hive
– -No interface necessary
– -Direct access to metadata
• Notifications when data available
Data & Metadata Access With HCatalog
HCatInputFormat/ HCatLoader/
HCatOuputFormat HCatStorer
SerDe
InputFormat/
REST Metastore Client
OuputFormat
External
System HDFS
Metastore
Page 41
HIVE AND PIG
MapReduce Hive Pig
SerDe
InputFormat/ InputFormat/ Load/
Metastore Client
OuputFormat OuputFormat Store
HDFS
Metastore
HIVE ODBC/JDBC TODAY
Need to have Hive
JDBC code on the client
Client
Hive Hadoop
Server
Issues:
• Not concurrent
ODBC • Not secure
Client • Not scalable
HCatalog
MPP
HDFS HBase
Store
HCATALOG ARCHITECTURE
HCatLoader HCatStorer
Hive
MetaStore RDBMS
WHERE IS THE DATA
PIG
HIVE
MapReduce
Storage
USING HCATALOG
PIG
HIVE
MapReduce
HCatalog
Storage
Page 47
Problem: Data in variety of formats
Storage
(HDFS, HBASE , etc)
Page 48
Solution: HCat provides common abstraction
Hadoop Application
• Registered Data w/ Schema
• HCat normalizes data to application
HCatalog
Storage
Page 49
HCATALOG EXAMPLE
Templeton Specific Support
Move data directly into/out-of HDFS through WebHDFS
Stateless Server
– Horizontally scale for load
– Configurable for HA
– Currently Requires ZooKeeper to track job status info
GETTING INVOLVED
57
HBASE
• HBase is a distributed column-oriented data store built on top of
HDFS
• HBase is an Apache open source project whose goal is to provide
storage for the Hadoop Distributed Computing
• Data is logically organized into tables, rows and columns
• Key/value column family store
• Data stored in HDFS
• ZooKeeper for coordination
• Access model is get/put/del
• Plus range scans and versions
58
HBASE
• HBase is a key value store on top of HDFS
• This is the NOSql Database
• Very thin layer over raw HDFS
– Data is grouped in a Table that has rows of data.
– Each row can have multiple ‘Column Families’
– Each ‘Column Family’ contain(s) multiple columns.
– Each column name is the key and it has it’s corresponding column
value.
– Each row doesn’t need to have the same number of columns
KEY VALUE COLUMN VALUE STORE
• Column family is collection of columns
• One or more cells form a row that is addressed by a unique row key
• All rows are sorted alphabetically by row key
• Each column may have multiple versions with each distinct values in
a different cell
• Access to row data is atomic and includes any no of columns being
read or written
INSTALLATION (1)
START Hadoop…
$ wget
http://ftp.twaren.net/Unix/Web/apache/hadoop/hbase/h
base-0.20.2/hbase-0.20.2.tar.gz
$ sudo tar -zxvf hbase-*.tar.gz -C /opt/
$ sudo ln -sf /opt/hbase-0.20.2 /opt/hbase
$ sudo chown -R $USER:$USER /opt/hbase
$ sudo mkdir /var/hadoop/
$ sudo chmod 777 /var/hadoop
CONNECTING TO HBASE
• Java client
– get(byte [] row, byte [] column, long timestamp, int versions);
• Non-Java clients
– Thrift server hosting HBase client instance
• Sample ruby, C++, & java (via thrift) clients
– REST server hosts HBase client
• TableInput/OutputFormat for MapReduce
– HBase as MR source
– HBase Shell
– JRuby to add get, scan, and admin
– ./bin/hbase shell YOUR_SCRIPT
THRIFT
hbase(main):001:0> list
TABLE
0 row(s) in 0.5710 seconds
hbase(main):002:0>
• You now have three cells in three rows in your table. Notice that you didn’t define the
columns before you used them. Nor did you specify what type of data you stored in
each column. This is what the NoSQL crowd means when they say HBase is a schema-
less database.
READING DATA
• HBase gives you two ways to read data: get and scan. Command to
store the cells was put. get is the complement of put, reading back a
single row.
MYTABLE
cf
row1 row2 row3
first second third
cf:message cf:foo cf:bar
hello Hbase 0 3.14
MYTABLE cf
Cells
row1 first cf:message hello Hbase
cf row2 second cf:foo 0
row3 third cf:bar 3.14
SCAN AND GET
hbase(main):008:0> scan 'mytable'
ROW COLUMN+CELL
first column=cf:message, timestamp=1323483954406, value=hell
The shell shows you all the cells in the row, organized by column, with the value
associated at each timestamp. HBase can store multiple versions of each cell. The
default number of versions stored is three, but it’s configurable.
At read time, only the latest version is returned, unless otherwise specified.
TABLE ‘USERS’ CREATION
hbase(main):001:0> create 'users', 'info'
0 row(s) in 0.1200 seconds
hbase(main):002:0>
Shell describes table with two properties: table name and list of column families
(physical characteristics)
To open a connection, every time we don’t use the shell, instead a Java client library. The
code for connecting to this table is
HTableInterface usersTable = new HTable("users");
JAVA CONNECTION
• The HTable constructor reads the default configuration information
to locate HBase, similar to the way the shell did. It then locates the
‘users’ table you created earlier and gives you a handle to it.
(configuration parameters can be picked by the Java client from the
hbase-site.xml file in their classpath). HBase client applications need
to have only one configuration piece available to them to access
HBase—the ZooKeeper quorum address.
75
WRITE PROCESS
• When something is written to HBase, it is first written to an in-
memory store (memstore), once this memstore reaches a certain
size, it is flushed to disk into a store file also called HFile (everything
is also written immediately to a log file for durability). The store files
created on disk are immutable. Sometimes the store files are
merged together, this is done by a process called compaction.
• This log file is called write-ahead log (WAL), also referred to as the
Hlog
• HFile is the underlying storage format for HBase. HFiles belong to a
column family, and a column family can have multiple HFiles. But a
single HFile can’t have data for multiple column families. There is
one MemStore per column family
• http://www.ngdata.com/visualizing-hbase-flushes-and-compactions/
NOTES ON DATA MODEL
• HBase schema consists of several Tables
• Each table consists of a set of Column Families
– Columns are not part of the schema
• HBase has Dynamic Columns
– Because column names are encoded inside the cells
– Different cells can have different columns
77
NOTES ON DATA MODEL (CONT’D)
• The version number can be user-supplied
– Even does not have to be inserted in increasing order
– Version number are unique within each key
• Table can be very sparse
– Many cells are empty
• Keys are indexed as the primary key Has two columns
[cnnsi.com & my.look.ca]
HBASE PHYSICAL MODEL
• Each column family is stored in a separate file (called HTables)
• Key & Version numbers are replicated with each column family
• Empty cells are not stored
79
EXAMPLE
80
COLUMN FAMILIES
81
HBASE REGIONS
• Each HTable (column family) is partitioned horizontally into regions
– Regions are counterpart to HDFS blocks
82
THREE MAJOR COMPONENTS
• The HBaseMaster
– One master
• The HRegionServer
– Many region servers
83
HBASE COMPONENTS
• Region
– A subset of a table’s rows, like horizontal range partitioning
– Automatically done
• RegionServer (many slaves)
– Manages data regions
– Serves data for reads and writes (using a log)
• Master
– Responsible for coordinating the slaves
– Assigns regions, detects failures
– Admin functions
84
HBASE
ARCHITECTURE
Page 7
Logical Data Model
A sparse, multi-dimensional, sorted map
Table A
column column
rowkey timestamp value
family qualifier
1368394583 7
"bar"
1368394261 "hello"
cf1 1368394583 22
"foo" 1368394925 13.6
a
1368393847 "world"
Legend:
- Rows are sorted by rowkey.
- Within a row, values are located by column family and qualifier.
- Values also carry a timestamp; there can me multiple versions of a value.
- Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes.
Page 13
Logical Architecture
Distributed, persistent partitions of a BigTable
Region Server 7
Table A
Table A, Region 1
a Table A, Region 2
b
Region 1 Table G, Region 1070
c
Table L, Region 25
d
e
f Region Server 86
Region 2 g Table A, Region 3
h Table C, Region 30
i Table F, Region 160
j Table F, Region 776
Region 3 k
l
Region Server 367
m
Table A, Region 4
n
Region 4 Table C, Region 17
o
Table E, Region 52
p
Table P, Region 1116
Legend:
- A single table is partitioned into Regions of roughly equal size.
- Regions are assigned to Region Servers across the cluster.
- Region Servers host roughly the same number of regions.
Page 9
Physical Architecture
Distribution and Data Path
...
Legend:
- An HBase RegionServer is collocated with an HDFS DataNode.
- HBase clients communicate directly with Region Servers for sending and receiving data.
- HMaster manages Region assignment and handles DDL operations.
- Online configuration state is maintained in ZooKeeper.
- HMaster and ZooKeeper are NOT involved in data path.
Page 11
HBASE
Anatomy of a
RegionServer
Page 14
REGIONSERVER
• RegionServer : Every request of a write goes to the RegionServer---
directs request to Region.
• Each region stores rows. Rows data is separated by CF’s. Data related
to a particular CF is stored in HStore (consists of MemStore+set of
HFiles). MemStore is in the RegionServer memory, while HFiles are in
HDFS.
Storage Machinery
Implementing the data model
RegionServer
BlockCache
HLog
(WAL)
HRegion HRegion
...
... ... ...
HFile HFile
HDFS
Legend:
- A RegionServer contains a single WAL, single BlockCache, and multiple Regions.
- A Region contains multiple Stores, one for each Column Family.
- A Store consists of multiple StoreFiles and a MemStore.
- A StoreFile corresponds to a single HFile.
- HFiles and WAL are persisted on HDFS.
Page 16
For what workloads
• It depends on how you tune it, but…
• HBase is good for:
– Large datasets
– Sparse datasets
– Loosely coupled (denormalized) records
– Lots of concurrent clients
• Try to avoid:
– Small datasets (unless you have lots of them)
– Highly relational records
– Schema designs requiring transactions *
MEMSTORE
• When something is written to HBase, it is first written to an in-
memory store (memstore), once this memstore reaches a certain
size, it is flushed to disk into a store file (everything is also written
immediately to a log file for durability). The store files created on
disk are immutable. Sometimes the store files are merged together,
this is done by a process called compaction.
• http://www.ngdata.com/visualizing-hbase-flushes-and-compactions/
LOGGING OPERATIONS
94
HBASE DEPLOYMENT
Master
node
Slave
nodes
95
How does it integrate with my infrastructure?
• Horizontally scale application data
– Highly concurrent, read/write access
– Consistent, persisted shared state
– Distributed online data processing via Coprocessors (experimental)
• Gateway between online services and offline storage/analysis
– Staging area to receive new data
– Serve online, indexed “views” on datasets from HDFS
– Glue between batch (HDFS, MR1) and online (CEP, Storm) systems
Page 23
What data semantics
• GET, PUT, DELETE key-value operations
• SCAN for queries
• Row-level write atomicity
• MapReduce integration
– Online API (today)
– Bulkload (today)
– Snapshots (coming)
Page 24
What about operational concerns?
• Provision hardware with more spindles/TB
• Balance memory and IO for reads
– Contention between random and sequential access
– Configure Block size, BlockCache, compression, codecs based on access patterns
– Additional resources
– “HBase: Performance Tuners,” http://labs.ericsson.com/blog/hbase-performance-tuners
– “Scanning in HBase,” http://hadoop-hbase.blogspot.com/2012/01/scanning-in-
hbase.html
• Balance IO for writes
– Configure C1 (compactions, region size, compression, pre-splits, &c.) based on
write pattern
– Balance IO contention between maintaining C1 and serving reads
– Additional resources
Page 25
WRITE PROCESS
WRITE PROCESS-2
• If HBase goes down, the data that was not yet flushed from the
MemStore to the HFile can be recovered by replaying the WAL, all
handled under the hood. There is a single
• WAL per HBase server, shared by all tables (and their column
families) served from that server.
• We don’t recommend disabling the WAL unless you’re willing to lose
data when things fail.
MEMSTORE FLUSHING
• MemStore size which causes flushing is configured on two levels:
– per RS: % of heap occupied by memstores
– per table: size in MB of single memStore (per CF) of region
MEMSTORE
• Apart from solving the “non-ordered” problem, Memstore also has
other benefits, e.g.:
• It acts as a in-memory cache which keeps recently added data. This
is useful in numerous cases when last written data is accessed more
frequently than older data
• There are certain optimizations that can be done to rows/cells when
they are stored in memory before writing to persistent store. E.g.
when it is configured to store one version of a cell for certain CF and
Memstore contains multiple updates for that cell, only most recent
one can be kept and older ones can be omitted (and never written
to HFile).
• Important thing to note is that every Memstore flush creates one
HFile per CF.
MEMSTORE FLUSHES
HFILES COMPACTION
DATA LOCALITY
DETAILED ARCHITECTURE
READ PROCESS
110
ZOOKEEPER
• HBase depends on
ZooKeeper and by
default it manages a
ZooKeeper instance as
the authority on cluster
state
ZOOKEEPER
• HBase depends on ZooKeeper
• By default HBase manages the ZooKeeper instance
– E.g., starts and stops ZooKeeper
• HMaster and HRegionServers register themselves with ZooKeeper
112
CREATING A TABLE
HBaseAdmin admin= new HBaseAdmin(config);
HColumnDescriptor []column;
column= new HColumnDescriptor[2];
column[0]=new HColumnDescriptor("columnFamily1:");
column[1]=new HColumnDescriptor("columnFamily2:");
HTableDescriptor desc= new
HTableDescriptor(Bytes.toBytes("MyTable"));
desc.addFamily(column[0]);
desc.addFamily(column[1]);
admin.createTable(desc);
113
OPERATIONS ON REGIONS: GET()
• Given a key return corresponding record
• For each value return the highest version
114
OPERATIONS ON REGIONS: SCAN()
115
GET() Select value from table where
key=‘com.apache.www’ AND
label=‘anchor:apache.com’
Time
Row key Column “anchor:”
Stamp
t12
t11
“com.apache.www”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6
t5
t3
OPERATIONS ON REGIONS: PUT()
• Insert a new record (with a new key), Or
• Insert a record for an existing key
Implicit version number
(timestamp)
117
OPERATIONS ON REGIONS: DELETE()
118
ALTERING A TABLE
119
WHEN TO USE HBASE
• Random write, read or both
• Variable schema in each record
• Collections of data for each key
• Atomic control of per-key data
• Row access to each column family
• Access patterns well-known and simple
120
HBASE
• HBase uses HDFS for reliable storage
– Handles checksums, replication, failover
• Master manages cluster
• RegionServer manage data
• ZooKeeper is the ‘neural network’ for bootstrapping and
coordinating cluster
BLOOM FILTER
• Generated when Hfile is persisted/stored at end of each file and
loaded into memory
• Allows check on row or row+column level
• Can filter entire store files from reads
• Useful when many misses are expected during reads (non existing
keys)
HIVE HBASE INTEGRATION
• Reasons to use Hive on HBase:
– A lot of data sitting in HBase due to its usage in a real-time
environment, but never used for analysis
– Give access to data in HBase usually only queried through
MapReduce to people that don’t code (business analysts)
– When needing a more flexible storage solution, so that rows can
be updated live by either a Hive job or an application and can be
seen immediately to the other
• How it works:
– Hive can use tables that already exist in HBase or manage
its own ones, but they still all reside in the same HBase
instance
HBase is a special case here, it has a unique row key map with :key
Not all the columns in the table need to be mapped
HBASE VS. HDFS
• Both are distributed systems that scale to hundreds or thousands of
nodes
134
HBASE VS. HDFS (CONT’D)
• HBase is designed to efficiently address the above points
– Fast record lookup
– Support for record-level insertion
– Support for updates (not in place)
135
HBASE VS. RDBMS
136
HBASE VS. HDFS
137
OOZIE
OOZIE OVERVIEW
Main Features
– Execute and monitor workflows in Hadoop
– Periodic scheduling of workflows
– Trigger execution by data availability
– HTTP and command line interface + Web console
Adoption
– ~100 users on mailing list since launch on github
– In production at Yahoo!, running >200K jobs/day
OOZIE WORKFLOW OVERVIEW
Purpose:
Execution of workflows on the Grid
Oozie
WS Tomcat Hadoop/Pig/HDFS
API web-app
DB
OOZIE WORKFLOW
Directed Acyclic Graph of Jobs
M/R
streaming OK
job
Java OK
start fork join
Main
Pig MORE
job OK decision
M/R
ENOUGH
job
OK
Java Main
OK FS OK
end
job
OOZIE WORKFLOW EXAMPLE
Start M-R OK
<workflow-app name=’wordcount-wf’>
Start End
wordcount
<start to=‘wordcount’/>
Error
<action name=’wordcount'>
<map-reduce>
<job-tracker>foo.com:9001</job-tracker>
<name-node>hdfs://bar.com:9000</name-node> Kill
<configuration>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to=’end'/>
<error to=’kill'/>
</action>
<kill name=‘kill’/>
<end name=‘end’/>
</workflow-app>
OOZIE WORKFLOW NODES
• Control Flow:
– start/end/kill
– decision
– fork/join
• Actions:
– map-reduce
– pig
– hdfs
– sub-workflow
– java – run custom Java code
OOZIE WORKFLOW APPLICATION
Application Deployment:
$ hadoop fs –put wordcount-wf hdfs://bar.com:9000/usr/abc/wordcount
Job Execution:
$ oozie job –run -config job.properties
job: 1-20090525161321-oozie-xyz-W
MONITORING AN OOZIE WORKFLOW JOB
Purpose:
– Coordinated execution of workflows on the Grid
– Workflows are backwards compatible
Tomcat
Check
WS API Data Availability
Oozie
Coordinator
Oozie
Oozie Workflow
Client Hadoop
OOZIE APPLICATION LIFECYCLE
Coordinator Job
action … …
create
action
start Oozie Coordinator Engine
Oozie Workflow Engine
A
WF WF WF WF
B C
USE CASE 1: TIME TRIGGERS
• Execute your workflow every 15 minutes (CRON)
01:00 02:00
ROLLING WINDOWS
<coordinator-app name=“coord1” frequency=“${1*HOURS}”…>
<datasets>
<dataset name="logs" frequency=“15” initial-instance="2009-01-01T00:00Z">
<uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name=“inputLogs” dataset="logs">
<start-instance>${current(-3)}</start-instance>
<end-instance>${current(0)}</end-instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path>
<configuration>
<property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property>
</configuration>
</workflow>
</action>
</coordinator-app>
SLIDING WINDOWS
• Access last 24 hours of data, and roll them up every hour.
…
01:00 02:00 03:00 24:00
24:00
… +1 day
02:00 03:00 04:00
01:00
+1 day
01:00
… +1 day
03:00 04:00 05:00
02:00
+1 day
02:00
OOZIE COORDINATOR APPLICATION
Application Deployment:
$ hadoop fs –put coord_job hdfs://bar.com:9000/usr/abc/coord_job
Job Execution:
$ oozie job –run -config job.properties
job: 1-20090525161321-oozie-xyz-C
MONITORING AN OOZIE COORDINATOR JOB
To Contribute:
• https://github.com/yahoo/oozie/wiki/How-To-Contribute
SQOOP
WHAT IS SQOOP
• Tool to transfer data from relational databases
– Teradata, MySQL, PostgreSQL, Oracle, Netezza
• To Hadoop ecosystem
– HDFS (text, sequence file), Hive, HBase, Avro
• And vice versa
• Based on Connectors
– Responsible for Metadata lookups, and Data Transfer
– Majority of connectors are JDBC based
– Non-JDBC (direct) connectors for optimized data transfer
• Connectors responsible for all supported functionality
– HBase Import, Avro Support
• The canonical use case is performing a nightly dump of all the data in a
transactional relational database into Hadoop for offline analysis. The
popularity of Sqoop in enterprise systems confirms that Sqoop does
bulk transfer admirably.
What is Sqoop?
Traditional ETL
Application
Data
Data
What is Sqoop?
A very scalable different paradigm
Application
Data
Application
Data
Application
Data
Data
WHY SQOOP?
• Efficient/Controlled resource utilization
– Concurrent connections, Time of operation
• Datatype mapping and conversion
– Automatic, and User override
• Metadata propagation
– Sqoop Record
– Hive Metastore
– Avro
170
SQOOP
171
MAHOUT
DATA ANALYTICS
• Include machine learning and data mining tools
– Analyze/mine/summarize large datasets
– Extract knowledge from past data
– Predict trends in future data
173
DATA MINING & MACHINE LEARNING
174
TOOLS & ALGORITHMS
• Collaborative Filtering
• Clustering Techniques
• Classification Algorithms
• Association Rules
• Frequent Pattern Mining
• Statistical libraries (Regression, SVM, …)
• Others…
175
COMMON USE CASES
176
RECOMMENDATIONS
• Predict what the user likes based on
– His/Her historical behavior
– Aggregate behavior of people similar to him
IN OUR CONTEXT…
178
OTHER PROJECTS
• Apache Mahout
– Open-source package on Hadoop for data mining and
machine learning
• Revolution R (R-Hadoop)
– Extensions to R package to run on Hadoop
179
APACHE MAHOUT
• Apache Software Foundation project
• Create scalable machine learning libraries
• Why Mahout? Many Open Source ML libraries either:
– Lack Community
– Lack Documentation and Examples
– Lack Scalability
– Or are research-oriented
180
GOAL 1: MACHINE LEARNING
Applica ons
Examples
Freq.
Gene c Pa ern Classifica on Clustering Recommenders
Mining
Math
U li es Collec ons Apache
Vectors/Matrices/
Lucene/Vectorizer (primi ves) Hadoop
SVD
GOAL 2: SCALABILITY
• Be as fast and efficient as the possible given the intrinsic design of
the algorithm
• Most Mahout implementations are Map Reduce enabled
• Work in Progress
182
INTERESTING PROBLEMS
• Cluster users talking about Faculty Summit and cluster them based
on what they are tweeting
– Can you suggest people to network with.
• Use user generate tags that people have given for musicians and
cluster them
– Use the cluster to pre-populate suggest-box to
autocomplete tags when users type
• Cluster movies based on abstract and description and show related
movies.
– Note: How it can augment recommendations or
collaborative filtering algorithms.
MAHOUT PACKAGE
184
C1: COLLABORATIVE FILTERING
185
C2: CLUSTERING
• Group similar objects together
186
C3: CLASSIFICATION
187
FPM: FREQUENT PATTERN MINING
• Find the frequent itemsets
– <milk, bread, cheese> are sold frequently together
188
O: OTHERS
• Outlier detection
• Math libirary
– Vectors, matrices, etc.
• Noise reduction
189
WE FOCUS ON…
• Clustering K-Means
-- Technique logic
• Frequent Pattern Mining Apriori
-- How to implement in Hadoop
190
K-MEANS ALGORITHM
191
K-MEANS ALGORITHM
• Step 1: Select K points at random (Centers)
• Step 2: For each data point, assign it to the closest center
– Now we formed K clusters
• Step 3: For each cluster, re-compute the centers
– E.g., in the case of 2D points
• X: average over all x-axis points in the cluster
• Y: average over all y-axis points in the cluster
• Step 4: If the new centers are different from the old centers
(previous iteration) Go to Step 2
192
K-MEANS IN MAPREDUCE
• Input
– Dataset (set of points in 2D) --Large
– Initial centroids (K points) --Small
• Map Side
– Each map reads the K-centroids + one block from dataset
– Assign each point to the closest centroid
– Output <centroid, point>
193
K-MEANS IN MAPREDUCE (CONT’D)
• Reduce Side
– Gets all points for a given centroid
– Re-compute a new centroid for this cluster
– Output: <new centroid>
• Iteration Control
– Compare the old and new set of K-centroids
• If similar Stop
• Else
– If max iterations has reached Stop
– Else Start another Map-Reduce Iteration
194
K-MEANS OPTIMIZATIONS
• Use of Combiners
– Similar to the reducer
– Computes for each centroid the local sums (and counts) of the assigned points
– Sends to the reducer <centroid, <partial sums>>
195
NAÏVE BAYES CLASSIFIER
• In simple terms, a naive Bayes classifier assumes that the presence
or absence of a particular feature is unrelated to the presence or
absence of any other feature, given the class variable. For example, a
fruit may be considered to be an apple if it is red, round, and about
3" in diameter. A naive Bayes classifier considers each of these
features to contribute independently to the probability that this fruit
is an apple, regardless of the presence or absence of the other
features.
196
NAÏVE BAYES CLASSIFIER
• Given a dataset (training data), we learn (build) a statistical
model
– This model is called “Classifier”
197
NAÏVE BAYES CLASSIFIER: EXAMPLE
• Example
Three features
198
FREQUENT PATTERN MINING
• Very common problem in Market-Basket applications
199
FREQUENT PATTERN MINING
• Given a set of items I ={milk, bread, jelly, …}
• Given a set of transactions where each transaction contains
subset of items
– t1 = {milk, bread, water}
– t2 = {milk, nuts, butter, rice}
200
EXAMPLE
• {Bread} 80%
• {PeanutButter} 60%
• {Bread, PeanutButter} 60%
called “Support”
201
CAN WE OPTIMIZE??
• {Bread} 80%
• {PeanutButter} 60%
• {Bread, PeanutButter} 60%
called “Support”
Property
For itemset S={X, Y, Z, …} of size n to be frequent, all its subsets of
size n-1 must be frequent as well
202
HOW TO FIND FREQUENT ITEMSETS
• Naïve Approach
– Enumerate all possible itemsets and then count
each one
203
RESOURCES
• http://mahout.apache.org
• dev@mahout.apache.org - Developer mailing list
• user@mahout.apache.org - User mailing list
• Check out the documentations and wiki for quickstart
• http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code
PROVISIONING,
MANAGING,
MONITORING
CLUSTER
WHAT IS NAGIOS
• “Nagios is an enterprise-class monitoring solutions for hosts,
services, and networks released under an Open Source license.”
“Nagios is a popular open source computer system and network
monitoring application software. It watches hosts and services that
you specify, alerting you when things go bad and again when they
get better.”
CACTI
• Performance Graphing System
• Slick Web Interface
• Template System for Graph Types
• Pluggable
– SNMP (Simple Network Management Protocol) input
– Shell script /external program
CACTI
NAGIOS
• Answers “IS IT RUNNING?”
• Text based Configuration
CACTI
• Answers “HOW WELL IS IT RUNNING?”
• Web Based configuration
– php-cli tools
AMBARI
Extend core capabilities to include Enable insight into job Expose integration and
the critical tasks associated with performance and reduce the customization points so Hadoop
provisioning and operating burden on specialized Hadoop can interoperate with existing
Hadoop clusters. skills and knowledge. operational tooling.
211
MORE DATABASES
• Ambari to support Postgres, MySQL or Oracle
• Configure Hive and Oozie to use MySQL or Oracle
Page 212
OTHER GOODIES
Page 213
JOB DIAGNOSTICS
• Enhanced swimlane visualizations
• See job DAG with task overlay
• See task scatter plot across jobs
Page 214
ELECTRONIC ARTS ON HADOOP AND EC2
WHAT WE HAVE DONE!
• Setup EC2, requested machines, configured firewalls and
passwordless SSH;
• Downloaded Java and Hadoop;
• Configured HDFS and MapReduce and pushed configuration around
the cluster;
• Started HDFS and MapReduce;
• Submitted the job, ran it successfully, and viewed the output.
1. START EC2 SERVERS
• Amazon Web Services @ http://aws.amazon.com/;
• Used the ‘classic wizard’, created three micro instances running the
latest 64 bit Ubuntu Server;
• Key pair .pem file either exists or you create one to connect to the
servers and to navigate around within the cluster
2. NAME EC2 SERVERS
• For reference, instances are named Master, Slave 1, and Slave 2
within the EC2 console once they are running;
• Note down the host names for each of the 3 instances in the bottom
part of the management console. We will use these to access the
servers:
PUTTY CONFIGURATION
WEB INTERFACES
MAPREDUCE PROGRAM
Thank You!
Navin Chandra