TCRIX Faculty Summit Day 2.2

FACULTY SUMMIT ON BIG
DATA
Navin Chandra
OUTLINE
• Pig
• Hive
• Hcatalog
• HBase
• Oozie
• Sqoop
• Mahout
• Provisioning, Managing, Monitoring Cluster
• Real-time Use Case Scenarios
APACHE PIG
BACKGROUND
• Yahoo! was the first big adopter of Hadoop.
• Hadoop gained popularity in the company quickly.
• Yahoo! Research developed Pig to address the need for a higher
level language.
• Roughly 30% of Hadoop jobs run at Yahoo! are Pig jobs.
APACHE PIG
• Pig 0.11.1- A platform for analyzing large data sets that consists of a
high-level language for expressing data analysis programs, coupled
with infrastructure for evaluating these programs.
• Pig's infrastructure layer consists of
– a compiler that produces sequences of Map-Reduce programs,
– Pig's language layer currently consists of a textual language called Pig
Latin, which has the following key properties:
• Ease of programming
• Optimization opportunities
• Extensibility
9/21/2013
PIG
• What is Pig?
– An open-source high-level dataflow system
– Provides a simple language for queries and data manipulation,
Pig Latin, that is compiled into map-reduce jobs that are run on
Hadoop
– Pig Latin combines the high-level data manipulation constructs
of SQL with the procedural programming of map-reduce
• Why is it important?
– Companies and organizations like Yahoo, Google and Microsoft
are collecting enormous data sets in the form of click streams,
search logs, and web crawls
– Some form of ad-hoc processing and analysis of all of this
information is required
PIG
• Pig provides a higher level language, Pig Latin, that:

– Increases productivity.
– 10 lines of Pig Latin ≈ 200 lines of Java.
– Opens the system to non-Java programmers.
– Provides common operations like join, group, ﬁlter, sort
EXISTING SOLUTION
• Parallel database products (ex: Teradata)

– Expensive at web scale
– Data analysis programmers find the declarative SQL
queries to be unnatural and restrictive
• Raw map-reduce
– Complex n-stage dataflows are not supported; joins
and related tasks require workarounds or custom
implementations
– Resulting code is difficult to reuse and maintain; shifts
focus and attention away from data analysis
Language Features
• Several options for user-interaction
– Interactive mode (console)
– Batch mode (prepared script files containing Pig Latin commands)
– Embedded mode (execute Pig Latin commands within a Java program)
• Built primarily for scan-centric workloads and read-only data
analysis
– Easily operates on both structured and schema-less, unstructured data
– Transactional consistency and index-based lookups not required
– Data curation and schema management can be overkill
• Flexible, fully nested data model
• Extensive UDF support
– Currently must be written in Java
– Can be written for filtering, grouping, per-tuple processing, loading
and storing
PIG OPERATORS
RUNNING PIG
• You can execute Pig Latin statements:
– Using grunt shell or command line
$ pig ... - Connecting to ...
grunt> A = load 'data';
grunt> B = ... ;
– In local mode or hadoop mapreduce mode
$ pig myscript.pig
Command Line - batch, local mode mode
$ pig -x local myscript.pig
– Either interactively or in batch
9/21/2013
EXAMPLE 1
$ cp /etc/passwd .
$ ls passwd
$ cat passwd (these are system files separated by colon :)
$ pig –x local
grunt> A = LOAD ‘Passwd’ using PigStorage (‘:’);
grunt> DUMP A;
grunt>B = FOREACH A GENERATE $0;
grunt>DUMP B;
grunt> STORE B INTO ‘Passout’; (Passout dir is created)
grunt> quit;
$ root@ubuntu:/home/Navin# cd Passout
$ root@ubuntu:/home/Navin/Passout# cat part-m-00000
Data Model
• Supports four basic types
– Atom: a simple atomic value (int, long, double, string)
• ex: ‘Peter’
– Tuple: a sequence of fields that can be any of the data
types
• ex: (‘Peter’, 14)
– Bag: a collection of tuples of potentially varying
structures, can contain duplicates
• ex: {(‘Peter’), (‘Bob’, (14, 21))}
– Map: an associative array, the key must be a chararray
but the value can be any type
DATA MODEL
• By default Pig treats undeclared fields as

bytearrays (collection of uninterpreted bytes)
• Can infer a field’s type based on:
– Use of operators that expect a certain type of field
– UDFs with a known or explicitly set return type
– Schema information provided by a LOAD function
or explicitly declared using an AS clause
• Type conversion is lazy
USAGE
• Web log processing.
• Data processing for web search platforms.
• Ad hoc queries across large data sets.
• Rapid prototyping of algorithms for processing large data sets.
PROGRAM/FLOW ORGANIZATION
• A LOAD statement reads data from the file system.
• A series of "transformation" statements process the data.
• A STORE statement writes output to the file system; or, a DUMP
statement displays output to the screen.
9/21/2013
Pig Latin vs. SQL
• Pig Latin is procedural (dataflow programming model)
– Step-by-step query style is much cleaner and easier to write and
follow than trying to wrap everything into a single block of SQL
Pig Latin vs. SQL (continued)
• Lazy evaluation (data not processed prior to STORE command)
• Data can be stored at any point during the pipeline
• An execution plan can be explicitly defined
• Pipeline splits are supported
PIG LATIN
• FOREACH-GENERATE (per-tuple processing)

– Iterates over every input tuple in the bag, producing
one output each, allowing efficient parallel
implementation
– Expressions within the

GENERATE clause can
take the form of any of
these expressions
INTERPRETATION
• In general, Pig processes Pig Latin statements as follows:
– First, Pig validates the syntax and semantics of all statements.
– Next, if Pig encounters a DUMP or STORE, Pig will execute the
statements.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int,

gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(John)
(Mary)
(Bill)
(Joe)
• Store operator will store it in a file
9/21/2013
PIG
MULTIPLE OUTPUTS
raw = LOAD . . .
. . .
SPLIT raw INTO x IF $0 > 4, y if $1 == 'FOO', z if $0 == 2 AND $3 <

2;
STORE x INTO 'x_out';
STORE y INTO 'y_out';
STORE z INTO 'z_out';
MULTIPLE INPUTS
A = LOAD . . .
B = LOAD . . .
C = LOAD . . .
A = FILTER . . .
B = FOREACH . . .
C = FILTER . . .
C = FOREACH . . .
. . .
PARTITIONER EXAMPLE FOR NO OF
BOOKS
• grunt> A = load 'tabcollege.txt' using PigStorage('\t') as
(user:chararray, age:int, college:chararray, nbooks:int);
• grunt> B = group A by college;
• grunt> C = FOREACH B { D = ORDER A by nbooks DESC; E = LIMIT
D 1; GENERATE FLATTEN (E); }
• grunt> DUMP C;
COMPUTING AVERAGE NUMBER OF PAGE VISITS BY USER
• Logs of user visiting a webpage consists of (user,url,time)

• Fields of the log are tab separated and in text format
• Basic idea:
– Load the log file
– Group based on the user field
– Count the group
...
– Calculate average for all user
user url time

Amy www.cnn.com 8:00
Amy www.crap.com 8:05
Amy www.myblog.com 10:00
Amy www.flickr.com 10:05
Fred cnn.com/index.htm 12:00
HOW TO PROGRAM THIS IN PIG LATIN
VISITS = load ‘visits' as (user, url, time);

dump visits;
USER_VISITS = group VISITS by user;
USER_CNTS = foreach USER_VISITS generate group as user, COUNT(VISITS) as

numvisits;
ALL_CNTS = group USER_COUNTS all;
AVG_CNT = foreach ALL_CNTS generate AVG(USER_COUNTS.numvisits);

IDENTIFY USERS WHO VISIT “GOOD
PAGES”
• Good pages are those pages visited by users whose page rank is greater than 0.5
• Basic idea
– Join tables based on url
– Group based on user
– Calculate average pagerank of user visited pages
– Filter user who has average pagerank greater than 0.5
...
– Store the result
user url time url pagerank

Amy www.cnn.com 8:00 www.cnn.com 0.9
Amy www.crap.com 8:05 www.flickr.com 0.9
Amy www.myblog.com 10:00 www.myblog.com 0.7
Amy www.flickr.com 10:05 www.crap.com 0.2
FRONT END ARCHITECTURE
Pig Latin Programs Query Parser Logical Plan
Semantic Checking Logical Plan
Logical Optimizer Optimized Logical Plan
Logical to Physical Translator Physical Plan
Physical To M/R Translator MapReduce Plan
Map Reduce Launcher
Create a job jar to be submitted to hadoop cluster

NEW FEATURES IN THE PIPELINE
• Pig Pen, eclipse plug-in for developing and testing in Pig Latin
– Currently available for use in M/R mode
• Performance
– Expanding cases where combiner is used
– Map side join
– Improving order by
• Sampler
• Reducing number of M/R jobs from 3 to 2
– Efficient multi-store queries
• Better error handling
PIG RESOURCES
• Pig is a Apache subproject project

• Documentation
– General info is at: http://wiki.apache.org/pig/
– Pig UDF : http://wiki.apache.org/pig/UDFManual
• Mailing lists
– pig-user@hadoop.apache.org
– pig-dev@hadoop.apache.org
• Source code
– https://svn.apache.org/repos/asf/hadoop/pig/trunk
• Code submissions
– https://issues.apache.org/jira/browse/PIG-*
WHAT IS HIVE?
A data warehouse infrastructure built on top of Hadoop

for providing data summarization, query, and analysis.
– ETL.
– Structure.
– Access to different storage.
– Query execution via MapReduce.
Key Building Principles:
– SQL is a familiar language
– Extensibility – Types, Functions, Formats, Scripts
– Performance
HIVE, WHY?
• Need a Multi Petabyte Warehouse

• Files are insufficient data abstractions
– Need tables, schemas, partitions, indices
• SQL is highly popular
• Need for an open data format
– RDBMS have a closed data format
– flexible schema
• Hive is a Hadoop subproject!
DATA UNITS
Databases.
Tables.
Partitions.
Buckets.
HIVE CHARACTERISTICS
Batch oriented
Data Warehouse focused
Entire data sets (table scans)
Generates/runs MapReduce (not faster than MR!)
Limited indexing, no stats, no cache
Programmer is the optimizer
Append only (mostly)
CONCLUSION
A easy way to process large scale data.
Support SQL-based queries.
Provide more user defined interfaces to extend
Programmability.
Files in HDFS are immutable. Tipically:
– Log processing: Daily Report, User Activity
Measurement
– Data/Text mining: Machine learning (Training Data)
– Business intelligence: Advertising Delivery,Spam
Detection
HCATALOG
WHAT IS IT?
• HCatalog for a small introduction, is a “table management and
storage management layer for Apache Hadoop” which:
– enables Pig, MapReduce, and Hive users to easily share
data on the grid.
– provides a table abstraction for a relational view of data in
HDFS
– ensures format indifference (viz RCFile format, text files,
sequence files)
– provides a notification service when new data becomes
available
Without HCatalog
Feature MapReduce Pig Hive
Record format Key value pairs Tuple Record
Data model User defined int, float, string, int, float, string,
bytes, maps, tuples, maps, structs, lists
bags
Schema Encoded in app Declared in script or Read from metadata
read by loader
Data location Encoded in app Declared in script Read from metadata
Data format Encoded in app Declared in script Read from metadata
Page 38
With HCatalog
Feature MapReduce + Pig + HCatalog Hive
HCatalog
Record format Record Tuple Record
Data model int, float, string, int, float, string, int, float, string,
maps, structs, lists bytes, maps, tuples, maps, structs, lists
bags
Schema Read from metadata Read from metadata Read from metadata
Data location Read from metadata Read from metadata Read from metadata
Data format Read from metadata Read from metadata Read from metadata
Page 39
HOW DOES IT WORK?
• Pig
– -HCatLoader + HCatStorer Interface
• MapReduce
– -HCatInputFormat + HCatOutputFormat interface
• Hive
– -No interface necessary
– -Direct access to metadata
• Notifications when data available
Data & Metadata Access With HCatalog
MapReduce Hive Pig
HCatInputFormat/ HCatLoader/
HCatOuputFormat HCatStorer
SerDe
InputFormat/
REST Metastore Client
OuputFormat
External
System HDFS
Metastore
Page 41
HIVE AND PIG
MapReduce Hive Pig
SerDe
InputFormat/ InputFormat/ Load/
Metastore Client
OuputFormat OuputFormat Store
HDFS
Metastore
HIVE ODBC/JDBC TODAY
Need to have Hive
JDBC code on the client
Client
Hive Hadoop
Server
Issues:
• Not concurrent
ODBC • Not secure
Client • Not scalable
Open source version

not easy to use
MAKING YOUR STRUCTURED DATA
AVAILABLE TO THE MAPREDUCE ENGINE
MapReduce Pig Hive
HCatalog
MPP
HDFS HBase
Store
HCATALOG ARCHITECTURE
HCatLoader HCatStorer
HCatInputFormat HCatOutputFormat CLI Notification
Hive MetaStore Client
Generated Thrift Client
Hive
MetaStore RDBMS
WHERE IS THE DATA
PIG
HIVE
MapReduce
Storage
USING HCATALOG
PIG
HIVE
MapReduce
HCatalog
Storage
Page 47
Problem: Data in variety of formats
• Data files maybe organized in different formats

• Data files may contain different formats in different partitions
Storage
(HDFS, HBASE , etc)
Page 48
Solution: HCat provides common abstraction
Hadoop Application
• Registered Data w/ Schema
• HCat normalizes data to application
HCatalog
Storage
Page 49
HCATALOG EXAMPLE
Templeton Specific Support
Move data directly into/out-of HDFS through WebHDFS
Webservice calls to HCatalog

– Register table relationships for data (e.g., createTable, createDatabase)
– Adjust tables (e.g., AlterTable)
– Look at a statistics (e.g., ShowTable)
Webservice calls to start work

– MapReduce, Pig, Hive
– Poll for job status
– Notification URL when job completes (optional)
Stateless Server
– Horizontally scale for load
– Configurable for HA
– Currently Requires ZooKeeper to track job status info
GETTING INVOLVED
Incubator site : http://incubator.apache.org/hcatalog
User list: hcatalog-user@incubator.apache.org
Dev list: hcatalog-dev@incubator.apache.org

HBASE
Page 2
HBASE
• HBase is a database; Hadoop database

• Not an RDBMS, does not follow SQL—can store an
integer in one row and string in another in the same
column
• HBase is designed to run on a cluster of computers
instead of a single computer. The cluster can be built
using commodity hardware; HBase scales horizontally as
you add more machines to the cluster.
• RDBMS follow ACID properties (Atomicity, Consistency,
Isolation and Durability) with upfront schema definition
HBASE
• HBase does qualify as a NoSQL store. It provides a keyvalue API
• Strong consistency so clients can see data immediately after it’s
written
• HBase is designed for terabytes to petabytes of data, so it optimizes
for this use case.
Use cases: Mozilla crash reports, FB, Twitter, StumbleUpon
HBASE: PART OF HADOOP’S ECOSYSTEM
HBase is built on top of HDFS
HBase files are

internally stored
in HDFS
57
HBASE
• HBase is a distributed column-oriented data store built on top of
HDFS
• HBase is an Apache open source project whose goal is to provide
storage for the Hadoop Distributed Computing
• Data is logically organized into tables, rows and columns
• Key/value column family store
• Data stored in HDFS
• ZooKeeper for coordination
• Access model is get/put/del
• Plus range scans and versions
58
HBASE
• HBase is a key value store on top of HDFS
• This is the NOSql Database
• Very thin layer over raw HDFS
– Data is grouped in a Table that has rows of data.
– Each row can have multiple ‘Column Families’
– Each ‘Column Family’ contain(s) multiple columns.
– Each column name is the key and it has it’s corresponding column
value.
– Each row doesn’t need to have the same number of columns
KEY VALUE COLUMN VALUE STORE
• Column family is collection of columns
• One or more cells form a row that is addressed by a unique row key
• All rows are sorted alphabetically by row key
• Each column may have multiple versions with each distinct values in
a different cell
• Access to row data is atomic and includes any no of columns being
read or written
INSTALLATION (1)
START Hadoop…
$ wget
http://ftp.twaren.net/Unix/Web/apache/hadoop/hbase/h
base-0.20.2/hbase-0.20.2.tar.gz
$ sudo tar -zxvf hbase-*.tar.gz -C /opt/
$ sudo ln -sf /opt/hbase-0.20.2 /opt/hbase
$ sudo chown -R $USER:$USER /opt/hbase
$ sudo mkdir /var/hadoop/
$ sudo chmod 777 /var/hadoop
CONNECTING TO HBASE
• Java client
– get(byte [] row, byte [] column, long timestamp, int versions);
• Non-Java clients
– Thrift server hosting HBase client instance
• Sample ruby, C++, & java (via thrift) clients
– REST server hosts HBase client
• TableInput/OutputFormat for MapReduce
– HBase as MR source
– HBase Shell
– JRuby to add get, scan, and admin
– ./bin/hbase shell YOUR_SCRIPT
THRIFT
• a software framework for scalable cross-language

services development.
• works seamlessly between C++, Java, Python, PHP, and
Ruby.
• The other similar project “Rest”
$ hbase-daemon.sh start thrift

$ hbase-daemon.sh stop thrift
HBASE
SHELL
HBASE SHELL
$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.92.1-cdh4.0.0, rUnknown, Mon Jun 4 17:27:36 PDT 2012
hbase(main):001:0>
hbase(main):001:0> list
TABLE
0 row(s) in 0.5710 seconds
hbase(main):002:0>
Now lets create a table and store data

STORING DATA
HBase uses the table as the top-level structure for storing

data. To write data into HBase, you need a table to write it
into. To begin, create a table called mytable with a single
column family
hbase(main):002:0> create 'mytable', 'cf'

hbase(main):003:0> list
TABLE
mytable
WRITING DATA
Let’s add the string hello HBase to the table. In HBase

parlance, we say, “Put the bytes 'hello HBase' to a cell in
'mytable‘ in the 'first' row at the 'cf:message' column
hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello HBase'

hbase(main):005:0> put 'mytable', 'second', 'cf:foo', 0x0 (zero,x,zero)
hbase(main):006:0> put 'mytable', 'third', 'cf:bar', 3.14159
0 row(s) in 0.0080 second
• You now have three cells in three rows in your table. Notice that you didn’t define the
columns before you used them. Nor did you specify what type of data you stored in
each column. This is what the NoSQL crowd means when they say HBase is a schema-
less database.
READING DATA
• HBase gives you two ways to read data: get and scan. Command to
store the cells was put. get is the complement of put, reading back a
single row.
MYTABLE
cf
row1 row2 row3
first second third
cf:message cf:foo cf:bar
hello Hbase 0 3.14
MYTABLE cf
Cells
row1 first cf:message hello Hbase
cf row2 second cf:foo 0
row3 third cf:bar 3.14
SCAN AND GET
hbase(main):008:0> scan 'mytable'
ROW COLUMN+CELL
first column=cf:message, timestamp=1323483954406, value=hell
second column=cf:foo, timestamp=1323483964825, value=0

third column=cf:bar, timestamp=1323483997138, value=3.14159
hbase(main):007:0> get 'mytable', 'first'

COLUMN CELL
cf:message timestamp=1323483954406, value=hello HBase
The shell shows you all the cells in the row, organized by column, with the value
associated at each timestamp. HBase can store multiple versions of each cell. The
default number of versions stored is three, but it’s configurable.
At read time, only the latest version is returned, unless otherwise specified.
TABLE ‘USERS’ CREATION
hbase(main):001:0> create 'users', 'info'
hbase(main):002:0>
• Columns in HBase are organized into groups called column families.

info is a column family in table ‘users’
• Column families impact physical characteristics of the data store in
HBase. For this reason, at least one column family must be specified
at table creation time. Other than the column family name, Hbase
doesn’t require you to tell it anything about your data ahead of time.
That’s why Hbase is often described as a schema-less database.
SHELL
hbase(main):003:0> describe 'users'
DESCRIPTION ENABLED
{NAME => 'users', FAMILIES => [{NAME => 'info', true
BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0
', COMPRESSION => 'NONE', VERSIONS => '3', TTL
=> '2147483647', BLOCKSIZE => '65536', IN_MEMOR
Y => 'false', BLOCKCACHE => 'true'}]}
hbase(main):004:0>
Shell describes table with two properties: table name and list of column families
(physical characteristics)
To open a connection, every time we don’t use the shell, instead a Java client library. The
code for connecting to this table is
HTableInterface usersTable = new HTable("users");
JAVA CONNECTION
• The HTable constructor reads the default configuration information
to locate HBase, similar to the way the shell did. It then locates the
‘users’ table you created earlier and gives you a handle to it.
(configuration parameters can be picked by the Java client from the
hbase-site.xml file in their classpath). HBase client applications need
to have only one configuration piece available to them to access
HBase—the ZooKeeper quorum address.
• Rather than instantiating Htables directly, using HTablePool is more

common in practice.
ACCESS TO DATA
COORDINATES
• Use of coordinates to locate data
rowkey, column family, col. qualifier (column,qual)= value
HBASE: KEYS AND COLUMN FAMILIES
Each record is divided into Column Families
Each row has a Key
Each column family consists of one or more Columns
75
WRITE PROCESS
• When something is written to HBase, it is first written to an in-
memory store (memstore), once this memstore reaches a certain
size, it is flushed to disk into a store file also called HFile (everything
is also written immediately to a log file for durability). The store files
created on disk are immutable. Sometimes the store files are
merged together, this is done by a process called compaction.
• This log file is called write-ahead log (WAL), also referred to as the
Hlog
• HFile is the underlying storage format for HBase. HFiles belong to a
column family, and a column family can have multiple HFiles. But a
single HFile can’t have data for multiple column families. There is
one MemStore per column family
• http://www.ngdata.com/visualizing-hbase-flushes-and-compactions/
NOTES ON DATA MODEL
• HBase schema consists of several Tables
• Each table consists of a set of Column Families
– Columns are not part of the schema
• HBase has Dynamic Columns
– Because column names are encoded inside the cells
– Different cells can have different columns
“Roles” column family

has different columns in
different cells
77
NOTES ON DATA MODEL (CONT’D)
• The version number can be user-supplied
– Even does not have to be inserted in increasing order
– Version number are unique within each key
• Table can be very sparse
– Many cells are empty
• Keys are indexed as the primary key Has two columns
[cnnsi.com & my.look.ca]
HBASE PHYSICAL MODEL
• Each column family is stored in a separate file (called HTables)
• Key & Version numbers are replicated with each column family
• Empty cells are not stored
HBase maintains a multi-level

index on values:
<key, column family, column
name, timestamp>
79
EXAMPLE
80
COLUMN FAMILIES
81
HBASE REGIONS
• Each HTable (column family) is partitioned horizontally into regions
– Regions are counterpart to HDFS blocks
Each will be one region
82
THREE MAJOR COMPONENTS
• The HBaseMaster
– One master
• The HRegionServer
– Many region servers
• The HBase client
83
HBASE COMPONENTS
• Region
– A subset of a table’s rows, like horizontal range partitioning
– Automatically done
• RegionServer (many slaves)
– Manages data regions
– Serves data for reads and writes (using a log)
• Master
– Responsible for coordinating the slaves
– Assigns regions, detects failures
– Admin functions
84
HBASE
ARCHITECTURE
Page 7
Logical Data Model
A sparse, multi-dimensional, sorted map
Table A
column column
rowkey timestamp value
family qualifier
1368394583 7
"bar"
1368394261 "hello"
cf1 1368394583 22
"foo" 1368394925 13.6
a
1368393847 "world"
"2011-07-04" 1368396302 "fourth of July"

cf2
1.0001 1368387684 "almost the loneliest number"
b cf2 "thumb" 1368387247 [3.6 kb png data]
Legend:
- Rows are sorted by rowkey.
- Within a row, values are located by column family and qualifier.
- Values also carry a timestamp; there can me multiple versions of a value.
- Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes.
Page 13
Logical Architecture
Distributed, persistent partitions of a BigTable
Region Server 7
Table A
Table A, Region 1
a Table A, Region 2
b
Region 1 Table G, Region 1070
c
Table L, Region 25
d
e
f Region Server 86
Region 2 g Table A, Region 3
h Table C, Region 30
i Table F, Region 160
j Table F, Region 776
Region 3 k
l
Region Server 367
m
Table A, Region 4
n
Region 4 Table C, Region 17
o
Table E, Region 52
p
Table P, Region 1116
Legend:
- A single table is partitioned into Regions of roughly equal size.
- Regions are assigned to Region Servers across the cluster.
- Region Servers host roughly the same number of regions.
Page 9
Physical Architecture
Distribution and Data Path
REST/Thrift JavaApp JavaApp JavaApp JavaApp HBase Shell

Gateway
...
HBase HBase HBase HBase HBase HBase
Client Client Client Client Client
Client
Zoo HBase Region Region Region Region

Keeper Master Server Server Server Server
...
Data Data Data Data Name

Zoo Zoo Node Node Node Node Node
Keeper Keeper
Legend:
- An HBase RegionServer is collocated with an HDFS DataNode.
- HBase clients communicate directly with Region Servers for sending and receiving data.
- HMaster manages Region assignment and handles DDL operations.
- Online configuration state is maintained in ZooKeeper.
- HMaster and ZooKeeper are NOT involved in data path.
Page 11
HBASE
Anatomy of a
RegionServer
Page 14
REGIONSERVER
• RegionServer : Every request of a write goes to the RegionServer---
directs request to Region.
• Each region stores rows. Rows data is separated by CF’s. Data related
to a particular CF is stored in HStore (consists of MemStore+set of
HFiles). MemStore is in the RegionServer memory, while HFiles are in
HDFS.
Storage Machinery
Implementing the data model
RegionServer
BlockCache
HLog
(WAL)
HRegion HRegion
HStore HStore HStore HStore
StoreFile StoreFile MemStore
...
... ... ...
HFile HFile
HDFS
Legend:
- A RegionServer contains a single WAL, single BlockCache, and multiple Regions.
- A Region contains multiple Stores, one for each Column Family.
- A Store consists of multiple StoreFiles and a MemStore.
- A StoreFile corresponds to a single HFile.
- HFiles and WAL are persisted on HDFS.
Page 16
For what workloads
•  It depends on how you tune it, but…
•  HBase is good for:
–  Large datasets
–  Sparse datasets
–  Loosely coupled (denormalized) records
–  Lots of concurrent clients
•  Try to avoid:
–  Small datasets (unless you have lots of them)
–  Highly relational records
–  Schema designs requiring transactions *
MEMSTORE
• When something is written to HBase, it is first written to an in-
memory store (memstore), once this memstore reaches a certain
size, it is flushed to disk into a store file (everything is also written
immediately to a log file for durability). The store files created on
disk are immutable. Sometimes the store files are merged together,
this is done by a process called compaction.
• http://www.ngdata.com/visualizing-hbase-flushes-and-compactions/
LOGGING OPERATIONS
94
HBASE DEPLOYMENT
Master
node
Slave
nodes
95
How does it integrate with my infrastructure?
•  Horizontally scale application data
–  Highly concurrent, read/write access
–  Consistent, persisted shared state
–  Distributed online data processing via Coprocessors (experimental)
•  Gateway between online services and offline storage/analysis
–  Staging area to receive new data
–  Serve online, indexed “views” on datasets from HDFS
–  Glue between batch (HDFS, MR1) and online (CEP, Storm) systems
Page 23
What data semantics
•  GET, PUT, DELETE key-value operations
•  SCAN for queries
• Row-level write atomicity
•  MapReduce integration
–  Online API (today)
–  Bulkload (today)
–  Snapshots (coming)
Page 24
What about operational concerns?
•  Provision hardware with more spindles/TB
•  Balance memory and IO for reads
–  Contention between random and sequential access
–  Configure Block size, BlockCache, compression, codecs based on access patterns
–  Additional resources
–  “HBase: Performance Tuners,” http://labs.ericsson.com/blog/hbase-performance-tuners
–  “Scanning in HBase,” http://hadoop-hbase.blogspot.com/2012/01/scanning-in-
hbase.html
•  Balance IO for writes
–  Configure C1 (compactions, region size, compression, pre-splits, &c.) based on
write pattern
–  Balance IO contention between maintaining C1 and serving reads
–  Additional resources
Page 25
WRITE PROCESS
WRITE PROCESS-2
• If HBase goes down, the data that was not yet flushed from the
MemStore to the HFile can be recovered by replaying the WAL, all
handled under the hood. There is a single
• WAL per HBase server, shared by all tables (and their column
families) served from that server.
• We don’t recommend disabling the WAL unless you’re willing to lose
data when things fail.
MEMSTORE FLUSHING
• MemStore size which causes flushing is configured on two levels:
– per RS: % of heap occupied by memstores
– per table: size in MB of single memStore (per CF) of region
MEMSTORE
• Apart from solving the “non-ordered” problem, Memstore also has
other benefits, e.g.:
• It acts as a in-memory cache which keeps recently added data. This
is useful in numerous cases when last written data is accessed more
frequently than older data
• There are certain optimizations that can be done to rows/cells when
they are stored in memory before writing to persistent store. E.g.
when it is configured to store one version of a cell for certain CF and
Memstore contains multiple updates for that cell, only most recent
one can be kept and older ones can be omitted (and never written
to HFile).
• Important thing to note is that every Memstore flush creates one
HFile per CF.
MEMSTORE FLUSHES
HFILES COMPACTION
DATA LOCALITY
DETAILED ARCHITECTURE
READ PROCESS
• Hbase keeps data ordered and keeps as much of it as

possible in memory.
• HBase has BlockCache for reads, that sits in the JVM
heap alongside the MemStore. Each column family has
its own BlockCache.
• HFile physically laid out as blocks+index on the blocks
• Block in Blockcache is 64kb and is unit of data that is
read in a single pass. Reading a row from HBase requires
first checking the MemStore for any pending
modifications. Then the BlockCache is examined to see if
the block containing this row has been recently accessed.
Finally, the relevant HFiles on disk are accessed.
READ
OPERATION
The .META.
table holds
the list of all
user-space
regions.
The -ROOT- table

holds the list
of .META. table
regions
BIG PICTURE
110
ZOOKEEPER
• HBase depends on
ZooKeeper and by
default it manages a
ZooKeeper instance as
the authority on cluster
state
ZOOKEEPER
• HBase depends on ZooKeeper
• By default HBase manages the ZooKeeper instance
– E.g., starts and stops ZooKeeper
• HMaster and HRegionServers register themselves with ZooKeeper
112
CREATING A TABLE
HBaseAdmin admin= new HBaseAdmin(config);
HColumnDescriptor []column;
column= new HColumnDescriptor[2];
column[0]=new HColumnDescriptor("columnFamily1:");
column[1]=new HColumnDescriptor("columnFamily2:");
HTableDescriptor desc= new
HTableDescriptor(Bytes.toBytes("MyTable"));
desc.addFamily(column[0]);
desc.addFamily(column[1]);
admin.createTable(desc);
113
OPERATIONS ON REGIONS: GET()
• Given a key  return corresponding record
• For each value return the highest version
• Can control the number of versions you want
114
OPERATIONS ON REGIONS: SCAN()
115
GET() Select value from table where
key=‘com.apache.www’ AND
label=‘anchor:apache.com’
Time
Row key Column “anchor:”
Stamp
t12
t11
“com.apache.www”
t10 “anchor:apache.com” “APACHE”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6
t5
t3
OPERATIONS ON REGIONS: PUT()
• Insert a new record (with a new key), Or
• Insert a record for an existing key
Implicit version number
(timestamp)
Explicit version number
117
OPERATIONS ON REGIONS: DELETE()
• Marking table cells as deleted

• Multiple levels
– Can mark an entire column family as deleted
– Can make all column families of a given row as deleted
• All operations are logged by the RegionServers

• The log is flushed periodically
118
ALTERING A TABLE
Disable the table before changing the schema
119
WHEN TO USE HBASE
• Random write, read or both
• Variable schema in each record
• Collections of data for each key
• Atomic control of per-key data
• Row access to each column family
• Access patterns well-known and simple
120
HBASE
• HBase uses HDFS for reliable storage
– Handles checksums, replication, failover
• Master manages cluster
• RegionServer manage data
• ZooKeeper is the ‘neural network’ for bootstrapping and
coordinating cluster
BLOOM FILTER
• Generated when Hfile is persisted/stored at end of each file and
loaded into memory
• Allows check on row or row+column level
• Can filter entire store files from reads
• Useful when many misses are expected during reads (non existing
keys)
HIVE HBASE INTEGRATION
• Reasons to use Hive on HBase:
– A lot of data sitting in HBase due to its usage in a real-time
environment, but never used for analysis
– Give access to data in HBase usually only queried through
MapReduce to people that don’t code (business analysts)
– When needing a more flexible storage solution, so that rows can
be updated live by either a Hive job or an application and can be
seen immediately to the other
• Reasons not to do it:

– Run SQL queries on HBase to answer live user requests (it’s still
a MR job)
– Hoping to see interoperability with other SQL analytics systems
HIVE AND HBASE
• How it works:
– Hive can use tables that already exist in HBase or manage
its own ones, but they still all reside in the same HBase
instance
Hive table definitions
Points to an existing table HBase
Manages this table from Hive

INTEGRATION
Hive table definition HBase table

persons people
name STRING d:fullname

age INT d:age
siblings MAP<string, string> d:address
f:
HIVE WITH HBASE
The first step is to create a sample HBase table ‘my_table’

hbase(main):004:0> create 'my_table', 'test'
hbase(main):006:0> put 'my_table', '1', 'test:mydata1', 'value1'




HIVE WITH HBASE
Go to Hive terminal and create external table test_all
[root@localhost training]# hive
hive> create external table test_all (id string,colname
map<string,string>) stored by
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with
serdeproperties ("hbase.columns.mapping" = ":key,test:")
tblproperties("hbase.table.name"="my_table");
OK
Time taken: 7.574 seconds
hive>
HIVE WITH HBASE: SELECT * FROM TEST_ALL
hive> select * from test_all;

OK
1 {"mydata1":"value1"}
Time taken: 0.777 seconds
hive>
INTEGRATION
• Drawbacks (that can be fixed with brain juice):
– Binary keys and values (like integers represented on 4
bytes) aren’t supported since Hive prefers string
representations, HIVE-1634
– Compound row keys aren’t supported, there’s no way of
using multiple parts of a key as different “fields”
– This means that concatenated binary row keys are
completely unusable, which is what people often use for
HBase
– Filters are done at Hive level instead of being pushed to
the region servers
– Partitions aren’t supported
DATA FLOWS
• Data is being generated all over the place:
– Apache logs
– Application logs
– MySQL clusters
– HBase clusters
USE CASES
• Front-end engineers
– They need some statistics regarding their latest product
• Research engineers
– Ad-hoc queries on user data to validate some assumptions
– Generating statistics about recommendation quality
• Business analysts
– Statistics on growth and activity
– Effectiveness of advertiser campaigns
– Users’ behavior VS past activities to determine, for example,
why certain groups react better to email communications
– Ad-hoc queries on stumbling behaviors of slices of the user base
TABLE IN HBASE
Using a simple table in HBase:
CREATE EXTERNAL TABLE blocked_users(
userid INT,
blockee INT,
blocker INT,
created BIGINT)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,f:blockee,f:blocker,f:created")
TBLPROPERTIES("hbase.table.name" = "m2h_repl-userdb.stumble.blocked_users");
HBase is a special case here, it has a unique row key map with :key
Not all the columns in the table need to be mapped
HBASE VS. HDFS
• Both are distributed systems that scale to hundreds or thousands of
nodes
• HDFS is good for batch processing (scans over big files)

– Not good for record lookup
– Not good for incremental addition of small batches
– Not good for updates
134
HBASE VS. HDFS (CONT’D)
• HBase is designed to efficiently address the above points
– Fast record lookup
– Support for record-level insertion
– Support for updates (not in place)
• HBase updates are done by creating new versions of values
135
HBASE VS. RDBMS
136
HBASE VS. HDFS
If application has neither random reads or writes  Stick to HDFS
137
OOZIE
OOZIE OVERVIEW
Main Features
– Execute and monitor workflows in Hadoop
– Periodic scheduling of workflows
– Trigger execution by data availability
– HTTP and command line interface + Web console
Adoption
– ~100 users on mailing list since launch on github
– In production at Yahoo!, running >200K jobs/day
OOZIE WORKFLOW OVERVIEW
Purpose:
Execution of workflows on the Grid
Oozie
WS Tomcat Hadoop/Pig/HDFS
API web-app
DB
OOZIE WORKFLOW
Directed Acyclic Graph of Jobs
M/R
streaming OK
job
Java OK
start fork join
Main
Pig MORE
job OK decision
M/R
ENOUGH
job
OK
Java Main
OK FS OK
end
job
OOZIE WORKFLOW EXAMPLE
Start M-R OK
<workflow-app name=’wordcount-wf’>
Start End
wordcount
<start to=‘wordcount’/>
Error
<action name=’wordcount'>
<map-reduce>
<job-tracker>foo.com:9001</job-tracker>
<name-node>hdfs://bar.com:9000</name-node> Kill
<configuration>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to=’end'/>
<error to=’kill'/>
</action>
<kill name=‘kill’/>
<end name=‘end’/>
</workflow-app>
OOZIE WORKFLOW NODES
• Control Flow:
– start/end/kill
– decision
– fork/join
• Actions:
– map-reduce
– pig
– hdfs
– sub-workflow
– java – run custom Java code
OOZIE WORKFLOW APPLICATION
A HDFS directory containing:
– Definition file: workflow.xml

– Configuration file: config-default.xml
– App files: lib/ directory with JAR and SO files
– Pig Scripts
RUNNING AN OOZIE WORKFLOW JOB
Application Deployment:
$ hadoop fs –put wordcount-wf hdfs://bar.com:9000/usr/abc/wordcount
Workflow Job Parameters:

$ cat job.properties
oozie.wf.application.path = hdfs://bar.com:9000/usr/abc/wordcount
input = /usr/abc/input-data
output = /user/abc/output-data
Job Execution:
$ oozie job –run -config job.properties
job: 1-20090525161321-oozie-xyz-W
MONITORING AN OOZIE WORKFLOW JOB
Workflow Job Status:

$ oozie job -info 1-20090525161321-oozie-xyz-W
------------------------------------------------------------------------
Workflow Name : wordcount-wf
App Path : hdfs://bar.com:9000/usr/abc/wordcount
Status : RUNNING
…
Workflow Job Log:
$ oozie job –log 1-20090525161321-oozie-xyz-W
Workflow Job Definition:

$ oozie job –definition 1-20090525161321-oozie-xyz-W
OOZIE COORDINATOR OVERVIEW
Purpose:
– Coordinated execution of workflows on the Grid
– Workflows are backwards compatible
Tomcat
Check
WS API Data Availability
Oozie
Coordinator
Oozie
Oozie Workflow
Client Hadoop
OOZIE APPLICATION LIFECYCLE
Coordinator Job
0*f 1*f 2*f … … N*f

start end
action … …
create
Action Action Action Action

0 1 2 N
action
start Oozie Coordinator Engine
Oozie Workflow Engine
A
WF WF WF WF
B C
USE CASE 1: TIME TRIGGERS
• Execute your workflow every 15 minutes (CRON)
00:15 00:30 00:45 01:00

RUN WORKFLOW EVERY 15 MINS
<coordinator-app name=“coord1”
start="2009-01-08T00:00Z"
end="2010-01-01T00:00Z"
frequency=”15"
xmlns="uri:oozie:coordinator:0.1">
<action>
<workflow>
<app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path>
<configuration>
<property> <name>key1</name><value>value1</value> </property>
</configuration>
</workflow>
</action>
</coordinator-app>
TIME AND DATA TRIGGERS
• Materialize your workflow every hour, but only run them

when the input data is ready.
Hadoop
Input Data
Exists?
01:00 02:00 03:00 04:00

DATA TRIGGERS
<coordinator-app name=“coord1” frequency=“${1*HOURS}”…>
<datasets>
<dataset name="logs" frequency=“${1*HOURS}” initial-instance="2009-01-01T00:00Z">
<uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name=“inputLogs” dataset="logs">
<instance>${current(0)}</instance>
</data-in>
</input-events>
<action>
<workflow>
<configuration>
<property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property>
</configuration>
</workflow>
</action>
</coordinator-app>
ROLLING WINDOWS
• Access 15 minute datasets and roll them up into hourly
datasets
00:15 00:30 00:45 01:00 01:15 01:30 01:45 02:00
01:00 02:00
ROLLING WINDOWS
<coordinator-app name=“coord1” frequency=“${1*HOURS}”…>
<datasets>
<dataset name="logs" frequency=“15” initial-instance="2009-01-01T00:00Z">
<uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name=“inputLogs” dataset="logs">
<start-instance>${current(-3)}</start-instance>
<end-instance>${current(0)}</end-instance>
</data-in>
</input-events>
<action>
<workflow>
<configuration>
<property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property>
</configuration>
</workflow>
</action>
</coordinator-app>
SLIDING WINDOWS
• Access last 24 hours of data, and roll them up every hour.
…
01:00 02:00 03:00 24:00
24:00
… +1 day
02:00 03:00 04:00
01:00
+1 day
01:00
… +1 day
03:00 04:00 05:00
02:00
+1 day
02:00
OOZIE COORDINATOR APPLICATION
A HDFS directory containing:
– Definition file: coordinator.xml

– Configuration file: coord-config-default.xml
RUNNING AN OOZIE COORDINATOR JOB
Application Deployment:
$ hadoop fs –put coord_job hdfs://bar.com:9000/usr/abc/coord_job
Coordinator Job Parameters:

$ cat job.properties
oozie.coord.application.path = hdfs://bar.com:9000/usr/abc/coord_job
Job Execution:
$ oozie job –run -config job.properties
job: 1-20090525161321-oozie-xyz-C
MONITORING AN OOZIE COORDINATOR JOB
Coordinator Job Status:

$ oozie job -info 1-20090525161321-oozie-xyz-C
------------------------------------------------------------------------
Job Name : wordcount-coord
App Path : hdfs://bar.com:9000/usr/abc/coord_job
Status : RUNNING
…
Coordinator Job Log:
$ oozie job –log 1-20090525161321-oozie-xyz-C
Coordinator Job Definition:

$ oozie job –definition 1-20090525161321-oozie-xyz-C
OOZIE WEB CONSOLE: LIST JOBS
OOZIE WEB CONSOLE: JOB DETAILS
OOZIE WEB CONSOLE: FAILED ACTION
OOZIE WEB CONSOLE: ERROR MESSAGES
WHAT’S NEXT FOR OOZIE?
New Features
– More out-of-the-box actions: distcp, hive, …
– Authentication framework
• Authenticate a client with Oozie
• Authenticate an Oozie workflow with downstream
services
– Bundles: Manage multiple coordinators together
– Asynchronous data sets and coordinators
Scalability
– Memory footprint
– Data notification instead of polling
Integration with Howl (http://github.com/yahoo/howl)
RESOURCES
Oozie is Open Source

• Source: http://github.com/yahoo/oozie
• Docs: http://yahoo.github.com/oozie
• List: http://tech.groups.yahoo.com/group/Oozie-users/
To Contribute:
• https://github.com/yahoo/oozie/wiki/How-To-Contribute
SQOOP
WHAT IS SQOOP
• Tool to transfer data from relational databases
– Teradata, MySQL, PostgreSQL, Oracle, Netezza
• To Hadoop ecosystem
– HDFS (text, sequence ﬁle), Hive, HBase, Avro
• And vice versa
• Based on Connectors
– Responsible for Metadata lookups, and Data Transfer
– Majority of connectors are JDBC based
– Non-JDBC (direct) connectors for optimized data transfer
• Connectors responsible for all supported functionality
– HBase Import, Avro Support
• The canonical use case is performing a nightly dump of all the data in a
transactional relational database into Hadoop for offline analysis. The
popularity of Sqoop in enterprise systems confirms that Sqoop does
bulk transfer admirably.
What is Sqoop?
Traditional ETL
Data Application Data

What is Sqoop?
A different paradigm
Application
Data
Data
What is Sqoop?
A very scalable different paradigm
Application
Data
Application
Data
Application
Data
Data
WHY SQOOP?
• Efficient/Controlled resource utilization
– Concurrent connections, Time of operation
• Datatype mapping and conversion
– Automatic, and User override
• Metadata propagation
– Sqoop Record
– Hive Metastore
– Avro
170
SQOOP
171
MAHOUT
DATA ANALYTICS
• Include machine learning and data mining tools
– Analyze/mine/summarize large datasets
– Extract knowledge from past data
– Predict trends in future data
173
DATA MINING & MACHINE LEARNING
• Subset of Artificial Intelligence (AI)

• Lots of related fields and applications
– Information Retrieval
– Stats
– Biology
– Linear algebra
– Marketing and Sales
174
TOOLS & ALGORITHMS
• Collaborative Filtering
• Clustering Techniques
• Classification Algorithms
• Association Rules
• Frequent Pattern Mining
• Statistical libraries (Regression, SVM, …)
• Others…
175
COMMON USE CASES
176
RECOMMENDATIONS
• Predict what the user likes based on
– His/Her historical behavior
– Aggregate behavior of people similar to him
IN OUR CONTEXT…
--Efficient in analyzing/mining data --Efficient in managing big data

--Do not scale --Does not analyze or mine the data
How to integrate these two worlds

together
178
OTHER PROJECTS
• Apache Mahout
– Open-source package on Hadoop for data mining and
machine learning
• Revolution R (R-Hadoop)
– Extensions to R package to run on Hadoop
179
APACHE MAHOUT
• Apache Software Foundation project
• Create scalable machine learning libraries
• Why Mahout? Many Open Source ML libraries either:
– Lack Community
– Lack Documentation and Examples
– Lack Scalability
– Or are research-oriented
180
GOAL 1: MACHINE LEARNING
Applica ons
Examples
Freq.
Gene c Pa ern Classifica on Clustering Recommenders
Mining
Math
U li es Collec ons Apache
Vectors/Matrices/
Lucene/Vectorizer (primi ves) Hadoop
SVD
GOAL 2: SCALABILITY
• Be as fast and efficient as the possible given the intrinsic design of
the algorithm
• Most Mahout implementations are Map Reduce enabled
• Work in Progress
182
INTERESTING PROBLEMS
• Cluster users talking about Faculty Summit and cluster them based
on what they are tweeting
– Can you suggest people to network with.
• Use user generate tags that people have given for musicians and
cluster them
– Use the cluster to pre-populate suggest-box to
autocomplete tags when users type
• Cluster movies based on abstract and description and show related
movies.
– Note: How it can augment recommendations or
collaborative filtering algorithms.
MAHOUT PACKAGE
184
C1: COLLABORATIVE FILTERING
185
C2: CLUSTERING
• Group similar objects together
• K-Means, Fuzzy K-Means, Density-Based,…
• Different distance measures

– Manhattan, Euclidean, …
186
C3: CLASSIFICATION
187
FPM: FREQUENT PATTERN MINING
• Find the frequent itemsets
– <milk, bread, cheese> are sold frequently together
• Very common in market analysis, access pattern analysis, etc…
188
O: OTHERS
• Outlier detection
• Math libirary
– Vectors, matrices, etc.
• Noise reduction
189
WE FOCUS ON…
• Clustering  K-Means
• Classification  Naïve Bayes
-- Technique logic
• Frequent Pattern Mining  Apriori
-- How to implement in Hadoop
190
K-MEANS ALGORITHM
Iterative algorithm until converges
191
K-MEANS ALGORITHM
• Step 1: Select K points at random (Centers)
• Step 2: For each data point, assign it to the closest center
– Now we formed K clusters
• Step 3: For each cluster, re-compute the centers
– E.g., in the case of 2D points 
• X: average over all x-axis points in the cluster
• Y: average over all y-axis points in the cluster
• Step 4: If the new centers are different from the old centers
(previous iteration)  Go to Step 2
192
K-MEANS IN MAPREDUCE
• Input
– Dataset (set of points in 2D) --Large
– Initial centroids (K points) --Small
• Map Side
– Each map reads the K-centroids + one block from dataset
– Assign each point to the closest centroid
– Output <centroid, point>
193
K-MEANS IN MAPREDUCE (CONT’D)
• Reduce Side
– Gets all points for a given centroid
– Re-compute a new centroid for this cluster
– Output: <new centroid>
• Iteration Control
– Compare the old and new set of K-centroids
• If similar  Stop
• Else
– If max iterations has reached  Stop
– Else  Start another Map-Reduce Iteration
194
K-MEANS OPTIMIZATIONS
• Use of Combiners
– Similar to the reducer
– Computes for each centroid the local sums (and counts) of the assigned points
– Sends to the reducer <centroid, <partial sums>>
• Use of Single Reducer

– Amount of data to reducers is very small
– Single reducer can tell whether any of the centers has changed or not
– Creates a single output file
195
NAÏVE BAYES CLASSIFIER
• In simple terms, a naive Bayes classifier assumes that the presence
or absence of a particular feature is unrelated to the presence or
absence of any other feature, given the class variable. For example, a
fruit may be considered to be an apple if it is red, round, and about
3" in diameter. A naive Bayes classifier considers each of these
features to contribute independently to the probability that this fruit
is an apple, regardless of the presence or absence of the other
features.
196
NAÏVE BAYES CLASSIFIER
• Given a dataset (training data), we learn (build) a statistical
model
– This model is called “Classifier”
• Each point in the training data is in the form of:

– <label, feature 1, feature 2, ….feature N>
– Label  is the class label
– Features 1..N  the features (dimensions of the point)
• Then, given a point without a label <??, feature 1, ….feature N>

– Use the model to decide on its label
197
NAÏVE BAYES CLASSIFIER: EXAMPLE
• Example
Three features
Class label (male

or female)
198
FREQUENT PATTERN MINING
• Very common problem in Market-Basket applications
• Given a set of items I ={milk, bread, jelly, …}
• Given a set of transactions where each transaction contains

subset of items
– t1 = {milk, bread, water}
– t2 = {milk, nuts, butter, rice}
199
FREQUENT PATTERN MINING
• Given a set of items I ={milk, bread, jelly, …}
• Given a set of transactions where each transaction contains
subset of items
– t1 = {milk, bread, water}
– t2 = {milk, nuts, butter, rice}
What are the itemsets frequently sold together ??
% of transactions in which the itemset appears >= α
200
EXAMPLE
• {Bread}  80%
• {PeanutButter}  60%
• {Bread, PeanutButter}  60%
Assume α = 60%, what are the frequent itemsets
called “Support”
201
CAN WE OPTIMIZE??
• {Bread}  80%
• {PeanutButter}  60%
• {Bread, PeanutButter}  60%
Assume α = 60%, what are the frequent itemsets
called “Support”
Property
For itemset S={X, Y, Z, …} of size n to be frequent, all its subsets of
size n-1 must be frequent as well
202
HOW TO FIND FREQUENT ITEMSETS
• Naïve Approach
– Enumerate all possible itemsets and then count
each one
All possible itemsets of size 1
203
RESOURCES
• http://mahout.apache.org
• dev@mahout.apache.org - Developer mailing list
• user@mahout.apache.org - User mailing list
• Check out the documentations and wiki for quickstart
• http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code
PROVISIONING,
MANAGING,
MONITORING
CLUSTER
WHAT IS NAGIOS
• “Nagios is an enterprise-class monitoring solutions for hosts,
services, and networks released under an Open Source license.”
“Nagios is a popular open source computer system and network
monitoring application software. It watches hosts and services that
you specify, alerting you when things go bad and again when they
get better.”
CACTI
• Performance Graphing System
• Slick Web Interface
• Template System for Graph Types
• Pluggable
– SNMP (Simple Network Management Protocol) input
– Shell script /external program
CACTI
NAGIOS
• Answers “IS IT RUNNING?”
• Text based Configuration
CACTI
• Answers “HOW WELL IS IT RUNNING?”
• Web Based configuration
– php-cli tools
AMBARI
Cluster Operations Job Diagnostics Extensible Platform
Extend core capabilities to include Enable insight into job Expose integration and
the critical tasks associated with performance and reduce the customization points so Hadoop
provisioning and operating burden on specialized Hadoop can interoperate with existing
Hadoop clusters. skills and knowledge. operational tooling.
211
MORE DATABASES
• Ambari to support Postgres, MySQL or Oracle
• Configure Hive and Oozie to use MySQL or Oracle
Page 212
OTHER GOODIES
• Add slaves components to hosts
• Stop/Start All Services
• Re-assign Master Components
• Host status filtering
Page 213
JOB DIAGNOSTICS
• Enhanced swimlane visualizations
• See job DAG with task overlay
• See task scatter plot across jobs
Page 214
ELECTRONIC ARTS ON HADOOP AND EC2
WHAT WE HAVE DONE!
• Setup EC2, requested machines, configured firewalls and
passwordless SSH;
• Downloaded Java and Hadoop;
• Configured HDFS and MapReduce and pushed configuration around
the cluster;
• Started HDFS and MapReduce;
• Submitted the job, ran it successfully, and viewed the output.
1. START EC2 SERVERS
• Amazon Web Services @ http://aws.amazon.com/;
• Used the ‘classic wizard’, created three micro instances running the
latest 64 bit Ubuntu Server;
• Key pair .pem file either exists or you create one to connect to the
servers and to navigate around within the cluster
2. NAME EC2 SERVERS
• For reference, instances are named Master, Slave 1, and Slave 2
within the EC2 console once they are running;
• Note down the host names for each of the 3 instances in the bottom
part of the management console. We will use these to access the
servers:
PUTTY CONFIGURATION
WEB INTERFACES
MAPREDUCE PROGRAM
Thank You!
Navin Chandra

TCRIX Faculty Summit Day 2.2

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

TCRIX Faculty Summit Day 2.2

Caricato da

Copyright:

Formati disponibili

FACULTY SUMMIT ON BIG

• Pig provides a higher level language, Pig Latin, that:

• Parallel database products (ex: Teradata)

• By default Pig treats undeclared fields as

• FOREACH-GENERATE (per-tuple processing)

– Expressions within the

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int,

SPLIT raw INTO x IF $0 > 4, y if $1 == 'FOO', z if $0 == 2 AND $3 <

• Logs of user visiting a webpage consists of (user,url,time)

user url time

VISITS = load ‘visits' as (user, url, time);

USER_VISITS = group VISITS by user;

USER_CNTS = foreach USER_VISITS generate group as user, COUNT(VISITS) as

ALL_CNTS = group USER_COUNTS all;

AVG_CNT = foreach ALL_CNTS generate AVG(USER_COUNTS.numvisits);

user url time url pagerank

Semantic Checking Logical Plan

Logical Optimizer Optimized Logical Plan

Logical to Physical Translator Physical Plan

Physical To M/R Translator MapReduce Plan

Map Reduce Launcher

Create a job jar to be submitted to hadoop cluster

• Pig is a Apache subproject project

A data warehouse infrastructure built on top of Hadoop

• Need a Multi Petabyte Warehouse

Data format Encoded in app Declared in script Read from metadata

MapReduce Hive Pig

Open source version

MapReduce Pig Hive

HCatInputFormat HCatOutputFormat CLI Notification

Hive MetaStore Client

Generated Thrift Client

• Data files maybe organized in different formats

Webservice calls to HCatalog

Webservice calls to start work

Incubator site : http://incubator.apache.org/hcatalog

User list: hcatalog-user@incubator.apache.org

Dev list: hcatalog-dev@incubator.apache.org

• HBase is a database; Hadoop database

HBase is built on top of HDFS

HBase files are

• a software framework for scalable cross-language

$ hbase-daemon.sh start thrift

Now lets create a table and store data

HBase uses the table as the top-level structure for storing

hbase(main):002:0> create 'mytable', 'cf'

Let’s add the string hello HBase to the table. In HBase

hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello HBase'

second column=cf:foo, timestamp=1323483964825, value=0

hbase(main):007:0> get 'mytable', 'first'

• Columns in HBase are organized into groups called column families.

• Rather than instantiating Htables directly, using HTablePool is more

Each record is divided into Column Families

Each row has a Key

Each column family consists of one or more Columns

“Roles” column family

HBase maintains a multi-level

Each will be one region

• The HBase client

"2011-07-04" 1368396302 "fourth of July"

b cf2 "thumb" 1368387247 [3.6 kb png data]

REST/Thrift JavaApp JavaApp JavaApp JavaApp HBase Shell

Zoo HBase Region Region Region Region

Data Data Data Data Name

0f 1f 2f … … Nf