Sei sulla pagina 1di 223

FACULTY SUMMIT ON BIG

DATA
Navin Chandra
OUTLINE
• Pig
• Hive
• Hcatalog
• HBase
• Oozie
• Sqoop
• Mahout
• Provisioning, Managing, Monitoring Cluster
• Real-time Use Case Scenarios
APACHE PIG
BACKGROUND
• Yahoo! was the first big adopter of Hadoop.
• Hadoop gained popularity in the company quickly.
• Yahoo! Research developed Pig to address the need for a higher
level language.
• Roughly 30% of Hadoop jobs run at Yahoo! are Pig jobs.
APACHE PIG
• Pig 0.11.1- A platform for analyzing large data sets that consists of a
high-level language for expressing data analysis programs, coupled
with infrastructure for evaluating these programs.
• Pig's infrastructure layer consists of
– a compiler that produces sequences of Map-Reduce programs,
– Pig's language layer currently consists of a textual language called Pig
Latin, which has the following key properties:
• Ease of programming
• Optimization opportunities
• Extensibility

9/21/2013
PIG
• What is Pig?
– An open-source high-level dataflow system
– Provides a simple language for queries and data manipulation,
Pig Latin, that is compiled into map-reduce jobs that are run on
Hadoop
– Pig Latin combines the high-level data manipulation constructs
of SQL with the procedural programming of map-reduce
• Why is it important?
– Companies and organizations like Yahoo, Google and Microsoft
are collecting enormous data sets in the form of click streams,
search logs, and web crawls
– Some form of ad-hoc processing and analysis of all of this
information is required
PIG

• Pig provides a higher level language, Pig Latin, that:


– Increases productivity.
– 10 lines of Pig Latin ≈ 200 lines of Java.
– Opens the system to non-Java programmers.
– Provides common operations like join, group, filter, sort
EXISTING SOLUTION

• Parallel database products (ex: Teradata)


– Expensive at web scale
– Data analysis programmers find the declarative SQL
queries to be unnatural and restrictive
• Raw map-reduce
– Complex n-stage dataflows are not supported; joins
and related tasks require workarounds or custom
implementations
– Resulting code is difficult to reuse and maintain; shifts
focus and attention away from data analysis
Language Features
• Several options for user-interaction
– Interactive mode (console)
– Batch mode (prepared script files containing Pig Latin commands)
– Embedded mode (execute Pig Latin commands within a Java program)
• Built primarily for scan-centric workloads and read-only data
analysis
– Easily operates on both structured and schema-less, unstructured data
– Transactional consistency and index-based lookups not required
– Data curation and schema management can be overkill
• Flexible, fully nested data model
• Extensive UDF support
– Currently must be written in Java
– Can be written for filtering, grouping, per-tuple processing, loading
and storing
PIG OPERATORS
RUNNING PIG
• You can execute Pig Latin statements:
– Using grunt shell or command line
$ pig ... - Connecting to ...
grunt> A = load 'data';
grunt> B = ... ;
– In local mode or hadoop mapreduce mode
$ pig myscript.pig
Command Line - batch, local mode mode
$ pig -x local myscript.pig
– Either interactively or in batch

9/21/2013
EXAMPLE 1
$ cp /etc/passwd .
$ ls passwd
$ cat passwd (these are system files separated by colon :)
$ pig –x local
grunt> A = LOAD ‘Passwd’ using PigStorage (‘:’);
grunt> DUMP A;
grunt>B = FOREACH A GENERATE $0;
grunt>DUMP B;
grunt> STORE B INTO ‘Passout’; (Passout dir is created)
grunt> quit;
$ root@ubuntu:/home/Navin# cd Passout
$ root@ubuntu:/home/Navin/Passout# cat part-m-00000
Data Model
• Supports four basic types
– Atom: a simple atomic value (int, long, double, string)
• ex: ‘Peter’
– Tuple: a sequence of fields that can be any of the data
types
• ex: (‘Peter’, 14)
– Bag: a collection of tuples of potentially varying
structures, can contain duplicates
• ex: {(‘Peter’), (‘Bob’, (14, 21))}
– Map: an associative array, the key must be a chararray
but the value can be any type
DATA MODEL

• By default Pig treats undeclared fields as


bytearrays (collection of uninterpreted bytes)
• Can infer a field’s type based on:
– Use of operators that expect a certain type of field
– UDFs with a known or explicitly set return type
– Schema information provided by a LOAD function
or explicitly declared using an AS clause
• Type conversion is lazy
USAGE
• Web log processing.
• Data processing for web search platforms.
• Ad hoc queries across large data sets.
• Rapid prototyping of algorithms for processing large data sets.
PROGRAM/FLOW ORGANIZATION
• A LOAD statement reads data from the file system.
• A series of "transformation" statements process the data.
• A STORE statement writes output to the file system; or, a DUMP
statement displays output to the screen.

9/21/2013
Pig Latin vs. SQL
• Pig Latin is procedural (dataflow programming model)
– Step-by-step query style is much cleaner and easier to write and
follow than trying to wrap everything into a single block of SQL
Pig Latin vs. SQL (continued)
• Lazy evaluation (data not processed prior to STORE command)
• Data can be stored at any point during the pipeline
• An execution plan can be explicitly defined
• Pipeline splits are supported
PIG LATIN

• FOREACH-GENERATE (per-tuple processing)


– Iterates over every input tuple in the bag, producing
one output each, allowing efficient parallel
implementation

– Expressions within the


GENERATE clause can
take the form of any of
these expressions
INTERPRETATION
• In general, Pig processes Pig Latin statements as follows:
– First, Pig validates the syntax and semantics of all statements.
– Next, if Pig encounters a DUMP or STORE, Pig will execute the
statements.

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int,


gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(John)
(Mary)
(Bill)
(Joe)
• Store operator will store it in a file

9/21/2013
PIG
MULTIPLE OUTPUTS

raw = LOAD . . .
. . .

SPLIT raw INTO x IF $0 > 4, y if $1 == 'FOO', z if $0 == 2 AND $3 <


2;
STORE x INTO 'x_out';
STORE y INTO 'y_out';
STORE z INTO 'z_out';
MULTIPLE INPUTS

A = LOAD . . .
B = LOAD . . .
C = LOAD . . .
A = FILTER . . .
B = FOREACH . . .
C = FILTER . . .
C = FOREACH . . .

. . .
PARTITIONER EXAMPLE FOR NO OF
BOOKS
• grunt> A = load 'tabcollege.txt' using PigStorage('\t') as
(user:chararray, age:int, college:chararray, nbooks:int);
• grunt> B = group A by college;
• grunt> C = FOREACH B { D = ORDER A by nbooks DESC; E = LIMIT
D 1; GENERATE FLATTEN (E); }
• grunt> DUMP C;
COMPUTING AVERAGE NUMBER OF PAGE VISITS BY USER

• Logs of user visiting a webpage consists of (user,url,time)


• Fields of the log are tab separated and in text format
• Basic idea:
– Load the log file
– Group based on the user field
– Count the group

...
– Calculate average for all user

user url time


Amy www.cnn.com 8:00
Amy www.crap.com 8:05
Amy www.myblog.com 10:00
Amy www.flickr.com 10:05
Fred cnn.com/index.htm 12:00
Fred cnn.com/index.htm 1:00
HOW TO PROGRAM THIS IN PIG LATIN

VISITS = load ‘visits' as (user, url, time);


dump visits;

USER_VISITS = group VISITS by user;

USER_CNTS = foreach USER_VISITS generate group as user, COUNT(VISITS) as


numvisits;

ALL_CNTS = group USER_COUNTS all;

AVG_CNT = foreach ALL_CNTS generate AVG(USER_COUNTS.numvisits);


IDENTIFY USERS WHO VISIT “GOOD
PAGES”
• Good pages are those pages visited by users whose page rank is greater than 0.5
• Basic idea
– Join tables based on url
– Group based on user
– Calculate average pagerank of user visited pages
– Filter user who has average pagerank greater than 0.5

...
– Store the result

user url time url pagerank


Amy www.cnn.com 8:00 www.cnn.com 0.9
Amy www.crap.com 8:05 www.flickr.com 0.9
Amy www.myblog.com 10:00 www.myblog.com 0.7
Amy www.flickr.com 10:05 www.crap.com 0.2
Fred cnn.com/index.htm 12:00
Fred cnn.com/index.htm 1:00
FRONT END ARCHITECTURE
Pig Latin Programs Query Parser Logical Plan

Semantic Checking Logical Plan

Logical Optimizer Optimized Logical Plan

Logical to Physical Translator Physical Plan

Physical To M/R Translator MapReduce Plan

Map Reduce Launcher

Create a job jar to be submitted to hadoop cluster


NEW FEATURES IN THE PIPELINE
• Pig Pen, eclipse plug-in for developing and testing in Pig Latin
– Currently available for use in M/R mode
• Performance
– Expanding cases where combiner is used
– Map side join
– Improving order by
• Sampler
• Reducing number of M/R jobs from 3 to 2
– Efficient multi-store queries
• Better error handling
PIG RESOURCES

• Pig is a Apache subproject project


• Documentation
– General info is at: http://wiki.apache.org/pig/
– Pig UDF : http://wiki.apache.org/pig/UDFManual
• Mailing lists
– pig-user@hadoop.apache.org
– pig-dev@hadoop.apache.org
• Source code
– https://svn.apache.org/repos/asf/hadoop/pig/trunk
• Code submissions
– https://issues.apache.org/jira/browse/PIG-*
WHAT IS HIVE?

A data warehouse infrastructure built on top of Hadoop


for providing data summarization, query, and analysis.
– ETL.
– Structure.
– Access to different storage.
– Query execution via MapReduce.
Key Building Principles:
– SQL is a familiar language
– Extensibility – Types, Functions, Formats, Scripts
– Performance
HIVE, WHY?

• Need a Multi Petabyte Warehouse


• Files are insufficient data abstractions
– Need tables, schemas, partitions, indices
• SQL is highly popular
• Need for an open data format
– RDBMS have a closed data format
– flexible schema
• Hive is a Hadoop subproject!
DATA UNITS
Databases.
Tables.
Partitions.
Buckets.
HIVE CHARACTERISTICS
Batch oriented
Data Warehouse focused
Entire data sets (table scans)
Generates/runs MapReduce (not faster than MR!)
Limited indexing, no stats, no cache
Programmer is the optimizer
Append only (mostly)
CONCLUSION
A easy way to process large scale data.
Support SQL-based queries.
Provide more user defined interfaces to extend
Programmability.
Files in HDFS are immutable. Tipically:
– Log processing: Daily Report, User Activity
Measurement
– Data/Text mining: Machine learning (Training Data)
– Business intelligence: Advertising Delivery,Spam
Detection
HCATALOG
WHAT IS IT?
• HCatalog for a small introduction, is a “table management and
storage management layer for Apache Hadoop” which:
– enables Pig, MapReduce, and Hive users to easily share
data on the grid.
– provides a table abstraction for a relational view of data in
HDFS
– ensures format indifference (viz RCFile format, text files,
sequence files)
– provides a notification service when new data becomes
available
Without HCatalog
Feature MapReduce Pig Hive
Record format Key value pairs Tuple Record
Data model User defined int, float, string, int, float, string,
bytes, maps, tuples, maps, structs, lists
bags
Schema Encoded in app Declared in script or Read from metadata
read by loader
Data location Encoded in app Declared in script Read from metadata

Data format Encoded in app Declared in script Read from metadata

Page 38
With HCatalog
Feature MapReduce + Pig + HCatalog Hive
HCatalog
Record format Record Tuple Record
Data model int, float, string, int, float, string, int, float, string,
maps, structs, lists bytes, maps, tuples, maps, structs, lists
bags
Schema Read from metadata Read from metadata Read from metadata

Data location Read from metadata Read from metadata Read from metadata

Data format Read from metadata Read from metadata Read from metadata

Page 39
HOW DOES IT WORK?
• Pig
– -HCatLoader + HCatStorer Interface
• MapReduce
– -HCatInputFormat + HCatOutputFormat interface
• Hive
– -No interface necessary
– -Direct access to metadata
• Notifications when data available
Data & Metadata Access With HCatalog

MapReduce Hive Pig

HCatInputFormat/ HCatLoader/
HCatOuputFormat HCatStorer

SerDe
InputFormat/
REST Metastore Client
OuputFormat

External
System HDFS
Metastore

Page 41
HIVE AND PIG
MapReduce Hive Pig

SerDe
InputFormat/ InputFormat/ Load/
Metastore Client
OuputFormat OuputFormat Store

HDFS
Metastore
HIVE ODBC/JDBC TODAY
Need to have Hive
JDBC code on the client
Client

Hive Hadoop
Server

Issues:
• Not concurrent
ODBC • Not secure
Client • Not scalable

Open source version


not easy to use
MAKING YOUR STRUCTURED DATA
AVAILABLE TO THE MAPREDUCE ENGINE

MapReduce Pig Hive

HCatalog

MPP
HDFS HBase
Store
HCATALOG ARCHITECTURE

HCatLoader HCatStorer

HCatInputFormat HCatOutputFormat CLI Notification

Hive MetaStore Client

Generated Thrift Client

Hive
MetaStore RDBMS
WHERE IS THE DATA
PIG
HIVE

MapReduce

Storage
USING HCATALOG
PIG
HIVE

MapReduce

HCatalog

Storage

Page 47
Problem: Data in variety of formats

• Data files maybe organized in different formats


• Data files may contain different formats in different partitions

Storage
(HDFS, HBASE , etc)

Page 48
Solution: HCat provides common abstraction

Hadoop Application
• Registered Data w/ Schema
• HCat normalizes data to application

HCatalog

Storage

Page 49
HCATALOG EXAMPLE
Templeton Specific Support
Move data directly into/out-of HDFS through WebHDFS

Webservice calls to HCatalog


– Register table relationships for data (e.g., createTable, createDatabase)
– Adjust tables (e.g., AlterTable)
– Look at a statistics (e.g., ShowTable)

Webservice calls to start work


– MapReduce, Pig, Hive
– Poll for job status
– Notification URL when job completes (optional)

Stateless Server
– Horizontally scale for load
– Configurable for HA
– Currently Requires ZooKeeper to track job status info
GETTING INVOLVED

Incubator site : http://incubator.apache.org/hcatalog

User list: hcatalog-user@incubator.apache.org

Dev list: hcatalog-dev@incubator.apache.org


HBASE
Page 2
HBASE

• HBase is a database; Hadoop database


• Not an RDBMS, does not follow SQL—can store an
integer in one row and string in another in the same
column
• HBase is designed to run on a cluster of computers
instead of a single computer. The cluster can be built
using commodity hardware; HBase scales horizontally as
you add more machines to the cluster.
• RDBMS follow ACID properties (Atomicity, Consistency,
Isolation and Durability) with upfront schema definition
HBASE
• HBase does qualify as a NoSQL store. It provides a keyvalue API
• Strong consistency so clients can see data immediately after it’s
written
• HBase is designed for terabytes to petabytes of data, so it optimizes
for this use case.
Use cases: Mozilla crash reports, FB, Twitter, StumbleUpon
HBASE: PART OF HADOOP’S ECOSYSTEM

HBase is built on top of HDFS

HBase files are


internally stored
in HDFS

57
HBASE
• HBase is a distributed column-oriented data store built on top of
HDFS
• HBase is an Apache open source project whose goal is to provide
storage for the Hadoop Distributed Computing
• Data is logically organized into tables, rows and columns
• Key/value column family store
• Data stored in HDFS
• ZooKeeper for coordination
• Access model is get/put/del
• Plus range scans and versions

58
HBASE
• HBase is a key value store on top of HDFS
• This is the NOSql Database
• Very thin layer over raw HDFS
– Data is grouped in a Table that has rows of data.
– Each row can have multiple ‘Column Families’
– Each ‘Column Family’ contain(s) multiple columns.
– Each column name is the key and it has it’s corresponding column
value.
– Each row doesn’t need to have the same number of columns
KEY VALUE COLUMN VALUE STORE
• Column family is collection of columns
• One or more cells form a row that is addressed by a unique row key
• All rows are sorted alphabetically by row key
• Each column may have multiple versions with each distinct values in
a different cell
• Access to row data is atomic and includes any no of columns being
read or written
INSTALLATION (1)
START Hadoop…

$ wget
http://ftp.twaren.net/Unix/Web/apache/hadoop/hbase/h
base-0.20.2/hbase-0.20.2.tar.gz
$ sudo tar -zxvf hbase-*.tar.gz -C /opt/
$ sudo ln -sf /opt/hbase-0.20.2 /opt/hbase
$ sudo chown -R $USER:$USER /opt/hbase
$ sudo mkdir /var/hadoop/
$ sudo chmod 777 /var/hadoop
CONNECTING TO HBASE

• Java client
– get(byte [] row, byte [] column, long timestamp, int versions);
• Non-Java clients
– Thrift server hosting HBase client instance
• Sample ruby, C++, & java (via thrift) clients
– REST server hosts HBase client
• TableInput/OutputFormat for MapReduce
– HBase as MR source
– HBase Shell
– JRuby to add get, scan, and admin
– ./bin/hbase shell YOUR_SCRIPT
THRIFT

• a software framework for scalable cross-language


services development.
• works seamlessly between C++, Java, Python, PHP, and
Ruby.
• The other similar project “Rest”

$ hbase-daemon.sh start thrift


$ hbase-daemon.sh stop thrift
HBASE
SHELL
HBASE SHELL
$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.92.1-cdh4.0.0, rUnknown, Mon Jun 4 17:27:36 PDT 2012
hbase(main):001:0>

hbase(main):001:0> list
TABLE
0 row(s) in 0.5710 seconds
hbase(main):002:0>

Now lets create a table and store data


STORING DATA

HBase uses the table as the top-level structure for storing


data. To write data into HBase, you need a table to write it
into. To begin, create a table called mytable with a single
column family

hbase(main):002:0> create 'mytable', 'cf'


0 row(s) in 1.0730 seconds
hbase(main):003:0> list
TABLE
mytable
1 row(s) in 0.0080 seconds
WRITING DATA

Let’s add the string hello HBase to the table. In HBase


parlance, we say, “Put the bytes 'hello HBase' to a cell in
'mytable‘ in the 'first' row at the 'cf:message' column

hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello HBase'


0 row(s) in 0.2070 seconds
hbase(main):005:0> put 'mytable', 'second', 'cf:foo', 0x0 (zero,x,zero)
0 row(s) in 0.0130 seconds
hbase(main):006:0> put 'mytable', 'third', 'cf:bar', 3.14159
0 row(s) in 0.0080 second

• You now have three cells in three rows in your table. Notice that you didn’t define the
columns before you used them. Nor did you specify what type of data you stored in
each column. This is what the NoSQL crowd means when they say HBase is a schema-
less database.
READING DATA
• HBase gives you two ways to read data: get and scan. Command to
store the cells was put. get is the complement of put, reading back a
single row.

MYTABLE

cf
row1 row2 row3
first second third
cf:message cf:foo cf:bar
hello Hbase 0 3.14

MYTABLE cf
Cells
row1 first cf:message hello Hbase
cf row2 second cf:foo 0
row3 third cf:bar 3.14
SCAN AND GET
hbase(main):008:0> scan 'mytable'
ROW COLUMN+CELL
first column=cf:message, timestamp=1323483954406, value=hell

second column=cf:foo, timestamp=1323483964825, value=0


third column=cf:bar, timestamp=1323483997138, value=3.14159
3 row(s) in 0.0240 seconds

hbase(main):007:0> get 'mytable', 'first'


COLUMN CELL
cf:message timestamp=1323483954406, value=hello HBase
1 row(s) in 0.0250 seconds

The shell shows you all the cells in the row, organized by column, with the value
associated at each timestamp. HBase can store multiple versions of each cell. The
default number of versions stored is three, but it’s configurable.
At read time, only the latest version is returned, unless otherwise specified.
TABLE ‘USERS’ CREATION
hbase(main):001:0> create 'users', 'info'
0 row(s) in 0.1200 seconds
hbase(main):002:0>

• Columns in HBase are organized into groups called column families.


info is a column family in table ‘users’
• Column families impact physical characteristics of the data store in
HBase. For this reason, at least one column family must be specified
at table creation time. Other than the column family name, Hbase
doesn’t require you to tell it anything about your data ahead of time.
That’s why Hbase is often described as a schema-less database.
SHELL
hbase(main):003:0> describe 'users'
DESCRIPTION ENABLED
{NAME => 'users', FAMILIES => [{NAME => 'info', true
BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0
', COMPRESSION => 'NONE', VERSIONS => '3', TTL
=> '2147483647', BLOCKSIZE => '65536', IN_MEMOR
Y => 'false', BLOCKCACHE => 'true'}]}
1 row(s) in 0.0330 seconds
hbase(main):004:0>

Shell describes table with two properties: table name and list of column families
(physical characteristics)
To open a connection, every time we don’t use the shell, instead a Java client library. The
code for connecting to this table is
HTableInterface usersTable = new HTable("users");
JAVA CONNECTION
• The HTable constructor reads the default configuration information
to locate HBase, similar to the way the shell did. It then locates the
‘users’ table you created earlier and gives you a handle to it.
(configuration parameters can be picked by the Java client from the
hbase-site.xml file in their classpath). HBase client applications need
to have only one configuration piece available to them to access
HBase—the ZooKeeper quorum address.

• Rather than instantiating Htables directly, using HTablePool is more


common in practice.
ACCESS TO DATA
COORDINATES
• Use of coordinates to locate data
rowkey, column family, col. qualifier (column,qual)= value
HBASE: KEYS AND COLUMN FAMILIES

Each record is divided into Column Families

Each row has a Key

Each column family consists of one or more Columns

75
WRITE PROCESS
• When something is written to HBase, it is first written to an in-
memory store (memstore), once this memstore reaches a certain
size, it is flushed to disk into a store file also called HFile (everything
is also written immediately to a log file for durability). The store files
created on disk are immutable. Sometimes the store files are
merged together, this is done by a process called compaction.
• This log file is called write-ahead log (WAL), also referred to as the
Hlog
• HFile is the underlying storage format for HBase. HFiles belong to a
column family, and a column family can have multiple HFiles. But a
single HFile can’t have data for multiple column families. There is
one MemStore per column family
• http://www.ngdata.com/visualizing-hbase-flushes-and-compactions/
NOTES ON DATA MODEL
• HBase schema consists of several Tables
• Each table consists of a set of Column Families
– Columns are not part of the schema
• HBase has Dynamic Columns
– Because column names are encoded inside the cells
– Different cells can have different columns

“Roles” column family


has different columns in
different cells

77
NOTES ON DATA MODEL (CONT’D)
• The version number can be user-supplied
– Even does not have to be inserted in increasing order
– Version number are unique within each key
• Table can be very sparse
– Many cells are empty
• Keys are indexed as the primary key Has two columns
[cnnsi.com & my.look.ca]
HBASE PHYSICAL MODEL
• Each column family is stored in a separate file (called HTables)
• Key & Version numbers are replicated with each column family
• Empty cells are not stored

HBase maintains a multi-level


index on values:
<key, column family, column
name, timestamp>

79
EXAMPLE

80
COLUMN FAMILIES

81
HBASE REGIONS
• Each HTable (column family) is partitioned horizontally into regions
– Regions are counterpart to HDFS blocks

Each will be one region

82
THREE MAJOR COMPONENTS

• The HBaseMaster
– One master

• The HRegionServer
– Many region servers

• The HBase client

83
HBASE COMPONENTS
• Region
– A subset of a table’s rows, like horizontal range partitioning
– Automatically done
• RegionServer (many slaves)
– Manages data regions
– Serves data for reads and writes (using a log)
• Master
– Responsible for coordinating the slaves
– Assigns regions, detects failures
– Admin functions

84
HBASE
ARCHITECTURE

Page 7
Logical Data Model
A sparse, multi-dimensional, sorted map

Table A
column column
rowkey timestamp value
family qualifier

1368394583 7
"bar"
1368394261 "hello"

cf1 1368394583 22
"foo" 1368394925 13.6
a
1368393847 "world"

"2011-07-04" 1368396302 "fourth of July"


cf2
1.0001 1368387684 "almost the loneliest number"

b cf2 "thumb" 1368387247 [3.6 kb png data]

Legend:
- Rows are sorted by rowkey.
- Within a row, values are located by column family and qualifier.
- Values also carry a timestamp; there can me multiple versions of a value.
- Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes.

Page 13
Logical Architecture
Distributed, persistent partitions of a BigTable

Region Server 7
Table A
Table A, Region 1
a Table A, Region 2
b
Region 1 Table G, Region 1070
c
Table L, Region 25
d
e
f Region Server 86
Region 2 g Table A, Region 3
h Table C, Region 30
i Table F, Region 160
j Table F, Region 776
Region 3 k
l
Region Server 367
m
Table A, Region 4
n
Region 4 Table C, Region 17
o
Table E, Region 52
p
Table P, Region 1116

Legend:
- A single table is partitioned into Regions of roughly equal size.
- Regions are assigned to Region Servers across the cluster.
- Region Servers host roughly the same number of regions.

Page 9
Physical Architecture
Distribution and Data Path

REST/Thrift JavaApp JavaApp JavaApp JavaApp HBase Shell


Gateway
...
HBase HBase HBase HBase HBase HBase
Client Client Client Client Client
Client

Zoo HBase Region Region Region Region


Keeper Master Server Server Server Server

...

Data Data Data Data Name


Zoo Zoo Node Node Node Node Node
Keeper Keeper

Legend:
- An HBase RegionServer is collocated with an HDFS DataNode.
- HBase clients communicate directly with Region Servers for sending and receiving data.
- HMaster manages Region assignment and handles DDL operations.
- Online configuration state is maintained in ZooKeeper.
- HMaster and ZooKeeper are NOT involved in data path.

Page 11
HBASE

Anatomy of a
RegionServer

Page 14
REGIONSERVER
• RegionServer : Every request of a write goes to the RegionServer---
directs request to Region.
• Each region stores rows. Rows data is separated by CF’s. Data related
to a particular CF is stored in HStore (consists of MemStore+set of
HFiles). MemStore is in the RegionServer memory, while HFiles are in
HDFS.
Storage Machinery
Implementing the data model

RegionServer

BlockCache
HLog
(WAL)

HRegion HRegion

HStore HStore HStore HStore

StoreFile StoreFile MemStore

...
... ... ...
HFile HFile

HDFS

Legend:
- A RegionServer contains a single WAL, single BlockCache, and multiple Regions.
- A Region contains multiple Stores, one for each Column Family.
- A Store consists of multiple StoreFiles and a MemStore.
- A StoreFile corresponds to a single HFile.
- HFiles and WAL are persisted on HDFS.

Page 16
For what workloads
•  It depends on how you tune it, but…
•  HBase is good for:
–  Large datasets
–  Sparse datasets
–  Loosely coupled (denormalized) records
–  Lots of concurrent clients
•  Try to avoid:
–  Small datasets (unless you have lots of them)
–  Highly relational records
–  Schema designs requiring transactions *
MEMSTORE
• When something is written to HBase, it is first written to an in-
memory store (memstore), once this memstore reaches a certain
size, it is flushed to disk into a store file (everything is also written
immediately to a log file for durability). The store files created on
disk are immutable. Sometimes the store files are merged together,
this is done by a process called compaction.
• http://www.ngdata.com/visualizing-hbase-flushes-and-compactions/
LOGGING OPERATIONS

94
HBASE DEPLOYMENT

Master
node

Slave
nodes

95
How does it integrate with my infrastructure?
•  Horizontally scale application data
–  Highly concurrent, read/write access
–  Consistent, persisted shared state
–  Distributed online data processing via Coprocessors (experimental)
•  Gateway between online services and offline storage/analysis
–  Staging area to receive new data
–  Serve online, indexed “views” on datasets from HDFS
–  Glue between batch (HDFS, MR1) and online (CEP, Storm) systems

Page 23
What data semantics
•  GET, PUT, DELETE key-value operations
•  SCAN for queries
• Row-level write atomicity
•  MapReduce integration
–  Online API (today)
–  Bulkload (today)
–  Snapshots (coming)

Page 24
What about operational concerns?
•  Provision hardware with more spindles/TB
•  Balance memory and IO for reads
–  Contention between random and sequential access
–  Configure Block size, BlockCache, compression, codecs based on access patterns
–  Additional resources
–  “HBase: Performance Tuners,” http://labs.ericsson.com/blog/hbase-performance-tuners
–  “Scanning in HBase,” http://hadoop-hbase.blogspot.com/2012/01/scanning-in-
hbase.html
•  Balance IO for writes
–  Configure C1 (compactions, region size, compression, pre-splits, &c.) based on
write pattern
–  Balance IO contention between maintaining C1 and serving reads
–  Additional resources

Page 25
WRITE PROCESS
WRITE PROCESS-2
• If HBase goes down, the data that was not yet flushed from the
MemStore to the HFile can be recovered by replaying the WAL, all
handled under the hood. There is a single
• WAL per HBase server, shared by all tables (and their column
families) served from that server.
• We don’t recommend disabling the WAL unless you’re willing to lose
data when things fail.
MEMSTORE FLUSHING
• MemStore size which causes flushing is configured on two levels:
– per RS: % of heap occupied by memstores
– per table: size in MB of single memStore (per CF) of region
MEMSTORE
• Apart from solving the “non-ordered” problem, Memstore also has
other benefits, e.g.:
• It acts as a in-memory cache which keeps recently added data. This
is useful in numerous cases when last written data is accessed more
frequently than older data
• There are certain optimizations that can be done to rows/cells when
they are stored in memory before writing to persistent store. E.g.
when it is configured to store one version of a cell for certain CF and
Memstore contains multiple updates for that cell, only most recent
one can be kept and older ones can be omitted (and never written
to HFile).
• Important thing to note is that every Memstore flush creates one
HFile per CF.
MEMSTORE FLUSHES
HFILES COMPACTION
DATA LOCALITY
DETAILED ARCHITECTURE
READ PROCESS

• Hbase keeps data ordered and keeps as much of it as


possible in memory.
• HBase has BlockCache for reads, that sits in the JVM
heap alongside the MemStore. Each column family has
its own BlockCache.
• HFile physically laid out as blocks+index on the blocks
• Block in Blockcache is 64kb and is unit of data that is
read in a single pass. Reading a row from HBase requires
first checking the MemStore for any pending
modifications. Then the BlockCache is examined to see if
the block containing this row has been recently accessed.
Finally, the relevant HFiles on disk are accessed.
READ
OPERATION
The .META.
table holds
the list of all
user-space
regions.

The -ROOT- table


holds the list
of .META. table
regions
BIG PICTURE

110
ZOOKEEPER

• HBase depends on
ZooKeeper and by
default it manages a
ZooKeeper instance as
the authority on cluster
state
ZOOKEEPER
• HBase depends on ZooKeeper
• By default HBase manages the ZooKeeper instance
– E.g., starts and stops ZooKeeper
• HMaster and HRegionServers register themselves with ZooKeeper

112
CREATING A TABLE
HBaseAdmin admin= new HBaseAdmin(config);
HColumnDescriptor []column;
column= new HColumnDescriptor[2];
column[0]=new HColumnDescriptor("columnFamily1:");
column[1]=new HColumnDescriptor("columnFamily2:");
HTableDescriptor desc= new
HTableDescriptor(Bytes.toBytes("MyTable"));
desc.addFamily(column[0]);
desc.addFamily(column[1]);
admin.createTable(desc);

113
OPERATIONS ON REGIONS: GET()
• Given a key  return corresponding record
• For each value return the highest version

• Can control the number of versions you want

114
OPERATIONS ON REGIONS: SCAN()

115
GET() Select value from table where
key=‘com.apache.www’ AND
label=‘anchor:apache.com’
Time
Row key Column “anchor:”
Stamp

t12

t11
“com.apache.www”

t10 “anchor:apache.com” “APACHE”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6

t5

t3
OPERATIONS ON REGIONS: PUT()
• Insert a new record (with a new key), Or
• Insert a record for an existing key
Implicit version number
(timestamp)

Explicit version number

117
OPERATIONS ON REGIONS: DELETE()

• Marking table cells as deleted


• Multiple levels
– Can mark an entire column family as deleted
– Can make all column families of a given row as deleted

• All operations are logged by the RegionServers


• The log is flushed periodically

118
ALTERING A TABLE

Disable the table before changing the schema

119
WHEN TO USE HBASE
• Random write, read or both
• Variable schema in each record
• Collections of data for each key
• Atomic control of per-key data
• Row access to each column family
• Access patterns well-known and simple

120
HBASE
• HBase uses HDFS for reliable storage
– Handles checksums, replication, failover
• Master manages cluster
• RegionServer manage data
• ZooKeeper is the ‘neural network’ for bootstrapping and
coordinating cluster
BLOOM FILTER
• Generated when Hfile is persisted/stored at end of each file and
loaded into memory
• Allows check on row or row+column level
• Can filter entire store files from reads
• Useful when many misses are expected during reads (non existing
keys)
HIVE HBASE INTEGRATION
• Reasons to use Hive on HBase:
– A lot of data sitting in HBase due to its usage in a real-time
environment, but never used for analysis
– Give access to data in HBase usually only queried through
MapReduce to people that don’t code (business analysts)
– When needing a more flexible storage solution, so that rows can
be updated live by either a Hive job or an application and can be
seen immediately to the other

• Reasons not to do it:


– Run SQL queries on HBase to answer live user requests (it’s still
a MR job)
– Hoping to see interoperability with other SQL analytics systems
HIVE AND HBASE

• How it works:
– Hive can use tables that already exist in HBase or manage
its own ones, but they still all reside in the same HBase
instance

Hive table definitions

Points to an existing table HBase

Manages this table from Hive


INTEGRATION

Hive table definition HBase table


persons people

name STRING d:fullname


age INT d:age
siblings MAP<string, string> d:address
f:
HIVE WITH HBASE

The first step is to create a sample HBase table ‘my_table’


hbase(main):004:0> create 'my_table', 'test'
0 row(s) in 1.2650 seconds

hbase(main):006:0> put 'my_table', '1', 'test:mydata1', 'value1'


0 row(s) in 0.1420 seconds

hbase(main):007:0> put 'my_table', '2', 'test:mydata1', 'value2'


0 row(s) in 0.0050 seconds

hbase(main):008:0> put 'my_table', '3', 'test:mydata1', 'value3'


0 row(s) in 0.0140 seconds

hbase(main):009:0> put 'my_table', '4', 'test:mydata1', 'value4'


0 row(s) in 0.0170 seconds
HIVE WITH HBASE
Go to Hive terminal and create external table test_all
[root@localhost training]# hive
hive> create external table test_all (id string,colname
map<string,string>) stored by
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with
serdeproperties ("hbase.columns.mapping" = ":key,test:")
tblproperties("hbase.table.name"="my_table");
OK
Time taken: 7.574 seconds
hive>
HIVE WITH HBASE: SELECT * FROM TEST_ALL

hive> select * from test_all;


OK
1 {"mydata1":"value1"}
2 {"mydata1":"value2"}
3 {"mydata1":"value3"}
4 {"mydata1":"value4"}
Time taken: 0.777 seconds
hive>
INTEGRATION
• Drawbacks (that can be fixed with brain juice):
– Binary keys and values (like integers represented on 4
bytes) aren’t supported since Hive prefers string
representations, HIVE-1634
– Compound row keys aren’t supported, there’s no way of
using multiple parts of a key as different “fields”
– This means that concatenated binary row keys are
completely unusable, which is what people often use for
HBase
– Filters are done at Hive level instead of being pushed to
the region servers
– Partitions aren’t supported
DATA FLOWS
• Data is being generated all over the place:
– Apache logs
– Application logs
– MySQL clusters
– HBase clusters
USE CASES
• Front-end engineers
– They need some statistics regarding their latest product
• Research engineers
– Ad-hoc queries on user data to validate some assumptions
– Generating statistics about recommendation quality
• Business analysts
– Statistics on growth and activity
– Effectiveness of advertiser campaigns
– Users’ behavior VS past activities to determine, for example,
why certain groups react better to email communications
– Ad-hoc queries on stumbling behaviors of slices of the user base
TABLE IN HBASE
Using a simple table in HBase:
CREATE EXTERNAL TABLE blocked_users(
userid INT,
blockee INT,
blocker INT,
created BIGINT)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,f:blockee,f:blocker,f:created")
TBLPROPERTIES("hbase.table.name" = "m2h_repl-userdb.stumble.blocked_users");

HBase is a special case here, it has a unique row key map with :key
Not all the columns in the table need to be mapped
HBASE VS. HDFS
• Both are distributed systems that scale to hundreds or thousands of
nodes

• HDFS is good for batch processing (scans over big files)


– Not good for record lookup
– Not good for incremental addition of small batches
– Not good for updates

134
HBASE VS. HDFS (CONT’D)
• HBase is designed to efficiently address the above points
– Fast record lookup
– Support for record-level insertion
– Support for updates (not in place)

• HBase updates are done by creating new versions of values

135
HBASE VS. RDBMS

136
HBASE VS. HDFS

If application has neither random reads or writes  Stick to HDFS

137
OOZIE
OOZIE OVERVIEW

Main Features
– Execute and monitor workflows in Hadoop
– Periodic scheduling of workflows
– Trigger execution by data availability
– HTTP and command line interface + Web console

Adoption
– ~100 users on mailing list since launch on github
– In production at Yahoo!, running >200K jobs/day
OOZIE WORKFLOW OVERVIEW
Purpose:
Execution of workflows on the Grid

Oozie

WS Tomcat Hadoop/Pig/HDFS
API web-app

DB
OOZIE WORKFLOW
Directed Acyclic Graph of Jobs
M/R
streaming OK
job
Java OK
start fork join
Main

Pig MORE
job OK decision

M/R
ENOUGH
job

OK

Java Main

OK FS OK
end
job
OOZIE WORKFLOW EXAMPLE
Start M-R OK
<workflow-app name=’wordcount-wf’>
Start End
wordcount
<start to=‘wordcount’/>
Error
<action name=’wordcount'>
<map-reduce>
<job-tracker>foo.com:9001</job-tracker>
<name-node>hdfs://bar.com:9000</name-node> Kill
<configuration>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to=’end'/>
<error to=’kill'/>
</action>

<kill name=‘kill’/>
<end name=‘end’/>
</workflow-app>
OOZIE WORKFLOW NODES
• Control Flow:
– start/end/kill
– decision
– fork/join

• Actions:
– map-reduce
– pig
– hdfs
– sub-workflow
– java – run custom Java code
OOZIE WORKFLOW APPLICATION

A HDFS directory containing:

– Definition file: workflow.xml


– Configuration file: config-default.xml
– App files: lib/ directory with JAR and SO files
– Pig Scripts
RUNNING AN OOZIE WORKFLOW JOB

Application Deployment:
$ hadoop fs –put wordcount-wf hdfs://bar.com:9000/usr/abc/wordcount

Workflow Job Parameters:


$ cat job.properties
oozie.wf.application.path = hdfs://bar.com:9000/usr/abc/wordcount
input = /usr/abc/input-data
output = /user/abc/output-data

Job Execution:
$ oozie job –run -config job.properties
job: 1-20090525161321-oozie-xyz-W
MONITORING AN OOZIE WORKFLOW JOB

Workflow Job Status:


$ oozie job -info 1-20090525161321-oozie-xyz-W
------------------------------------------------------------------------
Workflow Name : wordcount-wf
App Path : hdfs://bar.com:9000/usr/abc/wordcount
Status : RUNNING

Workflow Job Log:
$ oozie job –log 1-20090525161321-oozie-xyz-W

Workflow Job Definition:


$ oozie job –definition 1-20090525161321-oozie-xyz-W
OOZIE COORDINATOR OVERVIEW

Purpose:
– Coordinated execution of workflows on the Grid
– Workflows are backwards compatible
Tomcat
Check
WS API Data Availability
Oozie
Coordinator

Oozie
Oozie Workflow
Client Hadoop
OOZIE APPLICATION LIFECYCLE
Coordinator Job

0*f 1*f 2*f … … N*f


start end

action … …
create

Action Action Action Action


0 1 2 N

action
start Oozie Coordinator Engine
Oozie Workflow Engine
A
WF WF WF WF

B C
USE CASE 1: TIME TRIGGERS
• Execute your workflow every 15 minutes (CRON)

00:15 00:30 00:45 01:00


RUN WORKFLOW EVERY 15 MINS
<coordinator-app name=“coord1”
start="2009-01-08T00:00Z"
end="2010-01-01T00:00Z"
frequency=”15"
xmlns="uri:oozie:coordinator:0.1">
<action>
<workflow>
<app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path>
<configuration>
<property> <name>key1</name><value>value1</value> </property>
</configuration>
</workflow>
</action>
</coordinator-app>
TIME AND DATA TRIGGERS

• Materialize your workflow every hour, but only run them


when the input data is ready.
Hadoop
Input Data
Exists?

01:00 02:00 03:00 04:00


DATA TRIGGERS
<coordinator-app name=“coord1” frequency=“${1*HOURS}”…>
<datasets>
<dataset name="logs" frequency=“${1*HOURS}” initial-instance="2009-01-01T00:00Z">
<uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name=“inputLogs” dataset="logs">
<instance>${current(0)}</instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path>
<configuration>
<property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property>
</configuration>
</workflow>
</action>
</coordinator-app>
ROLLING WINDOWS
• Access 15 minute datasets and roll them up into hourly
datasets

00:15 00:30 00:45 01:00 01:15 01:30 01:45 02:00

01:00 02:00
ROLLING WINDOWS
<coordinator-app name=“coord1” frequency=“${1*HOURS}”…>
<datasets>
<dataset name="logs" frequency=“15” initial-instance="2009-01-01T00:00Z">
<uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name=“inputLogs” dataset="logs">
<start-instance>${current(-3)}</start-instance>
<end-instance>${current(0)}</end-instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path>
<configuration>
<property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property>
</configuration>
</workflow>
</action>
</coordinator-app>
SLIDING WINDOWS
• Access last 24 hours of data, and roll them up every hour.


01:00 02:00 03:00 24:00

24:00

… +1 day
02:00 03:00 04:00
01:00
+1 day
01:00

… +1 day
03:00 04:00 05:00
02:00
+1 day
02:00
OOZIE COORDINATOR APPLICATION

A HDFS directory containing:

– Definition file: coordinator.xml


– Configuration file: coord-config-default.xml
RUNNING AN OOZIE COORDINATOR JOB

Application Deployment:
$ hadoop fs –put coord_job hdfs://bar.com:9000/usr/abc/coord_job

Coordinator Job Parameters:


$ cat job.properties
oozie.coord.application.path = hdfs://bar.com:9000/usr/abc/coord_job

Job Execution:
$ oozie job –run -config job.properties
job: 1-20090525161321-oozie-xyz-C
MONITORING AN OOZIE COORDINATOR JOB

Coordinator Job Status:


$ oozie job -info 1-20090525161321-oozie-xyz-C
------------------------------------------------------------------------
Job Name : wordcount-coord
App Path : hdfs://bar.com:9000/usr/abc/coord_job
Status : RUNNING

Coordinator Job Log:
$ oozie job –log 1-20090525161321-oozie-xyz-C

Coordinator Job Definition:


$ oozie job –definition 1-20090525161321-oozie-xyz-C
OOZIE WEB CONSOLE: LIST JOBS
OOZIE WEB CONSOLE: JOB DETAILS
OOZIE WEB CONSOLE: FAILED ACTION
OOZIE WEB CONSOLE: ERROR MESSAGES
WHAT’S NEXT FOR OOZIE?
New Features
– More out-of-the-box actions: distcp, hive, …
– Authentication framework
• Authenticate a client with Oozie
• Authenticate an Oozie workflow with downstream
services
– Bundles: Manage multiple coordinators together
– Asynchronous data sets and coordinators
Scalability
– Memory footprint
– Data notification instead of polling
Integration with Howl (http://github.com/yahoo/howl)
RESOURCES

Oozie is Open Source


• Source: http://github.com/yahoo/oozie
• Docs: http://yahoo.github.com/oozie
• List: http://tech.groups.yahoo.com/group/Oozie-users/

To Contribute:
• https://github.com/yahoo/oozie/wiki/How-To-Contribute
SQOOP
WHAT IS SQOOP
• Tool to transfer data from relational databases
– Teradata, MySQL, PostgreSQL, Oracle, Netezza
• To Hadoop ecosystem
– HDFS (text, sequence file), Hive, HBase, Avro
• And vice versa
• Based on Connectors
– Responsible for Metadata lookups, and Data Transfer
– Majority of connectors are JDBC based
– Non-JDBC (direct) connectors for optimized data transfer
• Connectors responsible for all supported functionality
– HBase Import, Avro Support
• The canonical use case is performing a nightly dump of all the data in a
transactional relational database into Hadoop for offline analysis. The
popularity of Sqoop in enterprise systems confirms that Sqoop does
bulk transfer admirably.
What is Sqoop?
Traditional ETL

Data Application Data


What is Sqoop?
A different paradigm

Application

Data

Data
What is Sqoop?
A very scalable different paradigm

Application

Data

Application

Data

Application

Data
Data
WHY SQOOP?
• Efficient/Controlled resource utilization
– Concurrent connections, Time of operation
• Datatype mapping and conversion
– Automatic, and User override
• Metadata propagation
– Sqoop Record
– Hive Metastore
– Avro

170
SQOOP

171
MAHOUT
DATA ANALYTICS
• Include machine learning and data mining tools
– Analyze/mine/summarize large datasets
– Extract knowledge from past data
– Predict trends in future data

173
DATA MINING & MACHINE LEARNING

• Subset of Artificial Intelligence (AI)


• Lots of related fields and applications
– Information Retrieval
– Stats
– Biology
– Linear algebra
– Marketing and Sales

174
TOOLS & ALGORITHMS
• Collaborative Filtering
• Clustering Techniques
• Classification Algorithms
• Association Rules
• Frequent Pattern Mining
• Statistical libraries (Regression, SVM, …)
• Others…

175
COMMON USE CASES

176
RECOMMENDATIONS
• Predict what the user likes based on
– His/Her historical behavior
– Aggregate behavior of people similar to him
IN OUR CONTEXT…

--Efficient in analyzing/mining data --Efficient in managing big data


--Do not scale --Does not analyze or mine the data

How to integrate these two worlds


together

178
OTHER PROJECTS
• Apache Mahout
– Open-source package on Hadoop for data mining and
machine learning

• Revolution R (R-Hadoop)
– Extensions to R package to run on Hadoop

179
APACHE MAHOUT
• Apache Software Foundation project
• Create scalable machine learning libraries
• Why Mahout? Many Open Source ML libraries either:
– Lack Community
– Lack Documentation and Examples
– Lack Scalability
– Or are research-oriented

180
GOAL 1: MACHINE LEARNING

Applica ons

Examples

Freq.
Gene c Pa ern Classifica on Clustering Recommenders
Mining

Math
U li es Collec ons Apache
Vectors/Matrices/
Lucene/Vectorizer (primi ves) Hadoop
SVD
GOAL 2: SCALABILITY
• Be as fast and efficient as the possible given the intrinsic design of
the algorithm
• Most Mahout implementations are Map Reduce enabled
• Work in Progress

182
INTERESTING PROBLEMS
• Cluster users talking about Faculty Summit and cluster them based
on what they are tweeting
– Can you suggest people to network with.
• Use user generate tags that people have given for musicians and
cluster them
– Use the cluster to pre-populate suggest-box to
autocomplete tags when users type
• Cluster movies based on abstract and description and show related
movies.
– Note: How it can augment recommendations or
collaborative filtering algorithms.
MAHOUT PACKAGE

184
C1: COLLABORATIVE FILTERING

185
C2: CLUSTERING
• Group similar objects together

• K-Means, Fuzzy K-Means, Density-Based,…

• Different distance measures


– Manhattan, Euclidean, …

186
C3: CLASSIFICATION

187
FPM: FREQUENT PATTERN MINING
• Find the frequent itemsets
– <milk, bread, cheese> are sold frequently together

• Very common in market analysis, access pattern analysis, etc…

188
O: OTHERS
• Outlier detection
• Math libirary
– Vectors, matrices, etc.
• Noise reduction

189
WE FOCUS ON…
• Clustering  K-Means

• Classification  Naïve Bayes

-- Technique logic
• Frequent Pattern Mining  Apriori
-- How to implement in Hadoop

190
K-MEANS ALGORITHM

Iterative algorithm until converges

191
K-MEANS ALGORITHM
• Step 1: Select K points at random (Centers)
• Step 2: For each data point, assign it to the closest center
– Now we formed K clusters
• Step 3: For each cluster, re-compute the centers
– E.g., in the case of 2D points 
• X: average over all x-axis points in the cluster
• Y: average over all y-axis points in the cluster
• Step 4: If the new centers are different from the old centers
(previous iteration)  Go to Step 2

192
K-MEANS IN MAPREDUCE
• Input
– Dataset (set of points in 2D) --Large
– Initial centroids (K points) --Small

• Map Side
– Each map reads the K-centroids + one block from dataset
– Assign each point to the closest centroid
– Output <centroid, point>

193
K-MEANS IN MAPREDUCE (CONT’D)
• Reduce Side
– Gets all points for a given centroid
– Re-compute a new centroid for this cluster
– Output: <new centroid>

• Iteration Control
– Compare the old and new set of K-centroids
• If similar  Stop
• Else
– If max iterations has reached  Stop
– Else  Start another Map-Reduce Iteration

194
K-MEANS OPTIMIZATIONS
• Use of Combiners
– Similar to the reducer
– Computes for each centroid the local sums (and counts) of the assigned points
– Sends to the reducer <centroid, <partial sums>>

• Use of Single Reducer


– Amount of data to reducers is very small
– Single reducer can tell whether any of the centers has changed or not
– Creates a single output file

195
NAÏVE BAYES CLASSIFIER
• In simple terms, a naive Bayes classifier assumes that the presence
or absence of a particular feature is unrelated to the presence or
absence of any other feature, given the class variable. For example, a
fruit may be considered to be an apple if it is red, round, and about
3" in diameter. A naive Bayes classifier considers each of these
features to contribute independently to the probability that this fruit
is an apple, regardless of the presence or absence of the other
features.

196
NAÏVE BAYES CLASSIFIER
• Given a dataset (training data), we learn (build) a statistical
model
– This model is called “Classifier”

• Each point in the training data is in the form of:


– <label, feature 1, feature 2, ….feature N>
– Label  is the class label
– Features 1..N  the features (dimensions of the point)

• Then, given a point without a label <??, feature 1, ….feature N>


– Use the model to decide on its label

197
NAÏVE BAYES CLASSIFIER: EXAMPLE
• Example

Three features

Class label (male


or female)

198
FREQUENT PATTERN MINING
• Very common problem in Market-Basket applications

• Given a set of items I ={milk, bread, jelly, …}

• Given a set of transactions where each transaction contains


subset of items
– t1 = {milk, bread, water}
– t2 = {milk, nuts, butter, rice}

199
FREQUENT PATTERN MINING
• Given a set of items I ={milk, bread, jelly, …}
• Given a set of transactions where each transaction contains
subset of items
– t1 = {milk, bread, water}
– t2 = {milk, nuts, butter, rice}

What are the itemsets frequently sold together ??

% of transactions in which the itemset appears >= α

200
EXAMPLE
• {Bread}  80%
• {PeanutButter}  60%
• {Bread, PeanutButter}  60%

Assume α = 60%, what are the frequent itemsets

called “Support”

201
CAN WE OPTIMIZE??
• {Bread}  80%
• {PeanutButter}  60%
• {Bread, PeanutButter}  60%

Assume α = 60%, what are the frequent itemsets

called “Support”

Property
For itemset S={X, Y, Z, …} of size n to be frequent, all its subsets of
size n-1 must be frequent as well
202
HOW TO FIND FREQUENT ITEMSETS
• Naïve Approach
– Enumerate all possible itemsets and then count
each one

All possible itemsets of size 1

All possible itemsets of size 2

All possible itemsets of size 3

All possible itemsets of size 4

203
RESOURCES
• http://mahout.apache.org
• dev@mahout.apache.org - Developer mailing list
• user@mahout.apache.org - User mailing list
• Check out the documentations and wiki for quickstart
• http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code
PROVISIONING,
MANAGING,
MONITORING
CLUSTER
WHAT IS NAGIOS
• “Nagios is an enterprise-class monitoring solutions for hosts,
services, and networks released under an Open Source license.”
“Nagios is a popular open source computer system and network
monitoring application software. It watches hosts and services that
you specify, alerting you when things go bad and again when they
get better.”
CACTI
• Performance Graphing System
• Slick Web Interface
• Template System for Graph Types
• Pluggable
– SNMP (Simple Network Management Protocol) input
– Shell script /external program
CACTI
NAGIOS
• Answers “IS IT RUNNING?”
• Text based Configuration
CACTI
• Answers “HOW WELL IS IT RUNNING?”
• Web Based configuration
– php-cli tools
AMBARI

Cluster Operations Job Diagnostics Extensible Platform

Extend core capabilities to include Enable insight into job Expose integration and
the critical tasks associated with performance and reduce the customization points so Hadoop
provisioning and operating burden on specialized Hadoop can interoperate with existing
Hadoop clusters. skills and knowledge. operational tooling.

211
MORE DATABASES
• Ambari to support Postgres, MySQL or Oracle
• Configure Hive and Oozie to use MySQL or Oracle

Page 212
OTHER GOODIES

• Add slaves components to hosts

• Stop/Start All Services

• Re-assign Master Components

• Host status filtering

Page 213
JOB DIAGNOSTICS
• Enhanced swimlane visualizations
• See job DAG with task overlay
• See task scatter plot across jobs

Page 214
ELECTRONIC ARTS ON HADOOP AND EC2
WHAT WE HAVE DONE!
• Setup EC2, requested machines, configured firewalls and
passwordless SSH;
• Downloaded Java and Hadoop;
• Configured HDFS and MapReduce and pushed configuration around
the cluster;
• Started HDFS and MapReduce;
• Submitted the job, ran it successfully, and viewed the output.
1. START EC2 SERVERS
• Amazon Web Services @ http://aws.amazon.com/;

• Used the ‘classic wizard’, created three micro instances running the
latest 64 bit Ubuntu Server;

• Key pair .pem file either exists or you create one to connect to the
servers and to navigate around within the cluster
2. NAME EC2 SERVERS
• For reference, instances are named Master, Slave 1, and Slave 2
within the EC2 console once they are running;

• Note down the host names for each of the 3 instances in the bottom
part of the management console. We will use these to access the
servers:
PUTTY CONFIGURATION
WEB INTERFACES
MAPREDUCE PROGRAM
Thank You!

Navin Chandra

Potrebbero piacerti anche