Sei sulla pagina 1di 50

UNIT –V FRAMEWORK

1.HIVE
Hive-Introduction
 An SQL-like interface to Hadoop
 Hive is a data warehouse infrastructure tool to process structured data in
Hadoop. It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
 Initially Hive was developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open source under the
name Apache Hive.
Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates
Features of Hive
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.
1.Hive-Services
 The Hive shell is only one of several services that you can run using the hive
command.
Hive> hive –service
 Cli-The command line interface to Hive (the shell). This is the default service.
 Hiveserver-Runs Hive as a server exposing a Thrift service, enabling access
from a range of clients written in different languages.
 Hwi -The Hive Web Interface
 Jar-The Hive equivalent to hadoop jar, a convenient way to run Java
applications that includes Hadoop and Hive classes on the classpath

Department of CA @ AVCCE Page 1 of 50


UNIT –V FRAMEWORK

 Metastore-By default, the metastore is run in the same process as the Hive
service
Hive architecture

Thrift Client
 The Hive Thrift Client makes it easy to run Hive commands from a wide range
of programming languages.
 Thrift bindings for Hive are available for C++, Java, PHP,Python, and Ruby
JDBC Driver
 Hive provides JDBC driver, defined in the class
org.apache.hadoop.hive.jdbc.HiveDriver.
 When configured with a JDBC URI of the form jdbc:hive://host:port/dbname, a
Java application will connect to a Hive server running in a separate process at
the given host and port.
ODBC Driver
 The Hive ODBC Driver allows applications that support the ODBC protocol to
connect to Hive.
The Metastore
 The metastore is the central repository of Hive metadata.
 The metastore is divided into two pieces: a service and the backing store for the
data.

Department of CA @ AVCCE Page 2 of 50


UNIT –V FRAMEWORK

Working of Hive

2.HiveQL
• Hive’s SQL dialect is called HiveQL. Queries are translated to MapReduce jobs
to exploit the scalability of MapReduce.
• Hive QL
– Basic-SQL : Select, From, Join, Group-By
– Equi-Join, Muti-Table Insert, Multi-Group-By
– Batch query
1. Hive Data Types
• Hive primitive data types
• Hive complex data types.
Primitive data types
Numeric Types
• TINYINT (1-byte signed integer)
• SMALLINT (2-byte signed integer)
• INT (4-byte signed integer)
• BIGINT (8-byte signed integer)
• FLOAT (4-byte single precision floating point number)
• DOUBLE (8-byte double precision floating point number)
Date/Time Types
TIMESTAMP,DATE
String Types
STRING,VARCHAR,CHAR
Department of CA @ AVCCE Page 3 of 50
UNIT –V FRAMEWORK

Misc Types
BOOLEAN,BINARY
Hive Complex Data Types
• arrays: ARRAY<data_type>
• maps: MAP<primitive_type, data_type>
• structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
• union: UNIONTYPE<data_type, data_type, ...>
2. Hive - Built-in Operators
• Relational Operators
• Arithmetic Operators
• Logical Operators
• Complex Operators
Relational Operators

Arithmetic Operators

Department of CA @ AVCCE Page 4 of 50


UNIT –V FRAMEWORK

Logical Operators

Complex Operators

3. Hive Built-in Functions


Numeric and Mathematical Functions

Syntax Example

ABS( double n ) ABS(-100)

BIN( bigint n ) BIN(100)

CEIL( double n ) CEIL(9.5)

FLOOR( double n ) FLOOR(10.9)

SQRT( double n ) SQRT(4)

LOG2( double n ) LOG2(44)

POW( double m, double n ), POW(10,2)

RAND( [int seed] ) RAND( )

Department of CA @ AVCCE Page 5 of 50


UNIT –V FRAMEWORK

Date Functions

Syntax Example

UNIX_TIMESTAMP() 2016-09-24 12:11:10

UNIX_TIMESTAMP( string date, UNIX_TIMESTAMP('2000-01-01


string pattern ) 10:20:30','yyyy-MM-dd')

TO_DATE( string timestamp ) TO_DATE('2000-01-01 10:20:30')

YEAR( string date ) YEAR('2000-01-01 10:20:30')

DATEDIFF( string date1, string DATEDIFF('2000-03-01', '2000-01-


date2 ) 10')

DATE_ADD( string date, int days ) DATE_ADD('2000-03-01', 5)

DATE_SUB( string date, int days ) DATE_SUB('2000-03-20', 5)

String Functions

Syntax Example

ASCII( string str ) ASCII('A')

CONCAT( string str1, string str2... ) CONCAT('hadoop','-','hive')

LENGTH( string str ) LENGTH('hive')

LOWER( string str ), LOWER('HiVe')

REVERSE( string str ) REVERSE('hive')

SUBSTR( string source_str, int SUBSTR('hadoop',4,2)


start_position [,int length] )

UPPER( string str ) UPPER('HiVe')

Department of CA @ AVCCE Page 6 of 50


UNIT –V FRAMEWORK

Collection Functions

Syntax Example

size(Array<T>) Returns the number of


elements in the array type

size(Map<K.V>) Returns the number of


elements in the map type

map_keys(Map<K.V> Returns an unordered array


containing the keys of the
input map

map_values(Map<K.V>) Returns an unordered array


containing the values of the
input map

sort_array(Array<T>) Sorts the input array in


ascending order

Type Conversion Function

Syntax Example

binary(string|binary) Casts the parameter into a


binary

cast(expr as <type>) Converts the results of the


expression

Eg:cast('1' as BIGINT)

3. Querying Data in Hive


1. Database manipulation
Create Database Statement
Syntax
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Department of CA @ AVCCE Page 7 of 50
UNIT –V FRAMEWORK

Example
Hive> CREATE DATABASE [IF NOT EXISTS] userdb;
Verify a databases list
Hive> SHOW DATABASES;
Hive> SHOW DATABASES LIKE 'u.*';
Drop Database Statement
DROP DATABASE [IF EXISTS] database_name
Hive> DROP DATABASE IF EXISTS userdb;
2.Managing –Tables
Syntax
CREATE TABLE [IF NOT EXISTS] [db_name.] table_name [(col_name data_type
[COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]
Example
hive> create table if not exists employee (eid int,name String,designation String,
salary int)
COMMENT 'Employee detail'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
Displaying tables
hive> show tables;
Displaying structure table ‘employee’
Hive> desc employee;
eid int None
ename string None
Department of CA @ AVCCE Page 8 of 50
UNIT –V FRAMEWORK

designation string None


salary int None
Dropping Tables & View
• The DROP TABLE statement deletes the data and metadata for a table

Syntax
DROP TABLE [IF EXISTS] table_name;
Example
Hive> drop table emp;
Hive> drop view emp1;
Altering Tables
Syntax
1. ALTER TABLE name RENAME TO new_name
2. ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
3. ALTER TABLE name DROP [COLUMN] column_name
4. ALTER TABLE name CHANGE column_name new_name new_type
5. ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
Example
Renames the table from employee1 to employee
Hive> ALTER TABLE employee1 RENAME TO employee
Rename the column name
Hive> ALTER TABLE employee CHANGE name ename String;
Adds a column named dept to the employee table
Hive> ALTER TABLE employee ADD COLUMNS ( dept STRING COMMENT
'Department name');
Replace column in table
Hive> ALTER TABLE employee REPLACE COLUMNS ( eid INT empid Int);

Department of CA @ AVCCE Page 9 of 50


UNIT –V FRAMEWORK

3.Importing Data
Syntax
Insert command
INSERT OVERWRITE TABLE target as SELECT col1, col2 FROM source;
Example
Hive> insert overwrite table student as select * from student1;
Load statement
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename;

Example
Hive> LOAD DATA LOCAL INPATH '/home/mca/sample.txt‘ OVERWRITE INTO
TABLE employee
4.Select-Where Clause
 SELECT statement is used to retrieve the data from a table. WHERE clause
works similar to a condition.
Syntax
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE
where_condition] [GROUP BY col_list] [HAVING having_condition] [ORDER BY
col_list]] [LIMIT number];
Example
 Hive> SELECT * FROM employee WHERE salary>30000;
 Hive> select * from employee where salary>12000 and dept='ADMIN';
 Hive> SELECT * FROM employee WHERE salary>30000 limit 2;
5.Sorting and Aggregating
 Sorting data in Hive can be achieved by use of a standard ORDER BY clause
Example
Hive> SELECT * from employee order by salary;
 The GROUP BY clause is used to group all the records in a result set using a
particular collection column.
Department of CA @ AVCCE Page 10 of 50
UNIT –V FRAMEWORK

Example
Hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;
6.Joins
 JOIN is a clause that is used for combining specific fields from two tables by
using values common to each one
 JOIN
 LEFT OUTER JOIN
 RIGHT OUTER JOIN
 FULL OUTER JOIN
JOIN
 JOIN clause is used to combine and retrieve the records from multiple tables.
 Example
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
LEFT OUTER JOIN
 The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if
there are no matches in the right table.
 Example
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c LEFT OUTER
JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
RIGHT OUTER JOIN
 The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even
if there are no matches in the left table.
Example
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER
JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
FULL OUTER JOIN
 The HiveQL FULL OUTER JOIN combines the records of both the left and the
right outer tables that fulfil the JOIN condition.
Department of CA @ AVCCE Page 11 of 50
UNIT –V FRAMEWORK

Example
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c FULL OUTER
JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
6.Subqueries
 A subquery is a SELECT statement that is embedded in another SQL statement.
 Hive has limited support for subqueries, only permitting a subquery in the
FROM clause of a SELECT statement.
Example
 SELECT emp_name,salary FROM emp WHERE emp.Joining_year IN (SELECT
dept_start_year FROM department);

7.Views
 A view is a sort of “virtual table” that is defined by a SELECT statement.
 Views can be used to present data to users in a different way to the way it is
actually stored on disk
Example
hive> CREATE VIEW emp_30000 AS SELECT * FROM employee WHERE
salary>30000;
2. PIG- Latin
1. Introduction
 A high-level scripting language (Pig Latin)
 Pig Latin, the programming language for Pig provides common data
manipulation operations, such as grouping, joining, and filtering.
 Pig generates Hadoop Map Reduce jobs to perform the data flows.
 This high-level language for ad-hoc analysis allows developers to inspect HDFS
stored data without the need to learn the complexities of the Map Reduce
framework, thus simplifying the access to the data
 The Pig Latin scripting language is not only a higher-level data flow language

Department of CA @ AVCCE Page 12 of 50


UNIT –V FRAMEWORK

 It has operators similar to SQL (e.g., FILTER and JOIN) that are translated into a
series of map and reduce functions.
 Pig Latin, in essence, is designed to fill the gap between the declarative style of
SQL and the low-level procedural style of MapReduce
History-Pig
 In 2006, Apache Pig was developed as a research project at Yahoo, especially to
create and execute MapReduce jobs on every dataset.
 In 2007, Apache Pig was open sourced via Apache incubator.
 In 2008, the first release of Apache Pig came out.
 In 2010, Apache Pig graduated as an Apache top-level project
When Pig & Hive
• Hive is a good choice
– When you want to query the data
– When you need an answer for specific questions
– If you are familiar with SQL
• Pig is a good choice
– For ETL (Extract  Transform- Load)
– For preparing data for easier analysis
– When you have long series of steps to perform
Pig Vs Hive

Department of CA @ AVCCE Page 13 of 50


UNIT –V FRAMEWORK

Apache Pig – Architecture

Pig Latin – Data types


Primary data types
Data types Example

int 8

long 8L

float 5.5F

double 5.5

chararray ‘avcce’

Boolean true/ false.

2016-01-
Datetime
22T00:00:00

Complex Data types

Data Type Example

Tuple-ordered set of fields. (raja, 30)

Bag-collection of tuples. {(raju,30),(Mohhammad,45)}

Map-set of key-value pairs. [ ‘name’#’Raju’, ‘age’#30]

Department of CA @ AVCCE Page 14 of 50


UNIT –V FRAMEWORK

2. Data Processing Operators


1. Loading and Storing Data
Load Operator
 To load data from external storage for processing in Pig.
Syntax
Relation_name = LOAD 'Input file path' USING function as schema;
• relation_name − We have to mention the relation in which we want to store the
data.
• Input file path − We have to mention the HDFS directory where the file is
stored
• function − We have to choose a function from the set of load functions provided
by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader).
• Schema − We have to define the schema of the data
Example
Grunt> record = load '/home/mca/pig-0.14.0/weather.txt‘ USING
PigStorage(',')
>> As(year:int,temperature:int,quality:int);
Describe Operator
 The describe operator is used to view the schema of a relation
Example
Grunt> describe record;
Output:
record: {year: int,temperature: int,quality: int}
Dump operator
 The Dump operator is used to run the Pig Latin statements and display
the results on the screen
Example
Grunt> dump record;
(1950,0,1)
Department of CA @ AVCCE Page 15 of 50
UNIT –V FRAMEWORK

(1950,21,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
Storing Data
 To store the loaded data in the file system using the store operator
Syntax
STORE Relation_name INTO ' required_directory_path ' [USING function];
Example
Grunt> store record into 'out' using pigstorage(':');
Grunt> cat out;
1950:0:1
1950:21:1
1950:-11:1
1949:111:1
1949:78:1
2. Filtering Data
 The FILTER operator is used to select the required tuples from a relation based
on a condition.
Syntax
grunt> Relation2_name = FILTER Relation1_name BY (condition);
Example
Grunt>filter_data = FILTER student_details BY city == 'Chennai';
FOR EACH...GENERATE
 FOREACH operator is used to generate specified data transformations based on
the column data.
Syntax
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required
data);
Department of CA @ AVCCE Page 16 of 50
UNIT –V FRAMEWORK

Example
grunt> foreach_data = FOREACH student_details GENERATE id,age,city;
3. Grouping and Joining Data
JOIN Operator
 The JOIN operator is used to combine records from two or more relations
Syntax
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
Example
Grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
Grunt> DUMP B;
(Joe,2)
(Hank,4)
(Eve,3)
Grunt> C = JOIN A BY $0, B BY $1;
Grunt> DUMP C;
(2,Tie,Joe,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)
GROUP Operator
 The GROUP operator is used to group the data in one or more relations. It
collects the data having the same key.
Syntax
grunt> Group_data = GROUP Relation_name BY column name;
Example
Grunt> dump record;
(1950,0,1)
(1950,21,1)
(1950,-11,1)
Department of CA @ AVCCE Page 17 of 50
UNIT –V FRAMEWORK

(1949,111,1)
(1949,78,1)
Grunt> grouped_records = group record by year;
Grunt> dump grouped_records;
(1949,{(1949,111,1),(1949,78,1)})
(1950,{(1950,0,1),(1950,21,1),(1950,-11,1)})
Cross Operator
 The CROSS operator computes the cross-product of two or more relations
Syntax
grunt> Relation3_name = CROSS Relation1_name, Relation2_name;
Example
Grunt> DUMP A;
(2,Tie)
(4,Coat)
Grunt> DUMP B;
(Joe,2)
(Hank,4)
4. Sorting Data
 The ORDER BY operator is used to display the contents of a relation in a sorted
order based on one or more fields.
Syntax
Grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);
Example
Grunt> order_by_data = ORDER student_details BY age ASC;
Grunt> I = CROSS A, B;
grunt> DUMP I;
(2,Tie,Joe,2)
(2,Tie,Hank,4)
(4,Coat,Joe,2)
(4,Coat,Hank,4)

Department of CA @ AVCCE Page 18 of 50


UNIT –V FRAMEWORK

3. ZOOKEEPER
 Apache Zookeeper is an effort to develop and maintain an open-source server
which enables highly reliable distributed coordination.
 A high-performance coordination service for distributed applications
(naming, configuration management, synchronization, and group Services)
 Runs in Java and has bindings for both Java and C
 Developed at Yahoo! Research
 Started as sub-project of Hadoop, now a top-level Apache project

Features of Zookeeper
 Shared hierarchical namespace: consists of znodes (data registers) in
memory
 High performance: can be used in large, distributed systems
 Reliability: keeps from being a single point of failure
 Strict ordered access: sophisticated synchronization primitives can be
implemented at the client
 Replication: replicated itself over a sets of hosts called an ensemble

Architecture of ZooKeeper

Department of CA @ AVCCE Page 19 of 50


UNIT –V FRAMEWORK

What is Apache ZooKeeper Meant For?


 Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate
between themselves and maintain shared data with robust synchronization
techniques.
 The common services provided by ZooKeeper are as follows
Naming service − Identifying the nodes in a cluster by name. It is similar
to DNS, but for nodes.
Configuration management − Latest and up-to-date configuration
information of the system for a joining node.
Cluster management − Joining / leaving of a node in a cluster and node
status at real time.
Leader election − Electing a node as leader for coordination purpose.
Locking and synchronization service − Locking the data while
modifying it. This mechanism helps you in automatic fail recovery while
connecting other distributed applications like Apache HBase.
Highly reliable data registry − Availability of data even when one or a
few nodes are down.
The ZooKeeper Service
 ZooKeeper is a highly available, high-performance coordination service
1.Data Model
 ZooKeeper maintains a hierarchical tree of nodes called znodes.
 A znode stores data and has an associated ACL.
 ZooKeeper is designed for coordination (which typically uses small data
files), not high-volume data storage, so there is a limit of 1 MB on the
amount of data that may be stored in any znode.
2. Types of Znodes
• Persistence znode − Persistence znode is alive even after the client, which
created that particular znode, is disconnected.
• Ephemeral znode − Ephemeral znodes are active until the client is alive
Department of CA @ AVCCE Page 20 of 50
UNIT –V FRAMEWORK

• Sequential znode − Sequential znodes can be either persistent or ephemeral.


3. Watches
 Watches are a simple mechanism for the client to get notifications about the
changes in the ZooKeeper ensemble.
 Clients can set watches while reading a particular znode. Watches send a
notification to the registered client for any of the znode changes.
4. Operations
 There are nine basic operations in ZooKeeper

5. Multi-update
 There is another ZooKeeper operation, called multi, which batches together
multiple primitive operations into a single unit that either succeeds or fails in
its entirety.
6.ACLs
 A znode is created with a list of ACLs, which determines who can perform
certain operations on it.
 ACLs depend on authentication, the process by which the client identifies itself
to ZooKeeper.
 There are a few authentication schemes that ZooKeeper provides:
digest
 The client is authenticated by a username and password.
sasl
 The client is authenticated using Kerberos.
ip
 The client is authenticated by its IP address.
Department of CA @ AVCCE Page 21 of 50
UNIT –V FRAMEWORK

7. Implementation
 The ZooKeeper service can run in two modes.
 In standalone mode, there is a single ZooKeeper server, which is useful for
testing due to its simplicity but provides no guarantees of high-availability or
resilience.
 ZooKeeper runs in replicated mode, on a cluster of machines called an
ensemble. ZooKeeper achieves high-availability through replication, and can
provide a service as long as a majority of the machines in the ensemble are up.
Phase 1: Leader election
 The machines in an ensemble go through a process of electing a distinguished
member, called the leader. The other machines are termed followers.
Phase 2: Atomic broadcast
 All write requests are forwarded to the leader, which broadcasts the update to
the
followers. When a majority have persisted the change, the leader commits the
update, and the client gets a response saying the update succeeded.
8. Consistency
 Understanding the basis of ZooKeeper’s implementation helps in
understanding the consistency guarantees that the service makes.

Sequential consistency-Updates from any particular client are applied in the order
that they are sent.
Atomicity-Updates either succeed or fail. This means that if an update fails, no client
will ever see it.

Department of CA @ AVCCE Page 22 of 50


UNIT –V FRAMEWORK

Single system image-A client will see the same view of the system regardless of the
server it connects to. This means that if a client connects to a new server during the
same session, it will not see an older state of the system than the one it saw with the
previous server.
Durability -Once an update has succeeded, it will persist and will not be undone.
This means updates will survive server failures.
Timeliness -The lag in any client’s view of the system is bounded, so it will not be
out of date by more than some multiple of tens of seconds.

9. Sessions

 Sessions are very important for the operation of ZooKeeper. Requests in a


session are executed in FIFO order. Once a client connects to a server, the
session will be established and a session id is assigned to the client.
 The client sends heartbeats at a particular time interval to keep the session
valid. If the ZooKeeper ensemble does not receive heartbeats from a client for
more than the period (session timeout) specified at the starting of the service,
it decides that the client died.
States
 The ZooKeeper object transitions through different states in its lifecycle

Department of CA @ AVCCE Page 23 of 50


UNIT –V FRAMEWORK

Examples
 ZooKeeper Command Line Interface (CLI) is used to interact with the
ZooKeeper ensemble for development purpose.
 To perform the following operation
 Create znodes
 Get data
 Watch znode for changes
 Set data
 Create children of a znode
 List children of a znode
 Check Status
 Remove / Delete a znode
Create Znodes

Syntax

Create /path /data

Example

Create /FirstZnode “Myfirstzookeeper-app”

Get Data

 It returns the associated data of the znode and metadata of the specified znode

Syntax

get /path
Example
get /FirstZnode

Department of CA @ AVCCE Page 24 of 50


UNIT –V FRAMEWORK

4. HBASE

1. Introduction

 HBase is a distributed column-oriented database built on top of the Hadoop file


system. It is an open-source project and is horizontally scalable.
 HBase is a data model that is similar to Google’s big table designed to provide
quick random access to huge amounts of structured data. It leverages the fault
tolerance provided by the Hadoop File System (HDFS).
 It is a part of the Hadoop ecosystem that provides random real-time
read/write access to data in the Hadoop File System.
HBase History

HBase and HDFS

HDFS HBase
HDFS is a distributed file system suitable for HBase is a database built on top of the HDFS.
storing large files.
HDFS does not support fast individual HBase provides fast lookups for larger tables
record lookups

It provides only sequential access of data. HBase internally uses Hash tables and
provides random access, and it stores the data
in indexed HDFS files for faster lookups.

Department of CA @ AVCCE Page 25 of 50


UNIT –V FRAMEWORK

Storage Mechanism in HBase

HBase is a column-oriented database and the tables in it are sorted by row

 Table is a collection of rows.


 Row is a collection of column families.
 Column family is a collection of columns.
 Column is a collection of key value pairs.

Example

HBase and RDBMS

HBase RDBMS

HBase is schema-less, it doesn't have the An RDBMS is governed by its schema, which
concept of fixed columns schema; defines describes the whole structure of tables.
only column families.

It is built for wide tables. HBase is It is thin and built for small tables. Hard to
horizontally scalable. scale

It has de-normalized data. It will have normalized data.

It is good for semi-structured as well as It is good for structured data


structured data.
Department of CA @ AVCCE Page 26 of 50
UNIT –V FRAMEWORK

Applications of HBase

 It is used whenever there is a need to write heavy applications.


 HBase is used whenever we need to provide fast random access to available
data.
 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally
HBase Architecture

 In HBase, tables are split into regions and are served by the region servers.
Regions are vertically divided by column families into “Stores”. Stores are
saved as files in HDFS

 HBase has three major components: the client library, a master server, and
region servers. Region servers can be added or removed as per requirement

MasterServer
 Assigns regions to the region servers and takes the help of Apache ZooKeeper
for this task.
 Handles load balancing of the regions across region servers. It unloads the
busy servers and shifts the regions to less occupied servers.
Regions
 Regions are nothing but tables that are split up and spread across the region
servers.
Region server
 Communicate with the client and handle data-related operations.
 Handle read and write requests for all the regions under it.
Zookeeper
 Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization, etc.

Department of CA @ AVCCE Page 27 of 50


UNIT –V FRAMEWORK

Installing HBase in Standalone Mode


$wget http://www.interior-dsgn.com/apache/hbase/stable/hbase-0.98.8-
hadoop2-bin.tar.gz
$tar -zxvf hbase-0.98.8-hadoop2-bin.tar.gz
mv hbase-0.99.1/* Hbase/
Configuring HBase inStandalone Mode
hbase-env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.7.0
hbase-site.xml

<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:/home/mca/HBase/HFiles</value>
</property>
<property> <name>hbase.zookeeper.property.dataDir</name>
<value>/home/mca/zookeeper</value> </property></configuration>
$cd /usr/local/HBase/bin
$./start-hbase.sh
Hbase>

1. HBase - General Commands

The general commands in HBase are status, version, table_help, and whoami

Status-This command returns the status of the system including the details of the
servers running on the system
hbase> status
versin-This command returns the version of HBase used in your system
hbase> version

Department of CA @ AVCCE Page 28 of 50


UNIT –V FRAMEWORK

table_help-This command guides you what and how to use table-referenced


commands.
hbase> table_help
whoami-This command returns the user details of HBase.
hbase> whoami
2. Create Table
Syntax
create ‘<table name>’,’<column family>’
Example
hbase> create 'emp', 'personal data', ’professional data’

3. Listing a Table

hbase > list


Output

TABLE
emp
2 row(s) in 0.0340 seconds

4. Create Data

Syntax

put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’

Output

Hbase> put 'emp','1','personal data:name','raju'


hbase> put 'emp','1','personal data:city','hyderabad'
hbase> put 'emp','1','professional data:designation','manager'
hbase> put 'emp','1','professional data:salary','50000'

5. Displaying data

Syntax
Scan ‘table name’
Example

hbase> scan 'emp'

Department of CA @ AVCCE Page 29 of 50


UNIT –V FRAMEWORK

ROW COLUMN+CELL
1 column=personal data:city, timestamp=1417524216501, value=hyderabad
1 column=personal data:name, timestamp=1417524185058, value=ramu
1 column=professional data:designation, timestamp=1417524232601,
value=manager
1 column=professional data:salary, timestamp=1417524244109, value=50000

Example-2

6. Delete data

Deleting a Specific Cell in a Table


Syntax
delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time stamp>’
Example
hbase> delete 'emp', '1', 'personal data:city', 1417521848375
Deleting All Cells in a Table
Syntax
deleteall ’<table name>’,’<row>’
Example
hbase> deleteall 'emp','1'
Deleting All records in a Table
Syntax
truncate 'table name'
Example
hbase> truncate 'emp'

Department of CA @ AVCCE Page 30 of 50


UNIT –V FRAMEWORK

7. Reading Data
Reading a all columns
Syntax
get ’<table name>
Example
hbase> get 'emp', '1'

Reading a Specific row


get ’<table name>’,’row1’
Example
hbase> get 'emp', '1'
COLUMN CELL
personal : city timestamp=1417521848375, value=hyderabad
personal : name timestamp=1417521785385, value=ramu
professional: designation timestamp=1417521885277, value=manager
professional: salary timestamp=1417521903862, value=50000
4 row(s) in 0.0270 seconds
Reading a Specific Column
Syntax
hbase>get 'table name', ‘rowid’, {COLUMN => ‘column family:column name ’}
Example
hbase> get 'emp', 'row1', {COLUMN=>'personal:name'}
8. Describe and alter
Describe command
Syntax
hbase> describe 'table name'
Example
hbase(main):006:0> describe 'emp'

Department of CA @ AVCCE Page 31 of 50


UNIT –V FRAMEWORK

Alter command
Changing the Maximum Number of Cells of a Column Family
hbase> alter 't1', NAME => 'f1', VERSIONS => 5
Example
hbase(main):003:0> alter 'emp', NAME => 'personal data', VERSIONS => 5
Adding column in table
hbase(main):003:0> alter 'emp', add pincode => 'personal data'
Deleting a Column Family
Syntax
hbase> alter ‘ table name ’, ‘delete’ => ‘ column family ’
Example
hbase(main):007:0> alter 'employee','delete'=>'professional'
9. Drop command
Drop command
Syntax
Drop ‘table name’
Example
hbase> drop 'emp'
drop_all command
Syntax
hbase> drop_all ‘reg_exp’
Example
hbase> drop_all ‘t.*’

Department of CA @ AVCCE Page 32 of 50


UNIT –V FRAMEWORK

5. IBM Info Sphere Big Insights


 What is Big Data?
 What is Hadoop?
 IBM InfoSphere BigInsights
 Federal Market Use Cases
 Take Aways
What is Big Data?
• Extracting insight from an immense volume, variety and velocity of data, in
context, beyond what was previously possible

What is Hadoop?
 Apache Hadoop – free, open source ramework for data-intensive
applications
 Inspired by Google technologies (MapReduce,
 GFS)
 Originally built to address scalability problems of Web search and analytics
 Extensively used by Yahoo!
 Enables applications to work with thousands of nodes and petabytes of
data in a highly parallel, cost effective manner
 CPU + disks of a commodity box = Hadoop node
 Boxes can be combined into clusters
 New nodes can be added without changing
 Data formats
Department of CA @ AVCCE Page 33 of 50
UNIT –V FRAMEWORK

 How data is loaded


 How jobs are written

IBM Info sphere Big Insight

Department of CA @ AVCCE Page 34 of 50


UNIT –V FRAMEWORK

InfoSphere Big Insights


• An analytics platform for Big Data at rest
• Volume
–Petabyte range
• Variety
–All kinds of data
–All kinds of analytics

• Adaptive Analytics Integrating Analytics on Data in Motion and Data at Rest

Department of CA @ AVCCE Page 35 of 50


UNIT –V FRAMEWORK

InfoSphere BigInsights – A Full Hadoop Stack

What makes Big Insights special?

Developing BigInsights Applications


• Map Reduce development in Java
• Pig
-Open source language / Apache sub-project
• Hive
– Open source language / Apache sub-project
– Provides a SQL-like interface to Hadoop
• Jaql
–IBM-developed query language
–Includes a SQL-like interface
–Handles nesting gracefully

Department of CA @ AVCCE Page 36 of 50


UNIT –V FRAMEWORK

–Heavily function oriented


–Very flexible; useful for loosely structured data
–Interface for IBM’s column store, text analytic
BigData Application Development Tooling with Eclipse
 Spreadsheet-like Analysis Tool
 Analytics for V3
 System ML
 System T Information Extraction (IE)
 BigInsights Text Analytics
 Social-media Analytics
BigInsights – Enhancing and Hardening Hadoop for Mission

• Tooling
– Installation/Configuration
– Development via Eclipse, jaql, AQL, DML, etc
– Discover/Analyze
– Browser Based Administration, Performance Analysis and etc
• Flex Scheduler
– Provided in addition to Hadoop’s FIFO and FAIR schedulers
–Optimize for response time, granular control, SLAs,
• Adaptive M/R
– Balance workload across mappers
– Minimize startup and scheduling cost
• LZO Split-able Compression
• Security Enhancements
– Secure access thru console, other open ports closed
– Authentication with LDAP
– Authorization thru roles
• Extensive RDBMS, DataWarehouse Integration.
• Integration with R & SPSS
• BigSheets Visualization
– Rapid analysis without M/R coding
Department of CA @ AVCCE Page 37 of 50
UNIT –V FRAMEWORK

– Collect Data, Extract and Analyze & Explore/Visualize


– Leverage System T Text Analytics Macro
• Advanced Text Analytics Toolkit
– Eclipse Dev Tool Integration, Provenance Viewer, Debug Capability
– Text Analytics Language (AQL)
– Optimized M/R Engine
–Prebuilt Extractors
• Total Hadoop Cluster Solution
• Tight Integration/Support for Data In Motion via InfoSphere Streams
Enterprise-Level File System: GPFS-SNC
• GPFS SNC (General Parallel File System for Shared Nothing Clusters)
• Developed in 1998 by IBM Research
• Used since then in supercomputers and thousands of other
deployments Shared-nothing architecture designed for Hadoop in
2009

Department of CA @ AVCCE Page 38 of 50


UNIT –V FRAMEWORK

6. IBM InfoSphere Stream


 Stream is a powerful analytic computing platform that delivers a platform for
analyzing data in real time with micro-latency
 The gathering large quantities of data, manipulating the data, storing it on disk
and then analyzing it, would be the case with biginsights
 Stream allows you to apply the analytics on the data in motion
 In stream data flows through operators that have the ability to manipulate the
data stream and in-flight analysis is performed on the data.

Categories of Problems Solved by Streams


• Applications that require on-the-fly processing, filtering and analysis of
streaming data
– Sensors: environmental, industrial, surveillance video, GPS, …
– “Data exhaust”: network/system/web server/app server log files
– High-rate transaction data: financial transactions, call detail records

Department of CA @ AVCCE Page 39 of 50


UNIT –V FRAMEWORK

Criteria: two or more of the following


– Messages are processed in isolation or in limited data windows
– Sources include non-traditional data (spatial, imagery, text, …)
– Sources vary in connection methods, data rates, and processing
requirements, presenting integration challenges
– Data rates/volumes require the resources of multiple processing nodes
– Analysis and response are needed with sub-millisecond latency
– Data rates and volumes are too great for store-and-mine approaches

Industry use cases for infoSphere stream

Department of CA @ AVCCE Page 40 of 50


UNIT –V FRAMEWORK

How InfoSphere stream work


What is a Stream?
 Stream is a graph of nodes connected by edges
 Each node in graph is an operator or adapter that will process the data within
the stream in some way
 Nodes can have zero or more inputs and zero or more outputs
 The output from one node is connected to the input of another node or nodes
 The edges of the graph that join the nodes together represents the stream of
data moving between the operators

Streams Processing Language


Designed for stream computing
– Define a streaming-data flow graph
– Rich set of data types to define tuple attributes
Declarative
– Operator invocations name the input and output streams
– Referring to streams by name is enough to connect the graph
Procedural support
– Full-featured imperative language
– Custom logic in operator invocations
Extensible
– User-defined data types
– Custom functions written in SPL or a native language (C++ or Java)
– Custom operators written in SPL

Department of CA @ AVCCE Page 41 of 50


UNIT –V FRAMEWORK

Streams Standard Toolkit:operator

Utility Operators

Department of CA @ AVCCE Page 42 of 50


UNIT –V FRAMEWORK

XML Support: Built Into SPL


• XML type
o Validated for syntax or schema
• XML Parser operator
o with Xpath expression and function to parse and manipulate XML data
• XQuery function
o Use XML data on the fly
• Adapters support XML format
o Standard Toolkit
Database Toolkit
Enterprise class
 Large, massively parallel jobs have unique availability requirements because in
a large cluster, there are bound to be failures.
 The stream built in availability characteristics that take this into account
 To consider that in a massive cluster, the creation ,visualization and
monitoring of your application is a critical success factor in keeping your
management costs low
 The integration with the enterprise architecture is essential to building a
holistic solution
 The enterprise aspects of the big data problem for streaming analytics:
Avilabilty,Ease of use and Integration
High availability
 An application host is a server that runs SPL jobs
 A management host runs the management services that control the flow of
SPL jobs
 A mixed host can run both SPL jobs and management tasks

Department of CA @ AVCCE Page 43 of 50


UNIT –V FRAMEWORK

Easy to use
 Stream comes with an eclipse based visual toolset, called Infosphere stream
studio which allows you to create,edit,test,debug ,run and even visualize a
stream graph model and SPL applications
Integration
 Coordinating the traditional and new age big data processes takes a vendor
that understand both sides of the equation
7. IBM Big sheet –Visualization
 BigSheets is a spreadsheet-style tool for business analysts provided with IBM
Info Sphere Big Insights, a platform based on the open source Apache Hadoop
project.
 Big Sheets enables non-programmers to iteratively explore, manipulate, and
visualize data stored in your distributed file system
 It includes built-in functions to extract names, addresses, organizations, email,
locations, and phone numbers.
 Big Sheets has a good deal in common with a typical spreadsheet application,
such as Microsoft® Excel
 The benefit is the ease with which you can open, browse, and ultimately create
graphical views in the form of graphs and tag clouds on your data. As an
entirely web-based interface to the data viewing and analysis process, it is very
easy to use, but not without its complexities.
 Big Sheets either takes the data you provide and builds a visualized version of
that information in the form of a graph, or it processes raw information to
provide a summarized view of the data. This enables Big Sheets to support
some basic processing alongside its core visualization role

Department of CA @ AVCCE Page 44 of 50


UNIT –V FRAMEWORK

IBM Big data Platform

• Three steps involved in using big sheet to perform big data analysis
• Collect data-we can collect data from multiple sources, including crawling
web,local files,files on your network and multiple protocols format
(HTTP,HDFS,Amazon S3 Native file System)
• Extract and analyze data-we have collected your information, you can see a
sample of it in spreadsheet interface.

Department of CA @ AVCCE Page 45 of 50


UNIT –V FRAMEWORK

• Explore and visualize data-we can apply visualization to help you make sense
of your data .
Big Sheet provides the following visualization tools
• Tag cloud-shows word frequencies; the bigger the word, the more frequently it
exists in the sheet
• Pie chart-show proportional relationship, where the relative size of the slice
represents its proportion of the data
• Map-shows data values overlaid onto either a map of the world
• Heat map-The additional dimension of showing the relative intensity of the
values overlaid on the map
• Bar chart-shows the frequency of values for a specified columns
Visualization technique that suits the data format
• Basic crawler data— Information extracted from web pages using the
BigInsights web crawler.
• Character delimited data— Tab, tilde, or other field-based information
separated by characters.
• Character-delimited data with text qualifier— Delimited data where the
fields are enclosed by quotation marks or other characters to contain the data.
• Comma-separated value (CSV) data— Standard CSV format, including
qualified fields (quotation marks or escape characters), with or without header
rows.
• Hive read— Data read from an existing Hive data table.
• JSON array— An array of JSON data.
• JSON object read— Multiple lines of JSON object data.
• Line reader— Each line is taken as an entry of discrete data. This format is
useful when a separate job has output a list of unique words.
• Sheets data— Data generated by a previous sheet's processing job.
• Tab-separated value (TSV) data— Field-based information separated by
tabs, with or without a header row.
Department of CA @ AVCCE Page 46 of 50
UNIT –V FRAMEWORK

Visualization tools
 Polymaps
 Flot
 Tangle
 D3.js
 FF Chartwell
 CartoDB
 The R Project
Case Study- BigSheets Twitter Analysis – Top tweetting users chart

Step 1: Creating Big Sheets Master Workbooks

Step 2: Tailoring BigSheets Workbooks

• Tweets-IBM+BigData ALL with 17 columns corresponding to the following


fields:
o header1 = created_at
o header2 = id_str
o header3 = geo
o header4 = coordinates
o header5 = location
o header6 = user.id_str
o header7 = user.name
o header8 = user.screen_name
o header9 = user.location

Department of CA @ AVCCE Page 47 of 50


UNIT –V FRAMEWORK

oheader10 = user.description
o header11 = user.url
o header12 = user.followers_count
o header13 = user.friends_count
o header14 = retweet_count
o header15 = favorite_count
o header16 = lang
o header17 = text
• WordCount-IBM+BigData with 2 columns corresponding to:
o header1 = word
o header 2 = number of occurrences

1. From the BigSheets page of the web console, open the Tweets-IBM+BigData
ALL master workbook.
2. As I mentioned before, master workbooks cannot be modified so we need to
create a new workbook based on this master workbook. Within the master
workbook, click on Build new workbook.

3. New workbook will automatically open. In the left down corner click on Add
sheets and as a type of sheet choose Pivot.

Department of CA @ AVCCE Page 48 of 50


UNIT –V FRAMEWORK

4. Set the Group by columns to header16.

5.Go to the Calculate tab and create a new column (for example “total”) showing the
results of function COUNT applied on groups based on column header16.

6. Apply the settings. You should get results similar to the following screenshot.

Save the workbook (for example as “Tweets-IBM+BigData ALL – language pie”)


and Exit the editor.

Department of CA @ AVCCE Page 49 of 50


UNIT –V FRAMEWORK

Step 3: Creating charts

Go to the Tweets-IBM+BigData ALL – language pie


In the left down corner click on Add chart and choose Chart, then Pie.

In the settings set Value to header16 and Count to total. Apply the settings.

Now you are supposed to run the computation of your chart. Note that a status bar to
the right of the Run button enables you to monitor the progress of job. Behind the
scenes, BigSheets executes Pig scripts that initiate MapReduce job. In the end you
should see the pie chart representing the coverage by languages. Note that every time
the data source change, you can just re-run your charts to get fresh results

Department of CA @ AVCCE Page 50 of 50

Potrebbero piacerti anche