Unit-V-Big Data

UNIT –V FRAMEWORK
1.HIVE
Hive-Introduction
An SQL-like interface to Hadoop
Hive is a data warehouse infrastructure tool to process structured data in
Hadoop. It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open source under the
name Apache Hive.
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
1.Hive-Services
The Hive shell is only one of several services that you can run using the hive
command.
Hive> hive –service
Cli-The command line interface to Hive (the shell). This is the default service.
Hiveserver-Runs Hive as a server exposing a Thrift service, enabling access
from a range of clients written in different languages.
Hwi -The Hive Web Interface
Jar-The Hive equivalent to hadoop jar, a convenient way to run Java
applications that includes Hadoop and Hive classes on the classpath
Department of CA @ AVCCE Page 1 of 50

UNIT –V FRAMEWORK
Metastore-By default, the metastore is run in the same process as the Hive
service
Hive architecture
Thrift Client
The Hive Thrift Client makes it easy to run Hive commands from a wide range
of programming languages.
Thrift bindings for Hive are available for C++, Java, PHP,Python, and Ruby
JDBC Driver
Hive provides JDBC driver, defined in the class
org.apache.hadoop.hive.jdbc.HiveDriver.
When configured with a JDBC URI of the form jdbc:hive://host:port/dbname, a
Java application will connect to a Hive server running in a separate process at
the given host and port.
ODBC Driver
The Hive ODBC Driver allows applications that support the ODBC protocol to
connect to Hive.
The Metastore
The metastore is the central repository of Hive metadata.
The metastore is divided into two pieces: a service and the backing store for the
data.

UNIT –V FRAMEWORK
Working of Hive
2.HiveQL
• Hive’s SQL dialect is called HiveQL. Queries are translated to MapReduce jobs
to exploit the scalability of MapReduce.
• Hive QL
– Basic-SQL : Select, From, Join, Group-By
– Equi-Join, Muti-Table Insert, Multi-Group-By
– Batch query
1. Hive Data Types
• Hive primitive data types
• Hive complex data types.
Primitive data types
Numeric Types
• TINYINT (1-byte signed integer)
• SMALLINT (2-byte signed integer)
• INT (4-byte signed integer)
• BIGINT (8-byte signed integer)
• FLOAT (4-byte single precision floating point number)
• DOUBLE (8-byte double precision floating point number)
Date/Time Types
TIMESTAMP,DATE
String Types
STRING,VARCHAR,CHAR
UNIT –V FRAMEWORK
Misc Types
BOOLEAN,BINARY
Hive Complex Data Types
• arrays: ARRAY<data_type>
• maps: MAP<primitive_type, data_type>
• structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
• union: UNIONTYPE<data_type, data_type, ...>
2. Hive - Built-in Operators
• Relational Operators
• Arithmetic Operators
• Logical Operators
• Complex Operators
Relational Operators
Arithmetic Operators

UNIT –V FRAMEWORK
Logical Operators
Complex Operators
3. Hive Built-in Functions

Numeric and Mathematical Functions
Syntax Example
ABS( double n ) ABS(-100)
BIN( bigint n ) BIN(100)
CEIL( double n ) CEIL(9.5)
FLOOR( double n ) FLOOR(10.9)
SQRT( double n ) SQRT(4)
LOG2( double n ) LOG2(44)
POW( double m, double n ), POW(10,2)
RAND( [int seed] ) RAND( )

UNIT –V FRAMEWORK
Date Functions
Syntax Example
UNIX_TIMESTAMP() 2016-09-24 12:11:10
UNIX_TIMESTAMP( string date, UNIX_TIMESTAMP('2000-01-01

string pattern ) 10:20:30','yyyy-MM-dd')
TO_DATE( string timestamp ) TO_DATE('2000-01-01 10:20:30')
YEAR( string date ) YEAR('2000-01-01 10:20:30')
DATEDIFF( string date1, string DATEDIFF('2000-03-01', '2000-01-

date2 ) 10')
DATE_ADD( string date, int days ) DATE_ADD('2000-03-01', 5)
DATE_SUB( string date, int days ) DATE_SUB('2000-03-20', 5)
String Functions
Syntax Example
ASCII( string str ) ASCII('A')
CONCAT( string str1, string str2... ) CONCAT('hadoop','-','hive')
LENGTH( string str ) LENGTH('hive')
LOWER( string str ), LOWER('HiVe')
REVERSE( string str ) REVERSE('hive')
SUBSTR( string source_str, int SUBSTR('hadoop',4,2)

start_position [,int length] )
UPPER( string str ) UPPER('HiVe')

UNIT –V FRAMEWORK
Collection Functions
Syntax Example
size(Array<T>) Returns the number of

elements in the array type
size(Map<K.V>) Returns the number of

elements in the map type
map_keys(Map<K.V> Returns an unordered array

containing the keys of the
input map
map_values(Map<K.V>) Returns an unordered array

containing the values of the
input map
sort_array(Array<T>) Sorts the input array in

ascending order
Type Conversion Function
Syntax Example
binary(string|binary) Casts the parameter into a

binary
cast(expr as <type>) Converts the results of the

expression
Eg:cast('1' as BIGINT)
3. Querying Data in Hive

1. Database manipulation
Create Database Statement
Syntax
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
UNIT –V FRAMEWORK
Example
Hive> CREATE DATABASE [IF NOT EXISTS] userdb;
Verify a databases list
Hive> SHOW DATABASES;
Hive> SHOW DATABASES LIKE 'u.*';
Drop Database Statement
DROP DATABASE [IF EXISTS] database_name
Hive> DROP DATABASE IF EXISTS userdb;
2.Managing –Tables
Syntax
CREATE TABLE [IF NOT EXISTS] [db_name.] table_name [(col_name data_type
[COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]
Example
hive> create table if not exists employee (eid int,name String,designation String,
salary int)
COMMENT 'Employee detail'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
Displaying tables
hive> show tables;
Displaying structure table ‘employee’
Hive> desc employee;
eid int None
ename string None
UNIT –V FRAMEWORK
designation string None

salary int None
Dropping Tables & View
• The DROP TABLE statement deletes the data and metadata for a table
Syntax
DROP TABLE [IF EXISTS] table_name;
Example
Hive> drop table emp;
Hive> drop view emp1;
Altering Tables
Syntax
1. ALTER TABLE name RENAME TO new_name
2. ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
3. ALTER TABLE name DROP [COLUMN] column_name
4. ALTER TABLE name CHANGE column_name new_name new_type
5. ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
Example
Renames the table from employee1 to employee
Hive> ALTER TABLE employee1 RENAME TO employee
Rename the column name
Hive> ALTER TABLE employee CHANGE name ename String;
Adds a column named dept to the employee table
Hive> ALTER TABLE employee ADD COLUMNS ( dept STRING COMMENT
'Department name');
Replace column in table
Hive> ALTER TABLE employee REPLACE COLUMNS ( eid INT empid Int);

UNIT –V FRAMEWORK
3.Importing Data
Syntax
Insert command
INSERT OVERWRITE TABLE target as SELECT col1, col2 FROM source;
Example
Hive> insert overwrite table student as select * from student1;
Load statement
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename;
Example
Hive> LOAD DATA LOCAL INPATH '/home/mca/sample.txt‘ OVERWRITE INTO
TABLE employee
4.Select-Where Clause
SELECT statement is used to retrieve the data from a table. WHERE clause
works similar to a condition.
Syntax
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE
where_condition] [GROUP BY col_list] [HAVING having_condition] [ORDER BY
col_list]] [LIMIT number];
Example
Hive> SELECT * FROM employee WHERE salary>30000;
Hive> select * from employee where salary>12000 and dept='ADMIN';
Hive> SELECT * FROM employee WHERE salary>30000 limit 2;
5.Sorting and Aggregating
Sorting data in Hive can be achieved by use of a standard ORDER BY clause
Example
Hive> SELECT * from employee order by salary;
The GROUP BY clause is used to group all the records in a result set using a
particular collection column.
UNIT –V FRAMEWORK
Example
Hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;
6.Joins
JOIN is a clause that is used for combining specific fields from two tables by
using values common to each one
JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
JOIN
JOIN clause is used to combine and retrieve the records from multiple tables.
Example
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
LEFT OUTER JOIN
The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if
there are no matches in the right table.
Example
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c LEFT OUTER
JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
RIGHT OUTER JOIN
The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even
if there are no matches in the left table.
Example
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER
FULL OUTER JOIN
The HiveQL FULL OUTER JOIN combines the records of both the left and the
right outer tables that fulfil the JOIN condition.
UNIT –V FRAMEWORK
Example
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c FULL OUTER
6.Subqueries
A subquery is a SELECT statement that is embedded in another SQL statement.
Hive has limited support for subqueries, only permitting a subquery in the
FROM clause of a SELECT statement.
Example
SELECT emp_name,salary FROM emp WHERE emp.Joining_year IN (SELECT
dept_start_year FROM department);
7.Views
A view is a sort of “virtual table” that is defined by a SELECT statement.
Views can be used to present data to users in a different way to the way it is
actually stored on disk
Example
hive> CREATE VIEW emp_30000 AS SELECT * FROM employee WHERE
salary>30000;
2. PIG- Latin
1. Introduction
A high-level scripting language (Pig Latin)
Pig Latin, the programming language for Pig provides common data
manipulation operations, such as grouping, joining, and filtering.
Pig generates Hadoop Map Reduce jobs to perform the data flows.
This high-level language for ad-hoc analysis allows developers to inspect HDFS
stored data without the need to learn the complexities of the Map Reduce
framework, thus simplifying the access to the data
The Pig Latin scripting language is not only a higher-level data flow language

UNIT –V FRAMEWORK
It has operators similar to SQL (e.g., FILTER and JOIN) that are translated into a
series of map and reduce functions.
Pig Latin, in essence, is designed to fill the gap between the declarative style of
SQL and the low-level procedural style of MapReduce
History-Pig
In 2006, Apache Pig was developed as a research project at Yahoo, especially to
create and execute MapReduce jobs on every dataset.
In 2007, Apache Pig was open sourced via Apache incubator.
In 2008, the first release of Apache Pig came out.
In 2010, Apache Pig graduated as an Apache top-level project
When Pig & Hive
• Hive is a good choice
– When you want to query the data
– When you need an answer for specific questions
– If you are familiar with SQL
• Pig is a good choice
– For ETL (Extract Transform- Load)
– For preparing data for easier analysis
– When you have long series of steps to perform
Pig Vs Hive

UNIT –V FRAMEWORK
Apache Pig – Architecture
Pig Latin – Data types

Primary data types
Data types Example
int 8
long 8L
float 5.5F
double 5.5
chararray ‘avcce’
Boolean true/ false.
2016-01-
Datetime
22T00:00:00
Complex Data types
Data Type Example
Tuple-ordered set of fields. (raja, 30)
Bag-collection of tuples. {(raju,30),(Mohhammad,45)}
Map-set of key-value pairs. [ ‘name’#’Raju’, ‘age’#30]

UNIT –V FRAMEWORK
2. Data Processing Operators

1. Loading and Storing Data
Load Operator
To load data from external storage for processing in Pig.
Syntax
Relation_name = LOAD 'Input file path' USING function as schema;
• relation_name − We have to mention the relation in which we want to store the
data.
• Input file path − We have to mention the HDFS directory where the file is
stored
• function − We have to choose a function from the set of load functions provided
by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader).
• Schema − We have to define the schema of the data
Example
Grunt> record = load '/home/mca/pig-0.14.0/weather.txt‘ USING
PigStorage(',')
>> As(year:int,temperature:int,quality:int);
Describe Operator
The describe operator is used to view the schema of a relation
Example
Grunt> describe record;
Output:
record: {year: int,temperature: int,quality: int}
Dump operator
The Dump operator is used to run the Pig Latin statements and display
the results on the screen
Example
Grunt> dump record;
(1950,0,1)
UNIT –V FRAMEWORK
(1950,21,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
Storing Data
To store the loaded data in the file system using the store operator
Syntax
STORE Relation_name INTO ' required_directory_path ' [USING function];
Example
Grunt> store record into 'out' using pigstorage(':');
Grunt> cat out;
1950:0:1
1950:21:1
1950:-11:1
1949:111:1
1949:78:1
2. Filtering Data
The FILTER operator is used to select the required tuples from a relation based
on a condition.
Syntax
grunt> Relation2_name = FILTER Relation1_name BY (condition);
Example
Grunt>filter_data = FILTER student_details BY city == 'Chennai';
FOR EACH...GENERATE
FOREACH operator is used to generate specified data transformations based on
the column data.
Syntax
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required
data);
UNIT –V FRAMEWORK
Example
grunt> foreach_data = FOREACH student_details GENERATE id,age,city;
3. Grouping and Joining Data
JOIN Operator
The JOIN operator is used to combine records from two or more relations
Syntax
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
Example
Grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
Grunt> DUMP B;
(Joe,2)
(Hank,4)
(Eve,3)
Grunt> C = JOIN A BY $0, B BY $1;
Grunt> DUMP C;
(2,Tie,Joe,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)
GROUP Operator
The GROUP operator is used to group the data in one or more relations. It
collects the data having the same key.
Syntax
grunt> Group_data = GROUP Relation_name BY column name;
Example
Grunt> dump record;
(1950,0,1)
(1950,21,1)
(1950,-11,1)
UNIT –V FRAMEWORK
(1949,111,1)
(1949,78,1)
Grunt> grouped_records = group record by year;
Grunt> dump grouped_records;
(1949,{(1949,111,1),(1949,78,1)})
(1950,{(1950,0,1),(1950,21,1),(1950,-11,1)})
Cross Operator
The CROSS operator computes the cross-product of two or more relations
Syntax
grunt> Relation3_name = CROSS Relation1_name, Relation2_name;
Example
Grunt> DUMP A;
(2,Tie)
(4,Coat)
Grunt> DUMP B;
(Joe,2)
(Hank,4)
4. Sorting Data
The ORDER BY operator is used to display the contents of a relation in a sorted
order based on one or more fields.
Syntax
Grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);
Example
Grunt> order_by_data = ORDER student_details BY age ASC;
Grunt> I = CROSS A, B;
grunt> DUMP I;
(2,Tie,Joe,2)
(2,Tie,Hank,4)
(4,Coat,Joe,2)
(4,Coat,Hank,4)

UNIT –V FRAMEWORK
3. ZOOKEEPER
Apache Zookeeper is an effort to develop and maintain an open-source server
which enables highly reliable distributed coordination.
A high-performance coordination service for distributed applications
(naming, configuration management, synchronization, and group Services)
Runs in Java and has bindings for both Java and C
Developed at Yahoo! Research
Started as sub-project of Hadoop, now a top-level Apache project
Features of Zookeeper
Shared hierarchical namespace: consists of znodes (data registers) in
memory
High performance: can be used in large, distributed systems
Reliability: keeps from being a single point of failure
Strict ordered access: sophisticated synchronization primitives can be
implemented at the client
Replication: replicated itself over a sets of hosts called an ensemble
Architecture of ZooKeeper

UNIT –V FRAMEWORK
What is Apache ZooKeeper Meant For?

Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate
between themselves and maintain shared data with robust synchronization
techniques.
The common services provided by ZooKeeper are as follows
Naming service − Identifying the nodes in a cluster by name. It is similar
to DNS, but for nodes.
Configuration management − Latest and up-to-date configuration
information of the system for a joining node.
Cluster management − Joining / leaving of a node in a cluster and node
status at real time.
Leader election − Electing a node as leader for coordination purpose.
Locking and synchronization service − Locking the data while
modifying it. This mechanism helps you in automatic fail recovery while
connecting other distributed applications like Apache HBase.
Highly reliable data registry − Availability of data even when one or a
few nodes are down.
The ZooKeeper Service
ZooKeeper is a highly available, high-performance coordination service
1.Data Model
ZooKeeper maintains a hierarchical tree of nodes called znodes.
A znode stores data and has an associated ACL.
ZooKeeper is designed for coordination (which typically uses small data
files), not high-volume data storage, so there is a limit of 1 MB on the
amount of data that may be stored in any znode.
2. Types of Znodes
• Persistence znode − Persistence znode is alive even after the client, which
created that particular znode, is disconnected.
• Ephemeral znode − Ephemeral znodes are active until the client is alive
UNIT –V FRAMEWORK
• Sequential znode − Sequential znodes can be either persistent or ephemeral.

3. Watches
Watches are a simple mechanism for the client to get notifications about the
changes in the ZooKeeper ensemble.
Clients can set watches while reading a particular znode. Watches send a
notification to the registered client for any of the znode changes.
4. Operations
There are nine basic operations in ZooKeeper
5. Multi-update
There is another ZooKeeper operation, called multi, which batches together
multiple primitive operations into a single unit that either succeeds or fails in
its entirety.
6.ACLs
A znode is created with a list of ACLs, which determines who can perform
certain operations on it.
ACLs depend on authentication, the process by which the client identifies itself
to ZooKeeper.
There are a few authentication schemes that ZooKeeper provides:
digest
The client is authenticated by a username and password.
sasl
The client is authenticated using Kerberos.
ip
The client is authenticated by its IP address.
UNIT –V FRAMEWORK
7. Implementation
The ZooKeeper service can run in two modes.
In standalone mode, there is a single ZooKeeper server, which is useful for
testing due to its simplicity but provides no guarantees of high-availability or
resilience.
ZooKeeper runs in replicated mode, on a cluster of machines called an
ensemble. ZooKeeper achieves high-availability through replication, and can
provide a service as long as a majority of the machines in the ensemble are up.
Phase 1: Leader election
The machines in an ensemble go through a process of electing a distinguished
member, called the leader. The other machines are termed followers.
Phase 2: Atomic broadcast
All write requests are forwarded to the leader, which broadcasts the update to
the
followers. When a majority have persisted the change, the leader commits the
update, and the client gets a response saying the update succeeded.
8. Consistency
Understanding the basis of ZooKeeper’s implementation helps in
understanding the consistency guarantees that the service makes.
Sequential consistency-Updates from any particular client are applied in the order
that they are sent.
Atomicity-Updates either succeed or fail. This means that if an update fails, no client
will ever see it.

UNIT –V FRAMEWORK
Single system image-A client will see the same view of the system regardless of the
server it connects to. This means that if a client connects to a new server during the
same session, it will not see an older state of the system than the one it saw with the
previous server.
Durability -Once an update has succeeded, it will persist and will not be undone.
This means updates will survive server failures.
Timeliness -The lag in any client’s view of the system is bounded, so it will not be
out of date by more than some multiple of tens of seconds.
9. Sessions
Sessions are very important for the operation of ZooKeeper. Requests in a

session are executed in FIFO order. Once a client connects to a server, the
session will be established and a session id is assigned to the client.
The client sends heartbeats at a particular time interval to keep the session
valid. If the ZooKeeper ensemble does not receive heartbeats from a client for
more than the period (session timeout) specified at the starting of the service,
it decides that the client died.
States
The ZooKeeper object transitions through different states in its lifecycle

UNIT –V FRAMEWORK
Examples
ZooKeeper Command Line Interface (CLI) is used to interact with the
ZooKeeper ensemble for development purpose.
To perform the following operation
Create znodes
Get data
Watch znode for changes
Set data
Create children of a znode
List children of a znode
Check Status
Remove / Delete a znode
Create Znodes
Syntax
Create /path /data
Example
Create /FirstZnode “Myfirstzookeeper-app”
Get Data
It returns the associated data of the znode and metadata of the specified znode
Syntax
get /path
Example
get /FirstZnode

UNIT –V FRAMEWORK
4. HBASE
1. Introduction
HBase is a distributed column-oriented database built on top of the Hadoop file

system. It is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide
quick random access to huge amounts of structured data. It leverages the fault
tolerance provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time
read/write access to data in the Hadoop File System.
HBase History
HBase and HDFS
HDFS HBase
HDFS is a distributed file system suitable for HBase is a database built on top of the HDFS.
storing large files.
HDFS does not support fast individual HBase provides fast lookups for larger tables
record lookups
It provides only sequential access of data. HBase internally uses Hash tables and
provides random access, and it stores the data
in indexed HDFS files for faster lookups.

UNIT –V FRAMEWORK
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted by row
Table is a collection of rows.

Row is a collection of column families.
Column family is a collection of columns.
Column is a collection of key value pairs.
Example
HBase and RDBMS
HBase RDBMS
HBase is schema-less, it doesn't have the An RDBMS is governed by its schema, which
concept of fixed columns schema; defines describes the whole structure of tables.
only column families.
It is built for wide tables. HBase is It is thin and built for small tables. Hard to
horizontally scalable. scale
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as It is good for structured data

structured data.
UNIT –V FRAMEWORK
Applications of HBase
It is used whenever there is a need to write heavy applications.

HBase is used whenever we need to provide fast random access to available
data.
Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally
HBase Architecture
In HBase, tables are split into regions and are served by the region servers.
Regions are vertically divided by column families into “Stores”. Stores are
saved as files in HDFS
HBase has three major components: the client library, a master server, and
region servers. Region servers can be added or removed as per requirement
MasterServer
Assigns regions to the region servers and takes the help of Apache ZooKeeper
for this task.
Handles load balancing of the regions across region servers. It unloads the
busy servers and shifts the regions to less occupied servers.
Regions
Regions are nothing but tables that are split up and spread across the region
servers.
Region server
Communicate with the client and handle data-related operations.
Handle read and write requests for all the regions under it.
Zookeeper
Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization, etc.

UNIT –V FRAMEWORK
Installing HBase in Standalone Mode

$wget http://www.interior-dsgn.com/apache/hbase/stable/hbase-0.98.8-
hadoop2-bin.tar.gz
$tar -zxvf hbase-0.98.8-hadoop2-bin.tar.gz
mv hbase-0.99.1/* Hbase/
Configuring HBase inStandalone Mode
hbase-env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.7.0
hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:/home/mca/HBase/HFiles</value>
</property>
<property> <name>hbase.zookeeper.property.dataDir</name>
<value>/home/mca/zookeeper</value> </property></configuration>
$cd /usr/local/HBase/bin
$./start-hbase.sh
Hbase>
1. HBase - General Commands
The general commands in HBase are status, version, table_help, and whoami
Status-This command returns the status of the system including the details of the
servers running on the system
hbase> status
versin-This command returns the version of HBase used in your system
hbase> version

UNIT –V FRAMEWORK
table_help-This command guides you what and how to use table-referenced

commands.
hbase> table_help
whoami-This command returns the user details of HBase.
hbase> whoami
2. Create Table
Syntax
create ‘<table name>’,’<column family>’
Example
hbase> create 'emp', 'personal data', ’professional data’
3. Listing a Table
hbase > list

Output
TABLE
emp
2 row(s) in 0.0340 seconds
4. Create Data
Syntax
put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’
Output
Hbase> put 'emp','1','personal data:name','raju'

hbase> put 'emp','1','personal data:city','hyderabad'
hbase> put 'emp','1','professional data:designation','manager'
hbase> put 'emp','1','professional data:salary','50000'
5. Displaying data
Syntax
Scan ‘table name’
Example
hbase> scan 'emp'

UNIT –V FRAMEWORK
ROW COLUMN+CELL
1 column=personal data:city, timestamp=1417524216501, value=hyderabad
1 column=personal data:name, timestamp=1417524185058, value=ramu
1 column=professional data:designation, timestamp=1417524232601,
value=manager
1 column=professional data:salary, timestamp=1417524244109, value=50000
Example-2
6. Delete data
Deleting a Specific Cell in a Table

Syntax
delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time stamp>’
Example
hbase> delete 'emp', '1', 'personal data:city', 1417521848375
Deleting All Cells in a Table
Syntax
deleteall ’<table name>’,’<row>’
Example
hbase> deleteall 'emp','1'
Deleting All records in a Table
Syntax
truncate 'table name'
Example
hbase> truncate 'emp'

UNIT –V FRAMEWORK
7. Reading Data
Reading a all columns
Syntax
get ’<table name>
Example
hbase> get 'emp', '1'
Reading a Specific row

get ’<table name>’,’row1’
Example
hbase> get 'emp', '1'
COLUMN CELL
personal : city timestamp=1417521848375, value=hyderabad
personal : name timestamp=1417521785385, value=ramu
professional: designation timestamp=1417521885277, value=manager
professional: salary timestamp=1417521903862, value=50000
4 row(s) in 0.0270 seconds
Reading a Specific Column
Syntax
hbase>get 'table name', ‘rowid’, {COLUMN => ‘column family:column name ’}
Example
hbase> get 'emp', 'row1', {COLUMN=>'personal:name'}
8. Describe and alter
Describe command
Syntax
hbase> describe 'table name'
Example
hbase(main):006:0> describe 'emp'

UNIT –V FRAMEWORK
Alter command
Changing the Maximum Number of Cells of a Column Family
hbase> alter 't1', NAME => 'f1', VERSIONS => 5
Example
hbase(main):003:0> alter 'emp', NAME => 'personal data', VERSIONS => 5
Adding column in table
hbase(main):003:0> alter 'emp', add pincode => 'personal data'
Deleting a Column Family
Syntax
hbase> alter ‘ table name ’, ‘delete’ => ‘ column family ’
Example
hbase(main):007:0> alter 'employee','delete'=>'professional'
9. Drop command
Drop command
Syntax
Drop ‘table name’
Example
hbase> drop 'emp'
drop_all command
Syntax
hbase> drop_all ‘reg_exp’
Example
hbase> drop_all ‘t.*’

UNIT –V FRAMEWORK
5. IBM Info Sphere Big Insights

What is Big Data?
What is Hadoop?
IBM InfoSphere BigInsights
Federal Market Use Cases
Take Aways
What is Big Data?
• Extracting insight from an immense volume, variety and velocity of data, in
context, beyond what was previously possible
What is Hadoop?
Apache Hadoop – free, open source ramework for data-intensive
applications
Inspired by Google technologies (MapReduce,
GFS)
Originally built to address scalability problems of Web search and analytics
Extensively used by Yahoo!
Enables applications to work with thousands of nodes and petabytes of
data in a highly parallel, cost effective manner
CPU + disks of a commodity box = Hadoop node
Boxes can be combined into clusters
New nodes can be added without changing
Data formats
UNIT –V FRAMEWORK
How data is loaded

How jobs are written
IBM Info sphere Big Insight

UNIT –V FRAMEWORK
InfoSphere Big Insights

• An analytics platform for Big Data at rest
• Volume
–Petabyte range
• Variety
–All kinds of data
–All kinds of analytics
• Adaptive Analytics Integrating Analytics on Data in Motion and Data at Rest

UNIT –V FRAMEWORK
InfoSphere BigInsights – A Full Hadoop Stack
What makes Big Insights special?
Developing BigInsights Applications

• Map Reduce development in Java
• Pig
-Open source language / Apache sub-project
• Hive
– Open source language / Apache sub-project
– Provides a SQL-like interface to Hadoop
• Jaql
–IBM-developed query language
–Includes a SQL-like interface
–Handles nesting gracefully

UNIT –V FRAMEWORK
–Heavily function oriented

–Very flexible; useful for loosely structured data
–Interface for IBM’s column store, text analytic
BigData Application Development Tooling with Eclipse
Spreadsheet-like Analysis Tool
Analytics for V3
System ML
System T Information Extraction (IE)
BigInsights Text Analytics
Social-media Analytics
BigInsights – Enhancing and Hardening Hadoop for Mission
• Tooling
– Installation/Configuration
– Development via Eclipse, jaql, AQL, DML, etc
– Discover/Analyze
– Browser Based Administration, Performance Analysis and etc
• Flex Scheduler
– Provided in addition to Hadoop’s FIFO and FAIR schedulers
–Optimize for response time, granular control, SLAs,
• Adaptive M/R
– Balance workload across mappers
– Minimize startup and scheduling cost
• LZO Split-able Compression
• Security Enhancements
– Secure access thru console, other open ports closed
– Authentication with LDAP
– Authorization thru roles
• Extensive RDBMS, DataWarehouse Integration.
• Integration with R & SPSS
• BigSheets Visualization
– Rapid analysis without M/R coding
UNIT –V FRAMEWORK
– Collect Data, Extract and Analyze & Explore/Visualize

– Leverage System T Text Analytics Macro
• Advanced Text Analytics Toolkit
– Eclipse Dev Tool Integration, Provenance Viewer, Debug Capability
– Text Analytics Language (AQL)
– Optimized M/R Engine
–Prebuilt Extractors
• Total Hadoop Cluster Solution
• Tight Integration/Support for Data In Motion via InfoSphere Streams
Enterprise-Level File System: GPFS-SNC
• GPFS SNC (General Parallel File System for Shared Nothing Clusters)
• Developed in 1998 by IBM Research
• Used since then in supercomputers and thousands of other
deployments Shared-nothing architecture designed for Hadoop in
2009

UNIT –V FRAMEWORK
6. IBM InfoSphere Stream

Stream is a powerful analytic computing platform that delivers a platform for
analyzing data in real time with micro-latency
The gathering large quantities of data, manipulating the data, storing it on disk
and then analyzing it, would be the case with biginsights
Stream allows you to apply the analytics on the data in motion
In stream data flows through operators that have the ability to manipulate the
data stream and in-flight analysis is performed on the data.
Categories of Problems Solved by Streams

• Applications that require on-the-fly processing, filtering and analysis of
streaming data
– Sensors: environmental, industrial, surveillance video, GPS, …
– “Data exhaust”: network/system/web server/app server log files
– High-rate transaction data: financial transactions, call detail records

UNIT –V FRAMEWORK
Criteria: two or more of the following

– Messages are processed in isolation or in limited data windows
– Sources include non-traditional data (spatial, imagery, text, …)
– Sources vary in connection methods, data rates, and processing
requirements, presenting integration challenges
– Data rates/volumes require the resources of multiple processing nodes
– Analysis and response are needed with sub-millisecond latency
– Data rates and volumes are too great for store-and-mine approaches
Industry use cases for infoSphere stream

UNIT –V FRAMEWORK
How InfoSphere stream work

What is a Stream?
Stream is a graph of nodes connected by edges
Each node in graph is an operator or adapter that will process the data within
the stream in some way
Nodes can have zero or more inputs and zero or more outputs
The output from one node is connected to the input of another node or nodes
The edges of the graph that join the nodes together represents the stream of
data moving between the operators
Streams Processing Language

Designed for stream computing
– Define a streaming-data flow graph
– Rich set of data types to define tuple attributes
Declarative
– Operator invocations name the input and output streams
– Referring to streams by name is enough to connect the graph
Procedural support
– Full-featured imperative language
– Custom logic in operator invocations
Extensible
– User-defined data types
– Custom functions written in SPL or a native language (C++ or Java)
– Custom operators written in SPL

UNIT –V FRAMEWORK
Streams Standard Toolkit:operator
Utility Operators

UNIT –V FRAMEWORK
XML Support: Built Into SPL

• XML type
o Validated for syntax or schema
• XML Parser operator
o with Xpath expression and function to parse and manipulate XML data
• XQuery function
o Use XML data on the fly
• Adapters support XML format
o Standard Toolkit
Database Toolkit
Enterprise class
Large, massively parallel jobs have unique availability requirements because in
a large cluster, there are bound to be failures.
The stream built in availability characteristics that take this into account
To consider that in a massive cluster, the creation ,visualization and
monitoring of your application is a critical success factor in keeping your
management costs low
The integration with the enterprise architecture is essential to building a
holistic solution
The enterprise aspects of the big data problem for streaming analytics:
Avilabilty,Ease of use and Integration
High availability
An application host is a server that runs SPL jobs
A management host runs the management services that control the flow of
SPL jobs
A mixed host can run both SPL jobs and management tasks

UNIT –V FRAMEWORK
Easy to use
Stream comes with an eclipse based visual toolset, called Infosphere stream
studio which allows you to create,edit,test,debug ,run and even visualize a
stream graph model and SPL applications
Integration
Coordinating the traditional and new age big data processes takes a vendor
that understand both sides of the equation
7. IBM Big sheet –Visualization
BigSheets is a spreadsheet-style tool for business analysts provided with IBM
Info Sphere Big Insights, a platform based on the open source Apache Hadoop
project.
Big Sheets enables non-programmers to iteratively explore, manipulate, and
visualize data stored in your distributed file system
It includes built-in functions to extract names, addresses, organizations, email,
locations, and phone numbers.
Big Sheets has a good deal in common with a typical spreadsheet application,
such as Microsoft® Excel
The benefit is the ease with which you can open, browse, and ultimately create
graphical views in the form of graphs and tag clouds on your data. As an
entirely web-based interface to the data viewing and analysis process, it is very
easy to use, but not without its complexities.
Big Sheets either takes the data you provide and builds a visualized version of
that information in the form of a graph, or it processes raw information to
provide a summarized view of the data. This enables Big Sheets to support
some basic processing alongside its core visualization role

UNIT –V FRAMEWORK
IBM Big data Platform
• Three steps involved in using big sheet to perform big data analysis
• Collect data-we can collect data from multiple sources, including crawling
web,local files,files on your network and multiple protocols format
(HTTP,HDFS,Amazon S3 Native file System)
• Extract and analyze data-we have collected your information, you can see a
sample of it in spreadsheet interface.

UNIT –V FRAMEWORK
• Explore and visualize data-we can apply visualization to help you make sense
of your data .
Big Sheet provides the following visualization tools
• Tag cloud-shows word frequencies; the bigger the word, the more frequently it
exists in the sheet
• Pie chart-show proportional relationship, where the relative size of the slice
represents its proportion of the data
• Map-shows data values overlaid onto either a map of the world
• Heat map-The additional dimension of showing the relative intensity of the
values overlaid on the map
• Bar chart-shows the frequency of values for a specified columns
Visualization technique that suits the data format
• Basic crawler data— Information extracted from web pages using the
BigInsights web crawler.
• Character delimited data— Tab, tilde, or other field-based information
separated by characters.
• Character-delimited data with text qualifier— Delimited data where the
fields are enclosed by quotation marks or other characters to contain the data.
• Comma-separated value (CSV) data— Standard CSV format, including
qualified fields (quotation marks or escape characters), with or without header
rows.
• Hive read— Data read from an existing Hive data table.
• JSON array— An array of JSON data.
• JSON object read— Multiple lines of JSON object data.
• Line reader— Each line is taken as an entry of discrete data. This format is
useful when a separate job has output a list of unique words.
• Sheets data— Data generated by a previous sheet's processing job.
• Tab-separated value (TSV) data— Field-based information separated by
tabs, with or without a header row.
UNIT –V FRAMEWORK
Visualization tools
Polymaps
Flot
Tangle
D3.js
FF Chartwell
CartoDB
The R Project
Case Study- BigSheets Twitter Analysis – Top tweetting users chart
Step 1: Creating Big Sheets Master Workbooks
Step 2: Tailoring BigSheets Workbooks
• Tweets-IBM+BigData ALL with 17 columns corresponding to the following

fields:
o header1 = created_at
o header2 = id_str
o header3 = geo
o header4 = coordinates
o header5 = location
o header6 = user.id_str
o header7 = user.name
o header8 = user.screen_name
o header9 = user.location

UNIT –V FRAMEWORK
oheader10 = user.description
o header11 = user.url
o header12 = user.followers_count
o header13 = user.friends_count
o header14 = retweet_count
o header15 = favorite_count
o header16 = lang
o header17 = text
• WordCount-IBM+BigData with 2 columns corresponding to:
o header1 = word
o header 2 = number of occurrences
1. From the BigSheets page of the web console, open the Tweets-IBM+BigData
ALL master workbook.
2. As I mentioned before, master workbooks cannot be modified so we need to
create a new workbook based on this master workbook. Within the master
workbook, click on Build new workbook.
3. New workbook will automatically open. In the left down corner click on Add
sheets and as a type of sheet choose Pivot.

UNIT –V FRAMEWORK
4. Set the Group by columns to header16.
5.Go to the Calculate tab and create a new column (for example “total”) showing the
results of function COUNT applied on groups based on column header16.
6. Apply the settings. You should get results similar to the following screenshot.
Save the workbook (for example as “Tweets-IBM+BigData ALL – language pie”)

and Exit the editor.

UNIT –V FRAMEWORK
Step 3: Creating charts
Go to the Tweets-IBM+BigData ALL – language pie

In the left down corner click on Add chart and choose Chart, then Pie.
In the settings set Value to header16 and Count to total. Apply the settings.
Now you are supposed to run the computation of your chart. Note that a status bar to
the right of the Run button enables you to monitor the progress of job. Behind the
scenes, BigSheets executes Pig scripts that initiate MapReduce job. In the end you
should see the pie chart representing the coverage by languages. Note that every time
the data source change, you can just re-run your charts to get fresh results

Unit-V-Big Data

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Unit-V-Big Data

Caricato da

Copyright:

Formati disponibili

UNIT –V FRAMEWORK

Department of CA @ AVCCE Page 1 of 50

Department of CA @ AVCCE Page 2 of 50

Department of CA @ AVCCE Page 4 of 50

3. Hive Built-in Functions

ABS( double n ) ABS(-100)

BIN( bigint n ) BIN(100)

CEIL( double n ) CEIL(9.5)

FLOOR( double n ) FLOOR(10.9)

SQRT( double n ) SQRT(4)

LOG2( double n ) LOG2(44)

POW( double m, double n ), POW(10,2)

RAND( [int seed] ) RAND( )

Department of CA @ AVCCE Page 5 of 50

UNIX_TIMESTAMP() 2016-09-24 12:11:10

UNIX_TIMESTAMP( string date, UNIX_TIMESTAMP('2000-01-01

TO_DATE( string timestamp ) TO_DATE('2000-01-01 10:20:30')

YEAR( string date ) YEAR('2000-01-01 10:20:30')

DATEDIFF( string date1, string DATEDIFF('2000-03-01', '2000-01-

DATE_ADD( string date, int days ) DATE_ADD('2000-03-01', 5)

DATE_SUB( string date, int days ) DATE_SUB('2000-03-20', 5)

ASCII( string str ) ASCII('A')

CONCAT( string str1, string str2... ) CONCAT('hadoop','-','hive')

LENGTH( string str ) LENGTH('hive')

LOWER( string str ), LOWER('HiVe')

REVERSE( string str ) REVERSE('hive')

SUBSTR( string source_str, int SUBSTR('hadoop',4,2)

UPPER( string str ) UPPER('HiVe')

Department of CA @ AVCCE Page 6 of 50

size(Array<T>) Returns the number of

size(Map<K.V>) Returns the number of

map_keys(Map<K.V> Returns an unordered array

map_values(Map<K.V>) Returns an unordered array

sort_array(Array<T>) Sorts the input array in

Type Conversion Function

binary(string|binary) Casts the parameter into a

cast(expr as <type>) Converts the results of the

3. Querying Data in Hive

designation string None

Department of CA @ AVCCE Page 9 of 50

Department of CA @ AVCCE Page 12 of 50

Department of CA @ AVCCE Page 13 of 50

Apache Pig – Architecture

Pig Latin – Data types

Boolean true/ false.

Complex Data types

Data Type Example

Tuple-ordered set of fields. (raja, 30)

Bag-collection of tuples. {(raju,30),(Mohhammad,45)}

Map-set of key-value pairs. [ ‘name’#’Raju’, ‘age’#30]

Department of CA @ AVCCE Page 14 of 50

2. Data Processing Operators

Department of CA @ AVCCE Page 18 of 50

Department of CA @ AVCCE Page 19 of 50

What is Apache ZooKeeper Meant For?

• Sequential znode − Sequential znodes can be either persistent or ephemeral.

Department of CA @ AVCCE Page 22 of 50

Sessions are very important for the operation of ZooKeeper. Requests in a

Department of CA @ AVCCE Page 23 of 50

Create /path /data

Create /FirstZnode “Myfirstzookeeper-app”

Department of CA @ AVCCE Page 24 of 50

HBase is a distributed column-oriented database built on top of the Hadoop file