DSCI 5350 - Lecture 5 PDF

DSCI 5350 - Big Data Analytics
Lecture 5 – Impala and Hive
Kashif Saeed
1
Complex Data Types in Hive
2
Complex data types in Hive/Impala
• Complex types (also referred to as nested types or Collection

types) let you represent multiple data values within a single
row/column position
• Basic data types, also known as Scalar types, represent a single
value in a column
• With Complex data types, your analytic queries involving
multiple tables could benefit from greater locality during join
processing
By packing more related data items within each HDFS data block,
complex types let join queries avoid the network overhead of the
traditional Hadoop shuffle or broadcast join techniques
• Allows Hive/Impala to handle NoSQL databases data
• Complex data types work in Impala for CDH 5.5 or above
3
ARRAY
• Ordered sequence of same data type that is indexed

using zero-based integers
• The elements can be scalar or of complex type
• You refer to the value of the array item using the
position of the item. For an array X of size n, the items
values are X[0] – X[n-1]
4
MAP
• A complex data type representing an arbitrary set of

key-value pairs
• The key part is a scalar type, while the value part can
be a scalar or another complex type (ARRAY, STRUCT,
or MAP)
• For a MAP M, the value of a key column can be
retrieved by M[key]
5
STRUCT
• A complex data type, representing multiple fields of a

single item
• Each field can be a different type
• A field within a STRUCT can also be another STRUCT,
an ARRAY, or a MAP
• A field within a STRUCT can be accessed by using
structname.fieldname
6
7
Hands-On
• Hive Activity 3
8
HiveQL
9
HiveQL
• In this section, I will provide some query examples of

different options you have available in HiveQL
• For the sake of time, we will only discuss few
examples
• Remember, this is NOT a SQL course
10
Computations in queries
• You can use SQL functions, aggregations, and

arithmetic calculations in your HiveQL
• For a list of functions available, please refer to the hive
book page 82-85
SELECT upper(name), salary, deductions["Federal

Taxes"],round(salary * (1 - deductions["Federal
Taxes"])) FROM employees;
11
LIMIT Clause
• LIMIT clause puts an upper limit on the number of

rows returned:

Taxes"],
round(salary * (1 - deductions["Federal
Taxes"])) FROM employees
LIMIT 2;
12
Column Aliases
• You can define column aliases in your query

Taxes"] as fed_taxes,
round(salary * (1 - deductions["Federal
Taxes"])) as salary_minus_fed_taxes FROM
employees
LIMIT 2;
13
CASE Statement
• You can write CASE statements in Hive
SELECT name, salary,

CASE
WHEN salary < 50000.0 THEN 'low‘
WHEN salary >= 50000.0 AND salary < 70000.0
THEN 'middle'
WHEN salary >= 70000.0 AND salary < 100000.0
THEN 'high'
ELSE 'very high'
END AS bracket FROM employees;
14
LIKE Operator
• LIKE is a standard SQL operator

• LIKE uses ‘%’ and ‘_’as a wildcard characters:
 % substitutes unlimited characters at the BEGINNING or
END of a string
 _ substitutes exactly one character
SELECT name, address.street FROM employees

WHERE address.street LIKE '%Ave.';
SELECT name, address.city FROM employees

WHERE address.city LIKE 'O%';
SELECT name, address.street FROM employees

WHERE address.street LIKE '%Chi%';
15
LIKE Examples
Source: http://www.sqlexamples.info/PHP/mysql_rlike.htm 16
RLIKE Operator
•RLIKE is a Hive extension which uses Java

Expressions for searches
•Wildcards
^ signifies BEGINING of the string.
$ signifies END of the string.
[[:<:]] substitute characters in the string BEGINNING
[[:>:]] substitute characters in the string END
| means OR
17
RLIKE Examples
Regex Tutorial: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

18
Joins and Join Optimizations
• The keyword JOIN is used for Inner joins

• The syntax is:
table1 JOIN table2 ON table1.col

= table2.col
• For outer joins, use the keyword OUTER JOIN
• You can get better performance in Hive if you optimize
your joins
• The number of joins and the sequence of joins are
important for join optimizations
19
Join Optimization – Number of Joins
• Hive mostly creates a separate MapReduce job for each

join
• The following example uses 2 MapReduce jobs
SELECT a.ymd, a.price_close, b.price_close,
c.price_close FROM stocks a JOIN stocks b ON
a.ymd = b.ymd JOIN stocks c ON a.ymd = c.ymd
• The example can be rewritten to use 1 MapReduce job

as:
SELECT a.ymd, a.price_close, b.price_close,
c.price_close FROM stocks a JOIN stocks b ON
a.ymd = b.ymd AND a.ymd = c.ymd
20
Join Optimization – Sequence of Joins
•Hive assumes that the last table in your query is

the largest table
It attempts to buffer the other tables and then
stream the last table through, while performing
joins on individual records
•When joining multiple tables in a query, always
keep the largest table as the last table
•Example: If I am joining two tables: Stocks and
Dividends, with Stock being the larger table, use
the following:
SELECT s.ymd, s.symbol, s.price_close,
d.dividend FROM dividends d JOIN stocks s ON
(s.ymd = d.ymd AND s.symbol = d.symbol)
21
HiveQL: Indexes in Hive
22
Indexes in Hive
• The goal of Hive indexing is to improve the speed of query

lookup on certain columns of a table
• Without an index, queries with predicates like 'WHERE tab1.col1 = 10'
load the entire table or partition and process all the rows.
• Hive has limited indexing capabilities
• Only single-table indexes are supported
• Unlike RDBMS, indexes do not result in keys as Hive does not
support keys, but you can build an index on columns to speed
some operations
• The indexed data for a table is stored in another table
• The improvement in query speed that an index can provide
comes at the cost of additional processing to create the index
and disk space to store the indexed data
23
CREATE INDEX Command
CREATE INDEX index_name

ON TABLE base_table_name (col_name, ...)
AS index_type
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"]
24
ALTER INDEX Command
ALTER INDEX employees_index ON TABLE

employees
PARTITION (country = 'US')
REBUILD;
25
• WITH DEFERRED REBUILD
• If WITH DEFERRED REBUILD is specified on CREATE INDEX,
then the newly created index is initially empty (regardless
of whether the table contains any data)
• You can use the ALTER INDEX command to re-build the
index
• PARTITIONED BY
• If we omitted the PARTITIONED BY clause completely, the
index would span all partitions of the original table
26
Index - Examples
• Creating Index on the emp table:

CREATE INDEX table01_index ON TABLE emp
(empid) AS 'COMPACT' WITH DEFERRED REBUILD;
• Showing all Indices on the emp table:

show index on emp;
27
Index - Examples
• Verify the tables in Metastore:
28
Index - Examples
• Rebuilding an Index:
ALTER INDEX index_name ON table_name
[PARTITION (...)] REBUILD
• Dropping an Index:
DROP INDEX index_name ON table_name
29
HiveQL: Partitions in Hive
30
Partitioned Tables in Hive
• Like databases, you can partition tables in Hive

• Hive creates subdirectories for the partitions in the table
directory
• Partitioning in Hive can significantly improve query
performance because it bypasses MapReduce in most cases
• Look at the example below for how Partitions are created:
31
Partitioned Tables in Hive - Continued
• Below is how the directories in HDFS are created by

Hive when the table is partitioned:
32
Partitioned Tables in Hive - Continued
• When we add predicates to WHERE clauses that

filter on partition values, these predicates are called
partition filters
• After creating a partition, if you run a query without a
partition filter, it can result in an enormous
MapReduce job if the partitions are larger
• A safety measure is to put Hive in strict mode to
prohibit queries of partitioned tables without a
partition filter
hive> set hive.mapred.mode=strict;
33
Partitioned Tables in Hive
•You can see the existing partitions on a table:
hive> SHOW PARTITIONS employees;
•If you have multiple partitions, and you only

want to see specific partitions:
hive> SHOW PARTITIONS employees
PARTITION(country='US');
34
Hands-On
• Hive Activity 4
35
Sampling Data in Hive
36
Sampling Data in Hive
• You might need to sample very large data files for

analytics purposes
• Hive supports this with queries that samples tables
into buckets
• You can use rand() function to create buckets with
random number of rows, or you can create equal
number of rows in each sample by using one of the
columns (ideally the unique identifier)
• BUCKET x out of n – represents the number of buckets
in the following examples
37
Sampling Data in Hive- Examples
Try the following queries in Hive:

SELECT * from emp TABLESAMPLE(BUCKET 1 OUT
OF 2 ON empid);
SELECT * from emp TABLESAMPLE(BUCKET 2 OUT

OF 2 ON empid);
38
Sampling Data in Hive - Examples
• Try the following queries in Hive:

SELECT * from emp TABLESAMPLE(BUCKET 1 OUT OF
2 ON rand());
SELECT * from emp TABLESAMPLE(BUCKET 2 OUT OF

2 ON rand());
39
File Formats in Hive
Sources: Programming Hive O’Reilly

https://blog.matthewrathbone.com/2016/09/01/a-beginners-guide-to-hadoop-storage-formats.html
https://www.slideshare.net/StampedeCon/choosing-an-hdfs-data-storage-format-avro-vs-parquet-and-more-
stampedecon-2015/9
40
Why Storage Formats are Important?
• Biggest slow down for processing engines is to find the

relevant data, and write data back to HDFS
• Choosing an appropriate file format can have
significant benefits:
Faster read times
Faster write times
Splittable files (so you don’t need to read the whole file, just
a part of it)
Schema evolution support (allowing you to change the fields
in a dataset)
Advanced compression support (compress the files with a
compression codec without sacrificing these features)
41
Storage Formats in Hive
Storage formats:
• Text (txt, csv, tsv)
• Sequence File
• Avro
• Parquet
• Optimized Row Columnar (ORC)
42
Text Format
• Used for csv, txt, tsv, and sometimes json sources

• Data is laid out in lines, with each line being a record;
lines are terminated by a newline character \n
• Human readable
• Data storage is bulky and not as efficient to query
• Text-files are inherently splittable
• Convenient format if data needs to be used by other
applications or scripts that produce/read delimited
files
• Does not support block compression
43
Sequence File
• Sequence files are flat files consisting of binary key-value pairs

• Results in a row-oriented storage in Hive
• When Hive converts queries to MapReduce jobs, it decides on
the appropriate key-value pairs to use for a given record
• The sequence file is a standard format supported by Hadoop
itself, so it is an acceptable choice when sharing files between
Hive and other Hadoop-related tools
• Sequence files can be compressed at the block and record
level, which is very useful for optimizing disk space utilization
and I/O
44
Avro
• Widely used as a serialization platform

• To serialize an object means to convert its state to a byte stream so that the byte
stream can be reverted back into a copy of the object
• Row-based, compact and fast binary format which defines file data schemas
in JSON
• Schema is encoded on the file
• When Avro data is read, the schema used when writing it is always present
• File supports block compression and are splittable
• Supports schema evolution
• Avro is a well thought out format which defines file data schemas in JSON
(for interoperability), allows for schema evolutions (remove a column, add a
column), and multiple serialization/deserialization use cases. It also
supports block-level compression. For most Hadoop-based use cases Avro is
a really good choice
45
Columnar Formats (Parquet, ORC)
• Columnar storage layout allows reading only a small fraction of

the data from a data file or table
• datasets are partitioned both horizontally and vertically
• particularly useful if you need access to a subset of data, or
access all values of a single column without reading whole
records
• offers better compression and storage optimization
• Ideal for read-heavy situations
• optimizes workloads for Hive and Spark when you need to read
segments of records rather than the whole thing (which is more
common in MapReduce)
46
Handling JSON Data in Hive
47
• Hive is a fantastic tool for performing SQL-style

queries across data that is often not appropriate for a
relational database
• This is possible because of 2 reasons:
1. Hive has complex data types
2. Hive supports SerDe
48
What is a SerDe?
• The SerDe interface allows you to instruct Hive as to

how a record should be processed
• A SerDe is a combination of a Serializer and a
Deserializer (hence called a SerDe)
• The Deserializer interface takes a string or binary
representation of a record, and translates it into a Java
object that Hive can manipulate
• The Serializer takes a Java Object and uses Hive to
write it to HDFS
• Hive provides an API for developing a SerDe
org.apache.hadoop.hive.serde2
• A SerDe allows Hive to read data from a table, and
write it back out to HDFS in any custom format 49
•JSON (JavaScript Object Notation) is a

lightweight data-interchange format
•Hive provides three different mechanisms for
processing JSON
use the GET_JSON_OBJECT UDF
use the JSON_TUPLE UDTF
use custom SerDe
50
JSON_TUPLE UDTF
• JSON_TUPLE is a User Defined Table Function, which

takes a set of names (keys) and a JSON string, and
returns a tuple of values using one function
• It is much more efficient than using
GET_JSON_OBJECT because it parses the string once
as opposed to GET_JSON_OBJECT, which parses the
string once for each column
• As JSON_TUPLE is a UDTF, you will need to use
the LATERAL VIEW syntax in order to achieve the same
goal (See Appendix A)
• The disadvantage – only works for a single level deep
JSON file. For multiple level deep files, you have to
create more Lateral views
51
Hands-On
• Hive Activity 6
• Practice – Complete the example in the blog:
http://thornydev.blogspot.com/2013/07/querying-
json-records-via-hive.html
• Expect multiple level JSON for the assignment, but not
for the exam
52
Appendix
53
Appendix A – Table Generating
Functions
• Table generating functions take zero or more inputs and

produce multiple columns or rows of output
• All table generating functions, user-defined and built-in,
are often referred to generically as user-defined table
generating functions (UDTFs)
• Limitation – UDTFs only work at complex data types
which are single level deep and have single row (see
example)
Source: Programming Hive O’reilly 54

UDTF Example – explode() function
• Select * from an array returns an array list
• explode(),a UDTF, returns each element from the list in a

separate row
55
• Table with multiple rows example
• Use of Lateral View
56
Appendix B - Database Indexes
• A database index is a data structure that improves the

speed of data retrieval operations on a database table
• Can be Clustered or
Non-Clustered
• Clustered index physically
re-organizes the data, hence
there can be only one
clustered index
• Non-clustered indexes are
like indexes at the back of a
book, which points to where
the data resides
Database Indexes - Continued
• When an index is added to a column, it saves the

database from full table scans and uses index seek
• Upside – Querying becomes faster specially if you
have indexes defined on the fields often used in
Where clauses
• Downside – adding data to the table with indexes
means that index needs to be updated as well
Appendix C - Database Partitioning
• When a database table grows too big, running

operations like indexing, selection, deletion, or
updates on the table become too time consuming
• Partitioning a table means dividing the table and its
indexes into smaller, manageable partitions so that
maintenance can be done partition by partition,
rather than on the entire table
• For reporting purposes, most frequently used data
and historical data can be saved on different
partitions
Additional read on Partitions:
https://docs.oracle.com/cd/B28359_01/server.111/b32024/partition.htm
How Partitioning works
• A partition allows a table and index to be divided into

smaller pieces called a partition
• Each partition has its own name and its own storage
characteristics
• From a DBA perspective, a partitioned table can be
managed collectively or individually
• From an application/consumer perspective, a
partitioned table is identical to a non-partitioned
table; no modification to SQL queries needed when
accessing a partitioned table
Appendix D - SQL vs. NoSQL
SQL NoSQL
Relational Databases Distributed Databases
Standard of Storage: Tables No Standard of Storage: Generally Key-
value pair, documents etc.
Standard schema definition, hence not No Standard Scheme definition, hence
well suited for unstructured data well suited for unstructured data
Examples: Oracle, SQL Server, DB2, MySQL Examples: MongoDB, Redis, RavenDb,
etc. Cassandra, Hbase, CouchDb
Ideal for complex querying Do not have standard interfaces to
perform complex queries
Does not perform well for very large Ideal for very large datasets
datasets
SQL databases emphasizes on ACID NoSQL database follows Eric Brewers CAP
properties ( Atomicity, Consistency, theorem (BASE)
Isolation and Durability)
ACID vs. BASE for Transactions
What is ACID
Atomic: Everything in a transaction succeeds or the
entire transaction is rolled back
Consistent: A transaction cannot leave the database in
an inconsistent state
 Isolated: Transactions cannot interfere with each
other
Durable: Completed transactions persist, even when
servers restart etc.
What if you ran Amazon on ACID?
• Every time when someone is in the process of buying
something, you’ll lock the inventory part of the
database to the visitors on the website so that they
will see accurate inventory
• This will ensure that Amazon does not sell anything
not in the inventory but will be disastrous to user
experience
Amazon may risk selling the last item to two customers and then apologize to one,
rather than interrupting the user experience to show the exact inventory.
• Eric Brewer’s CAP theorem says that if you want
consistency, availability, and partition tolerance, you have
to settle for two out of three.
• Alternative to ACID is BASE
Basic Availability
Soft-state
Eventual consistency
• Rather than acquiring consistency after every transaction,
it will eventually be in a consistent state
• BASE allows for more scalable and affordable systems

DSCI 5350 - Lecture 5 PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

DSCI 5350 - Lecture 5 PDF

Caricato da

Copyright:

Formati disponibili

DSCI 5350 - Big Data Analytics

Lecture 5 – Impala and Hive

• Complex types (also referred to as nested types or Collection

• Ordered sequence of same data type that is indexed

• A complex data type representing an arbitrary set of

• A complex data type, representing multiple fields of a

• In this section, I will provide some query examples of

• You can use SQL functions, aggregations, and

SELECT upper(name), salary, deductions["Federal

• LIMIT clause puts an upper limit on the number of

SELECT upper(name), salary, deductions["Federal

• You can define column aliases in your query

SELECT upper(name), salary, deductions["Federal

• You can write CASE statements in Hive

SELECT name, salary,

• LIKE is a standard SQL operator

SELECT name, address.street FROM employees

SELECT name, address.city FROM employees

SELECT name, address.street FROM employees

•RLIKE is a Hive extension which uses Java

Regex Tutorial: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

• The keyword JOIN is used for Inner joins

table1 JOIN table2 ON table1.col

• Hive mostly creates a separate MapReduce job for each

• The example can be rewritten to use 1 MapReduce job

•Hive assumes that the last table in your query is

• The goal of Hive indexing is to improve the speed of query

CREATE INDEX index_name

ALTER INDEX employees_index ON TABLE

• Creating Index on the emp table:

• Showing all Indices on the emp table:

• Verify the tables in Metastore:

• Like databases, you can partition tables in Hive

• Below is how the directories in HDFS are created by

• When we add predicates to WHERE clauses that

•You can see the existing partitions on a table:

hive> SHOW PARTITIONS employees;

•If you have multiple partitions, and you only

• You might need to sample very large data files for

Try the following queries in Hive:

SELECT * from emp TABLESAMPLE(BUCKET 2 OUT

• Try the following queries in Hive:

SELECT * from emp TABLESAMPLE(BUCKET 2 OUT OF

Sources: Programming Hive O’Reilly

• Biggest slow down for processing engines is to find the

• Used for csv, txt, tsv, and sometimes json sources

• Sequence files are flat files consisting of binary key-value pairs

• Widely used as a serialization platform

• Columnar storage layout allows reading only a small fraction of

• Hive is a fantastic tool for performing SQL-style

• The SerDe interface allows you to instruct Hive as to

•JSON (JavaScript Object Notation) is a

• JSON_TUPLE is a User Defined Table Function, which

• Table generating functions take zero or more inputs and

Source: Programming Hive O’reilly 54

• Select * from an array returns an array list

• explode(),a UDTF, returns each element from the list in a

• Use of Lateral View

• A database index is a data structure that improves the

• When an index is added to a column, it saves the

• When a database table grows too big, running

• A partition allows a table and index to be divided into

Potrebbero piacerti anche