Sei sulla pagina 1di 64

DSCI 5350 - Big Data Analytics

Lecture 5 – Impala and Hive

Kashif Saeed
1
Complex Data Types in Hive

2
Complex data types in Hive/Impala

• Complex types (also referred to as nested types or Collection


types) let you represent multiple data values within a single
row/column position
• Basic data types, also known as Scalar types, represent a single
value in a column
• With Complex data types, your analytic queries involving
multiple tables could benefit from greater locality during join
processing
By packing more related data items within each HDFS data block,
complex types let join queries avoid the network overhead of the
traditional Hadoop shuffle or broadcast join techniques
• Allows Hive/Impala to handle NoSQL databases data
• Complex data types work in Impala for CDH 5.5 or above
3
ARRAY

• Ordered sequence of same data type that is indexed


using zero-based integers
• The elements can be scalar or of complex type
• You refer to the value of the array item using the
position of the item. For an array X of size n, the items
values are X[0] – X[n-1]

4
MAP

• A complex data type representing an arbitrary set of


key-value pairs
• The key part is a scalar type, while the value part can
be a scalar or another complex type (ARRAY, STRUCT,
or MAP)
• For a MAP M, the value of a key column can be
retrieved by M[key]

5
STRUCT

• A complex data type, representing multiple fields of a


single item
• Each field can be a different type
• A field within a STRUCT can also be another STRUCT,
an ARRAY, or a MAP
• A field within a STRUCT can be accessed by using
structname.fieldname

6
7
Hands-On

• Hive Activity 3

8
HiveQL

9
HiveQL

• In this section, I will provide some query examples of


different options you have available in HiveQL
• For the sake of time, we will only discuss few
examples
• Remember, this is NOT a SQL course

10
Computations in queries

• You can use SQL functions, aggregations, and


arithmetic calculations in your HiveQL
• For a list of functions available, please refer to the hive
book page 82-85

SELECT upper(name), salary, deductions["Federal


Taxes"],round(salary * (1 - deductions["Federal
Taxes"])) FROM employees;

11
LIMIT Clause

• LIMIT clause puts an upper limit on the number of


rows returned:

SELECT upper(name), salary, deductions["Federal


Taxes"],
round(salary * (1 - deductions["Federal
Taxes"])) FROM employees
LIMIT 2;

12
Column Aliases

• You can define column aliases in your query

SELECT upper(name), salary, deductions["Federal


Taxes"] as fed_taxes,
round(salary * (1 - deductions["Federal
Taxes"])) as salary_minus_fed_taxes FROM
employees
LIMIT 2;

13
CASE Statement

• You can write CASE statements in Hive

SELECT name, salary,


CASE
WHEN salary < 50000.0 THEN 'low‘
WHEN salary >= 50000.0 AND salary < 70000.0
THEN 'middle'
WHEN salary >= 70000.0 AND salary < 100000.0
THEN 'high'
ELSE 'very high'
END AS bracket FROM employees;

14
LIKE Operator

• LIKE is a standard SQL operator


• LIKE uses ‘%’ and ‘_’as a wildcard characters:
 % substitutes unlimited characters at the BEGINNING or
END of a string
 _ substitutes exactly one character

SELECT name, address.street FROM employees


WHERE address.street LIKE '%Ave.';

SELECT name, address.city FROM employees


WHERE address.city LIKE 'O%';

SELECT name, address.street FROM employees


WHERE address.street LIKE '%Chi%';
15
LIKE Examples

Source: http://www.sqlexamples.info/PHP/mysql_rlike.htm 16
RLIKE Operator

•RLIKE is a Hive extension which uses Java


Expressions for searches
•Wildcards
^ signifies BEGINING of the string.
$ signifies END of the string.
[[:<:]] substitute characters in the string BEGINNING
[[:>:]] substitute characters in the string END
| means OR

17
RLIKE Examples

Regex Tutorial: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html


18
Joins and Join Optimizations

• The keyword JOIN is used for Inner joins


• The syntax is:

table1 JOIN table2 ON table1.col


= table2.col
• For outer joins, use the keyword OUTER JOIN
• You can get better performance in Hive if you optimize
your joins
• The number of joins and the sequence of joins are
important for join optimizations

19
Join Optimization – Number of Joins

• Hive mostly creates a separate MapReduce job for each


join
• The following example uses 2 MapReduce jobs
SELECT a.ymd, a.price_close, b.price_close,
c.price_close FROM stocks a JOIN stocks b ON
a.ymd = b.ymd JOIN stocks c ON a.ymd = c.ymd

• The example can be rewritten to use 1 MapReduce job


as:
SELECT a.ymd, a.price_close, b.price_close,
c.price_close FROM stocks a JOIN stocks b ON
a.ymd = b.ymd AND a.ymd = c.ymd
20
Join Optimization – Sequence of Joins

•Hive assumes that the last table in your query is


the largest table
It attempts to buffer the other tables and then
stream the last table through, while performing
joins on individual records
•When joining multiple tables in a query, always
keep the largest table as the last table
•Example: If I am joining two tables: Stocks and
Dividends, with Stock being the larger table, use
the following:
SELECT s.ymd, s.symbol, s.price_close,
d.dividend FROM dividends d JOIN stocks s ON
(s.ymd = d.ymd AND s.symbol = d.symbol)
21
HiveQL: Indexes in Hive

22
Indexes in Hive

• The goal of Hive indexing is to improve the speed of query


lookup on certain columns of a table
• Without an index, queries with predicates like 'WHERE tab1.col1 = 10'
load the entire table or partition and process all the rows.
• Hive has limited indexing capabilities
• Only single-table indexes are supported
• Unlike RDBMS, indexes do not result in keys as Hive does not
support keys, but you can build an index on columns to speed
some operations
• The indexed data for a table is stored in another table
• The improvement in query speed that an index can provide
comes at the cost of additional processing to create the index
and disk space to store the indexed data
23
CREATE INDEX Command

CREATE INDEX index_name


ON TABLE base_table_name (col_name, ...)
AS index_type
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"]
24
ALTER INDEX Command

ALTER INDEX employees_index ON TABLE


employees
PARTITION (country = 'US')
REBUILD;

25
• WITH DEFERRED REBUILD
• If WITH DEFERRED REBUILD is specified on CREATE INDEX,
then the newly created index is initially empty (regardless
of whether the table contains any data)
• You can use the ALTER INDEX command to re-build the
index
• PARTITIONED BY
• If we omitted the PARTITIONED BY clause completely, the
index would span all partitions of the original table

26
Index - Examples

• Creating Index on the emp table:


CREATE INDEX table01_index ON TABLE emp
(empid) AS 'COMPACT' WITH DEFERRED REBUILD;

• Showing all Indices on the emp table:


show index on emp;

27
Index - Examples

• Verify the tables in Metastore:

28
Index - Examples

• Rebuilding an Index:
ALTER INDEX index_name ON table_name
[PARTITION (...)] REBUILD

• Dropping an Index:
DROP INDEX index_name ON table_name

29
HiveQL: Partitions in Hive

30
Partitioned Tables in Hive

• Like databases, you can partition tables in Hive


• Hive creates subdirectories for the partitions in the table
directory
• Partitioning in Hive can significantly improve query
performance because it bypasses MapReduce in most cases
• Look at the example below for how Partitions are created:

31
Partitioned Tables in Hive - Continued

• Below is how the directories in HDFS are created by


Hive when the table is partitioned:

32
Partitioned Tables in Hive - Continued

• When we add predicates to WHERE clauses that


filter on partition values, these predicates are called
partition filters
• After creating a partition, if you run a query without a
partition filter, it can result in an enormous
MapReduce job if the partitions are larger
• A safety measure is to put Hive in strict mode to
prohibit queries of partitioned tables without a
partition filter
hive> set hive.mapred.mode=strict;

33
Partitioned Tables in Hive

•You can see the existing partitions on a table:

hive> SHOW PARTITIONS employees;

•If you have multiple partitions, and you only


want to see specific partitions:
hive> SHOW PARTITIONS employees
PARTITION(country='US');

34
Hands-On

• Hive Activity 4

35
Sampling Data in Hive

36
Sampling Data in Hive

• You might need to sample very large data files for


analytics purposes
• Hive supports this with queries that samples tables
into buckets
• You can use rand() function to create buckets with
random number of rows, or you can create equal
number of rows in each sample by using one of the
columns (ideally the unique identifier)
• BUCKET x out of n – represents the number of buckets
in the following examples

37
Sampling Data in Hive- Examples

Try the following queries in Hive:


SELECT * from emp TABLESAMPLE(BUCKET 1 OUT
OF 2 ON empid);

SELECT * from emp TABLESAMPLE(BUCKET 2 OUT


OF 2 ON empid);

38
Sampling Data in Hive - Examples

• Try the following queries in Hive:


SELECT * from emp TABLESAMPLE(BUCKET 1 OUT OF
2 ON rand());

SELECT * from emp TABLESAMPLE(BUCKET 2 OUT OF


2 ON rand());

39
File Formats in Hive

Sources: Programming Hive O’Reilly


https://blog.matthewrathbone.com/2016/09/01/a-beginners-guide-to-hadoop-storage-formats.html
https://www.slideshare.net/StampedeCon/choosing-an-hdfs-data-storage-format-avro-vs-parquet-and-more-
stampedecon-2015/9

40
Why Storage Formats are Important?

• Biggest slow down for processing engines is to find the


relevant data, and write data back to HDFS
• Choosing an appropriate file format can have
significant benefits:
Faster read times
Faster write times
Splittable files (so you don’t need to read the whole file, just
a part of it)
Schema evolution support (allowing you to change the fields
in a dataset)
Advanced compression support (compress the files with a
compression codec without sacrificing these features)
41
Storage Formats in Hive

Storage formats:
• Text (txt, csv, tsv)
• Sequence File
• Avro
• Parquet
• Optimized Row Columnar (ORC)

42
Text Format

• Used for csv, txt, tsv, and sometimes json sources


• Data is laid out in lines, with each line being a record;
lines are terminated by a newline character \n
• Human readable
• Data storage is bulky and not as efficient to query
• Text-files are inherently splittable
• Convenient format if data needs to be used by other
applications or scripts that produce/read delimited
files
• Does not support block compression
43
Sequence File

• Sequence files are flat files consisting of binary key-value pairs


• Results in a row-oriented storage in Hive
• When Hive converts queries to MapReduce jobs, it decides on
the appropriate key-value pairs to use for a given record
• The sequence file is a standard format supported by Hadoop
itself, so it is an acceptable choice when sharing files between
Hive and other Hadoop-related tools
• Sequence files can be compressed at the block and record
level, which is very useful for optimizing disk space utilization
and I/O

44
Avro

• Widely used as a serialization platform


• To serialize an object means to convert its state to a byte stream so that the byte
stream can be reverted back into a copy of the object
• Row-based, compact and fast binary format which defines file data schemas
in JSON
• Schema is encoded on the file
• When Avro data is read, the schema used when writing it is always present
• File supports block compression and are splittable
• Supports schema evolution
• Avro is a well thought out format which defines file data schemas in JSON
(for interoperability), allows for schema evolutions (remove a column, add a
column), and multiple serialization/deserialization use cases. It also
supports block-level compression. For most Hadoop-based use cases Avro is
a really good choice

45
Columnar Formats (Parquet, ORC)

• Columnar storage layout allows reading only a small fraction of


the data from a data file or table
• datasets are partitioned both horizontally and vertically
• particularly useful if you need access to a subset of data, or
access all values of a single column without reading whole
records
• offers better compression and storage optimization
• Ideal for read-heavy situations
• optimizes workloads for Hive and Spark when you need to read
segments of records rather than the whole thing (which is more
common in MapReduce)

46
Handling JSON Data in Hive

47
Handling JSON Data in Hive

• Hive is a fantastic tool for performing SQL-style


queries across data that is often not appropriate for a
relational database
• This is possible because of 2 reasons:
1. Hive has complex data types
2. Hive supports SerDe

48
What is a SerDe?

• The SerDe interface allows you to instruct Hive as to


how a record should be processed
• A SerDe is a combination of a Serializer and a
Deserializer (hence called a SerDe)
• The Deserializer interface takes a string or binary
representation of a record, and translates it into a Java
object that Hive can manipulate
• The Serializer takes a Java Object and uses Hive to
write it to HDFS
• Hive provides an API for developing a SerDe
org.apache.hadoop.hive.serde2
• A SerDe allows Hive to read data from a table, and
write it back out to HDFS in any custom format 49
Handling JSON Data in Hive

•JSON (JavaScript Object Notation) is a


lightweight data-interchange format
•Hive provides three different mechanisms for
processing JSON
use the GET_JSON_OBJECT UDF
use the JSON_TUPLE UDTF
use custom SerDe

50
JSON_TUPLE UDTF

• JSON_TUPLE is a User Defined Table Function, which


takes a set of names (keys) and a JSON string, and
returns a tuple of values using one function
• It is much more efficient than using
GET_JSON_OBJECT because it parses the string once
as opposed to GET_JSON_OBJECT, which parses the
string once for each column
• As JSON_TUPLE is a UDTF, you will need to use
the LATERAL VIEW syntax in order to achieve the same
goal (See Appendix A)
• The disadvantage – only works for a single level deep
JSON file. For multiple level deep files, you have to
create more Lateral views
51
Hands-On

• Hive Activity 6
• Practice – Complete the example in the blog:
http://thornydev.blogspot.com/2013/07/querying-
json-records-via-hive.html
• Expect multiple level JSON for the assignment, but not
for the exam

52
Appendix

53
Appendix A – Table Generating
Functions

• Table generating functions take zero or more inputs and


produce multiple columns or rows of output
• All table generating functions, user-defined and built-in,
are often referred to generically as user-defined table
generating functions (UDTFs)
• Limitation – UDTFs only work at complex data types
which are single level deep and have single row (see
example)

Source: Programming Hive O’reilly 54


UDTF Example – explode() function

• Select * from an array returns an array list

• explode(),a UDTF, returns each element from the list in a


separate row

55
• Table with multiple rows example

• Use of Lateral View

56
Appendix B - Database Indexes

• A database index is a data structure that improves the


speed of data retrieval operations on a database table
• Can be Clustered or
Non-Clustered
• Clustered index physically
re-organizes the data, hence
there can be only one
clustered index
• Non-clustered indexes are
like indexes at the back of a
book, which points to where
the data resides
Database Indexes - Continued

• When an index is added to a column, it saves the


database from full table scans and uses index seek
• Upside – Querying becomes faster specially if you
have indexes defined on the fields often used in
Where clauses
• Downside – adding data to the table with indexes
means that index needs to be updated as well
Appendix C - Database Partitioning

• When a database table grows too big, running


operations like indexing, selection, deletion, or
updates on the table become too time consuming
• Partitioning a table means dividing the table and its
indexes into smaller, manageable partitions so that
maintenance can be done partition by partition,
rather than on the entire table
• For reporting purposes, most frequently used data
and historical data can be saved on different
partitions
Additional read on Partitions:
https://docs.oracle.com/cd/B28359_01/server.111/b32024/partition.htm
How Partitioning works

• A partition allows a table and index to be divided into


smaller pieces called a partition
• Each partition has its own name and its own storage
characteristics
• From a DBA perspective, a partitioned table can be
managed collectively or individually
• From an application/consumer perspective, a
partitioned table is identical to a non-partitioned
table; no modification to SQL queries needed when
accessing a partitioned table
Appendix D - SQL vs. NoSQL
SQL NoSQL
Relational Databases Distributed Databases
Standard of Storage: Tables No Standard of Storage: Generally Key-
value pair, documents etc.
Standard schema definition, hence not No Standard Scheme definition, hence
well suited for unstructured data well suited for unstructured data
Examples: Oracle, SQL Server, DB2, MySQL Examples: MongoDB, Redis, RavenDb,
etc. Cassandra, Hbase, CouchDb
Ideal for complex querying Do not have standard interfaces to
perform complex queries
Does not perform well for very large Ideal for very large datasets
datasets
SQL databases emphasizes on ACID NoSQL database follows Eric Brewers CAP
properties ( Atomicity, Consistency, theorem (BASE)
Isolation and Durability)
ACID vs. BASE for Transactions
What is ACID
Atomic: Everything in a transaction succeeds or the
entire transaction is rolled back
Consistent: A transaction cannot leave the database in
an inconsistent state
 Isolated: Transactions cannot interfere with each
other
Durable: Completed transactions persist, even when
servers restart etc.
ACID vs. BASE for Transactions
What if you ran Amazon on ACID?
• Every time when someone is in the process of buying
something, you’ll lock the inventory part of the
database to the visitors on the website so that they
will see accurate inventory
• This will ensure that Amazon does not sell anything
not in the inventory but will be disastrous to user
experience

Amazon may risk selling the last item to two customers and then apologize to one,
rather than interrupting the user experience to show the exact inventory.
ACID vs. BASE for Transactions
• Eric Brewer’s CAP theorem says that if you want
consistency, availability, and partition tolerance, you have
to settle for two out of three.
• Alternative to ACID is BASE
Basic Availability
Soft-state
Eventual consistency
• Rather than acquiring consistency after every transaction,
it will eventually be in a consistent state
• BASE allows for more scalable and affordable systems

Potrebbero piacerti anche