Cloudera Impala Intro

Copyright ©2012 Cloudwick Technologies
1
Need for RTQ?
Impala Overview and Architecture
Impala vs Other Technologies
Impala Roadmap and FAQs
Copyright ©2012 Cloudwick Technologies 2

• Traditional Data warehousing Systems generally do not have access to real-
time data
• The whole purpose of traditional data warehouses was to collect and organize
data for a set of carefully defined questions. Once the data was in a
consumable form, it was much easier for users to access it with a variety of
tools and applications. But getting the data in that consumable format
required several steps:
 Deciding on the questions to be answered up front

 Collecting the data relevant to answering these questions from the source
systems
 Refining the data (a process called data modeling) so that it was in a format
that analytic applications could consume
 Creating a pipeline process for extracting, transforming, and loading the
data into the analytic database that periodically ran as a batch process,
typically weekly or monthly

• Traditional data-warehousing techniques have bottlenecked and broken
down principally around the following activities:
• Data modeling for analytic applications is the extremely sophisticated and

manual process of transforming data from many potential sources into a
format that makes asking questions easy.
• If new questions need to be asked, new data must typically be captured from
source applications and then modeled.
• DBAs must look at the data-warehouse schema change, review its selection of
indices, and then review changes to materialized views.
• Application architects must review any applications or data-processing
utilities such as extract, transform, and load (ETL) that touch the data
warehouse and could break as a result of changes.

• A world with so much more multi-structured data requires a different approach
and different trade-offs:
• Pushing the process of refining and enriching the data out to a broader audience
• Making the process of refining and enriching iterative
• Broadening the analytics audience by the process of discovering the meaning in
the data more incrementally and with more self-service techniques
• Hive-based queries are too slow because they must be translated into the batch-
oriented MapReduce programming framework.
• Another alternative, moving some of the data into a data mart, has generally
meant accessing only a summary subset of the data that may have filtered out
the signal from the noise.
• HBase has also been insufficient for analytics because its design center is to
support simple operations such as create, read, update, and delete rather than
other operations such as aggregation.

 Speed to insight
o Get answers as fast as you can ask questions
o Interactive analytics directly on source data
o No jumping between data silos
 Cost savings
o Reduce duplicate storage with specialized systems
o Reduce data movement for interactive analysis
o Leverage existing tools and employee skills

 Full Fidelity Analysis
o Ask questions of all your data
o No information loss from aggregation or
conforming to relational schemas for analysis
 Discoverability
o Single metadata repository for unified business
views
o Supports familiar SQL language and existing
BI/discovery tools
o Enables more users to interact with data

 Real time SQL queries on CDH in seconds
 Integration with leading BI tools
 Support for HDFS and HBase
 Native distributed query engine
 Low latency scheduler
 In-memory data transfers
 Leverages metadata, ODBC driver, SQL syntax and Beeswax
GUI (in Hue) from Apache Hive
 Kerberos authentication

● General-purpose SQL query engine:
○ Should work both for analytic and transactional workloads
○ Will support queries that take from microseconds to hours
● Runs directly within Hadoop:

○ Reads widely used Hadoop file formats
○ Talks to widely used Hadoop storage managers
○ Runs on same nodes that run Hadoop processes
● High performance:
○ C++ instead of Java
○ Runtime code generation
○ Completely new execution engine that doesn't build on
MapReduce

● Runs as a distributed service in cluster: one Impala daemon on
each node with data
● User submits query via ODBC/Beeswax to any of the daemons
● Query is distributed to all nodes with relevant data
● If any node fails, the query fails
● Impala uses Hive's metadata interface, connects to Hive's
metastore
● Supported file formats:
○ text files (GA: with compression, including lzo)
○ sequence files with snappy/gzip compression
○ GA: Avro data files
○ GA: Trevni (columnar format)

● SQL support:
○ patterned after Hive's version of SQL
○ limited to Select, Project, Join, Union, Subqueries,
Aggregation and Insert
○ Order By only with Limit
○ GA: DDL support (CREATE, ALTER)
● Functional limitations:
○ no custom UDFs, file formats, SerDes
○ no beyond SQL (buckets, samples, transforms, arrays,
structs, maps, xpath, json)
○ only hash joins; joined table has to fit in memory:
■ beta: of single node
■ GA: aggregate memory of all nodes
○ beta: join order = FROM clause order

● HBase functionality:
○ Uses Hive's mapping of HBase table into metastore table
○ Predicates on rowkey columns are mapped into start/stop row
○ Predicates on other columns are mapped into
SingleColumnValueFilters
● HBase functional limitations:

○ No nested-loop joins
○ All data stored as text

The Impala package installs three binaries:
impalad - The Impala daemon. Plans and executes queries against
HDFS and HBASE data. There should be one daemon process
running on each node in the cluster that has a data node.
impala-shell - Command-line interface for issuing queries to the
Impala daemon.
statestored - Name service that tracks location and status of all
impalad instances in the cluster.
Software Requirements: Hardware Requirements:
•Red Hat Enterprise Linux (RHEL) • Intel - Nehalem (released 2008) or
5.7/Centos 5.7 or RHEL 6.2/Centos 6.2 later processors.
•CDH 4.1.0 or later • AMD - Bulldozer (released 2011) or
•Hive later processors.
•MySQL or PostgreSQL • Memory – 32GB or more
•RPMs or Repositories • Storage - Data nodes with 10 or more
•Java Dependencies disks each

Manual Procedure:
1) Install CDH4
2) Install Hive
3) Install and Configure MySql as Hive Metastore
4) Install Impala Package on all data nodes
5) Enable Impala to connect to Hive Metastore
 Copy the jar related to Metastore
6) Copy the configuration files hdfs-site.xml, core-site.xml
and hive-site.xml and log4j.properties to /usr/lib/impala/conf
7) Add Impala specific configuration parameters to core-site.xml
and hdfs-site.xml.
8) Start Impala Daemons

● Two binaries: impalad and statestored
● Impala daemon (impalad)
A. Handles client requests and all internal requests related to
query execution
● State store daemon (statestored)
A. Provides name service and metadata distribution
● Query execution phases

○ Request arrives via odbc/beeswax
○ Planner turns request into collections of plan fragments
○ Coordinator initiates execution on remote impalad's
○ During execution
■ Intermediate results are streamed between executors
■ Query results are streamed back to client

Request arrives via odbc/beeswax API

Planner turns request into collections of plan fragments
Coordinator initiates execution on remote impalad's

Intermediate results are streamed between impalad's Query results
are streamed back to client

● What is Dremel:
○ Columnar storage for data with nested structures
○ Distributed scalable aggregation on top of that
● Columnar storage in Hadoop: Trevni
○ New columnar format created by Doug Cutting
○ Stores data in appropriate native/binary types
○ Can also store nested structures similar to Dremel's
ColumnIO
● Distributed aggregation: Impala
● Impala plus Trevni: a superset of the published version
of Dremel (which didn't support joins)

● Hive: MapReduce as an execution engine
○ High latency, low throughput queries
○ Fault-tolerance model based on MapReduce's on-
disk checkpointing; materializes all intermediate
results
○ Java runtime allows for easy late-binding of
functionality: file formats and UDFs.
○ Extensive layering imposes high runtime overhead
● Impala:
○ Direct, process-to-process data exchange
○ No fault tolerance
○ An execution engine designed for low runtime
overhead
● Apache Drill project (Supported by MapR) is based on
the Dremel project at Google.
● It features additional flexibility in the form of pluggable

query languages (Pluggable, DrQL, Mongo QL,
Cascading), pluggable data formats (Column based –
ColumnIO/Dremel, Trevni, RCFile and Row-based –
RecordIO, Avro, Json, CSV), and multiple storage
formats (Pluggable, Hadoop, Hbase).
● The initial design center has nested data types with the
goal of scaling to 10,000 servers, petabytes of data, and
trillions of records processed in seconds.
● STORM - Twitter has open-sourced Storm, its
distributed, fault-tolerant, real-time computation
system, at GitHub. Storm is the real-time processing
system developed by BackType and is mostly written in
Clojure.
● Apache S4 (Simple Scalable Streaming System)- Open

sourced by Yahoo which is written in Java. S4 is
a general-purpose, distributed, scalable, fault-
tolerant, pluggable platform that allows programmers to
easily develop applications for processing continuous
unbounded streams of data.

Batch Interactive Stream
Processing Analysis Processing
Query Run Time Minutes to Hours Milli Seconds to Never Ending
Minutes
Data Volume TBs to PBs GBs to PBs Continuous
Stream
Programming MapReduce Queries DAG
Model
Users Developers Analysts and Developers
Developers
Google project MapReduce Dremel / Big
Query
Open Source Hadoop Apache Drill Storm and S4
Projects MapReduce Cloudera Impala

● New Data Formats
○ LZO-Compressed Text
○ AVRO
○ Columnar
● Better Metadata Handling
○ Automatic Metadata Distribution through
statestore
● Connectivity: JDBC
● Improved Query Execution: Partition Joins
● Performance Improvements
● Improved Hbase Support
o Composite Keys, Avro Data in Columns
o INSERT, UPDATE and DELETE
● Additional SQL
○ UDFs
○ SQL Authorization and DDL
○ ORDER BY without LIMIT
○ Window Functions
○ Support for Structured data types
● Better Runtime Optimizations

● Better Resource Management

● What are good Impala use cases? Under what
conditions should Impala or Hive/MapReduce be used?
Impala is well-suited to executing SQL queries for
interactive exploratory analytics on large datasets. Hive
and MapReduce are better tools for very long running,
batch-oriented tasks such as ETL.
● How do I import or use my existing dataset? Impala

doesn't need you to import data or change the data as it
exists today with Hive. To run an Impala query against a
dataset, simply create the table in Hive. If Impala is
already running, you may need to refresh the Impala
metadata cache.
Can any Impala query also be executed in Hive?
Yes. There are some minor differences in how some
queries are handled, but Impala queries can also be
completed in Hive. Impala SQL is a subset of HiveQL,
with some functional limitations such as transforms.
Can I use Impala to query data already loaded into Hive

and HBase? Or are there special steps that must be taken
to use Impala to query data in Hive or HBase?
There are no additional steps to allow Impala to query
tables managed by Hive, whether they are stored in
HDFS or HBase.

How are joins performed in Impala?
Impala will only support hash joins for GA. The order in which
tables are joined is the same order in which tables appear in the
SELECT statement's FROM clause. That is, there is no join order
optimization taking place at the moment. It is usually optimal for
the smallest table to appear as the right-most table in a JOIN
clause.
What is the size limit for joins?
In the Impala beta release, all of the tables except the left-most one
(that is, the first one in the FROM clause) must fit into the
memory of each host on which the query executes. For the GA
release, all of the tables will need to fit into the aggregate memory
of the executing hosts.
What is Impala’s aggregation strategy? Impala currently only
supports in-memory hash aggregation.
How does Impala achieve its performance Improvements?
o Impala avoids MapReduce. While MapReduce is a great general parallel

processing model with many benefits, it is not designed to execute SQL. Impala
avoids the inefficiencies of MapReduce in these ways:
o Impala does not materialize intermediate results to disk. SQL queries often map
to multiple MapReduce jobs with all intermediate data sets written to disk.
o Impala avoids MapReduce start-up time. For interactive queries, the
MapReduce start- up time becomes very noticeable. Impala runs as a service
and essentially has no start- up time.
o Impala can more naturally disperse query plans instead of having to fit them
into a pipeline of map and reduce jobs. This enables Impala to parallelize
multiple stages of a query and avoid overheads such as sort and shuffle when
unnecessary.
o Impala uses a more efficient execution engine by taking advantage of modern
hardware and technologies:

Hive
Impala Query MySQL Query
Workload Query
Time Time
Time
5.2 Gb HAproxy log – top IPs by
3.1s 65.4s 146s
request count
5.2 Gb HAproxy log – top IPs by

3.3s 65.2s 164s
total request time
800 Mb parsed rails log – slowest

1.0s 33.2s 48.1s
accounts
800 Mb parsed rails log – highest

1.1s 33.7s 49.6s
database time paths
8 Gb pageview table – daily

22.4s 92.2s 180s
pageviews and unique visitors

DEMO


Cloudera Impala Intro

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Cloudera Impala Intro

Caricato da

Copyright:

Formati disponibili

Copyright ©2012 Cloudwick Technologies

Impala Overview and Architecture

Impala vs Other Technologies

Impala Roadmap and FAQs

Copyright ©2012 Cloudwick Technologies 2

 Deciding on the questions to be answered up front

Copyright ©2012 Cloudwick Technologies 3

• Data modeling for analytic applications is the extremely sophisticated and

Copyright ©2012 Cloudwick Technologies 4

Copyright ©2012 Cloudwick Technologies 5

Copyright ©2012 Cloudwick Technologies 7

Copyright ©2012 Cloudwick Technologies 8

Copyright ©2012 Cloudwick Technologies 9

● Runs directly within Hadoop:

Copyright ©2012 Cloudwick Technologies 10

Copyright ©2012 Cloudwick Technologies 11

Copyright ©2012 Cloudwick Technologies 12

● HBase functional limitations:

Copyright ©2012 Cloudwick Technologies 13

Copyright ©2012 Cloudwick Technologies 14

Copyright ©2012 Cloudwick Technologies 15

● Query execution phases

Copyright ©2012 Cloudwick Technologies 16

Copyright ©2012 Cloudwick Technologies 17

Copyright ©2012 Cloudwick Technologies 18

Copyright ©2012 Cloudwick Technologies 19

Copyright ©2012 Cloudwick Technologies 20

● It features additional flexibility in the form of pluggable

● Apache S4 (Simple Scalable Streaming System)- Open

Copyright ©2012 Cloudwick Technologies 23

Copyright ©2012 Cloudwick Technologies 24

● Better Runtime Optimizations

Copyright ©2012 Cloudwick Technologies 26

● How do I import or use my existing dataset? Impala

Can I use Impala to query data already loaded into Hive

Copyright ©2012 Cloudwick Technologies 28

o Impala avoids MapReduce. While MapReduce is a great general parallel

Copyright ©2012 Cloudwick Technologies 30

5.2 Gb HAproxy log – top IPs by

800 Mb parsed rails log – slowest

800 Mb parsed rails log – highest

8 Gb pageview table – daily

Copyright ©2012 Cloudwick Technologies 31

Copyright ©2012 Cloudwick Technologies 32

Potrebbero piacerti anche