Big Data Ecosystem-Student Guide PDF

®
Student Guide
Introduction to Big Data Ecosystem
Skills Academy: Data Science Technologist
IBM Training
Preface
March 2018
NOTICES
This information was developed for products and services offered in the USA.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for
information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to
state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any
non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document.
The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive, MD-NC119
Armonk, NY 10504-1785
United States of America
The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law:
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND,
EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in
certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these
changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the
program(s) described in this publication at any time without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of
those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information
concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available
sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM
products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the
examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and
addresses used by an actual business enterprise is entirely coincidental.
TRADEMARKS
IBM, the IBM logo, ibm.com, Big SQL, Db2, and Hortonworks are trademarks or registered trademarks of International Business Machines Corp.,
registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM
trademarks is available on the web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml.
Adobe, and the Adobe logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other
countries.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
© Copyright International Business Machines Corporation 2018.
This document may not be reproduced in whole or in part without the prior written permission of IBM.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
© Copyright IBM Corp. 2018 P-2

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Preface
Contents
Preface................................................................................................................. P-1
Contents ............................................................................................................. P-3
Course overview............................................................................................... P-15
Document conventions ..................................................................................... P-16
Additional training resources ............................................................................ P-17
IBM product help .............................................................................................. P-18
Introduction to Big Data and Data Analytics ....................................... 1-1
Unit objectives .................................................................................................... 1-3
Introduction to Big Data ...................................................................................... 1-4
Big Data - a tsunami that is hitting us already ..................................................... 1-5
Data has an intrinsic property…it grows and grows ............................................ 1-6
Growing interconnected & instrumented world ................................................... 1-7
Growth in Internet traffic (PCs, smartphones, IoT,…) ......................................... 1-9
Some examples of Big Data ............................................................................. 1-11
The four classic dimensions of Big Data (4 Vs) ................................................ 1-12
The 4 Vs of Big Data - IBM Infographic ............................................................ 1-16
4 dimensions: Volume, Velocity, Variety, & Veracity ......................................... 1-17
Volume & Velocity of Big Data .......................................................................... 1-19
Variety & Veracity of Big Data .......................................................................... 1-21
And, of course, some people have more than 4 Vs .......................................... 1-22
Or even more… ................................................................................................ 1-23
Types of Big Data ............................................................................................. 1-25
Big Data Analytic Techniques ........................................................................... 1-26
Five key Big Data Use Cases ........................................................................... 1-27
Common Use Cases applied to Big Data ......................................................... 1-28
“Data is the New Oil” ........................................................................................ 1-29
Some statistics: Facebook................................................................................ 1-30
fivethirtyeight.com - Nate Silver ........................................................................ 1-31
The challenges of sensor data.......................................................................... 1-32
Engine data - aircraft, diesel trucks & generators,... ......................................... 1-33
Large Hadron Collider (LHC) ............................................................................ 1-35
Largest Radio Telescope (Square Kilometer Array).......................................... 1-37
Wikibon Big Data Software, Hardware & Professional Services Projection 2014-
2026 ($B) ......................................................................................................... 1-39
2017 Big Data and Analytics Forecast - 03-Mar-17 .......................................... 1-40
Evolution of the Internet of Things (IoT)............................................................ 1-43
Meaning of real-time when applied to Big Data ................................................ 1-44
More comments on real-time ............................................................................ 1-45

Preface
Traditional vs. Big Data approaches to using data ............................................ 1-46

There are a many parts to the Hadoop Ecosystem ........................................... 1-47
We will start with the Core Components ........................................................... 1-48
Appendix A ....................................................................................................... 1-49
Example business sectors currently using Big Data ......................................... 1-50
Big Data adoption study (the 4 Es) ................................................................... 1-51
Use cases for a Big Data platform: Healthcare and Life Sciences .................... 1-52
Healthcare ........................................................................................................ 1-53
Big Data and complexity in healthcare.............................................................. 1-54
Precision Medicine Initiative (PMI) & Big Data .................................................. 1-55
Use cases for a Big Data platform: Financial Services ..................................... 1-57
Financial marketplace example: Visa ............................................................... 1-58
Financial ........................................................................................................... 1-59
Use cases for a Big Data platform: Telecommunications Services ................... 1-60
Use cases for a Big Data platform: Transportation Services ............................. 1-62
Use cases for a Big Data platform: Retailers & Social Media............................ 1-63
Graph analytics ................................................................................................ 1-64
Behavioral segmentation & analytics ................................................................ 1-65
Use cases for a Big Data platform: Manufacturing ............................................ 1-66
Articles on IoT - and Industry 4.0 ...................................................................... 1-67
Data 3.0 - my view on the data landscape ........................................................ 1-68
Realtime is a definite future direction for analytics ............................................ 1-69
Industry 4.0 ...................................................................................................... 1-70
The six design principles in Industry 4.0 ........................................................... 1-71
Internet of Things (IoT) ..................................................................................... 1-72
The future of IoT and the connected world ....................................................... 1-74
Sensors & software for the automobile ............................................................. 1-75
Technology & tomorrow: Future directions (IEEE) ............................................ 1-76
An example of a big data platform in practice (IBM) ......................................... 1-79
Big Data & Analytics Architecture - A broader picture ....................................... 1-80
Big Data scenarios span many industries ......................................................... 1-81
Big Data adoption (emphasis on Hadoop) ........................................................ 1-82
Factors driving interest in Big Data Analysis ..................................................... 1-83
Impact of Big Data analytics in the next 5 years ............................................... 1-84
Big Data: Outcomes & data sources ................................................................. 1-85
Data & Data Science ........................................................................................ 1-86
Facets of data................................................................................................... 1-87
System of Units / Binary System of Units ......................................................... 1-89
Introduction to Hadoop & the Hadoop Ecosystem ............................................ 1-91
Hardware improvements over the years …....................................................... 1-92
Parallel data processing ................................................................................... 1-93
What is Hadoop? .............................................................................................. 1-94

Preface
A large (and growing) Ecosystem ..................................................................... 1-96

Who uses Hadoop? .......................................................................................... 1-97
Why & where Hadoop is used / not used .......................................................... 1-98
Hadoop / MapReduce timeline ......................................................................... 1-99
Many contributors to Hadoop (e.g., 2006-2011) ............................................. 1-101
The two key components of Hadoop .............................................................. 1-102
Think differently .............................................................................................. 1-103
Core Hadoop concepts ................................................................................... 1-104
Differences between RDBMS and Hadoop/HDFS .......................................... 1-105
Requirements for this new approach .............................................................. 1-106
Some terminology…to get you started............................................................ 1-108
Checkpoint ..................................................................................................... 1-109
Checkpoint solution ........................................................................................ 1-110
Unit summary ................................................................................................. 1-111
Introduction to Hortonworks Data Platform (HDP) ............................. 2-1
Unit objectives .................................................................................................... 2-3
Hortonworks Data Platform (HDP) ...................................................................... 2-4
Data workflow ..................................................................................................... 2-6
Sqoop ................................................................................................................. 2-7
Flume ................................................................................................................. 2-8
Kafka .................................................................................................................. 2-9
Data access ..................................................................................................... 2-10
Hive .................................................................................................................. 2-11
Pig .................................................................................................................... 2-12
HBase .............................................................................................................. 2-13
Accumulo ......................................................................................................... 2-14
Phoenix ............................................................................................................ 2-15
Storm ............................................................................................................... 2-16
Solr................................................................................................................... 2-17
Spark ................................................................................................................ 2-18
Druid ................................................................................................................ 2-19
Data Lifecycle and Governance........................................................................ 2-20
Falcon .............................................................................................................. 2-21
Atlas ................................................................................................................. 2-22
Security ............................................................................................................ 2-23
Ranger ............................................................................................................. 2-24
Knox ................................................................................................................. 2-25
Operations........................................................................................................ 2-26
Ambari .............................................................................................................. 2-27
The Ambari web interface................................................................................. 2-28
Cloudbreak ....................................................................................................... 2-29

Preface
ZooKeeper ....................................................................................................... 2-30

Oozie ................................................................................................................ 2-31
Tools ................................................................................................................ 2-32
Zeppelin ........................................................................................................... 2-33
Zeppelin GUI .................................................................................................... 2-34
Ambari Views ................................................................................................... 2-35
IBM value-add components .............................................................................. 2-36
Big SQL is SQL on Hadoop .............................................................................. 2-37
Big Replicate .................................................................................................... 2-38
Information Server and Hadoop: BigQuality and BigIntegrate........................... 2-39
Information Server - BigIntegrate: Ingest, transform, process and deliver any data
into & within Hadoop ........................................................................................ 2-40
Information Server - BigQuality: Analyze, cleanse and monitor your big data ... 2-42
IBM InfoSphere Big Match for Hadoop ............................................................. 2-43
Watson Studio (formerly Data Science Experience (DSX))............................... 2-44
Checkpoint ....................................................................................................... 2-45
Checkpoint solution .......................................................................................... 2-46
Unit summary ................................................................................................... 2-47
Lab 1: Exploration of the lab environment ........................................................ 2-48
Apache Ambari...................................................................................... 3-1
Unit objectives .................................................................................................... 3-3
Operations.......................................................................................................... 3-4
Apache Ambari ................................................................................................... 3-5
Functionality of Apache Ambari .......................................................................... 3-6
Ambari Metrics System ("AMS") ......................................................................... 3-7
Ambari architecture ............................................................................................ 3-8
Sign in to the Ambari web interface .................................................................... 3-9
Navigating the Ambari web interface ................................................................ 3-10
The Ambari web dashboard ............................................................................. 3-11
Metric details on the Ambari dashboard ........................................................... 3-12
Metric details for time-based cluster components ............................................. 3-13
Service Actions/Alert and Health Checks ......................................................... 3-14
Add Service Wizard .......................................................................................... 3-15
Background service check from the admin menu ............................................. 3-16
Host metrics: Example of a host ....................................................................... 3-17
Non-functioning/failed services: Example of HBase.......................................... 3-18
Managing Ambari ............................................................................................. 3-19
Creating users and groups ............................................................................... 3-20
Standard Ambari Service users and groups ..................................................... 3-21
Managing hosts in a cluster .............................................................................. 3-22
Running Ambari from the command line........................................................... 3-23

Preface
Ambari terminology .......................................................................................... 3-25

Checkpoint ....................................................................................................... 3-27
Checkpoint Solution ......................................................................................... 3-28
Lab 1: Managing Hadoop clusters with Apache Ambari.................................... 3-29
Unit summary ................................................................................................... 3-37
Hadoop and HDFS................................................................................. 4-1
Unit objectives .................................................................................................... 4-3
Agenda ............................................................................................................... 4-4
The importance of Hadoop ................................................................................. 4-5
Hardware improvements through the years ........................................................ 4-6
What hardware is not used for Hadoop .............................................................. 4-8
Parallel data processing is the answer ............................................................... 4-9
What is Hadoop? .............................................................................................. 4-10
Hadoop open source projects ........................................................................... 4-12
Advantages and disadvantages of Hadoop ...................................................... 4-14
Timeline for Hadoop ......................................................................................... 4-15
Hadoop: Major components ............................................................................. 4-17
Brief introduction to HDFS and MapReduce ..................................................... 4-18
Hadoop Distributed File System (HDFS) principles .......................................... 4-19
HDFS architecture ............................................................................................ 4-20
HDFS blocks .................................................................................................... 4-22
HDFS replication of blocks ............................................................................... 4-24
Setting rack network topology (Rack Awareness) ............................................. 4-26
Compression of files ......................................................................................... 4-30
Which compression format should I use? ......................................................... 4-32
NameNode startup ........................................................................................... 4-33
NameNode files (as stored in HDFS)................................................................ 4-34
Adding a file to HDFS: replication pipelining ..................................................... 4-35
Managing the cluster ........................................................................................ 4-37
HDFS-2 NameNode HA (High Availability) ....................................................... 4-38
Secondary NameNode ..................................................................................... 4-39
Possible FileSystem setup approaches ............................................................ 4-40
Federated NameNode (HDFS2) ....................................................................... 4-41
fs: file system shell ........................................................................................... 4-43
Unit summary ................................................................................................... 4-48
Checkpoint ....................................................................................................... 4-49
Lab 1: File access and basic commands with HDFS ........................................ 4-51

Preface
MapReduce and YARN ......................................................................... 5-1

Unit objectives .................................................................................................... 5-3
Topic: Introduction to MapReduce processing based on MR1 ............................ 5-4
Agenda ............................................................................................................... 5-5
MapReduce: The Distributed File System (DFS) ................................................ 5-6
MapReduce v1 explained ................................................................................... 5-7
MapReduce v1 engine........................................................................................ 5-8
The MapReduce programming model ................................................................ 5-9
The MapReduce execution environments ........................................................ 5-10
Agenda ............................................................................................................. 5-11
MapReduce 1 overview .................................................................................... 5-12
MapReduce: Map phase .................................................................................. 5-13
MapReduce: Shuffle phase .............................................................................. 5-14
MapReduce: Reduce phase ............................................................................. 5-15
MapReduce: Combiner (Optional) .................................................................... 5-16
Agenda ............................................................................................................. 5-17
WordCount example ......................................................................................... 5-18
Map task........................................................................................................... 5-19
Shuffle .............................................................................................................. 5-21
Reduce ............................................................................................................. 5-22
Optional: Combiner .......................................................................................... 5-23
Source code for WordCount.java ...................................................................... 5-24
How does Hadoop run MapReduce jobs? ........................................................ 5-28
Classes ............................................................................................................ 5-30
Splits ................................................................................................................ 5-31
RecordReader .................................................................................................. 5-32
InputFormat ...................................................................................................... 5-33
Fault tolerance.................................................................................................. 5-34
Topic: Issues with and limitations of Hadoop v1 and MapReduce v1................ 5-36
Issues with the original MapReduce paradigm ................................................. 5-37
Limitations of classic MapReduce (MRv1) ........................................................ 5-38
Scalability in MRv1: Busy JobTracker............................................................... 5-39
YARN overhauls MRv1..................................................................................... 5-40
Hadoop 1 and 2 architectures compared .......................................................... 5-42
YARN features ................................................................................................. 5-43
YARN features: Scalability ............................................................................... 5-44
YARN features: Multi-tenancy .......................................................................... 5-45
YARN features: Compatibility ........................................................................... 5-47
YARN features: Higher cluster utilization .......................................................... 5-48
YARN features: Reliability and availability ........................................................ 5-49
YARN major features summarized ................................................................... 5-50

Preface
Topic: The architecture of YARN ...................................................................... 5-51

Hadoop v1 to Hadoop v2 .................................................................................. 5-52
Architecture of MRv1 ........................................................................................ 5-53
YARN architecture............................................................................................ 5-54
Terminology changes from MRv1 to YARN ...................................................... 5-56
YARN ............................................................................................................... 5-57
YARN high level architecture ............................................................................ 5-58
Running an application in YARN ...................................................................... 5-59
How YARN runs an application ........................................................................ 5-66
Container JAVA command line ......................................................................... 5-67
Provisioning, managing, and monitoring ........................................................... 5-68
Spark in Hadoop 2+.......................................................................................... 5-69
Checkpoint ....................................................................................................... 5-70
Unit summary ................................................................................................... 5-72
Lab 1: Running MapReduce and YARN jobs .................................................... 5-73
Lab 2: Creating and coding a simple MapReduce job....................................... 5-84
Apache Spark ........................................................................................ 6-1
Unit objectives .................................................................................................... 6-3
Big data and Spark ............................................................................................. 6-4
Ease of use ........................................................................................................ 6-6
Who uses Spark and why? ................................................................................. 6-7
Spark unified stack ............................................................................................. 6-9
Brief history of Spark ........................................................................................ 6-11
Resilient Distributed Datasets (RDDs) .............................................................. 6-12
Spark jobs and shell ......................................................................................... 6-14
Brief overview of Scala ..................................................................................... 6-15
Scala: Anonymous functions, aka Lambda functions ........................................ 6-17
Computing wordcount using Lambda functions ................................................ 6-18
Resilient Distributed Dataset (RDD) ................................................................. 6-20
Creating an RDD .............................................................................................. 6-22
Spark's Scala and Python shells....................................................................... 6-24
RDD basic operations....................................................................................... 6-25
What happens when an action is executed?..................................................... 6-27
RDD operations: Transformations .................................................................... 6-35
RDD operations: Actions .................................................................................. 6-37
RDD persistence .............................................................................................. 6-38
Best practices for which storage level to choose .............................................. 6-40
Shared variables and key-value pairs ............................................................... 6-42
Programming with key-value pairs .................................................................... 6-44
Programming with Spark .................................................................................. 6-45

Preface
SparkContext ................................................................................................... 6-46

Linking with Spark: Scala ................................................................................. 6-47
Initializing Spark: Scala .................................................................................... 6-48
Linking with Spark: Python ............................................................................... 6-49
Initializing Spark: Python .................................................................................. 6-50
Linking with Spark: Java ................................................................................... 6-51
Initializing Spark: Java ...................................................................................... 6-52
Passing functions to Spark ............................................................................... 6-53
Programming the business logic....................................................................... 6-55
Running Spark examples ................................................................................. 6-56
Create Spark standalone applications - Scala .................................................. 6-58
Run standalone applications ............................................................................ 6-59
Spark libraries .................................................................................................. 6-60
Spark SQL........................................................................................................ 6-61
Spark streaming ............................................................................................... 6-62
Spark Streaming: Internals ............................................................................... 6-63
MLib ................................................................................................................. 6-65
GraphX ............................................................................................................. 6-66
Spark cluster overview ..................................................................................... 6-67
Spark monitoring .............................................................................................. 6-69
Checkpoint ....................................................................................................... 6-71
Unit summary ................................................................................................... 6-73
Lab 1: Working with a Spark RDD with Scala ................................................... 6-74
Storing and querying data .................................................................... 7-1
Unit objectives .................................................................................................... 7-3
Introduction to data............................................................................................. 7-4
Gathering and cleaning/munging/wrangling data ................................................ 7-5
Flat files/text files ................................................................................................ 7-7
CSV and various forms of delimited files ............................................................ 7-8
Avro/SequenceFile ............................................................................................. 7-9
JSON format - Java Script Object Notation....................................................... 7-11
XML (eXtensible Markup Language) ................................................................ 7-14
RC/ORC file formats ......................................................................................... 7-15
Parquet............................................................................................................. 7-16
NoSQL ............................................................................................................. 7-18
Origins of the NoSQL products ......................................................................... 7-19
HBase .............................................................................................................. 7-20
What is HBase?................................................................................................ 7-21
Why NoSQL? ................................................................................................... 7-22
Why HBase in particular? ................................................................................. 7-24

Preface
HBase and ACID properties ............................................................................. 7-26

HBase data model ............................................................................................ 7-28
Think "Map": This is not a spreadsheet model .................................................. 7-29
HBase Data Model: Logical view ...................................................................... 7-30
Column family................................................................................................... 7-31
HBase vs traditional RDBMS ............................................................................ 7-32
Example of a classic RDBMS table .................................................................. 7-33
Example of HBase logical view ("records") ....................................................... 7-34
Example of the physical view ("cell") ................................................................ 7-35
More on the HBase data model ........................................................................ 7-36
Indexes in HBase ............................................................................................. 7-37
Hadoop v1 processing environment ................................................................. 7-38
Hadoop v2 processing environment ................................................................. 7-39
Open source programming languages: Pig and Hive........................................ 7-40
Pig .................................................................................................................... 7-41
Pig vs. SQL ...................................................................................................... 7-42
Characteristics of the Pig language .................................................................. 7-43
Example of a Pig script ..................................................................................... 7-44
What is Hive? ................................................................................................... 7-45
SQL for Hadoop ............................................................................................... 7-47
Java vs. Hive: The wordcount algorithm ........................................................... 7-48
Hive and wordcount .......................................................................................... 7-49
Hive components.............................................................................................. 7-50
Starting Hive: The Hive shell ............................................................................ 7-51
Data types and models ..................................................................................... 7-52
Data model partitions........................................................................................ 7-53
Data model external table................................................................................. 7-54
Physical layout ................................................................................................. 7-55
Creating a table ................................................................................................ 7-56
Hive and HBase ............................................................................................... 7-57
HBase table mapping ....................................................................................... 7-58
Languages used by data scientists: R and Python ........................................... 7-59
Quick overview of R ......................................................................................... 7-60
R clients ........................................................................................................... 7-61
Simple example ................................................................................................ 7-62
R is noted for its ability to produce graphical output.......................................... 7-63
Quick overview of Python ................................................................................. 7-64
Python wordcount program .............................................................................. 7-65
Checkpoint ....................................................................................................... 7-66
Unit summary ................................................................................................... 7-68
Lab 1: Using Hive to access Hadoop/HBase data ............................................ 7-69

Preface
ZooKeeper, Slider, and Knox ............................................................... 8-1

Unit objectives .................................................................................................... 8-3
Topic: ZooKeeper ............................................................................................... 8-4
ZooKeeper ......................................................................................................... 8-5
Distributed systems ............................................................................................ 8-7
What is ZooKeeper? ........................................................................................... 8-8
ZooKeeper service: Replicated mode ............................................................... 8-10
ZooKeeper service: Standalone mode.............................................................. 8-12
Consistency guarantees ................................................................................... 8-13
What ZooKeeper does not guarantee ............................................................... 8-14
ZooKeeper structure: Data model..................................................................... 8-15
ZNode operations ............................................................................................. 8-16
ZNode watches ................................................................................................ 8-17
ZNode reads and writes ................................................................................... 8-18
ZooKeeper generic use cases .......................................................................... 8-19
ZooKeeper's role in the Hadoop infrastructure ................................................. 8-20
Real world use cases for ZooKeeper ................................................................ 8-21
Topic: Slider ..................................................................................................... 8-22
Slider ................................................................................................................ 8-23
Apache Slider: Enable long running services on YARN .................................... 8-25
Role of Slider in the Hadoop ecosystem ........................................................... 8-26
Topic: Knox ...................................................................................................... 8-28
Knox ................................................................................................................. 8-29
Knox security .................................................................................................... 8-31
Apache Knox is just one ring of an overall security system............................... 8-32
Checkpoint ....................................................................................................... 8-33
Unit summary ................................................................................................... 8-35
Lab 1: Explore Zookeeper ................................................................................ 8-36
Loading data with Sqoop...................................................................... 9-1
Unit objectives .................................................................................................... 9-3
Load scenarios ................................................................................................... 9-4
Data at rest ......................................................................................................... 9-5
Data in motion .................................................................................................... 9-6
Data from a web server ...................................................................................... 9-7
Solution if data is from a data warehouse or RDBMS ......................................... 9-8
Sqoop ................................................................................................................. 9-9
Sqoop connection............................................................................................. 9-10
Sqoop import .................................................................................................... 9-11
Sqoop import examples .................................................................................... 9-12

Preface
Sqoop export .................................................................................................... 9-13

Sqoop export examples .................................................................................... 9-14
Additional export information ............................................................................ 9-15
Flume ............................................................................................................... 9-16
How Flume works ............................................................................................. 9-17
Checkpoint ....................................................................................................... 9-18
Unit summary ................................................................................................... 9-20
Lab 1: Moving data into HDFS with Sqoop ....................................................... 9-21
Security and Governance ................................................................ 10-1
Unit objectives .................................................................................................. 10-3
The need for data governance.......................................................................... 10-4
9 ways to build confidence in big data .............................................................. 10-5
What Hadoop security requires ........................................................................ 10-6
How is security provided? ................................................................................. 10-8
Enterprise security services with HDP .............................................................. 10-9
Authentication: Kerberos & Knox .................................................................... 10-11
Authorization & Auditing: Apache Ranger ....................................................... 10-12
History of Hadoop security ............................................................................. 10-14
Implications of security ................................................................................... 10-16
Personal & Sensitive Information.................................................................... 10-17
Hortonworks DataPlane Service (DPS) .......................................................... 10-18
Hortonworks DataPlane Service (DPS) .......................................................... 10-19
Manage, secure, and govern data across all assets ....................................... 10-20
Further reading ............................................................................................... 10-21
Checkpoint ..................................................................................................... 10-22
Unit summary ................................................................................................. 10-25
Stream Computing ........................................................................... 11-1
Unit objectives .................................................................................................. 11-3
Generations of analytics processing ................................................................. 11-4
What is streaming data? ................................................................................... 11-6
IBM is a pioneer in streaming analytics ............................................................ 11-7
IBM System S................................................................................................... 11-8
Streaming data - concepts & terminology ....................................................... 11-10
Batch processing - classic approach .............................................................. 11-14
Stream processing - the real-time data approach ........................................... 11-15
Streaming components & Streaming Data Engines ........................................ 11-16
Hortonworks HDF / NiFi.................................................................................. 11-18
What is Hortonworks DataFlow (HDF)? .......................................................... 11-19

Preface
Components of the HDF Platform ................................................................... 11-20

NiFi & MiNiFi .................................................................................................. 11-21
Edge Intelligence for IoT with Apache MiNiFi ................................................. 11-23
HDP dashboard - example ............................................................................. 11-24
Controlling HDF components from Ambari ..................................................... 11-25
IBM Streams .................................................................................................. 11-26
Comparison of IBM Streams vs NiFi ............................................................... 11-28
Advantages of IBM Streams & Streams Studio .............................................. 11-30
Where does IBM Streams fit in the processing cycle? .................................... 11-31
Real-time processing to find new insights ....................................................... 11-32
Components of IBM Streams.......................................................................... 11-34
Application graph of an IBM Streams application ........................................... 11-35
Demo: “Learn IBM Streams in 5 minutes” (5:53) ........................................... 11-37
Demo: “IBM Streams Designer Overview” (2:54) ............................................ 11-38
Checkpoint ..................................................................................................... 11-39
Unit summary ................................................................................................. 11-41

Preface
Course overview
Preface overview
You will review how IBM Open Platform (IOP) with Apache Hadoop is the collaborative
platform to enable Big Data solutions to be developed on the common set of Apache
Hadoop technologies. You will also have an in-depth introduction to the main
components of the ODP core, namely Apache Hadoop (inclusive of HDFS, YARN, and
MapReduce) and Apache Ambari, as well as providing a treatment of the main open-
source components that are generally made available with the ODP core in a
production Hadoop cluster. The participant will be engaged with the product through
interactive exercises.
Intended audience
Big data engineers, data scientists, developers or programmers, and administrators
who are interested in learning about IBM's Open Platform with Apache Hadoop.
Topics covered
Topics covered in this course include:
• Introduction to Big Data and Data Analytics
• Introduction to Hortonworks Data Platform (HDP)
• Apache Ambari
• Hadoop and the Hadoop Distributed File System (HDFS)
• MapReduce and Yarn
• Apache Spark
• Storing and Querying Data
• ZooKeeper, Slider, and Knox
• Loading Data with Sqoop
• Dataplane Service
• Stream Computing
Course prerequisites
Participants should have:
• None, however, knowledge of Linux would be beneficial.

Preface
Document conventions
Conventions used in this guide follow Microsoft Windows application standards, where
applicable. As well, the following conventions are observed:
• Bold: Bold style is used in demonstration and exercise step-by-step solutions to
indicate a user interface element that is actively selected or text that must be
typed by the participant.
• Italic: Used to reference book titles.
• CAPITALIZATION: All file names, table names, column names, and folder names
appear in this guide exactly as they appear in the application.
To keep capitalization consistent with this guide, type text exactly as shown.

Preface
Additional training resources

• Visit IBM Analytics Product Training and Certification on the IBM website for
details on:
• Instructor-led training in a classroom or online
• Self-paced training that fits your needs and schedule
• Comprehensive curricula and training paths that help you identify the courses
that are right for you
• IBM Analytics Certification program
• Other resources that will enhance your success with IBM Analytics Software
• For the URL relevant to your training requirements outlined above, bookmark:
• Information Management portfolio:
http://www-01.ibm.com/software/data/education/
• Predictive and BI/Performance Management/Risk portfolio:
http://www-01.ibm.com/software/analytics/training-and-certification/

Preface
IBM product help

Help type When to use Location
Task- You are working in the product and IBM Product - Help link
oriented you need specific task-oriented help.
Books for You want to use search engines to Start/Programs/IBM

Printing find information. You can then print Product/Documentation
(.pdf) out selected pages, a section, or the
whole book.
Use Step-by-Step online books
(.pdf) if you want to know how to
complete a task but prefer to read
about it in a book.
The Step-by-Step online books
contain the same information as the
online help, but the method of
presentation is different.
IBM on the You want to access any of the

Web following:
• IBM - Training and Certification • http://www-01.ibm.com/

software/analytics/training-
and-certification/
• Online support • http://www-947.ibm.com/
support/entry/portal/
Overview/Software
• IBM Web site • http://www.ibm.com

Introduction to Big Data and Data Analytics
Introduction to
Big Data and
Data Analytics
Data Science Foundations
© Copyright IBM Corporation 2018

Course materials may not be reproduced in whole or in part without the written permission of IBM.
U n i t 1 I n t r o d u c t i o n t o B i g D a t a a n d D a t a A n a l yt i c s
© Copyright IBM Corp. 2018 1-2

Unit objectives
• Develop an understanding of the complete open-source Hadoop
ecosystem and its near-term future directions
• Be able to compare and evaluate the major Hadoop distributions and
their ecosystem components, both their strengths and their limitations
• Gain hands-on experience with key components of various big data
ecosystem components and their roles in building a complete big data
solution to common business problems
• Learning the tools that will enable you to continue your big data
education after the course
 This learning is going to be a life-long technical journey that you will start
here and continue throughout your business career
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018
Unit objectives

Introduction to Big Data

• A tsunami of Big Data
• The Vs of Big Data
(3Vs, 4Vs, 5Vs, …)
 The count depends on who
does the counting
• The Ecosystem
 Apache open Source
 The distributions
 The add-ons
 Open Data Platform Initiative
(OPDi.org)
• Some basic terminology
Introduction to Big Data

Big Data - a tsunami that is hitting us already

• We are witnessing a tsunami of data:
 Huge volumes
 Data of different types and formats
 Impacting the business at new and ever increasing speeds
• The challenges:
 Capturing, transporting, and moving the data
 Managing - the data, the hardware involved, and the software
(open source and not)
 Processing - from munging the raw data to programming to provide insight
into the data
 Storing - safeguarding and securing
− “Big Data refers to non-conventional strategies and innovative technologies used
by businesses and organizations to capture, manage, process, and make sense of
a large volume of data”
• The industries involved
• The futures
Big Data - a tsunami that is hitting us already

We are witnessing a tsunami of huge volume of data of different types and formats that
make managing, processing, storing safeguarding and securing, and transporting them
a real challenge.
Big Data refers to non-conventional strategies and innovative technologies used by
businesses and organizations to capture, manage, process, and make sense of a large
volume of data. Ref.: Jeff Reed (2017), Data Analytics: Applicable Data Analysis to
Advance Any Business Using the Power of Data Driven Analytics
The analogies:
• Elephant (hence the logo of Hadoop)
• Humongous (the underlying word for Mongo Database)
• Streams, data lakes, oceans of data

Data has an intrinsic property…it grows and grows
90% 80% 20%

of the world’s data of the world’s of available data can
was created in the data today is be processed by
last two years unstructured traditional systems
1 in 2 83% 5.4X
business leaders don’t of CIO’s cited BI and analytics more likely that top
have access to data they as part of their visionary plan performers use business
need analytics
Data has an intrinsic property…it grows and grows

Growing interconnected & instrumented world
Growing interconnected & instrumented world

Gary’s Social Media Count – webpage
• http://www.personalizemedia.com/garys-social-media-count/
Passive RFID grows by 1.12 billion tags in 2014 to 6.9 billion: More than five years later
than the industry had expected, market research and events firm IDTechEx find that the
passive RFID tag market is now seeing tremendous volume growth.
• http://www.idtechex.com/research/articles/passive-rfid-grows-by-1-12-billion-tags-
in-2014-to-6-9-billion-00007031.asp
• http://www.statista.com/statistics/299966/size-of-the-global-rfid-market/
• https://en.wikipedia.org/wiki/Radio-frequency_identification
It is Predicted This Year That 1.3 Billion RFID Tags Will Be Sold (in 2006), but This
Number is Expected to Grow Rapidly over the Next Ten Years
• http://www.businesswire.com/news/home/20060124005042/en/Predicted-Year-
1.3-Billion-RFID-Tags-Sold

The number of internet users has increased tenfold from 1999 to 2013. The
first billion was reached in 2005. The second billion in 2010. The third billion in 2014.
• http://www.internetlivestats.com/internet-users/
Growth in the smart electric meter market has cooled during the last year, even though
strong market drivers remain in place on which utilities can build a business
case. Although the direct operational and societal benefits of smart meters and the
broader benefits of smart grids that are enabled by smart meters continue to be
debated among policymakers, utilities, regulators, and consumers, penetration rates
continue to climb, reaching nearly 39 percent in North America in 2012. According to a
recent report from Navigant Research, the worldwide installed base of smart meters will
grow from 313 million in 2013 to nearly 1.1 billion in 2022.
• https://www.navigantresearch.com/newsroom/the-installed-base-of-smart-meters-
will-surpass-1-billion-by-2022
Number of monthly active Facebook users worldwide as of 1st quarter 2016 (in millions)
• http://www.statista.com/statistics/264810/number-of-monthly-active-facebook-
users-worldwide/
The Top 20 Valuable Facebook Statistics - Updated April 2016

• https://zephoria.com/top-15-valuable-facebook-statistics/
Number of mobile phones to exceed world population by 2014. The ITU expects the
number of cell phone accounts to rise from 6 billion now to 7.3 billion in 2014, compared
with a global population of 7 billion.
Over 100 countries have the number of cell phone accounts exceeding their
population.
• http://www.digitaltrends.com/mobile/mobile-phone-world-population-2014/
• http://www.siliconindia.com/magazine_articles/World_to_have_more_cell_phone_
accounts_than_people_by_2014-DASD767476836.html

Growth in Internet traffic (PCs, smartphones, IoT,…)
Growth in Internet traffic (PCs, smartphones, IoT,…)

Cisco report says smartphone traffic will exceed PC traffic by 2020
IP traffic will grow in a massive way as 10 billion new devices come online over
the next five years
Advancements in the Internet of Things (IoT) are continuing to drive IP traffic and
tangible growth in the market
Graphic: NetworkWorld, 7 Jun 2016
Reference:
• http://www.networkworld.com/article/3080001/lan-wan/cisco-ip-traffic-will-
surpass-the-zettabyte-level-in-2016.html
Applications such as video surveillance, smart meters, digital health monitors and a
host of other Machine-to-Machine services are creating new network requirements and
incremental traffic increases.
Annual global IP traffic will surpass the zettabyte (ZB; 1000 exabytes) threshold in
2016, and will reach 2.3 ZB by 2020. Global IP traffic will reach 1.1 ZB per year or 88.7
EB (one billion gigabytes [GB]) per month in 2016. By 2020, global IP traffic will reach
2.3 ZB per year, or 194 EB per month.

The number of devices connected to IP networks will be three times as high as the
global population in 2020. There will be 3.4 networked devices per capita by 2020, up
from 2.2 networked devices per capita in 2015. Accelerated in part by the increase in
devices and the capabilities of those devices, IP traffic per capita will reach 25 GB per
capita by 2020, up from 10 GB per capita in 2015.

Some examples of Big Data

• Science • Medical records
• Astronomy  Commercial
• Atmospheric science  Web / event / database logs
 "Digital exhaust" - result of human interaction
• Genomics
with the Internet
• Biogeochemical  Sensor networks
• Biological  RFID
− and other complex / interdisciplinary  Internet text and documents
scientific research  Internet search indexing
• Social  Call detail records (CDR)
• Social networks  Photographic archives
 Video / audio archives
• Social data
 Large scale eCommerce
− Person to person (P2P, C2C):
 Government
• Wish Lists on Amazon.com  Regular government business and commerce
• Craig’s List needs
• Person to world (P2W, C2W):  Military and homeland security surveillance
• Twitter
• Facebook
• LinkedIn
Some examples of Big Data

The four classic dimensions of Big Data (4 Vs)
…and a 5th V - Value -

that is the real purpose
of working with Big Data
to obtain business insight
The four classic dimensions of Big Data (4 Vs)

https://www.promptcloud.com/blog/The-4-Vs-of-Big-Data-for-Yielding-Invaluable-Gems-
of-Information
Volume
As the name suggests, the main characteristic of big data is its huge volume collected
through various sources. We are used to measuring data in Gigabytes or Terabytes.
However, according to various studies, big data volume created so far is
in Zettabytes which is equivalent to a trillion gigabytes. 1 zettabyte is equivalent to
approximately 3 million galaxies of stars. This will give you an idea of colossal volume
of data being available for business research and analysis.
Take any sector and you can comprehend that it is flooded with loads of data. Travel,
education, entertainment, health, banking, shopping - each and every sector can
benefit immensely from the Big data advantage. Data is collected from diverse sources
which include business transactions, social media, sensors, surfing history etc.

With every passing day, data is growing exponentially. Thousands of TBs worth data is
created every minute worldwide via Facebook, tweets, instant messages, email,
internet usage, mobile usage, product reviews etc. Every minute, hundreds of twitter
accounts are created, thousands of applications are downloaded, and thousands of
new posts and ads are posted. According to experts, the amount of big data in the
world is likely to get doubled every two years. This will definitely provide immense data
in coming years and also calls for smarter data management.
As the volume of the data is growing at the speed of light, traditional database
technology will not suffice the need of efficient data management i.e. storage and
analysis. The need of the hour will be a large scale adoption of new age tools like
Hadoop and MongoDB. These use distributed systems to facilitate storage and analysis
of this enormous big data across various databases. This information explosion has
opened new doors of opportunities in the modern age.
Variety
Big data is collected and created in various formats and sources. It includes structured
data as well as unstructured data like text, multimedia, social media, business reports
etc.
Structured data such as bank records, demographic data, inventory databases,
business data, product data feeds have a defined structure and can be stored and
analyzed using traditional data management and analysis methods.
Unstructured data includes captured like images, tweets or Facebook status updates,
instant messenger conversations, blogs, videos uploads, voice recordings, sensor data.
These types of data do not have any defined pattern. Unstructured data is most of the
time reflection of human thoughts, emotions and feelings which sometimes would be
difficult to be expressed using exact words.
As the saying goes “A picture paints a thousand words”, one image or video which is
shared on social networking sites and applauded by millions of users can help in
deriving some crucial inferences. Hence, it is the need of the hour to understand this
non-verbal language to unlock some secrets of market trends.
One of the main objectives of big data is to collect all this unstructured data and analyze
it using the appropriate technology. Data crawling, also known as web crawling, is a
popular technology used for systematically browsing the web pages. There are
algorithms designed to reach the maximum depth of a page and extract useful data
worth analyzing.
Variety of data definitely helps to get insights from different set of samples, users and
demographics. It helps to bring different perspective to same information. It also allows
analyzing and understanding the impact of different form and sources of data collection
from a ‘larger picture’ point of view.

For instance, in order to understand the performance of a brand, traditional surveys are
one of the forms of data collection. This is done by selecting a sample, mostly from
panels. The advantage of this approach is that you get direct answers to the questions.
However, we can obtain real time feedback through various other forms like Facebook
activity, product review blogs, and updates posted by customers on merchant websites
like Flipkart, Amazon, and Snapdeal. A combination of these two forms of data definitely
gives a data-backed, clearer perspective to your business decision making process.
Velocity
In today’s fast paced world, speed is one of the key drivers for success in your business
as time is equivalent to money. Fast turn-around is one of the pre-requisites to stay
alive in this fierce competition. Expectations of quick results and quick deliverables are
pressing to a great extent. In such scenarios, it becomes vital to collect and analyze
vast amount of disparate data swiftly, in order to make well-informed decisions in real-
time. Low velocity of even high quality of data may hinder the decision making of a
business.
The general definition of Velocity is ‘speed in a specific direction’. In big data, Velocity is
the speed or frequency at which data is collected in various forms and from different
sources for processing. The frequency of specific data collected via various sources
defines the velocity of that data. In other terms, it is data in motion to be captured and
explored. It ranges from batch updates, to periodic to real-time flow of the data.
The frequency of Facebook status updates shared, and messages tweeted every
second, videos uploaded and/or downloaded every minute, or the online/offline bank
transactions recorded every hour, determine the velocity of the data. You can relate
velocity with the amount of trade information captured during each trading session in a
stock exchange. Imagine a video or an image going viral at the blink of an eye to reach
millions of users across the world. Big data technology allows you to process the real-
time data, sometimes without even capturing in a database.
Streams of data are processed and databases are updated in real-time, using parallel
processing of live streams of data. Data streaming helps extract valuable insights from
incessant and rapid flow of data records. A streaming application like Amazon Web
Services (AWS) Kinesis is an example of an application that handles the velocity of
data.
The higher the frequency of data collection into your big data platform in a stipulated
time period, the more likely it will enable you to make accurate decision at the right time.

Veracity
The fascinating trio of volume, variety, and velocity of data brings along a mixed bag of
information. It is quite possible that such huge data may have some uncertainty
associated with it. You will need to filter out clean and relevant data from big data, to
provide insights that power up your business. In order to make accurate decisions, the
data you have used as an input should be appropriately compiled, conformed,
validated, and made uniform.
There are various reasons of data contamination like data entry errors or typos (mostly
in structured data), wrong references or links, junk data, pseudo data etc. The
enormous volume, wide variety, and high velocity in conjunction with high-end
technology, holds no significance if the data collected or reported is incorrect. Hence,
data trustworthiness (in other words, quality of data) holds the highest importance in the
big data world.
In automated data collection, analysis, report generation, and decision making process,
it is inevitable to have a foolproof system in place to avoid any lapses. Even the most
minor of slippage at any stage in the big data extraction process can cause immense
blunder.
Any reports generated based on a certain type of data from a certain source must be
validated for accuracy and reliability. It is always advisable that you have 2 different
methods and sources to validate credibility and consistency of the data, to avoid any
bias. It is not only about accuracy post data collection, but also about determining right
source and form of the data, required amount or size of the data, and the right method
of analysis, play a vital role in procuring impeccable results. Integrity in any field of
business life or personal life holds highest significance and hence, proper measures
must be put in place to take care of this crucial aspect. It will definitely allow you to
position yourself in the market as a reliable authority and help to you to attain greater
heights of success.
Parting thoughts
These 4v’s are like 4 pillars lending stability to the giant structure of big data and adding
a precious 5th “V” - value - to the information procured leads to the whole purpose of
big data, smart decision making.

The 4 Vs of Big Data - IBM Infographic
Infographic: http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg
The 4 Vs of Big Data - IBM Infographic

Infographic available at:
http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg
Sources of the data shown: McKinsey Global Institute, Twitter, Cisco, Gartner, EMC,
IBM, MEPTEC, QAS
From traffic patterns and music downloads to web history and medical records, data is
recorded, stored, and analyzed to enable the technology and services that the world
relies on every day. But what exactly is big data, and how can these massive amounts
of data be used.
We can break big data down into four dimensions: Volume, Velocity, Variety, and
Veracity.
Depending on the industry and organization, big data encompasses information from
multiple internal and external sources such as transactions, social media, enterprise
content, sensors, and mobile devices. Companies can leverage data to adapt their
products and services to better meet customer needs, optimize operations and
infrastructure, and find new sources of revenue.
By 2015, 4.4 million IT jobs will be created globally to support big data with 1.9 million in
the United States.

4 dimensions: Volume, Velocity, Variety, & Veracity

• Depending on the industry and organization, big data encompasses
information from multiple internal and external sources such as
− Transactions
− Social media
− Enterprise content
− Sensors
− Mobile devices
• Companies can leverage data to
− adapt their products and services to
better meet customer needs
− optimize operations and infrastructure, and
− find new sources of revenue.
• This decade, over 5 million IT jobs will be created globally to support
big data with 2+ million in the United States.
4 dimensions: Volume, Velocity, Variety, & Veracity

References:
http://www.ibmbigdatahub.com/tag/587
Data is being produced at astronomical rates. In fact, 90% of the data in the world
today was created in the last two years! The term “big data” can be defined as data
that becomes so large that it cannot be processed using conventional methods. The
size of the data which can be considered to be Big Data is a constantly varying factor
and newer tools are continuously being developed to handle this big data. It is
changing our world completely and shows no signs of being a passing fad that will
disappear anytime in the near future.
In order to make sense out of this overwhelming amount of data it is often broken down
using the four V’s listed:
Volume: refers to the humungous amounts of data generated each second from social
media, cell phones, cars, credit cards, M2M sensors, photographs, video, etc. Thesee
vast amounts of data have become so large in fact that we can no longer store and
analyze data using traditional database technology. We now use distributed systems,
where parts of the data is stored in different locations and brought together by software.

Velocity: refers to the speed at which vast amounts of data are being generated,
collected and analyzed. Every day the number of emails, twitter messages, photos,
video clips, etc. increases at lighting speeds around the world. Every second of every
day data is increasing. Not only must it be analyzed, but the speed of transmission,
and access to the data must also remain instantaneous to allow for real-time access to
website, credit card verification and instant messaging. Big data technology allows us
now to analyze the data while it is being generated, without ever putting it into
databases.
Variety: defined as the different types of data we can now use. Data today looks very
different than data from the past. We no longer just have structured data (name, phone
number, address, financial info, etc) that fits nice and neatly into a data table. Today’s
data is unstructured. In fact, 80% of all the world’s data fits into this category, including
photos, video sequences, social media updates, etc. New and innovative big data
technology is now allowing structured and unstructured data to be harvested, stored,
and used simultaneously.
Veracity: last here, but certainly the least. Veracity is the quality or trustworthiness of
the data. Just how accurate is all this data? For example, think about all the Twitter
posts with hash tags, abbreviations, spelling, typos, etc., and the reliability and accuracy
of all that content. Working with tons and tons of data is of no use if the quality or
trustworthiness is not accurate. Traditionally mainframe data was considered the most
trustworth; but what about this new data? Quality, accuracy, and precision are needed.

Volume & Velocity of Big Data
• Volume - scale of data • Velocity - analysis of streaming data

• 40ZB (40 trillion gigabytes) • 1 TB of Trade Information
• of data will be created by 2020, an increase of 300 • The New York Stock Exchange captures 1 TB of trade
times from 2005 information during each trading session
• 2.5 quintillion byes (trillion GB) • 18.9 B Network Connections

• of data are created each day • By 2016. it is projected there will be almost 2.5
connections per person on earth
• 100 terabytes (100,000 GB)
• Most companies in the US have at least 100TB • 100 sensors
of data stored • Modern cars have close to 100 sensors that
monitor items such as fuel level and tire
• 6 billion cell phones pressure
• World population: 7 billion
• Internet of Things (IoT)
Volume & Velocity of Big Data

The numbers shown here are representative, but are growing and changing rapidly in
an ever changing data world.
INFOBESITY VS. SMART DATA
Today we generate a staggering amount of data. Eighty percent of the world’s data has
been created in the past two years. In fact, the International Data Corporation predicts
that by 2020 some 1.7 megabytes of new information will be created every second for
every human being on the planet. To put that in perspective, in 1969, astronauts flew to
the moon and back using computers with only 2 kilobytes (0.002 megabytes) of
memory.
All this data has in some ways been a giant leap for mankind, but “infobesity” can
paralyze. Separating the signals from the noise can be transformative. Indeed,
advances in data science and database technology have made it possible to unlock
insights from these vast troves of data, creating opportunities for a wide range of
industries. E-commerce platforms like Amazon, Alibaba and eBay use powerful
algorithms to predict purchase preferences and make timely product recommendations
with precision accuracy. Spotify can suggest artists, albums and songs by constantly
analyzing what music you - and people like you - listen to. More than half of the
programs watched by Netflix’s 70 million-plus users start with a system-generated
recommendation.

The list goes on. If you aren’t using big data, you have big problems.
See: Classrooms of the Future, http://www.ozy.com/pov/classrooms-of-the-
future/66311

Variety & Veracity of Big Data
• Variety - different forms of data • Veracity - uncertainty of data

• Healthcare - 150 exabytes • 1 in 3 business leaders
• As of 2011, the global size of data in healthcare was • Don’t trust the information they use to
estimated to be 150 EB (billion gigabytes)
make decisions
• 420 million wearables • 27% of respondents
• By 2014, it’s anticipated there will be 420
• In one survey were unsure of how much
million wearable, wireless health
of their data was inaccurate
monitors
• Social Data • $3.1 trillion a year
• Poor data quality costs the US economy
• 30 billion pieces of content are shared on
around $3.1 trillion a year
Facebook every month
• 400 million tweets/day by about 200
million monthly-active users
• 4+ billion hr/mo of video are watched on
YouTube
Variety & Veracity of Big Data

And, of course, some people have more than 4 Vs
http://blogs.systweak.com/2017/03/big-data-vs-represents-
characteristics-or-challenges-of-big-data
And, of course, some people have more than 4 Vs

Reference:
http://blogs.systweak.com/2017/03/big-data-vs-represents-characteristics-or-
challenges-of-big-data/
The article adds also:
• Viability
• Volatility
• Vulnerability

Or even more…
• Volume - how much data is there?
• Velocity - how quickly is the data being created, moved, or accessed?
• Variety - how many different types of sources are there?
• Veracity - can we trust the data?
• Validity - is the data accurate and correct?
• Viability - is the data relevant to the use case at hand?
• Volatility - how often does the data change?
• Vulnerability - can we keep the data secure?
• Visualization - how can the data be presented to the user?
• Value - can this data produce a meaningful return on investment?
https://healthitanalytics.com/news/understanding-the-many-vs-of-healthcare-big-data-analytics
Or even more…
Extracting actionable insights from big data analytics - and perhaps especially
healthcare big data analytics - is one of the most complex challenges that organizations
can face in the modern technological world.
In the healthcare realm, big data has quickly become essential for nearly every
operational and clinical task, including population health management, quality
benchmarking, revenue cycle management, predictive analytics, and clinical decision
support.
The complexity of big data analytics is hard to break down into bite-sized pieces, but
the dictionary has done a good job of providing pundits with some adequate
terminology.
Data scientists and tech journalists both love patterns, and few are more pleasing to
both professions than the alliterative properties of the many V’s of big data.
Originally, there were only the big three - volume, velocity, and variety - introduced by
Gartner analyst Doug Laney all the way back in 2001, long before “big data” became a
mainstream buzzword.
As enterprises started to collect more and more types of data, some of which were
incomplete or poorly architected, IBM was instrumental in adding the fourth V, veracity,
to the mix.

Subsequent linguistic leaps have resulted in even more terms being added to the litany.
Value, visualization, viability, vulnerability, volatility, and validity have all been proposed
as candidates for the list.
Each term describes a specific property of big data that organizations must understand
and address in order to succeed with its chosen initiatives. The article applies this list
specifically to healthcare analytics.

Types of Big Data

• Structured
 Data that can be stored
and processed in a
fixed format, aka schema
• Semi-structured
 Data that does not have a formal structure of a data model, i.e. a table
definition in a relational DBMS, but nevertheless it has some organizational
properties like tags and other markers to separate semantic elements that
makes it easier to analyze, aka XML or JSON
• Unstructured
 Data that has an unknown form and cannot be stored in RDBMS and cannot
be analyzed unless it is transformed into a structured format is called as
unstructured data
 Text Files and multimedia contents like images, audios, videos are example
of unstructured data - unstructured data is growing quicker than others,
experts say that 80 percent of the data in an organization is unstructured
Types of Big Data

Structured: The data that can be stored and processed in a fixed format is called as
Structured Data. Data stored in a relational database management system (RDBMS) is
one example of ‘structured’ data. It is easy to process structured data as it has a fixed
schema. Structured Query Language (SQL) is often used to manage such kind of Data.
Semi-structured: Semi-structured Data is a type of data which does not have a formal
structure of a data model, i.e. a table definition in a relational DBMS, but nevertheless it
has some organizational properties like tags and other markers to separate semantic
elements that makes it easier to analyze. XML files or JSON documents are examples
of semi-structured data.
Unstructured: The data which have unknown form and cannot be stored in RDBMS
and cannot be analyzed unless it is transformed into a structured format is called as
unstructured data. Text Files and multimedia contents like images, audios, and videos
are examples of unstructured data. The unstructured data is growing quicker than
others, experts say that 80 percent of the data in an organization are unstructured.

Big Data Analytic Techniques

An Insight into
26 Big Data Analytic
Techniques
Big Data Analytic Techniques

Reference:
• An Insight into 26 Big Data Analytic Techniques: Part 1
http://blogs.systweak.com/2016/11/an-insight-into-26-big-data-analytic-
techniques-part-1/
• An Insight into 26 Big Data Analytic Techniques: Part 2
http://blogs.systweak.com/2016/11/an-insight-into-26-big-data-analytic-
techniques-part-2/

Five key Big Data Use Cases
Five key Big Data Use Cases

Common Use Cases applied to Big Data

• Extract / Transform / Load (ETL) And what do these workloads
− Common to business intelligence & have in common?
data warehousing
− But with Big Data we will see these
The nature of the data …
as Extract / Load / Transform (ELT)
• Text mining • Volume
• Index building
• Velocity
• Graph creation and analysis • Variety
• Pattern recognition …the Vs that we have met
• Collaborative filtering
• Predictive models
• Sentiment analysis
• Risk assessment…
Common Use Cases applied to Big Data

“Data is the New Oil”
“Data is the New Oil”

“Data is the new oil." Coined in 2006 by Clive Huby, a British data commercialization
entrepreneur. This now famous phrase was embraced by the World Economic Forum in
a 2011 report, which considered data to be an economic asset like oil.
In this infographic, famed graphic designer Nigel Holmes explores our burgeoning
digital universe.
“Information is the oil of the 21st century, and analytics is the combustion engine”
- Peter Sondergaard

Some statistics: Facebook

• 955 million accounts using 70 languages
 10% of the population of the world is connected on FB
• 140 billion photos uploaded
 300 million per day
 Collection is 10,000 larger than that in the Library of Congress
• 30 billion pieces of information posted per day
• 2.7 billion likes and comments per day
• 125 billion friend connections
 On average, users have 234 friends
• Now just 4 degrees of separation
Some statistics: Facebook

Statistics from:
The Human Face of Big Data, 2012
Problems:
1. Invades our privacy
2. Substitutes phony relationship for real ones

fivethirtyeight.com - Nate Silver

• FiveThirtyEight's mission
 Originally to help New York Times readers cut through the clutter of this
data-rich world - now owned by ESPN
 538 is the number of electors in the United States electoral college
(presidential elections)
• Blog: Politics, Sports, Science & Health, Economics, Culture
 The blog is devoted to rigorous analysis of politics, polling, public affairs,
sports, science and culture, largely through statistical means
• Nate Silver
 Nate Silver built an innovative system for predicting baseball performance,
predicted the 2008 election within a hair’s breadth, and became a
national sensation as a blogger-all by the time he was thirty
 He solidified his standing as the nation's foremost political forecaster with
his near perfect prediction of the 2012 election - less so in 2016
fivethirtyeight.com - Nate Silver

Nate Silver books:
• Baseball Between the Numbers (2006)
• The Signal and the Noise (2012)
• The Best American Infographics (2014)
References:
• https://en.wikipedia.org/wiki/FiveThirtyEight
• http://fivethirtyeight.com/

The challenges of sensor data

• Every day, we create 2.5 quintillion bytes (2.5 EB, 2.5*1018) of different
types of data generated in a variety of ways such as social media, cell
phone GPS signals, digital media, and purchase transaction records
• Sensors are one of the biggest contributors of big data, enabling new
applications across industries:
 Telemetrics for auto insurance and vehicle monitoring & service
 Smart metering for energy & utilities organizations
 Inventory management and asset tracking in the retail and manufacturing
sectors
 Fleet management in logistics and transportation organizations
• Applications that rely on sensor-generated data have unique big data
requirements to efficiently collect, store, and analyze the data to take
advantage of its value
The challenges of sensor data

Some additional references:
• Internet of Things Tutorial (PDF Version) -- TutorialsPoint
https://www.tutorialspoint.com/internet_of_things/internet_of_things_tutorial.pdf
• How has the IoT progressed over the last four years?
The Economist Intelligence Unit’s (EIU) Internet of Things (IoT) Business Index
2017 Study, commissioned by Arm and IBM
https://pages.arm.com/economist-iot-report.html
https://www.ibm.com/developerworks/learn/iot/
• IoT Technical Library - Articles
https://www.ibm.com/developerworks/views/iot/libraryview.jsp

Engine data - aircraft, diesel trucks & generators,…

• Each Airbus A380 engine - there are 4 engines - generates 1 PB of
data on a flight from London (LHR) to Singapore (SIN)
 GE, one of the world’s largest manufacturers, is using big data analytics
with data generated from machine sensors to predict maintenance needs
− GE is using the analysis
to provide services tied to
its products, designed to
minimize downtime caused
by parts failures
− Real-time analytics also
enables machines to adapt
continually and improve
efficiency
Engine data - aircraft, diesel trucks & generators,...

Both Qantas and Singapore Airlines use Rolls-Royce Trent 900 engines in their A380
aircraft.
Qantas Flight 32 was a Qantas scheduled passenger flight that suffered an
uncontained engine failure on 4 November 2010 and made an emergency
landing at Singapore Changi Airport. The failure was the first of its kind for the
Airbus A380, the world's largest passenger aircraft. It marked the first aviation
occurrence involving an Airbus A380. On inspection it was found that a turbine
disc in the aircraft's No. 2 Rolls-Royce Trent 900 engine (on the port side nearest
the fuselage) had disintegrated. The aircraft had also suffered damage to the
nacelle, wing, fuel system, landing gear, flight controls, the controls for engine No.
1 and an undetected fire in the left inner wing fuel tank that eventually self-
extinguished.[1] The failure was determined to have been caused by the breaking
of a stub oil pipe which had been manufactured improperly.
GE manufactures jet engines, turbines and medical scanners. It is using operational
data from sensors on its machinery and engines for pattern analysis.

References:
http://www.computerweekly.com/news/2240176248/GE-uses-big-data-to-power-
machine-services-business
• "The airline industry spends $200bn on fuel per year so a 2% saving is $4bn. GE
provides software that enables airline pilots to manage fuel efficiency.“
• “Another product, Movement Planner, is a cruise control system for train drivers.
The technology assesses the terrain and the location of the train to calculate the
optimal speed to run the locomotive for fuel economy. ”
http://www.infoworld.com/article/2616433/big-data/general-electric-lays-out-big-plans-
for-big-data.html
• “As one of the world's largest companies, GE is a major manufacturer of systems
in aviation, rail, mining, energy, healthcare, and more. In recognition of the
importance of big data to GE, CEO Jeff Immelt launched a new initiative called
the “industrial Internet,” which aims to help customers increase efficiency and to
create new revenue opportunities for GE through analytics.”
• “The industrial Internet is GE's spin on “the Internet of things,” where Internet-
connected sensors collect vast quantities of data for analysis. According to
Immelt, sensors have already been embedded in 250,000 "intelligent machines"
manufactured by GE, including jet engines, power turbines, medical devices, and
so on. Harvesting and analyzing the data generated by those sensors holds
enormous potential for optimization across a broad range of industrial operations.

Large Hadron Collider (LHC)

• The Large Hadron Collider (LHC) data center process about one
petabyte of data every day. The center hosts 11,000 services with
100,000 processor cores. Some 6000 changes in the database are
performed every second.
 A global collaboration of computer
centers distributes and stores LHC
data, giving real-time access to
physicists around the world
 The Grid runs more than two million
jobs per day - at peak rates, 10 GB
of data may be transferred / sec.
− This year, 300 terabytes of LHC data
covering experiments were published,
allowing physicists around the world to
examine aspects of particle collisions
that CERN's own researchers had not
had time to cover
Large Hadron Collider (LHC)

References:
http://www.popularmechanics.com/technology/a20540/300-tb-cern-data-large-hadron-
collider/ (April 2016)
• “The most complex machine in mankind's history just put a gargantuan data trove
online for anyone to parse. You think you've got the analytical chops to glean
insights about the nature of the cosmos or God or simply the tendencies of
muons? Go ahead, man, dig through the 300 terabytes of data that CERN, the
European Organization for Nuclear Research, just dropped onto the cloud.”
• …“But it ain't nothing compared to what the National Security Agency works
with. Going by2013 figures the agency released, the NSA's
various activities ’touch’ 300 TB of data every 15 minutes or so.”

http://www.theverge.com/2016/4/25/11501078/cern-300-tb-lhc-data-open-access
• “If you ever wanted to take a look at raw data produced by the Large Hadron
Collider, but are missing the necessary physics PhD, here's your chance: CERN
has published more than 300 terabytes of LHC data online for free. The data
covers roughly half the experiments run by the LHC's CMS detector during 2011,
with a press release from CERN explaining that this includes about 2.5 inverse
fem
• tobarns of data - around 250 trillion particle collisions. Best not to download this
on a mobile connection then.”

Largest Radio Telescope (Square Kilometer Array)

• The Square Kilometer Array (SKA) is a large multi radio telescope
project aimed to be built in Australia and South Africa. When finished,
it would have a total collecting area of about one square kilometer - it
will operate over a wide range of frequencies and its size will make it
50 times more sensitive than any other radio instrument and survey,
10,000 times faster than ever before…
 The SKA will produce
20,000 PB / day in 2020
(compared with the current
internet volume of 300 PB / day)
 This could also go up by a factor
of ten when fully operational in
2028 - the various data centers
will process 100 PB / day each
− Incidentally,IBM is developing
hardware specifically to process
this astronomical information
Largest Radio Telescope (Square Kilometer Array)

With receiving stations extending out to distance of at least 3,000 kilometers (1,900 mi)
from a concentrated central core, it will exploit radio astronomy's ability to provide the
highest resolution images in all astronomy. The SKA will be built in the southern
hemisphere, in sub-Saharan states with cores in South Africa and Australia, where the
view of the Milky Way Galaxy is best and radio interference least.
Construction of the SKA is scheduled to begin in 2018 for initial observations by 2020.
The SKA will be built in two phases, with Phase 1 (2018-2023) representing about 10%
of the capability of the whole telescope. Phase 1 of the SKA was cost-capped at 650
million euros in 2013, while Phase 2's cost has not yet been established. The
headquarters of the project are located at the Jodrell Bank Observatory, in the UK.
The data collected by the SKA in a single day would take nearly two million years to
playback on an iPod
The SKA will be so sensitive that it will be able to detect an airport radar on a planet
tens of light years away
The SKA central computer will have the processing power of about one hundred million
PCs
The dishes of the SKA will produce 10 times the global internet traffic
The SKA will use enough optical fiber to wrap twice around the Earth

The aperture arrays in the SKA could produce more than 100 times the global internet
traffic
References:
• https://en.wikipedia.org/wiki/Square_Kilometre_Array
• https://www.skatelescope.org/ (SKA homepage)
• http://www.ska.gov.au/About/Pages/default.aspx
• http://www.ska.gov.au/NewZealandSKA/Pages/default.aspx

Wikibon Big Data Software, Hardware & Professional

Services Projection 2014-2026 ($B)
Wikibon Big Data Software, Hardware & Professional Services Projection 2014-2026 ($B)
Executive Summary from the Wikibon Report
The big data market grew 23.5% in 2015, led by Hadoop platform revenues. We
believe the market will grow from $18.3B in 2014 to $92.2B in 2026 - a strong 14.4%
CAGR. Growth throughout the next decade will take place in three successive and
overlapping waves of application patterns - Data Lakes, Intelligent Systems of
Engagement, and Self-Tuning Systems of Intelligence. Increasing amounts of data
generated by sensors from the Internet of Things will drive each application pattern. Big
data tool integration, administrative simplicity, and developer adoption are keys to
growth rates. The adoption of streaming technologies, which address a number of
Hadoop limitations, also will be a factor. Ultimately, the market growth will depend on
enterprises: Will doers take the steps required to transform business with big data
systems?

2017 Big Data and Analytics forecast - 03-Mar-17
2017 Big Data and Analytics Forecast - 03-Mar-17

Big Data Market Evolution Is Accelerating
Wikibon identifies three broad waves of usage scenarios driving big data and machine
learning adoption. All require orders of magnitude greater scalability and lower price
points than what preceded them. The usage scenarios are:
Data Lake applications that greatly expand on prior generation data warehouses
Massively scalable Web and mobile applications that anticipate and influence users’
digital experiences
Autonomous applications that manage an ecosystem of smart, connected devices (IoT)
Each of these usage scenarios represents a wave of adoption that depends on the
maturity of a set of underlying technologies which we analyze independently. All
categories but applications - those segments that collectively comprise infrastructure -
slow from 27% growth in 2017 to single digits in 2023. Open source options and
integrated stacks of services from cloud providers combine to commoditize
infrastructure technologies.

Wikibon forecasts the following categories:

Application databases accrue functionality of analytic databases. Analytics
increasingly will inform human and machine decisions in real-time. The category totals
$2.6bn in 2016 and growth slowly tapers off from 30% and spend peaks at $8.5bn in
2024.
Analytic databases evolve beyond data lakes MPP SQL databases, the backbone
for data lakes, continue to evolve and eventually will become the platform for large-
scale advanced, offline analytics. The category totals $2.5bn in 2016 with growth at half
the level of application databases and its size peaks at $3.8bn in 2023.
Application infrastructure for core batch compute slows. This category, which
includes offerings like Spark, Splunk, and AWS EMR, totals $1.7bn in 2016 with 35%
growth in 2017 but slows to single digits in 2023 as continuous processing captures
more growth.
Continuous processing infrastructure is boosted by IoT applications. The
category will be the basis for emerging microservice-based big data applications,
including much of intelligent. It totals $200M in 2016 with 45% growth in 2017 and only
slows below 30% in 2025.
The data science tool chain is evolving into models with APIs. Today, data science
tool chains require dedicated specialist to architect, administer, and operate. However,
complex data science toolchains, including those for machine learning, are transforming
into live, pre-trained models accessible through developer APIs. The cottage industry of
tools today totals $200M growing at 45% in 2017 but only dips below 30% in 2025
when it totals $1.8bn.
Machine learning applications are mostly custom-built today. They will become
more pervasive in existing enterprise applications in addition to spawning new
specialized vendors. The market today totals $900M growing at 50% 2017 and reaches
almost $18bn in 2027.

Big Data Analytics Segment Analysis

As users gain experience with big data technology and use cases, the focus of the
market will inevitably turn from hardware and infrastructure software to applications.
That process is well underway in the big data market, as hardware accounts for an ever
smaller share of total spend (see Figure 2 below). Infrastructure software is facing twin
challenges. First, open source options are undermining prices. Second, more vendors,
especially those with open source products, have adopted pricing models around
helping operate their products on-prem or in the cloud. While there is room for vendors
to deliver value with this approach, there is a major limitation. Given the relatively
immature state of the market, customers typically run products from many vendors. And
it’s the interactions between these products from multiple vendors that creates most of
the operational complexity for customers. As a result, market growth for infrastructure
software will slow to single digits in 2023 from 26% in 2017 (see Figure 1). Public cloud
vendors will likely solve the multi-vendor management problem. Before more fully
packaged applications can take off, customers will require a heavy mix of professional
services since the reusable building blocks are so low-level. By 2027, applications will
account for 40% of software spend, up from 11% in 2016, and professional services will
taper off to 32% of all big data spend in 2027, down from 40% in 2016.
To summarize:
• Open source and public cloud continue to pressure pricing.
• Applications today require heavy professional services spend.
• Business objectives are driving software to support ever lower latency decisions.
Reference: http://wikibon.com/2017-big-data-and-analytics-forecast/ (subscription
required)

Evolution of the Internet of Things (IoT)
Source:
Internet of Things (IoT): The Next Cyber Security Target (Webinar)
Praveen Kumar Gandi, Head Information Security Services,
https://www.slideserve.com/ClicTest/webinar-on-internet-of-things-iot-the-next-cyber-security-target
Evolution of the Internet of Things (IoT)

For relevant images:
Google search: evolution of the internet of things images
The image shown above: https://twitter.com/fisher85m/status/926360908900773889
See also: Webinar on Internet of Things(IoT): The Next Cyber Security Target -
PowerPoint PPT Presentation on Slide Serve at
https://www.slideserve.com/ClicTest/webinar-on-internet-of-things-iot-the-next-cyber-
security-target (downloadable PPT slide presentation)

Meaning of real-time when applied to Big Data

• Sub-second Response
Generally, when engineers say “real-time”, they are usually referring to sub-second response time. In this
kind of real-time data processing, nanoseconds count. Extreme levels of performance are key to success.
• Human Comfortable Response Time

“Thou shalt not bore or frustrate the users.” The performance
requirement for this kind of processing is usually a couple of seconds.
• Event-Driven
If when you say “real-time,” you mean the opposite of scheduled, then you mean event-driven. Instead of
happening in a particular time interval, event-driven data processing happens when a certain action or
condition triggers it. The performance requirement for this is generally before another event happens.
• Streaming Data Processing

If when you say “real-time,” you mean the opposite of batch processing, then you mean streaming data
processing. In batch processing, data is gathered together, and all records or other data units are
processed in one big bundle until they’re all done. In streaming data processing, the data is processed as
it flows in, one unit at a time. And once the data starts coming in, it generally doesn’t end.
http://blog.syncsort.com/2016/03/big-data/four-really-real-meanings-of-real-time
Meaning of real-time when applied to Big Data

Real-time processing of Big Data is the topic: Streaming Data
Here we are just defining some terms to distinguish between:
Data at Rest: e.g., “Oceans of Data.” and the new term “Data Lakes” - the data has
already arrived and is stored
Data in Motion: Streaming Data
The question is “how do we define the processing of data in real-time”? Does it mean
milliseconds, seconds, minutes, or what>

More comments on real-time

• Real time is not a concept that’s woven into the fabric of the universe -
it’s a very human construct.
 Essentially, real time refers to lags in data arrival that are either below the
threshold of perception or, if they exist, are so short that they don't pose a
barrier to immediate action
• Decisions have various tolerances for protracted data arrival
• Data latencies vs. decision latencies
More comments on real-time

https://www.linkedin.com/pulse/real-time-isnt-youve-been-led-believe-james-kobielus
http://bigdatapage.com/4-really-real-meanings-of-real-time

Traditional vs. Big Data approaches to using data
Traditional vs. Big Data approaches to using data

Source: IBM

There are many parts to the Hadoop Ecosystem

• Constantly changing
 New additions
 Updated versions
There are a many parts to the Hadoop Ecosystem

Illustration from: Gollapudi, S. (2016). Practical machine learning: Tackle the real-world
complexities of modern machine learning. Birmingham, UK & Mumbai, India: Packt
Publishing, p. 68.

We will start with the Core Components

• Hadoop
 Hadoop Distributed
File System (HDFS)
 MapReduce
 Common
• And we will work our
way through the
history of Hadoop
and its versions
 Classic (v1)
 Current (v2) / YARN
 Futures (v3)
We will start with the Core Components

Appendix A
Appendix A
Self-study materials from Module 1.

Example business sectors currently using Big Data

• Healthcare
• Financial
• Industry
• Agriculture
…and many others
Example business sectors currently using Big Data

Links
Which industries benefit most from big data?
• http://hortonworks.com/big-data-insights/industries-benefit-big-data
Explore new ways to work with IBM Predictive Industry Solutions
• http://www.ibm.com/analytics/us/en/industry/index.html
www.ibmbigdatahub.com/use-cases

Big Data adoption study (the 4 Es)
• From a 2012 Big Data @ Work Study surveying 1144 business and IT
professionals in 95 countries
• Gartner Sept. 2014 report: 13% of surveyed organizations have deployed
big data solutions, while 73% have invested in big data or plan to do so
Big Data adoption study (the 4 Es)

Before we look at individual industries, let’s look at the Big Data Adoption Process.
IBM's John Graeme Noseworthy on the four E’s of Big Data -
http://www.exchange4media.com/marketing/videoibms-john-graeme-noseworthy-on-
the-four-es-of-big-data_51412.html
Given numerous challenges for leveraging data optimally for marketers today, Graeme
shares few tips for marketers that will help them in their journey to Big Data market
analytics in the form of what he calls the ‘Four Es of Data and Analytics’:
Educate - A proper understanding of what Big Data and analytics mean to your
organization and how you define it, how the marketplace in and around your industry
and your competition are either applying or not applying analytics for operations.
Explore - Exploring possibilities encapsulates thinking big and understanding all the
possibilities you can explore with Big Data.
Engage - Moving from learning to applying. Find a small starting point; it is critical that
marketers talk to their partners in technology. Everybody needs to collaborate for
mutual success, which brings you to the next area of the ability to execute.
Execute - This is the phase where the organisation actually starts delivering smarter
customer experiences.
See more at: http://www.exchange4media.com/marketing/videoibms-john-graeme-
noseworthy-on-the-four-es-of-big-data_51412.html#sthash.BIeieEB8.dpuf

Use cases for a Big Data platform: Healthcare and

Life Sciences
• Problem:
 Vast quantities of real-time information
are starting to come from wireless
monitoring devices that postoperative
patients and those with chronic
diseases are wearing at home and in
their daily lives.
• How big data analytics can help:
 Epidemic early warning
 Intensive Care Unit and remote
monitoring
Use cases for a Big Data platform: Healthcare and Life Sciences

Healthcare
• How Big Data Is Quietly Fighting Diseases and Illnesses

 http://dataconomy.com/how-big-data-is-quietly-fighting-diseases-and-
illnesses
• The Data Is In: 3 Ways Analytics Will Improve Healthcare
 http://dataconomy.com/the-data-is-in-3-ways-analytics-will-improve-
healthcare
Healthcare

Big Data and complexity in healthcare

• Medical information is
doubling every 5 years,
much of which is
unstructured
• 81% of physicians report

spending 5 hours or less
per month reading
medical journals
Big Data and complexity in healthcare

“Medicine has become too complex (and only) about 20 percent of the knowledge
clinicians use today is evidence-based” - Steven Shapiro, Chief Medical and Scientific
Officer, UPMC
“…to keep up with the state of the art, a doctor would have to devote 160 hours a week
to perusing papers …” - The Economist, Feb 14th 2013

Precision Medicine Initiative (PMI) & Big Data

• Precision Medicine
 A medical model that proposes the
customization of healthcare, with medical
decisions, practices, and/or products being
tailored to the individual patient - Wikipedia
 Diagnostic testing is often employed for
selecting appropriate and optimal therapies
based on the context of a patient’s genetic
content or other molecular or cellular analysis
 Tools employed in PM can include molecular
diagnostics, imaging, and analytics/software
• The Precision Medicine Initiative (PMI)
 A $215 million investment in President Obama’s
Fiscal Year 2016 Budget to accelerate
biomedical research and provide clinicians with
new tools to select the therapies that will work
best in individual patients
Precision Medicine Initiative (PMI) & Big Data

Graphic on the right (from the National Cancer Institute) illustrates the use of precision
medicine in cancer treatment. “Discovering unique therapies that treat an individual’s
cancer based on the specific genetic abnormalities of that person’s tumor.”
Precision Medicine Initiative (PMI) references:
• Obama’s Precision Medicine Initiative Is The Ultimate Big-Data Project - “Curing
both rare diseases and common cancers doesn't just require new research, but
also linking all the data that researchers already have”
http://www.fastcompany.com/3057177/obamas-precision-medicine-initiative-is-
the-ultimate-big-data-project
• White House
https://www.whitehouse.gov/precision-medicine
• Obama: Precision Medicine Initiative Is First Step to Revolutionizing Medicine -
"We may be able to accelerate the process of discovering cures in ways we've
never seen before," the president stated
http://www.usnews.com/news/articles/2016-02-25/obama-precision-medicine-
initiative-is-first-step-to-revolutionizing-medicine
• National Institutes of Health
https://www.nih.gov/precision-medicine-initiative-cohort-program

• National Cancer Institute

http://www.cancer.gov/research/key-initiatives/precision-medicine
• Precision Medicine (Wikipedia)
https://en.wikipedia.org/wiki/Precision_medicine

Use cases for a Big Data platform: Financial Services

• Problem:
 Manage the several Petabytes of data which is growing at 40-100%
per year under increasing pressure to prevent frauds and complaints
to regulators
 Fraud detection
 Credit issuance
 Risk management
 360° view of the Customer
Use cases for a Big Data platform: Financial Services

Financial marketplace example: Visa

• Problem
 Credit card fraud costs up to 7 cents per 100 dollars –
billions of dollars per year
 Fraud schemes are constantly changing
 Understanding the fraud pattern months after
the fact is only partially helpful - fraud detection
models need to evolve faster
• If only Visa could …
 Reinvent how to detect the fraud patterns
 Stop new fraud patterns before they can
rack-up significant losses
• Solution
 Revolutionize the speed of detection
 Visa loaded two years of test records, or 73 billion transactions, amounting
to 36 terabytes of data into Hadoop - the processing time fell from one
month with traditional methods to a mere 13 minutes
Financial marketplace example: Visa

Financial
• Big data is overhauling credit scores

 http://dataconomy.com/big-data-overhauling-credit-scores-2
• Top 10 Big Data Trends in 2016 for Financial Services
 https://www.mapr.com/blog/top-10-big-data-trends-2016-financial-services
Financial

Use cases for a Big Data platform:

Telecommunications Services
• Problem:
 Legacy systems are used to gain insights from internally
generated data facing issues of high storage costs,
long data loading time, and long administration
processing times
 CDR processing
 Combat fraud
 Churn prediction
 Geomapping / marketing
 Network monitoring
Use cases for a Big Data platform: Telecommunications Services

Survey: 90% of telco service providers believe Hadoop is best platform to combat fraud
(18 May 2016)
http://www.fiercebigdata.com/story/survey-90-telco-service-providers-believe-hadoop-
best-platform-combat-fraud/2016-05-18
A recently released survey by Cloudera and Argyle Data found that 90 percent of telcos
believe Hadoop is the most effective platform to combat revenue fraud scams, losses
from which are now estimated at U.S. $38 billion.
"Across all areas of industry, people are trying to figure out the most effective use cases
for Hadoop," said Vijay Raja, solutions marketing manager at Cloudera, in the
announcement.
"Fraud prevention is a textbook use case for Hadoop-based analytics because the ROI
is immediately visible.“
While the two companies tout the capabilities of machine learning in tackling security
issues, of which fraud is one, most applications of machine learning are not yet ready to
replace human security analysts.

"Although security solutions based on unsupervised machine learning do exist, relying

entirely on artificial intelligence to spot cyberattacks isn't totally practical because such
systems yield a large number of false positives," explains Ben Dickson, a software
engineer, in his post in The Daily Dot.
"You eventually need the help of human experts to find evidence of security breaches
and make critical decisions.“
However, machine learning often proves to be an exceptional tool in leveraging human
talent beyond what the human alone can do. It's simply too much data for a human to
mentally process and react to fast enough to do any good.
Further, there's a major shortage in skilled cyber security analysts at the moment. That
means that most companies are running big on security problems but short on talent to
address it.
So, while cybersecurity pros are safely employed for the foreseeable future, any who do
not use machine learning for a much-needed assist will soon find themselves replaced
by those that do.
It's not a question of machine or man, but a mandate of man and machine.

Use cases for a Big Data platform:

Transportation Services
• Problem:
 Traffic congestion has been increasing
worldwide as a result of increased
urbanization and population growth
reducing the efficiency of transportation
infrastructure and increasing travel time
and fuel consumption.
 Urban planning & monitoring
 Real time analysis to weather
and traffic congestion data streams
to identify traffic patterns reducing
transportation costs.
Use cases for a Big Data platform: Transportation Services

Use cases for a Big Data platform: Retailers & Social Media
• Problem:
 Savvy retailers want to use “big data” to
predict trends, prepare for demand, pinpoint
customers, optimize pricing & promotions,
and monitor real-time analytics & results
- by combining data from web browsing
patterns, social media, industry forecasts,
existing customer records, etc.
 Access social media to gain insight
 Federate data between Big Data and RDBMs
 Apply graph analysis to the available data
 Work to understand demand and engage
customers
Use cases for a Big Data platform: Retailers & Social Media
IBM Presentation - Big Data in Retail: Examples in Action
http://www.ibmbigdatahub.com/presentation/big-data-retail-examples-action
http://www.ibm.com/analytics/us/en/industry/retail-analytics/
One example:
Graph Analytics - http://www.ibmbigdatahub.com/blog/what-graph-analytics
A growing number of businesses and industries are finding innovative ways to apply
graph analytics to a variety of use-case scenarios because it affords a unique
perspective on the analysis of networked entities and their relationships. Gain an
understanding of how four different types of graph.
http://www.ibmbigdatahub.com/blog/what-graph-analytics
IBM Institute for Business Value (IIBV) -
http://www.ibm.com/services/uk/gbs/thoughtleadership/ - Register to download the
complete IBM Institute for Business Value study.

Graph analytics
• Path analysis • Community analysis
• Connectivity analysis • Centrality analysis
www.ibmbigdatahub.com/blog/what-graph-analytics
Graph analytics
Bringing this concept to the real world, nodes or vertices can be people, such as
customers and employees; affinity groups such as LinkedIn or meet-up groups; and
companies and institutions. They can also be places such as airports, buildings, cities
and towns, distribution centers, houses, landmarks, retail stores, shipping ports and so
on. Vertices can also be things such as assets, bank accounts, computer servers,
customer accounts, devices, grids, molecules, policies, products, twitter handles, URLs,
web pages and so on.
Edges can be elements that represent relationships such as emails, likes and dislikes,
payment transactions, phone calls, social networking and so on. Edges can be directed,
that is, they have a one-way direction arrow to represent a relationship from one node
to another-for example, Mike made a payment to Bill; Mike follows Corinne on Twitter.
They can also be non-directed-for example, M1 links London and Leeds-and weighted-
for example, the number of payments between these two accounts is high. The time
between two locations is also an example of a weighted relationship.
http://www.ibmbigdatahub.com/blog/what-graph-analytics

Behavioral segmentation & analytics
Behavioral segmentation & analytics

Use cases for a Big Data platform: Manufacturing

• Problem:
 The world of production will become more and more networked
until everything is interlinked with everything else. The complexity
of production and supplier networks has grow enormously. Previously,
networks and processes were limited to one factory, but the boundaries of
individual factories will most likely no longer exist in favor of the interconnect
of multiple factories or even geographical regions..
 The Internet of Things (IoT)
 Industry 4.0
Use cases for a Big Data platform: Manufacturing

References:
4 ways Industry 4.0 will look different in 2017
http://readwrite.com/2017/01/01/4-ways-industry-4-0-will-look-different-in-2017-il1/

Articles on IoT - and Industry 4.0

• GE & Siemens - the American industrial giant is sprinting towards its
goal; the German firm is taking a more deliberate approach
http://www.economist.com/news/business/21711079-american-industrial-giant-sprinting-towards-its-goal-german-firm-taking-more
• Plus many more articles

• And articles specific to the various industry sector
Articles on IoT - and Industry 4.0

Data 3.0 - my view on the data landscape

• The Eras of Data
 0 Flat files
 1 Relational Databases (RBDMs) - 1970s - OLTP (Online Transactional processing)
 2 Data Warehouses - 1990s - OLAP (Online Analytical processing) or
DSS (Decision Support Systems) workloads
 3 Big Data - 2000s - Batch, with a movement towards Real-time
• Some terminology of Big Data
 Oceans of data (data at rest) vs. Streams of data (data in motion)
 Data Lake (a large storage repository and processing engine)
• James Dixon of Pentaho used the term initially to contrast with “data mart,” which is a
smaller repository of interesting attributes extracted from the raw data. He wrote: "If you
think of a datamart as a store of bottled water - cleansed and packaged and structured
for easy consumption - the data lake is a large body of water in a more natural state. The
contents of the data lake stream in from a source to fill the lake, and various users of the
lake can come to examine, dive in, or take samples.“ (Wikipedia, Data Lake)
 NoSQL (“not only SQL”)
Data 3.0 - my view on the data landscape

Realtime is a definite future direction for analytics

• IBM System S was a precursor of the future
IBM InfoSphere Streams and now the open-source competitors such
as Apache Storm, Apache Kafka, etc.
Realtime is a definite future direction for analytics

Industry 4.0
• “Industry 4.0” was the brainchild of the German government, and
describes the next phase in manufacturing - a so-called fourth
industrial revolution
 Industry 1.0: Water/steam power
 Industry 2.0: Electric power
 Industry 3.0: Computing power
 Industry 4:0: Internet of Things (IoT) power
• Meaning
− Characteristic for industrial production in an Industry 4.0 environment are the
strong customization of products under the conditions of highly flexibilized (mass-)
production. The required automation technology is improved by the introduction of
methods of self-optimization, self-configuration, self-diagnosis, cognition and
intelligent support of workers in their increasingly complex work.
− Wikipedia: https://en.wikipedia.org/wiki/Industry_4.0
Industry 4.0

The six design principles in Industry 4.0

• Interoperability
 The ability of cyber-physical systems (i.e. workpiece carriers, assembly stations, and
products), humans and Smart Factories to connect/communicate with each other via
the Internet of Things (IoT) and the Internet of Services (IoS)
• Virtualization
 A virtual copy of the Smart Factory which is created by linking sensor data (from
monitoring physical processes) with virtual plant models and simulation models
• Decentralization
 The ability of cyber-physical systems within Smart Factories to make decisions on
their own
• Real-time Capability
 The capability to collect and analyze data and provide the derived insights immediately
• Service Orientation
 Offering of services (of cyber-physical systems, humans or Smart Factories) via the
Internet of Services
• Modularity
 Flexible adaptation of Smart Factories to changing requirements by replacing or
expanding individual modules
The six design principles in Industry 4.0

Internet of Things (IoT)

• The number of connected devices that can share data is exploding,
with estimates of 50-200 billion devices being connected to the Internet
by 2020 - a transformative change for our industrial society
• With a dramatic growth in connections:
 new devices
 legacy infrastructures
…triggering an unprecedented spike in data volumes, devices & data
• That data represents
 untapped production efficiencies
 competitive business insights
 new, brand-differentiating services
* but only if the data can be effectively
analyzed, and its value unlocked
Photo: Siemens
Internet of Things (IoT)

Machine-generated data is becoming a major data resource and will continue to do so.
Wikibon has forecast that the market value of the industrial Internet (a term coined by
Frost & Sullivan to refer to the integration of complex physical machinery with
networked sensors and software) will be approximately $540 billion in 2020.
Cielen, D., Meysman, A. D. B., & Ali, M. (2016). Introducing data science: Big
data, machine learning, and more, using Python tools. Shelter Island, NY:
Manning Publications, p. 6.
http://www.datanami.com/2015/10/30/hortonworks-preps-for-coming-iot-device-storm/
The masses may not be getting flying cars or personal robots anytime soon, but thanks
to connected devices, the future of technology is definitely bright. From smart
refrigerators and wireless BBQ sensors to connected turbines and semi-trucks, the
Internet of Things (IoT) will have a huge impact on consumer and industrial tech.
However, managing hundreds of billions of devices-not to mention the data they
generate-will not be easy, which is why Hadoop distributor Hortonworks is taking pains
to prepare for the coming device storm.

According to Intel, the number of connected devices will explode in the near future,
growing from about 15 billion devices in 2015 to more than 200 billion by 2010.
Anybody who’s watching the rise of wearables like the Fitbit, the popularity of smart
thermostats like Nest, or the presence of drone aircraft can tell you that this
phenomenon is real and accelerating.
Getting these wireless IoT devices integrated with command and control systems will
not be an easy task. While we now have enough Internet addresses to go around,
there’s a lot of other messy details that need to be worked out to ensure that devices
aren’t stepping on each other’s toes, that they can’t be co-opted by others, that the data
is safe and secure. Right now, it’s an ad hoc, Wild West IoT world, but that simply won’t
scale.

The future of IoT and the connected world
The future of IoT and the connected world

International Data Corporation (IDC) has estimated that there will be 26 times more
connected things than people in 2020 - perhaps 200 billion. This network is commonly
referred to as the internet of things (IoT).
Free O’Reilly book (37 pp., PDF) - Architecting for the Internet of Things:
https://voltdb.com/sm/16/IoTOReilly

Sensors & software for the automobile
Sensors & software for the automobile

Disrupting the Car Industry
https://whatsthebigdata.com/2016/06/14/disrupting-the-car-industry/
While self-driving tech - which is included in the infographic above - receives the lion’s
share of media attention, a host of less-heralded startups are targeting specific pieces
of automotive infrastructure or components. For example, companies including
Quanergy (https://www.cbinsights.com/company/quanergy-systems), LeddarTech
(https://www.cbinsights.com/company/leddartech), and TriLumina
(https://www.cbinsights.com/company/trilumina) are seeking to capitalize on the self-
driving revolution by dramatically lowering the cost of expensive LiDAR sensors.
Startups such as Veniam (https://www.cbinsights.com/company/veniam) and Savari
(https://www.cbinsights.com/company/savari) are developing vehicle-to-vehicle (V2V)
and vehicle-to-anything (V2X) communications, another autonomous-adjacent
technology field. With increasing automotive connectivity seemingly inevitable, vehicle
cybersecurity has also begun to emerge as a focus, with newer players like Karamba
Security (https://www.cbinsights.com/company/karamba) joining others such as Argus
Cyber Security (https://www.cbinsights.com/company/argus-cyber-security).

Technology & tomorrow: Future directions (IEEE)

• IEEE New Technologies • Software Defined Networks
Connections is a resource to (SDN)
emerging technologies: • Cloud Computing
• Big Data • Life Sciences
• Brain • Smart Grid
• Cybersecurity Initiative • Transportation Electrification
• Digital Senses • www.ieee.org/about/technologies
• Green ICT
• Rebooting Computing
• Smart Cities
• Smart Materials
Technology & tomorrow: Future directions (IEEE)

The Institute of Electrical and Electronics Engineers (IEEE) recommends getting
involved in their current Future Directions Initiatives:
Big Data (http://bigdata.ieee.org/) - Big data is much more than just data bits and bytes
on one side and processing on the other. IEEE, through its Cloud Computing Initiative
and multiple societies, has already been taking the lead on the technical aspects of big
data. To provide increased value, IEEE will provide a framework for collaboration
throughout IEEE. IEEE has launched a new initiative focused on big data. Plans are
under way to capture all the different perspectives via in-depth discussions, and to drive
to a set of results that will define the scope and the direction for the initiative.
Brain (http://brain.ieee.org/) - This initiative is dedicated to advancing technologies that
improve the understanding of brain function, revolutionizing current abilities to reverse
engineer neural circuits in both the central and peripheral nervous systems, and
developing new approaches to interface the brain with machines for augmenting
human-machine interaction and mitigating effects of neurological disease and injury.

Cybersecurity Initiative (http://cybersecurity.ieee.org/) - Through outreach projects,

workshops, experiments, and challenge competitions, the IEEE Cybersecurity Initiative
builds on IEEE’s long-standing and world-leading technical activities in cybersecurity
and privacy to actively engage, inform, and support members, organizations, and
communities involved in cybersecurity research, development, operations, policy, and
education.
Digital Senses (http://digitalsenses.ieee.org/) - IEEE Digital Senses Initiative is
dedicated to advancing technologies that capture and reproduce, or synthesize the
stimuli of various senses (sight, hearing, touch, smell, taste, etc.), combine the
reproduced or synthesized stimuli with the naturally received stimuli in various ways,
and help humans or enable machines to perceive, understand, and respond to the
stimuli. Plans are under way to capture all the different perspectives via in-depth cross-
disciplinary discussions, and to drive to a set of results that will facilitate disruptive
innovations and foster cross-industry collaborations globally in three focus areas (virtual
reality, augmented reality, and human augmentation) and many other relevant areas
(wearables, consumer healthcare, smart robots, etc.).
Green ICT (http://greenict.ieee.org/) - Green Information and Communications
Technology is a key driver of sustainability when green metrics (energy consumption,
atmospheric emissions, e-waste, life cycle management) are effectively coupled with its
positive socio-economic impacts. IEEE is focused on achieving sustainability through
greening ICT and promoting its awareness.
Internet of Things (IoT) (http://iot.ieee.org/) - IoT is a self-configuring and adaptive
system consisting of networks of sensors and smart objects whose purpose is to
interconnect "all" things, including every day and industrial objects, in such a way as to
make them intelligent, programmable, and more capable of interacting with humans.
Rebooting Computing (http://rebootingcomputing.ieee.org/) - IEEE seeks to rethink
the computer, "from soup to nuts," including all aspects from device to user interface.
This group works from a holistic viewpoint, taking into account evolutionary and
revolutionary approaches.
Smart Cities (http://smartcities.ieee.org/) - IEEE experts will work with local
government leaders and city planners around the world to explore the issues and
address what's needed to prepare for the ever-increasing urban population growth,
including engaging and interacting with local inhabitants to increase awareness of their
urban environment, leading to the formation of smart cities.

Smart Materials - These are responsive to their environment (passive) and/or external
stimuli (active) in such a way that they serve an enabling technological function.
Examples include, but are of course not limited to: shape memory alloys, piezoelectrics,
electro/photo/thermo-chromics, self-healing polymers, magnetics, re-configurable
devices/components, and many others. There is an opportunity to collaboratively
connect people across the current materials and/or application focus, to address
common issues that may center on phenomenology, reliability, integration, adoption and
standards, or any number of others.
Software Defined Networks (SDN) (http://sdn.ieee.org/) - SDN and NFV (Network
Functions Virtualization) are creating the conditions to reinvent network architectures.
This is happening first at the edge of the network where "intelligence" has already
started migrating, and where innovation is more urgently needed to overcome the
"ossification" by improving networks and services infrastructure flexibility.
These initiatives have reached maturity after following through a full life cycle within
Future Directions. Visit their web portals to continue to participate and learn about new
volunteer opportunities.
Cloud Computing (http://cloudcomputing.ieee.org/) - This has become a scalable
service consumption and delivery platform in the modern IT infrastructure. IEEE is
advancing the understanding and use of the cloud computing paradigm, which currently
has a significant impact on the entire information and communications ecosystem.
Life Sciences (http://lifesciences.ieee.org/) - The overall objective is to make IEEE a
major and recognized player in the life sciences, in particular in the disciplines that are
at the intersection between the organization's traditional fields-electrical engineering,
computer engineering, and computer science-and the life sciences.
Smart Grid (http://smartgrid.ieee.org/) - The "smart grid" has come to describe a next-
generation electrical power system that is typified by the increased use of
communications and information technology in the generation, delivery, and
consumption of electrical energy.
Transportation Electrification (http://electricvehicle.ieee.org/): IEEE seeks to
accelerate the development and implementation of new technologies for the
electrification of transportation which is manifested in the electric vehicles (EV) of today
and the future.

An example of a big data platform in practice (IBM)

Ingestion and Real-time Analytic Zone Analytics and
Reporting Zone
Streaming Data
Warehousing Zone
BI &
Reporting
Enterprise
Warehouse
Connectors
Predictive
Analytics
Hadoop
MapReduce Hive/HBase Data Marts

Col Stores Visualization &
Discovery
Documents
ETL, MDM, Data Governance
in variety of formats
Landing and Analytics Sandbox Zone Metadata and Governance Zone

An example of a big data platform in practice (IBM)

We will see that other vendors (e.g., Hortonworks) offer their own Big Data Platforms
that handle both real-time (streaming) and batch (traditional Hadoop v1) processing
environments.
We should note that the traditional data processing environments are not going to go
away:
• Relational databases
• Data Warehouses
…
…but those will be transformed.

Big Data & Analytics architecture - A broader picture
Big Data & Analytics Architecture - A broader picture

This slide illustration adds a series of new / enhanced applications and is more specific
as to the nature of the types of data sources.

Big Data scenarios span many industries
Multi-channel customer
sentiment and experience
analysis
Detect life-threatening
conditions at hospitals in time to
intervene
Predict weather patterns to plan

optimal wind turbine usage, and
optimize capital expenditure on
asset placement
Make risk decisions based on

real-time transactional data
Identify criminals and threats

from disparate video, audio,
and data feeds
Big Data scenarios span many industries

Graphic: IBM

Big Data adoption (emphasis on Hadoop)
Big Data adoption (emphasis on Hadoop)

Factors driving interest in Big Data Analysis
Factors driving interest in Big Data Analysis

Impact of Big Data analytics in the next 5 years
Impact of Big Data analytics in the next 5 years

Big Data: Outcomes & data sources
Big Data: Outcomes & data sources

Data & Data Science

• The direction of this course is to head towards a study of Data Science
- and that means we will have to study the nature of data, including
how it is found, what formats it occurs in, wrangling it, …
• During the journey we will study:
 The basics of the technology: Hadoop & HDFS, MapReduce & YARN,
Spark
 Data formats & data movement
 The role & work of the Data Scientist
 Programming for Big Data
 The Hadoop ecosystem - both open source and proprietary approaches
 Data governance & data security
Data & Data Science

The sections of this course are:
• Orientation & Introduction to Big Data
• Hadoop & HDFS
• MapReduce & YARN
• Spark
• Data Formats & Data Movement
• The Role & Work of the Data Scientist
• Programming for Big Data
• The Hadoop Ecosystem
• Data Governance & Data Security

Facets of data
In data science and big data, you will come across many different types
of data, and each of them require different tools and techniques. The
main categories of data are:
• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and image
• Streaming
Cielen, D., Meysman, A. D. B., & Ali, M. (2016). Introducing data science: Big data, machine
learning, and more, using Python tools. Shelter Island, NY: Manning Publications, pp. 4-8.
Facets of data
We live in the data age. It’s not easy to measure the total volume of data stored
electronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes
21
in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes. A zettabyte is 10
bytes, or equivalently one thousand exabytes, one million petabytes, or one billion
terabytes. That’s more than one disk drive for every person in the world.
White, T. (2015). Hadoop: The definitive guide (4th, revised & updated ed.).
Sebastopol, CA: O'Reilly Media, p. 1.
For the meaning of this terminology (terabyte, petabyte, exabyte, zettabyte), see the
next slide and its notes.
This flood of data is coming from many sources. Consider the following:
• The New York Stock Exchange generates about 4−5 terabytes of data per day.
• Facebook hosts more than 240 billion photos, growing at 7 petabytes per month.
• Ancestry.com, the genealogy site, stores around 10 petabytes of data.
• The Internet Archive stores around 18.5 petabytes of data.

(All figures are from 2013 or 2014. For more information, see Tom Groenfeldt, “At
NYSE, The Data Deluge Overwhelms Traditional Databases”; Rich Miller, “Facebook
Builds Exabyte Data Centers for Cold Storage”; Ancestry.com’s “Company Facts”;
Archive.org’s “Petabox”; and the Worldwide LHC Computing Grid project’s welcome
page.)

System of Units / Binary System of Units
System of Units / Binary System of Units

When we talk about Big Data, we need to have some measuring units. Accuracy in
terminology is important.
Please note that international symbol for kilo is small “k” (and not capital “K”) - and thus
kB. This is a common mistake in usage.
The easiest way to remember some of this: 1 petabyte = 1000 terabytes, 1 exabyte = 1
billion gigabytes, 1 zettabyte = 1 billion terabytes.
You need to be aware of the two nomenclatures for sizing disk and storage media: the
official International System of Units (SI) and the now deprecated binary usage, based
on powers of 2.
Unfortunately, both are in use in all the literature, and usually without distinction.
From an operating systems perspective, Linux and OS X compute in powers of 10 (1KB
= 1000 Bytes), Windows (even in Windows 10) use powers of 2 (1KB = 1024 Bytes).
Disk and tape storage is generally storage is universally sold as powers of 10, but
12
called GB, TB, etc. Thus a purchased 4 TB disk drive is truly 4*10 ) and will appear at
40
3.57 TB on Windows (really 4 TiB = 3.57 * 2 ).
Note also that the correct formulation is kB (kilo is lowercase “k” in SI terminology), etc.

Be very careful when talking about network speed, as this is fairly standard:
1 Mbps = one million bits per second
1 MBps = one million bytes per second
By the way, even the term “byte” is a little ambiguous. The generally accepted meaning
these days is an octet (i.e., eight bits).
The de facto standard of eight bits is a convenient power of two permitting the values
0 through 255 for one byte. The international standard IEC 80000-13 codified this
common meaning. Many types of applications use information representable in eight
or fewer bits and processor designers optimize for this common usage. The
popularity of major commercial computing architectures has aided in the ubiquitous
acceptance of the 8-bit size. The unit octet was defined to explicitly denote a
sequence of 8 bits because of the ambiguity associated at the time with the byte.
https://en.wikipedia.org/wiki/Byte
Unicode UTF-8 encoding is variable-length and uses 8-bit code units. It was
designed for backward compatibility with ASCII and to avoid the complications of
endianness and byte order marks in the alternative UTF-16 and UTF-32 encodings.
The name is derived from: Universal Coded Character Set + Transformation Format
- 8-bit. UTF-8 is the dominant character encoding for the World Wide Web,
accounting for 86.9% of all Web pages in May 2016. The Internet Mail Consortium
(IMC) recommends that all e-mail programs be able to display and create mail using
UTF-8, and the W3C recommends UTF-8 as the default encoding in XML and
HTML. UTF-8 encodes each of the 1,112,064 valid code points in the Unicode code
space (1,114,112 code points minus 2,048 surrogate code points) using one to four
8-bit bytes (a group of 8 bits is known as an octet in the Unicode Standard).
https://en.wikipedia.org/wiki/UTF-8

Introduction to Hadoop & the Hadoop Ecosystem

• Why? When? Where?
 Origins / History
 The Why of Hadoop
 The When of Hadoop
 The Where of Hadoop
• Hadoop Basics
 Comparison with RDBMS
• Hadoop architecture
 MapReduce
 HDFS
 Hadoop Common
Introduction to Hadoop & the Hadoop Ecosystem

So let’s dive in now to the topic of Hadoop and the Hadoop Ecosystem.
We will look at cover the big picture first, diving in just so far - in this units - to get our
feet wet. Later on, in the follow up units, we will explore how it fits into a broader IT
environment that could involve data warehouses, relational DBMSs, streaming engines,
etc. Then I’ll wrap up by explaining what IBM can do to help you get started with this
technology.
https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-
data/2/

Hardware improvements over the years…

• CPU Speeds
 1990 - 44 MIPS at 40 MHz
 2010 - 147,600 MIPS at 3.3 GHz
• RAM Memory
 1990 - 640K conventional memory
(256K extended memory recommended)
 2010 - 8-32GB (and more)
• Disk Capacity
 1990 - 20MB
 2010 - 1TB
• Disk Latency (speed of reads and
writes) - not much improvement in last
7-10 years, currently around 80MB / sec
Hardware improvements over the years …

Before we talk about what is Hadoop, let’s set up the context as to why Hadoop
technology is so important.
Moore’s law has been true for a long time, but no matter how many more transistors are
added to CPUs, and how powerful they become, it is on disk latency where the
bottleneck is.
Here our intent is to make one specific point: Scaling up (more powerful computers
with power CPUs) is not the answer to all problems since disk latency is the main
issue. Scaling out (to a cluster of computers) is the better approach, and the only
approach at extreme scale.

Parallel data processing

Different approaches:
 GRID computing - spreads processing load -
“CPU scavenging”
 Distributed workload - hard to manage
applications, overhead on developer
 Parallel databases - Db2 DPF, Teradata,
Netezza, etc. (distribute the data)
• Distributed computing:
 Multiple computers appear as one super
computer, communicate with each other by In pioneer days they used oxen
message passing, operate together to achieve for heavy pulling, and when
a common goal one ox couldn’t budge a log,
they didn’t try to grow a larger ox.
• Challenges We shouldn’t be trying for
bigger computers, but for more
 Heterogeneity, Openness, Security, systems of computers.
Scalability, Concurrency, Fault tolerance, -Grace Hopper
Transparency
Parallel data processing

Grace Hopper (December 9, 1906 - January 1, 1992), was an American computer
scientist and United States Navy Rear Admiral. She was one of the first programmers of
the Harvard Mark I computer in 1944, invented the first compiler for a computer
programming language, and was one of those who popularized the idea of machine-
independent programming languages which led to the development of COBOL, one of
the first high-level programming languages.
https://en.wikipedia.org/wiki/Grace_Hopper
The quote is found at: White, T. (2015). Hadoop: The definitive guide (4th, revised &
updated ed.). Sebastopol, CA: O'Reilly Media, p. 1.

What is Hadoop?
• Apache open source software framework for reliable, scalable,
distributed computing of massive amount of data
 Hides underlying system details and complexities from user
 Developed in Java
• Consists of 3 sub projects:
 MapReduce
 Hadoop Distributed File System (aka. HDFS)
 Hadoop Common
• Has a large ecosystem with both open-source & proprietary Hadoop-
related projects
 Hbase / Zookeeper / Avro / etc.
• Meant for heterogeneous commodity hardware
What is Hadoop?
Hadoop is an open source project of the Apache Foundation.
It is a framework written in Java originally developed by Doug Cutting who named it
after his son's toy elephant.
Hadoop uses Google’s MapReduce and Google File System (GFS) technologies
concepts at its foundation.
It is optimized to handle massive amounts of data which could be structured,
unstructured or semi-structured, using commodity hardware, that is, relatively
inexpensive computers.
This massive parallel processing is done with great performance. In its initial
conception, which we will study first, it is a batch operation handling massive amounts
of data, so the response time is not immediate.
Hadoop replicates its data across different computers, so that if one goes down, the
data is processed on one of the replicated computers.
You may be familiar with OLTP (Online Transactional processing) workloads where data
is randomly accessed on structured data like a relational database. For example when
you access your bank account.

You may also be familiar with OLAP (Online Analytical processing) or DSS (Decision
Support Systems) workloads where data is sequentially access on structured data like
a relational database to generate reports that provide business intelligence.
Now, you may not be that familiar with the concept of “Big Data”. Big Data is a term
used to describe large collections of data (also known as datasets) that may be
unstructured, and grow so large and quickly that is difficult to manage with regular
database or statistics tools.
Hadoop is not used for OLTP nor OLAP, but for Big Data. It complements these two, to
manage data. So Hadoop is NOT a replacement for a RDBMS.

A large (and growing) Ecosystem
A large (and growing) Ecosystem

Some references:
https://www.coursera.org/learn/hadoop/lecture/E87sw/hadoop-ecosystem-major-
components
https://hadoopecosystemtable.github.io/

Who uses Hadoop?
Who uses Hadoop?

Reference:
https://wiki.apache.org/hadoop/PoweredBy

Why & where Hadoop is used / not used

• What Hadoop is good for:
 Massive amounts of data through
parallelism
 A variety of data (structured, unstructured,
semi-structured)
 Inexpensive commodity hardware
• Hadoop is not good for:
 Not to process transactions (random access)
 Not good when work cannot be parallelized
 Not good for low latency data access
 Not good for processing lots of small files
 Not good for intensive calculations with little data
Why & where Hadoop is used / not used

Hadoop / MapReduce timeline
Hadoop / MapReduce timeline

Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used
text search library. Hadoop has its origins in Apache Nutch, an open source web search
engine, itself a part of the Lucene project.
Following are the major events that led to the creation of the stable version of Hadoop
that's available.
• 2003 - Google launches project Nutch to handle billions of searches and
indexing millions of web pages.
• Oct 2003 - Google releases papers with GFS (Google File System)
• Dec 2004 - Google releases papers with MapReduce
• 2005 - Nutch used GFS and MapReduce to perform operations
• 2006 - Yahoo! created Hadoop based on GFS and MapReduce (with Doug
Cutting and team)
• 2007 - Yahoo started using Hadoop on a 1000 node cluster
• Jan 2008 - Apache took over Hadoop
• Jul 2008 - Tested a 4000 node cluster with Hadoop successfully
• 2009 - Hadoop successfully sorted a petabyte of data in less than 17 hours to
handle billions of searches and indexing millions of web pages.

• 2011 - IBM releases BigInsights based on Hadoop 0.23

• 2011 June - Hortonworks founded in Santa Clara, CA
• Dec 2011 - Hadoop releases version 1.0
• Aug 2013 - Version 2.0.6 is available
A more detailed history can be found at:
https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-
data/
When the seeds of Hadoop were first planted in 2002, the world just wanted a better
open-source search engine. So then-Internet Archive search director Doug Cutting and
University of Washington graduate student Mike Cafarella set out to build it. They called
their project Nutch and it was designed with that era’s web in mind.
Looking back on it today, early iterations of Nutch were kind of laughable. About a year
into their work on it, Cutting and Cafarella thought things were going pretty well
because Nutch was already able to crawl and index hundreds of millions of pages. “At
the time, when we started, we were sort of thinking that a web search engine was
around a billion pages,” Cutting explained to me, “so we were getting up there.”
There are now about 700 million web sites and, according to Wired’s Kevin Kelly, well
over a trillion web pages.
But getting Nutch to work wasn’t easy. It could only run across a handful of machines,
and someone had to watch it around the clock to make sure it didn’t fall down.
“I remember working on it for several months, being quite proud of what we had been
doing, and then the Google File System paper came out and I realized ‘Oh, that’s a
much better way of doing it. We should do it that way,'” reminisced Cafarella. “Then, by
the time we had a first working version, the MapReduce paper came out and that
seemed like a pretty good idea, too.”
Google released the Google File System paper (http://research.google.com/
archive/gfs.html) in October 2003 and the MapReduce paper
(http://research.google.com/archive/mapreduce.html) in December 2004. The latter
would prove especially revelatory to the two engineers building Nutch.
“What they spent a lot of time doing was generalizing this into a framework that
automated all these steps that we were doing manually,” Cutting explained.
References:
• Google’s Paper on Big Table: http://research.google.com/archive/bigtable.html
• Google’s Paper on MapReduce:
http://research.google.com/archive/mapreduce.html

Many contributors to Hadoop (e.g., 2006-2011)
Many contributors to Hadoop (e.g., 2006-2011)

The big fear is a fragmentation of Hadoop similar to what happened to the Unix
operating system in the 1990s. Already, Cloudera, Hortonworks, EMC
Greenplum, MapR, Intel and IBM (and, in a way, Amazon Web Services) sell
foundational-level Hadoop distributions that vary from one another and from Apache
Hadoop in significant ways. Facebook,Twitter, Quantcast and other web companies
also do their own open source work and often release it into open source via Github.
The companies involved haven’t always done their best to quell these fears.
Hortonworks and Cloudera have engaged in very public disputes over who the real
open source champion is and whose employees have been more integral to the
development of Apache Hadoop. And both of those companies are quick to point the
finger at other companies they claim are doing Hadoop a disservice by developing
proprietary software at the MapReduce and HDFS levels.

The two key components of Hadoop

• Hadoop Distributed File System = HDFS
 Where Hadoop stores data
 A file system that spans all the nodes in a Hadoop cluster
 It links together the file systems on many local nodes to make them into one
big file system
• MapReduce framework
 How Hadoop understands and assigns work to the nodes (machines)
The two key components of Hadoop

There are two key components or aspects of Hadoop that are important to understand:
1. MapReduce is a software framework introduced by Google to support distributed
computing on large data sets of clusters of computers.
2. The Hadoop Distributed File System (HDFS) is where Hadoop stores its data.
This file system spans all the nodes in a cluster. Effectively, HDFS links together
the data that resides on many local nodes, making the data part of one big file
system. You can use other file systems with Hadoop (e.g., MapR’s MapRFS,
IBM’s Spectrum Scale [formerly known as GPFS = General Parallel File
System]), but HDFS is quite common.
References:
• http://info.mapr.com/Hadoop_Buyers_Guide_2015.html
• http://www.redbooks.ibm.com/abstracts/sg248254.html

Think differently
As we start to work with Hadoop, we need to think differently:
• Different processing paradigms
• Different approaches to storing data
• Think ELT (extract-load-transform)
rather than ELT (extract-transform-load)
…and to understanding the Hadoop Ecosystem is embark on a

continuing learning process…self-education is an ongoing requirement
Think differently

Core Hadoop concepts

• Applications are written in high-level language code
• Work is performed in a cluster of commodity machines
 Nodes talk to each other as little as possible
• Data is distributed in advance
 Bring the computation to the data
• Data is replicated for increased availability and reliability
• Hadoop is fully scalable and fault-‐tolerant
Core Hadoop concepts

Differences between RDBMS and Hadoop/HDFS
Differences between RDBMS and Hadoop/HDFS

Requirements for this new approach

• Partial Failure Support
• Data Recoverability
• Component Recovery
• Consistency
• Scalability
• Hadoop is based on work done by Google in the late 1990s/early 2000s:

Specifically, on papers describing the Google File System (GFS)
(published in 2003), and MapReduce (published in 2004)
Requirements for this new approach

Partial Failure Support
Failure of a component should result in a graceful degradation of application
performance, not complete failure of the entire system
Data Recoverability
If a component of the system fails, its workload should be assumed by still-functioning
units in the system - failure should not result in the loss of any data
Component Recovery
If a component of the system fails and then recovers, it should be able to rejoin the
system, without requiring a full restart of the entire system
Consistency
Component failures during execution of a job should not affect the outcome of the job
Scalability
Adding load to the system should result in a graceful decline in performance of
individual jobs, not failure of the system
Increasing resources should support a proportional increase in load capacity

References:
• Google File System (2003), aka BigTable:
http://research.google.com/archive/gfs.html
o Bigtable: A Distributed Storage System for Structured Data
• MapReduce (2004): http://research.google.com/archive/mapreduce.html
o MapReduce: Simplified Data Processing on Large Clusters

Some terminology…to get you started

• 75 Big Data Terms Everyone Should Know (July 2017)
http://dataconomy.com/2017/07/75-big-data-terms-everyone-know
• But these are just
the beginning of
a terminological
dictionary that you
should develop
for yourself
Some terminology…to get you started

Checkpoint
1. What are the 4Vs of Big Data?
What are some of the additional Vs that some add to the basic four?
2. What are the three types of Big Data?
3. Name some of the industry sectors that are using Big Data and Data
Analytics to manage their business
4. How many bytes are there in a kilobyte (kB)?
5. What type of processing is Hadoop good / not good for?
Checkpoint

Checkpoint solution
1. What are the 4Vs of Big Data?
What are some of the additional Vs that some add to the basic four?
 Velocity, Variety, Volume, Veracity
 Value, Validity, Viability, Volatility, Vulnerability, Visualization
2. What are the three types of Big Data?
 Structured, Semi-structured, Unstructured
 Secondary types: Natural Language, Machine-Generated, Graph-based, Audio / Video / Image,
Streaming
3. Name some of the industry sectors that are using Big Data and Data
Analytics to manage their business
 Healthcare, Telecommunications, Utilities, Banking / Finance, Insurance, Agriculture, Travel,
Retail - This list, of course, is not closed or exclusive, but merely representative of the examples
used in this course - other industries are also valid
4. How many bytes are there in a kilobyte (kB)?
 1 kB = 1000 bytes (= 103). Don’t confuse with kibibyte (KiB) - 1024.
5. What type of processing is Hadoop good / not good for?
 Massive amounts of data processed in parallel
 Not good to process transactions, low latency / rapid processing, small files, intensive
calculations
Checkpoint solution

Unit summary
• Develop an understanding of the complete open-source Hadoop
ecosystem and its near-term future directions
• Be able to compare and evaluate the major Hadoop distributions and
their ecosystem components, both their strengths and their limitations
• Gain hands-on experience with key components of various big data
ecosystem components and their roles in building a complete big data
solution to common business problems
• Learning the tools that will enable you to continue your big data
education after the course
 This learning is going to be a life-long technical journey that you will start
here and continue throughout your business career
Unit summary


Introduction to Hortonworks Data Platform (HDP)
Introduction to Hortonworks
Data Platform (HDP)

U n i t 2 I n t r o d u c t i o n t o H o r t o n wo r k s D a t a P l a t f o r m ( H D P )

Unit objectives
• Describe the functions and features of HDP
• List the IBM value-add components
• Explain what IBM Watson Studio is
• Give a brief description of the purpose of each of the value-add
components
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018
Unit objectives

Hortonworks Data Platform (HDP)

• HDP is platform for data-at-rest
• Secure, enterprise-ready open source Apache Hadoop distribution

based on a centralized architecture (YARN)
• HDP is:
 Open
 Central
 Interoperable
 Enterprise ready

Hortonworks Data Platform (HDP) is a powerful platform for managing Big Data at rest.
HDP is Open Enterprise Hadoop, and it's a platform that is:
• 100% Open Source
• Centrally architected with YARN at its core
• Interoperable with existing technology and skills, AND
• Enterprise-ready, with data services for operations, governance and security


Governance
Tools Security Operations
Integration
Data Lifecycle Ranger Ambari
& Governance Zeppelin Ambari User Views
Knox Cloudbreak
Falcon
Atlas ZooKeeper
Atlas HDFS Oozie

Encrpytion
Data workflow Data Access

Sqoop Batch Script Sql NoSql Stream Search In-Mem Others
Flume Hbase HAWQ

Map
Pig Hive Accumulo Storm Solr Spark Partners
Reduce
Phoenix Big SQL
Kafka
Tez Tez Slider S T

NFS
Yarn: Data Operating System
Hadoop Distributed File System (HDFS)
WebHDFS
Data Management
Here is the high level view of the Hortonworks Data Platform. It is divided into several
categories, listed in no particular order of importance.
• Governance, Integration
• Tools
• Security
• Operations
• Data Access
• Data Management
These next several slides will go in more detail of each of these groupings.

Data workflow
Data workflow
In this section you will learn a little about some of the data workflow tools that come with
HDP.

Sqoop
• Tool to easily import information from structured databases (Db2,

MySQL, Netezza, Oracle, etc.) and related Hadoop systems (such as
Hive and HBase) into your Hadoop cluster
• Can also use to extract data from Hadoop and export it to relational
databases and enterprise data warehouses
• Helps offload some tasks such as ETL from Enterprise Data

Warehouse to Hadoop for lower cost and efficient execution
Sqoop
Sqoop is a tool for moving data between structured databases or relational databases
and related Hadoop system. This works both ways. You can take data in your RDBMS
and move to your HDFS and move from your HDFS to some other form of RDBMS.
You can use Sqoop to offload tasks such as ETL from data warehouses to Hadoop for
lower cost and efficient execution for analytics.
Check out the Sqoop documentation for more info: http://sqoop.apache.org/

Flume
• Apache Flume is a distributed, reliable, and available service for

efficiently collecting, aggregating, and moving large amounts of
streaming event data.
• Flume helps you aggregate data from many sources, manipulate the
data, and then add the data into your Hadoop environment.
• Its functionality is now superseded by HDF / Apache Nifi.
Flume
To set the stage, Flume is now deprecated by HDP in favor of HDF, Hortonworks Data
Flow. This course, will, however, cover what Flume does on a very high level, but will
not get into HDF of any great detail until the end. Essentially, when you have large
amounts of data, such as log files, that needs to be moved from one location to another,
Flume is the tool. Now, with that said, in the common cases, when moving log data, you
would consider looking into streaming technologies to move data in real-time (or near
real-time) as the log files are generated from the data warehouse into the Hadoop
datastore for processing. This is where Hortonworks Data Flow comes in.
For more information, refer to: https://hortonworks.com/products/data-platforms/hdf/

Kafka
• Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-

subscribe messaging system.
 Used for building real-time data pipelines and streaming apps
• Often used in place of traditional message brokers like JMS and

AMQP because of its higher throughput, reliability and replication.
• Kafka works in combination with variety of Hadoop tools:

 Apache Storm
 Apache Hbase
 Apache Spark
Kafka
Kafka Apache Kafka is a messaging system used for real-time data pipelines. Kafka is
used to build real-time streaming data pipelines that get data between systems or
applications. Kafka works with a number variety of Hadoop tools for various
applications. Such exmaples of uses cases are:
• Website activity tracking: capturing user site activities for real-time
tracking/monitoring
• Metrics: monitoring data
• Log aggregation: collecting logs from various sources to a central location for
processing.
• Stream processing: article recommendations based on user activity
• Event sourcing: state changes in applications are logged as time-ordered
sequence of records
• Commit log: external commit log system that helps with replicating data
between nodes in case of failed nodes
More information can be found here: https://kafka.apache.org/

Data access
Data access
In this section you will learn a little about some of the data access tools that come with
HDP. These include MapReduce, Pig, Hive, HBase, Accumulo, Phoenix, Storm, Solr,
Spark, Big SQL, Tez and Slider.

Hive
• Apache Hive is a data warehouse system built on top of Hadoop.
• Hive facilitates easy data summarization, ad-hoc queries, and the

analysis of very large datasets that are stored in Hadoop.
• Hive provides SQL on Hadoop

 Provides SQL interface, better known as HiveQL or HQL, which allows for
easy querying of data in Hadoop
• Includes Hcatalog
 Global metadata management layer that exposes Hive table metadata to
other Hadoop applications.
Hive
Hive is a data warehouse system built on top of Hadoop. Hive supports easy data
summarization, ad-hoc queries, and analysis of large data sets in Hadoop. For those
who have some SQL background, Hive is great tool because it allows you to use a
SQL-like syntax to access data stored in HDFS. Hive also works well with other
applications in the Hadoop ecosystem. It includes a HCatalog, which is a global
metadata management layer that exposes the Hive table metadata to other Hadoop
applications.
Hive documentation: https://hive.apache.org/

Pig
• Apache Pig is a platform for analyzing large data sets.

• Pig was designed for scripting a long series of data operations (good
for ETL)
 Pig consists of a high-level language called Pig Latin, which was designed
to simplify MapReduce programming.
• Pig's infrastructure layer consists of a compiler that produces
sequences of MapReduce programs from this Pig Latin code that you
write.
• The system is able to optimize your code, and "translate" it into
MapReduce allowing you to focus on semantics rather than efficiency.
Pig
Another data access tool is Pig, which was written for analyzing large data sets. Pig has
its own language, called Pig Latin, with a purpose of simplifying MapReduce
programming. PigLatin is a simple scripting language, once compiled, will become
MapReduce jobs to run against Hadoop data. The Pig system is able to optimize your
code, so you as the developer can focus on the semantics rather than efficiency.
Pig documentation: http://pig.apache.org/

HBase
• Apache HBase is a distributed, scalable, big data store.
• Use Apache HBase when you need random, real-time read/write

access to your Big Data.
 The goals of the HBase project is to be able to handle very large tables of
data running on clusters of commodity hardware.
• HBase is modeled after Google's Bigtable and provides Bigtable-like

capabilities on top of Hadoop and HDFS. HBase is a NoSQL
datastore.
HBase
HBase is a columnar datastore, which means that the data is organized in columns as
opposed to traditional rows where traditional RDMBS is based upon. HBase has this
concept of column families to store and retrieve data. HBase is great for large datasets,
but, not ideal for transactional data processing. This means if you have use cases
where you rely on transactional processing, you would go with a different datastore that
has the features that you need. The common use for HBase is if you need to perform
random read/write access to your big data.
HBase documentation: https://hbase.apache.org/

Accumulo
• Apache Accumulo is a sorted, distributed key/value store that provides

robust, scalable data storage and retrieval.
• Based on Google’s Bigtable and runs on YARN

 Think of it as a "highly secure HBase"
• Features:
 Server-side programming
 Designed to scale
 Cell-based access control
 Stable
Accumulo
Accumulo is another key/value store, similar to HBase. You can think of Accumulo as a
"highly secure HBase". There are various features that provides a robust, scalable, data
storage and retrieval. It is also based on Google's BigTable, which again, is the same
technology for HBase. HBase, however, is getting more features as aligns it more with
what the community needs. It is up to you to evaluate your requirements and determine
the best tool for your needs.
Acummulo documentation: https://accumulo.apache.org/

Phoenix
• Apache Phoenix enables OLTP and operational analytics in Hadoop

for low latency applications by combining the best of both worlds:
 The power of standard SQL and JDBC APIs with full ACID transaction
capabilities.
 The flexibility of late-bound, schema-on-read capabilities from the NoSQL
world by leveraging HBase as its backing store.
• Essentially this is SQL for NoSQL
• Fully integrated with other Hadoop products such as Spark, Hive, Pig,
Flume, and MapReduce
Phoenix
Phoenix enables online transactional process and operational analytics in Hadoop for
low latency applications. Essentially, this is a SQL for NoSQL database. Recall that
HBase is not designed for transactional processing. Phoenix combines the best of the
NoSQL datastore and the need for transactional processing. This is fully integrated with
other Hadoop produces such as Spark, Hive, Pig and MapReduce.
Phoenix documentation: https://phoenix.apache.org/

Storm
• Apache Storm is an open source distributed real-time computation

system.
 Fast
 Scalable
 Fault-tolerant
• Used to process large volumes of high-velocity data
• Useful when milliseconds of latency matter and Spark isn't fast enough
 Has been benchmarked at over a million tuples processed per second per
node
Storm
Storm is designed for real-time computation that is fast, scalable, and fault-tolerant.
When you have a use case to analyze streaming data, consider Storm as an option. Of
course, there are numerous other streaming tools available such as Spark or even IBM
Streams, a proprietary software with decades of research behind it for real-time
analytics.

Solr
• Apache Solr is a fast, open source enterprise search platform built on

the Apache Lucene Java search library
• Full-text indexing and search

 REST-like HTTP/XML and JSON APIs make it easy to use with variety of
programming languages
• Highly reliable, scalable and fault tolerant, providing distributed

indexing, replication and load-balanced querying, automated failover
and recovery, centralized configuration and more
Solr
Solr is built off of the Apache Lucene search library. It is designed for full text indexing
and searching. Solr powers the search of many big sites around the internet. It is highly
reliable, scalable and fault tolerant providing distributed indexing, replication and load-
balanced querying, automated failover and recover, centralized configure and more!

Spark
• Apache Spark is a fast and general engine for large-scale data

processing.
• Spark has a variety of advantages including:
 Speed
− Run programs faster than MapReduce in memory
 Easy to use
− Write apps quickly with Java, Scala, Python, R
 Generality
− Can combine SQL, streaming, and complex analytics
 Runs on variety of environments and can access diverse data sources
− Hadoop, Mesos, standalone, cloud…
− HDFS, Cassandra, HBase, S3…
Spark
Spark is an in-memory processing engine where speed and scalability is the big
advantage. There are a number of built-in libraries that sits on top of the Spark core,
which takes advantage of all its capabilites. Spark ML, Spark's GraphX, Spark
Streaming, Spark SQL and DataFrames. There are three main languages supported by
Spark: Scala, Python, and R. In most cases, Spark can run programs faster than
MapReduce can by utilizing its in-memory architecture.
Spark documentation: https://spark.apache.org/

Druid
• Apache Druid is a high-performance, column-oriented, distributed

data store.
 Interactive sub-second queries
− Unique architecture enables rapid multi-dimensional filtering, ad-hoc attribute
groupings, and extremely fast aggregations
 Real-time streams
− Lock-free ingestion to allow for simultaneous ingestion and querying of high
dimensional, high volume data sets
− Explore events immediately after they occur
 Horizontally scalable
 Deploy anywhere
Druid
Druid is a datastore designed for business intellingence (OLAP) queries. Druid provides
real-time data ingestion, query, and fast aggregations. It integrates with Apache Hive to
build OLAP cubes and run sub-seconds queries.

Data Lifecycle and Governance
Data Lifecycle and Governance

In this section you will learn a little about some of the data lifecycle and governance
tools that come with HDP.

Falcon
• Framework for managing data life cycle in Hadoop clusters
• Data governance engine

 Defines, schedules, and monitors data management policies
• Hadoop admins can centrally define their data pipelines

 Falcon uses these definitions to auto-generate workflows in Oozie
• Addresses enterprise challenges related to Hadoop data replication,

business continuity, and lineage tracing by deploying a framework for
data management and processing
Falcon
Falcon is used for managing the data life cycle in Hadoop clusters. One example use
case would feed management services such as feed retention, replications across
clusters for backups, archival of data, etc.
Falcon documentation: https://falcon.apache.org/

Atlas
• Apache Atlas is a scalable and extensible set of core foundational

governance services
 Enables enterprises to effectively and efficiently meet their compliance
requirements within Hadoop
• Exchange metadata with other tools and processes within and outside
of the Hadoop
 Allows integration with the whole enterprise data ecosystem
• Atlas Features:
 Data Classification
 Centralized Auditing
 Centralized Lineage
 Security & Policy Engine
Atlas
Atlas enables enterprises to meet their compliance requirements within Hadoop. It
provides features for data classification, centralized auditing, centralized lineage, and
security and policy engine. It integrates with the whole enterprise data ecosystem.
Atlas documentation: https://atlas.apache.org/

Security
Security
In this section you will learn a little about some of the security tools that come with HDP.

Ranger
• Centralized security framework to enable, monitor and manage

comprehensive data security across the Hadoop platform
• Manage fine-grained access control over Hadoop data access

components like Apache Hive and Apache HBase
• Using Ranger console can manage policies for access to files, folders,
databases, tables, or column with ease
• Policies can be set for individual users or groups

 Policies enforced within Hadoop
Ranger
Ranger is used to control data security across the entire Hadoop platform. The Ranger
console can manage polocies for access to files, folders, databases, tables and
columns. The policies can be set for individual users or groups.
Ranger documentation: https://ranger.apache.org/

Knox
• REST API and Application Gateway for the Apache Hadoop

Ecosystem
• Provides perimeter security for Hadoop clusters
• Single access point for all REST interactions with Apache Hadoop
clusters
• Integrates with prevalent SSO and identity management systems

 Simplifies Hadoop security for users who access cluster data and execute
jobs
Knox
Knox is a gateway for the Hadoop ecosystem. It provides perimeter level security for
Hadoop. You can think of Knox like the castle walls, where within walls is your hadoop
cluster. Knox integrates with SSO and identity management systems to simplify hadoop
security for users who access cluster data and execute jobs.
Knox documentation: https://knox.apache.org/

Operations
Operations
In this section you will learn a little about some of the operations tools that come with
HDP.

Ambari
• For provisioning, managing, and monitoring Apache Hadoop clusters.
• Provides intuitive, easy-to-use Hadoop management web UI backed by

its RESTful APIs
• Ambari REST APIs

 Allows application developers and system integrators to easily integrate
Hadoop provisioning, management, and monitoring capabilities to their own
applications
Ambari
You will grow to know your way around Ambari, as this is the central place to manage
your entire Hadoop cluster. Installation, provisioning, managmenet and monitoring of
your Hadoop cluster is done with Ambari. It also comes with some easy to use RESTful
APIs which allows application developers to easily integrate Ambari with their own
applications.
Ambari documentation: https://ambari.apache.org/

The Ambari web interface
The Ambari web interface

Here is a look at the Ambari web interface. Available services are shown on the left with
the metrics of your cluster showing on the center of the page. Additional component
and server configurations are available once you drill down on the respective pages.

Cloudbreak
• A tool for provisioning and managing Apache Hadoop clusters in the

cloud
• Automates launching of elastic Hadoop clusters
• Policy-based autoscaling on the major cloud infrastructure platforms,

including:
 Microsoft Azure
 Amazon Web Services
 Google Cloud Platform
 OpenStack
 Platforms that support Docker container
Cloudbreak
Cloudbreak is a tool for managing clusters in the cloud. Cloudbreak is a Hortonworks'
project, and is currently not a part of Apache. It automates the launch of clusters into
various cloud infrastructure platforms.

ZooKeeper
• Apache ZooKeeper is centralized service for maintaining configuration

information, naming, providing distributed synchronization, and
providing group services
 All of these kinds of services are used in some form or another by
distributed applications
 Saves time so you don't have to develop your own
• It is fast, reliable, simple and ordered
• Distributed applications can use Zookeeper to store and mediate

updates to important configuration information
ZooKeeper
Zookeeper provides a centralized service for maintaining configuration information,
naming, providing distributed synchronization and providing group services across your
Hadoop cluster. Applications within the Hadoop cluster can use ZooKeeper to maintain
configuration information.
Zookeeper documentation: https://zookeeper.apache.org/

Oozie
• Oozie is a Java based workflow scheduler system to manage Apache

Hadoop jobs
• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions
• Integrated with the Hadoop stack

 YARN is its architectural center
 Supports Hadoop jobs for MapReduce, Pig, Hive, and Sqoop
Oozie
Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie is integrated with
the rest of the Hadoop stack. Oozie workflow jobs are Directed Acyclial Graphs (DAGs)
of actions. At the heart of this is YARN.
Oozie documentation: http://oozie.apache.org/

Tools
Tools
In this section you will learn a little about some of the Tools that come with HDP.

Zeppelin
• Apache Zeppelin is a Web-based notebook that enables data-driven,

interactive data analytics and collaborative documents
• Documents can contain SparkSQL, SQL, Scala, Python, JDBC

connection, and much more
• Easy for both end-users and data scientists to work with
• Notebooks combine code samples, source data, descriptive markup,

result sets, and rich visualizations in one place
Zeppelin
Zepplin is a web based notebook designed for data scientists to easily and quickly
explore dataset through collaborations. Notebooks can contain Spark SQL, SQL, Scala,
Python, JDBC, and more. Zeppelin allows for interaction and visualization of large
datasets.
Zeppelin documentation: https://zeppelin.apache.org/

Zeppelin GUI
Zeppelin GUI
Here is a screenshot of the Zepplin notebook showing some visualization of a particular
dataset.

Ambari Views
• Ambari web interface includes a built-in set of Views that are pre-
deployed for you to use with your cluster
• These GUI components increase ease-of-use
• Includes views for Hive, Pig, Tez, Capacity Scheduler, File, HDFS
• Ambari Views Framework allow developers to create new user

interface components that plug into Ambari Web UI
Ambari Views
Ambari views provide a built-in set of views for Hive, Pig, Tez, Capacity Schedule, File,
HDFS which allows developers to monitor and manage the cluster. It also allows
developers to create new user interface components that plug in to the Ambari Web UI.

IBM value-add components
• Big SQL
• Big Replicate
• BigQuality
• BigIntegrate
• Big Match
IBM value-add components

These are some of the value-add components available from IBM. You will learn a bit
about these components next.

Big SQL is SQL on Hadoop

• Big SQL builds on Apache Hive
foundation
 Integrates with the Hive metastore Big SQL
Hive
Hive APIs
 Instead of MapReduce, uses Sqoop
Pig
powerful native C/C++ MPP engine Hive APIs
Hive APIs
Hive Metastore
• View on your data residing in the
Hadoop FileSystem
• No proprietary storage format
• Modern SQL:2011 capabilities
• Same SQL can be used on your Hadoop
Cluster
warehouse data with little or no
modifications
Big SQL is SQL on Hadoop

Big SQL is a SQL processing engine for the Hadoop cluster. It provides a SQL on
Hadoop interface and plugs in directly to the cluster. There is no proprietary format and
no new SQL syntax to learn. The same SQL can be used on the warehouse data with
little or no modification. It provides an ease-of-use language for SQL developers to be
able to work with data on Hadoop.

Big Replicate
• Provides active-active data replication for Hadoop across supported
environments, distributions, and hybrid deployments
• Replicates data automatically with guaranteed consistency across
Hadoop clusters running on any distribution, cloud object storage and
local and NFS mounted file systems
• Provides SDK to extend Big Replicate replication to virtually any data
source
• Patented distributed coordination engine enables:
 Guaranteed data consistency across any number of sites at any distance
 Minimized RTO/RPO
• Totally non-invasive
 No modification to source code
 Easy to turn on/off
Big Replicate
For active-active data replication on Hadoop clusters, Big Replicate has no competition.
It replicates data automatically with guaranteed consistency on Hadoop clusters on any
distribution, cloud object storage and local and NDFS mounted file systems. Big
Replicate provides SDK to extend it to any other data source. Patented distributed
coordination engine enables guaranteed data consistency across any number of sites
at any distance.

Information Server and Hadoop: BigQuality and BigIntegrate
• IBM InfoSphere Information Server is a market-leading data integration

platform which includes a family of products that enable you to
understand, cleanse, monitor, transform, and deliver data, as well as to
collaborate to bridge the gap between business and IT.
• Information Server can now be used with Hadoop
• You can profile, validate, cleanse, transform, and integrate your big
data on Hadoop, an open source framework that can manage large
volumes of structured and unstructured data.
• This functionality is available with the following product offerings

 IBM BigIntegrate: Provides data integration features of Information Server.
 IBM BigQuality: Provides data quality features of Information Server.
Information Server and Hadoop: BigQuality and BigIntegrate

Information Server is a platform for data integration, data quality, and governance that is
unified by a common metadata layer and scalable architecture. This means more
reuse, better productivity, and the ability to leverage massively scalable architectures
like MPP, GRID, and Hadoop clusters.

Information Server - BigIntegrate:

Ingest, transform, process and deliver any data into & within Hadoop
Satisfy the most complex transformation requirements with
the most scalable runtime available in batch or real-time
• Connect
 Connect to wide range of traditional
enterprise data sources as well as Hadoop
data sources
 Native connectors with highest level of
performance and scalability for key
data sources
• Design & Transform
 Transform and aggregate any data volume
 Benefit from hundreds of built-in
transformation functions
 Leverage metadata-driven productivity and
enable collaboration
• Manage & Monitor
 Use a simple, web-based dashboard to
manage your runtime environment
Information Server - BigIntegrate: Ingest, transform, process and deliver any data into &
within Hadoop
IBM BigIntegrate is a big data integration solution that provides superior connectivity,
fast transformation and reliable, easy-to-use data delivery features that execute on the
data nodes of a Hadoop cluster. IBM BigIntegrate provides a flexible and scalable
platform to transform and integrate your Hadoop data.
Once you have data sources that are understood and cleansed, the data must be
transformed into a usable format for the warehouse and delivered in a timely fashion
whether in batch, real-time or SOA architectures. All warehouse projects require data
integration – how else will the many enterprise data sources make their way into the
warehouse? Hand-coding is not a scalable option.
Increase developer efficiency
• Top down design – Highly visual dev environment
• Enhanced collaboration through design asset reuse
High performance delivery with flexible deployments
• Support for multiple delivery styles: ETL, ELT, Change Data Capture, SOA
integration etc.
• High-performance, parallel engine

Rapid integration
• Pre-built connectivity
• Balanced Optimization
• Multiple user configuration options
• Job parameter available for all options
• Powerful logging and tracing
BigIntegrate is built for the simple to the most sophisticated data transformations.
Think about the simple transformations such as transforming or calculating total values.
This is the very basic of transformation across data like you would do with a
spreadsheet or calculator. Then imagine the more complex. Such as provide a lookup
to an automated loan system where the loan qualification date equals the interest rate
for that time of day based on a look up to an ever changing system.
These are the types of transformations our customers are doing every day and they
require an easy to use canvas that allows you to design as you think. This is exactly
what BigIntegrate has been built to do.

Information Server - BigQuality:

Analyze, cleanse and monitor your big data
Most comprehensive data quality capabilities that run
natively on Hadoop
• Analyze
 Discovers data of interest to the organization
based on business defined data classes
 Analyzes data structure, content
and quality
 Automates your data analysis process
• Cleanse
 Investigate, standardize, match and survive
data at scale and with the full power of
common data integration processes
• Monitor
 Assess and monitor the quality of your data in
any place and across systems
 Align quality indicators to
business policies
 Engage data steward team when issues
exceed thresholds of the business
Information Server - BigQuality: Analyze, cleanse and monitor your big data
IBM BigQuality provides a massively scalable engine to analyze, cleanse, and monitor
data.
Analysis discovers patterns, frequencies, and sensitive data that is critical to the
business – the content, quality, and structure of data at rest. While a robust user
interface is provided, the process can be completely automated, too.
Cleanse uses powerful out of the box (that are completely customizable) routines to
investigate, standardize, match, and survive free format data. For example,
understanding that William Smith and Bill Smith are the same person. Or knowing that
BLK really means Black in some contexts.
Monitor is measuring the content, quality, and structure of data in flight to make
operational decisions about data. For example, ‘exceptions’ can be send to a full
workflow engine called the Stewardship Center where people can collaborate on the
issues.

IBM InfoSphere Big Match for Hadoop
PME Hadoop
Algorithm
Big Match is a Probabilistic Matching Engine (PME) running

natively within Hadoop for Customer Data Matching
IBM InfoSphere Big Match for Hadoop

Big Match provides proven probabilistic matching algorithms natively in Hadoop to
accurately and economically link all customer data across structured and un-structured
data sources at large scale. Big Match helps provide customer information that
consuming applications can act on with confidence.
Big Match integrates directly with Apache Hadoop.

Watson Studio (formerly Data Science Experience (DSX))
• Watson Studio is a collaborative platform for data scientists,

built on open source components and IBM added value,
available in the cloud or on premises.
• https://datascience.ibm.com/
Learn Create Collaborate

Built-in learning to The best of open Community and
get started or go the source and IBM social features that
distance with value-add to create provide meaningful
advanced tutorials state-of-the-art data collaboration
products
Watson Studio (formerly Data Science Experience (DSX))

Watson Studio is a collaborative platform designed for data sicentists, built on open
source components and IBM value-add componets available in the cloud or on-premise

Checkpoint
1) List the components of HDP which provides data access capabilities?
2) List the components that provides the capability to move data from
relational database into Hadoop?
3) Managing Hadoop clusters can be accomplished using which
component?
4) True or False? The following components are value-add from IBM:
Big Replicate, Big SQL, BigIntegrate, BigQuality, Big Match
5) True or False? Data Science capabilities can be achieved using only
HDP.
Checkpoint

Checkpoint solution
1) List the components of HDP which provides data access capabilities.
 MapReduce, Pig, Hive, HBase, Phoenix, Spark, and more!
2) List the components that provides the capability to move data from
relational database into Hadoop.
 Sqoop, Flume, Kafka
3) Managing Hadoop clusters can be accomplished using which
component?
 Ambari
4) True or False? The following components are value-add from IBM:
Big Replicate, Big SQL, BigIntegrate, BigQuality, Big Match
 True
5) True or False? Data Science capabilities can be achieved using only
HDP.
 False. Data Science capabilities also requires Watson Studio.
Checkpoint solution

Unit summary
• Describe the functions and features of HDP
• List the IBM value-add components
• Explain what IBM Watson Studio is
• Give a brief description of the purpose of each of the value-add
components
Unit summary

Lab 1
• Exploration of the lab environment
 Start the VMWare image
 Launch the Ambari console
 Perform some basic setup
 Start services and the Hadoop processing environment
 Explore the placement of files in the Linux host-system environment and
explain the file system and directory structures
Introduction
Creating Bigto Hortonworks
SQL Datatables
schemas and Platform (HDP) © Copyright IBM Corporation 2018
Lab 1: Exploration of the lab environment

Lab 1:
Exploration of the lab environment
Purpose:
In this lab, you will explore the lab environment. You will access your lab
environment and launch Apache Ambari. You will startup a variety of services
by using the Ambari GUI. You will also explore some of the directory structure
on the Linux system that you will be using.
RHEL login user/password root/dalvm3

student/student
Ambari URL http://<hostname>:8080/

where <hostname> is where your Ambari server is
installed
Ambari login user/password admin/admin
Data Server Manager (DSM) https://<hostname>:8443/gateway/default/dsm

URL where <hostname> is where your Ambari server is
installed
DSM login user/password guest/guest-password
Big SQL admin user/password bigsql/passw0rd
HDP services password passw0rd
Note:
• If the Ambari services have not been started up, you need to start it up now.
Navigate to the Ambari page and start all the services. This can take up to 20-
25 minutes.
• Hostnames shown in screen captures may vary from your images hostname.

Task 1. Configure your images.

The labs in this course consist of two VM images. One of the images is the "master"
which contains the Big SQL Master along with a variety of other services such as
Ambari. The second image also runs a variety of Hadoop services, including the Big
SQL worker.
Important: Occasionally, when you suspend and resume the VM images, the network
may assign a different IP address than the one you had configured. In these instances,
the Ambari console and the services will not run. You will need to update the /etc/hosts
file on each server with the newly assigned IP address to continue working with the
images. No restart of the VM image is necessary - just give it a couple of minutes, at
most. In some cases, you may need to restart the Ambari server, using ambari-server
restart from the command line of the master image.
Perform the following steps for each of your two VM images:
1. Connect to and login to your VMware images with the user as root and the
password dalvm3.
The desktop of your Linux system is displayed (desktop of the user root).
Alternatively, if you are comfortable using a SSH tool such as PuTTY, you may
SSH into your images if your environment allows you to directly connect using
IP addresses.

2. To open a new Linux terminal, right-click the desktop, and then click
Open in Terminal.
This terminal window can be resized. You may find it useful to stretch the right
side to make it wider, as some commands are long and some responses will be
wide; any text that is longer than the default line width will be word-wrapped to
the next line.
3. Type hostname -f and take note of the hostname that is returned.
4. Type ifconfig to check for the current assigned IP address.
5. Make a note of the IP address next to inet.
Next, you need to edit the /etc/hosts file on each VM to map the hostnames
listed to the IP addresses of your two VMs. Do this by performing the following
steps on each machine:
6. If not already logged in as the root user, switch to root by typing su - and if
prompted for a password, type dalvm3.
7. To open the /etc/hosts file, type gedit /etc/hosts.
8. There will be IP addresses followed by host names. Ensure that the IP
addresses listed correctly match the host names they are paired with.
9. Update with the IP address for your servers from step 4.
10. Save and exit the file, and then close the terminal.
Task 2. Start the Hadoop components.
If all of the components have been started already, you may skip this task. Otherwise,
follow the steps in this task to start up all of the components.
You will start up all the services via the Ambari console to ensure that everything is
ready for the lab. You may stop what you don't need later, but for now, you will start
everything.
1. Launch Firefox, and then if necessary, navigate to the Ambari login page,
http://<hostname>:8080.
2. Log in to the Ambari console as admin/admin.
On the left side of the browser are the statuses of all the services. If any are
currently yellow, wait for several minutes for them to become red before you
start them up.

3. Once all the statuses are red, at the bottom of the left side, click Actions and
then click Start All to start up the services.
4. In the Confirmation dialog, click OK.

This will take a while to complete to complete.
5. When the services have started successfully, click OK.
If your Web browser returns an error instead of a sign in screen, verify that the
Ambari server has been started and that you have the correct URL for it. If
needed, launch Ambari manually: log into the node containing the Ambari
server as root and issue this command: ambari-server start
6. Starting all services may take as much as 10 minutes.
You will be revisiting the Ambari Server in its own unit and will look at the
various pages in more detail at that time.
You will now explore your Hadoop cluster environment.
Task 3. Explore the placement of files in the Linux
host-system environment.
You will review the placement of some configuration files in the Linux host-system
environment so that you can become familiar with where the open-source configuration
files are installed. This is a behind the covers look at the actual configuration files that
are updated when you edit configuration settings for the different components within
Ambari on your Hadoop cluster.
1. Open a new Linux terminal. Ensure that you are still logged in as the root user.
2. Type pwd to see which directory you are currently in.

3. To change the directory to /usr/hdp, type cd /usr/hdp.

The results appear as follows:
4. To review the files and directories in the /usr/hdp directory, type ls -l.
The first of these subdirectories (2.6.1.0-129) is the release level of the HDP
software that you are working with; the version in your lab environment may
differ if the software has been updated.
5. To display the contents of the 2.6.1.0-129 directory, type ls -l 2*.
The results appear similar to the following:
Take note of the user IDs associated with the various directories. Some have a
user name that is same as the software held in the directory; some are owned
by root. You will be looking at the standard users in the Apache Ambari unit
when you explore details of the Ambari Server and work with the management
of your cluster.

6. To view the current subdirectory which has a set of links that point back to files
and subdirectories in the 2.*.*.*-*** directory, type ls -l current.
There are sometimes multiple links back to the one directory.

The software that will be executed at this time (current) is determined by these
links. Since you can have multiple versions of software installed, it is possible to
install the software for an upcoming upgrade and then later have Ambari install
a rolling upgrade that will systematically change the links as appropriate to files
in the new, upgraded software directories.
7. Change directories to the /etc directory by typing cd /etc in the Linux console.

8. List the contents of the /etc directory by executing the ls -l command. Your
results will look similar to the following:
You will notice a huge listing of items printed out.

You can view the various Hadoop component configuration files at (for most of
the software that comes with the HDP installation) within the
/etc/<componentname>/conf/*
Do NOT try to manually change configuration files here. The configs are
managed within Ambari, and you should only edit configuration setings from
within the Ambari GUI.
9. Take a quick look at one of the component's conf directory. Enter the following
command to change into Kafka's conf directory:
cd /etc/kafka/conf

10. List the contents of the directory by typing the ls -l command.
A listing of files is displayed.

11. Close all open windows.
Results:
In this lab, you explored the lab environment. You accessed your lab
environment and launched Apache Ambari. You started up a variety of
services by using the Ambari GUI. You also explored some of the directory
structure on the Linux system.

Apache Ambari
Apache Ambari

Unit 3 Apache Ambari

Unit objectives
• Understand the purpose of Apache Ambari in the HDP stack
• Understand the overall architecture of Ambari, and Ambari’s relation to
other services and components of a Hadoop cluster
• List the functions of the main components of Ambari
• Explain how to start and stop services from Ambari Web Console
Apache Ambari © Copyright IBM Corporation 2018
Unit objectives

Operations
Operations
In this section you will learn about Ambari, which is one of the operations tools that
come with HDP.

Apache Ambari
• For provisioning, managing, and monitoring Apache Hadoop clusters.
• Provides intuitive, easy-to-use Hadoop management web UI backed by

its RESTful APIs
• Ambari REST APIs

 Allows application developers and system integrators to easily integrate
Hadoop provisioning, management, and monitoring capabilities to their own
applications
Apache Ambari
The Apache Ambari project is aimed at making Hadoop management simpler by
developing software for provisioning, managing, and monitoring Apache Hadoop
clusters. Ambari provides an intuitive, easy-to-use Hadoop management web user
interface (UI) backed by its RESTful APIs.

Functionality of Apache Ambari

Ambari enables System Administrators to:
• Provision a Hadoop cluster
 Ambari provides wizard for installing Hadoop services across any number of hosts
 Ambari handles configuration of Hadoop services for the cluster
• Manage a Hadoop cluster
 Ambari provides central management for starting, stopping, and reconfiguring
Hadoop services across the entire cluster
• Monitor a Hadoop cluster
 Ambari provides dashboard for monitoring health and status of the Hadoop cluster
 Ambari leverages Ambari Metrics System ("AMS") for metrics collection
 Ambari leverages Ambari Alert Framework for system alerting and will notify you
when your attention is needed
− For example, a node goes down, remaining disk space is low, etc.
• Ambari enables application developers and system integrators to:
 Easily integrate Hadoop provisioning, management, and monitoring capabilities to
their own applications with the Ambari REST APIs
Functionality of Apache Ambari

Additional information is available on the Ambari wiki:
https://cwiki.apache.org/confluence/display/AMBARI/Ambari

Ambari Metrics System ("AMS")

• System for collecting, aggregating and serving Hadoop and system
metrics in Ambari-managed clusters. The AMS works as follows:
1. Metrics Monitors run on each host and send system-level metrics to the AMS (which is a daemon).
2. Hadoop Sinks run on each host and send Hadoop-level metrics to the Collector.
3. The Metrics Collector stores and aggregates metrics. The Collector can store data either on the local
filesystem ("embedded mode") or can use an external HDFS for storage ("distributed mode").
4. Ambari exposes a REST API, which makes metrics retrieval easy.
5. Ambari REST API feeds the Ambari Web UI.
Ambari Metrics System ("AMS")

One of the one of the more fascinating pieces of Ambari is its Ambari Metrics System.
The Ambari Metrics System ("AMS") is a system for collecting, aggregating and
serving Hadoop and system metrics in Ambari-managed clusters. The AMS works as
follows (note that the numbered points correspond with the following diagram. Read
through each step and find the corresponding number on the diagram).
The above diagram depicts the high level conceptual architecture of the new Ambari
Metrics System. Note that in the diagram there is a GUI component, this is the Ambari
User Interface, which is a web-based interface that allows users to easily interact with
the system.

Ambari architecture
In addition to the agents seen in the prior slide, the Ambari Server also
contains or interacts with the following components:
• Postgres RDBMS (default) stores the cluster configurations
• Authorization Provider integrates with an organization's

authentication/authorization provider such as the LDAP service (By
default, Ambari uses an internal database as the user store for
authentication and authorization)
• Ambari Alert Framework supports alerts and notifications
• REST API integrates with the web-based front-end Ambari Web. This
REST API can also be used by custom applications
Ambari architecture

Sign in to the Ambari web interface

• Via web browser:
 URL is ip_address_of_Ambari_Server:8080 and sign in as:
admin / admin (default)
Sign in to the Ambari web interface

There are two Ambari interfaces to the outside world (through the firewall around the
Hadoop cluster):
• Ambari web Interface, which you will review and use
• custom application API that allows programs to talk to Ambari

Navigating the Ambari web interface
Main tabs:
– Dashboard
– Hosts
– Admin
• Repositories
• Service Accounts
• Security
– Views
– admin
• About
• Manage Ambari
• Settings
• Sign Out
Drop down to add a
service, or start or
stop all services
Navigating the Ambari web interface

Note that the "admin" tab is an icon that acts as a drop-down menu.

The Ambari web dashboard
The Ambari web dashboard

This slide shows a view of the Dashboard within the Ambari web interface.
Notice the various components on the standard dashboard configuration. The typical
items here include:
• HDFS Disk Usage: 23%
• DataNodes Live: 1/1; this shows the lab environment. There is 1 data node in
this example, and it is live and running.
This system shown here has been up 27 days ("27.1 d" for NameNode Uptime and for
the ResourceManager Uptime).
Note in the example, which components are running: all, except Accumulo and
SmartSense are running smoothly (notice the red triangles with exclamation points).

Metric details on the Ambari dashboard

• Use Services to monitor and
manage selected services
running in your Hadoop cluster
• All services installed in your
cluster are listed in the
leftmost Services panel; the
metrics are shown in the main
body of the dashboard
• Hover the cursor over the
individual entry on the
dashboard
• Metric details are shown (see
example at right)
Metric details on the Ambari dashboard

By holding your mouse cursor over the "Card" you can get the detailed metrics of the
component.
The CPU usage metric detail is on the next slide.

Metric details for time-based cluster components
Metric details for time-based cluster components

For other components, such as CPU usage, you may be not as interested in the
instantaneous metric value as you are with current disk usage, but you may be
interested in the metric over a recent time period.

Service Actions/Alert and Health Checks

• Service Actions
 drop-down menu
• Alerts and Health Checks
 View results of health checks
 Display lists each issue and its
rating, sorted first by descending
severity, then by descending time
Service Actions/Alert and Health Checks

Ambari is intended to be the center for monitoring performance of the Hadoop cluster,
as well as the center for generic and particular alert and health checks.
The Ambari Metrics System ("AMS") is a system for collecting, aggregating and serving
Hadoop and system metrics in Ambari-managed clusters.

Add Service Wizard

• Background is pale-green for installed services
• Other services can be added by checking the service(s) and clicking
Next to setup details
Add Service Wizard

The Add Service Wizard allows you to:
• add a service to the cluster
• add components to the service
• create configuration
• apply configuration to the cluster
• create host components
• install and start the service
For further information, go to:
https://cwiki.apache.org/confluence/display/AMBARI/Adding+a+New+Service+to+an+E
xisting+Cluster

Background service check from the admin menu
Background service check from the admin menu

When you activate adding a service or starting a service, the action takes place in
background mode so that you can continue to perform other operations while the
requested change runs.
You can view current background operations (blue), completed successful operations
(green), and terminated failed operations (red) in the Background Service Check
window.

Host metrics: Example of a host
Host metrics: Example of a host

Non-functioning/failed services: Example of HBase
Non-functioning/failed services: Example of HBase

A currently failed service is shown with a red-triangle on the left side of the display. This
service happens to be Accumulo.

Managing Ambari
• Click the admin drop-down

• Select Manage Ambari for this display: 3 areas to manage
Managing Ambari

Creating users and groups

• This screen is for creating Ambari web users and groups, and not HDFS
users and groups
• Default for Ambari Admin is No (for example, non-administrative user)
 Two user roles: User and Admin. Users can view metrics, view service status
and configuration, and browse job information. Admins can do all User tasks
plus start and stop services, modify configurations, and run service checks.
• Initial Ambari Administrator credential is admin/admin
• HDFS users/groups are managed totally separately
Creating users and groups

The individual services in Hadoop are each run under the ownership of a corresponding
UNIX account. These accounts are known as service users, and these service users
belong to a special UNIX group. In addition there is a special service user for running
smoke tests on components during installation and on-demand using the Management
Header in the Services View of the Ambari Web GUI. Any of these users and groups
can be customized using the Misc tab of the Customize Services step.
If you choose to customize names, Ambari checks to see if these custom accounts
already exist; if they do not exist, Ambari creates them. The default accounts are
always created during installation whether or not custom accounts are specified. These
default accounts are not used and can be removed after install is performed.

Standard Ambari Service users and groups
Standard Ambari Service users and groups

For more information on standard service users and groups, visit:
https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.0.0/bk_ambari-
administration/content/defining_service_users_and_groups_for_a_hdp_2x_stack.html

Managing hosts in a cluster

• Ambari provides the following actions using the Hosts tab:
 Working with Hosts
 Determining Host Status
 Filtering the Hosts List
 Performing Host-Level Actions
 Viewing Components on a Host
 Decommissioning Masters and Slaves
 Deleting a Host from a Cluster
 Setting Maintenance Mode
 Adding Hosts to a Cluster
Managing hosts in a cluster

Running Ambari from the command line

• Ambari allows commands to be run at the command line
 To read settings from Ambari (such as hosts in the cluster)
curl -i -uadmin:admin http://eddev27.canlab.ibm.com:8080/api/v1/hosts
 To deploy services across a cluster using a script

 To undeploy services which are not being used
(Note that generally services should be stopped manually before removing them.
Sometimes stopping from Ambari might not stop some of the sub-components, so
sub-components may also need to be stopped.)
curl -u admin:admin -H "X-Requested-By: ambari" -X DELETE
http://localhost:8080/api/v1/clusters/BI4_QSVservices/FLUME
http://localhost:8080/api/v1/clusters/BI4_QSVservices/SLIDER
http://localhost:8080/api/v1/clusters/BI4_QSVservices/SOLR
 An Ambari shell (with prompt) is available

− https://cwiki.apache.org/confluence/display/AMBARI/Ambari+Shell
Running Ambari from the command line

Here is an example interaction with command line interface; the hosts are in a Hadoop
Cluster (with results returned in JSON format)
[root@rvm ~]# curl -i -uadmin:admin http://rvm.svl.ibm.com:8080/api/v1/hosts
HTTP/1.1 200 OK
Set-Cookie: AMBARISESSIONID=1n8t0nb6perytxj09ju3las9z;Path=/
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Type: text/plain
Content-Length: 265
Server: Jetty(7.6.7.v20120910)
{
"href" : "http://rvm.svl.ibm.com:8080/api/v1/hosts",
"items" : [
{
"href" : "http://rvm.svl.ibm.com:8080/api/v1/hosts/rvm.svl.ibm.com",
"Hosts" : {

"cluster_name" : "BI4_QSE",
"host_name" : "rvm.svl.ibm.com"
}
}
]
}
Starting / Restarting the Ambari Server
[root@rvm ~]# ambari-server restart
Using python /usr/bin/python2.6
Restarting ambari-server
Stopping ambari-server
Ambari Server stopped
Starting ambari-server
Ambari Server running with 'root' privileges.
Organizing resource files at /var/lib/ambari-server/resources...
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Waiting for server start....................
Ambari Server 'start' completed successfully.
[root@rvm ~]#
Note also that Ambari has a Python Shell,
https://cwiki.apache.org/confluence/display/AMBARI/Ambari+python+Shell

Ambari terminology
• Service
• Component
• Host/Node
• Node Component
• Operation
• Task
• Stage
• Action
• Stage Plan
• Manifest
• Role
Ambari terminology
The following definitions can be found at http://ambari.apache.org.
Service: Service refers to services in the Hadoop stack. HDFS, HBase, and Pig are
examples of services. A service may have multiple components (for example, HDFS
has NameNode, Secondary NameNode, DataNode, etc.). A service can just be a client
library (for example, Pig does not have any daemon services, but just has a client
library).
Component: A service consists of one or more components. For example, HDFS has 3
components: NameNode, DataNode and Secondary NameNode. Components may be
optional. A component may span multiple nodes (for example, DataNode instances on
multiple nodes).
Node/Host: Node refers to a machine in the cluster. Node and host are used
interchangeably in this document.
Node-Component: Node-component refers to an instance of a component on a
particular node. For example, a particular DataNode instance on a particular node is a
node-component.

Operation: An operation refers to a set of changes or actions performed on a cluster to

satisfy a user request or to achieve a desirable state change in the cluster. For
example, starting of a service is an operation and running a smoke test is an operation.
If a user requests to add a new service to the cluster and that includes running a smoke
test as well, then the entire set of actions to meet the user request will constitute an
operation. An operation can consist of multiple "actions" that are ordered (see below).
Task: Task is the unit of work that is sent to a node to execute. A task is the work that
node has to carry out as part of an action. For example, an "action" can consist of
installing a DataNode on Node n1 and installing a DataNode and a secondary
NameNode on Node n2. In this case, the "task" for n1 will be to install a DataNode and
the "tasks" for n2 will be to install both a DataNode and a secondary NameNode.
Stage: A stage refers to a set of tasks that are required to complete an operation and
are independent of each other; all tasks in the same stage can be run across different
nodes in parallel.
Action: An 'action' consists of a task or tasks on a machine or a group of machines.
Each action is tracked by an action id and nodes report the status at least at the
granularity of the action. An action can be considered a stage under execution. In this
document a stage and an action have one-to-one correspondence unless specified
otherwise. An action id will be a bijection of request-id, stage-id.
Stage Plan: An operation typically consists of multiple tasks on various machines and
they usually have dependencies requiring them to run in a particular order. Some tasks
are required to complete befor others can be scheduled. Therefore, the tasks required
for an operation can be divided in various stages where each stage must be completed
before the next stage, but all the tasks in the same stage can be scheduled in parallel
across different nodes.
Manifest: Manifest refers to the definition of a task which is sent to a node for
execution. The manifest must completely define the task and must be serializable.
Manifest can also be persisted on disk for recovery or record.
Role: A role maps to either a component (for example, NameNode, DataNode) or an
action (for example, HDFS rebalancing, HBase smoke test, other admin commands,
etc.)

Checkpoint
1. True or False? Ambari is backed by RESTful APIs for developers to
easily integrate with their own applications.
2. Which Hadoop functionalities does Ambari provide?
3. Which page from the Ambari UI allows you to check the versions of
the software installed on your cluster?
4. True or False? Creating users through the Ambari UI will also create
the user on the HDFS.
5. True or False? You can use the CURL commands to issue
commands to Ambari.
Checkpoint

Checkpoint Solution
1. True or False? Ambari is backed by RESTful APIs for developers to
easily integrate with their own applications.
 True
2. Which Hadoop functionalities does Ambari provide?
 Provision, manage, monitor and integrate
3. Which page from the Ambari UI allows you to check the versions of
the software installed on your cluster?
 The Admin > Manage Ambari page
4. True or False? Creating users through the Ambari UI will also create
the user on the HDFS.
 False.
5. True or False? You can use the CURL commands to issue
commands to Ambari.
 True.
Checkpoint Solution

Lab 1
• Managing Hadoop clusters with Apache Ambari
 Start the Apache Ambari web console and perform basic start/stop services
 Explore other aspects of the Ambari web server
Lab 1: Managing Hadoop clusters with Apache Ambari

Lab 1:
Managing Hadoop clusters with Apache Ambari
Purpose:
In this lab you will explore the Apache Ambari web console and perform basic
starting and stopping of services, giving you experience in using Apache
Ambari to manage your Hadoop cluster.
RHEL login user/password root/dalvm3

student/student
Ambari URL http://<hostname>:8080/

where <hostname> is where your Ambari server is
installed
Ambari login user/password admin/admin
Data Server Manager (DSM) https://<hostname>:8443/gateway/default/dsm

URL where <hostname> is where your Ambari server is
installed
DSM login user/password guest/guest-password
Big SQL admin user/password bigsql/passw0rd
HDP services password passw0rd
Note:
• If the Ambari services have not been started up, you need to start it up now.
Navigate to the Ambari page and start all the services. This can take up to 20-
25 minutes.
• Hostnames shown in screen captures may vary from your images hostname.

Task 1. Start the Apache Ambari web console and perform

basic start/stop of services.
The major reference for Apache Ambari can be found at http://ambari.apache.org/.
There are various shells available for Ambari, but not all are fully developed or available
at time of developing this course; those shells are not covered in this exercise. These
shells include: Ambari shell and Python Shell & Client.
Note: Cluster administration is not normally done by end users, but knowledge of
Apache Ambari is extremely useful for all users as it provides concepts needed for both
administration and use of the cluster.
1. Connect to and login to your lab environment with the user root and the
password dalvm3.
2. Launch the Firefox web browser, and navigate to the Ambari login page,
http://localhost:8080, logging in as admin/admin.
Note which services are currently running from the left pane.
3. At the bottom of the Services pane, click Actions, and then:
• if the services are started, click Stop All.
• if the services are stopped, click Start All.

4. When asked for confirmation, click OK.
Starting or stopping the services can sometimes take several minutes.
Operations which are in progress are shown in blue, complete and successful in
green, and failed in red.
If you stopped the services in step 3, from the Actions menu, click Start All,
and then click OK when prompted for confirmation.

5. To see what services can be added, from the Actions menu, click Add Service.
The services shown will include some or all of the following; your listing will
probably differ if you have a later version of some or all of these components.
Service Version Description
HDFS 2.7.3 Apache Hadoop Distributed File System
Apache Hadoop NextGen MapReduce
YARN + MapReduce2 2.7.3
(YARN)
Tez is the next generation Hadoop Query
Tez 0.7.0 Processing framework written on top of
YARN.
Data warehouse system for ad-hoc queries &
Hive 1.2.1000 analysis of large datasets and table & storage
management service
A Non-relational distributed database, plus
HBase 1.1.2 Phoenix, a high performance SQL layer for
low latency applications.
Scripting platform for analyzing large
Pig 0.16.0
datasets
Tool for transferring bulk data between
Sqoop 1.4.6 Apache Hadoop and structured data stores
such as relational databases
System for workflow coordination and
execution of Apache Hadoop jobs. This also
Oozie 4.2.0 includes the installation of the optional
Oozie Web Console which relies on and will
install the ExtJS Library.
Centralized service which provides highly
ZooKeeper 3.4.6
reliable distributed coordination
Falcon 0.10.0 Data management and processing platform
Apache Hadoop Stream processing
Storm 1.1.0
framework
A distributed service for collecting,
Flume 1.5.2 aggregating, and moving large amounts of
streaming data into HDFS
Robust, scalable, high performance
Accumulo 1.7.0
distributed key/value store.
Core shared service used by Ambari
Ambari Infra 0.1.0
managed components.
A system for metrics collection that provides
Ambari Metrics 0.1.0 storage and retrieval capability for metrics
collected from the cluster

Atlas 0.8.0 Atlas Metadata and Governance platform

A high-throughput distributed messaging
Kafka 0.10.1
system
Provides a single point of authentication and
Knox 0.12.0 access for Apache Hadoop services in a
cluster
Log aggregation, analysis, and visualization
Log Search 0.5.0 for Ambari managed services. This service is
Technical Preview.
Ranger 0.7.0 Comprehensive security for Hadoop
Ranger KMS 0.7.0 Key Management Server
SmartSense - Hortonworks SmartSense Tool
(HST) helps quickly gather configuration,
1.4.0.2.5. metrics, logs from common HDP services
SmartSense 0.3-7 that aids to quickly troubleshoot support
cases and receive cluster-specific
recommendations.
Apache Spark is a fast and general engine for
Spark 1.6.3
large-scale data processing.
Apache Spark is a fast and general engine for
Spark2 2.1.1
large-scale data processing
A web-based notebook that enables
interactive data analytics. It enables you to
Zeppelin Notebook 0.7.0 make beautiful data-driven, interactive and
collaborative documents with SQL, Scala
and more.
IBM Big SQL 5.0.0.0 SQL on Hadoop
Data Server Manager: Tool for
Big SQL DSM 2.1.3.0 administering, monitoring and managing the
performance of Big SQL.
A fast column-oriented distributed data
Druid 0.9.2
store. This service is Technical Preview.
Project of the Apache Software Foundation
to produce free implementations of
distributed or otherwise scalable machine
Mahout 0.9.0
learning algorithms focused primarily in the
areas of collaborative filtering, clustering and
classification
A framework for deploying, managing and
Slider 0.92.0 monitoring existing distributed applications
on YARN.

Services that are installed and available are shown with check-marks. Services
without a check-mark (such as Ranger in this instance) are not yet installed or
available:
6. Close the Add Services window.

7. When all of the services have been started, look at center of the page to see the
high level information available there.
You will explore the various individual services that are listed, to better
understand what is available.
8. Click the HDFS service to expose the relevant information and metrics:
9. Explore a variety of other services on the left panel of the main web page.

Task 2. Explore other aspects of the Ambari web server.

1. Click on Hive in the left hand service menu.
The Summary tab gives you some overview information about the Hive service.
2. Next, click the Configs tab and explore some of the Hive configuration settings.
3. Now, navigate and explore other aspects of the Ambari web server interface
using the earlier content pages and slides in this unit to guide you. Spend some
time familiarizing yourself with the look, feel, and features of Ambari.
Results:
In this lab you explored the Apache Ambari web console and performed basic
starting and stopping of services, giving you experience in using Apache
Ambari to manage your Hadoop cluster.

Unit summary
• Understand the purpose of Apache Ambari in the HDP stack
• Understand the overall architecture of Ambari, and Ambari’s relation to
other services and components of a Hadoop cluster
• List the functions of the main components of Ambari
• Explain how to start and stop services from Ambari Web Console
Unit summary


Hadoop and HDFS
Hadoop and HDFS

Unit 4 Hadoop and HDFS

Unit objectives
• Understand the basic need for a big data strategy in terms of parallel
reading of large data files and internode network speed in a cluster
• Describe the nature of the Hadoop Distributed File System (HDFS)
• Explain the function of the NameNode and DataNodes in an Hadoop
cluster
• Explain how files are stored and blocks ("splits") are replicated
Introduction
Hadoop and to Hortonworks Data Platform (HDP)
HDFS © Copyright IBM Corporation 2018
Unit objectives

Agenda
• Why Hadoop?
 Comparison with RDBMS
• Hadoop architecture
 MapReduce
 HDFS
 Hadoop Common
• The origin and history of HDFS
 How files are stored; replication of
blocks
 Name Node
 Secondary Name Node
 Fault tolerance
Introduction
Agenda
Here's an overview of what we will cover in this unit.
This unit will explain the underlying technologies that are important to solving the big
data challenge and tell about BigInsights. First, you will have an overview, and then you
will dive more deeply into BigInsights itself, and explore how it fits into a broader IT
environment that could involve data warehouses, relational DBMSs, streaming engines,
etc.

The importance of Hadoop

• "We believe that more than half of the world's data will be stored in
Apache Hadoop within five years" - Hortonworks
Introduction
The importance of Hadoop

There has been much buzz about Hadoop and big data in the marketplace over the
past year, with Wikibon predicting a $50B marketplace by 2017. Hortonworks says on
their web site that they believe that half of the world's data will be stored in Hadoop
within the next 5 years. With so much at stake, there are a lot of vendors looking for
some of the action and the big database vendors are not standing still.

Hardware improvements through the years

• CPU speeds:
 1990: 44 MIPS at 40 MHz
 2010: 147,600 MIPS at 3.3 GHz
• RAM memory How long does it take to read
1TB of data (at 80 MB/sec)?
 1990: 640K conventional memory
(256K extended memory recommended) 1 disk - 3.4 hrs
 2010: 8-32GB (and more) 10 disks - 20 min
100 disks - 2 min
• Disk capacity
1000 disks - 12 sec
 1990: 20MB
 2010: 1TB
• Disk latency (speed of reads and writes) -
not much improvement in last 7-10 years,
currently around 70-80 MB/sec
Introduction
Hardware improvements through the years

Before exploring Hadoop, you will review why Hadoop technology is so important.
Moore's law has been true for a long time, but no matter how many more transistors are
added to CPUs, and how powerful they become, it is disk latency where the bottleneck
is. This chart makes the point: scaling up (more powerful computers with power CPUs)
is not the answer to all problems since disk latency is the main issue. Scaling out
(cluster of computers) is a better approach.
One definition of a commodity server is "a piece of fairly standard hardware that can be
purchased at retail, to have any particular software installed on it".
A typical Hadoop cluster node on Intel hardware in 2015:
• two quad core CPUs
• 12 GB to 24 GB memory
• four to six disk drives of 2 terabyte (TB) capacity

Incidentally, the second slowest component in a cluster of computers is the inter-node

network connection. You will find that this leads to the importance of where to place the
data that is to be read.
This topic is well discussed on the internet. As an example, try doing a Google search
of best hardware configuration for hadoop. You should note that the hardware
required for management nodes is considerably higher. Here you are primarily
presented with the hardware requirements for worker nodes, aka DataNodes.
Reference:
http://docs.hortonworks.com/HDP2Alpha/index.htm#Hardware_Recommendations_for
_Hadoop.htm

What hardware is not used for Hadoop

• RAID
 Not suitable for the data in a large cluster
 Wanted: Just-a-Bunch-Of-Disks (JBOD)
• Linux Logical Volume Manager (LVM)
 HDFS and GPFS are already abstractions that run on top of disk filesystems
- LVM is an abstraction that can obscure the real disk
• Solid-state disk (SSD)
 Low latency of SSD is not useful for streaming file data
 Low storage capacity for high cost (currently not commodity hardware)
RAID is often used on Master Nodes

(but never Data Nodes)
as part of fault tolerance mechanisms
Introduction
What hardware is not used for Hadoop

There is the possibility of software RAID in future versions of Hadoop:
• http://wiki.apache.org/hadoop/HDFS-RAID
Discussion of why not to use RAID:
• http://hortonworks.com/blog/why-not-raid-0-its-about-time-and-snowflakes
• http://hortonworks.com/blog/proper-care-and-feeding-of-drives-in-a-hadoop-
cluster-a-conversation-with-stackiqs-dr-bruno/

Parallel data processing is the answer

• There has been:
 GRID computing: spreads processing load
 distributed workload: hard to manage
applications, overhead on developer
 parallel databases: Db2 DPF, Teradata,
Netezza, etc. (distribute the data)
• Challenges:
 heterogeneity
 openness
 security
 scalability Distributed computing: Multiple computers appear
as one super computer, communicate with each
 concurrency other by message passing, and
operate together to achieve a common goal
 fault tolerance
 transparency
Introduction
Parallel data processing is the answer

Further comments:
• With GRID computing, the idea is to increase processing power of multiple
computers and bring data to place where processing capacity is available.
• Distributed workload is bring processing to data. This is great, but writing
applications that do this is very difficult. You need to worry how to exchange
data between the different nodes, you need to think about what happens if one
of these nodes goes down, you need to decide where to put the data, decide to
which node transmit the data, and figure out when all nodes finish, how to send
the data to some central place. All of this is very challenging. You could spend a
lot of time coding on passing the data, rather than dealing with the problem
itself.
• Parallel databases are in use today, but they are generally for structured
relational data.
• Hadoop, is the answer to parallel data processing without having the issues of
GRID, distributed workload or parallel databases.

What is Hadoop?
• Apache open source software framework for reliable, scalable,
distributed computing over massive amount of data:
 hides underlying system details and complexities from user
 developed in Java
• Consists of 4 sub projects:
 MapReduce
 Hadoop Distributed File System (HDFS)
 Yarn
 Hadoop Common
• Supported by many Apache/Hadoop-related projects:
 HBase, Zookeeper, Avro, etc.
• Meant for heterogeneous commodity hardware.
Introduction
What is Hadoop?
Hadoop is an open source project of the Apache Foundation.
It is a framework written in Java originally developed by Doug Cutting who named it
after his son's toy elephant.
Hadoop uses Google's MapReduce and Google File System (GFS) technologies as its
foundation.
It is optimized to handle massive amounts of data which could be structured,
unstructured or semi-structured, and uses commodity hardware (relatively inexpensive
computers).
This massive parallel processing is done with great performance. However, it is a batch
operation handling massive amounts of data, so the response time is not immediate. As
of Hadoop version 0.20.2, updates are not possible, but appends were made possible
starting in version 0.21.
What is the value of a system if the information it stores or retrieves is not consistent?
Hadoop replicates its data across different computers, so that if one goes down, the
data is processed on one of the replicated computers.

You may be familiar with OLTP (Online Transactional processing) workloads where
data is randomly accessed on structured data like a relational database, such as when
you access your bank account.
You may also be familiar with OLAP (Online Analytical processing) or DSS (Decision
Support Systems) workloads where data is sequentially access on structured data, like
a relational database, to generate reports that provide business intelligence.
You may not be that familiar with the concept of big data. Big data is a term used to
describe large collections of data (also known as datasets) that may be unstructured,
and grow so large and quickly that is difficult to manage with regular database or
statistics tools.
Hadoop is not used for OLTP nor OLAP, but is used for big data, and it complements
these two to manage data. Hadoop is not a replacement for a RDBMS.

Hadoop open source projects

• Hadoop is supplemented by an extensive ecosystem of open source
projects.
Introduction
Hadoop open source projects

The Open Data Platform (www.opendataplatform.org):
• The ODP Core will initially focus on Apache Hadoop (inclusive of HDFS, YARN,
and MapReduce) and Apache Ambari. Once the ODP members and processes
are well established, the scope of the ODP Core may expand to include other
open source projects.
Apache Ambari aims to make Hadoop management simpler by developing software for
provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an
intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.
Reference summary from http://hadoop.apache.org:
The Apache™ Hadoop® project develops open-source software for reliable, scalable,
distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to deliver high-
availability, the library itself is designed to detect and handle failures at the application
layer, so delivering a highly-available service on top of a cluster of computers, each of
which may be prone to failures.

The project includes these modules:

• Hadoop Common: The common utilities that support the other Hadoop
modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that
provides high throughput access to application data.
• Hadoop YARN: A framework for job scheduling and cluster resource
management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large
data sets.'
• Ambari™: A web-based tool for provisioning, managing, and monitoring Hadoop
clusters. It also provides a dashboard for viewing cluster health and ability to
view MapReduce, Pig and Hive applications visually.
• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data
storage for large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and
ad hoc querying.
• Mahout™: A Scalable machine learning and data mining library.
• Pig™: A high-level data-flow language and execution framework for parallel
computation.
• Spark™: A fast and general compute engine for Hadoop data. Spark provides a
simple and expressive programming model that supports a wide range of
applications, including ETL, machine learning, stream processing, and graph
computation.
• Tez™: A generalized data-flow programming framework, built on Hadoop
YARN, which provides a powerful and flexible engine to execute an arbitrary
DAG of tasks to process data for both batch and interactive use-cases.
• ZooKeeper™: A high-performance coordination service for distributed
applications.

Advantages and disadvantages of Hadoop

• Hadoop is good for:
 processing massive amounts of data through parallelism
 handling a variety of data (structured, unstructured, semi-structured)
 using inexpensive commodity hardware
• Hadoop is not good for:
 processing transactions (random access)
 when work cannot be parallelized
 low latency data access
 processing lots of small files
 intensive calculations with small amounts of data
Introduction
Advantages and disadvantages of Hadoop

Hadoop is cannot resolve all data-related problems. It is being designed to handle big
data. Hadoop works better when handling one single huge file rather than many small
files. Hadoop complements existing RDBMS technology.
Wikipedia: Low latency allows human-unnoticeable delays between an input being
processed and the corresponding output providing real time characteristics. This can be
especially important for internet connections utilizing services such as online gaming
and VOIP.
Hadoop is not good for low latency data access. In practice you can replace latency
with delay. So low latency means negligible delay in processing. Low latency data
access means negligible delay accessing data. Hadoop is not designed for low latency.
Hadoop works best with very large files. The larger the file, the less time Hadoop
spends seeking for the next data location on disk and the more time Hadoop runs at the
limit of the bandwidth of your disks. Seeks are generally expensive operations that are
useful when you only need to analyze a small subset of your dataset. Since Hadoop is
designed to run over your entire dataset, it is best to minimize seeks by using large
files.
Hadoop is good for applications requiring high throughput of data: Clustered machines
can read data in parallel for high throughput.

Timeline for Hadoop
Introduction
Timeline for Hadoop

Review this slide for the history of Hadoop/MapReduce technology.
Google published information on the GFS (Google File System): Ghemawat, S.,
Gobioff, H., & Leung, S.-T. (2002). The Google File System. Retrieved from
http://static.googleusercontent.com/media/research.google.com/en/us/archive/gfs-
sosp2003.pdf
The original MapReduce paper: Dean, J., & Ghemawat, S. (2004). MapReduce:
Simplified data processing on large clusters. Retrieved from
http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapredu
ce-osdi04.pdf
Doug Cutting was developing Lucene (Text index for documents, subproject of Nutch),
and Nutch (Web index for documents). He could not get Nutch to scale. Then he saw
Google's papers on GFS and MapReduce and created Hadoop (open source version of
the info from those papers). Hadoop allowed Nutch to scale.

Timeline:
• 2003: Google launches project Nutch to handle billions of searches and
indexing millions of web pages.
• Oct 2003: Google releases papers with GFS (Google File System)
• Dec 2004: Google releases papers with MapReduce
• 2005: Nutch used GFS and MapReduce to perform operations
• 2006: Yahoo! created Hadoop based on GFS and MapReduce (with Doug
Cutting and team)
• 2007: Yahoo started using Hadoop on a 1000 node cluster
• Jan 2008: Apache took over Hadoop
• Jul 2008: Tested a 4000 node cluster with Hadoop successfully
• 2009: Hadoop successfully sorted a petabyte of data in less than 17 hours to
handle billions of searches and indexing millions of web pages.
• Dec 2011: Hadoop releases version 1.0
For later releases, and the release numbering structure, refer to:
• http://wiki.apache.org/hadoop/Roadmap

Hadoop: Major components

• Hadoop Distributed File System (HDFS)
 where Hadoop stores data
 a file system that spans all the nodes in a Hadoop cluster
 links together the file systems on many local nodes to make them into one
large file system that spans all the data nodes of the cluster
• MapReduce framework
• How Hadoop understands and assigns work to the nodes (machines)
• Evolving: MR v1, MR v2, etc.
• Morphed into YARN and other processing frameworks
Introduction
Hadoop: Major components

There are two aspects of Hadoop that are important to understand:
computing on large data sets of clusters of computers.
system. You can use other file systems with Hadoop, but HDFS is quite common.

Brief introduction to HDFS and MapReduce

• Driving principles
 data is stored across the entire cluster
 programs are brought to the data, not the data to the program
• Data is stored across the entire cluster (the DFS)
 the entire cluster participates in the file system
 blocks of a single file are distributed across the cluster
 a given block is typically replicated as well for resiliency
101101001
Cluster
010010011
1
100111111
001010011
101001010 1 3 2
010110010
010101001
2
100010100
101110101 4 1 3
Blocks 110101111
011011010
101101001
010100101
3
010101011 2 4
100100110
101110100
4 2 3
1
4
Logical File
Introduction
Brief introduction to HDFS and MapReduce

The driving principle of MapReduce is a simple one: spread the data out across a huge
cluster of machines and then, rather than bringing the data to your programs as you do
in a traditional programming, write your program in a specific way that allows the
program to be moved to the data. Thus, the entire cluster is made available in both
reading the data as well as processing the data.
The Distributed File System (DFS) is at the heart of MapReduce. It is responsible for
spreading data across the cluster, by making the entire cluster look like one giant file
system. When a file is written to the cluster, blocks of the file are spread out and
replicated across the whole cluster (in the diagram, notice that every block of the file is
replicated to three different machines).
Adding more nodes to the cluster instantly adds capacity to the file system and
automatically increases the available processing power and parallelism.

Hadoop Distributed File System (HDFS) principles

• Distributed, scalable, fault tolerant, high throughput
• Data access through MapReduce
• Files split into blocks (aka splits)
• 3 replicas for each piece of data by default
• Can create, delete, and copy, but cannot update
• Designed for streaming reads, not random access
• Data locality is an important concept: processing data on or near the
physical storage to decrease transmission of data
Introduction
Hadoop Distributed File System (HDFS) principles

Lister are the important principles behind the Hadoop Distributed File System (HDFS).

HDFS architecture
• Master/Slave architecture NameNode File1
• Master: NameNode a
b
 manages the file system namespace c
and metadata d
― FsImage
― Edits Log
 regulates client access to files
• Slave: DataNode
 many per cluster
 manages storage attached to the
nodes
 periodically reports status to a b a c
NameNode b a d b
d c c d
DataNodes
Introduction
HDFS architecture
Important points still to be discussed:
• Secondary NameNode
• Import checkpoint
• Rebalancing
• SafeMode
• Recovery Mode
The entire file system namespace, including the mapping of blocks to files and file
system properties, is stored in a file called the FsImage. The FsImage is stored as a file
in the NameNode's local file system. It contains the metadata on disk (not exact copy of
what is in RAM, but a checkpoint copy).
The NameNode uses a transaction log called the EditLog (or Edits Log) to persistently
record every change that occurs to file system metadata, synchronizes with metadata in
RAM after each write.

The NameNode can be a potential single point of failure (this has been resolved in later
releases of HDFS with Secondary NameNode, various forms of high availability, and in
Hadoop v2 with NameNode federation and high availability as out-of-the-box options).
• Use better quality hardware for all management nodes, and in particular do not
use inexpensive commodity hardware for the NameNode.
• Mitigate by backing up to other storage.
In case of power failure on NameNode, recover is performed using the FsImage and
the EditLog.

HDFS blocks
• HDFS is designed to support very large files
• Each file is split into blocks: Hadoop default is 128 MB
• Blocks reside on different physical DataNodes
• Behind the scenes, each HDFS block is supported by multiple
operating system blocks
64 MB HDFS blocks
OS blocks
• If a file or a chunk of the file is smaller than the block size, only the
needed space is used. For example, a 210MB file is split as:
64 MB 64 MB 64 MB 18 MB
Introduction
HDFS blocks
Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and
distributed throughout the cluster. In this way, the map and reduce functions can be ex-
ecuted on smaller subsets of your larger data sets, and this provides the scalability that
is needed for big data processing.
• https://hortonworks.com/apache/hdfs/
In earlier versions of Hadoop/HDFS, the default blocksize was often quoted as 64 MB,
but the current default setting for Hadoop/HDFS is noted in
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-
default.xml
• dfs.blocksize = 134217728
Default block size for new files, in bytes. You can use the following suffix (case
insensitive): k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size
(such as 128k, 512m, 1g, etc.), or you can provide the complete size in bytes
(such as 134217728 for 128 MB).
It should be noted that Linux itself has both a logical block size (typically 4 KB) and a
physical or hardware block size (typically 512 bytes).

Linux filesystems:
• For ext2 or ext3, the situation is relatively simple: each file occupies a certain
number of blocks. All blocks on a given filesystem have the same size, usually
one of 1024, 2048 or 4096 bytes.
• What is the physical blocksize?
[user@vm Desktop]$ lsblk -o NAME,PHY-SEC
NAME PHY-SEC
sr0 512
sda 512
├─sda1 512
├─sda2 512
└─sda3 512
• What is the logical blocksize for these file systems (such as /dev/sda1)? This
has to be run as root.
[root@vm ~]# blockdev --getbsz /dev/sda1
1024

HDFS replication of blocks

• Blocks of data are replicated to multiple nodes
 behavior is controlled by replication factor, configurable per file
 default is 3 replicas
• Approach:
 first replica goes on
any node in the cluster
 second replica on a
node in a different rack
 third replica on a
different node in the
second rack
The approach cuts inter-rack
network bandwidth, which
improves write performance
Introduction
HDFS replication of blocks

To ensure reliability and performance, the placement of replicas is critical to HDFS.
HDFS can be distinguished from most other distributed file systems by replica
placement. This feature requires tuning and experience. Improving data reliability,
availability, and network bandwidth utilization are the purpose of a rack-aware replica
placement policy. The current implementation for the replica placement policy is a first
effort in this direction. The short-term goals of implementing this policy are to validate it
on production systems, learn more about its behavior, and build a foundation to test
and research more sophisticated policies.
Large HDFS instances run on a cluster of computers that commonly spread across
many racks. Communication between two nodes in different racks has to go through
switches. In most cases, network bandwidth between machines in the same rack is
greater than network bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via the process
outlined in block awareness (http://hadoop.apache.org/docs/r0.17.1/hdfs_user_guide
.html#Rack+Awareness). A simple but non-optimal policy is to place replicas on unique
racks. This prevents losing data when an entire rack fails and allows use of bandwidth
from multiple racks when reading data. This policy evenly distributes replicas in the
cluster which makes it easy to balance load on component failure, however this policy
increases the cost of writes because a write needs to transfer blocks to multiple racks.

For the common case, when the replication factor is three, HDFS's placement policy is
to put one replica on one node in the local rack, another on a different node in the local
rack, and the last on a different node in a different rack. This policy cuts the inter-rack
write traffic which generally improves write performance. The chance of rack failure is
far less than that of node failure; this policy does not impact data reliability and
availability guarantees. However, it does reduce the aggregate network bandwidth used
when reading data since a block is placed in only two unique racks rather than three.
With this policy, the replicas of a file do not evenly distribute across the racks. One third
of replicas are on one node, two thirds of replicas are on one rack, and the other third
are evenly distributed across the remaining racks. This policy improves write
performance without compromising data reliability or read performance.

Setting rack network topology (Rack Awareness)

• Defined by a script which specifies which node is in which rack, where
the rack is the network switch to which the node is connected, not the
metal framework where the nodes are physically stacked.
• The script is referenced in net.topology.script.property.file in the
Hadoop configuration file core-site.xml. For example:
<property>
<name>net.topology.script.file.name</name>
<value>/etc/hadoop/conf/rack-topology.sh </value>
</property>
• The network topology script (net.topology.script.file.name in the
example above) receives as arguments one or more IP addresses of
nodes in the cluster. It returns on stdout a list of rack names, one for
each input.
• One simple approach is to use IP addressing of 10.x.y.z where
x = cluster number, y = rack number, z = node within rack; and an
appropriate script to decode this into y/z.
Introduction
Setting rack network topology (Rack Awareness)

For small clusters in which all servers are connected by a single switch, there are only
two levels of locality: on-machine and off-machine. When loading data from a
DataNode's local drive into HDFS, the NameNode will schedule one copy to go into the
local DataNode, and will pick two other machines at random from the cluster.
For larger Hadoop installations which span multiple racks, it is important to ensure that
replicas of data exist on multiple racks so that the loss of a switch does not render
portions of the data unavailable, due to all of the replicas being underneath it.
HDFS can be made rack-aware by the use of a script which allows the master node to
map the network topology of the cluster. While alternate configuration strategies can be
used, the default implementation allows you to provide an executable script which
returns the rack address of each of a list of IP addresses.
The network topology script receives as arguments one or more IP addresses of nodes
in the cluster. It returns on stdout a list of rack names, one for each input. The input and
output order must be consistent.

To set the rack mapping script, specify the key topology.script.file.name in

conf/Hadoop-site.xml. This provides a command to run to return a rack id; it must be an
executable script or program. By default, Hadoop will attempt to send a set of IP
addresses to the file as several separate command line arguments. You can control the
maximum acceptable number of arguments with the topology.script.number.args key.
Rack ids in Hadoop are hierarchical and look like path names. By default, every node
has a rack id of /default-rack. You can set rack ids for nodes to any arbitrary path, such
as /foo/bar-rack. Path elements further to the left are higher up the tree, so a
reasonable structure for a large installation may be /top-switch-name/rack-name.
The following example script performs rack identification based on IP addresses given
a hierarchical IP addressing scheme enforced by the network administrator. This may
work directly for simple installations; more complex network configurations may require
a file- or table-based lookup process. Care should be taken in that case to keep the
table up-to-date as nodes are physically relocated, etc. This script requires that the
maximum number of arguments be set to 1.
#!/bin/bash
# Set rack id based on IP address.
# Assumes network administrator has complete control
# over IP addresses assigned to nodes and they are
# in the 10.x.y.z address space. Assumes that
# IP addresses are distributed hierarchically. e.g.,
# 10.1.y.z is one data center segment and 10.2.y.z is another;
# 10.1.1.z is one rack, 10.1.2.z is another rack in
# the same segment, etc.)
# This is invoked with an IP address as its only argument
# get IP address from the input
ipaddr=$0
# select "x.y" and convert it to "x/y"
segments=`echo $ipaddr | cut --delimiter=. --fields=2-3 --output-delimiter=/`
echo /${segments}

A more complex rack-aware script:

File name: rack-topology.sh
#!/bin/bash
# Adjust/Add the property "net.topology.script.file.name"
# to core-site.xml with the "absolute" path the this
# file. ENSURE the file is "executable".
# Supply appropriate rack prefix
RACK_PREFIX=default
# To test, supply a hostname as script input:
if [ $# -gt 0 ]; then
CTL_FILE=${CTL_FILE:-"rack_topology.data"}
HADOOP_CONF=${HADOOP_CONF:-"/etc/hadoop/conf"}
if [ ! -f ${HADOOP_CONF}/${CTL_FILE} ]; then
echo -n "/$RACK_PREFIX/rack "
exit 0
fi
while [ $# -gt 0 ] ; do
nodeArg=$1
exec< ${HADOOP_CONF}/${CTL_FILE}
result=""
while read line ; do
ar=( $line )
if [ "${ar[0]}" = "$nodeArg" ] ; then
result="${ar[1]}"
fi
done
shift

if [ -z "$result" ] ; then
else
echo -n "/$RACK_PREFIX/rack_$result "
fi
done
else
fi
Sample Topology Data File

File name: rack_topology.data
# This file should be:

# - Placed in the /etc/hadoop/conf directory
# - On the Namenode (and backups IE: HA, Failover, etc)
# - On the Job Tracker OR Resource Manager (and any Failover JT's/RM's)
# This file should be placed in the /etc/hadoop/conf directory.
# Add Hostnames to this file. Format <host ip> <rack_location>
192.168.2.10 01
192.168.2.11 02
192.168.2.12 03

Compression of files
• File compression brings two benefits:
 reduces the space need to store files
 speeds up data transfer across the network or to/from disk
• But is the data splitable? (necessary for parallel reading)
• Use codecs, such as org.apache.hadoop.io.compressSnappyCodec
Compression Format Algorithm Filename extension Splitable?
DEFLATE DEFLATE .deflate No
gzip DEFLATE .gz No
bzip2 bzip2 .bz2 Yes
LZO LZO .lzo / .cmx Yes, If indexed in preprocessing
LZ4 LZ4 .lz4 No
Snappy Snappy .snappy No
Introduction
Compression of files
gzip
gzip is naturally supported by Hadoop. gzip is based on the DEFLATE algorithm,
which is a combination of LZ77 and Huffman Coding.
bzip2
bzip2 is a freely available, patent free, high-quality data compressor. It typically
compresses files to within 10% to 15% of the best available techniques (the PPM
family of statistical compressors), whilst being around twice as fast at
compression and six times faster at decompression.
LZO
The LZO compression format is composed of many smaller (~256K) blocks of
compressed data, allowing jobs to be split along block boundaries. Moreover, it
was designed with speed in mind: it decompresses about twice as fast as gzip,
meaning it is fast enough to keep up with hard drive read speeds. It does not
compress quite as well as gzip; expect files that are on the order of 50% larger
than their gzipped version. But that is still 20-50% of the size of the files without
any compression at all, which means that IO-bound jobs complete the map phase
about four times faster.

LZO = Lempel Ziv Oberhummer, http://en.wikipedia.org/wiki/Lempel-Ziv-

Oberhumer. A free software tool which implements it is lzop. The original library
was written in ANSI C, and it has been made available under the GNU General
Purpose License. Versions of LZO are available for the Perl, Python, and Java
languages. The copyright for the code is owned by Markus F. X. J. Oberhumer.
LZ4
LZ4 is a lossless data compression algorithm that is focused on compression and
decompression speed. It belongs to the LZ77 family of byte-oriented compression
schemes. The algorithm gives a slightly worse compression ratio than algorithms
like gzip. However, compression speeds are several times faster than gzip while
decompression speeds can be significantly faster than LZO .
http://en.wikipedia.org/wiki/LZ4_(compression_algorithm), the reference
implementation in C by Yann Collet is licensed under a BSD license.
Snappy
Snappy is a compression/decompression library. It does not aim for maximum
compression, or compatibility with any other compression library; instead, it aims
for very high speeds and reasonable compression. For instance, compared to the
fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but
the resulting compressed files are anywhere from 20% to 100% bigger. On a
single core of a Core i7 processor in 64-bit mode, Snappy compresses at about
250 MB/sec or more and decompresses at about 500 MB/sec or more. Snappy is
widely used inside Google, in everything from BigTable and MapReduce to RPC
systems.
All packages produced by the Apache Software Foundation (ASF), such as Hadoop,
are implicitly licensed under the Apache License, Version 2.0, unless otherwise
explicitly stated. The licensing of other algorithms, such as LZO, that are not licensed
under ASF may pose some problems for distributions that rely solely on the Apache
License.

Which compression format should I use?

• Most to least effective:
 use a container format (Sequence file, Avro, ORC, or Parquet)
 for files use a fast compressor such as LZO, LZ4, or Snappy
 use a compression format that supports splitting, such as bz2 (slow)
or one that can be indexed to support splitting, such as LZO
 split files into chunks and compress each chunk separately using a
supported compression format (does not matter if splittable) - choose a
chunk size so that compressed chunks are approximately the size of an
HDFS block
 store files uncompressed
Take advantage of compression - but the compression format

should depend on file size, data format, and tools used
Introduction
Which compression format should I use?

References:
• http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-
mapreduce-client-core/MapReduceTutorial.html#Data_Compression
• http://comphadoop.weebly.com
• http://www.slideshare.net/Hadoop_Summit/kamat-singh-
june27425pmroom210cv2
• http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-
x/topics/admin_data_compression_performance.html

NameNode startup
1. NameNode reads fsimage in memory
2. NameNode applies editlog changes
3. NameNode waits for block data from data nodes
 NameNode does not store the physical-location information of the blocks
 NameNode exits SafeMode when 99.9% of blocks have at least one copy
accounted for
block information send
3
1 fsimage is read to NameNode
datadir
block1
NameNode datanode1 block2
…
edits log is read

2 and applied
datanode2
datadir
block1
namedir block2
edits log …
…
fsimage
Introduction
NameNode startup
During start up, the NameNode loads the file system state from the fsimage and the
edits log file. It then waits for DataNodes to report their blocks so that it does not
prematurely start replicating the blocks though enough replicas already exist in the
cluster.
During this time NameNode stays in SafeMode. SafeMode for the NameNode is
essentially a read-only mode for the HDFS cluster, where it does not allow any
modifications to file system or blocks. Normally the NameNode leaves SafeMode
automatically after the DataNodes have reported that most file system blocks are
available.
If required, HDFS can be placed in SafeMode explicitly using the command hadoop
dfsadmin -safemode. The NameNode front page shows whether SafeMode is on or
off.

NameNode files (as stored in HDFS)

[root@vm hdfs]# cd /hadoop/hdfs
[root@vm hdfs]# ls -l
total 12
drwxr-x---. 3 hdfs hadoop 4096 Apr 20 05:45 data
drwxr-xr-x. 3 hdfs hadoop 4096 Apr 20 05:45 namenode
drwxr-xr-x. 3 hdfs hadoop 4096 Apr 20 05:45 namesecondary
[root@vm hdfs]# ls -l namenode/current:
total 53296
...
-rw-r--r--. 1 hdfs hadoop 1048576 Apr 17 04:00 edits_0000000000000047319-0000000000000048279
-rw-r--r--. 1 hdfs hadoop 1048576 Apr 20 05:44 edits_inprogress_0000000000000049323
-rw-r--r--. 1 hdfs hadoop 1322806 Apr 16 23:39 fsimage_0000000000000046432
-rw-r--r--. 1 hdfs hadoop 62 Apr 16 23:39 fsimage_0000000000000046432.md5
-rw-r--r--. 1 hdfs hadoop 1394702 Apr 20 04:11 fsimage_0000000000000048809
-rw-r--r--. 1 hdfs hadoop 62 Apr 20 04:11 fsimage_0000000000000048809.md5
-rw-r--r--. 1 hdfs hadoop 6 Apr 20 05:37 seen_txid
-rw-r--r--. 1 hdfs hadoop 207 Apr 20 04:11 VERSION
Introduction
NameNode files (as stored in HDFS)

These are the actual storage files (in HDFS) where the NameNode stores its metadata:
• fsimage
• edits
• VERSION
There is a current edits_inprogress file that is accumulating edits (adds, deletes) since
the last update of the fsimage. This current edits file is closed off and the changes
incorporated into a new version of the fsimage based on whichever of two configurable
events occurs first:
• edits file reaches a certain size (here 1 MB, default 64 MB)
• time limit between updates is reached, and there have been updates (default 1
hour)

Adding a file to HDFS: replication pipelining

1. File is added to NameNode memory by persisting info in edits log
2. Data is written in blocks to DataNodes
 DataNode starts chained copy to two other DataNodes
 if at least one write for each block succeeds, the write is successful
Client talks to NameNode, which Client

1 determines which DataNodes will API on client send data
store the replicas of each block 3 block to first node
NameNode
datadir
1st datanode block1
4.First DataNode
daisychain-writes to
block2…
edits log is changed in
2 memory and on disk second, second to
third - with ack back
to previous node 2nd datanode
5.Then first DataNode datadir
namedir confirms replication block1
edits log complete to the block2…
fsimage NameNode …
Introduction
Adding a file to HDFS: replication pipelining

Data Blocks
HDFS is designed to support very large files. Applications that are compatible with
HDFS are those that deal with large data sets. These applications write their data only
once but they read it one or more times and require these reads to be satisfied at
streaming speeds. HDFS supports write-once-read-many semantics on files. A typical
block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB
"splits," and if possible, each split (or block) will reside on a different DataNode.
Staging
A client request to create a file does not reach the NameNode immediately. In fact,
initially the HDFS client caches the file data into a temporary local file. Application writes
are transparently redirected to this temporary local file. When the local file accumulates
data worth over one HDFS block size, the client contacts the NameNode. The
NameNode inserts the file name into the file system hierarchy and allocates a data
block for it. The NameNode responds to the client request with the identity of the
DataNode and the destination data block. Then the client flushes the block of data from
the local temporary file to the specified DataNode. When a file is closed, the remaining
un-flushed data in the temporary local file is transferred to the DataNode.

The client then tells the NameNode that the file is closed. At this point, the NameNode
commits the file creation operation into a persistent store. If the NameNode dies before
the file is closed, the file is lost.
The aforementioned approach was adopted after careful consideration of target
applications that run on HDFS. These applications need streaming writes to files. If a
client writes to a remote file directly without any client side buffering, the network speed
and the congestion in the network impacts throughput considerably. This approach is
not without precedent. Earlier distributed file systems, such as AFS, have used client
side caching to improve performance. A POSIX requirement has been relaxed to
achieve higher performance of data uploads.
Replication Pipelining
When a client is writing data to an HDFS file, its data is first written to a local file as
explained previously. Suppose the HDFS file has a replication factor of three. When the
local file accumulates a full block of user data, the client retrieves a list of DataNodes
from the NameNode. This list contains the DataNodes that will host a replica of that
block. The client then flushes the data block to the first DataNode. The first DataNode
starts receiving the data in small portions (4 KB), writes each portion to its local
repository and transfers that portion to the second DataNode in the list. The second
DataNode, in turn starts receiving each portion of the data block, writes that portion to
its repository and then flushes that portion to the third DataNode. Finally, the third
DataNode writes the data to its local repository. Thus, a DataNode can be receiving
data from the previous one in the pipeline and at the same time forwarding data to the
next one in the pipeline. The data is pipelined from one DataNode to the next.
For good descriptions of the process, see the tutorials at:
• HDFS Users Guide at http://hadoop.apache.org/docs/current/hadoop-project-
dist/hadoop-hdfs/HdfsUserGuide.html
• An introduction to the Hadoop Distributed File System at
https://hortonworks.com/apache/hdfs/
• How HDFS works at https://hortonworks.com/apache/hdfs/#section_2

Managing the cluster

• Adding a data node
 Start new DataNode (pointing to NameNode)
 If required, run balancer to rebalance blocks across the cluster:
hadoop balancer
• Removing a node
 Simply remove DataNode
 Better: Add node to exclude file and wait till all blocks have been moved
 Can be checked in server admin console server:50070
• Checking filesystem health
 Use: hadoop fsck
Introduction
Managing the cluster

Apache Hadoop clusters grow and change with use. The normal method is to use
Apache Ambari to build your initial cluster with a base set of Hadoop services targeting
known use cases. You may want to add other services for new use cases, and even
later you may need to expand the storage and processing capacity of the cluster.
Ambari can help in both scenarios, the initial configuration and the later
expansion/reconfiguration of your cluster.
When you can add more hosts to the cluster you can assign these hosts to run as
DataNodes (and NodeManagers under YARN, as you will see later). This allows you to
expand both your HDFS storage capacity and your overall processing power.
Similarly you can removed DataNodes if they are malfunctioning or you want to
reorganize your cluster.

HDFS-2 NameNode HA (High Availability)

• HDFS-2 adds NameNode High Availability
• Standby NameNode needs filesystem transactions and block locations for fast failover
• Every filesystem modification is logged to at least 3 quorum journal nodes by active
Namenode
 Standby Node applies changes from journal nodes as they occur
 Majority of journal nodes define reality
 Split Brain is avoided by JournalNodes (They will only allow one NameNode to write to them )
• DataNodes send block locations and heartbeats to both NameNodes
• Memory state of Standby NameNode is very close to Active NameNode
 Much faster failover than cold start
JournalNode1 JournalNode2 JournalNode3
Active Standby
NameNode NameNode
Datanode1 Datanode2 Datanode3 Datanodex

Introduction
HDFS-2 NameNode HA (High Availability)

With Hadoop 2, high availability is supported out-of-the-box.
Features:
• two available NameNodes: Active and Standby
• transactions logged to Quorum Journal Nodes (QJM)
• standby node periodically gets updates
• DataNodes send block locations and heartbeats to both NameNodes
• when failures occur, Standby can take over with very small downtime
• no cold start
Deployment:
• need to have two dedicated NameNodes in the cluster
• QJM may coexist with other services (at least 3 in a cluster)
• no need of a Secondary NameNode

Secondary NameNode
• During operation primary NameNode cannot merge fsImage and edits log
• This is done on the secondary NameNode
 Every couple minutes, secondary NameNode copies new edit log from primary NN
 Merges edits log into fsimage
 Copies the new merged fsImage back to primary NameNode
• Not HA but faster startup time
 Secondary NN does not have complete image. In-flight transactions would be lost
 Primary NameNode needs to merge less during startup
• Was temporarily deprecated because of NameNode HA but has some advantages
 (No need for Quorum nodes, less network traffic, less moving parts )
New Edits Log entries are

Primary copied to Secondary NN Secondary
NameNode NameNode
Merged fsimage is copied

namedir back namedir
edits log edits log
fsimage simage
Introduction
Secondary NameNode
In the older approach, a Secondary NameNode is used.
The NameNode stores the HDFS filesystem information in a file named fsimage.
Updates to the file system (add/remove blocks) are not updating the fsimage file, but
instead are logged into a file, so the I/O is fast append only streaming as opposed to
random file writes. When restarting, the NameNode reads the fsimage and then applies
all the changes from the log file to bring the filesystem state up to date in memory. This
process takes time.
The job of the Secondary NameNode's is not to be a secondary to the name node, but
only to periodically read the filesystem changes log and apply them into the fsimage file,
thus bringing it up to date. This allows the namenode to start up faster next time.
Unfortunately, the Secondary NameNode service is not a standby secondary
namenode, despite its name. Specifically, it does not offer HA for the NameNode. This
is well illustrated in the slide above.
Note that more recent distributions have NameNode High Availability using NFS
(shared storage) and/or NameNode High Availability using a Quorum Journal Manager
(QJM).

Possible FileSystem setup approaches

• Hadoop 2 with HA
 no single point of failure
 wide community support
• Hadoop 2 without HA (or, Hadoop 1.x in older versions)
 copy namedir to NFS ( RAID )
 have virtual IP for backup NameNode
 still some failover time to read blocks, no instant failover but less overhead
Introduction
Possible FileSystem setup approaches

The slide shows two approaches to high availability.

Federated NameNode (HDFS2)

• New in Hadoop2: NameNodes (NN) can be federated
 Historically NameNodes can become a bottleneck on huge clusters
 One million blocks or ~100TB of data require roughly one GB RAM in NN
• Blockpools
 Administrator can create separate blockpools/namespaces with different
NNs
 DataNodes register on all NNs
 DataNodes store data of all blockpools (otherwise setup separate clusters)
 New ClusterID identifies all NNs in a cluster.
 A namespace and its block pool together are called Namespace Volume
 You define which blockpool to use by connecting to a specific NN
 Each NameNode still has its own separate backup/secondary/checkpoint
node
• Benefits
 One NN failure will not impact other blockpools
 Better scalability for large numbers of file operations
Introduction
Federated NameNode (HDFS2)

References:
• http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
hdfs/Federation.html
With Federated NameNodes in HDFS2, there are main layers:
• Namespace
• Consists of directories, files and blocks.
• It supports all the namespace related file system operations such as create,
delete, modify and list files and directories.

• Block Storage Service, which has two parts:

• Block Management (performed in the NameNode)
• Provides DataNode cluster membership by handling registrations, and
periodic heart beats.
• Processes block reports and maintains location of blocks.
• Supports block related operations such as create, delete, modify and
get block location.
• Manages replica placement, block replication for under replicated
blocks, and deletes blocks that are over replicated.
• Storage is provided by DataNodes by storing blocks on the local file system
and allowing read/write access.
The prior HDFS architecture allows only a single namespace for the entire cluster. In
that configuration, a single NameNode manages the namespace. HDFS Federation
addresses this limitation by adding support for multiple NameNodes/namespaces to
HDFS.
Multiple NameNodes / Namespaces
In order to scale the name service horizontally, federation uses multiple independent
NameNodes / namespaces. The NameNodes are federated; the NameNodes are
independent and do not require coordination with each other. The DataNodes are used
as common storage for blocks by all the NameNodes. Each DataNode registers with all
the NameNodes in the cluster. DataNodes send periodic heartbeats and block reports.
They also handle commands from the NameNodes.
Users may use ViewFS to create personalized namespace views. ViewFS is analogous
to client-side mount tables in some UNIX/Linux systems.

fs: file system shell (1 of 4)

• File System Shell (fs)
 Invoked as follows:
hadoop fs <args>
 Example: listing the current directory in HDFS
hadoop fs -ls .
 Note that the current directory is designated by dot (".")

- the here symbol in Linux/UNIX
 If you want the root of the HDFS file system, you would use slash ("/")
Introduction
fs: file system shell

HDFS can be manipulated through a Java API or through a command line interface. All
commands for manipulating HDFS through Hadoop's command line interface begin
with "hadoop", a space, and "fs" (or "dfs"). This is the file system shell. This is followed
by the command name as an argument to "hadoop fs". These commands start with a
dash. For example, the "ls" command for listing a directory is a common UNIX
command and is preceded with a dash. As on UNIX systems, ls can take a path as an
argument. In this example, the path is the current directory, represented by a single dot.
fs is one of the command options for hadoop. If you just type the command hadoop by
itself, you will see other options.
A good tutorial at this stage can be found at
https://developer.yahoo.com/hadoop/tutorial/module2.html.


• FS shell commands take URIs as argument
 URI format:
scheme://authority/path
• Scheme:
 For the local filesystem, the scheme is file
 For HDFS, the scheme is hdfs
• Authority is the hostname and port of the NameNode
hadoop fs -copyFromLocal file:///myfile.txt
dfs://localhost:9000/user/virtuser/myfile.txt
• Scheme and authority are often optional
 Defaults are taken from configuration file core-site.xml
30
Introduction
Just as for the ls command, the file system shell commands can take paths as
arguments. These paths can be expressed in the form of uniform resource identifiers or
URIs. The URI format consists of a scheme, an authority, and path. There are multiple
schemes supported. The local file system has a scheme of "file". HDFS has a scheme
called "hdfs."
For example, if you want to copy a file called "myfile.txt" from your local filesystem to an
HDFS file system on the localhost, you can do this by issuing the command shown.
The copyFromLocal command takes a URI for the source and a URI for the destination.
"Authority" is the hostname of the NameNode. For example, if the NameNode is in
localhost and accessed on port 9000, the authority would be localhost:9000.
The scheme and the authority do not always need to be specified. Instead you may rely
on their default values. These defaults can be overridden by specifying them in a file
named core-site.xml in the conf directory of your Hadoop installation.


• Many POSIX-like commands
 cat, chgrp, chmod, chown, cp, du, ls, mkdir, mv, rm, stat, tail
• Some HDFS-specific commands
 copyFromLocal, put, copyToLocal, get, getmerge, setrep
• copyFromLocal / put
 Copy files from the local file system into fs
hadoop fs -copyFromLocal localsrc dst

or
hadoop fs -put localsrc dst
Introduction
HDFS supports many POSIX-like commands. HDFS is not a fully POSIX (Portable
operating system interface for UNIX) compliant file system, but it supports many of the
commands. The HDFS commands are mostly easily-recognized UNIX commands like
cat and chmod. There are also a few commands that are specific to HDFS such as
copyFromLocal.
Note that:
• localsrc and dst are placeholders for your actual file(s)
• localsrc can be a directory or a list of files separated by space(s)
• dst can be a new file name (in HDFS) for a single-file-copy, or a directory (in
HDFS), that is the destination directory

Example:
hadoop fs -put *.txt ./Gutenberg
…copies all the text files in the local Linux directory with the suffix of .txt to the
directory Gutenberg in the users home directory in HDFS
The "direction" implied by the names of these commands (copyFromLocal, put) is
relative to the user, who can be thought to be situated outside HDFS.
Also, you should note there is no cd (change directory) command available for hadoop.


• copyToLocal / get
 Copy files from fs into the local file system
hadoop fs -copyToLocal [-ignorecrc] [-crc]<src><localdst>
or
hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>
• Creating a directory: mkdir

hadoop fs -mkdir /newdir
Introduction
The copyToLocal (aka get) command copies files out of the file system you specify
and into the local file system.
get
Usage: hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>
• Copy files to the local file system. Files that fail the CRC check may be copied
with the -ignorecrc option. Files and CRCs may be copied using the -crc option.
• Example: hadoop fs -get hdfs:/mydir/file file:///home/hdpadmin/localfile
Another important note: for files in Linux, where you would use the file:// authority, two
slashes represent files relative to your current Linux directory (pwd). To reference files
absolutely, use three slashes (and mentally pronounce as "slash-slash pause slash").

Unit summary
• Understand the basic need for a big data strategy in terms of parallel
reading of large data files and internode network speed in a cluster
• Describe the nature of the Hadoop Distributed File System (HDFS)
• Explain the function of the NameNode and DataNodes in an Hadoop
cluster
• Explain how files are stored and blocks ("splits") are replicated
Introduction
Unit summary

Checkpoint
1. True or False? Hadoop systems are designed for transaction
processing.
2. List the Hadoop open source projects.
3. What is the default number of replicas in a Hadoop system?
4. True or False? One of the driving principal of Hadoop is that the data is
brought to the program.
5. True or False? At least 2 NameNodes are required for a standalone
Hadoop cluster.
Introduction
Checkpoint

Checkpoint solution
1. True or False? Hadoop systems are designed for transaction
processing.
 Hadoop systems are not designed for transaction processing, and would be
very terrible at it. Hadoop systems are designed for batch processing.
2. List the Hadoop open source projects.
 To name a few, MapReduce, Yarn, Ambari, Hive, HBase, etc.
3. What is the default number of replicas in a Hadoop system?
 3
4. True or False? One of the driving principal of Hadoop is that the data
is brought to the program.
 False. The program is brought to the data, to eliminate the need to move large
amounts of data.
5. True or False? At least 2 NameNodes are required for a standalone
Hadoop cluster.
 Only 1 NameNode is required per cluster; 2 are required for high-availability.
Introduction
Checkpoint solution

Lab 1
File access and basic commands with HDFS
Hadoop and to
Introduction HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018
Lab 1: File access and basic commands with HDFS

Lab 1:
File access and basic commands with HDFS
Purpose:
This lab is intended to provide you with experience in using the Hadoop
Distributed File System (HDFS). The basic HDFS file system commands
learned here will be used throughout the remainder of the course.
You will also be moving some data into HDFS that will be used in later units of
this course. The files that you will need are stored in the Linux directory
/home/labfiles.
VM Hostname: http://datascientist.ibm.com
User/Password: student / student
root/dalvm3
Task 1. Learn some of the basic HDFS file system commands.
The major reference for the HDFS File System Commands can be found on the
Apache Hadoop website. Additional commands can be found there. The URL
for the most current Hadoop version is:
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
hdfs/HDFSCommands.html
The HDFS File System Commands are generally prefixed by:

hadoop fs
but can also be executed with dfs instead of fs and also in the form:
hdfs fs
Other HDFS commands - generally administrative - use parameters other than
fs (for example, fsck, rebalance, …).
1. Connect to your classroom VMware image and login as student.
2. In a new terminal window, type cd to change to your home directory.
3. Type pwd to verify that you are in biadmin's home directory.

4. To verify that, as the user biadmin, you have a file system directory in HDFS,
type hadoop fs -ls .
Note: At the end of this command there is a period (it means the current
directory in HDFS).
Your results appear as shown below:
[biadmin@ibmclass ~]$ hadoop fs -ls .
[biadmin@ibmclass ~]$
In Linux, there is no response to the command; this indicates success. If you get
this response, go to step 8.
If, however, you receive an error response such as:
ls: '.': No such file or directory
then you do not have a home directory in HDFS. You will need to have one
created for you. The ability to create a home directory for you in HDFS is not
something that you can do; it must be done as the hdfs user. You will do this in
steps 5 through 7.
5. To login as the hdfs user, use sudo su - hdfs
[biadmin@ibmclass ~]$ sudo su - hdfs
[sudo] password for biadmin:
[hdfs@ibmclass ~]$
or su - (the root password is dalvm3)

[biadmin@ibmclass ~]$ su -
Password:
[root@ibmclass ~]# su - hdfs
[hdfs@ibmclass ~]$

6. Execute the following three commands:

hadoop fs -mkdir /user/biadmin
hadoop fs -chown biadmin /user/biadmin
hadoop fs -chgrp biadmin /user/biadmin
[hdfs@ibmclass ~]$ hadoop fs -mkdir /user/biadmin

[hdfs@ibmclass ~]$ hadoop fs -ls /
Found 8 items
drwxrwxrwx - yarn hadoop 0 2015-04-15 12:21 /app-logs
drwxr-xr-x - hdfs hdfs 0 2015-04-15 12:19 /apps
drwxrwxr-x - hdfs hadoop 0 2015-05-19 17:29 /biginsights
drwxr-xr-x - hdfs hdfs 0 2015-04-15 12:13 /iop
drwxr-xr-x - mapred hdfs 0 2015-04-15 12:12 /mapred
drwxr-xr-x - hdfs hdfs 0 2015-04-15 12:12 /mr-history
drwxrwxrwx - hdfs hdfs 0 2015-05-20 18:17 /tmp
drwxr-xr-x - hdfs hdfs 0 2015-06-04 09:02 /user
[hdfs@ibmclass ~]$ hadoop fs -ls /user
Found 8 items
drwxrwx--- - ambari-qa hdfs 0 2015-04-15 12:25 /user/ambari-qa
drwxr-xr-x - hdfs hdfs 0 2015-06-04 09:02 /user/biadmin
drwxrwxrwx - bigr hdfs 0 2015-05-20 18:17 /user/bigr
drwxr-xr-x - hcat hdfs 0 2015-04-15 12:23 /user/hcat
drwx------ - hive hdfs 0 2015-04-15 12:19 /user/hive
drwxrwxr-x - oozie hdfs 0 2015-04-15 12:20 /user/oozie
drwxr-xr-x - spark hadoop 0 2015-04-15 12:14 /user/spark
drwxrwxr-x - tauser hdfs 0 2015-05-20 12:18 /user/tauser
[hdfs@ibmclass ~]$ hadoop fs -chown biadmin /user/biadmin
[hdfs@ibmclass ~]$ hadoop fs -chgrp biadmin /user/biadmin
Found 8 items
drwxr-xr-x - biadmin biadmin 0 2015-06-04 09:02 /user/biadmin
7. Logout from hdfs (and root, if necessary) by using exit or Ctrl-D.

Alternatively, you can close your current terminal window and open a new one.
You now have a home directory in HDFS. This home directory is /user/biadmin
(note that this is "user" and not "usr" and not "home").
8. Ensure that you are connected as the biadmin user.

9. To validate that you now have a home directory in HDFS, run the following
commands:
hadoop fs -ls
hadoop fs -ls .
hadoop fs -ls /user
[biadmin@ibmclass ~]$ hadoop fs -ls



Found 8 items
drwxr-xr-x - biadmin biadmin 0 2015-06-04 09:02 /user/biadmin
You are now ready to do some exploration of the HDFS file system and your
Hadoop ecosystem.
10. To create a subdirectory called Gutenberg in your HDFS user directory, type
hadoop fs -mkdir Gutenberg.
Note that the directory you just created, defaulted to your home directory. This is
how Linux handles this command. If you wanted the directory elsewhere, then
you need to fully qualify it.
11. Type hadoop fs -ls to list your directory.
12. Use the recursive option (-R) of ls to see if there are any files in your Gutenberg
directory (there are none at the moment): hadoop fs -ls -R
[biadmin@ibmclass ~]$ hadoop fs -mkdir Gutenberg
Found 1 items
drwxr-xr-x - biadmin biadmin 0 2015-06-04 10:41 Gutenberg
[biadmin@ibmclass ~]$ hadoop fs -ls -R

13. Use the hadoop fs -put command to move files from the Linux /home/labfiles
directory into your HDFS directory, Gutenberg:
hadoop fs -put /home/biadmin/labfiles/*.txt Gutenberg
This command can also be executed as:
hadoop fs -copyFromLocal /home/biadmin/labfiles/*.txt Gutenberg
14. Repeat the list command with the recursive option (step 12).
[biadmin@ibmclass ~]$ hadoop fs -put /home/labfiles/*.txt Gutenberg
Found 1 items
-rw-r--r-- 3 biadmin biadmin 421504 2015-06-04 11:01 Gutenberg/Frankenstein.txt
-rw-r--r-- 3 biadmin biadmin 697802 2015-06-04 11:01
Gutenberg/Pride_and_Prejudice.txt
-rw-r--r-- 3 biadmin biadmin 757223 2015-06-04 11:01
Gutenberg/Tale_of_Two_Cities.txt
-rw-r--r-- 3 biadmin biadmin 281398 2015-06-04 11:01 Gutenberg/The_Prince.txt
Note here in the listing of files the following for the last file:
-rw-r--r-- 3 biadmin biadmin 281398
Here you will see the read-write (rw) permissions that you would find with a
typical Linux file.
The "3" here is the typical replication factor for the blocks (or "splits") of the
individual files. These files are too small (last file is 281KB) to have more than
one block (the max block size is 128MB), but each block of each file is
replicated three times.
You may see "1" instead of "3" in a single-node cluster (pseudo-distributed
mode). That too is normal for a single-node cluster as it does not really make
sense to replicate multiple copies on a single node.

Task 2. Explore one of the HDFS administrative commands.

There are a number of HDFS administration commands in addition to the HDFS
file system commands. A reference for administration commands is:
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
hdfs/HDFSCommands.html.
We will look at just one of them, fsck. We will run it as: hdfs fsck
Administrative commands cannot be normally run as regular users. You will do
the following as the hdfs user.
fsck
Runs a HDFS filesystem checking utility.
Usage: hadoop fsck [GENERIC_OPTIONS] <path> [-move | -delete | -

openforwrite] [-files [-blocks [-locations | -racks]]]
COMMAND_OPTION Description
path Start checking from this path.
-move Move corrupted files to /lost+found
-delete Delete corrupted files.
-openforwrite Print out files opened for write.
-files Print out files being checked.
-blocks Print out block report.
-locations Print out locations for every block.
-racks Print out network topology for data-

node locations.
1. To log in as the hdfs user, type sudo su - hdfs.

[biadmin@ibmclass ~]$ sudo su - hdfs
[sudo] password for biadmin:
[hdfs@ibmclass ~]$
or su - (the root password is dalvm3)

[biadmin@ibmclass ~]$ su -
Password:
[root@ibmclass ~]# su - hdfs
[hdfs@ibmclass ~]$

2. Type the following command: hdfs fsck /

In your lab environment, you may get a number of errors related to replication.
Ignore those, as you are running a pseudo-distributed cluster and this
replication is not standard.
The results that you should see will be similar to the following:
. . .
Status: HEALTHY
Total size: 713494586 B (Total open files size: 340 B)
Total dirs: 12748
Total files: 341
Total symlinks: 0 (Files currently being written: 3)
Total blocks (validated): 330 (avg. block size 2162104 B) (Total open file blocks
(not validated): 3)
Minimally replicated blocks: 330 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 330 (100.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 660 (66.666664 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Thu Jun 04 11:11:43 GMT-05:00 2015 in 218 milliseconds
The filesystem under path '/' is HEALTHY

Results:
You used basic Hadoop Distributed File System (HDFS) file system
commands, moving some data into HDFS that will be used in later units of this
course.

MapReduce and YARN
MapReduce and YARN

Unit 5 MapReduce and YARN

Unit objectives
• Describe the MapReduce model v1
• List the limitations of Hadoop 1 and MapReduce 1
• Review the Java code required to handle the Mapper class, the
Reducer class, and the program driver needed to access MapReduce
• Describe the YARN model
• Compare Hadoop 2/YARN with Hadoop 1
MapReduce and YARN © Copyright IBM Corporation 2018
Unit objectives

Topic:
Introduction to MapReduce
processing based on MR1
Topic: Introduction to MapReduce processing based on MR1

Agenda
• MapReduce Introduction
• MapReduce Tasks
• WordCount Example
• Splits
• Execution
Agenda

MapReduce: The Distributed File System (DFS)

• Driving principles
 data is stored across the entire cluster
 programs are brought to the data, not the data to the program
• Data is stored across the entire cluster (the DFS)
 the entire cluster participates in the file system
 blocks of a single file are distributed across the cluster
 a given block is typically replicated as well for resiliency
101101001 Cluster
010010011
1
100111111
001010011
101001010 1 3 2
010110010
010101001
2
100010100
101110101 4 1 3
Blocks 110101111
011011010
101101001
010100101
3
010101011 2 4
100100110
101110100
4 2 3
1
4
Logical File
MapReduce: The Distributed File System (DFS)

The driving principle of MapReduce is a simple one: spread your data out across a
huge cluster of machines and then, rather than bringing the data to your programs as
you do in traditional programming, you write your program in a specific way that allows
the program to be moved to the data. Thus, the entire cluster is made available in both
reading the data as well as processing the data.
A Distributed File System (DFS) is at the heart of MapReduce. It is responsible for
spreading data across the cluster, by making the entire cluster look like one giant file
system. When a file is written to the cluster, blocks of the file are spread out and
replicated across the whole cluster (in the diagram, notice that every block of the file is
replicated to three different machines).
Adding more nodes to the cluster instantly adds capacity to the file system and, as we'll
see on the next slide, automatically increases the available processing power and
parallelism.

MapReduce v1 explained
• Hadoop computational model
 Data stored in a distributed file system spanning many inexpensive
computers
 Bring function to the data
 Distribute application to the compute resources where the data is stored
• Scalable to thousands of nodes and petabytes of data
public static class TokenizerMapper
extends Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable
Hadoop Data Nodes
one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text val, Context

StringTokenizer itr =
1.Map Phase
new StringTokenizer(val.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
(break job into small parts)
public static class IntSumReducer
Distribute map 2.Shuffle

extends Reducer<Text,IntWritable,Text,IntWrita
private IntWritable result = new IntWritable();
public void reduce(Text key,
tasks to cluster (transfer interim output

Iterable<IntWritable> val, Context context){
int sum = 0;
for (IntWritable v : val) {
sum += v.get();
for final processing)
. . .
3.Reduce Phase
MapReduce (boil all output down to
Application Shuffle a single result set)
Return a single result

Result Set set
MapReduce v1 explained
• MapReduce is a software framework introduced by Google to support
distributed computing on large data sets of clusters of computers.
• The Hadoop Distributed File System (HDFS) is where Hadoop stores its data.
system. Furthermore, HDFS assumes nodes will fail, so it replicates a given
chunk of data across multiple nodes to achieve reliability. The degree of
replication can be customized by the Hadoop administrator or programmer.
However, by default is to replicate every chunk of data across 3 nodes: 2 on the
same rack, and 1 on a different rack.
The key to understanding Hadoop lies in the MapReduce programming model. This is
essentially a representation of the divide and conquer processing model, where your
input is split into many small pieces (the map step), and the Hadoop nodes process
these pieces in parallel. Once these pieces are processed, the results are distilled (in
the reduce step) down to a single answer.

MapReduce v1 engine
• Master/Slave architecture
 Single master (JobTracker) controls job execution on multiple slaves
(TaskTrackers).
• JobTracker
 Accepts MapReduce jobs submitted by clients
 Pushes map and reduce tasks out to TaskTracker nodes
 Keeps the work as physically close to data as possible
 Monitors tasks and TaskTracker status
• TaskTracker
 Runs map and reduce tasks
 Reports status to JobTracker
 Manages storage and transmission of intermediate output
cluster Computer / Node 1
JobTracker
TaskTracker TaskTracker TaskTracker TaskTracker

Computer / Node 2 Computer / Node 3 Computer / Node 4 Computer / Node 5
MapReduce v1 engine
If one TaskTracker is very slow, it can delay the entire MapReduce job, especially
towards the end of a job, where everything can end up waiting for the slowest task. With
speculative-execution enabled, however, a single task can be executed on multiple
slave nodes.
For jobs scheduling, by default Hadoop uses FIFO (First in, First Out), and optional 5
scheduling priorities to schedule jobs from a work queue. Other scheduling algorithms
are available as add-ins: Fair Scheduler, Capacity Scheduler, and so on.

The MapReduce programming model

• "Map" step
 Input is split into pieces (HDFS blocks or "splits")
 Worker nodes process the individual pieces in parallel
(under global control of a Job Tracker)
 Each worker node stores its result in its local file system where a reducer
is able to access it
• "Reduce" step
 Data is aggregated ("reduced" from the map steps) by worker nodes
(under control of the Job Tracker)
 Multiple reduce tasks parallelize the aggregation
 Output is stored in HDFS (and thus replicated)
The MapReduce programming model

From Wikipedia on MapReduce (http://en.wikipedia.org/wiki/MapReduce):
MapReduce is a framework for processing huge datasets on certain kinds of
distributable problems using a large number of computers (nodes), collectively
referred to as a cluster (if all nodes use the same hardware) or as a grid (if the
nodes use different hardware). Computational processing can occur on data stored
either in a file system (unstructured) or within a database (structured).
"Map" step: The master node takes the input, chops it up into smaller sub-problems,
and distributes those to worker nodes. A worker node may do this again in turn, leading
to a multi-level tree structure. The worker node processes that smaller problem, and
passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and
combines them in some way to get the output - the answer to the problem it was
originally trying to solve.

The MapReduce execution environments

• APIs vs. Execution Environment
 APIs are implemented by applications and are largely independent of
execution environment
 Execution Environment defines how MapReduce jobs are executed
• MapReduce APIs
 org.apache.mapred:
− Old API, largely superseded some classes still used in new API
− Not changed with YARN
 org.apache.mapreduce:
− New API, more flexibility, widely used
− Applications may have to be recompiled to use YARN (not binary compatible)
• Execution Environments
 Classic JobTracker/TaskTracker from Hadoop v1
 YARN (MapReduce v2): Flexible execution environment to run MapReduce
and much more
− No single JobTracker, instead ApplicationMaster jobs for every application
The MapReduce execution environments

computing on large data sets of clusters of computers. I'll explain more about
MapReduce in a minute.
system. I'll explain more about HDFS in a minute, too. You can use other file
systems with Hadoop, but HDFS is quite common.

Agenda
• MapReduce introduction
• MapReduce phases
 Map
 Shuffle
 Reduce
 Combiner
• WordCount example
• Splits
• Execution
Agenda
You will review the MapReduce phases: Map, Shuffle, and Reduce. In addition, there is
an optional Combiner phase.

MapReduce 1 overview
Results can be
written to HDFS or a
database
Map Shuffle Reduce

Distributed
FileSystem HDFS,
data in blocks
MapReduce 1 overview
This slide provides an overview of the MR1 process.
File blocks (stored on different DataNodes) in HDFS are read and processed by the
Mappers.
The output of the Mapper processes are shuffled (sent) to the Reducers (one output file
from each Mapper to each Reducer); the files here are not replicated and are stored
local to the Mapper node.
The Reducers produces the output and that output is stored in HDFS, with one file for
each Reducer.

MapReduce: Map phase

• Mappers
 Small program (typically), distributed across the cluster, local to data
 Handed a portion of the input data (called a split)
 Each mapper parses, filters, or transforms its input
 Produces grouped <key,value> pairs
Map Phase
Logical Output
sort File
101101001
010010011
1 map copy merge 101101001
1
100111111 010010011
100111111
001010011
101001010 001010011
To DFS
010110010 sort reduce 101001010
010110010
010101001
2
100010100 2 map 01
101110101
110101111 Logical Output
011011010
101101001 File
010100101 sort merge
3
010101011
101101001
100100110
3 map 010010011
100111111
101110100 001010011
reduce 101001010 To DFS
010110010
4 sort 01
4 map
Logical
Input File
MapReduce: Map phase

Earlier, you learned that if you write your programs in a special way, the programs can
be brought to the data. This special way is called MapReduce, and it involves breaking
your program down into two discrete parts: Map and Reduce.
A mapper is typically a relatively small program with a relatively simple task: it is
responsible for reading a portion of the input data, interpreting, filtering or transforming
the data as necessary and then finally producing a stream of <key, value> pairs. What
these keys and values are is not of importance in the scope of this topic, but just keep
in mind that these values can be as big and complex as you need.
Notice in the diagram, how the MapReduce environment will automatically take care of
taking your small "map" program (the black boxes) and pushing that program out to
every machine that has a block of the file you are trying to process. This means that the
bigger the file, the bigger the cluster, the more mappers get involved in processing the
data. That's a pretty powerful idea.

MapReduce: Shuffle phase

• The output of each mapper is locally grouped together by key
• One node is chosen to process data for each unique key
• All of this movement (shuffle) of data is transparently orchestrated by
MapReduce
Shuffle
Logical Output
sort File
101101001
010010011
1
100111111 010010011
100111111
001010011
101001010 001010011
To DFS
010110010 sort reduce 101001010
010110010
010101001
2
100010100 2 map 01
101110101
110101111
101101001
sort merge File
010100101
3
010101011 101101001
100100110
3 map 010010011
101110100 100111111
reduce 001010011
101001010 To DFS
4 sort 010110010
01
4 map
MapReduce: Shuffle phase

This next phase is called "Shuffle" and is orchestrated behind the scenes by
MapReduce.
The idea here is that all of the data that is being emitted from the mappers is first locally
grouped by the <key> that your program chose, and then for each unique key, a node
is chosen to process all of the values from all of the mappers for that key.
For example, if you used U.S. state (such as MA, AK, NY, CA, etc.) as the key of your
data, then one machine would be chosen to send all of the California data to, and
another chosen for all of the New York data, and so on. Each machine would be
responsible for processing the data for its selected state. In the diagram above, the data
clearly only had two keys (shown as white and black boxes), but keep in mind that there
may be many records with the same key coming from a given mapper.

MapReduce: Reduce phase

• Reducers
 Small programs (typically) that aggregate all of the values for the key
that they are responsible for
 Each reducer writes output to its own file
Reduce Phase
Logical
sort Output
File
101101001
010010011
1
100111111 010010011
100111111
001010011
101001010 001010011
To DFS
010110010 sort reduce 101001010
010110010
010101001
2
100010100 2 map 01
101110101
110101111
011011010
101101001
010100101 sort merge 101101001
3
010101011 010010011
100100110
3 map 100111111
001010011
101110100
reduce 101001010 To DFS
010110010
4 sort
01
4 map
Logical
Output
File
MapReduce: Reduce phase

Reducers are the last part of the picture. Again, these are small programs (typically
small) that are responsible for sorting and/or aggregating all of the values with the key
that is was assigned to work on. Just like with mappers, the more unique keys you have
the more parallelism.
Once each reducer has completed whatever is assigned to do, such as add up the total
sales for the state it was assigned, and it in turn, emits key/value pairs that get written to
storage that can then be used as the input to the next MapReduce job.
This is a simplified overview of MapReduce.

MapReduce: Combiner (Optional)

• The data that will go to each reduce node is sorted and merged before
going to the reduce node, pre-doing some of the work of the receiving
reduce node in order to minimize network traffic between map and
reduce nodes.
Combiner
sort Logical Output
File
101101001
010010011
1
100111111 & combine 010010011
001010011 100111111
101001010 001010011
To DFS
010110010 sort reduce 101001010
010110010
010101001
2
100010100 2 map 01
101110101 & combine
110101111
101101001
sort merge File
010100101
3
010101011
100100110
3 map 101101001
010010011
101110100 & combine 100111111
reduce 001010011
101001010 To DFS
4 sort 010110010
01
4 map
& combine
MapReduce: Combiner (Optional)

At the same time as the Sort is done during the Shuffle work on the Mapper node, an
optional Combiner function may be applied.
For each key, all key/values with that key are sent to the same Reducer node (that is
the purpose of the Shuffle phase).
Rather than sending multiple key/value pairs with the same key value to the Reducer
node, the values are combined into one key/value pair. This is only possible where the
reduce function is additive (or does not lose information when combined).
Since only one key/value pair is sent, the file transferred from Mapper node to Reducer
node is smaller and network traffic is minimalized.

Agenda
• MapReduce introduction
• MapReduce tasks
• WordCount example
• Splits
• Execution
Agenda
You will review WordCount under MR v1 and how the running job differs under YARN
(including the similarities).
Later, you will review the Java code for the Java WordCount program to foster a brief
discussion of the major parts of the program from a code perspective (thus illustrating
Map, Combiner, Reduce).

WordCount example
• In the example of a list of animal names
 MapReduce can automatically split files on line breaks
 The file has been split into two blocks on two nodes
• To count how often each big cat is mentioned, in SQL you would use:
SELECT COUNT(NAME) FROM ANIMALS

WHERE NAME IN (Tiger, Lion …)
GROUP BY NAME;
Node 1 Node 2
Tiger Tiger
Lion Tiger
Lion Wolf
Panther Panther
Wolf …
…
WordCount example
In a file with two blocks (or "splits") of data, animal names are listed. There is one
animal name per line in the files.
Rather than count the number of mentions of each animal, you are interested only in
members of the cat family.
Since the blocks are held on different nodes, software running on the individual nodes
process the blocks separately.
If you were using SQL, which is not used, the SQL would be as stated in the slide.

Map task
• There are two requirements for the Map task:
 filter out the non big-cat rows
 prepare count by transforming to <Text(name), Integer(1)>
Node 1
Tiger <Tiger, 1>
Lion <Lion, 1>
Lion <Lion, 1>
The Map Tasks are
Panther <Panther, 1>
executed locally on each
Wolf … split
…
Node 2
Tiger <Tiger, 1>
Tiger <Tiger, 1>
Wolf <Panther, 1>
Panther …
…
Map task
Reviewing the description of MapReduce from Wikipedia
(http://en.wikipedia.org/wiki/MapReduce):
MapReduce is a framework for processing huge datasets on certain kinds of
distributable problems using a large number of computers (nodes), collectively
referred to as a cluster (if all nodes use the same hardware) or as a grid (if the
nodes use different hardware). Computational processing can occur on data stored
either in a file system (unstructured) or within a database structured).
"Map" step: The master node takes the input, breaks it up into smaller sub-problems,
and distributes those to worker nodes. A worker node may do this again in turn, leading
to a multi-level tree structure. The worker node processes that smaller problem, and
passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and
combines them in some way to get the output, the answer to the problem it was
originally trying to solve.

The Map step shown does the following processing:

• each Map node reads its own "split" (block) of data
• the information required (in this case, the names of animals) is extracted from
each record (in this case, one line = one record)
• data is filtered (keeping only the names of cat family animals)
• key-value pairs are created (in this case, key = animal and value = 1)
• key-value pairs are accumulated into locally stored files on the individual nodes
where the Map task is being executed

Shuffle
• Shuffle moves all values of one key to the same target node
• Distributed by a Partitioner Class (normally hash distribution)
• Reduce Tasks can run on any node - here on Nodes 1 and 3
 The number of Map and Reduce tasks do not need to be identical
 Differences are handled by the hash partitioner
Node 1 Node 1
Tiger <Tiger, 1>
<Panther, <1,1>>
Lion <Lion, 1>
<Tiger, <1,1,1>>
Panther <Lion, 1>
…
Wolf <Panther, 1>
… …
Node 2 Node 3
Tiger <Tiger, 1> Shuffle distributes keys <Lion, <1,1>>
Tiger <Tiger, 1> using a hash partitioner. …
Results are stored in
Wolf <Panther, 1>
HDFS blocks on the
Panther …
machines that run the
… Reduce jobs
Shuffle
Shuffle distributes the key-value pairs to the nodes where the Reducer task will run.
Each Mapper task produces one file for each Reducer task. A hash function running on
the Mapper node determines which Reducer task will receive any particular key-value
pair. All key-value pairs with a particular key will be sent to the same Reducer task.
Reduce tasks can run on any node, either different from the set of nodes where the
Map task run or on the same DataNodes. In the slide example, Node 1 is used for one
Reduce task, but a new node, Node 3, is used for a second Reduce node.
There is no relation between the number of Map tasks (generally one node for each
block of the file(s) begin read) and the number of Reduce tasks. Commonly the number
of Reduce tasks is smaller than the number of Map tasks.

Reduce
• The reduce task computes aggregated values for each key
 Normally the output is written to the DFS
 Default is one output part-file per Reduce task
 Reduce tasks aggregate all values of a specific key - our case, the count of
the particular animal type
Reducer Tasks running on DataNodes Output files are stored in HDFS
Node 1
<Panther, 2>
<Panther, <1,1>> <Tiger, 3>
<Tiger, <1,1,1>> …
…
Node 3
<Lion, 2>
<Lion, <1,1>>
…
…
Reduce
Note that these two Reducer tasks are running on Nodes 1 and 3.
The Reduce node then takes the answers to all the sub-problems and combines them
in some way to get the output - the answer to the problem it was originally trying to
solve.
In this case, the Reduce step shown on this slide does the following processing:
• Each Reduce node the data sent to it from the various Map nodes
• This data has been previously sorted (and possibly partially merged)
• The Reduce node aggregates the date; in the case of WordCount, it sums the
counts received for each particular word (each animal in this case)
• One file is produced for each Reduce task and that is written to HDFS where
the blocks will automatically be replicated

Optional: Combiner
• For performance, an aggregate in the Map task can be helpful
• Reduces the amount of data sent over the network
 Also reduces Merge effort, since data is premerged in Map
 Done in the Map task, before Shuffle
Map Task running on each of two DataNodes Reduce Tasks
Node 1 Node 1
Tiger <Tiger, 1> <Lion, 1>
<Panther, <1,1>>
Lion <Lion, 1> <Panther, 1>
<Tiger, <1, 2>>
Panther <Panther, 1> <Tiger, 1>
…
Wolf … …
…
Node 2 Node 3
Tiger <Tiger, 1> <Tiger, 2> <Lion, 1>
Tiger <Tiger, 1> <Panther, 1> …
Wolf <Panther, 1> …
Panther …
…
Optional: Combiner
The Combiner phase is optional. When it is used, it runs on the Mapper node and
preprocesses the data that will be sent to Reduce tasks by pre-merging and pre-
aggregating the data in the files that will be transmitted to the Reduce tasks.
The Combiner thus reduces the amount of data that will be sent to the Reducer tasks,
and that speeds up the processing as smaller files need to be transmitted to the
Reducer task nodes.

Source code for WordCount.java (1 of 3)

1. package org.myorg;
2.
3. import java.io.IOException;
4. import java.util.*;
5.
6. import org.apache.hadoop.fs.Path;
7. import org.apache.hadoop.conf.*;
8. import org.apache.hadoop.io.*;
9. import org.apache.hadoop.mapred.*;
10. import org.apache.hadoop.util.*;
11.
12. public class WordCount {
13.
14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text,
Text, IntWritable> {
15. private final static IntWritable one = new IntWritable(1);
16. private Text word = new Text();
17.
18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>
output,Reporter reporter) throws IOException {
19. String line = value.toString();
20. StringTokenizer tokenizer = new StringTokenizer(line);
21. while (tokenizer.hasMoreTokens()) {
22. word.set(tokenizer.nextToken());
23. output.collect(word, one);
24. }
25. }
26. }
27.
Source code for WordCount.java

This is a slightly simplified version of WordCount.java for MapReduce 1. The full
program is slightly larger, and there are some recommended differences for compiling
for MapReduce 2 with Hadoop 2.
Code from the Hadoop classes is brought in with the import statements. Like an
iceberg, most of the actual code executed at runtime is hidden from the programmer; it
runs deep down in the Hadoop code itself.
The interest here is the Mapper class, Map.
This class reads the file (you will see on the driver class slide later as arg[0]) as a string.
The string is tokenized, for example, broken into words separated by space(s).
Note the following shortcomings of the code:
• No lowercasing is done, thus The and the are treated as separate words that
will be counted separately.
• Any adjacent punctuation is appended to the word, thus "the (leading double
quote) and the (no quotes) will be counted separately, and any word followed
by punctuation, for example cow, (trailing comma) is counted separately from
cow (the same word without trailing punctuation).

You will see these shortcomings in the output. Note that this is the standard WordCount
program and the interest is not in the actual results but only in the process at this stage.
The WordCount program is to Hadoop Java programs functions as the "Hello, world!"
program is to the C language. It is generally the first program that people experience
when coming to the new technology.


28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable,
Text, IntWritable> {
29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
30. int sum = 0;
31. while (values.hasNext()) {
32. sum += values.next().get();
33. }
34. output.collect(key, new IntWritable(sum));
35. }
36. }
37.
The Reducer class, Reduce, is displayed in the slide.

The key-value pairs arrive at this class already sorted (courtesy of the core Hadoop
classes that you do not see), thus adjacent records will have the same key.
While the key does not change, the values are aggregated (in this case, summed)
using: sum += …


38. public static void main(String[] args) throws Exception {
39. JobConf conf = new JobConf(WordCount.class);
40. conf.setJobName("wordcount");
41.
42. conf.setOutputKeyClass(Text.class);
43. conf.setOutputValueClass(IntWritable.class);
44.
45. conf.setMapperClass(Map.class);
46. conf.setCombinerClass(Reduce.class);
47. conf.setReducerClass(Reduce.class);
48.
49. conf.setInputFormat(TextInputFormat.class);
50. conf.setOutputFormat(TextOutputFormat.class);
51.
52. FileInputFormat.setInputPaths(conf, new Path(args[0]));
53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));
54.
55. JobClient.runJob(conf);
57. }
58. }
The driver routine, embedded in main, does the following work:

• sets the JobName for runtime
• sets the Mapper class to Map
• sets the Reducer class to Reduce
• sets the Combiner class to Reduce
• sets the input file to arg[0]
• sets the output directory to arg[1]
The combiner runs on the Map task and use the same code as the Reducer task.
The names of the output files will be generated inside the Hadoop code.

How does Hadoop run MapReduce jobs?

2. get new job ID
MapReduce 1. run job
JobClient JobTracker 5. initialize job
program
4. submit job
client JVM
jobtracker node
client node
7. heartbeat
6. retrieve
3. copy job (returns task)
input splits
resources
Distributed TaskTracker
file system 8. retrieve
(e.g. HDFS) job resources 9. launch
child JVM
Child
• Client: submits MapReduce jobs
• JobTracker: coordinates the job run, 10. run
breaks down the job to map and reduce MapTask

tasks for each node to work on the cluster or
ReduceTask
• TaskTracker: execute the Map and Reduce
task functions tasktracker node
How does Hadoop run MapReduce jobs?

The process of running a MapReduce job on Hadoop consists of 10 major steps:
1. The first step is the MapReduce program you've written tells the Job Client to
run a MapReduce job.
2. This sends a message to the JobTracker which produces a unique ID for the
job.
3. The Job Client copies job resources, such as a jar file containing a Java code
you have written to implement the map or the reduce task, to the shared file
system, usually HDFS.
4. Once the resources are in HDFS, the Job Client can tell the JobTracker to start
the job.
5. The JobTracker does its own initialization for the job. It calculates how to split
the data so that it can send each "split" to a different mapper process to
maximize throughput.
6. It retrieves these "input splits" from the distributed file system, not the data itself.

7. The TaskTrackers are continually sending heartbeat messages to the

JobTracker. Now that the JobTracker has work for them, it will return a map task
or a reduce task as a response to the heartbeat.
8. The TaskTrackers need to obtain the code to execute, so they get it from the
shared file system.
9. Then they can launch a Java Virtual Machine with a child process running in it
and this child process runs your map code or your reduce code. The result of
the map operation will remain in the local disk for the given TaskTracker node
(not in HDFS).
10. The output of the Reduce task is stored in HDFS file system using the number
of copies specified by replication factor.

Classes
• There are three main Java classes provided in Hadoop to read data
in MapReduce:
 InputSplitter dividing a file into splits
− Splitsare normally the block size but depends on number of requested Map
tasks, whether any compression allows splitting, etc.
 RecordReader takes a split and reads the files into records
− For example, one record per line (LineRecordReader)
− But note that a record can be spit across splits
 InputFormat takes each record and transforms it into a
<key, value> pair that is then passed to the Map task
• Lots of additional helper classes may be required to handle
compression, etc.
 For example, IBM BigInsights provides additional compression handlers
for LZO compression, etc.
Classes
The InputSplitter, RecordSplitter, and InputFormat classes are provided inside the
Hadoop code.
Other helper classes are needed to support Java MapReduce programs. Some of
these are provided from inside the Hadoop code itself, but distribution vendors and end
programmers can provide other classes that either override or supplement standard
code. Thus some vendors provide the LZO compressions algorithm to supplement
standard compression codecs (such as codecs for bzip2, etc).

Splits
• Files in MapReduce are stored in blocks (128 MB)
• MapReduce divides data into fragments or splits.
 One map task is executed on each split
• Most files have records with defined split points
 Most common is the end of line character
• The InputSplitter class is responsible for taking a HDFS file and
transforming it into splits.
 Aim is to process as much data as possible locally
Splits

RecordReader
• Most of the time a split will not happen at a block end
• Files are read into Records by the RecordReader class
 Normally the RecordReader will start and stop at the split points.
• LineRecordReader will read over the end of the split till the line end.
 HDFS will send the missing piece of the last record over the network
• Likewise LineRecordReader for Block2 will disregard the first
incomplete line
Node 1 Node 2
Tiger\n ther\n
Tiger\n Tiger\n
Lion\n Wolf\n
Pan Lion
In this example RecordReader1 will not stop at Pan but will read on until the end of
the line. Likewise RecordReader2 will ignore the first line.
RecordReader

InputFormat
• MapReduce Tasks read files by defining an InputFormat class
 Map tasks expect <key, value> pairs
• To read line-delimited textfiles Hadoop provides the TextInputFormat
class
 It returns one key, value pair per line in the text
 The value is the content of the line
 The key is the character offset to the new line character (end of line)
Node 1
Tiger <0, Tiger>
Lion <6, Lion>
Lion <11, Lion>
Panther <16, Panther>
Wolf <24, Wolf>
… …
InputFormat
InputFormat describes the input-specification for a Map-Reduce job.
The Map-Reduce framework relies on the InputFormat of the job to:
• Validate the input-specification of the job.
• Split-up the input file(s) into logical InputSplits, each of which is then assigned to
an individual Mapper.
• Provide the RecordReader implementation to be used to glean input records
from the logical InputSplit for processing by the Mapper.
The default behavior of file-based InputFormats, typically sub-classes of
FileInputFormat, is to split the input into logical InputSplits based on the total size, in
bytes, of the input files. However, the FileSystem blocksize of the input files is treated
as an upper bound for input splits. A lower bound on the split size can be set via
mapreduce.input.fileinputformat.split.minsize.
Clearly, logical splits based on input-size is insufficient for many applications since
record boundaries are to be respected. In such cases, the application has to also
implement a RecordReader on whom lies the responsibility to respect record-
boundaries and present a record-oriented view of the logical InputSplit to the individual
task.

Fault tolerance
JobTracker node
JobTracker
3
JobTracker fails
heartbeat
TaskTracker 2 TaskTracker fails
child JVM
Child
MapTask 1 Task fails
or
ReduceTask
TaskTracker node
Fault tolerance
What happens when something goes wrong?
Failures can happen at the task level (1), TaskTracker level (2), or JobTracker level (3).
The primary way that Hadoop achieves fault tolerance is through restarting tasks.
Individual task nodes (TaskTrackers) are in constant communication with the head
node of the system, called the JobTracker. If a TaskTracker fails to communicate with
the JobTracker for a period of time (by default, 1 minute), the JobTracker will assume
that the TaskTracker in question has crashed. The JobTracker knows which map and
reduce tasks were assigned to each TaskTracker.
If the job is still in the mapping phase, then other TaskTrackers will be asked to re-
execute all map tasks previously run by the failed TaskTracker. If the job is in the
reducing phase, then other TaskTrackers will re-execute all reduce tasks that were in
progress on the failed TaskTracker.

Reduce tasks, once completed, have been written back to HDFS. Thus, if a
TaskTracker has already completed two out of three reduce tasks assigned to it, only
the third task must be executed elsewhere. Map tasks are slightly more complicated:
even if a node has completed ten map tasks, the reducers may not have all copied their
inputs from the output of those map tasks. If a node has crashed, then its mapper
outputs are inaccessible. So any already-completed map tasks must be re-executed to
make their results available to the rest of the reducing machines. All of this is handled
automatically by the Hadoop platform.
This fault tolerance underscores the need for program execution to be side-effect free.
If Mappers and Reducers had individual identities and communicated with one another
or the outside world, then restarting a task would require the other nodes to
communicate with the new instances of the map and reduce tasks, and the re-executed
tasks would need to reestablish their intermediate state. This process is notoriously
complicated and error-prone in the general case. MapReduce simplifies this problem
drastically by eliminating task identities or the ability for task partitions to communicate
with one another. An individual task sees only its own direct inputs and knows only its
own outputs, to make this failure and restart process clean and dependable.

Topic:
Issues with and
limitations of Hadoop v1
and MapReduce v1
Topic: Issues with and limitations of Hadoop v1 and MapReduce v1

The original Hadoop (v1) and MapReduce (v1) had limitations, and a number of issues
surfaced over time. You will review these in preparation for looking at the differences
and changes introduced with Hadoop 2 and MapReduce v2.

Issues with the original MapReduce paradigm

• Centralized handling of job control flow
• Tight coupling of a specific programming model with the resource
management infrastructure
• Hadoop is now being used for all kinds of tasks beyond its original
design
Issues with the original MapReduce paradigm

These will be reviewed in more detail in this unit.

Limitations of classic MapReduce (MRv1)

The most serious limitations of classical MapReduce are:
 Scalability
 Resource utilization
 Support of workloads different from MapReduce.
• In the MapReduce framework, the job execution is controlled by two
types of processes:
 A single master process called JobTracker, which coordinates all jobs
running on the cluster and assigns map and reduce tasks to run on the
TaskTrackers
 A number of subordinate processes called TaskTrackers, which run
assigned tasks and periodically report the progress to the JobTracker
Limitations of classic MapReduce (MRv1)

This topic is well covered in the article:
• Introduction to YARN: https://hortonworks.com/apache/yarn/

Scalability in MRv1: Busy JobTracker
Scalability in MRv1: Busy JobTracker

In Hadoop MapReduce, the JobTracker is charged with two distinct responsibilities:
• Management of computational resources in the cluster, which involves
maintaining the list of live nodes, the list of available and occupied map and
reduce slots, and allocating the available slots to appropriate jobs and tasks
according to selected scheduling policy
• Coordination of all tasks running on a cluster, which involves instructing
TaskTrackers to start map and reduce tasks, monitoring the execution of the
tasks, restarting failed tasks, speculatively running slow tasks, calculating total
values of job counters, and more
The large number of responsibilities given to a single process caused significant
scalability issues, especially on a larger cluster where the JobTracker had to constantly
keep track of thousands of TaskTrackers, hundreds of jobs, and tens of thousands of
map and reduce tasks. The diagram represents this issue. On the contrary, the
TaskTrackers usually run only a dozen or so tasks, which were assigned to them by the
hard-working JobTracker.

YARN overhauls MRv1

• MapReduce has undergone a complete overhaul with YARN, splitting
up the two major functionalities of JobTracker (resource management
and job scheduling/monitoring) into separate daemons
• ResourceManager (RM)
 The global ResourceManager and per-node slave, the NodeManager (NM),
form the data-computation framework
 The ResourceManager is the ultimate authority that arbitrates resources
among all the applications in the system
• ApplicationMaster (AM)
 The per-application ApplicationMaster is, in effect, a framework specific
library and is tasked with negotiating resources from the ResourceManager
and working with the NodeManager(s) to execute and monitor the tasks
 An application is either a single job in the classical sense of Map-Reduce
jobs or a directed acyclic graph (DAG) of jobs
YARN overhauls MRv1

The fundamental idea of YARN/MRv2 is to split up the two major functionalities of the
JobTracker, resource management and job scheduling/monitoring, into separate
daemons. The idea is to have a global ResourceManager (RM) and per-application
ApplicationMaster (AM). An application is either a single job in the classical sense of
Map-Reduce jobs or a DAG of jobs.
The ResourceManager and per-node slave, the NodeManager (NM), form the data-
computation framework. The ResourceManager is the ultimate authority that arbitrates
resources among all the applications in the system.
The per-application ApplicationMaster is, in effect, a framework specific library and is
tasked with negotiating resources from the ResourceManager and working with the
NodeManager(s) to execute and monitor the tasks.
The ResourceManager has two main components: Scheduler and
ApplicationsManager.

The Scheduler is responsible for allocating resources to the various running

applications subject to familiar constraints of capacities, queues etc. The Scheduler is
pure scheduler in the sense that it performs no monitoring or tracking of status for the
application. Also, it offers no guarantees about restarting failed tasks either due to
application failure or hardware failures. The Scheduler performs its scheduling function
based the resource requirements of the applications; it does so based on the abstract
notion of a resource Container which incorporates elements such as memory, cpu, disk,
network etc. In the first version, only memory is supported.
The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the
cluster resources among the various queues, applications etc. The current Map-
Reduce schedulers such as the CapacityScheduler and the FairScheduler would be
some examples of the plug-in.
The CapacityScheduler supports hierarchical queues to allow for more predictable
sharing of cluster resources.
The ApplicationsManager is responsible for accepting job-submissions, negotiating
the first container for executing the application specific ApplicationMaster and provides
the service for restarting the ApplicationMaster container on failure.
The NodeManager is the per-machine framework agent who is responsible for
containers, monitoring their resource usage (cpu, memory, disk, network) and reporting
the same to the ResourceManager/Scheduler.
The per-application ApplicationMaster has the responsibility of negotiating appropriate
resource containers from the Scheduler, tracking their status and monitoring for
progress.
MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This
means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a
recompile.

Hadoop 1 and 2 architectures compared

Hadoop 1 Hadoop 2
Hadoop 2 supports many
execution engines, including a
port of MapReduce that is now
a YARN Application
MapReduce is the only

execution engine in Hadoop 1 MapReduce 2 Spark
(Batch Parallel (In-memory HBase
Computing) processing) on YARN ...
MapReduce YARN
(Batch parallel Computing) (Resource management)
HDFS HDFS
(Distributed Storage) (Distributed Storage)
The YARN framework provides work

scheduling that is agnostic to the nature
HDFS is common to both versions
of the work being performed
Hadoop 1 and 2 architectures compared

This diagram is developed from that in Alex Holmes, Hadoop in Practice, 2nd ed.
(Shelter Island, NY: Manning, 2015).

YARN features
• Scalability
• Multi-tenancy
• Compatibility
• Serviceability
• Higher cluster utilization
• Reliability/Availability
YARN features

YARN features: Scalability

• There is one Application Master per job, which is why YARN scales
better than the previous Hadoop v1 architecture
 The Application Master for a given job can run on an arbitrary cluster node,
and it runs until the job reaches termination
• Separation of functionality allows the individual operations to be
improved with less effect on other operations
• YARN supports rolling upgrades without downtime
ResourceManager focuses exclusively on scheduling, allowing clusters to

expand to thousands of nodes managing petabytes of data
YARN features: Scalability

YARN lifts the scalability ceiling in Hadoop by splitting the roles of the Hadoop Job
Tracker into two processes. A ResourceManager controls access to the clusters
resources (memory, CPU, etc.), and the ApplicationManager (one per job) controls task
execution.
YARN can run on larger clusters than MapReduce 1. MapReduce 1 hits scalability
bottlenecks in the region of 4,000 nodes and 40,000 tasks, stemming from the fact that
the JobTacker has to manage both jobs and tasks. YARN overcomes these limitations
by virtue of its split ResourceManager/ApplicationMaster architecture: it is designed to
scale up to 10,000 nodes and 100,000 tasks.
In contrast to the jobtracker, each instance of an application has a dedicated
ApplicationMaster, which runs for the duration of the application. This model is actually
closer to the original Google MapReduce paper, which describes how a master process
is started to coordinate map and reduce tasks running on a set of workers.

YARN features: Multi-tenancy

• YARN allows multiple access engines (either open-source or
proprietary) to use Hadoop as the common standard for batch,
interactive, and real-time engines that can simultaneously access the
same data sets
• YARN uses a shared pool of nodes for all jobs
• YARN allows the allocation of Hadoop clusters of fixed size from the
shared pool
Multi-tenant data processing improves an

enterprise's return on its Hadoop investment
YARN features: Multi-tenancy

Multi-tenancy generally refers to a set of features that enable multiple business users
and processes to share a common set of resources, such as an Apache Hadoop
cluster via policy rather than physical separation, yet without negatively impacting
service level agreements (SLA), violating security requirements, or even revealing the
existence of each party.
What YARN does is essentially de-couple Hadoop workload management from
resource management. This means that multiple applications can share a common
infrastructure pool. While this idea is not new to many, it is new to Hadoop. Earlier
versions of Hadoop consolidated both workload and resource management functions
into a single JobTracker. This approach resulted in limitations for customers hoping to
run multiple applications on the same cluster infrastructure.

To borrow from object-oriented programming terminology, multi-tenancy is an over-

loaded term. It means different things to different people depending on their orientation
and context. To say a solution is multi-tenant is not helpful unless being specific about
the meaning. Some interpretations of multi-tenancy in big data environments are:
• Support for multiple concurrent Hadoop jobs
• Support for multiple lines of business on a shared infrastructure
• Support for multiple application workloads of different types
(Hadoop and non-Hadoop)
• Provisions for security isolation between tenants
• Contract-oriented service level guarantees for tenants
• Support for multiple versions of applications and application frameworks
concurrently
Organizations that are sophisticated in their view of multi-tenancy will need all of these
capabilities and more. YARN promises to address some of these requirements, and
does so in large measure. But, you will find in future releases of Hadoop that there will
be other approaches that are being addressed to provide other forms of multi-tenancy.
While an important technology, the world is not suffering from a shortage of resource
managers. Some Hadoop providers are supporting YARN, while others are supporting
Apache Mesos.

YARN features: Compatibility

• To the end user (a developer, not an administrator), the changes are
almost invisible
• Possible to run unmodified MapReduce jobs using the same
MapReduce API and CLI
 May require a recompile
There is no reason not to migrate from MRv1 to YARN
YARN features: Compatibility

To ease the transition from Hadoop v1 to YARN, a major goal of YARN and the
MapReduce framework implementation on top of YARN was to ensure that existing
MapReduce applications that were programmed and compiled against previous
MapReduce APIs (we'll call these MRv1 applications) can continue to run with little or
no modification on YARN (we'll refer to these as MRv2 applications).
For the vast majority of users who use the org.apache.hadoop.mapred APIs,
MapReduce on YARN ensures full binary compatibility. These existing applications can
run on YARN directly without recompilation. You can use .jar files from your existing
application that code against mapred APIs, and use bin/hadoop to submit them directly
to YARN.
Unfortunately, it was difficult to ensure full binary compatibility to the existing
applications that compiled against MRv1 org.apache.hadoop.mapreduce APIs. These
APIs have gone through many changes. For example, several classes stopped being
abstract classes and changed to interfaces. Therefore, the YARN community
compromised by only supporting source compatibility for
org.apache.hadoop.mapreduce APIs. Existing applications that use MapReduce APIs
are source-compatible and can run on YARN either with no changes, with simple
recompilation against MRv2 .jar files that are shipped with Hadoop 2, or with minor
updates.

YARN features: Higher cluster utilization

• Higher cluster utilization, whereby resources not used by one
framework can be consumed by another
• The NodeManager is a more generic and efficient version of the
TaskTracker.
 Instead of having a fixed number of map and reduce slots, the
NodeManager has a number of dynamically created resource containers
 The size of a container depends upon the amount of resources assigned to
it, such as memory, CPU, disk, and network IO
YARN's dynamic allocation of cluster resources improves utilization

over the more static MapReduce rules used in early versions of
Hadoop (v1)
YARN features: Higher cluster utilization

The NodeManager is a more generic and efficient version of the TaskTracker. Instead
of having a fixed number of map and reduce slots, the NodeManager has a number of
dynamically created resource containers. The size of a container depends upon the
amount of resources it contains, such as memory, CPU, disk, and network IO.
Currently, only memory and CPU are supported (YARN-3); cgroups might be used to
control disk and network IO in the future.
The number of containers on a node is a product of configuration parameters and the
total amount of node resources (such as total CPUs and total memory) outside the
resources dedicated to the slave daemons and the OS.

YARN features: Reliability and availability

• High availability for the ResourceManager
 An application recovery is performed after the restart of ResourceManager
 The ResourceManager stores information about running applications and
completed tasks in HDFS
 If the ResourceManager is restarted, it recreates the state of applications
and re-runs only incomplete tasks
• Highly available NameNode, making the Hadoop cluster much more
efficient, powerful, and reliable
High Availability is work in progress, and is close to completion -

features have been actively tested by the community
YARN features: Reliability and availability

YARN major features summarized

• Multi-tenancy
 YARN allows multiple access engines (either open-source or proprietary) to use
Hadoop as the common standard for batch, interactive, and real-time engines
that can simultaneously access the same data sets
 Multi-tenant data processing improves an enterprise's return on its Hadoop
investments.
• Cluster utilization
 YARN's dynamic allocation of cluster resources improves utilization over more
static MapReduce rules used in early versions of Hadoop
• Scalability
 Data center processing power continues to rapidly expand. YARN's
ResourceManager focuses exclusively on scheduling and keeps pace as
clusters expand to thousands of nodes managing petabytes of data.
• Compatibility
 Existing MapReduce applications developed for Hadoop 1 can run YARN
without any disruption to existing processes that already work
YARN major features summarized

Topic:
The architecture of YARN
Topic: The architecture of YARN

Hadoop v1 to Hadoop v2
Single Use System Multi Purpose Platform

Batch Apps Usually Batch, Interactive, Online, Streaming
Hadoop 1.0 Hadoop 2.0
MR2 Pig Hive Other … RT, HBase

Stream, +
Pig Hive Other … Graph, Services
execution / data processing etc
MapReduce
(cluster resource mgmt YARN
& data processing) (cluster resource management)
HDFS HDFS2
(redundant, reliable storage) (redundant, reliable storage)
Hadoop v1 to Hadoop v2
The most notable change from Hadoop v1 to Hadoop v2 is the separation of cluster
and resource management from the execution and data processing environment.
This allows for a new variety of application types to run, including MapReduce v2.

Architecture of MRv1
• Classic version of MapReduce (MRv1)
Architecture of MRv1
The effect most prominently seen with the overall job control.
In MapReduce v1 there is just one JobTracker that is responsible for allocation of
resources, task assignment to data nodes (as TaskTrackers), and ongoing monitoring
("heartbeat") as each job is run (the TaskTrackers constantly report back to the
JobTracker on the status of each running task).

YARN architecture
• High level architecture of YARN
YARN architecture
In the YARN architecture, a global ResourceManager runs as a master daemon,
usually on a dedicated machine, that arbitrates the available cluster resources among
various competing applications. The ResourceManager tracks how many live nodes
and resources are available on the cluster and coordinates what applications submitted
by users should get these resources and when.
The ResourceManager is the single process that has this information so it can make its
allocation (or rather, scheduling) decisions in a shared, secure, and multi-tenant
manner (for instance, according to an application priority, a queue capacity, ACLs, data
locality, etc.).
When a user submits an application, an instance of a lightweight process called the
ApplicationMaster is started to coordinate the execution of all tasks within the
application. This includes monitoring tasks, restarting failed tasks, speculatively running
slow tasks, and calculating total values of application counters. These responsibilities
were previously assigned to the single JobTracker for all jobs. The ApplicationMaster
and tasks that belong to its application run in resource containers controlled by the
NodeManagers.

The NodeManager is a more generic and efficient version of the TaskTracker. Instead
of having a fixed number of map and reduce slots, the NodeManager has a number of
dynamically created resource containers. The size of a container depends upon the
amount of resources it contains, such as memory, CPU, disk, and network IO.
Currently, only memory and CPU (YARN-3) are supported. cgroups might be used to
control disk and network IO in the future. The number of containers on a node is a
product of configuration parameters and the total amount of node resources (such as
total CPUs and total memory) outside the resources dedicated to the slave daemons
and the OS.
Interestingly, the ApplicationMaster can run any type of task inside a container. For
example, the MapReduce ApplicationMaster requests a container to launch a map or a
reduce task, while the Giraph ApplicationMaster requests a container to run a Giraph
task. You can also implement a custom ApplicationMaster that runs specific tasks and,
in this way, invent a shiny new distributed application framework that changes the big
data world. I encourage you to read about Apache Twill, which aims to make it easy to
write distributed applications sitting on top of YARN.
In YARN, MapReduce is simply degraded to a role of a distributed application (but still a
very popular and useful one) and is now called MRv2. MRv2 is simply the
re-implementation of the classical MapReduce engine, now called MRv1, that runs on
top of YARN.

Terminology changes from MRv1 to YARN
YARN terminology Instead of MRv1 terminology
ResourceManager Cluster Manager
ApplicationMaster JobTracker
(but dedicated and short-lived)
NodeManager TaskTracker
Distributed Application One particular MapReduce job
Container Slot
Terminology changes from MRv1 to YARN

YARN
• Acronym for Yet Another Resource Negotiator
• New resource manager included in Hadoop 2.x and later
• De-couples Hadoop workload and resource management
• Introduces a general purpose application container
• Hadoop 2.2.0 includes the first GA version of YARN
• Most Hadoop vendors support YARN
YARN
YARN is a key component in Hortonworks Data Platform.

YARN high level architecture

• In Hortonworks Data Platform (HDP), users can take advantage of
YARN and applications written to YARN APIs
Existing MapReduce Applications
MapReduce v2 Tez HBase Spark Others

(batch) (interactive) (online) (in memory) (varied)
YARN
(cluster resource management)
HDFS
YARN high level architecture

Running an application in YARN (1 of 7)

NodeManager @node133
Resource
Manager
@node132
Running an application in YARN


Application 1:
Analyze lineitem table
Launch
Resource
Manager Application
@node132 Master 1


Application 1:
Resource Resource request
Manager
Application
@node132 Master 1
Container IDs
NodeManager @bigaperf136


Application 1:
App 1 App 1
Launch
Resource
Manager
Application
@node132 Master 1
Launch
App 1


Application 1:
Analyze
lineitem
table NodeManager @node134
App 1 App 1
Application 2:
Analyze customer table
Resource
Manager
Application
@node132 Master 1
Application
Master 2 App 1


NodeManager @nodef133
Application 1:
Analyze
lineitem
App 1 App 1
Application 2:
Resource
Manager
Application
@node132 Master 1
Application
Master 2 App 1


App 2
Application 1:
Analyze
lineitem
App 1 App 1
Application 2:
Resource
Manager
Application
AppApp
2 2
@node132 Master 1
NodeManager @nodef136
Application
Master 2 App 1

How YARN runs an application
How YARN runs an application

To run an application on YARN, a client contacts the resource manager and asks it to
run an application master process (step 1). The resource manager then finds a node
manager that can launch the application master in a container (steps 2a and 2b).
Precisely what the application master does once it is running depends on the
application. It could simply run a computation in the container it is running in and return
the result to the client. Or it could request more containers from the resource managers
(step 3), and use them to run a distributed computation (steps 4a and 4b).
This is well described in: White, T. (2015) Hadoop: The definitive guide (4th ed.).
Sabastopol, CA: O'Reilly Media, p. 80.

Container Java command line

• Container JVM command (generally behind the scenes)
 Launched by "yarn" user with /bin/bash
yarn 1251527 1199943 0 14:38 ? 00:00:00 /bin/bash -c
/hdp/jdk/bin/java -Djava.net …
 If you count "java" process ids (pids) running with the "yarn" user, you will
see 2X
00:00:00 /bin/bash -c /hdp/jdk/bin/java
00:00:00 /bin/bash -c /hdp/jdk/bin/java
. . .
00:07:48 /hdp/jdk/bin/java -Djava.net.pr
00:08:11 /hdp/jdk/bin/java -Djava.net.pr
. . .
Container JAVA command line

This is for the more experienced user.
yarn 1251527 1199943 0 14:38 ? 00:00:00 /bin/bash -c /hdp/jdk/bin/java -
Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2000m -Xms1000m -
Xmn100m -Xtune:virtualized -
Xshareclasses:name=mrscc_%g,groupAccess,cacheDir=/var/hdp/hadoop/tmp,nonFatal -Xscmx20m
-Xdump:java:file=/var/hdp/hadoop/tmp/javacore.%Y%m%d.%H%M%S.%pid.%seq.txt -
Xdump:heap:file=/var/hdp/hadoop/tmp/heapdump.%Y%m%d.%H%M%S.%pid.%seq.phd -
Djava.io.tmpdir=/data6/yarn/local/nodemanager-
local/usercache/bigsql/appcache/application_1417731580977_0002/container_1417731580977_0
002_01_000095/tmp -Dlog4j.configuration=container-log4j.properties -
Dyarn.app.container.log.dir=/var/hdp/hadoop/yarn/logs/application_1417731580977_0002/con
tainer_1417731580977_0002_01_000095 -Dyarn.app.container.log.filesize=0 -
Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 9.30.75.55 51923
attempt_1417731580977_0002_m_000073_0 95
1>/var/hdp/hadoop/yarn/logs/application_1417731580977_0002/container_1417731580977_0002_
01_000095/stdout
2>/var/hdp/hadoop/yarn/logs/application_1417731580977_0002/container_1417731580977_0002_
01_000095/stderr

Provisioning, managing, and monitoring

• The Apache Ambari project is aimed at making Hadoop management simpler by
developing software for provisioning, managing, and monitoring Apache Hadoop
clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI
backed by its RESTful APIs.
• Ambari enables System Administrators to:
 Provision a Hadoop Cluster
− Ambari provides a step-by-step wizard for installing Hadoop services across any number of
hosts.
− Ambari handles configuration of Hadoop services for the cluster.
 Manage a Hadoop Cluster
− Ambari provides central management for starting, stopping, and reconfiguring Hadoop
services across the entire cluster.
 Monitor a Hadoop Cluster
− Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
− Ambari leverages Ganglia for metrics collection.
− Ambari leverages Nagios for system alerting and will send emails when your attention is
needed (for example, a node goes down, remaining disk space is low, etc.).
• Ambari enables Application Developers and System Integrators to:
 Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own
applications with the Ambari REST APIs
Provisioning, managing, and monitoring

This is just a reminder of the role of Ambari in Hadoop 2.
With Hadoop 2 and YARN there is a great need for provisioning, management, and
monitoring because of the greater complexity.
In the final unit you will look at Ambari Slider as a mechanism for dynamically changing
requirements at run time for long running jobs.

Spark with Hadoop 2+

• Spark is an alternative in-memory framework to MapReduce
• Supports general workloads as well as streaming, interactive queries
and machine learning providing performance gains
• Spark SQL provides APIs that allow SQL queries to be embedded in
Java, Scala or Python programs in Spark
• MLlib: Spark optimized library support machine learning functions
• GraphX: API for graphs and parallel computation
• Spark streaming: Write applications to process streaming data in Java,
Scala or Python
Spark in Hadoop 2+
Apache Spark is a new, alternative, in-memory framework that is an alternative to
MapReduce.
Spark is the subject of the next unit.

Checkpoint
1. List the phases in a MR job.
2. What are the limitations of MR v1?
3. The JobTracker in MR1 is replaced by which component(s) in YARN?
4. What are the major features of YARN?
Checkpoint

Checkpoint solution
1. List the phases in a MR job.
 Map, Shuffle, Reduce, Combiner
2. What are the limitations of MR v1?
 Centralized handling of job control flow
 Tight coupling of a specific programming model with the resource management
infrastructure
 Hadoop is now being used for all kinds of tasks beyond its original design
3. The JobTracker in MR1 is replaced by which component(s) in YARN?
 ResourceManager
 ApplicationMaster
4. What are the major features of YARN?
 Multi-tenancy
 Cluster utilization
 Scalability
 Compatibility
Checkpoint solution

Unit summary
• Describe the MapReduce model v1
• List the limitations of Hadoop 1 and MapReduce 1
• Review the Java code required to handle the Mapper class, the
Reducer class, and the program driver needed to access MapReduce
• Describe the YARN model
• Compare Hadoop 2/YARN with Hadoop 1
Unit summary

Lab 1
Running MapReduce and YARN jobs
Lab 1: Running MapReduce and YARN jobs

Lab 1:
Running MapReduce and YARN jobs
Purpose:
You will run Java programs using Hadoop v2, YARN, and related
technologies. You will not need to do any Java programming. The Hadoop
community provides a number of standard example programs (akin to "Hello,
World" for people learning to write C language programs).
In the real world, the sample/example programs provide opportunities to learn
the relevant technology, and on a newly installed Hadoop cluster, to exercise
the system and the operations environment.
You will start with an existing sample program, selected from among those
available, and progress towards compiling a Java program from scratch and
learning in what type of environment it runs.
The Java program that you will be compiling and running, WordCount2.java,
resides in the Linux directory /home/labfiles.
VM Hostname: http://datascientist.ibm.com
User/Password: student / student
root/dalvm3
Task 1. Run a simple MapReduce job from supplied example
programs available with a Hadoop default installation.
The best references for writing MapReduce programs for MRv1, MRv2, and YARN are
on the Hadoop website. The URLs relevant for the current version of Hadoop are:
• http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-
mapreduce-client-core/MapReduceTutorial.html
• http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-
site/WritingYarnApplications.html
• http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-
site/WritingYarnApplications.html
This documentation corresponds to the most current release of Hadoop and HDFS
(substitute the appropriate value for “current” with your version similar to r2.7.0, if
different).

1. Connect to and login to your lab environment with user biadmin and password
biadmin credentials.
3. To locate all of the Java jar files in your system with example in the file name,
type find / -name "*example*" 2> /dev/null | grep jar.
There will be files and directories in your Linux file system that you, as biadmin
user, will not have access to. The find command will give you errors. Those
errors are sent to Linux errout and are eliminated from your console output by
using: 2> /dev/null
[hdfs@ibmclass ~]$ find / -name "*example*" 2> /dev/null | grep jar
/usr/share/java/rhino-examples-1.7.jar
/usr/share/java/rhino-examples.jar
/usr/hdp/2.6.4.0-91/oozie/doc/oozie-4.1.0_HDP_5/examples/apps/java-main/lib/oozie-
examples-4.1.0_HDP_5.jar
/usr/hdp/2.6.4.0-91/oozie/doc/oozie-4.1.0_HDP_5/examples/apps/datelist-java-
main/lib/oozie-examples-4.1.0_HDP_5.jar
/usr/hdp/2.6.4.0-91/oozie/doc/oozie-4.1.0_HDP_5/examples/apps/hadoop-el/lib/oozie-
/usr/hdp/2.6.4.0-91/oozie/doc/oozie-4.1.0_HDP_5/examples/apps/custom-main/lib/oozie-
/usr/hdp/2.6.4.0-91/oozie/doc/oozie-4.1.0_HDP_5/examples/apps/demo/lib/oozie-
/usr/hdp/2.6.4.0-91/oozie/doc/oozie-4.1.0_HDP_5/examples/apps/map-reduce/lib/oozie-
/usr/hdp/2.6.4.0-91/oozie/doc/oozie-4.1.0_HDP_5/examples/apps/aggregator/lib/oozie-
/usr/hdp/2.6.4.0-91/oozie/share/lib/hbase/hbase-examples-0.98.8_HDP_4-hadoop2.jar
/usr/hdp/2.6.4.0-91/knox/samples/hadoop-examples.jar
/usr/hdp/2.6.4.0-91/hadoop-mapreduce/hadoop-mapreduce-examples.jar
/usr/hdp/2.6.4.0-91/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-HDP-7.jar
/usr/hdp/2.6.4.0-91/spark/lib/spark-examples.jar
/usr/hdp/2.6.4.0-91/spark/lib/spark-examples-1.2.1_HDP_4-hadoop2.6.0-HDP-7.jar
/usr/hdp/2.6.4.0-91/hbase/lib/hbase-examples.jar
/usr/hdp/2.6.4.0-91/hbase/lib/hbase-examples-0.98.8_HDP_4-hadoop2.jar
[hdfs@ibmclass ~]$
You can list the contents of a jar file using jar tf (t = table of contents, f = file
name follows); you will apply that to one of the jar files that you found with your
find-search.

4. To list the contents of the hadoop-mapreduce-examples.jar file, type jar tf

/usr/hdp/2.6.4.0-91/hadoop-mapreduce/hadoop-mapreduce-examples.jar.
The results will show something similar to:
[hdfs@ibmclass ~]$ jar -tf /usr/hdp/2.6.4.0-91/hadoop-mapreduce/hadoop-mapreduce-
examples.jar
META-INF/
META-INF/MANIFEST.MF
org/
org/apache/
org/apache/hadoop/
org/apache/hadoop/examples/
org/apache/hadoop/examples/dancing/
org/apache/hadoop/examples/pi/
org/apache/hadoop/examples/pi/math/
org/apache/hadoop/examples/terasort/
org/apache/hadoop/examples/RandomWriter$RandomInputFormat$RandomRecordReader.class
org/apache/hadoop/examples/DBCountPageView$PageviewReducer.class
org/apache/hadoop/examples/DBCountPageView$PageviewRecord.class
org/apache/hadoop/examples/SecondarySort$FirstGroupingComparator.class
org/apache/hadoop/examples/MultiFileWordCount.class
org/apache/hadoop/examples/QuasiMonteCarlo.class
org/apache/hadoop/examples/BaileyBorweinPlouffe$BbpInputFormat.class
org/apache/hadoop/examples/SecondarySort$IntPair.class
org/apache/hadoop/examples/SecondarySort$Reduce.class
org/apache/hadoop/examples/BaileyBorweinPlouffe$BbpInputFormat$1.class
org/apache/hadoop/examples/BaileyBorweinPlouffe$Fraction.class
org/apache/hadoop/examples/DBCountPageView.class
org/apache/hadoop/examples/WordMean$WordMeanMapper.class
org/apache/hadoop/examples/SecondarySort.class
org/apache/hadoop/examples/RandomTextWriter.class
org/apache/hadoop/examples/QuasiMonteCarlo$HaltonSequence.class
org/apache/hadoop/examples/RandomWriter$RandomInputFormat.class
org/apache/hadoop/examples/Sort.class
org/apache/hadoop/examples/WordStandardDeviation.class
org/apache/hadoop/examples/BaileyBorweinPlouffe$1.class
org/apache/hadoop/examples/dancing/Sudoku$ColumnConstraint.class
org/apache/hadoop/examples/dancing/OneSidedPentomino.class
org/apache/hadoop/examples/dancing/Sudoku$SolutionPrinter.class
org/apache/hadoop/examples/dancing/Pentomino$Point.class
org/apache/hadoop/examples/dancing/Sudoku$RowConstraint.class
org/apache/hadoop/examples/dancing/DistributedPentomino$PentMap.class
org/apache/hadoop/examples/dancing/DistributedPentomino$PentMap$SolutionCatcher.class
org/apache/hadoop/examples/dancing/DancingLinks$Node.class
org/apache/hadoop/examples/dancing/Sudoku$CellConstraint.class
org/apache/hadoop/examples/dancing/DancingLinks$ColumnHeader.class
org/apache/hadoop/examples/dancing/Pentomino$ColumnName.class
org/apache/hadoop/examples/dancing/DancingLinks$SolutionAcceptor.class
org/apache/hadoop/examples/dancing/Pentomino.class
org/apache/hadoop/examples/dancing/DancingLinks.class
org/apache/hadoop/examples/dancing/Sudoku.class
org/apache/hadoop/examples/dancing/Pentomino$SolutionPrinter.class
org/apache/hadoop/examples/dancing/DistributedPentomino.class
org/apache/hadoop/examples/dancing/Sudoku$SquareConstraint.class
org/apache/hadoop/examples/dancing/Pentomino$SolutionCategory.class
org/apache/hadoop/examples/dancing/Sudoku$ColumnName.class
org/apache/hadoop/examples/dancing/Pentomino$Piece.class
org/apache/hadoop/examples/AggregateWordHistogram.class
org/apache/hadoop/examples/DBCountPageView$PageviewMapper.class
org/apache/hadoop/examples/WordMedian$WordMedianReducer.class
org/apache/hadoop/examples/RandomTextWriter$RandomTextMapper.class
org/apache/hadoop/examples/BaileyBorweinPlouffe.class
org/apache/hadoop/examples/WordMean$WordMeanReducer.class
org/apache/hadoop/examples/RandomWriter$Counters.class
org/apache/hadoop/examples/BaileyBorweinPlouffe$BbpSplit.class

org/apache/hadoop/examples/BaileyBorweinPlouffe$BbpReducer$1.class
org/apache/hadoop/examples/WordCount.class
org/apache/hadoop/examples/Join.class
org/apache/hadoop/examples/SecondarySort$IntPair$Comparator.class
org/apache/hadoop/examples/MultiFileWordCount$MapClass.class
org/apache/hadoop/examples/AggregateWordCount$WordCountPlugInClass.class
org/apache/hadoop/examples/WordStandardDeviation$WordStandardDeviationReducer.class
org/apache/hadoop/examples/RandomTextWriter$Counters.class
org/apache/hadoop/examples/MultiFileWordCount$CombineFileLineRecordReader.class
org/apache/hadoop/examples/WordCount$TokenizerMapper.class
org/apache/hadoop/examples/BaileyBorweinPlouffe$BbpMapper.class
org/apache/hadoop/examples/QuasiMonteCarlo$QmcReducer.class
org/apache/hadoop/examples/MultiFileWordCount$WordOffset.class
org/apache/hadoop/examples/pi/DistSum$Machine$AbstractInputFormat.class
org/apache/hadoop/examples/pi/DistSum$MapSide$PartitionInputFormat.class
org/apache/hadoop/examples/pi/DistSum$ReduceSide$SummationInputFormat.class
org/apache/hadoop/examples/pi/Parser.class
org/apache/hadoop/examples/pi/DistSum.class
org/apache/hadoop/examples/pi/DistSum$Machine.class
org/apache/hadoop/examples/pi/SummationWritable.class
org/apache/hadoop/examples/pi/Util$Timer.class
org/apache/hadoop/examples/pi/DistSum$Parameters.class
org/apache/hadoop/examples/pi/DistSum$ReduceSide$PartitionMapper.class
org/apache/hadoop/examples/pi/DistBbp.class
org/apache/hadoop/examples/pi/DistSum$1.class
org/apache/hadoop/examples/pi/DistSum$ReduceSide$SummingReducer.class
org/apache/hadoop/examples/pi/DistSum$Machine$AbstractInputFormat$1.class
org/apache/hadoop/examples/pi/DistSum$MapSide$SummingMapper.class
org/apache/hadoop/examples/pi/Combinable.class
org/apache/hadoop/examples/pi/Container.class
org/apache/hadoop/examples/pi/DistSum$Machine$SummationSplit.class
org/apache/hadoop/examples/pi/DistSum$MapSide.class
org/apache/hadoop/examples/pi/DistSum$MixMachine.class
org/apache/hadoop/examples/pi/Util.class
org/apache/hadoop/examples/pi/DistSum$Computation.class
org/apache/hadoop/examples/pi/DistSum$ReduceSide$IndexPartitioner.class
org/apache/hadoop/examples/pi/math/Bellard$Sum$Tail.class
org/apache/hadoop/examples/pi/math/Bellard$Parameter.class
org/apache/hadoop/examples/pi/math/ArithmeticProgression.class
org/apache/hadoop/examples/pi/math/Modular.class
org/apache/hadoop/examples/pi/math/LongLong.class
org/apache/hadoop/examples/pi/math/Summation.class
org/apache/hadoop/examples/pi/math/Montgomery.class
org/apache/hadoop/examples/pi/math/Bellard$Sum$1.class
org/apache/hadoop/examples/pi/math/Bellard$Sum.class
org/apache/hadoop/examples/pi/math/Montgomery$Product.class
org/apache/hadoop/examples/pi/math/Bellard$1.class
org/apache/hadoop/examples/pi/math/Bellard.class
org/apache/hadoop/examples/pi/DistSum$ReduceSide.class
org/apache/hadoop/examples/pi/SummationWritable$ArithmeticProgressionWritable.class
org/apache/hadoop/examples/pi/TaskResult.class
org/apache/hadoop/examples/RandomWriter$RandomMapper.class
org/apache/hadoop/examples/terasort/TeraInputFormat$TeraRecordReader.class
org/apache/hadoop/examples/terasort/TeraChecksum.class
org/apache/hadoop/examples/terasort/TeraOutputFormat$TeraRecordWriter.class
org/apache/hadoop/examples/terasort/TeraValidate$ValidateReducer.class
org/apache/hadoop/examples/terasort/GenSort.class
org/apache/hadoop/examples/terasort/Random16$RandomConstant.class
org/apache/hadoop/examples/terasort/TeraGen$SortGenMapper.class
org/apache/hadoop/examples/terasort/TeraInputFormat$1.class
org/apache/hadoop/examples/terasort/TeraInputFormat$SamplerThreadGroup.class
org/apache/hadoop/examples/terasort/TeraGen$RangeInputFormat$RangeRecordReader.class
org/apache/hadoop/examples/terasort/TeraChecksum$ChecksumReducer.class
org/apache/hadoop/examples/terasort/TeraSort.class
org/apache/hadoop/examples/terasort/Random16.class
org/apache/hadoop/examples/terasort/TeraSort$TotalOrderPartitioner$InnerTrieNode.class

org/apache/hadoop/examples/terasort/TeraSort$SimplePartitioner.class
org/apache/hadoop/examples/terasort/TeraGen.class
org/apache/hadoop/examples/terasort/TeraInputFormat$TextSampler.class
org/apache/hadoop/examples/terasort/TeraGen$Counters.class
org/apache/hadoop/examples/terasort/TeraSort$TotalOrderPartitioner.class
org/apache/hadoop/examples/terasort/TeraOutputFormat.class
org/apache/hadoop/examples/terasort/TeraScheduler$Split.class
org/apache/hadoop/examples/terasort/TeraSort$TotalOrderPartitioner$TrieNode.class
org/apache/hadoop/examples/terasort/TeraValidate$ValidateMapper.class
org/apache/hadoop/examples/terasort/TeraSort$TotalOrderPartitioner$LeafTrieNode.class
org/apache/hadoop/examples/terasort/TeraGen$RangeInputFormat.class
org/apache/hadoop/examples/terasort/TeraScheduler.class
org/apache/hadoop/examples/terasort/TeraChecksum$ChecksumMapper.class
org/apache/hadoop/examples/terasort/TeraScheduler$Host.class
org/apache/hadoop/examples/terasort/TeraValidate.class
org/apache/hadoop/examples/terasort/Unsigned16.class
org/apache/hadoop/examples/terasort/TeraInputFormat.class
org/apache/hadoop/examples/terasort/TeraGen$RangeInputFormat$RangeInputSplit.class
org/apache/hadoop/examples/AggregateWordHistogram$AggregateWordHistogramPlugin.class
org/apache/hadoop/examples/AggregateWordCount.class
org/apache/hadoop/examples/Grep.class
org/apache/hadoop/examples/WordCount$IntSumReducer.class
org/apache/hadoop/examples/WordMean.class
org/apache/hadoop/examples/WordStandardDeviation$WordStandardDeviationMapper.class
org/apache/hadoop/examples/SecondarySort$MapClass.class
org/apache/hadoop/examples/BaileyBorweinPlouffe$BbpReducer.class
org/apache/hadoop/examples/RandomWriter.class
org/apache/hadoop/examples/SecondarySort$FirstPartitioner.class
org/apache/hadoop/examples/ExampleDriver.class
org/apache/hadoop/examples/WordMedian$WordMedianMapper.class
org/apache/hadoop/examples/MultiFileWordCount$MyInputFormat.class
org/apache/hadoop/examples/QuasiMonteCarlo$QmcMapper.class
org/apache/hadoop/examples/DBCountPageView$AccessRecord.class
org/apache/hadoop/examples/WordMedian.class
META-INF/maven/
META-INF/maven/org.apache.hadoop/
META-INF/maven/org.apache.hadoop/hadoop-mapreduce-examples/
META-INF/maven/org.apache.hadoop/hadoop-mapreduce-examples/pom.xml
META-INF/maven/org.apache.hadoop/hadoop-mapreduce-examples/pom.properties
[hdfs@ibmclass ~]$
This is a long list. You are going to use the WordCount example from this file,
but what is the file called? Note also that you can get dates and times for the
various classes and programs in this jar file by using jar tvf (v = verbose).
We can find what executable Java programs are included in the manifest by
attempting to run this jar with hadoop jar, but not providing any additional and
perhaps necessary parameters.

5. Type hadoop jar /usr/hdp/2.6.4.0-91/hadoop-mapreduce/hadoop-

mapreduce-examples.jar.
[hdfs@ibmclass ~]$ hadoop jar /usr/hdp/2.6.4.0-91/hadoop-mapreduce/hadoop-mapreduce-
examples.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the
input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram
of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the
input files.
wordmedian: A map/reduce program that counts the median length of the words in the
input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the
length of the words in the input files.
[hdfs@ibmclass ~]$
The program that you will use is wordcount which is a map/reduce program that
counts the words in the input files.
6. To run the program with appropriate parameters, type hadoop jar
/usr/hdp/2.6.4.0-91/hadoop-mapreduce/hadoop-mapreduce-examples.jar
wordcount Gutenberg/Frankenstein.txt wcout.
[biadmin@ibmclass ~]$ hadoop jar /usr/hdp/2.6.4.0-91/hadoop-mapreduce/hadoop-

mapreduce-examples.jar wordcount Gutenberg/Frankenstein.txt wcout
15/06/04 14:48:45 INFO impl.TimelineClientImpl: Timeline service address:

http://ibmclass.localdomain:8188/ws/v1/timeline/
15/06/04 14:48:45 INFO client.RMProxy: Connecting to ResourceManager at
ibmclass.localdomain/192.168.244.140:8050
15/06/04 14:48:46 INFO input.FileInputFormat: Total input paths to process : 1
15/06/04 14:48:46 INFO mapreduce.JobSubmitter: number of splits:1
15/06/04 14:48:47 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1433282779965_0002
15/06/04 14:48:48 INFO impl.YarnClientImpl: Submitted application
application_1433282779965_0002
15/06/04 14:48:48 INFO mapreduce.Job: The url to track the job:
http://ibmclass.localdomain:8088/proxy/application_1433282779965_0002/
15/06/04 14:48:48 INFO mapreduce.Job: Running job: job_1433282779965_0002
15/06/04 14:49:21 INFO mapreduce.Job: Job job_1433282779965_0002 running in uber mode
: false
15/06/04 14:49:21 INFO mapreduce.Job: map 0% reduce 0%

15/06/04 14:50:12 INFO mapreduce.Job: Job job_1433282779965_0002 completed

successfully
15/06/04 14:50:12 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=167616
FILE: Number of bytes written=564573
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=421641
HDFS: Number of bytes written=122090
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=33447
Total time spent by all reduces in occupied slots (ms)=11038
Total time spent by all map tasks (ms)=33447
Total time spent by all reduce tasks (ms)=11038
Total vcore-seconds taken by all map tasks=33447
Total vcore-seconds taken by all reduce tasks=11038
Total megabyte-seconds taken by all map tasks=34249728
Total megabyte-seconds taken by all reduce tasks=11302912
Map-Reduce Framework
Map input records=7244
Map output records=74952
Map output bytes=717818
Map output materialized bytes=167616
Input split bytes=137
Combine input records=74952
Combine output records=11603
Reduce input groups=11603
Reduce shuffle bytes=167616
Reduce input records=11603
Reduce output records=11603
Spilled Records=23206
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=291
CPU time spent (ms)=5950
Physical memory (bytes) snapshot=799457280
Virtual memory (bytes) snapshot=3353075712
Total committed heap usage (bytes)=859832320
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=421504
File Output Format Counters
Bytes Written=122090
There was just one Reduce task:

Launched reduce tasks=1
…and hence there would be only one output file.

7. To find what file(s) were produced, type hadoop fs -ls -R.

drwx--- - biadmin 0 2015-06-04 14:50 .staging
-rw-r-r- 3 biadmin biadmin 421504 2015-06-04 11:01 Gutenberg/Frankenstein.txt
-rw-r-r- 3 biadmin biadmin 697802 2015-06-04 11:01
Gutenberg/Pride_and_Prejudice.txt
-rw-r-r- 3 biadmin biadmin 757223 2015-06-04 11:01
Gutenberg/Tale_of_Two_Cities.txt
-rw-r-r- 3 biadmin biadmin 281398 2015-06-04 11:01 Gutenberg/The_Prince.txt
drwxr-xr-x - biadmin biadmin 0 2015-06-04 14:50 wcout
-rw-r-r- 3 biadmin biadmin 0 2015-06-04 14:50 wcout/_SUCCESS
-rw-r-r- 3 biadmin biadmin 122090 2015-06-04 14:50 wcout/part-r-00000
The directory wcout (/user/biadmin/wcout) in HDFS was created and there

are two files in it: _SUCCESS and part-r-00000.
To review the part-r-0000 file, type
hadoop fs -cat wcout/part-r-00000 | more.
Scroll through the part-r-0000 file and look at least a dozen or more pages of
output into more (press Enter to see more than one page of output).
Here is a sample from several pages down:
Concerning 1
Constantinople 1
Constantinople, 1
Continue 1
Continuing 1
Copet. 1
Cornelius 6
Could 6
Coupar, 1
Cousin, 1
Covered 1
Creator. 1
Creator; 1
Cumberland 3
While the job was running, you could have watched its progress by using the
suggest URL. For the job listing above, this was:
http://ibmclass.localdomain:8088/proxy/application_1433282779965_0002/
8. Type q to exit from more, and then to remove the directory where your output
file was stored (wcout), type hadoop fs -rm -R wcout.
It is necessary to remove this directory and all files in it to use the command
again without changes. Alternatively you could run the command again, but with
a different output directory such as wcout2.

9. You will now run the WordCount program again and, immediately after it starts,
you should copy the URL for that run by selecting it, right-clicking, and clicking
Copy. Paste the URL into a Firefox browser within your lab environment.
Depending on how fast you do this, you will see something like the following, if
your MapReduce job is still running:
Or the following, if your MapReduce job has already completed:

10. The same job can be run (and monitored) as a YARN job. To do this, type yarn
jar /usr/hdp/2.6.4.0-91/hadoop-mapreduce/hadoop-mapreduce-
examples.jar wordcount Gutenberg/Frankenstein.txt wcout2.
11. You can also run the job against all four files in the Gutenberg directory. To do
this, type hadoop jar /usr/hdp/2.6.4.0-91/hadoop-mapreduce/hadoop-
mapreduce-examples.jar wordcount Gutenberg/*.txt wcout2.
Question 1: How many Mappers were used for your run?
Question 2: How many Reducers were used for your run?
Results:
You ran Java programs using Hadoop v2, YARN, and related technologies.
You started with an existing sample program, selected from among those
available, and progressed towards compiling a program a Java program from
scratch and learned in what type of environment it runs.

Lab 2
Creating and coding a simple MapReduce job
Lab 2: Creating and coding a simple MapReduce job

Lab 2:
Creating and coding a simple MapReduce job
Purpose:
You will compile and run a more complete version of WordCount that has
been written specifically for MapReduce2.
Task 1. Compile and run a more complete version of

WordCount that has been written specifically for
MapReduce2.
The program that you will use - WordCount2.java - has been taken from:
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-
mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v2.0
There is a small error in the Java code ("| |" with a space rather than "||" without
a space. This has been corrected in the version in /home/labfiles.
This version 2 of WordCount.java is more sophisticated than the one that you
have already run, since it allows for splitting on text in addition to and other than
whitespace, and also allows you to specify anything that you want to skip when
counting words (such as "to", "the", etc.).
There are some limitations; if you are a competent Java programmer, you may
want to experiment later with other features. For instance, are all words
lowercased when they are tokenized?
Since you are now more familiar with the process of running MR/YARN jobs,
the directions provided here will concentrate on the compilation only.
1. To copy the files that you will need to your biadmin home directory, type the
following commands:
cd /home/biadmin/labfiles
cp patternsToSkip run-wc WordCount2.java .
2. To move the file patternsToSkip to HDFS, type hadoop fs -put
patternsToSkip ..
3. Type hadoop classpath, which provides you with the CLASSPATH
environmental variables that you need for compilation.
You will not likely want to type that information manually.

4. Compile WordCount.java with the Hadoop2 API and this CLASSPATH:

javac -cp `hadoop classpath` WordCount2.java
Note here, the use of the back-quote character to substitute the value of the
CLASSPATH set into the compilation as the classpath (-cp) needed for the
compilation.
5. Create a Jar file that can be run in the Hadoop2/YARN environment:
jar cf WC2.jar *.class
jar tf WC2.jar
6. To run your Jar file successfully, you will need to remove all the class files from
your biadmin Linux directory:
rm *.class
7. You are now ready to run your compiled program with appropriate parameters
(see the program logic and the use of these additional parameters on the
website listed above). Type the following command on one line:
hadoop jar WC2.jar WordCount2 -D wordcount.case.sensitive=false

./Gutenberg/*.txt ./wc2out -skip ./patternsToSkip
This can also be run with "yarn" and the MR job can be monitored as previously.
Question 1: How many Mappers were used for your run?
Question 2: How many Reducers were used for your run?
8. Look at your output (in wc2out) and see if it differs from what you generated
previously.
Results:
You compiled and ran a more complete version of WordCount that has been
written specifically for MapReduce2.

Apache Spark
Apache Spark

Unit 6 Apache Spark

Unit 6 Apache Spark
Unit objectives
• Understand the nature and purpose of Apache Spark in the Hadoop
ecosystem
• List and describe the architecture and components of the Spark
unified stack
• Describe the role of a Resilient Distributed Dataset (RDD)
• Understand the principles of Spark programming
• List and describe the Spark libraries
• Launch and use Spark's Scala and Python shells
Apache Spark © Copyright IBM Corporation 2018
Unit objectives

Unit 6 Apache Spark
Big data and Spark

• Faster results from analytics has become increasingly important
• Apache Spark is a computing platform designed to be fast and general-
purpose, and easy to use
Speed • In-memory computations

• Faster than MapReduce for complex applications on disk
Generality • Covers a wide range of workloads on one system
• Batch applications (e.g. MapReduce)
• Iterative algorithms
• Interactive queries and streaming
Ease of use • APIs for Scala, Python, Java, R
• Libraries for SQL, machine learning, streaming, and graph processing
• Runs on Hadoop clusters or as a standalone
• including the popular MapReduce model
Big data and Spark

There is an explosion of data, and no matter where you look, data is everywhere. You
get data from social media such as Twitter feeds, Facebook posts, SMS, and a variety
of others. The need to be able to process those data as quickly as possible becomes
more important than ever. How can you find out what your customers want and be able
to offer it to them right away? You do not want to wait hours for a batch job to complete;
you need to have it in minutes or less.
MapReduce has been useful, but the amount of time it takes for the jobs to run is no
longer acceptable in many situations. The learning curve to writing a MapReduce job is
also difficult as it takes specific programming knowledge and expertise. Also,
MapReduce jobs only work for a specific set of use cases. You need something that
works for a wider set of use cases.

Unit 6 Apache Spark
Apache Spark was designed as a computing platform to be fast, general-purpose, and

easy to use. It extends the MapReduce model and takes it to a whole other level.
• The speed comes from the in-memory computations. Applications running in
memory allows for a much faster processing and response. Spark is even faster
than MapReduce for complex applications on disk.
• This generality covers a wide range of workloads under one system. You can
run batch application such as MapReduce type jobs or iterative algorithms that
builds upon each other. You can also run interactive queries and process
streaming data with your application. In a later slide, you'll see that there are a
number of libraries which you can easily use to expand beyond the basic Spark
capabilities.
• The ease of use with Spark enables you to quickly pick it up using simple APIs
for Scala, Python, Java, and R. As mentioned, there are additional libraries
which you can use for SQL, machine learning, streaming, and graph
processing. Spark runs on Hadoop clusters such as Hadoop YARN or Apache
Mesos, or even as a standalone with its own scheduler.

Unit 6 Apache Spark
Ease of use
• To implement the classic wordcount in Java MapReduce, you need
three classes: the main class that sets up the job, a Mapper, and a
Reducer, each about 10 lines long
• For the same wordcount program, written in Scala for Spark:
val conf = new SparkConf().setAppName("Spark wordcount")
val sc = new SparkContext(conf)
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1)).countByKey()
counts.saveAsTextFile("hdfs://...")
• With Java, Spark can take advantage of Scala's versatility, flexibility,

and its functional programming concepts
Ease of use
Spark supports Scala, Python, Java, and R programming languages, all now in
widespread use among data scientists. The slide shows programming in Scala. Python
is widespread among data scientists and in the scientific community, bringing those
users on par with Java and Scala developers.
An important aspect of Spark is the ways that it can combine the functionalities of many
tools in available in the Hadoop ecosystem to provide a single unifying platform. In
addition, the Spark execution model is general enough that a single framework can be
used for:
• batch processing operations( similar to that provided by MapReduce),
• stream data processing,
• machine learning,
• SQL-like operations, and
• graph operations.
The result is that many different ways of working with data are available on the on same
platform, bridging the gap between the work of the classic big data programmer, data
engineers and data scientists.
All the same, Spark has its own limitations. There are no universal tools. Thus Spark is
not suitable for transaction processing and other ACID types of operations.

Unit 6 Apache Spark
Who uses Spark and why?

• Parallel distributed processing, fault tolerance on commodity hardware,
scalability, in-memory computing, high level APIs, etc.
• Data scientist
 Analyze and model the data to obtain insight using ad-hoc analysis
 Transforming the data into a useable format
 Statistics, machine learning, SQL
• Data engineers
 Develop a data processing system or application
 Inspect and tune their applications
 Programming with the Spark's API
• Everyone else
 Ease of use
 Wide variety of functionality
 Mature and reliable
Who uses Spark and why?

You may be asking why you would want to use Spark and what you would use it for.
Spark is related to MapReduce in a sense that it expands on Hadoop's capabilities. Like
MapReduce, Spark provides parallel distributed processing, fault tolerance on
commodity hardware, scalability, etc. Spark adds to the concept with aggressively
cached in-memory distributed computing, low latency, high level APIs and stack of high
level tools described on the next slide. This saves time and money.

Unit 6 Apache Spark
There are two groups that we can consider here who would want to use Spark: Data
Scientists and Engineers - and they have overlapping skill sets.
• Data scientists need to analyze and model the data to obtain insight. They need
to transform the data into something they can use for data analysis. They will
use Spark for its ad-hoc analysis to run interactive queries that will give them
results immediately. Data scientists may have experience using SQL, statistics,
machine learning and some programming, usually in Python, MatLab or R.
Once the data scientists have obtained insights on the data and determined that
there is a need develop a production data processing application, a web
application, or some system to act upon the insight, the work is handed over to
data engineers.
• Data engineers use Spark's programming API to develop a system that
implement business use cases. Spark parallelize these applications across the
clusters while hiding the complexities of distributed systems programming and
fault tolerance. Data engineers can employ Spark to monitor, inspect, and tune
applications.
For everyone else, Spark is easy to use with a wide range of functionality. The product
is mature and reliable.

Unit 6 Apache Spark
Spark unified stack
Spark
Streaming MLlib GraphX
Spark SQL
real-time machine learning graph processing
processing
Spark Core
Standalone Scheduler YARN Mesos
Spark unified stack

The Spark core is at the center of the Spark Unified Stack. The Spark core is a general-
purpose system providing scheduling, distributing, and monitoring of the applications
across a cluster.
The Spark core is designed to scale up from one to thousands of nodes. It can run over
a variety of cluster managers including Hadoop YARN and Apache Mesos, or more
simply, it can run standalone with its own built-in scheduler.
Spark Core contains basic Spark functionalities required for running jobs and needed
by other components. The most important of these is the RDD concept, or resilient
distributed dataset, the main element of Spark API. RDD is an abstraction of a
distributed collection of items with operations and transformations applicable to the
dataset. It is resilient because it is capable of rebuilding datasets in case of node
failures.
Various add-in components can run on top of the core; these are designed to
interoperate closely, letting the users combine them, just like they would any libraries in
a software project. The benefit of the Spark Unified Stack is that all the higher layer
components will inherit the improvements made at the lower layers. Example:
Optimization to the Spark Core will speed up the SQL, the streaming, the machine
learning and graph processing libraries as well.

Unit 6 Apache Spark
Spark simplifies the picture by providing many of Hadoop ecosystem functions through
several purpose-built components. These are Spark Core, Spark SQL, Spark
Streaming, Spark MLib, and Spark GraphX:
• Spark SQL is designed to work with the Spark via SQL and HiveQL (a Hive
variant of SQL). Spark SQL allows developers to intermix SQL with Spark's
programming language supported by Python, Scala, Java, and R.
• Spark Streaming provides processing of live streams of data. The Spark
Streaming API closely matches that of the Sparks Core's API, making it easy for
developers to move between applications that processes data stored in memory
vs arriving in real-time. It also provides the same degree of fault tolerance,
throughput, and scalability that the Spark Core provides.
• MLlib is the machine learning library that provides multiple types of machine
learning algorithms. These algorithms are designed to scale out across the
cluster as well. Supported algorithms include logistic regression, naive Bayes
classification, SVM, decision trees, random forests, linear regression, k-means
clustering, and others.
• GraphX is a graph processing library with APIs to manipulate graphs and
performing graph-parallel computations. Graphs are data structures comprised
of vertices and edges connecting them. GraphX provides functions for building
graphs and implementations of the most important algorithms of the graph
theory, like page rank, connected components, shortest paths, and others.
"If you compare the functionalities of Spark components with the tools in the Hadoop
ecosystem, you can see that some of the tools are suddenly superfluous. For example,
Apache Storm can be replaced by Spark Streaming, Apache Giraph can be replaced
by Spark GraphX and Spark MLlib can be used instead of Apache Mahout. Apache
Pig, and Apache Sqoop are not really needed anymore, as the same functionalities are
covered by Spark Core and Spark SQL. But even if you have legacy Pig workflows and
need to run Pig, the Spork project enables you to run Pig on Spark." - Bonaći, M., &
Zečević, P. (2015). Spark in action. Greenwich, CT: Manning Publications. ("Spork" is
Apache Pig on Spark, see https://github.com/sigmoidanalytics/spork).

Unit 6 Apache Spark
Brief history of Spark

• Timeline
 2002: MapReduce @ Google
 2004: MapReduce paper
 2006: Hadoop @ Yahoo
 2008: Hadoop Summit
 2010: Spark paper
 2014: Apache Spark top-level
• MapReduce started a general
batch processing paradigm,
but had its limitations:
 difficulty programming in MapReduce
 batch processing did not fit many use cases
• MR spawned a lot of specialized systems (Storm, Impala, Giraph, etc.)
Brief history of Spark

MapReduce was developed over a decade ago as a fault tolerant framework that ran
on commodity systems.
MapReduce started off as a general batch processing system, but there are two major
limitations.
• Difficulty in programming directly in MR.
• Batch jobs do not fit many use cases.
This spawned specialized systems to handle the other needed use cases. When you
try to combine these third party systems into your applications, there is a lot of
overhead.
Spark comes out about a decade later with similar framework to run data processing on
commodity systems also using a fault tolerant framework.
Taking a looking at the code size of some applications on the graph on this slide, you
can see that Spark requires a considerable amount less. Even with Spark's libraries, it
only adds a small amount of code due to how tightly everything is with little overhead.
There is great value to be able to express a wide variety of use cases with few lines of
code and the corresponding increase in programmer productivity.

Unit 6 Apache Spark
Resilient Distributed Datasets (RDDs)

• Spark's primary abstraction: Distributed collection
of elements, parallelized across the cluster Example
RDD flow
• Two types of RDD operations:
 Transformations Hadoop
− Creates a directed acyclic graph (DAG) RDD
− Lazy evaluations
− No return value Filtered
 Actions RDD
− Performs the transformations
− The action that follows returns a value Mapped
RDD
• RDD provides fault tolerance
• Has in-memory caching (with overflow to disk) Reduced
RDD
Resilient Distributed Datasets (RDDs)

Looking at the core of Spark, Spark's primary core abstraction is called Resilient
Distributed Dataset (RDD).
RDD essentially is just a distributed collection of elements that is parallelized across the
cluster.
There are two types of RDD operations: transformations and actions.
• Transformations do not return a value. In fact, nothing is evaluated during the
definition of transformation statements. Spark just creates the definition of a
transformation that will be evaluated later at runtime. This is called this lazy
evaluation. The transformation is stored as a directed acyclic graphs (DAG).
• Actions actually perform the evaluation. The action is transformations are
performed along with the work that is called for to consume or produce RDDs.
Actions return values. For example, you can do a count on a RDD, to get the
number of elements within and that value is returned.

Unit 6 Apache Spark
The fault tolerance aspect of RDDs allows Spark to reconstruct the transformations
used to build the lineage to get back any lost data.
In the example RDD flow shown on the slide, the first step loads the dataset from
Hadoop. Successive steps apply transformations on this data such as filter, map, or
reduce. Nothing actually happens until an action is called. The DAG is just updated
each time until an action is called. This provides fault tolerance; for example, if a node
goes offline, all it needs to do when it comes back online is to re-evaluate the graph to
where it left off.
In-memory caching is provided with Spark to enable the processing to happen in
memory. If the RDD does not fit in memory, it will spill to disk.

Unit 6 Apache Spark
Spark jobs and shell

• Spark jobs can be written in Scala, Python, or Java. APIs are available
for all three: http://spark.apache.org/docs/1.6.3/
• Spark shells are provided for Scala (spark-shell) and Python (pyspark)
• Spark's native language is Scala, so it is natural to write Spark
applications using Scala
Spark jobs and shell

Spark jobs can be written in Scala, Python or Java. Spark shells are available for Scala
(spark-shell) and Python (pyspark). This course will not teach how to program in each
specific language, but will cover how to use them within the context of Spark. It is
recommended that you have at least some programming background to understand
how to code in any of these.
If you are setting up the Spark cluster yourself, you will have to make sure that you
have a compatible version of it. This information can be found on Spark's website. In
the lab environment, everything has been set up for you - all you do is launch up the
shell and you are ready to go.
Spark itself is written in the Scala language, so it is natural to use Scala to write Spark
applications. This course will cover code examples from by Scala, Python, and Java.
Java 8 actually supports the functional programming style to include lambdas, which
concisely captures the functionality that are executed by the Spark engine. This bridges
the gap between Java and Scala for developing applications on Spark. Java 6 and 7 is
supported, but would require more work and an additional library to get the same
amount of functionality as you would using Scala or Python.

Unit 6 Apache Spark
Brief overview of Scala

• Everything is an Object:
 Primitive types such as numbers or Boolean
 Functions
• Numbers are objects
 1 + 2 * 3 / 4  (1).+(((2).*(3))./(x)))
 Where the +, *, / are valid identifiers in Scala
• Functions are objects
 Pass functions as arguments
 Store them in variables
 Return them from other functions
• Function declaration
 def functionName ([list of parameters]) : [return type]
Brief overview of Scala

Everything in Scala is an object. The primitive types defined by Java such as int or
boolean are objects in Scala. Functions are objects in Scala and play an important role
in how applications are written for Spark.
Numbers are objects. As an example, in the expression that you see here: 1 + 2 * 3 / 4
actually means that the individual numbers invoke the various identifiers +,-,*,/ with the
other numbers passed in as arguments using the dot notation.
Functions are objects. You can pass functions as arguments into another function. You
can store them as variables. You can return them from other functions. The function
declaration is the function name followed by the list of parameters and then the return
type.

Unit 6 Apache Spark
If you want to learn more about Scala, check out its website for tutorials and guide.
Throughout this course, you will see examples in Scala that will have explanations on
what it does. Remember, the focus of this unit is on the context of Spark, and is not
intended to teach Scala, Python or Java.
References for learning Scala:
• Horstmann, C. S. (2012). Scala for the impatient. Upper Saddle River, NJ:
Addison-Wesley Professional
• Odersky, M., Spoon, L., & Venners, B. (2011). Programming in Scala: A
Comprehensive Step-by-Step Guide (2nd ed.). Walnut Creek, CA: Artima Press

Unit 6 Apache Spark
Scala: Anonymous functions, aka Lambda functions

• Lambda, or => syntax: functions without a name created for one-time use to pass
to another function
• The left side of the right arrow => is where the argument are placed resides (no
arguments in the example)
• Right side of the arrow is the body of the function (here the println statement)
Scala: Anonymous functions, aka Lambda functions

Anonymous functions are very common in Spark applications. Essentially, if the
function you need is only going to be required once, there is really no value in naming it.
Just use it anonymously on the go and forget about it. For example, if you have a
timeFlies function and in it, you just print a statement to the console. In another function,
oncePerSecond, you need to call this timeFlies function. Without anonymous functions,
you would code it like the top example by defining the timeFlies function. Using the
anonymous function capability, you just provide the function with arguments, the right
arrow, and the body of the function after the right arrow as in the bottom example.
Because this is the only place you will be using this function, you do not need to name
the function.'
Python's syntax is relatively convenient and easy to work with, but aside from the basic
structure of the language Python is also sprinkled with small syntax structures that
make certain tasks especially convenient. The lambda keyword/function construct is
one of these, where the creators call it "syntactical candy."
References:
• http://docs.scala-lang.org/tutorials/scala-for-java-programmers.html

Unit 6 Apache Spark
Computing wordcount using Lambda functions

• The classic wordcount program can be written with anonymous
(Lambda) functions
• Three functions are needed:
 Tokenize each line into words (with space as delimiter)
 Map to produce the <word, 1> key/value pair from each word that is read
 Reduce to aggregate the counts for each word individually (reduceByKey)
• The results are written to HDFS
text_file = spark.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
• Lambda functions can be used with Scala, Python, and Java v8; this
example is Scala
Computing wordcount using Lambda functions

The Lambda or => syntax is a shorthand way to define functions inline in Python and
Scala. With Spark you can define anonymous functions separately and then pass the
name to Spark. For example, in Python:
def hadHDP(line):
return "HDP" in line;
HDPLines = lines.filter(hasHDP)
…is functionally equivalent to:

grep HDP inputfile
A common example is a MapReduce wordcount. You first split up the file by words
("tokenize") and then map each word into a key value pair with the word as the key, and
the value of 1. Then you reduce by the key, which adds up all the value of the same
key, effectively, counting the number of occurrences of that key. Finally, the counts are
written out to a file in HDFS.

Unit 6 Apache Spark
References:
• École Polytechnique Fédérale de Lausanne (EPFL). (2012). A tour of Scala:
Anonymous function syntax, from http://www.scala-lang.org/old/node/133
• Apache Software Foundation,(2015). Spark Examples, from
https://spark.apache.org/examples.html

Unit 6 Apache Spark
Resilient Distributed Dataset (RDD)

• Fault-tolerant collection of elements that can be operated on in parallel
• RDDs are immutable
• Three methods for creating RDD
 Parallelizing an existing collection
 Referencing a dataset
 Transformation from an existing RDD
• Two types of RDD operations
 Transformations
 Actions
• Dataset from any storage supported by Hadoop
 HDFS, Cassandra, HBase, Amazon S3, etc.
• Types of files supported:
 Text files, SequenceFiles, Hadoop InputFormat, etc
Resilient Distributed Dataset (RDD)

Major points of interest here:
• Resilient Distributed Dataset (RDD) is Spark's primary abstraction.
• An RDD is a fault tolerant collection of elements that can be parallelized. In
other words, they can be made to be operated on in parallel. They are
immutable.
• These are the fundamental primary units of data in Spark.
• When RDDs are created, a direct acyclic graph (DAG) is created. This type of
operation is called transformations. Transformations makes updates to that
graph, but nothing actually happens until some action is called. Actions are
another type of operations. We'll talk more about this shortly. The notion here is
that the graphs can be replayed on nodes that need to get back to the state it
was before it went offline, thus providing fault tolerance. The elements of the
RDD can be operated on in parallel across the cluster. Remember,
transformations return a pointer to the RDD created and actions return values
that comes from the action.

Unit 6 Apache Spark
There are three methods for creating a RDD. You can parallelize an existing collection.
This means that the data already resides within Spark and can now be operated on in
parallel. As an example, if you have an array of data, you can create a RDD out of it by
calling the parallelized method. This method returns a pointer to the RDD. So this new
distributed dataset can now be operated upon in parallel throughout the cluster.
The second method to create a RDD, is to reference a dataset. This dataset can come
from any storage source supported by Hadoop such as HDFS, Cassandra, HBase,
Amazon S3, etc.
The third method to create a RDD is from transforming an existing RDD to create a new
RDD. In other words, if you have the array of data that you parallelized earlier, and you
want to filter out the records available. A new RDD is created using the filter method.
The final point on this slide: Spark supports text files, SequenceFiles, and any other
Hadoop InputFormat.

Unit 6 Apache Spark
Creating an RDD
• Launch the Spark shell (requires PATH environment v)
spark-shell
• Create some data
val data = 1 to 10000
• Parallelize that data (creating the RDD)

val distData = sc.parallelize(data)
• Perform additional transformations or invoke an action on it.
distData.filter(…)
• Alternatively, create an RDD from an external dataset
val readmeFile = sc.textFile("Readme.md")
Creating an RDD
Here is a quick example of how to create an RDD from an existing collection of data. In
the examples throughout the course, unless otherwise indicated, you will be using
Scala to show how Spark works. In the lab exercises, you will get to work with Python
and Java as well.
First, launch the Spark shell. This command is located under the /usr/bin directory.
Once the shell is up (with the prompt "scala>"), create some data with values from 1 to
10,000. Then, create an RDD from that data using the parallelize method from the
SparkContext, shown as sc on the slide; this means that the data can now be operated
on in parallel.
More will be covered on the SparkContext, the sc object that is invoking the parallelized
function later, so for now, just know that when you initialize a shell, the SparkContext,
sc, is initialized for you to use.

Unit 6 Apache Spark
The parallelize method returns a pointer to the RDD. Remember, transformations

operations such as parallelize, only returns a pointer to the RDD. It actually will not
create that RDD until some action is invoked on it. With this new RDD, you can perform
additional transformations or actions on it such as the filter transformation. This is
important with large amounts of data (big data), because you do not want to duplicate
the date until needed, and certainly not cache it in memory until needed.
Another way to create an RDD is from an external dataset. In the example here, you
are creating a RDD from a text file using the textFile method of the SparkContext
object. You will see more examples of how to create RDD throughout this course.

Unit 6 Apache Spark
Spark's Scala and Python shells

• Spark's shells provides simple ways to learn the APIs and provide a
set of powerful tools to analyze data interactively
• Scala shell:
 Runs on the Java VM & provides a good way to use existing Java libraries
 To launch the Scala shell: ./bin/spark-shell
 The prompt is: scala>
 To read in a text file: scala> val textFile = sc.textFile("README.md")
• Python shell:
 To launch the Python shell: ./bin/pyspark
 The prompt is: >>>
 To read in a text file: >>> textFile = sc.textFile("README.md")
 Two additional variations: IPython & the IPythyon Notebook
• To quit either shell, use: Ctrl-D (the EOF character)
Spark's Scala and Python shells

The Spark shell provides a simple way to learn Spark's API. It is also a powerful tool
analyze data interactively. The Shell is available in either Scala, which runs on the Java
VM, or Python. To start up Scala, execute the command spark-shell from within the
Spark's bin directory. To create a RDD from a text file, invoke the textFile method with
the sc object, which is the SparkContext.
To start up the shell for Python, you would execute the pyspark command from the
same bin directory. Then, invoking the textFile command will also create a RDD for that
text file.
In the lab exercise later, you will start up either of the shells and run a series of RDD
transformations and actions to get a feel of how to work with Spark. Later you will get to
dive deeper into RDDs.
IPython offers features such as tab completion. For further details: http://ipython.org
IPython Notebook is a web-browser based version of IPython.

Unit 6 Apache Spark
RDD basic operations

• Loading a file
val lines = sc.textFile("hdfs://data.txt")
• Applying transformation
val lineLengths = lines.map(s => s.length)
• Invoking action
val totalLengths = lineLengths.reduce((a,b) => a + b)
• View the DAG

lineLengths.toDebugString
RDD basic operations

You have seen how to load a file from an external dataset. This time, however, you are
loading a file from the hdfs. Loading the file creates a RDD, which is only a pointer to
the file. The dataset is not loaded into memory yet. Nothing will happen until some
action is called. The transformation basically updates the direct acyclic graph (DAG).
So the transformation here is saying map each line - ‘s’ - to the length of that line. Then,
the action operation is reducing it to get the total length of all the lines. When the action
is called, Spark goes through the DAG and applies all the transformation up until that
point, followed by the action and then a value is returned back to the caller.
Directed Acyclic Graph (DAG)
On this slide, you will see how to view the DAG of any particular RDD. A DAG is
essentially a graph of the business logic and does not get executed until an action
is called - often called lazy evaluation.

Unit 6 Apache Spark
To view the DAG of a RDD after a series of transformation, use the toDebugString
method as you see here on the slide. It will display the series of transformation that
Spark will go through once an action is called. You read it from the bottom up. In the
sample DAG shown on the slide, you can see that it starts as a textFile and goes
through a series of transformation such as map and filter, followed by more map
operations. Remember, that it is this behavior that allows for fault tolerance. If a node
goes offline and comes back on, all it has to do is just grab a copy of this from a
neighboring node and rebuild the graph back to where it was before it went offline.

Unit 6 Apache Spark
What happens when an action is executed? (1 of 8)

// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Caching
messages.cache()
// Actions Worker Worker Worker
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
What happens when an action is executed?

In the next several slides, you will see at a high level what happens when an action is
executed.
Let's look at the code first. The goal here is to analyze some log files. The first line you
load the log from the hadoop file system. The next two lines you filter out the messages
within the log errors. Before you invoke some action on it, you tell it to cache the filtered
dataset; it doesn't actually cache it yet as nothing has been done up until this point.
Then you do more filters to get specific error messages relating to mysql and php
followed by the count action to find out how many errors were related to each of those
filters.

Unit 6 Apache Spark

// Creating the RDD
//Caching
messages.cache()
Block 1 Block 2 Block 3
The data is partitioned

into different blocks
In reviewing the steps, the first thing that happens when you load in the text file is the
data is partitioned into different blocks across the cluster.

Unit 6 Apache Spark

// Creating the RDD
//Cache
messages.cache()
Driver sends the code to be

executed on each block
Then the driver sends the code to be executed on each block. In the example, it would
be the various transformations and actions that will be sent out to the workers. Actually,
it is the executor on each workers that is going to be performing the work on each
block. You will see a bit more on executors later.

Unit 6 Apache Spark

// Creating the RDD
//Caching
messages.cache()
Read HDFS block
The executors read the HDFS blocks to prepare the data for the operations in parallel.

Unit 6 Apache Spark

// Creating the RDD
//Caching
messages.cache()
Cache Cache Cache
Process + cache data
After a series of transformations, you want to cache the results up until that point into
memory. A cache is created.

Unit 6 Apache Spark

// Creating the RDD
//Caching
messages.cache()
Cache Cache Cache
Send the data back to the

driver
After the first action completes, the results are sent back to the driver. In this case, you
are looking for messages that relate to mysql. This is then returned back to the driver.

Unit 6 Apache Spark

// Creating the RDD
//Caching
messages.cache()
Cache Cache Cache
Process from cache
To process the second action, Spark will use the data on the cache; it does not need to
go to the HDFS data again. It just reads it from the cache and processes the data from
there.

Unit 6 Apache Spark

// Creating the RDD
//Caching
messages.cache()
Cache Cache Cache
Send the data back to the

driver
Finally the results are sent back to the driver and you have completed a full cycle.

Unit 6 Apache Spark
RDD operations: Transformations

• These are some of the transformations available - the full set can be found on
Spark's website.
• Transformations are lazy evaluations
• Returns a pointer to the transformed RDD
Transformation Meaning
map(func) Return a new dataset formed by passing each element of the source through a
function func.
filter(func) Return a new dataset formed by selecting those elements of the source on which
func returns true.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items. So
func should return a Seq rather than a single item
join(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V,
W)) pairs with all pairs of elements for each key.
reduceByKey(func) When called on a dataset of (K, V) pairs, returns a dataset of (K,V) pairs where
the values for each key are aggregated using the given reduce function func
sortByKey([ascending],[numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a
dataset of (K,V) pairs sorted by keys in ascending or descending order.
RDD operations: Transformations

These are some of the transformations available; the full set can be found on Spark's
website. The Spark Programming Guide can be found at
https://spark.apache.org/docs/latest/programming-guide.html and transformations can
be found at https://spark.apache.org/docs/latest/programming-
guide.html#transformations.
Remember that Transformations are essentially lazy evaluations. Nothing is executed
until an action is called. Each transformation function basically updates the graph and
when an action is called, the graph is executed. Transformation returns a pointer to the
new RDD.

Unit 6 Apache Spark
Some of the less obvious are:

• The flatMap function is similar to map, but each input can be mapped to 0 or
more output items. What this means is that the returned pointer of the func
method, should return a sequence of objects, rather than a single item. It would
mean that the flatMap would flatten a list of lists for the operations that follows.
Basically this would be used for MapReduce operations where you might have
a text file and each time a line is read in, you split that line up by spaces to get
individual keywords. Each of those lines ultimately is flatten so that you can
perform the map operation on it to map each keyword to the value of one.
• The join function combines two sets of key value pairs and return a set of keys
to a pair of values from the two initial set. For example, you have a K,V pair and
a K,W pair. When you join them together, you will get a K, (V,W) set.
• The reduceByKey function aggregates on each key by using the given reduce
function. This is something you would use in a WordCount to sum up the values
for each word to count its occurrences.

Unit 6 Apache Spark
RDD operations: Actions

• On the other hand, actions return values
Action Meaning
collect() Return all the elements of the dataset as an array of the driver program. This is usually
useful after a filter or another operation that returns a sufficiently small subset of data.
count() Return the number of elements in a dataset.
first() Return the first element of the dataset
take(n) Return an array with the first n elements of the dataset.
foreach(func) Run a function func on each element of the dataset.
RDD operations: Actions

Action returns values. Again, you can find more information on Spark's website. The full
set of functions are available at https://spark.apache.org/docs/latest/programming-
guide.html#actions.
The slide shows a subset:
• The collect function returns all the elements of the dataset as an array of the
driver program. This is usually useful after a filter or another operation that
returns a significantly small subset of data to make sure your filter function
works correctly.
• The count function returns the number of elements in a dataset and can also be
used to check and test transformations.
• The take(n) function returns an array with the first n elements. Note that this is
currently not executed in parallel. The driver computes all the elements.
• The foreach(func) function run a function func on each element of the dataset.

Unit 6 Apache Spark
RDD persistence
• Each node stores partitions of the cache that it computes in memory
• Reuses them in other actions on that dataset (or derived datasets)
 Future actions are much faster (often by more than 10x)
• Two methods for RDD persistence: persist(), cache()
Storage Level Meaning
MEMORY_ONLY Store as deserialized Java objects in the JVM. If the RDD does not fit in memory, part of it will
be cached. The other will be recomputed as needed. This is the default. The cache() method
uses this.
MEMORY_AND_DISK Same except also store on disk if it doesn’t fit in memory. Read from memory and disk when
needed.
MEMORY_ONLY_SER Store as serialized Java objects (one bye array per partition). Space efficient, but more CPU
intensive to read.
MEMORY_AND_DISK_SER Similar to MEMORY_AND_DISK but stored as serialized objects.
DISK_ONLY Store only on disk.
MEMORY_ONLY_2, Same as above, but replicate each partition on two cluster nodes
MEMORY_AND_DISK_2, etc
OFF_HEAP (experimental) Store RDD in serialized format in Tachyon.
RDD persistence
Now to want to get a bit into RDD persistence. You have seen this used already; it is
the cache function. The cache function is actually the default of the persist function;
cache() is essentially just persist with MEMORY_ONLY storage.
One of the key capability of Spark is its speed through persisting or caching. Each node
stores any partitions of the cache and computes it in memory. When a subsequent
action is called on the same dataset, or a derived dataset, it uses it from memory
instead of having to retrieve it again. Future actions in such cases are often 10 times
faster. The first time a RDD is persisted, it is kept in memory on the node. Caching is
fault tolerant because if it any of the partition is lost, it will automatically be recomputed
using the transformations that originally created it.
There are two methods to invoke RDD persistence. persist() and cache(). The persist()
method allows you to specify a different storage level of caching. For example, you can
choose to persist the data set on disk, persist it in memory but as serialized objects to
save space, etc. Again the cache() method is just the default way of using persistence
by storing deserialized objects in memory.

Unit 6 Apache Spark
The table shows the storage levels and what it means. Basically, you can choose to
store in memory or memory and disk. If a partition does not fit in the specified cache
location, then it will be recomputed on the fly. You can also decide to serialized the
objects before storing this. This is space efficient, but will require the RDD to
deserialized before it can be read, so it takes up more CPU workload. There's also the
option to replicate each partition on two cluster nodes. Finally, there is an experimental
storage level storing the serialized object in Tachyon. This level reduces garbage
collection overhead and allows the executors to be smaller and to share a pool of
memory. You can read more about this on Spark's website.

Unit 6 Apache Spark
Best practices for which storage level to choose

• The default storage level (MEMORY_ONLY) is the best
• Otherwise MEMORY_ONLY_SER and a fast serialization library
• Don't spill to disk unless the functions that computed your datasets are
expensive, or they filter a large amount of the data - recomputing a
partition may be as fast as reading it from disk.
• Use the replicated storage levels if you want fast fault recovery (such
as if using Spark to serve requests from a web application)
• All the storage levels provide full fault tolerance by recomputing lost
data, but the replicated ones let you continue running tasks on the
RDD without waiting to recompute a lost partition
• The experimental OFF_HEAP mode has several advantages:
 Allows multiple executors to share the same pool of memory in Tachyon
 Reduces garbage collection costs
 Cached data is not lost if individual executors crash.
Best practices for which storage level to choose

There are a lot of rules on this page; use them as a reference when you have to decide
the type of storage level. There are tradeoffs between the different storage levels. You
should analyze your current situation to decide which level works best. You can find this
information on Spark's website: https://spark.apache.org/docs/latest/rdd-programming-
guide.html#rdd-persistence

Unit 6 Apache Spark
The primary rules are:

• If the RDDs fit comfortably with the default storage level (MEMORY_ONLY),
leave them that way - the most CPU-efficient option, allowing operations on the
RDDs to run as fast as possible. Basically if your RDD fits within the default
storage level, by all means, use that. It is the fastest option to fully take
advantage of Spark's design.
• Otherwise use MEMORY_ONLY_SER and a fast serialization library to make
objects more space-efficient, but still reasonably fast to access.
• Do not spill to disk unless the functions that compute your datasets are
expensive or it requires a large amount of space.
• If you want fast recovery, use the replicated storage levels. Note that all levels of
storage are fully fault tolerant, but would still require the recomputing of the data.
If you have a replicated copy, you can continue to work while Spark is
recomputing a lost partition.
• In environments with high amounts of memory or multiple applications, the
experimental OFF_HEAP mode has several advantages. Use Tachyon if your
environment has high amounts of memory or multiple applications. It allows you
to share the same pool of memory and significantly reduces garbage collection
costs. Also, the cached data is not lost if the individual executors crash.

Unit 6 Apache Spark
Shared variables and key-value pairs
• When a function is passed from the driver to a worker, normally a

separate copy of the variables are used ("pass by value").
• Two types of variables:
 Broadcast variables
− Read-only copy on each machine
− Distribute broadcast variables using efficient broadcast algorithms
 Accumulators
− Variables added through an associative operation
− Implement counters and sums
− Only the driver can read the accumulators value
− Numeric types accumulators. Extend for new types.
Scala: key-value pairs Python: key-value pairs Java: key-value pairs
val pair = ('a', 'b') pair = ('a', 'b') Tuple2 pair = new Tuple2('a', 'b');
pair._1 // will return 'a' pair[0] # will return 'a' pair._1 // will return 'a'
pair._2 // will return 'b' pair[1] # will return 'b' pair._2 // will return 'b'
Shared variables and key-value pairs

On this page and the next, you will review Spark shared variables and the type of
operations you can do on key-value pairs.
Spark provides two limited types of shared variables for common usage patterns:
broadcast variables and accumulators. Normally, when a function is passed from the
driver to a worker, a separate copy of the variables are used for each worker. Broadcast
variables allow each machine to work with a read-only variable cached on each
machine. Spark attempts to distribute broadcast variables using efficient algorithms. As
an example, broadcast variables can be used to give every node a copy of a large
dataset efficiently.
The other shared variables are accumulators. These are used for counters in sums that
works well in parallel. These variables can only be added through an associated
operation. Only the driver can read the accumulators value, not the tasks. The tasks
can only add to it. Spark supports numeric types but programmers can add support for
new types. As an example, you can use accumulator variables to implement counters
or sums, as in MapReduce.

Unit 6 Apache Spark
Last, but not least, key-value pairs are available in Scala, Python and Java. In Scala,
you create a key-value pair RDD by typing val pair = ('a', 'b'). To access each element,
invoke the ._ notation. This is not zero-index, so the ._1 will return the value in the first
index and ._2 will return the value in the second index. Java is also very similar to Scala
where it is not zero-index. You create the Tuple2 object in Java to create a key-value
pair. In Python, it is a zero-index notation, so the value of the first index resides in index
0 and the second index is 1.

Unit 6 Apache Spark
Programming with key-value pairs

• There are special operations available on RDDs of key-value pairs
 Grouping or aggregating elements by a key
• Tuple2 objects created by writing (a, b)
 Must import org.apache.spark.SparkContext._
• PairRDDFunctions contains key-value pair operations
 reduceByKey((a, b) => a + b)
• Custom objects as key in key-value pair requires a custom equals()
method with a matching hashCode() method.
• Example:
val textFile = sc.textFile("…")

val readmeCount = textFile.flatMap(line =>
line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
Programming with key-value pairs

There are special operations available to RDDs of key-value pairs. In an application,
you must remember to import the SparkContext package to use PairRDDFunctions
such as reduceByKey.
The most common ones are those that perform grouping or aggregating by a key.
RDDs containing the Tuple2 object represents the key-value pairs. Tuple2 objects are
simple created by writing (a, b) as long as you import the library to enable Spark's
implicit conversion.
If you have custom objects as the key inside your key-value pair, remember that you
will need to provide your own equals() method to do the comparison as well as a
matching hashCode() method.
In the example, you have a textFile that is just a normal RDD. You then perform some
transformations on it and it creates a PairRDD which allows it to invoke the
reduceByKey method that is part of the PairRDDFunctions API.

Unit 6 Apache Spark
Programming with Spark

• You have reviewed accessing Spark with interactive shells:
 spark-shell (for Scala)
 pyspark (for Python)
• Next you will review programming with Spark with:
 Scala
 Python
 Java
• Compatible versions of software are needed
 Spark 1.6.3 uses Scala 2.10. To write applications in Scala, you will need to
use a compatible Scala version (e.g. 2.10.X)
 Spark 1.x works with Python 2.6 or higher (but not yet with Python 3)
 Spark 1.x works with Java 6 and higher - and Java 8 supports lambda
expressions
Programming with Spark

Compatibility of Spark with various versions of the programming languages is
important.
Note also that as new releases of the HDP are released, you should revisit the issue of
compatibility of languages to work with the new versions of Spark. View all versions of
Spark and compatible software at: http://spark.apache.org/documentation.html

Unit 6 Apache Spark
SparkContext
• The SparkContext is the main entry point for Spark functionality;
it represents the connection to a Spark cluster
• Use the SparkContext to create RDDs, accumulators, and broadcast
variables on that cluster
• With the Spark shell, the SparkContext, sc, is automatically initialized
for you to use
• But in a Spark program, you need to add code to import some classes
and implicit conversions into your program:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
SparkContext
The SparkContext is the main entry point to everything Spark. It can be used to create
RDDs and shared variables on the cluster. When you start up the Spark Shell, the
SparkContext is automatically initialized for you with the variable sc. For a Spark
application, you must first import some classes and implicit conversions and then create
the SparkContext object.
The three import statements for Scala are shown on the slide.

Unit 6 Apache Spark
Linking with Spark: Scala

• Spark applications requires certain dependencies.
• Needs a compatible Scala version to write applications.
 For example, Spark 1.6.3 uses Scala 2.10.
• To write a Spark application, you need to add a Maven dependency on
Spark.
 Spark is available through Maven Central at:
groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.6.3
• To access a HDFS cluster, you need to add a dependency on hadoop-

client for your version of HDFS
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
Linking with Spark: Scala

Each Spark application you create requires certain dependencies. Over the next three
slides you will review how to link to those dependencies depending on which
programming language you decide to use.
To link with Spark using Scala, you must have a compatible version of Scala with the
Spark you choose to use. For example, Spark 1.6.3 uses Scala 2.10, so make sure that
you have Scala 2.10 if you wish to write applications for Spark 1.6.3.
To write a Spark application, you must add a Maven dependency on Spark. The
information is shown on the slide here. If you wish to access a Hadoop cluster, you
need to add a dependency to that as well.
In the lab environment, this will already be set up for you. The information on this page
is important if you want to set up a Spark stand-alone environment or your own Spark
cluster. Visit the site for more information on Spark versions and dependencies:
https://mvnrepository.com/artifact/org.apache.spark/spark-core?repo=hortonworks-
releases

Unit 6 Apache Spark
Initializing Spark: Scala

• Build a SparkConf object that contains information about your
application
val conf = new SparkConf().setAppName(appName).setMaster(master)
• The appName parameter  Name for your application to show on the

cluster UI
• The master parameter  is a Spark, Mesos, or YARN cluster URL (or
a special "local" string to run in local mode)
 In testing, you can pass "local" to run Spark.
 local[16] will allocate 16 cores
 In production mode, do not hardcode master in the program. Launch with
spark-submit and provide it there.
• Then, you need to create the SparkContext object.
new SparkContext(conf)
Initializing Spark: Scala

Once you have the dependencies established, the first thing is to do in your Spark
application before you can initialize Spark is to build a SparkConf object. This object
contains information about your application.
For example,
val conf = new SparkConf().setAppName(appName).setMaster(master)
You set the application name and tell it which is the master node. The master
parameter can be a standalone Spark distribution, Mesos, or a YARN cluster URL. You
can also decide to use the local keyword string to run it in local mode. In fact, you can
run local[16] to specify the number of cores to allocate for that particular job or Spark
shell as 16.
For production mode, you would not want to hardcode the master path in your program.
Instead, launch it as an argument to the spark-submit command.
Once you have the SparkConf all set up, you pass it as a parameter to the
SparkContext constructor to create it.

Unit 6 Apache Spark
Linking with Spark: Python

• Spark 1.x works with Python 2.6 or higher
• Uses the standard CPython interpreter, so C libraries like NumPy can
be used.
• To run Spark applications in Python, use the bin/spark-submit script
located in the Spark directory.
 Load Spark's Java/Scala libraries
 Allow you to submit applications to a cluster
• If you want to access HDFS, you need to use a build of PySpark

linking to your version of HDFS.
• Import some Spark classes
from pyspark import SparkContext, SparkConf
Linking with Spark: Python

Spark 1.6.3 works with Python 2.6 or higher. It uses the standard CPython interpreter,
so C libraries like NumPy can be used.
Check which version of Spark you have when you enter an environment that uses it.
To run Spark applications in Python, use the bin/spark-submit script located in the
Spark's home directory. This script will load the Spark's Java/Scala libraries and allow
you to submit applications to a cluster. If you want to use HDFS, you will have to link to
it as well. In the lab environment, you will not need to do this as Spark is bundled with it.
You also need to import some Spark classes, as shown.

Unit 6 Apache Spark
Initializing Spark: Python

application
conf = SparkConf().setAppName(appName).setMaster(master)

cluster UI
• Then, you need to create the SparkContext object.
sc = SparkContext(conf=conf)
Initializing Spark: Python

Here's the information for Python. It is pretty much the same information as Scala. The
syntax here is slightly different, otherwise, you are required to set up a SparkConf
object to pass as a parameter to the SparkContext object. You are also recommended
to pass the master parameter as an argument to the spark-submit operation.

Unit 6 Apache Spark
Linking with Spark: Java

• Spark 1.6.3 works with Java 7 and higher.
 Java 8 supports lambda expressions
• Add a dependency on Spark
 Available through Maven Central at:
groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.6.3
• If you want to access an HDFS cluster, you must add the dependency
as well.
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
• Import some Spark classes

import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.api.java.JavaRDD
import org.apache.spark.SparkConf
Linking with Spark: Java

Spark 1.6.3 works with Java 6 and higher. If you are using Java 8, Spark supports
lambda expressions for concisely writing functions. Otherwise, you can use the
org.apache.spark.api.java.function package with older Java versions.
As with Scala, you need to a dependency on Spark, which is available through Maven
Central. If you wish to access an HDFS cluster, you must add the dependency there as
well. Last, but not least, you need to import some Spark classes.

Unit 6 Apache Spark
Initializing Spark: Java

application
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master)

cluster UI
• Then, you will need to create the JavaSparkContext object.
JavaSparkContext sc = new JavaSparkContext(conf);
Initializing Spark: Java

Here is the same information for Java. Following the same idea, you need to create the
SparkConf object and pass that to the SparkContext, which in this case, would be a
JavaSparkContext. Remember, when you imported statements in the program, you
imported the JavaSparkContext libraries.

Unit 6 Apache Spark
Passing functions to Spark

• Spark's API relies on heavily passing functions in the driver program to
run on the cluster
• Three methods
 Anonymous function syntax
(x: Int) => x + 1
 Static methods in a global singleton object
object MyFunctions {
• def func1 (s: String): String = {…}
}
myRdd.map(MyFunctions.func1)
 Passing by reference, to avoid sending the entire object, consider copying
the function to a local variable
val field = "Hello"
− Avoid:
def doStuff(rdd: RDD[String]):RDD[String] = {rdd.map(x => field + x)}
− Consider:
def doStuff(rdd: RDD[String]):RDD[String] = {
val field_ = this.field
rdd.map(x => field_ + x) }
Passing functions to Spark

Passing functions to Spark is important to understand as you begin to think about the
business logic of your application.
The design of Spark's API relies heavily on passing functions in the driver program to
run on the cluster. When a job is executed, the Spark driver needs to tell its worker how
to process the data.

Unit 6 Apache Spark
There are three methods that you can use to pass functions.
• The first method to do this is using an anonymous function syntax. You saw
briefly what an anonymous function is in the first lesson. This is useful for short
pieces of code. For example, here we define the anonymous function that takes
in a parameter x of type Int and add one to it. Essentially, anonymous functions
are useful for one-time use of the function. In other words, you do not need to
explicitly define the function to use it. You define as you go. Again, the left side
of the => are the parameters or the argument. The right side of the => is the
body of the function.
• Another method to pass functions around Spark is to use static methods in a
global singleton object. This means that you can create a global object, in the
example, it is the object MyFunctions. Inside that object, you basically define the
function func1. When the driver requires that function, it only needs to send out
the object - the worker will be able to access it. In this case, when the driver
sends out the instructions to the worker, it just has to send out the singleton
object.
• It is possible to pass reference to a method in a class instance, as opposed to a
singleton object. This would require sending the object that contains the class
along with the method. To avoid this consider copying it to a local variable within
the function instead of accessing it externally.
Example, say you have a field with the string Hello. You want to avoid calling that
directly inside a function as shown on the slide as x => field + x.
Instead, assign it to a local variable so that only the reference is passed along and not
the entire object shown val field_ = this.field
For an example such as this, it may seem trivial, but imagine if the field object is not a
simple text Hello, but is something much larger, say a large log file. In that case,
passing by reference will have greater value by saving a lot of storage by not having to
pass the entire file.

Unit 6 Apache Spark
Programming the business logic

• Spark's API available in
Scala, Java, R, or Python.
• Create the RDD from an
external dataset or from an
existing RDD.
• Transformations and actions
to process the data.
• Use RDD persistence to
improve performance
• Use broadcast variables or
accumulators for specific
use cases
Programming the business logic

At this point, you should know how to link dependencies with Spark and also know how
to initialize the SparkContext. I also touched a little bit on passing functions with Spark to
give you a better view of how you can program your business logic. This course will not
focus too much on how to program business logics, but there are examples available for
you to see how it is done. The purpose is to show you how you can create an application
using a simple, but effective examples which demonstrates Spark's capabilities.
Once you have the beginning of your application ready by creating the SparkContext
object, you can start to program in the business logic using Spark's API available in
Scala, Java, or Python. You create the RDD from an external dataset or from an
existing RDD. You use transformations and actions to compute the business logic. You
can take advantage of RDD persistence, broadcast variables and/or accumulators to
improve the performance of your jobs.
Here's a sample Scala application. You have your import statement. After the beginning
of the object, you see that the SparkConf is created with the application name. Then a
SparkContext is created. The several lines of code after is creating the RDD from a text
file and then performing the Hdfs test on it to see how long the iteration through the file
takes. Finally, at the end, you stop the SparkContext by calling the stop() function.
Again, just a simple example to show how you would create a Spark application. You
will get to practice this in an exercise.

Unit 6 Apache Spark
Running Spark examples

• Spark samples available in the examples directory
• Run the examples: ./bin/run-example SparkPi
where SparkPi is the name of the sample application
• In Python: ./bin/spark-submit examples/src/main/python/pi.py
Running Spark examples

There are many example programs available that shows the various usage of Spark.
Depending on your programming language preference, there are examples in three
languages that work with Spark. You can view the source code of the examples on the
Spark website, or on Github, or within the Spark distribution itself. The slide is a partial
overview.

Unit 6 Apache Spark
For a full list of the examples available in GitHuB:

Scala:
https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples
Python:
https://github.com/apache/spark/tree/master/examples/src/main/python
Java:
https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples
R:
https://github.com/apache/spark/tree/master/examples/src/main/r
Spark Streaming:
https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/
examples/streaming
Java Streaming:
https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/
examples/streaming
The slide lists the steps to run some of these examples. To run Scala or Java
examples, you would execute the run-example script under the Spark's bin directory.
So for example, to run the SparkPi application, execute run-example SparkPi, where
SparkPi would be the name of the application. Substitute that with a different application
name to run that other applications.
To run the sample Python applications, use the spark-submit command and provide the
path to the application.

Unit 6 Apache Spark
Create Spark standalone applications: Scala
Import
statements
SparkConf and
SparkContext
Transformations
+ Actions
Create Spark standalone applications - Scala

This is an example application using Scala. Similarly programs can be written in Python
or Java.
The application shown here counts the number of lines with 'a' and the number of lines
with 'b' - you need to replace the YOUR_SPARK_HOME with the directory where
Spark is installed.
Unlike the Spark shell, you have to initialize the SparkContext in a program. First you
must create a SparkConf to set up your application's name. Then you create the
SparkContext by passing in the SparkConf object. Next, you create the RDD by loading
in the textFile, and then caching the RDD. Since we will be apply a couple of
transformation on it, caching will help speed up the process, especially if the logData
RDD is large. Finally, you get the values of the RDD by executing the count action on it.
End the program by printing it out onto the console.

Unit 6 Apache Spark
Run standalone applications

• Define the dependencies using any system build mechanism
(Ant, SBT, Maven, Gradle)
• Example:
• Scala  simple.sbt
• Java  pom.xml
• Python  --py-files argument (not needed for SimpleApp.py)
• Create the typical directory structure with the files

Scala using SBT : Java using Maven:
./simple.sbt ./pom.xml
./src ./src
./src/main ./src/main
./src/main/scala ./src/main/java
./src/main/scala/SimpleApp.scala ./src/main/java/SimpleApp.java
• Create a JAR package containing the application's code.

• Use spark-submit to run the program
Run standalone applications

At this point, you should know how to create a Spark application using any of the
supported programming language. Now you get to explore how to run the application.
You will need to first define the dependencies. Then you have to package the
application together using system build tools such as Ant, sbt, or Maven. The examples
here show how you would do it using various tools. You can use any tool for any of the
programming languages. For Scala, the example is shown using sbt, so you would
have a simple.sbt file. In Java, the example shows using Maven so you would have the
pom.xml file. In Python, if you need to have dependencies that requires third party
libraries, then you can use the -py-files argument to handle that.
Again, shown here are examples of what a typical directory structure would look like for
the tool that you choose.
Finally, once you have the JAR packaged created, run the spark-submit to execute the
application:
• Scala: sbt
• Java: mvn
• Python: submit-spark

Unit 6 Apache Spark
Spark libraries
• Extension of the core Spark API.
• Improvements made to the core are passed to these libraries.
• Little overhead to use with the Spark core
spark.apache.org
Spark libraries
Spark comes with libraries that you can utilize for specific use cases. These libraries are
an extension of the Spark Core API. Any improvements made to the core will
automatically take effect with these libraries. One of the big benefits of Spark is that
there is little overhead to use these libraries with Spark as they are tightly integrated.
The rest of this lesson will cover on a high level, each of these libraries and their
capabilities. The main focus will be on Scala with specific callouts to Java or Python if
there are major differences.
The four libraries are Spark SQL, Spark Streaming, Mllib, and GraphX. The remainder
of this course will cover these libraries.

Unit 6 Apache Spark
Spark SQL
• Allows relational queries expressed in
 SQL
 HiveQL
 Scala
• SchemaRDD
 Row objects
 Schema
 Created from:
− Existing RDD
− Parquet file
− JSON dataset
− HiveQL against Apache Hive
• Supports Scala, Java, R, and Python
Spark SQL
Spark SQL allows you to write relational queries that are expressed in either SQL,
HiveQL, or Scala to be executed using Spark. Spark SQL has a new RDD called the
SchemaRDD. The SchemaRDD consists of rows objects and a schema that describes
the type of data in each column in the row. You can think of this as a table in a
traditional relational database.
You create a SchemaRDD from existing RDDs, a Parquet file, a JSON dataset, or
using HiveQL to query against the data stored in Hive. Spark SQL is currently an alpha
component, so some APIs may change in future releases.
Spark SQL supports Scala, Java, R, and Python.

Unit 6 Apache Spark
Spark streaming
• Scalable, high-throughput, fault- Receives data from:

tolerant stream processing of live – Kafka
data streams – Flume
– HDFS / S3
• Receives live input data and – Kinesis
divides into small batches which
– Twitter
are processed and returned as
Pushes data out to:
batches
– HDFS
• DStream - sequence of RDD – Databases
• Currently supports Scala, Java, – Dashboard
and Python
Spark streaming
Spark streaming gives you the capability to process live streaming data in small
batches. Utilizing Spark's core, Spark Streaming is scalable, high-throughput and fault-
tolerant. You write Stream programs with DStreams, which is a sequence of RDDs from
a stream of data. There are various data sources that Spark Streaming receives from
including, Kafka, Flume, HDFS, Kinesis, or Twitter. It pushes data out to HDFS,
databases, or some sort of dashboard.
Spark Streaming supports Scala, Java and Python. Python was actually introduced with
Spark 1.2. Python has all the transformations that Scala and Java have with DStreams,
but it can only support text data types. Support for other sources such as Kafka and
Flume will be available in future releases for Python.

Unit 6 Apache Spark
Spark Streaming: Internals

• The input stream (DStream) goes into Spark Streaming
• The data is broken up into batches that are fed into the Spark engine
for processing
• The final results are generated as a stream of batches
spark.apache.org
• Sliding window operations

 Windowed computations
− Window length
− Sliding interval
− reduceByKeyAndWindow
spark.apache.org
Spark Streaming: Internals

Here's a quick view of how Spark Streaming works. First the input stream comes in to
Spark Streaming. That data stream is broken up into batches of data that are fed into
the Spark engine for processing. Once the data has been processed, it is sent out in
batches.
Spark Stream support sliding window operations. In a windowed computation, every
time the window slides over a source of DStream, the source RDDs that falls within the
window are combined and operated upon to produce the resulting RDD.
There are two parameters for a sliding window:
• The window length is the duration of the window
• The sliding interval is the interval in which the window operation is performed.
Both of these parameters must be in multiples of the batch interval of the source
DStream.

Unit 6 Apache Spark
In the second diagram, the window length is 3 and the sliding interval is 2. To put it into
perspective, maybe you want to generate word counts over last 30 seconds of data,
every 10 seconds. To do this, you would apply the reduceByKeyAndWindow operation
on the pairs of DStream of (Word,1) pairs over the last 30 seconds of data.
Doing wordcount in this manner is provided as an example program
NetworkWordCount, available on GitHub at
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spar
k/examples/streaming/NetworkWordCount.scala

Unit 6 Apache Spark
MLlib
• MLlib for machine learning library - under active development
• Provides, currently, the following common algorithm and utilities
 Classification
 Regression
 Clustering
 Collaborative filtering
 Dimensionality reduction
MLib
Here is a really short overview of the machine learning library. The MLlib library
contains algorithms and utilities for classification, regression, clustering, collaborative
filtering and dimensionality reduction. Essentially, you would use this for specific
machine learning use cases that requires these algorithms.

Unit 6 Apache Spark
GraphX
• GraphX for graph processing
 Graphs and graph parallel computation
 Social networks and language modeling
https://spark.apache.org/docs/latest/graphx-programming-guide.html#overview
GraphX
The GraphX is a library that sits on top of the Spark Core. It is basically a graph
processing library which can used for social networks and language modeling. Graph
data and the requirement for graph parallel systems is becoming more common, which
is why the GraphX library was developed. Specific scenarios would not be efficient if it
is processed using the data-parallel model. A need for the graph-parallel model is
introduced with new graph-parallel systems like Giraph and GraphLab to efficiently
execute graph algorithms much faster than general data-parallel systems.
There are new inherent challenges that comes with graph computations, such as
constructing the graph, modifying its structure, or expressing computations that span
several graphs. As such, it is often necessary to move between table and graph views
depending on the objective of the application and the business requirements.
The goal of GraphX is to optimize the process by making it easier to view data both as
a graph and as collections, such as RDD, without data movement or duplication.

Unit 6 Apache Spark
Spark cluster overview

• Components
 Driver
 Cluster Master
 Executors
• Three supported Cluster Managers:

 Standalone
 Apache Mesos
 Hadoop YARN
Spark cluster overview

There are three main components of a Spark cluster. You have the driver, where the
SparkContext is located within the main program. To run on a cluster, you would need
some sort of cluster manager. This could be either Spark's standalone cluster manager,
or Mesos or Yarn. Then you have your worker nodes where the executors reside. The
executors are the processes that run computations and store the data for the
application. The SparkContext sends the application, defined as JAR or Python files to
each executor. Finally, it sends the tasks for each executor to run.

Unit 6 Apache Spark
Several things to understand this architecture.

• Each application gets its own executor processes. The executor stays up for the
entire duration that the application is running. The benefit of this is that the
applications are isolated from each other, on both the scheduling side, and
running on different JVMs. However, this means that you cannot share data
across applications. You would need to externalize the data if you wish to share
data between different applications, instances of SparkContext.
• Spark applications don't care about the underlying cluster manager. As long as
it can acquire executors and communicate with each other, it can run on any
cluster manager.
• Because the driver program schedules tasks on the cluster, it should run close
to the worker nodes on the same local network. If you like to send remote
requests to the cluster, it is better to use a RPC and have it submit operations
from nearby.
• There are currently three supported cluster managers. Spark comes with a
standalone manager that you can use to get up and running. You can use
Apache Mesos, a general cluster manager that can run and service Hadoop
jobs. Finally, you can also use Hadoop YARN, the resource manager in Hadoop
2. In the lab exercise, you will be using HDP with Yarn to run your Spark
applications.

Unit 6 Apache Spark
Spark monitoring
• Three ways to monitor Spark applications
1. Web UI
− Port 4040 (lab exercise on port 8088)
− Available for the duration of the application
2. Metrics
− Based on the Coda Hale Metrics Library
− Report to a variety of sinks (HTTP, JMX, and CSV)
/conf/metrics.properties
3. External instrumentations
− Cluster-wide monitoring tool (Ganglia)
− OS profiling tools (dstat, iostat, iotop)
− JVM utilities (jstack, jmap, jstat, jconsole)
Spark monitoring
There are three ways to monitor Spark applications. The first way is the Web UI. The
default port is 4040. The port in the lab environment is 8088. The information on this UI
is available for the duration of the application. If you want to see the information after
the fact, set the spark.eventLog.enabled to true before starting the application. The
information will then be persisted to storage as well.
The Web UI has the following information.
• A list of scheduler stages and tasks.
• A summary of RDD sizes and memory usage.
• Environmental information and information about the running executors.
To view the history of an application after it has running, you can start up the history
server. The history server can be configured on the amount of memory allocated for it,
the various JVM options, the public address for the server, and a number of properties.

Unit 6 Apache Spark
Metrics are a second way to monitor Spark applications. The metric system is based on
the Coda Hale Metrics Library. You can customize it so that it reports to a variety of
sinks such as CSV. You can configure the metrics system in the metrics.properties file
under the conf directory.
In addition, you can also use external instrumentations to monitor Spark. Ganglia is
used to view overall cluster utilization and resource bottlenecks. Various OS profiling
tools and JVM utilities can also be used for monitoring Spark.

Unit 6 Apache Spark
Checkpoint
1. List the benefits of using Spark.
2. List the languages supported by Spark.
3. What is the primary abstraction of Spark?
4. What would you need to do in a Spark application that you would not
need to do in a Spark shell to start using Spark?
Checkpoint

Unit 6 Apache Spark
Checkpoint solution
1. List the benefits of using Spark.
 Speed, generality, ease of use
2. List the languages supported by Spark.
 Scala, Python, Java
3. What is the primary abstraction of Spark?
 Resilient Distributed Dataset (RDD)
4. What would you need to do in a Spark application that you would not
need to do in a Spark shell to start using Spark?
 Import the necessary libraries to load the SparkContext
Checkpoint solution

Unit 6 Apache Spark
Unit summary
• Understand the nature and purpose of Apache Spark in the Hadoop
ecosystem
• List and describe the architecture and components of the Spark
unified stack
• Describe the role of a Resilient Distributed Dataset (RDD)
• Understand the principles of Spark programming
• List and describe the Spark libraries
• Launch and use Spark's Scala and Python shells
Unit summary
Additional information and exercises are available in the Cognitive Class course Spark
Fundamentals available at https://cognitiveclass.ai/learn/spark/. That course gets into
further details on most of the topics covered here, especially Spark configuration,
monitoring, and tuning.
Cognitive Class has been chosen by IBM as one of the issuers of badges as part of the
IBM Open Badge program.
Cognitive Class leverages the services of Pearson VUE Acclaim to assist in the
administration of the IBM Open Badge program. By enrolling into this course, you agree
to Cognitive Class sharing your details with Pearson VUE Acclaim for the strict use of
issuing your badge upon completion of the badge criteria. The course comes with a
final quiz where the minimum passing mark for the course is 60%, where the final test is
worth 100% of the course mark.

Unit 6 Apache Spark
Lab 1
Working with a Spark RDD with Scala
Lab 1: Working with a Spark RDD with Scala

Unit 6 Apache Spark
Lab 1:
Working with a Spark RDD with Scala
Purpose:
In this lab, you will learn to use some of the fundamental aspects of running
Spark in the HDP environment.
Additional information on Spark and additional lab exercises using Spark (with
Scala, Python, and Java) are available in the Cognitive Class course Spark
Fundamentals available at https://cognitiveclass.ai/learn/spark/.
Good locations for additional information on Scala and introductory exercises in
Spark with Scala and Python can be found at:
Scala Community: http://www.scala-lang.org
Quick Start: https://spark.apache.org/docs/latest/quick-start.html
Blogs, e.g.: http://blog.ajduke.in/2013/05/31/various-ways-to-run-scala-code
Note:
You may need verify that your hostname and IP address are setup correctly as
noted in earlier units. Note that if you shut down your lab environment, this
verification of hostname and IP address should be repeated. This is particularly
important if you find that you cannot connect to Spark. Resetting the IP values
may require that you reboot the VMware Image (as root: reboot).
Task 1. Connect to the VMware Image & to the Spark server.
1. Connect to and login to your lab environment with user student and password
student credentials.
3. To set an environmental variable $SPARK_HOME, type export
SPARK_HOME=/usr/hdp/current/spark-client.

Unit 6 Apache Spark
4. To start the Spark shell, type $SPARK_HOME/bin/spark-shell.

[biadmin@ibmclass Desktop]$ cd
[biadmin@ibmclass ~]$ $SPARK_HOME/bin/spark-shell
15/06/05 11:06:04 INFO spark.SecurityManager: Changing view acls to: biadmin
15/06/05 11:06:04 INFO spark.SecurityManager: Changing modify acls to: biadmin
15/06/05 11:06:04 INFO spark.SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(biadmin); users with
modify permissions: Set(biadmin)
15/06/05 11:06:04 INFO spark.HttpServer: Starting HTTP Server
15/06/05 11:06:04 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/06/05 11:06:04 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:34520
15/06/05 11:06:04 INFO util.Utils: Successfully started service 'HTTP class server'
on port 34520.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.3
/_/
Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
15/06/05 11:06:07 INFO spark.SecurityManager: Changing view acls to: biadmin
15/06/05 11:06:07 INFO spark.SecurityManager: Changing modify acls to: biadmin
. . .
15/06/05 11:06:10 INFO scheduler.EventLoggingListener: Logging events to

hdfs://ibmclass.localdomain:8020/hdp/apps/4.0.0.0/spark/logs/history-server/local-
1433520369438
15/06/05 11:06:10 INFO repl.SparkILoop: Created spark context..
Spark context available as sc.
scala>
You now have a Scala prompt where you can enter Scala interactively.
Note that the Spark context is available as sc.
To exit from Scala at any time, you type sys.exit and press Enter (the official
approach) or use Ctrl-D.
The tab key provides code completion.

Unit 6 Apache Spark
5. Type sc. (the period is needed!), and then press Tab.

scala> sc.
accumulable accumulableCollection
accumulator addFile
addJar addSparkListener
appName applicationId
asInstanceOf binaryFiles
binaryRecords broadcast
cancelAllJobs cancelJobGroup
clearCallSite clearFiles
clearJars clearJobGroup
defaultMinPartitions defaultMinSplits
defaultParallelism emptyRDD
files getAllPools
getCheckpointDir getConf
getExecutorMemoryStatus getExecutorStorageStatus
getLocalProperty getPersistentRDDs
getPoolForName getRDDStorageInfo
getSchedulingMode hadoopConfiguration
hadoopFile hadoopRDD
initLocalProperties isInstanceOf
isLocal jars
killExecutor killExecutors
makeRDD master
metricsSystem newAPIHadoopFile
newAPIHadoopRDD objectFile
parallelize requestExecutors
runApproximateJob runJob
sequenceFile setCallSite
setCheckpointDir setJobDescription
setJobGroup setLocalProperty
sparkUser startTime
statusTracker stop
submitJob tachyonFolderName
textFile toString
union version
wholeTextFiles
Note, if your type sc and press Tab, without the period after sc, you will get an
abbreviated output, since only three keywords start with sc, whereas a lot of
functionality is provided by the Spark context ("sc.").
scala> sc
sc scala schema

Unit 6 Apache Spark
Task 2. Load data into an RDD and perform transformations

and actions on that data.
1. To do an RDD transformation by reading in a file that was previously loaded to
HDFS, type the following:
val pp = sc.textFile("Gutenberg/Pride_and_Prejudice.txt")
scala> val pp = sc.textFile("Gutenberg/Pride_and_Prejudice.txt")

15/06/05 12:05:48 INFO storage.MemoryStore: ensureFreeSpace(274073) called with
curMem=0, maxMem=278302556
15/06/05 12:05:48 INFO storage.MemoryStore: Block broadcast_0 stored as values in
memory (estimated size 267.6 KB, free 265.1 MB)
curMem=274073, maxMem=278302556
15/06/05 12:05:49 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes
in memory (estimated size 40.8 KB, free 265.1 MB)
15/06/05 12:05:49 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory
on localhost:46250 (size: 40.8 KB, free: 265.4 MB)
15/06/05 12:05:49 INFO storage.BlockManagerMaster: Updated info of block
broadcast_0_piece0
15/06/05 12:05:49 INFO spark.SparkContext: Created broadcast 0 from textFile at
<console>:12
pp: org.apache.spark.rdd.RDD[String] = Gutenberg/Pride_and_Prejudice.txt MappedRDD[1]
at textFile at <console>:12
The result is a pointer to the file. The file is not actually read at this time, as is
evidenced by noting that you do not get any errors if you misspell the file name.
Now pp is a pointer to the RDD.
We can perform some RDD actions on this data. One simple action is to count
the number of items (lines, records) in the RDD.
2. To count the number of items in the RDD, type pp.count().
scala> pp.count()
15/06/05 12:02:27 INFO mapred.FileInputFormat: Total input paths to process : 1
15/06/05 12:02:27 INFO spark.SparkContext: Starting job: count at <console>:15
15/06/05 12:02:27 INFO scheduler.DAGScheduler: Got job 0 (count at <console>:15) with
2 output partitions (allowLocal=false)
15/06/05 12:02:27 INFO scheduler.DAGScheduler: Final stage: Stage 0(count at
<console>:15)
15/06/05 12:02:27 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/06/05 12:02:27 INFO scheduler.DAGScheduler: Missing parents: List()
15/06/05 12:02:27 INFO scheduler.DAGScheduler: Submitting Stage 0
(Gutenberg/Pride_and_Prejudice.txt MappedRDD[7] at textFile at <console>:12), which
has no missing parents
curMem=1263720, maxMem=278302556
...
15/06/05 12:02:28 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks

have all completed, from pool
15/06/05 12:02:28 INFO scheduler.DAGScheduler: Stage 0 (count at <console>:15)
finished in 0.600 s
res2: Long = 13030
scala> 15/06/05 12:02:28 INFO scheduler.DAGScheduler: Job 0 finished: count at

<console>:15, took 0.958171 s
The number of lines in the file is 13030.

Unit 6 Apache Spark
3. In a new terminal window, to verify the number of lines in the file, use the Linux
command wc on the original file that we uploaded to HDFS:
wc -l /home/biadmin/labfiles/Pr*
[biadmin@ibmclass ~]$ wc -l /home/labfiles/Pr*

13030 /home/labfiles/Pride_and_Prejudice.txt
4. Restart the Spark Shell, and then to read the first record of the RDD, type
pp.first().
scala> pp.first()
15/06/05 12:40:14 INFO mapred.FileInputFormat: Total input paths to process : 1
15/06/05 12:40:14 INFO spark.SparkContext: Starting job: first at <console>:15
15/06/05 12:40:14 INFO scheduler.DAGScheduler: Got job 1 (first at <console>:15) with
1 output partitions (allowLocal=true)
15/06/05 12:40:14 INFO scheduler.DAGScheduler: Final stage: Stage 1(first at
<console>:15)
15/06/05 12:40:14 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/06/05 12:40:14 INFO scheduler.DAGScheduler: Missing parents: List()
15/06/05 12:40:14 INFO scheduler.DAGScheduler: Submitting Stage 1
(Gutenberg/Pride_and_Prejudice.txt MappedRDD[3] at textFile at <console>:12), which
has no missing parents
curMem=631860, maxMem=278302556
15/06/05 12:40:14 INFO storage.MemoryStore: Block broadcast_3 stored as values in
memory (estimated size 2.5 KB, free 264.8 MB)
curMem=634420, maxMem=278302556
15/06/05 12:40:14 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes
in memory (estimated size 1901.0 B, free 264.8 MB)
15/06/05 12:40:14 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory
on localhost:46250 (size: 1901.0 B, free: 265.3 MB)
15/06/05 12:40:14 INFO storage.BlockManagerMaster: Updated info of block
broadcast_3_piece0
15/06/05 12:40:14 INFO spark.SparkContext: Created broadcast 3 from broadcast at
DAGScheduler.scala:838
15/06/05 12:40:14 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage
1 (Gutenberg/Pride_and_Prejudice.txt MappedRDD[3] at textFile at <console>:12)
15/06/05 12:40:14 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/06/05 12:40:14 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID
1, localhost, ANY, 1343 bytes)
15/06/05 12:40:14 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1)
15/06/05 12:40:14 INFO rdd.HadoopRDD: Input split:
hdfs://ibmclass.localdomain:8020/user/biadmin/Gutenberg/Pride_and_Prejudice.txt:0+348
901
15/06/05 12:40:14 INFO mapred.LineRecordReader: Found UTF-8 BOM and skipped it
15/06/05 12:40:14 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1).
1796 bytes result sent to driver
15/06/05 12:40:14 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID
1) in 21 ms on localhost (1/1)
15/06/05 12:40:14 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
have all completed, from pool
15/06/05 12:40:14 INFO scheduler.DAGScheduler: Stage 1 (first at <console>:15)
finished in 0.021 s
15/06/05 12:40:14 INFO scheduler.DAGScheduler: Job 1 finished: first at <console>:15,
took 0.030293 s
res2: String = PRIDE AND PREJUDICE
scala>

Unit 6 Apache Spark
The first actual line in the file has the string: PRIDE AND PREJUDICE.
This string is the title of the book.
Scala, Python, and Java are each substantive languages. It is not our goal to
teach you the complete Scala language in this unit, but merely to introduce you
to it.
Scala is an interpreted language, like Java, and has a compiler scalac just as
Java has its compiler javac.
The next stage in your learning should be to take the free Cognitive Class
course on Spark, which has programming exercises in Scala and Python that
carry on from what you have learned here. From there you would progress to
one of the textbooks on learning Scala, but be warned that the best ones are
often over 500 pages long.
Close all open windows.
Results:
You learned to use some of the fundamental aspects of running Spark in the
HDP environment.

Storing and querying data
Storing and querying data

Unit 7 Storing and querying data

Unit objectives
• List the characteristics of representative data file formats, including
flat/text files, CSV, XML, JSON, and YAML
• List the characteristics of the four types of NoSQL datastores
• Describe the storage used by HBase in some detail
• Describe and compare the open source programming languages, Pig
and Hive
• List the characteristics of programming languages typically used by
Data Scientists: R and Python
Storing and querying data © Copyright IBM Corporation 2018
Unit objectives

Introduction to data
• "Data are values of qualitative or quantitative variables, belonging to a
set of items" - Wikipedia
• Common data representation formats used for big data include:
 Row- or record-based encodings:
− Flat files / text files
− CSV and delimited files
− Avro / SequenceFile
− JSON
− Other formats: XML, YAML
 Column-based storage formats:
− RC / ORC file
− Parquet
 NoSQL datastores
• Compression of data
Introduction to data
Storing petabytes of data in Hadoop is relatively easy, but it is more important to
choose an efficient storage format for faster querying.
Row-based encodings (Text, Avro, JSON) with a general purpose compression library
(GZip, LZO, CMX, Snappy) are common mainly for interoperability reasons, but
column-based storage formats (Parquet, ORC) provide not only faster query execution
by minimizing IO but also great compression.
Compression is important to big data file storage:
• Reduces file sizes and thus speeds up transfer to/from disk
• Generally faster to transfer a small file and then decompress than to transfer a
larger file

Gathering and cleaning/munging/wrangling data

• "In our experience, the tasks of exploratory data mining and data
cleaning constitute 80% of the effort that determines 80% of the value
of the ultimate data."
 Dasu, T., & Johnson, T. (2003) Exploratory data mining and data cleaning
• Sources of error in data: Data quality/veracity issues
 Data entry errors (such as call center records manually entered)
 Measurement errors (such as bad/inappropriate sampling)
 Distillation errors (such as smoothing due to noise)
 Data integration errors (such as multiple databases)
• Data manipulation
 Filtering or subsetting
 Transforming: add new variables or modify existing variables
 Aggregating: collapse multiple values into a single value
 Sorting: change the order of values
Gathering and cleaning/munging/wrangling data

Often the data that is gathered (raw data) needs to be seriously processed and even
converted/transformed either before or in the process of loading into HDFS.
There is no settled terminology for the set of activities between acquiring and modeling
data. You will see the phrase data preparation to describe these activities. Data
preparation seeks to turn newly acquired raw data into clean data that can be analyzed
and modeled in a meaningful way. This phase of the data science workflow, and
subsets of it, have been variously labeled munging, wrangling, reduction, and
cleansing. You can use the various terms indifferently, though some of them are often
classified as jargon.
Data munging or data wrangling is loosely the process of manually converting or
mapping data from one raw form into another format that allows for more convenient
consumption of the data with the help of semi-automated tools. This may include further
munging, data visualization, data aggregation, training a statistical model, as well as
many other potential uses.

Data munging as a process typically follows a set of general steps which begin with
extracting the data in a raw form from the data source, munging the raw data using
algorithms (like sorting) or parsing the data into predefined data structures, and finally
depositing the resulting content into a data sink for storage and future use.[1] Given the
rapid growth of the internet such techniques will become increasingly important in the
organization of the growing amounts of data available.
In the world of data warehousing, ETL (extract-transform-load) is referred to; here the
process is often ELT - load first, transform later.
Dasu and Johnson's book (2003) is available online at https://goo.gl/nIoSvj (this url
shortener will open at the quotation).
Data wrangling also the subject of a New York Times article:
http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-
insights-is-janitor-work.html (For Big-Data Scientists, "Janitor Work" Is Key Hurdle to
Insights).

Flat files/text files

• The traditional ETL (extract-transform-load) process extracts data from
multiple sources, then cleanses, formats, and loads it into a data
warehouse for analysis
• Flat files or text files may need to be parsed into fields/attributes

 Fields may be positional at fixed offset from the beginning of the record
 Or, text analytics may be required to extract meaning
Flat files/text files

CSV and various forms of delimited files

• Familiar to everyone as input/output to spreadsheet applications:
 Rows correspond to individual records
 Columns of plain text are separated by comma or some other delimiter
(tab, |, etc.)
• May have a header record with names of columns
• Problems:
 Quotes may be used in order to deal with strings of text
 Escape characters may be present (typically backslash = \)
 Windows and Linux/UNIX use different end-of-line characters
• Python has a standard library that includes a CSV package
• Though attractive, the capabilities of CSV-style formats are limited:
 Assumes each record has a fix number of attributes
 Not easy to represent sets, lists, or maps, or more complex data structures
CSV and various forms of delimited files

An example of CSV-formatted data with a header row:
id,title,description,price
1,shoes,red shoes,$70.00
2,hat,a black hat,$20.00
3,sweater,a wool sweater,$50.00

Avro/SequenceFile
• Avro data files are a compact, efficient binary format that provides
interoperability with applications written in other programming
languages
 Avro also supports versioning, so that when, for example, columns are
added or removed from a table, previously imported data files can be
processed along with new ones
• SequenceFiles are a binary format that store individual records in
custom record-specific data types
 This format supports exact storage of all data in binary representations, and
is appropriate for storing binary data (for example, VARBINARY columns),
or data that will be principally manipulated by custom MapReduce programs
 Reading from SequenceFiles is higher-performance than reading from text
files, as records do not need to be parsed).
Avro/SequenceFile
Doug Cutting (one of the original developers of Hadoop) answered the question, “What
are the advantages of Avro's object container file format over the SequenceFile
container format?” http://www.quora.com/What-are-the-advantages-of-Avros-object-
container-file-format-over-the-SequenceFile-container-format

Two primary reasons:

1. Language Independence. The SequenceFile container and each Writable
implementation stored in it are only implemented in Java. There is no format
specification independent of the Java implementation. Avro data files have a
language-independent specification and are currently implemented in C, Java,
Ruby, Python, and PHP. A Python-based application can directly read and write
Avro data files.
Avro's language-independence is not yet a huge advantage in MapReduce
programming, since MapReduce programs for complex data structures are
generally written in Java. But, once you implement Avro tethers for other
languages (http://s.apache.org/ZOw, Hadoop Pipes for Avro), then it will be
possible to write efficient mappers and reducers in C, C++, Python, and Ruby that
operate on complex data structures.
Language independence can be an advantage today however if you'd like to
create or access data outside of MapReduce programs from non-Java
applications. Moreover, as the Hortonworks Data Platform expands, Hortonworks
would like to be able to include more non-Java applications and to easily
interchange data with these applications, so establishing a standard, language-
independent data format for this platform is a priority.
2. Versioning. If a Writable class changes, if fields are added or removed, the type
of a field is changed or the class is renamed, then data is usually unreadable. A
Writable implementation can explicitly manage versioning, writing a version
number with each instance and handling older versions at read-time. This is rare,
but even then, it does not permit forward-compatibility (old code reading a newer
version) nor branched versions. Avro automatically handles field addition and
removal, forward and backward compatibility, branched versioning, and
renaming, all largely without any awareness by an application.
The versioning advantages are available today for Avro MapReduce applications.

JSON format: JavaScript Object Notation

• JSON is a plain-text object serialization format that can represent quite
complex data in a way that can be transferred between a user and a
program or one program to another program
• Often called the language of Web 2.0
• Two basic structures:
 Records consisting of maps (aka key/value pairs), in curly braces:
{name: "John", age: 25}
 Lists (aka arrays), in square brackets: [ . . . ]
• Records and arrays can be nested in each other multiple times
• Support libraries are available in R, Python, and other languages
• Standard JSON format does not offer any formal schema mechanism
although there are attempts at developing a formal schema
• APIs that return JSON data: Cnet, Flikr, Google Geocoder, Twitter,
Yahoo Answers, Yelp, etc.
JSON format - Java Script Object Notation

JSON (JavaScript Object Notation) is an open standard format that uses human-
readable text to transmit data objects consisting of attribute-value pairs. It is used
primarily to transmit data between a server and web application, as an alternative to
XML. Although originally derived from the JavaScript scripting language, JSON is a
language-independent data format. Code for parsing and generating JSON data is
readily available in many programming languages.
The primary reference site (www.json.org) describes JSON as built on two structures:
• A collection of name/value pairs. In various languages, this is realized as an
object, record, struct, dictionary, hash table, keyed list, or associative array.
• An ordered list of values. In most languages, this is realized as an array, vector,
list, or sequence.
These are universal data structures with great flexibility in practice. Virtually all modern
programming languages support them in one form or another. It makes sense that a
data format that is interchangeable between programming languages is also based on
these structures.

References:
• Get Started With JSON:
http://www.webmonkey.com/2010/02/get_started_with_json
• GNIP: https://www.gnip.com
The two basic data structures of JSON have also been described as dictionaries
(maps) and lists (arrays). JSON treats an object as a dictionary where attribute names
are used as keys into the map.
• Dictionaries are defined in a way that may be familiar to anyone who has
initialized a python dict with some values (or has printed out the contents of a
dict), pairs of keys and values, separated by a ":", with each key-value pair
delimited by a ",", and each entire object/record surrounded by "{}".
• Lists are also represented using python-like syntax, a sequence of values
separated by ",", surrounded by "[ ]". These two data structures can be arbitrarily
nested, e.g., a dictionary that contains a list of dictionaries, etc. A
Additionally, individual attributes can be text strings, surrounded by double quotes (" "),
numbers, true/false, or null. Note that there is no native support for a 'set' data structure.
Typically a set is transformed into a list when an object is getting written to JSON, which
would be input into a set when being consumed, For instance, in python: some_set =
set [a list, here]. Quotes in text fields are escaped like \". Note that when inserted into a
file, by convention JSON objects are typically written one per line.

Two examples of JSON

#1:
{"id":1, "name":"josh-shop", "listings":[1, 2, 3]}
{"id":2, "name":"provost", "listings":[4, 5, 6]}
#2, from Wikipedia (at http://en.wikipedia.org/wiki/JSON):

{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
}
],
"children": [],
"spouse": null
}
Python's standard library includes a JSON package. This is very useful for reading a
raw JSON string into a dictionary; however, transforming that map into an actual object,
and writing an arbitrary object out into JSON may require some additional
programming.

XML (eXtensible Markup Language)

• XML is an incredibly rich and flexible data representation format
 Uses markup to provide context for fields in plain text
 Provides an excellent mechanism for serializing objects and data
 Widely used as an electronic data interchange (EDI) format within industry
sectors
• XML has a formal schema language, written in XML, and data written
within the constraints of a schema are guaranteed to be valid for later
processing
• Webpages are written in HTML, a variant on XML
 Web scraping/web harvesting/web data extraction can be used to extract
information from websites
 Web crawling and scraping can be done with languages such as Python, R,
and others
• Many of the configuration files used in the Hadoop ecosystem are in
XML format
XML (eXtensible Markup Language)

Example of XML formatted data:
<Items>
<listing id=1 title="shoes" price="$70.00">
<description>red shoes</description>
</listing>
<listing id=2 title="hat" price="$20.00">
<description>black hat</description>
</listing>
<listing id=3 title="sweater" price="$50.00">
<description>a wool sweater</description>
</listing>
</Items>

RC/ORC file formats

• RC (Record Columnar File) file format was developed to support Hive
• ORC (Optimized Row Columnar) format now challenges the RC format
by providing an optimized / more efficient approach
 Specific encoders for
different column data types
 Light-weight indexing that
enables skipping of complete
blocks of rows
 Provides basic statistics such as
min, max, sum, and count,
on columns
 Larger default blocksize
(256MB)
RC/ORC file formats

ORC goes beyond RCFile and uses specific encoders for different column data types to
improve compression further, e.g. variable length compression on integers. ORC
introduces a lightweight indexing that enables skipping of complete blocks of rows that
do not match a query. It comes with basic statistics, min, max, sum, and count, on
columns. Lastly, a larger block size of 256 MB by default optimizes for large sequential
reads on HDFS for more throughput and fewer files to reduce load on the namenode.
Reference: http://www.semantikoz.com/blog/orc-intelligent-big-data-file-format-hadoop-
hive

Parquet
• "Apache Parquet is a columnar storage format available to any project
in the Hadoop ecosystem, regardless of the choice of data processing
framework, data model or programming language"
 http://parquet.apache.org
• Compressed, efficient columnar storage, developed by Cloudera and
Twitter
 Efficiently encodes nested structures and sparsely populated data based on
the Google BigTable / BigQuery / Dremel definition/repetition levels
 Allows compression schemes to be specified on a per-column level
 Future-proofed to allow adding more encodings to be added as they are
invented and implemented
• Provides some of the best results in various benchmark, performance
tests
Parquet
This is the official announcement from Cloudera and Twitter about Parquet, an efficient
general-purpose columnar file format for Apache Hadoop (from
http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-
apache-hadoop):
Parquet is designed to bring efficient columnar storage to Hadoop. Compared to, and
learning from, the initial work done toward this goal in Trevni, Parquet includes the
following enhancements:
• Efficiently encode nested structures and sparsely populated data based on the
Google Dremel definition/repetition levels
• Provide extensible support for per-column encodings
(such as delta, run length, etc.)
• Provide extensibility of storing multiple types of data in column data
(such as indexes, bloom filters, statistics)
• Offer better write performance by storing metadata at the end of the file

• A new columnar storage format was introduced for Hadoop called Parquet,
which started as a joint project between Twitter and Cloudera engineers.
• Parquet was created to make the advantages of compressed, efficient columnar
data representation available to any project in the Hadoop ecosystem,
regardless of the choice of data processing framework, data model, or
programming language.
• Parquet is built from the ground up with complex nested data structures in mind.
This seems to be a very efficient method of encoding data in non-trivial object
schemas.
• Parquet is built to support very efficient compression and encoding schemes.
Parquet allows compression schemes to be specified on a per-column level,
and is future-proofed to allow adding more encodings as they are invented and
implemented. The concepts of encoding and compression are separated,
allowing Parquet consumers to implement operators that work directly on
encoded data without paying decompression and decoding penalty when
possible.
• Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data
processing frameworks. An efficient, well-implemented columnar storage
substrate should be useful to all frameworks without the cost of extensive and
difficult to set up dependencies.
• The initial code defines the file format, provides Java building blocks for
processing columnar data, and implements Hadoop Input/Output Formats, Pig
Storers/Loaders, and an example of a complex integration, Input/Output formats
that can convert Parquet-stored data directly to and from Thrift objects.
Related references:
• Google Big Table: https://cloud.google.com/files/BigQueryTechnicalWP.pdf
• Dremel: http://research.google.com/pubs/pub36632.html

NoSQL
• NoSQL, aka "Not only SQL," aka "Non-relational" was specifically
introduced to handle the rise in data types, data access, and data
availability needs brought on the dot.com boom
• It is generally agreed that there are four types of NoSQL datastores:
 Key-value stores
 Graph stores
 Column stores
 Document stores
• Why consider NoSQL?
 Flexibility
 Scalability - they scale horizontally rather than vertically
 Availability
 Lower operational costs
 Specialized capabilities
NoSQL
Examples of each of the four types of NoSQL datastores:
• Key-value stores: MemcacheD, REDIS, and Riak
• Graph stores: Neo4j and Sesame
• Column stores: HBase and Cassandra
• Document stores: MongoDB, CouchDB, Cloudant, and MarkLogic
NoSQL datastores will not be replacing the traditional RDMSs, neither transactional
relational databases nor data warehouses. Nor will Hadoop be replacing them either.

Origins of the NoSQL products
Origins of the NoSQL products

HBase
• An implementation of Google's BigTable Design
 "A Bigtable is a sparse, distributed, persistent multidimensional sorted map"
- http://research.google.com/archive/bigtable-osdi06.pdf
• An open source Apache Top Level Project
 Embraced and supported by IBM and all leading Hadoop distributions
• Powers some of the leading sites on the Web, such as Facebook,
Yahoo, etc.
 http://wiki.apache.org/hadoop/Hbase/PoweredBy
• Is a NoSQL datastore: column datastore
HBase
In 2004 Google began development of a distributed storage system for structured data
called BigTable. Google engineers designed a system for storing big data that can
scale to petabytes leveraging commodity servers. Projects at Google like Google Earth,
web indexing and Google Finance required a new cost effective, robust and scalable
system that the traditional RDBMS was incapable of supporting. In November of 2006
Google released a whitepaper describing BigTable (see the URL). Roughly one year
later, Powerset created HBase which is a BigTable implementation compliant with the
Google specification. In 2008 HBase was released as an open source top-level Apache
project (http://hbase.apache.org) and HBase is now the Hadoop database.
HBase powers some of the leading sites on the World Wide Web refer to:
http://wiki.apache.org/hadoop/Hbase/PoweredBy

What is HBase?
• An industry leading implementation of Google's BigTable Design
 What's a BigTable?
− "A Bigtable is a sparse, distributed, persistent multidimensional sorted map"
http://research.google.com/archive/bigtable-osdi06.pdf
• An open source Apache Top Level Project
 Embraced and supported by IBM
• HBase powers some of the leading sites on the Web (such as
Facebook, Yahoo, etc.)
 http://wiki.apache.org/hadoop/Hbase/PoweredBy)
• A NoSQL data store
What is HBase?

Why NoSQL? Zettabytes
• Cost effective technology is needed to

handle new volumes of data
Petabytes
• Increased data volumes lead to Sharding

RDBMS Sharding A
• Flexible Data Models are Needed to B

Support BigData Applications ABC
Id Fn Ln Addr
1 Fred Jones Liberty, NY
2 John Smith ??????
Why NoSQL?
So why consider NoSQL Technology? This slide presents three key reasons:
1. Massive data sets exhaust the capacity and scale of existing RDBMSs. Buying
additional licenses and adding more CPU, RAM and DISK is very expensive and
not linear in cost. Many companies and organizations also want to leverage more
cost effective commodity systems and open source technologies. NoSQL
technology deployed on commodity high volume servers can solve these
problems.
2. Distributing the RDBMS is operationally challenging and often technically
impossible. The architecture breaks down when sharding is implemented on a
large scale. Denormalization of the data model, joins, referential integrity and
rebalancing are common issues.
3. Un-structured data (such as social media data: Twitter, Facebook, email, etc.)
and semi-structured data (such as application logs, security logs) do not fit the
traditional model of schema-on-ingest. Typically the schema is developed after
ingest and analysis. Un-structured and semi-structured data generate a variable
number of fields and variable data content, and are therefore problematic for the
data architect when designing the database. There may be many NULL fields
(sparse data), and or the number and type of fields may be variable.

Additional consideration points:

1. In this new age of big data, most or all of these challenges are typical and as a
result the NoSQL market is growing rapidly. Traditional RDBMS technologies and
high end server platforms often exceed budgets. Organizations want to leverage
commodity high volume servers. Elastic scale out is needed to handle new
volumes of data (sensors, log files, social media data, etc.) and increased
retention requirements.
2. Sharding is not cost effective in the age of big data. Sharding creates
architectural issues (such as Joins/Denormalization, Referential Integrity and
Challenges with Rebalancing).
3. New applications require a flexible schema. Records can be sparse (e.g. social
media data is variable). Schema cannot always be designed up front.
And, briefly:
• Increased complexity of SQL
• Sharding introduces complexity
• Single point of failure
• Failover servers more complex
• Backups more complex
• Operational complexity added

Why HBase in particular?

• Highly Scalable
 Automatic partitioning (sharding)
 Scale linearly and automatically with new nodes
• Low Latency
 Support random read/write, small range scan
• Highly Available
• Strong Consistency
• Very good for "sparse data" (no fixed columns)
Why HBase in particular?

So why are so many businesses and government organizations choosing HBase over
other NoSQL technologies?
For starters, HBase is considered "…the Hadoop Database" (see hbase.apache.org)
and is bundled with supported Apache Hadoop distributions like Hortonworks Data
Platform. If you need high performance random read/write access to your big data you
are probably going to leverage HBase on your Hadoop cluster. HBase users can easily
leverage the Map/Reduce model and other powerful features included with Apache
Hadoop.
HDP's strategy for Apache Hadoop is to embrace and extend the technology with
powerful advanced analytics, development tooling, performance and availability
enhancements, security and manageability. As a key component of Hadoop, HBase is
part of this strategy with strong support and a solid roadmap going forward.
When the requirements fit, HBase can replace certain costly RDBMSs.

HBase handles sharding seamlessly and automatically and benefits from the non-
disruptive horizontal scaling feature of Hadoop. When more capacity and or
performance is needed, users add datanodes to the Hadoop cluster. This provides
immediate growth to HBase data stores since HBase leverages the HDFS. Users can
easily scale from TBs to PBs as their capacity needs grow.
HBase supports a flexible and dynamic data model. The Schema does not need to be
defined up front which makes HBase a natural fit for many big data applications and
some traditional applications as well.
The Apache Hadoop file system, HDFS, does not naturally support applications
requiring random read/write capability. HDFS was designed for large sequential batch
operations (e.g. write once with many large sequential reads during analysis). HBase
does support high performance random r/w applications and for this reason alone it is
often leveraged in Hadoop applications.

HBase and ACID Properties

• Atomicity
 All reading and writing of data in one region is done by the assigned Region
Server
 All clients have to talk to the assigned Region Server to get to the data
 Provides row level atomicity
• Consistency and Isolation
 All rows returned via any access API will consist of a complete row that
existed at some point in the table's history
 A scan is not a consistent view of a table. Scans do not exhibit snapshot
isolation. Any row returned by the scan will be a consistent view (for
example, that version of the complete row existed at some point in time)
• Durability
 All visible data is also durable data. That is to say, a read will never return
data that has not been made durable on disk
HBase and ACID properties

A frequently asked question whenever discussing HBase technology is: "How does
HBase adhere to the ACID Properties?"
The HBase community provided a very good page discussing this, which is
summarized on this slide: http://hbase.apache.org/acid-semantics.html
• When strict ACID properties are required:
• HBase provides strict row level atomicity.
• As far as Apache HBase is concerned - "There is no further guarantee or
transactional feature that spans multiple rows or across tables."**
• See also Indexed-Transactional HBase project - https://github.com/hbase-
trx/hbase-transactional-tableindexed
Note: HBase and other NoSQL distributed data stores are subject to the CAP Theorem
which states that distributed NoSQL data stores can only achieve 2 out of the 3
properties: consistency, availability and partition tolerance.

For more information see: http://hbase.apache.org/acid-semantics.html and

http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
HBase achieves consistency and partition tolerance. See chart "Understanding the
CAP Theorem."
Primary reference:
• George, L. (2011). HBase: The definitive guide. Savatopol, CA: O'Reilly Media.
Note on Concurrency of writes and mutations
• HBase will automatically get a lock before a write and release that lock after the
write
• Also the user can control the locking manually

HBase data model

• Data is stored in HBase table(s)
• Tables are made of rows and columns
• All columns in HBase belong to a particular column family
• Table schema only defines column families
 Can have large, variable number of columns per row
 (row key, column key, timestamp)  value
 A {row, column, version} tuple exactly specifies a cell
• Each cell value has a version
 Timestamp
• Row stored in order by row keys
 Row keys are byte arrays; lexicographically sorted
"... a sparse, distributed, persistent, multi-dimensional sorted map. The

map is indexed by a row key, column key, and a timestamp; each value in
the map is an uninterrupted array of bytes"
- Google BigTable paper
HBase data model

Think "Map": This is not a spreadsheet model

• Most literature describes HBase Column A Column B Column C
as a Column-Oriented data store Row A
which it is, but this can lead to Row B
confusion Row C
 "In computer science, an Note: Column Families contain Columns with time
stamped versions. Columns only exist when
associative array, map, or inserted (i.e. sparse)
dictionary is an abstract data

Column A Column B
type composed of a Row A
Integer Value
collection of (key,value)
Column B
pairs, such that each Long Timestamp1
Row B
possible key appears at most Long Timestamp2
Column C
once in the collection." Huge URL
Row C
• Technically HBase is a Family 1 Family 2
multidimensional sorted map
Think "Map": This is not a spreadsheet model

When researching HBase it's typical to find literature that describes HBase as a column
oriented data store which it is.
However, this can lead to confusion and wrong impressions when you try to picture the
spreadsheet or traditional RDBMS table model.
HBase is more accurately defined as a "multidimensional sorted map" as seen in the
Big Table specification by Google.
Reference: http://en.wikipedia.org/wiki/Associative_array

HBase Data Model: Logical view

• Table HBTABLE
 Contains column-families Row key Value
• Column family cf_data:
 Logical and physical grouping of {'cq_name': 'name1',
columns 11111 'cq_val': 1111}
cf_info:
• Column {'cq_desc': 'desc11111'}
 Exists only when inserted
 Can have multiple versions cf_data:
 Each row can have different set of {'cq_name': 'name2',
columns 22222 'cq_val': 2013 @ ts = 2013,
 Each column identified by it's key 'cq_val': 2012 @ ts = 2012
• Row key }
 Implicit primary key
 Used for storing HFile HFile
HFile
11111 cf_data cq_name name1 @ ts1 11111 cf_info cq_desc desc11111 @ ts1
ordered rows 11111 cf_data cq_val 1111 @ ts1
 Efficient queries 22222 cf_data cq_name name2 @ ts1
using row key 22222 cf_data cq_val 2013 @ ts1
22222 cf_data cq_val 2012 @ ts2
HBase Data Model: Logical view

This slide covers the logical representation of an HBase table. A table is made of
column families which are the logical and physical grouping of columns.
Each column value is identified by a key. The row key is the implicit primary key. It is
used to sort rows.
The table shown on the slide (HBTABLE) has two column families: cf_data, cf_info.
Cf_data has two columns with qualifiers cq_name and cq_val. A column in HBase is
referred using family:qualifier. Cf_info column family has only one column: cq_desc.
The green boxes show how column families also provide physical separation. The
columns in cf_data family are stored separately from columns in cf_info family. This is
another important thing to keep in mind when designing the layout of HBase table. If
you have data that is not often queried, it is better to assign that to a separate column
family.

Column family
• Basic storage unit. Columns in the same family should have similar
properties and similar size characteristics
• Configurable by column family
 Multiple time-stamped versions
− Such as 3rd dimension to Tables
 Compression
− (none, Gzip, LZO, SNAPPY)
 Version retention policies
− Time To Live (TTL)
• A column is named using the following syntax: family:qualifier
"Column keys are grouped into sets called column families, which form the
basic unit of access control" - Google BigTable paper
Column family

HBase vs. traditional RDBMS
HBase RDBMS
A sparse, distributed, persistent

Data Layout multidimensional sorted map
Row or Column Oriented
ACID Support on Single Row

Transactions Only
Yes
get/put/scan only unless

Query Language combined with Hive or other SQL
technology
Security Authentication / Authorization Authentication / Authorization
Indexes Row-Key only or special table Yes
Throughput Millions of Queries per Sec Thousands of Queries per Sec
Maximum Database Size PBs TBs
HBase vs traditional RDBMS

Example of a classic RDBMS table
SSN - Last Name First Account Type of Timestamp

primary Name Number Account
key
01234 Smith John abcd1234 Checking 20120618…
01235 Johnson Michael wxyz1234 Checking 20121118…
01235 Johnson Michael aabb1234 Checking 20151123…
01236 Mules null null null null
Example of a classic RDBMS table

Given the classic RDBMS table as shown in the slide, you will see what this would look
like in HBase over the next few pages.

Example of HBase logical view ("records")
Row key Value (CF, Column, Version, Cell)

01234 info: {'lastName': 'Smith',
'firstName': 'John'}
acct: {'checking': 'abcd1234'}
01235 info: {'lastName': 'Johnson',
'firstName': 'Michael'}
acct: {'checking': 'wxyz1234'@ts=2013,
'checking': 'aabb1234'@ts=2012}
01236 info: {'lastName': 'Mules'}
Good for sparse data since non-exist columns are just ignored
There are no nulls
Example of HBase logical view ("records")

The example table could be implemented logically as follows in HBase.
Note the time stamp data pointed to by the Row key 01235. This makes HBase
multidimensional.

Example of the physical view ("cell")

info Column Family
Row Key Column Key Timestamp Cell Value
01234 info:fname 1330843130 John
01234 info:lname 1330843130 Smith
01235 info:fname 1330843345 Michael
01235 info:lname 1330843345 Johnson
01236 info:lname Mules
acct Column Family

Row Key Column Key Timestamp Cell Value
01234 acct:checking 1330843130 abcd1234

01235 acct:checking 1330843345 wxyz1234
01235 acct:checking 1330843239 aabb1234
Key
Key/Value Row Column Family Column Qualifier Timestamp Value

Example of the physical view ("cell")

While the physical cell layout in HBase would look something like this, there are more
details to the physical layout of course which will be described further in this
presentation (such as how data is actually stored in Apache Hadoop HDFS).

More on the HBase data model

• There is no Schema for an HBase Table in the RDBMS sense
 Except that you have to declare the Column Families
− Since it determines the physical on-disk organization
 Thus every row can have a different set of Columns
• HBase is described as a key-value store
Key
• Each key-value pair is versioned

 Can be a timestamp or an integer
 Update a column is just to add a new version
• All data are byte arrays, including table name, Column Family names,
and Column names (also called Column Qualifiers)
More on the HBase data model

A detailed schema in the RDBMS sense does not need to be defined up front in HBase.
You only need to define Column Families, since they impact physical on-disk storage
This slide also illustrates why HBase is called a Key/Value pair data store.
Varying the granularity of the key impacts retrieval performance and cardinality when
querying HBase for a Value.
Data types are converted to and from the raw byte array format that HBase supports
natively.

Indexes in HBase
• All table accesses are via table row key, effectively, its primary key.
• Rows are stored in byte-lexographical order
• Within a row, columns are stored in sorted order, with the result that it
is fast, cheap, and easy to scan adjacent rows and columns
 Partial key lookups also possible
• HBase does not support indexes natively
 Instead, a Table can be created that serves the same purpose
• HBase supports "bloom filters"
 Used to decide quickly if a particular row / column combination exists in the
store file and reduce IO and access time
• Secondary Indexes are not supported
Key
Indexes in HBase
This slide explains how data in HBase is sorted and can be searched and indexed.
The sorting within Tables makes adjacent queries and scans more efficient.
HBase does not support indexes natively but Tables can be created to serve the same
purpose.
Bloom filters can be used to reduce IOs and look up time. More information on Bloom
filters can be found at: http://en.wikipedia.org/wiki/Bloom_filter
More information on Bloom Filter usage in HBase can be found in: George, L. (2011).
HBase: The definitive guide. Savastopol, CA: O'Reilly Media, p. 372.

Hadoop v1 processing environment

• Multi-layered framework


• Four levels in an Hadoop cluster - from the bottom up
 Distributed storage
 Resource management
 Processing framework
 APIs (application programming interfaces)

Diagram from deRoos, D., Zikopoulos, P. C., Melnyk, R. B., Brown, B., & Coss, R.
(2014). Hadoop for dummies. Hoboken, NJ: Wiley. p. 109.

Open source programming languages: Pig and Hive
Pig: developed originally at Yahoo Hive: developed originally at Facebook

High-level programming language, excellent for data Provides and SQL-like interface, allowing abstraction
transformation (ETL) - better than Hive for of data on top of non-relational, semi-structured
unstructured data data
Language: Pig Latin Language: HiveQL
Procedural / data-flow language Declarative language (SQL dialect)
Not suitable for ad-hoc queries, but happy to do Very good for ad-hoc analysis, but not necessarily
grunt work for end users; leverages SQL expertise
Reads data in a large number of file formats, Uses a SerDe (serialization/deserialization) interface
databases to read data from a table and write it back out in
any custom format. Many standard SerDes are
available, and you can write your own for custom
formats
Compiler translates Pig Latin into sequences of
MapReduce programs
Recommended for people are familiar with scripting Recommended for people who are familiar to SQL
languages like Python.
Open source programming languages: Pig and Hive

References:
• http://pig.apache.org
• https://cwiki.apache.org/confluence/display/PIG
• http://hive.apache.org
• https://cwiki.apache.org/confluence/display/Hive
Books:
• Gates, A. (2001). Programming Pig. Sabastopol, CA: O'Relly Media.
• Capriolo, E., Wampler, D., & Rutherglen, J. (2012). Programming Hive.
Sabastopol, CA: O'Reilly Media.
• Lam, C. P., & Davis, M. W. (2015, forthcoming). Hadoop in action (2nd ed.).
Greenwich, CT: Manning Publications.
Holmes, A. (2015). Hadoop in practice (2nd ed.). Greenwich, CT: Manning Publications

Pig
• Pig runs in two modes:
 Local mode: on a single machine without requirement for HDFS
 MapReduce/Hadoop mode: execution on an HDFS cluster, with the Pig
scrip converted to a MapReduce job
• When Pig run runs in an interactive shell, the prompt is grunt>
• Pig scripts have, by
convention, a suffix
of .pig
• Pig is written in the
language Pig Latin
Pig
Diagram from: deRoos, D., Zikopoulos, P. C., Melnyk, R. B., Brown, B., & Coss, R.
(2014). Hadoop for dummies. Hoboken, NJ: Wiley. p. 125

Pig vs. SQL

• In contrast to SQL, Pig:
 uses lazy evaluation
 uses ETL techniques
 is able to store data at any point during a pipeline
 declares execution plans
 supports pipeline splits
• DBMSs are generally substantially faster than the MapReduce system
once the data is loaded, but loading the data takes considerably longer
in database systems
• RDBMSs offer out-of-the-box support for column-storage, working with
compressed data, indexes for efficient random data access, and
transaction-level fault tolerance
• Pig Latin is procedural language with a pipeline paradigm
• SQL is a declarative language
Pig vs. SQL

In SQL users can specify that data from two tables must be joined, but not what join
implementation to use. But, with some RDBMS systems, extensions ("query hints") are
available outside the official SQL query language to allow the implementation of queries
and the type of joins to be performed on a single statement basis.
Pig Latin allows users to specify an implementation or aspects of an implementation to
be used in executing a script in several ways.
Pig Latin programming is similar to specifying a query execution plan

Characteristics of the Pig language

• Most Pig scripts start with the LOAD statement to read data from
HDFS (or from the local file system when running in local mode)
 In the example on the next slide, you will load data from a .CSV file
 The USING statement maps the file's data to the Pig data model
• Aggregations are commonly used to summarize dataset
 Variations include: GROUP, ALL, GROUP ALL
• FOREACH … GENERATE statements can be used to transform
column data
Characteristics of the Pig language

At its core, Pig Latin is a data flow language, where you define a data stream and a
series of transformations that are applied to the data is it flows through the application.
This is in contrast to a control flow language (such as C or Java), where you write a
series of instructions. In control flow languages, you use constructs such as loops and
conditional logic (if and case statements) There are no loops and no if-statements in Pig
Latin.

Example of a Pig script

input_lines = LOAD '/tmp/my-data' AS (line: chararray);
-- Extract words from each line and put them into a Pig bag,
-- and then flatten the bag to get one word for each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(LINE)) AS word;
-- Filter out any words that are just white spaces

filtered_words = FILTER words BY word MATCHES '\\w+';
-- Create a group for each word

word_groups = GROUP filtered_words BY word;
-- Count the entries in each group

word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS
count, group AS word;
-- Order the records by count and write out the results

ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/sorted-word-count';
Example of a Pig script

The actual program is 7 lines of code.
Lines that start with double-dash are comments, in this case explaining the line(s) that
follow.

What is Hive?
• A system for managing and querying structured data built on top of
Hadoop
 MapReduce for execution
 HDFS for storage
 Metadata on raw files
• Key Building Principles:
 SQL is a familiar data warehousing language
 Extensibility - Types, Functions, Formats, Scripts
 Scalability and Performance
What is Hive?
The Apache Hive data warehouse software facilitates querying and managing large
datasets residing in distributed storage. Built on top of Apache Hadoop, it provides:
• tools to enable easy data extract/transform/load (ETL)
• a mechanism to impose structure on a variety of data formats
• access to files stored either directly in Apache HDFS or in other data storage
systems such as Apache HBase
• query execution via MapReduce
Hive defines a simple SQL-like query language, called QL, which enables users familiar
with SQL to query the data. At the same time, this language also allows programmers
who are familiar with the MapReduce framework to be able to plug in their custom
mappers and reducers to perform more sophisticated analysis that may not be
supported by the built-in capabilities of the language. QL can also be extended with
custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's).

Hive does not mandate read or written data be in the "Hive format" - there is no such
thing. Hive works equally well on Thrift, control delimited, or your specialized data
formats. Please see File Formats and Hive SerDe in the Developer Guide for details.
Hive is not designed for OLTP workloads and does not offer real-time queries or row-
level updates. It is best used for batch jobs over large sets of append-only data (like
web logs). What Hive values most are scalability (scale out with more machines added
dynamically to the Hadoop cluster), extensibility (with MapReduce framework and
UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats.
Components of Hive include HCatalog and WebHCat:
• HCatalog is a component of Hive. It is a table and storage management layer
for Hadoop that enables users with different data processing tools - including
Pig and MapReduce - to more easily read and write data on the grid.
• WebHCat provides a service that you can use to run Hadoop MapReduce (or
YARN), Pig, Hive jobs or perform Hive metadata operations using an HTTP
(REST style) interface.

SQL for Hadoop

• Data warehouse augmentation is a very common use case for Hadoop
• While highly scalable, MapReduce is notoriously difficult to use
 Java API is tedious and requires programming expertise
 Unfamiliar languages (such as Pig) also requiring expertise
 Many different file formats, storage mechanisms, configuration options, etc.
• SQL support opens the data to a much wider audience
 Familiar, widely known syntax
 Common catalog for identifying data and structure
 Clear separation of defining the what (you want) vs. the how (to get it)
SQL for Hadoop

Java vs. Hive: The wordcount algorithm

package org.myorg; public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
import java.io.IOException;
import java.util.*; public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
import org.apache.hadoop.fs.Path; int sum = 0;
import org.apache.hadoop.conf.*; for (IntWritable val : values) {
import org.apache.hadoop.io.*; sum += val.get();
import org.apache.hadoop.mapreduce.*; }
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; context.write(key, new IntWritable(sum));
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; }
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; }
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public static void main(String[] args) throws Exception {
public class WordCount { Configuration conf = new Configuration();
public static class Map extends Mapper<LongWritable, Text, Text, Job = new Job(conf, "wordcount");
IntWritable> {
private final static IntWritable one = new IntWritable(1); job.setOutputKeyClass(Text.class);
private Text word = new Text(); job.setOutputValueClass(IntWritable.class);
public void map(LongWritable key, Text value, Context context) job.setMapperClass(Map.class);

throws IOException, InterruptedException { job.setReducerClass(Reduce.class);
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line); job.setInputFormatClass(TextInputFormat.class);
while (tokenizer.hasMoreTokens()) { job.setOutputFormatClass(TextOutputFormat.class);
word.set(tokenizer.nextToken());
context.write(word, one); FileInputFormat.addInputPath(job, new Path(args[0]));
} FileOutputFormat.setOutputPath(job, new Path(args[1]));
}
} job.waitForCompletion(true);
}
}
Java vs. Hive: The wordcount algorithm

Hive and wordcount

• Wordcount in Java is 70+ lines of Java code
• Below you have the equivalent program in HiveQL, just 8 lines of code
and does not require compilation, nor the creation of a JAR file (Java
ARchive) to run
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS

SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word
FROM docs)
GROUP BY word
ORDER BY word;
Capriolo, E., Wampler, D., & Rutherglen, J. (2012). Programming Hive.
Hive and wordcount

Capriolo, E., Wampler, D., & Rutherglen, J. (2012). Programming Hive. Savastopol, CA:
O'Reilly Media. p. 12

Hive components
Hive
Applications Application
Web
Client Interfaces Browser
Thrift Client JDBC ODBC
(remote)
hive>
Hive Web
Query Execution Interface
Hive Server CLI
Hive Metastore
Metadata Metastore Driver
JobConf Config
Hive components
These are the major components that you might deal with when working with Hive.

Starting Hive: The Hive shell

• The Hive shell is located in
$HIVE_HOME/bin/hive
• From the shell you can:
 perform queries, DML, and DDL
 view and manipulate table metadata
 retrieve query explain plans (execution strategy)
$ $HIVE_HOME/bin/hive
2013-01-14 23:36:52.153 GMT : Connection obtained for host: master-
Logging initialized using configuration in file:/opt/ibm/biginsight
Hive history file=/var/ibm/biginsights/hive/query/biadmin/hive_job
hive> show tables;

mytab1
mytab2
mytab3
OK
Time taken: 2.987 seconds
hive> quit;
Starting Hive: The Hive shell

This Hive shell runs in a command line interface.

Data types and models

• Supports a number of scalar and structured data types
 tinyint, smallint, int, bigint, float, double
 boolean
 string, binary
 timestamp
 Array: for example array<int>
 struct: for example struct<f1:int,f2:array<string>>
 map: for example map<int,string>
 union: for example uniontype<int,string,double>
• Partitioning
 can partition on one or more columns
 value partitioning only, range partitioning is not yet supported
• Bucketing
 sub-partitioning/grouping of data by hash within partitions
 useful for sampling and improves some join operations
Data types and models

Data model partitions

• Value partition based on partition columns
• Nested sub-directories in HDFS for each combination of partition
column values
• Example
 Partition columns : ds and ctry
 HDFS subdirectory for ds = 20090801, ctry = US

− …/hive/warehouse/pview/ds=20090801/ctry=US
 HDFS subdirectory for ds = 20090801, ctry = CA

− …/hive/warehouse/pview/ds=20090801/ctry=CA
Data model partitions

Here, partitioning is on columns:
• ds (datestamp)
• ctry (country)

Data model external table

• Point to existing data directories in HDFS
• Can create tables and partitions; partition columns just become
annotations to external directories
• Example : create external table with partition
CREATE EXTERNAL TABLE pview (userid int, pageid int, ds string, ctry string)
PARTITIONED ON (ds string, ctry string)
STORED AS textfile
LOCATION '/path/to/existing/table'
• Example : add a partition to external table

ALTER TABLE pview
ADD PARTITION (ds='20090801', ctry='US')
LOCATION '/path/to/existing/partition'
Data model external table

Physical layout
• Hive warehouse directory structure
.../hive/warehouse/ Database (schemas) are
db1.db/ contained in ".db" subdirectories
tab1/
tab1.dat Table "tab1"
part_tab1/
state=NJ/ Table partitioned by "state" column.
part_tab1.dat One subdirectory per unique value.
state=CA/ Query predicates eliminate partitions
part_tab1.dat (directories) that need to be read
• Data files are just regular HDFS files

 Internal format can vary table-to-table (delimited, sequence, etc.)
• Supports external tables
Physical layout

Creating a table
file: users.dat
• Creating a delimited table 1|1|Bob Smith|Mary
hive> create table users
( 2|1|Frank Barney|James:Liz:Karen
id int,
3|2|Ellen Lacy|Randy:Martin
office_id int,
name string, 4|3|Jake Gray|
children array<string> 5|4|Sally Fields|John:Fred:Sue:Hank:Robert
)
row format delimited
fields terminated by '|'
collection items terminated by ':'
stored as textfile;
• Inspecting tables:
hive> show tables;
OK
users
hive> describe users;

OK
id int
office_id int
name string
children array<string>
Creating a table
Two levels of separator are used here:
• Column or attribute separator: |
• Collection item separator: :

Hive and HBase

• Hive comes with an HBase storage handler
• Allows MapReduce queries and loading of HBase tables
• Uses predicate pushdown to optimize query
 scans only necessary regions based upon table key
 applies predicates as HBase row filters (if possible)
• Usually Hive must be provided additional jars and configuration in order to
work with HBase
$ hive \
--auxpath \
$HIVE_SRC/build/dist/lib/hive-hbase-handler-0.9.0.jar,\
$HIVE_SRC/build/dist/lib/hbase-0.92.0.jar,\
$HIVE_SRC/build/dist/lib/zookeeper-3.3.4.jar,\
$HIVE_SRC/build/dist/lib/guava-r09.jar \
-hiveconf hbase.master=hbase.yoyodyne.com:60000
Hive and HBase

HBase table mapping
CREATE TABLE hbase_table_1 (

key int,
value1 string,
value2 int,
value3 int)
WITH SERDEPROPERTIES ("hbase.columns.mapping"= ":key,a:b,a:c,d:e")
TBLPROPERTIES("hbase.table.name" = "MY_TABLE");
hbase_table_1
key value1 value2 value3
Hive 15 "fred" 357 94837
MY_TABLE
family: a family: d
(key) b c e
HBase "15" "fred" "357" 0x17275
HBase table mapping

For more information:
• http://hive.apache.org
• https://cwiki.apache.org/confluence/display/Hive/GettingStarted

Languages used by data scientists: R and Python

• Comparison of programming languages typically used by Data
Scientists: R and Python
R Python
"R is an interactive environment for doing statistics" Real programming language
- "has a programming language, rather than being a
programming language"
Rich set of libraries, graphic and otherwise, that are Lacks some of R's richness for data analytics, but
very suitable for Data Science said to be closing the gap
Better if the need is to perform data analysis Better for more generic programming tasks (e.g.
workflow control of a computer model)
Focuses on better, user friendly data analysis, Python emphasizes productivity and code
statistics, and data models readability
More adoption from researchers, data scientists, More adoption from developers and programmers
statisticians, and quants
Active user communities Active user communities
Standard R library + many, many additional libraries Python + numpy + scipy + scikit + Django + Pandas
where statistical algorithms often appear first
Languages used by data scientists: R and Python

The conclusion is that both are excellent, but they have their individual strengths. Over
time, you probably need both.
One approach, if you know Python already, is to use that as your first tool. When you
find Python lacking, learn enough R to do what you want, and then either:
• write scripts in R and run them from Python using the subprocess module, or
• install the RPy module
Use Python for what Python is good at and fill in the gaps with one of the above. Use R
for plotting things, and Python for the heavy lifting.
Sometimes though, the tool you already know or that is easy to learn, is far more likely
to win, than the powerful-but-complex tool that is currently out of your reach.
In the 2013 KDNuggets poll of top languages for data analytics, data mining, and data
science, R was the most-used software for the third year running (60.9%), with Python
in second place (38.8%). More tellingly, R's usage grew almost four times faster than
Python's in 2013 versus 2012 (8.4 percentage points for R, compared to 2.7
percentage points for Python).

Quick overview of R
• R is a free interpreted language
• Best suited for statistical analysis and modeling
 Data exploration and manipulation
 Descriptive statistics
 Predictive analytics and machine learning
 Visualization
 +++
• Can produce "publication quality graphics"
• Emerging as a competitor to proprietary platforms
 Widely used in Universities and companies
 Not as easy to use or performant as SAS or SPSS
• Algorithms tend to first be available in R
 Made available by companies and universities as packages
− rpart(classification), tree(random forest trees), etc.
Quick overview of R
More details on R can be found widely in the web. It is similar to Matlab and
SAS/SPSS.
Pluses and minuses:
+ Very sophisticated algorithms,
+ Very good graphics
- Not very good data transformation and file type support
- Not very fast
- Mostly in memory so limited data sizes

R clients
RStudio (shown) is a simple, popular IDE, but there are others too…
Run
Write
code File
here
See
results
here
Or
write
code
here Console
R clients
There are different R clients out there. This is R Studio (a free download, available for
Windows, Linux, and Mac); this the most common one.

Simple example
• R supports atomic data types, lists, vectors, matrices and data frames
• Data Frames are analogous to database tables
 Can be created or read from CSV files
• Large set of statistical functions
 Additional functions can be loaded as packages
# Vectors
> kidsNames <- c("Henry", "Peter", "Allison")
> kidsAges <- c(7, 11, 17)
# data.frame
> kids <- data.frame(ages = kidsAges, names = kidsNames)
> print(kids)
ages names
1 7 Henry
2 11 Peter
17 Allison
> mean(kids$ages)
[1] 11.66667
Simple example
This is a very simple interactive R program that works on basic data. The program
creates two vectors then merges them together into one Data frame as two columns.
Then shows how to display the table and how to compute a basic average.
This type of interactive use of R can be done in the free, downloadable RStudio, which
runs on Linux, Windows, and Mac.

R is noted for its ability to produce graphical output
R is noted for its ability to produce graphical output

Quick overview of Python

• Python is dynamic
 No need to declare variables: just assign variables (with "=") and use them
 No explicit Boolean type
 Classes, with most of the tools you expect in an object-oriented language
• Python does not use braces: begin-end, {}
 Indentation is used to denote blocks: everything at the same level of
indentation is considered in the same block
• Data structures
 Python has a variety : strings, lists, tuples, dictionaries (hash tables)
 Python data structures are either mutable (changeable) or not, but strings
are not mutable
• And, of course, …
 Control constructs: if, for (more like foreach), and while control statements
 Module, namespaces, …
Quick overview of Python

Python wordcount program

• Open the file passed as the parameter to the program (argv[0]) and
read as encoded in UTF-8
• Split the text of the line using space, comma, semi-colon, single quote
 Split() can be called with no argument - in this case, split() uses spaces as
the delimiter, with multiple spaces treated as a single space
• Accumulate the count for each word (+=)
import sys
file=open(sys.argv[0],"r+", encoding="utf-8-sig")
wordcount={}
for word in file.read().split(" ,;'"):
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for key in wordcount.keys():
print ("%s %s " %(key , wordcount[key]))
file.close();
Python wordcount program

Checkpoint
1. List the common data types of big data.
2. True or False. NoSQL database is designed for those that do not
want to use SQL.
3. Which database is a columnar storage database?
4. Which database provides a SQL for Hadoop interface?
Checkpoint

Checkpoint solution
1. List the common data types of big data.
 Flat files, text files, CSV, delimited files, Avro, SequenceFile, JSON, RC /
ORC, Parquet
2. True or False. NoSQL database is designed for those that do not
want to use SQL.
 False - NoSQL database means Not Only, SQL, so it has much more than
just plan SQL.
3. Which database is a columnar storage database?
 HBase
4. Which database provides a SQL for Hadoop interface?
 Hive
Checkpoint solution

Unit summary
• List the characteristics of representative data file formats, including
flat/text files, CSV, XML, JSON, and YAML
• List the characteristics of the four types of NoSQL datastores
• Describe the storage used by HBase in some detail
• Describe and compare the open source programming languages, Pig
and Hive
• List the characteristics of programming languages typically used by
Data Scientists: R and Python
Unit summary

Lab 1
Using Hive to access Hadoop/HBase data
Lab 1: Using Hive to access Hadoop/HBase data

Lab 1:
Using Hive to access Hadoop/HBase data
Purpose:
This lab is intended to provide you with experience in accessing
Hadoop/HBase data using a command line interface (CLI).
Task 1. Storing and accessing HBase data.

The major references for the HBase can be found on the Apache.org and
Apache Wiki websites:
- http://hbase.apache.org
- https://hbase.apache.org/book.html
Cognitive Class has a free course on HBase at:
- https://cognitiveclass.ai/courses/using-hbase-for-real-time-access-to-your-big-data/
3. Verify that Hive is running by clicking on Hive in the left panel:
If Hive is not running, you will have to start it using the central panel.
When running, minimize the Ambari Web Console browser.

5. To start the HBase CLI shell, type the following command:

/usr/bin/hbase shell
[biadmin@ibmclass ~]$ /usr/bin/hbase shell

2015-06-07 11:57:36,247 INFO [main] Configuration.deprecation: hadoop.native.lib is
deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.8_HDP_4-hadoop2, rUnknown, Fri Mar 27 21:53:57 PDT 2015
hbase(main):001:0>
The last line here is the prompt for the Hbase CLI Client. Note that the
interactions are numbered (001) for each CLI session.
6. To view the online CLI help manual, type help at the prompt and then press
Enter:
hbase(main):001:0> help
HBase Shell, version 0.98.8_HDP_4-hadoop2, rUnknown, Fri Mar 27 21:53:57 PDT 2015
Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a
specific command.
Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a
command group.
COMMAND GROUPS:
Group name: general
Commands: status, table_help, version, whoami
Group name: ddl

Commands: alter, alter_async, alter_status, create, describe, disable, disable_all,
drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list,
show_filters
Group name: namespace

Commands: alter_namespace, create_namespace, describe_namespace, drop_namespace,
list_namespace, list_namespace_tables
Group name: dml

Commands: append, count, delete, deleteall, get, get_counter, incr, put, scan, truncate,
truncate_preserve
Group name: tools

Commands: assign, balance_switch, balancer, catalogjanitor_enabled, catalogjanitor_run,
catalogjanitor_switch, close_region, compact, compact_rs, flush, hlog_roll, major_compact,
merge_region, move, split, trace, unassign, zk_dump
Group name: replication

Commands: add_peer, disable_peer, enable_peer, list_peers, list_replicated_tables,
remove_peer, set_peer_tableCFs, show_peer_tableCFs
Group name: snapshots

Commands: clone_snapshot, delete_snapshot, list_snapshots, rename_snapshot,
restore_snapshot, snapshot
Group name: security

Commands: grant, revoke, user_permission
Group name: visibility labels

Commands: add_labels, clear_auths, get_auths, set_auths, set_visibility
SHELL USAGE:
Quote all names in HBase Shell such as table and column names. Commas delimit

command parameters. Type <RETURN> after entering a command to run it.

Dictionaries of configuration used in the creation and alteration of tables are
Ruby Hashes. They look like this:
{'key1' => 'value1', 'key2' => 'value2', ...}
and are opened and closed with curley-braces. Key/values are delimited by the
'=>' character combination. Usually keys are predefined constants such as
NAME, VERSIONS, COMPRESSION, etc. Constants do not need to be quoted. Type
'Object.constants' to see a (messy) list of all constants in the environment.
If you are using binary keys or values and need to enter them in the shell, use
double-quote'd hexadecimal representation. For example:
hbase> get 't1', "key\x03\x3f\xcd"

hbase> get 't1', "key\003\023\011"
hbase> put 't1', "test\xef\xff", 'f1:', "\x01\x33\x40"
The HBase shell is the (J)Ruby IRB with the above HBase-specific commands added.
For more on the HBase Shell, see http://hbase.apache.org/book.html
hbase(main):002:0>
Take a minute to look through the Help to see what is available to you. Note that
in the commands, the names of the tables, rows, column families are all in
quotes. You will need to make sure that when you are referring to specific
tables, rows, column families, that they are enclosed in quotes.
Practical notes:
• Command (such as create) must be lowercase
• Table name (such as t1) must be quoted, …
• In interactive mode, you do not need a semicolon as statement terminator
or separator (unlike standard SQL)
7. Create an Hbase table using the create command:
create 't1', 'cf1', 'cf2', 'cf3'
to create a table t1 with three column families (cf1, cf2, cf3).
hbase(main):002:0> create 't1', 'cf1', 'cf2', 'cf3'
0 row(s) in 5.9840 seconds
=> Hbase::Table - t1
hbase(main):003:0>
The table t1 has been created.

Note that these single quotes are inch-type quotes (') and not Microsoft smart-
quotes (‘’). Take care anytime, here and elsewhere, if you cut-and-paste code
that Microsoft or other products have not changed the original code available to
use by converting to "smart-quotes".
Other notes:
• The create command only requires the name of the table and one or more
column families. Columns can be added dynamically to the table. Also,
each row can have a different set of columns (within each column family).
However, the table may not be mappable to SQL in such cases.
• Our column family names have been deliberately kept short. This is a best
practice: keep your column family names short. For example, instead of
'col_fam1' use 'cf1'. HBase stores the entire names across all of their
nodes where the data resides. If you use a long name, it will get repeated
across all of the nodes increase the total usage. You want to avoid this by
using as short of a name as possible.
• The table name does not need to be short. The name t1 here is short, and
cryptic, merely to save your typing convenience. In reality you should use
more expressive names as that is better documentation of your data
model.
8. Type describe 't1' to verify your table creation.
hbase(main):001:0> describe 't1'
Table t1 is ENABLED
t1
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE =>
'0', VERSIONS => '1', COMPRESSION => 'NONE', M
IN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536',
IN_MEMORY => 'false', BLOCKCACHE => 'true'}
hbase(main):002:0>

Note the following metadata features in your describe response:

• COMPRESSION
• IN_MEMORY
• VERSIONS
You will be changing some of these values shortly with an alter statement:
The next question is: just where is the table t1 stored? It is stored in HDFS, but
where in particular?
9. Open another terminal window (so that you can continue later in the first
terminal window) and execute the following command:
hadoop fs -ls -R / 2>/dev/null | grep t1
where this command lists all files (recursive, -R) and passes the results to the
Linux command grep. There would be errors, but these are discarded
(2>/dev/null).
[biadmin@ibmclass Desktop]$ hadoop fs -ls -R / 2>/dev/null | grep t1
drwxr-xr-x - hbase hdfs 0 2015-06-07 12:17
/apps/hbase/data/data/default/t1
/apps/hbase/data/data/default/t1/.tabledesc
-rw-r--r-- 3 hbase hdfs 769 2015-06-07 12:17
/apps/hbase/data/data/default/t1/.tabledesc/.tableinfo.0000000001
/apps/hbase/data/data/default/t1/.tmp
/apps/hbase/data/data/default/t1/8a45456f26ee4569360c6af03e893ed6
-rw-r--r-- 3 hbase hdfs 35 2015-06-07 12:17
/apps/hbase/data/data/default/t1/8a45456f26ee4569360c6af03e893ed6/.regioninfo
/apps/hbase/data/data/default/t1/8a45456f26ee4569360c6af03e893ed6/cf1
[biadmin@ibmclass Desktop]$
Note that directories are created. Files will be put into those directories as
records are stored into the table t1.

10. In the first terminal window, specify compression for a column family in the table
using the following statement:
alter 't1', {NAME => 'cf1', COMPRESSION => 'GZ'}
hbase(main):002:0> alter 't1', {NAME => 'cf1', COMPRESSION => 'GZ'}

Updating all regions with the new schema...
0/1 regions updated.
1/1 regions updated.
Done.
hbase(main):003:0>
The other compression algorithms that are supported but would need extra
configuration are SNAPPY and LZO. Note that GZIP is slow but also has the
most efficient compression option.
Sometimes you may find that you need to disable the table prior to executing
the alter statement:
disable 't1'
alter 't1', {NAME => 'cf1', COMPRESSION => 'GZ'}
You will now make additional changes to the metadata for the table.
11. Specify the IN_MEMORY option for a column family that will be queried
frequently. This does not ensure the data will be in memory always. It only gives
priority for the corresponding data to stay in the cache longer.
alter 't1', {NAME => 'cf1', IN_MEMORY => 'true'}
12. Specify the required number of versions for a column. By default, HBase stores
1 version of the value, but you can set to have more than 1 versions stored.
Enter the following in one continuous line:
alter 't1', {NAME => 'cf1', VERSIONS => 3},
{NAME => 'col_fam2', VERSIONS => 2}
13. Run the describe statement again, and verify that these changes were made to
the table and the column families.
describe 't1'

14. Insert dates into the table using the put command. Each row could have
different set of columns. The below set of put commands inserts two rows with a
different set of column names. Go ahead and enter each of these commands
(singly or as a group) into the HBase CLI shell, or insert something similar of
your choice.
put 't1', 'row1', 'cf1:c11', 'r1v11'
put 't1', 'row1', 'cf1:c12', 'r1v12'
put 't1', 'row1', 'cf2:c21', 'r1v21'
put 't1', 'row1', 'cf3:c31', 'r1v31'
put 't1', 'row2', 'cf1:d11', 'r2v11'
put 't1', 'row2', 'cf1:d12', 'r2v12'
put 't1', 'row2', 'cf2:d21', 'r2v21'
hbase(main):010:0> put 't1', 'row1', 'cf1:c11', 'r1v11'

put 't1', 'row1', 'cf1:c12', 'r1v12'
put 't1', 'row1', 'cf2:c21', 'r1v21'
put 't1', 'row1', 'cf3:c31', 'r1v31'
put 't1', 'row2', 'cf1:d11', 'r2v11'
put 't1', 'row2', 'cf1:d12', 'r2v12'
put 't1', 'row2', 'cf2:d21', 'r2v21'
hbase(main):016:0>
15. To view the data, you may use the get command to retrieve an individual row,
or the scan command to retrieve multiple rows:
get 't1', 'row1'
hbase(main):021:0> get 't1', 'row1'
COLUMN CELL
cf1:c11 timestamp=1433702916069, value=r1v11
Curiosity point: All data is versioned either using an integer timestamp (seconds
since the epoch, 1 Jan 1970 UCT/GMT), or another integer of your choice.

16. If you run the scan command, you will see a per-row listing of values:
scan 't1'
hbase(main):022:0> scan 't1'

ROW COLUMN+CELL
row1 column=cf1:c11, timestamp=1433702916069, value=r1v11
row2 column=cf1:d11, timestamp=1433702916539, value=r2v11
hbase(main):023:0>
Notes:
• The above scan results show that HBase tables do not require a set schema.
This is good for some applications that need to store arbitrary data. To put this
in other words, HBase does not store null values. If a value for a column is null
(e.g. values for d11, d12, d21 are null for row1), it is not stored. This is one
aspect that makes HBase work well with sparse data.
• In addition to the actual column value (r1v11), each result row has the row key
value (row1), column family name (col_fam1), column qualifier/column (c11)
and timestamp. These pieces of information are also stored physically for each
value. Having a large number of columns with values for all rows (in other
words, dense data) would mean this information gets repeated. Also, larger row
key values, longer column family and column names would increase the storage
space used by a table. For example use r1 instead of row1.
Good business practices:
• Try to use smaller row key values, column family and qualifier names.
• Try to use fewer columns if you have dense data

Task 2. Storing and accessing HBase data.

The major references for the Hive can be found on the Apache.org and Apache
Wiki websites:
- https://hive.apache.org
- https://cwiki.apache.org/confluence/display/Hive/Tutorial
Cognitive Class has a free course on Hive:
- https://cognitiveclass.ai/courses/hadoop-hive/
You will not have time in this unit to do a full exercise on Hive, but you will start
the Hive CLI client to learn where to find it.
1. In a new terminal window, type cd to change to the home directory.
2. To start the Hive client, type hive.
[biadmin@ibmclass ~]$ hive
15/06/07 14:57:18 WARN conf.HiveConf: HiveConf of name hive.optimize.mapjoin.mapreduce
does not exist
15/06/07 14:57:18 WARN conf.HiveConf: HiveConf of name hive.heapsize does not exist
15/06/07 14:57:18 WARN conf.HiveConf: HiveConf of name
hive.auto.convert.sortmerge.join.noconditionaltask does not exist
Logging initialized using configuration in file:/etc/hive/conf/hive-log4j.properties

hive>
Some configuration is needed for Hive. With the appropriate configuration and
setup, the HBase table that you created can be accessed with Hive and
HiveQL.
It is recommended that you take the Cognitive Class course and/or continue
your learning with one of the many tutorials available on the Internet.
Results:
You accessed Hadoop/HBase data using a command line interface (CLI).

ZooKeeper, Slider, and Knox
ZooKeeper, Slider, and Knox

Unit 8 ZooKeeper, Slider, and Knox

Unit objectives
• Understand the challenges posed by distributed applications and how
ZooKeeper is designed to handle them
• Explain the role of ZooKeeper within the Apache Hadoop infrastructure
and the realm of Big Data management
• Explore generic use cases and some real-world scenarios for
ZooKeeper
• Define the ZooKeeper services that are used to manage distributed
systems
• Explore and use the ZooKeeper CLI to interact with ZooKeeper
services
• Understand how Apache Slider works in conjunction with YARN to
deploy distributed applications and to monitor them
• Explain how Apache Knox provides peripheral security services to an
Hadoop cluster
ZooKeeper, Slider, and Knox © Copyright IBM Corporation 2018
Unit objectives

Topic: ZooKeeper
Topic: ZooKeeper

ZooKeeper
• ZooKeeper provides support for writing distributed applications in the
Hadoop ecosystem
• ZooKeeper addresses the issue of partial failure
 Partial failure is intrinsic to distributed systems
 ZK provides a set of tools to build distributed applications that can safely
handle partial failures
• Zookeeper has the following characteristics:
 simple
 expressive
 highly available
 facilitates loosely coupled interactions
 is a library
• Apache ZooKeeper is an open source server that enables highly
reliable distributed coordination
ZooKeeper
ZooKeeper is a centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services. All of these kinds of
services are used in some form or another by distributed applications. Each time they
are implemented there is a lot of work that goes into fixing the bugs and race conditions
that are inevitable. Because of the difficulty of implementing these kinds of services,
applications initially usually skimp on them, which make them brittle in the presence of
change and difficult to manage. Even when done correctly, different implementations of
these services lead to management complexity when the applications are deployed.
Information on Zookeeper can be found at:
• Primary website: http://zookeeper.apache.org
• Wiki: https://cwiki.apache.org/confluence/display/ZOOKEEPER/Index
• Documentation: http://zookeeper.apache.org/doc/current/index.html
• FAQ: https://cwiki.apache.org/confluence/display/ZOOKEEPER/FAQ
• Chapter 21 (pp. 601-40) in: White, T. (2015). Hadoop: The definitive guide (4th
ed.). Sabastopol, CA: O'Reilly Media.

The standard documentation includes the following sections:

• ZooKeeper Overview
• Developers
• Administrators & Operators
• Contributors to Zookeeper
• Miscellaneous FAQ

Distributed systems
• Multiple software components on multiple computers, but run as a
single system
• Computers can be physically close (local network), or geographically
distant (WAN)
• The goal of distributed computing is to make such a network work as a
single computer
• Distributed systems offer many benefits over centralized systems
 Scalability: System can easily be expanded by adding more machines as
needed
 Redundancy: Several machines can provide the same services, so if one is
unavailable: work does not stop, smaller machines can be used,
redundancy not prohibitively expensive
Distributed systems
A distributed computer system consists of multiple software components that are on
multiple computers, but run as a single system. The computers that are in a distributed
system can be physically close together and connected by a local network, or they can
be geographically distant and connected by a wide area network. The goal of
distributed computing is to make such a network work as a single computer.
Distributed systems offer many benefits over centralized systems. One benefit is that of
Scalability. Systems can easily be expanded by adding more machines as needed.
Another major benefit of distributed systems is Redundancy. Several machines can
provide the same services, so if one is unavailable, work does not stop. Additionally,
because many smaller machines can be used, this redundancy does not need to be
prohibitively expensive.

What is ZooKeeper?
• Distributed applications require coordination
 Develop your own service (a lot of work)
 Or, use robust pre-existing coordination services (ZooKeeper)
• Distributed open-source centralized coordination service for:
 Maintaining configuration information: Sharing config info across all "nodes"
 Naming: Find a machine in a cluster of 1,000s of servers, Naming Service
 Providing distributed synchronization: Locks, Barriers, Queues, etc.
 Providing group services: Leader election and more
• The ZooKeeper service is distributed, reliable, fast, and simple
What is ZooKeeper?
Distributed applications require coordination. Coordination services are notoriously hard
to get right. They are especially prone to errors such as race conditions and deadlock.
The motivation behind ZooKeeper is to relieve distributed applications the responsibility
of implementing coordination services from scratch.
You could develop your own coordination service, however that can take a lot of work
and is not a trivial task. The alternative is using a robust coordination service that is
already available for use. ZooKeeper is just that - a distributed centralized coordination
service that is open-source and ready for use.
ZooKeeper is a distributed, open-source coordination service for distributed
applications. It exposes a simple set of primitives that distributed applications can build
upon to implement higher level services for synchronization, configuration maintenance,
and groups and naming. It is designed to be easy to program to, and uses a data model
styled after the familiar directory tree structure of file systems. It runs in Java and has
bindings for both Java and C.

ZooKeeper is simple. ZooKeeper allows distributed processes to coordinate with each

other through a shared hierarchal namespace which is organized similarly to a standard
file system. The name space consists of data registers, called znodes, in ZooKeeper
parlance, and these are similar to files and directories. Unlike a typical file system,
which is designed for storage, ZooKeeper data is kept in-memory, which means
ZooKeeper can achieve high throughput and low latency numbers.
The ZooKeeper implementation puts a premium on high performance, highly available,
strictly ordered access. The performance aspects of ZooKeeper means it can be used
in large, distributed systems. The reliability aspects keep it from being a single point of
failure. The strict ordering means that sophisticated synchronization primitives can be
implemented at the client. The ZooKeeper service can be used to help you tackle many
of the common challenges distributed applications face.
• ZooKeeper can be used to maintain configuration information. For example you
can store configuration data in ZooKeeper and share that data across all nodes
in your distributed setup.
• ZooKeeper can be used for naming. An example is using it as a naming service,
allowing one node to find a specific machine in a cluster of thousands of
servers.
• ZooKeeper can be used to solve the problem of distributed synchronization,
providing the building blocks for Locks, Barriers, and Queues.
• ZooKeeper can also be used for group services such as leader election and
more.
• ZooKeeper provides the building blocks for all of these scenarios and is
distributed, reliable and fast, while still being relatively simple to work with!
Reference: http://zookeeper.apache.org/doc/current/index.html

ZooKeeper service: Replicated mode

• ZooKeeper runs on cluster of machines, Ensemble, providing high
availability and consistency
 Requires majority of servers; 7 node ensemble can lose 3 nodes
 Writes go through leader; requires majority consensus
• Servers that make up the ZooKeeper service know about each other
 Maintain an in-memory image of state
 Clients connect to only one server, but If they loose the connection, can
connect to another server automatically
ZooKeeper Service
Follower Leader Follower Follower Follower
server server server server server
Client Client Client Client Client Client Client

ZooKeeper service: Replicated mode

In a distributed ZooKeeper implementation, there are multiple servers. This is known as
ZooKeeper's Replicated Mode. One server is elected as the leader and all additional
servers are followers. If the ZooKeeper leader fails, then a new leader is elected.
All ZooKeeper servers must know about each other. Each server maintains an in-
memory image of the overall state as well as transaction logs and snapshots in
persistent storage. Clients connect to just a single server, however, when a client is
started, it can provide a list of servers. In that way, if the connection to server for that
client fails, the client connects to the next server in its list. Since each server maintains
the same information, the client is able to continue to function without interruption.
A ZooKeeper client can perform a read operation from any server in the ensemble,
however a write operation must go through the ZooKeeper leader and requires a
majority consensus to succeed.
Who does what? Reads are satisfied by followers, writes are committed by the leader.

The processing sequence is:

• ZooKeeper Service is replicated over a set of machines
• All machines store a copy of the data (in memory)
• A leader is elected on service startup
• Clients only connect to a single ZooKeeper server & maintains a TCP
connection.
• Client can read from any Zookeeper server, writes go through the leader & and
this needs a majority consensus.
The image in the slide above is based on:
https://cwiki.apache.org/confluence/display/ZOOKEEPER/ProjectDescription

ZooKeeper service: Standalone mode

• Single ZooKeeper server
• Good for testing/learning
• Lacks benefits of Replicated Mode; no guarantee of high-availability or
resilience
ZooKeeper Service
server
Client Client Client
ZooKeeper service: Standalone mode

ZooKeeper can also be run in Standalone mode. In this mode only a single ZooKeeper
server exists. All clients connect to the ZooKeeper service via this one server. You lose
the benefits of replicated mode when using standalone mode. Trading high-availability
and resilience for a simpler environment can be useful for testing and learning
purposes.

Consistency guarantees
• Sequential consistency
 Client updates are applied in order they are received/sent
• Atomicity
 Updates succeed or fail; partial updates are not allowed
• Single system image
 Client sees same view of ZooKeeper service regardless of server
• Reliability
 If update succeeds then it persists
• Timeliness
 Client view of system is guaranteed up-to-date within a time bound
− Generally within tens of seconds
− If client does not see system changes within time bound, then service-outage
Consistency guarantees
ZooKeeper provides five consistency guarantees:
1. Sequential Consistency: updates from a client to the ZooKeeper service are
applied in the order they are sent.
2. Atomicity: updates in ZooKeeper either succeed or fail. Partial updates are not
allowed.
3. Single System Image: a client will see the same view of the ZooKeeper service
regardless of the server in the ensemble that it is connected to.
4. Reliability: if an update succeeds in ZooKeeper then it will persist and not be
rolled back. The update will only be overwritten when another client performs a
new update.
5. Timeliness: a client's view of the system is guarantee to be up-to-date within a
certain time bound, generally within tens of seconds. If a client does not see
system changes within that time bound, then the client assumes a service
outage and will connect to a different server in the ensemble.

What ZooKeeper does not guarantee

• Simultaneously consistent cross-client views
• Different clients will not always have identical views of ZooKeeper data
at every instance in time.
 ZooKeeper provides the sync() method
− Forces a ZooKeeper ensemble server to catch up with leader
What ZooKeeper does not guarantee

In the previous slide we looked at the consistency guarantees that ZooKeeper makes. It
is important to understand that ZooKeeper does not make a Simultaneously Consistent
Cross-Client View guarantee. This means that ZooKeeper does not guarantee that
different clients will have identical views of ZooKeeper data at every instance in time.
Network delays and other factors may make it possible for one client to perform an
update before another client is notified of the change. The way you can handle this is
using the sync() method that ZooKeeper provides. The sync method forces a
ZooKeeper ensemble server to catch up with the leader.

ZooKeeper structure: Data model

• Distributed processes coordinate through shared hierarchal
namespaces
 These are organized very similarly to standard UNIX and Linux file systems
• A namespace consists of data registers
 Called ZNodes
 Similar to files and directories
 ZNode holds data, children, or both
• ZNode types
 Persistent: lasts until deleted
 Ephemeral: lasts for the
duration of the session,
cannot have children
 Sequence: provides
unique numbering
Source : http://zookeeper.apache.org
ZooKeeper structure: Data model

The distributed processes using ZooKeeper coordinate with each other through shared
hierarchical namespaces. These namespaces are organized similarly to file systems in
UNIX or Linux.
Each namespace has a root node and can have one or more child nodes. Since the
term node has so many different connotations, ZooKeeper refers to each of these
nodes as ZNodes.
As previously stated, data can be stored in a ZNode.
When data is written to or read from a ZNode, all of the data is either written or read.
Also, there is an access controlled list (also known as an ACL) that is associated with
each ZNode. This allows control over who can create, read, update, and delete a
Znode.

ZNode operations
• ZNodes are the main entity that a programmer or administrator uses
• Access is also available through a ZooKeeper Shell
Operation Type
create Write
delete Write
exists Read
getChildren Read
getData Read
setData Write
getACL Read
setACL Write
sync Read
ZNode operations

ZNode watches
• Clients can set watches on ZNodes:
 NodeChildrenChanged
 NodeCreated
 NodeDataChanged
 NodeDeleted
• Changes to a ZNode trigger the watch and ZooKeeper sends the client
a notification.
• Watches are one-time triggers.
• Watches are always ordered.
• Client sees watched event before new ZNode data.
• Client should handle cases of latency between getting the event and
sending a new request to get a watch
ZNode watches

ZNode reads and writes

• Read requests are processed locally at the ZooKeeper server to which
the client is currently connected
• Write requests are forwarded to the leader and go through majority
consensus before a response is generated
ZNode reads and writes

ZooKeeper generic use cases

• Configuration Management
 Cluster member nodes bootstrapping configuration from a centralized
source in unattended way
 Easier, simpler deployment/provisioning
• Distributed Cluster Management
 Node join/leave
 Node statuses in real time
• Naming service such as DNS
• Distributed synchronization: locks, barriers, queues
• Leader election in a distributed system
• Centralized and highly reliable (simple) data registry
ZooKeeper generic use cases

ZooKeeper's role in the Hadoop infrastructure

• HBase
 Uses ZooKeeper for master election, server lease management,
bootstrapping, and coordination between servers
• Hadoop and MapReduce
 Uses ZooKeeper to aid in high availability of Resource Manager
• Flume
 Uses ZooKeeper for configuration purposes in recent releases
ZooKeeper's role in the Hadoop infrastructure

As new versions of Hadoop are released, ZooKeeper is being used more and more in
the Hadoop infrastructure. Some of the uses are:
• HBase uses ZooKeeper for master election, server lease management,
bootstrapping, and coordination between servers.
• Later versions of Hadoop are using ZooKeeper to provide high availability for
YARN's Resource Manager.
• Apache Flume has been using ZooKeeper for configuration purposes in recent
releases.

Real world use cases for ZooKeeper

• Rackspace
 "The Email and Apps team uses ZooKeeper to coordinate sharding and
responsibility changes in a distributed e-mail client that pulls and indexes
data for search. ZooKeeper also provides distributed locking for connections
to prevent a cluster from overwhelming servers."
• Twitter
 Service Discovery
• Vast.com
 "Used internally as a part of sharding services, distributed synchronization
of data/index updates, configuration management and failover support"
• Yahoo
 "ZooKeeper is used for a myriad of services inside Yahoo! for doing leader
election, configuration management, sharding, locking, group membership
etc."
• Zynga
 "Used for a variety of services including configuration management, leader
election, sharding and more ..."
Real world use cases for ZooKeeper

A variety of companies are using ZooKeeper with their distributed applications. Twitter,
Yahoo and many others are using ZooKeeper for different purposes such as
configuration management, sharding, locking and more.
Sources: Apache ZooKeeper wiki and blog.twitter.com

Topic: Slider
Topic: Slider

Slider
• Apache Slider is a YARN application to deploy existing distributed
applications on YARN, monitor them and make them larger or smaller
as desired, even while the application is running.
• Some of the features are:
 Allows users to create on-demand applications in a YARN cluster
 Allows different users/applications to run different versions of the
application.
 Allows users to configure different application instances differently
 Stop/restart application instances as needed
 Expand/shrink application instances as needed
• The Slider tool is a Java command line application.
Development model: Apache incubator project Community activity: Vibrant (80+ issues/month)
License: Apache Established: 2014
Adoption: HDP, Pivotal Key backers: Hortonworks
Contributors: 20+ URL: slider.incubator.apache.org
Slider
Apache Slider is a YARN application to deploy existing distributed applications on
YARN, monitor them and make them larger or smaller as desired, even while the
application is running.
Applications can be stopped then started; the distribution of the deployed application
across the YARN cluster is persisted -enabling a best-effort placement close to the
previous locations. Applications which remember the previous placement of data (such
as HBase) can exhibit fast start-up times from this feature.
YARN itself monitors the health of "YARN containers" hosting parts of the deployed
application -it notifies the Slider manager application of container failure. Slider then
asks YARN for a new container, into which Slider deploys a replacement for the failed
component. As a result, Slider can keep the size of managed applications consistent
with the specified configuration, even in the face of failures of servers in the cluster, as
well as parts of the application itself.

Apache Slider (incubating) is an effort undergoing incubation at The Apache Software

Foundation (ASF), sponsored by the name of Apache TLP sponsor. Incubation is
required of all newly accepted projects until a further review indicates that the
infrastructure, communications, and decision making process have stabilized in a
manner consistent with other successful ASF projects. While incubation status is not
necessarily a reflection of the completeness or stability of the code, it does indicate that
the project has yet to be fully endorsed by the ASF.

Apache Slider: Enable long running services on YARN

• YARN resource management and scheduling works well for batch
workloads, but not for interactive or real-time data processing services
• Apache Slider extends YARN to support long-running distributed
services on an Hadoop cluster
 Dynamic expansion or contraction of services
 Dedicated monitoring
Live Long
 Supports restart after process failure And Process!
• Live Long and Process (LLAP)
YARN Resource
NodeManager NodeManager
HBase Manager
scheduling Container Container
Slider resource management Container Application Master
Application Master
Container
YARN Slider
Application Master
Container
Master Node Slave Node 1 Slave Node 2
Apache Slider: Enable long running services on YARN

Role of Slider in the Hadoop ecosystem (1 of 2)

• Applications can be stopped then started
 The distribution of the deployed application across the YARN cluster is
persisted
 This enables best-effort placement close to the previous locations
 Applications which remember the previous placement of data (such as
HBase) can exhibit fast start-up times from this feature.
• YARN itself monitors the health of "YARN containers" hosting parts of
the deployed application
 YARN notifies the Slider manager application of container failure
 Slider then asks YARN for a new container, into which Slider deploys a
replacement for the failed component, keeping the size of managed
applications consistent with the specified configuration
• The tool persists the information as JSON documents in HDFS.
Role of Slider in the Hadoop ecosystem

Applications can be stopped then started; the distribution of the deployed application
across the YARN cluster is persisted -enabling a best-effort placement close to the
previous locations. Applications which remember the previous placement of data (such
as HBase) can exhibit fast start-up times from this feature.
YARN itself monitors the health of "YARN containers" hosting parts of the deployed
application -it notifies the Slider manager application of container failure. Slider then
asks YARN for a new container, into which Slider deploys a replacement for the failed
component. As a result, Slider can keep the size of managed applications consistent
with the specified configuration, even in the face of failures of servers in the cluster, as
well as parts of the application itself.

Role of Slider in the Hadoop ecosystem (2 of 2)

• Once the cluster has been started:
 The cluster can be made to grow or shrink using Slider commands
 The cluster can also be stopped and later restarted
• Slider implements all its functionality through YARN APIs and the
existing application shell scripts
• The goal of the application was to have minimal code changes and
impact on existing applications
Some of the features of Slider are:

• Allows users to create on-demand applications in a YARN cluster
• Allow different users/applications to run different versions of the application.
• Allow users to configure different application instances differently
• Stop / Restart application instances as needed
• Expand / shrink application instances as needed
The Slider tool is a Java command-line application.
Once the cluster has been started, the cluster can be made to grow or shrink using the
Slider commands. The cluster can also be stopped and later restarted.
Slider implements all its functionality through YARN APIs and the existing application
shell scripts. The goal of the application was to have minimal code changes and as of
this writing, it has required few changes

Topic: Knox
Topic: Knox

Knox
• The Apache Knox Gateway is an extensible reverse proxy framework
for securely exposing REST APIs and HTTP based services at a
perimeter
• Different types of REST access supported: HTTP(S) client, cURL,
Knox Shell (DSL), SSL, …
Knox
The Apache Knox Gateway is a REST API Gateway for interacting with Hadoop
clusters. The Knox Gateway provides a single access point for all REST interactions
with Hadoop clusters. In this capacity, the Knox Gateway is able to provide valuable
functionality to aid in the control, integration, monitoring and automation of critical
administrative and analytical needs of the enterprise.
• Authentication (LDAP and Active Directory Authentication Provider)
• Federation/SSO (HTTP Header Based Identity Federation)
• Authorization (Service Level Authorization)
• Auditing

While there are a number of benefits for unsecured Hadoop clusters, the Knox
Gateway also complements the Kerberos secured cluster quite nicely. Coupled with
proper network isolation of a Kerberos secured Hadoop cluster, the Knox Gateway
provides the enterprise with a solution that:
• Integrates well with enterprise identity management solutions
• Protects the details of the Hadoop cluster deployment (hosts and ports are
hidden from end-users)
• Simplifies the number of services that clients need to interact with
Reference: https://knox.apache.org (including the diagram above).
Knox runs in a firewall DMZ between the external clients and the Apache Hadoop
components.
Additional information:
• cURL is a computer software project providing a library and command line tool
for transferring data using various protocols. The cURL project produces two
products, libcurl and cURL. The name is a recursive acronym that stands for
Curl URL Request Library.
• The Knox Wiki (https://cwiki.apache.org/confluence/display/KNOX/Examples)
provides examples of the use of the Knox Shell DSL including an example that
submits the familiar WordCount Java MapReduce job to the Hadoop cluster via
the gateway using the Knox Shell DSL. However, in this case, the job is
submitted via an Oozie workflow, showing several different ways of doing this.

Knox security
• Knox provides perimeter security for Hadoop clusters with the
following advantages:
Advantage Description
Simplified access Extends Hadoop's REST/HTTP services by encapsulating
Kerberos within the cluster
Enhanced security Exposes Hadoop's REST/HTTP services without revealing network
details, with SSL provided out of box
Centralized control Centrally enforces REST API security and route requests to
multiple Hadoop clusters
Enterprise integration Supports LDAP, Active Directory, SSO, SAML, and other
authentication systems
• Any fully secure Hadoop cluster needs Kerberos, but Kerberos

requires a client-side library and complex client side configuration
• By encapsulating Kerberos, Knox eliminates the need for client
software and client configuration thereby simplifying the access model
Knox security
REST (Representational State Transfer) is an architectural style for networked
hypermedia applications. It is primarily used to build Web services that are lightweight,
maintainable, and scalable. A service based on REST is called a RESTful service.
REST is not dependent on any protocol, but almost every RESTful service uses HTTP
as its underlying protocol.
The client and service talk to each other via messages. Clients send a request to the
server, and the server replies with a response. Apart from the actual data, these
messages also contain some metadata about the message.
Reference: http://www.drdobbs.com/web-development/restful-web-services-a-
tutorial/240169069

Apache Knox is just one ring of an overall security system
Data Protection
Perimeter Level Security
• Core Hadoop
•Network Security (Firewalls, etc.)
•Apache Knox (Gateways, etc.)
Authentication
• Kerberos
Authorizations
• HDFS Permissions
• HDFS ACLs
• etc
OS Security
Apache Knox is just one ring of an overall security system

Apache Knox provides perimeter security.
Other levels of security, including Access Control Lists (ACLs) and file level protections
are needed at other levels within the Hadoop cluster.

Checkpoint
1. Which Apache project provides coordination of resources?
2. What is ZooKeeper's role in the Hadoop infrastructure?
3. True or False? Slider provides an intuitive UI which allows you to
dynamically allocate YARN resources.
4. True or False? Knox can provide all the security you need within your
Hadoop infrastructure.
Checkpoint

Checkpoint solution
1. Which Apache project provides coordination of resources?
 ZooKeeper
2. What is ZooKeeper's role in the Hadoop infrastructure?
 Manage the coordination between HBase servers
 Hadoop and MapReduce uses ZooKeeper to aid in high availability of
Resource Manager
 Flume uses ZooKeeper for configuration purposes in recent releases
3. True or False? Slider provides an intuitive UI which allows you to
dynamically allocate YARN resources.
 False. Slider is a Java command line tool.
4. True or False? Knox can provide all the security you need within your
Hadoop infrastructure.
 False. Knox provides one of many layers of security in Hadoop.
Checkpoint solution

Unit summary
• Understand the challenges posed by distributed applications and how
ZooKeeper is designed to handle them
• Explain the role of ZooKeeper within the Apache Hadoop infrastructure
and the realm of Big Data management
• Explore generic use cases and some real-world scenarios for
ZooKeeper
• Define the ZooKeeper services that are used to manage distributed
systems
• Explore and use the ZooKeeper CLI to interact with ZooKeeper
services
• Understand how Apache Slider works in conjunction with YARN to
deploy distributed applications and to monitor them
• Explain how Apache Knox provides peripheral security services to an
Hadoop cluster
Unit summary

Lab 1
Explore ZooKeeper
Lab 1: Explore Zookeeper

Lab 1:
Explore ZooKeeper
Purpose:
You will connect to ZooKeeper and explore the ZooKeeper files.
Task 1. Connect to ZooKeeper and explore the ZooKeeper

files.
The major reference for Apache ZooKeeper can be found on the Apache
ZooKeeper website:
- http://zookeeper.apache.org
- http://zookeeper.apache.org/doc/current/ (Documentation)
Additional information can be found on the ZooKeeper wiki:
- https://cwiki.apache.org/confluence/display/ZOOKEEPER/Index
- https://cwiki.apache.org/confluence/display/ZOOKEEPER/Tutorial
(Quick Tutorial)
A full course Developing Distributed Applications Using ZooKeeper, is available
on the Cognitive Class (http://cognitiveclass.ai/) at
https://cognitiveclass.ai/courses/developing-distributed-applications-using-
zookeeper/.
3. Click ZooKeeper in the left pane to see the Summary information.
You will see more information about the ZooKeeper Server.

4. Click ZooKeeper Server in the Summary pane.

This will bring up a detailed list of individual services where from the individual
drop-downs you are able to Start, Stop, Restart, Turn on Maintenance Mode.
5. In the Components list, beside ZooKeeper, expand the drop down and click
Start (if stopped) or Restart (if currently running).
6. Click OK to confirm, and after the background operation is complete, click OK to
close the Background Operations Running window.
7. Minimize the Ambari Web Console browser (Firefox).

Task 2. Investigate the ZooKeeper environment.

You will now use a terminal window to navigate to the ZooKeeper home
directory on the Linux file system and investigate the directories and files that
you will find there.
1. To open a terminal window, right-click your desktop, and then click
Open in Terminal.
2. Execute the following commands to change directory in the Linux file system to
where you will find details of the ZooKeeper Client, and list the files that you
find there:
cd /usr/hdp/current/zookeeper-client
pwd
ls -l
[biadmin@ibmclass Desktop]$ cd /usr/hdp/current/zookeeper-client
[biadmin@ibmclass zookeeper-client]$ pwd
/usr/hdp/current/zookeeper-client
[biadmin@ibmclass zookeeper-client]$ ls -l
total 1328
drwxr-xr-x 2 root root 222 Jan 24 17:20 bin
lrwxrwxrwx 1 root root 27 Jan 24 17:20 conf -> /etc/zookeeper/2.6.4.0-91/0
drwxr-xr-x 6 root root 4096 Jan 24 17:20 doc
drwxr-xr-x 3 root root 18 Jan 24 17:20 etc
drwxr-xr-x 2 root root 4096 Jan 24 17:20 lib
drwxr-xr-x 3 root root 18 Jan 24 17:20 man
drwxr-xr-x 3 root root 17 Jan 24 17:20 usr
-rw-r--r-- 1 root root 800581 Jan 4 02:35 zookeeper-3.4.6.2.6.4.0-91.jar
lrwxrwxrwx 1 root root 30 Jan 24 17:20 zookeeper.jar -> zookeeper-3.4.6.2.6.4.0-
91.jar
[biadmin@ibmclass zookeeper-client]$
Note the names of some directories. These follow, in general, Linux directory
naming conventions:
• bin: Executables to start/stop/interact with ZooKeeper
• conf: ZooKeeper and log configuration files
• doc: ZooKeeper documentation
• lib: Library files for all the binaries held in the /bin directory

3. Execute the following commands to set your environmental variable

ZOOKEEPER_HOME to the current directory (as you may need it again in this
window), change directory to the bin subdirectory, and start the ZooKeeper
Client Command Line Interface (CLI):
export ZOOKEEPER_HOME=`pwd`
cd bin
./zkCli.sh
[biadmin@ibmclass zookeeper-client]$ export ZOOKEEPER_HOME=`pwd`

[biadmin@ibmclass zookeeper-client]$ cd bin
[biadmin@ibmclass bin]$ ./zkCli.sh
Connecting to localhost:2181
log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Welcome to ZooKeeper!
JLine support is enabled
WATCHER::
WatchedEvent state:SyncConnected type:None path:null

[zk: localhost:2181(CONNECTED) 0]
You may have to press Enter to get this last line as your ongoing prompt.
4. Type help to get a list of available commands for the ZK CLI.
zk: localhost:2181(CONNECTED) 0] help
ZooKeeper -server host:port cmd args
stat path [watch]
set path data [version]
ls path [watch]
delquota [-n|-b] path
ls2 path [watch]
setAcl path acl
setquota -n|-b val path
history
redo cmdno
printwatches on|off
delete path [version]
sync path
listquota path
rmr path
get path [watch]
create [-s] [-e] path data acl
addauth scheme auth
quit
getAcl path
close
connect host:port

5. Type ls / in the ZooKeeper CLI prompt.

This tells ZK to list the ZNodes at the top level of the ZooKeeper node hierarchy.
For this we need a slash ("/") after the ls command:
[zk: localhost:2181(CONNECTED) 1] ls /
[registry, ambari-metrics-cluster, hiveserver2, zookeeper, hbase-unsecure, rmstore]
ZooKeeper returns [registry, ambari-metrics-cluster, hiveserver2, zookeeper,

hbase-unsecure, rmstore] (your results may be slightly different). This means
there are six ZNodes at the top of the node hierarchy. The fifth node listed is
used by hbase.
Notice that each interaction in this session with the ZK CLI is numbered: 1, 2,
and so on.
6. To see the subnodes of the hbase node, type ls /hbase-unsecure.
[zk: localhost:2181(CONNECTED) 2] ls /hbase-unsecure
[replication, meta-region-server, rs, splitWAL, backup-masters, table-lock, flush-
table-proc, master-maintenance, region-in-transition, online-snapshot, switch,
master, running, recovering-regions, draining, namespace, hbaseid, table]
7. Type get /hbase-unsecure to view the data and metadata stored in any of the
nodes.
You should look at the hbase node:
[zk: localhost:2181(CONNECTED) 3] get /hbase-unsecure
cZxid = 0x100000012
ctime = Wed Jan 24 17:45:59 PST 2018
mZxid = 0x100000012
mtime = Wed Jan 24 17:45:59 PST 2018
pZxid = 0x400000054
cversion = 34
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 18

The meaning of the data / metadata fields for this response is shown in this
table:
Field / Data Description

Blank line (Optional) line of text that is actual data stored in this
ZNode
cZxid = 0x100000012 The zxid (ZooKeeper Transaction Id) of the change
that caused this znode to be created
ctime = Wed Jan 24 The time when this znode was created
17:45:59 PST 2018
mZxid = 0x100000012 The zxid of the change that last modified this znode
mtime = Wed Jan 24 The time when this znode was last modified.
17:45:59 PST 2018
pZxid = 0x400000054 The zxid of the change that last modified children of
this znode.
cversion = 34 The number of changes to the children of this znode.
dataVersion = 0 The number of changes to the data of this znode.
aclVersion = 0 The number of changes to the ACL of this znode.
ephemeralOwner = 0x0 The session id of the owner of this znode if the
znode is an ephemeral node. If it is not an
ephemeral node, it will be zero
dataLength = 0 The length of the data field of this znode (zero in
this case, since blank)
numChildren = 18 The number of children of this znode.
There are a number of other commands that can be applied against this ZNode
and other ZNodes of the ZooKeeper Server. For further details check the
documentation.
Extra exercises are available in the Cognitive Class course Developing
Distributed Applications Using ZooKeeper and excellent documentation can be
found on http://zookeeper.apache.org.
Results:
You connected to ZooKeeper and explored the ZooKeeper files.

Loading data with Sqoop
Loading data with Sqoop

Unit 9 Loading data with Sqoop

Unit objectives
• List some of the load scenarios that are applicable to Hadoop
• Understand how to load data at rest
• Understand how to load data in motion
• Understand how to load data from common sources such as a data
warehouse, relational database, web server, or database logs
• Explain what Sqoop is and how it works
• Describe how Sqoop can be used to import data from relational
systems into Hadoop and export data from Hadoop into relational
systems
• Brief introduction to what Flume is and how it works
Loading data with Sqoop © Copyright IBM Corporation 2018
Unit objectives

Load scenarios
Data at rest
Data in motion
Data from web server and database logs
Data from a data warehouse
Load scenarios
Several load scenarios are discussed in this unit:
• Data at rest
• Data in motion
• Data from web server and database logs
• Data from a data warehouse

Data at rest
Data at rest
……
………
…. …
…. • Data is already generated into a file or directories.
• No additional updates will be done to the file and will
be transferred over as is
Standard HDFS shell commands - using CLI or a graphical interface

Use the hadoop fs -copyFromLocal or -put commands
Data at rest
Data at rest is already in a file in some directory. No additional updates are planned on
this data and it can be transferred as is. This can be accomplished using standard
HDFS shell commands, put, cp or copyFromLocal, or using the HDP web console.

Data in motion
Data in motion
Lots of servers generating log files or other data

For example:
• a file that is getting appended
• a file that is being update
Data from multiple locations that needs to be
merged into one single source
For example:
• merge all log files from one directory into one file
Data in motion
What about when data is in motion?
First of all, what is meant by data in motion? This is data that is continually updated.
New data might be added regularly to these data sources, data might be appended to a
file, or logs might be getting merged into one log. You need to have the capability of
merging the files before copying them into Hadoop.
Examples of data in motion include:
• Data from a web server such as WebSphere Application Server or Apache
Server
• Data in server logs, or application logs, for example site browsing histories

Data from a web server
Data from a Web Server
• If the source is from a web server such as WebSphere

• Data is typically web server logs or applications logs
• Logs are constantly appended
Data from a web server

Flume is a great tool for collecting data from a web server. Another option would be to
use Java Management Extension (JMX) commands.

Solution if data is from a data warehouse or RDBMS
Data from a Data Warehouse or RDBMS
1. Use standard database export commands, export the

tables into comma delimited files & and then use
Hadoop commands to import the data HDFS
2. Sqoop to load directly from relational systems
3. Data from Db2 and Netezza, using proprietary stored
procedures and the like
4. Data from Db2, Netezza, Teradata, and other
database servers into Big SQL using Big SQL Load
Solution if data is from a data warehouse or RDBMS

When moving data from a data warehouse, (or any RDBMS), you export the data into
an operating system file in CSV or other format and then use Hadoop commands to
import the data.
If you are working with IBM Integrated Analytics System (IIAS) system or Db2, you may
find proprietary stored procedures that allow reading/writing between database tables
and HDFS - depending on the particular version of HDP.

Sqoop
• Transfers data between Hadoop and relational databases
 Uses JDBC
 Must copy the JDBC driver JAR files for any relational databases to
$SQOOP_HOME/lib
• Uses the database to describe the schema of the data
• Uses MapReduce to import and export the data
 Import process creates a Java class
− Can encapsulate one row of the imported table
 The source code of the class is provided to you
• Can help you to quickly develop MapReduce applications that use
HDFS-stored records
• The options import and export work in relation to HDFS
(and the opposite of the relation to the database itself)
Sqoop
After completing this topic, you should be able to:
• Explain the use of Sqoop to:
• Import data from relational database tables into HDFS
• Export data from HDFS into relational database tables
• Describe the basic Sqoop commands:
• Import statement
• Export statement
The sqoop binary is /usr/bin/sqoop
If accumulo is needed and installed, $ACCUMULO_HOME should be set.

Sqoop connection
• Database connection requirements are the same for import and export
• Specify as command line options:
 JDBC connection string
 Username
 Password
 Table
 Target directory
 Number of Mappers
• Or, put the options into a file for execution
sqoop import|export \
--connect jdbc:db2://your.db2.com:50000/yourDB \
--username db2user --password yourpassword \
...
Sqoop connection
The example shows using a Db2 database connection. Sqoop works with all databases
that have a JDBC connection.
In the exercise later, you will export data from a MySQL database.

Sqoop import
• Imports data from relational tables into HDFS
 Each row in the table becomes a separate record in HDFS
• Data can be stored
 Text files
 Binary files
 Into HBase
 Into Hive
• Imported data
 Can be all rows of a table
 Can limit the rows and columns
 Can specify your own query to access relational data
• Target directory (--target-dir)
 Specifies the directory in HDFS in which to place the data
− If omitted, the name of the directory is the name of the table
Sqoop import

Sqoop import examples

• Import all rows from a table
sqoop import --connect jdbc:db2://your.db2.com:50000/yourDB \
--username db2user --password db2password --table db2table \
--target-dir sqoopdata
• Addition parameters, as needed:

--split-by tbl_primarykey
--columns "empno,empname,salary"
--where "salary > 40000"
--query 'SELECT e.empno, e.empname, d.deptname FROM employee e JOIN
department d on (e.deptnum = d.deptnum)'
--as-textfile
--as-avrodatafile
--as-sequencefile
Sqoop import examples

Sqoop export
• Exports a set of files from HDFS to a relation database system
 Table must already exist
 Records are parsed based upon user's specifications
• Default mode is insert
 Inserts rows into the table
• Update mode
 Generates update statements
 Replaces existing rows in the table
 Does not generate an insert
− Missing rows are not inserted
− Not detected as an error
• Call mode
 Makes a stored procedure call for each record
• --export-dir
 Specifies the directory in HDFS from which to read the data
Sqoop export

Sqoop export examples

• Basic export from files in a directory to a table
sqoop export --connect jdbc:db2://your.db2.com:50000/yourDB \
--username db2user --password db2password --table employee \
--export-dir /employeedata/processed
• Example calling a stored procedure

--username db2user --password db2password --call empproc \
--export-dir /employeedata/processed
• Example updating a table

--username db2user --password db2password --table employee \
--update_key empno --export-dir /employeedata/processed
Sqoop export examples

Additional export information

• Parsing data
 Default is comma separated fields with newline separated records
 Can provide input arguments that override the default
− If the records to be exports loaded into HDFS using the import command
• The original generated Java class can be used to read the data
• Transactions
 Sqoop uses multi-row insert syntax
 Inserts up to 100 rows per statement
 Commits work every 100 inserts
− Commit every 10,000 rows
 Each export map task operates as a separate transaction
Additional export information

Flume
• Distributed, reliable service for efficient collection, aggregation, and
transfer of large amounts of streaming data into Hadoop
• Supports dynamic reconfiguration without the need to restart
• Each Flume agent runs independently of others with no single point
of failure
• Data ingest coordinated by YARN
• Stream sources include application logs, sensor and machine data,
geo-location data and social media
Flume
Flume has agents, installed at a source, which only know how to read data and pass it
on to a sink. A source can be a file, a stream or it might be another flume agent.
Likewise the sink might be, among other things, HDFS or another flume agent. The final
agent in the chain is the collector.

How Flume works

• It is built on the concept of flows
• Flows might have different batching or reliability setup
• Flows are comprised of nodes chained together
 Each node receives data as "source", stores it in a channel, and sends
it via a "sink“
• Flume is now deprecated in favor of Hortonworks Data Flow (HDF)
Channel Channel
Source Source Sink
Sink
Agent Storage
Spooling Avro
Avro HDFS
Directory
Remote Source
How Flume works

Any number of Flume agents can be chained together. You start by installing a Flume
agent where the data originates. This is the agent tier.
Every logical node has both a source and a sink. You define the sink to continually
watch the tail of some file.
After capturing data, the logical node at the agent tier sends the data to a logical node
at the collector tier. This logical node also has a source and a sink defined. The source
is reading from the logical node at the agent tier.
At the collector tier, you will want to do data consolidation. Decorators are defined to
accomplish that. You can do some additional data processing by writing your own
decorator using a Java-based plug-in API.

Checkpoint
1. True or False? Sqoop is used to transfer data between Hadoop and
relational databases.
2. List four different load scenarios.
3. True or False? For Sqoop to connect to a relational database, the JDBC
JAR files for that database must be located in $SQOOP_HOME/bin.
4. If you omit the Target directory (--target-dir) command line option in a
Sqoop import statement, what is used as the default directory name?
5. True or False? Each Flume node receives data as “source”, stores it in a
“channel”, and sends it via a “sink”.
Checkpoint

Checkpoint solution
1. True or False? Sqoop is used to transfer data between Hadoop and
relational databases.
 True. Sqoop is used to transfer data between Hadoop and relational databases.
2. List four different load scenarios.
 Data at rest, data in motion, data from web server and database logs, and data
from a data warehouse.
3. True or False? For Sqoop to connect to a relational database, the JDBC
JAR files for that database must be located in $SQOOP_HOME/bin.
 False. All database connection JAR files must be in $SCOOP_HOME/lib
4. If you omit the Target directory (--target-dir) command line option in a
Sqoop import statement, what is used as the default directory name?
 If omitted, the name of the directory is the name of the table.
5. True or False? Each Flume node receives data as “source”, stores it in a
“channel”, and sends it via a “sink”.
 True. Nodes can be chained together and each node has its own source, channel,
and sink.
Checkpoint solution

Unit summary
• List some of the load scenarios that are applicable to Hadoop
• Understand how to load data at rest
• Understand how to load data in motion
• Understand how to load data from common sources such as a data
warehouse, relational database, web server, or database logs
• Explain what Sqoop is and how it works
• Describe how Sqoop can be used to import data from relational
systems into Hadoop and export data from Hadoop into relational
systems
• Brief introduction to what Flume is and how it works
Unit summary

Lab 1
Moving data into HDFS with Sqoop
Lab 1: Moving data into HDFS with Sqoop

Lab 1:
Moving data into HDFS with Sqoop
Purpose:
You will learn how to move data into an HDFS cluster from a relational
database. The relational database that will be used is MySQL since that is
already installed in an ODP cluster.
You will connect into the MySQL database and create some data records in
internal database storage format.
Sqoop will be used to import that data into HDFS (file movements are in
relation to Hadoop and HDFS).
Additional exercises are available in the Cognitive Class course on Moving Data into
Hadoop (https://cognitiveclass.ai/courses/flume-sqoop-moving-data-into-hadoop/).
Task 1. Login to your lab environment and connect to the
MySQL database.
If you have problems connecting to the MySQL database, it may be for one of
several reasons:
• The MySQL database may be not running. Appropriate commands to
check status, start/stop, restart the MySQL server (needs root /
administrative privileges) are:
sudo service mysqld status
sudo service mysqld start
sudo service mysqld stop
sudo service mysqld restart
• The database driver may be out-of-date for your version of MySQL:
You can download the latest MySQL Connector/J driver for Linux from
http://dev.mysql.com/downloads/connector/j and, after unpacking the tar file
(tar xvf), place the mysql-connector-java-*.*.*-bin.jar file into
/usr/hdp/2.6.4.0-91/sqoop/lib directory.
• As noted previously, you may need to verify that your hostname and IP
address are setup correctly, and make changes if they are needed as
noted in earlier units. Note that if you have shut down your lab environment
anytime, or overnight, this verification of hostname and IP address should
be repeated.

1. Connect to your lab environment and login as student.

2. In a new terminal window, type cd to change to your home directory, and then
type mysql to start a MySQL command session.
[biadmin@ibmclass Desktop]$ cd
[biadmin@ibmclass ~]$ mysql
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.6.39 MySQL Community Server (GPL)
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its

affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>
You will receive a MySQL prompt: mysql>

3. To create the test database you will use in this exercise, type
mysql> CREATE DATABASE test;.
4. To see what databases are currently available in your MySQL server, type
mysql> SHOW DATABASES;.
mysql> SHOW DATABASES;

+--------------------+
| Database |
+--------------------+
| information_schema |
| hive |
| mysql |
| performance_schema |
| test |
+--------------------+
5 rows in set (0.02 sec)
mysql>
Note: SQL (and hence MySQL) commands are case-insensitive. The
convention you are using here is that SQL keywords (reserved words) are
written in Upper-Case and words which are not keywords are written here in
lower-case.
The string "mysql>" is the prompt, and should not be entered.
MySQL commands are terminated by a semi-colon (";").

5. To connect to the test database, type mysql> USE test;.

6. Type the following command to create a table called mytable in this database:
mysql> CREATE TABLE test.mytable (id INT,
name VARCHAR(20));
7. Type the following command to insert data into your table:
mysql> INSERT INTO test.mytable VALUES
(1,'one'),(2,'two'),(3,'three'),(4,'four'),
(5,'five'),(6,'six'),(7,'seven');
8. To verify that your data was actually stored into the database table, type:
mysql> SELECT * FROM test.mytable;
mysql> SELECT * FROM test.mytable;

+------+-------+
| id | name |
+------+-------+
| 1 | one |
| 2 | two |
| 3 | three |
| 4 | four |
| 5 | five |
| 6 | six |
| 7 | seven |
+------+-------+
7 rows in set (0.00 sec)
mysql>
9. You will now need to create a user to connect with and grant privileges to that
user for the new database you created so that you can connect to it from HDP.
To do this, type:
mysql> CREATE USER ''@'localhost';
mysql> GRANT ALL PRIVILEGES ON *.* TO ''@'localhost';
10. Type quit; since you will not need the database again for awhile.

Task 2. Import data from the database into HDFS.

Sqoop wants to know how many mappers it should employ. Normally, Sqoop
would look at the primary key column to figure out how to split the data across
the mappers. Since there was not a primary key defined for this table, you will
have to help out.
You could use the split-by <column-name> parameter to tell Sqoop which
column should be used in place of the primary key column. But since you only
have one node in your Hadoop cluster, there is no sense in running more than
just a single mapper. Do this by specifying the -m 1 parameter.
1. Import all rows from mytable into the scooper directory in Hadoop using the
following command:
The sqoop statement is (in one line):
sqoop import --connect jdbc:mysql://localhost/test --table mytable -
-target-dir scooper -m 1
[biadmin@ibmclass ~]$ sqoop import --connect jdbc:mysql://localhost/test --table mytable -

-target-dir scooper -m 1
Warning: /usr/hdp/2.6.4.0-91/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
18/03/05 11:22:01 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.6.4.0-91
18/03/05 11:22:02 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
18/03/05 11:22:02 INFO tool.CodeGenTool: Beginning code generation
18/03/05 11:22:02 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM
`mytable` AS t LIMIT 1
18/03/05 11:22:02 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM
`mytable` AS t LIMIT 1
18/03/05 11:22:02 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.6.4.0-
91/hadoop-mapreduce
Note: /tmp/sqoop-root/compile/8fec2d92905d106a21c2eb37ba446dbf/mytable.java uses or
overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/03/05 11:22:04 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-
root/compile/8fec2d92905d106a21c2eb37ba446dbf/mytable.jar
18/03/05 11:22:04 WARN manager.MySQLManager: It looks like you are importing from mysql.
18/03/05 11:22:04 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
18/03/05 11:22:04 WARN manager.MySQLManager: option to exercise a MySQL-specific fast
path.
18/03/05 11:22:04 INFO manager.MySQLManager: Setting zero DATETIME behavior to
convertToNull (mysql)
18/03/05 11:22:04 INFO mapreduce.ImportJobBase: Beginning import of mytable
18/03/05 11:22:05 INFO client.RMProxy: Connecting to ResourceManager at
bdd2.fyre.ibm.com/172.16.174.19:8050
18/03/05 11:22:05 INFO client.AHSProxy: Connecting to Application History server at
bdd2.fyre.ibm.com/172.16.174.19:10200
18/03/05 11:22:08 INFO db.DBInputFormat: Using read commited transaction isolation
18/03/05 11:22:08 INFO mapreduce.JobSubmitter: number of splits:1
18/03/05 11:22:08 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1518204343343_0014
18/03/05 11:22:09 INFO impl.YarnClientImpl: Submitted application
application_1518204343343_0014
http://bdd2.fyre.ibm.com:8088/proxy/application_1518204343343_0014/
18/03/05 11:22:09 INFO mapreduce.Job: Running job: job_1518204343343_0014
18/03/05 11:22:17 INFO mapreduce.Job: Job job_1518204343343_0014 running in uber mode :
false


18/03/05 11:22:23 INFO mapreduce.Job: Job job_1518204343343_0014 completed successfully
18/03/05 11:22:23 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=167908
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=48
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=10701
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=3567
Total vcore-milliseconds taken by all map tasks=3567
Total megabyte-milliseconds taken by all map tasks=10957824
Map-Reduce Framework
Map input records=7
Map output records=7
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=82
CPU time spent (ms)=1760
Physical memory (bytes) snapshot=239927296
Virtual memory (bytes) snapshot=4596932608
Total committed heap usage (bytes)=196608000
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=48
18/03/05 11:22:23 INFO mapreduce.ImportJobBase: Transferred 48 bytes in 18.1684 seconds
(2.6419 bytes/sec)
18/03/05 11:22:23 INFO mapreduce.ImportJobBase: Retrieved 7 records. [biadmin@ibmclass ~]$
Note that a MapReduce job was created and run. There was one Mapper and no
Reducers. There is one output file, and it has 7 records.

2. To see what has been stored in the HDFS file system and the content of the file
that was created, type the following commands:
hadoop fs –ls –R
hadoop fs –cat scooper/p*

drwx------ - root hdfs 0 2018-03-05 11:22 .staging
drwxr-xr-x - root hdfs 0 2018-03-05 11:22 scooper
-rw-r--r-- 3 root hdfs 0 2018-03-05 11:22 scooper/_SUCCESS
-rw-r--r-- 3 root hdfs 48 2018-03-05 11:22 scooper/part-m-00000
[biadmin@ibmclass ~]$ hadoop fs -cat scooper/p*

1,one
2,two
3,three
4,four
5,five
6,six
7,seven
Task 3. Import data from the database into HDFS using a

Sqoop script file.
Typing all of those statements into a command line, knowing that when you
close the command line window, the command will be lost, seems like a waste.
But what if your commands could be saved in a text file? That might be worth
something. Also, what if you wanted to limit the rows imported? You will review
each of these ideas.
Note that the UNIX rule for parameters applies. When the parameter name is a
single letter, a single dash is used (-m); but when a parameter name has more
than one letter, a double dash is used (--connect).
1. From the command line, use vi, gedit, or a text editor of your choice to create a
file called sqoop.params.
2. In the document window, type your import parameters as seen below:
Notice that you can add comments.
import
--connect
jdbc:mysql://localhost/test
--table
mytable
--target-dir
sqoopimport2
# Only select some rows
--where
ID > 4
# Remaining options should be specified on the command line

3. Save the file as sqoop.params.

4. From the command line, type the following:
sqoop --options-file sqoop.params -m 1
5. View your results and take note of how many records were imported.
Results:
You moved data into an HDFS cluster from a relational database. You
connected into the MySQL database and created some data records in
internal database storage format.
Sqoop was used to import that data into HDFS (file movements are in relation
to Hadoop and HDFS).

Security and Governance
Security and Governance

U n i t 1 0 S e c u r i t y a n d G o ve r n a n c e

Unit objectives
• Explain the need for data governance and the role of data security
in this governance
• List the Five Pillars of security and how they are implemented
with HDP
• Discuss the history of security with Hadoop
• Identify the need for and the methods used to secure Personal &
Sensitive Information
• Describe the function of the Hortonworks DataPlane Service (DPS)
Security and Governance © Copyright IBM Corporation 2018
Unit objectives

The need for data governance

• Data governance  Data reliability
 Accurate analysis & reporting
 Confident decision making
 Reduce unwanted financial outlays
• Functionality requirement:
 Discover & Understand
(embody the “data guru”)
 Define the Metadata
 Handle Security
 Provide Privacy
 Maintain Data integrity
 Measure & Monitor
The need for data governance

9 ways to build confidence in big data
9 ways to build confidence in big data

References:
http://www.ibmbigdatahub.com/infographic/9-ways-build-confidence-big-data
https://www.ibm.com/analytics/unified-governance-integration
Video (Unified Governance for the Cognitiv Computing Era) 2:40:
https://www.youtube.com/watch?v=G1OcWYWVIGw

What Hadoop security requires

• Strong authentication
 Makes malicious impersonation impossible
• Strong authorization
 Control over who can access data that is stored
− Local
− Cloud
− Hybrid
 Control over who can view / control jobs and

processes
• Isolation between running tasks
• Ongoing development priority & commitment
 … using Open Source technologies
What Hadoop security requires

We will be following the open source approach available with Hortonworks HDP
product and related.
Hortonworks offers a 3-day training program:
HDP Operations: Apache Hadoop Security Training
https://hortonworks.com/services/training/class/hdp-administrator-security/
“This course is designed for experienced administrators who will be implementing
secure Hadoop clusters using authentication, authorization, auditing, and data
protection strategies and tools.”
The Cloudera CDH distribution has its training on security matters as well. This is
described in:
Cloudera Training: Secure Your Cloudera Cluster
https://www.slideshare.net/cloudera/cloudera-training-secure-your-cloudera-cluster
Sometimes security of computer systems is described in terms of 3As:
• Authentication
• Authorization
• Accountability

Security must be applied to network connectivity, running processes, and the data itself.
When dealing with data, we are concerned primarily with:
• Integrity
• Confidentiality & privacy
• Rules & regulations concerning who has valid access to what - based on both
the role performed and the individual exercising that role

How is security provided?

• HDP ensures comprehensive enforcement of security requirements
across the entire Hadoop stack.
• Kerberos is the key to strong authentication
• Ranger provides a single simple interface for security policy definition
and maintenance
• Encryption options available for data at-rest and in-motion
How is security provided?

Enterprise security services with HDP

• Five Pillars of Security
1. Centralized Security Administration
2. Authentication 3. Authorization 4. Audit 5. Data Protection

Who am I What can I do? What did the Can I encrypt data
How can I prove it? users do? in transit & at rest?
• Kerberos • Fine grained • Centralized • Wire encryption for

data in transit
HDP
• API security with access control audit reporting

with Apache with Apache • Disk encryption in
Apache Knox
Ranger Ranger Hadoop
• Native and
third-party
encryption
Enterprise security services with HDP

Security is essential for organizations that store and process sensitive data in the
Hadoop ecosystem. Many organizations must adhere to strict corporate security
policies.
The challenges with Hadoop Security in general:
• Hadoop is a distributed framework used for data storage and large-scale
processing on clusters using commodity servers. Adding security to Hadoop is
challenging because not all of the interactions follow the classic client-server
pattern.
• In Hadoop, the file system is partitioned and distributed, requiring authorization
checks at multiple points.
• A submitted job is executed at a later time on nodes different than the node on
which the client authenticated and submitted the job.
• Secondary services such as a workflow system (Oozie, etc.) access Hadoop on
behalf of users.
• A Hadoop cluster can scale to thousands of servers and tens of thousands of
concurrent tasks.
• Hadoop / YARN / etc. are evolving technologies - and each component is
subject to versionality and cross-component integration

References to security with Hortonworks HDP:

• HDP Documentation
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/index.html
• HDP Security (PDF)
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-
2.6.4/bk_security/bk_security.pdf

Authentication: Kerberos & Knox

• Ambari provides automation and
management of Kerberos into the
Hadoop cluster
 Kerberos can be combined with
Active Directory to provide a
combined Kerberos/AD approach
• Apache Knox is used for API &
Perimeter Security
 Removes the need for end users to
interact with Kerberos
 Enables integration with different
authentication standards
 Provides a single location to manage
security for REST APIs & HTTP
based services
Authentication: Kerberos & Knox

Kerberos was originally developed for MIT’s Project Athena in the 1980 and is the most
widely deployed system for authentication. It is shipped with all major computer
operating systems. MIT developers & maintains implementations for:
• Linux/Unix
• Mac OS X
• Windows
Apache Knox 1.0.0 (released 7-Feb-2018) delivers three groups of user facing services:
• Proxying services
• Authentication Services
• Client DSL/SDK services
The user guide is: https://knox.apache.org/books/knox-1-0-0/user-guide.html
Java 1.8 (i.e., Java version 8) is required for the Knox Gateway runtime.
Knox 1.0.0 supports the soon to be released Hadoop 3.x, but can be used with
Hadoop 2.x.
Knox 1.0.0 upgrades Knox 0.14.0 (released 14-Dec-2017), but is largely the same
release with a handful of bug releases and a repackaging of the member classes.
V1.0.0 is a milestone release because of the naming convention change.

Authorization & Auditing: Apache Ranger

Authorization - with fine-grain access control:
 HDFS - Folder, File
 Hive - Database, Table, Column
 HBase - Table, Column Family, Column
 Storm, Knox, and more
Audit - Extensive user access auditing in HDFS, Hive, and Hbase
 IP Address
 Resource type/ resource
 Timestamp
 Access granted or denied
Flexibility in defining policies + the control of access into systems
Authorization & Auditing: Apache Ranger

Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It
provides a centralized platform to define, administer and manage security policies
consistently across Hadoop components.
Apache Ranger provides a Centralized Security Framework to manage fine-grained
access control that uses an Ambari interface to provide an Administration Console that
can:
• Deliver a “single pane of glass” for the security administrator
• Centralize administration of security policy
• Policies for Accessing a Resource (File, Directories, Database, Table Column)
for Users/Groups
• Enforce Authorization Policies within Hadoop
• Enable Audit Tracking and Policy Analytics
• Ensure consistent coverage across the entire Hadoop stack

Ranger has plugins for:

• HDFS
• Hive
• Knox
• Storm
• HBase
The Ranger Key Management Service (Ranger KMS) provides a scalable
cryptographic key management service for HDFS “data at rest” encryption. Ranger
KMS is based on the Hadoop KMS originally developed by the Apache community and
extends the native Hadoop KMS functionality by allowing system administrators to store
keys in a secure database.
Reference: https://hortonworks.com/apache/ranger/

History of Hadoop security

In the beginning…
• Insufficient authentication/authorization of both users & services
• Framework did not perform mutual authentication
• Malicious user could impersonate services
• Minimal authorization allowed anyone to read / write data
• Arbitrary java code could be executed by user/service account
• File permissions was easily circumvented
• Only disk encryption
And now…
• 5 Pillars of Security: Administration, Authentication, Authorization,
Audit, and Data Protection
History of Hadoop security

Hadoop is essentially at its heart was designed for storing and processing large
amounts of data efficiently and as it turns out, cheaply (monetarily) when compared to
other platforms. The focus early on in the project was around the actual technology to
make this happen. Much of the code covered the logic on how to deal with the
complexities inherent in distributed systems, such as handling of failures and
coordination.
Because of this focus, the early Hadoop project established a security stance that the
entire cluster of machines and all of the users accessing it are part of a trusted network.
What that mean effectively means was that Hadoop did not initially have strong security
measures in place to enforce, well, much of anything.
As the Hadoop ecosystem evolved, it became apparent that at a minimum there should
be a mechanism for users to strongly authenticate to prove their identities. The primary
mechanism chosen for Hadoop was Kerberos, a well-established protocol that today is
common in enterprise systems such as Microsoft Active Directory. After strong
authentication came strong authorization. Strong authorization defined what an
individual user could do after they had been authenticated.
Initially, authorization was implemented on a per-component basis, meaning that
administrators needed to define authorization controls in multiple places. This lead to
the need for centralized administration of security - handled now by Apache Ranger.

Another evolving need is the protection of data through encryption and other
confidentiality mechanisms. In the trusted network, it was assumed that data was
inherently protected from unauthorized users because only authorized users were on
the network. Since then, Hadoop has added encryption for data transmitted between
nodes, as well as data stored on disk.
Thus, now, we have the Five Pillars of Security:
• Administration
• Authentication
• Authorization
• Audit
• Data Protection
Hadoop 3.0.0 (GA as of early 2018, https://hadoop.apache.org/docs/r3.0.0/), as
Hadoop 2, is intimately concerned with security:
Hadoop in Secure Mode
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
common/SecureMode.html

Implications of security
• Data is now an essential new driver of competitive advantage
• Hadoop plays critical role in modern data architecture by
 Providing low-cost
 Scale-out data storage
 Value-add processing
• Any internal or external breach of this enterprise-wide data can be
catastrophic
 Privacy violations
 Regulatory infractions
 Damage to corporate image
 Damage to long-term shareholder value and consumer confidence
Implications of security
In every industry, data is now an essential new driver of competitive advantage.
Hadoop plays a critical role in the modern data architecture by providing low-cost,
scale-out data storage, and value-add processing.
The Hadoop cluster with its HDFS file system, and the broader role of a Data Lake, are
used to hold the “crown jewels” of the business organization, namely vital operational
data that is used to drive the business and make it unique amongst its peers. Some of
this data, in addition, is highly sensitive.
Any internal or external breach of this enterprise-wide data can be catastrophic, from
privacy violations, to regulatory infractions, to damage to corporate image, and long-
term shareholder value. To prevent damage to the company’s business, customers,
finances and reputation, management and IT leaders must ensure that this data -
whether Hadoop HDFS, Data Lake, or hybrid storage, including cloud storage - meets
the same high standards of security as any legacy data environment.

Personal & Sensitive Information

• Personally identifiable information (PII), or sensitive personal
information (SPI), as used in information security and privacy
laws, is information that can be used on its own, or with other
information, to identify, contact, or locate a single person, or to
identify an individual in context - Wikipedia
• Privacy Laws: Most countries
 Including the new GPDR regulations in the EU
• Regulatory Laws & Standards by industry: Examples from USA
 Sarbanes-Oxley Act (SOX) of 2002
 Health Insurance Portability and Accountability Act of 1996 (HIPAA)
 Payment Card Industry Data Security Standard (PCI DSS), versions
in 2004-16
 …
Personal & Sensitive Information

This topic is well covered in Wikipedia:
https://en.wikipedia.org/wiki/Personally_identifiable_information
Other references:
https://en.wikipedia.org/wiki/Sarbanes%E2%80%93Oxley_Act
https://en.wikipedia.org/wiki/Health_Insurance_Portability_and_Accountability_Act
https://en.wikipedia.org/wiki/Payment_Card_Industry_Data_Security_Standard
Other relevant standards include:
• The Health Information Technology for Economic and Clinical Health Act
(HITECH), part of the American Recovery and Reinvestment Act of 2009
• International Organization Standardization (ISO)
• Control Objectives for Information and Related Technology (COBIT), a good-
practice framework created by international professional association ISACA for
information technology (IT) management and IT governance
And, naturally, every company has its own Corporate Security Policies

Hortonworks DataPlane Service (DPS)

• DPS was made available with HDP 2.6.3, released November 2017
• DPS is remarketed by IBM in conjunction with Hortonworks HDF
• Use it to access and manage all the data stored across all of the
storage environments used by the organization in support of the
Hadoop Ecosystem
 On-Prem Cluster data (HDFS, …)
 Cloud stored data (AWS, IBM Cloud, …)
 Point-of-Origin data
 Hybrids of On-Prem / Cloud / …
− Thus, support for any form of Data Lake
− With consistent Security and Governance
− Through a series of Next-Gen Services

The first release of the Hortonworks DataPlane Service (DPS) with HDP 2.6.3 includes
Data Lifecycle Manager Service (DLM) as General availability (GA) and Data Steward
Service (DSS) in technical preview (TP) mode.
These are the first of a series of Next-Gen Services.
The DPS Platform is the architectural foundation that helps register multiple data lakes
and manages data services across these data lakes from a single unified pane of glass.


• DPS is a game changer for this platform and associated services will
help enterprises gain visibility over all of their data across all of their
environments while making it easy to maintain consistent security
and governance
 DPS was made available with HDP 2.6.3, released November 2017

Use these resources to learn more about DPS:
• Web Site: https://hortonworks.com/products/data-management/dataplane-
service/
• Blogs:
• https://hortonworks.com/blog/disasters-can-instant-takes-village-build-
hybrid-cloud-based-recovery-solution/
• https://hortonworks.com/blog/category-emerges-introducing-hortonworks-
dataplane-service/
• Press Release: https://hortonworks.com/press-releases/hortonworks-dataplane-
service/
• Product Docs: https://docs.hortonworks.com/HDPDocuments/DPS1/DPS-
1.0.0/index.html
• A one-hour, on-demand webinar, Global Data Management In A Multi-Cloud
Hybrid World, is available at: https://hortonworks.com/webinar/global-data-
management-multi-cloud-hybrid-world/

Manage, secure, and govern data across all assets

• DataPlane Service (DPS) is a service that reimagines the
Modern Data Architecture and solves next-gen data problems
 Reliably access and understand all your data assets
 Apply consistent security and governance policies
 Manage data across on-prem, cloud, and hybrid environments
 Extend platform and easily add next-gen services
• New next-gen services include:

 Data Lifecycle Manager: Control & manage the lifecycle of data across
multiple tiers
 Data Steward Studio: Curate, govern, and understand data assets to
access deep insight and apply consistent policies across multiple tiers
Manage, secure, and govern data across all assets

Further reading
• Spivey, B., & Echeverria, J. (2015). Hadoop security: Protecting your
big data platform. Sebastopol, CA: O’Reilly.
Further reading

Checkpoint
1. What is the new driver of competitive advantage?
2. What role does Hadoop play in modern data architecture?
3. What are some types of sensitive data?
4. Name compliance adherences that exists today?
5. What does HDP use for authentication, authorization and audit?
6. What are the Five Pillars of Security?
7. Through what HDP component are Kerberos, Knox, and Ranger managed?
8. Which security component is used to provide peripheral security?
9. Name some types of sensitive data.
10. What are some of the compliance adherences required of organizations some of
which are industry specific?
11. What governance issue does Hortonworks DataPlane Service (DPS) address?
Checkpoint

Checkpoint solution
1. What is the new driver of competitive advantage?
 Data is essential new drive of competitive advantage
2. What role does Hadoop play in modern data architecture?
 Provides low cost, scale out data storage and value-add processing
3. What are some types of sensitive data?
 Ethnic or Racial Origin, Political Opinion, Religious or Similar Beliefs, Memberships,
Physical/Mental Health Details, Criminal/Civil Offences, Patient Health Information,
Student Performance, Contact, ID Cards/Numbers, Birth Dates, and Parent Names
4. Name compliance adherences that exists today?
 SOX, HIPPA, PCI DSS, HITECH, ISO, COBIT, Corporate Security Policies
5. What does HDP use for authentication, authorization and audit?
 Apache Knox & Kerberos, Apache Ranger, Apache Ranger
6. What are the Five Pillars of Security?
 Administration, Authentication, Authorization, Audit, and Data Protections
Checkpoint solution

Checkpoint solution (cont’d)

7. Through what HDP component are Kerberos, Knox, and Ranger managed?
 Ambari
8. Which security component is used to provide peripheral security?
 Apache Knox
9. Name some types of sensitive data.
 Any or several of: Ethnic or Racial Origin, Political Opinion, Religious or Similar Beliefs,
Memberships, Physical/Mental Health Details, Criminal/Civil Offences, Patient Health
Information, Student Performance, Contact, ID Cards/Numbers, Birth Dates, and Parent
Names
10. What are some of the compliance adherences required of organizations some of
which are industry specific?
 SOX, HIPPA, PCI DSS, HITECH, ISO, COBIT, Corporate Security Policies
11. What governance issue does Hortonworks DataPlane Service (DPS) address?
 Visibility over all of an organization’s data across all of their environments — on-prem,
cloud, hybrid — while making it easy to maintain consistent security and governance

Unit summary
• Explain the need for data governance and the role of data security
in this governance
• List the Five Pillars of security and how they are implemented
with HDP
• Discuss the history of security with Hadoop
• Identify the need for and the methods used to secure Personal &
Sensitive Information
• Describe the function of the Hortonworks DataPlane Service (DPS)
Unit summary


Stream Computing
Stream Computing

Unit 11 Stream Computing

Unit objectives
• Define streaming data
• Describe IBM as a pioneer in streaming data - with System S 
IBM Streams
• Explain streaming data - concepts & terminology
• Compare and contrast batch data vs streaming data
• List and explain streaming components & Streaming Data Engines
(SDEs)
Stream Computing © Copyright IBM Corporation 2018
Unit objectives

Generations of analytics processing

• From Relational Databases (OLTP)  Data Warehouses (OLAP)
 Real-time Analytic Processing (RTAP) on Streaming Data
Generations of analytics processing

Graphic from:
• Ebbers, M., Abdel-Gayed, A., Budhi, V. B., Dolot, F., Kamat, V., Picone, R., &
Trevelin, J. (2013). Addressing data volume, velocity, and variety with IBM
InfoSphere Streams V3.0. Raleigh, NC: IBM International Technical Support
Organization (ITSO), p. 26.
• This eBook is available as a free download from:
http://www.redbooks.ibm.com/redbooks/pdfs/sg248108.pdf.
• Note that the current version of the IBM Streams product (February, 2018)
is v4.2.
Extract from this book (pp. 26-27):
Hierarchical databases were invented in the 1960s and still serve as the foundation for
online transaction processing (OLTP) systems for all forms of business and government
that drive trillions of transactions today. Consider a bank as an example. It is likely that
even today in many banks that information is entered into an OLTP system, possibly by
employees or by a web application that captures and stores that data in hierarchical
databases. This information then appears in daily reports and graphical dashboards to
demonstrate the current state of the business and to enable and support appropriate
actions. Analytical processing here is limited to capturing and understanding what
happened.

Relational databases brought with them the concept of data warehousing, which
extended the use of databases from OLTP to online analytic processing (OLAP). By
using our example of the bank, the transactions that are captured by the OLTP system
were stored over time and made available to the various business analysts in the
organization. With OLAP, the analysts can now use the stored data to determine trends
in loan defaults, overdrawn accounts, income growth, and so on. By combining and
enriching the data with the results of their analysis, they could do even more complex
analysis to forecast future economic trends or make recommendations on new
investment areas. Additionally, they can mine the data and look for patterns to help
them be more proactive in predicting potential future problems in areas such as
foreclosures. The business can then analyze the recommendations to decide whether
they must take action. The core value of OLAP is focused on understanding why things
happened to make more informed recommendations.
A key component of OLTP and OLAP is that the data is stored. Now, some new
applications require faster analytics than is possible when you have to wait until the
data is retrieved from storage. To meet the needs of these new dynamic applications,
you must take advantage of the increase in the availability of data before storage,
otherwise known as streaming data. This need is driving the next evolution in analytic
processing called real-time analytic processing (RTAP). RTAP focuses on taking the
proven analytics that are established in OLAP to the next level. Data in motion and
unstructured data might be able to provide actual data where OLAP had to settle for
assumptions and hunches. The speed of RTAP allows for the potential of action in
place of making recommendations.
So, what type of analysis makes sense to do in real time? Key types of RTAP include,
but are not limited to, the following analyses:
• Alerting
• Feedback
• Detecting failures
[These uses are discussed in more detail on pp. 26-27.]

What is streaming data?

• Applications that require on-the-fly processing, filtering, and analysis
of streaming data
 Sensors: environmental, industrial, surveillance video, GPS, …
 “Data exhaust”: network/system/webserver log files
 High-rate transaction data: financial transactions, call detail records
• Criteria: two or more of the following
 Messages are processed in isolation or in limited data windows
 Sources include non-traditional data (spatial, imagery, text, …)
 Sources vary in connection methods, data rates, and processing
requirements, that present integration challenges
 Data rates / volumes require the resources of multiple processing nodes
 Analysis and response are needed with sub-millisecond latency
 Data rates and volumes are too great for store-and-mine approaches
What is streaming data?

Reading:
What is Streaming Data?
(excellent article written from the perspective of Amazon’s AWS, but with general
applicability) https://aws.amazon.com/streaming-data/
• Benefits of Streaming Data
• Streaming Data Examples
• Comparison between Batch Process and Stream Processing
• Challenged in Working with Streaming Data

IBM is a pioneer in streaming analytics

• Book:
 Roy, J. (2016). Streaming analytics with IBM Streams: Analyze more, act
faster, and get continuous insights. Indianapolis, IN: Wiley.
 Available as a free PDF download from IBM at:
https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?
htmlfid=IMM14184USEN&attachment=IMM14184USEN.PDF
• IBM researchers spent a decade transforming the vision of stream
computing into a product - a new programming language, IBM Streams
Processing Language (SPL), was even built just for streaming systems
IBM is a pioneer in streaming analytics

References:
• The Invention of Stream Computing
http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/streamcomputing/
• Research articles (2004-2015)
https://researcher.watson.ibm.com/researcher/view_group_pubs.php?grp=2531&
t=1

IBM System S
 System S provides a programming model
and an execution platform for user-
developed applications that ingest, filter,
analyze, and correlate potentially massive
volumes of continuous data streams
 Source Adapters = an operator that
connects to a specific type of input (e.g.,
Stock Exchange, Weather, file system, …)
 Sink Adapters = an operator that connects
to a specific type of output external to the
streams processing system (RDBMS, , …)
 Operator = software processing unit
 Stream = flow of tuples from one operator
to the next operator (they do not traverse
operators)
IBM System S
References:
• Stream Computing Platforms, Applications, and Analytics (Tab 4: System S)
https://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=253
4&t=4
System S: While at IBM, Dr. Ted Codd invented the relational database. In the defining
IBM Research project, it was referred to as System R, which stood for Relational. The
relational database is the foundation for data warehousing that launched the highly
successful Client/Server and On-Demand informational eras. One of the cornerstones
of that success was the capability of OLAP products that are still used in critical
business processes today.
When the IBM Research division again set its sights on developing something to
address the next evolution of analysis (RTAP) for the Smarter Planet evolution, they set
their sights on developing a platform with the same level of world-changing success,
and decided to call their effort System S, which stood for Streams. Like System R,
System S was founded on the promise of a revolutionary change to the analytic
paradigm. The research of the Exploratory Stream Processing Systems team at T.J.
Watson Research Center, which was set on advanced topics in highly scalable stream-
processing applications for the System S project, is the heart and soul of Streams.

Ebbers, M., Abdel-Gayed, A., Budhi, V. B., Dolot, F., Kamat, V., Picone, R., & Trevelin,
J. (2013). Addressing Data Volume, Velocity, and Variety with IBM InfoSphere Streams
V3.0. Armonk, NY: IBM International Technical Support Organization (ITSO), p. 27.

Streaming data - concepts & terminology (1)
Streaming data - concepts & terminology

Reference:
Stream Computing Platforms, Applications, and Analytics (Overview)
https://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=2534&t=1


• Directed Acyclic Graph (DAG) -
a directed acyclic graph (DAG) is a directed graph that contains no
cycles
• Hardware Node - one computer in a cluster of computers that can
work as a single processing unit
• Operator (or Software Node ) - a software process that handles
one unit of the processing that needs to be performed on the data
• Tuple - an ordered set of elements
(equivalent to the concept of a record)
• Source - an operator where tuples are ingested Target (or Sink) -
an operator where tuples are consumed (and usually made available
outside the streams processing environment)
Wikipedia:
“In mathematics and computer science, a directed acyclic graph, is a finite directed
graph with no directed cycles. That is, it consists of finitely many vertices and edges,
with each edge directed from one vertex to another, such that there is no way to start at
any vertex v and follow a consistently-directed sequence of edges that eventually loops
back to v again. Equivalently, a DAG is a directed graph that has a topological ordering,
a sequence of the vertices such that every edge is directed from earlier to later in the
sequence.”


Types of Operators:
• Source - Reads the input data in the form of streams.
• Sink - Writes the data of the output streams to external storage or
systems.
• Functor - An operator that filters, transforms, and performs
functions on the data of the input stream.
• Sort - Sorts streams data on defined keys.
• Split- Splits the input streams data into multiple output streams.
• Join - Joins the input streams data on defined keys.
• Aggregate - Aggregates streams data on defined keys.
• Barrier - Combines and coordinates streams data.
• Delay - Delays a stream data flow.
• Punctor - Identifies groups of data that should be processed
together.
The terminology used here is that of IBM Streams, but similar terminology applies to
other Streams Processing Engines (SPE).
References:
• Glossary of terms for IBM Streams (formerly known as IBM InfoSphere Streams)
https://www.ibm.com/support/knowledgecenter/en/SSCRJU_4.2.1/com.ibm.strea
ms.glossary.doc/doc/glossary_streams.html
Examples of other terminology:
• Apache Storm: http://storm.apache.org/releases/current/Concepts.html
o A spout is a source of streams in a topology. Generally spouts will read
tuples from an external source and emit them into the topology.
o All processing in topologies is done in bolts. Bolts can do anything from
filtering, functions, aggregations, joins, talking to databases, and more.


• Topologies
 Generally defined in terms of a Directed Acyclic Graph (DAG)
• Reliability
 Different Stream Processing Engines (SPE) provide reliability differently:
− Some take full responsibility for guaranteeing that no tuple is lost (but that can
slow the engine down)
− Some require that the developer program for reliability using, for example,
redundant processing streams in parallel
• Operators / Software Nodes / Workers
 To handle volume and velocity of data, most SPEs divide the work among
multiple software operators (= typically JVM tasks) that work in parallel
 The software operators are often spread over multiple Hardware Nodes in
a cluster
Reliability, using the example of Apache Storm

(http://storm.apache.org/releases/current/Concepts.html):
Storm guarantees that every spout tuple will be fully processed by the topology. It does
this by tracking the tree of tuples triggered by every spout tuple and determining when
that tree of tuples has been successfully completed. Every topology has a "message
timeout" associated with it. If Storm fails to detect that a spout tuple has been
completed within that timeout, then it fails the tuple and replays it later.
To take advantage of Storm's reliability capabilities, you must tell Storm when new
edges in a tuple tree are being created and tell Storm whenever you've finished
processing an individual tuple. These are done using the OutputCollector object that
bolts use to emit tuples. Anchoring is done in the emit method, and you declare that
you're finished with a tuple using the ack method.

Batch processing - classic approach

• With batch processing, the data is stationary (“at rest”) in a database or
an application store
• Operations are performed over all the data that is stored
Batch processing - classic approach

We are familiar with these operations in classic batch SQL. In the case of batch
processing, all that data is present at the time the SQL statement is processed.
But, with streaming data, the data is constantly flowing and thus other techniques need
to be performed. With streaming data, operations will be performed on windows of
data.

Stream processing - the real-time data approach

• Moving from a multi‐threaded program to one that takes advantage of
multiple nodes is a major rewrite of the application We are now dealing
with multiple processes as well as the communication between them -
aka message queues move data = multiple streams
• Instead of all data, only windows of data are visible for processing
Stream processing - the real-time data approach

Sometimes a stream processing job needs to do something in regular time intervals,
regardless of how many incoming messages the job is processing. For example, say
you want to report the number of page views per minute. To do this, you increment a
counter every time you see a page view event. Once per minute, you send the current
counter value to an output stream and reset the counter to zero. This is a fixed time
window and is useful for reporting, for instance, sales that occurred during a clock-
hour.
Other methods of windowing of streaming data use a sliding window or a session
window (both illustrated above).
Microsoft Azure offers tumbling, hopping, and sliding windows to perform temporal
operations on streaming data. See Microsoft’s article: Introduction to Stream
Analytics Window functions, https://docs.microsoft.com/en-us/azure/stream-
analytics/stream-analytics-window-functions

Streaming components & Streaming Data Engines

Open Source: Proprietary:
• Hortonworks HDF / NiFi • IBM Streams (full SDE)
• Apache Storm • Amazon Kinesis
• Apache Flink • Microsoft Azure Stream Analytics
• Apache Kafka
• Apache Samza
• Apache Beam (an SDK)
• Apache Spark Streaming
Streaming components & Streaming Data Engines

To work with streaming analytics is important to understand what are the various
available components are and how they relate. The topic itself is quite complex and
deserving of a full workshop in itself - thus, here, we can only provide an introduction.
A full data pipeline (i.e., Streaming Application) involves:
• Accessing data at its source (“source operator” components) - Kafka can be used
here
• Processing data (serializing data, merging & joining individual streams,
referencing static data from in-memory stores and databases, transforming data,
performing aggregation and analytics, … ) - Storm is a component that is
sometimes used here
• Delivering data to long-term persistence and to on-the-fly visualization
(“sink operators”)
IBM Streams can handle all these operations with using standard and custom-build
operators and is thus a full Streaming Data Engine (SDE), but open source
components can be used to build somewhat equivalent systems.
We are only going to look at HDF / NiFi and IBM Streams (data pipeline) in detail -
and separately; they are bolded in the list above.

Open source references

• http://storm.apache.org/
• http://flink.apache.org/
• http://kafka.apache.org
• http://samza.apache.org
Apache Samza is a distributed stream processing framework. It uses Apache
Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance,
processor isolation, security, and resource management
• https://beam.apache.org/
Apache Beam is an advanced unified programming model that implement batch
and streaming data processing jobs that run on any execution engine
• https://spark.apache.org/streaming/
Spark Streaming makes it easy to build scalable fault-tolerant streaming
applications
• Introduction to Spark Streaming (Hortonworks Tutorial)
https://hortonworks.com/hadoop-tutorial/introduction-spark-streaming/
Proprietary SDE references:
• IBM Streams
https://www.ibm.com/ms-en/marketplace/stream-computing
• Try IBM Streams (basic) for free
https://console.bluemix.net/catalog/services/streaming-analytics
• Amazon Kinesis
https://aws.amazon.com/kinesis/
• Microsoft Azure Stream Analytics
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-
introduction
• Stock-trading analysis and alerts
• Fraud detection, data, and identity protections
• Embedded sensor and actuator analysis
• Web clickstream analytics.

Hortonworks HDF / NiFi

• Hortonworks DataFlow (HDF) provides an only end-to-end platform
that collects, curates, analyzes, and acts on data in real-time, on-
premises or in the cloud
• With version 3.x, HDF has an available a drag-and-drop visual
interface
• HDF is an integrated solution that uses Apache Nifi/MiNifi, Apache
Kafka, Apache Storm, Superset, and Druid components where
appropriate
• The HDF streaming real-time data analytics platform includes Data
Flow Management Systems, Stream Processing, and Enterprise
Services
• The newest additions to HDF include a Schema Repository
Hortonworks HDF / NiFi

What is Hortonworks DataFlow (HDF)?

• HDF is a collection of products for data movement, streaming
analytics, and the management and governance of streaming data
What is Hortonworks DataFlow (HDF)?

Hortonworks DataFlow (HDF) is a collection of products for data movement, streaming
analytics, and the management and governance of streaming data:
• Flow management: NiFi/MiNiFi
• Stream processing: Storm, Kafka, Streaming Analytics Manager
• Enterprise services: Schema Registry, Apache Ranger, Ambari
The Streams Processing element is the only component that competes with
IBM Streams.
There is a more complete competitive presentation on Storm available but we’ll cover
the basics in here.

Components of the HDF Platform
Components of the HDF Platform

Reference for this graphic: https://hortonworks.com/products/data-platforms/hdf/
Hortonworks provides a 7-part Data-in-Motion webinar series (available on demand) at:
https://hortonworks.com/blog/harness-power-data-motion/

NiFi & MiNiFi

• NiFi is a disk-based, microbatch ETL tool that can duplicate
the same processing on multiple hosts for scalability
 Web-based user interface
 Highly configurable / Loss tolerant vs guaranteed delivery / Low latency vs
high throughput
 Tracks dataflow from beginning to end
• MiNiFi-a subproject of Apache NiFi-is a complementary data collection
approach that supplements the core tenets of NiFi in dataflow
management, focusing on the collection of data at the source of its
creation
 Small size and low resource consumption
 Central management of agents
 Generation of data provenance (full chain of custody of information)
 Integration with NiFi for follow-on dataflow management
NiFi & MiNiFi

References:
• https://nifi.apache.org
• https://nifi.apache.org/minifi/
NiFi background:
• Originated at the National Security Agency (NSA) - more than eight years of
development as a closed-source product
• Apache incubator November 2014 as part of NSA technology transfer program
• Apache top-level project in July 2015
• Java based, running on a JVM
Wikipedia (“Apache NiFi):
• Apache NiFi (short for NiagaraFiles) is a software project from the Apache
Software Foundation designed to automate the flow of data between software
systems. It is based on the "NiagaraFiles" software previously developed by the
NSA and open-sourced as a part of its technology transfer program in 2014

• The software design is based on the flow-based programming model and offers
features which prominently include the ability to operate within clusters, security
using TLS encryption, extensibility (users can write their own software to extend
its abilities) and improved usability features like a portal which can be used to
view and modify behavior visually.
NiFi is written in Java and runs within a Java virtual machine running on the server
that hosts it. The main components of Nifi are
• Web Server - the HTTP-based component used to visually control the software
and monitor the
• Flow Controller - serves as the brains of NiFi's behavior and controls the running
of Nifi extensions, scheduling the allocation of resources for this to happen.
• Extensions - various plugins that allow Nifi to interact with different kinds of
systems
• FlowFile repository - used by NiFi to maintain and track status of the currently
active FlowFile, including the information that NiFi is helping move between
systems
• Cluster manager - an instance of NiFi that provides the sole management point
for the cluster; the same data flow runs on all the nodes of the cluster
• Content repository - where data in transit is maintained
• Provenance repository - metadata detailing to the provenance of the data flowing
through the Software development and commercial support is offered by
Hortonworks. The software is fully open source. This software is also sold and
supported by IBM.
MiNiFi is a subproject of NiFi designed to solve the difficulties of managing and
transmitting data feeds to and from the source of origin, often the first/last mile of
digital signal, enabling edge intelligence to adjust flow behavior/bi-directional
communication.
Reference:
• https://hortonworks.com/blog/edge-intelligence-iot-apache-minifi/
• Free eBook (PDF, 52 pp., authored by Hortonworks & Attunity):
http://discover.attunity.com/apache-nifi-for-dummies-en-report-go-c-lp8558.html

Edge Intelligence for IoT with Apache MiNiFi

• Since the first mile of data collection (the far edge), is very distributed
and likely involves a very large number of end devices (ie. IoT), MiNiFi
carries over all the main capabilities of NiFi
Edge Intelligence for IoT with Apache MiNiFi

Reference: https://hortonworks.com/blog/edge-intelligence-iot-apache-minifi/
MiNiFi is ideal for situations where there are a large number of distributed, connected
devices. For example first mile data collection from routers, security cameras, cable
modems, ATMs, security appliances, point of sale systems, weather detection systems,
fleets of trucks/trains/planes/ships, thermostats, utility or power meters are all good
uses cases for MiNiFi due the distributed nature of their physical locations. MiNIFi then
enables multiple applications. For example, by deploying MiNiFi on point of sale
terminals, retailers can gather data from 100’s or 1000’s or more terminals.
This edge collection role was played often in the past by Flume, but Flume is has been
marked as deprecated. See:
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_release-
notes/content/deprecated_items.html

HDP dashboard - example

• Hortonworks provides a full
tutorial - Trucking IoT on HDF -
that covers the core concepts of
Storm and the role it plays in an
environment where real-time,
low-latency and distributed data
processing are important
 Continuous IoT data (Kafka)
 Real-time stream processing
(Storm)
 Visualization (Web application)
• https://hortonworks.com/hadoop-
tutorial/trucking-iot-hdf/
HDP dashboard - example

This tutorial discuss a real-world use-case and provides an understanding of the role
Storm plays within in HDF.
• https://hortonworks.com/hadoop-tutorial/trucking-iot-hdf/
The usecase is a trucking company that dispatches trucks across the country. The
trucks are outfitted with sensors that collect data (IoT, Internet of Things). The data
includes:
• The name of the driver
• The route the truck is using and intends to use
• The speed of the truck
• Event recently occured (speeding, the truck weaving out of its lane, following too
closely, etc)
Data like this is generated often - perhaps several times per second - and is
streamed back to the company’s servers.
Components of the tutorial are:
• Demonstration
• A dive into the Storm internals and how to build a topology from scratch
• A discussion of how to package the Storm code and deploy onto a cluster

Controlling HDF components from Ambari

Example: Storm
Controlling HDF components from Ambari

The Ambari interface provides:
• Open source management platform
• Ongoing cluster management and maintenance (cluster size irrelevant) via Web
UI and REST API (cluster operations automation)
• Visualization of cluster health and extended cluster capability by wrapping custom
services under management
• Centralized security setup

IBM Streams
• IBM Streams is an advanced computing platform that allows user-
developed applications to quickly ingest, analyze, and correlate
information as it arrives from thousands of real-time sources
• The solution can handle very high data throughput rates, up to millions
of events or messages per second
• IBM Streams provides:
 Development support - Rich Eclipse-based, visual IDE lets solution
architects visually build applications or use familiar programming languages
like Java, Scala or Python
 Rich data connections - Connect with virtually any data source whether
structured, unstructured or streaming, and integrate with Hadoop, Spark
and other data infrastructures
 Analysis and visualization - Integrate with business solutions. Built-in
domain analytics like machine learning, natural language, spatial-temporal,
text, acoustic, and more, to create adaptive streams applications
IBM Streams
References:
• https://www.ibm.com/cloud/streaming-analytics
• Streaming Analytics: Resources
https://www.ibm.com/cloud/streaming-analytics/resources
• Free eBook: Addressing Data Volume, Velocity, and Variety with IBM InfoSphere
Streams V3.0
IBM Redbook series, PDF, 326 pp.
http://www.redbooks.ibm.com/redbooks/pdfs/sg248108.pdf
• Toolkits, Sample, and Tutorials for IBM Streams
https://github.com/IBMStreams
• Forrester Total Economic Impact of IBM Streams:
This study provides readers with a framework to evaluate the potential financial
impact of Streams on their organizations
https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=IMC15131USEN

Streams is based on nearly two decades of effort by the IBM Research team to extend
computing technology to handle advanced analysis of high volumes of data quickly.
How important is their research? Consider how it would help crime investigation to
analyze the output of any video cameras in the area that surrounds the scene of a
crime to identify specific faces of any persons of interest in the crowd and relay that
information to the unit that is responding. Similarly, what a competitive edge it could
provide by analyzing 6 million stock market messages per second and execute trades
with an average trade latency of only 25 microseconds (far faster than a hummingbird
flaps its wings). Think about how much time, money, and resources could be saved by
analyzing test results from chip-manufacturing wafer testers in real time to determine
whether there are defective chips before they leave the line.
System S: While at IBM, Dr. Ted Codd invented the relational database. In the
defining IBM Research project, it was referred to as System R, which stood for
Relational. The relational database is the foundation for data warehousing that
launched the highly successful Client/Server and On-Demand informational eras.
One of the cornerstones of that success was the capability of OLAP products that
are still used in critical business processes today.
When the IBM Research division again set its sights on developing something to
address the next evolution of analysis (RTAP) for the Smarter Planet evolution,
they set their sights on developing a platform with the same level of world-
changing success, and decided to call their effort System S, which stood for
Streams. Like System R, System S was founded on the promise of a
revolutionary change to the analytic paradigm. The research of the Exploratory
Stream Processing Systems team at T.J. Watson Research Center, which was
set on advanced topics in highly scalable stream-processing applications for the
System S project, is the heart and soul of Streams.
Critical intelligence, informed actions, and operational efficiencies that are all available
in real time is the promise of Streams.

Comparison of IBM Streams vs NiFi

IBM Streams NiFi
 Stream Data Engine (SDE)  Microbatch engine
 Complete cluster support  Master-slave, duplicate processing
 C++ engine - performance & scalability  Java engine
 Mature & proven product (v4.2)  Evolving product
 Memory-based  Disk-based
 Streaming analytics  Extract-Tranform-Load (ETL) oriented
 Many analytics operators  No analytics Processors
 Enterprise data source / sink support  Limited data source / sink support
 Web-based, command-line based, and  Web-based monitoring tooling
REST- based monitoring tooling  Web-based development environment
 Drag-and-drop development  Needs Storm or Spark to provide the
environment + Streams Processing analytics
Language (SPL)
Comparison of IBM Streams vs NiFi

NiFi itself is not a streaming analytics engine. HDF uses Storm for the analytic
processing.
Apache Storm is a distributed stream processing computation framework written
predominantly in the Clojure programming language. Originally created by Nathan Marz
and team at BackType, the project was open sourced after being acquired by Twitter.
Why use Storm? http://storm.apache.org/
• Apache Storm is a free and open source distributed realtime computation system.
Storm makes it easy to reliably process unbounded streams of data, doing for
realtime processing what Hadoop did for batch processing. Storm is simpbe used
with any programming language, and is a lot of fun to use!
• Storm has many use cases: realtime analytics, online machine learning,
continuous computation, distributed RPC, ETL, and more. Storm is fast: a
benchmark clocked it at over a million tuples processed per second per node. It is
scalable, fault-tolerant, guarantees your data will be processed, and is easy to set
up and operate.

• Storm integrates with the queueing and database technologies you already use.
A Storm topology consumes streams of data and processes those streams in
arbitrarily complex ways, repartitioning the streams between each stage of the
computation however needed.
• The reference provides a tutorial.
Apache Storm with v1.0 works with Apache Trident. Trident is a high-level
abstraction for doing realtime computing on top of Storm. Trident provides
microbatching.
• Trudent allows you to seamlessly intermix high throughput (millions of messages
per second), stateful stream processing with low latency distributed querying. If
you're familiar with high level batch processing tools like Pig or Cascading, the
concepts of Trident will be very familiar – Trident has joins, aggregations,
grouping, functions, and filters. In addition to these, Trident adds primitives for
doing stateful, incremental processing on top of any database or persistence
store. Trident has consistent, exactly-once semantics, so it is easy to reason
about Trident topologies.
• Refer: http://storm.apache.org/releases/current/Trident-tutorial.html
References:
• Apache Storm: Overview
https://hortonworks.com/apache/storm/
• The Future of Apache Storm (YouTube, June 2016)
https://youtu.be/_Q2uzRIkTd8
• Storm or Spark: Choose your real-time weapon
https://www.infoworld.com/article/2854894/application-development/spark-and-
storm-for-real-time-computation.html

Advantages of IBM Streams & Streams Studio

Streams Studio provides the ability to create stream processing
topologies without programming through a comprehensive set of
capabilities:
• Pre-built Sources (input): Kafka, Event Hubs, HDFS
• Pre-built Processing (operators): Aggregate, Branch, Join, PMML,
Projection Bolt, Rule
• Pre-built Sink (output): Cassandra, Druid, Hive, HBase, HDFS,
JDBC, Kafka
• Notification (email), Open TSDB, Solr
• Pre-built visualization: 30+ business visualizations - and these can
be organized into a dashboard
• Extensible: Ability to add user-defined functions (UDF) and
aggregates (UDA) through jar files
Advantages of IBM Streams & Streams Studio

Being able to create stream processing topologies without programming is a worthwhile
goal. This is something that is possible already with IBM Streams using Streams
Studio.
IBM Streams is a complete, off-the-shelf Streaming Data Engine (SDE) that is ready to
run immediately after installation. In addition, you have all the tools to develop custom
source and sink
PMML stands for "Predictive Model Markup Language." It is the de-facto standard to
represent predictive solutions. A PMML file may contain a myriad of data
transformations (pre- and post-processing) as well as one or more predictive models.
IBM Streams comes with the ability to cross-integrate with IBM SPSS Statistics to
provide PMML capability and work with R, the open source statistical package that
supports PMML.
What is PMML?
https://www.ibm.com/developerworks/library/ba-ind-PMML1/

Where does IBM Streams fit in the processing cycle?
Where does IBM Streams fit in the processing cycle?

What if you wanted to add in fraud detection to your processing cycle before authorizing
the transaction and committing it to the database? Fraud detection has to happen in
real-time in order for it to truly be of a benefit. So instead of taking our data and running
it directly into the authorization / OLTP process, we process it as it streaming data
using IBM Streams. The results of this process can be directed to our transactional
system or data warehouse or to a Data Lake or Hadoop system storage (e.g., HDFS).

Real-time processing to find new insights
Multi-channel customer
sentiment and experience a
analysis
Detect life-threatening
conditions at hospitals in
time to intervene
Predict weather patterns to plan

optimal wind turbine usage, and
optimize capital expenditure on
asset placement
Make risk decisions based on

real-time transactional data
Identify criminals and threats

from disparate video, audio,
and data feeds
Real-time processing to find new insights

The graphic represents some of the situations in which IBM Streams have been applied
to perform real-time analytics on streaming data.
Today organizations are only tapping in to a small fraction of the data that is available to
them. The challenge is figuring out how to analyze ALL of the data, and find insights in
these new and unconventional data types. Imagine if you could analyze the 7 terabytes
of tweets being created each day to figure out what people are saying about your
products, and figure out who the key influencers are within your target demographics.
Can you imagine being able to mine this data to identify new market opportunities?
What if hospitals could take the thousands of sensor readings collected every hour per
patients in ICUs to identify subtle indications that the patient is becoming unwell, days
earlier that is allowed by traditional techniques? Imagine if a green energy company
could use petabytes of weather data along with massive volumes of operational data to
optimize asset location and utilization, making these environmentally friendly energy
sources more cost competitive with traditional sources.
What if you could make risk decisions, such as whether or not someone qualifies for a
mortgage, in minutes, by analyzing many sources of data, including real-time
transactional data, while the client is still on the phone or in the office? Or if law
enforcement agencies could analyze audio and video feeds in real-time without human
intervention to identify suspicious activity?

As new sources of data continue to grow in volume, variety and velocity, so too does
the potential of this data to revolutionize the decision-making processes in every
industry.

Components of IBM Streams
Components of IBM Streams

Application graph of an IBM Streams application
Application graph of an IBM Streams application

Applications can be developed in the IBM Streams Studio by using IBM Streams
Processing Language (SPL), a declarative language that is customized for stream
computing. Applications with the latest release are, however, generally developed using
a drag-and-drop graphical approach.
After the applications are developed, they are deployed to Streams Runtime
environment. By using Streams Live Graph, you can monitor the performance of the
runtime cluster from the perspective of individual machines and the communications
between them.
Virtually any device, sensor, or application system can be defined by using the
language. But, there are predefined source and output adapters that can further simplify
application development. As examples, IBM delivers the following adapters, among
many others:
• TCP/IP, UDP/IP, and files
• IBM WebSphere Front Office, which delivers stock feeds from major exchanges
worldwide
• IBM solidDB® includes an in-memory, persistent database that uses the Solid
Accelerator API

• Relational databases, which are supported by using industry standard ODBC

Applications, such as the one shown above, almost always feature multiple steps.
For example, some utilities began paying customers who sign up for a particular usage
plan to have their air conditioning units turned off for a short time, allowing the
temperature to be changed. An application to implement this plan collects data from
meters and might apply a filter to monitor only for those customers who selected this
service. Then, a usage model must be applied that was selected for that company. Up-
to-date usage contracts then need to be applied by retrieving them, extracting text,
filtering on key words, and possibly applying a seasonal adjustment. Current weather
information can be collected and parsed from the US National Oceanic &
Atmospheric Administration (NOAA), which has weather stations across the United
States. After the correct location is parsed for, text can be extracted and temperature
history can be read from a database and compared to historical information.
Optionally, the latest temperature history could be stored in a warehouse for future use.
Finally, the three streams (meter information, usage contract, and current weather
comparison to historical weather) can be used to take actions.
This example is taken from the free IBM RedBook (PDF): Ebbers, M., Abdel-Gayed, A.,
Budhi, V. B., Dolot, F., Kamat, V., Picone, R., & Trevelin, J. (2013). Addressing Data
Volume, Velocity, and Variety with IBM InfoSphere Streams V3.0. Raleigh, NC: IBM
International Technical Support Organization (ITSO), p. 31.
This RedBook is available from:
http://www.redbooks.ibm.com/redbooks/pdfs/sg248108.pdf

Demo: “Learn IBM Streams in 5 minutes” (5:53)
Play at full-screen, the YouTube Video:

https://www.ibm.com/ms-
en/marketplace/stream-computing
Demo: “Learn IBM Streams in 5 minutes” (5:53)

Example exploration: How to capture data streams and analyze them faster with
IBM Streams.
YouTube URL: https://www.youtube.com/watch?v=HLHGRy7Hif4

Demo: “IBM Streams Designer Overview” (2:54)
Play at full-screen, the YouTube Video:

https://youtu.be/i6nBhFoXZqo
Demo: “IBM Streams Designer Overview” (2:54)

This video provides an overview of IBM Streams Designer, part of IBM Watson Data
Platform.
Find more videos in the IBM Streams Designer Learning Center at:
http://ibm.biz/streams-designer-learning.

Checkpoint
1. What are the typical sources of streaming data?
2. Name some of the components of Hortonworks DataFlow (HDF)
3. What are the primary differences between NiFI and MiNiFi?
4. What main features does IBM Streams provide as a Streaming Data
Platform?
Checkpoint

Checkpoint solution
1. What are the typical sources of streaming data?
 Sensors (environmental, industrial, surveillance video, GPS, …), “Data
exhaust” (network/system/webserver log files), and high-rate transaction data
(financial transactions, call detail records, …)
2. Name some of the components of Hortonworks DataFlow (HDF)
 Flow management (NiFi/MiNiFi); Stream processing (Storm, Kafka, Streaming
Analytics Manager [SAM]); and Enterprise services (Schema Registry, Apache
Ranger, Ambari)
3. What are the primary differences between NiFI and MiNiFi?
 NiFi is a disk-based, microbatch ETL tool that provides flow management;
MiNiFi is a complementary data collection tool that feeds collected data to NiFi
4. What main features does IBM Streams provide as a Streaming Data
Platform?
 Development support (drag & drop); Rich data connections; and Analysis &
visualization
Checkpoint solution

Unit summary
• Define streaming data
• Describe IBM as a pioneer in streaming data - with System S 
IBM Streams
• Explain streaming data - concepts & terminology
• Compare and contrast batch data vs streaming data
• List and explain streaming components & Streaming Data Engines
(SDEs)
Unit summary

IBM Training
© Copyright IBM Corporation 2018. All Rights Reserved.

Big Data Ecosystem-Student Guide PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Big Data Ecosystem-Student Guide PDF

Caricato da

Copyright:

Formati disponibili

®

© Copyright IBM Corp. 2018 P-2

© Copyright IBM Corp. 2018 P-3

Traditional vs. Big Data approaches to using data ............................................ 1-46

© Copyright IBM Corp. 2018 P-4

A large (and growing) Ecosystem ..................................................................... 1-96

© Copyright IBM Corp. 2018 P-5

ZooKeeper ....................................................................................................... 2-30

© Copyright IBM Corp. 2018 P-6

Ambari terminology .......................................................................................... 3-25

© Copyright IBM Corp. 2018 P-7

MapReduce and YARN ......................................................................... 5-1

© Copyright IBM Corp. 2018 P-8

Topic: The architecture of YARN ...................................................................... 5-51

© Copyright IBM Corp. 2018 P-9

SparkContext ................................................................................................... 6-46

© Copyright IBM Corp. 2018 P-10

HBase and ACID properties ............................................................................. 7-26

© Copyright IBM Corp. 2018 P-11

ZooKeeper, Slider, and Knox ............................................................... 8-1

© Copyright IBM Corp. 2018 P-12

Sqoop export .................................................................................................... 9-13

© Copyright IBM Corp. 2018 P-13

Components of the HDF Platform ................................................................... 11-20

© Copyright IBM Corp. 2018 P-14

© Copyright IBM Corp. 2018 P-15

© Copyright IBM Corp. 2018 P-16

Additional training resources

© Copyright IBM Corp. 2018 P-17

IBM product help

Books for You want to use search engines to Start/Programs/IBM

IBM on the You want to access any of the

• IBM - Training and Certification • http://www-01.ibm.com/

© Copyright IBM Corp. 2018 P-18

Data Science Foundations

© Copyright IBM Corporation 2018

© Copyright IBM Corp. 2018 1-2

© Copyright IBM Corp. 2018 1-3

Introduction to Big Data

Introduction to Big Data

© Copyright IBM Corp. 2018 1-4

Big Data - a tsunami that is hitting us already

Big Data - a tsunami that is hitting us already

© Copyright IBM Corp. 2018 1-5

Data has an intrinsic property…it grows and grows

90% 80% 20%

Data has an intrinsic property…it grows and grows

© Copyright IBM Corp. 2018 1-6

Growing interconnected & instrumented world

Growing interconnected & instrumented world

© Copyright IBM Corp. 2018 1-7

The Top 20 Valuable Facebook Statistics - Updated April 2016

© Copyright IBM Corp. 2018 1-8

Growth in Internet traffic (PCs, smartphones, IoT,…)

Growth in Internet traffic (PCs, smartphones, IoT,…)

© Copyright IBM Corp. 2018 1-9

© Copyright IBM Corp. 2018 1-10

Some examples of Big Data

Some examples of Big Data

© Copyright IBM Corp. 2018 1-11

The four classic dimensions of Big Data (4 Vs)

…and a 5th V - Value -

The four classic dimensions of Big Data (4 Vs)