Hadoop 101

Hadoop 101
Learn the Hadoop Fundamentals
Ian Andrews | iandrews@gopivotal.com

Director – Advanced Technology
Pivotal Incorporated
Pivotal Confidential–Internal Use Only 1

HADOOP 101
Purpose
After completing this module, you will be able to:
1. Discuss the evolution of Data Platforms and why Hadoop was created
2. Discuss the purpose, functionality, and value of Hadoop
3. Describe the various Hadoop components
4. Discuss some of the most common use cases for Hadoop

HADOOP 101
Agenda
1. What is Hadoop
2. The Evolution of Data Platforms
3. How Hadoop Is Being Used Today
4. Resources and Key Takeaways

Part 1:
What Is Hadoop?

What is Hadoop?
 Framework that allows for distributed processing of large

data sets across clusters of commodity servers
– Store large amount of data
– Process the large amount of data stored
 Inspired by Google’s MapReduce and Google File System

(GFS) papers
 Apache Open Source Project

– Initial work done at Yahoo! starting in 2005
– Open sourced in 2009 there is now a very active open source
community

What is Hadoop?
Two Core Components
HDFS MapReduce
Scalable storage in Compute via the

Hadoop Distributed MapReduce distributed
File System processing platform
• Storage & Compute in 1 Framework

• Open Source Project of the Apache Software Foundation
• Java-intensive programming required

Part 2:
The Evolution of Data
Platforms
How it all began

First there was the Data Warehouse
• A new architecture to host data from

multiple sources to support decision-
making
• Why the Data warehouse exists:

Legacy
– Centralization of high value data
EDW
– Tools to process data into information
– Highly regulated environment

Then the MPP database was introduced
• A new approach to database

was required to handle new
analytics environment
• Why the MPP Database exists:

– Data got larger
– Queries got uglier
– Performance became critical
– R/SAS/Statistical languages need
to run in-database

Then things changed
 Internet age introduces the

ability to track interactions New Data
rather than just Streams
transactions New Delivery

Platforms Expanding
Data Volumes
 Cost of traditional
platforms too high to store Greater Cost
New
and process this new data Deployment
Models &
Pressures
Languages Increasing
Customer
Expectations
 Normal Databases not

able to perform at scale

Now there is Hadoop
 Traditional systems weren’t

built to handle the
storage/processing needs
of Web 2.0
 Why Hadoop exists:

– Data volumes moved to the
Petabyte range
– Raw (unstructured) forms of
data needed to be processed
– Cost needed to be low
– Processing must scale with
storage

The Hadoop Opportunity
 Internet age + exploding data growth
 Enterprises increasingly interested in leveraging new data sources

quickly:
– Understand granular customer behavior
– Spot emerging trends
– Identify new opportunities, etc.
 Traditional database tools not able to cope

– Weren’t built for big data use cases
– Lack scale, not cost-effective, rigid data structure
 Need for new approach  Hadoop

Why Hadoop is Important?
1. Hadoop reduces the cost of storing & processing data to a point that
keeping all data, indefinitely is suddenly a very real possibility – AND –
that cost is halving every 18 months
2. MapReduce makes developing & executing massively parallel data
processing tasks trivial compared to historical alternatives (e.g. HPC /
Grid)
3. Schema on Read paradigm shifts typical data preparation complexity
to analysis phase rather than acquisition phase
4. For the modern CIO “BIG DATA = HADOOP” - Don’t underestimate
the irrational exuberance of the market
The cost and effort to consume and extract value from data has been
fundamentally changed

How is Hadoop is unique?
 Handles large amounts of data

 Stores data in native format
 Delivers linear scalability at low cost
 Resilient in case of infrastructure failures
 Transparent application scalability
Enterprises can gain a competitive advantage through the

adoption of big data analytics

Common Apache Hadoop Components
Mahout • Scalable machine learning libraries written in MapReduce
Flume • Log file ingestion framework
Sqoop • SQL to Hadoop batch data transfer framework
HBase • NoSQL database for random, real time read/write access
Hive • System for SQL-like query data on top of HDFS
Pig • Procedural language that abstracts MapReduce
Zookeeper • Highly reliable distributed coordination
MapReduce • Framework for writing scalable data applications
HDFS • Hadoop Distributed File System

Evolution of the Commercial Hadoop Market
 Apache Hadoop is open-source software managed by the Apache Software Foundation

available as a free download at Apache.org.
 A number of commercial businesses have been created to provide packaged Hadoop
distributions containing integrated subcomponents of Hadoop and other open source
projects, education, support, services, and Hadoop management utilities to make the
technology easier to implement and use.
– Cloudera was the first commercial vendor to enter the space in 2008 founded by early Hadoop technologists who
later brought in Doug Cutting to be head architect.
– Other commercial vendors of Hadoop software include Hortonworks (2011 out of Yahoo!), IBM (Infosphere
BigInsights), MapR (2009) and DataStax (2010 – Hadoop and Cassandra).
 Organizations that are evaluating Hadoop are typically also looking at other NoSQL
databases such as Cassandra, Mongo DB, Couch DB, Amazon EMR, and others.
 They might also be evaluating scale-out file systems to use for storage such as Isilon
or Gluster FS. These systems mirror Hadoop’s scale-out architecture and are also
capable of handling the volume and unstructured nature of data that could be stored in
Hadoop.

Greenplum’s Entry into the Hadoop Market
 Greenplum began moving towards the Hadoop space when it added

support for MapReduce within the database in 2009. The company then
officially entered the Hadoop market in April of 2011 with the launch of
Greenplum HD.
 Greenplum HD is one of several commercially supported distributions

of Apache Hadoop. All of these distributions consist of the same core
Hadoop technology
 In providing our own distribution of Apache Hadoop, Greenplum (and

now Pivotal) are bringing Greenplum’s decade of experience
developing the best MPP database platform to the open source Big
Data platform

Part 3:
How Hadoop Is Being
Used Today

Target Markets, Verticals, and Use Cases
Hadoop Use Cases by Vertical

Finance Web 2.0 Telecom Healthcare
• Risk • Product • Network Graph • Electronic Medical
Modeling/Managem Recommendation Analysis Record Analysis
ent Engine • Call Detail Record • Claims Fraud
• Portfolio Analysis • Search Engine (CDR) Analysis Detection
• Investment Indexing (Search • Network Optimization • Drug Safety
Predictions Assist) • Service Optimization Analysis
• Fraud Detection • Content Optimization & Log Processing • Personalize
• Compliance Check • Advertising • User Behavior Medicine
• Customer Profiling Optimization Analysis • Healthcare Service
• Social Media • Customer Churn • Customer Churn Optimization
Analytics Analysis Prediction • Drug Development
• ETL • POS Transaction • Machine-generated • Healthcare
• Network analysis Analysis data centralization Information
based on • Data Warehousing (logs from firewalls, Exchange
transactions towers, switches, • Medical Image
servers, etc..) Processing

Use Case: Data Warehouse Augmentation / Offload
• Challenges
– Existing EDW used for low value and resource consuming ETL
process
– Planned growth will far exceed compute capacity
– Hard to do analytics or even basic reporting on EDW system
• Objectives
– Reduce EDW Total Cost of Ownership
– Enable longer data retention to enable analytics and accelerate time
to market
– Migrate ETL off EDW to free up compute resources

Hadoop Use Case: Retailer Trend Analysis
 Deep Historical Reporting for Retail Trends:

– Credit card company loads 10 years of data for all retailers (100’s of TB’s) into
Hadoop
– Run Map/Reduce Job on Hadoop Cluster for a single retailer and develop historical
picture of retailer (or retailers) in a specific area.
– Load results from Hadoop into data warehouse and further analyze with standard
BI/statistics packages.
 Why do this in Hadoop?

– Ability to store years of data cost effectively
– Data available for immediate recall (not on tapes or flat files)
– No need to ETL/normalize the data
– Data exists in its valuable, original format
– Offload intensive computation from DW
– Ability to pull in other data sources about the retail (unstructured) from other sources
and combine with other data (EDGAR Filings?)

Hadoop Use Case: Risk Modeling
 Risk Modeling
– Bank had customer data across multiple lines of business and needed to
develop a better risk picture of its customers.
▪ i.e. -- if direct deposits stop coming into checking acct, it’s likely that customer lost his/her job,
which impacts creditworthiness for other products (CC, mortgage, etc)
– Data existing in silos across mutliple LOB’s and acquired bank systems
– Data size approached 1 petabyte
 Why do this in Hadoop?

– Ability to cost-effectively integrate + 1 PB of data from multiple data sources: data
warehouse, call center, chat and email
– Platform for more analysis with poly-structured data sources; ie, combining bank data
with credit bureau data; Twitter, etc
– Offload intensive computation from DW

Hadoop Use Case: Sentiment Analysis
 Sentiment Analysis
– Hadoop used frequently to monitor what customers think of company’s products or
services
– Data loaded from social media sources (Twitter, blogs, facebook, emails, chats, etc)
into Hadoop cluster
– Map/Reduce jobs run continuously to identify sentiment (ie, Acme Company’s rates
are “outrageous” or “rip off”)
– Negative/positive comments can be acted upon (special offer, coupon, etc)
 Why Hadoop
– Social media/web data is unstructured
– Amount of data is immense
– New data sources arise weekly

Part 4:
Resources and Key
Takeaways

Online Resources
 Hadoop Basics: Hadoop Basics from EMC
 Pivotal Hadoop online resources: Greenplum Nation – Pivotal HD
 Selling Guide - Hadoop Selling Guide
 Webcast
– Hadoop Spotlight Webinar
– Video: HAWQ and Pivotal HD Video
– Internal Webcast: Internal Webcast
 Spring Hadoop: Spring Hadoop

Contact Resources
 Advanced Technology Sales

– Nick Cayou – ncayou@gopivotal.com
– Ian Andrews – iandrews@gopivotal.com
 Hadoop Virtual Team

– Jacque Istok – jistok@gopivotal.com
 Product Management
– SK Krishnamurthy - SK.Krishnamurthy@emc.com
 Hadoop vTeam Mailing List

– GPVTeamHadoop@emc.com

Module Summary & Highlights
 Hadoop was created to meet a need for data storage and

processing at a scale and cost that existing systems simply
could not address
 Hadoop adoption in the Enterprise is largely driven by the

low cost and flexibility of the platform
 Hadoop is relatively immature, but is quickly advancing with

new capabilities backed by a broad base of vendor and
community supported efforts
2
Thank You
Please note:
• It may take up to 24 hours for your

transcript to be updated and reflect that you
have successfully completed this course.
• If after 24 hours your transcript is not

updated, please send an e-mail to
joneill@gopivotal.com describing your
issue.
• Please include the following:

• BADGE NUMBER
• ROLE
• MODULE TITLE
• ISSUE

Click Here to
Provide Feedback
Thank you to the following individual who

helped make the creation of this course possible…
• Ian Andrews

Hadoop 101 - Sales Training - v4 - 4x3format

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Hadoop 101 - Sales Training - v4 - 4x3format

Caricato da

Copyright:

Formati disponibili

Learn the Hadoop Fundamentals

Ian Andrews | iandrews@gopivotal.com

Pivotal Confidential–Internal Use Only 1

After completing this module, you will be able to:

Pivotal Confidential–Internal Use Only 2

Pivotal Confidential–Internal Use Only 3

Pivotal Confidential–Internal Use Only 4

 Framework that allows for distributed processing of large

 Inspired by Google’s MapReduce and Google File System

 Apache Open Source Project

Pivotal Confidential–Internal Use Only 5

Two Core Components

Scalable storage in Compute via the

• Storage & Compute in 1 Framework

Pivotal Confidential–Internal Use Only 6

Pivotal Confidential–Internal Use Only 7

• A new architecture to host data from

• Why the Data warehouse exists:

Pivotal Confidential–Internal Use Only 8

• A new approach to database

• Why the MPP Database exists:

Pivotal Confidential–Internal Use Only 9

 Internet age introduces the

transactions New Delivery

 Normal Databases not

Pivotal Confidential–Internal Use Only 10

 Traditional systems weren’t

 Why Hadoop exists:

Pivotal Confidential–Internal Use Only 11

 Internet age + exploding data growth

 Enterprises increasingly interested in leveraging new data sources

 Traditional database tools not able to cope

 Need for new approach  Hadoop

Pivotal Confidential–Internal Use Only 12

Pivotal Confidential–Internal Use Only 13

 Handles large amounts of data

Enterprises can gain a competitive advantage through the

Pivotal Confidential–Internal Use Only 14

Mahout • Scalable machine learning libraries written in MapReduce

Flume • Log file ingestion framework

Sqoop • SQL to Hadoop batch data transfer framework

HBase • NoSQL database for random, real time read/write access

Hive • System for SQL-like query data on top of HDFS

Pig • Procedural language that abstracts MapReduce

Zookeeper • Highly reliable distributed coordination

MapReduce • Framework for writing scalable data applications

HDFS • Hadoop Distributed File System

Pivotal Confidential–Internal Use Only 15

 Apache Hadoop is open-source software managed by the Apache Software Foundation

Pivotal Confidential–Internal Use Only 16

 Greenplum began moving towards the Hadoop space when it added

 Greenplum HD is one of several commercially supported distributions

 In providing our own distribution of Apache Hadoop, Greenplum (and

Pivotal Confidential–Internal Use Only 17

Pivotal Confidential–Internal Use Only 18

Hadoop Use Cases by Vertical

Pivotal Confidential–Internal Use Only 19

Pivotal Confidential–Internal Use Only 20

 Deep Historical Reporting for Retail Trends:

 Why do this in Hadoop?

Pivotal Confidential–Internal Use Only 21

 Why do this in Hadoop?