Sei sulla pagina 1di 29

Hadoop 101

Learn the Hadoop Fundamentals

Ian Andrews | iandrews@gopivotal.com


Director – Advanced Technology
Pivotal Incorporated

Pivotal Confidential–Internal Use Only 1


HADOOP 101

Purpose

After completing this module, you will be able to:

1. Discuss the evolution of Data Platforms and why Hadoop was created
2. Discuss the purpose, functionality, and value of Hadoop
3. Describe the various Hadoop components
4. Discuss some of the most common use cases for Hadoop

Pivotal Confidential–Internal Use Only 2


HADOOP 101

Agenda

1. What is Hadoop
2. The Evolution of Data Platforms
3. How Hadoop Is Being Used Today
4. Resources and Key Takeaways

Pivotal Confidential–Internal Use Only 3


Part 1:
What Is Hadoop?

Pivotal Confidential–Internal Use Only 4


What is Hadoop?

 Framework that allows for distributed processing of large


data sets across clusters of commodity servers
– Store large amount of data
– Process the large amount of data stored

 Inspired by Google’s MapReduce and Google File System


(GFS) papers

 Apache Open Source Project


– Initial work done at Yahoo! starting in 2005
– Open sourced in 2009 there is now a very active open source
community

Pivotal Confidential–Internal Use Only 5


What is Hadoop?

Two Core Components

HDFS MapReduce

Scalable storage in Compute via the


Hadoop Distributed MapReduce distributed
File System processing platform

• Storage & Compute in 1 Framework


• Open Source Project of the Apache Software Foundation
• Java-intensive programming required

Pivotal Confidential–Internal Use Only 6


Part 2:
The Evolution of Data
Platforms
How it all began

Pivotal Confidential–Internal Use Only 7


First there was the Data Warehouse

• A new architecture to host data from


multiple sources to support decision-
making

• Why the Data warehouse exists:


Legacy
– Centralization of high value data
EDW
– Tools to process data into information
– Highly regulated environment

Pivotal Confidential–Internal Use Only 8


Then the MPP database was introduced

• A new approach to database


was required to handle new
analytics environment

• Why the MPP Database exists:


– Data got larger
– Queries got uglier
– Performance became critical
– R/SAS/Statistical languages need
to run in-database

Pivotal Confidential–Internal Use Only 9


Then things changed

 Internet age introduces the


ability to track interactions New Data
rather than just Streams

transactions New Delivery


Platforms Expanding
Data Volumes

 Cost of traditional
platforms too high to store Greater Cost
New
and process this new data Deployment
Models &
Pressures

Languages Increasing
Customer
Expectations

 Normal Databases not


able to perform at scale

Pivotal Confidential–Internal Use Only 10


Now there is Hadoop

 Traditional systems weren’t


built to handle the
storage/processing needs
of Web 2.0

 Why Hadoop exists:


– Data volumes moved to the
Petabyte range
– Raw (unstructured) forms of
data needed to be processed
– Cost needed to be low
– Processing must scale with
storage

Pivotal Confidential–Internal Use Only 11


The Hadoop Opportunity

 Internet age + exploding data growth

 Enterprises increasingly interested in leveraging new data sources


quickly:
– Understand granular customer behavior
– Spot emerging trends
– Identify new opportunities, etc.

 Traditional database tools not able to cope


– Weren’t built for big data use cases
– Lack scale, not cost-effective, rigid data structure

 Need for new approach  Hadoop

Pivotal Confidential–Internal Use Only 12


Why Hadoop is Important?

1. Hadoop reduces the cost of storing & processing data to a point that
keeping all data, indefinitely is suddenly a very real possibility – AND –
that cost is halving every 18 months
2. MapReduce makes developing & executing massively parallel data
processing tasks trivial compared to historical alternatives (e.g. HPC /
Grid)
3. Schema on Read paradigm shifts typical data preparation complexity
to analysis phase rather than acquisition phase
4. For the modern CIO “BIG DATA = HADOOP” - Don’t underestimate
the irrational exuberance of the market

The cost and effort to consume and extract value from data has been
fundamentally changed

Pivotal Confidential–Internal Use Only 13


How is Hadoop is unique?

 Handles large amounts of data


 Stores data in native format
 Delivers linear scalability at low cost
 Resilient in case of infrastructure failures
 Transparent application scalability

Enterprises can gain a competitive advantage through the


adoption of big data analytics

Pivotal Confidential–Internal Use Only 14


Common Apache Hadoop Components

Mahout • Scalable machine learning libraries written in MapReduce

Flume • Log file ingestion framework

Sqoop • SQL to Hadoop batch data transfer framework

HBase • NoSQL database for random, real time read/write access

Hive • System for SQL-like query data on top of HDFS

Pig • Procedural language that abstracts MapReduce

Zookeeper • Highly reliable distributed coordination

MapReduce • Framework for writing scalable data applications

HDFS • Hadoop Distributed File System

Pivotal Confidential–Internal Use Only 15


Evolution of the Commercial Hadoop Market

 Apache Hadoop is open-source software managed by the Apache Software Foundation


available as a free download at Apache.org.
 A number of commercial businesses have been created to provide packaged Hadoop
distributions containing integrated subcomponents of Hadoop and other open source
projects, education, support, services, and Hadoop management utilities to make the
technology easier to implement and use.
– Cloudera was the first commercial vendor to enter the space in 2008 founded by early Hadoop technologists who
later brought in Doug Cutting to be head architect.
– Other commercial vendors of Hadoop software include Hortonworks (2011 out of Yahoo!), IBM (Infosphere
BigInsights), MapR (2009) and DataStax (2010 – Hadoop and Cassandra).

 Organizations that are evaluating Hadoop are typically also looking at other NoSQL
databases such as Cassandra, Mongo DB, Couch DB, Amazon EMR, and others.
 They might also be evaluating scale-out file systems to use for storage such as Isilon
or Gluster FS. These systems mirror Hadoop’s scale-out architecture and are also
capable of handling the volume and unstructured nature of data that could be stored in
Hadoop.

Pivotal Confidential–Internal Use Only 16


Greenplum’s Entry into the Hadoop Market

 Greenplum began moving towards the Hadoop space when it added


support for MapReduce within the database in 2009. The company then
officially entered the Hadoop market in April of 2011 with the launch of
Greenplum HD.

 Greenplum HD is one of several commercially supported distributions


of Apache Hadoop. All of these distributions consist of the same core
Hadoop technology

 In providing our own distribution of Apache Hadoop, Greenplum (and


now Pivotal) are bringing Greenplum’s decade of experience
developing the best MPP database platform to the open source Big
Data platform

Pivotal Confidential–Internal Use Only 17


Part 3:
How Hadoop Is Being
Used Today

Pivotal Confidential–Internal Use Only 18


Target Markets, Verticals, and Use Cases

Hadoop Use Cases by Vertical


Finance Web 2.0 Telecom Healthcare
• Risk • Product • Network Graph • Electronic Medical
Modeling/Managem Recommendation Analysis Record Analysis
ent Engine • Call Detail Record • Claims Fraud
• Portfolio Analysis • Search Engine (CDR) Analysis Detection
• Investment Indexing (Search • Network Optimization • Drug Safety
Predictions Assist) • Service Optimization Analysis
• Fraud Detection • Content Optimization & Log Processing • Personalize
• Compliance Check • Advertising • User Behavior Medicine
• Customer Profiling Optimization Analysis • Healthcare Service
• Social Media • Customer Churn • Customer Churn Optimization
Analytics Analysis Prediction • Drug Development
• ETL • POS Transaction • Machine-generated • Healthcare
• Network analysis Analysis data centralization Information
based on • Data Warehousing (logs from firewalls, Exchange
transactions towers, switches, • Medical Image
servers, etc..) Processing

Pivotal Confidential–Internal Use Only 19


Use Case: Data Warehouse Augmentation / Offload

• Challenges
– Existing EDW used for low value and resource consuming ETL
process
– Planned growth will far exceed compute capacity
– Hard to do analytics or even basic reporting on EDW system

• Objectives
– Reduce EDW Total Cost of Ownership
– Enable longer data retention to enable analytics and accelerate time
to market
– Migrate ETL off EDW to free up compute resources

Pivotal Confidential–Internal Use Only 20


Hadoop Use Case: Retailer Trend Analysis

 Deep Historical Reporting for Retail Trends:


– Credit card company loads 10 years of data for all retailers (100’s of TB’s) into
Hadoop
– Run Map/Reduce Job on Hadoop Cluster for a single retailer and develop historical
picture of retailer (or retailers) in a specific area.
– Load results from Hadoop into data warehouse and further analyze with standard
BI/statistics packages.

 Why do this in Hadoop?


– Ability to store years of data cost effectively
– Data available for immediate recall (not on tapes or flat files)
– No need to ETL/normalize the data
– Data exists in its valuable, original format
– Offload intensive computation from DW
– Ability to pull in other data sources about the retail (unstructured) from other sources
and combine with other data (EDGAR Filings?)

Pivotal Confidential–Internal Use Only 21


Hadoop Use Case: Risk Modeling

 Risk Modeling
– Bank had customer data across multiple lines of business and needed to
develop a better risk picture of its customers.
▪ i.e. -- if direct deposits stop coming into checking acct, it’s likely that customer lost his/her job,
which impacts creditworthiness for other products (CC, mortgage, etc)
– Data existing in silos across mutliple LOB’s and acquired bank systems
– Data size approached 1 petabyte

 Why do this in Hadoop?


– Ability to cost-effectively integrate + 1 PB of data from multiple data sources: data
warehouse, call center, chat and email
– Platform for more analysis with poly-structured data sources; ie, combining bank data
with credit bureau data; Twitter, etc
– Offload intensive computation from DW

Pivotal Confidential–Internal Use Only 22


Hadoop Use Case: Sentiment Analysis
 Sentiment Analysis
– Hadoop used frequently to monitor what customers think of company’s products or
services
– Data loaded from social media sources (Twitter, blogs, facebook, emails, chats, etc)
into Hadoop cluster
– Map/Reduce jobs run continuously to identify sentiment (ie, Acme Company’s rates
are “outrageous” or “rip off”)
– Negative/positive comments can be acted upon (special offer, coupon, etc)

 Why Hadoop
– Social media/web data is unstructured
– Amount of data is immense
– New data sources arise weekly

Pivotal Confidential–Internal Use Only 23


Part 4:
Resources and Key
Takeaways

Pivotal Confidential–Internal Use Only 24


Online Resources

 Hadoop Basics: Hadoop Basics from EMC

 Pivotal Hadoop online resources: Greenplum Nation – Pivotal HD

 Selling Guide - Hadoop Selling Guide

 Webcast
– Hadoop Spotlight Webinar
– Video: HAWQ and Pivotal HD Video
– Internal Webcast: Internal Webcast

 Spring Hadoop: Spring Hadoop

Pivotal Confidential–Internal Use Only 25


Contact Resources

 Advanced Technology Sales


– Nick Cayou – ncayou@gopivotal.com
– Ian Andrews – iandrews@gopivotal.com

 Hadoop Virtual Team


– Jacque Istok – jistok@gopivotal.com

 Product Management
– SK Krishnamurthy - SK.Krishnamurthy@emc.com

 Hadoop vTeam Mailing List


– GPVTeamHadoop@emc.com

Pivotal Confidential–Internal Use Only 26


Module Summary & Highlights

 Hadoop was created to meet a need for data storage and


processing at a scale and cost that existing systems simply
could not address

 Hadoop adoption in the Enterprise is largely driven by the


low cost and flexibility of the platform

 Hadoop is relatively immature, but is quickly advancing with


new capabilities backed by a broad base of vendor and
community supported efforts

2
Pivotal Confidential–Internal Use Only 727
Thank You
Please note:

• It may take up to 24 hours for your


transcript to be updated and reflect that you
have successfully completed this course.

• If after 24 hours your transcript is not


updated, please send an e-mail to
joneill@gopivotal.com describing your
issue.

• Please include the following:


• BADGE NUMBER
• ROLE
• MODULE TITLE
• ISSUE

Pivotal Confidential–Internal Use Only 28


Click Here to
Provide Feedback

Thank you to the following individual who


helped make the creation of this course possible…
• Ian Andrews

Pivotal Confidential–Internal Use Only 29

Potrebbero piacerti anche