Sei sulla pagina 1di 31

I n t r o d u c t i o n t o big d a t a a n d h a d o o p

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Objectives
At the end of this session , you will understand the:
Big Data Introduction
Use Cases of Big Data in Multiple Industry Verticals
Hadoop and Its Eco-System
Hadoop Architecture
Learning Path for Developers, Administrators, Testing Professionals and Aspiring Data Scientists

Slide 2

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Un-structured Data is Exploding

Source: Twitter

Slide 3

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

IBMs Definition of Big Data


IBMs Definition Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/

Slide 4

www.edureka.co/big-data-and-hadoop

Annies Introduction
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.

Slide 5

www.edureka.co/big-data-and-hadoop

Annies Question
Map the following to corresponding data type:
XML files, e-mail body
Audio, Video, Images, Archived documents
Data from Enterprise systems (ERP, CRM etc.)

Slide 6

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Annies Answer

Ans. XML files, e-mail body Semi-structured data


Audio, Video, Image, Files, Archived documents Unstructured data
Data from Enterprise systems (ERP, CRM etc.) Structured data

Slide 7

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Further Reading
More on Big Data
http://www.edureka.in/blog/the-hype-behind-big-data/
Why Hadoop?
http://www.edureka.in/blog/why-hadoop/
Opportunities in Hadoop
http://www.edureka.in/blog/jobs-in-hadoop/
Big Data
http://en.wikipedia.org/wiki/Big_Data
IBMs definition Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Slide 8

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Common Big Data Customer Scenarios

Web and e-tailing

Recommendation Engines
Ad Targeting
Search Quality
Abuse and Click Fraud Detection

Telecommunications

Customer Churn Prevention


Network Performance Optimization
Calling Data Record (CDR) Analysis
Analysing Network to Predict Failure

http://wiki.apache.org/hadoop/PoweredBy
Slide 9

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Common Big Data Customer Scenarios (Contd.)

Government
Fraud Detection and Cyber Security
Welfare Schemes
Justice

Healthcare and Life Sciences

Health Information Exchange


Gene Sequencing
Serialization
Healthcare Service Quality Improvements
Drug Safety

http://wiki.apache.org/hadoop/PoweredBy
Slide 10

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Common Big Data Customer Scenarios (Contd.)


Banks and Financial services

Modeling True Risk


Threat Analysis
Fraud Detection
Trade Surveillance
Credit Scoring and Analysis

Retail
Point of Sales Transaction Analysis
Customer Churn Analysis
Sentiment Analysis

http://wiki.apache.org/hadoop/PoweredBy
Slide 11

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Why DFS?
Read 1 TB Data

1 Machine

10 Machine

4 I/O Channels
Each Channel 100 MB/s

Slide 12

4 I/O Channels
Each Channel 100 MB/s

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Why DFS? (Contd.)


Read 1 TB Data

1 Machine

10 Machine

4 I/O Channels
Each Channel 100 MB/s

4 I/O Channels
Each Channel 100 MB/s

43 Minutes
Slide 13

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Why DFS? (Contd.)


Read 1 TB Data

1 Machine

Slide 14

10 Machine

4 I/O Channels
Each Channel 100 MB/s

4 I/O Channels
Each Channel 100 MB/s

43 Minutes

4.3 Minutes
Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Slide 15

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Hadoop Cluster: A Typical Use Case


Active NameNode

Secondary NameNode
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply

RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply

DataNode
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS

Slide 16

StandBy NameNode
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply

DataNode
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Hidden Treasure
Insight into data can provide Business Advantage.

Case Study: Sears Holding Corporation

Some key early indicators can mean Fortunes to Business.


More Precise Analysis with more data.

*Sears was using traditional systems such as Oracle Exadata, Teradata and
SAS etc., to store and process the customer activity and sales data.
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?

Slide 17

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Limitations of Existing Data Analytics Architecture


BI Reports + Interactive Apps
A meagre
10% of the
~2PB data is
available for
BI

RDBMS (Aggregated Data)

1. Cant explore original


high fidelity raw data

ETL Compute Grid


2. Moving data to compute
doesnt scale

Storage

Processing

Storage only Grid (Original Raw Data)


Mostly Append

90% of
the ~2PB
archived

3. Premature data
death

Collection
Instrumentation
Slide 18

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Solution: A Combined Storage Computer Layer


BI Reports + Interactive Apps

Entire ~2PB
Data is
available for
processing

1. Data Exploration &


Advanced analytics

RDBMS (Aggregated Data)

No Data
Archiving

2. Scalable throughput for ETL &


aggregation

Hadoop : Storage + Compute Grid

3. Keep data alive


forever

Mostly Append
Both
Storage
And
Processing

Collection
Instrumentation

*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as
was the case with existing Non-Hadoop solutions.

Slide 19

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Annies Question
Hadoop is a framework that allows for the distributed
processing of:
Small Data Sets
Large Data Sets

Slide 20

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Annies Answer
Ans. Large Data Sets.
It is also capable of processing small data-sets. However, to
experience the true power of Hadoop, one needs to have
data in TBs. Because this is where RDBMS takes hours and
fails whereas Hadoop does the same in couple of minutes.

Slide 21

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Hadoop Ecosystem
Hadoop 2.0
Apache Oozie
(Workflow)
Hive

DW System

Pig Latin
Data Analysis

Mahout

Machine Learning

MapReduce Framework

Other
YARN
Frameworks
(MPI,GRAPH)

HBase

YARN
Cluster Resource Management
HDFS
(Hadoop Distributed File System)
Flume

Unstructured or
Semi-structured Data

Slide 22

Sqoop

Structured Data
Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Hadoop Cluster: Facebook


Facebook
We use Hadoop to store copies of internal log and dimension data sources and use
it as a source for reporting/analytics and machine learning.
Currently we have 2 major clusters:

A 1100-machine cluster with 8800 cores and about 12 PB raw storage.

A 300-machine cluster with 2400 cores and about 3 PB raw storage.

Each (commodity) node has 8 cores and 12 TB of storage.

We are heavy users of both streaming as well as the Java APIs. We have

built a higher level data warehousing framework using these features called
Hive(see the http://Hadoop.apache.org/hive/). We have also developed a
FUSE implementation over HDFS.

Slide 23

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

YARN Moving beyond MapReduce

BATCH
(MapReduce)

INTERACTIVE
(Text)

ONLINE
(HBase)

STREAMING
(Storm, S4,)

GRAPH
(Giraph)

IN-MEMORY
(Spark)

HPC MPI
(OpenMPI)

OTHER
(Search)
(Weave..)

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
Slide 24

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Hadoop Cluster Modes


Hadoop can run in any of the following three modes:
Standalone (or Local) Mode
No daemons, everything runs in a single JVM.
Suitable for running MapReduce programs during development.
Has no DFS.

Pseudo-Distributed Mode
Hadoop daemons run on the local machine.

Fully-Distributed Mode

Hadoop daemons run on a cluster of machines.

Slide 25

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Big Data Learning Path


Developer/Testing

Java / Python / Ruby


Hadoop Eco-system
NoSQL DB
Spark
Administration

Slide 26

MapReduce
Design Patterns

Apache Cassandra
Hadoop Administration

Linux Administration

Linux Administration
Cluster Management
Cluster Performance
Virtualization
Data Analyst

Apache
Spark & Scala

Big Data and Hadoop

Statistics Skills
Machine Learning
Hadoop Essentials
Expertise in R

Data Science

Advance Predictive
Modelling in R

Business Analytics
Using R
Edureka Contact : corp@edureka.co

Data Visualization
Using Tableau

Talend for Big Data

www.edureka.co/big-data-and-hadoop

Learning Path to Certification

LIVE Online Class

Course

Class Recording in LMS

Verifiable Certificate
Project Work
24/7 Post Class Support

Slide 27

Module Wise Quiz and Assignment

1. Assistance from Peers and


Support team
2. Review for Certification

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Further Reading
Apache Hadoop and HDFS
http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/

Apache Hadoop HDFS Architecture


http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/

Slide 28

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Assignment
Referring the documents present in the LMS under assignment solve the below problem.
How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?

Slide 29

www.edureka.co/big-data-and-hadoop

Survey
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.

Slide 30

Edureka Contact : corp@edureka.co

www.edureka.co/big-data-and-hadoop

Potrebbero piacerti anche