Introduction To Big Data

I n t r o d u c t i o n t o big d a t a a n d h a d o o p
Edureka Contact : corp@edureka.co
www.edureka.co/big-data-and-hadoop
Objectives
At the end of this session , you will understand the:
Big Data Introduction
Use Cases of Big Data in Multiple Industry Verticals
Hadoop and Its Eco-System
Hadoop Architecture
Learning Path for Developers, Administrators, Testing Professionals and Aspiring Data Scientists
Slide 2
Un-structured Data is Exploding
Source: Twitter
Slide 3
IBMs Definition of Big Data

IBMs Definition Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Slide 4
Annies Introduction
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Slide 5
Annies Question
Map the following to corresponding data type:
XML files, e-mail body
Audio, Video, Images, Archived documents
Data from Enterprise systems (ERP, CRM etc.)
Slide 6
Annies Answer
Ans. XML files, e-mail body Semi-structured data

Audio, Video, Image, Files, Archived documents Unstructured data
Data from Enterprise systems (ERP, CRM etc.) Structured data
Slide 7
Further Reading
More on Big Data
http://www.edureka.in/blog/the-hype-behind-big-data/
Why Hadoop?
http://www.edureka.in/blog/why-hadoop/
Opportunities in Hadoop
http://www.edureka.in/blog/jobs-in-hadoop/
Big Data
http://en.wikipedia.org/wiki/Big_Data
IBMs definition Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Slide 8
Common Big Data Customer Scenarios
Web and e-tailing
Recommendation Engines
Ad Targeting
Search Quality
Abuse and Click Fraud Detection
Telecommunications
Customer Churn Prevention

Network Performance Optimization
Calling Data Record (CDR) Analysis
Analysing Network to Predict Failure
http://wiki.apache.org/hadoop/PoweredBy
Slide 9
Common Big Data Customer Scenarios (Contd.)
Government
Fraud Detection and Cyber Security
Welfare Schemes
Justice
Healthcare and Life Sciences
Health Information Exchange

Gene Sequencing
Serialization
Healthcare Service Quality Improvements
Drug Safety
Slide 10
Common Big Data Customer Scenarios (Contd.)

Banks and Financial services
Modeling True Risk

Threat Analysis
Fraud Detection
Trade Surveillance
Credit Scoring and Analysis
Retail
Point of Sales Transaction Analysis
Customer Churn Analysis
Sentiment Analysis
Slide 11
Why DFS?
Read 1 TB Data
1 Machine
10 Machine
4 I/O Channels
Each Channel 100 MB/s
Slide 12
4 I/O Channels
Why DFS? (Contd.)

Read 1 TB Data
1 Machine
10 Machine
4 I/O Channels
4 I/O Channels
43 Minutes
Slide 13
Why DFS? (Contd.)

Read 1 TB Data
1 Machine
Slide 14
10 Machine
4 I/O Channels
4 I/O Channels
43 Minutes
4.3 Minutes
Slide 15
Hadoop Cluster: A Typical Use Case

Active NameNode
Secondary NameNode
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
RAM: 64 GB,
Hard disk: 1 TB
OS: 64-bit CentOS
DataNode
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores.
OS: 64-bit CentOS
Slide 16
StandBy NameNode
RAM: 64 GB,
Hard disk: 1 TB
OS: 64-bit CentOS
DataNode
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
OS: 64-bit CentOS
Hidden Treasure
Insight into data can provide Business Advantage.
Case Study: Sears Holding Corporation
Some key early indicators can mean Fortunes to Business.

More Precise Analysis with more data.
*Sears was using traditional systems such as Oracle Exadata, Teradata and
SAS etc., to store and process the customer activity and sales data.
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
Slide 17
Limitations of Existing Data Analytics Architecture

BI Reports + Interactive Apps
A meagre
10% of the
~2PB data is
available for
BI
RDBMS (Aggregated Data)
1. Cant explore original

high fidelity raw data
ETL Compute Grid

2. Moving data to compute
doesnt scale
Storage
Processing
Storage only Grid (Original Raw Data)

Mostly Append
90% of
the ~2PB
archived
3. Premature data
death
Collection
Instrumentation
Slide 18
Solution: A Combined Storage Computer Layer

BI Reports + Interactive Apps
Entire ~2PB
Data is
available for
processing
1. Data Exploration &

Advanced analytics
RDBMS (Aggregated Data)
No Data
Archiving
2. Scalable throughput for ETL &

aggregation
Hadoop : Storage + Compute Grid
3. Keep data alive

forever
Mostly Append
Both
Storage
And
Processing
Collection
Instrumentation
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as
was the case with existing Non-Hadoop solutions.
Slide 19
Annies Question
Hadoop is a framework that allows for the distributed
processing of:
Small Data Sets
Large Data Sets
Slide 20
Annies Answer
Ans. Large Data Sets.
It is also capable of processing small data-sets. However, to
experience the true power of Hadoop, one needs to have
data in TBs. Because this is where RDBMS takes hours and
fails whereas Hadoop does the same in couple of minutes.
Slide 21
Hadoop Ecosystem
Hadoop 2.0
Apache Oozie
(Workflow)
Hive
DW System
Pig Latin
Data Analysis
Mahout
Machine Learning
MapReduce Framework
Other
YARN
Frameworks
(MPI,GRAPH)
HBase
YARN
Cluster Resource Management
HDFS
(Hadoop Distributed File System)
Flume
Unstructured or
Semi-structured Data
Slide 22
Sqoop
Structured Data
Hadoop Cluster: Facebook

Facebook
We use Hadoop to store copies of internal log and dimension data sources and use
it as a source for reporting/analytics and machine learning.
Currently we have 2 major clusters:
A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
A 300-machine cluster with 2400 cores and about 3 PB raw storage.
Each (commodity) node has 8 cores and 12 TB of storage.
We are heavy users of both streaming as well as the Java APIs. We have
built a higher level data warehousing framework using these features called
Hive(see the http://Hadoop.apache.org/hive/). We have also developed a
FUSE implementation over HDFS.
Slide 23
YARN Moving beyond MapReduce
BATCH
(MapReduce)
INTERACTIVE
(Text)
ONLINE
(HBase)
STREAMING
(Storm, S4,)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
OTHER
(Search)
(Weave..)
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
Slide 24
Hadoop Cluster Modes

Hadoop can run in any of the following three modes:
Standalone (or Local) Mode
No daemons, everything runs in a single JVM.
Suitable for running MapReduce programs during development.
Has no DFS.
Pseudo-Distributed Mode
Hadoop daemons run on the local machine.
Fully-Distributed Mode
Hadoop daemons run on a cluster of machines.
Slide 25
Big Data Learning Path

Developer/Testing
Java / Python / Ruby

Hadoop Eco-system
NoSQL DB
Spark
Administration
Slide 26
MapReduce
Design Patterns
Apache Cassandra
Hadoop Administration
Linux Administration
Linux Administration
Cluster Management
Cluster Performance
Virtualization
Data Analyst
Apache
Spark & Scala
Big Data and Hadoop
Statistics Skills
Machine Learning
Hadoop Essentials
Expertise in R
Data Science
Advance Predictive
Modelling in R
Business Analytics
Using R
Data Visualization
Using Tableau
Talend for Big Data
Learning Path to Certification
LIVE Online Class
Course
Class Recording in LMS
Verifiable Certificate
Project Work
24/7 Post Class Support
Slide 27
Module Wise Quiz and Assignment
1. Assistance from Peers and

Support team
2. Review for Certification
Further Reading
Apache Hadoop and HDFS
http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
Apache Hadoop HDFS Architecture

http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/
Slide 28
Assignment
Referring the documents present in the LMS under assignment solve the below problem.
How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?
Slide 29
Survey
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.
Slide 30

Introduction To Big Data

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Introduction To Big Data

Caricato da

Copyright:

Formati disponibili

I n t r o d u c t i o n t o big d a t a a n d h a d o o p

Edureka Contact : corp@edureka.co

Edureka Contact : corp@edureka.co

Un-structured Data is Exploding

Edureka Contact : corp@edureka.co

IBMs Definition of Big Data

Edureka Contact : corp@edureka.co

Ans. XML files, e-mail body Semi-structured data

Edureka Contact : corp@edureka.co

Edureka Contact : corp@edureka.co

Common Big Data Customer Scenarios

Web and e-tailing

Customer Churn Prevention

Edureka Contact : corp@edureka.co

Common Big Data Customer Scenarios (Contd.)

Healthcare and Life Sciences

Health Information Exchange

Edureka Contact : corp@edureka.co

Common Big Data Customer Scenarios (Contd.)

Modeling True Risk

Edureka Contact : corp@edureka.co

Edureka Contact : corp@edureka.co

Why DFS? (Contd.)

Edureka Contact : corp@edureka.co

Why DFS? (Contd.)

Edureka Contact : corp@edureka.co

Hadoop Cluster: A Typical Use Case

Edureka Contact : corp@edureka.co

Case Study: Sears Holding Corporation

Some key early indicators can mean Fortunes to Business.

Edureka Contact : corp@edureka.co

Limitations of Existing Data Analytics Architecture

RDBMS (Aggregated Data)

1. Cant explore original

ETL Compute Grid

Storage only Grid (Original Raw Data)

Edureka Contact : corp@edureka.co

Solution: A Combined Storage Computer Layer

1. Data Exploration &

RDBMS (Aggregated Data)

2. Scalable throughput for ETL &

Hadoop : Storage + Compute Grid

3. Keep data alive

Edureka Contact : corp@edureka.co

Edureka Contact : corp@edureka.co

Edureka Contact : corp@edureka.co

Hadoop Cluster: Facebook

A 1100-machine cluster with 8800 cores and about 12 PB raw storage.

A 300-machine cluster with 2400 cores and about 3 PB raw storage.

Each (commodity) node has 8 cores and 12 TB of storage.

Edureka Contact : corp@edureka.co

YARN Moving beyond MapReduce

Edureka Contact : corp@edureka.co

Hadoop Cluster Modes

Hadoop daemons run on a cluster of machines.

Edureka Contact : corp@edureka.co

Big Data Learning Path

Java / Python / Ruby

Big Data and Hadoop

Talend for Big Data

Learning Path to Certification

LIVE Online Class

Class Recording in LMS