Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
www.edureka.co/big-data-and-hadoop
Objectives
At the end of this session , you will understand the:
Big Data Introduction
Use Cases of Big Data in Multiple Industry Verticals
Hadoop and Its Eco-System
Hadoop Architecture
Learning Path for Developers, Administrators, Testing Professionals and Aspiring Data Scientists
Slide 2
www.edureka.co/big-data-and-hadoop
Source: Twitter
Slide 3
www.edureka.co/big-data-and-hadoop
Slide 4
www.edureka.co/big-data-and-hadoop
Annies Introduction
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Slide 5
www.edureka.co/big-data-and-hadoop
Annies Question
Map the following to corresponding data type:
XML files, e-mail body
Audio, Video, Images, Archived documents
Data from Enterprise systems (ERP, CRM etc.)
Slide 6
www.edureka.co/big-data-and-hadoop
Annies Answer
Slide 7
www.edureka.co/big-data-and-hadoop
Further Reading
More on Big Data
http://www.edureka.in/blog/the-hype-behind-big-data/
Why Hadoop?
http://www.edureka.in/blog/why-hadoop/
Opportunities in Hadoop
http://www.edureka.in/blog/jobs-in-hadoop/
Big Data
http://en.wikipedia.org/wiki/Big_Data
IBMs definition Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Slide 8
www.edureka.co/big-data-and-hadoop
Recommendation Engines
Ad Targeting
Search Quality
Abuse and Click Fraud Detection
Telecommunications
http://wiki.apache.org/hadoop/PoweredBy
Slide 9
www.edureka.co/big-data-and-hadoop
Government
Fraud Detection and Cyber Security
Welfare Schemes
Justice
http://wiki.apache.org/hadoop/PoweredBy
Slide 10
www.edureka.co/big-data-and-hadoop
Retail
Point of Sales Transaction Analysis
Customer Churn Analysis
Sentiment Analysis
http://wiki.apache.org/hadoop/PoweredBy
Slide 11
www.edureka.co/big-data-and-hadoop
Why DFS?
Read 1 TB Data
1 Machine
10 Machine
4 I/O Channels
Each Channel 100 MB/s
Slide 12
4 I/O Channels
Each Channel 100 MB/s
www.edureka.co/big-data-and-hadoop
1 Machine
10 Machine
4 I/O Channels
Each Channel 100 MB/s
4 I/O Channels
Each Channel 100 MB/s
43 Minutes
Slide 13
www.edureka.co/big-data-and-hadoop
1 Machine
Slide 14
10 Machine
4 I/O Channels
Each Channel 100 MB/s
4 I/O Channels
Each Channel 100 MB/s
43 Minutes
4.3 Minutes
Edureka Contact : corp@edureka.co
www.edureka.co/big-data-and-hadoop
Slide 15
www.edureka.co/big-data-and-hadoop
Secondary NameNode
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
DataNode
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Slide 16
StandBy NameNode
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
DataNode
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
www.edureka.co/big-data-and-hadoop
Hidden Treasure
Insight into data can provide Business Advantage.
*Sears was using traditional systems such as Oracle Exadata, Teradata and
SAS etc., to store and process the customer activity and sales data.
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
Slide 17
www.edureka.co/big-data-and-hadoop
Storage
Processing
90% of
the ~2PB
archived
3. Premature data
death
Collection
Instrumentation
Slide 18
www.edureka.co/big-data-and-hadoop
Entire ~2PB
Data is
available for
processing
No Data
Archiving
Mostly Append
Both
Storage
And
Processing
Collection
Instrumentation
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as
was the case with existing Non-Hadoop solutions.
Slide 19
www.edureka.co/big-data-and-hadoop
Annies Question
Hadoop is a framework that allows for the distributed
processing of:
Small Data Sets
Large Data Sets
Slide 20
www.edureka.co/big-data-and-hadoop
Annies Answer
Ans. Large Data Sets.
It is also capable of processing small data-sets. However, to
experience the true power of Hadoop, one needs to have
data in TBs. Because this is where RDBMS takes hours and
fails whereas Hadoop does the same in couple of minutes.
Slide 21
www.edureka.co/big-data-and-hadoop
Hadoop Ecosystem
Hadoop 2.0
Apache Oozie
(Workflow)
Hive
DW System
Pig Latin
Data Analysis
Mahout
Machine Learning
MapReduce Framework
Other
YARN
Frameworks
(MPI,GRAPH)
HBase
YARN
Cluster Resource Management
HDFS
(Hadoop Distributed File System)
Flume
Unstructured or
Semi-structured Data
Slide 22
Sqoop
Structured Data
Edureka Contact : corp@edureka.co
www.edureka.co/big-data-and-hadoop
We are heavy users of both streaming as well as the Java APIs. We have
built a higher level data warehousing framework using these features called
Hive(see the http://Hadoop.apache.org/hive/). We have also developed a
FUSE implementation over HDFS.
Slide 23
www.edureka.co/big-data-and-hadoop
BATCH
(MapReduce)
INTERACTIVE
(Text)
ONLINE
(HBase)
STREAMING
(Storm, S4,)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
OTHER
(Search)
(Weave..)
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
Slide 24
www.edureka.co/big-data-and-hadoop
Pseudo-Distributed Mode
Hadoop daemons run on the local machine.
Fully-Distributed Mode
Slide 25
www.edureka.co/big-data-and-hadoop
Slide 26
MapReduce
Design Patterns
Apache Cassandra
Hadoop Administration
Linux Administration
Linux Administration
Cluster Management
Cluster Performance
Virtualization
Data Analyst
Apache
Spark & Scala
Statistics Skills
Machine Learning
Hadoop Essentials
Expertise in R
Data Science
Advance Predictive
Modelling in R
Business Analytics
Using R
Edureka Contact : corp@edureka.co
Data Visualization
Using Tableau
www.edureka.co/big-data-and-hadoop
Course
Verifiable Certificate
Project Work
24/7 Post Class Support
Slide 27
www.edureka.co/big-data-and-hadoop
Further Reading
Apache Hadoop and HDFS
http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
Slide 28
www.edureka.co/big-data-and-hadoop
Assignment
Referring the documents present in the LMS under assignment solve the below problem.
How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?
Slide 29
www.edureka.co/big-data-and-hadoop
Survey
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.
Slide 30
www.edureka.co/big-data-and-hadoop