Benvenuto in Scribd!

Introduction To Hadoop & Spark

Caricato da

Il 0% ha trovato utile questo documento (0 voti)

48 visualizzazioni28 pagine

This document introduces Hadoop and Spark for big data. It discusses why big data is important due to the growth of data from devices and applications. It describes characteristics of big data including volume, velocity and variety. Examples are given of how companies use big data for recommendations, A/B testing, fraud detection and more. Hadoop and Spark are introduced as solutions for distributed big data processing using frameworks like MapReduce. Key components of these platforms are also summarized.

Descrizione originale:

Introduction to Hadoop & Spark

Titolo originale

Introduction to Hadoop & Spark

Copyright

Formati disponibili

PDF, TXT o leggi online da Scribd

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Segnala questo documento

Copyright:

Formati disponibili

Scarica in formato PDF, TXT o leggi online su Scribd

Segnala contenuti inappropriati

Il 0% ha trovato utile questo documento (0 voti)

48 visualizzazioni28 pagine

Introduction To Hadoop & Spark

Caricato da

Justin Talbot

Copyright:

Formati disponibili

Scarica in formato PDF, TXT o leggi online su Scribd

Segnala contenuti inappropriati

Salta alla pagina

Sei sulla pagina 1di 28

Cerca all'interno del documento

Welcome to

Big Data
with

Hadoop & Spark

Introduction

Introduction to Hadoop & Spark

Data Variety

Introduction to Hadoop & Spark

Data Variety

ETL
Extract Transform Load

Introduction to Hadoop & Spark

Distributed Systems

1.Groups of networked computers

2.Interact with each other
3.To achieve a common goal.

Introduction to Hadoop & Spark

Question

How Many Bytes in One Petabyte?

^15
1.1259x10

Introduction to Hadoop & Spark

Question

How Much Data Facebook Stores in

One Day?

600 TB

Introduction to Hadoop & Spark

What is Big Data?

• Simply: Data of Very Big Size

• Can’t process with usual tools

• Distributed Architecture
Needed

• Structured / Unstructured

Introduction to Hadoop & Spark

Characteristics of Big Data
VOLUME VELOCITY VARIETY
Data At Rest Data In Motion Data in Many Forms

Problems Involving the

Problems related to storage handling of data coming at Problems involving
of huge data reliably. fast rate. complex data structures
e.g. Storage of Logs of a e.g. Number of requests e.g. Maps, Social Graphs,
website, Storage of data by being received by Facebook, Recommendations
gmail. Youtube streaming, Google
FB: 300 PB. 600TB/ day Analytics

Introduction to Hadoop & Spark

Characteristics of Big Data - Variety

Problems involving complex data structures

e.g. Maps, Social Graphs, Recommendations

Introduction to Hadoop & Spark

Question

Time taken to read 1 TB from HDD?

Around 6 hours

Introduction to Hadoop & Spark

Is One PetaByte Big Data?

If you have to count just vowels in 1 Petabyte

data everyday, do you need distributed
system?

Introduction to Hadoop & Spark

Is One PetaByte Big Data?

Yes.
Most of the existing systems can’t handle it.

Introduction to Hadoop & Spark

Why Big Data?

Introduction to Hadoop & Spark

Why is It Important Now?

X =>
Devices: Application
Connectivity
Smart Phones Social Networks
Wifi, 4G, NFC, GPS
4.6 billion mobile-phones. Internet of Things
1 - 2 billion people accessing the internet.

The devices became cheaper, faster and smaller.

The connectivity improved. Result: Many Applications

Introduction to Hadoop & Spark

Computing Components
To process & store data
we need

1. CPU Speed

4. Network

3. HDD or SSD
2. RAM - Speed & Size
Disk Size + Speed
Introduction to Hadoop & Spark
Which Components Impact the Speed of
Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above

Introduction to Hadoop & Spark

Which Components Impact the Speed of
Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above

Introduction to Hadoop & Spark

Example Big Data Customers
1. Ecommerce - Recommendations

Introduction to Hadoop & Spark

Example Big Data Customers
1. Ecommerce - Recommendations

Introduction to Hadoop & Spark

Example Big Data Problems
Recommendations - How?

USER ID MOVIE ID RATING

KUMAR matrix 4.0

KUMAR Ice age 3.5

USER ID MOVIE ID RATING
apocalypse
GIRI 3.6
now
KUMAR apocalypse now 3.6
GIRI Ice age 3.5
GIRI matrix 4.0

Introduction to Hadoop & Spark

Example Big Data Customers
2. Ecommerce - A/B Testing

Introduction to Hadoop & Spark

Big Data Customers
Government
1.Fraud Detection
2.Cyber Security Welfare
3.Justice

Telecommunications
1.Customer Churn Prevention
2.Network Performance Optimization
3.Calling Data Record (CDR) Analysis
4.Analyzing Network to Predict Failure

Introduction to Hadoop & Spark

Example Big Data Customers

Healthcare & Life Sciences

1.Health information exchange
2.Gene sequencing
3.Healthcare improvements
4.Drug Safety

Introduction to Hadoop & Spark

Big Data Solutions

1.Apache Hadoop
○ Apache Spark
2.Cassandra
3.MongoDB
4.Google Compute Engine

Introduction to Hadoop & Spark

What is Hadoop?

A. Created by Doug Cutting (of Yahoo)

B. Built for Nutch search engine project
C. Joined by Mike Cafarella
D. Based on GFS, GMR & Google Big Table
E. Named after Toy Elephant
F. Open Source - Apache
G. Powerful, Popular & Supported
H. Framework to handle Big Data
I. For distributed, scalable and reliable computing
J. Written in Java
Introduction to Hadoop & Spark
WorkFlow
Components Spark
SQL like interface Machine
learning
/ STATS
SQL Interface

Compute Engine

NoSQL Datastore

Resource Manager

File Storage

Introduction to Hadoop & Spark

Apache
• Really fast MapReduce
• 100x faster than Hadoop MapReduce in memory,
• 10x faster on disk.
• Builds on similar paradigms as MapReduce
• Integrated with Hadoop

Spark Core - A fast and general engine for large-scale

data processing.

Introduction to Hadoop & Spark

Spark Architecture
Data Sources

HDFS

HBase

SQL SparkR Java Python Scala Languages

Hive
Dataframes Streaming MLLib GraphX Libraries

Tachyon

Spark Core
Cassandra

Hadoop
Amazon EC2 Standalone Apache Mesos
YARN

Resource/cluster managers

Introduction to Hadoop & Spark

Potrebbero piacerti anche

Monitoring Hadoop
Da Everand
Monitoring Hadoop
Gurmukh Singh
Nessuna valutazione finora
Reboot 3rd Ed PDF
Documento248 pagine
Reboot 3rd Ed PDF
Justin Talbot
Nessuna valutazione finora
Optimizing Data Loading
Documento26 pagine
Optimizing Data Loading
budulinek
Nessuna valutazione finora
Personal: 1. What DBA Activities Did You To Do Today?
Documento7 pagine
Personal: 1. What DBA Activities Did You To Do Today?
dbareddy
Nessuna valutazione finora
Aksha Interview Questions
Documento52 pagine
Aksha Interview Questions
Nadeem Khan
100% (1)
Databricks
Documento43 pagine
Databricks
Madhavi Kareddy
Nessuna valutazione finora
Goldengate Interview Questions
Documento3 pagine
Goldengate Interview Questions
sachin
Nessuna valutazione finora
Administering Oracle Database Classic Cloud Service PDF
Documento425 pagine
Administering Oracle Database Classic Cloud Service PDF
Ganesh Karalkar
Nessuna valutazione finora
EDB High Availability Scalability v1.0
Documento23 pagine
EDB High Availability Scalability v1.0
Paohua Chuang
Nessuna valutazione finora
Oracle 12c - CDB - PDB - Performing Basic Tasks PDF
Documento18 pagine
Oracle 12c - CDB - PDB - Performing Basic Tasks PDF
Vinod Kumar Kannieboina
Nessuna valutazione finora
Ajay Kadiyala Resume 2023 PDF
Documento6 pagine
Ajay Kadiyala Resume 2023 PDF
viki awsac
Nessuna valutazione finora
ASM Interview Questions
Documento44 pagine
ASM Interview Questions
Kumar K
Nessuna valutazione finora
OpenETL Tools Comparison
Documento4 pagine
OpenETL Tools Comparison
Manuel Santos
Nessuna valutazione finora
Market and Liquidity Risk Management by Jawwad Ahmed Farid PDF
Documento88 pagine
Market and Liquidity Risk Management by Jawwad Ahmed Farid PDF
Justin Talbot
Nessuna valutazione finora
HDFS Architecture
Documento47 pagine
HDFS Architecture
krishan Goyal
Nessuna valutazione finora
ORACLE 12C New Features
Documento51 pagine
ORACLE 12C New Features
Thangavelu Agathian
100% (1)
Hadoop Interview Questions
Documento28 pagine
Hadoop Interview Questions
Anand S
Nessuna valutazione finora
Golden Gate
Documento53 pagine
Golden Gate
krajeshpandey9630
Nessuna valutazione finora
Oracle DBA Interview Questions
Documento15 pagine
Oracle DBA Interview Questions
Dinesh Kuppani
Nessuna valutazione finora
Oracle GoldenGate 12c Real-Time Access To Real-Time Information
Documento30 pagine
Oracle GoldenGate 12c Real-Time Access To Real-Time Information
thiyagarajancs
Nessuna valutazione finora
YARN - MapReduce
Documento34 pagine
YARN - MapReduce
Yashu Sagar
Nessuna valutazione finora
Hadoop Interview Questions
Documento28 pagine
Hadoop Interview Questions
jey011851
Nessuna valutazione finora
Apache Hive: Prashant Gupta
Documento61 pagine
Apache Hive: Prashant Gupta
Naveen Reddy
Nessuna valutazione finora
Ericsson HLR R12.0
Documento652 pagine
Ericsson HLR R12.0
Asdi Galvani
100% (1)
Apps DBA Interview Question
Documento12 pagine
Apps DBA Interview Question
Ramesh Kumar
Nessuna valutazione finora
Data Engineering Roadmap 2023
Documento1 pagina
Data Engineering Roadmap 2023
Diego Petitto
Nessuna valutazione finora
Edureka Interview Questions - HDFS
Documento4 pagine
Edureka Interview Questions - HDFS
varunpratap
Nessuna valutazione finora
Big Data Masters Certification Learnbay
Documento12 pagine
Big Data Masters Certification Learnbay
Lilith Kns
Nessuna valutazione finora
Performance Tuning - Expert Panel PDF
Documento135 pagine
Performance Tuning - Expert Panel PDF
Amit Verma
Nessuna valutazione finora
Department of Education: I. Multiple Choice
Documento2 pagine
Department of Education: I. Multiple Choice
Vianie Trimidal
Nessuna valutazione finora
Hadoop Distributed File System (HDFS) : Suresh Pathipati
Documento43 pagine
Hadoop Distributed File System (HDFS) : Suresh Pathipati
Kancharla
Nessuna valutazione finora
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Da Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
Nessuna valutazione finora
Hadoop Administration Interview Questions and Answers: 40% Career Booster Discount On All Course - Call Us Now 9019191856
Documento26 pagine
Hadoop Administration Interview Questions and Answers: 40% Career Booster Discount On All Course - Call Us Now 9019191856
krishna
Nessuna valutazione finora
AWS Oracle DB Migration Questionnaire
Documento2 pagine
AWS Oracle DB Migration Questionnaire
Avineet Agarwalla
Nessuna valutazione finora
Professional Hadoop Solutions
Da Everand
Professional Hadoop Solutions
Boris Lublinsky
Valutazione: 4 su 5 stelle
4/5 (2)
Session 12 Monitoring
Documento23 pagine
Session 12 Monitoring
Dipesh J
Nessuna valutazione finora
Optimizing Hadoop for MapReduce
Da Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
Nessuna valutazione finora
Oracle Goldengate Concepts and Architecture
Documento23 pagine
Oracle Goldengate Concepts and Architecture
Ian Hughes
Nessuna valutazione finora
Oracle DBA
Documento25 pagine
Oracle DBA
Avinash Singh
Nessuna valutazione finora
Apache Spark: Data Science Foundations
Documento55 pagine
Apache Spark: Data Science Foundations
TRAPMUZIC HDTV
Nessuna valutazione finora
Apache Cassandra Database - Instaclustr
Documento8 pagine
Apache Cassandra Database - Instaclustr
Adebayo
Nessuna valutazione finora
File Formats in Big Data
Documento13 pagine
File Formats in Big Data
Meghna Sharma
Nessuna valutazione finora
Database 12c Update
Documento9 pagine
Database 12c Update
abidou
Nessuna valutazione finora
Apache Hive
Documento3 pagine
Apache Hive
kual21
Nessuna valutazione finora
Migrating From OBIEE To Oracle Analytics Cloud - PPT
Documento25 pagine
Migrating From OBIEE To Oracle Analytics Cloud - PPT
M B
Nessuna valutazione finora
Top Answers To Couchbase Interview Questions
Documento3 pagine
Top Answers To Couchbase Interview Questions
Pappu Khan
Nessuna valutazione finora
Hadoop Admin Interview Question and Answers
Documento5 pagine
Hadoop Admin Interview Question and Answers
Vivek Kushwaha
Nessuna valutazione finora
Mix Hadoop Developer Interview Questions
Documento3 pagine
Mix Hadoop Developer Interview Questions
Amit Kumar
Nessuna valutazione finora
Notes On Dimension and Facts
Documento32 pagine
Notes On Dimension and Facts
Vamsi Karthik
Nessuna valutazione finora
Apache Kudu: Zbigniew Baranowski
Documento40 pagine
Apache Kudu: Zbigniew Baranowski
Axel Nico 92
Nessuna valutazione finora
Datastage Yuvaraj Resume
Documento3 pagine
Datastage Yuvaraj Resume
Yuvaraj Natarajan
Nessuna valutazione finora
Hadoop Interviews Q
Documento9 pagine
Hadoop Interviews Q
S K
Nessuna valutazione finora
Hadoop Training in Hyderabad
Documento6 pagine
Hadoop Training in Hyderabad
software institute
Nessuna valutazione finora
Cloudera Administrator Training For Apache Hadoop PDF
Documento2 pagine
Cloudera Administrator Training For Apache Hadoop PDF
Rocky
50% (2)
Big Data and Spark Developers
Documento5 pagine
Big Data and Spark Developers
Balaji Arun
Nessuna valutazione finora
Girishpote DBA KUWAIT
Documento7 pagine
Girishpote DBA KUWAIT
Musa Tahirli
Nessuna valutazione finora
Making The Move From Oracle Warehouse Builder To Oracle Data Integrator 12
Documento19 pagine
Making The Move From Oracle Warehouse Builder To Oracle Data Integrator 12
sam
Nessuna valutazione finora
Revolutionizing Data Warehousing in Telecom With The Vertica Analytic Database
Documento11 pagine
Revolutionizing Data Warehousing in Telecom With The Vertica Analytic Database
grandelindo
Nessuna valutazione finora
Hadoop Admin Download Syllabus PDF
Documento4 pagine
Hadoop Admin Download Syllabus PDF
shubham phulari
Nessuna valutazione finora
CCA175 Cloudera Hadoop and Spark Developer Tips and Tricks
Documento4 pagine
CCA175 Cloudera Hadoop and Spark Developer Tips and Tricks
Abdur Rahman
Nessuna valutazione finora
Nhia Dba and Application Manager Training: Real Application Clusters (RAC)
Documento23 pagine
Nhia Dba and Application Manager Training: Real Application Clusters (RAC)
Peter Asane
Nessuna valutazione finora
Bidirectional Data Import To Hive Using SQOOP
Documento6 pagine
Bidirectional Data Import To Hive Using SQOOP
International Journal of Innovative Science and Research Technology
Nessuna valutazione finora
Building A HA and DR Solution Using AlwaysON SQL FCIs and AGs v1
Documento42 pagine
Building A HA and DR Solution Using AlwaysON SQL FCIs and AGs v1
Engin Oruç Öztürk
Nessuna valutazione finora
Primer On Big Data Testing
Documento24 pagine
Primer On Big Data Testing
Surojeet Sengupta
Nessuna valutazione finora
Data Entreprise Surveys
Documento116 pagine
Data Entreprise Surveys
Justin Talbot
Nessuna valutazione finora
Hightech Corp. Stock Price Lowtech Corp. Stock Price
Documento61 pagine
Hightech Corp. Stock Price Lowtech Corp. Stock Price
Justin Talbot
Nessuna valutazione finora
Package Boot': August 29, 2013
Documento117 pagine
Package Boot': August 29, 2013
Justin Talbot
Nessuna valutazione finora
Revisiting The Debt Sustainability Framework For Low-Income Countries
Documento70 pagine
Revisiting The Debt Sustainability Framework For Low-Income Countries
Justin Talbot
Nessuna valutazione finora
Année Académique / Studienjahr / Anno Accademico / Academic Year 2015-2016
Documento5 pagine
Année Académique / Studienjahr / Anno Accademico / Academic Year 2015-2016
Justin Talbot
Nessuna valutazione finora
IRM Finacial Services Qualification
Documento3 pagine
IRM Finacial Services Qualification
Justin Talbot
Nessuna valutazione finora
Spring Batch Reference
Documento282 pagine
Spring Batch Reference
sankar2688
Nessuna valutazione finora
Business Analytics Ho 1
Documento4 pagine
Business Analytics Ho 1
luna
Nessuna valutazione finora
IBM Tivoli Storage Manager - TSM - DATASHEET PDF
Documento2 pagine
IBM Tivoli Storage Manager - TSM - DATASHEET PDF
Dens Can't Be Perfect
Nessuna valutazione finora
Chapter V
Documento125 pagine
Chapter V
Eugene San Lorenzo
Nessuna valutazione finora
AIS Chapter 3 Relational Data Base
Documento50 pagine
AIS Chapter 3 Relational Data Base
Kassawmar Desalegn
Nessuna valutazione finora
TN - Chapter 3 Systems Documentation Techniques
Documento25 pagine
TN - Chapter 3 Systems Documentation Techniques
Trân Nguyễn Hoàng Ngọc
Nessuna valutazione finora
SCHENCK
Documento26 pagine
SCHENCK
YOUNES LAKHLIFI
Nessuna valutazione finora
Zelf Een Peppol Access Point Worden - Storecove
Documento8 pagine
Zelf Een Peppol Access Point Worden - Storecove
Nikkie Bakker
Nessuna valutazione finora
Devops Cgi Interview
Documento4 pagine
Devops Cgi Interview
Jagadish Karem
Nessuna valutazione finora
Ssis
Documento25 pagine
Ssis
Sandy Atl
Nessuna valutazione finora
Risk Management Roles and Responsibilities
Documento2 pagine
Risk Management Roles and Responsibilities
Omnia Hassan
Nessuna valutazione finora
MS28937
Documento1 pagina
MS28937
nizarfeb
Nessuna valutazione finora
BGC Trust University Bangladesh: Presentation On
Documento12 pagine
BGC Trust University Bangladesh: Presentation On
Rubel Rudra
Nessuna valutazione finora
Admin Guide 123
Documento46 pagine
Admin Guide 123
Shilpa Choudhary
Nessuna valutazione finora
Unconfirmed 299404
Documento4 pagine
Unconfirmed 299404
Bidula
Nessuna valutazione finora
Working in ArcGIS Mode Working (Manual Bemtley)
Documento335 pagine
Working in ArcGIS Mode Working (Manual Bemtley)
Elizabeth huamantinco vera
Nessuna valutazione finora
Brainstorming Persons: Software Requirements Specification
Documento10 pagine
Brainstorming Persons: Software Requirements Specification
Ashish Gupta
Nessuna valutazione finora
PrashantGirase Resume
Documento3 pagine
PrashantGirase Resume
vyom saxena
Nessuna valutazione finora
Comparing Leading Cloud Service Providers
Documento2 pagine
Comparing Leading Cloud Service Providers
harshaks116
Nessuna valutazione finora
IC Project Task Template 8624
Documento3 pagine
IC Project Task Template 8624
Kim Villegas
Nessuna valutazione finora
Resume-Abhishek Mishra-1
Documento7 pagine
Resume-Abhishek Mishra-1
abhisheK mishra
Nessuna valutazione finora
DEBIAN - OpenVPN - Debian Wiki
Documento16 pagine
DEBIAN - OpenVPN - Debian Wiki
dgonzalezh
Nessuna valutazione finora
Econnect Installation and Administration Guide: Microsoft Dynamics GP
Documento70 pagine
Econnect Installation and Administration Guide: Microsoft Dynamics GP
Brian Yap
Nessuna valutazione finora
ADBMS Lab Manual
Documento69 pagine
ADBMS Lab Manual
Amit Sangale
100% (1)
Sample Project Plan Viewing
Documento3 pagine
Sample Project Plan Viewing
SanthoshAnvekar
Nessuna valutazione finora
Methodology Design Method: Requirements Gathering
Documento23 pagine
Methodology Design Method: Requirements Gathering
Wapot Ang emotirang paminta
Nessuna valutazione finora
DDCArv Ch8
Documento70 pagine
DDCArv Ch8
Çağrı Balkır
Nessuna valutazione finora
Google-Android Basics With Compose Process Document
Documento42 pagine
Google-Android Basics With Compose Process Document
Praveen Reddy
Nessuna valutazione finora