Sei sulla pagina 1di 6

Lesson - 1 (Intro to Big Data and Hadoop)

Revision (—)

Topics for Today (22th July 2017) :-
⁃ Introductory Presentation
⁃ Your introduction (In Process)
⁃ Name
⁃ Location
⁃ Years of Exp
⁃ A note on my blog
⁃ A note on technical and related queries
⁃ Google Drive link
⁃ TOC File daily updates and uploads
⁃ Java Modules
⁃ Books reference
⁃ Interview Book
⁃ Big Data Case Studies and setting the context
⁃ Quick Case Studies
⁃ Customer Churn Analysis - Slide 11
⁃ Point of sale transaction - Slide 12
⁃ What is big data ? - Slide 15-16
⁃ 3Vs - Slide 14
⁃ 5Vs - Slide 20
⁃ Evolution of Big Data - SR Slide 7
⁃ Types of Data - SR Slide 7
⁃ Big Data Challenges
⁃ Introduction to Hadoop
⁃ Software Know how - done
⁃ Local Set up
⁃ Mac
⁃ Unix
⁃ Hadoop Philosophies

Homework for Today (22nd July 2017) :-


1. Downloan and configure Acadgild VM on your machines
2. Java modules - My modules
3. Read 2 chapters of definitive guide
4. Read the Case Studies / Reading material
5. Hadoop Single Node Installation
6. Start reading the big data interview guide
7. Commonly used unix / linux commands from my blog
http://syed-rizvi.blogspot.in/

Big Data Challenges

a) Storage
b) Processing
c) Mannual Distributed Computing
Apache Hadoop is an open source framework which provides an
automated distributed computing environment that supports
storage of big data sets. It does that storage using a cluster of
commodity machines. It then analyses this stored big data using a
very simple programming model.

The storage mechanism is known as HDFS (Hadoop Distributed File


System). It is based on google GFS(Google File System) white paper.

The analytical mechanism is known as Map Reduce and is based on


google map reduce white paper.

There are 4 basic philosophies on which hadoop works.

a) All the basic software that helps start a hadoop cluster is a


software daemon.
b) All the above daemons are based on master and slave
architecture.
c) The entire hadoop framework is divided into 2 broad parts -
storage (HDFS) and processing (Map Reduce).

Hadoop 2.x

⁃ HDFS (Storage Part)


⁃ Master Daemon - Namenode (High End Admin Machine) (1 in
number)
⁃ Back up Master Daemon - Secondary Namenode (High End
Admin Machine) (1 in number)
⁃ Slave Daemons - Datanode (Commodity Machines) (Many in
number)
⁃ Map Reduce (Processing Part) - YARN (Yet Another Resource
Negotiater)
⁃ Master Daemon - ResourceManager (High End Admin
Machine) (1 in number)
⁃ Slave Daemon - NodeManager (Commodity Machines) (Many
in number)

d) Hadoop is a batch oriented system which can never be plugged


behind and online transaction processing system. Moreover, it a
write once, ready many times data storage mechanism. This means,
you can never update the data. If you really want to update, you
need to delete the previous version and upload a new copy.
Data Node is a slave daemon software for the storage part of
hadoop.
Likewise, Node manager is a slave daemon software for the
processing part of hadoop.
When you refer to hardware in hadoop, you always refer to it as
either a commodity machines or a slave node.
Important point to remember, there is a difference between “Slave
daemon” and “slave node” (which btw is also called as commodity
hardware.)
Both the datanode slave daemon and node manager slave daemon
run on a commodity machine or a slave machine (which is a
hardware)

A1
A2

F1 - A1 - 100 xxxxx

F2 - A1 - 100

10 TB - I Machine (Commodity Machine) -> 10 hrs


10 TB - I0 Machine -> 1 hrs

blog

http://syed-rizvi.blogspot.in/

https://drive.google.com/folderview?

id=0BwfmpHQetSFES3UzTDhITkR2Q3c&usp=sharing#list

Case Studies

http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-

on-hadoop/d/d-id/1107038?

http://www.computerweekly.com/news/2240219736/Case-Study-How-big-

data-powers-the-eBay-customer-journey

- Stand alone Mode

- Pseudo Distributed Mode

- Multi - distributed mode

Break for 5 Mins - Lets meet at 9:57 AM by my computer


- Datascience

- Basics of statistics

- R, Tableu

Potrebbero piacerti anche