Sei sulla pagina 1di 30

Cloud Computing

Presentation on

By:
ELSON D’SOUZA (1MS14IS035)
ESHWAR M. S. (1MS14IS132)
ARPITHA (1MS14IS019)
AGENDA

ØBIG DATA
ØHADOOP ECOSYSTEM
ØHDFS
ØMAP REDUCE
ØDEMO on BLUEMIX
BIG DATA
BIG DATA

ØAny data that cannot be easily


stored, processed, and analyzed
using traditional systems.
BIG DATA
Store data in HDFS Process using MapReduce
HADOOP
Stores data in Processes data in
cluster the cluster itself

Cluster contains many


machines
WHAT IS APACHE
HADOOP?
Ø Open source software framework
designed for storage and processing
of large scale data on clusters of
commodity hardware.
ØCreated by Doug Cutting and Mike
Carafella in 2005.
ØCutting named the program after his
son’s toy elephant.

HADOOP ECOSYSTEM

SQO
MAHOUT OP
HU E
OOZIE
FLUME
PIG HIVE

MR IMPALA HBASE

HDFS
APPLICATIONS OF
HADOOP
ØData-intensive text processing
ØAssembly of large genomes
ØGraph mining
ØMachine learning and data mining
ØLarge scale social network analysis

USERS OF HADOOP
Hadoop
Distributed
File
HDFS
System
HDFS

mydata.txt

blk_1 64Mb

blk_2 64Mb

blk_3 22Mb

150 MB
HDFS

mydata.txt

blk_1 64Mb

blk_2 64Mb

blk_3 22Mb

150 MB
HDFS
ØHDFS is a file system written in Java based on
the Google’s GFS
ØProvides redundant storage for massive
amounts of data
ØFiles are split into blocks
ØBlocks are split across many machines at load
time
• Different blocks from the same file will be
stored on different machines
ØBlocks are replicated across multiple machines
ØThe Name Node keeps track of which blocks
make up a file and where they are stored
HDFS
ØWhen a client wants to retrieve data
• Communicates with the Name Node to
determine which blocks make up a file
and on which data nodes those blocks
are stored
• Then communicated directly with the data
nodes to read the data
ØHDFS works best with a smaller number
of large files
• Millions as opposed to billions of files
• Typically 100MB or more per file
NAME NODE
ØStores metadata for the files, like the
directory structure of a typical FS.
ØThe server holding the Name Node
instance is quite crucial, as there is only
one.
ØTransaction log for file deletes/adds, etc.
Does not use transactions for whole
blocks or file-streams, only metadata.
ØHandles creation of more replica blocks
when necessary after a Data Node failure

DATA NODE

ØStores the actual data in HDFS


ØCan run on any underlying filesystem
(ext3/4, NTFS, etc.)
ØNotifies Name Node of what blocks it has
ØName Node replicates blocks 3x by
default.

MAP REDUCE
MAP REDUCE

Mumbai ₹ 45 + ₹ 68
Mumbai ₹ 45
Bangalore ₹ 64
Chennai ₹ 123
Delhi ₹ 75 Bangalore ₹ 64
Mumbai ₹ 68
...

Chennai ₹ 123

Delhi ₹ 75
MAP REDUCE

Mappers

Mum45
45 Mum45
Mum 45
Mum
Mum 45 Mum
Mum 45
45
Mum 45
Mum 45 Mum64
Bang 45

Reducers
MAP REDUCE

Mappers

Mum45
45 Mum45
Mum 45 Mum45 45 Mum
Mum
Mum
Mum 45 Mum
Mum 45
45 Mum45 45 Mum
Mum 45 Mum
Mum Mum45
Mum 45
Mum 45
Mum 45 Mum64
Bang 45 Mum
Delhi 36 Mum
Che 8545 4572
Mum
45
Che Mum
Mum 45
106
45
45
45

Reducers
Mum, Ban Che, Del
MAP REDUCE

Mappers

Intermediate Records
(KEY, VALUE)

SHUFFLE AND SORT

Reducers

RESULTS
MAP REDUCE

Task Tracker

Job Tracker
MAP REDUCE

ØA method for distributing computation


across multiple nodes
ØEach node processes the data that is
stored at that node
ØConsists of two main phases
• Map
• Reduce
MAP

ØReads data from an input file


ØOutputs zero or more key/value pairs
SORT

ØOutput from the mapper is sorted by key


ØAll values with the same key are
guaranteed to go to the same machine
REDUCE

ØCalled once for each unique key


ØGets a list of all values associated with a
key as input
ØThe reducer outputs zero or more final
key/value pairs
• Usually just one output per input key
JOB TRACKER AND TASK
TRACKER
ØJob Tracker
• Determines the execution plan for the job
• Assigns individual tasks

ØTask Tracker
• Keeps track of the performance of an
individual mapper or reducer

DEMO

Potrebbero piacerti anche