Cloud Computing Presentation On: By: ELSON D'SOUZA (1MS14IS035) ESHWAR M. S. (1MS14IS132) ARPITHA (1MS14IS019)

Cloud Computing
Presentation on
By:
ELSON D’SOUZA (1MS14IS035)
ESHWAR M. S. (1MS14IS132)
ARPITHA (1MS14IS019)
AGENDA
ØBIG DATA
ØHADOOP ECOSYSTEM
ØHDFS
ØMAP REDUCE
ØDEMO on BLUEMIX
BIG DATA
BIG DATA
ØAny data that cannot be easily

stored, processed, and analyzed
using traditional systems.
BIG DATA
Store data in HDFS Process using MapReduce
HADOOP
Stores data in Processes data in
cluster the cluster itself
Cluster contains many

machines
WHAT IS APACHE
HADOOP?
Ø Open source software framework
designed for storage and processing
of large scale data on clusters of
commodity hardware.
ØCreated by Doug Cutting and Mike
Carafella in 2005.
ØCutting named the program after his
son’s toy elephant.
•
HADOOP ECOSYSTEM
SQO
MAHOUT OP
HU E
OOZIE
FLUME
PIG HIVE
MR IMPALA HBASE
HDFS
APPLICATIONS OF
HADOOP
ØData-intensive text processing
ØAssembly of large genomes
ØGraph mining
ØMachine learning and data mining
ØLarge scale social network analysis
•
USERS OF HADOOP
Hadoop
Distributed
File
HDFS
System
HDFS
mydata.txt
blk_1 64Mb
blk_2 64Mb
blk_3 22Mb
150 MB
HDFS
mydata.txt
blk_1 64Mb
blk_2 64Mb
blk_3 22Mb
150 MB
HDFS
ØHDFS is a file system written in Java based on
the Google’s GFS
ØProvides redundant storage for massive
amounts of data
ØFiles are split into blocks
ØBlocks are split across many machines at load
time
• Different blocks from the same file will be
stored on different machines
ØBlocks are replicated across multiple machines
ØThe Name Node keeps track of which blocks
make up a file and where they are stored
HDFS
ØWhen a client wants to retrieve data
• Communicates with the Name Node to
determine which blocks make up a file
and on which data nodes those blocks
are stored
• Then communicated directly with the data
nodes to read the data
ØHDFS works best with a smaller number
of large files
• Millions as opposed to billions of files
• Typically 100MB or more per file
NAME NODE
ØStores metadata for the files, like the
directory structure of a typical FS.
ØThe server holding the Name Node
instance is quite crucial, as there is only
one.
ØTransaction log for file deletes/adds, etc.
Does not use transactions for whole
blocks or file-streams, only metadata.
ØHandles creation of more replica blocks
when necessary after a Data Node failure
•
DATA NODE
ØStores the actual data in HDFS

ØCan run on any underlying filesystem
(ext3/4, NTFS, etc.)
ØNotifies Name Node of what blocks it has
ØName Node replicates blocks 3x by
default.
•
MAP REDUCE
MAP REDUCE
Mumbai ₹ 45 + ₹ 68
Mumbai ₹ 45
Bangalore ₹ 64
Chennai ₹ 123
Delhi ₹ 75 Bangalore ₹ 64
Mumbai ₹ 68
...
…
Chennai ₹ 123
…
Delhi ₹ 75
MAP REDUCE
Mappers
Mum45
45 Mum45
Mum 45
Mum
Mum 45 Mum
Mum 45
45
Mum 45
Mum 45 Mum64
Bang 45
Reducers
MAP REDUCE
Mappers
Mum45
45 Mum45
Mum 45 Mum45 45 Mum
Mum
Mum
Mum 45 Mum
Mum 45
45 Mum45 45 Mum
Mum 45 Mum
Mum Mum45
Mum 45
Mum 45
Mum 45 Mum64
Bang 45 Mum
Delhi 36 Mum
Che 8545 4572
Mum
45
Che Mum
Mum 45
106
45
45
45
Reducers
Mum, Ban Che, Del
MAP REDUCE
Mappers
Intermediate Records
(KEY, VALUE)
SHUFFLE AND SORT
Reducers
RESULTS
MAP REDUCE
Task Tracker
Job Tracker
MAP REDUCE
ØA method for distributing computation

across multiple nodes
ØEach node processes the data that is
stored at that node
ØConsists of two main phases
• Map
• Reduce
MAP
ØReads data from an input file

ØOutputs zero or more key/value pairs
SORT
ØOutput from the mapper is sorted by key

ØAll values with the same key are
guaranteed to go to the same machine
REDUCE
ØCalled once for each unique key

ØGets a list of all values associated with a
key as input
ØThe reducer outputs zero or more final
key/value pairs
• Usually just one output per input key
JOB TRACKER AND TASK
TRACKER
ØJob Tracker
• Determines the execution plan for the job
• Assigns individual tasks
ØTask Tracker
• Keeps track of the performance of an
individual mapper or reducer
•
DEMO

Cloud Computing Presentation On: By: ELSON D'SOUZA (1MS14IS035) ESHWAR M. S. (1MS14IS132) ARPITHA (1MS14IS019)

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Cloud Computing Presentation On: By: ELSON D'SOUZA (1MS14IS035) ESHWAR M. S. (1MS14IS132) ARPITHA (1MS14IS019)

Caricato da

Copyright:

Formati disponibili

Cloud Computing

ØAny data that cannot be easily

Cluster contains many

ØStores the actual data in HDFS

SHUFFLE AND SORT

ØA method for distributing computation

ØReads data from an input file

ØOutput from the mapper is sorted by key

ØCalled once for each unique key

Potrebbero piacerti anche