Hadoop

Hadoop is an open-source software
framework for storage and processing

large-scale of data-sets on clusters of
commodity hardware.
Key Hadoop
 Accessible – Hadoop runs on cluster of

commodity machines
 Robust – Hadoop can handle
hardware/software failure
 Scalable – Hadoop extends lineary by
adding new node on cluster
Hadoop Includes
– Hadoop Distributed File System (HDFS) :

distributed data storage (storage engine)
– Map/Reduce :
distributed data processing (processing
engine)
HDFS
(Hadoop Distributed File System)
 HDFS stores files in blocks typically 64 MB in size
 HDFS is optimized for throughput over latency; it is very efficient at

streaming read requests for large files but poor at seek requests for many
small ones.
 HDFS is optimized for workloads that are generally of the write-once and
read-many type
 Each storage node runs a process called a DataNode (SlaveNode) that

manages the block on that host, and these are coordinated by a master
NameNode (MasterNode) process running on a separate host.
 Instead of handling disk failures by having physical redundancies in disk

arrays or similar strategies, HDFS uses replication. Each of the blocks
comprising a file is stored on multiple nodes within the cluster
Hadoop Cluster
Hadoop Component
 NameNode (Master Node)

 DataNode (Slave Node)
 Secondary NameNode
 JobTracker
 TaskTracker
NameNode
 Master of HDFS (Hadoop Distributed File
System)
 Manages the DataNodes
 Manages how file is divided into blocks, and
manages where that blocks are stored.
 Keeps Metadata in RAM
- List of files
- List of Blocks for each file
- List of DataNodes for each block
- File attributes, e.g. creation time, replication factor
DataNode
 Stores actual block-file on disk

 May not store the whole file
 Reports about block-info to NameNode
 Receives Instruction from NameNode
Interaction
NemeNode & DataNode
SecondaryNameNode (SNN)
 Snapshot of NameNode
 Minimizes downtime/data loss if NameNode
fails
JobTracker
Coordinates the parallel processing of data using

MapReduce
 Divides task to HDFS Cluster

 Tracks Map/Reduce tasks
 Restarts failed task to another node
 Does a speculative execution
TaskTracker
 Tracking individual map & reduce task

 Reports progress of task to the JobTracker
Interaction
JobTracker & TaskTracker
Topology Hadoop cluster
MAP-REDUCE
● Map-reduce is a programming model a framework for
processing parallel problems across huge datasets
using a large number of computers (nodes), collectively
referred to as a cluster.
● It consists of two steps: map and reduce.
The “map” step takes a key/value pair

and produces an intermediate key/value pair.
The “reduce” step takes a key and a list

of the key's values and outputs the final
key/value pair.
MapReduce – Some Common Uses
● Machine learning Algorithms – e.g Mahout

● Sorting huge data.
● Counting
● Data Extraction and transformation
● Data analysis
● Text Analysis
● Advertising Analysis
● Social Network analysis
● Financial analysis
● … and many more
MapReduce
 Mapper – filter and transform

 Reducer – sort and aggregate/summary
Mapping List
Mapping List
Reducing List
MapReduce Illustration
MapReduce Illustration
MapReduce step
1. Prepare the Map() input
2. Run the user-provided Map() code
3. Partition & Shuffle the Map output to the Reduce processors
4. Run the user-provided Reduce() code
5. Produce the final output

MapReduce Step 1-2
MapReduce Step 2-3-4
MapReduce Step 4-5
Map Reduce Sample Application
Word Count
Word Count
Pseudo-code
map(offset,line-contents) :
for each word in line-contents :
emit(word,1)
reduce(word,values) :
sum=0
for each value in values :
sum = sum + value
emit(word,sum)
Hadoop
WordCountMapper
public class WordCountMapper extends Mapper<LongWritable,
Text, Text, IntWritable>{
private final static
IntWritable one = new IntWritable(1);
public void map(LongWritable key, Text value,

Context context) throws IOException,
InterruptedException {
String s=value.toString();
for (String word : s.split("\\W+")) {
if (word.length() > 0) {
context.write(new Text(word), one);
}
}
}
}
Hadoop
WordCountReducer
public class WordCountReducer extends

Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key, Iterable<IntWritable>

values, Context context) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
Hadoop Job
Driver code
FileInputFormat.addInputPath(job, new Path("/input"));
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileOutputFormat.setOutputPath(job, new
Path("/output"));
job.setOutputFormatClass(TextOutputFormat.class);
MapReduce Example: Word Count
Solusi247 berdiri tahun 2000 yang bisnis utamanya bidang telco.

Sejak tahun 2006 Solusi247 mengembagkan riset bidang Radar dan Big Data
Product Solusi247 untuk Big Data antara lain HGrid247, HSpark247 dan SGrid247.
InputFormat & RecordReader

<0,Solusi247 berdiri tahun 2000 yang bisnis utamanya bidang telco.>
<65,Sejak tahun 2006 Solusi247 mengembagkan riset bidang Radar dan Big Data>
<138,Product Solusi247 untuk Big Data antara lain HGrid247, HSpark247 dan SGrid247.>
Mapper
<Solusi247,1><berdiri,1><tahun,1><2000,1><yang,1><bisnis,1><utamanya,1><bidang,1><telco,1>
<Sejak,1><tahun,1><2006,1><Solusi247,1><mengembagkan,1><riset,1><bidang,1><Radar,1><dan,1>
<Big,1><Data,1>
<Product,1><Solusi247,1><untuk,1><Big,1><Data,1><antara,1><lain,1><HGrid247,1><HSpark247,1>
<dan,1><SGrid247,1>
MapReduce Example: Word Count
<2000,[1]> <2000,1>
<2006,[1]> <2006,1>
<Big,[1,1]> <Big,2>
<Data,[1,1]> <Data,2>
<HGrid247,[1]> <HGrid247,1>
<HSpark247,[1]> <HSpark247,1>
<Product,[1]> <Product,1>
<Radar,[1]> <Radar,1>
<SGrid247,[1]> <SGrid247,1>
<Sejak,[1]> <Sejak,1>
<Solusi247,[1,1,1]> <Solusi247,3>
<antara,[1]> Reducer <antara,1>
<berdiri,[1]> <berdiri,1>
<bidang,[1,1]> <bidang,2>
<bisnis,[1]> <bisnis,1>
<dan,[1,1]> <dan,2>
<lain,[1]> <lain,1>
<mengembagkan,[1]> <mengembagkan,1>
<riset,[1]> <riset,1>
<tahun,[1,1]> <tahun,2>
<telco,[1]> <telco,1>
<untuk,[1]> <untuk,1>
<utamanya,[1]> <utamanya,1>
<yang,[1]> <yang,1>
Word Count
MapReduce process
Why Hadoop?
• If you find the MapReduce programming model simple/easy

• If you have a data intensive workload
• If you need fault tolerance
• If you have dedicated nodes available
• If you like Java
• If you want to experiment.

Thank You

Hadoop

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Hadoop

Caricato da

Copyright:

Formati disponibili

Hadoop is an open-source software

framework for storage and processing

 Accessible – Hadoop runs on cluster of

– Hadoop Distributed File System (HDFS) :

 HDFS is optimized for throughput over latency; it is very efficient at

 Each storage node runs a process called a DataNode (SlaveNode) that

 Instead of handling disk failures by having physical redundancies in disk

 NameNode (Master Node)

 Stores actual block-file on disk

Coordinates the parallel processing of data using

 Divides task to HDFS Cluster

 Tracking individual map & reduce task

● It consists of two steps: map and reduce.

The “map” step takes a key/value pair

The “reduce” step takes a key and a list

● Machine learning Algorithms – e.g Mahout

 Mapper – filter and transform

1. Prepare the Map() input

2. Run the user-provided Map() code

3. Partition & Shuffle the Map output to the Reduce processors

4. Run the user-provided Reduce() code

5. Produce the final output

public void map(LongWritable key, Text value,

public class WordCountReducer extends

public void reduce(Text key, Iterable<IntWritable>

Solusi247 berdiri tahun 2000 yang bisnis utamanya bidang telco.

InputFormat & RecordReader

• If you find the MapReduce programming model simple/easy

• If you need fault tolerance

• If you have dedicated nodes available

• If you like Java

• If you want to experiment.

Potrebbero piacerti anche