Sei sulla pagina 1di 38

Hadoop is an open-source software

framework for storage and processing


large-scale of data-sets on clusters of
commodity hardware.
Key Hadoop

 Accessible – Hadoop runs on cluster of


commodity machines
 Robust – Hadoop can handle
hardware/software failure
 Scalable – Hadoop extends lineary by
adding new node on cluster
Hadoop Includes

– Hadoop Distributed File System (HDFS) :


distributed data storage (storage engine)

– Map/Reduce :
distributed data processing (processing
engine)
HDFS
(Hadoop Distributed File System)
 HDFS stores files in blocks typically 64 MB in size

 HDFS is optimized for throughput over latency; it is very efficient at


streaming read requests for large files but poor at seek requests for many
small ones.

 HDFS is optimized for workloads that are generally of the write-once and
read-many type

 Each storage node runs a process called a DataNode (SlaveNode) that


manages the block on that host, and these are coordinated by a master
NameNode (MasterNode) process running on a separate host.

 Instead of handling disk failures by having physical redundancies in disk


arrays or similar strategies, HDFS uses replication. Each of the blocks
comprising a file is stored on multiple nodes within the cluster
Hadoop Cluster
Hadoop Component

 NameNode (Master Node)


 DataNode (Slave Node)
 Secondary NameNode
 JobTracker
 TaskTracker
NameNode
 Master of HDFS (Hadoop Distributed File
System)
 Manages the DataNodes
 Manages how file is divided into blocks, and
manages where that blocks are stored.
 Keeps Metadata in RAM
- List of files
- List of Blocks for each file
- List of DataNodes for each block
- File attributes, e.g. creation time, replication factor
DataNode

 Stores actual block-file on disk


 May not store the whole file
 Reports about block-info to NameNode
 Receives Instruction from NameNode
Interaction
NemeNode & DataNode
SecondaryNameNode (SNN)

 Snapshot of NameNode
 Minimizes downtime/data loss if NameNode
fails
JobTracker

Coordinates the parallel processing of data using


MapReduce

 Divides task to HDFS Cluster


 Tracks Map/Reduce tasks
 Restarts failed task to another node
 Does a speculative execution
TaskTracker

 Tracking individual map & reduce task


 Reports progress of task to the JobTracker
Interaction
JobTracker & TaskTracker
Topology Hadoop cluster
MAP-REDUCE
● Map-reduce is a programming model a framework for
processing parallel problems across huge datasets
using a large number of computers (nodes), collectively
referred to as a cluster.

● It consists of two steps: map and reduce.

The “map” step takes a key/value pair


and produces an intermediate key/value pair.

The “reduce” step takes a key and a list


of the key's values and outputs the final
key/value pair.
MapReduce – Some Common Uses

● Machine learning Algorithms – e.g Mahout


● Sorting huge data.
● Counting
● Data Extraction and transformation
● Data analysis
● Text Analysis
● Advertising Analysis
● Social Network analysis
● Financial analysis
● … and many more
MapReduce

 Mapper – filter and transform


 Reducer – sort and aggregate/summary
Mapping List
Mapping List
Reducing List
MapReduce Illustration
MapReduce Illustration
MapReduce step

1. Prepare the Map() input

2. Run the user-provided Map() code

3. Partition & Shuffle the Map output to the Reduce processors

4. Run the user-provided Reduce() code

5. Produce the final output


MapReduce Step 1-2
MapReduce Step 2-3-4
MapReduce Step 4-5
Map Reduce Sample Application

Word Count
Word Count
Pseudo-code

map(offset,line-contents) :
for each word in line-contents :
emit(word,1)

reduce(word,values) :
sum=0
for each value in values :
sum = sum + value
emit(word,sum)
Hadoop
WordCountMapper
public class WordCountMapper extends Mapper<LongWritable,
Text, Text, IntWritable>{
private final static
IntWritable one = new IntWritable(1);

public void map(LongWritable key, Text value,


Context context) throws IOException,
InterruptedException {
String s=value.toString();
for (String word : s.split("\\W+")) {
if (word.length() > 0) {
context.write(new Text(word), one);
}
}
}
}
Hadoop
WordCountReducer

public class WordCountReducer extends


Reducer<Text,IntWritable,Text,IntWritable> {

public void reduce(Text key, Iterable<IntWritable>


values, Context context) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
Hadoop Job
Driver code
FileInputFormat.addInputPath(job, new Path("/input"));
job.setInputFormatClass(TextInputFormat.class);

job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

FileOutputFormat.setOutputPath(job, new
Path("/output"));
job.setOutputFormatClass(TextOutputFormat.class);
MapReduce Example: Word Count

Solusi247 berdiri tahun 2000 yang bisnis utamanya bidang telco.


Sejak tahun 2006 Solusi247 mengembagkan riset bidang Radar dan Big Data
Product Solusi247 untuk Big Data antara lain HGrid247, HSpark247 dan SGrid247.

InputFormat & RecordReader


<0,Solusi247 berdiri tahun 2000 yang bisnis utamanya bidang telco.>
<65,Sejak tahun 2006 Solusi247 mengembagkan riset bidang Radar dan Big Data>
<138,Product Solusi247 untuk Big Data antara lain HGrid247, HSpark247 dan SGrid247.>

Mapper

<Solusi247,1><berdiri,1><tahun,1><2000,1><yang,1><bisnis,1><utamanya,1><bidang,1><telco,1>
<Sejak,1><tahun,1><2006,1><Solusi247,1><mengembagkan,1><riset,1><bidang,1><Radar,1><dan,1>
<Big,1><Data,1>
<Product,1><Solusi247,1><untuk,1><Big,1><Data,1><antara,1><lain,1><HGrid247,1><HSpark247,1>
<dan,1><SGrid247,1>
MapReduce Example: Word Count
<2000,[1]> <2000,1>
<2006,[1]> <2006,1>
<Big,[1,1]> <Big,2>
<Data,[1,1]> <Data,2>
<HGrid247,[1]> <HGrid247,1>
<HSpark247,[1]> <HSpark247,1>
<Product,[1]> <Product,1>
<Radar,[1]> <Radar,1>
<SGrid247,[1]> <SGrid247,1>
<Sejak,[1]> <Sejak,1>
<Solusi247,[1,1,1]> <Solusi247,3>
<antara,[1]> Reducer <antara,1>
<berdiri,[1]> <berdiri,1>
<bidang,[1,1]> <bidang,2>
<bisnis,[1]> <bisnis,1>
<dan,[1,1]> <dan,2>
<lain,[1]> <lain,1>
<mengembagkan,[1]> <mengembagkan,1>
<riset,[1]> <riset,1>
<tahun,[1,1]> <tahun,2>
<telco,[1]> <telco,1>
<untuk,[1]> <untuk,1>
<utamanya,[1]> <utamanya,1>
<yang,[1]> <yang,1>
Word Count
MapReduce process
Why Hadoop?

• If you find the MapReduce programming model simple/easy


• If you have a data intensive workload

• If you need fault tolerance

• If you have dedicated nodes available

• If you like Java

• If you want to experiment.


Thank You

Potrebbero piacerti anche