Sei sulla pagina 1di 25

Hadoop MapReduce

A Technical Seminar Report submitted in partial fulfillment requirements for the award of the degree of BACHELOR OF TECHNOLOGY In Computer Science and Engineering

Submitted by Mr. GANJI MADHU SUDHAN ROLL NO : 10071A0579 C.S.E Department

Department of Computer Science and Engineering VNR VIGNANA JYOTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY ( An Autonomous Institute under JNTUH&Recognised by AICTE) Bachupally, Hyderabad April, 2014.

VALLURIPALLY NAGESHWARA RAO VIGNANA JYOTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY (AFFILIATED TO JNTU,AP) BACHUPALLY, (VIA) KUKATPALLY,HYDERABAD-500072

CERTIFICATE

This is to certify that the Seminar entitled HADOOP MAPREDUCE is a bonafide work done and submitted by GANJI MADHU SUDHAN under the guidance of

Mrs.P.Radhika for the partial fulfilment of the requirement of the award of degree of B.Tech(IV year, II semester ) in Computer Science and Engineering, VNRVJIET, Hyderabad, during the academic year 2013-2014. This work is carried out under ny supervision and has not been submitted to any other University/Institute for award of any degree/diploma.

Mrs.P. Radhika Seminar Co-ordinator Assistant Professor C.S.E Department

Dr. B. RaveendraBabu Professor and Head C.S.E. Department VNRVJIET

ii

ACKNOWLEDGEMENT

Firstly I would like to express my immense gratitude and heartfelt thanks towards our institution Vnr Vignana Jyothi Institute Of Engineering and Technology, which created a great platform to attain profound technical skills in the field of Computer Science, thereby fulfilling my cherished goal. I am very much thankful to our H.O.D. Dr. B. Raveendra Babu for his valuable advice and support. I extend my heartfelt thanks to our seminar coordinators, Mrs. P. Radhika, Assistant Professor CSE Department, for her excellent supervision and guidance. Without her supervision and many hours of devoted guidance, simulating & constructive criticism, this thesis would never come out in this form. Last but not the least, my appreciable obligation also goes to all the teaching and non-teaching staff members of Computer Science & Engineering Department and earnest thanks giving to my dear parents for their moral support and heartfelt cooperation. I would aslo like to thank my friends, whose direct or indirect help has enabled me to complete this work successfully.

GANJI MADHU SUDHAN (10071A0579)

iii

ABSTRACT

MapReduce is a programming model for processing large data sets with a parallel,distributed algorithm on a cluster. A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce()procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance. The model is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as in their original forms. Furthermore, the key contributions of the MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine once. MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation is Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology but has since been genericized.

iv

INDEX
Contents Page. No

1. Introduction...1 1.1 What is Hadoop? ...1 1.2 Challenges in distributed computing - meeting Hadoop ...1 1.3 Origin of Hadoop .2 2. Hadoopsubprojects.....3 3. The Hadoopapproach ..4 4. HadoopMapReduce.......5 4.1 MapReduce Paradigm ..5 4.2 MapReduce Programming Model 5 4.3 4.4 4.5 4.6 4.7 4.8 4.9 MapReduce Map function ...6 MapReduce Reduce function ..6 Example Applications .7 MapReduce Terminology 8 MapReduce Dataflow ..8 MapReduce Real Example ...12 MapReduce Combiner function 12

4.10 MapReduce Fault-tolerance ......13 5. Development and tools.....14 6. Conclusion .16 7. Refereces ... ..17 6. Bibliography........18

List of Figures Fig 1: Hadoop architecture................2 Fig 2: Hadoop Data distribution ...4 Fig 3: MapReduceArchitecture .....5 Fig 4: Map function ..6 Fig 5: Reduce function..6 Fig 6: MapReduceJobTracker .......9 Fig 7: MapReduce data flow with a single reduce task ......10 Fig 8: MapReduce data flow with multiple reduce tasks ....11 Fig 9: MapReduce data flow with no reduce tasks..11 Fig 10: MapReduce Combiner function ...12

vi

HADOOP MAPREDUCE

vii

1. INTRODUCTION
1.1 What is Hadoop?
Hadoop is the popular open source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets. Hadoop enables you to explore complex data, using custom analyses tailored to your information and questions. Hadoop is the system that allows unstructured data to be distributed across hundreds or thousands of machines forming shared nothing clusters, and the execution of Map/Reduce routines to run on the data in that cluster. Hadoop has its own filesystem which replicates data to multiple nodes to ensure if one node holding data goes down, there are at least 2 other nodes from which to retrieve that piece of information. This protects the data availability from node failure, something which is critical when there are many nodes in a cluster (aka RAID at a server level). Yahoo! is the biggest contributor Here's what makes it especially useful: Scalable: It can reliably store and process petabytes. Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.

1.2Challenges in distributed computing --- meeting hadoop


Various challenges are faced while developing a distributed application. The first problem to solve is hardware failure: as soon as we start using many pieces of hardware, the chance that one will fail is fairly high. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. This is how RAID works, for instance, although Hadoopsfilesystem, the Hadoop Distributed Filesystem(HDFS), takes a slightly different approach. The second problem is that most analysis tasks need to be able to combine the data in some way; data read from one disk may need to be combined with the data from any of the other 99 disks. Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging. MapReduce provides a programming model that abstracts the problem from disk reads and writes transforming it into a computation over sets of keys and values.
1

1.3 Origin of Hadoop


Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web searchengine, itself a part of the Lucene project. Building a web search engine from scratch was an ambitious goal, for not only is the software required to crawl and index websites complex to write, but it is also a challenge to run without a dedicated operations team, since there are so many moving parts. Nevertheless, they believed it was a worthy goal, as it would open up and ultimately democratize search engine algorithms. Nutch was started in 2002, and a working crawler and search system quickly emerged. However, they realized that their architecture wouldnt scale to the billions of pages on the Web. Help was at hand with the publication of a paper in 2003 that described the architecture of Googles distributed filesystem, called GFS, which was being used in production at Google.# GFS, or something like it, would solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes. In 2004, they set about writing an open source implementation, the Nutch Distributed Filesystem (NDFS). In 2004, Google published the paper that introduced MapReduce to the world.* Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all the major Nutch algorithms had been ported to run using MapReduce and NDFS. NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and in February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop. At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop into a system that ran at web scale (see sidebar). This was demonstrated in February 2008 when Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster. In April 2008, Hadoop broke a world record to become the fastest system to sort a terabyte of data. Running on a 910-node cluster, Hadoop sorted one terabyte in 2009 seconds (just under 3 minutes), beating the previous years winner of 297 seconds(described in detail in TeraByte Sort on Apache Hadoop on page 461). In November of the same year, Google reported that its MapReduce implementation sorted one terabyte in 68 seconds. As this book was going to press (May 2009), it was announced that a team at Yahoo! used Hadoop to sort one terabyte in 62 seconds.

Fig 1 :Hadoop architecture

2. Hadoop subprojects
Although Hadoop is best known for MapReduce and its distributed filesystem(HDFS, renamed from NDFS), the other subprojects provide complementary services, or build on the core to add higher-level abstractions The various subprojects of hadoop includes:Core A set of components and interfaces for distributed filesystems and general I/O(serialization, Java RPC, persistent data structures). Avro A data serialization system for efficient, cross-language RPC, and persistent datastorage. (At the time of this writing, Avro had been created only as a new subproject, and no other Hadoop subprojects were using it yet.) Mapreduce A distributed data processing model and execution environment that runs on large clusters of commodity machines. HDFS A distributed filesystem that runs on large clusters of commodity machines. Pig A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters. HBASE A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads). Zookeeper A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications. Hive A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data. Chukwa A distributed data collection and analysis system. Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports. (At the time of this writing, Chukwa had only recently graduated from a contrib module in Core to its own subproject.)

3. The Hadoop Approach


Hadoop is designed to efficiently process large volumes of information by connecting many commodity computers together to work in parallel. The theoretical 1000CPU machine described earlier would cost a very large amount of money, far more than 1,000 single-CPU or 250 quad-core machines. Hadoop will tie these smaller and more reasonably priced machines together into a single cost-effective compute cluster. Performing computation on large volumes of data has been done before, usually in a distributed setting. What makes Hadoop unique is its simplified programming model which allows the user to quickly write and test distributed systems, and its efficient, automatic distribution of data and work across machines and in turn utilizing the underlying parallelism of the CPU cores.

Data distribution In a Hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in. The Hadoop Distributed File System (HDFS) will split large data files into chunks which are managed by different nodes in the cluster. In addition to this each chunk is replicated across several machines, so that a single machine failure does not result in any data being unavailable. An active monitoring system then re-replicates the data in response to system failures which can result in partial storage. Even though the file chunks are replicated and distributed across several machines, they form a single namespace, so their contents are universally accessible.

Fig 2 :Hadoop Data distribution

4. Hadoop MapReduce
4.1 MapReduce Paradigm
MapReduce is a programming model and an associated implementation for processing and generating largedata sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. This abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical .record. in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user specilized map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.

Fig 3 :Ma pRe duc e Arc hite ctur e

4.2 MapReduce Programming Model


The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce. Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associatedwith the same intermediate key I and passes them to the Reduce function.
5

The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user's reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory.

4.3 Map function


map (in_key, in_value) -> (out_key, intermediate_value) list

Fig 4 : Map function Example: Upper-case Mapper let map(k, v) = emit(k.toUpper(), v.toUpper()) (foo, bar) --> (FOO, BAR) (Foo, other) -->(FOO, OTHER) (key2, data) --> (KEY2, DATA)

4.4 Reduce function


reduce (out_key, intermediate_value list) ->out_value list

Fig 5 : Reduce function


6

Example: Sum Reducer let reduce(k, vals) sum = 0 foreachint v in vals: sum += v emit(k, sum) (A, [42, 100, 312]) --> (A, 454) (B, [12, 6, -2]) --> (B, 16) Example2:Counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

4.5 Example Applications


Inverted Index: The map function parses each document, and emits a sequence of hword; document IDi pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a hword; list(document ID)i pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions. Distributed Sort: The map function extracts the key from each record, and emits a hkey; recordi pair. The reduce function emits all pairs unchanged.
7

4.6 MapReduce Terminology


Job: unit of work that the client wants to be performed MapReduce program + Input data + configuration information Task: part of the job map and reduce tasks Jobtracker: node that coordinates all the jobs in the system by scheduling tasks to run on tasktrackers Tasktracker: nodes that run tasks and send progress reports to the jobtracker Split: fixed-size piece of the input data

4.7 MapReduce Dataflow


A Map-Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Distributed FileSystem are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. A MapReduce job is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program, and configuration information. Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks. There are two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers. The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers. Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. If a tasks fails, the jobtracker can reschedule it on a different tasktracker. Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. Hadoop creates one map task for each split, which runs the user defined map function for each record in the split.

Fig 6 :MapReduceJobTracker

Having many splits means the time taken to process each split is small compared to the time to process the whole input. So if we are processing the splits in parallel, the processing is better load-balanced if the splits are small, since a faster machine will be able to process proportionally more splits over the course of the job than a slower machine. Even if the machines are identical, failed processes or other jobs running concurrently make load balancing desirable, and the quality of the load balancing increases as the splits become more fine-grained. On the other hand, if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time. For most jobs, a good split size tends to be the size of a HDFS block, 64 MB by default, although this can be changed for the cluster (for all newly created files), or specified when each file is created. Hadoop does its best to run the map task on a node where the input data resides in HDFS. This is called the data locality optimization. It should now be clear why the optimal split size is the same as the block size: it is the largest size of input that can be guaranteed to be stored on a single node. If the split spanned two blocks, it would be unlikely that any HDFS node stored both blocks, so some of the split would have to be transferred across the network to the node running the map task, which is clearly less efficient than running the whole map task using local data. Map tasks write their output to local disk, not to HDFS. Map output is intermediate output: its processed by reduce tasks to produce the final output, and once the job is complete the map output can be thrown away. So storing it in HDFS, with replication, would be overkill. If the node running the map task fails before the map output has been consumed by the reduce task, then Hadoop will automatically rerun the map task on another node to recreate the map output. Reduce tasks dont have the advantage of data localitythe input to a single reduce task is normally the output from all mappers. In the present example, we have a single reduce task that is fed by all of the map tasks. Therefore the sorted map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. The output of the reduce is normally stored in HDFS for reliability. For each HDFS block of the
9

reduce output, the first replica is stored on the local node, with other replicas being stored on off-rack nodes. Thus, writing the reduce output does consume network bandwidth, but only as much as a normal HDFS write pipeline consume. The dotted boxes in the figure below indicate nodes, the light arrows show data transfers on a node, and the heavy arrows show data transfers between nodes. The number of reduce tasks is not governed by the size of the input, but is specified independently.

Fig 7 :MapReduce data flow with a single reduce task

When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for every key are all in a single partition. The partitioning can be controlled by a user-defined partitioning function, but normally the default partitioner which buckets keys using a hash functionworks very well. This diagram makes it clear why the data flow between map and reduce tasks is colloquially known as the shuffle, as each reduce task is fed by many map tasks. The shuffle is more complicated than this diagram suggests, and tuning it can have a big impact on job execution time. Finally, its also possible to have zero reduce tasks. This can be appropriate when you dont need the shuffle since the processing can be carried out entirely in parallel.

10

Fig 8 :MapReduce data flow with multiple reduce tasks

Fig 9 :MapReduce data flow with no reduce tasks

11

4.8 MapReduce Real Example


reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

4.9 MapReduce Combiner function


Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify a combiner function to be run on the map outputthe combiner functions output forms the input to the reduce function. Since the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all. In other words, calling the combiner function zero, one, or many times should produce the same output from the reducer.

Fig 10 :MapReduce Combiner function

12

4.10 MapReduce Fault-tolerance


Worker failure: The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Master Failure: It is easy to make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state. However, in most cases, the user restarts the job.

13

5. Development and tools


5.1 Hadoop operation modes
Hadoop supports three modes of operation: Standalone Pseudo-distributed Fully-distributed

5.2 Java example

14

15

6. CONCLUSION
We can conclude from this that MapReduce is an efficient concept or methodology which uses the power of parallel computing for its fullest to divide the large amount of data processing and providing very fast and accurate results in data extraction.

16

7. REFERNCES
1. https://www.google.co.in/insidesearch/howsearchworks/thestory/ 2. http://www.google.co.in/technology/pigeonrank.html 3. http://computer.howstuffworks.com/internet/basics/google2.htm 4.http://computer.howstuffworks.com/internet/basics/google2.htm

17

8. BIBLIOGRAPHY
Hadoop The definitive guide Tom White (2009). Hadoop The Definitive Guide. O'Reilly, San Francisco, 1st Edition

Google Article Jeffrey Dean and Sanjay Ghemawat (2004). MapReduce: Simplified Data Processing on Large Clusters. Available on: http://labs.google.com/papers/mapreduce-osdi04.pdf

Companies using Hadoop http://wiki.apache.org/hadoop/PoweredBy

Academic Hadoop Algorithms http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papersupdated/

18

Potrebbero piacerti anche