Sei sulla pagina 1di 4

Mapreduce work 1.

It splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner :- splits input into independent chunks in parallel manner 2. Map takes input as (key,value) and produced (key,value) as output 3. The framework sorts the outputs of the maps, which are then input to the reduce tasks. 4. Both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. 5. Typically the compute nodes and the storage nodes are the same 6. The Map-Reduce framework and the Distributed File System run on the same set of nodes. 7. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. 8. There are two types of nodes that control the job execution process: 1. jobtrackers 2. tasktrackers 9. The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers. 10. Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. 11. If a tasks fails, the jobtracker can reschedule it on a different tasktracker. 12. Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. 13. Hadoop creates one map task for each split, which runs the user defined map function for each record in the split. 14. The quality of the load balancing increases as the splits become more fine-grained. On the other hand, if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time. For most jobs, a good split size tends to be the size of a HDFS block, 64 MB by default. Map tasks write their output to local disk, not to HDFS. 15. Map output is intermediate output: its processed by reduce tasks to produce the final output, and once the job is complete the map output can be thrown away. So storing it in HDFS, with replication, would be a waste of time. 16. It is also possible that the node running the map task fails before the map output has been consumed by the reduce task. 17. Reduce tasks dont have the advantage of data locality the input to a single reduce task is normally the output from all mappers. 18. In case a single reduce task that is fed by all of the map tasks: The sorted map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. 19. The output of the reducer is normally stored in HDFS for reliability. For each HDFS block of the reduce output, the first replica is stored on the local node, with other replicas being stored on off rack nodes. There are number of mapper and reducer and how many mapper and reducer has to use is decided by the jobtracker.

HDFS: 1. File system that manage the storage across a network is called as distributed file system. 2. Hadoop comes with Hadoop distributed file system (HDFS), is a distributed file system which hold large amount of data(in petabytes or terabytes) and provides high throughput access to information. HDFS stores a data or file in a block. 3. Typically block size for file system is 512 bytes but for HDFS it is much larger unit- 64 MB by default. Files in HDFS are divided into blocks and then store as independent unit. 4. To deal with corrupted block, disk and machine failure, each block is replicated to number of different computer (default 3). 5. A HDFS cluster has two types of node operating in a master-worker pattern: a NameNode (the master) and a number of DataNodes (workers). 6. The function of the NameNode: 1. To manage the file system namespace. 2. To maintains the file system tree and the metadata for all the files and directories in the tree. 3. The NameNode also knows the DataNodes on which all the blocks for a given file are located 7. This information is stored on local disk in the form of the two files: namespace image and edit log. 8. The NameNode does not store block location persistently, because this information is reconstructed by DataNode each time system starts. 9. A client accesses the file system on behalf of the user by communicating with the NameNode and DataNodes, which is shown in Fig 2.4 HDFS Architecture. 10. DataNode store and retrieve blocks when client or NameNode request it to do so and they report to the NameNode with the list of block they hold. 11. HDFS support traditional file system, any user or application can create directories and stores file in the directories. 12. The blocks of files are replicated for fault tolerance. The block size and number of replication are configurable per file. 13. The NameNode makes all decision regarding replication of blocks. It periodically recive heartbeats and blockreport from each data nod from a cluster. 14. Recipts of hearbeat tells datanode is functioning properly and blockreport contains lists of all block stroes on datanode. 15. The heart beat message is also used for fault-tolerance but it also contains CPU usage and the status of each task, and then those information is used for the task scheduling in the JobTracker. 16. HDFS is composed of one NameNode and multiple DataNodes. The NameNode is responsible for metadata management and DataNodes stores data. 17. In most configuration center dedicated node work as JobTracker and NameNode and all other nodes are work as TaskTracker and DataNode. 18. A task is referred straggler if its progress is slower than other tasks and if it is not completed with other task than MapReduce allocate it to idle node, for which data migration take place.

Hpc and Mapreduce 1. Todays research deal with the increasing volume and complexity of data produced by ultra scale data, high resolution scientific equipment and experiments. 2. These data sets are stored in parallel and distributed file systems and frequently retrieved for analytics applications. 3. These data sets affect the way they stored in DFS and data sets are complex. Such data sets are presents researchers with many challenges in representing, processing and managing it. However, the raw data obtained from simulation/experiments needs to be stored in data-intensive file system in a format they are useful for subsequent analytics applications. 4. There is information gap because current HPC applications write data to these new file systems stores data, which generate unoptimized writes to the file system. 5. In HPC analytics, this information gap becomes critical because source of data and commodity based system does not have same data semantics. 6. However, scientists are using frameworks like MapReduce and require data sets to be copied to the accompanied distributed and parallel file system, HDFS. 7. The challenge is to be finding best way to retrieve data from the file system following semantics of the HPC data. 8. One way to approach this problem is to use a high-level programming abstraction for specifying the semantics and bridging the gap between the way data was written and the way it will be accessed. 9. Currently, the way to access data in HDFS like file systems is to use the MapReduce programming abstraction. Because MapReduce is not designed for semantics-based HPC analytics, some of the existing analytics applications use multiple MapReduce programs to specify and analyze data. For example, consider an application, which needs to merge different data sets followed by extracting subsets of that data. Two MapReduce programs are used to implement this access pattern. That is, the first program will merge the data sets and the second will extract the subsets. The overhead of this approach is quantified as 1) the effort to transform the data patterns in MapReduce programs,2) number of lines of code required for MapReduce data preprocessing, and 3) the performance penalties because of reading excessive data sets from disk in each MapReduce program. 10. A framework based on MapReduce, which is capable of understanding data semantics, simplifying the writing of analytics applications and potentially improving performance by reducing MapReduce phases. Aim of this work is to utilize the scalability and fault tolerance benefits of MapReduce and combine them with scientific access patterns. A framework, called MapReduce with Access Patterns (MRAP), is a unique combination of the data access semantics

and the programming framework (MapReduce) used in implementing HPC analytics applications. MRAP framework consists of three components to handle these two patterns: 1. MRAP API, that is provided to eliminate the multiple MapReduce phases used to specify data access patterns 2. MRAP data restructuring, , that is provided to further improve the performance of the access patterns with small I/O problem 3. MRAP data-centric scheduling, that is provided to improve the performance of access patterns with distributed data chunks across nodes.

Potrebbero piacerti anche