Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Rahul Poddar 11500110119 Santosh Kumar 11500110006 Shubham Raj 11500110054 Vinayak Raj 11500110019 6th semester CSE-B BPPIMT
MapReduce
HDFS
Outline
Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over a network (typically the Internet).
The Cloud aims to cut costs, and help the users focus on their core business instead of being impeded by IT obstacles
The main enabling technologies for Cloud Computing are virtualization and autonomic computing.
Software as a service(SaaS)
Platform as a service(PaaS) Infrastructure as a service(IaaS) These three services encapsulate the basic component of cloud computing.
Java Requirements: Hadoop is a Java-based system. Recent versions of Hadoop require Sun Java 1.6.
Installing Hadoop: Hadoop 1.0.3 or above installed(either single node or multi node).
Operating System: Linux, Ubuntu 12.04 LTS version, Mac OS X. Can also be run in Windows, but Windows requires Cygwin to be installed.
1)Master(the HDFS NameNode, the MapReduce JobTracker, and the HBase Master))
2)Slaves(the HDFS DataNodes, the MapReduce TaskTrackers, , and the HBase RegionServers) Two quad core CPUs 12 GB to 24 GB memory and 1 GB RAM.
Hadoop is a scalable fault tolerant grid operating system for data storage and processing.
Its scalability comes from the combo of: HDFS: Self healing, high bandwidth Clustered storage MapReduce: Fault tolerant Distributed processing
http://wiki.apache.org/hadoop/
Commodity HW
Characteristics of Hadoop
We live in the age of very large and complex data called the BIG DATA.
IDC estimates that the total size of digital universe is 1.8 zettabytes which is equal to 1021 bytes. That equals to each person of this world having one hard disk drive.
90% of the total world data has been generated in just 2 years alone. Such a large amount of ever increasing data is getting difficult for traditional RDBMS and grid computing systems to manage.
The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately 10 billion photos taking 1 petabyte of storage.
The Large Hadron Collider at CERN, Geneva produces about 15 million petabytes of data per year.
The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month.
High expenses of high end servers computers and other proprietary hardware and softwares for processing and storage of large amount of data as well as their maintenance cost is unbearable for many industrial organisations. Also upgradation and maintenance to scale up the capacity of these servers require huge cost .
The traditional single server architecture is not a robust architecture because a large single computer is taking care of all the computing.If it fails or shutdowns then whole system breaks down and huge losses are incurred by the enterprises .Also during repairing or upgradation computer has to switch off and in meantime no useful tasks are executed resulting in lagging of computations.
Not Robust
MapReduce is a programming model for processing large data sets and typically used to do distributed computing on clusters of computers.
MapReduce provides regular programmers the ability to produce parallel distributed programs much more easily. MapReduce consists of two simple functions:
map()
reduce()
MapReduce algorithm
"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes.
A worker node may do this again in turn, leading to a multi-level tree structure.
The worker node processes the smaller problem, and passes the answer back to its master node.
MapReduce algorithm
"Reduce" step: The master node collects the answers to all the sub-problems from slaves
Then the master combines the answers in some way to form the output the answer to the problem it was originally trying to solve.
MapReduce algorithm
JobTracker
Slave node
Slave node
Slave node
TaskTracker
TaskTracker
TaskTracker
Task instance
Task instance
Task instance
Job A full program - an execution of a Mapper and Reducer across a data set
At least 20 map task attempts will be performed more if a machine crashes, etc.
Terminology Example
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS is part of the Apache Hadoop project, which is part of the Apache Lucene project.
Master-Slave architecture
Manages the filesystem namespace Maintain file name to list blocks + location mapping Manages block allocation/replication Checkpoints namespace and journals namespace changes for reliability Control access to namespace
Stores blocks using the underlying OSs files Clients access the blocks directly from datanodes Periodically sends block reports to Namenode Periodically check block integrity
HDFS Architecture
Weather sensors all across the globe are collecting climatic data.
Mapper.py:
#!/usr/bin/env python import re import sys for line in sys.stdin: val = line.strip() (year, temp, q) = (val[15:19], val[87:92],val[92:93]) If (temp != "+9999" and re.match("[01459]", q)): print "%s\t%s" % (year, temp)
Reduce.py: #!/usr/bin/env python import sys (last_key, max_val) = (None, 0) for line in sys.stdin: (key, val) = line.strip().split("\t") if last_key and last_key != key: print "%s\t%s" % (last_key, max_val) (last_key, max_val) = (key, int(val)) else: (last_key, max_val) = (key, max(max_val, int(val))) if last_key: print "%s\t%s" % (last_key, max_val)
Hadoop Wiki
http://hadoop.apache.or g/core/
http://wiki.apache.org/hado op/GettingStartedWithHado op
http://wiki.apache.org/ha doop/HadoopMapReduce
http://hadoop.apache.org/c ore/docs/current/hdfs_desig n.html
References