Single Node Cluster Using Hadoop

Cloud computing using Hadoop
Rahul Poddar 11500110119 Santosh Kumar 11500110006 Shubham Raj 11500110054 Vinayak Raj 11500110019 6th semester CSE-B BPPIMT
Brief introduction of Cloud Computing
What is Hadoop and its properties
MapReduce
An example application on Hadoop
Requirements for this project
What led to development of Hadoop?
HDFS
Outline
Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over a network (typically the Internet).
The Cloud aims to cut costs, and help the users focus on their core business instead of being impeded by IT obstacles
The main enabling technologies for Cloud Computing are virtualization and autonomic computing.
What is cloud computing
With cloud computing other companies host your computers
Software as a service(SaaS)
Platform as a service(PaaS) Infrastructure as a service(IaaS) These three services encapsulate the basic component of cloud computing.
Cloud Computing Architecture
Java Requirements: Hadoop is a Java-based system. Recent versions of Hadoop require Sun Java 1.6.
Installing Hadoop: Hadoop 1.0.3 or above installed(either single node or multi node).
Operating System: Linux, Ubuntu 12.04 LTS version, Mac OS X. Can also be run in Windows, but Windows requires Cygwin to be installed.
Software requirements for Hadoop project
Hadoop and Hbase requires two types of machines:
1)Master(the HDFS NameNode, the MapReduce JobTracker, and the HBase Master))
2)Slaves(the HDFS DataNodes, the MapReduce TaskTrackers, , and the HBase RegionServers) Two quad core CPUs 12 GB to 24 GB memory and 1 GB RAM.
Hardware requirements for Hadoop(Small cluster 5-50 nodes)
Hadoop is a scalable fault tolerant grid operating system for data storage and processing.
Its scalability comes from the combo of: HDFS: Self healing, high bandwidth Clustered storage MapReduce: Fault tolerant Distributed processing
Operates on structured and unstructured data
Here comes Hadoop
A large and active ecosystem(many developers and additions like Hbase,Pig,Hive)
Open source under the Apache License
http://wiki.apache.org/hadoop/
Here comes Hadoop
Commodity HW
Use replication across servers to deal with unreliable storage/servers
Support for moving computation close to data
Add inexpensive servers
Servers have 2 purposes: data storage and computation
Characteristics of Hadoop
We live in the age of very large and complex data called the BIG DATA.
IDC estimates that the total size of digital universe is 1.8 zettabytes which is equal to 1021 bytes. That equals to each person of this world having one hard disk drive.
Need for Hadoop:Big data
Every day 2.5 quintillions(2.5 x 1018)bytes of data is being generated .
90% of the total world data has been generated in just 2 years alone. Such a large amount of ever increasing data is getting difficult for traditional RDBMS and grid computing systems to manage.
Need for Hadoop:Big data
The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately 10 billion photos taking 1 petabyte of storage.
The Large Hadron Collider at CERN, Geneva produces about 15 million petabytes of data per year.
The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month.
Sources of Big data
High expenses of high end servers computers and other proprietary hardware and softwares for processing and storage of large amount of data as well as their maintenance cost is unbearable for many industrial organisations. Also upgradation and maintenance to scale up the capacity of these servers require huge cost .
Inefficiency and high expenses
The traditional single server architecture is not a robust architecture because a large single computer is taking care of all the computing.If it fails or shutdowns then whole system breaks down and huge losses are incurred by the enterprises .Also during repairing or upgradation computer has to switch off and in meantime no useful tasks are executed resulting in lagging of computations.
Not Robust
MapReduce is a programming model for processing large data sets and typically used to do distributed computing on clusters of computers.
MapReduce provides regular programmers the ability to produce parallel distributed programs much more easily. MapReduce consists of two simple functions:
map()
reduce()
MapReduce algorithm
"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes.
A worker node may do this again in turn, leading to a multi-level tree structure.
The worker node processes the smaller problem, and passes the answer back to its master node.
MapReduce algorithm
"Reduce" step: The master node collects the answers to all the sub-problems from slaves
Then the master combines the answers in some way to form the output the answer to the problem it was originally trying to solve.
MapReduce algorithm
Master node MapReduce job submitted by client computer
JobTracker
Slave node
Slave node
Slave node
TaskTracker
TaskTracker
TaskTracker
Task instance
Task instance
Task instance
MapReduce: High Level
Job A full program - an execution of a Mapper and Reducer across a data set
Task An execution of a Mapper or a Reducer on a slice of data

a.k.a. Task-In-Progress (TIP)
Task Attempt A particular instance of an attempt to execute a task on a machine
Some MapReduce Terminology
Running Word Count across 20 files is one job
20 files to be mapped imply 20 map tasks + some number of reduce tasks
At least 20 map task attempts will be performed more if a machine crashes, etc.
Terminology Example
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS is part of the Apache Hadoop project, which is part of the Apache Lucene project.
HDFS(Hadoop Distributed File System)
Master-Slave architecture
DFS Master Namenode
Manages the filesystem namespace Maintain file name to list blocks + location mapping Manages block allocation/replication Checkpoints namespace and journals namespace changes for reliability Control access to namespace
DFS Slaves Datanodes handle block storage
Stores blocks using the underlying OSs files Clients access the blocks directly from datanodes Periodically sends block reports to Namenode Periodically check block integrity
HDFS Architecture
Weather sensors all across the globe are collecting climatic data.
The data can be used from National Climatic Data Centre(http://www.ncdc.noaa.gov/)

We will focus only on temperature for simplicity The input will be data from NCDC which will given as key-value pair to map() The output given by reduce() will be the maximum temperature of each year.
An Example:Weather Data Mining
Mapper.py:
#!/usr/bin/env python import re import sys for line in sys.stdin: val = line.strip() (year, temp, q) = (val[15:19], val[87:92],val[92:93]) If (temp != "+9999" and re.match("[01459]", q)): print "%s\t%s" % (year, temp)
Weather Data Mining
Reduce.py: #!/usr/bin/env python import sys (last_key, max_val) = (None, 0) for line in sys.stdin: (key, val) = line.strip().split("\t") if last_key and last_key != key: print "%s\t%s" % (last_key, max_val) (last_key, max_val) = (key, int(val)) else: (last_key, max_val) = (key, max(max_val, int(val))) if last_key: print "%s\t%s" % (last_key, max_val)
Weather Data Mining
To run a test: % cat input/ncdc/sample.txt | src/main/ch02/python/max_temperature_map.py | \ sort | src/main/ch02/python/max_temperature_reduce.py
Output: 1949 111 1950 22
Running the program
Hadoop Wiki
http://hadoop.apache.or g/core/
http://wiki.apache.org/hado op/GettingStartedWithHado op
http://wiki.apache.org/ha doop/HadoopMapReduce
http://hadoop.apache.org/c ore/docs/current/hdfs_desig n.html
References

Single Node Cluster Using Hadoop

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Single Node Cluster Using Hadoop

Caricato da

Copyright:

Formati disponibili

Cloud computing using Hadoop

Brief introduction of Cloud Computing

What is Hadoop and its properties

An example application on Hadoop

Requirements for this project

What led to development of Hadoop?

What is cloud computing

With cloud computing other companies host your computers

Cloud Computing Architecture

Software requirements for Hadoop project

Hadoop and Hbase requires two types of machines:

Hardware requirements for Hadoop(Small cluster 5-50 nodes)

Operates on structured and unstructured data

Here comes Hadoop

A large and active ecosystem(many developers and additions like Hbase,Pig,Hive)

Open source under the Apache License

Here comes Hadoop

Use replication across servers to deal with unreliable storage/servers

Support for moving computation close to data

Add inexpensive servers

Servers have 2 purposes: data storage and computation

Need for Hadoop:Big data

Every day 2.5 quintillions(2.5 x 1018)bytes of data is being generated .

Need for Hadoop:Big data

Sources of Big data

Inefficiency and high expenses

Master node MapReduce job submitted by client computer

MapReduce: High Level

Task An execution of a Mapper or a Reducer on a slice of data

Task Attempt A particular instance of an attempt to execute a task on a machine

Some MapReduce Terminology

Running Word Count across 20 files is one job

20 files to be mapped imply 20 map tasks + some number of reduce tasks

HDFS(Hadoop Distributed File System)

DFS Master Namenode

DFS Slaves Datanodes handle block storage

The data can be used from National Climatic Data Centre(http://www.ncdc.noaa.gov/)

An Example:Weather Data Mining

Weather Data Mining

Weather Data Mining

To run a test: % cat input/ncdc/sample.txt | src/main/ch02/python/max_temperature_map.py | \ sort | src/main/ch02/python/max_temperature_reduce.py

Output: 1949 111 1950 22

Running the program

Potrebbero piacerti anche