Sei sulla pagina 1di 20

A new way to store and analyze data

Presented By :: Harsha Jain CSE IV Year Student

www.powerpointpresentationon.blogspot.com

Topics Covered
What is Hadoop? Why, Where, When? Benefits of Hadoop How Hadoop Works? Hdoop Architecture Hadoop Common HDFS Hadoop MapReduce Installation & Execution Demo of installation Hadoop Community

By Harsha Jain

What is Hadoop?
Hadoop was created by Douglas Reed Cutting, who named haddop after his childs stuffed elephant to support Lucene and Nutch search engine projects. Open-source project administered by Apache Software Foundation. Hadoop consists of two key services:
a. Reliable data storage using the Hadoop Distributed File System (HDFS). b. High-performance parallel data processing using a technique called MapReduce.

Hadoop is large-scale, high-performance processing jobs in spite of system changes or failures.

By Harsha Jain

Hadoop, Why?
Need to process 100TB datasets On 1 node: scanning @ 50MB/s = 23 days On 1000 node cluster: scanning @ 50MB/s = 33 min Need Efficient, Reliable and Usable framework

By Harsha Jain

Where and When Hadoop


Where
Batch data processing, not real-time / user facing (e.g. Document Analysis and Indexing, Web Graphs and Crawling) Highly parallel data intensive distributed applications Very large production deployments (GRID)

When
Process lots of unstructured data When your processing can easily be made parallel Running batch jobs is acceptable When you have access to lots of cheap hardware

By Harsha Jain

Benefits of Hadoop
Hadoop is designed to run on cheap commodity hardware It automatically handles data replication and node failure It does the hard work you can focus on processing data Cost Saving and efficient and reliable data processing

By Harsha Jain

How Hadoop Works


Hadoop implements a computational paradigm named

Map/Reduce, where the application is divided into many small


fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

file system (HDFS) that

stores data on the compute nodes, providing very high aggregate

By Harsha Jain

Hdoop Architecture
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing

Hadoop Consists::
Hadoop Common*: The common utilities that support the other Hadoop subprojects. HDFS*: A distributed file system that provides high throughput access to application data. MapReduce*: A software framework for distributed processing of large data sets on compute clusters.
Hadoop is made up of a number of elements. Hadoop consists of the Hadoop Common, At the bottom is the Hadoop Distributed File System (HDFS), which stores files across storage nodes in a Hadoop cluster. Above the HDFS is the MapReduce engine, which consists of JobTrackers and TaskTrackers. * This presentation is primarily focus on Hadoop architecture and related sub project
By Harsha Jain

Data Flow
Web Servers Scribe Servers Network Storage

Oracle RAC

Hadoop Cluster

MySQ L

By Harsha Jain

Hadoop Common
Hadoop Common is a set of utilities that support the other Hadoop subprojects. Hadoop Common includes FileSystem, RPC, and serialization libraries.

By Harsha Jain

HDFS
Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. Replication and locality
By Harsha Jain

HDFS Architecture

By Harsha Jain

Hadoop MapReduce
The Map-Reduce programming model Framework for distributed processing of large data sets Pluggable user code runs in generic framework Common design pattern in data processing cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output Natural for: Log processing Web search indexing Ad-hoc queries
By Harsha Jain

MapReduce Implementation
1. Input files split (M splits) 2. Assign Master & Workers 3. Map tasks 4. Writing intermediate data to disk (R regions) 5. Intermediate data read & sort 6. Reduce tasks 7. Return

By Harsha Jain

MapReduce Cluster Implementation M map R reduce Input files Intermediate


tasks files tasks

Output files

split 0 split 1 split 2 split 3 split 4


Several map or reduce tasks can run on a single computer Each intermediate file is divided into R partitions, by partitioning function
By Harsha Jain

Output 0

Output 1

Each reduce task corresponds to one partition

Examples of MapReduce
Word Count

Read text files and count how often words occur.


The input is text files The output is a text file
each line: word, tab, count

Map: Produce pairs of (word, count) Reduce: For each word, sum up the counts.
By Harsha Jain

Lets Go
Installation ::
Requirements: Linux, Java 1.6, sshd, rsync Configure SSH for password-free authentication Unpack Hadoop distribution Edit a few configuration files Format the DFS on the name node Start all the daemon processes

Execution::
Compile your job into a JAR file Copy input data into HDFS Execute bin/hadoop jar with relevant args Monitor tasks via Web interface (optional) Examine output when job is complete

By Harsha Jain

Demo Video for installation

By Harsha Jain

Hadoop Community
Hadoop Users Adobe Alibaba Amazon AOL Facebook Google IBM Major Contributor Apache Cloudera Yahoo

By Harsha Jain

References
Apache Hadoop! (http://hadoop.apache.org ) Hadoop on Wikipedia (http://en.wikipedia. org/wiki/Hadoop) Free Search by Doug Cutting (http://cutting. wordpress.com ) Hadoop and Distributed Computing at Yahoo! (http://developer.yahoo.com/hadoop ) Cloudera - Apache Hadoop for the Enterprise (http://www.cloudera.com )
By Harsha Jain

Potrebbero piacerti anche