Sei sulla pagina 1di 31

Seminar Report

titled

HADOOP & HDFS


Submitted BY

Mr Indrajit Gohokar 7th/B Roll No:132

Oct, 2013-14

Department of Computer Technology


YESHWANTRAO CHAVAN COLLEGE OF ENGINEERING, Nagpur
(An Autonomous Institution Affiliated to Rashtrasant Tukadoji Maharaj Nagpur University)

YESHWANTRAO CHAVAN COLLEGE OF ENGINEERING NAGPUR

Department of Computer Technology


(2013-14)

(An Autonomous Institution Affiliated to Rashtrasant Tukadoji Maharaj Nagpur University)

Certificate
This is to certify that the Seminar Report titled Hadoop course & in HDFS VII is submitted towards the partial fulfillment of requirement of seminar Semester, B.E.(Computer Technology), degree awarded by Rashtrasant Tukdoji Maharaj Nagpur University, Nagpur.

Submitted by: Mr. Indrajit Dilip Gohokar (Roll No:132) is approved.

Seminar Guide

Mr. P.DHAVAN

Seminar Coordinator
Mrs.P.DESHKAR

Head, Department of Computer Technology

ii

Mr.A.R.PATIL BHAGAT

Date:Oct/2013 Place:Nagpur

Abstract
Nowadays we encounter huge amounts of data be it from Facebook or Twitter.This huge amounts of data that is being generated everyday is known as Big Data.Due to the vastness of the data we have somehow have tofind a way to analyse it int order to make any sense out of it.This analysis can be done using Hadoop. Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. This report focuses on the understanding of Apache Hadoop and the HDFS. The Hadoop Distributed File System (HDFS),a paradigm of Hadoop is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size.

iii

Table of Contents
Title Page No.
1.0 Introduction.......................................................................... .........4 2.0 Background Knowledge...............................................................5 2.0.1 2.0.2 Use Big of Data Cluster to What How does value Hadoop with Big of HDFS & around for the parallel Big Big Data look Big some world..................................................5 Architecture know is processing..........5 3.0 Everything 3.0.1 3.0.2 3.0.3The 3.1 Apache about Data............................................9 Data?................................................................9 like?.............................................9 Data..........................................................10 implementations..............12

iv

3.1.1 3.1.1.1 3.1.1.2 3.1.1.3

Hadoop

Distributed Features Architecture

File

Sytem of of Filesystem

(HDFS)...........................12 HDFS.......................................................12 HDFS................................................13 Namespace................................................14 3.1.1.4 3.1.2 3.1.2.1 3.1.2.2 4.0 Advantage Limitations........................................................19 4.0.1Advantages.................................................................... ......19 4.0.2 Limitations..........................................................................2 0 5.0 Applications.......................................................................... .......21 6.0 Future Scope................................................................................22 7.0 Conclusion............................................................................. .......23 References.................................................................................... .......24 Data Some Organization implementations The The and of Oracle Dell and replication............................15 Hadoop....................................15 implementation.......................................15 implementation............................................17

List of Figures
Figure Number 1 2 Figure Name Overview of Cluster Architecture Technicians working on a large Linux cluster at the Chemnitz University of Technology, Germany 3 4 5 6 7 8
Value of big Data-Wipro infographic HDFS Architecture

Page No 6 7

11

14 Feeding Hadoop Data to the 16 Oracle Database Oracle grid engine 16 Dell Hadoop design 17 Dell network implementation 18

vi

1.0 Introduction
Big Data has become viable as cost-effective approaches have emerged to tame the volume, velocity and variability of massive data. Assuming that the volumes of data are larger than those conventional relational database infrastructures can cope with, processing options break down broadly into massively parallel processing architectures data warehouses or databases such as Apache Hadoop-based solutions. Hadoop provides a distributed file system(HDFS) and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm.An important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and executing application computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by simply adding commodity servers. So Hadoop definately plays an important role in managing & making sense out of the Big Data.

vii

2.0 Background Knowledge


2.0.1 Big Data around the world
The rise of the internet and Web 2.0 have resulted not only an enormous increase in the amount of data created, but also in the type of data. On the other hand, data is also collected, organized, stored, managed, and, most importantly, analyzed to enable and to accelerate discoveries in science. Examples of Scientific Big Data include nuclear research data, where the CERN Institute, the European Organization for Nuclear Research, is a major contributor; all of the data reported on the generation and consumption of all forms of energy on a global scale, where Smart Grids are a tremendous source of that data that is obtained from 350 billion annual meter readings [2]. The Large Hadron Collider (LHC), a particle accelerator that will revolutionize our understanding of the workings of the Universe, will generate 60 terabytes of data per day 15 petabytes (15 million gigabytes) annually [1]. In the age of Web 2.0, 12 terabytes of Tweets are created each day [2] and 100 terabytes of data uploaded is daily to Facebook
[3].

Examples of Big Data in private sector include generation of 2.5 petabytes of data in an hour by 1 million customer transactions by Wallmart [3]. Thus Big Data is everywhere and the age of Big Data is upon us. So successfully exploiting the value in big data requires experimentation and exploration.

2.0.2 Use of Cluster Architecture for parallel processing:

viii

To effectively harness the power of Big Data, we need an architecture that is distributed and will support parallel processing. It should cater to three needs namely 1. Volume It should be able to handle the extensive volume of Big Data. 2. Speed It should be able to process and analyze data as fast as possible. 3. Cost All this should be with minimum cost. A computer cluster (or Cluster Architecture) consists of a set of loosely connected computers that work together so that in many respects they can be viewed as a single system. The components of a cluster are usually connected to each other through fast local area networks, each node (computer used as a server) running its own instance of an operating system. Each node consists of its own cores, memory and disks. Computer clusters emerged as a result of convergence of a number of computing trends including the availability of low cost microprocessors, high speed networks, and software for high performance distributed computing. The activities of the computing nodes are orchestrated by "clustering middleware", a software layer that sits atop the nodes and allows the users to treat the cluster as by and large one cohesive computing unit.

ix

Fig1: Overview of Cluster Architecture

Clusters are usually deployed to improve performance and availability over that of a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.[4]

Fig 2: Technicians working on a large Linux cluster at the Chemnitz University of Technology, Germany

The advantages of cluster architecture are Modular and Scalable - easier to expand the system without bringing down the application that runs on top of the cluster.

Data Locality where data can be processed by the cores collocated in same node or Rack minimizing any transfer over network.

Parallelization - higher degree of parallelism via the simultaneous execution of separate portions of a program on different processors.

Less cost Built on the principle of commodity hardware which is to have more low-performance, lowcost hardware working in parallel (scalar computing) than to have less high-performance, high-cost hardware.

However managing a Cluster has a few overheads which include Complexity - Cost of administering a cluster of N machines significantly increases complexity of using the cluster. More Storage - As data is replicated to protect from failure Cluster architecture requires more storage capacity. Data Distribution and Task Scheduling When a large multi-user cluster needs to access very large amounts of data, task scheduling and Data Distribution becomes a challenge. However, given that in a complex application environment the performance of each job depends on the characteristics of the underlying cluster, mapping tasks onto CPU cores and GPU devices provides significant challenges.[5] Careful Management and Need of massive parallel processing Design - Automatic parallelization of programs continues to remain a technical challenge. The development and debugging of parallel programs on a

xi

cluster requires parallel language primitives as well as suitable tools.

3.0 Everything to know about Big Data


3.0.1 What is Big Data?
In information technology, Big data refers to the datasets whose size is beyond the ability of a typical database software tools to capture, store, manage and analyze [8]. OReilly defines big data the following way: Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn't fit the strictures of your database architectures.[6]

3.0.2What does the Big Data look like?


As a catch-all term, "Big Data" can be pretty nebulous. Input data to big data systems could be chatter from social networks,

xii

web server logs, traffic flow sensors, satellite imagery, broadcast audio streams, banking transactions, MP3s of rock music, the content of web pages, scans of government documents, GPS trails, telemetry from automobiles, financial market data, the list goes on. Are these all really the same thing? To clarify matters, the three Vs of volume, velocity and variety are commonly used to characterize different aspects of big data. They're a helpful lens through which to view and understand the nature of the data and the software platforms available to exploit them. Most probably you will contend with each of the Vs to one degree or another. 1. Volume: Many factors contribute to the increase in data volume transaction-based data stored through the years, text data constantly streaming in from social media, increasing amounts of sensor data being collected, etc. This volume presents the most immediate challenge to conventional IT structures. It calls for scalable storage, and a distributed approach to querying. Many companies already have large amounts of archived data, perhaps in the form of logs, but not the capacity to process it. 2. Velocity: The importance of data's velocity the increasing rate at which data flows into an organization has followed a similar pattern to that of volume. Problems previously restricted to segments of industry are now presenting themselves in a much broader setting. Specialized companies such as financial traders have long turned systems that cope with fast moving data to their advantage. Why is that so? The Internet and mobile era means that the way we deliver and consumer products and services is increasingly instrumented, generating a data flow back to the provider.Those who
xiii

are able to quickly utilize that information, by recommending additional purchases, for instance, gain competitive advantage. It's not just the velocity of the incoming data that's the issue: it's possible to stream fast-moving data into bulk storage for later batch processing, for example. The importance lies in the speed of the feedback loop, taking data from input through to decision.[6]

3. Variety: Rarely does data present itself in a form perfectly ordered and ready for processing. A common theme in big data systems is that the source data is diverse, and doesn't fall into neat relational structures. It could be text from social networks, image data, a raw feed directly from a sensor source. None of these things come ready for integration into an application. Even on the web, where computer-to-computer communication ought to bring some guarantees, the reality of data is messy. A common use of big data processing is to take unstructured data and extract ordered meaning, for consumption either by humans or as a structured input to an application.

3.0.3 The value of Big Data


Big data is more than simply a matter of size; it is an opportunity to find insights in new and emerging types of data and content, to make your business more agile, and to answer questions that were previously considered beyond your reach. The value of Big Data is increasing day by day. This infographic by Wipro shows the value of Big Data in some sectors[7]

xiv

Fig 3.Value of big Data-Wipro infographic

Until now, there was no practical way to harvest this opportunity. Today managing Big Data can be done effectively and with an ease with emergence of state of art solutions like Apache Hadoop.

xv

3.1 Apache Hadoop with HDFS & some implementations


Apache Hadoop is an open source Java framework for processing and querying vast amounts of data on large clusters of commodity hardware.[8] It is an open Source Apache Project initiated and led by Yahoo.It enables applications to work with thousands of computation-independent computers and petabytes of data. Hadoop was derived from Google's Map-Reduce and Google File System (GFS) papers. The entire Apache Hadoop platform is now commonly considered to consist of the Hadoop kernel, Map-Reduce and HDFS, as well as a number of related projects including Apache Hive, Apache Hbase, and others. Hadoop was invented by Doug Cutting and funded by Yahoo in 2006 and reached to its web scale capacity in 2008[9]. The HDFS or the Hadoop Distributed File system is one of the paradigm of Hadoop framework other than the Map-Reduce.

3.1.1 Hadoop Distributed File Sytem (HDFS)


The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

xvi

3.1.1.1 Features of HDFS [10] Highly fault-tolerant: Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file systems data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. Suitable for applications with large data sets: Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It provides high aggregate data bandwidth and scale to thousands of nodes in a single cluster. It should support tens of millions of files in a single instance. Streaming access to file system data: Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.

Portability across Heterogeneous Hardware and Software Platforms: HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.
3.1.1.2 Architecture of HDFS

xvii

Master/slave architecture. HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by HDFS clients.There are a number of DataNodes usually one per node in a cluster. The DataNodes manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks and set of blocks are stored in DataNodes. NameNode: Keeps image of entire file system namespace and file Blockmap in memory. When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint.Periodic checkpointing is done, so that the system can recover back to the last checkpointed state in case of a crash[11]. DataNodes: A Datanode stores data in files in its local file system. Datanode has no knowledge about HDFS filesystem. It stores each block of HDFS data in a separate file. The DataNodes are responsible for serving read and write requests from the file systems clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. When the filesystem starts up it generates a list of all HDFS blocks and send this report to Namenode: Blockreport[11].

xviii

Fig 4: HDFS Architecture

3.1.1.3 Filesystem Namespace HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode[11].

3.1.1.4 Data Organization and replication [10] HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file

xix

is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different DataNode. HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly[11].

3.1.2 Some Implementations of Hadoop


Some of the big corporations around the world have implemented Hadoop for their own use.These corporations include Yahoo,IBM,Facebook,Google Dell,Oracle,Cloudera and so on.These major corporations obviously have understood the might of Hadoop. 3.1.2.1 The Oracle Implementation 1)Feeding Hadoop Data to the Database for Further Analysis[12] External tables present data stored in a file system in a table format, and can be used in SQL queries transparently. Hadoop data stored in HDFS can be accessed from inside the Oracle Database by using External Tables through the use of FUSE (File system in User Space) Using the External Table makes it easier for non-programmers to work with Hadoop data from inside an Oracle Database.

xx

Fig 5. Feeding Hadoop Data to the Oracle Database

2)Drive Enterprise-Level Operational Efficiency of Hadoop Infrastructure with Oracle Grid Engine[12] All the computing resources allocated to a Hadoop cluster are used exclusively by Hadoop, which can result in having underutilized resources when Hadoop is not running. Oracle Grid Engine enables a cluster of computers to run Hadoop with other data-oriented compute application models.The benefit of this approach is that you dont have to maintain a dedicated cluster for running only Hadoop applications.

Fig 6. Oracle Grid Engine

xxi

3.1.2.2 The Dell Implentation 1)Hadoop design implementation[13] The representation is broken down into the Hadoop use cases such as Compute, Storage, and Database workloads.Each workload has specific characteristics for operations, deployment, architecture, and management.

Fig design. 2)Network implementation[13]

7.

Dell

Hadoop

Top-of-rack (ToR) switches in a network architecture connect directly to the DataNodes and allow for all inter-node communication within the Hadoop environment. Hadoop networks should utilize inexpensive components that are employed in a way that maximizes performance for DataNode communication.

xxii

Fig implementation 3)Performance Benchmarks[13]

8.Dell

Network

Within the Hadoop software ecosystem, there are several benchmark tools included that can be used for these comparisons. 1. Teragen Utilizes the parallel framework within Hadoop to quickly create large data sets that can be manipulated. 2. Terasort Read the data created by Teragen into the systems physical memory and then sort it and write it back out to the HDFS. 3. Teravalidate Ensures that the data produced by Terasort is accurate without any errors

xxiii

4.0 Advantages and Limitations


4.0.1 Advantages :
Distribute data and computation. Data Locality principle avoids network overload. Tasks are independent[14] So its easy to handle partial failures - entire nodes can fail and restart without shutting down entire system. Avoid crawling horrors of failure-tolerant synchronous distributed systems. Linear scaling in the ideal case Designed for cheap, commodity hardware. Simple programming model. The end-user programmer only writes map-reduce tasks and the overhead of programming other tasks is reduced. Flat Scalability: One of the major benefits of using Hadoop in contrast to other distributed systems is its flat scalability curve. A program written in distributed frameworks other than Hadoop may require large amounts of refactoring when scaling from ten to one hundred or one thousand machines. After a Hadoop program is written and functioning on ten

xxiv

nodes, very little--if any--work is required for that same program to run on a much larger amount of hardware.[14]

4.0.2 Limitations:
Programming model is very restrictive. Lack of central data can be frustrating. Still rough - software under active development. e.g. HDFS only recently added support for append operations [14] Joins of multiple datasets are tricky and slow. Often, entire dataset gets copied in the process Cluster management is hard (debugging, distributing software, collecting logs... is hard) Still single master, which requires care and may limit scaling Managing job ow isnt trivial when intermediate data should be kept. Multiple copies of already big data are created.[15] Limited SQL support.[15] Inefficient execution as HDFS has no notion of query optimizer.[15] Lack of required skills necessary for handling Hadoop.
[15]

xxv

5.0 Application of Hadoop


Hadoop is leading choice of the companies when it comes in managing Big Data. Some of the major applications and companies that use Hadoop are [16] IBM offers InfoSphere Big Insights based on. Facebook uses Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning. Currently Facebook has 2 major clusters. EBay uses o Heavy usage of Java Map-Reduce for Search optimization and Research. Yahoo! has o More than 100,000 CPUs in >40,000 computers running Hadoop
xxvi

o Its biggest cluster: 4500 nodes (2*4cpu boxes with 4*1TB disk & 16GB RAM) Other companies include Adobe, Amazon, The New York Times, Hewlett-Packard and list keeps on increasing.

6.0 Future Scope


Hadoop is growing really fast and so its use in industries that are gearing up of Big Data Challenge. The Future of Hadoop definately involves [9] Better Scheduling among the nodes for better resource allocation and control of resources. Splitting Core into sub-projects like Apache Hive, Apache Cassandra, PIG etc. Improved API for developers in Java. HDFS security. Upgrading to Hadoop 1.0.

xxvii

7.0 Conclusion
Apache Hadoop is 100% open source, and pioneered a fundamentally new way of storing and processing data. Instead of relying on expensive, proprietary hardware and different systems to store and process data, Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big. This report focuses on the various properties, challenges and opportunities of Big Data.It also discusses how we can use Apache Hadoop to manage the Big Data and extract value out of it. A detailed discussion has been done on Hadoop and its underlying technology. So in todays hyper-connected world where more and more data is being created every day, the need

xxviii

for Hadoop is no longer a question. The only question now is how to take advantage of it best.

References
[1] Randal E. Bryant, Randy H. Katz, Edward D. Lazowska, Big-Data Computing: Creating revolutionary breakthroughs in commerce, science, and society, Version 8: December 22, 2008. Available: http://www.cra.org/ccc/docs/init/Big_Data.pdf [2] What is Big Data, Bringing Big data to enterprise [Online] Available:http://www01.ibm.com/software/data/bigdata/ [3] A Comprehensive List of Big Data Statistics [Online]. Available:http://wikibon.org/blog/big-data-statistics/

xxix

[4] D.A. Bader and R. Pennington, `` Cluster Computing: Applications ,'' The International Journal of High Performance Computing, 15(2):181-185, May 2001.Available:http://www.cc.gatech.edu/~bader/papers/ijh pca.pdf [5] K. Shirahata, Hybrid Map Task Scheduling for GPUBased Heterogeneous Clusters in: Cloud Computing Technology and Science (CloudCom), 2010 Nov.30 2010Dec.3 2010 pages 733 740 [Online].Available: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp? arnumber=5708524 [6]What Is Big Data? OReilly Radar, January11, 2012, [Online].Available: http://radar.oreilly.com/2012/01/whatis-big-data.html [7]BigData,Wipro, [Online].Available:http://www.slideshare.net/wiprotechnolo gies/wipro-infographicbig-data [8]Hadoop at Yahoo!, Yahoo developer Network [Online].Available: http://developer.yahoo.com/hadoop/ [9] Owan o maley, Introduction to Hadoop [Online].Available: http://wiki.apache.org/hado op/ [10] HDFS Architecture, Hadoop 0.20 Documentation [Online]. Available:http://hadoop.apache.org/docs/r0.20.2/hdfs_desig n.html [11] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, The Hadoop Distributed File System

xxx

,Available:http://storageconference.org/2010/Papers/MSST/ Shvachko.pdf [12] Oracle and Hadoop Overview [Online] Available: http://www.oracle.com/technetwork/database/bidatawarehousing/twp-hadoop-oracle-194542.pdf [13] Introduction to Hadoop Dell [Online] Available:http://i.dell.com/sites/content/business/solutions/ whitepapers/en/Documents/hadoop-introduction.pdf [14] Yahoo! Hadoop tutorial ,Yahoo Developer Network(YDN),[Online]. Available:http://developer.yahoo.com/hadoop/tutorial/modul e1.html#comparison [15] Hadoop's Limitations for Big Data Analytics , ParAccel .Inc [On line] Available:http://www.paraccel.com/resources/Whitepapers/ Hadoop-Limitations-for-Big-Data-ParAccel-Whitepaper.pdf [16]Hadoop Wiki, Powered By,[Online] Available: http://wiki.apache.org/hadoop/PoweredBy#I

xxxi