Sei sulla pagina 1di 14

HADOOP(BIG DATA) INTRODUCTION

07/03/2014
ASML-OBIEE

BABY MANI DEEPA S


Technology
d.deepa1@tcs.com

Confidentiality Statement
Include the confidentiality statement within the box provided. This has to be legally
approved
Confidentiality and Non-Disclosure Notice
The information contained in this document is confidential and proprietary to TATA
Consultancy Services. This information may not be disclosed, duplicated or used for any
other purposes. The information contained in this document may not be released in
whole or in part outside TCS for any purpose without the express written permission of
TATA Consultancy Services.

Tata Code of Conduct


We, in our dealings, are self-regulated by a Code of Conduct as enshrined in the Tata
Code of Conduct. We request your support in helping us adhere to the Code in letter and
spirit. We request that any violation or potential violation of the Code by any person be
promptly brought to the notice of the Local Ethics Counselor or the Principal Ethics
Counselor or the CEO of TCS. All communication received in this regard will be treated
and kept as confidential.

Table of Content
1.

Introduction .............................................................................................................................................................. 4
1.1 Hadoop Cluster ....................................................................................................................................................... 4

2.

3.

Big data analytics ...................................................................................................................................................... 5


2.1

Predictive analytics ........................................................................................................................................... 5

2.2

Data mining ....................................................................................................................................................... 5

2.3

NoSQL database ................................................................................................................................................ 5

2.4

MapReduce framework .................................................................................................................................... 6

Apache Hadoop framework ...................................................................................................................................... 7


3.1

Hadoop Common .............................................................................................................................................. 7

3.2

Hadoop Distributed File System (HDFS) ............................................................................................................ 7

3.3

Hadoop YARN (Yet Another Resource Negotiator) ......................................................................................... 10

3.4

Hadoop MapReduce ....................................................................................................................................... 11

4.

Pig and Hive............................................................................................................................................................. 12

5.

Hadoop Uniqueness ................................................................................................................................................ 13

1. Introduction
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using onhand database management tools or traditional data processing applications. When the volume of data goes
enormously large we go for big data, since its practically impossible to handle such volumes using traditional
RDBMS.

1.1 Hadoop Cluster


Hadoop is a software framework which supports data intensive processes and enables applications to work with
Big Data. Hadoop is based on Mapper-reducer technology.There are two very famous companies uses Hadoop to
process their large data Facebook and Yahoo. Hadoop platform can solve problems where the deep analysis is
complex and unstructured but needs to be done in reasonable time. Apache Hadoop is an open-source software
framework. Hadoop technology maintains and manages the data among all the independent servers. Individual
user can not directly gain the access to the data as data is divided among these servers. Additionally, a single data
can be shared on multiple server which gives availability of the data in case of the disaster or single machine
failure.

2. Big data analytics


Big data analytics is the process of examining large variety of types of data to uncover hidden patterns hence
provides competitive advantages over rival organizations and result in business benefits. The primary goal of big
data analytics is to help companies make better business decisions by enabling data scientists and other users to
analyze huge volumes of transaction data as well as other data sources that may be left untapped by conventional
business intelligence (BI) programs.
Big data analytics can be done with the software tools commonly used as part of advanced analytics disciplines
such as:

Predictive analytics.
Data mining
NoSQL databases
MapReduce framework

2.1 Predictive analytics


Predictive analytics is the branch of data mining concerned with the prediction of future probabilities and trends.
The central element of predictive analytics is the predictor, a variable that can be measured for an individual or
other entity to predict future behaviour. For example, an insurance company is likely to take into account potential
driving safety predictors such as age, gender, and driving record when issuing car insurance policies.

2.2 Data mining


Data mining is sorting through data to identify patterns and establish relationships.
Data mining parameters include:

Association - looking for patterns where one event is connected to another event.
Sequence or path analysis - looking for patterns where one event leads to another later event.
Classification - looking for new patterns (May result in a change in the way the data is organized but that's ok).
Clustering - finding and visually documenting groups of facts not previously known.
Forecasting - discovering patterns in data that can lead to reasonable predictions about the future (This area of
data mining is known as predictive analytics.)

2.3 NoSQL database


NoSQL database, also called Not Only SQL, seeks to solve the scalability and big data performance issues that
relational databases werent designed to address. NoSQL is especially useful when an enterprise needs to access
and analyse massive amounts of unstructured data or data that's stored remotely on multiple virtual servers in the
cloud
5

2.4 MapReduce framework


The framework is divided into two parts:

Map, a function that parcels out work to different nodes in the distributed cluster.
Reduce another function that collates the work and resolves the results into a single value.

The MapReduce framework is fault-tolerant because each node in the cluster is expected to report back
periodically with completed work and status updates. If a node remains silent for longer than the expected
interval, a master node makes note and re-assigns the work to other nodes.
This technology is much simpler conceptually but very powerful when put along with Hadoop framework. There
are two major steps:

Map
Reduce

In Map step master node takes input and divides into simple smaller chunks and provides it to other worker node.
In Reduce step it collects all the small solution of the problem and returns as output in one unified answer. Both of
these steps use function which relies on Key-Value pairs. This process runs on the various nodes in parallel and
brings faster results for framework.

3. Apache Hadoop framework


The Apache Hadoop framework is composed of the following modules:

Hadoop Common - contains libraries and utilities needed by other Hadoop modules.
Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on commodity machines,
providing very high aggregate bandwidth across the cluster.
Hadoop YARN (Yet Another Resource Negotiator) - a resource-management platform responsible for
managing compute resources in clusters and using them for scheduling of users' applications.
Hadoop MapReduce - a programming model for large scale data processing.

3.1 Hadoop Common


Hadoop Common package provides file system and OS level abstractions, a MapReduce engine, necessary Java
Archive (JAR) files and scripts needed to start Hadoop, source code, documentation and a contribution section
that includes projects from the Hadoop Community.

3.2 Hadoop Distributed File System (HDFS)


HDFS is comprised of interconnected clusters of nodes where files and directories reside. An HDFS cluster consists
of a :
Single node, known as a NameNode, that manages the file system namespace operations like opening,
closing, and renaming files and directories regulates client access to files. A name node also maps data blocks
to data nodes, which handle read and write requests from HDFS clients. The current design has a single
NameNode for each cluster. NameNode keeps the entire namespace image in RAM.
The HDFS namespace is a hierarchy of files and directories. Files and directories are represented on the
NameNode by inodes.Inodes record attributes like permissions, modification and access times, namespace and
disk space quotas.
Image: The inodes and the list of blocks that define the metadata of the name system are called the image.
Journal: Each client-initiated transaction is recorded in the journal

DataNodes, the cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster,
as each DataNode may execute multiple application tasks concurrently. DataNodes store data as blocks within
files. The file content is split into large blocks (typically 128 megabytes),
and each block of the file is independently replicated at multiple DataNodes. The blocks are stored on the local
file system on the DataNodes.
Each block replica on a DataNode is represented by two files in the local native file system.
The first file contains the data itself and the second file records the block's metadata
including checksums for the data and the generation stamp. The size of the data file equals the actual length of
the block and does not require extra space to round it up to the nominal block size as in traditional file systems.
Thus, if a block is half full it needs only half of the space of the full block on the local drive

Handshake: During startup each DataNode connects to the NameNode and performs a handshake.
The purpose of the handshake is to verify the namespace ID and the software version of the DataNode.
If either does not match that of the NameNode, the DataNode automatically shuts down.
Namespace id: The namespace ID is assigned to the file system instance when it is formatted. The namespace
ID is persistently stored on all nodes of the cluster. Nodes with a different namespace ID will not be able to join
the cluster, thus protecting the integrity of the file system. A DataNode that is newly initialized and without any
namespace ID is permitted to join the cluster and receive the cluster's namespace ID.
Storage id: After the handshake the DataNode registers with the NameNode.
DataNodes persistently store their unique storage IDs. The storage ID is an internal identifier of the DataNode,
which makes it recognizable even if it is restarted with a different IP address or port. The storage ID is assigned
to the DataNode when it registers with the NameNode for the first time and never changes after that. Data
nodes also create, delete, and replicate data blocks according to instructions from the governing name node.
The NameNode actively monitors the number of replicas of a block. When a replica of a block is lost due to a
DataNode failure or disk failure, the NameNode creates another replica of the block. The NameNode maintains
the namespace tree and the mapping of blocks to DataNodes, holding the entire namespace image in RAM.
The NameNode is the repository for all HDFS metadata, and user data never flows through the NameNode.
Heartbeats: A DataNode identifies block replicas in its possession to the NameNode by sending a block report.
A block report contains the block ID, the generation stamp and the length for each block replica the server
hosts. The first block report is sent immediately after the DataNode registration. Subsequent block reports are
sent every hour and provide the NameNode with an up-to-date view of where block replicas are located on the
cluster.
During normal operation DataNodes send heartbeats to the NameNode to confirm that the DataNode is
operating and the block replicas it hosts are available. The default heartbeat interval is three seconds. If the
NameNode does not receive a heartbeat from a DataNode in ten minutes the NameNode considers the
DataNode to be out of service and the block replicas hosted by that DataNode to be unavailable. The
NameNode then schedules creation of new replicas of those blocks on other DataNodes.
Heartbeats from a DataNode also carry information about total storage capacity, fraction of storage in use, and
the number of data transfers currently in progress. These statistics are used for the NameNode's block
allocation and load balancing decisions.
The NameNode does not directly send requests to DataNodes. It sends instructions to the DataNodes by
replying to heartbeats sent by those DataNodes. The instructions include commands to: replicate blocks to
other nodes, remove local block replicas, re-register and send an immediate block report, or shut down the
node.
These specific features ensure that the Hadoop clusters are highly functional and highly available:
Rack awareness allows consideration of a nodes physical location, when allocating storage and scheduling
tasks
Minimal data motion. MapReduce moves compute processes to the data on HDFS and not the other way
around. Processing tasks can occur on the physical node where the data resides.
This significantly reduces the network I/O patterns and keeps most of the I/O on the local disk or within the
same rack and provides very high aggregate read/write bandwidth.
Utilities diagnose the health of the files system and can rebalance the data on different nodes
8

Rollback allows system operators to bring back the previous version of HDFS after an upgrade, in case of
human or system errors
Standby NameNode provides redundancy and supports high availability
Highly operable. Hadoop handles different types of cluster that might otherwise require operator intervention.
This design allows a single operator to maintain a cluster of 1000s of nodes.
HDFS has demonstrated scalability of up to 200 PB of storage and a single cluster of 4500 servers,
supporting close to a billion files and blocks.
The NameNode is a multithreaded system and processes requests simultaneously from multiple clients.
Saving a transaction to disk becomes a bottleneck since all other threads need to wait until the synchronous
flush-and-sync procedure
initiated by one of them is complete. In order to optimize this process, the NameNode batches multiple
transactions.
When one of the NameNode's threads initiates a flush-and-sync operation, all the transactions batched at that
time are committed together.
Remaining threads only need to check that their transactions
have been saved and do not need to initiate a flush-and-sync operation.
secondary name node: The HDFS file system includes a so-called secondary NameNode, which misleads some
people into thinking [citation needed] that when the primary NameNode goes offline, the secondary
NameNode takes over. In fact, the secondary NameNode regularly connects with the primary NameNode and
builds snapshots of the primary NameNodes directory information, which the system then saves to local or
remote directories.
These check pointed images can be used to restart a failed primary NameNode without having to replay the
entire journal of file-system actions, then to edit the log to create an up-to-date directory structure.
Because the NameNode is the single point for storage and management of metadata,
it can become a bottleneck for supporting a huge number of files, especially a large number of small files. HDFS
Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple
File read and writes: An application adds data to HDFS by creating a new file and writing the data to it. After the
file is closed, the bytes written cannot be altered or removed except that new data can be added to the file by
reopening the file for append. HDFS implements a single-writer, multiple-reader model.
The HDFS client that opens a file for writing is granted a lease for the file; no other client can write to the file. The
writing client periodically renews the lease by sending a heartbeat to the NameNode. When the file is closed, the
lease is revoked. The lease duration is bound by a soft limit and a hard limit. Until the soft limit expires, the writer
is certain of exclusive access to the file. If the soft limit expires and the client fails to close the file or renew the
lease, another client can preempt the lease. If after the hard limit expires (one hour) and the client has failed to
renew the lease, HDFS assumes that the client has quit and will automatically close the file on behalf of the writer,
and recover the lease. The writer's lease does not prevent other clients from reading the file; a file may have many
concurrent readers.

Replication Management: The NameNode endeavors to ensure that each block always has the intended number
of replicas. The NameNode detects that a block has become under- or over-replicated
when a block report from a DataNode arrives. When a block becomes over replicated, the NameNode chooses a
replica to remove.
The NameNode prefer to remove a replica from the DataNode with the least amount of available disk space. The
goal is to balance storage utilization across DataNodes without reducing the block's availability
When a block becomes under-replicated; it is put in the replication priority queue. A block with only one replica
has the highest priority, while a block with a number of replicas that is greater than two thirds of its replication
factor has the lowest priority.
checkpoint:
The persistent record of the image stored in the NameNode's local native filesystem is called a checkpoint.
Checkpoint node:
The NameNode in HDFS, in addition to its primary role serving client requests, can alternatively execute either of
two other roles, either a CheckpointNode or a BackupNode. The role is specified at the node startup.
The CheckpointNode periodically combines the existing checkpoint and journal to create a new checkpoint and
an empty journal.A recently introduced feature of HDFS is the BackupNode. Like a CheckpointNode, the
BackupNode is capable of creating periodic checkpoints, but in addition it maintains an in-memory, up-to-date
image of the filesystem namespace that is always synchronized with the state of the NameNode.
The BackupNode accepts the journal stream of namespace transactions from the active NameNode, saves them
in journal on its own storage directories, and applies these transactions to its own namespace image in memory.
The NameNode treats the BackupNode as a journal store the same way as it treats journal files in its storage
directories. If the NameNode fails, the BackupNode's image in memory and the checkpoint on disk is a record of
the latest namespace state.

3.3 Hadoop YARN (Yet Another Resource Negotiator)


The fundamental idea of YARN is to split up the two major responsibilities of the MapReduce - JobTracker i.e.
resource management and job scheduling/monitoring, into separate daemons: a global ResourceManager and
per-application ApplicationMaster (AM).
The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, system for
managing applications in a distributed manner.
The main components of YARN are:
ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.
The per-application ApplicationMaster is, in effect, a framework specific entity and is tasked with negotiating
resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the
component tasks.
NodeManager is YARNs per-node agent, and takes care of the individual compute nodes in a Hadoop cluster.
This includes keeping up-to date with the ResourceManager (RM), overseeing containers life-cycle management;
10

monitoring resource usage (memory, CPU) of individual containers, tracking node-health, logs management and
auxiliary services which may be exploited by different YARN applications.MapReduce is a computational paradigm
in which an application is divided into self-contained units of work. Each of these units of work can be executed on
any node in the cluster

3.4 Hadoop MapReduce


A MapReduce job splits the input data set into independent "chunks" that are processed by map tasks in parallel.
The framework sorts the map outputs, which are then input to reduce tasks. Job inputs and outputs are stored in
the file system.The MapReduce framework and the HDFS are typically on the same set of nodes,
which enables the framework to schedule tasks on nodes that contain data.
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per node. The master
is responsible for scheduling job component tasks on the slaves, monitoring tasks, and re-executing failed tasks.
The slaves execute tasks as directed by the master. Minimally, applications specify input and output locations and
supply map and reduce functions through implementation of appropriate interfaces or abstract classes. Although
the Hadoop framework is implemented in Java,MapReduce applications do not have to be written in Java.
HDFS uses a master/slave architecture in which one device (the master) controls one or more other devices (the
slaves). A small Hadoop cluster includes a single master and multiple worker nodes.
The master node consists of a JobTracker, TaskTracker, NameNode and DataNode.
A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker
nodes and compute-only worker nodes.
The job tracker schedules map or reduce jobs to task trackers with an awareness of the data location.
For example: if node A contains data (x,y,z) and node B contains data (a,b,c),
the job tracker schedules node B to perform map or reduce tasks on (a,b,c) and node A would be scheduled to
perform map or reduce tasks on (x,y,z).
This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer.

11

4. Pig and Hive


Pig is high level platform for creating MapReduce programs with Hadoop. Hive is a data warehouse infrastructure
built for Hadoop for analysis and aggregation (summary of the data) of the data. Both of these commands are
compilation of the MapReduce commands. Pig procedure language where one describes procedures to apply on
the Hadoop. Hives is SQL-like declarative language. Yahoo uses Pigs and Hives both in their Hadoop Toolkit.
Apache Pig, Pig is a platform for analyzing large data sets that consists of a high-level language for expressing
data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig
programs is that their structure is amenable to substantial parallelization, which in turns enables them to
handle very large data sets. At the present time, Pig's infrastructure layer consists of a compiler that produces
sequences of Map-Reduce programs. Pig's language layer currently consists of a textual language called Pig
Latin, which is easy to use, optimized, and extensible.
Apache Hive, Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc
queries, and the analysis of large datasets stored in Hadoop compatible file systems. It provides a mechanism to
project structure onto this data and query the data using a SQL-like language called HiveQL. Hive also allows
traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or
inefficient to express this logic in HiveQL.
Apache Catalog, Catalog is a metadata abstraction layer for referencing data without using the underlying filenames or formats. It insulates users and scripts from how and where the data is physically stored.
Apache HBase, HBase (Hadoop DataBase) is a distributed, column oriented database. HBase uses HDFS for the
underlying storage. It supports both batch style computations using MapReduce and point queries (random
reads).
The main components of HBase are
HBase Master is responsible for negotiating load balancing across all Region Servers and maintain the state of the
cluster. It is not part of the actual data storage or retrieval path.
The RegionServer is deployed on each computer and hosts data and processes I/O request.
Apache ZooKeeper, ZooKeeper is a centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services which are very useful for a variety of
distributed systems. HBase is not operational without ZooKeeper.
Apache Oozie, Apache Oozie is a workflow/coordination system to manage Hadoop jobs.

12

5. Hadoop Uniqueness
Hadoop enables a computing solution that is:

Scalable - New nodes can be added as needed and added without needing to change data formats, how data
is loaded, how jobs are written, or the applications on top.
Cost effective Hadoop brings massively parallel computing to commodity servers. The result is a sizeable
decrease effective in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
Flexible - Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of
sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses
than any one system can provide.
Fault tolerant - When you lose a node, the system redirects work to another location of the data and continues
processing without missing a beat.

13

Thank You

Contact
For more information, contact d.deepa1@tcs.com (Email Id of ISU)

About Tata Consultancy Services (TCS)


Tata Consultancy Services is an IT services, consulting and business solutions
organization that delivers real results to global business, ensuring a level of certainty no
other firm can match. TCS offers a consulting-led, integrated portfolio of IT and ITenabled infrastructure, engineering and assurance services. This is delivered through its
unique Global Network Delivery ModelTM, recognized as the benchmark of excellence in
software development. A part of the Tata Group, Indias largest industrial conglomerate,
TCS has a global footprint and is listed on the National Stock Exchange and Bombay
Stock Exchange in India.
For more information, visit us at www.tcs.com.

IT Services
Business Solutions
Consulting
All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content /
information contained here is correct at the time of publishing. No material from here may be copied, modified, reproduced,
republished, uploaded, transmitted, posted or distributed in any form without prior written permission from TCS.
Unauthorized use of the content / information appearing here may violate copyright, trademark and other applicable laws,
and could result in criminal or civil penalties. Copyright 2011 Tata Consultancy Services Limited

Potrebbero piacerti anche