Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
07/03/2014
ASML-OBIEE
Confidentiality Statement
Include the confidentiality statement within the box provided. This has to be legally
approved
Confidentiality and Non-Disclosure Notice
The information contained in this document is confidential and proprietary to TATA
Consultancy Services. This information may not be disclosed, duplicated or used for any
other purposes. The information contained in this document may not be released in
whole or in part outside TCS for any purpose without the express written permission of
TATA Consultancy Services.
Table of Content
1.
Introduction .............................................................................................................................................................. 4
1.1 Hadoop Cluster ....................................................................................................................................................... 4
2.
3.
2.2
2.3
2.4
3.2
3.3
3.4
4.
5.
1. Introduction
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using onhand database management tools or traditional data processing applications. When the volume of data goes
enormously large we go for big data, since its practically impossible to handle such volumes using traditional
RDBMS.
Predictive analytics.
Data mining
NoSQL databases
MapReduce framework
Association - looking for patterns where one event is connected to another event.
Sequence or path analysis - looking for patterns where one event leads to another later event.
Classification - looking for new patterns (May result in a change in the way the data is organized but that's ok).
Clustering - finding and visually documenting groups of facts not previously known.
Forecasting - discovering patterns in data that can lead to reasonable predictions about the future (This area of
data mining is known as predictive analytics.)
Map, a function that parcels out work to different nodes in the distributed cluster.
Reduce another function that collates the work and resolves the results into a single value.
The MapReduce framework is fault-tolerant because each node in the cluster is expected to report back
periodically with completed work and status updates. If a node remains silent for longer than the expected
interval, a master node makes note and re-assigns the work to other nodes.
This technology is much simpler conceptually but very powerful when put along with Hadoop framework. There
are two major steps:
Map
Reduce
In Map step master node takes input and divides into simple smaller chunks and provides it to other worker node.
In Reduce step it collects all the small solution of the problem and returns as output in one unified answer. Both of
these steps use function which relies on Key-Value pairs. This process runs on the various nodes in parallel and
brings faster results for framework.
Hadoop Common - contains libraries and utilities needed by other Hadoop modules.
Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on commodity machines,
providing very high aggregate bandwidth across the cluster.
Hadoop YARN (Yet Another Resource Negotiator) - a resource-management platform responsible for
managing compute resources in clusters and using them for scheduling of users' applications.
Hadoop MapReduce - a programming model for large scale data processing.
DataNodes, the cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster,
as each DataNode may execute multiple application tasks concurrently. DataNodes store data as blocks within
files. The file content is split into large blocks (typically 128 megabytes),
and each block of the file is independently replicated at multiple DataNodes. The blocks are stored on the local
file system on the DataNodes.
Each block replica on a DataNode is represented by two files in the local native file system.
The first file contains the data itself and the second file records the block's metadata
including checksums for the data and the generation stamp. The size of the data file equals the actual length of
the block and does not require extra space to round it up to the nominal block size as in traditional file systems.
Thus, if a block is half full it needs only half of the space of the full block on the local drive
Handshake: During startup each DataNode connects to the NameNode and performs a handshake.
The purpose of the handshake is to verify the namespace ID and the software version of the DataNode.
If either does not match that of the NameNode, the DataNode automatically shuts down.
Namespace id: The namespace ID is assigned to the file system instance when it is formatted. The namespace
ID is persistently stored on all nodes of the cluster. Nodes with a different namespace ID will not be able to join
the cluster, thus protecting the integrity of the file system. A DataNode that is newly initialized and without any
namespace ID is permitted to join the cluster and receive the cluster's namespace ID.
Storage id: After the handshake the DataNode registers with the NameNode.
DataNodes persistently store their unique storage IDs. The storage ID is an internal identifier of the DataNode,
which makes it recognizable even if it is restarted with a different IP address or port. The storage ID is assigned
to the DataNode when it registers with the NameNode for the first time and never changes after that. Data
nodes also create, delete, and replicate data blocks according to instructions from the governing name node.
The NameNode actively monitors the number of replicas of a block. When a replica of a block is lost due to a
DataNode failure or disk failure, the NameNode creates another replica of the block. The NameNode maintains
the namespace tree and the mapping of blocks to DataNodes, holding the entire namespace image in RAM.
The NameNode is the repository for all HDFS metadata, and user data never flows through the NameNode.
Heartbeats: A DataNode identifies block replicas in its possession to the NameNode by sending a block report.
A block report contains the block ID, the generation stamp and the length for each block replica the server
hosts. The first block report is sent immediately after the DataNode registration. Subsequent block reports are
sent every hour and provide the NameNode with an up-to-date view of where block replicas are located on the
cluster.
During normal operation DataNodes send heartbeats to the NameNode to confirm that the DataNode is
operating and the block replicas it hosts are available. The default heartbeat interval is three seconds. If the
NameNode does not receive a heartbeat from a DataNode in ten minutes the NameNode considers the
DataNode to be out of service and the block replicas hosted by that DataNode to be unavailable. The
NameNode then schedules creation of new replicas of those blocks on other DataNodes.
Heartbeats from a DataNode also carry information about total storage capacity, fraction of storage in use, and
the number of data transfers currently in progress. These statistics are used for the NameNode's block
allocation and load balancing decisions.
The NameNode does not directly send requests to DataNodes. It sends instructions to the DataNodes by
replying to heartbeats sent by those DataNodes. The instructions include commands to: replicate blocks to
other nodes, remove local block replicas, re-register and send an immediate block report, or shut down the
node.
These specific features ensure that the Hadoop clusters are highly functional and highly available:
Rack awareness allows consideration of a nodes physical location, when allocating storage and scheduling
tasks
Minimal data motion. MapReduce moves compute processes to the data on HDFS and not the other way
around. Processing tasks can occur on the physical node where the data resides.
This significantly reduces the network I/O patterns and keeps most of the I/O on the local disk or within the
same rack and provides very high aggregate read/write bandwidth.
Utilities diagnose the health of the files system and can rebalance the data on different nodes
8
Rollback allows system operators to bring back the previous version of HDFS after an upgrade, in case of
human or system errors
Standby NameNode provides redundancy and supports high availability
Highly operable. Hadoop handles different types of cluster that might otherwise require operator intervention.
This design allows a single operator to maintain a cluster of 1000s of nodes.
HDFS has demonstrated scalability of up to 200 PB of storage and a single cluster of 4500 servers,
supporting close to a billion files and blocks.
The NameNode is a multithreaded system and processes requests simultaneously from multiple clients.
Saving a transaction to disk becomes a bottleneck since all other threads need to wait until the synchronous
flush-and-sync procedure
initiated by one of them is complete. In order to optimize this process, the NameNode batches multiple
transactions.
When one of the NameNode's threads initiates a flush-and-sync operation, all the transactions batched at that
time are committed together.
Remaining threads only need to check that their transactions
have been saved and do not need to initiate a flush-and-sync operation.
secondary name node: The HDFS file system includes a so-called secondary NameNode, which misleads some
people into thinking [citation needed] that when the primary NameNode goes offline, the secondary
NameNode takes over. In fact, the secondary NameNode regularly connects with the primary NameNode and
builds snapshots of the primary NameNodes directory information, which the system then saves to local or
remote directories.
These check pointed images can be used to restart a failed primary NameNode without having to replay the
entire journal of file-system actions, then to edit the log to create an up-to-date directory structure.
Because the NameNode is the single point for storage and management of metadata,
it can become a bottleneck for supporting a huge number of files, especially a large number of small files. HDFS
Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple
File read and writes: An application adds data to HDFS by creating a new file and writing the data to it. After the
file is closed, the bytes written cannot be altered or removed except that new data can be added to the file by
reopening the file for append. HDFS implements a single-writer, multiple-reader model.
The HDFS client that opens a file for writing is granted a lease for the file; no other client can write to the file. The
writing client periodically renews the lease by sending a heartbeat to the NameNode. When the file is closed, the
lease is revoked. The lease duration is bound by a soft limit and a hard limit. Until the soft limit expires, the writer
is certain of exclusive access to the file. If the soft limit expires and the client fails to close the file or renew the
lease, another client can preempt the lease. If after the hard limit expires (one hour) and the client has failed to
renew the lease, HDFS assumes that the client has quit and will automatically close the file on behalf of the writer,
and recover the lease. The writer's lease does not prevent other clients from reading the file; a file may have many
concurrent readers.
Replication Management: The NameNode endeavors to ensure that each block always has the intended number
of replicas. The NameNode detects that a block has become under- or over-replicated
when a block report from a DataNode arrives. When a block becomes over replicated, the NameNode chooses a
replica to remove.
The NameNode prefer to remove a replica from the DataNode with the least amount of available disk space. The
goal is to balance storage utilization across DataNodes without reducing the block's availability
When a block becomes under-replicated; it is put in the replication priority queue. A block with only one replica
has the highest priority, while a block with a number of replicas that is greater than two thirds of its replication
factor has the lowest priority.
checkpoint:
The persistent record of the image stored in the NameNode's local native filesystem is called a checkpoint.
Checkpoint node:
The NameNode in HDFS, in addition to its primary role serving client requests, can alternatively execute either of
two other roles, either a CheckpointNode or a BackupNode. The role is specified at the node startup.
The CheckpointNode periodically combines the existing checkpoint and journal to create a new checkpoint and
an empty journal.A recently introduced feature of HDFS is the BackupNode. Like a CheckpointNode, the
BackupNode is capable of creating periodic checkpoints, but in addition it maintains an in-memory, up-to-date
image of the filesystem namespace that is always synchronized with the state of the NameNode.
The BackupNode accepts the journal stream of namespace transactions from the active NameNode, saves them
in journal on its own storage directories, and applies these transactions to its own namespace image in memory.
The NameNode treats the BackupNode as a journal store the same way as it treats journal files in its storage
directories. If the NameNode fails, the BackupNode's image in memory and the checkpoint on disk is a record of
the latest namespace state.
monitoring resource usage (memory, CPU) of individual containers, tracking node-health, logs management and
auxiliary services which may be exploited by different YARN applications.MapReduce is a computational paradigm
in which an application is divided into self-contained units of work. Each of these units of work can be executed on
any node in the cluster
11
12
5. Hadoop Uniqueness
Hadoop enables a computing solution that is:
Scalable - New nodes can be added as needed and added without needing to change data formats, how data
is loaded, how jobs are written, or the applications on top.
Cost effective Hadoop brings massively parallel computing to commodity servers. The result is a sizeable
decrease effective in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
Flexible - Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of
sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses
than any one system can provide.
Fault tolerant - When you lose a node, the system redirects work to another location of the data and continues
processing without missing a beat.
13
Thank You
Contact
For more information, contact d.deepa1@tcs.com (Email Id of ISU)
IT Services
Business Solutions
Consulting
All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content /
information contained here is correct at the time of publishing. No material from here may be copied, modified, reproduced,
republished, uploaded, transmitted, posted or distributed in any form without prior written permission from TCS.
Unauthorized use of the content / information appearing here may violate copyright, trademark and other applicable laws,
and could result in criminal or civil penalties. Copyright 2011 Tata Consultancy Services Limited