Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
In the traditional system, storing and retrieving volumes of data had three major issues:
Hadoop clusters
read/write
terabytes of data
per second
Speed: Search and analysis
is time-consuming
HDFS offers zero
HDFS copies licensing and
2 support costs
the data
multiple times
HDFS is a distributed file system that provides access to data across Hadoop clusters.
It manages and supports analysis of very large volumes of Big Data.
Characteristics of HDFS
HDFS is economical
How Does HDFS Work?
EXAMPLE
A patron gifts his books to a The librarian decides to arrange The librarian then distributes
college library. the books on a small rack. multiple copies of each book on
other racks based on the category.
Similarly, HDFS creates multiple copies of a data block and stores them in separate systems for easy access.
HDFS Storage
Metadata
HDFS stores files
in a number of
blocks NameNode
Node A Node D
B2 B3 B1
B1
B4 B2
Very large B2
data file B3
B4 Node B Node E
B1 B3
1 2 B2
3 4 B4
B4
Node C
B1
B3
HDFS Storage
Metadata
NameNode
Node A Node D
B2 B3 B1
B1
B4 B2
Very large B2
data file B3
B4 Node B Node E
B1 B3
1 2 B2
3 4 B4
B4
Node C
Each block is B1
replicated to a few B3
separate computers
HDFS Storage
NameNode
Node A Node D
B2 B3 B1
B1
B4 B2
Very large B2
data file B3
B4 Node B Node E
B1 B3
1 2 B2
3 4 B4
B4
Node C
B1
B3
HDFS Storage
Metadata
NameNode
Metadata keeps
Node A Node D information about the
B1
B2 B3 B1 block and its
B4 B2 replication. It is stored
Very large B2
in NameNode.
data file B3
B4 Node B Node E
B1 B3
1 2 B2
3 4 B4
B4
Node C
B1
B3
HDFS and YARN
Topic 2—HDFS Architecture and Components
HDFS Architecture
It is also known as the master and slave architecture.
Maintain
Metadata
File system
Master
DN1: A,C
DN2: A,C
File.txt = AC
DN3: A,C
NameNode
Slave
HDFS Architecture
Responsible for
accepting jobs from
clients Maintain
Metadata
File system
Master
DN1: A,C
DN2: A,C
File.txt = AC
DN3: A,C
NameNode
Slave
HDFS Architecture
Maintain
Slave
HDFS Architecture
Maintain
Metadata
File system
Master
DN1: A,C
DN2: A,C
File.txt = AC
DN3: A,C
NameNode A file is split into one or
more blocks, stored,
and replicated in the
slave nodes
Slave
HDFS Architecture
Maintain
Metadata
File system
Master
DN1: A,C
DN2: A,C
File.txt = AC
DN3: A,C
Data required for the
operation is loaded and NameNode
segregated into chunks
of data blocks
Slave
HDFS Components
• Namenode
• Secondary Namenode
• File system
• Metadata
• Datanode
HDFS Components
NAMENODE
Metadata
File system
Metadata NameNode
Secondary NameNode server is responsible for maintaining a backup of the NameNode server.
Namenode It maintains the edit log and namespace image information in sync with the NameNode server.
Master
Secondary
Namenode
NameNode Secondary Maintains
NameNode
File system Slave
Metadata
HDFS Cluster
There can be only one Secondary NameNode server in a cluster. It cannot be treated as
a disaster recovery server (it partially restores the NameNode server in case of failure)
HDFS Components
FILE SYSTEM
/
File system
Datanode
/Dir 1.1 File A /Dir 2.1
File B
HDFS Components
METADATA
File system
Metadata
Datanode
HDFS Components
DATANODE
Client
Metadata
Datanode
DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5
Block
Rack 1 Client
Rack 1
Data Block Split
Data block split is an important process of HDFS architecture. Each file is split into one or more blocks
and the blocks are stored and replicated in DataNodes.
NameNode
Block replication refers to creating copies of a block in multiple DataNodes. Usually, the data
is split into parts, such as part-0 and part-1.
NameNode
JobTracker
B1 B2 B3
Job 1
Block
DataNode Replication DataNode
server1 server 2
Resubmit Job 1
Replication Method
• Each file is split into a sequence of blocks (of the same size, except the last one).
• Blocks are replicated for fault-tolerance.
• The block replication factor is usually configured at the cluster level (can also be done at the file level).
• The NameNode receives a heartbeat and a block report from each DataNode in the cluster.
• A block report lists the blocks on a DataNode.
What Is a Rack?
• In large clusters of Hadoop, to improve network traffic while reading/writing HDFS files, Namenode
chooses data nodes that are on the same rack or a nearby rack to read/write request.
• Namenode achieves this rack information by maintaining rack ids of each data node.
• This concept of choosing closer data nodes based on rack information is called Rack Awareness in
Hadoop.
Replication and Rack Awareness in Hadoop
The topology of the replica is critical to ensure the reliability of HDFS. Usually, data is replicated thrice. The
suggested replication topology is as follows:
Rack 1 Rack 2
R3N1 R2N1: B1
R3N2 R3N2
R1N3: B1 R2N3: B1
Replication and Rack Awareness: Example
The diagram illustrates a Hadoop cluster with three racks. Each rack contains multiple nodes.
NameNode
B1 B2 B3
Replication and Rack Awareness: Example
R1N1 represents
NameNode
Node 1 on Rack 1
and so on.. B1 B2 B3
Replication and Rack Awareness: Example
The NameNode
decides which
DataNode belongs
to which rack.
NameNode
B1 B2 B3
Implementing Rack Awareness in Hadoop
2. Create a topology.py script file: Topology scripts are used by Hadoop to determine the rack location
of nodes. File is present in /etc/hadoop/conf/topology.sh
Step 4: Similarly, the second datanode stores the packet and forwards it
to the third (and last) datanode in the pipeline.
Step 6: When the client has finished writing data, it calls close() on the
stream.
Anatomy of File Read
Step 1: First, the Client will open the file by giving a call to open()
method on FileSystem object.
Step 3: The client then calls read() on the DFSInputStream and then
connects to the first closest DataNode for the first block in the file.
Step 4: Data is streamed from the DataNode back to the client, which
calls read() repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close
the connection to the DataNode and then find the best DataNode for
the next block.
Step 6: Blocks are read in order. When the client has finished reading, it
calls close() on the FSDataInputStream
HDFS Access
FS shell for
Web GUI utilized executing
through an HTTP commands
browser on HDFS
HDFS Command Line
Get a directory listing of user’s home directory in HDFS $hdfs dfs –ls
In this demonstration, you will view how to run a few basic command lines of HDFS. You can also
execute the commands in your lab environment for practice.
Hue File Browser
The file browser in Hue helps to view and manage HDFS directories and files.
Demo
HDFS Access using Hue
In this demonstration, you will see how to access HDFS using Hue. You will also learn how to view and manage
your HDFS directories and files using Hue. You can also execute it in your lab environment for practice.
HDFS and YARN
Topic 3—Introduction to YARN (Yet Another Resource Negotiator)
What Is YARN: Case Study
Yahoo was the first company to embrace Hadoop and became a trendsetter within the Hadoop ecosystem.
In late 2012, Yahoo struggled to handle iterative and stream processing of data on the Hadoop infrastructure
due to MapReduce limitations.
Both iterative and stream processing were important for Yahoo in facilitating its move from batch computing
to continuous computing.
This helped in handling more than 100 billion events such as clicks, impressions, email content,
metadata, etc., per day.
What Is YARN?
YARN is a resource manager created by separating the processing engine and the
management function of MapReduce. It monitors and manages workloads, a multi-
tenant environment, high availability features of Hadoop, and implements security
controls.
Advantages of Using YARN
Lower
operational
costs
The YARN Infrastructure is responsible for providing computational resources for application executions.
HADOOP 2.7
MapReduce Others
(data processing) (data processing)
Client
YARN
Resource Data Processing
Client
Application Applications
Master Scheduler
Manager
The Resource Manager mediates the available resources in the cluster among
competing applications and ensures maximum cluster utilization.
Resource
Manager
Resource Manager
Node
Manager
Applications
Application Scheduler
Manager
Master
It includes a pluggable scheduler called the YarnScheduler, which allows different policies
for managing constraints, such as capacity, fairness, and Service-Level Agreements.
Components of YARN Architecture
RESOURCE MANAGER COMPONENT: SCHEDULER
Resource Manager
Node
Manager
Applications
Application Scheduler
Manager
Master
The ApplicationsManager
accepts job submissions,
negotiates the first container for
executing the application, and
restarts the ApplicationMaster
container on failure.
Components of YARN Architecture
RESOURCE MANAGER: INTERNAL COMPONENTS
Resource Manager
Resource
Manager ClientService YARN Scheduler NodesListManager
ContainerAllocationExpirer
Components of YARN Architecture
HOW RESOURCE MANAGER OPERATES
• The Resource Manager communicates with the clients through an interface called the ClientService.
• Administrative requests are served by a separate interface called the AdminService. Operators can
Resource
get updated information about the cluster operation using this.
Manager
• In parallel, the ResourceTrackerService receives node heartbeats from the Node Manager to track
Node
new or decommissioned nodes.
Manager
• The NMLivelinessMonitor and NodesListManager keep an updated status of which nodes are
Application
healthy so that the Scheduler and the ResourceTrackerService can allocate work appropriately.
Master
• The ApplicationMasterService manages Application Masters on all nodes, keeping the Scheduler
informed.
• The AMLivelinessMonitor keeps a list of Application Masters and their last heartbeat times to let the
Resource Manager know what applications are healthy on the cluster
• Any ApplicationMaster that does not send a heartbeat within a certain interval is marked as dead
and re-scheduled to run on a new container.
Components of YARN Architecture
RESOURCE MANAGER: HIGH AVAILABILITY MODE
Before Hadoop 2.4, the Resource Manager was the single point of failure in a YARN cluster.
The High Availability, or HA, feature is an Active/Standby Resource Manager pair to remove
Resource
this single point of failure.
Manager
Node
Manager
Application
Master
Components of YARN Architecture
NODE MANAGER
Client
Resource
Manager YARN
Client
Resource
Manager YARN
Each Node Manager takes Resource Data Processing
Node instructions from the
Manager Resource Manager and Resource Manager
reports and handles
Application containers on a single
Applications
Master node. Scheduler
Manager
When a container is leased to an application, the Node Manager sets up the container’s
environment, including the resource constraints specified in the lease and any dependencies.
Resource
Manager
The Node Manager runs on each node and manages the following:
Node
Manager Node Manager • Container lifecycle management
• Container dependencies
Application App
Master Container
Master • Container leases
YARN container is a result of a successful resource allocation, that is, the RM has granted an
application a lease to use specified resources on a specific node.
Resource
Manager
To launch the container, the Application Master must provide a container launch context
(CLC) that includes the following information:
Resource
Manager • Environment variables
Node • Dependencies, that is, local resources such as data files or shared objects needed prior
Manager to launch
The Application Master in YARN is a framework-specific library that negotiates resources from
the RM and works with the Node Manager or managers to execute and monitor containers and
Resource their resource consumption.
Manager
Node
Manager Node Manager
Application App
Master Container
Master
Data Node
Components of YARN Architecture
APPLICATION MASTER: USES
OTHER
BATCH INTERACTIVE ONLINE STREAMING GRAPH IN-MEMORY HPC MPI
(Search)
(MapReduce) (Tez) (HBase) (Strom, S4,…) (Giraph) (Spark) (OpenMPI)
(Weave…)
Cluster Resource
YARN
Management
Distributed file
system
Running an Application Through YARN
Users submit applications to the Resource Manager by typing the hadoop jar command.
The client submits an
application to the The Resource Manager maintains the list of applications on the cluster and available resources
Resource Manager on the Node Manager and determines the next application that receives a portion of the cluster
resource.
The Resource
Manager allocates a $ my-Hadoop-app
container Node Manager DataNode
Client
The Application
Master contacts the
related Node
Manager Node Manager DataNode
When the Resource Manager accepts a new application submission, the Scheduler first selects a
The client submits an container.
application to the
Resource Manager
Next, the Application Master is started and is responsible for the entire life-cycle of the application.
The Resource
$ my-Hadoop-app
Manager allocates a
container
Node Manager DataNode
Client
The Application
Master contacts the
related Node
Manager Node Manager DataNode
$ my-Hadoop-app
The Resource
Manager allocates a
Node Manager DataNode
container Client
The Application
Master contacts the
related Node Node Manager DataNode
Manager
The Node Manager
launches the Resource
Manager
container Node Manager DataNode
Application
The container Here are you
NameNode
containers Master
executes
the Application
Node Manager DataNode
Master
Running an Application Through YARN
STEP 4
The Node Manager does not monitor tasks; it only monitors the resource usage in the containers.
The client submits an
application to the The Application Master negotiates containers to launch all of the tasks needed to complete the
Resource Manager application.
The client submits an After the application is complete, the Application Master shuts itself and releases its own
application to the container.
Resource Manager
The Resource
Manager allocates a $ my-Hadoop-app
container
Node Manager DataNode
Client
The Application
Master contacts the
related Node
Manager Node Manager DataNode
The Node Manager
launches the
container Resource Manager
Node Manager DataNode
The container
executes Application
“I’m done!” NameNode
Master
the Application
Master
Node Manager DataNode
Tools for YARN Development
YARN
YARN Hue Job
Command
Web UI browser
Line
Tools for YARN Development
YARN WEB UI
Hue Job
browser
YARN
Command
Line
Tools for YARN Development
HUE JOB BROWSER
YARN The Hue Job Browser allows you to monitor status of job, kill a running job, and view logs.
Web UI
Hue Job
browser
YARN
Command
Line
Tools for YARN Development
YARN COMMAND LINE
YARN Most of the YARN commands are for administrators rather than developers.
Web UI
A few useful commands: - yarn –help
list all command of yarn
In this demonstration, you will learn how to use YARN Web UI, Hue Job Browser, and YARN
command line. You can also execute it in your lab environment for practice.
Key Takeaways
HDFS chunks data into blocks and distributes them across the cluster.
Slave nodes run DataNode daemons that are managed by a single NameNode.
YARN works with HDFS to run tasks where the data is stored.
YARN executes jobs that can be monitored using Hue, YARN Web UI, or YARN
command.
Quiz
QUIZ
How is NameNode failure in non HA mode tackled?
1
The file is replicated three times by default. Each copy of the file is stored on a separate
a. DataNode.
The master copy of the file is stored on a single DataNode. The replica copies are divided
b. into fixed-size blocks, which are stored on multiple DataNodes.
The file is divided into fixed-size blocks, which are stored on multiple DataNodes. Each
c. block is replicated three times by default. Multiple blocks from the same file might
reside on the same DataNode.
d. The file is divided into fixed-size blocks, which are stored on multiple DataNodes. Each
block is replicated three times by default. HDFS guarantees that different blocks from
the same file are never on the same DataNode.
QUIZ Which of the following statements best describes how a large (100 GB) file is stored in
2 HDFS?
The file is replicated three times by default. Each copy of the file is stored on a separate
a. DataNode.
The master copy of the file is stored on a single DataNode. The replica copies are divided
b. into fixed-size blocks, which are stored on multiple DataNodes.
The file is divided into fixed-size blocks, which are stored on multiple DataNodes. Each
c. block is replicated three times by default. Multiple blocks from the same file might
reside on the same DataNode.
d. The file is divided into fixed-size blocks, which are stored on multiple DataNodes. Each
block is replicated three times by default. HDFS guarantees that different blocks from
the same file are never on the same DataNode.
The client queries the NameNode for the block location(s). The NameNode returns the
a. block location(s) to the client. The client reads the data directly off the DataNode(s).
b. The client queries all DataNodes in parallel. The DataNode that contains the requested
data responds directly to the client. The client reads the data directly off the DataNode.
The NameNode contacts the DataNode that holds the requested data block. Data is
d. transferred from the DataNode to the NameNode and then from the NameNode to the client.
QUIZ
Which of the following describes how a client reads a file from HDFS?
3
The client queries the NameNode for the block location(s). The NameNode returns the
a. block location(s) to the client. The client reads the data directly off the DataNode(s).
b. The client queries all DataNodes in parallel. The DataNode that contains the requested
data responds directly to the client. The client reads the data directly off the DataNode.
The NameNode contacts the DataNode that holds the requested data block. Data is
d. transferred from the DataNode to the NameNode and then from the NameNode to the client.
HDFS is optimized for storing a large number of files smaller than the HDFS block
a.
size.
HDFS is a distributed file system that replaces ext3 or ext4 on Linux nodes in a
c. Hadoop cluster.
HDFS is a distributed file system that runs on top of native OS file systems and is
d. well-suited for storage of very large datasets.
QUIZ
Which of the following is/are valid statements?
4
HDFS is optimized for storing a large number of files smaller than the HDFS block
a.
size.
HDFS is a distributed file system that replaces ext3 or ext4 on Linux nodes in a
c. Hadoop cluster.
HDFS is a distributed file system that runs on top of native OS file systems and is
d. well-suited for storage of very large datasets.