HDFS and YARN

Big Data Hadoop and Spark Developer
Lesson 2—HDFS and YARN
© Simplilearn. All rights reserved.

Learning Objectives
Discuss Hadoop Distributed File System (HDFS)
Explain HDFS architecture and components
Describe YARN and its features
Explain YARN architecture

HDFS and YARN
Topic 1—Hadoop Distributed File System (HDFS)
Why HDFS?
In the traditional system, storing and retrieving volumes of data had three major issues:
Speed: Search and analysis

is time-consuming
Reliability: Fetching data is Cost: 10,000 to $14,000

difficult 3 1 per terabyte
Why HDFS? (Contd.)
HDFS resolves all major issues of the traditional file system.
Hadoop clusters
read/write
terabytes of data
per second
Speed: Search and analysis
is time-consuming
HDFS offers zero
HDFS copies licensing and
2 support costs
the data
multiple times
Reliability: Fetching data is Cost: 10,000 to $14,000

difficult 3 1 per terabyte
What Is HDFS?
HDFS is a distributed file system that provides access to data across Hadoop clusters.
It manages and supports analysis of very large volumes of Big Data.
Characteristics of HDFS
HDFS has high fault-tolerance
HDFS has high throughput
HDFS is economical
How Does HDFS Work?
EXAMPLE
A patron gifts his books to a The librarian decides to arrange The librarian then distributes
college library. the books on a small rack. multiple copies of each book on
other racks based on the category.
Similarly, HDFS creates multiple copies of a data block and stores them in separate systems for easy access.
HDFS Storage
Metadata
HDFS stores files
in a number of
blocks NameNode
Node A Node D
B2 B3 B1
B1
B4 B2
Very large B2
data file B3
B4 Node B Node E
B1 B3
1 2 B2
3 4 B4
B4
Node C
B1
B3
HDFS Storage
Metadata
NameNode
Node A Node D
B2 B3 B1
B1
B4 B2
Very large B2
data file B3
B4 Node B Node E
B1 B3
1 2 B2
3 4 B4
B4
Node C
Each block is B1
replicated to a few B3
separate computers
HDFS Storage
Data is divided into

Metadata 128 MB per block
NameNode
Node A Node D
B2 B3 B1
B1
B4 B2
Very large B2
data file B3
B4 Node B Node E
B1 B3
1 2 B2
3 4 B4
B4
Node C
B1
B3
HDFS Storage
Metadata
NameNode
Metadata keeps
Node A Node D information about the
B1
B2 B3 B1 block and its
B4 B2 replication. It is stored
Very large B2
in NameNode.
data file B3
B4 Node B Node E
B1 B3
1 2 B2
3 4 B4
B4
Node C
B1
B3
HDFS and YARN
Topic 2—HDFS Architecture and Components
HDFS Architecture
It is also known as the master and slave architecture.
Maintain
Edit log Fsimage

Secondary
NameNode
Metadata
File system
Master
DN1: A,C
DN2: A,C
File.txt = AC
DN3: A,C
NameNode
Data Node 1 Data Node 2 Data Node 3 Data Node N

……
1 3 1 3 1 3
Slave
HDFS Architecture
Responsible for
accepting jobs from
clients Maintain
Edit log Fsimage

Secondary
NameNode
Metadata
File system
Master
DN1: A,C
DN2: A,C
File.txt = AC
DN3: A,C
NameNode

……
1 3 1 3 1 3
Slave
HDFS Architecture
Maintain
Edit log Fsimage

Secondary
NameNode
Stores the block
Metadata
File system location and its
Master replication
DN1: A,C
DN2: A,C
File.txt = AC
DN3: A,C
NameNode

……
1 3 1 3 1 3
Slave
HDFS Architecture
Maintain
Edit log Fsimage

Secondary
NameNode
Metadata
File system
Master
DN1: A,C
DN2: A,C
File.txt = AC
DN3: A,C
NameNode A file is split into one or
more blocks, stored,
and replicated in the
slave nodes

……
1 3 1 3 1 3
Slave
HDFS Architecture
Maintain
Edit log Fsimage

Secondary
NameNode
Metadata
File system
Master
DN1: A,C
DN2: A,C
File.txt = AC
DN3: A,C
Data required for the
operation is loaded and NameNode
segregated into chunks
of data blocks

……
1 3 1 3 1 3
Slave
HDFS Components
The main components of HDFS:
• Namenode
• Secondary Namenode
• File system
• Metadata
• Datanode
HDFS Components
NAMENODE
The NameNode server is the core component of an HDFS cluster.

Namenode
It maintains and executes file system namespace operations such as opening, closing,
and renaming of files and directories that are present in HDFS.
Secondary
Namenode Metadata
File system
DN1 1 3
File system File.txt =
DN2 1 3
AC
DN3 1 3
NameNode
Metadata
Datanode Data Node 1 Data Node 2 Data Node 3 Data Node N

……
1 3 1 3 1 3
NameNode is a single point of failure.

HDFS Components
NAMENODE: OPERATION
The NameNode maintains two persistent files:

Namenode
• A transaction log called an Edit Log
Secondary • A namespace image called an FsImage

Namenode
File system
Metadata NameNode
Retrieves the Edit Updates with Edit

Datanode log at startup log information
Edit log Fsimage

HDFS Components
SECONDARY NAMENODE
Secondary NameNode server is responsible for maintaining a backup of the NameNode server.
Namenode It maintains the edit log and namespace image information in sync with the NameNode server.
Master
Secondary
Namenode
NameNode Secondary Maintains
NameNode
File system Slave
Edit log Fsimage
Metadata
Datanode Data Node Data Node Data Node
HDFS Cluster
There can be only one Secondary NameNode server in a cluster. It cannot be treated as
a disaster recovery server (it partially restores the NameNode server in case of failure)
HDFS Components
FILE SYSTEM
HDFS exposes a file system namespace and allows user NameNode

Namenode data to be stored in files.
Secondary The file system supports operations such as create,

Namenode remove, move, and rename.
/
File system
Metadata /Dir 1 /Dir 1
Datanode
/Dir 1.1 File A /Dir 2.1
File B
HDFS Components
METADATA
HDFS metadata is the structure of HDFS directories and files in a tree.

Namenode
It includes attributes of directories and files, such as ownership, permissions, quotas, and
Secondary replication factor.
Namenode
File system
Metadata
Datanode
HDFS Components
DATANODE
The DataNode is a multiple instance server.

Namenode
It is responsible for storing and maintaining the data blocks. It also retrieves the blocks when
asked by clients or the NameNode.
Secondary
Namenode
Metadata (Name, replicas, …):
Metadata ops /home/foo/data, 3, …
File system
NameNode
Client
Metadata
Datanode
DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5
Block
Rack 1 Client
Rack 1
Data Block Split
Data block split is an important process of HDFS architecture. Each file is split into one or more blocks
and the blocks are stored and replicated in DataNodes.
NameNode
DataNode1 DataNode2 DataNode3 DataNode4
A file split into b1 b2 … b2 b3 … b1 b3 … b1 b2 …

blocks
DataNodes managing blocks
By default, each file block is 128 MB.

Block Replication Architecture
Block replication refers to creating copies of a block in multiple DataNodes. Usually, the data
is split into parts, such as part-0 and part-1.
NameNode
JobTracker
B1 B2 B3
Job 1
Block
DataNode Replication DataNode
server1 server 2
Resubmit Job 1
Replication Method
• Each file is split into a sequence of blocks (of the same size, except the last one).
• Blocks are replicated for fault-tolerance.
• The block replication factor is usually configured at the cluster level (can also be done at the file level).
• The NameNode receives a heartbeat and a block report from each DataNode in the cluster.
• A block report lists the blocks on a DataNode.
What Is a Rack?
Rack is a collection of machines that are physically located in a single place/data-

center and connected through a network.
In Hadoop, Rack is a physical collection of slave machines put together at a single
location for data storage.
Rack Awareness in Hadoop
• In large clusters of Hadoop, to improve network traffic while reading/writing HDFS files, Namenode
chooses data nodes that are on the same rack or a nearby rack to read/write request.
• Namenode achieves this rack information by maintaining rack ids of each data node.
• This concept of choosing closer data nodes based on rack information is called Rack Awareness in
Hadoop.
Replication and Rack Awareness in Hadoop
The topology of the replica is critical to ensure the reliability of HDFS. Usually, data is replicated thrice. The
suggested replication topology is as follows:
• The first replica is placed on the same node as

that of the client.
• The second replica is placed on a different rack NameNode

from that of the first replica.
• The third replica is placed on the same rack as

that of the second one, but on a different node. Client
Rack 1 Rack 2
R3N1 R2N1: B1
R3N2 R3N2
R1N3: B1 R2N3: B1
Replication and Rack Awareness: Example
The diagram illustrates a Hadoop cluster with three racks. Each rack contains multiple nodes.
NameNode
B1 B2 B3
R1N1 represents
NameNode
Node 1 on Rack 1
and so on.. B1 B2 B3
The NameNode
decides which
DataNode belongs
to which rack.
NameNode
B1 B2 B3
Implementing Rack Awareness in Hadoop
1. Create a topology data file in Namenode as /etc/hadoop/conf/topology.map
2. Create a topology.py script file: Topology scripts are used by Hadoop to determine the rack location
of nodes. File is present in /etc/hadoop/conf/topology.sh
3. Add this property to core-site.xml of Name node only.

Anatomy of File Write
Step 1: The client creates the file by calling create() method on

DistributedFileSystem.
Step 2: DistributedFileSystem makes an RPC call to the namenode to

create a new file in the filesystem’s namespace.
TheDistributedFileSystem returns an FSDataOutputStream for the client
to start writing data to.
Step 3: As the client writes data, DFSOutputStream splits it into packets

and writes to internal data queue, which is consumed by the
DataStreamer. The DataStreamer streams the packets to the first
datanode in the pipeline.
Step 4: Similarly, the second datanode stores the packet and forwards it
to the third (and last) datanode in the pipeline.
Step 5: DFSOutputStream maintains an internal queue of packets that

are waiting to be acknowledged by datanodes, called the ack queue.
Step 6: When the client has finished writing data, it calls close() on the
stream.
Anatomy of File Read
Step 1: First, the Client will open the file by giving a call to open()
method on FileSystem object.
Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote

Procedure Call), to determine the locations of the blocks for the first few
blocks of the file. For each block, the NameNode returns the addresses
of all the DataNodes that have a copy of that block.
Step 3: The client then calls read() on the DFSInputStream and then
connects to the first closest DataNode for the first block in the file.
Step 4: Data is streamed from the DataNode back to the client, which
calls read() repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close
the connection to the DataNode and then find the best DataNode for
the next block.
Step 6: Blocks are read in order. When the client has finished reading, it
calls close() on the FSDataInputStream
HDFS Access
HDFS provides the following access mechanisms:
Python access and

Java API for C language
applications wrapper for non-
Java applications
FS shell for
Web GUI utilized executing
through an HTTP commands
browser on HDFS
HDFS Command Line
$ hdfs dfs -put

Copy file simplilearn.txt from local disk to the user’s directory in
simplilearn.txt
HDFS. This will copy the file to /user/username/simplilearn.txt simplilearn.txt
Get a directory listing of user’s home directory in HDFS $hdfs dfs –ls
$hdfs dfs –mkdir

Create a directory called testing under the user’s home directory testing
hdfs dfs -rm -r

Delete the directory testing and all its contents testing
Demo
HDFS Commands
In this demonstration, you will view how to run a few basic command lines of HDFS. You can also
execute the commands in your lab environment for practice.
Hue File Browser
The file browser in Hue helps to view and manage HDFS directories and files.
Demo
HDFS Access using Hue
In this demonstration, you will see how to access HDFS using Hue. You will also learn how to view and manage
your HDFS directories and files using Hue. You can also execute it in your lab environment for practice.
HDFS and YARN
Topic 3—Introduction to YARN (Yet Another Resource Negotiator)
What Is YARN: Case Study
Yahoo was the first company to embrace Hadoop and became a trendsetter within the Hadoop ecosystem.
In late 2012, Yahoo struggled to handle iterative and stream processing of data on the Hadoop infrastructure
due to MapReduce limitations.
Both iterative and stream processing were important for Yahoo in facilitating its move from batch computing
to continuous computing.
How could this issue be solved?

What Is YARN: Case Study
Yahoo implemented YARN. It installed:
• Over 30,000 production nodes on Spark for iterative processing

• Storm for stream processing,
• Hadoop for batch processing
This helped in handling more than 100 billion events such as clicks, impressions, email content,
metadata, etc., per day.
What Is YARN?
YARN is a resource manager created by separating the processing engine and the
management function of MapReduce. It monitors and manages workloads, a multi-
tenant environment, high availability features of Hadoop, and implements security
controls.
Advantages of Using YARN
There's no need to move data Reduced Higher Resources not utilized by a

between Hadoop YARN and data cluster framework can be consumed
systems running on different
motion utilization by another
clusters of computers
Advantages of
the single-cluster
approach
Lower
operational
costs
Only one "do-it-all" cluster needs to

be managed
YARN Infrastructure
The YARN Infrastructure is responsible for providing computational resources for application executions.
HADOOP 2.7
MapReduce Others
(data processing) (data processing)
YARN provides resources for running

an application
(cluster resource management)
HDFS provides storage
(redundant, reliable storage)
HDFS and YARN
Topic 4—YARN Architecture
YARN Architecture
Client
YARN
Resource Data Processing
The three important elements of the Resource Manager

YARN architecture are Resource
Manager, ApplicationMaster, and
Node Manager. Applications
Scheduler
Manager
Node Manager Node Manager Node Manager
App App App

Container Container Container
Master Master Master
Data Node Data Node Data Node

Components of YARN Architecture
RESOURCE MANAGER
Client
The Resource Manager

Resource (RM) is the master. It
Manager runs several services, YARN
including the Resource Resource Data Processing
Scheduler.
Node
Manager Resource Manager
Application Applications
Master Scheduler
Manager
App App App


RESOURCE MANAGER
The Resource Manager mediates the available resources in the cluster among
competing applications and ensures maximum cluster utilization.
Resource
Manager
Resource Manager
Node
Manager
Applications
Application Scheduler
Manager
Master
It includes a pluggable scheduler called the YarnScheduler, which allows different policies
for managing constraints, such as capacity, fairness, and Service-Level Agreements.
RESOURCE MANAGER COMPONENT: SCHEDULER
The Scheduler is responsible for

allocating resources to various running
applications depending on the common
Resource constraints of capacities, queues, and
Manager so on.
Resource Manager
Node
Manager
Applications
Manager
Master
The Scheduler has a policy plug-

in to partition cluster resources
among various applications.
Examples: CapacityScheduler,
FairScheduler.
RESOURCE MANAGER COMPONENT: APPLICATIONS MANAGER
The Applications Manager is an

Resource interface that maintains a list
Manager of applications that have been
submitted, currently running,
Resource Manager or completed.
Node
Manager
Applications
Manager
Master
The ApplicationsManager
accepts job submissions,
negotiates the first container for
executing the application, and
restarts the ApplicationMaster
container on failure.
RESOURCE MANAGER: INTERNAL COMPONENTS
Resource Manager
Resource
Manager ClientService YARN Scheduler NodesListManager
Node AdminService NMLivelinessMonitor

Manager
ResourceTrackerService
Application Applications
Master Manager
ApplicationMasterService
Container
AMLivelinessMonitor
Security
ApplicationMasterLauncher
ContainerAllocationExpirer
HOW RESOURCE MANAGER OPERATES
• The Resource Manager communicates with the clients through an interface called the ClientService.
• Administrative requests are served by a separate interface called the AdminService. Operators can
Resource
get updated information about the cluster operation using this.
Manager
• In parallel, the ResourceTrackerService receives node heartbeats from the Node Manager to track
Node
new or decommissioned nodes.
Manager
• The NMLivelinessMonitor and NodesListManager keep an updated status of which nodes are
Application
healthy so that the Scheduler and the ResourceTrackerService can allocate work appropriately.
Master
• The ApplicationMasterService manages Application Masters on all nodes, keeping the Scheduler
informed.
• The AMLivelinessMonitor keeps a list of Application Masters and their last heartbeat times to let the
Resource Manager know what applications are healthy on the cluster
• Any ApplicationMaster that does not send a heartbeat within a certain interval is marked as dead
and re-scheduled to run on a new container.
RESOURCE MANAGER: HIGH AVAILABILITY MODE
Before Hadoop 2.4, the Resource Manager was the single point of failure in a YARN cluster.
The High Availability, or HA, feature is an Active/Standby Resource Manager pair to remove
Resource
this single point of failure.
Manager
Node
Manager
Application
Master
NODE MANAGER
Client
Resource
Manager YARN
The Node Manager (NM) is Resource Data Processing

Node the slave. When it starts, it
Manager Resource Manager
announces itself to the RM
and offers resources to
Application the cluster. Applications
Master Scheduler
Manager
App App App


NODE MANAGER
Client
Resource
Manager YARN
Each Node Manager takes Resource Data Processing
Node instructions from the
Manager Resource Manager and Resource Manager
reports and handles
Application containers on a single
Applications
Master node. Scheduler
Manager
App App App


NODE MANAGER
When a container is leased to an application, the Node Manager sets up the container’s
environment, including the resource constraints specified in the lease and any dependencies.
Resource
Manager
The Node Manager runs on each node and manages the following:
Node
Manager Node Manager • Container lifecycle management
• Container dependencies
Application App
Master Container
Master • Container leases
Data Node • Node and container resource usage

• Node health
• Log management
• Reporting node and container status to the RM
NODE MANAGER COMPONENT: YARN CONTAINER
YARN container is a result of a successful resource allocation, that is, the RM has granted an
application a lease to use specified resources on a specific node.
Resource
Manager
Node Node Manager

Manager
App
Application Container
Master
Master
Data Node
NODE MANAGER COMPONENT: LAUNCHING A CONTAINER
To launch the container, the Application Master must provide a container launch context
(CLC) that includes the following information:
Resource
Manager • Environment variables
Node • Dependencies, that is, local resources such as data files or shared objects needed prior
Manager to launch
Application • Security tokens

Master
• The command necessary to create the process that the application wants to launch
APPLICATION MASTER
The Application Master in YARN is a framework-specific library that negotiates resources from
the RM and works with the Node Manager or managers to execute and monitor containers and
Resource their resource consumption.
Manager
Node
Manager Node Manager
Application App
Master Container
Master
Data Node
APPLICATION MASTER: USES
The Application Master:

Resource
Manager • manages the application lifecycle
• makes dynamic adjustments to resource consumption
Node
Manager • manages execution flow
• manages faults
Application
Master • provides status and metrics to the Resource Manager
• interacts with Node Manager and Resource Manager using extensible communication protocols
While every application has its own instance of an AppMaster, it is possible to

implement an AppMaster for a set of applications as well.
Applications of YARN
There can be many different workloads running on a Hadoop YARN cluster.
OTHER
BATCH INTERACTIVE ONLINE STREAMING GRAPH IN-MEMORY HPC MPI
(Search)
(MapReduce) (Tez) (HBase) (Strom, S4,…) (Giraph) (Spark) (OpenMPI)
(Weave…)
Cluster Resource
YARN
Management
Distributed file
system
Running an Application Through YARN
The Node Manager 05 The container executes

04
launches the container the Application Master
The Resource Manager 02 03 The Application Master contacts

allocates a container the related Node Manager
01 The client submits an application

to the Resource Manager
STEP 1
Users submit applications to the Resource Manager by typing the hadoop jar command.
The client submits an
application to the The Resource Manager maintains the list of applications on the cluster and available resources
Resource Manager on the Node Manager and determines the next application that receives a portion of the cluster
resource.
The Resource
Manager allocates a $ my-Hadoop-app
container Node Manager DataNode
Client
The Application
Master contacts the
related Node
Manager Node Manager DataNode
The Node Manager

launches the
container Resource Manager
Node Manager DataNode
The container
Application
executes Master
NameNode
the Application
Master
STEP 2
When the Resource Manager accepts a new application submission, the Scheduler first selects a
The client submits an container.
application to the
Resource Manager
Next, the Application Master is started and is responsible for the entire life-cycle of the application.
The Resource
$ my-Hadoop-app
Manager allocates a
container
Client
The Application
Master contacts the
related Node
The Node Manager

launches the Resources Request:
-1xNode1/1GB/1 core
container Resource
-1xNode2/1GB/1 core

The container
Application
executes Master
NameNode
the Application
Master
STEP 3

After a container is allocated, the Application Master asks the Node Manager managing the host
application to the on which the container was allocated to use these resources to launch an application-specific task.
Resource Manager
$ my-Hadoop-app
The Resource
Manager allocates a
container Client
The Application
Master contacts the
related Node Node Manager DataNode
Manager
The Node Manager
launches the Resource
Manager
Application
The container Here are you
NameNode
containers Master
executes
the Application
Master
STEP 4
The Node Manager does not monitor tasks; it only monitors the resource usage in the containers.
application to the The Application Master negotiates containers to launch all of the tasks needed to complete the
Resource Manager application.
The Resource $ my-Hadoop-app

Manager allocates a
Client
The Application
Master contacts the
related Node
The Node Manager

launches the
The container Here are you Application
NameNode
executes containers Master
the Application
Master Node Manager DataNode
STEP 5
The client submits an After the application is complete, the Application Master shuts itself and releases its own
application to the container.
Resource Manager
The Resource
Manager allocates a $ my-Hadoop-app
container
Client
The Application
Master contacts the
related Node
The Node Manager
launches the
The container
executes Application
“I’m done!” NameNode
Master
the Application
Master
Tools for YARN Development
Hadoop includes three tools for YARN developers:
YARN
YARN Hue Job
Command
Web UI browser
Line
YARN WEB UI
YARN YARN web UI runs on 8088 port, by default.

Web UI It also provides a better view than Hue; however, you can’t control or configure from YARN web UI.
Hue Job
browser
YARN
Command
Line
HUE JOB BROWSER
YARN The Hue Job Browser allows you to monitor status of job, kill a running job, and view logs.
Web UI
Hue Job
browser
YARN
Command
Line
YARN COMMAND LINE
YARN Most of the YARN commands are for administrators rather than developers.
Web UI
A few useful commands: - yarn –help
list all command of yarn
Hue Job - yarn –version

browser print the version
- yarn logs -applicationId <app-id>

views logs of specified application ID
YARN
Command
Line
Demo
Using YARN Web UI, Hue Job Browser, and YARN command Line
In this demonstration, you will learn how to use YARN Web UI, Hue Job Browser, and YARN
command line. You can also execute it in your lab environment for practice.
Key Takeaways
HDFS is the storage layer for Hadoop.
HDFS chunks data into blocks and distributes them across the cluster.
Slave nodes run DataNode daemons that are managed by a single NameNode.
HDFS can be accessed using Hue, HDFS command, or HDFS API.
YARN manages resources in a Hadoop cluster and schedules jobs.
YARN works with HDFS to run tasks where the data is stored.
YARN executes jobs that can be monitored using Hue, YARN Web UI, or YARN
command.
Quiz
QUIZ
How is NameNode failure in non HA mode tackled?
1
a. Secondary NameNode is switched on as NameNode
b. Secondary NameNode automatically starts working as NameNode
c. ,Primary NameNode is recreated from Secondary NameNode image backup
d. Another NameNode in cluster with replication works as main NameNode

QUIZ
How is NameNode failure in non HA mode tackled?
1
a. Secondary NameNode is switched on as NameNode
b. Secondary NameNode automatically starts working as NameNode
c. ,Primary NameNode is recreated from Secondary NameNode image backup
d. Another NameNode in cluster with replication works as main NameNode
The correct answer is c.

NameNode failure in non-HA mode is tackled by taking an image backup of the NameNode from
Secondary NameNode and incorporating it into a new NameNode machine.
QUIZ Which of the following statements best describes how a large (100 GB) file is stored in
2 HDFS?
The file is replicated three times by default. Each copy of the file is stored on a separate
a. DataNode.
The master copy of the file is stored on a single DataNode. The replica copies are divided
b. into fixed-size blocks, which are stored on multiple DataNodes.
The file is divided into fixed-size blocks, which are stored on multiple DataNodes. Each
c. block is replicated three times by default. Multiple blocks from the same file might
reside on the same DataNode.
d. The file is divided into fixed-size blocks, which are stored on multiple DataNodes. Each
block is replicated three times by default. HDFS guarantees that different blocks from
the same file are never on the same DataNode.
QUIZ Which of the following statements best describes how a large (100 GB) file is stored in
2 HDFS?
The file is replicated three times by default. Each copy of the file is stored on a separate
a. DataNode.
The master copy of the file is stored on a single DataNode. The replica copies are divided
b. into fixed-size blocks, which are stored on multiple DataNodes.
The file is divided into fixed-size blocks, which are stored on multiple DataNodes. Each
c. block is replicated three times by default. Multiple blocks from the same file might
reside on the same DataNode.
d. The file is divided into fixed-size blocks, which are stored on multiple DataNodes. Each
block is replicated three times by default. HDFS guarantees that different blocks from
the same file are never on the same DataNode.
The correct answer is d.

The file is divided into fixed-size blocks, which are stored on multiple DataNodes. Each block is
replicated three times by default. HDFS guarantees that different blocks from the same file are
never on the same DataNode.
QUIZ
Which of the following describes how a client reads a file from HDFS?
3
The client queries the NameNode for the block location(s). The NameNode returns the
a. block location(s) to the client. The client reads the data directly off the DataNode(s).
b. The client queries all DataNodes in parallel. The DataNode that contains the requested
data responds directly to the client. The client reads the data directly off the DataNode.
c. The client contacts the NameNode for the block location(s).
The NameNode contacts the DataNode that holds the requested data block. Data is
d. transferred from the DataNode to the NameNode and then from the NameNode to the client.
QUIZ
Which of the following describes how a client reads a file from HDFS?
3
The client queries the NameNode for the block location(s). The NameNode returns the
a. block location(s) to the client. The client reads the data directly off the DataNode(s).
b. The client queries all DataNodes in parallel. The DataNode that contains the requested
data responds directly to the client. The client reads the data directly off the DataNode.
c. The client contacts the NameNode for the block location(s).
The NameNode contacts the DataNode that holds the requested data block. Data is
d. transferred from the DataNode to the NameNode and then from the NameNode to the client.

The client contacts the NameNode for the block location(s). NameNode then queries the
DataNodes for block locations. The DataNodes respond to the NameNode, and the NameNode
redirects the client to the DataNode that holds the requested data block(s). The client then reads
the data directly off the DataNode.
QUIZ
Which of the following is/are valid statements?
4
HDFS is optimized for storing a large number of files smaller than the HDFS block
a.
size.
b. HDFS supports a write once-read once data access model.
HDFS is a distributed file system that replaces ext3 or ext4 on Linux nodes in a
c. Hadoop cluster.
HDFS is a distributed file system that runs on top of native OS file systems and is
d. well-suited for storage of very large datasets.
QUIZ
Which of the following is/are valid statements?
4
HDFS is optimized for storing a large number of files smaller than the HDFS block
a.
size.
b. HDFS supports a write once-read once data access model.
HDFS is a distributed file system that replaces ext3 or ext4 on Linux nodes in a
c. Hadoop cluster.
HDFS is a distributed file system that runs on top of native OS file systems and is
d. well-suited for storage of very large datasets.

HDFS is a distributed file system that runs on top of native OS file systems and is well-suited for
storage of very large datasets.
QUIZ
The NameNode uses RAM:
5
a. To store the file contents in HDFS
b. To store filenames, list of blocks, and other meta information
c. To store the edits log that keeps track of changes in HDFS
d. To manage distributed read and write locks on files in HDFS

QUIZ
The NameNode uses RAM:
5
a. To store the file contents in HDFS
b. To store filenames, list of blocks, and other meta information
c. To store the edits log that keeps track of changes in HDFS
d. To manage distributed read and write locks on files in HDFS
The correct answer is b.

NameNode uses RAM to store filenames, list of blocks, and other meta information.
QUIZ You need to move a file titled weblogs into HDFS. You are not allowed to copy the file.
You have ample space on your DataNodes. What action should you take to relieve this
6 situation and store more files in HDFS?
a. Increase the block size on all current files in HDFS
b. Increase the block size on your remaining files
c. Decrease the block size on your remaining files
d. Increase the amount of memory for the NameNode

QUIZ You need to move a file titled weblogs into HDFS. You are not allowed to copy the file.
You have ample space on your DataNodes. What action should you take to relieve this
6 situation and store more files in HDFS?
a. Increase the block size on all current files in HDFS
b. Increase the block size on your remaining files
c. Decrease the block size on your remaining files
d. Increase the amount of memory for the NameNode
The correct answer is c.

It is recommended that you decrease the block size on your remaining files.
This concludes “HDFS and YARN.”
The next lesson is “MapReduce and Sqoop.”
©Simplilearn. All rights reserved

HDFS and YARN

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

HDFS and YARN

Caricato da

Copyright:

Formati disponibili

Big Data Hadoop and Spark Developer

Lesson 2—HDFS and YARN

© Simplilearn. All rights reserved.

Discuss Hadoop Distributed File System (HDFS)

Explain HDFS architecture and components

Describe YARN and its features

Explain YARN architecture

Speed: Search and analysis

Reliability: Fetching data is Cost: 10,000 to $14,000

HDFS resolves all major issues of the traditional file system.

Reliability: Fetching data is Cost: 10,000 to $14,000

HDFS has high fault-tolerance

HDFS has high throughput

Data is divided into

Edit log Fsimage

Data Node 1 Data Node 2 Data Node 3 Data Node N

Edit log Fsimage

Data Node 1 Data Node 2 Data Node 3 Data Node N

Edit log Fsimage

Data Node 1 Data Node 2 Data Node 3 Data Node N

Edit log Fsimage

Data Node 1 Data Node 2 Data Node 3 Data Node N

Edit log Fsimage

Data Node 1 Data Node 2 Data Node 3 Data Node N

The main components of HDFS:

The NameNode server is the core component of an HDFS cluster.

Datanode Data Node 1 Data Node 2 Data Node 3 Data Node N

NameNode is a single point of failure.

The NameNode maintains two persistent files:

Secondary • A namespace image called an FsImage

Retrieves the Edit Updates with Edit

Edit log Fsimage

Edit log Fsimage

Datanode Data Node Data Node Data Node

HDFS exposes a file system namespace and allows user NameNode

Secondary The file system supports operations such as create,

Metadata /Dir 1 /Dir 1

HDFS metadata is the structure of HDFS directories and files in a tree.

The DataNode is a multiple instance server.

DataNode1 DataNode2 DataNode3 DataNode4

A file split into b1 b2 … b2 b3 … b1 b3 … b1 b2 …

By default, each file block is 128 MB.

Rack is a collection of machines that are physically located in a single place/data-

• The first replica is placed on the same node as

• The second replica is placed on a different rack NameNode

• The third replica is placed on the same rack as

1. Create a topology data file in Namenode as /etc/hadoop/conf/topology.map

3. Add this property to core-site.xml of Name node only.

Step 1: The client creates the file by calling create() method on

Step 2: DistributedFileSystem makes an RPC call to the namenode to

Step 3: As the client writes data, DFSOutputStream splits it into packets

Step 5: DFSOutputStream maintains an internal queue of packets that

Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote

HDFS provides the following access mechanisms:

Python access and

$ hdfs dfs -put

$hdfs dfs –mkdir

hdfs dfs -rm -r

How could this issue be solved?

Yahoo implemented YARN. It installed:

• Over 30,000 production nodes on Spark for iterative processing

There's no need to move data Reduced Higher Resources not utilized by a

Only one "do-it-all" cluster needs to

YARN provides resources for running