Hadoop Admin Questions

Which operating system(s) are supported for production Hadoop deployment?
A. Any operating system that supports Java 1.6 including Linux, Windows, Solaris and Mac OS
B. Any flavors of Linux and Unix is supported. Windows and Mac OS are only supported for
development and testing
C. Only Centos and Red Hat are supported for Hadoop production deployment
D. Any flavors of Linux, Unix, Windows, and Mac OS are supported.
Correct Answer is B - The main supported operating system is Linux. However, with
some additional software Hadoop can be deployed on Windows.
What is the role of the namenode?
A. Namenode splits big files into smaller blocks and sends them to different datanodes.
B. Namenode is responsible for assigning names to each slave node so that they can be identified by
the clients.
C. Namemode is responsible for managing HDFS system and supplies addresses of the data on the
different datanodes.
D. Both 2 and 3 are valid answers.
Correct Answer is C - The namenode is the "brain" of the Hadoop cluster and
responsible for managing the distribution blocks on the system based on the replication
policy. The namenode also supplies the specific addresses for the data based on the
client requests.
What happen on the namenode when a client tries to read a data file?
A. The namenode will look up the information about file in the edit file and then retrieve the
remaining information from filesystem memory snapshot
B. The namenode is not involved in the retrieving data file since the data is stored on datanodes
C. The namenode will retrieve data from the datanodes and start streaming data back to the clients
D. None of these answers are correct
Correct Answer is A - Since the namenode needs to support a large number of the
clients, the primary namenode will only send information back for the data location. The
datanode itselt is responsible for the retrieval.
What are the hardware requirements for a Hadoop cluster (primary and secondary namenodes and datanodes)?
A. Hadoop was designed to work on commodity hardware. There are no requirements for the
hardware
B. There are no requirements for datanodes. However, the namenodes require a specified amount of
RAM to store filesystem image in memory.
C. Hadoop cluster only will be as fast as its individual nodes. Therefore, you need to buy machines
with good CPU speeds
D. Older versions of Hadoop had no hardware requirements. With all .22+ releases though, there are
minimum CPU and RAM requirements.
Correct Answer is B - Based on the design of the primary namenode and secondary
namenode, entire filesystem information will be stored in memory. Therefore, both
namenodes need to have enough memory to contain the entire filesystem image.
What mode(s) can Hadoop code be run in?
A. Hadoop can be deployed in distributed mode only.

B. Hadoop can be deployed in stand-alone mode or distributed mode
C. Hadoop can be deployed in stand alone mode, pseudo-distributed mode or fully-distributed
mode.
D. None of these are applicable modes for Hadoop
Correct Answer is C - Hadoop was specifically designed to be deployed on multi-node
cluster. However, it also can be deployed on single machine and as a single process for
testing purposes.
How would an Hadoop administrator deploy various components of Hadoop in production?
A. Deploy namenode, jobtracker, datanode and tasktracker on the Master Node

B. Deploy namenode and jobtracker on the master node, and deploy datanodes and taskstrackers on
multiple slave nodes
C. Deploy each component on separated machine to make sure that there are enough resources
D. Deploy namenode, jobtracker and tasktracker on the master node tasktracker on multiple slave
nodes
Correct Answer is B - There is a need for only one namenode and jobtracker on the
system. The number of datanodes depends on the available hardware.
What is the best practice to deploy the secondary namenode?
A. Deploy secondary namenode on the same machine with primary namenode

B. Deploy secondary namenode on every machine that runs datanode
C. Deploy secondary namenode on a separate standalone machine
D. Only deploy secondary namenode in active-active configuration
Correct Answer is C - The secondary namenode needs to be deployed on a separate
machine. It will not interfere with primary namenode operations in this way. The
secondary namenode must have the same memory requirements as the main
namenode.
Is there a standard procedure to deploy Hadoop?
A. Yes, all Hadoop deployment comply with well defined standards

B. No, the deployment procedure varies from one vendor to another
C. No, there are some differences between various distributions. However, they all require that
Hadoop jars be installed on the machine
D. No, each deployment is different due largely to differences in cluster topography for each
implementation
Correct Answer is C - There are some common requirements for all Hadoop
distributions but the specific procedures will be different for different vendors since they
all have some degree of proprietary software.
What is the role of the secondary namenode?
A. Secondary namenode is a backup namenode that will serve requests if primary namenode goes
down.
B. Secondary namenode performs CPU intensive operation of combining edit logs and current
filesystem snapshots
C. Secondary namenode is always active and serves 50% of the requests to the primary namenode
D. There is no secondary namenode
Correct Answer is B - The secondary namenode was separated out as a process due to
having CPU intensive operations and additional requirements for metadata back-up.
What are the side effects of not running a secondary name node?
A. The cluster performance will degrade over time since edit log will grow bigger and bigger
B. The primary namenode will become overloaded and response time be slower.
C. The data reads will be faster since the primary namenode does not have to replicate to secondary
namenode
D. The only possible impact is when there is an outage a failover to the secondary namenode will
not occur. This is a rare occurence.
Correct Answer is A - If the secondary namenode is not running at all, the edit log will
grow significantly and it will slow the system down. Also, the system will go into
safemode for an extended time since the namenode needs to combine the edit log and
the current filesystem checkpoint image.
What happen if a datanode loses network connection for a few minutes?
A. The namenode will detect that a datanode is not responsive and will start replication of the data
from remaining replicas. When datanode comes back online, administrator will need to manually delete
the extra replicas
B. All data will be lost on that node. The administrator has to make sure the proper data distribution
between nodes
C. If the datanode comes back online just after few minutes, the cluster wont detect that it was not
available and will continue working normally
D. The namenode will detect that a datanode is not responsive and will start replication of the data
from remaining replicas. When datanode comes back online, the extra replicas will be deleted
Correct Answer is D - The replication factor is actively maintained by the namenode.
The namenode monitors the status of all datanodes and keeps track which blocks are
located on that node. The moment the datanode is not avaialble it will trigger replication
of the data from the existing replicas. However, if the datanode comes back up,
overreplicated data will be deleted. Note: the data might be deleted from the original
datanode.
What happen if one of the datanodes has much slower CPU? How will it effect the performance of the cluster?
A. The task execution will be as fast as the slowest worker. However, if speculative execution is
enabled, the slowest worker will not have such big impact.
B. The slowest worker will significantly impact job execution time. It will slow everything down.
C. The namenode will detect the slowest worker and will never send tasks to that node
D. It depends on the level of priority assigned to the task. All high priority tasks are executed in
parallel twice. A slower datanode would therefore be bypassed. If task is not high priority, however,
performance will be affected.
Correct Answer is A - Hadoop was specifically designed to work with commodity
hardware. The speculative execution helps to offset the slow workers. The multiple
instances of the same task will be created and job tracker will take the first result into
consideration and the second instance of the task will be killed.
What is speculative execution?
A. There is no such thing as speculative execution. Hadoop tasks are preassigned priorities very
much like in relational databases. Administrators can allocate CPU based on a combination of task and/or
priority.
B. Speculative execution is the way in which jobtrackers decide where to assign a task based on the
results of the previous jobs. It predetermines the most efficient route for completing the task.
C. Speculative execution allows multiple jobs to be executed at the same time
D. If speculative execution is enabled, the job tracker will issue multiple instances of the same task
on multiple nodes and it will take the result of the task that finished first. The other instances of the task
will be killed.
Correct Answer is D - The speculative execution is used to offset the impact of the
slow workers in the cluster. The jobtracker creates multiple instances of the same task
and takes the result of the first successful task. The rest of the tasks will be discarded.
How many racks do you need to create an Hadoop cluster in order to make sure that the cluster operates reliably?
A. In order to ensure a reliable operation it is recommended to have at least 2 racks with rack
placement configured
B. Due to replication of the data one rack is good enough to ensure the reliable operation of the
Hadoop cluster
C. Highly available Hadoop cluster requires at least four different racks for datanodes
D. There is no best practice related to number of racks. It depends on the size of the cluster, SLAs,
and type of hardware.
Correct Answer is A - Hadoop has a built-in rack awareness mechanism that allows
data distribution between different racks based on the configuration. .
Are there any special requirements for namenode?
A. No, Hadoop is specifically designed to run on commodity hardware. This is true for the
namenode as well.
B. Yes, the namenode serves all data back to the clients and need to have extra fast CPU
C. Yes, the namenode holds information about all files in the system and needs to be extra reliable
D. Yes, the namenode serves all data back to the clients and needs both extra fast CPU and higher
RAM.
Correct Answer is C - The namenode is a single point of failure. It needs to be extra
reliable and metadata need to be replicated in multiple places. Note that the community
is working on solving the single point of failure issue with the namenode.
If you have a file 128M size and replication factor is set to 3, how many blocks can you find on the cluster that
will correspond to that file (assuming the default apache and cloudera configuration)?
A. 3
B. 6
C. 9
D. 12
Correct Answer is B - Based on the configuration settings the file will be divided into
multiple blocks according to the default block size of 64M. 128M / 64M = 2 . Each block
will be replicated according to replication factor settings (default 3). 2 * 3 = 6 .
What is distributed copy (distcp)?
A. Distcp is a Hadoop utility for launching MapReduce jobs to copy data. The primary usage is for
copying a large amount of data
B. Distcp is a way to copy data from one datanode to another before namenode can send tasks to
that datanode.
C. distcp is used to distribute data evenly among all datanodes in the cluster
D. Distc is used for both 1 and 3
Correct Answer is A - One of the major challenges in the Hadoop enviroment is

copying data across multiple clusters and distcp will allow multiple datanodes to be
leveraged for parallel copying of the data.
What is replication factor?
A. Replication factor controls how many times the namenode replicates its metadata
B. Replication factor creates multiple copies of the same file to be served to clients
C. Replication factor controls how many times each individual block can be replicated
D. None of these answers are correct.
Correct Answer is C - Data is replicated in the Hadoop cluster based on the replication
factor. The high replication factor guarantees data availability in the event of failure.
What daemons run on Master nodes?
A. NameNode, DataNode, JobTracker and TaskTracker

B. NameNode, DataNode and JobTracker
C. NameNode, Secondary NameNode and JobTracker
D. NameNode, Secondary NameNode, JobTracker, TaskTracker and DataNode
Correct Answer is C - Hadoop is comprised of five separate daemons and each of these
daemon run in its own JVM. NameNode, Secondary NameNode and JobTracker run on
Master nodes. DataNode and TaskTracker run on each Slave nodes.
What is rack awareness?
A. Rack awareness is the network topology that needs to be configured for Hadoop cluster
B. Rack awareness is taken into consideration when a client makes a request for specific data
C. Rack awareness is the way in which the namenode decides how to place blocks based on the rack
definitions
D. Rack awareness is the way in which the datanode determines how to place blocks based on rack
definitions
Correct Answer is C - Hadoop will try to minimize the network traffic between
datanodes within the same rack and will only contact remote racks if it has to. The
namenode is able to control this due to rack awareness.
What is the role of the jobtracker in an Hadoop cluster?
A. The jobtracker is responsible for scheduling tasks on slave nodes, collecting results, retrying
failed tasks
B. The job tracker is responsible for job monitoring on the cluster, sending alerts to master nodes
C. The job tracker is responsible for cleaning up the temporary data from running jobs, copying
necessary data to each datanode and communication with namenode
D. Both 1 and 3 are valid answers
Correct Answer is A - The job tracker is the main component of the map-reduce
execution. It control the division of the job into smaller tasks, submits tasks to individual
tasktracker, tracks the progress of the jobs and reports results back to calling code. .
How does the Hadoop cluster tolerate datanode failures?
A. Failures are anticipated. When they occur, the jobs are re-executed.
B. Datanodes talk to each other and figure out what need to be re-replicated if one of the nodes goes
down
C. If a datanode goes down, that data will not be available till datanode come back up
D. Since Hadoop is design to run on commodity hardware, the datanode failures are expected.
Namenode keeps track of all available datanodes and actively maintains replication factor on all data.
Correct Answer is D - The namenode actively tracks the status of all datanodes and
acts immediately if the datanodes become non-responsive. The namenode is the central
"brain" of the HDFS and starts replication of the data the moment a disconnect is
detected.
What is the procedure for namenode recovery?
A. A namenode can be recovered in two ways: starting new namenode from backup metadata or
promoting secondary namenode to primary namenode
B. Hadoop administrator can always reformat namenode and start over. All data will be preserved.
C. The secondary namenode will become primary namenode automatically. No additional work is
required from administrator point of view.
D. The namenode is a single point of failure for Hadoop. There is no recovery in place today
(though various vendors are working on developing one)
Correct Answer is A - The namenode recovery procedure is very important to ensure
the reliability of the data.It can be accomplished by starting a new namenode using
backup data or by promoting the secondary namenode to primary.
Web-UI shows that half of the datanodes are in decommissioning mode. What does that mean? Is it safe to
remove those nodes from the network?
A. This means that clients will not be directed to those datanodes and it is safe to remove them from
the network.
B. This means that namenode is trying retrieve data from those datanodes by moving replicas to
remaining datanodes. There is a possibility that data can be lost if administrator removes those datanodes
before decomissioning finished
C. This means that cluster is going down and subsequent requests to Hadoop cluster cluster will be
rejects. Administrator can stop decomissioning by restarting Hadoop cluster
D. This means that there is available capacity in half the datanodes and traffic can be redirected to
them as appropriate.
Correct Answer is B - Due to replication strategy it is possible to lose some data due to
datanodes removal en masse prior to completing the decommissioning process.
Decommissioning refers to namenode trying to retrieve data from datanodes by moving
replicas to remaining datanodes.
What does the Hadoop administrator have to do after adding new datanodes to the Hadoop cluster?
A. Since the new nodes will not have any data on them, the administrator needs to start the balancer
to redistribute data evenly between all nodes.
B. The namenode will detect new datanodes automatically and will start sending data to them. No
interactions from Hadoop administrator is required
C. The administrator needs to provision new datanodes on all nodes so that the datanodes are aware
of each other`s location and can work in concert.
D. Each new node will not have any data. The administrator need not do anything. As new data is
added to Hadoop the system will begin to use the new nodes
Correct Answer is A - Hadoop cluster will detect new datanodes automatically.
However, in order to optimize the cluster performance it is recommended to start
rebalancer to redistribute the data between datanodes evenly. .
If the Hadoop administrator needs to make a change, which configuration file does he need to change?
A. It depends on the nature of the change. Each node has it`s own set of configuration files and they
are not always the same on each node.
B. The change needs to be made in master node. The slave nodes will pick up the change
automatically from the master node.
C. The change needs to be done in java code while setting up job object
D. The administrator need only modify mstconfig which maintains a master list of configurations
for the cluster
Correct Answer is A - Each node in the Hadoop cluster has its own configuration files
and the changes needs to be made in every file. One of the reasons for this is that
configuration can be different for every node. .
Map Reduce jobs are failing on a cluster that was just restarted. They worked before restart. What could be
wrong?
A. The data may have been deleted during restart. It needs to be repopulated again.
B. The cluster needs to be reformatted and restarted again. The namenode metadata may have been
corrupted.
C. The cluster is in a safe mode. The administrator needs to wait for namenode to exit the safe mode
before restarting the jobs again.
D. None of these answers are correct
Correct Answer is C - This is a very common mistake by Hadoop administrators when
there is no secondary namenode on the cluster and the cluster has not been restarted in
a long time. The namenode will go into safemode and combine the edit log and current
file system timestamp.
Map Reduce jobs take too long. What can be done to improve the performance of the cluster?
A. One the most common reasons for performance problems on Hadoop cluster is uneven
distribution of the tasks. The number tasks has to match the number of available slots on the cluster
B. Hadoop software needs to be redeployed on machines with faster CPUs and more memory.
C. The MapReduce developer needs to redesign code by using better, high-performance algorithms
D. One of the more common performance problems is network performance limitations. Consider
upgrading network bandwidth.
Correct Answer is A - Hadoop is not a hardware aware system. It is the responsibility

of the developers and the administrators to make sure that the resource supply and
demand match.
How often do you need to reformat the namenode?
A. The namenode needs to be re-formatted when cluster reaches 50% of the provisioned capacity.
B. Never. The namenode needs to formatted only once in the beginning. Reformatting of the
namenode will lead to lost of the data on entire cluster
C. The namenode needs to be reformatted on regular basis or after about every 0.5 Tb
D. Never. The namenode is automatically reformatted daily to ensure optimal performance
Correct Answer is B - The namenode is the only system that needs to be formatted
only once. It will create the directory structure for file system metadata and create
namespaceID for the entire file system.
After increasing the replication level, I still see that data is under replicated. What could be wrong?
A. Data replication takes time due to large quantities of data. The Hadoop administrator should
allow sufficient time for data replication
B. The cluster needs to be restarted. The replication factor will be taken into consideration only after
restart
C. The replication factor is a global setting and cannot be set on a file level
D. The Hadoop administrator should never change the replication level without first reformatting
the system. Otherwise you could encounter data corruption issues.
Correct Answer is A - Depending on the data size the data replication will take some
time. Hadoop cluster still needs to copy data around and if data size is big enough it is
not uncommon that replication will take from a few minutes to a few hours.

Hadoop Admin Questions

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Hadoop Admin Questions

Caricato da

Copyright:

Formati disponibili

Which operating system(s) are supported for production Hadoop deployment?

What is the role of the namenode?

What mode(s) can Hadoop code be run in?

A. Hadoop can be deployed in distributed mode only.

How would an Hadoop administrator deploy various components of Hadoop in production?

A. Deploy namenode, jobtracker, datanode and tasktracker on the Master Node

What is the best practice to deploy the secondary namenode?

A. Deploy secondary namenode on the same machine with primary namenode

Is there a standard procedure to deploy Hadoop?

A. Yes, all Hadoop deployment comply with well defined standards

What happen if a datanode loses network connection for a few minutes?

What is speculative execution?

Are there any special requirements for namenode?

What is distributed copy (distcp)?

Correct Answer is A - One of the major challenges in the Hadoop enviroment is

What is replication factor?

What daemons run on Master nodes?

A. NameNode, DataNode, JobTracker and TaskTracker

What is rack awareness?

What is the role of the jobtracker in an Hadoop cluster?

How does the Hadoop cluster tolerate datanode failures?

What is the procedure for namenode recovery?

Correct Answer is A - Hadoop is not a hardware aware system. It is the responsibility

How often do you need to reformat the namenode?

Potrebbero piacerti anche