Sei sulla pagina 1di 6

1.

What are the core components of Hadoop


Hadoop comprises of hdfs and map reduce processing of the large data will be in
mapreduce and hdfs will be for the storage of large sets of data.
2.What Is hdfs
It is a Hadoop distributed file system where we will store large amount of data
across the clusters.
3.Keyfeatures of Hadoop
It is fault tolerant
High throughput
Basically it runs on commodity hardware
4.Fault Tolerance
In Hadoop we use the fault tolerance concept t prevent the loast of data.
Since we are using commodity hardware there are high chances to loose the data
so here we have the default replication factor as 3 data will be replicated in 3 blocks
2 blocks will be stored in 2 different machines within the rack and the 3 block wil be
stored in the 3rd rack so even if we loose the first block of data we can get the data
from the other 2 blocks.
Mapreduce
The processing of large sets of data will happen in mapreduce here we have
the map task and the reducer task in mapper phase we wil process the data where
as in the reducer phase it reduces the data and gives us the final output
Master slave architecture in Hadoop
Here we will have 5 deamon processes running
In master we will have jobtracker,namenode and secondary namenode and in the
slave node we will have tasktracker and data node
Input format and output data format
Sequential fileinputformat,text input format,keyvalue text input format,file input
format,whole file format.

Reducer
It combines the multiple mapper output data based on specific key in one.
How can you overwrite the default inputformat in Hadoop.
We can overwrite it in the job config file
Is there a map input format in Hadoop
No but we can use sequence file input format
What happens if mapper output does not match reducer input
A real time exception will be thrown.
Can you provide multiple input paths to a map-reduce jobs in Hadoop
Yes we can add any number of input paths
How can you set arbitrary no. of mappers to be created for a job in Hadoop
We cannot set
How can you set arbitrary no. of reducers to be created for a job in Hadoop
We can set it by using setnumreducetasks in the jobconfig class
Where are Hadoops configuration files located
In the conf sub directory
Hadoops three configuration files
Hdfs-site.xml
Core-site.xml
Mapred-site.xml
Parameters of mapper
Longwritable(input)
Text(input)
Text(intermediate output)
Intwritable(intermediate output)
Parameters of reducer

Text(
Itwritable
Text
Intwritable
Four modules in Hadoop framework
Hadoop common
Hadoop yarn
Hadoop hdfs
Hadoop mapreduce
What Is mapreduce
Mapreduce framework is used to process large sets of data.in map reduce we have
mapper process and reducer process we use the mapper process to process large
amounts of data then the framework will process sort and shuffle phase and the
reducer will reduce the mapper output and it will give the final result.
2.compute node:the actual data will get processed in compute node
Storage node:the actual data will be processed in storage node.
3.basic components in Hadoop
Input location
Output location
Map task
Reducer task
Job config
4.input format and output format
The framework takes the input as key value pairs and even it gives output in
key value pairs.
5. restriction to the key and value class
The key and value should be serializable in Hadoop we have writable
interface to do it since key should be comparable key must implement
Writablecomparable

6.inetrface to be implemented to create mapper and reducer


Org.apache.hadoop.mapreduce.mapper
Org.apache.hadoop.mapreduce.reducer
7.what mapper does
When It will get the key value pairs from the record reader it will process the pairs
of data and will produce output where we will send that output to reducer as input.
8.what is inpusplit
9.what is inputformat
10.where do you specify mapper implementation
In job itself
11.methods in mapper interface
Run(),setup(),map(),cleanup()
12.what happens if you dont override the mapper methods
It will perform the identitymapper function where we will the same output as the
input
13.use of context object
14.how can you add the arbitrary keyvalue pairs in mapper
15.how does mapper run() method works
16.which object is used to get progress of particular job
Context
17.what is the next step after mapper
After mapper the framework will perform sort and shuffle and then the
partotioner will perform hash alogorithm to send which mapper output to which
reducer based on key finally we will have reducer phase.
18.how many maps are there in particular job
Usually the no. of maps depends on the input data and the block size so the
number of maps purely depend on the input data.
Suppose if you expect 10TB of input data and have a blocksize of 128MB you will
end up with 82,000 maps

19.how many reducers should be configured


We can set the number of reducers in the configuration file by
set.numreducetasks()
20.how the client communicates with HDFS
The client communicates by using HDFS API. The client submit jobs to
the job trackers the name node interacts with the datanode for the data location
then the client interacts directly to the data node and perform the operations.
21.what is datanode?How many instances of datanode run on a Hadoop cluster
In datanode the actual data will be there the data will be stored in
HDFS.Here the data is divided in to blocks we will have one datanode in the slave
node. It run on its JVM Process.they will interact with each other mostly during
replication. There will be a single instance of datanode.
22.what is name node
In name node it will contain the metadata.It monitors the datanode it
contains the metadata information of blocks which are present in the name node.it
gets the heartbeat from the datanode. And will assign the tasks to the datanode.it
run on a separate JVM.The name is a SPF.
23.HDFS block size
In HDFS the data is split in to blocks by default we will have 64MB.Data
is replicated in 3 blocks by default we can even mention it in the config file.
24.how does a namenode handle the failure of the datanodes
It receives the heartbeat from the datanode periodically and also the block
report.the block report contains the no. of blocks available it so if the name node
doesnt receive heartbeat from the datanode it assumes there is a failure and
reassigns the task to the other datanode.
25.

Potrebbero piacerti anche