Hadoop Deployment and Configuration - Single

machine and a cluster

Typical Hardware
HP Compaq 8100 Elite CMT PC

Processor: Intel Core i7-860

RAM: 8GB PC3-10600 Memory (2X4GB)


Network: Intel 82578 GbE (Integrated)

Network switch
Netgear GS2608

N Port

10/100/1000 Mbps Gigabit Switch

Gateway node
Dell Optiplex GX280

Processor: Intel Pentium 4 2.80 GHz


Install the Ubuntu Server (Maverick Meerkat) operating system
that is available for download from the Ubuntu releases site.

Some important points to remember while installing the OS

Ensure that the SSH server is selected to be installed
Enter the proxy details needed for systems to connect to the internet
from within your network
Create a user on each installation

Preferably with the same password on each node

Supported Platforms
GNU/Linux is supported as a development and production platform. Hadoop has
been demonstrated on GNU/Linux clusters with 2000 nodes.
Win32 is supported as a development platform. Distributed operation has not
been well tested on Win32, so it is not supported as a production platform.

Required Software
Required software for Linux and Windows include:
JavaTM 1.7.x, preferably from Sun, must be installed.
ssh must be installed and sshd must be running to use the Hadoop scripts that
manage remote Hadoop daemons.

Additional requirements for Windows include:

Cygwin - Required for shell support in addition to the required software above.

Installing Software
If your cluster doesn't have the requisite software you will need to install it.
For example on Ubuntu Linux:
$ sudo apt-get install ssh
$ sudo apt-get install rsync

On Windows, if you did not install the required software when you installed
cygwin, start the cygwin installer and select the packages:
openssh - the Net category

Install Suns java JDK

Install Suns java JDK on each node in the cluster
Add the canonical partner repository to your list of apt repositories.
You can do this by adding the line below into your /etc/apt/sources.list file
deb maverick partner
Update the source list
sudo apt-get update

Install sun-java7-jdk
sudo apt-get install sun-java6-jdk

Select Suns java as the default on the machine

sudo update-java-alternatives -s java-6-sun

Verify the installation running the command

java version

Adding a dedicated Hadoop system user

Use a dedicated Hadoop user account for running Hadoop.
While thats not required it is recommended because it helps to
separate the Hadoop installation from other software
applications and user accounts running on the same machine
(think: security, permissions, backups, etc).

This will add the user hduser and the group hadoop to your
local machine:
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser

Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus
your local machine if you want to use Hadoop on it.
For single-node setup of Hadoop, we therefore need to configure SSH access
to localhost for the hduser user we created in the previous slide.
Have SSH up and running on your machine and configured it to allow SSH
public key authentication.
Generate an SSH key for the hduser user.
user@ubuntu:~$ su - hduser
hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
The key's randomart image is:

Configuring SSH
Second, you have to enable SSH access to your local machine with this newly
created key.
hduser@ubuntu:~$ cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to your local machine
with the hduser user.
The step is also needed to save your local machines host key fingerprint to
the hduser users known_hosts file.
If you have any special SSH configuration for your local machine like a nonstandard SSH port, you can define host-specific SSH options
in $HOME/.ssh/config (see man ssh_config for more information).
hduser@ubuntu:~$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS

Disabling IPv6
One problem with IPv6 on Ubuntu is that using for the various
networking-related Hadoop configuration options will result in Hadoop
binding to the IPv6 addresses.
To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of
your choice and add the following lines to the end of the file:
#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

You have to reboot your machine in order to make the changes take effect.
You can check whether IPv6 is enabled on your machine with the following
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

You can also disable IPv6 only for Hadoop as documented in HADOOP-3437.
You can do so by adding the following line to conf/

Hadoop Installation
You have to download Hadoop from the Apache Download Mirrors and
extract the contents of the Hadoop package to a location of your choice.
Say /usr/local/hadoop.
Make sure to change the owner of all the files to the hduser user
and hadoop group, for example:
$ cd /usr/local
$ sudo tar xzf hadoop-xxxx.tar.gz
$ sudo mv hadoop-xxxxx hadoop
$ sudo chown -R hduser:hadoop hadoop

Create a symlink from hadoop-xxxxx to hadoop

Update $HOME/.bashrc
Add the following lines to the end of the $HOME/.bashrc file of user hduser.
If you use a shell other than bash, you should of course update its
appropriate configuration files instead of .bashrc.
# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later
export JAVA_HOME=/usr/lib/jvm/java-6-sun
# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

Update $HOME/.bashrc
# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
# Requires installed 'lzop' command.
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
# Add Hadoop bin/ directory to PATH

Configuration files

The $HADOOP_INSTALL/hadoop/conf directory contains some configuration files

for Hadoop. These are: - This file contains some environment variable settings used by

Hadoop. You can use these to affect some aspects of Hadoop daemon behavior,
such as where log files are stored, the maximum amount of heap used etc. The
only variable you should need to change in this file is JAVA_HOME, which
specifies the path to the Java 1.5.x installation used by Hadoop.
slaves - This file lists the hosts, one per line, where the Hadoop slave daemons
(datanodes and tasktrackers) will run. By default this contains the single entry
hdfs-site.xml - This file contains generic default settings for Hadoop daemons
and Map/Reduce jobs. Do not modify this file.
mapred-site.xml - This file contains site specific settings for the Hadoop
Map/Reduce daemons and jobs. The file is empty by default. Putting
configuration properties in this file will override Map/Reduce settings in the
hadoop-default.xml file. Use this file to tailor the behavior of Map/Reduce on
your site.
core-site.xml - This file contains site specific settings for all Hadoop daemons
and Map/Reduce jobs. This file is empty by default. Settings in this file override
those in hadoop-default.xml and mapred-default.xml. This file should contain
settings that must be respected by all servers and clients in a Hadoop
installation, for instance, the location of the namenode and the jobtracker.

Configuration : Single node :
The only required environment variable we have to configure for Hadoop
in this case is JAVA_HOME.
Open etc/hadoop/conf/ in the editor of your choice
set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory
export JAVA_HOME=/usr/lib/jvm/java-6-sun

We configure following:



Configure HDFS
We will configure the directory where Hadoop will store its data files, the
network ports it listens to, etc.
Our setup will use Hadoops Distributed File System, HDFS, even though our
little cluster only contains our single local machine.
You can leave the settings below as is with the exception of
the hadoop.tmp.dir variable which you have to change to the directory of
your choice.
We will use the directory /app/hadoop/tmp
Hadoops default configurations use hadoop.tmp.dir as the base temporary
directory both for the local file system and HDFS, so dont be surprised if you
see Hadoop creating the specified directory automatically on HDFS at some
later point.
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
# ...and if you want to tighten up security, chmod from 755 to 750...
$ sudo chmod 750 /app/hadoop/tmp

<!-- In: conf/core-site.xml -->
<description>A base for other temporary directories.</description>

<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>

<!-- In: conf/mapred-site.xml -->
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.

<!-- In: conf/hdfs-site.xml -->
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.

Formatting the HDFS and Starting

To format the filesystem (which simply initializes the directory specified by
the variable), run the command
hadoop namenode format

Run : This will startup a Namenode, Datanode, Jobtracker and a

Tasktracker on your machine
Run to stop all processes

Download example input data

Create a directory inside /home//gutenberg
The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
The Notebooks of Leonardo Da Vinci
Ulysses by James Joyce

Copy local example data to HDFS

hdfs dfs -copyFromLocal gutenberg gutenberg
hadoop dfs -ls gutenberg

Run the MapReduce job

Now, we run the WordCount example job
hadoop jar /usr/lib/hadoop/hadoop-xxxx-example.jar wordcount gutenberg
This command will
read all the files in the HDFS directory /user/cloudera/gutenberg,
process it, and
store the result in the HDFS directory /user/cloudera/gutenberg-out

Check if the result is successfully stored in HDFS directory gutenberg-out

hdfs dfs ls gutenberg-out

Retrieve the job result from HDFS

hdfs dfs cat gutenberg-out/part-r-00000

hdfs dfs cat gutenberg-out/part-r-00000 | sort nk 2,2 r | less

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by default
(see conf/hadoop-default.xml) available at these locations:
http://localhost:50030/ web UI for MapReduce job tracker(s)
http://localhost:50060/ web UI for task tracker(s)
http://localhost:50070/ web UI for HDFS name node(s)

Cluster setup
Basic idea
Use Bitvise
SSH port

What we
have done so





Single Node

Single Node





Calling by name
Now that you have two single-node clusters up and running, we will modify
the Hadoop configuration to make
one Ubuntu box the master (which will also act as a slave) and
the other Ubuntu box a slave.
We will call the designated master machine just the master from now on and
the slave-only machine the slave.
We will also give the two machines these respective hostnames in their
networking setup, most notably in /etc/hosts.

If the hostnames of your machines are different (e.g. node01) then you must
adapt the settings as appropriate.

connect both machines via a single hub or switch and configure the network
interfaces to use a common network such as 192.168.0.x/24.
To make it simple,
we will assign the IP address to the master machine and to theslave machine.
Update /etc/hosts on both machines with the following lines:
# /etc/hosts (for master AND slave) master slave

SSH access
The hduser user on the master (aka hduser@master) must be able to
connect a) to its own user account on the master i.e. ssh master in this
context and not necessarily ssh localhost and b) to the hduser user
account on the slave (aka hduser@slave) via a password-less SSH login.
you just have to add the hduser@masters public SSH key (which should be
in$HOME/.ssh/ to the authorized_keys file of hduser@slave (in
this users$HOME/.ssh/authorized_keys).

ssh-copy-id -i $HOME/.ssh/ hduser@slave

Verify that the password-less access to all slaves from the master works
ssh hduser@slave ssh hduser@master

How the final multi-node cluster will look like

Naming again
The master node will run the master daemons for each layer:
NameNode for the HDFS storage layer, and
JobTracker for the MapReduce processing layer

Both machines will run the slave daemons:

DataNode for the HDFS layer, and
TaskTracker for MapReduce processing layer

The master daemons are responsible for coordination and management of

the slave daemons while the latter will do the actual data storage and data
processing work.
Typically one machine in the cluster is designated as the NameNode and
another machine the as JobTracker, exclusively.
These are the actual master nodes.
The rest of the machines in the cluster act as both DataNode and
These are the slaves or worker nodes.

conf/masters (master only)

The conf/masters file defines on which machines Hadoop will
start secondary NameNodes in our multi-node cluster.

In our case, this is just the master machine.

The primary NameNode and the JobTracker will always be the
machines on which you run the bin/ and bin/ scripts, respectively
The primary NameNode and the JobTracker will be started on
the same machine if you run bin/
On master, update /conf/masters that it looks like this: master

conf/slaves (master only)

This conf/slaves file lists the hosts, one per line, where the Hadoop slave
daemons (DataNodes and TaskTrackers) will be run.
We want both the master box and the slave box to act as Hadoop slaves
because we want both of them to store and process data.
On master, update conf/slaves that it looks like this:

If you have additional slave nodes, just add them to the conf/slaves file, one
per line (do this on all machines in the cluster).

conf/*-site.xml (all machines)

You have to change the configuration files

conf/mapred-site.xml and
on ALL machines: : The name of the default file system. A URI whose scheme and
authority determine the FileSystem implementation. Set as hdfs://master:54310
mapred.job.tracker: The host and port that the MapReduce job tracker runs at.
Set as master:54311
dfs.replication: Default block replication. Set as 2
mapred.local.dir: Determines where temporary MapReduce data is written. It
also may be a list of directories. As a rule of thumb, use 10x the number of slaves (i.e.,
number of TaskTrackers).
mapred.reduce.tasks: As a rule of thumb, use 2x the number of slave processors
(i.e., number of TaskTrackers).

Formatting the HDFS and Starting

To format the filesystem (which simply initializes the directory
specified by the variable on the NameNode), run
the command
hdfs namenode format
Starting the multi-node cluster
Starting the cluster is done in two steps.
First, the HDFS daemons are started:
NameNode daemon is started on master, and
DataNode daemons are started on all slaves (here: master and slave)

Second, the MapReduce daemons are started:

JobTracker is started on master, and
TaskTracker daemons are started on all slaves (here: master and slave) followed by to stop

Run the PiEstimator example

hadoop jar /usr/lib/hadoop/hadoop-xxxxx-example.jar pi 2 100000

Day 1: Hadoop Deployment and Configuration - Single machine and a cluster
and a cluster

