Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Installation Steps
Installed java
ssh and sshd
gunzip hadoop-0.18.0.tar.gz
Or tar vxf hadoop-0.18.0.tar
Additional Configuration
conf/masters
contains the hostname of the SecondaryNameNode
It should be fully-qualified domain name.
conf/slaves
the hostname of every machine in the cluster which
should start TaskTracker and DataNode daemons
Ex:
slave01
slave02
slave03
Advance Configuration
enable passwordless ssh
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Advance Configuration
Various directories should be created on each
node
The NameNode requires the NameNode metadata
directory
$ mkdir -p /home/hadoop/dfs/name
Advance Configuration..
bin/slaves.sh allows a command to be
executed on all nodes in the slaves file.
$ mkdir -p /tmp/hadoop
$ export HADOOP_CONF_DIR=${HADOOP_HOME}/conf
$ export HADOOP_SLAVES=${HADOOP_CONF_DIR}/slaves
$ ${HADOOP_HOME}/bin/slaves.sh "mkdir -p /tmp/hadoop"
$ ${HADOOP_HOME}/bin/slaves.sh "mkdir -p /home/hadoop/dfs/data
Format HDFS
$ bin/hadoop namenode -format
Selecting Machines
Hadoop is designed to take advantage of
whatever hardware is available
Hadoop jobs written in Java can consume
between 1 and 2 GB of RAM per core
If you use HadoopStreaming to write your jobs
in a scripting language such as Python, more
memory may be advisable.
Cluster Configurations
Small Clusters: 2-10 Nodes
Medium Clusters: 10-40 Nodes
Large Clusters: Multiple Racks
configuration in conf/hadoop-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>head.server.node.com:9001</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://head.server.node.com:9000</val
ue>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/dfs/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/dfs/name</value>
<final>true</final>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop</value>
<final>true</final>
</property>
<property>
<name>mapred.system.dir</name>
<value>/hadoop/mapred/system</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
NameNodes backup
The cluster's hadoop-site.xml file should then
instruct the NameNode to write to this
directory as well:
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/dfs/name,/mnt/namenode-backup</value>
<final>true</final>
</property>
conf/hadoop-site.xml
Nodes must be decommissioned on a schedule that permits
replication of blocks being decommissioned.
conf/hadoop-site.xml
<property>
<name>dfs.hosts.exclude</name>
<value>/home/hadoop/excludes</value>
<final>true</final>
</property>
<property>
<name>mapred.hosts.exclude</name>
<value>/home/hadoop/excludes</value>
<final>true</final>
</property>
create an empty file with this name:
$ touch /home/hadoop/excludes
Replication Setting
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
Tutorial
Configure Hadoop Cluster in two nodes.
Tutorial-Installed Hadoop in Cluster.docx
Performance Monitoring
Ganglia
Nagios
Ganglia
performance monitoring framework for
distributed systems
collects metrics on individual machines and
forwards them to an aggregator
designed to be integrated into other
applications
Ganglia
Installed and configured Ganglia
create a file named hadoop-metrics.properties
in the $HADOOP_HOME/conf directory
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=localhost:8649
mapred.class=org.apache.hadoop.metrics.ganglia.GangliaCont
ext
mapred.period=10
mapred.servers=localhost:8649
Nagios
a machine and service monitoring system
designed for large clusters
provide useful diagnostic information for
tuning your cluster, including network, disk,
and CPU utilization across machines.
Tutorial
Installed Ganglia /Nagios and monitor Hadoop
Tutorial-MonitorHadoopWithGanglia.docx