Managing A Hadoop Cluster

Managing Hadoop Cluster
Topology of a typical Hadoop cluster .
Installation Steps
Installed java
ssh and sshd
gunzip hadoop-0.18.0.tar.gz
Or tar vxf hadoop-0.18.0.tar
Set JAVA_HOME in conf/hadoop-env.sh

Modified hadoop-site.xml
Hadoop Installation Flavors

Standalone
Pseudo-distributed
Hadoop clusters of multiple nodes
Additional Configuration
conf/masters
contains the hostname of the SecondaryNameNode
It should be fully-qualified domain name.
conf/slaves
the hostname of every machine in the cluster which
should start TaskTracker and DataNode daemons
Ex:
slave01
slave02
slave03
Advance Configuration
enable passwordless ssh
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
The ~/.ssh/id_dsa.pub and authorized_keys

files should be replicated on all machines in
the cluster.
Advance Configuration
Various directories should be created on each
node
The NameNode requires the NameNode metadata
directory
$ mkdir -p /home/hadoop/dfs/name
Every node needs the Hadoop tmp directory and

DataNode directory created
Advance Configuration..
bin/slaves.sh allows a command to be
executed on all nodes in the slaves file.
$ mkdir -p /tmp/hadoop
$ export HADOOP_CONF_DIR=${HADOOP_HOME}/conf
$ export HADOOP_SLAVES=${HADOOP_CONF_DIR}/slaves
$ ${HADOOP_HOME}/bin/slaves.sh "mkdir -p /tmp/hadoop"
$ ${HADOOP_HOME}/bin/slaves.sh "mkdir -p /home/hadoop/dfs/data
Format HDFS
$ bin/hadoop namenode -format
start the cluster:

$ bin/start-all.sh
Selecting Machines
Hadoop is designed to take advantage of
whatever hardware is available
Hadoop jobs written in Java can consume
between 1 and 2 GB of RAM per core
If you use HadoopStreaming to write your jobs
in a scripting language such as Python, more
memory may be advisable.
Cluster Configurations
Small Clusters: 2-10 Nodes
Medium Clusters: 10-40 Nodes
Large Clusters: Multiple Racks
Small Clusters: 2-10 Nodes

In two nodes,
one node: NameNode/JobTracker and a
DataNode/TaskTracker;
the other node: DataNode/TaskTracker.
Clusters of three or more machines typically

use a dedicated NameNode/JobTracker, and
all other nodes are workers.
configuration in conf/hadoop-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>head.server.node.com:9001</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://head.server.node.com:9000</val
ue>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/dfs/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/dfs/name</value>
<final>true</final>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop</value>
<final>true</final>
</property>
<property>
<name>mapred.system.dir</name>
<value>/hadoop/mapred/system</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
Medium Clusters: 10-40 Nodes

The single point of failure in a Hadoop cluster
is the NameNode
Hence, back up the NameNode metadata.
One machine in the cluster should be designated
as the NameNode's backup
It does not run the normal Hadoop daemons
it exposes a directory via NFS which is only
mounted on the NameNode
NameNodes backup
The cluster's hadoop-site.xml file should then
instruct the NameNode to write to this
directory as well:
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/dfs/name,/mnt/namenode-backup</value>
<final>true</final>
</property>
conf/hadoop-site.xml
Nodes must be decommissioned on a schedule that permits
replication of blocks being decommissioned.
conf/hadoop-site.xml
<property>
<name>dfs.hosts.exclude</name>
<value>/home/hadoop/excludes</value>
<final>true</final>
</property>
<property>
<name>mapred.hosts.exclude</name>
<value>/home/hadoop/excludes</value>
<final>true</final>
</property>
create an empty file with this name:
$ touch /home/hadoop/excludes
Replication Setting
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
Tutorial
Configure Hadoop Cluster in two nodes.
Tutorial-Installed Hadoop in Cluster.docx
Performance Monitoring
Ganglia
Nagios
Ganglia
performance monitoring framework for
distributed systems
collects metrics on individual machines and
forwards them to an aggregator
designed to be integrated into other
applications
Ganglia
Installed and configured Ganglia
create a file named hadoop-metrics.properties
in the $HADOOP_HOME/conf directory
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=localhost:8649
mapred.class=org.apache.hadoop.metrics.ganglia.GangliaCont
ext
mapred.period=10
mapred.servers=localhost:8649
Nagios
a machine and service monitoring system
designed for large clusters
provide useful diagnostic information for
tuning your cluster, including network, disk,
and CPU utilization across machines.
Tutorial
Installed Ganglia /Nagios and monitor Hadoop
Tutorial-MonitorHadoopWithGanglia.docx

Managing A Hadoop Cluster

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Managing A Hadoop Cluster

Caricato da

Copyright:

Formati disponibili

Managing Hadoop Cluster

Topology of a typical Hadoop cluster .

Set JAVA_HOME in conf/hadoop-env.sh

Hadoop Installation Flavors

The ~/.ssh/id_dsa.pub and authorized_keys

Every node needs the Hadoop tmp directory and

start the cluster:

Small Clusters: 2-10 Nodes

Clusters of three or more machines typically

Medium Clusters: 10-40 Nodes

Potrebbero piacerti anche