Sei sulla pagina 1di 12

Hadoop Administrator Training Lab Hand Book

List of Lab Exercises


Lab VM Details
Lab 1 : Hadoop configuration
Lab 2 : HDFS Lab
Lab 3 : Map Reduce Word count
Lab 4 : Map Reduce Lab Total Transactions By Each Product
Lab 5 : HDFS Monitoring
Lab 6 : Job Tracker Monitoring
Lab 7 : Sqoop Lab : Export and Import of data
Lab 8 : Hive configuration
Lab 9 : Hive Programming
Lab 10: Pig Configuration
Lab 11 : Pig Programming
Lab VM Details
A. Linux Version Ubuntu 12.04
B. User name : user Password: password
C. super user password: password
D. Create these useful directories in the VM

The following sub-directories should be created under /home/user


Directory Name Description

Downloads Contains all Installable for Hadoop, Hive and Pig which we will download.

Lab For all lab activities


Lab/hdfs For configuring hdfs related contents
Lab/mapred For configuring mapred related contents
Lab/software Folder for installing Hadoop, Hive, Pig and Sqoop
Lab/data Input files for Lab Exercises
Lab/programs For all Map Reduce Programs

II. Lab 1 : Hadoop Configuration


All directory paths are under home directory /home/user
A. Please check if ssh is configured or not by typing in ssh localhost. If it fails then we
should set up ssh by typing the following in the Terminal window.
sudo apt-get install openssh-server

B. We need to download the following tar ball installation from the relevant sites.
a) hadoop
b) pig
c) hive
d) sqoop
e) hbase
C. Untar Hadoop jar file
o Go to lab/software in the Terminal window of the VM.

o Untar Hadoop files into software folder

tar -xvf ../../Downloads/hadoop-1.0.3.tar.gz [space after tar]

o Browse through the directories and check which subdirectory contains what files

D. Create a new file called .bash_profile [ yes . before the word] in the /home/user
directory.

E. Install OpenJDK in Ubuntu by entering the below command in the Terminal window
sudo apt-get install openjdk-6-jdk

F. To enable editing the file in a text, download winscp [FTP tool for editing and moving
files in the VM]. Type ifconfig in the terminal window to get the IP Address of the VM
and type it in the host file name file along with the username and password and
connect.
Enter the following settings

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386
export HADOOP_INSTALL=/home/user/lab/software/hadoop-1.0.3
export PATH=$PATH:$HADOOP_INSTALL/bin

Save and exit .bash_profile.

run following command


. .bash_profile

Verify whether variable are defined or not by typing export at command prompt
or env

Check the following versions

java version
hadoop version

b. Create the directories


Create the following directories under lab/hdfs

mkdir namenodep
mkdir datan1
mkdir checkp

Change permission for the following directories under lab/hdfs

chmod 755 datan1

Create the following directories under lab/mapred

mkdir local1
mkdir system

c. Configuring pseudo-distributed mode


Go to conf directory under HADOOP_HOME
HADOOP_HOME is (/home/user/lab/software/hadoop-1.0.3)

Modify core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
<final>true</final>
</property>
</configuration>

Modify hdfs-site.xml
<?xml version="1.0"?>
<!-- hdfs-site.xml -->
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/home/user/lab/hdfs/namenodep</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/user/lab/hdfs/datan1</value>
<final>true</final>
</property>
<property>
<name>dfs.checkpoint.dir</name>
<value>/home/user/lab/hdfs/checkp</value>
<final>true</final>
</property>
</configuration>

Modify mapred-site.xml

<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://localhost:8021</value>
<final>true</final>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/user/lab/mapred/local1</value>
<final>true</final>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/user/lab/mapred/system</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>3</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>3</value>
<final>true</final>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx400m</value>
<!-- Not marked as final so jobs can include JVM debugging options -->
</property>
</configuration>

d. Format the namenode


Enter the following command at prompt

hadoop namenode format [Note: Ensure that you give a Y and not a y]

Go to namenodep directory and check if any folders have been created verify
which all files have been created

[The exact folders / files is not required as shown in the screen shot. They will
be created once the cluster is started up and not at the beginning]
e. Start HDFS services
Go to conf directory under HADOOP_HOME

Edit Hadoop-env.sh and set JAVA_HOME


Export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

Go to bin directory under HADOOP_HOME and type the following command


./start-dfs.sh

Run jps and verify the following process running

f. Start Map Reduce services


Go to bin directory under HADOOP_HOME and type the following command
./start mapred.sh

Run jps and verify the following processes running.

If all five processes are running, then Hadoop is up and running.


Lab 2 : HDFS Lab
A. Create an input and output directory under hdfs for all input and output files

hadoop fs -mkdir /input


hadoop fs -mkdir /output

B. Check directories

hadoop fs -ls /

C. Copy files from local system to hdfs and check if the file is copied

hadoop fs -copyFromLocal /home/user/lab/data/txns /input


hadoop fs -copyFromLocal /home/not root/lab/data/custs /input
hadoop fs -ls /input

D. Go to datan1 and check how the file are split and multiple blocks are stored [Hint: Check
the size of the files in the folder called current, where the blk files would be stored]

III. Lab 3 : Map Reduce Word Count


A. Switch the workspace to a known folder.
B. Open eclipse and create a new Java Project called MRLab

C. Hint : File-> New-> Others-> Java Project


D. Create a package com.evenkat under src folder under project MRLab

E. Add the Hadoop jar files to the project


Hint : Right click on MRLab->Properties->Java Build Path-> Add External Jars
All Jar files under (d:\software\hadoop-1.0.3 and d:\software\hadoop-1.0.3\lib)
F. Create a class called WordCountTesting
G. The packages to be imported are
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import java.io.IOException;
import java.util.StringTokenizer;

H. Map Reduce Program code


Note: This would be inside the class WordCountTesting
public static class MyMapper extends Mapper<LongWritable, Text, Text,
IntWritable> {

private Text word = new Text();

public void map(LongWritable key, Text value, Context context ) throws


IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write( word, new IntWritable( 1) );
}

}
} //end of MyMapper class

public static class MyReducer extends Reducer<Text, IntWritable, Text,


IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context
) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

context.write( key, new IntWritable(sum) );


}
} //end of MyReducer class
I. Driver Code [Note that the driver
public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();


String[] otherArgs = new GenericOptionsParser( conf, args
).getRemainingArgs();

Job job = new Job(conf, "Word Counter");

job.setJarByClass( WordCountTesting.class );
job.setMapperClass( MyMapper.class );
job.setReducerClass( MyReducer.class );

job.setMapOutputKeyClass( Text.class );
job.setMapOutputValueClass( IntWritable.class );

job.setOutputKeyClass( Text.class );
job.setOutputValueClass( IntWritable.class );

FileInputFormat.addInputPath( job, new Path( otherArgs[0] ) );


FileOutputFormat.setOutputPath( job, new Path( otherArgs[1] ) );

System.exit( job.waitForCompletion( true ) ? 0 : 1 );


}
J. Create the jar file
Right Click on MRLab-> export-> java -> jar file (Give name WordCount.jar). DO NOT
CREATE A MAIN ENTRY CLASS NOW.
K. Transfer the jar file to VM under / home/user/lab/programs
Hint : Use WinSCP software to ftp the jar file to linux VM

L. Create a words file under/home/user/lab/data directory and write few line of text in the
file (Hint:copy some paragraph from your favourite website)
M. Copy the words file to hdfs under input folder
N. Go to /home/user/lab/programs and run the job
O. hadoop jar WordCount.jar com.evenkat.WordCountTesting /input/words
/output/wcount

P. Check output
Hadoop fs -cat /output/wcount/part-r-00000
IV. Lab 4 : Map Reduce Lab Total Transactions by Each Product
Input File : txns
Txnid, date, custid, amount, product category, sub category, city, state, credit or cash
Run the map reduce program as follows:

hadoop jat TxnSorting.jar comevenkat.SortingDriver input/txns output/pbyamt.


Check the output file
Hadoop fs -ls output/pbyamt
Hadoop fs -cat output/pbyamt/part-r-00000

Lab 5 : HDFS Monitoring


A. HDFS Filesystem statistics

Hadoop dfsadmin -report

Gives you a detailed report of the hdfs system including


Total capacity allocated, used, available
No.of files, block
Total number of under replicated or missing blocks

B. Checking health of files in HDFS


Gives you detailed report of hdfs files ( all files or a specific files)

Hadoop fsck /
Hadoop fsck / user/user/input/txns -files -blocks

Gives you detailed report of the file that is specified


C. Total number of blocks and their size
D. Under replicated or missing blocks, if any

E. HDFS Web UI

F. Open your browser and enter the following url


G. http:// <ip address of the VM>:50070/

V. Lab 6: Job Tracker Monitoring


Open your browser and enter the following url
http:// <ip address of the VM>:50030/

Potrebbero piacerti anche