Hadoop and BigData LAB MANUAL

Ex.No: 1.
a)
Data structures in Java- Stack
OBJECTIVE: To implement Stack in java
DESCRIPTION:
Stack is a subclass of Vector that implements a standard last-in, first-out stack.Stack

only defines the default constructor, which creates an empty stack. Stack includes all
the methods defined by Vector, and adds several of its own.Java provides predefined
stack class in java.util package with push(),pop(),peek() etc.,methods.
PROGRAM:
importjava.util.*;
public class stackpro {
public static void main(String[] args) {
Stack<Integer> s=new Stack<Integer>();
Scanner sc=new Scanner(System.in);
inti;
do{
System.out.println("1:push");
System.out.println("2:pop");
System.out.println("3:peek");
System.out.println("4:search");
System.out.println("5:isEmpty");
System.out.println("Enter the choice");
i=sc.nextInt();
switch(i)
{
case 1:
System.out.println("Enter the element:");
int x=sc.nextInt();
s.push(x);
System.out.println("stack is "+s);
break;
case 2:
int y=s.pop();
System.out.println("the value popedis"+y);
break;
case 3:
int z=s.peek();
System.out.println("The peek element is"+z);
break;
case 4:
System.out.println("Enter the element to be searched");
int b=sc.nextInt();
int a=s.search(b);
if(a==-1)
System.out.println("Element is not available");
else
System.out.println("Element is available in index "+a);
break;
case 5:
System.out.println("The stack is empty: "+s.empty());
1
break;
case 6:
System.exit(0);
}
}while(i<=6);
}
}
OUTPUT:
1:push
2:pop
3:peek
4:search
5:isEmpty
Enter the choice
1
Enter the element:
10
stack is [10]
1:push
2:pop
3:peek
4:search
5:isEmpty
Enter the choice
1
Enter the element:
20
stack is [10, 20]
1:push
2:pop
3:peek
4:search
5:isEmpty
Enter the choice
1
Enter the element:
30
stack is [10, 20, 30]
1:push
2:pop
3:peek
4:search
5:isEmpty
Enter the choice
3
The peek element is30
1:push
2:pop
3:peek
4:search
5:isEmpty
Enter the choice
2
the value poped is30
1:push
2:pop
3:peek
4:search
5:isEmpty
Enter the choice
4
Enter the element to be searched
2
20
The index is 1
1:push
2:pop
3:peek
4:search
5:isEmpty
Enter the choice
5
The stack is empty: false
1:push
2:pop
3:peek
4:search
5:isEmpty
Enter the choice
6
VIVA QUESTIONS:
1. Stack class is available in which package?

Ans: Java.lang.*;
2. What is the purpose of peek () method?

Ans: Peek () method returns top value of the stack.
3. What is the purpose of pop () method?

Ans: Pop()method returns the top value and then deletes the top value.
3
Ex.No: 1.b)
LinkedList
OBJECTIVE: To implement LinkedList datastruture.

.
DESCRIPTION:
The The LinkedList class extends AbstractSequentialList and implements the List
interface. It provides a linked-list data structure.
PROGRAM:
import java.util.*;
public class LinkedListDemo {
public static void main(String args[]) {

// create a linked list
LinkedList ll = new LinkedList();
// add elements to the linked list

ll.add("F");
ll.add("B");
ll.add("D");
ll.add("E");
ll.add("C");
ll.addLast("Z");
ll.addFirst("A");
ll.add(1, "A2");
System.out.println("Original contents of ll: " + ll);
// remove elements from the linked list

ll.remove("F");
ll.remove(2);
System.out.println("Contents of ll after deletion: " + ll);
// remove first and last elements

ll.removeFirst();
ll.removeLast();
System.out.println("ll after deleting first and last: " + ll);
// get and set a value

Object val = ll.get(2);
ll.set(2, (String) val + " Changed");
System.out.println("ll after change: " + ll);
}
}
4
OUTPUT:
Original contents of ll: [A, A2, F, B, D, E, C, Z]

Contents of ll after deletion: [A, A2, D, E, C, Z]
ll after deleting first and last: [A2, D, E, C]
ll after change: [A2, D, E Changed, C]
VIVA QUESTIONS:
1. How can you insert elements in to Linked List?

Ans:By using add()
2. How can you insert element in to Linked as a first element?
Ans:addFirst()
3.How can you add elements of one collection to List?
Ans:addAll()
5
Ex.No: 1.c)
SET
OBJECTIVE: To implement the Set datastructure.
DESCRIPTION:
A Set is a Collection that cannot contain duplicate elements. It models the

mathematical set abstraction. The Set interface contains only methods inherited from
Collection and adds the restriction that duplicate elements are prohibited.
PROGRAM:
public class Set {

LinkedHashSet<String>lset=new LinkedHashSet<String>();
lset.add("pratyusha");
lset.add("pratyusha");//set does not allow duplicate values.
lset.add("bindu");
lset.add("aruna");
for(String s:lset)//advanced for loop i.e., iterator.
{
System.out.println(s);
}
System.out.println(lset);
TreeSet<String>tset=new TreeSet<String>();//sorted order
tset.add("praneeth");
tset.add("anuradha");
tset.add("pratyusha");
System.out.println(tset);
TreeSet<Integer> set=new TreeSet<Integer>();
set.add(10);
set.add(100);
set.add(90);
set.add(18);
System.out.println(set);
HashSet<String>hset=new HashSet<String>();//random order
hset.add("pratyusha");
hset.add("anuradha");
hset.add("srinivas");
hset.add("bindu");
hset.add("vineela");
hset.add("jyothsna");
System.out.println(hset);
LinkedHashSet<Integer> a=new LinkedHashSet<Integer>();
a.add(14);
a.add(18);
a.add(28);
a.add(35);
System.out.println(a.contains(14));//contains returns boolean value
int sum=0;
6
for(Integer i:a)
{
sum=sum+i;
}
System.out.println(sum);
}
}
OUTPUT:
[pratyusha, bindu, aruna]
[anuradha, praneeth, pratyusha]
[10, 18, 90, 100]
[jyothsna, vineela, anuradha, srinivas, bindu, pratyusha]
true
95
VIVA QUESTIONS:
1. What interfaces are implemented by the HashSet Class ? Which is the

superclass of HashSet Class?
HashSet Class implements three interfaces that is Serializable ,Cloneable and
Set interfaces.
AbstractSet is the superclass of HashSet Class .
2. Difference between HashSet and TreeSet?
Ordering: HashSet stores the object in random order . There is no guarantee
that the element we inserted first in the HashSet will be printed first in the
output .Elements are sorted according to the natural ordering of its elements in
TreeSet. If the objects can not be sorted in natural order than use compareTo()
method to sort the elements of TreeSet object .
3. When to prefer TreeSet over HashSet
1. Sorted unique elements are required instead of unique elements.The sorted list
given by TreeSet is always in ascending order.
2. TreeSet has greater locality than HashSet.If two entries are near by in the
order , then TreeSet places them near each other in data structure and hence in
memory, while HashSet spreads the entries all over memory regardless of the keys
they are associated to.
7
Ex.No: 1.d)
Map
OBJECTIVE: To implement Map Datastructure
DESCRIPTION:
Map contains values on the basis of key i.e. key and value pair. Each key and value
pair is known as an entry. Map contains only unique keys.Map is useful if you have
to search, update or delete elements on the basis of key.
PROGRAM:
importjava.util.*;
public class map {
Scanner sc=new Scanner(System.in);
TreeMap<String,Double>tmap=new TreeMap<String,Double>();//sorted order

tmap.put("13a91a0514",80.6);
tmap.put("13a91a0528",82.6);
tmap.put("13a91a0518",81.6);
tmap.put("13a91a0535",83.6);
tmap.put("13a91a0535",83.6);//values do not repeat
System.out.println(tmap);
HashMap<String,Double>hmap=new HashMap<String,Double>();//random order

hmap.put("13a91a0514",80.6);
hmap.put("13a91a0518",81.6);
hmap.put("13a91a0535",83.6);
hmap.put("13a91a0528",82.6);
System.out.println(hmap);
LinkedHashMap<String,Double>lmap=new LinkedHashMap<String,Double>();//no
//change
lmap.put("13a91a0514",80.6);
lmap.put("13a91a0518",81.6);
lmap.put("13a91a0535",83.6);
lmap.put("13a91a0528",82.6);
System.out.println(lmap);
//taking input from the user

System.out.println("How many elements are there");
int no=sc.nextInt();
System.out.println("Enter"+no+" keys and values");
TreeMap<Integer,String> t=new TreeMap<Integer,String>();
for(inti=0;i<no;i++)
{
8
int key=sc.nextInt();
String value=sc.next();
t.put(key, value);
}
System.out.println(t);
//advanced for loop
for(Map.Entry<Integer,String> e:t.entrySet())
{
System.out.println(e.getKey());
System.out.println(e.getValue());
}
}
}
OUTPUT:
{13a91a0514=80.6, 13a91a0518=81.6, 13a91a0528=82.6,

13a91a0535=83.6}
{13a91a0514=80.6, 13a91a0518=81.6, 13a91a0528=82.6,
13a91a0535=83.6}
{13a91a0514=80.6, 13a91a0518=81.6, 13a91a0535=83.6,
13a91a0528=82.6}
How many elements are there
4
Enter4 keys and values
1
14
2
18
3
35
4
28
{1=14, 2=18, 3=35, 4=28}
1
14
2
18
3
35
4
28
9
VIVA QUESTIONS:
1. Sort a Map on the keys?

Ans: One way is to put Map.Entry into a list, and sort it using a comparator that
sorts the value.
2. What are the commons ways for implementing Map?
Ans: The 4 commonly used implementations of Map in Java - HashMap, TreeMap,
Hashtable and LinkedHashMap.
 HashMap is implemented as a hash table, and there is no ordering on keys or

values.
 TreeMap is implemented based on red-black tree structure, and it is ordered by
the key.
 LinkedHashMap preserves the insertion order
 Hashtable is synchronized, in contrast to HashMap. It has an overhead for
synchronization.
3. Difference between HashMap, TreeMap, and Hashtable

Ans: There are three main implementations of Map interface in
Java: HashMap, TreeMap, and Hashtable.
The most important differences include:
1. The order of iteration. HashMap and Hashtable make no guarantees as to the

order of the map; in particular, they do not guarantee that the order will remain
constant over time. But TreeMap will iterate the whole entries according the
"natural ordering" of the keys or by a comparator.
2. key-value permission. HashMap allows null key and null values (Only one null
key is allowed because no two keys are allowed the same). Hashtable does not
allow null key or null values. If TreeMap uses natural ordering or its comparator
does not allow null keys, an exception will be thrown.
10
3. Synchronized. Only Hashtable is synchronized, others are not. Therefore, "if a
thread-safe implementation is not needed, it is recommended to use HashMap in
place of Hashtable."
11
Ex.No: 1.e)
GENERIC PROGRAMMING
OBJECTIVE: To implement Generic Concepts
DESCRIPTION:
Java Generic methods and generic classes enable programmers to specify, with a
single method declaration, a set of related methods, or with a single class declaration,
a set of related types, respectively. Generics also provide compile-time type safety
that allows programmers to catch invalid types at compile time.
Example.
PROGRAM:
Class A<T>
{
T x;
void add(T x)
{
this.x=x;
}
T get()
{
return x;
}
}
public class gen
{
A<Integer> o=new A<Integer>();
o.add(2);
System.out.println(o.get());
12
A<String> o1=new A<String>();
o1.add(“neelima”);
System.out.println(o1.get());
}
}
OUTPUT:
2
Neelima
VIVA QUESTIONS:
1. What are Generics?
Ans:Generics are used to create Generic Classes and Generic methods which
can work with different Types(Classes).
2. How do you declare a Generic Class?

Ans:The declaration of class:Instead of T, We can use any valid identifier.
class MyListGeneric<T>
3. How can we restrict Generics to a subclass of particular class?
Ans: In MyListGeneric, Type T is defined as part of class declaration. Any
Java Type can be used a type for this class. If we would want to restrict the types
allowed for a Generic Type, we can use a Generic Restrictions.
13
Ex.No: 1.f)
Serialization
OBJECTIVE: To implement serialization
DESCRIPTION:
Java provides a mechanism, called object serialization where an object can be

represented as a sequence of bytes that includes the object's data as well as
information about the object's type and the types of data stored in the object.
After a serialized object has been written into a file, it can be read from the file and
deserialized that is, the type information and bytes that represent the object and its
data can be used to recreate the object in memory.
PROGRAM:
Student.java:
import java.io.*;
public class Student implements Serializable
{
int no;
String name;
}
SeriEx.java:
import java.io.*;
public class SeriEx
{
Public static voidmain(String args[])throws Exception
{
Student S1=new Student();
S1.no=12;
S1.name=”CSEA”;
Objectoutputstream out=new objectoutputStream(new
14
FileOutputStream(“D:/serex.ser”));
Out.writeobject(S1);
}
}
Deser.java:
import java.io.*;
public class Deser
{
Public static void main(String[] args)throws Exception
{
Student S1=null;
FileInputStream fileIn=new FileIutputStream(“D:/serex.ser”);
ObjectInputStream in=new ObjectInputStream(fileIn);
S1=(Student)in.readobject();
System.out.println(“Deserialization student……”);
System.out.println(“Name:”+S1.name);
System.out.println(“Number:”+S1.no);
}}
OUTPUT:
12
CSEA
VIVA QUESTIONS:
1. How to make a Java class Serializable?

Making a class Serializable in Java is very easy, Your Java class just needs to
implements java.io.Serializable interface and JVM will take care of
serializing object in default format.
2. How many methods Serializable has? If no method then what is the purpose
of Serializable interface?
Serializable interface exists in java.io package and forms core of java

serialization mechanism. It doesn't have any method and also called Marker
Interface in Java. When your class implements java.io.Serializable interface it
becomes Serializable in Java and gives compiler an indication that use Java
Serialization mechanism to serialize this object.
15
Ex.No: 1.g)
Queue
OBJECTIVE: To implement Queue
DESCRIPTION:
A Queue is a collection for holding elements prior to processing. Besides

basic Collection operations, queues provide additional insertion, removal, and
inspection operations.
The Queueinterface follows.
public interface Queue<E> extends Collection<E> {

E element();
boolean offer(E e);
E peek();
E poll();
E remove();
}
PROGRAM:
import java.util.*;
class TestCollection12{
public static void main(String args[]){
PriorityQueue<String> queue=new PriorityQueue<String>();
queue.add("Amit");
queue.add("Vijay");
16
queue.add("Karan");
queue.add("Jai");
queue.add("Rahul");
System.out.println("head:"+queue.element());
System.out.println("head:"+queue.peek());
System.out.println("iterating the queue elements:");
Iterator itr=queue.iterator();
while(itr.hasNext()){
System.out.println(itr.next()); }
queue.remove();
queue.poll();
System.out.println("after removing two elements:");
Iterator<String> itr2=queue.iterator();
while(itr2.hasNext()){
System.out.println(itr2.next());
} } }
OUTPUT:
head:Amit
head:Amit
iterating the queue elements:
Amit
Jai
Karan
Vijay
Rahul
after removing two elements:
Karan
Rahul
Vijay
VIVA QUESTIONS:
1. Queue is available in which package?

Ans:Java.util.*;
2. Queue reterives elements in which order?
Ans: First in First out Order
17
Ex.No: 1.h)
Wrapper Classes
OBJECTIVE: To implement Wrapper Classes
DESCRIPTION:
Wrapper class in java provides the mechanism to convert primitive into object and
object into primitive.
Since J2SE 5.0, autoboxing and unboxing feature converts primitive into object and
object into primitive automatically. The automatic conversion of primitive into object
is known and autoboxing and vice-versa unboxing.
public interface Queue<E> extends Collection<E> {

E element();
boolean offer(E e);
E peek();
E poll();
E remove();
}
PROGRAM:
public class WrapperExample1{

public static void main(String args[])
{
//Converting int into Integer
int a=20;
18
Integer i=Integer.valueOf(a);//converting int into Integer
Integer j=a;//autoboxing, now compiler will write Integer.valueOf(a) internally
System.out.println(a+" "+i+" "+j);
}}
OUTPUT:
20 20 20
VIVA QUESTIONS:
1. What is wrapper class?

Ans: Converting primitive data type in to object.
2. What is the wrapper class for boolean data type?
Ans: Boolean is a predefine Wrapper class.
19
Ex.No: 2
Perform setting up and Installing Hadoop
OBJECTIVE: Installing Hadoop
DESCRIPTION:
Hadoop can be run in 3 different modes. Different modes of Hadoop are
Standalone Mode
 Default mode of Hadoop

 HDFS is not utilized in this mode.Local file system is used for input and
output
 Used for debugging purpose
 No Custom Configuration is required in 3 hadoop(mapred-site.xml,core-
site.xml, hdfs-site.xml) files.
 Standalone mode is much faster than Pseudo-distributed mode.
Pseudo Distributed Mode(Single Node Cluster)
 Configuration is required in given 3 files for this mode

 Replication factory is one for HDFS.
 Here one node will be used as Master Node / Data Node / Job Tracker / Task
Tracker
 Used for Real Code to test in HDFS.
 Pseudo distributed cluster is a cluster where all daemons are
running on one node itself.
Fully distributed mode (or multiple node cluster)
 This is a Production Phase

 Data are used and distributed across many nodes.
20
 Different Nodes will be used as Master Node / Data Node / Job Tracker / Task
Tracker
PROGRAM:
Installation of Hadoop
Step 1: Verifying JAVA Installation
Java must be installed on your system before installing Hive. Let us verify java
installation using the following command:
$ java –version
If Java is already installed on your system, you get to see the following response:
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
If java is not installed in your system, then follow the steps given below for installing
java.
Installing Java
Step I:
Download java (JDK <latest version> - X64.tar.gz) by visiting the following link
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-
1880260.html.
Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system.
Step II:
Generally you will find the downloaded java file in the Downloads folder. Verify it
and extract the jdk-7u71-linux-x64.gz file using the following commands.
$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz
21
Step III:
To make java available to all the users, you have to move it to the location
“/usr/local/”. Open root, and type the following commands.
$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit
Step IV:
For setting up PATH and JAVA_HOME variables, add the following commands to
~/.bashrc file.
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc
Step V:
Use the following commands to configure java alternatives:
# alternatives --install /usr/bin/java java usr/local/java/bin/java 2

# alternatives --install /usr/bin/javac javac usr/local/java/bin/javac 2
# alternatives --install /usr/bin/jar jar usr/local/java/bin/jar 2
# alternatives --set java usr/local/java/bin/java
# alternatives --set javac usr/local/java/bin/javac
# alternatives --set jar usr/local/java/bin/jar
Now verify the installation using the command java -version from the terminal as
explained above.
Step 2: Verifying Hadoop Installation
Hadoop must be installed on your system before installing Hive. Let us verify the
Hadoop installation using the following command:
$ hadoop version
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0

From source with checksum 79e53ce7994d1628b240f09af91e1af4
22
If Hadoop is not installed on your system, then proceed with the following steps:
Downloading Hadoop
Download and extract Hadoop 2.4.1 from Apache Software Foundation using the
following commands.
$ su
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit
Installing Hadoop in Pseudo Distributed Mode
The following steps are used to install Hadoop 2.4.1 in pseudo distributed mode.
Step I: Setting up Hadoop
You can set Hadoop environment variables by appending the following commands to
~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc
23
Step II: Hadoop Configuration
You can find all the Hadoop configuration files in the location
“$HADOOP_HOME/etc/hadoop”. You need to make suitable changes in those
configuration files according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/Hadoop
In order to develop Hadoop programs using java, you have to reset the java
environment variables in hadoop-env.sh file by replacing JAVA_HOME value with
the location of java in your system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
Given below are the list of files that you have to edit to configure Hadoop.
core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop
instance, memory allocated for the file system, memory limit for storing the data, and
the size of Read/Write buffers.
Open the core-site.xml and add the following properties in between the
<configuration> and </configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data, the
namenode path, and the datanode path of your local file systems. It means the place
where you want to store the Hadoop infra.
Let us assume the following data.

dfs.replication (data replication value) = 1
(In the following path /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode
(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode
Open this file and add the following properties in between the <configuration>,
24
</configuration> tags in this file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value >
</property>
</configuration>
Note: In the above file, all the property values are user-defined and you can make
changes according to your Hadoop infrastructure.
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add
the following properties in between the <configuration>, </configuration> tags in this
file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By default,
Hadoop contains a template of yarn-site.xml. First of all, you need to copy the file
from mapred-site,xml.template to mapred-site.xml file using the following command.
$ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the
<configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Verifying Hadoop Installation

The following steps are used to verify the Hadoop installation.
Step I: Name Node Setup
25
Set up the namenode using the command “hdfs namenode -format” as follows.
$ cd ~
$ hdfs namenode -format
The expected result is as follows.
10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to
retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/
Step II: Verifying Hadoop dfs

The following command is used to start dfs. Executing this command will start your
Hadoop file system.
$ start-dfs.sh
The expected output is as follows:
10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop-2.4.1/logs/hadoop-
hadoop-namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop-2.4.1/logs/hadoop-
hadoop-datanode-localhost.out
Starting secondary namenodes [0.0.0.0]
Step IV: Accessing Hadoop on Browser
The default port number to access Hadoop is 50070. Use the following url to get
Hadoop services on your browser.
http://localhost:50070/Hadoop Browser
Step V: Verify all applications for cluster
The default port number to access all applications of cluster is 8088. Use the
following url to visit this service.
http://localhost:8088/All
Application
26
VIVA QUESTIONS
1. What are the three installation modes of Hadoop installation?

 Stand alone mode
 Psuedo distributed mode
 Full distributed mode
2. why SSH Setup and Key Generation
SSH setup is required to do different operations on a cluster such as starting,

stopping, distributed daemon shell operations. To authenticate different users of
Hadoop, it is required to provide public/private key pair for a Hadoop user and share
27
it with different users.
3. What are the features of hdfs?
 It is suitable for the distributed storage and processing.

 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of namenode and datanode help users to easily check the
status of cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.
28
Ex.No: 3)
File Management Tasks in Hadoop
OBJECTIVE: To implement Adding, Reteriving, Deleting files and directories
DESCRIPTION:
The File System (FS) shell includes various shell-like commands that directly interact
with the Hadoop Distributed File System (HDFS) as well as other file systems that
Hadoop supports, such as Local FS, HFTP FS, S3 FS, and others.The following
commands are used for interacting with HDFS.
cat
Usage: hdfs dfs -cat URI [URI ...]
Copies source paths to stdout.
Example:
 hdfs dfs -cat hdfs://nn1.example.com/file1

hdfs://nn2.example.com/file2
 hdfs dfs -cat file:///file3 /user/hadoop/file4
Exit Code:
Returns 0 on success and -1 on error.
chgrp
Usage: hdfs dfs -chgrp [-R] GROUP URI [URI ...]
Change group association of files. The user must be the owner of files, or else a super-
user. Additional information is in the Permissions Guide.
Options
 The -R option will make the change recursively through the directory structure.
chmod
Usage: hdfs dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI

...]
29
Change the permissions of files. With -R, make the change recursively through the
directory structure. The user must be the owner of the file, or else a super-user.
Additional information is in the Permissions Guide.
Options
chown
Usage: hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
Change the owner of files. The user must be a super-user. Additional information is in
the Permissions Guide.
Options
copyFromLocal
Usage: hdfs dfs -copyFromLocal <localsrc> URI
Similar to put command, except that the source is restricted to a local file reference.
Options:
 The -f option will overwrite the destination if it already exists.
copyToLocal
Usage: hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Similar to get command, except that the destination is restricted to a local file reference.
count
Usage: hdfs dfs -count [-q] <paths>
Count the number of directories, files and bytes under the paths that match the specified
file pattern. The output columns with -count are: DIR_COUNT, FILE_COUNT,
CONTENT_SIZE FILE_NAME
The output columns with -count -q are: QUOTA, REMAINING_QUATA, SPACE_QUOTA,

REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, FILE_NAME
30
Example:
 hdfs dfs -count hdfs://nn1.example.com/file1

hdfs://nn2.example.com/file2
 hdfs dfs -count -q hdfs://nn1.example.com/file1
Exit Code:
cp
Usage: hdfs dfs -cp [-f] URI [URI ...] <dest>
Copy files from source to destination. This command allows multiple sources as well in
which case the destination must be a directory.
Options:
 The -f option will overwrite the destination if it already exists.
Example:
 hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2

 hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
/user/hadoop/dir
Exit Code:
PROGRAM:Interacting with Local File System Commands:

ponny@ubuntu:~$ cat >cseaa
hello how are you
ponny@ubuntu:~$ cat cseaa

hello how are you
ponny@ubuntu:~$ ls
AirPassengers.csv Pictures
big.txt protobuf-2.4.1
classes protobuf-2.4.1.tar.gz
core protobuf-2.5.0
cseaa protobuf-2.5.0.tar.gz
cseblearners Public
31
data10.txt PVP College
Data1.txt R
ponny@ubuntu:~$ clear
ponny@ubuntu:~$ cat >cseaa

hi how are you
ponny@ubuntu:~$ cat cseaa

hi how are you
ponny@ubuntu:~$ mkdir Dps

ponny@ubuntu:~$ cd Dps
ponny@ubuntu:~/Dps$ cd\
ponny@ubuntu:~/Dps$ mkdir train
ponny@ubuntu:~/Dps$ cd train
ponny@ubuntu:~/Dps/train$ cd\
ponny@ubuntu:~$ ls
AirPassengers.csv pa.txt~
big.txt Pictures
classes protobuf-2.4.1
core protobuf-2.4.1.tar.gz
cseaa protobuf-2.5.0
cseblearners protobuf-2.5.0.tar.gz
data10.txt Public
Data1.txt PVP College
ponny@ubuntu:~$ clear
ponny@ubuntu:~/Dps$ ls
train
ponny@ubuntu:~/Dps$ cd\
ponny@ubuntu:~$ jps
4520
4662 FsShell
3660 TaskTracker
2832 NameNode
4698 Jps
32
3328 SecondaryNameNode
3412 JobTracker
3079 DataNode
Interacting with Hadoop File System Commands
ponny@ubuntu:~$ hadoop fs –ls
Found 2 items
-rw-r--r-- 1 ponny supergroup 15 2016-08-19 10:32
/user/ponny/hadooplab
drwxr-xr-x - ponny supergroup 0 2016-08-18 15:38
/user/ponny/training
ponny@ubuntu:~$ hadoop fs -mkdir hadoop

Warning: $HADOOP_HOME is deprecated.
ponny@ubuntu:~$ hadoop fs -copyFromLocal csea hadoop

ponny@ubuntu:~$ hadoop fs -cat hadoop\csea

cat: File does not exist: /user/ponny/hadoopcsea

ponny@ubuntu:~$ hadoop fs -cat hadoop/csea
hi this is hadoop lab
ponny@ubuntu:~$ hadoop fs -copyToLocal hadoop/csea cseloc

ponny@ubuntu:~$ cat cseloc
ponny@ubuntu:~$
1.create "training" file in local system and copy that file to hdfs directory using
"put" cmd
ponny@ubuntu:~$ cat >training
Hello Welcome to the world of Bigdata
ponny@ubuntu:~$ hadoop fs -put training hadoop

ponny@ubuntu:~$ hadoop fs -ls hadoop

33
Found 2 items
/user/ponny/hadoop/csea
/user/ponny/hadoop/training
ponny@ubuntu:~$ hadoop fs -cat hadoop/training

Hello Welcome to the world of Bigdata
2.create It directory in Hdfs and copy "csea" file to hdfs It directory
ponny@ubuntu:~$ hadoop fs -mkdir IT

ponny@ubuntu:~$ cat csea

ponny@ubuntu:~$ hadoop fs -copyFromLocal csea IT

ponny@ubuntu:~$ hadoop fs -cat IT/csea

3.create ece directory in Hdfs and copy training bigdata file from cse hdfs direct
to ece hdfs directory
ponny@ubuntu:~$ hadoop fs -mkdir ECE
ponny@ubuntu:~$ hadoop fs -cp hadoop/csea ECE

ponny@ubuntu:~$ hadoop fs -cat ECE/csea

mv command:
ponny@ubuntu:~$ cat >mvnew

hi this is about mv command
ponny@ubuntu:~$ cat mvnew

hi this is about mv command
ponny@ubuntu:~$ hadoop fs -copyFromLocal mvnew Hadoop

34
ponny@ubuntu:~$ hadoop fs -mv hadoop/mvnew ECE
ponny@ubuntu:~$ hadoop fs -ls Hadoop

Found 3 items
/user/ponny/hadoop/mvnew
ponny@ubuntu:~$ hadoop fs -ls Hadoop
Warning: $HADOOP_HOME is deprecated

.
Found 2 items
4.Copy training bigdata file from ECE HDFS Desktop(local file system)
ponny@ubuntu:~$ hadoop fs -copyToLocal hadoop/training Desktop
ponny@ubuntu:~$ hadoop fs -copyFromLocal

/home/ponny/Desktop/Cancerpatientchi.R ECE
ponny@ubuntu:~$ hadoop fs -ls ECE

Found 3 items
/user/ponny/ECE/Cancerpatientchi.R
/user/ponny/ECE/csea
/user/ponny/ECE/mvnew
ponny@ubuntu:~$ hadoop fs -rmr IT
35
Deleted hdfs://localhost:54310/user/ponny/IT
VIVA QUESTIONS:
1. What is the purpose of copyFromLocal?
Ans: To copy data from local file system to HDFS
2. What is the purpose of copyToLocal?
Ans: To copy data from HDFS to Local
3. What is the purpose of Hadoop fs –mkdir?
Ans: To create directory in Hadoop file system.
36
Ex.No: 4)
Word Count Map Reduce program
OBJECTIVE: To implement the Fifos using IPC
DESCRIPTION:
MapReduce is a processing technique and a program model for distributed computing

based on java. The MapReduce algorithm contains two important tasks, namely Map
and Reduce. Map takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs). Secondly, reduce
task, which takes the output from a map as an input and combines those data tuples
into a smaller set of tuples. As the sequence of the name MapReduce implies, the
reduce task is always performed after the map job. WordCount is a simple
application that counts the number of occurrences of each word in a given input
set.This works with a local-standalone, pseudo-distributed or fully-distributed
Hadoop installation
PROGRAM
Driver code:
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
37
importorg.apache.hadoop.mapreduce.Job;
importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
publicclassWordCountDriver {
publicstaticvoid main(String[] args) throws Exception {
String input = "test1.txt";

String output = "out";
// Create a new job

Job job = newJob();
// Set job name to locate it in the distributed environment

job.setJarByClass(WordCountDriver.class);
job.setJobName("Word Count");
// Set input and output Path, note that we use the default input format
// which is TextInputFormat (each record is a line of input)
FileInputFormat.addInputPath(job, newPath(input));
FileOutputFormat.setOutputPath(job, newPath(output));
// Set Mapper and Reducer class

job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
// Set Output key and value

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Mapper Class:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper
extends Mapper<LongWritable, Text, Text, IntWritable>{
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
String line = value.toString();
38
String[] words = line.split(" ");
for (String w : words) {
word.set(w);
context.write(word, one);
}
}}
Reducer:
import java.util.Iterator;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer
extends Reducer<Text, IntWritable, Text, IntWritable>{
protected void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for(IntWritable value:values)
{
sum value.get();
}
context.write(key, new IntWritable(sum));
}
}
OUT PUT:
Input File:
Welcome every1.
Welcome to Hadoop lab.
Today we are going to work on Hadoop MapReduce concept.
Output File:
MapReduce 1
Today 1
Welcome 2
are 1
concept. 1
every1 1.
going 1
Hadoop 2
lab. 1
on 1
to 2
we 1
work 1
39
VIVA QUESTIONS:
1. Which method is used for writing mapper logic()?

Ans:Protected void map()
2. Which method is used for writing reducer logic()?

Ans:Protected void reduce()
3.What are the various Hadoop data types?

Ans:LongWritable,InteWritable,Text etc.,
Ex.No: 5)
Matrix Multiplication using Map Reduce Approach
OBJECTIVE: To implement the Matrix Multiplcation
DESCRIPTION:
In the map function each input from the dataset is organized to produce a key value
pair such that reducer can do the entire computation of the corresponding output cell.
PROGRAM
Driver code:
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Job;
importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
publicclassMatrix {
publicstaticvoid main(String[] args) throws Exception {
String input = "test1.txt";

String output = "out";
40
// Create a new job
Job job = newJob();
// Set job name to locate it in the distributed environment

job.setJarByClass(Matrix.class);
job.setJobName("Word Count");
// Set input and output Path, note that we use the default input format
// which is TextInputFormat (each record is a line of input)
FileInputFormat.addInputPath(job, newPath(input));
FileOutputFormat.setOutputPath(job, newPath(output));
// Set Mapper and Reducer class

job.setMapperClass(MatrixMapper.class);
job.setReducerClass(MatrixReducer.class);
// Set Output key and value

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Mapper Class:
public class MatrixMapper extends

Mapper<LongWritable, Text, Text, Text>
{
@Override
protected void map
(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
// input format is ["a", 0, 0, 63]
String[] csv = value.toString().split(",");
String matrix = csv[0].trim();
int row = Integer.parseInt(csv[1].trim());
int col = Integer.parseInt(csv[2].trim());
if(matrix.contains("a"))
{
for (int i=0; i < lMax; i++)
{
String akey = Integer.toString(row) + "," +
Integer.toString(i);
context.write(new Text(akey), value);
}
}
41
if(matrix.contains("b"))
{
for (int i=0; i < iMax; i++)
{
String akey = Integer.toString(i) + "," +
Integer.toString(col);
context.write(new Text(akey), value);
}
}
}
}
Reducer:
public class MatrixReducer extends Reducer<Text, Text, Text, IntWritable> {
@Override
protected void reduce
(Text key, Iterable<Text> values, Context context)
int[] a = new int[5];

int[] b = new int[5];
// b, 2, 0, 30
for (Text value : values) {
System.out.println(value);
String cell[] = value.toString().split(",");
if (cell[0].contains("a")) // take rows here
{
int col = Integer.parseInt(cell[2].trim());
a[col] = Integer.parseInt(cell[3].trim());
}
else if (cell[0].contains("b")) // take col here
{
int row = Integer.parseInt(cell[1].trim());
b[row] = Integer.parseInt(cell[3].trim());
}
}
int total = 0;
for (int i = 0; i < 5; i++) {
int val = a[i] * b[i];
total += val;
}
context.write(key, new IntWritable(total));
}
42
}
OUT PUT:
0,0 11878
0,1 14044
0,2 16031
0,3 5964
0,4 15874
1,0 4081
1,1 6914
1,2 8282
1,3 7479
1,4 9647
2,0 6844
2,1 9880
2,2 10636
2,3 6973
2,4 8873
3,0 10512
3,1 12037
3,2 10587
3,3 2934
3,4 5274
4,0 11182
4,1 14591
4,2 10954
4,3 1660
4,4 9981
VIVA QUESTIONS:
1. Which method is used for split the dataset?

Ans:split()
2. Which method is used for writing data into file?

Ans:context.write()
43
Ex.No: 6)
Mines Weather Data using Map Reduce
OBJECTIVE: To implement mines the weather data using mapreduce
DESCRIPTION:
Sensors senses weather data in big text format containing station ID, year, date, time,
temperature, quality etc. from each sensor and store it in single line. Suppose
thousands of data sensors are their, then we have thousands of records with no
particular order. We require only year and maximum tempertaure of particular
quality in that year.
For example:
Input string from sensor:
0029029070999991902010720004+64333+023450FM-12+
000599999V0202501N027819999999N0000001N9-00331+
99999098351ADDGF102991999999999999999999
Here: 1902 is year
44
0033 is temperature
1 is measurement quality (Range between 0 or 1 or 4 or 5 or 9)
Here each mapper takes input key as "byte offset of line" and value as "one weather
sensor read i.e one line". and parse each line and produce intermediate key is "year"
and intermediate value as "temperature of certain measurement qualities" for that
year.
PROGRAM
Driver code:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
public class Weather {

JobConf conf=new JobConf();
Job job;
try {
job = new Job(conf,"WeatherDataExtraction");
job.setJobName("WeatherDataExtraction");
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(job,new
Path("E:\\Nitin\\Programming\\DATA\\01001.dat\\01001.dat"));
FileOutputFormat.setOutputPath(conf,new
Path("E:\\Nitin\\output20.txt"));
try {
job.waitForCompletion(true);
} catch (ClassNotFoundException | IOException | InterruptedException
45
e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Mapper Class:
Mapper Class
class MaxTemperatureMapper
extends Mapper {
@Override
public void map(LongWritable key, Text value, Context context)
String line = value.toString();
//In input character string,perticular year occurs from character position
//15 to 19 and it is fix for every input data.
//so make substring to get year from total input string
String year = line.substring(15, 19);
int airTemperature;
//Temperature(including sign character) occurs from character position
46
// 87 to 92 for temperature comparision we needn't required "+ve" sign
//before temp.because +11c=11c we can ignore "+ve" sign
//before Temp.but not "-ve" sign.
//Make substring to get temp from total input string
if (line.charAt(87) == '+') {
// parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
String quality = line.substring(92, 93);
//Temperature quality occurs at character position 93
//and we have to get Temp. qualities of 0 or 1 or 4 or 5 or 9
//so make substring of one character to get temp. quality and matches //it with our
required qualities
//If it matches,then we write perticular year as key and temp. as value //to context
output
if (quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
47
}
Reducer:
import org.apache.hadoop.mapreduce.Reducer;
public class Reduce extends Reducer<Text,IntWritable,Text,IntWritable>{
public void reduce(Text key,Iterable<IntWritable> values,Context context) throws

IOException, InterruptedException
{
Integer max=new Integer(0);
for(IntWritable val:values) {
if (val.get()>max.intValue()) { max=val.get();}
}
context.write(key,new IntWritable(max.intValue()));
}
}
OUT PUT:
1949 111
1955 22
VIVA QUESTIONS:
1. Which method is used for getting max value()?

Ans:Math.max()
2. Which method is used for getting minvalue()?
Ans:Math.min()
48
Ex.No: 7
Pig Latin scripts
OBJECTIVE: To Implement Install and Run Pig then write Pig Latin scripts to sort,
group, join, project, and filter your data.
DESCRIPTION:
Apache Pig is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure for
evaluating these programs. The salient property of Pig programs is that their
49
structure is amenable to substantial parallelization, which in turns enables them to
handle very large data sets.At the present time, Pig's infrastructure layer consists of
a compiler that produces sequences of Map-Reduce programs, for which large-scale
parallel implementations already exist (e.g., the Hadoop subproject). Pig's language
layer currently consists of a textual language called Pig Latin, which has the following
key properties:
 Ease of programming. It is trivial to achieve parallel execution of

simple, "embarrassingly parallel" data analysis tasks. Complex tasks
comprised of multiple interrelated data transformations are explicitly
encoded as data flow sequences, making them easy to write, understand,
and maintain.
 Optimization opportunities. The way in which tasks are encoded
permits the system to optimize their execution automatically, allowing
the user to focus on semantics rather than efficiency.
 Extensibility. Users can create their own functions to do special-purpose
processing.
Install Apache Pig
After downloading the Apache Pig software, install it in your Linux

environment by following the steps given below.
Step 1
Create a directory with the name Pig in the same directory where the
installation directories of Hadoop, Java, and other software were
installed. (In our tutorial, we have created the Pig directory in the user
named Hadoop).
$ mkdir Pig
Step 2
Extract the downloaded tar files as shown below.
$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz
$ tar zxvf pig-0.15.0.tar.gz
Step 3
Move the content of pig-0.15.0- src.tar.gz file to the Pig directory
created earlier as shown below.
$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/
Configure Apache Pig
After installing Apache Pig, we have to configure it. To configure, we
need to edit two files − bashrc and pig.properties.
.bashrc file
PigLatin Script.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);

B = FOREACH A GENERATE name;
50
DUMP B;
(John)
(Mary)
(Bill)
(Joe)
customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
Self – join
Self-join is used to join a table with itself as if the table were two relations,
temporarily renaming at least one relation.
grunt> customers1 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING

PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
grunt> customers2 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING

PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
self-join operation on the relation customers, by joining the two relations

customers1 and customers2 as shown below.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
Verify the relation customers3 using the DUMP operator as shown below.
grunt> Dump customers3;

Output
It will produce the following output, displaying the contents of the relation
51
customers.
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)
Inner Join
Inner Join is used quite frequently; it is also referred to as equijoin. An inner join
returns rows when there is a match in both tables.
inner join operation on the two relations customers and orders as shown below.
grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;

Verification
Verify the relation coustomer_orders using the DUMP operator as shown below.
grunt> Dump coustomer_orders;

Output
will get the following output that will the contents of the relation named
coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Note −
Outer Join: Unlike inner join, outer join returns all the rows from at least one of the
relations. An outer join operation is carried out in three ways −
Left outer join

Right outer join
Full outer join
Left Outer Join
Let us perform left outer join operation on the two relations customers and orders as
shown below.
grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;

Verification
Verify the relation outer_left using the DUMP operator as shown below.
grunt> Dump outer_left;

Output
It will produce the following output, displaying the contents of the relation outer_left.
52
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
Right Outer Join
The right outer join operation returns all rows from the right table, even if there are
no matches in the left table.
Let us perform right outer join operation on the two relations customers and orders as
shown below.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;

Verification
Verify the relation outer_right using the DUMP operator as shown below.
grunt> Dump outer_right

Output
It will produce the following output, displaying the contents of the relation
outer_right.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Full Outer Join
The full outer join operation returns rows when there is a match in one of the
relations.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

Example
Let us perform full outer join operation on the two relations customers and orders as
shown below.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

Verification
Verify the relation outer_full using the DUMP operator as shown below.
grun> Dump outer_full;

Output
It will produce the following output, displaying the contents of the relation outer_full.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
53
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
VIVA QUESTIONS
1)what is pig in Hadoop?
Ans:Apache open source project which is run on hadoop,provides engine for data
flow in parallel on hadoop.It includes language called pig latin,which is for
expressing these data flow.It includes different operations like joins,sort,filter ..etc
and also ability to write UserDefine Functions(UDF) for proceesing and reaing and
writing.pig uses both HDFS and MapReduce i,e storing and processing.
2)what is differnce between pig and sql?

Ans:Pig latin is procedural version of SQl.pig has certainly similarities,more
difference from sql.sql is a query language for user asking question in query form.sql
makes answer for given but dont tell how to answer the given question.suppose ,if
user want to do multiple operations on tables,we have write maultiple queries and
also use temporary table for storing,sql is support for subqueries but intermediate we
have to use temporary tables,SQL users find subqueries confusing and difficult to
form properly.using sub-queries creates an inside-out design where the first step in
the data pipeline is the innermost query .pig is designed with a long series of data
operations in mind, so there is no need to write the data pipeline in an inverted set of
subqueries or to worry about storing data in temporary tables.
3)How Pig differs from MapReduce

Ans:In mapreduce,groupby operation performed at reducer side and filter,projection
can be implemented in the map phase.pig latin also provides standard-operation
similar to mapreduce like orderby and filters,groupby..etc.we can analyze pig script
54
and know data flows ans also early to find the error checking.pig Latin is much lower
cost to write and maintain thanJava code for MapReduce.
Ex.No: 8
Use Hive to Manage Databases, Tables and Views
OBJECTIVE: Install and Run Hive then use Hive to create, alter, and drop databases, tables,
views, functions, and indexes
DESCRIPTION:
Hive is a data warehouse infrastructure tool to process structured data in
Hadoop. It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
Data Base Creation:

hive> create database CSE;
OK
55
hive> show databases;
OK
cs
cse
hive> show databases like 'c*';
OK
cs
cse
cseb
hive> show tables;
OK
cseb
customer
order
hive> create table customer(cid BIGINT,cname STRING,cage INT)
> row format delimited
> fields terminated by ','
> stored as textfile;
OK
hive> use default;
OK
hive> drop database cse cascade;
OK
Time taken: 6.524 seconds
hive> show tables;
OK
customer
hive> LOAD DATA LOCAL
> INPATH '/home/lalitha2/Desktop/deepu.txt'
> OVERWRITE INTO TABLE customer;
OK
hive> select * from customer;
OK
1 A 20
2 B 30
3 C 35
4 D 40
hive> create table Order(oid BIGINT,oname STRING,cid INT)
> row format delimited
> fields terminated by ','
> stored as textfile;
OK
hive> LOAD DATA LOCAL
> INPATH '/home/lalitha2/Desktop/orderdet.txt'
> OVERWRITE INTO TABLE Order;
hive> select * from Order;
OK
101 pendrive 1
102 mouse 2
56
103 laptop 3
104 laptop 4
105 mouse 2
1.Write a query to display cid,oid who are having an order item pendrive
select cid,oid from Order WHERE oname="pendrive";
OK
1 101
2.write a query to display oid,oname which is having cid=2
hive> select oid,oname from Order WHERE cid=2;
OK
102 mouse
105 mouse
3.write a query to display oid of laptop
hive> select oid,oname from Order WHERE cid=2;
OK
102 mouse
105 mouse
4.write a query to display oid of laptop or mouse
hive> select oid from Order WHERE Oname="laptop" OR Oname="mouse";
OK
102
103
104
105
5.write a query to display oid and cid
hive> select oid,cid from Order;
OK
101 1
102 2
103 3
104 4
105 2
1.write a query to display customer name of cid=2 from customer
hive> select cname from customer where cid=2;
OK
B
2.write a query to display customer names whoa re having customer id< 4
hive> select cname from customer where cid<4;
OK
A
B
C
JOINS
hive> select c.cid,c.cname,o.oid,o.oname from customer c join Order o on(c.cid=o.cid);
OK
1 A 101 laptop
2 B 102 cd
3 C 103 pendrive
4 D 104 dd
57
hive> select c.cid,c.cname,o.oid,o.oname from customer c left outer join Order o
on(c.cid=o.oid);
OK
1 A NULL NULL
2 B NULL NULL
3 C NULL NULL
4 D NULL NULL
hive> select c.cid,c.cname,o.oid,o.oitem from customer c right outer join Order o

on(c.cid=o.oid);
OK
NULL NULL 101 pendrive
NULL NULL 102 mouse
NULL NULL 103 laptop
NULL NULL 105 mouse
hive> select c.cid,c.cname,o.oid,o.oitem from customer c full outer join Order o
on(o.oid=c.cid);
OK
NULL NULL 101 pendrive
NULL NULL 102 mouse
NULL NULL 105 mouse
Views:
hive> create view customer_view as
> select c.cid,c.cname,o.oid,o.oname
> from customer c full outer join Order o
> on (c.cid=o.cid);
OK
hive> select * from customer_view;
OK
1 A 101 laptop
2 B 102 cd
3 C 103 pendrive
4 D 104 dd
NULL NULL 105 ddd
VIVA QUESTIONS:
1. What is Hive?
Ans: Hive is a datawarehouse tool for analyzing data.
2. What are the different types of Joins?

Ans:Inner Join,Left Outer Join,Right Outer Join .Full Outer Join
58
3. How can we copy data from local file system to Hive?
Ans:By using “ Load” command
59

Hadoop and BigData LAB MANUAL

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Hadoop and BigData LAB MANUAL

Caricato da

Copyright:

Formati disponibili

Ex.No: 1.

OBJECTIVE: To implement Stack in java

Stack is a subclass of Vector that implements a standard last-in, first-out stack.Stack

1. Stack class is available in which package?

2. What is the purpose of peek () method?

3. What is the purpose of pop () method?

OBJECTIVE: To implement LinkedList datastruture.

public static void main(String args[]) {

// add elements to the linked list

// remove elements from the linked list

// remove first and last elements

// get and set a value

Original contents of ll: [A, A2, F, B, D, E, C, Z]

1. How can you insert elements in to Linked List?

OBJECTIVE: To implement the Set datastructure.

A Set is a Collection that cannot contain duplicate elements. It models the

public class Set {

1. What interfaces are implemented by the HashSet Class ? Which is the

OBJECTIVE: To implement Map Datastructure

TreeMap<String,Double>tmap=new TreeMap<String,Double>();//sorted order

HashMap<String,Double>hmap=new HashMap<String,Double>();//random order

//taking input from the user

{13a91a0514=80.6, 13a91a0518=81.6, 13a91a0528=82.6,

1. Sort a Map on the keys?

 HashMap is implemented as a hash table, and there is no ordering on keys or

3. Difference between HashMap, TreeMap, and Hashtable

The most important differences include:

1. The order of iteration. HashMap and Hashtable make no guarantees as to the

OBJECTIVE: To implement Generic Concepts

1. What are Generics?

2. How do you declare a Generic Class?

3. How can we restrict Generics to a subclass of particular class?

Ans: In MyListGeneric, Type T is defined as part of class declaration. Any

OBJECTIVE: To implement serialization

Java provides a mechanism, called object serialization where an object can be

1. How to make a Java class Serializable?

Serializable interface exists in java.io package and forms core of java

OBJECTIVE: To implement Queue

A Queue is a collection for holding elements prior to processing. Besides

The Queueinterface follows.

public interface Queue<E> extends Collection<E> {

1. Queue is available in which package?

Ans: First in First out Order

OBJECTIVE: To implement Wrapper Classes

public interface Queue<E> extends Collection<E> {

public class WrapperExample1{

1. What is wrapper class?

Ans: Boolean is a predefine Wrapper class.

OBJECTIVE: Installing Hadoop

Hadoop can be run in 3 different modes. Different modes of Hadoop are

 Default mode of Hadoop

Pseudo Distributed Mode(Single Node Cluster)

 Configuration is required in given 3 files for this mode

Fully distributed mode (or multiple node cluster)

 This is a Production Phase

Step 1: Verifying JAVA Installation

java version "1.7.0_71"

Java(TM) SE Runtime Environment (build 1.7.0_71-b13)

Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

$ tar zxf jdk-7u71-linux-x64.gz

Use the following commands to configure java alternatives:

# alternatives --install /usr/bin/java java usr/local/java/bin/java 2

Step 2: Verifying Hadoop Installation