Mapreduce (1) .Odp

Basics of MapReduce
Example for complete Mapreduce Flow

MapReduce API classes and interfaces
Find MapReduce old API in
https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/mapred/package-summary.html
Find MapReduce new API in
https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/mapreduce/package-summary.html
SNO Primitive Java Wrapper Hadoop Type
types class classes
1 byte Byte ByteWritable
2 boolean Boolean BooleanWritable
3 char or String String Text
4 double Double DoubleWritable
5 float Float FloatWritable
6 int Integer Integer
7 long Long LongWritable
8 short Short ShortWritable
int a=10 Int x

IntWritable i=new IntWritable(a) x=i.get();
Class JobConf
• Used to describe mapreduce job configuration to the hadoop frame work for execution
• It specify Mapper,Combiner,Partitioner,Reduce,InputFormat and OutputFormat to used in
Mapreduce program
Methods
*public JobConf(Configuration conf, Class DriverLogic class)

*public void setJobName(String name)
*public void setJarByClass(Class cls)
*public String getJobName()
*public void setOutputKeyClass(TypeClass.class)
Hadoop has its own data types. List of used types classes for key, value pairs is
BooleanWritable,ByteWritable,DoubleWritable,FloatWritable,
IntWritable,LongWritable,TextWritable,NullWritable
* public Typeclass getOutputKeyClass()

*public void setOutputValueClass(Typeclass.class)
*public Typeclass getOutputValueClass()
*public void setMapOutputValueClass(Typeclass.class)
*public Typeclass getMapOutputValueClass()
* public void setMapOutputKeyClass(Typeclass.class)
*public Typeclass.class getMapOutputKeyClass()
*public void setInputFormat(Class<? extends InputFormat> theClass)
TextInputFormat is default input format
->TextInputFormat:Each line in Text File is a record.Key is the byte offset of the line and value
is the content of the line
Type of Key:LongWritable Type of value:Text
->KeyValueTextInputFormat:Each record is divided by the (/t). Everything before /t is key

and everything after /t is value
Type of Key:Text Type of value:Text
->SequenceFileInputFormat:Sequnce File is compressed Binary File Format

Type of Key:userdefined Type of value:userdefined
->NlineInputFormat:Same as TextInputFormat ,but each split will have exactly N lines

Type of Key:LongWritable Type of value:Text
*public InputFormat getInputFormat()
*public void setOutputFormat(Class<? Extends OutputFormat > Class)
*public OutputFormat getOutputFormat()
*public void setMapperClass(Class<? extends Mapper> .class)

* public void setReducerClass(Class<? extends Reducer>.class)
*public void setCombinerClass(Class<? extends Reducer>.class)
*public void setPartitionerClass(Class<? extends Partitioner> .class)
*public void setNumMapTasks(int n)
*public void setNumReduceTasks(int n)
Example
JobConf job = new JobConf(new Configuration(), MyJob.class);
job.setJobName(“Max Temperature");
job.setMapperClass(MyMapper.class);
job.setCombinerClass(MyReducer.class);
job.setReducerClass(MyReducer.class); job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextInputFormat.class);
In java new Mapreduce API ,class JobConf is replaced with Job class.
All methods in JobConf is also applicable with Job class.It’s usage is given as
Job job=new Job();

job.setJobName(“Max Temperature");
Class FileInputFormat
It is used to set and get the input paths
*public static void addInputPath(JobConf conf, Path path)

*public static void addInputPaths(JobConf conf, String commaSeparatedPaths)
*public static Path[] getInputPaths(JobConf conf)
Class FileOutputFormat
It is used to set output path
public static void setOutputPath(JobConf conf, Path outputDir)
Note: This outputDir shouldn’t exist before running job.
Example
FileInputFormat.addInputPath(job,new Path(args[0]);
FileoutputFormat.setOutputPath(job,new Path(args[1]);
Interface InputFormat<K,V>
*InputSplit[] getSplits(JobConf job, int numSplits) throws IOException

Logically split the set of input files for the job
*RecordReader<K,V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws

IOException
Get the RecordReader for the given InputSplit

Interface RecordReader<K,V>
RecordReader reads <key, value> pairs from an InputSplit
• K createKey():Create an object of the appropriate type to be used as a key.

• V createValue():Create an object of the appropriate type to be used as a value.
• long getPos() throws IOException:Returns the current position in the input.
• boolean next(K key, V value) throws IOException:Reads the next key/value
• void close() throws IOException: Close the InputSplit
• float getProgress() throws IOException
Interface OutputFormat<K,V>
OutputFormat describes & validates the output-specification for a Map-Reduce job
*void checkOutputSpecs(FileSystem ignored, JobConf job) throws IOException

*RecordWriter<K,V> getRecordWriter(FileSystem ignored, JobConf job, String name,
Progressable progress) throws IOException
Interface RecordWriter<K,V>
RecordWriter writes the output <key, value> pairs to an output file.
*void write(K key, V value) throws IOException

*void close(Reporter reporter) throws IOException
Interface Mapper<K1,V1,K2,V2>
*void map(K1 key, V1 value, OutputCollector<K2,V2> output, Reporter reporter)

throws IOException
Output <k,v> pairs are collected using OutputCollector.collect(Object,Object).
In new Mapreduce API ,Mapper is an abstract class, It’s map method is given as
void map(K1 key, V1 value, Context context) throws IOException
Output <k,v> pairs is added to output using write method i.e
context.write(object,object)
Interface Partitioner<K2,V2>
Partitioner controls the partitioning of the keys of the intermediate map-outputs to reducers
* int getPartition(K2 key, V2 value, int numPartitions)
Get the paritition number for a given key (hence record) given the total number of partitions i.e.
number of reduce-tasks for the job
Interface Reducer<K2,V2,K3,V3>:Reducer has 3 primary phases:
1)Shuffle 2)Sort 3)Secondary sort
Secondary sort is implemented using a concept call “compositekeyComporator”
*void reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter) throws
IOException
Output <k,v> pairs are collected using OutputCollector.collect(TypeClass,TypeClass)
Class MapReduceBase
Base class for Mapper and Reducer implementations.Provides default no-op

Implementations for a few methods
*void close(): used to close the stream and release

*public void configure(JobConf job)
class JobClient
JobClient provides facilities to submit jobs, track their progress
*public static runJob(JobConf job) throws IOException

Utility that submits a job
Here is an example on how to use JobClient:
JobConf job = new JobConf(new Configuration(), MyJob.class);

job.setJobName("myjob");
job.setInputPath(new Path(args[0]));
job.setOutputPath(new Path(args[1]));
job.setMapperClass(MyJob.MyMapper.class);
job.setReducerClass(MyJob.MyReducer.class);
// Submit the job, then poll for progress until the job is complete
JobClient.runJob(job);
Note:
In new API, we will use a method called waitForCompletion instead of using
JobClient’s runJob method.
Boolean waitForCompletion(boolean)
This method allows user to configure and submit job and wait for it to finish
Single argument to the this method indicates whether output is generated (or) not, when true, the job
writes information about it’s progress to output console
The return value is Boolean indicating success(0) or failure(1)
●
Wordcount using new API
import java.io.IOException; Wordcount .java

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Wordcount
{
public static void main(String[] args) throws IOException, ClassNotFoundException,
InterruptedException
{
if (args.length != 2)
{
System.out.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job conf= new Job();
conf.setJarByClass(Wordcount.class);
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordMapper.class);
conf.setReducerClass(WordReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
System.exit(conf.waitForCompletion(true) ? 0 : 1);
}
}
import java.io.IOException; WordMapper.java
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>

{
public void map(LongWritable key, Text value,Context context)

throws IOException, InterruptedException
{
String s = value.toString();
for (String word : s.split("\\W+"))
{
if (word.length() > 0)
{
context.write(new Text(word), new IntWritable(1));
}
}
}
}
WordReducer .java
import java.io.IOException;
import org.apache.hadoop.mapreduce.Reducer;
public class WordReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,Context context)

throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));

}
}
Executing WordCount MapReduce Program
Open Eclipse
Provide your Workspace Name
Minimize Welcome Screen
Create new Project
Double clickon Java Project
Provide Project name
After successful creation of project
Create classes for Driver Logic,Mapper,Reducer Programs
Created empty class. Now start coding Driver Logic
After coding Driver Logic
In the same way code Mapper,Reducer program
Add MapReduce libraries for JAVA Project
Goto Libraries Tab and click on “Add External JARs”
Adding External JARs and click on “OK”
12
10
Column 1
6
Column 2
Column 3
0
Row 1 Row 2 Row 3 Row 4
Select all External JARs loaded and click on “OK”
Now all erros will disappear
Running Map reduce program locally
(Run Configurations of Project in Eclipse IDE)
Place your input data in your project workspace folder
Open eclipse -> Right Click on Project->Run Configurations
Create configuration for your Java applications
->Provide Project Name, Main class
name
Provide input and output arguments in Arguments tab ,and click on Run
Observe status of your running program in console
Observe status of your running program in console
Output will be placed in your project workspace folder
Open output folder to see actual output
Running Map reduce program on HDFS
Create and place input in HDFS
Open HDFS to view input
After coding driver logic,Mapper,Reducer, Create executable jar to run Mapreduce program
Select Runnable JARFILE
Provide configuration name and destination for JAR file
Provide destination path for JAR file along with name
Click on Finish
Ignore Warning message
Creating JAR file in progess
Execute MapReduce Program using JAR File
Output is saved in HDFS
Apache Hive ™
The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large
datasets residing in distributed storage using SQL.
Apache Hive is an open source project run by volunteers at the Apache Software
Foundation.Before becoming a open source project of Apache Hadoop, Hive was originated in
Facebook.
It provides
● Tools to enable easy access to data via HiveQL, thus enabling data warehousing tasks such as
extract/transform/load (ETL)
● It provides the structure on a variety of data formats.
● Access to files stored either directly in Apache HDFS or in other data storage systems such as
Apache HBaseTM
● Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools.
● SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.
● By default, Hive stores metadata in an embedded Apache Derby database, and other client/server
databases like MySQL can optionally be used
● Hive indexing is to improve the speed of query lookup on certain columns of a table
Limitations of Hive:
● Hive is not designed for Online transaction processing (OLTP ), it is only used for the Online Analytical
Processing.
● Hive supports overwriting data, but not updates and deletes.
● In Hive, sub queries are not supported.
● There is no "insert into table values ... " statement
● You can only load data using bulk load
● There is not "delete from " command
● You can only do bulk delete
Getting started with Apache HIVE
Step 1: Download apache-hive-1.2.1-bin.tar.gz (or)any latest version
Step 2: Extract apache-hive-1.2.1-bin.tar.gz to your HOME directory

Step 3: Set HADOOP_HOME=/home/satish/hadoop-1.2.1 in hive-config.sh
Find hive-config.sh in /home/satish/apache-hive-1.2.1-bin/bin
Step 4: Set HIVE_HOME and HADOOP_HOME in bashrc
~/hadoop-1.2.1$ sudo gedit ~/.bashrc

Step 5: Create HIVE configuration file
The HIVE distribution includes a template configuration file that provides

all default settings for HIVE.To cutomize HIVE you need to copy the template
file to the file name hive-site.xml
~$ cd apache-hive-1.2.1-bin
~/apache-hive-1.2.1-bin$ cd conf
~/apache-hive-1.2.1-bin/conf$ cp hive-default.xml.template hive-site.xml
Now delete all default

properties from hive-
site.xml
HIVE Data Definition Language (DDL )
1)HIVE Database
*Creating database
Syntax:
CREATE DATABASE [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];
Example1:
hive> show databases;

OK
default
Time taken: 0.013 seconds, Fetched: 1 row(s)
hive> set hive.cli.print.current.db=true;

hive (default)> create database employ;
OK
Time taken: 0.134 seconds
hive (default)> use employ;

OK
hive (employ)>
Example2:
hive (default)> create database IF NOT EXISTS employ

> COMMENT 'Database for employees'
> LOCATION '/hive/warehouse'
> with DBPROPERTIES ('creator'='ram','city'='vijayawada');
OK
Trying same database again
hive (default)> create database employ

> COMMENT 'Database for employees'
> LOCATION '/hive/warehouse'
> with DBPROPERTIES ('creator'='ram','city'='vijayawada');
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Database
employ already exists
hive (default)> show databases;

OK
default
employ
*Drop database
Syntax:
DROP [IF EXISTS] database_name [RESTRICT|CASCADE];
Example:
hive> Drop database employ;

OK
hive> show databases;

OK
default
Hive>
● The default behavior is RESTRICT, where DROP DATABASE will fail if the database is not
empty.
● To drop the tables in the database as well, use DROP DATABASE db_name CASCADE
*Describe database
hive> DESCRIBE DATABASE EXTENDED employ;

OK
employ Database for employees hdfs://localhost:9000/hive/warehouse satish USER
{city=vijayawada, creator=ram}
*Alter database
Syntax:
ALTER DATABASE database_name SET DBPROPERTIES (property_name=property_value, ...);
ALTER DATABASE database_name SET OWNER [USER|ROLE] user_or_role;
Example:
hive (default)> ALTER database employ SET DBPROPERTIES ('creator'='rajesh','city'='Guntur');
OK
hive (default)> DESCRIBE DATABASE EXTENDED employ;

OK
employ Database for employees hdfs://localhost:9000/hive/warehouse satish USER {city=Guntur,
creator=rajesh}
*Use Database
● Statement USE sets the current database for all subsequent HiveQL statements.
● To revert to the default database, use the keyword "default" instead of a database name.
● To check which database is currently being used: SELECT current_database()
Example
hive (default)> select current_database();
OK
default
hive (default)> use employ;
OK
hive (employ)> select current_database();

OK
employ
hive (employ)>
● If you have a lot of databases, you can restrict the ones listed using a regular expression,
● The following example lists only those databases that start with the letter e and end with
any other characters (the .* part):
hive (employ)> show databases LIKE 'e.*';

OK
employ
hive (employ)>
HIVE datatypes (1)Hive> CREATE TABLE data_types_table (
(2) > our_tinyint TINYINT COMMENT '1 byte signed integer',
(3) > our_smallint SMALLINT COMMENT '2 byte signed integer',
(4) > our_int INT COMMENT '4 byte signed integer',
(5) > our_bigint BIGINT COMMENT '8 byte signed integer',
(6) > our_float FLOAT COMMENT 'Single precision floating point',
(7) > our_double DOUBLE COMMENT 'Double precision floating point',
(8) > our_decimal DECIMAL COMMENT 'Precise decimal type based
(9) > on Java BigDecimal Object',
(10) > our_timestamp TIMESTAMP COMMENT 'YYYY-MM-DD HH:MM:SS.fffffffff"
(11) > (9 decimal place precision)',
(12) > our_boolean BOOLEAN COMMENT 'TRUE or FALSE boolean data type',
(13) > our_string STRING COMMENT 'Character String data type',
(14) > our_binary BINARY COMMENT 'Data Type for Storing arbitrary
(15) > number of bytes',
(16) > our_array ARRAY<TINYINT> COMMENT 'A collection of fields all of
(17) > the same data type indexed BY
(18) > an integer',
(19) > our_map MAP<STRING,INT> COMMENT 'A Collection of Key,Value Pairs
(20) > where the Key is a Primitive
(21) > Type and the Value can be
(22) > anything. The chosen data
(23) > types for the keys and values
(24) > must remain the same per map',
(25) > our_struct STRUCT<first : SMALLINT, second : FLOAT, third : STRING>
(26) > COMMENT 'A nested complex data
(27) > structure',
(28) > our_union UNIONTYPE<INT,FLOAT,STRING>
(29) > COMMENT 'A Complex Data Type that can
(30) > hold One of its Possible Data
(31) > Types at Once')
(32) > COMMENT 'Table illustrating all Apache Hive data types'
(33) > ROW FORMAT DELIMITED
(34) > FIELDS TERMINATED BY ','
(35) > COLLECTION ITEMS TERMINATED BY '|'
(36) > MAP KEYS TERMINATED BY '^'
(37) > LINES TERMINATED BY 'n'
(38) > STORED AS TEXTFILE
(39) > TBLPROPERTIES ('creator'='Bruce Brown', 'created_at'='Sat Sep 21 20:46:32
EDT 2013');
OK
2)HIVE tables
*Creating HIVE tables
CREATE TABLE creates a table with the given name. An error is thrown if a table or view
with the same name already exists. You can use IF NOT EXISTS to skip the error.
hive> create table IF NOT EXISTS Employee(

>SNO int comment 'sequence number',
> name string comment 'Employee Name',
>position string comment 'Employee Role',
>salary int comment 'Employee Salary',
>dept string comment 'Employee group')
>Comment 'Employee details'
> row format delimited
> fields terminated by ','
>lines terminated by '\n'
>STORED AS TEXTFILE
>LOCATION '/hive/warehouse/employ.db/Employee'
> TBLPROPERTIES ('creator'='ram');
The default TEXTFILE format for your HIVE table are slower to process and they consume lots of
disk space unless you compress them.
For these reasons,Apcahe Hive community came up with several choices for storing our tables on
the HDFS
TEXTFILE
● Use STORED AS TEXTFILE if the data needs to be stored as plain text files.
● Default file format for HIVE reccords.
● Alphanumeric characters from the unicode are used to store you data
SEQUENCEFILE
● Use STORED AS SEQUENCEFILE if the data needs to be compressed.
● The format for binary file composed of key/value pairs
RCFILE(Record Columnar FILE)
● Store records in a column-oriented fashion rather than row oriented fashion.
● Use RCFILE format if you have large number of cloumns ,but only a few columns are typicaly used
ORC(Optimized Row Columnar FILE)
A format that has significant optimiaions to improve HIVE reads and writes and the processing of
tables
Drop Table
DROP TABLE [IF EXISTS] table_name [PURGE];
● DROP TABLE removes metadata and data for this table. The data is actually moved to the .Trash/Current
directory if Trash is configured
● If PURGE is specified, the table data does not go to the .Trash/Current directory and so cannot be retrieved
in the event of a mistaken DROP.
Example:
hive (employ)> DROP table IF EXISTS employee2;

OK
Truncate Table
hive (employ)> TRUNCATE TABLE table_name ;

HIVE DML(Data Manipulation language)
Loading data in to table

LOAD DATA [LOCAL] INPATH 'path to file' [OVERWRITE] INTO TABLE table_name
Example: Consider text file in local filesystem with following contents
hive (employ)> LOAD DATA LOCAL INPATH 'hiveinput/employ' INTO TABLE Employee;
Loading data to table employ.employee
Table employ.employee stats: [numFiles=0, totalSize=0]
OK
In HDFS
hive (employ)> select * from Employee;

OK
1 Anne Admin 50000 A
2 Gokul Admin 50000 B
3 Janet Sales 60000 A
4 Hari Admin 50000 C
......
.....
Apache Pig
Step 1: Download pig form here
https://pig.apache.org/releases.html
Step 2: Extract to home directory “/home/aliet”
Step 3: Modify .bashrc file ,To open bashrc file use this command
$ sudo gedit /etc/bash.bashrc
--> In bashrc file append the below statements
export PIG_HOME=/home/aliet/pig-0.15.0
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=/home/aliet/hadoop-1.2.1/conf
export JAVA_HOME=$JAVA_HOME/usr
Step 4:Start all hadoop deamons ,open terminal and type the following command
hadoop-1.2.1$bin/start-all.sh
Step 5 :On the command prompt type

$ pig -h
Step 6 :To Start pig in local mode
$ pig -x local
grunt>
Step 7 :To start pig in mapreduce mode
$ pig -x mapreduce or $ pig
LOAD ,DUMP,DESCRIBE operators
Loading Employ data in to bag
001 Robin 22 newyork

002 BO 23 Kolkata
003 Maya 23 Tokyo
004 Sara 25 London
005 David 23 Bhuwaneshwar
006 Maggy 22 Chennai
grunt> A = LOAD 'piginput/employ';

grunt> DUMP A;
(001,Robin,22,newyork)
(002,BO,23,Kolkata)
(003,Maya,23,Tokyo)
(004,Sara,25,London)
(005,David,23,Bhuwaneshwar)
(006,Maggy,22,Chennai)
grunt> DESCRIBE A;
Schema for A unknown.
grunt>
Defining schema and Using function
001:Robin:22:newyork
002:BO:23:Kolkata
Employ data 003 :Maya:23:Tokyo
004:Sara:25:London
005 :David:23:Bhuwaneshwar
006 :Maggy:22 :Chennai
grunt> A= LOAD 'piginput/employ' USING PigStorage(':') AS (id:int,name:chararray,age:int,city);
grunt> DUMP A;
(1,Robin,22,newyork)
(2,BO,23,Kolkata)
(3,Maya,23,Tokyo)
(4,Sara,25,London)
(5,David,23,Bhuwaneshwar)
(6,Maggy,22,Chennai)
grunt> DESCRIBE A;
A: {id: int,name: chararray,age: int,city: bytearray}
If we miss out types of fields,then default type will be bytearray
grunt> A= LOAD 'piginput/emply' USING PigStorage(':') AS (id,name:chararray,age:int,city);

grunt> DESCRIBE A;
A: {id: bytearray,name: chararray,age: int,city: bytearray}
grunt>
GROUP and COGROUP operators
Grouping Employ by age
grunt> A= LOAD 'piginput/employ' USING PigStorage(':') AS (id:int,name:chararray,age:int,city);
grunt> B= GROUP A by age;
grunt>DUMP B;
(22,{(1,Robin,22,newyork),(6,Maggy,22,Chennai)})
(23,{(2,BO,23,Kolkata),(3,Maya,23,Tokyo),(5,David,23,Bhuwaneshwar)})
(25,{(4,Sara,25,London)})
grunt> DESCRIBE B;
B: {group: int,A: {(id: int,name: chararray,age: int,city: bytearray)}}
grunt>
The COGROUP operator is used in statements involving two or more relations.
grunt> employ= LOAD 'piginput/employ' USING PigStorage(':') AS

(id:int,name:chararray,age:int,city);
grunt> student= LOAD 'piginput/student' USING PigStorage(',') AS

(id:int,firstname:chararray,lastname:chararray,age:int,phno:long,city:chararray);
grunt> cogroup_data = COGROUP student by age, employ by age;

grunt> DUMP cogroup_data
(21,{(1,Rajiv,Reddy,21,9848022337,Hyderabad),
(3,Rajesh,Khanna,21,9848022339,Hyderabad),
(4,Preethi,Agarwal,21,9848022330,Punei)},{})
(22,{(2,siddarth,Battacharya,22,9848022338,Kolkata)},{(1,Robin,22,newyork),
(6,Maggy,22,Chennai)})
(23,{(5,Trupthi,Mohanthy,23,9848022336,Chennai),
(6,Archana,Mishra,23,9848022335,Chennai)},{(2,BO,23,Kolkata),(3,Maya,23,Tokyo),
(5,David,23,Bhuwaneshwar)})
(24,{(7,Komal,Nayak,24,9848022334,trivendram),
(8,Bharathi,Nambiayar,24,9848022333,trivendram)},{})
(25,{},{(4,Sara,25,London)})
FILTER: Selects tuples from a relation based on some condition.

l Use the FILTER operator to work with tuples or rows of data.
l FILTER is commonly used to select the data that you want; or, conversely, to filter out
(remove) the data you don’t want
grunt> X = FILTER A BY a3 == 3; grunt> X = FILTER A BY (a1 == 8) OR (NOT a2+a3 > a1));

grunt> DUMP X; grunt> DUMP X;
(8,3,4)
(1,2,3)
(4,3,3)
(4,3,3)
(1,2,3)
LIMIT
l
Use the LIMIT operator to limit the number of output tuples.
l
If the specified number of output tuples is equal to or exceeds the number of tuples in
the relation, the output will include all tuples in the relation.
l
A particular set of tuples can be requested using the ORDER operator followed by
LIMIT.
grunt> X= LIMIT student 3;
grunt> DUMP X;
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(3,Rajesh,Khanna,21,9848022339,Hyderabad)
DISTINCT
l
Use the DISTINCT operator to remove duplicate tuples in a relation.
l
DISTINCT does not preserve the original order of the contents
l
You cannot use DISTINCT on a subset of fields.
grunt> A = LOAD 'data' USING PigStorage(' ') AS (a1:int,a2:int,a3:int);

grunt> DUMP A;
(8,3,4)
(1,2,3)
(4,3,3)
(4,3,3)
(1,2,3)
CROSS
Use the CROSS operator to compute the cross product (Cartesian product) of two or more
relations.
grunt>A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
grunt>DUMP A;
(1,2,3)
(4,2,1)
grunt>B = LOAD 'data2' AS (b1:int,b2:int);
grunt>DUMP B;
(2,4)
(8,9)
(1,3)
In this example the cross product of relation A and B is computed.
grunt>X = CROSS A, B;
grunt>DUMP X;
(1,2,3,2,4)
(1,2,3,8,9)
(1,2,3,1,3)
(4,2,1,2,4)
(4,2,1,8,9)
(4,2,1,1,3)
UNION
Use the UNION operator to merge the contents of two or more relations. The UNION
operator:
• Does not preserve the order of tuples. Both the input and output relations are interpreted
as unordered bags of tuples.
• Does not ensure (as databases do) that all tuples belongs to the same schema or that
they have the same number of fields.
l
Does not eliminate duplicate tuples.
grunt>A = LOAD 'data' AS (a1:int,a2:int,a3:int);

grunt>DUMP A;
(1,2,3)
(4,2,1)
grunt>B = LOAD 'data' AS (b1:int,b2:int);
grunt>DUMP B;
(2,4)
(8,9)
(1,3)
grunt>X = UNION A, B;
grunt>DUMP X;
(1,2,3)
(4,2,1)
(2,4)
(8,9)
(1,3)
SPLIT
Use the SPLIT operator to partition the contents of a relation into two or more relations
based
on some expression. Depending on the conditions stated in the expression:
• A tuple may be assigned to more than one relation.
• A tuple may not be assigned to any relation.
SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression ...];
In this example relation A is split into three relations, X, Y, and Z.
grunt>A = LOAD 'data' AS (f1:int,f2:int,f3:int);

grunt>DUMP A;
(1,2,3)
(4,5,6)
(7,8,9)
grunt>SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
grunt>DUMP X;
(1,2,3)
(4,5,6)
grunt>DUMP Y;
(4,5,6)
grunt>DUMP Z;
(1,2,3)
(7,8,9)
FOREACH
Use the FOREACH ...GENERATE operation to work with columns of data (if you want to
work with tuples or rows of data, use the FILTER operation).
grunt>A = LOAD 'data1' AS

(a1:int,a2:int,a3:int);
grunt>DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt>X = FOREACH A GENERATE *;
grunt>DUMP X;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
:~/hadoop-1.2.1$ bin/hadoop dfs -put piginput/in piginput/in
:~/hadoop-1.2.1$ pig -x mapreduce

grunt> A = LOAD 'piginput/in' USING PigStorage(' ') AS (a1:int,a2:int,a3:int);
grunt> DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> X = FOREACH A GENERATE a1, a2;
grunt> STORE X INTO 'output";
Two fields in relation A are summed to form relation X. A schema is defined for the
projected field.
grunt> X = FOREACH A GENERATE a1+a2 AS f1:int

>> ;
grunt> DESCRIBE X;
X: {f1: int}
grunt> STORE X INTO 'output';
JOIN
l
Use the JOIN operator to perform an inner join of two or more relations based on
l
common field values.
l
Inner joins ignore null keys
l
JOIN creates a flat set of output records while COGROUP creates a nested set of output
records.
alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER] , right-alias BY
right-alias-column
B = LOAD 'data2' AS (b1:int,b2:int);
A = LOAD 'data1' AS (a1:int,a2:int,a3:int); DUMP B;
DUMP A; (2,4)
(1,2,3) (8,9)
(4,2,1) (1,3)
(8,3,4) (2,7)
(4,3,3) (2,9)
(7,2,5) (4,6)
(8,4,3) (4,9)
In this example relations A and B are joined by their first fields.
X = JOIN A BY a1, B BY b1;

DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
LEFT OUTER JOIN
grunt> C = JOIN A by $0 LEFT OUTER, B BY $0;

(1,2,3,1,3)
(4,2,1,4,6)
(4,2,1,4,9)
(4,3,3,4,6)
(4,3,3,4,9)
(7,2,5,,)
(8,3,4,8,9)
(8,4,3,8,9)
RIGHT OUTER JOIN

grunt> C = JOIN A by $0 RIGHT OUTER, B BY $0;
(1,2,3,1,3)
(,,,2,4)
(,,,2,7)
(,,,2,9)
(4,2,1,4,6)
(4,2,1,4,9)
(4,3,3,4,6)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
FULL OUTER JOIN
grunt> C = JOIN A by $0 FULL, B BY $0;
(1,2,3,1,3)
(,,,2,4)
(,,,2,7)
(,,,2,9)
(4,2,1,4,6)
(4,2,1,4,9)
(4,3,3,4,6)
(4,3,3,4,9)
(7,2,5,,)
(8,3,4,8,9)
(8,4,3,8,9)

Mapreduce (1) .Odp

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Mapreduce (1) .Odp

Caricato da

Copyright:

Formati disponibili

Basics of MapReduce

Example for complete Mapreduce Flow

Find MapReduce new API in

1 byte Byte ByteWritable

2 boolean Boolean BooleanWritable

3 char or String String Text

4 double Double DoubleWritable

5 float Float FloatWritable

6 int Integer Integer

7 long Long LongWritable

8 short Short ShortWritable

int a=10 Int x

*public JobConf(Configuration conf, Class DriverLogic class)

* public Typeclass getOutputKeyClass()

*public void setInputFormat(Class<? extends InputFormat> theClass)

TextInputFormat is default input format

->KeyValueTextInputFormat:Each record is divided by the (/t). Everything before /t is key

->SequenceFileInputFormat:Sequnce File is compressed Binary File Format

->NlineInputFormat:Same as TextInputFormat ,but each split will have exactly N lines

*public void setMapperClass(Class<? extends Mapper> .class)

Job job=new Job();

*public static void addInputPath(JobConf conf, Path path)

public static void setOutputPath(JobConf conf, Path outputDir)

Note: This outputDir shouldn’t exist before running job.

*InputSplit[] getSplits(JobConf job, int numSplits) throws IOException

*RecordReader<K,V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws

Get the RecordReader for the given InputSplit

• K createKey():Create an object of the appropriate type to be used as a key.

OutputFormat describes & validates the output-specification for a Map-Reduce job

*void checkOutputSpecs(FileSystem ignored, JobConf job) throws IOException

RecordWriter writes the output <key, value> pairs to an output file.

*void write(K key, V value) throws IOException

*void map(K1 key, V1 value, OutputCollector<K2,V2> output, Reporter reporter)

Output <k,v> pairs are collected using OutputCollector.collect(Object,Object).

void map(K1 key, V1 value, Context context) throws IOException

Output <k,v> pairs is added to output using write method i.e

* int getPartition(K2 key, V2 value, int numPartitions)

1)Shuffle 2)Sort 3)Secondary sort

Secondary sort is implemented using a concept call “compositekeyComporator”

Output <k,v> pairs are collected using OutputCollector.collect(TypeClass,TypeClass)

Base class for Mapper and Reducer implementations.Provides default no-op

*void close(): used to close the stream and release

JobClient provides facilities to submit jobs, track their progress

*public static runJob(JobConf job) throws IOException

JobConf job = new JobConf(new Configuration(), MyJob.class);

import java.io.IOException; Wordcount .java

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>

public void map(LongWritable key, Text value,Context context)

public class WordReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,Context context)

for (IntWritable value : values) {

context.write(key, new IntWritable(wordCount));

Step 1: Download apache-hive-1.2.1-bin.tar.gz (or)any latest version

Step 2: Extract apache-hive-1.2.1-bin.tar.gz to your HOME directory

Find hive-config.sh in /home/satish/apache-hive-1.2.1-bin/bin

Step 4: Set HIVE_HOME and HADOOP_HOME in bashrc

~/hadoop-1.2.1$ sudo gedit ~/.bashrc

The HIVE distribution includes a template configuration file that provides

Now delete all default

hive> show databases;

hive> set hive.cli.print.current.db=true;

hive (default)> use employ;

hive (default)> create database IF NOT EXISTS employ