Sei sulla pagina 1di 109

Basics of MapReduce

Example for complete Mapreduce Flow


MapReduce API classes and interfaces
Find MapReduce old API in
https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/mapred/package-summary.html

Find MapReduce new API in

https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/mapreduce/package-summary.html
SNO Primitive Java Wrapper Hadoop Type
types class classes

1 byte Byte ByteWritable

2 boolean Boolean BooleanWritable

3 char or String String Text

4 double Double DoubleWritable

5 float Float FloatWritable

6 int Integer Integer

7 long Long LongWritable

8 short Short ShortWritable

int a=10 Int x


IntWritable i=new IntWritable(a) x=i.get();
Class JobConf

• Used to describe mapreduce job configuration to the hadoop frame work for execution
• It specify Mapper,Combiner,Partitioner,Reduce,InputFormat and OutputFormat to used in
Mapreduce program

Methods

*public JobConf(Configuration conf, Class DriverLogic class)


*public void setJobName(String name)
*public void setJarByClass(Class cls)
*public String getJobName()
*public void setOutputKeyClass(TypeClass.class)

Hadoop has its own data types. List of used types classes for key, value pairs is
BooleanWritable,ByteWritable,DoubleWritable,FloatWritable,
IntWritable,LongWritable,TextWritable,NullWritable

* public Typeclass getOutputKeyClass()


*public void setOutputValueClass(Typeclass.class)
*public Typeclass getOutputValueClass()
*public void setMapOutputValueClass(Typeclass.class)
*public Typeclass getMapOutputValueClass()
* public void setMapOutputKeyClass(Typeclass.class)
*public Typeclass.class getMapOutputKeyClass()

*public void setInputFormat(Class<? extends InputFormat> theClass)

TextInputFormat is default input format

->TextInputFormat:Each line in Text File is a record.Key is the byte offset of the line and value
is the content of the line
Type of Key:LongWritable Type of value:Text

->KeyValueTextInputFormat:Each record is divided by the (/t). Everything before /t is key


and everything after /t is value
Type of Key:Text Type of value:Text

->SequenceFileInputFormat:Sequnce File is compressed Binary File Format


Type of Key:userdefined Type of value:userdefined

->NlineInputFormat:Same as TextInputFormat ,but each split will have exactly N lines


Type of Key:LongWritable Type of value:Text
*public InputFormat getInputFormat()
*public void setOutputFormat(Class<? Extends OutputFormat > Class)
*public OutputFormat getOutputFormat()

*public void setMapperClass(Class<? extends Mapper> .class)


* public void setReducerClass(Class<? extends Reducer>.class)
*public void setCombinerClass(Class<? extends Reducer>.class)
*public void setPartitionerClass(Class<? extends Partitioner> .class)
*public void setNumMapTasks(int n)
*public void setNumReduceTasks(int n)

Example
JobConf job = new JobConf(new Configuration(), MyJob.class);
job.setJobName(“Max Temperature");
job.setMapperClass(MyMapper.class);
job.setCombinerClass(MyReducer.class);
job.setReducerClass(MyReducer.class); job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextInputFormat.class);

In java new Mapreduce API ,class JobConf is replaced with Job class.
All methods in JobConf is also applicable with Job class.It’s usage is given as

Job job=new Job();


job.setJobName(“Max Temperature");
Class FileInputFormat
It is used to set and get the input paths

*public static void addInputPath(JobConf conf, Path path)


*public static void addInputPaths(JobConf conf, String commaSeparatedPaths)
*public static Path[] getInputPaths(JobConf conf)

Class FileOutputFormat
It is used to set output path

public static void setOutputPath(JobConf conf, Path outputDir)

Note: This outputDir shouldn’t exist before running job.

Example
FileInputFormat.addInputPath(job,new Path(args[0]);
FileoutputFormat.setOutputPath(job,new Path(args[1]);

Interface InputFormat<K,V>

*InputSplit[] getSplits(JobConf job, int numSplits) throws IOException


Logically split the set of input files for the job

*RecordReader<K,V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws


IOException

Get the RecordReader for the given InputSplit


Interface RecordReader<K,V>
RecordReader reads <key, value> pairs from an InputSplit

• K createKey():Create an object of the appropriate type to be used as a key.


• V createValue():Create an object of the appropriate type to be used as a value.
• long getPos() throws IOException:Returns the current position in the input.
• boolean next(K key, V value) throws IOException:Reads the next key/value
• void close() throws IOException: Close the InputSplit
• float getProgress() throws IOException

Interface OutputFormat<K,V>

OutputFormat describes & validates the output-specification for a Map-Reduce job

*void checkOutputSpecs(FileSystem ignored, JobConf job) throws IOException


*RecordWriter<K,V> getRecordWriter(FileSystem ignored, JobConf job, String name,
Progressable progress) throws IOException

Interface RecordWriter<K,V>

RecordWriter writes the output <key, value> pairs to an output file.

*void write(K key, V value) throws IOException


*void close(Reporter reporter) throws IOException
Interface Mapper<K1,V1,K2,V2>

*void map(K1 key, V1 value, OutputCollector<K2,V2> output, Reporter reporter)


throws IOException

Output <k,v> pairs are collected using OutputCollector.collect(Object,Object).

In new Mapreduce API ,Mapper is an abstract class, It’s map method is given as

void map(K1 key, V1 value, Context context) throws IOException

Output <k,v> pairs is added to output using write method i.e

context.write(object,object)

Interface Partitioner<K2,V2>

Partitioner controls the partitioning of the keys of the intermediate map-outputs to reducers

* int getPartition(K2 key, V2 value, int numPartitions)

Get the paritition number for a given key (hence record) given the total number of partitions i.e.
number of reduce-tasks for the job
Interface Reducer<K2,V2,K3,V3>:Reducer has 3 primary phases:

1)Shuffle 2)Sort 3)Secondary sort

Secondary sort is implemented using a concept call “compositekeyComporator”

*void reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter) throws
IOException

Output <k,v> pairs are collected using OutputCollector.collect(TypeClass,TypeClass)

Class MapReduceBase

Base class for Mapper and Reducer implementations.Provides default no-op


Implementations for a few methods

*void close(): used to close the stream and release


*public void configure(JobConf job)

class JobClient

JobClient provides facilities to submit jobs, track their progress

*public static runJob(JobConf job) throws IOException


Utility that submits a job
Here is an example on how to use JobClient:

JobConf job = new JobConf(new Configuration(), MyJob.class);


job.setJobName("myjob");
job.setInputPath(new Path(args[0]));
job.setOutputPath(new Path(args[1]));
job.setMapperClass(MyJob.MyMapper.class);
job.setReducerClass(MyJob.MyReducer.class);
// Submit the job, then poll for progress until the job is complete

JobClient.runJob(job);

Note:
In new API, we will use a method called waitForCompletion instead of using
JobClient’s runJob method.
Boolean waitForCompletion(boolean)

This method allows user to configure and submit job and wait for it to finish
Single argument to the this method indicates whether output is generated (or) not, when true, the job
writes information about it’s progress to output console
The return value is Boolean indicating success(0) or failure(1)

Wordcount using new API

import java.io.IOException; Wordcount .java


import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Wordcount
{
public static void main(String[] args) throws IOException, ClassNotFoundException,
InterruptedException
{
if (args.length != 2)
{
System.out.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job conf= new Job();
conf.setJarByClass(Wordcount.class);
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordMapper.class);
conf.setReducerClass(WordReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
System.exit(conf.waitForCompletion(true) ? 0 : 1);
}
}
import java.io.IOException; WordMapper.java
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>


{

public void map(LongWritable key, Text value,Context context)


throws IOException, InterruptedException
{
String s = value.toString();
for (String word : s.split("\\W+"))
{
if (word.length() > 0)
{
context.write(new Text(word), new IntWritable(1));
}
}
}
}
WordReducer .java

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,Context context)


throws IOException, InterruptedException {

int wordCount = 0;

for (IntWritable value : values) {

wordCount += value.get();
}

context.write(key, new IntWritable(wordCount));


}
}
Executing WordCount MapReduce Program
Open Eclipse
Provide your Workspace Name
Minimize Welcome Screen
Create new Project
Double clickon Java Project
Provide Project name
After successful creation of project
Create classes for Driver Logic,Mapper,Reducer Programs
Created empty class. Now start coding Driver Logic
After coding Driver Logic
In the same way code Mapper,Reducer program
Add MapReduce libraries for JAVA Project
Goto Libraries Tab and click on “Add External JARs”
Adding External JARs and click on “OK”

12

10

Column 1
6
Column 2
Column 3

0
Row 1 Row 2 Row 3 Row 4
Select all External JARs loaded and click on “OK”
Now all erros will disappear
Running Map reduce program locally
(Run Configurations of Project in Eclipse IDE)
Place your input data in your project workspace folder
Open eclipse -> Right Click on Project->Run Configurations
Create configuration for your Java applications
->Provide Project Name, Main class
name
Provide input and output arguments in Arguments tab ,and click on Run
Observe status of your running program in console
Observe status of your running program in console
Output will be placed in your project workspace folder
Open output folder to see actual output
Running Map reduce program on HDFS
Create and place input in HDFS
Open HDFS to view input
After coding driver logic,Mapper,Reducer, Create executable jar to run Mapreduce program
Select Runnable JARFILE
Provide configuration name and destination for JAR file
Provide destination path for JAR file along with name
Click on Finish
Ignore Warning message
Creating JAR file in progess
Execute MapReduce Program using JAR File
Output is saved in HDFS
Apache Hive ™

The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large
datasets residing in distributed storage using SQL.
Apache Hive is an open source project run by volunteers at the Apache Software
Foundation.Before becoming a open source project of Apache Hadoop, Hive was originated in
Facebook.
It provides
● Tools to enable easy access to data via HiveQL, thus enabling data warehousing tasks such as
extract/transform/load (ETL)
● It provides the structure on a variety of data formats.
● Access to files stored either directly in Apache HDFS or in other data storage systems such as
Apache HBaseTM
● Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools.
● SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.
● By default, Hive stores metadata in an embedded Apache Derby database, and other client/server
databases like MySQL can optionally be used
● Hive indexing is to improve the speed of query lookup on certain columns of a table
Limitations of Hive:
● Hive is not designed for Online transaction processing (OLTP ), it is only used for the Online Analytical
Processing.
● Hive supports overwriting data, but not updates and deletes.
● In Hive, sub queries are not supported.
● There is no "insert into table values ... " statement
● You can only load data using bulk load
● There is not "delete from " command
● You can only do bulk delete
Getting started with Apache HIVE

Step 1: Download apache-hive-1.2.1-bin.tar.gz (or)any latest version

Step 2: Extract apache-hive-1.2.1-bin.tar.gz to your HOME directory


Step 3: Set HADOOP_HOME=/home/satish/hadoop-1.2.1 in hive-config.sh

Find hive-config.sh in /home/satish/apache-hive-1.2.1-bin/bin

Step 4: Set HIVE_HOME and HADOOP_HOME in bashrc

~/hadoop-1.2.1$ sudo gedit ~/.bashrc


Step 5: Create HIVE configuration file

The HIVE distribution includes a template configuration file that provides


all default settings for HIVE.To cutomize HIVE you need to copy the template
file to the file name hive-site.xml

~$ cd apache-hive-1.2.1-bin
~/apache-hive-1.2.1-bin$ cd conf
~/apache-hive-1.2.1-bin/conf$ cp hive-default.xml.template hive-site.xml

Now delete all default


properties from hive-
site.xml
HIVE Data Definition Language (DDL )

1)HIVE Database
*Creating database
Syntax:
CREATE DATABASE [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];

Example1:

hive> show databases;


OK
default
Time taken: 0.013 seconds, Fetched: 1 row(s)

hive> set hive.cli.print.current.db=true;


hive (default)> create database employ;
OK
Time taken: 0.134 seconds

hive (default)> use employ;


OK
Time taken: 0.015 seconds
hive (employ)>
Example2:

hive (default)> create database IF NOT EXISTS employ


> COMMENT 'Database for employees'
> LOCATION '/hive/warehouse'
> with DBPROPERTIES ('creator'='ram','city'='vijayawada');
OK
Time taken: 0.014 seconds

Trying same database again

hive (default)> create database employ


> COMMENT 'Database for employees'
> LOCATION '/hive/warehouse'
> with DBPROPERTIES ('creator'='ram','city'='vijayawada');
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Database
employ already exists

hive (default)> show databases;


OK
default
employ
Time taken: 0.012 seconds, Fetched: 2 row(s)
*Drop database
Syntax:
DROP [IF EXISTS] database_name [RESTRICT|CASCADE];
Example:

hive> Drop database employ;


OK
Time taken: 1.416 seconds

hive> show databases;


OK
default
Time taken: 0.25 seconds, Fetched: 1 row(s)
Hive>

● The default behavior is RESTRICT, where DROP DATABASE will fail if the database is not
empty.
● To drop the tables in the database as well, use DROP DATABASE db_name CASCADE

*Describe database

hive> DESCRIBE DATABASE EXTENDED employ;


OK
employ Database for employees hdfs://localhost:9000/hive/warehouse satish USER
{city=vijayawada, creator=ram}
Time taken: 0.018 seconds, Fetched: 1 row(s)
*Alter database
Syntax:
ALTER DATABASE database_name SET DBPROPERTIES (property_name=property_value, ...);
ALTER DATABASE database_name SET OWNER [USER|ROLE] user_or_role;
Example:
hive (default)> ALTER database employ SET DBPROPERTIES ('creator'='rajesh','city'='Guntur');
OK
Time taken: 0.117 seconds

hive (default)> DESCRIBE DATABASE EXTENDED employ;


OK
employ Database for employees hdfs://localhost:9000/hive/warehouse satish USER {city=Guntur,
creator=rajesh}
Time taken: 0.022 seconds, Fetched: 1 row(s)

*Use Database
● Statement USE sets the current database for all subsequent HiveQL statements.
● To revert to the default database, use the keyword "default" instead of a database name.
● To check which database is currently being used: SELECT current_database()

Example
hive (default)> select current_database();
OK
default
Time taken: 0.96 seconds, Fetched: 1 row(s)
hive (default)> use employ;
OK
Time taken: 0.017 seconds

hive (employ)> select current_database();


OK
employ
Time taken: 0.115 seconds, Fetched: 1 row(s)
hive (employ)>

● If you have a lot of databases, you can restrict the ones listed using a regular expression,
● The following example lists only those databases that start with the letter e and end with
any other characters (the .* part):

hive (employ)> show databases LIKE 'e.*';


OK
employ
Time taken: 0.136 seconds, Fetched: 1 row(s)
hive (employ)>
HIVE datatypes (1)Hive> CREATE TABLE data_types_table (
(2) > our_tinyint TINYINT COMMENT '1 byte signed integer',
(3) > our_smallint SMALLINT COMMENT '2 byte signed integer',
(4) > our_int INT COMMENT '4 byte signed integer',
(5) > our_bigint BIGINT COMMENT '8 byte signed integer',
(6) > our_float FLOAT COMMENT 'Single precision floating point',
(7) > our_double DOUBLE COMMENT 'Double precision floating point',
(8) > our_decimal DECIMAL COMMENT 'Precise decimal type based
(9) > on Java BigDecimal Object',
(10) > our_timestamp TIMESTAMP COMMENT 'YYYY-MM-DD HH:MM:SS.fffffffff"
(11) > (9 decimal place precision)',
(12) > our_boolean BOOLEAN COMMENT 'TRUE or FALSE boolean data type',
(13) > our_string STRING COMMENT 'Character String data type',
(14) > our_binary BINARY COMMENT 'Data Type for Storing arbitrary
(15) > number of bytes',
(16) > our_array ARRAY<TINYINT> COMMENT 'A collection of fields all of
(17) > the same data type indexed BY
(18) > an integer',
(19) > our_map MAP<STRING,INT> COMMENT 'A Collection of Key,Value Pairs
(20) > where the Key is a Primitive
(21) > Type and the Value can be
(22) > anything. The chosen data
(23) > types for the keys and values
(24) > must remain the same per map',
(25) > our_struct STRUCT<first : SMALLINT, second : FLOAT, third : STRING>
(26) > COMMENT 'A nested complex data
(27) > structure',
(28) > our_union UNIONTYPE<INT,FLOAT,STRING>
(29) > COMMENT 'A Complex Data Type that can
(30) > hold One of its Possible Data
(31) > Types at Once')
(32) > COMMENT 'Table illustrating all Apache Hive data types'
(33) > ROW FORMAT DELIMITED
(34) > FIELDS TERMINATED BY ','
(35) > COLLECTION ITEMS TERMINATED BY '|'
(36) > MAP KEYS TERMINATED BY '^'
(37) > LINES TERMINATED BY 'n'
(38) > STORED AS TEXTFILE
(39) > TBLPROPERTIES ('creator'='Bruce Brown', 'created_at'='Sat Sep 21 20:46:32
EDT 2013');
OK
Time taken: 0.886 seconds
2)HIVE tables
*Creating HIVE tables
CREATE TABLE creates a table with the given name. An error is thrown if a table or view
with the same name already exists. You can use IF NOT EXISTS to skip the error.

hive> create table IF NOT EXISTS Employee(


>SNO int comment 'sequence number',
> name string comment 'Employee Name',
>position string comment 'Employee Role',
>salary int comment 'Employee Salary',
>dept string comment 'Employee group')
>Comment 'Employee details'
> row format delimited
> fields terminated by ','
>lines terminated by '\n'
>STORED AS TEXTFILE
>LOCATION '/hive/warehouse/employ.db/Employee'
> TBLPROPERTIES ('creator'='ram');
The default TEXTFILE format for your HIVE table are slower to process and they consume lots of
disk space unless you compress them.
For these reasons,Apcahe Hive community came up with several choices for storing our tables on
the HDFS
TEXTFILE
● Use STORED AS TEXTFILE if the data needs to be stored as plain text files.
● Default file format for HIVE reccords.
● Alphanumeric characters from the unicode are used to store you data
SEQUENCEFILE
● Use STORED AS SEQUENCEFILE if the data needs to be compressed.
● The format for binary file composed of key/value pairs
RCFILE(Record Columnar FILE)
● Store records in a column-oriented fashion rather than row oriented fashion.
● Use RCFILE format if you have large number of cloumns ,but only a few columns are typicaly used
ORC(Optimized Row Columnar FILE)
A format that has significant optimiaions to improve HIVE reads and writes and the processing of
tables
Drop Table
DROP TABLE [IF EXISTS] table_name [PURGE];
● DROP TABLE removes metadata and data for this table. The data is actually moved to the .Trash/Current
directory if Trash is configured
● If PURGE is specified, the table data does not go to the .Trash/Current directory and so cannot be retrieved
in the event of a mistaken DROP.
Example:

hive (employ)> DROP table IF EXISTS employee2;


OK
Time taken: 0.305 seconds

Truncate Table

hive (employ)> TRUNCATE TABLE table_name ;


HIVE DML(Data Manipulation language)

Loading data in to table


LOAD DATA [LOCAL] INPATH 'path to file' [OVERWRITE] INTO TABLE table_name
Example: Consider text file in local filesystem with following contents
hive (employ)> LOAD DATA LOCAL INPATH 'hiveinput/employ' INTO TABLE Employee;
Loading data to table employ.employee
Table employ.employee stats: [numFiles=0, totalSize=0]
OK
Time taken: 0.507 seconds

In HDFS

hive (employ)> select * from Employee;


OK
1 Anne Admin 50000 A
2 Gokul Admin 50000 B
3 Janet Sales 60000 A
4 Hari Admin 50000 C
......
.....
Apache Pig
Step 1: Download pig form here
https://pig.apache.org/releases.html

Step 2: Extract to home directory “/home/aliet”

Step 3: Modify .bashrc file ,To open bashrc file use this command
$ sudo gedit /etc/bash.bashrc

--> In bashrc file append the below statements

export PIG_HOME=/home/aliet/pig-0.15.0
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=/home/aliet/hadoop-1.2.1/conf
export JAVA_HOME=$JAVA_HOME/usr

Step 4:Start all hadoop deamons ,open terminal and type the following command
hadoop-1.2.1$bin/start-all.sh

Step 5 :On the command prompt type


$ pig -h
Step 6 :To Start pig in local mode
$ pig -x local
grunt>
Step 7 :To start pig in mapreduce mode
$ pig -x mapreduce or $ pig
LOAD ,DUMP,DESCRIBE operators
Loading Employ data in to bag

001 Robin 22 newyork


002 BO 23 Kolkata
003 Maya 23 Tokyo
004 Sara 25 London
005 David 23 Bhuwaneshwar
006 Maggy 22 Chennai

grunt> A = LOAD 'piginput/employ';


grunt> DUMP A;

(001,Robin,22,newyork)
(002,BO,23,Kolkata)
(003,Maya,23,Tokyo)
(004,Sara,25,London)
(005,David,23,Bhuwaneshwar)
(006,Maggy,22,Chennai)

grunt> DESCRIBE A;
Schema for A unknown.
grunt>
Defining schema and Using function
001:Robin:22:newyork
002:BO:23:Kolkata
Employ data 003 :Maya:23:Tokyo
004:Sara:25:London
005 :David:23:Bhuwaneshwar
006 :Maggy:22 :Chennai
grunt> A= LOAD 'piginput/employ' USING PigStorage(':') AS (id:int,name:chararray,age:int,city);
grunt> DUMP A;
(1,Robin,22,newyork)
(2,BO,23,Kolkata)
(3,Maya,23,Tokyo)
(4,Sara,25,London)
(5,David,23,Bhuwaneshwar)
(6,Maggy,22,Chennai)
grunt> DESCRIBE A;
A: {id: int,name: chararray,age: int,city: bytearray}

If we miss out types of fields,then default type will be bytearray

grunt> A= LOAD 'piginput/emply' USING PigStorage(':') AS (id,name:chararray,age:int,city);


grunt> DESCRIBE A;
A: {id: bytearray,name: chararray,age: int,city: bytearray}
grunt>
GROUP and COGROUP operators
Grouping Employ by age
grunt> A= LOAD 'piginput/employ' USING PigStorage(':') AS (id:int,name:chararray,age:int,city);
grunt> B= GROUP A by age;
grunt>DUMP B;

(22,{(1,Robin,22,newyork),(6,Maggy,22,Chennai)})
(23,{(2,BO,23,Kolkata),(3,Maya,23,Tokyo),(5,David,23,Bhuwaneshwar)})
(25,{(4,Sara,25,London)})

grunt> DESCRIBE B;
B: {group: int,A: {(id: int,name: chararray,age: int,city: bytearray)}}
grunt>

The COGROUP operator is used in statements involving two or more relations.

grunt> employ= LOAD 'piginput/employ' USING PigStorage(':') AS


(id:int,name:chararray,age:int,city);

grunt> student= LOAD 'piginput/student' USING PigStorage(',') AS


(id:int,firstname:chararray,lastname:chararray,age:int,phno:long,city:chararray);

grunt> cogroup_data = COGROUP student by age, employ by age;


grunt> DUMP cogroup_data
(21,{(1,Rajiv,Reddy,21,9848022337,Hyderabad),
(3,Rajesh,Khanna,21,9848022339,Hyderabad),
(4,Preethi,Agarwal,21,9848022330,Punei)},{})
(22,{(2,siddarth,Battacharya,22,9848022338,Kolkata)},{(1,Robin,22,newyork),
(6,Maggy,22,Chennai)})
(23,{(5,Trupthi,Mohanthy,23,9848022336,Chennai),
(6,Archana,Mishra,23,9848022335,Chennai)},{(2,BO,23,Kolkata),(3,Maya,23,Tokyo),
(5,David,23,Bhuwaneshwar)})
(24,{(7,Komal,Nayak,24,9848022334,trivendram),
(8,Bharathi,Nambiayar,24,9848022333,trivendram)},{})
(25,{},{(4,Sara,25,London)})

FILTER: Selects tuples from a relation based on some condition.


l Use the FILTER operator to work with tuples or rows of data.
l FILTER is commonly used to select the data that you want; or, conversely, to filter out
(remove) the data you don’t want

grunt> X = FILTER A BY a3 == 3; grunt> X = FILTER A BY (a1 == 8) OR (NOT a2+a3 > a1));


grunt> DUMP X; grunt> DUMP X;
(8,3,4)
(1,2,3)
(4,3,3)
(4,3,3)
(1,2,3)
LIMIT
l
Use the LIMIT operator to limit the number of output tuples.
l
If the specified number of output tuples is equal to or exceeds the number of tuples in
the relation, the output will include all tuples in the relation.
l
A particular set of tuples can be requested using the ORDER operator followed by
LIMIT.
grunt> X= LIMIT student 3;
grunt> DUMP X;
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(3,Rajesh,Khanna,21,9848022339,Hyderabad)

DISTINCT
l
Use the DISTINCT operator to remove duplicate tuples in a relation.
l
DISTINCT does not preserve the original order of the contents
l
You cannot use DISTINCT on a subset of fields.

grunt> A = LOAD 'data' USING PigStorage(' ') AS (a1:int,a2:int,a3:int);


grunt> DUMP A;
(8,3,4)
(1,2,3)
(4,3,3)
(4,3,3)
(1,2,3)
CROSS
Use the CROSS operator to compute the cross product (Cartesian product) of two or more
relations.
grunt>A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
grunt>DUMP A;
(1,2,3)
(4,2,1)
grunt>B = LOAD 'data2' AS (b1:int,b2:int);
grunt>DUMP B;
(2,4)
(8,9)
(1,3)
In this example the cross product of relation A and B is computed.

grunt>X = CROSS A, B;
grunt>DUMP X;
(1,2,3,2,4)
(1,2,3,8,9)
(1,2,3,1,3)
(4,2,1,2,4)
(4,2,1,8,9)
(4,2,1,1,3)
UNION

Use the UNION operator to merge the contents of two or more relations. The UNION
operator:
• Does not preserve the order of tuples. Both the input and output relations are interpreted
as unordered bags of tuples.
• Does not ensure (as databases do) that all tuples belongs to the same schema or that
they have the same number of fields.
l
Does not eliminate duplicate tuples.

grunt>A = LOAD 'data' AS (a1:int,a2:int,a3:int);


grunt>DUMP A;
(1,2,3)
(4,2,1)
grunt>B = LOAD 'data' AS (b1:int,b2:int);
grunt>DUMP B;
(2,4)
(8,9)
(1,3)
grunt>X = UNION A, B;
grunt>DUMP X;
(1,2,3)
(4,2,1)
(2,4)
(8,9)
(1,3)
SPLIT

Use the SPLIT operator to partition the contents of a relation into two or more relations
based
on some expression. Depending on the conditions stated in the expression:
• A tuple may be assigned to more than one relation.
• A tuple may not be assigned to any relation.

SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression ...];

In this example relation A is split into three relations, X, Y, and Z.

grunt>A = LOAD 'data' AS (f1:int,f2:int,f3:int);


grunt>DUMP A;
(1,2,3)
(4,5,6)
(7,8,9)
grunt>SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
grunt>DUMP X;
(1,2,3)
(4,5,6)
grunt>DUMP Y;
(4,5,6)
grunt>DUMP Z;
(1,2,3)
(7,8,9)
FOREACH

Use the FOREACH ...GENERATE operation to work with columns of data (if you want to
work with tuples or rows of data, use the FILTER operation).

grunt>A = LOAD 'data1' AS


(a1:int,a2:int,a3:int);
grunt>DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt>X = FOREACH A GENERATE *;

grunt>DUMP X;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
:~/hadoop-1.2.1$ bin/hadoop dfs -put piginput/in piginput/in

:~/hadoop-1.2.1$ pig -x mapreduce


grunt> A = LOAD 'piginput/in' USING PigStorage(' ') AS (a1:int,a2:int,a3:int);
grunt> DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> X = FOREACH A GENERATE a1, a2;
grunt> STORE X INTO 'output";

Two fields in relation A are summed to form relation X. A schema is defined for the
projected field.

grunt> X = FOREACH A GENERATE a1+a2 AS f1:int


>> ;
grunt> DESCRIBE X;
X: {f1: int}
grunt> STORE X INTO 'output';
JOIN
l
Use the JOIN operator to perform an inner join of two or more relations based on
l
common field values.
l
Inner joins ignore null keys
l
JOIN creates a flat set of output records while COGROUP creates a nested set of output
records.
alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER] , right-alias BY
right-alias-column
B = LOAD 'data2' AS (b1:int,b2:int);
A = LOAD 'data1' AS (a1:int,a2:int,a3:int); DUMP B;
DUMP A; (2,4)
(1,2,3) (8,9)
(4,2,1) (1,3)
(8,3,4) (2,7)
(4,3,3) (2,9)
(7,2,5) (4,6)
(8,4,3) (4,9)
In this example relations A and B are joined by their first fields.

X = JOIN A BY a1, B BY b1;


DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
LEFT OUTER JOIN

grunt> C = JOIN A by $0 LEFT OUTER, B BY $0;


(1,2,3,1,3)
(4,2,1,4,6)
(4,2,1,4,9)
(4,3,3,4,6)
(4,3,3,4,9)
(7,2,5,,)
(8,3,4,8,9)
(8,4,3,8,9)

RIGHT OUTER JOIN


grunt> C = JOIN A by $0 RIGHT OUTER, B BY $0;

(1,2,3,1,3)
(,,,2,4)
(,,,2,7)
(,,,2,9)
(4,2,1,4,6)
(4,2,1,4,9)
(4,3,3,4,6)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
FULL OUTER JOIN
grunt> C = JOIN A by $0 FULL, B BY $0;
(1,2,3,1,3)
(,,,2,4)
(,,,2,7)
(,,,2,9)
(4,2,1,4,6)
(4,2,1,4,9)
(4,3,3,4,6)
(4,3,3,4,9)
(7,2,5,,)
(8,3,4,8,9)
(8,4,3,8,9)

Potrebbero piacerti anche