Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/mapreduce/package-summary.html
SNO Primitive Java Wrapper Hadoop Type
types class classes
• Used to describe mapreduce job configuration to the hadoop frame work for execution
• It specify Mapper,Combiner,Partitioner,Reduce,InputFormat and OutputFormat to used in
Mapreduce program
Methods
Hadoop has its own data types. List of used types classes for key, value pairs is
BooleanWritable,ByteWritable,DoubleWritable,FloatWritable,
IntWritable,LongWritable,TextWritable,NullWritable
->TextInputFormat:Each line in Text File is a record.Key is the byte offset of the line and value
is the content of the line
Type of Key:LongWritable Type of value:Text
Example
JobConf job = new JobConf(new Configuration(), MyJob.class);
job.setJobName(“Max Temperature");
job.setMapperClass(MyMapper.class);
job.setCombinerClass(MyReducer.class);
job.setReducerClass(MyReducer.class); job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextInputFormat.class);
In java new Mapreduce API ,class JobConf is replaced with Job class.
All methods in JobConf is also applicable with Job class.It’s usage is given as
Class FileOutputFormat
It is used to set output path
Example
FileInputFormat.addInputPath(job,new Path(args[0]);
FileoutputFormat.setOutputPath(job,new Path(args[1]);
Interface InputFormat<K,V>
Interface OutputFormat<K,V>
Interface RecordWriter<K,V>
In new Mapreduce API ,Mapper is an abstract class, It’s map method is given as
context.write(object,object)
Interface Partitioner<K2,V2>
Partitioner controls the partitioning of the keys of the intermediate map-outputs to reducers
Get the paritition number for a given key (hence record) given the total number of partitions i.e.
number of reduce-tasks for the job
Interface Reducer<K2,V2,K3,V3>:Reducer has 3 primary phases:
*void reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter) throws
IOException
Class MapReduceBase
class JobClient
JobClient.runJob(job);
Note:
In new API, we will use a method called waitForCompletion instead of using
JobClient’s runJob method.
Boolean waitForCompletion(boolean)
This method allows user to configure and submit job and wait for it to finish
Single argument to the this method indicates whether output is generated (or) not, when true, the job
writes information about it’s progress to output console
The return value is Boolean indicating success(0) or failure(1)
●
Wordcount using new API
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
int wordCount = 0;
wordCount += value.get();
}
12
10
Column 1
6
Column 2
Column 3
0
Row 1 Row 2 Row 3 Row 4
Select all External JARs loaded and click on “OK”
Now all erros will disappear
Running Map reduce program locally
(Run Configurations of Project in Eclipse IDE)
Place your input data in your project workspace folder
Open eclipse -> Right Click on Project->Run Configurations
Create configuration for your Java applications
->Provide Project Name, Main class
name
Provide input and output arguments in Arguments tab ,and click on Run
Observe status of your running program in console
Observe status of your running program in console
Output will be placed in your project workspace folder
Open output folder to see actual output
Running Map reduce program on HDFS
Create and place input in HDFS
Open HDFS to view input
After coding driver logic,Mapper,Reducer, Create executable jar to run Mapreduce program
Select Runnable JARFILE
Provide configuration name and destination for JAR file
Provide destination path for JAR file along with name
Click on Finish
Ignore Warning message
Creating JAR file in progess
Execute MapReduce Program using JAR File
Output is saved in HDFS
Apache Hive ™
The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large
datasets residing in distributed storage using SQL.
Apache Hive is an open source project run by volunteers at the Apache Software
Foundation.Before becoming a open source project of Apache Hadoop, Hive was originated in
Facebook.
It provides
● Tools to enable easy access to data via HiveQL, thus enabling data warehousing tasks such as
extract/transform/load (ETL)
● It provides the structure on a variety of data formats.
● Access to files stored either directly in Apache HDFS or in other data storage systems such as
Apache HBaseTM
● Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools.
● SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.
● By default, Hive stores metadata in an embedded Apache Derby database, and other client/server
databases like MySQL can optionally be used
● Hive indexing is to improve the speed of query lookup on certain columns of a table
Limitations of Hive:
● Hive is not designed for Online transaction processing (OLTP ), it is only used for the Online Analytical
Processing.
● Hive supports overwriting data, but not updates and deletes.
● In Hive, sub queries are not supported.
● There is no "insert into table values ... " statement
● You can only load data using bulk load
● There is not "delete from " command
● You can only do bulk delete
Getting started with Apache HIVE
~$ cd apache-hive-1.2.1-bin
~/apache-hive-1.2.1-bin$ cd conf
~/apache-hive-1.2.1-bin/conf$ cp hive-default.xml.template hive-site.xml
1)HIVE Database
*Creating database
Syntax:
CREATE DATABASE [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];
Example1:
● The default behavior is RESTRICT, where DROP DATABASE will fail if the database is not
empty.
● To drop the tables in the database as well, use DROP DATABASE db_name CASCADE
*Describe database
*Use Database
● Statement USE sets the current database for all subsequent HiveQL statements.
● To revert to the default database, use the keyword "default" instead of a database name.
● To check which database is currently being used: SELECT current_database()
Example
hive (default)> select current_database();
OK
default
Time taken: 0.96 seconds, Fetched: 1 row(s)
hive (default)> use employ;
OK
Time taken: 0.017 seconds
● If you have a lot of databases, you can restrict the ones listed using a regular expression,
● The following example lists only those databases that start with the letter e and end with
any other characters (the .* part):
Truncate Table
In HDFS
Step 3: Modify .bashrc file ,To open bashrc file use this command
$ sudo gedit /etc/bash.bashrc
export PIG_HOME=/home/aliet/pig-0.15.0
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=/home/aliet/hadoop-1.2.1/conf
export JAVA_HOME=$JAVA_HOME/usr
Step 4:Start all hadoop deamons ,open terminal and type the following command
hadoop-1.2.1$bin/start-all.sh
(001,Robin,22,newyork)
(002,BO,23,Kolkata)
(003,Maya,23,Tokyo)
(004,Sara,25,London)
(005,David,23,Bhuwaneshwar)
(006,Maggy,22,Chennai)
grunt> DESCRIBE A;
Schema for A unknown.
grunt>
Defining schema and Using function
001:Robin:22:newyork
002:BO:23:Kolkata
Employ data 003 :Maya:23:Tokyo
004:Sara:25:London
005 :David:23:Bhuwaneshwar
006 :Maggy:22 :Chennai
grunt> A= LOAD 'piginput/employ' USING PigStorage(':') AS (id:int,name:chararray,age:int,city);
grunt> DUMP A;
(1,Robin,22,newyork)
(2,BO,23,Kolkata)
(3,Maya,23,Tokyo)
(4,Sara,25,London)
(5,David,23,Bhuwaneshwar)
(6,Maggy,22,Chennai)
grunt> DESCRIBE A;
A: {id: int,name: chararray,age: int,city: bytearray}
(22,{(1,Robin,22,newyork),(6,Maggy,22,Chennai)})
(23,{(2,BO,23,Kolkata),(3,Maya,23,Tokyo),(5,David,23,Bhuwaneshwar)})
(25,{(4,Sara,25,London)})
grunt> DESCRIBE B;
B: {group: int,A: {(id: int,name: chararray,age: int,city: bytearray)}}
grunt>
DISTINCT
l
Use the DISTINCT operator to remove duplicate tuples in a relation.
l
DISTINCT does not preserve the original order of the contents
l
You cannot use DISTINCT on a subset of fields.
grunt>X = CROSS A, B;
grunt>DUMP X;
(1,2,3,2,4)
(1,2,3,8,9)
(1,2,3,1,3)
(4,2,1,2,4)
(4,2,1,8,9)
(4,2,1,1,3)
UNION
Use the UNION operator to merge the contents of two or more relations. The UNION
operator:
• Does not preserve the order of tuples. Both the input and output relations are interpreted
as unordered bags of tuples.
• Does not ensure (as databases do) that all tuples belongs to the same schema or that
they have the same number of fields.
l
Does not eliminate duplicate tuples.
Use the SPLIT operator to partition the contents of a relation into two or more relations
based
on some expression. Depending on the conditions stated in the expression:
• A tuple may be assigned to more than one relation.
• A tuple may not be assigned to any relation.
SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression ...];
Use the FOREACH ...GENERATE operation to work with columns of data (if you want to
work with tuples or rows of data, use the FILTER operation).
grunt>DUMP X;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
:~/hadoop-1.2.1$ bin/hadoop dfs -put piginput/in piginput/in
Two fields in relation A are summed to form relation X. A schema is defined for the
projected field.
(1,2,3,1,3)
(,,,2,4)
(,,,2,7)
(,,,2,9)
(4,2,1,4,6)
(4,2,1,4,9)
(4,3,3,4,6)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
FULL OUTER JOIN
grunt> C = JOIN A by $0 FULL, B BY $0;
(1,2,3,1,3)
(,,,2,4)
(,,,2,7)
(,,,2,9)
(4,2,1,4,6)
(4,2,1,4,9)
(4,3,3,4,6)
(4,3,3,4,9)
(7,2,5,,)
(8,3,4,8,9)
(8,4,3,8,9)