Sei sulla pagina 1di 9

Integration of R with Hadoop

When it comes to Statistical Analysis, R is one of the most preferred option


and by integrating it with Hadoop, we can successfully use it for Big Data
Analytics. In this post, we will be discussing the step-by-step explanation
for integrating R with Hadoop and will be performing various operations
on HDFS using R console.

RHadoop is a collection of three R packages for providing large data


operations with an R environment. RHadoop is available with three main R
packages, where each of them offer different Hadoop features:

1. Rhdfs

2. Rmr

3. Rhbase

Rhdfs:

Rhdfs is an R package that provides the basic connectivity to the Hadoop


Distributed File System. R programmers can browse, read, write, and
modify files stored in HDFS from within R. Rhdfs package calls the HDFS
API in the backend to operate on the data sources stored in the HDFS. This
package should be installed only on the node that will run the R client.

Rmr:

Rmr is an R package that allows R developers to perform Statistical


Analysis in R via Hadoops MapReduce functionality on a Hadoop cluster.
With the help of this package, the job of a R programmer has been
reduced, where they just need to divide their application logic into the
map and reduce phases and submit it with the Rmr methods. After that,
the Rmr calls the Hadoop streaming MapReduce API with several job
parameters such as input directory, output directory, mapper, reducer,
and so on, to perform the R MapReduce job over the Hadoop cluster. This
package should be installed on every node in the cluster.
Rhbase:

Rhbase is an R interface for operating the Hadoops HBase data source,


stored at the distributed network via a Thrift server. The Rhbase package
is designed with several methods for initialization and read/write and
table manipulation operations. In this post, we will look in to the Rhdfs
package that provides the basic connectivity to the Hadoop Distributed
File System. Before delving deeper, lets look at how to setup Rhadoop.

Steps for Setting up Rhadoop

The per-requisites for installing Rhadoop is Hadoop and R. Assuming they


are already installed, let get started with the setup process.

1. Installing Java and Hadoop

2. Installing R

Required Packages for Installing


We require several R packages to be installed for connecting R with
Hadoop. The list of packages are as follows:

rJava
RJSONIO
itertools
digest
Rcpp
httr
functional
devtools
plyr
reshape2
We will discuss installing of all this packages in two different ways. They
are as follows:

1. Using install.packages from R Console:


install.packages( c('rJava','RJSONIO', 'itertools', 'digest','Rcpp ','httr','functional','devtools',
1
'plyr','reshape2'),dependencies=TRUE,repos='http://cran.rstudio.com/')
Note: Before installing rJava, we should set the JAVA_HOME path and
should login to R with sudo privileges.
2.Downloading Packages and installing through R cmd:
Download the required packages from the below link.

Link: https://drive.google.com/open?
id=0B5dejdhAYHztRkgzbGZOeUdXdVE
After downloading the packages, extract them and use the below
command:

1 unzip Rhadoop_packages.zip

To install these packages, we will be using R cmd.

1 R CMD INSTALL <package name>

Now we will be Installing rJava ,refer the below command for the same.
1 sudo R CMD INSTALL rJava_0.9-6.tar.gz

We need to follow the same command to install all the other required packages .
1 sudo R CMD INSTALL <package.rar>

Note:Before installing rhdfs we should set HADOOP_CMD environmental variable.


You can refer to the below screen shot to follow the steps for Installing Rhdfs.

For accessing HDFS we should start hadoop demons, make sure that all your HDFS daemons
are up.

Check the files in HDFS from the command line.


Now we will access HDFS from the R console
Login to R console
Set environment variables

Load the required packages rhdfs

After loading the rhdfs package we should initiate the connection using hdfs.init()

Accessing HDFS through R console

Listing the file in hdfs root directory


1 hdfs.ls('/')
To get the HDFS default configurations used for this connection use
1 hdfs.defaults("conf")

File manipulation

hdfs.put: This is used to copy files from the local filesystem to the HDFS
filesystem.
1 hdfs.put('localfile source','hdfs destination')

hdfs.mkdir: used to create new directory in hdfs:


1 hdfs.mkdir('/new_dir')
hdfs.move: This is used to move a file from one HDFS directory to another
HDFS directory.
1 hdfs.move('/test_file','/new_dir/')

hdfs.rename: This is used to rename the file stored at HDFS from R.


1 hdfs.rename('/new_dir/test_file','/new_dir/test_file1')

hdfs.chmod: This is used to change permissions of some files.


1 hdfs.chmod('/Wc.txt', permissions= '777')
hdfs.delete: This is used to delete the HDFS file or directory from R.
1 hdfs.delete("/RHadoop")

Hope this blog helped you in learning how to integrate R with Hadoop.
Keep visiting our site for more updates on BigData and other
technologies. Click Here to learn more.

Potrebbero piacerti anche