Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
RMANUAL
ABRIEFTUTORIALTOLEARNR
DR.R.SRIVATSAN
CONSULTANTFACULTY|IVYPROFESSIONALSCHOOL
1 R SESSIONS
R sessions
The R programming can be carried out as an interative R-session. To start an R session, type R
from the command line in windows or linux OS. For example, from shell prompt $ in linux,
$R
This generates the following output before entering > prompt of R:
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: i486-pc-linux-gnu (32-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type license() or licence() for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type contributors() for more information and
citation() on how to cite R or R packages in publications.
Type demo() for some demos, help() for on-line help, or
help.start() for an HTML browser interface to help.
Type q() to quit R.
[Previously saved workspace restored]
>
Once we are inside an R session, we can directly execute R language commands by typing them
line by line. Pressing the return key terminates typing of command and brings the > prompt again).
In the example session below, we declare 2 variables a and b to have values 5 and 6 respectively,
and assign their sum to another variable called res:
>
>
>
>
>
a = 5
b = 6
c = a + b
c
[1] 11
To get help on any function of R, type help(function-name) in R prompt. For example,
if we need help on if logic, type,
>help("if")
then, help lines are printed.
To exit the R session, type quit() in the R prompt:
> quit()
Save workspace image? [y/n/c]: n
Copyright (c) from 2012 R. Srivatsan
In R, we can assign integer and floating point values to variables directly. The mathematical
operations can be performed with symbols following a format exactly similar to languages like C,
C++, java, python and perl. We have already performed a simple operation c = a+b in the previous
section.
(a+b)
For example, we can perform the operation c = (ab) by assigning values directly at declaration
> a
> b
> c
> c
[1]
= 7.5
= 6
= (a+b)/(a*b)
0.3
q
In another example, we apply the formula y = 1.0 + ( zr )0.38 as,
> z
> r
> y
> y
[1]
= 22.9
= 6.7
= sqrt(1.0 + (z/r)^0.38)
1.610978
Note the explicit usage of brackets for grouping the terms to remove ambiguity.
A list of useful inbuilt functions in R is given below:
Function
Description
------------------------------------------abs(x)
absolute value
sqrt(x)
square root
ceiling(x)
ceiling(3.475) is 4
floor(x)
floor(3.475) is 3
trunc(x)
trunc(5.99) is 5
round(x, digits=n)
round(3.475, digits=2) is 3.48
signif(x, digits=n)
signif(3.475, digits=2) is 3.5
cos(x),sin(x),tan(x)
Triginomteric sine, cosine and tan functions
acos(x),cosh(x),acosh(x) arcsine, arccosine and arctangent functions
log(x)
natural logarithm
log10(x)
common logarithm
log2(x)
logarithm to the bse of 2
exp(x)
e^{x}
3 STRING OPERATIONS
String operations
3 STRING OPERATIONS
A substring can be formed by calling substr() function specifying the start and stop
character locations of the substring in the main string. To form a substring from location 4 to 8 of
string scat,
> ssub <- substr(scat,4,8)
> ssub
[1] "acabc"
We can also replace a portion of string with other substring:
> substr(scat,4,8) <- "UUUUU"
> scat
[1] "abcUUUUUaqqqqqq"
In case we want a substring from a given start positition to the end of original string, give an
arbitrarily large integer for the end location:
> str3 = "WWW.objsite.com"
> sublg <- substr(str3,4,100000000L)
> sublg
[1] "objsite.com"
A string can be trunkated to a certain number of characters from its beginning with
strtrim() function:
> str4 <- "AECH9939-ALM"
> strunk <- strtrim(str4, 4)
> strunk
[1] "AECH"
The function strsplit() is used for spltting a string by a given character. For example,
> strsplit("fname.doc", "\\.")
[[1]]
[1] "fname" "doc"
The two portions of the split string can be converted to a list, as shown below. More on
lists later:
> aa <- unlist(strsplit("fname.doc", "\\."))
> aa[1]
[[1]]
[1] "fname"
> aa[2]
[1] "doc"
For converting the upper cases to lower cases and vice versa, we use functions toupper()
and tolower()
> strr <- "This is a sentence"
> strrup <- toupper(strr)
> strrup
[[1] "THIS IS A SENTENCE"
> tolower(strrup)
[1] "this is a sentence"
Copyright (c) from 2012 R. Srivatsan
4 DATA STRUCTURES IN R
Data structures in R
R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices,
dataframes, and lists. We can perform many algebraic operations with these data structures and
many useful built-in functions are defined for these data types. Here we will learn to use them.
4.1
Vectors
4.1 Vectors
4 DATA STRUCTURES IN R
> vstr*100
[1] 10 20 30 40 50 60 70 80 90
> vstr/100
[1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
In the same way, the algebraic operations between one or more vectors are applied
to their individual elements. Thus if we add two vectors of same number of elements, their
individual elements are correspondingly added to give a new vector. This is illustrated in the following
operations between vectors vec1 and vec2 below:
> vec1 <- c(1.5,2.5,3.5,4.5,5.5,6.5)
> vec2 <- c(10,20,30,40,50,60)
> vec1+vec2
[1] 11.5 22.5 33.5 44.5 55.5 66.5
> vec1-vec2
[1]
> vec1*vec2
[1]
15
> vec1/vec2
[1] 0.1500000 0.1250000 0.1166667 0.1125000 0.1100000 0.1083333
> log(vec2)
[1] 2.302585 2.995732 3.401197 3.688879 3.912023 4.094345
Vectors can be combined with other vectors or individual elements and grow in
size. For example, with the vectors vec1 and vec2 defined above,
> cvec <- c(vec1, vec2)
> cvec
[1]
1.5
2.5
3.5
4.5
5.5
4.1 Vectors
4 DATA STRUCTURES IN R
We can also perform mathemaical operations with whole vectors, provided they
contain same number of elements. The mathematical operations with vectors is applied
to their respective elements individually:
> rvec <- 3*vec1 + vec2
> rvec
[1] 14.5 27.5 40.5 53.5 66.5 79.5
The vector can be sorted in ascending order by sort function
> vec <- c(8.9, 1.5, 3.4, 6.7, 12.8, 7.4)
> sort(vec)
[1]
1.5
3.4
6.7
7.4
8.9 12.8
We can get the maximum and minimum values among the elements of a vector:
> vec <- c(8.9, 1.5, 3.4, 6.7, 12.8, 7.4)
> max(vec)
[1] 12.8
> min(vec)
[1] 1.5
It is very easy to generate a sequence of numbers in R. Use seq function to generate
a sequence from a given number to an end number:
> sq <- seq(1,50)
> sq
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Use the following syntax to generate a sequence from 1 to 50 in steps of 5
> sq <- seq(1,50,5)
> sq
[1]
6 11 16 21 26 31 36 41 46
The sequence can be generated in reverse by flipping the sign of the step:
> sq <- seq(50,1,-5)
> sq
[1] 50 45 40 35 30 25 20 15 10
Given a vector, logical operations can be performed on each of its elements to evaluate a
TRUE or FALSE value. For example, in the following R statement, for all elements of vector
lseq that satisfy the given logical condition, a TRUE is assigned, and for others that dont satisfy
this condition, FALSE is assigned. The resulting vector lresult has TRUE or FALSE values
in the corresponding positions.
> lseq <- c(23,15,34,25,46,58, 59,34,29,36,44,89)
> lresult <- ( (lseq > 24) & (lseq < 60) )
> lresult
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
TRUE
TRUE
TRUE FALSE
The vectors in R can handle the missing values. The missing value is recognized by NA.
For example, in the statements below, zm is a vector of elements: numbers 1 to 10, and two NA
(missing) values. The vector data has FALSE for genuine values, and TRUE for missing values
since is.na() is looking for missing values:
Copyright (c) from 2012 R. Srivatsan
4 DATA STRUCTURES IN R
9 10 NA NA
TRUE
TRUE
Only NA is taken to be a missing value. Other states like -Inf and Inf are evaluated as
FALSE while looking for a missing value.
The missing values in a vector can be replaced by zeroes as shown below:
> x <- c(1.22, 3.44, 2.95, 4.23, NA, NA, 5.99)
> x[is.na(x)] <- 0
> x
[1] 1.22 3.44 2.95 4.23 0.00 0.00 5.99
Finally, if we want to create a character vector with indexed strings like X1, X2, X3, ...
etc.,
> labs <- paste( c("X"), 1:20, sep="")
> labs
[1] "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8" "X9"
[13] "X13" "X14" "X15" "X16" "X17" "X18" "X19" "X20"
4.2
An array in R can have one, two or more dimensions. Array is just a vector stored with additional
attributes like dimensions and names for the dimensions, if needed.
All the elements of an array should be of same data type (string, number, character
etc).
Also, all the rows of an array should have same length, and rows and columns can
have different dimensions.
We can convert a vector into any dimensional array by specifying its dimensions in the array()
function. Below, we convert a vector x into an array arr of dimension (4,3), where (4,3) refers to 4
rows and 3 columns:
> x <- c(10,20,30,40,50,60,70,80,90,100,110,120)
> arr <- array(x, dim=c(4,3))
> arr
[,1] [,2] [,3]
[1,]
10
50
90
[2,]
20
60 100
[3,]
30
70 110
[4,]
40
80 120
4 DATA STRUCTURES IN R
x <- c(10,20,30,40,50,60,70,80,90,100,110,120)
brr <- array(x, dim=c(2,2,3))
brr
, 1
[,1] [,2]
[1,]
10
30
[2,]
20
40
, , 2
[1,]
[2,]
[,1] [,2]
50
70
60
80
, , 3
[,1] [,2]
[1,]
90 110
[2,] 100 120
Individual elements of an array can be accessed by giving the name of the array followed by
the subscripts in square brackets, separated by commas. For example, in the above arrays, arr[2,1],
arr[1,1], brr[1,1,1], brr[2,1,3] etc.
Dropping a subscript of a row or a column will give all elements in the corresponding row or
column. Thus, arr[2,] is entire second row, arr[,1] is entire first column.
A Matrix is a 2 dimensional array. Like an array, all elements of a matrix are of same data
type and all the rows should be of same length. The general format for creating a matrix of r rows
and c columns from a vector vec is,
amat <- matrix(vec, nrow=r, ncol=c, byrow=FALSE)
Here, byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates
that the matrix should be filled by columns (the default).
For example,
> x <- c(10,20,30,40,50,60,70,80,90,100,110,120)
> amat <- matrix(x, nrow=3, ncol=4)
> amat
[,1] [,2] [,3] [,4]
[1,]
10
40
70 100
[2,]
20
50
80 110
[3,]
30
60
90 120
> amat <- matrix(x, nrow=3, ncol=4, byrow=TRUE)
> amat
[,1] [,2] [,3] [,4]
[1,]
10
20
30
40
[2,]
50
60
70
80
[3,]
90 100 110 120
Copyright (c) from 2012 R. Srivatsan
10
4.3 Dataframes
4 DATA STRUCTURES IN R
The rows and columns of a matrix can be named using functions rownames() and colnames()
> rownames(amat) <- c("r1","r2","r3")
> colnames(amat) <- c("c1", "c2", "c3", "c4")
> amat
c1 c2 c3 c4
r1 10 20 30 40
r2 50 60 70 80
r3 90 100 110 120
The number of rows and columns in matrix amat will be returned by function calls
nrow(amat) and ncol(amat)
> nrow(amat)
[1] 3
> ncol(amat)
[1] 4
There are a number of functions to manipulate matrices exist in R library. Sophisticated
operations like matrix multiplication, inversion, transpose, computing eigen values and eigen vectors,
computing determinants can be done by calling appropriate functions from R library.
4.3
Dataframes
Dataframe is a data structure similar to matrix, with a special feature that different
columns can have different data types. Dataframe is very useful for combining vectors of same
length with different data types into a single data matrix.
Similar to matrices, all the columns of a data frame should have same number of
rows.
In the example below, we create a data frame called frm1 with three vectors namely data1,
data2 and data3. We call the function data.frame() for this. The created data frame will have
columns data1, data2 and data3.
>
>
>
>
>
1
2
3
4
5
Note that the column names of the data frame frm1 we have created are just the names of the
objects themselves.
To get the column names of a data frame, use names(frame name):
> names(frm1)
[1] "data1" "data2" "data3"
The columns of a data frame can be named explicitly using a vector of strings. For the above
frame frm1, we can set the column names with our own vector of strings:
Copyright (c) from 2012 R. Srivatsan
11
4.3 Dataframes
4 DATA STRUCTURES IN R
7.5
> frm1[1:3,]
Element Proportion Product_ID
Iron
12.5
1122
1
2 Sulphur
32.6
1123
3 Calcium
16.7
1124
Copyright (c) from 2012 R. Srivatsan
12
4.4 Lists
4 DATA STRUCTURES IN R
We can also access a column of a dataframe by its name, by typing the frame name and the
column names separated by a $ sign:
> frm1$Element
[1] Iron
Sulphur
Calcium
Magnecium Copper
Levels: Calcium Copper Iron Magnecium Sulphur
> frm1$Proportion
[1] 12.5 32.6 16.7 20.6
7.5
> frm1$Product_ID
[1] 1122 1123 1124 1125 1126
In case we feel that typing the $ sign every time we want to access a column in a data
frame is becoming tiresome, R provides a command to attach the variable in the data frame to our
workspace:
> attach(frm1)
Now, the column named Proportion can be accessed directly by its name, rather than by a $
sign:
> Proportion
[1] 12.5 32.6 16.7 20.6
4.4
7.5
Lists
A list is an ordered collection of objects known as its components. The objects collected inside a
list can be very different like arrays, matrices, vectors and dataframes. Each object inside a list and
the element of the object can be accessed by proper symbol and indexing.
To create a list with many objects, use list() function. In the example below, we will create
a list of the following 5 vectors of entirely different data types:
>
>
>
>
>
13
4.4 Lists
4 DATA STRUCTURES IN R
[[1]]
[1] "Experiment-A" "Experiment-B" "Experiment-C" "Experiment-D" "Experiment-E"
[[2]]
[1] 12.5 32.6 16.7 20.6
7.5
[[3]]
[1] 1122 1123 1124 1125 1126
[[4]]
[1] "A" "S" "P" "K" "G"
[[5]]
[1] "IBAB_LAB"
Each object in the list can be accessed in its entirity by typing the object order in the list
within double square brackets after list name:
> alis[[1]]
[1] "Experiment-A" "Experiment-B" "Experiment-C" "Experiment-D" "Experiment-E"
> alis[[2]]
[1] 12.5 32.6 16.7 20.6
7.5
> alis[[5]]
[1] "IBAB_LAB"
To access the individual members of a specific vector in the list, use a second subscript as
shown:
> alis[[1]][1]
[1] "Experiment-A"
> alis[[4]][2]
[1] "S"
> alis[[3]][3] * 100
[1] 112400
Now we will create a list consisting of components of different data types:
> Lst <- list(name="AA-list", lengths=c(12.5,32.6,16.7,20.6,7.5),
+ XX=array(1:20, dim=c(4,5)))
14
5 R SCRIPTS
In the above list called Lst consists of a string component called name, a vector called lengths
and a 2 dimensional array called XX with dimension (4,5). Note that we have given names to the
components and created them inside the list() function itself.
Now let us print the components of the list:
> Lis
$name
[1] "AA-list"
$lengths
[1] 12.5 32.6 16.7 20.6
7.5
$XX
[1,]
[2,]
[3,]
[4,]
The elements of each of the components can be directly accessd by the format
list_name$component_name. For example, element (3,2) of array XX can be accessed
by "Lis$XX[3,4]", and the elements of third row of array XX are accessed by "Lis$XX[3,]".
R Scripts
So far we have been typing the R-commands in the R prompt >. Though this method is convenient
for learning few lines of commands, this cannot be used for real life applications where codes spanning
many tens of lines are required to be written. For this purpose, R allows us to write a script, which is
a collection of many lines of R statements written in a file. The statements are written one below the
other separated by line break, without the > character at the beginning of each statement. This
script file can be executed inside R prompt with a single line of command, which in turn executes
the statements in the script one by one sequentially. This way, very complicated long logical code
can be written and executed.
The R script is recognized by the file extension r or R. Thus, test.r is an R script named
test and compute.R is an R script called compute.
As an example, create a text file with the name test.r and write the following lines of code in
the file:
a=5
b=6
c = a*(a+b)
print(c)
To execute this code, go to R prompt and source the script file with the command:
> source("test.r")
[1] 55
If we source an R script inside another R script, then the variables of the sourced script will be
accessable to the second script. They need not be declared separately inside the second script.
Copyright (c) from 2012 R. Srivatsan
15
In R, an expression or a statement consists of one or more data types and operations applied to
them. For example,
c=a+b
Here, + is an addition operator that operates on integers a and b (called operands).
We have already learnt about the arithmatic operators +, -, *, /, and %.
R has many operators for creating relational and logical expressions. They are mostly used in
the control flow statements for executing simple or compound statements based on whether a given
expression is evaluated to be true or false.
The syntax of logical expressions and control flow statements in R are very much similar to the
corresponding constructs in C language.
The important equality, relational and the logical operators are listed below:
Operator
------------
Function
------------
usage
--------------
<
less than
<=
>
greater than
>=
==
equality
expression1 == expression2
!=
inequality
expression1 != expression2
&
logical AND
||
logical OR
expression1 or expression2
logical NOT
not expression
isTRUE(x)
test if x is true
is.na(x)
In the above table, expression, expression1 and expression2 means expressions like,
3.1456 (simple constant)
radius (simple variable)
xvalue * sin(x) (a compund expression)
These operators operate on one or more expressions.
To understand the way a logical expression works in R, have a look at the following tiny R-script
and the output it generates:
x=5
y=6
print(x < y)
Copyright (c) from 2012 R. Srivatsan
16
17
6.1 Data filtering with logical statements6 LOGICAL STATEMENTS AND CONTROL LOOPS
6.1
Using simple logical expressions, data stored in various data structures of R can be easily filtered to
create subsets of data. In this chapter, we demonstrate this through examples in the form of small
script lines which can be typed into a file and sourced inside R prompt as shown in the previous
chapter.
We start with a simple example in which we filter out the elements of a vector whose values
are greater than certain number. In a second filter operation, we filter values in a range.
To achive this, we place the required logical statement inside the square bracket where array
elements are accesed. See the script below:
When the above code lines are executed in an R script, the following output is created.
[1] "High filter : values above 7"
[1] 9.7 9.9 11.4 14.6 17.4 16.5
[1] "Band filter : values between 7 and 15"
[1] 7.3 9.7 7.6 9.9 11.4 14.6
In the above statements, the statement
highfilt <- marray[marray > 9.5]
basically picks the elements of vector marray whose valus are greater than 9.5 and creates the
list highfilt with these numbers. The print[highf ilt] statement prints the elements of filtarr.
Similarly, the statement
bandfilt <- marray[(marray > 7) & (marray < 15.0)]
picks up the elements of vector marray whose valus are greater than 7 and less than 15 to create
the list bandfilt with these numbers.
In the second example, we create a vector of numbers with some missing values (ie. NA). We
will apply a filter to select elements which are not NAs and at the same time have values below 100
and write them into another vector. In a second operation, we will remove all the NA values from
the original vector itself.
The script below achieves this:
18
6.1 Data filtering with logical statements6 LOGICAL STATEMENTS AND CONTROL LOOPS
tarray <- c(2, 7, 29, 32, 41, 11, 15, NA, NA, 55, 32, NA, 42, 109)
karray <- tarray[ !is.na(tarray) & (tarray < 100) ]
tarray[is.na(tarray)] <- 0
print("Filter with NAs and numbers greater than 100 removed:")
print(karray)
print("Filter with NAs replaced by 0")
print(tarray)
When the above code lines are executed in an R script, the following output is created.
[1] "Filter with NAs and numbers greater than 100 removed:"
[1] 2 7 29 32 41 11 15 55 32 42
[1] "Filter with NAs replaced by 0"
[1]
2
7 29 32 41 11 15
0
0 55 32
0 42 109
In the above script, the statement
tarray[ !is.na(tarray) & (tarray < 100) ]
selects elements of vector tarray that are not NAs and at the same time less than 100. The
statement
tarray[is.na(tarray)] <- 0
assigns the value 0 to the elemts of vector tarray that are missing values (NAs).
After this, all NAs in vector tarray are replaced by 0.
From a data set, a subset can be created by applying conditions on one or more column
members.
For example, suppose a data frame is called datframe with many columns and one of them
have name npcol. Then the statement
subdata <- subset(datframe, datframe$npcol > 30.0)
will select all the rows of datframe in which npcol is greater than 30 to create a new data frame
called subdata.
The subset function can be applied to data types like vectors and data frames.
As a third example, we will create a data frame with an (imaginary) experimental data. In
this data set, there are 7 genes for which some experimental measurements are available from 7
experiments.
We first create a data frame with these data vectors, and then use subset() function to create
a subset of data after filtering on individual column values.
The code below demonstrates this. The comments are self explanatory.
19
6.1 Data filtering with logical statements6 LOGICAL STATEMENTS AND CONTROL LOOPS
When the above code lines are executed in an R script, the following output is created.
[1] "subframe1 : Rows with expt2 > 20"
GeneName Gender expt1 expt2 expt3 expt4 expt51 expt52 expt6
1
gene-1
M 12.3 22.1 15.5 14.4
12.2
13.3 11.0
2
gene-2
M 11.5 25.7 13.4 16.6
15.5
14.5 10.0
3
gene-3
F 13.6 32.5 11.5 45.0
17.4
21.6 12.2
Copyright (c) from 2012 R. Srivatsan
20
6.1 Data filtering with logical statements6 LOGICAL STATEMENTS AND CONTROL LOOPS
4
gene-4
M 15.4 42.5 21.7 11.0
19.4
17.9 14.3
[1] "subframe2 : Rows with gender Female"
GeneName Gender expt1 expt2 expt3 expt4 expt51 expt52 expt6
3
gene-3
F 13.6 32.5 11.5 45.0
17.4
21.6 12.2
5
gene-5
F
9.4 12.6 14.5
9.7
10.2
15.6 23.3
6
gene-5
F
8.1 15.5 16.5 10.0
9.8
14.4 19.8
[1] "subframe3 : Rows with Male gender and expt2 < 30.0"
GeneName Gender expt1 expt2 expt3 expt4 expt51 expt52 expt6
1
gene-1
M 12.3 22.1 15.5 14.4
12.2
13.3 11.0
2
gene-2
M 11.5 25.7 13.4 16.6
15.5
14.5 10.0
7
gene-6
M 10.0 17.6 12.1 12.5
9.0
12.0 13.4
21
6.2
The if conditional statement helps us to execute certain commands subject to the condition
that a given statement is TRUE.
The general syntax of if statement is given by
if(condition) statement
Here if is a reserved key word. The condition typed inside braces refers to a logical statement.
The statement refers to a single or a set of statements which will be executed if the condition is true.
First the condition is logically evaluated and if it evaluates to TRUE, the statement is executed. If
the condition evaluates to FALSE, the statement is not executed.
Following script illustrates this:
a = 5.0
b = 10.0
if(a < b)
print("a is less than b")
=
=
=
=
5.0
10.0
15.0
20.0
if(a > b)
{
print("a is greater than b")
} else {print("a is less than b")}
Since a=5.0 and b=10.0, the condition (a > b) evaluates to false in the above code, and the
control is transferred to else condition and the following line is printed:
[1] "a is less than b"
A set of nested if...else if conditions can be set up as shown in the example below. The code
is self explanatory.
22
a
b
c
d
=
=
=
=
5.0
10.0
15.0
20.0
if(a > b)
{
k = k + 1
print("a is greater than b")
} else if(b > c)
{
k = k -1
print("b is less than c")
} else { print("both are not true")}
Since the conditions (a > b) and (b > c) both evalate to FALSE, the print statement inside the
final else is executed to print the following line:
[1] "both are not true"
23
6.3
The for loop is useful for iteratively executing a group of instructions. The general format is
given by
24
In the second method, elements of a vector can be iteratively accessed inside the for loop by
the index generated inside. Carefully go through this script:
avec <- c(2.1, 3.2, 4.3, 5.4, 6.5, 7.6)
for( i in 1:length(avec) )
{
num = avec[i]*10
print(num)
}
in the above example, length(avec) returns a number 6 which is the length of the vector as defined
in the code. Thus, 1:length(avec) generates a sequence from 1 to 6. As we have seen before, the
for loop iterates through this sequence assigning values 1 to 6 for the variable i. Inside the loop,
avec[i]*10 accesses the values of vector avec using this index and multiplied by 10. The resulting
output is presented here:
[1] 21
[1] 32
[1] 43
[1] 54
[1] 65
[1] 76
25
6.4
The while loop is used for executing a statement until a condition is valid. The loop terminates
when the condition fails. The general format is
26
6.5
The break statement breaks out of a for or while control loops. When a break is encountered, control
is transferred to the first statement outside the inner-most loop. When combined with if condition,
the break can be effectively used for the conditional termination of for or while loops. Example below
illustrates this concept.
nevent = 100
for(i in 1:nevent)
{
if(i*12.0 > 200)
break;
print(i)
}
print("Now control is outside the for loop")
The value of the iterator i varies from 1 to nevent = 100 inside the for loop. There is an if
condition inside the for loop that tests whether i*12 is greater than 200 for every iterative value of i.
When this test is true, the break statement transfers the control outside the first enclosing for loop.
Since 17 12 > 200, the for loop should run for first 16 iterations when i varies from 1 to 16. This
code prints out the following lines as expected:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
[1] "Now control is outside the for loop"
Similarly, we can break out of while loop under specific condition.
27
Like other languages, R has the ability to support used defined functions. An R function takes
objects and data variables as function argument and returns an object.
The function in R has the following structure:
myfunction <- function(argument1, argument2, ...) {
statements
return(object)
}
Here, function is a key word used for defining the function. argument1, argument2 etc. are
function arguments. They can be either simple variable types or objects like arrays, lists etc. Inside
the function, the arguments passed in are used by the function. The statements refers to such lines of
script. Finally, the function passes the computed object through a return statement. All the lines of
the code inside the function are enclosed in a pair of curly brackets following the keyword function.
myfunction refers to the name given to the function. When a function is caled, it will be called with
this name.
Once a function is defined, it can be called with the general syntax,
objectName <- myfunction(arg1, arg2, ...)
The object returned by myfunction is copied to the new object called objectName. As with any
other program logic, the data type, number and order of the arguments during the definition and
call of the function should exactly match. If not, error is flagged by R.
In the example script below, a function called normalize takes a vector avec and a number anum
as arguments. It divided each element of this vector by the number and take a square root. The
resulting vector with normalized number is then returned as a vector.
In the script, a vector called vec and a number called anumber are created and the function call
is given. The resulting vector is printed.
The script is given below, which is self explanatory:
# defining a function called normalize
normalize <- function(avec, anum) {
norvec <- (avec/anum)^0.5
return(norvec)
}
28
8 PLOTS IN R
Plots in R
Various types of sophisticated plots can be created in R. For each plot type, a plot function has to
be called with parameters to set plot properties like range, axis, point type, line type, titles, legend
etc.
We will start with plots with points and lines. We sill discuss all aspects of plots in this section,
most of which are common to all plot types. In the later sections, we discuss specific feature of each
plot types.
8.1
Point and line plots can be produced using plot() function, which takes x and y points either as
vectors or single number along with many other parameters. The parameters x and y are necessary.
For others, default value will be used in the absence of the parameter.
In the script below, we create 2 vectors called x and y with data points and call the plot
function. Obviously, vectors x and y should have equal number of elements:
# We create vectors of (X,Y) points and plot.
x <- c(1,2,3,4,5,6,7,8,9,10)
y = c(12, 23, 36, 48, 53, 64, 78, 89, 91, 110)
# Just plot points
plot(x,y)
The above code creates a plot with points. This is a simple plot in black color with both axes
marked with vector names. Ather parameters of the plot have been given default values.
29
8 PLOTS IN R
We will now add more features to this plot in steps. First, we join the points with a line
while retaining the points. This is achived by the parameter called type, which takes a character
value in double quotes. See the code:
# We create vectors of (X,Y) points and plot.
x <- c(1,2,3,4,5,6,7,8,9,10)
y = c(12, 23, 36, 48, 53, 64, 78, 89, 91, 110)
# plot points overlaied by lines
plot(x,y,type="o")
plots points
plots lines
plots points and lines
plots points overlaid by
plot with histogram like
plot with histogram like
plot with stair steps
no plotting - blank plot
lines
vertical lines
vertical lines
with axis marked (x,y)
30
8 PLOTS IN R
Now we will choose a symbol and size for the data points. This is achieved by the
parameters pch (meaning point character) for point symbol, and cex for the symbol size. The code
is below:
# We create vectors of (X,Y) points and plot.
x <- c(1,2,3,4,5,6,7,8,9,10)
y = c(12, 23, 36, 48, 53, 64, 78, 89, 91, 110)
#
#
#
#
#
#
Important values of these parameters are given here. For details, see R manual.
pch --->
cex --->
31
[Note :
8 PLOTS IN R
]
lty ---> a number like 1,2,3... indicating type of line like plain line,
dashed line, dot dashed etc.
See manual for details of each type.
lwd ---> number indicating the line width
lwd = 1
is default
lwd = 2
is twice the default width
lwd = 3
is thrice the width etc.
8.1.3
We should now add color to the data points and line we have plotted. This is done using
col parameter.
Modify the plot statement in our code as follows (we omit printing other lines of code)
plot(x,y,type="o", pch=20, cex=1.1, lty=3, lwd=1,col="dark red")
The col parameter can be defined in 3 ways:
col = 5
col="blue" ---> names of the color given as a string. See manual for list
col=#FFFFFF
32
8 PLOTS IN R
Now we will add main title to the graph with its own color, font and sizes.
For this we use main parameter. This takes a string value which will be displayed as the main
title of the plot at the top.
The col parameter can be defined in 3 ways:
col.main
--->
font.main --->
cex.main --->
With these, the plot call for a plot with main title is as below:
plot(x,y,type="o", pch=20, cex=1.1, lty=3, lwd=1,col="dark red",
main="Plot of Data-1", col.main="blue", font.main=4, cex.main=1.2 )
8.1.5
A subtitle at the bottom of the plot can be added with sub parameter, which takes a string value
and displays it at the bottom of the plot. The other properties of this text are set with parameters
col.sub, font.sub, cex.sub which take usual values. See the plot statement below:
plot(x,y,type="o", col="dark red", main="Plot of Data-1",
col.main="blue", font.main=4, cex.main=1.2,
sub = "This is sub title", col.sub="blue", font.sub=7,
cex.sub=1.0)
8.1.6
We will also add axis titles with chosen color, size and font. The titles to the X and Y axis can
be given with xlab, ylab parameters (meaning X-label and Y-label). These two parameters take
string values which are displayed as labels for X and Y axis. The font type, color and size are set
through font.lab, col.lab, cex.lab whose values are similar to the ones we saw before.
Here is the plot statement with X and Y labels set:
x <- c(1,2,3,4,5ha,6,7,8,9,10)
y = c(12, 23, 36, 48, 53, 64, 78, 89, 91, 110)
plot(x,y,type="o", pch=20, cex=1.1, lty=3, lwd=1,col="dark red",
main="Plot of Data-1", col.main="blue", font.main=4, cex.main=1.2,
sub = "This is sub title", col.sub="blue", font.sub=7, cex.sub=1.0,
xlab="This is X-axis Label", ylab="This is Y-axis Label", col.lab="red",
font.lab=6, cex.lab=1.1)
The plot with axis titles and subtitle drawn for the above statement is shown here:
Copyright (c) from 2012 R. Srivatsan
33
8 PLOTS IN R
34
8 PLOTS IN R
When plot function is called with data, it computes the ranges of X and Y axis based on the data.
Sometimes, we may require to fix the range of the data by hand, rather than by the range of data.
The ranges of X and Y axis can be varied using xlim, ylim parameters. (xlim means xlimit
and ylim means ylimit).
The parameters xlim and ylim take a 2 element vector as input. The first number represents the
beginning of range and second represents end of range. Thus, xlim=c(1,10) means an X axis range
from 1 to 10. Similarly for ylim.
See the plot call below. This plots with x axis in the range 1 to 20 and y axis in the range 1 to
150.
Xvalue <- c(1,2,3,4,5,6,7,8,9,10)
Yvalue = c(12, 23, 36, 48, 53, 64, 78, 89, 91, 110)
# Ranges for X,Y axis (other parameters to default, for clarity.)
plot(Xvalue, Yvalue, xlim=c(1,20), ylim=c(1,150))
35
8 PLOTS IN R
.
We can write text inside the plot for explanation and labelling curves using text parameter. This
parameter takes 2 numbers for the (x,y) coordinates of starting point of plot, and a text string which
is displayed inside the plot starting from given (x,y). Note that the units of these coordinates
are same as units of x and y axis used in the plot.
If text labels to be written near many points in the plot, we can give x and y as vectors, and
another vector of strings as label. In this case, all these 3 vectors should be of same length. The
general format of text command is,
text(x, y, textString, col=color, cex=value, font=fontType)
The text() call can be given inside plot() function as well as after the call to the plot().
See plot call below for all these:
Xvalue <- c(1,2,3,4,5,6,7,8,9,10)
Yvalue = c(12, 23, 36, 48, 53, 64, 78, 89, 91, 110)
# To add text to a plot. We add text at a particular
#
location (2, 105) in the plot.
# First look at the plot, and decide the units for (x,y)!!
plot(Xvalue,Yvalue,text(2,100,"This is text at (2,100)"))
# Here, we place a text near every point, at 0.3 unit
# from x disrance of each point.
cch <- c("a","b","c","d","e","f","g","h","i","j")
plot(Xvalue,Yvalue, text(Xvalue+0.3, Yvalue, cch, col="blue"))
# We will draw both - text at a particular location as well as text at
#
every point as labels.
plot(Xvalue,Yvalue, text(Xvalue+0.3, Yvalue, cch, col="blue"))
text(2, 100, "This is text at (2,100)", col="red")
36
8.2
8 PLOTS IN R
To plot more than one curve on a single plot in R, we proceed as follows. First, create the first
plot. For the subsequent plots, do not use plot function. Instead, each one of the subsequent curves
are plotted using points and lines functions, whose calls are similar to the plot function. See the
code below:
# multiple graphs on the same plot with legends
x <- c(1,2,3,4,5,6,7)
y1 <- c(1,4,9,16,25,36,49)
y2 <- c(1, 5, 12, 21, 34, 51, 72)
y3 <- c(1, 6, 14, 28, 47, 73, 106 )
# First curve is plotted
plot(x, y1, type="o", col="blue", pch="o", lty=1)
# second curve on same plot -- use points() and lines() function
points(x, y2, col="red", pch="*")
lines(x, y2, col="red",lty=2)
# third curve on the same plot. Use points() and lines() function.
points(x, y3, col="dark red",pch="+")
lines(x, y3, col="dark red", lty=3)
37
8 PLOTS IN R
Legends can be added to the plot within a box at a desired location using legend function.
This function takes the following parameters:
X and Y axis locations in the graph coordinates.
A vector of string consisting of legends, typically one per graph
A vector of colors for col parameter. These colors are same as the ones used in the graph
A vector of character symbols for pch parameter, same as the ones used as pch parameters
in the plots.
A vector of line types to be given to lty parameter, same as the one used for plotting curved
In the code below, we have added legend to the above plot. Full code is given.
# multiple graphs on the same plot with legends
x <- c(1,2,3,4,5,6,7)
y1 <- c(1,4,9,16,25,36,49)
y2 <- c(1, 5, 12, 21, 34, 51, 72)
y3 <- c(1, 6, 14, 28, 47, 73, 106 )
# First curve is plotted
plot(x, y1, type="o", col="blue", pch="o", lty=1)
# second curve on same plot -- use points() and lines() function
points(x, y2, col="red", pch="*")
lines(x, y2, col="red",lty=2)
# third curve on the same plot. Use points() and lines() function.
points(x, y3, col="black",pch="+")
lines(x, y3, col="black", lty=3)
# Adding a legend inside box at the location (2,40) in graph coordinates.
legend(2,40,c("y1","y2","y3"), col=c("blue","red","black"),
pch=c("o","*","+"),lty=c(1,2,3))
38
8.3
8 PLOTS IN R
2D Scatter Plot
The 2D scatter plot is same as the plots with points. We just have to pass X and Y vectors for the
two coordinates to the plot function as arguments. All other settings are similar. The example code
is given here:
# R Scatter plot demo
# Generate 10000 random numbers from gaussian distribution
Xrandom <- 10*rnorm(10000)
# Generate 10000 numbers from Gaussian
Yrandom <- 10*rnorm(10000)
# plot the scatter plot. We choose color in Hexadecimal system
plot(Xrandom, Yrandom, cex=0.2, col="#FF9999", main="2D Scatter plot")
39
8.4 Histogram
8.4
8 PLOTS IN R
Histogram
We can generate histograms in R using hist function. The arguments of this function are almost
same as that of plot
In a histogram, we have to decide the number of bins beforehand. The function hist has a
parameter called breaks. This is the number of bins in the histogram.
In the simplest code below, we generate 10000 points from a Gaussian distribution and histogram
it.
# Generate Gaussian deviates with mean=5, and SD=3
data <- rnorm(10000, mean=5, sd=3)
#plot histogram with 40 bins
hist(data, breaks=40, col="red", xlim=c(-10,20), ylim=c(0,800),
main="Simulated Data", col.main="blue")
}
The above script creates a histogram of these 10000 data points on the screen.
40
8.4 Histogram
8.4.1
8 PLOTS IN R
We can also access the data of the histogram through the object returned by the histogram. For
the above data, try this:
# Generate Gaussian deviates with mean=5, and SD=3
data <- rnorm(10000, mean=5, sd=3)
#plot histogram with 40 bins and get the returned histogram object.
hdat <- hist(data, breaks=40, col="red", xlim=c(-10,20), ylim=c(0,800),
main="Simulated Data", col.main="blue")
# print the contents of hist, which has histogram data
print(hdat)
# we can access (for example) first 10 elements of bin data
print( hdat$breaks[1:10])
# we can access (for example) first 10 elements of counts on bins
print(hdat$counts[1:10])
# First 10 elements of Intensities
print(hdat$intensities[1:10])
## First 10 elements of Kernal Densitieshdat
print(hdat$density[1:10])
# First 10 elements of mid values
print(hdat$mids[1:10])
41
8.5
8 PLOTS IN R
Box-Whisker Plot
The Box-Whisker plot creates a pictorial representation of statistical spread in the data. In R,
the function boxplot creates this plot. This function can take many data types as inputs. We can
pass a vector, a list of vectors or a data frame made up of column vectors as input. For each one of
the columns of data, a Box-Whisker diagram is created.
We first create a single vector of data and call the boxplot function:
#
x
y
z
42
range
8 PLOTS IN R
--->
this determines how far the plot whiskers extend out from the
box. If range is positive, the whiskers extend to the most
extreme data point which is no more than range times the
interquartile range from the box. A value of zero causes the
whiskers to extend to the data extremes.
horizontal ----> A TRUE value for this will make the plot horizontal.
Default is vertical
varwidth
----> if varwidth is TRUE, the boxes are drawn with widths
proportional to the square-roots of the number of
observations in the groups.
notch ----> if notch is TRUE, a notch is drawn in each side of the
boxes. If the notches of two plots do not overlap this is
strong evidence that the two medians differ
outline ----> if outline is not true, the outliers are not drawn
names ---> group labels which will be printed under each boxplot. Can
be a character vector or an expression
boxwex ---> a scale factor to be applied to all boxes. When there are
only a few groups, the appearance of the plot can be improved
by making the boxes narrower
border ---> an optional vector of colors for the outlines of the
boxplots.
col ---> Contain colors to be used to colour the bodies of the box plots
We can also call the boxplot function with lists of vectors and data frames in which column vectors
are in the form of matrix. In the script below, we first create a list of vectors and call boxplot. Next,
we create a data frame of the same numeric vectors are call boxplot. Both the calls create a plot
with three boxplots, each for one column of data.
#
x
y
z
43
8.6
8 PLOTS IN R
Pie Charts
The Pie charts in R can be drawn using pie function of the plot library. This function is called with
a vector x and a vector of colors for these segments. We can also choose the data segments to be
drawn clockwise or anticlockwise, which is the default. In the script below, we draw 2 pie charts,
onw without legend and simple labels and the other with legend and percentages marked:
result <- c(10, 30, 60, 40, 90)
# Create a Pie chart with a heading and rainbow colors
pie(result, main="Experiment-1", col=rainbow(length(result)),
label=c("Mol-1","Mol-2","Mol-3", "Mol-4", "Mol-5"))
# Calculate the percentage of sections and put it in the label
alabels <- round((result/sum(result)) * 100, 1)
alabels <- paste(alabels, "%", sep="")
colors <- c("blue", "green","red", "white", "black")
pie(result, main="Experiment-1", col=colors, labels=alabels, cex=0.8)
# draw the legend
legend(-1.2, 1.0, c("molecule-1", "molecule-2", "molecule-3",
"molecule-4", "molecule-5"), fill=colors)
44
8.7 Bar-plots
8.7
8 PLOTS IN R
Bar-plots
In bar plots, individual categoroes are represented as vertical bars standing next to each other for
quantitative comparison. In R, the barplot function is called to create bar plots. This function can
take a vector or a matrix as data input. Code below shows this:
# We plot various bar charts here
# Define a data vector
data <- c(1,3,6,4,9)
#bar plot the vector -- simple plot with no legends and colors
barplot(data, main="Cancer-data", xlab="Days", ylab="Response Index",
names.arg=c("grp-1","grp-2","grp-3","grp-4","grp-5"),
border="blue", density=c(10,20,30,40,50))
# Create a data frame
col1 <- c(1,3,6,4,9)
col2 <- c(2,5,4,5,12)
col3 <- c(4,4,6,6,16)
data <- data.frame(col1,col2,col3)
names(data) <- c("patient-1","patient-2","patient-3")
# barplot with colors
barplot(as.matrix(data), main="Experiment-1", ylab="dosage", beside=TRUE,
col=rainbow(5))
#Add legends
legend("topleft", c("day1","day2","day3","day4","day5"), cex=1.0, bty="n",
fill=rainbow(5))
45
8.7 Bar-plots
8 PLOTS IN R
46
8.8
8 PLOTS IN R
We can place multiple plots in a single figure. For this, we use par() function in R. The function
par(mfrow) sets up plots one by one along rows, and par(mfcol) sets up plots one by one along the
columns.
For example, par(mfrow, c(2,3)) sets up a plots with first three plots along first row and next
three plots along the second row. When this command is given, blank screen is created by the device.
The plots pltted are one by one alloted the positions as they are plotted
The code below splits the screen into 2 rows and 3 columns to contain 6 plots. The comments
make the code easy to understand.
# This script demonstrates multiple plots in a single figure.
## Set up plotting in two rows and three columns.
## Set the outer margin so that bottom, left, and right are 0 and
## top is 2 lines of text.
## Plotting goes along rows first.
## To plot along columns, usde "mfcol" instead of mfrow.
par( mfrow = c( 2, 3 ), oma = c( 0, 0, 2, 0 ) )
## Call the first plot. This is automatically located in row 1, column 1:
plot( rnorm( n = 10 ), col = "red", main = "plot 1", cex.lab = 1.1 )
## Call the second plot. This is automatically located in row 1, column 2:
plot( runif( n = 10 ), col = "blue", main = "plot 2", cex.lab = 1.1 )
##Call the third plot. This is located in row 1, column 3:
plot( rt( n = 10, df = 8 ), col = "springgreen4", main = "plot 3",
cex.lab = 1.1 )
## Call the fourth plot. It is located in row 2, column 1:
plot( rpois( n = 10, lambda = 2 ), col = "black", main = "plot 4",
cex.lab = 1.1 )
## plot.new() skips a position.
plot.new()
## The fifth plot is located in row 2, column 3:
plot( rf( n = 10, df1 = 4, df2 = 8 ), col = "gray30", main = "plot 5",
cex.lab = 1.1 )
# Title is given to the whole of the plot.
title("Many distributions", outer=TRUE)
The plot is shown in the next page.
47
8 PLOTS IN R
48
8 PLOTS IN R
The plot created by the above code is shown in the next page.
49
8 PLOTS IN R
Multiple plots of variable sizes in a single figure drawn by splitting the screen:
50
9 INPUT/OUTPUT OPERATIONS IN R
Input/Output operations in R
When we start R, it starts an interactive session by default. The user gives input from keyboard
and output is printed on the screen. We can also take input from files and scripts into R session and
write into external files from R session. This section explains various Input/Output operations in R.
Including a script in current session source() function
Using source() function, we can include a script in R session or into another script. For exmaple,
the command in R session
source(datfile.r)
include the whole contents of datfile.r inside current session. Subsequently, we can use every
object declared inside datfile.r in the current session This also can be used to source one script
inside another script, and make the second script to use the variables and objects in the sourced file.
Example below illustrates this:
script datfile.r
PI = 3.14
Epsilon = 0.034
K = 1.788
KMM = 2*PI*Epsilon
Now, the following script includes the first script and uses the variables in it:
script calcul.r
source("datfile.r")
cc = KMM * 29.5
print(cc)
51
9 INPUT/OUTPUT OPERATIONS IN R
Writing the output of current session into a file sink() function
Using sink() function, we can direct the output of session to the terminal. This function can
also takes arguments like : output file name as a string, append parameter that decides whether to
append to existing file or overwrite it and a split parameter that allows printing to the screen when
TRUE. Once the function is called with file name, all the subsequent print statements write to the
file. Another call sink() with no parameter terminates the writing to the external file. See code
below:
# call the sink function.
# append=FALSE means Dont append to the existing file
# split=FALSE means dont write on screen
sink("test.txt", append=FALSE, split=FALSE)
# Following print statements written to the file test.txt
for(i in 1:10)
{
print("Start Printing")
i = i*10 + 5
print(i)
}
# now return output to terminal
sink()
# Now the following statement will not be printed
print("Hi, this is over")
}
52
9 INPUT/OUTPUT OPERATIONS IN R
Writing R objects into external files
To redirect the graphic output plotted by R into a file, we cannot use sink() function. For this,
there are many functions given by various libraries. Here we demonstrate the use of two functions,
dput and save for writing R data structures into files, and retrieving the data frem them.
# Creating some vectors
avec1 <- c(1,2,3,4,5,6)
avec2 <- c(10,20,30,40,50,60)
avec3 <- c(100,200,300,400,500,600)
svec <- c("aa","bb","cc","cc","dd")
cvec <- c("A","B","C")
astr <- "ATGCCTGAACGCCGGATT"
# create a data frame with vectors
aframe <- data.frame(avec1,avec2,avec3)
# create a list of this data frame and 3 more vectors
alis <- list(aframe, svec, cvec, astr)
# another vector
kvec <- c("AAA","BBB","CCC")
# Write two R objects into file using save() function. See help for options.
# We can save many such objects
save(list=c("alis", "kvec"), file="test1.out")
# load them into R using load() function
load("test1.out")
# Once loaded, just use them by name!!
print(alis[[1]]$avec1)
print(alis[[2]])
53
9 INPUT/OUTPUT OPERATIONS IN R
Writing R plots into image files
Many libraries exist for writing the R plots produced on screen into image files line .png, .jpeg,
PDF etc.
The image is plotted and svaed in the following steps.
(1) Open a device for plotting. The default device is screen itself.
(2) call an image function in R with image filename.
(3) plot the image with plot() function for example. This is also written to the file name in
image function.
(4) close the device by image.off() call. Now image is saved in the directory given for file
name.
See the code below:
# For writing plot into jpeg file
jpeg("figure1.jpeg")
plot(c(1,2,3,4), c(1,2,3,4))
dev.off()
# For writing plot into png file
png("figure1.png")
plot(c(1,2,3,4),c(1,2,3,4))
dev.off()
# For writing into bmp file
bmp("figure1.bmp")
plot(c(1,2,3,4), c(1,2,3,4))
dev.off()
# For writing into PDF file
pdf("figure1.pdf")
plot(c(1,2,3,4),c(1,2,3,4))
dev.off()
54