Sei sulla pagina 1di 8

basic_R.

R 7/31/09 10:15 AM

### Basic R Workshop


### Christopher Crutchfield
### crutchfield@gmail.com
Download this file as a *.R
Download brauer_microarray.csv
Get R
### Note, all comments in this code will be prefixed by 3 #s.

### I/O
### The best way to learn R is to manipulate data you are familiar with. Most
### people are familiar with spreadsheet software, such as Excel, Gnumeric,
### Calc, etc. One of the easiest ways to shuttle your data to R is to
### manipulate it into a managable matrix layout in your spreadsheet program,
### and save that spreadsheet as a *.csv file (comma separated values).
### Likewise, outputting to a comma separated value file and reading that into a
### spreadsheet file can be an easy way to quickly scan your present dataset.

### A common mistake is trying to read in a file that is located in another


### directory from where R is running. To find outwhere R will be attempting to
### execute commands use the basic (and useful) command:

getwd()

### When I enter this command I get the output:

### [1] "/home/ccrutchf/Desktop"

### This output lets me know that right now R is working on my desktop. If i
### don't give any more detail for input/output commands, it will default to
### look to read or write files on my desktop. If I know I want to work in
### another directory (say, "Documents), I can easily change the directory R
### will access (called a "working directory"):

setwd("/home/userid/Documents");

### If I forgot the exact name of a file I want to load, and I know I am in the
### proper directory, I can check what filesR can access via the function:

list.files()

### Here I get the output:

### [1] "brauer_microarray.csv"

### The most straightforward way to load this file into R to access it's
### contents is to assign it to a "data frame." The more we work with R the more
### we will find more and more powerful ways to manipulate data frames. The
### simplest method for loading a file as *.csv (comma separated values file)
### is as follows:

brauer_data <- read.csv("brauer_microarray.csv")

### This method, while simple, doesn't take full advantage of the read.csv()
### function. To examine the use of a function, simple type a question mark in
### front of it at the command line, such as:

?read.csv()

### The documentation, especially for core functions in R, usually goes very in
### depth. Typically the help file will read off a function and parenthetically
## denote the defaults for its parameters. The only parameter we are interested
## in changing at this time is row.names. Because we are working with genes, we
### would prefer the rows of our data frame correspond to those genes, not to an
### arbritrary number.

http://www.princeton.edu/~ccrutchf/R/basic_R.html Page 1 of 8
basic_R.R 7/31/09 10:15 AM

head(brauer_data)
brauer_data <- read.csv("brauer_microarray.csv", row.names=1)
head(brauer_data)

### You will notice the use of the function head(). This is a useful function
### for quickly assessing the top six rows of a large data frame.
### Here it's use will show in your console the difference between reading in a
### data frame with and without row names set. Conversely, use

tail()

### for reading the last six rows of a data frame.

### There will likely be times when you are manipulating a data frame that you
### want to save a copy of it *as is*. If you want to save a copy directly to a
### file you will likely want to use the write.csv() function.

write.csv(brauer_data, file="my_file_name.csv")

### Creating new objects in R is easy. A vector containing the 6 dilution rates
### in the experiment can be generated using the c() (concatenate) function
### and assigning, "<-", its value to a new object, we will call x_d.

x_d <- c(0.05, 0.10, 0.15, 0.20, 0.25, 0.30)

### Now say we are intrested in only the HXT genes. It's relatively easy to get
### at these genes using what's called a "regular expression"
### A more familiar term related to a regular expression is the use of a
### wildcard, "*". In our case we would be interest only in those
### genes that begin with HXT. This next function will be our first "nested"
### function. It may seem complicated at first, but
### learning these commands will eventually save you time & and make your
### coding far less error-prone.

HXT <- brauer_data[


grep(
glob2rx("HXT*"),
rownames(brauer_data)
)
,
];

### Now we have created our first object, a data frame containing only those
### genes whose name begins with the three characters, HXT.
### We managed this by combining functions:

### "<-", the assignment expression, which assigns the value of one object
### to the object it points to. An alternative expression
### "=", can be substituted here, and works just the same.

### "[" is called the subset operator. Here we are ultimately parsing our data
### frame for those rows #'s (or indices) that will match our requirements.
### The format typically used is data[rows,columns]. If we already knew the row
### numbers that will match our specification
### (it turns out they are 1373 1704 1836 1872 2460 2765 2998 3244 3299 5434),
### we could generate the same data frame by the following function.
### Note: we concatenate the locations of the rows using the c() function.

HXTspecify <- brauer_data[


c(
1373, 1704, 1836, 1872, 2460, 2765, 2998, 3244,
3299, 5434)
,
];

http://www.princeton.edu/~ccrutchf/R/basic_R.html Page 2 of 8
basic_R.R 7/31/09 10:15 AM

### Take note that we left the column specification [,*] blank, because in this
### case we are not specifying a subset of the columns in the data frame.
### Unfortunately in most cases we do *not* care to look up each row index
### manually, which is why we use the function grep().

### grep() is a function that matches the pattern ("regular expression") it is


### supplied to the objects in the character vector following.
### Because we are interested in parsing by genes beginning with the characters
### "HXT," we use the glob2rx() function that will convert
### a string containing a wild card ("*") into a regular expression. If you
### already know how to formalize regular expressions,
### which in this case would be "^HXT", you could avoid using glob2rx().

### The character vector we are searching is the rownames, in our case gene
### names, of our data frame. If you call

rownames(brauer_data);

### The console will display all 5572 gene names in their order in the data
### frame. By matching our regular expression to this
### character vector of gene names, our function will return to those numeric
### row locations, or indices, that fit our pattern (HXT*).

grep(
glob2rx("HXT*"),
rownames(brauer_data)
);

### Note, the output


### [1] 1373 1704 1836 1872 2460 2765 2998 3244 3299 5434
### looks very similar to the input in the HXTspecify example, above.

### We can get even more fancy, using a combination of a for loop, paste,
### regular expressions, and the explicit assign() function.

limitation <- c("G","N","P","S","L","U")


gene <- "HXT"
for(lim in limitation){
assign(
paste(
text= gene, "_", lim, sep=""
)
,
brauer_data[
grep(
glob2rx(paste(gene, "*", sep=""))
,
rownames(brauer_data)
)
,
grep(
glob2rx(
paste(
text=lim,"*", sep=""
)
)
,
colnames(brauer_data)
)
]

)
}

http://www.princeton.edu/~ccrutchf/R/basic_R.html Page 3 of 8
basic_R.R 7/31/09 10:15 AM

### Using this function, we automatically parsed the brauer dataset into six
### objects (different nutrient limitations) for only
### Those genes beginning with the characters HXT.
### The matrices that represent them, we created using assign: HXT_G, HXT_N,
### HXT_P, HXT_S, HXT_L, and HXT_U.

### A far less formal (and working) way to look at this code is

### for(each different limitation)


### {
### create_a_new_object
### (
### with the name HXT_("limitation")
### , make it a subset of brauer data containing:
### [
### only those rows with names beginning in HXT*
### ,
### and only those columns specific to that limitation
### ]
### )
### }

### Now we might be interested in making use of this organization and outputting
### some data in a way we could not
### as easily do with a spreadsheet program such as excel.
### First let us explore the lm() function

assign(
paste("lm_example"),

lm(

as.numeric(
HXT_G[1,])
~
x_d
)
)
eval(lm_example)

### Here we assigned the output of fitting a linear model of HXT5 expression in
### glucose limitation to its dilution rates.
### Just calling this expression will output only the intercepts.
### More information can be extracted using summary()

summary(lm_example)

### Usually we want these values so we can input them into another fuction

unclass(
summary(lm_example)
)

### Now we can pull out all sorts of values

### R-squared
unclass(summary(lm_example))$r.squared

### Intercept
unclass(summary(lm_example))$coefficients[1]

### Slope
unclass(summary(lm_example))$coefficients[2]

### Let us try taking these values we just extracted, and make a useful plot

http://www.princeton.edu/~ccrutchf/R/basic_R.html Page 4 of 8
basic_R.R 7/31/09 10:15 AM

### For the sake of showing more functional programming, lets first reorganize
### these most recent values we can access.

### This is a piece of code that works, but doesn't take advantage of functional
### methods. With functional programming it is easier to reuse the code for
### pulling out different parameters, assaying different genes, etc.

### Glucose r^2

G_r2 <- unclass(


summary(
lm(
as.numeric(
HXT_G[1,]
)
~
x_d
)
)
)$r.squared

for(i in 2:length(
HXT_G[,1]
)
)
{
G_r2 <- append(
G_r2
,
summary(
lm(
as.numeric(
HXT_G[i,]
)
~
x_d
)
)$r.squared
)
}

### You can write this same code more concisely, and get out all the parameters
### you might interested in. This code is not necessary for the result: a
### series of vectors representing the different parameters for the different
### limitations could as easily be generated using the code above, but modifying
### the parameter and the specified nutrient limitation

parameters <- c("r2","slope","intercept")

for(p in parameters)
{
for(lim in limitation)
{
n_length <- c(1:dim(get(paste(gene, "_",lim, sep="")))[1])
assign(
paste(
text=gene, lim, p, sep="_"
)
,
n_length)
for(n in 1:length(n_length))
{
ifelse(p=="r2"
,
assign(

http://www.princeton.edu/~ccrutchf/R/basic_R.html Page 5 of 8
basic_R.R 7/31/09 10:15 AM

paste(
text=gene, lim, p , sep="_"
)
,
replace(
get(
paste(
text=gene, lim, p , sep="_"
)
)
,
n
,
summary(
lm(
as.numeric(
subset(
t(
get(
paste(
text=gene, lim, sep="_"
)
)
)
,
select = c(n)
)
)
~
x_d
)
)$r.squared
)
)

,
ifelse(p=="slope"
,
assign(
paste(
text=gene, lim, p , sep="_"
)
,
replace(
get(
paste(
text=gene, lim, p , sep="_"
)
)
,
n
,
summary(
lm(
as.numeric(
subset(
t(
get(
paste(
text=gene, lim, sep="_"
)
)
)
,
select = c(n)
)

http://www.princeton.edu/~ccrutchf/R/basic_R.html Page 6 of 8
basic_R.R 7/31/09 10:15 AM

)
~
x_d
)
)$coefficients[[2]]
)
)
,
ifelse(p=="intercept"
,
assign(
paste(
text=gene, lim, p , sep="_"
)
,
replace(
get(
paste(
text=gene, lim, p , sep="_"
)
)
,
n
,
summary(
lm(
as.numeric(
subset(
t(
get(
paste(
text=gene, lim, sep="_"
)
)
)
,
select = c(n)
)
)
~
x_d
)
)$coefficients[1]
)
)
,
print("unknown parameter")
)
)
)
}
}
}

### Let us assume we're interested the relation of slopes and intercepts
### across the 5 conditions for the retrieved HXT gene expression.
### If we want to generate a scatter plot composed of these values, we
### can generate one using the following code

### Note, you can omit this "pdf" line if you want to see the plot
### generated dynamically

pdf(file="test.pdf")

### By cocatenating all the values we will be plotting at the same time

http://www.princeton.edu/~ccrutchf/R/basic_R.html Page 7 of 8
basic_R.R 7/31/09 10:15 AM

### we avoid needing to calculate the necessary axis ranges

plot(c(
HXT_G_intercept,
HXT_N_intercept,
HXT_P_intercept,
HXT_L_intercept,
HXT_U_intercept
)
,
c(
HXT_G_slope,
HXT_N_slope,
HXT_P_slope,
HXT_L_slope,
HXT_U_slope
)
,
ylab="slope"
,
xlab="intercept"
)

### We can add some (much needed) color using the following code
### pch=19 is a filled circle

points(HXT_G_intercept,HXT_G_slope, pch=19, col="green")


points(HXT_N_intercept,HXT_N_slope, pch=19, col="orange")
points(HXT_P_intercept,HXT_P_slope, pch=19, col="purple")
points(HXT_L_intercept,HXT_L_slope, pch=19, col="red")
points(HXT_U_intercept,HXT_U_slope, pch=19, col="gray")

### The code for a legend:


### Note the first two values denote its placement on the plot

legend(6
,
15
,
c(
"G","N","P","L","U"
)
,
col=c(
"green","orange","purple","red","gray"
)
,
pch=c(19)
)

http://www.princeton.edu/~ccrutchf/R/basic_R.html Page 8 of 8

Potrebbero piacerti anche