Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
R 7/31/09 10:15 AM
### I/O
### The best way to learn R is to manipulate data you are familiar with. Most
### people are familiar with spreadsheet software, such as Excel, Gnumeric,
### Calc, etc. One of the easiest ways to shuttle your data to R is to
### manipulate it into a managable matrix layout in your spreadsheet program,
### and save that spreadsheet as a *.csv file (comma separated values).
### Likewise, outputting to a comma separated value file and reading that into a
### spreadsheet file can be an easy way to quickly scan your present dataset.
getwd()
### This output lets me know that right now R is working on my desktop. If i
### don't give any more detail for input/output commands, it will default to
### look to read or write files on my desktop. If I know I want to work in
### another directory (say, "Documents), I can easily change the directory R
### will access (called a "working directory"):
setwd("/home/userid/Documents");
### If I forgot the exact name of a file I want to load, and I know I am in the
### proper directory, I can check what filesR can access via the function:
list.files()
### The most straightforward way to load this file into R to access it's
### contents is to assign it to a "data frame." The more we work with R the more
### we will find more and more powerful ways to manipulate data frames. The
### simplest method for loading a file as *.csv (comma separated values file)
### is as follows:
### This method, while simple, doesn't take full advantage of the read.csv()
### function. To examine the use of a function, simple type a question mark in
### front of it at the command line, such as:
?read.csv()
### The documentation, especially for core functions in R, usually goes very in
### depth. Typically the help file will read off a function and parenthetically
## denote the defaults for its parameters. The only parameter we are interested
## in changing at this time is row.names. Because we are working with genes, we
### would prefer the rows of our data frame correspond to those genes, not to an
### arbritrary number.
http://www.princeton.edu/~ccrutchf/R/basic_R.html Page 1 of 8
basic_R.R 7/31/09 10:15 AM
head(brauer_data)
brauer_data <- read.csv("brauer_microarray.csv", row.names=1)
head(brauer_data)
### You will notice the use of the function head(). This is a useful function
### for quickly assessing the top six rows of a large data frame.
### Here it's use will show in your console the difference between reading in a
### data frame with and without row names set. Conversely, use
tail()
### There will likely be times when you are manipulating a data frame that you
### want to save a copy of it *as is*. If you want to save a copy directly to a
### file you will likely want to use the write.csv() function.
write.csv(brauer_data, file="my_file_name.csv")
### Creating new objects in R is easy. A vector containing the 6 dilution rates
### in the experiment can be generated using the c() (concatenate) function
### and assigning, "<-", its value to a new object, we will call x_d.
### Now say we are intrested in only the HXT genes. It's relatively easy to get
### at these genes using what's called a "regular expression"
### A more familiar term related to a regular expression is the use of a
### wildcard, "*". In our case we would be interest only in those
### genes that begin with HXT. This next function will be our first "nested"
### function. It may seem complicated at first, but
### learning these commands will eventually save you time & and make your
### coding far less error-prone.
### Now we have created our first object, a data frame containing only those
### genes whose name begins with the three characters, HXT.
### We managed this by combining functions:
### "<-", the assignment expression, which assigns the value of one object
### to the object it points to. An alternative expression
### "=", can be substituted here, and works just the same.
### "[" is called the subset operator. Here we are ultimately parsing our data
### frame for those rows #'s (or indices) that will match our requirements.
### The format typically used is data[rows,columns]. If we already knew the row
### numbers that will match our specification
### (it turns out they are 1373 1704 1836 1872 2460 2765 2998 3244 3299 5434),
### we could generate the same data frame by the following function.
### Note: we concatenate the locations of the rows using the c() function.
http://www.princeton.edu/~ccrutchf/R/basic_R.html Page 2 of 8
basic_R.R 7/31/09 10:15 AM
### Take note that we left the column specification [,*] blank, because in this
### case we are not specifying a subset of the columns in the data frame.
### Unfortunately in most cases we do *not* care to look up each row index
### manually, which is why we use the function grep().
### The character vector we are searching is the rownames, in our case gene
### names, of our data frame. If you call
rownames(brauer_data);
### The console will display all 5572 gene names in their order in the data
### frame. By matching our regular expression to this
### character vector of gene names, our function will return to those numeric
### row locations, or indices, that fit our pattern (HXT*).
grep(
glob2rx("HXT*"),
rownames(brauer_data)
);
### We can get even more fancy, using a combination of a for loop, paste,
### regular expressions, and the explicit assign() function.
)
}
http://www.princeton.edu/~ccrutchf/R/basic_R.html Page 3 of 8
basic_R.R 7/31/09 10:15 AM
### Using this function, we automatically parsed the brauer dataset into six
### objects (different nutrient limitations) for only
### Those genes beginning with the characters HXT.
### The matrices that represent them, we created using assign: HXT_G, HXT_N,
### HXT_P, HXT_S, HXT_L, and HXT_U.
### A far less formal (and working) way to look at this code is
### Now we might be interested in making use of this organization and outputting
### some data in a way we could not
### as easily do with a spreadsheet program such as excel.
### First let us explore the lm() function
assign(
paste("lm_example"),
lm(
as.numeric(
HXT_G[1,])
~
x_d
)
)
eval(lm_example)
### Here we assigned the output of fitting a linear model of HXT5 expression in
### glucose limitation to its dilution rates.
### Just calling this expression will output only the intercepts.
### More information can be extracted using summary()
summary(lm_example)
### Usually we want these values so we can input them into another fuction
unclass(
summary(lm_example)
)
### R-squared
unclass(summary(lm_example))$r.squared
### Intercept
unclass(summary(lm_example))$coefficients[1]
### Slope
unclass(summary(lm_example))$coefficients[2]
### Let us try taking these values we just extracted, and make a useful plot
http://www.princeton.edu/~ccrutchf/R/basic_R.html Page 4 of 8
basic_R.R 7/31/09 10:15 AM
### For the sake of showing more functional programming, lets first reorganize
### these most recent values we can access.
### This is a piece of code that works, but doesn't take advantage of functional
### methods. With functional programming it is easier to reuse the code for
### pulling out different parameters, assaying different genes, etc.
for(i in 2:length(
HXT_G[,1]
)
)
{
G_r2 <- append(
G_r2
,
summary(
lm(
as.numeric(
HXT_G[i,]
)
~
x_d
)
)$r.squared
)
}
### You can write this same code more concisely, and get out all the parameters
### you might interested in. This code is not necessary for the result: a
### series of vectors representing the different parameters for the different
### limitations could as easily be generated using the code above, but modifying
### the parameter and the specified nutrient limitation
for(p in parameters)
{
for(lim in limitation)
{
n_length <- c(1:dim(get(paste(gene, "_",lim, sep="")))[1])
assign(
paste(
text=gene, lim, p, sep="_"
)
,
n_length)
for(n in 1:length(n_length))
{
ifelse(p=="r2"
,
assign(
http://www.princeton.edu/~ccrutchf/R/basic_R.html Page 5 of 8
basic_R.R 7/31/09 10:15 AM
paste(
text=gene, lim, p , sep="_"
)
,
replace(
get(
paste(
text=gene, lim, p , sep="_"
)
)
,
n
,
summary(
lm(
as.numeric(
subset(
t(
get(
paste(
text=gene, lim, sep="_"
)
)
)
,
select = c(n)
)
)
~
x_d
)
)$r.squared
)
)
,
ifelse(p=="slope"
,
assign(
paste(
text=gene, lim, p , sep="_"
)
,
replace(
get(
paste(
text=gene, lim, p , sep="_"
)
)
,
n
,
summary(
lm(
as.numeric(
subset(
t(
get(
paste(
text=gene, lim, sep="_"
)
)
)
,
select = c(n)
)
http://www.princeton.edu/~ccrutchf/R/basic_R.html Page 6 of 8
basic_R.R 7/31/09 10:15 AM
)
~
x_d
)
)$coefficients[[2]]
)
)
,
ifelse(p=="intercept"
,
assign(
paste(
text=gene, lim, p , sep="_"
)
,
replace(
get(
paste(
text=gene, lim, p , sep="_"
)
)
,
n
,
summary(
lm(
as.numeric(
subset(
t(
get(
paste(
text=gene, lim, sep="_"
)
)
)
,
select = c(n)
)
)
~
x_d
)
)$coefficients[1]
)
)
,
print("unknown parameter")
)
)
)
}
}
}
### Let us assume we're interested the relation of slopes and intercepts
### across the 5 conditions for the retrieved HXT gene expression.
### If we want to generate a scatter plot composed of these values, we
### can generate one using the following code
### Note, you can omit this "pdf" line if you want to see the plot
### generated dynamically
pdf(file="test.pdf")
### By cocatenating all the values we will be plotting at the same time
http://www.princeton.edu/~ccrutchf/R/basic_R.html Page 7 of 8
basic_R.R 7/31/09 10:15 AM
plot(c(
HXT_G_intercept,
HXT_N_intercept,
HXT_P_intercept,
HXT_L_intercept,
HXT_U_intercept
)
,
c(
HXT_G_slope,
HXT_N_slope,
HXT_P_slope,
HXT_L_slope,
HXT_U_slope
)
,
ylab="slope"
,
xlab="intercept"
)
### We can add some (much needed) color using the following code
### pch=19 is a filled circle
legend(6
,
15
,
c(
"G","N","P","L","U"
)
,
col=c(
"green","orange","purple","red","gray"
)
,
pch=c(19)
)
http://www.princeton.edu/~ccrutchf/R/basic_R.html Page 8 of 8