Sei sulla pagina 1di 23

R for absolute beginners

Duncan Golicher

November 17, 2008

Contents
Introduction 2
What can R do for me? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
What exactly is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
How long will it take me to learn R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
How much does R cost? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
How do I install R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Exercises and activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Getting started in R. Finding help and extending R with packages 6


What do I do now? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
How do I work with R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
How do I get help? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
How do I use the R-help list? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
How do I extend R by installing packages? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Exercises and activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Vectors: Working with one variable at a time 13


How can I use R as a calculator? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
How do I assign values to a variable? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Where do the numbers go? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
How do I transform a vector? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
How do I generate a sequence of numbers in R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Simple statistics in R: A t-test demonstration 17


How do I run statistical analyses in R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
How do I simulate some data from a known population? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
How do I test for normality? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
How do I run a t-test in R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1
R for absolute beginners A gentle introduction Duncan Golicher

How can I break down the calculations of a mean step by step? . . . . . . . . . . . . . . . . . . . . . . . . 20


How do I calculate a sample standard deviation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
How do I calculate a standard error? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
What does the t-distribution look like and how do I use it to test a null hypothesis? . . . . . . . . . . . . 22

Introduction
Dont do Sudukos, use R

This document is very loosely based on the material that I have previously taught to masters students in spanish
since 2004. I have translated it and brought it up to date in the hope that it might be useful as an introduction
to R for a general audience of postgraduate students and researchers. I am an ecologist, not a statistician and
have no formal training in mathematics. I feel this is probably an advantage when teaching introductory mate-
rial. It allows me to appreciate the difficulties many have with R and try to draw on my own experiences in or-
der to help. My first contact with R was far from encouraging. A statistical colleague mentioned R in an email,
claiming it to be the most useful piece of software around. I downloaded and installed R version 1.7. The icon on
my desktop was removed by Windows after about six months due to lack of use. I simply couldnt see what to do
with it. Eventually I did find some time to experiment. I followed the precise but rather terse introductory mate-
rial available at the time. This was hard, but fortunately previous experience with computer languages did help
me to get started. I began to use R and quickly found that my colleague was right all along. Using R, instead
of a typical combination of Excel and SPSS does increase the productivity of non statisticians. I was amazed to
find that a single line of R could to do things that were very difficult to achieve in other ways. I quickly became a
convert and wanted to spread the word.
My subsequent experiences have been both encouraging and frustrating. I have aimed at helping students and re-
searchers to follow my path and adopt R as their preferred platform for data analysis. Overall I have been pleas-
antly surprised by the high number of successes. I am often contacted for advice by those I have taught several
years ago as they begin to use R for serious research. At the same time I have been frustrated by the difficulty in
persuading more to adopt and use R on a routine basis. I have noticed two quite distinct classes of recalcitrants.
Students who have never previously seen a command line are quite understandably put off at the start. The R
style of working does not offer immediate attractions. R has a very steep learning curve. With some gentle en-
couragement this barrier can often be overcome, but it is always a struggle. The second class of potential R users
are much more difficult to persuade. They are experienced researchers who have invested considerable effort in
learning how to use alternative statistics package such as SPSS or SAS. For a researcher time is the resource in
shortest supply. Learning a new, superficially more complicated, way to analyse data appears to be a luxury they
simply cant afford. This is also understandable. However it is regrettable. This is the class of users that could
potentially benefit most from a working knowledge of R. While students can be forced to use R through giving
them evaluated course work, there is little that can be used to encourage researchers apart from demonstrating
results that they would like to produce themselves.
This course aims to help absolute beginners of both types to move up the initially steep gradient of the learning
curve and begin to enjoy using R. Once the hard barriers are overcome potential users should find that develop-
ing skills in R is a satisfying experience in its own right. At the same time they will find that knowledge of R can
result in much greater scientific productivity. I have aimed for an informal style as much as possible in order to
engage with the reader and help to soften the impact. I have also adopted a question and answer format for most
of the document. At each stage I will try to identify the key FAQs and provide my own answer to them.

What can R do for me?

An example of potential productivity gains is provided by this document itself. I have written the material us-
ing Sweave in Latex (Lyx). This is a great combination of tools for scientific writing. It allows me to embed R

2
R for absolute beginners A gentle introduction Duncan Golicher

code directly within the text. This code is run when the document is compiled. The output is thus incorporated
into quite a professional looking pdf document with no effort. Once the document is set up I dont need to think
about the formatting or typesetting. I can simply work on the content. This detail alone shows the immense
power of R and its associated tools. The potential for productivity gains once one has learnt how to use them
is endless.
I will be explaining what R code is later on. However here is an example of a simple line of code that produces
output.

library(fortunes)
fortune("pizza")

Roger D. Peng: I don't think anyone actually believes that R is designed to


make *everyone* happy. For me, R does about 99% of the things I need to do, but
sadly, when I need to order a pizza, I still have to pick up the telephone.
Douglas Bates: There are several chains of pizzerias in the U.S. that provide
for Internet-based ordering (e.g. www.papajohnsonline.com) so, with the
Internet modules in R, it's only a matter of time before you will have a
pizza-ordering function available.
Brian D. Ripley: Indeed, the GraphApp toolkit (used for the RGui interface
under R for Windows, but Guido forgot to include it) provides one (for use in
Sydney, Australia, we presume as that is where the GraphApp author hails from).
Alternatively, a Padovian has no need of ordering pizzas with both home and
neighbourhood restaurants ....
-- Roger D. Peng, Douglas Bates and Brian D. Ripley
R-help (June 2004)

Perhaps pizza lovers outside Australia will be disappointed. However this exchange on the R help list makes a
very serious point. I frequently use R to bring in data from the Internet and analyse it. R is often used in finance
for producing automated reports on the state of the stock market. Geneticists can use R to consult huge data
banks. R is very well integrated within the contemporary cloud style of scientific computing.
As an example of the unusual ways that R can be used this function fetches the daily sudoku puzzle from
http://www.sudoku.org.uk/

library(sudoku)
puz <- fetchSudokuUK()
puz

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]


[1,] 0 0 0 0 1 3 0 5 0
[2,] 1 5 0 9 4 2 3 0 6
[3,] 0 0 0 0 0 0 0 0 0
[4,] 0 2 0 0 0 6 0 3 0
[5,] 0 0 0 0 9 0 0 0 0
[6,] 0 7 0 2 0 0 0 8 0
[7,] 0 0 0 0 0 0 0 0 0
[8,] 8 0 3 4 7 1 0 2 9
[9,] 0 9 0 8 3 0 0 0 0

As Greg Snow, the author of the package notes Dont submit your solution for the prize contest if you used
solveSudoku or playSudoku with solve=TRUE. That would be cheating.

solveSudoku(puz)

3
R for absolute beginners A gentle introduction Duncan Golicher

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]


[1,] 6 4 9 7 1 3 2 5 8
[2,] 1 5 8 9 4 2 3 7 6
[3,] 2 3 7 5 6 8 1 9 4
[4,] 9 2 4 1 8 6 7 3 5
[5,] 5 8 1 3 9 7 4 6 2
[6,] 3 7 6 2 5 4 9 8 1
[7,] 7 1 5 6 2 9 8 4 3
[8,] 8 6 3 4 7 1 5 2 9
[9,] 4 9 2 8 3 5 6 1 7

Solving Sudokus may seem trivial, but it certainly shows the power of the R language to save at least someones
time. It is now very difficult to identify forms of serious scientific computing that havent been implemented in R.
However there is an important caveat that is very neatly summed up in this exchange.

fortune("Yoda")

Evelyn Hall: I would like to know how (if) I can extract some of the
information from the summary of my nlme.
Simon Blomberg: This is R. There is no if. Only how.
-- Evelyn Hall and Simon 'Yoda' Blomberg
R-help (April 2005)

In other words, although you can do almost anything in R, it is often far from obvious how. The main aim of
this course is to help you get to the stage where you can begin to find ways to solve your own problems using R.
Some very advanced analysis can only be achieved using R. As Frank Harrell notes, R is at the cutting edge.

fortune(10)

Overall, SAS is about 11 years behind R and S-Plus in statistical capabilities


(last year it was about 10 years behind) in my estimation.
-- Frank Harrell (SAS User, 1969-1991)
R-help (September 2003)

fortune(120)

Rene M. Raupp: Does anybody know any work comparing R with other (charged)
statistical software (like Minitab, SPSS, SAS)? [...] I have to show it's as
good as the others.
Kjetil Brinchmann Halvorsen: Sorry. That will be difficult. Couldn't it do to
prove it is better?
-- Rene M. Raupp and Kjetil Brinchmann Halvorsen
R-help (May 2005)

What exactly is R?

This is not an easy question to answer. In one sense the flexibility and power of R means that it becomes some-
thing different to every user. The conventional answer is that R is a system for statistical computation and graph-
ics. It consists of a language plus a run-time environment with graphics, a debugger, access to certain system
functions, and the ability to run programs stored in script files. In other words R is both a computer language
and a set of procedures that have already been implemented in order to carry out specific tasks.

4
R for absolute beginners A gentle introduction Duncan Golicher

There was a time when most computer users had a working knowledge of computer programming. This is no
longer the case. In fact to a very close approximation none of the students I have taught R had ever written a
line at a command prompt before. This is not surprising given the universality of the point and click menu driven
interfaces that have made computers universally accessible. What is more surprising to the students who have
learned R is the discovery that many statistical tasks are actually easier to perform using a command than by us-
ing a menu. Doing statistics and working with tables of numerical data is not the same sort of task as word or
image processing. There are many instances where menu driven approaches simply get in the way and cause con-
fusion. This will become clearer as we go on. At this point the simple message is that R is a computer language,
but you certainly do not need to be a computer programmer to use it. However you might need to develop some
programming skills to finally get the most from it.

fortune(52)

Can one be a good data analyst without being a half-good programmer? The short
answer to that is, 'No.' The long answer to that is, 'No.'
-- Frank Harrell
1999 S-PLUS User Conference, New Orleans (October 1999)

How long will it take me to learn R?

The rest of your life. There is far too much in R to learn in any less time than that. However getting up to speed
in R probably takes a couple of months, assuming that you are prepared to put in fifteen to twenty minutes a day
practice with simple exercises. If you regard learning R as an enjoyable mental challenge that also will make you
more productive this is not unreasonable. Having an experienced R user to hand will help to prevent time wasted
as a result of misunderstandings.

fortune("learning curve")

The learning curve is steep - but then like many people, I'd like to be able to
do sophisticated modelling with deep understanding and no effort :-)
-- Sean O'Riordain (in a thread about the helpfulness of documentation)
R-help (July 2005)

How much does R cost?

Nothing. R is Open Source software. You can redistribute it and/or modify it under the terms of the GNU Gen-
eral Public License as published by the Free Software Foundation A copy of the GNU General Public License is
available via WWW at
http://www.gnu.org/copyleft/gpl.html
The fact that R is Open Source does not mean that you are likely to actually make any changes to the source
code yourself. However the fact that others can read and modify the code gives R an important edge for scien-
tific users. It means that no-one has to reinvent wheels. Once a statistical procedure is written for R and found
to work correctly it can be reused and modified as necessary. Also despite the rather worrying cautionary notice
that R is free software and comes with ABSOLUTELY NO WARRANTY all the most important code in R has
been closely scrutinised by the finest academic statisticians in the business. R is reliable and accepted as such by
scientific publishers. An R user is in the priviledged position of standing on the shoulders of giants.

fortune("dodgy software")

5
R for absolute beginners A gentle introduction Duncan Golicher

Mingzhai Sun: When you use it [R], since it is written by so many authors, how
do you know that the results are trustable?
Bill Venables: The R engine [...] is pretty well uniformly excellent code but
you have to take my word for that. Actually, you don't. The whole engine is
open source so, if you wish, you can check every line of it. If people were out
to push dodgy software, this is not the way they'd go about it.
-- Mingzhai Sun and Bill Venables
R-help (January 2004)

How do I install R?

R can be run under Windows, Linux, Mac or Solaris and probably most other platforms. A quote from Barry
Rowlinson that still has not made it into the Fortunes package Id like to see a Nintendo Wii port, just so I can
play Super Mario Generalised Linear Modelling by waving the controller around.
At the time of writing R for Windows is available by clicking on the link below.
http://cran.r-project.org/bin/windows/base/R-2.8.0-win32.exe
Versions change over time. You should ensure that you install the latest version (for example R-2.9.0-win32.exe
when it becomes available). Linux users now find versions of R in the standard repositories of most popular dis-
tributions. In my own case I prefer to run R on Ubuntu. This Linux distribution is laptop friendly and handles
most dependency issues very cleanly. Packages can be installed using Synaptic or aptitude. As a minimum r-
base, r-base-dev, r-base-core,r-base-html and r-base-latex are needed. A search for r-cran in the Synaptic Package
Manager will show a large number of extension packages. It is probably worth installing all of them right from
the start in order to save time when they are found to be needed. Linux users can also install RKWard which
provides a sophisticated graphical interface for R. RKWard also includes a large number of scripts for routine sta-
tistical analysis. Together they form a user friendly free alternative to SPSS.
Windows users can obtain a graphical interface from the sciviews project.
http://www.sciviews.org/
Even if the full GUI is not used I recommend the program Tinn-R for editing R scripts with syntax highlighting
if you work under Windows.
http://www.sciviews.org/Tinn-R/index.html
However this said, I will not be using any graphical interface to R in this course. All the material will be generic,
cross platform and robust to any future changes in versions of R. I will explain the dynamic involved in using the
material in the next section where I will also explain the general procedure for extending R using packages that
can be used by users on any platform.

Exercises and activities


1. Install the latest version of R on your computer.
2. Investigate the contents of CRAN (http://cran.r-project.org/)
3. Download all the relevant manuals and courses from the Contributed section of CRAN.
4. Browse the R graphics gallery (http://addictedtor.free.fr/graphiques/)

Getting started in R. Finding help and extending R with packages


Just to get started I will assume that you are running R in Windows. Mac users have a similar experience. In the
case of Linux R is run by typing R in the terminal. By default in Linux no GUI is produced. I will assume that

6
R for absolute beginners A gentle introduction Duncan Golicher

Linux users are already hardened to this style of working. On starting R in Windows you are presented with the
superficially unhelpful looking interface shown below.

What do I do now?

This is always where the panic starts. The Windows GUI version at least gives you a few things to click on (the
console version in Linux just has a cursor), but they dont seem to do very much. There are no inmediate indica-
tions that R can do any statistics. The experience for many is quite offputting.
The first thing to point out is that this interface is in fact more helpful than it looks at first sight. But lets re-
turn to that later. First just to break the ice lets make R do something. Anything.
At this point I will introduce the convention that will be followed throughout this course. All lines that appear in
the format below can (and should) be either typed directly into the console or copied and pasted from this docu-
ment.

demo(graphics)

They will then run. So if you type demo(graphics) things will start happening. In this case R runs through a
number of scripts that give some examples of the sort of graphical output it is capable of. This particular demo is
now quite old and doesnt really do full justice to Rs graphical potential. As the script runs you will be prompted
to press return to get more output. I have included the first graph of several such demos below.

demo(graphics)

7
R for absolute beginners A gentle introduction Duncan Golicher

Simple Use of Color In a Plot



0









1

0 10 20 30 40 50

Just a Whisper of a Label

demo(image)

Maunga Whau Volcano


600
500
400
300
y

200
100

100 200 300 400 500 600 700 800

x
col=terrain.colors(100)

demo(persp)

8
R for absolute beginners A gentle introduction Duncan Golicher

z = Sinc( x2 + y2 )

y
x

demo(plotmath)

Arithmetic Operators Radicals


x+y x+y sqrt(x) x
xy xy sqrt(x, y) x
y

x*y xy Relations
x/y x y x == y x=y
x %+% y xy x != y xy
x%/%y xy x<y x<y
x %*% y xy x <= y xy
x %.% y xy x>y x>y
x x x >= y xy
+x +x x %~~% y xy
Sub/Superscripts x %=~% y xy
x[i] xi x %==% y xy
x^2 x2 x %prop% y xy
Juxtaposition Typeface
x*y xy plain(x) x
paste(x, y, z) xyz italic(x) x
Lists bold(x) x
list(x, y, z) x,, y,, z bolditalic(x) x
underline(x) x

How do I work with R?

Fear of the command line seems to be the biggest barrier to using R. At the same time adopting a script based
approach to data analysis is the greatest advantage of R. So, it is worth taking some time at this stage, to ex-
plain carefully how to work with the R console. Why is the R interface so minimalist? When you realise what
R can do and think carefully the reason becomes obvious. A menu based GUI for a statistics program is simply
a way to trace a path to a function and prompt the user for inputs to that function. So, in SPSS or Excel, you

9
R for absolute beginners A gentle introduction Duncan Golicher

might drill through a couple of layers of menu in order to find the function to produce a boxplot of a vector of
data called, say, treediams. In R you write boxplot(treediams). Once you have learnt basic commands this is an
efficient way of working. Statistical analysis is a vast subject. How many different function calls do you need in
order to be able to apply any of the multitude of statistical methods that are available in R? The answer is un-
known, but it must easily be in the tens of thousands. A menu based GUI that provided access to all of them
would be huge. It would no doubt look much more threatening to new users than the command line. Even it
it were to be built, few would use it. In many cases it would still be quicker to type commands once they are
known.
Speed isnt the only advantage of using written commands to run analyses. The biggest advantage is that every
step of the analysis is documented. You can collect the steps you took to produce a figure or table together into a
script and reproduce the results exactly. So, the typical method of working is to open R and also open a text ed-
itor like notepad. I recommend TinnR for Windows users as it has built in syntax highlighting. I usually experi-
ment with a command in the console first. When I find it does what I intended I copy and paste it to the script I
am building. At the end of a session using R I have a complete record of what I have been doing.

How do I get help?

So, how do you know what commands are available? There are two complementary ways. The first is to follow
a book or course like this one that introduces you to commands in a logical sequence. The other is to use the
comprehensive R help system. The R help system will not teach you any statistics nor will it explain why you
might want to run a function. However it will show you how to run almost all the functions in R and also provide
an example of their use. If it takes a little effort to find out how to run a function users might be encouraged to
spend more time finding out why and whether they need it.

fortune(51)

The documentation level of R is already much higher than average for open
source software and even than some commercial packages (esp. SPSS is notorious
for its attitude of "You want to do one of these things. If you don't
understand what the output means, click help and we'll pop up five lines of
mumbo-jumbo that you're not going to understand either.")
-- Peter Dalgaard
R-help (April 2002)

You can often simply substitute the data used in the example for your own to get results. You can open a web
browser interface to the help system from the console.

This will take you to the page shown below. You can also do the same by writing a command to call a function,
as that is all the GUI really does.

help.start()

10
R for absolute beginners A gentle introduction Duncan Golicher

The most used links on this page are Packages and Search Engine and Keywords. To use the search engine you
need java installed. Try searching for histogram.

This will then show a number of links to functions associated with the production of histograms. Try looking
at the function hist. You will find a standard page which is the same for all functions, including sections la-
belled Description,Usage, Arguments, Details, Value, References, See Also and Examples. Probably the most
important section at this stage is Examples. This provides you with a template for using the function. All the
examples can be run in the console either by copying and pasting the code on the help page or by typing exam-
ple(thefunctionyouwant). If you have a good idea of the name of the function the help page will be shown by
simply typing ?hist or help(hist).

help("hist")
example("hist")

Histogram of islands Histogram of islands


41
40

40
30

30
Frequency

Frequency
20

20
10

10

2 1 1 1 1 0 0 1
0

0 5000 10000 15000 0 5000 10000 15000

islands islands

Histogram of sqrt(islands) Histogram of sqrt(islands)


19
35

0.08
25
Frequency

11
Density

0.04
15

5
3
2
0 5

100 2 3 2
0.00

0 20 60 100 140 0 20 60 100 140

sqrt(islands) sqrt(islands)

11
R for absolute beginners A gentle introduction Duncan Golicher

How do I use the R-help list?

If you have real problems with R you can get direct help from the best in the business. These are the set of pro-
grammers, developers and long time R users on the R-help list. To subscribe or unsubscribe visit https://stat.
ethz.ch/mailman/listinfo/r-help or, via email, send a message with subject or body help to r-help-request@r-
project.org. However you should think carefully before using this fantastic resource directly. All the previous an-
swers to questions are held on line and are found easily by the usual search engines. Trivial questions that have
already been answered are rarely tolerated. R developers are extremely busy people who have neither time nor
inclination to help with homework. The posting guide states.

1. Do help.search("keyword") and apropos("keyword") with different keywords (type this at the R prompt).

2. Do RSiteSearch("keyword") with different keywords (at the R prompt) to search R functions, contributed
packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches.
3. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt)
4. If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it.

5. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html)


6. Read at least the relevant section in An Introduction to R
7. If the function is from a package accompanying a book, e.g., the MASS package, consult the book before
posting

8. It helps to provide a small example that someone can actually run.

fortune("demigod")

You may have not been long enough on this list to see that some of the old-time
gurus have reached a demigod like status. Demigods have all rights to be 'rude'
(that's almost a definition of a demi-deity).
-- Jari Oksanen (in a discussion on whether answers on R-help should be more
polite)
R-help (December 2004)

How do I extend R by installing packages?

A fundamental concept of R is the idea of packages. The initial instalation of R provides a base of functions most
of which have been developed and maintained by a small core team of programmers. However R is capable of
carrying out a huge number of additional analytical techniques. These are often written in R itself. Fortan or
C code can also be linked into R and run as R commands. It is this extensibility that has led to R becoming
the lingua franca of statistical computing. The biggest challenge is keeping up with the vast number of packages
available and being aware of what is available. It is safe to assume that someone has implemented almost any
standard technique you might need in R.
The list of packages with a short description of what they do can be found in CRAN. CRAN means Compre-
hensive R Archive Network and is mirrored throughout the world. Because the list of packages is now so vast a
Task View section has been set up that helps users to find packages associated with specific types of work. For
example the Spatial section would be a first stop if you are interested in using R for processing geographical
information. Environometrics shows some of the most useful packages for ecologists and resource managers.
The important element to remember with regard to packages is the difference between installing a package and
making it available for use during an R session. When a package is installed it is downloaded to your hard disk

12
R for absolute beginners A gentle introduction Duncan Golicher

and can be used. This needs to be done only once, with the exception of updating packages as new versions be-
come available1 .
Packages can be installed under Windows from the graphical interface by choosing install packages(s) under the
Packages menu. Again the job can also be done through a command. This is my preferred way of installing pack-
ages. The following line will install the package vegan, a key tool for multivariate analysis in ecology, and vcd for
visualising categorical data. The addition of dep=T tells R to install all other packages upon which these pack-
ages depend.

install.packages(c("vegan", "vcd"), dep = T)

The notion of dependencies is well known to those who use open source software.
There are some key points about R to mention at this point. First R is case sensitive. The line below will not
work.

Install.packages(c("vegan", "vcd"), dep = T)

The next point is that you only need to install a packege to the hard disk once. However must load it into mem-
ory every time you need to use a function from the package. This is achieved using the command library. For
example

library(vegan)

Makes the vegan package available for use. This will become clearer over time.

Exercises and activities


1. Make a list of the packages on CRAN that are potentially useful for multivariate analysis.

2. Run examples of canonical correspondence analysis and non metric multidimensional scaling using the pack-
age vegan. (Note, you do not necessarily have to understand the analysis at this stage, the exercise is aimed
at practice in using the help system and examples)
3. Install the packages nortest and moments. What do they do and how might they be useful?

4. Run an example of a test for normality

Vectors: Working with one variable at a time


The first goal when you begin working with R is to become sufficiently comfortable with the underlying concepts
of the R language to be able to manipulate data easily. This ability does not come overnight. You will need to
practice with a lot of examples. At first it may seem difficult to achieve simple results. The pay back is that with
experience it will become simple to achieve difficult results.
1 Linux users will already be familiar with this concept. Debian users also have the advantage that packages installed with apti-

tude will be automatically updated.

13
R for absolute beginners A gentle introduction Duncan Golicher

How can I use R as a calculator?

R can be used as a scientific calculator. Any operation written in the console will be evaluated and the result re-
turned to the console.

1 + 1

[1] 2

More complex operations follow the typical operator order. Be careful to use brackets correctly. An extra bracket
doesnt do any harm, but leaving one out may give results you dont expect.

1 + 1 * 3

[1] 4

(1 + 1) * 3

[1] 6

3 * 100/10 + 5

[1] 35

3 * 100/(10 + 5)

[1] 20

(3 * 100)/(10 + 5)

[1] 20

10 * (3 - 1)

[1] 20

10 * (3 - 1)^2

[1] 40

10 * 3 - 1^2

[1] 29

14
R for absolute beginners A gentle introduction Duncan Golicher

How do I assign values to a variable?


You may have noticed that the file menu in the Windows R console does not provide an obvious way of getting
data into R. On the introductory R help page there is a link to a document called R Data Import/Export. This
is a comprehensive and useful document for experienced R users written by Brian Ripley, a well known R Guru.

fortune(47)

Seldom are prizes, credit, and gratitude given, else Brian would be drowning in
them.
-- Anthony Rossini (about the merits of implementing software)
R-help (May 2004)

However I do not recommend it to beginners. This course (will) contains a whole section devoted to importing
and exporting data from other statistical packages such as SPSS and from spreadsheets and data bases. For the
time being we will enter data by hand in the console. If you type

x <- scan()

You can enter numbers one by one until you press enter twice in a row to exit.
A more reproducible way of assigning numbers to a vector is by concatenating. This can be included in a script.

x <- c(1, 3, 6, 7, 9, 10, 12, 23)


x

[1] 1 3 6 7 9 10 12 23

A vector is simply a list of numbers in a single dimension. R will refer to all the numbers by the name x and
operate on them. There are various points to mention here. First of all the <- symbol. In an informal sense it
means take everything that is at the end of the arrow and put it into the object at the head. So x<-c(1,2,3)
gives x the values of 1, 2 and 3. Usually the arrow points to the right. It is perfectly valid to turn it around, but
this would usually be confusing and is not done. You can also use the equals sign to do the same job.

x = c(1, 3, 4, 6, 7, 12, 23)


x

[1] 1 3 4 6 7 12 23

The use of = as an assignment operator is common in many computer languages, but I much prefer the arrow
syntax as it avoids confusion.
To move the contents of x to y is simple. Note that x will still contain the same numbers.

y <- x
x

[1] 1 3 4 6 7 12 23

Secondly be aware that you must use the concatenation operator c() to form a vector. None of these lines will
work.

x<-1,2,3,5
x<-(1,3,4,6,7)

15
R for absolute beginners A gentle introduction Duncan Golicher

Where do the numbers go?

It wasnt until I had taught two courses on R and heard this question several times that I realised that those who
ask it are expecting a completely non-technical answer. They dont want to know details about the way R uses
memory. The problem that arises in some students minds is related to the almost ubiquitous use of spreadsheets.

fortune(59)

Let's not kid ourselves: the most widely used piece of software for statistics
is Excel.
-- Brian D. Ripley ('Statistical Methods Need Software: A View of
Statistical Computing')
Opening lecture RSS 2002, Plymouth (September 2002)

Spreadsheet users are used to typing in numbers. The numbers remain staring at them until they move away.
The notion of more abstract data objects is natural to anyone who has rudimentary contact with programming
languages. However the idea is not intuitive for everyone.
This is an unexpected barrier to communication between those already used to the R way of doing things and the
beginner. It needs dealing with carefully.
My explanation is that as I work on an R session I produce a collection of objects held in the computers mem-
ory. The basic properties of these objects should also be held in my own memory. I need to have a good idea of
what I have put into R and why. However I really dont want to be looking at the numbers themselves all the
time. This just causes clutter and confusion. As you think more about it this seems reasonable. What is the dif-
ference in R between a vector called x containing 10 numbers and one containing 10,000? The answer is essen-
tially nothing. The following line produces a vector with ten thousand numbers. They are then multiplied by 2.
The second line is identical regardless of the size of the vector.

x <- 1:10000
x <- 2 * x

Almost anything you can do to one number you can do to 10,000 just as easily, apart from one thing. Look at
them all at once. So as you work with R you must get used to not wanting to look directly at the numbers them-
selves. However, it is a good idea to look at properties of the numbers to make sure that everything is as it should
be. This can be done through figures and statistical summaries.
A very useful function is str(). This produces a description of the strucure of the data object. In this case it is
a vector of numbers. The function summary produces a statistical summary of the numbers and head prints out
the first ten members.

str(x)

num [1:10000] 2 4 6 8 10 12 14 16 18 20 ...

summary(x)

Min. 1st Qu. Median Mean 3rd Qu. Max.


2 5002 10000 10000 15000 20000

head(x)

[1] 2 4 6 8 10 12

16
R for absolute beginners A gentle introduction Duncan Golicher

How do I transform a vector?

Any operation will be performed on the whole vector. Try these

x <- c(1, 2, 4, 6, 10)


2 * x
x + 2
x^2
log(x)
log2(x)
log(x * 100)
exp(x)
sqrt(x)

Note that if you dont assign the results of an operation they are simply printed out and lost. To transform x to
its logarithm to the base of 10 you need to write the following line.

x <- c(1, 6, 10, 100, 200)


x <- log10(x)
x

[1] 0.0000000 0.7781513 1.0000000 2.0000000 2.3010300

How do I generate a sequence of numbers in R?

One of the best features of R is the ease with which you can generate sequences of numbers and simulated data
sets

Simple statistics in R: A t-test demonstration

How do I run statistical analyses in R?

The R way of working is quite different to SAS or SPSS.


R can be especially useful for teaching basic statistics because it is easy to break down all the elements used in a
calculation as a series of relatively easily understandable steps. Take the example of a t-test, designed to evaluate
the probability that a sample of number could have been drawn from a population with a mean of zero.

How do I simulate some data from a known population?

It is often a good idea to simulate data from a known distribution to understand more fully the logic behind a
statistical procedure. In this case we can ensure that the assumptions used and the data coincide. It is then eas-
ier to see what problems might be associated with the analysis of a real data set using the same procedure.

set.seed(1)
x <- rnorm(10, mean = 1, sd = 2)
x

[1] -0.2529076 1.3672866 -0.6712572 4.1905616 1.6590155 -0.6409368


[7] 1.9748581 2.4766494 2.1515627 0.3892232

17
R for absolute beginners A gentle introduction Duncan Golicher

hist(x, col = "grey")

Histogram of x

3.0
2.5
2.0
Frequency

1.5
1.0
0.5
0.0 1 0 1 2 3 4 5

This example should help to explain the sometimes obscure logic of classic statistical reasoning. We have just
chosen ten numbers at random from a known distribution. This is a normal distribution with a standard devia-
tion of 2 and a mean of 1. However it is a small sample, so we can never really know that from the ten numbers
themselves.We try to draw inferences based on the limited knowledge these ten numbers provide.

How do I test for normality?

A better question might be, why and when should you bother testing for normality, but we will leave that aside.
It is no trouble to run several normality tests in R. The best way is with the package nortest. If you have not
done so you will have to install it from CRAN first. Then we can run an Anderson-Darling, Lilliefors (Kolmogorov-
Smirnov) and Cramer-von Mises test.

library(nortest)
ad.test(x)

Anderson-Darling normality test

data: x
A = 0.2806, p-value = 0.5608

lillie.test(x)

Lilliefors (Kolmogorov-Smirnov) normality test

data: x
D = 0.1345, p-value = 0.8754

cvm.test(x)

Cramer-von Mises normality test

data: x
W = 0.0395, p-value = 0.6554

We can also test for skewness and kurtosis by installing another small package called moments.

18
R for absolute beginners A gentle introduction Duncan Golicher

library(moments)
agostino.test(x)

D'Agostino skewness test

data: x
skew = 0.2962, z = 0.3482, p-value = 0.7277
alternative hypothesis: data have a skewness

anscombe.test(x)

Anscombe-Glynn kurtosis test

data: x
kurt = 2.2753, z = -0.0702, p-value = 0.944
alternative hypothesis: kurtosis is not equal to 3

Notice that testing for significant deviations from the desired normal properties in the case of small samples is
not particularly useful. The null hypothesis is much less likely to be rejected when there is little power avail-
able. There is nothing wrong with our assumption that the numbers were drawn from a normal distribution. In
this special case we know it to be absolutely correct because that is what we told R to do for us. However a hsi-
togram of the small sample itself does not look particularly normal. In fact if we repeat the process 36 times we
can see that the histogram of a sample of ten numbers very rarely looks normal even when they are drawn from a
normal population.

par(mfcol = c(6, 6), mar = c(0.5, 0.5, 0.5, 0.5))


replicate(36, hist(rnorm(10, 1, 2), col = "grey", xlab = "",
+ ylab = "", main = "", axes = F))

This explains why testing for the normality of small samples is far less important than having a good justification
for assuming that they could have been drawn from a population with normal properties. In this case if negative
values are not possible the whole process would be completely meaningless. Lets assume that they represent that
differences between two paired values which could indeed have either sign.

How do I run a t-test in R?

Now the idea of statistical hypothesis testing is to try estimate the probability attached to various statements we
could make about some underlying population from which these ten numbers were drawn. In this case we know

19
R for absolute beginners A gentle introduction Duncan Golicher

what this population is but in real life we do not. In this case a fairly reasonable statement to test would be that
the true difference between the pairs of observation is zero, in other words nothing much is happening.
In a practical research setting you run a t-test in R quickly with a single function. There are various inputs to
the test, but we will use the default two tailed option.

t.test(x)

One Sample t-test

data: x
t = 2.5612, df = 9, p-value = 0.03063
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.1476105 2.3812007
sample estimates:
mean of x
1.264406

How can I break down the calculations of a mean step by step?

So now R has told us what the result should be. Lets do all the calculations step by step. The mean of the sam-
ple is calculated by
x = n1 n
P
i=1 xi

To have each step of this simple operation calculated in turn we calculate the sum in R and then divide by n

n <- length(x)
n

[1] 10

sumx <- sum(x)


sumx

[1] 12.64406

meanx <- sumx/n


meanx

[1] 1.264406

This is interesting. The mean of the population from which the numbers were drawn is 1. However the mean of
the sample is some distance from this. In this case it is greater. If we took another sample the result would be
different, perhaps less. It could even be below zero. For example.

mean(rnorm(10, 1, 2))

[1] 1.461282

20
R for absolute beginners A gentle introduction Duncan Golicher

We can even simulate this 100,000 times, look at the results as a histogram and calculate the proportion of the
replicated samples that have a mean below zero. The theoretical basis of our null hypothesis test is based on this
concept.

samps <- replicate(1e+05, mean(rnorm(10, 1, 2)))


hist(samps, col = "grey", breaks = 20)
sum(samps < 0)/1e+05

[1] 0.05695

Histogram of samps
10000 12000
8000
Frequency

6000
4000
2000
0

2 1 0 1 2 3 4

samps

Our two problems are that we dont really know that the population standard deviation is exactly 2 and we dont
know what the mean is. Testing a hypothesis looks rather tricky. The best we can do is assume that the stan-
dard deviation of the sample is an estimate of the standard deviation of the population. In order to conduct a
null hypothesis test we ask the question, What is the probability of obtaining a sample with this mean, or one
more extreme, if the population mean were really zero. We do this in a slightly indirect way by calculating the
t-statistic first, which has a clever built in compensation for small sample sizes.

How do I calculate a sample standard deviation?

Now, how can we do calculate a standard deviation by hand? The formula for the sample standard deviation s
that is anunbiased estimator of is
q
1 Pn 2
s = n1 i=1 (x xi )

sumsquare <- sum((x - meanx)^2)


sumsquare

[1] 21.93532

meansquare <- sumsquare/(n - 1)


meansquare

[1] 2.437258

21
R for absolute beginners A gentle introduction Duncan Golicher

rootmeansquare <- sqrt(meansquare)


rootmeansquare

[1] 1.561172

sdx <- rootmeansquare

Of course R has built in functions for these calculations that save al this rigmarole.

mean(x)

[1] 1.264406

sd(x)

[1] 1.561172

Notice that weve got the estimate for the population standard deviation wrong! It should be 2. The result of a
lower estimate will be to increase a risk of type one errors. However this is built into the procedure. If we knew
what the true standard deviation were we could use the z statistic.

How do I calculate a standard error?


The standard error can be calculated from the standard deviation by dividing by the square root of the sample
size. It represents the variability in means that we would expect if we did as above and took many samples from
the population
SEx = s
n

se <- sdx/sqrt(n)
se

[1] 0.4936859

What does the t-distribution look like and how do I use it to test a null hypothesis?
Now if the under our null hypothesis the population mean were assumed to be zero (0 = 0) and the standard
error is estimated as s=0.494 we calculate a t statistic by subtracting the mean under the null hypothesis from
the mean we obtained (1.264 and dividing by the standard error.
x0
t= SEx

t <- meanx/se
t

[1] 2.561154

R has built in functions for many statistical distributions. We have already user rnorm to generate the numbers.
The t distribution has the same general pattern. 10 simulated values of t with 9 degrees of freedom can be gen-
erated by rt(10,df=9). To get the density use dt. The cumulative distribution function is given by pt and qt
gives the quantile function. We can use this to plot a density function for t and shade the tails that have values
equal to or more extreme than the value we got. A t distribution has longer tails than a normal distribution, and
this represents the compensation we are making for the fact that we have to estimate the sd from a small sample.

22
R for absolute beginners A gentle introduction Duncan Golicher

plot(function(x) dt(df = n - 1, x), -4, 4)


xvals <- seq(-4, -t, length = 50)
dvals <- dt(xvals, df = n - 1)
polygon(c(xvals, rev(xvals)), c(rep(0, 50), rev(dvals)), col = "gray")
xvals <- seq(4, t, length = 50)
dvals <- dt(xvals, df = n - 1)
polygon(c(xvals, rev(xvals)), c(rep(0, 50), rev(dvals)), col = "gray")
grid()
0.4
0.3
function(x) dt(df = n 1, x) (x)

0.2
0.1
0.0

4 2 0 2 4

Now finally the null hypothesis test involves calculating the cumulative area under the two tails. We can use pt
to do this.

tail1 <- pt(-t, df = 9)


tail1

[1] 0.01531455

tail2 <- 1 - pt(t, df = 9)


tail2

[1] 0.01531455

pvalue <- tail1 + tail2


pvalue

[1] 0.03062909

23

Potrebbero piacerti anche