Sei sulla pagina 1di 47

Introduction to Statistical

Packages

Session 1: Introduction to R and RStudio


MSBA 615
Nick Holt, Ph.D.
Outline
• Installing R and R Studio
• Installing Packages
• Introductions & Course Overview
• Lecture
• Brief history of R
• Rstudio Tour
• R Console / Operators
• R Objects / Types
• Intro to Data Visualization with ggplot2

2
Installing R
• Visit https://cloud.r-project.org/
• Download the appropriate version for your system
• Click the downloaded file and follow prompts to install
• Choosing the default options should be fine

• A new version of R is released every year, and there are 2-3 minor
releases each year. You should update R regularly.

3
Installing RStudio
• RStudio is an integrated development environment, or IDE, for R
programming.

• Visit: http://www.rstudio.com/download
• Download the appropriate version and install

• RStudio is updated a few times each year.

• When a new version is available, RStudio will let you know.

• You should update regularly because RStudio is constantly adding new


features that improve R in very useful ways.
4
About Me
• Louisville native
• Undergrad in Psychology at Morehead State University
• Grad School at University of Louisville
• Ph.D in Experimental Psychology – Perception and Cognition
• Infant Cognition Lab
• Face Perception and Recognition
• Causal Learning and Reasoning
• Motor Skills -> Thinking
• Taught graduate statistics in the psych department in grad school
• Moved into data science and analytics at a national research firm
• Head Instructor for the Institute for Advanced Analytics at Bellarmine University
• Currently VP, Director of Analytics at Doe-Anderson, Inc. (Ad Agency)

5
Doe-Anderson
• https://www.doeanderson.com/

Not really…

6
About You
• Name
• Where do you work? What do you do there?

7
About the Course
• Syllabus available on blackboard

8
Student Input
• Take 5 minutes to write down 3 things you are hoping to learn more about in this course

• I will try to touch on the most popular topics at some point

9
History of R
• R is an open source programming language/software environment
• An implementation of the S programming language
• S was created by John Chambers at Bell Labs as an internal environment for
statistical analysis
• S was designed for users to begin interacting with the language in a way that
doesn’t “feel” like programming
• As users’ needs become more sophisticated they will be able to gradually
transition into programming

10
History of R
• R was created in New Zealand by Ross Ihaka and Robert Gentleman
• Both author’s names begin with R, but R is also a play on S
• R version 1.0.0 was released in 2000
• Runs on almost any platform

11
History of R

http://www.nytimes.com/2009/01/07/technology/business-
computing/07program.html

12
History of R

http://www.forbes.com/sites/gilpress/2015/10/21/the-number-of-data-scientists-has-doubled-over-the-last-4-
years/2/#3310e65e1f6a 13
History of R

http://blog.revolutionanalytics.com/2015/12/r-is-the-fastest-growing-language-on-stackoverflow.html
14
Why is R becoming so popular?
• Most commercial stats software costs thousands of dollars
• Just about any statistical technique can be implemented in R
• Users can contribute “packages” -> all of the most advanced/new methods are available
before they reach other platforms
• State-of-the-art graphics capabilities
• R is highly adaptive: works with most data/file types

15
R Resources
• https://www.r-project.org
• News
• Download latest version of R
• CRAN: Comprehensive R Archive Network

• R has many useful functions built into the software (referred to as base R)
• CRAN stores user-contributed “packages” that add new functionality
• Currently, the CRAN package repository features 10k+ available packages
for free download
• Packages can be installed directly from the R Console (R will connect to
CRAN and download)
• https://gallery.shinyapps.io/087-crandash/
16
R Community
• Very active community of users
• https://www.r-bloggers.com/
• Twitter: #rstats, @hadleywickham, etc.
• StackOverflow: 313,323 questions tagged “R”
• useR! International conference for users

17
RStudio
• https://www.rstudio.com
• RStudio is an integrated development environment (IDE) for R
• Console
• Syntax-highlighting editor with direct code execution
• Tools for plotting, history, debugging, workspace management, version control (git
integration), publishing, etc.
• Open source (freely available)

• Created by JJ Allaire in late 2010


• Hadley Wickham is the Chief Scientist at Rstudio (he writes lots of nice
packages)

18
RStudio Overview

19
Getting Started with RStudio
• Settings
• Key features
• RStudio “Projects”

20
Getting Started with R
• R is a case-sensitive, interpreted language

• Enter commands one at a time from the command prompt or run entire scripts

• In a basic sense, R can function as an advanced calculator


• Demo: 2+2, etc.

21
RStudio Script Editor
• Create a new script: ctrl + Shift + n

• You should now see 4 panes

• Script editor:

• Use for code you want to save

• Ctrl + Enter runs the current line in the editor

• Ctrl + Shift + S runs the entire script

• Syntax errors will be highlighted with a red “x” and squiggly red underlining
22
Input
• Things we enter into the console are called expressions

• R evaluates the expression(s) and returns output

• The # symbol indicates a comment (anything following this symbol is not


evaluated)

23
Important Operators
• Arithmetic: +, -, *, /

• Exponents: ^

• The : operator creates a sequence from one number to the next

• c() can be used to concatenate values (create vectors)

• Remember that R is case sensitive (c does something very different from C)

24
Relational Operators
• Comparing for equality: ==

• Not equal: !=

• Not: !

• Others: >, <, >=, <=

25
Mathematical Functions
• Almost any mathematical function you can think of is built into R:

• Trig: sin(), cos(), tan(), asin(), acos(), atan()

• Logs & Exponents: log(), exp()

26
Objects in R
• In R, almost EVERYTHING is an “object”
• Numbers

• Variables

• Functions

• Objects can be assigned names and stored using “<-”

27
Basic Assignment in R
Assigning (storing) the value 8 in an object called x

x <- 8
Variable Assigner Value
(object)
28
Objects in R
• R has five atomic classes of objects*:
• Character

• Numeric

• Integer

• Complex

• Logical

*more on this later

29
Vectors in R
• Vectors are the most basic kind of object

• A vector is an ordered set of values (objects)

• Vectors can only contain objects of the same class

• The exception to this rule is a special type of vector called a “list”

Creating a vector:

• vector() can create an empty vector

• c( <insert values separated by commas >) can also create a vector

• A vector of numbers in a sequence can be created using a colon: 1:10


30
Creating Vectors in R
Assigning (storing) the vector of integers from 1 to 8 in an object called x

x <- 1:8
Variable Assigner Vector
(object)
31
Creating Vectors in R
Assigning (storing) the vector of integers from 1 to 3 in an object called x

x <- c(1,2,3)
Variable Assigner Vector
(object)
32
“Vectorization”
• Vectorization is a term in R that can mean different things in different contexts

• “Vectorized” can mean that an operator or function will act on each element
of the object without an explicit loop
• 10 * c(1:5) # here 1:5 is vectorized

• “Vectorized” can mean that a function takes a vector as input and


calculates a summary statistic
• sum(1:5)

• median(1:5)

33
Vectors in R
• The third meaning is vectorization over arguments:

• Some functions calculate a summary statistic from several input


arguments:
• sum(1, 2, 3, 4, 5)

• But most do not:


• median(1, 2, 3, 4, 5)

• How can we make this function work properly?

34
Quick Check
• Answer the following:
• Is the sum of all integers between 1 and 500 greater than 50 cubed?

35
Quick Check Solution
• Answer the following:
• Is the sum of all integers between 1 and 500 greater than 50 cubed?
1. Create an object called x that holds integers between 1 and 500

x <- 1:500

2. Create an object called y that holds 50 cubed

y <- 50^3

3. Create test logic using relational operators (>) to find the answer

x>y

36
Quick Check Solution
• Answer the following:
• Is the sum of all integers between 1 and 500 greater than 50 cubed?
1. Create an object called x that holds the sum of integers between 1 and 500

x <- sum(1:500)

2. Create an object called y that holds 50 cubed

y <- 50^3

3. Create test logic using relational operators (>) to find the answer

x>y

37
Numbers (Numeric Class)
• Numbers are objects of the numeric class

• An L suffix can be used to denote an integer

• Special Numbers:
• Inf: represents infinity (-Inf, also)

• NaN: “not a number”

• NA: not available – used to denote missing values

38
Numbers (Numeric Class)
• There are functions available to check for special numbers:

v <- c(0, Inf, -Inf, NaN, Na)


• Create the vector above in your console and try the following functions:
• is.finite()

• is.nan()

• is.na()

• What do these functions do? What data type is their output?

39
Logical Values (Logic Class)
• TRUE and FALSE are special words in R

• You cannot assign values to them (lower- and mixed-case object names will
work)

• T and F are preassigned short-hand expressions for TRUE and FALSE (they
can be redefined, though, so be careful)

40
Logical Values (Logic Class)
• Other Logical Operators
• Not: !

• And: &

• Or: |

41
Objects in R: Attributes
• Objects can have attributes
• Names: names()

• Dimensions: dim()

• Class: class()

• Length: length()

• Metadata (other attributes)

• str(): structure function


• Provides quick information about any object

42
Combining Vectors into new Objects
• Base R contains functions that can smash two vectors into an object with a
tabular data structure (think table of data)

• Smashing two vectors into a table is referred to as binding


• rbind() – aka ROW bind – binds two or more vectors into separate rows of a table

• cbind() – aka COLUMN bind – binds two or more vectors into columns of a table

43
rbind

x <- 1:10
y <- 11:20
rbind(x, y)

44
cbind

x <- 1:10
y <- 11:20
cbind(x, y)

45
Quick Check
• Answer the following:
• Quick Check 2: Create an object called z that stores 2 columns of data.
• The first column is comprised of numbers 100, 200, 300, 400, 500.

• The second column stores the square of each value in the first column divided by half
of the value in the first column

46
Quick Check Solution
• Answer the following:
• Quick Check 2: Create an object called z that stores 2 columns of data.
• The first column is comprised of numbers 100, 200, 300, 400, 500.

• The second column stores the square of each value in the first column divided by half
of the value in the first column

x <- seq(100, 500, by = 100)


y <- (x^2)/(x/2)
z <- cbind(x, y)
47