Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
• Introduction to R
• Descriptive Statistics
• Correlation and Regression
• t-Test and ANOVA
• Chi-Square Test
Introduction to R
Why R ?
Sumber: https://www.r-bloggers.com/new-surveys-show-continued-popularity-of-r/
R Components
• R Base • R IDE - RStudio
• R Package - CRAN
R Installation
- Windows and Mac OS X
Download executable file (.exe for windows and .pkg for Mac):
http://www.r-project.org/
- Linux
Ubuntu atau Debian : r-base
Red Hat atau Fedora: R.i386
Suse : R-base
Example:
$ sudo apt-get install r-base
- Instalasi Rstudio:
Follow this link: https://www.rstudio.com/
Go to Products > RStudio > Download RStudio Desktop
Get Start
To start R, we need to specify our working directory. All files related to
the analysis should be placed in this directory.
- In R Base / R GUI:
File > Change dir... > Choose a directory
- In Rstudio:
File > New Project > New Directory > Choose project type> Specify name
and path file
Basic Operator and Data Type
Arithmetic Operator: Operator to define a
•Add ( + ) variable:
>5+6 > age = 20 > 20 -> age
•Subtract ( - ) -“ <- “ or
[1] 11 > age > age
•Multiple ( * )
-“ -> “ or [1] 20 [1] 20
•Divide( / )
•Square ( ^ )
-“ = “
> age <- 20
> age
[1] 20
Data Type:
-Numeric x = 10.25 -Character x = “ten”
#find all rows that contain unexpected values (for instance “F”)
> which(data$sex == ‘F’)
[1] 210 211 212
> str(data)
'data.frame': 30 obs. of 9 variables:
$ team : Factor w/ 30 levels "Arizona Diamondbacks",..: 1 2 3 4 5 6 7 8 9 10 ...
$ code : Factor w/ 30 levels "ARI","ATL","BAL",..: 1 2 3 4 5 6 7 8 9 10 ...
$ league : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
$ division: Factor w/ 3 levels "Central","East",..: 3 2 2 2 1 1 1 1 3 1 ...
$ games : int 162 162 162 162 162 162 162 162 162 162 ...
$ wins : int 81 94 93 69 61 85 97 68 64 88 ...
$ losses : int 81 68 69 93 101 77 65 94 98 74 ...
$ pct : num 0.5 0.58 0.574 0.426 0.377 0.525 0.599 0.42 0.395 0.543 ...
$ payroll : int 67069833 86208000 76704000 110386000 80422700 118208000 80309500
78911300 75485000 131394000 ...
Scatterplots
Show a relation between two variables
Labeling in scatterplot
> by(payroll,league,sum)
league: AL
[1] 1424254675
------------------------------------------------------------------
league: NL
[1] 1512099665
Bar Plot
> barplot(by(payroll,league,sum))
> str(data)
'data.frame': 2759 obs. of 11 variables:
$ NAME : Factor w/ 2757 levels "Abbeville, LA",..: 4 347 1263 2444 17 2033 2408 26 124 715 ...
$ LSAD : Factor w/ 4 levels "County or equivalent",..: 3 1 1 1 3 1 1 3 1 1 ...
$ CENSUS2010POP : int 165252 13544 20202 131506 703200 161419 541781 157308 3451 94565 ...
$ NPOPCHG_2010 : int 417 -12 27 402 -332 -38 -294 277 -60 156 ...
$ NATURALINC2010 : int 228 -14 10 232 310 65 245 220 4 147 ...
$ BIRTHS2010 : int 609 36 41 532 1945 385 1560 542 6 363 ...
$ DEATHS2010 : int 381 50 31 300 1635 320 1315 322 2 216 ...
$ NETMIG2010 : int 190 2 17 171 -631 -101 -530 57 -61 11 ...
$ INTERNATIONALMIG2010: int 77 1 2 74 127 26 101 36 0 32 ...
$ DOMESTICMIG2010 : int 113 1 15 97 -758 -127 -631 21 -61 -21 ...
$ RESIDUAL2010 : int -1 0 0 -1 -11 -2 -9 0 -3 -2 ...
Descriptive Statistic
> sort(data$CENSUS2010POP)
> data[output$ix[1:10],1:2]
> data[order(-data$CENSUS2010POP)[1:10],1:2]
Data Grouping
> by(data$CENSUS2010POP,data$LSAD,mean)
data$LSAD: County or equivalent
[1] 161779.3
--------------------------------------------------
data$LSAD: Metropolitan Division
[1] 2803270
--------------------------------------------------
data$LSAD: Metropolitan Statistical Area
[1] 705786.2
--------------------------------------------------
data$LSAD: Micropolitan Statistical Area
[1] 53721.44
Statistik Deskriptif
Data Distribution – Box Plot & Histogram
Kurtosis
> kurtosis(data.micro[,3:11])
CENSUS2010POP NPOPCHG_2010 NATURALINC2010
6.757994 17.459700 9.590567
BIRTHS2010 DEATHS2010 NETMIG2010
6.504231 6.819837 21.681844
INTERNATIONALMIG2010 DOMESTICMIG2010 RESIDUAL2010
34.850185 22.369340 17.521871
Correlation and Regression
Parametric
Correlation
Case Study: “parenthood.Rdata”
http://bdsrc.binus.ac.id/RM/parenthood.Rdat
a
> load( "parenthood.Rdata" )
> attach(parenthood)
> str(parenthood)
'data.frame': 100 obs. of 4 variables:
$ dan.sleep : num 7.59 7.91 5.14 7.71 6.68 5.99 8.19 7.19 7.4 6.58 ...
$ baby.sleep: num 10.18 11.66 7.92 9.61 9.75 ...
$ dan.grump : num 56 60 82 55 67 72 53 60 60 71 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
Parametric
Histogram
Correlation
> hist(hours)
> hist(grade)
Non-Parametric
Correlation
Spearman’s Rank Correlation
Coefficients:
(Intercept) horsepower
-4630.7 173.1
> plot(horsepower,price)
> abline(model)
Parametric Regression – Linear
Regression
Model evaluation
Call:
lm(formula = price ~ length + engine.size + horsepower + city.mpg)
Coefficients:
(Intercept) length engine.size horsepower city.mpg
-28480.00 114.58 115.32 52.74 61.51
price = 114.58 x length + 115.32 x engine.size + 52.74 x horsepower + 61.51 x city.mpg – 28480.00
Regresi Parametrik – Log Linear
Regression
> lm(city.mpg ~ log(horsepower))
Call:
lm(formula = city.mpg ~ log(horsepower))
Coefficients:
(Intercept) log(horsepower)
101.44 -16.62
> load("D:/BDSRC/DataScienceWithR/modul2/harpo.Rdata")
> str(harpo)
'data.frame': 33 obs. of 2 variables:
$ grade: num 65 72 66 74 73 71 66 76 69 79 ...
$ tutor: Factor w/ 2 levels "Anastasia","Bernadette": 1 2 2 1 1 2 2 2 2 2 ...
> tapply(count,spray,mean)
A B C D E F
14.500000 15.333333 2.083333 4.916667 3.500000 16.666667
> tapply(count,spray,var)
A B C D E F
22.272727 18.242424 3.901515 6.265152 3.000000 38.606061
> tapply(count,spray,length)
A B C D E F
12 12 12 12 12 12
Anova
-> Mean comparison among more than two independent sample
groups
datasets::InsectSprays
> oneway.test(count~spray)
One-way analysis of means (not assuming equal variances)
data: count and spray
F = 36.0654, num df = 5.000, denom df = 30.043, p-value = 7.999e-12
> qf(.95,5,30.043)
[1] 2.533065
Decision: Reject H0 because F (36.0654) is more than F tabel (2.533065) and also p-
value = 7.999e-12, very small.
Anova
Bartlet test (parametric) or Levene (Non-parametric)
> prob = c(clubs = .25, diamonds = .25, hearts = .25, spades = .25)
> prob
clubs diamonds hearts spades
0.25 0.25 0.25 0.25
Critical Value:
p-value:
> pchisq( q = 8.44, df = 3, lower.tail = FALSE ) Reject H0 if p-value < 0.05 (α = 95%)
[1] 0.03774185
Decision:
There is a significant difference of proboballity for each card to be picked (not random)
chi-square
Other approach using chisq.test() function
Cara lain:
> chisq.test(observed)
data: observed
X-squared = 8.44, df = 3, p-value = 0.03774
Decision:
There is a significant difference of proboballity for each card to be picked (not random)
chi-square test of independence
D Data -> chapek9.Rdata
http://bdsrc.binus.ac.id/RM/chapek9.Rdata
> load( "chapek9.Rdata" )
> attach(chapek9)
> str(chapek9)
'data.frame': 180 obs. of 2 variables:
$ species: Factor w/ 2 levels "robot","human": 1 2 2 2 1 2 2 1 2 1 ...
$ choice : Factor w/ 3 levels "puppy","flower",..: 2 3 3 3 3 2 3 3 1 2 ...
Critical value:
> qchisq(0.95,2)
[1] 5.991465
Fisher test for small N
> mhs <- matrix(c(1, 2, 1, 3),nrow = 2, dimnames = list (c(“Passed", “not passed"), c("D
epressed", “Not Depressed")))
> mhs
Depressed Not depressed
Passed 1 1
Not passed 2 3
> chisq.test(mhs)
data: mhs
X-squared = 1.438e-32, df = 1, p-value = 1
Warning message:
In chisq.test(mhs) : Chi-squared approximation may be incorrect
Fisher test untuk N kecil
> fisher.test(mhs)
data: mhs
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.01279034 156.23100767
sample estimates:
odds ratio
1.414185
Odds Ratio
Based on this result, the probability of students to suffer from depression b
ecause of the their score is 1 orang.
MOOC:
-https://www.r-bloggers.com/how-to-learn-r-2/
-https://www.datacamp.com/
-http://tryr.codeschool.com/
-https://www.coursera.org/learn/r-programming
-https://www.rstudio.com/online-learning/
References
Adler, J. (2012). R in a Nutshell. Sebastopol, California: O'Reilly Media.
Matloff, N. (2011). The Art of R Programming. San Francisco: No Starch Press, Inc.
Pardamean, B., Baurley, J.W., Muljo, H.M., Perbangsa, A.S., & Suparyanto, T. (2014).
Data Management and Analysis System for Genome-Wide Association Study.
Bioinformatics Research Group, Bina Nusantara University.
Pathak, M.A. (2014). Beginning Data Science with R. California, USA: Springer.