RSCH8079 - Session 09 - Data Science With R

Course: RSCH8079 – IT Research Methodology
Data Science with R

Session 09
D3502 - Bens Pardamean, B.Sc., M.Sc., PhD

Outline
• Introduction to R
• Descriptive Statistics
• Correlation and Regression
• t-Test and ANOVA
• Chi-Square Test
Introduction to R
Why R ?
- Open source, dan cross platform - Mendukung prinsip reproducibility

- Menghasilkan visualisasi yang berkualitas tinggi -Di dukung komunitas yang besar ( >2 juta user)
Sumber: https://www.r-bloggers.com/new-surveys-show-continued-popularity-of-r/
R Components
• R Base • R IDE - RStudio
• R Package - CRAN
R Installation
- Windows and Mac OS X
Download executable file (.exe for windows and .pkg for Mac):
http://www.r-project.org/
- Linux
Ubuntu atau Debian : r-base
Red Hat atau Fedora: R.i386
Suse : R-base
Example:
$ sudo apt-get install r-base
- Instalasi Rstudio:
Follow this link: https://www.rstudio.com/
Go to Products > RStudio > Download RStudio Desktop
Get Start
To start R, we need to specify our working directory. All files related to
the analysis should be placed in this directory.
- In R Base / R GUI:
File > Change dir... > Choose a directory
- In Rstudio:
File > New Project > New Directory > Choose project type> Specify name
and path file
Basic Operator and Data Type
Arithmetic Operator: Operator to define a
•Add ( + ) variable:
>5+6 > age = 20 > 20 -> age
•Subtract ( - ) -“ <- “ or
[1] 11 > age > age
•Multiple ( * )
-“ -> “ or [1] 20 [1] 20
•Divide( / )
•Square ( ^ )
-“ = “
> age <- 20
> age
[1] 20
Data Type:
-Numeric x = 10.25 -Character x = “ten”
-Integer x = 10 -Factor x = “agree”

y=
-Complex x = 10 + 3i “disagree”
z = “neutral”
-Logical x = TRUE
Package Installation
R provide a comprehensive library for data analysis in CRAN.
> install.packages('stringr')
Installing package into ‘C:/Users/Arif/Documents/R/win-library/3.3’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.3/stringr_1.1.0.zip'
Content type 'application/zip' length 119734 bytes (116 KB)
downloaded 116 KB
package ‘stringr’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in

C:\Users\Arif\AppData\Local\Temp\RtmpCKsMwX\downloaded_packages
R Help System > help(cat) > ?cat

Data Import
- CSV
data.csv <- read.table("namafile.csv",header=TRUE)
- EXCEL
library(xlsx)
data.xlsx <- read.xlsx("namafile.xlsx",sheetName = "Sheet1")
- SPSS
library(memisc)
data.spss <- as.data.set(spss.system.file ('namafile.sav'))
- TXT
data.txt = read.table("namafile.txt")
Data Exploration
Case Study – Health data analytic
Please follow this link to download survey.csv file:
http://bdsrc.binus.ac.id/RM/survey.csv
#import the data
> survey = read.csv("survey.csv")
#print the structure of the data

> str(survey)
'data.frame': 237 obs. of 6 variables:
$ sex : Factor w/ 6 levels "F","female","Female",..: 3 6 6 6 6 3 6 3 6 6 ...
$ height : int 68 70 NA 63 65 68 72 62 69 66 ...
$ weight : int 158 256 204 187 168 172 160 116 262 189 ...
$ handedness: Factor w/ 2 levels "Left","Right": 2 1 2 2 2 2 2 2 2 2 ...
$ exercise : Factor w/ 3 levels "Freq","None",..: 3 2 2 2 3 3 1 1 3 3 ...
$ smoke : Factor w/ 4 levels "Heavy","Never",..: 2 4 3 2 2 2 2 2 2 2 ...
Data Exploration
Data entry error
- Fix incorrect data

#print all possible values for one variable
> unique(survey$sex)
[1] Female Male <NA> F M male female
Levels: F female Female M male
#If we want only “Female” and “Male” for this variable, then we need to change all
other values
#find all rows that contain unexpected values (for instance “F”)
> which(data$sex == ‘F’)
[1] 210 211 212
#change it with the correct value

> data$sex[which(data$sex == ‘F’)] = ‘Female’
Data Exploration
Data entry error Data entry error
- Missing value (NA) -Missing values (NA) – Data Imputation

#check if there is NA Replace NA with an appropriate value
> sum(is.na(survey$height))
[1] 28 #replace NA on “height” with 160
> data$height[is.na(data$height)] = 160
#find rows contain NA
> which(is.na(data$height))
[1] 3 12 15 25 26 29 31 35 58 68 70 81 #replace NA on “height” with height
83 84 90 92 96 108 121 average for each sex
[20] 133 157 173 179 203 213 217 225 226 > female.height =
mean(data$height[which(data$sex==
#exclude NA in mean calculation ‘Female’)], na.rm=T)
> mean(data$height, na.rm = T) > data$height[which(data$sex ==
[1] 67.89474 ‘Female’& is.na(data$height))] =
female.height
Descriptive Statistics
Studi Kasus
Data: Major League Baseball (MLB)
http://bdsrc.binus.ac.id/RM/teams.csv
> data = read.csv(‘teams.csv’)

> attach(data)
> str(data)
$ team : Factor w/ 30 levels "Arizona Diamondbacks",..: 1 2 3 4 5 6 7 8 9 10 ...
$ code : Factor w/ 30 levels "ARI","ATL","BAL",..: 1 2 3 4 5 6 7 8 9 10 ...
$ league : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
$ division: Factor w/ 3 levels "Central","East",..: 3 2 2 2 1 1 1 1 3 1 ...
$ games : int 162 162 162 162 162 162 162 162 162 162 ...
$ wins : int 81 94 93 69 61 85 97 68 64 88 ...
$ losses : int 81 68 69 93 101 77 65 94 98 74 ...
$ pct : num 0.5 0.58 0.574 0.426 0.377 0.525 0.599 0.42 0.395 0.543 ...
$ payroll : int 67069833 86208000 76704000 110386000 80422700 118208000 80309500
78911300 75485000 131394000 ...
Scatterplots
Show a relation between two variables
> plot (payroll,wins)
Labeling in scatterplot

> id = identify(payroll, wins,labels = code, n = 5)

> with(data, text(payroll, wins, labels = code, pos = 1,
cex=0.5))
Scatterplots
Data grouping (categorical) Data grouping (numerik)
> s1 = which(league == ‘NL’) > s3 = which(pct > 0.5)

> s2 = which(league == ‘AL’) > s4 = which(pct <= 0.5)
> plot(payroll[s1],wins[s1],xlim = range(payroll), > plot(payroll[s3], wins[s3], pch = 3, xlim =
ylim=range(wins), xlab=‘payroll’, ylab=‘wins’) range(payroll), ylim = range(wins), xlab = ‘payroll’,
> points(payroll[s2],wins[s2],pch=2) ylab = ‘wins’)
> points(payroll[s4], wins[s4], pch = 4)
Scatterplots
Line to separate two groups Legend
> s1 = which(league == ‘NL’) > plot(payroll[s3], wins[s3], xlim = range(payroll), ylim

> s2 = which(league == ‘AL’) = range(wins), xlab='payroll', ylab = 'wins')
> plot(payroll[s1],wins[s1],xlim = range(payroll), > points(payroll[s4], wins[s4], pch = 2)
ylim=range(wins), xlab=‘payroll’, ylab=‘wins’) > lines(range(payroll), c(81,81), lty = 3)
> points(payroll[s2],wins[s2],pch=2) > legend('bottomright', c('pct > 0.5', 'pct <= 0.5'),
pch=c(1,2), title='Legend')
Data Aggregation
Comparing sum of “payroll” between two leagues
> sum(payroll[which(league == 'NL')])

[1] 1512099665
> sum(payroll[which(league == 'AL')])
[1] 1424254675
> by(payroll,league,sum)
league: AL
[1] 1424254675
------------------------------------------------------------------
league: NL
[1] 1512099665
Bar Plot
> barplot(by(payroll,league,sum))
> par(xpd=T, mar=par()$mar + c(0,0,0,4))

> barplot(by(payroll,list(division,league),
sum),col=2:4)
> legend(2.5,8e8,c(‘Central’,’East’,’West’), fill=2:4)
Pie Diagram
> labels = c(‘AL Central’, ‘AL East’, ‘AL West’,‘NL

Central’, ‘NL East’, ‘NL West’)
> pie(by(as.numeric(payroll), league, sum)) > pie(as.numeric(by(payroll,list(division,
league),sum)),labels)
Descriptive Statistic
Case study: metropolitan.csv
http://bdsrc.binus.ac.id/RM/metropolitan.csv
> data = read.csv(‘metropolitan.csv’) > head(data)
> attach (data)
> dim(data) > tail(data)
[1] 2759 11
> nrow(data) > summary(data)
[1] 2759
> ncol(data)
[1] 11
> str(data)
$ NAME : Factor w/ 2757 levels "Abbeville, LA",..: 4 347 1263 2444 17 2033 2408 26 124 715 ...
$ LSAD : Factor w/ 4 levels "County or equivalent",..: 3 1 1 1 3 1 1 3 1 1 ...
$ CENSUS2010POP : int 165252 13544 20202 131506 703200 161419 541781 157308 3451 94565 ...
$ NPOPCHG_2010 : int 417 -12 27 402 -332 -38 -294 277 -60 156 ...
$ NATURALINC2010 : int 228 -14 10 232 310 65 245 220 4 147 ...
$ BIRTHS2010 : int 609 36 41 532 1945 385 1560 542 6 363 ...
$ DEATHS2010 : int 381 50 31 300 1635 320 1315 322 2 216 ...
$ NETMIG2010 : int 190 2 17 171 -631 -101 -530 57 -61 11 ...
$ INTERNATIONALMIG2010: int 77 1 2 74 127 26 101 36 0 32 ...
$ DOMESTICMIG2010 : int 113 1 15 97 -758 -127 -631 21 -61 -21 ...
$ RESIDUAL2010 : int -1 0 0 -1 -11 -2 -9 0 -3 -2 ...
Descriptive Statistic
> sort(data$CENSUS2010POP)
> output = sort(data$CENSUS2010POP,decreasing=T,

indeks.return=T)
> data[output$ix[1:10],1:2]
> data[order(-data$CENSUS2010POP)[1:10],1:2]
Data Grouping
> by(data$CENSUS2010POP,data$LSAD,mean)
data$LSAD: County or equivalent
[1] 161779.3
--------------------------------------------------
data$LSAD: Metropolitan Division
[1] 2803270
--------------------------------------------------
data$LSAD: Metropolitan Statistical Area
[1] 705786.2
--------------------------------------------------
data$LSAD: Micropolitan Statistical Area
[1] 53721.44
Statistik Deskriptif
Data Distribution – Box Plot & Histogram
> boxplot(data$BIRTHS2010 ~ data$LSAD) > hist(data.micro$BIRTHS2010)

Statistik Deskriptif
Skewness
> library(momen)
> skewness(data.micro[,3:11])
CENSUS2010POP NPOPCHG_2010 NATURALINC2010
1.7384473 2.6371220 1.0143676
BIRTHS2010 DEATHS2010 NETMIG2010
1.6833753 1.5502585 2.6078737
INTERNATIONALMIG2010 DOMESTICMIG2010 RESIDUAL2010
4.4857400 2.3719011
0.9202234
Kurtosis
> kurtosis(data.micro[,3:11])
CENSUS2010POP NPOPCHG_2010 NATURALINC2010
6.757994 17.459700 9.590567
BIRTHS2010 DEATHS2010 NETMIG2010
6.504231 6.819837 21.681844
INTERNATIONALMIG2010 DOMESTICMIG2010 RESIDUAL2010
34.850185 22.369340 17.521871
Correlation and Regression
Parametric
Correlation
Case Study: “parenthood.Rdata”
http://bdsrc.binus.ac.id/RM/parenthood.Rdat
a
> load( "parenthood.Rdata" )
> attach(parenthood)
> str(parenthood)
$ dan.sleep : num 7.59 7.91 5.14 7.71 6.68 5.99 8.19 7.19 7.4 6.58 ...
$ baby.sleep: num 10.18 11.66 7.92 9.61 9.75 ...
$ dan.grump : num 56 60 82 55 67 72 53 60 60 71 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
Parametric
Histogram
Correlation
> hist (dan.grump)

> hist (dan.sleep)
> hist (baby.sleep)
Parametric
qqnorm()
Correlation
> qqnorm (dan.grump);qqline (dan.grump, col=’red’)
> qqnorm (dan.sleep);qqline (dan.sleep, col=’red’)
> qqnorm (baby.sleep);qqline (baby.sleep, col=’red’)
Parametric Correlation
Scatterplot
> plot(dan.grump, dan.sleep)

> plot(dan.grump, baby.sleep)
> plot(dan.sleep, baby.sleep)
Parametric Correlation
Parametric
Correlation
Pearson’s Coefficient Correlation
Fungsi cor()
> cor(dan.sleep, dan.grump) [1]

-0.903384
> cor(baby.sleep, dan.grump)
[1] -0.5659637
> cor(baby.sleep, dan.sleep) [1]
0.6279493
Parametric
Correlation
Pearson’s Coefficient Correlation
Fungsi cor()
> cor(dan.sleep, dan.grump) [1]

-0.903384
> cor(baby.sleep, dan.grump)
[1] -0.5659637
> cor(baby.sleep, dan.sleep) [1]
0.6279493
Non-Parametric
Correlation
Case Study: “effort.Rdata”
http://bdsrc.binus.ac.id/RM/effort.Rdata
> load( "effort.Rdata" )

> attach(effort)
> effort
hours grade
1 2 13
2 76 91
3 40 79
4 6 14
5 16 21
6 28 74
7 27 47
8 59 85
> hist(hours)
> hist(grade)
Non-Parametric
Correlation
Spearman’s Rank Correlation
> hours.rank = rank(hours) > cor (hours, grade, method = “spearman”)

> hours.rank [1] 1
[1] 1 10 6 2 3 5 4 8 7 9
> grade.rank = rank(grade)
> grade.rank
[1] 1 10 6 2 3 5 4 8 7 9
> cor(hours.rank, grade.rank)

[1] 1
Significance of
Correlation
cor.test() function
> cor.test(dan.sleep, dan.grump)
Pearson's product-moment correlation data: dan.sleep and dan.grump

t = -20.854, df = 98, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9340614 -0.8594714
sample estimates: cor
-0.903384
Parametric Regression – Linear
Regression
Case study: “auto.csv” (chapter 6)
http://bdsrc.binus.ac.id/RM/auto.csv
> auto = read.csv('auto.csv')

> attach(auto)
> str(auto)
> plot(horsepower, price)
Regression
lm() function
y = f (x; w) = w0 + w1x + ε
> model = lm(price ˜ horsepower)

> model Call:
lm(formula = price ~ horsepower)
Coefficients:
(Intercept) horsepower
-4630.7 173.1
price = -4630.7022 + 173.1292 * horsepower

Regression
Regression model visualization
> plot(horsepower,price)
> abline(model)
Regression
Model evaluation
> summary(model) Call:

lm(formula = price ˜ horsepower)
Residuals:
Min 1Q Median 3Q Max

-10296.1 -2243.5 -450.1 1794.7 18174.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4630.70 990.58 -4.675 5.55e-06 ***
horsepower 173.13 8.99 19.259 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 4728 on 191 degrees of freedom
Multiple R-squared: 0.6601, Adjusted R-squared: 0.6583 F-
statistic: 370.9 on 1 and 191 DF, p-value: < 2.2e-16
Prediction –predict() function
Regression
> new.data = data.frame(horsepower = c(100,125,150,175,200))

> predict(model, new.data)
1 2 3 4 5
12682.21 17010.44 21338.67 25666.90 29995.1
3
> predict(model, new.data, interval = 'confidence', level = 0.95)
fit lwr upr
1 12682.21 12008.03 13356.40
2 17010.44 16238.24 17782.65
3 21338.67 20275.14 22402.20
4 25666.90 24232.01 27101.79
5 29995.13 28156.72 31833.53
Parametric Regression –
Multivariate Linear Regression
> lm(price ~ length + engine.size + horsepower + city.mpg)
Call:
lm(formula = price ~ length + engine.size + horsepower + city.mpg)
Coefficients:
(Intercept) length engine.size horsepower city.mpg
-28480.00 114.58 115.32 52.74 61.51
price = 114.58 x length + 115.32 x engine.size + 52.74 x horsepower + 61.51 x city.mpg – 28480.00
Regresi Parametrik – Log Linear
Regression
> lm(city.mpg ~ log(horsepower))
Call:
lm(formula = city.mpg ~ log(horsepower))
Coefficients:
(Intercept) log(horsepower)
101.44 -16.62
city.mpg = 101.44 – 16.62 x horsepower

Regresi Parametrik – Log Linier
Regression
Linear regression VS log Linear regression
> summary(lm(city.mpg ˜ horsepower)) Call: > summary(lm(city.mpg ˜ log(horsepower))) Call:

lm(formula = city.mpg ˜ horsepower) Residuals: lm(formula = city.mpg ˜ log(horsepower))
Min 1Q Median 3Q Max Residuals:
-7.5162 -1.9232 -0.1454 0.8365 17.2934 Min 1Q Median 3Q Max
Coefficients: -6.7491 -1.7312 -0.1621 1.2798 15.0499
Estimate Std. Error t value Pr(>|t|) Coefficients:
(Intercept) 39.842721 0.741080 53.76 <2e-16 *** Estimate Std. Error t value Pr(>|t|)
horsepower -0.140279 0.006725 -20.86 <2e-16 *** (Intercept) 101.4362 2.8703 35.34 <2e-16 ***
--- log(horsepower) -16.6204 0.6251 -26.59 <2e-16 ***
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ---
Residual standard error: 3.538 on 191 degrees of freedom Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Multiple R-squared: 0.6949, Adjusted R-squared: 0.6933 F- Residual standard error: 2.954 on 191 degrees of freedom
statistic: 435.1 on 1 and 191 DF, p-value: < 2.2e-16 Multiple R-squared: 0.7873, Adjusted R-squared: 0.7862 F-
statistic: 707 on 1 and 191 DF, p-value: < 2.2e-16
Non-Parametric Regression –
Regression Tree
• > library (rpart)
• > fit = rpart(price ~ length +engine.size + horsepower + city.mpg)
• > fit
• n= 193
• node), split, n, deviance, yval

• * denotes terminal node
• 1) root 193 12563190000 13285.030

• 2) engine.size< 182 176 3805169000 11241.450
• 4) horsepower< 94.5 94 382629400 7997.319
• 8) length< 172.5 72 108629400 7275.847 *
• 9) length>=172.5 22 113868600 10358.500 *
• 5) horsepower>=94.5 82 1299182000 14960.330
10) length< 176.4 33 444818200 12290.670

20)city.mpg>=22 21 94343020 10199.330 *
21)city.mpg< 22 12 97895460 15950.500 *
11) length>=176.4 49 460773500 16758.270 *
3) engine.size>=182 413464300 34442.060 *
Non-Parametric Regression –
Regression Tree
> plot(fit, uniform = T)
> text(fit, digits = 6, cex=0.6)
t-test and Anova
One sample t-test
-> Mean comparison
datasets::sleep
> attach (sleep) Given average of sleep duration increase is
> sleep 0
extra group ID
1 0.7 1 1
2 -1.6 1 2 > mean(extra)
3 -0.2 1 3 [1] 1.54
4 -1.2 1 4
5 -0.1 1 5 H0 : There is no significant difference of sleep duration mean
6 3.4 1 6 between observed sample and the whole population.
7 3.7 1 7
8 0.8 1 8 H1 : There is a significant difference of sleep duration mean
9 0.0 1 9 between observed sample and the whole population.
10 2.0 1 10
11 1.9 2 1
......
One sample t-test
-> Mean comparison
datasets::sleep
Reject H0 dan accept H1 if p-value < 0.05 > t.test(extra, mu=0)
Accept H0 if p-value > 0.05 One Sample t-test
Conclussion: There is a significant data: extra

difference of sleep duration mean between t = 3.413, df = 19, p-value = 0.002918
observed sample and the whole population. alternative hypothesis: true mean is not equal
to 0
0.5955845 2.4844155
sample estimates:
mean of x
1.54
Dependent samples t-test
-> Mean comparison between two sample groups
> mean(extra[group==1]) > t.test (extra~group, sleep, var.equal = T, paired=T)

[1] 0.75
> mean(extra[group==2]) Paired t-test
[1] 2.33
data: extra by group
t = -4.0621, df = 9, p-value = 0.002833
H0 : There is no significant difference of sleep alternative hypothesis: true difference in means is
duration mean between two groups. not equal to 0
H1 : There is a significant difference of sleep -2.4598858 -0.7001142
duration mean between two groups. sample estimates:
mean of the differences
-1.58
Independent samples t-test
-> Mean comparison between two independent sample
groups
case study: “harpo.Rdata” ->
http://bdsrc.binus.ac.id/RM/harpo.Rdata
> load("D:/BDSRC/DataScienceWithR/modul2/harpo.Rdata")
> str(harpo)
$ grade: num 65 72 66 74 73 71 66 76 69 79 ...
$ tutor: Factor w/ 2 levels "Anastasia","Bernadette": 1 2 2 1 1 2 2 2 2 2 ...
H0 : There is no significant difference of sleep duration mean

between two groups (grouped by the tutor).
H1 : There is a significant difference of sleep duration mean

between two groups (grouped by the tutor).
Independent samples t-test
-> Mean comparison between two independent sample
groups
case study: “harpo.Rdata” ->
http://bdsrc.binus.ac.id/RM/harpo.Rdata
> tapply(grade,tutor,mean) > t.test(grade~tutor, harpo, var.equal = T)
Anastasia Bernadette
74.53333 69.05556 Two Sample t-test
> tapply(grade,tutor,sd)
Anastasia Bernadette data: grade by tutor
8.998942 5.774918 t = 2.1154, df = 31, p-value = 0.04253
> tapply(grade,tutor,length) alternative hypothesis: true difference in means is not
Anastasia Bernadette equal to 0
15 18 95 percent confidence interval:
0.1965873 10.7589683
sample estimates:
mean in group Anastasia mean in group Bernadette
Conclussion: There is a significant 74.53333 69.05556
difference of sleep duration mean between
two groups (grouped by the tutor).
Anova
-> Mean comparison among more than two independent sample
groups
datasets::InsectSprays
>attach(InsectSprays)
>data(InsectSprays)
>str(InsectSprays)
$ count: num 10 7 20 14 14 12 10 23 17 20 ...
$ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ..
> tapply(count,spray,mean)
A B C D E F
14.500000 15.333333 2.083333 4.916667 3.500000 16.666667
> tapply(count,spray,var)
A B C D E F
22.272727 18.242424 3.901515 6.265152 3.000000 38.606061
> tapply(count,spray,length)
A B C D E F
12 12 12 12 12 12
Anova
groups
>boxplot(count ~ spray, InsectSprays)

Anova
groups
H0: There is no significant difference among the gorups

H1: There is a significant difference among the gorups
> oneway.test(count~spray)
One-way analysis of means (not assuming equal variances)
data: count and spray
F = 36.0654, num df = 5.000, denom df = 30.043, p-value = 7.999e-12
> qf(.95,5,30.043)
[1] 2.533065
Decision: Reject H0 because F (36.0654) is more than F tabel (2.533065) and also p-
value = 7.999e-12, very small.
Anova
Bartlet test (parametric) or Levene (Non-parametric)
H0: The data has a homogen population variant

H1: There are two population variant difference at minimum
> bartlett.test(count~spray, InsectSprays)

Bartlett test of homogeneity of variances
data: count by spray
Bartlett's K-squared = 25.96, df = 5, p-value = 9.085e-05
Decision: Rehect H0 because p-value = 9.085e-05, less than 0.05

Anova
Tukey Honest Significant Differences
diff lwr upr p adj

> aov.out = aov(count ~ spray, data=InsectSprays)
B-A 0.83333 -3.8661 5.53274 0.995181
> TukeyHSD(aov.out)
C-A -12.417 -17.116 -7.7173 0
Tukey multiple comparisons of means
D-A
95% family-wise confidence level -9.5833 -14.283 -4.8839 0.0000014
E-A -11 -15.699 -6.3006 0
F-A
Fit: aov(formula = count ~ spray, data = InsectSprays) 2.16667 -2.5327 6.86608 0.7542147
C-B -13.25 -17.949 -8.5506 0
D-B
$spray -10.417 -15.116 -5.7173 0.0000002
E-B -11.833 -16.533 -7.1339 0
Based on this result. there are significant differences F-B 1.33333 -3.3661 6.03274 0.9603075
between Variable C-A, D-A, E-A, C-B, D-B, E-B, F-C, D-C 2.83333 -1.8661 7.53274 0.4920707
F-D dan F-E, p-value for these relations are less than E-C 1.41667 -3.2827 6.11608 0.9488669
0.05. F-C 14.5833 9.88393 19.2827 0
E-D -1.4167 -6.1161 3.28274 0.9488669
F-D 11.75 7.05059 16.4494 0
F-E 13.1667 8.46726 17.8661 0
Chi-square Test
chi-square – Goodness of fit (randomness)
Data -> randomness.Rdata
http://bdsrc.binus.ac.id/RM/randomness.Rdata
> load("randomness.Rdata")
> attach(cards)
> str(cards)
$ id : Factor w/ 200 levels "subj1","subj10",..: 1 112 124 135 146 157 168
179 190 2 ...
$ choice_1: Factor w/ 4 levels "clubs","diamonds",..: 4 2 3 4 3 1 3 2 4 2 ...
$ choice_2: Factor w/ 4 levels "clubs","diamonds",..: 1 1 1 1 4 3 2 1 1 4 ...
> observed = table(choice_1)

> observed
choice_1
clubs diamonds hearts spades
35 51 64 50
chi-square
H0: All card have the same probability to be picked

Clubs: 25% | Diamonds: 25% | Hearts: 25% | Spades: 25%
H1: There is a significant difference of proboballity for each card to be picked.
> prob = c(clubs = .25, diamonds = .25, hearts = .25, spades = .25)
> prob
0.25 0.25 0.25 0.25
> N = 200 # sample size

> expected = N * prob # expected frequencies
> expected
50 50 50 50
chi-square
> observed - expected

choice_1
clubs diamonds hearts
spades
-15 1 14 0
> (observed - expected)^2
choice_1
225 1 196 0
> (observed - expected)^2/expected

choice_1
4.50 0.02 3.92 0.00
> sum((observed - expected)^2/expected)

[1] 8.44
chi-square
Critical Value:
> qchisq( p = .95, df = 3 )

[1] 7.814728
p-value:
> pchisq( q = 8.44, df = 3, lower.tail = FALSE ) Reject H0 if p-value < 0.05 (α = 95%)
[1] 0.03774185
Decision:
Reject H0 and accept H1
There is a significant difference of proboballity for each card to be picked (not random)
chi-square
Other approach using chisq.test() function
Cara lain:
> chisq.test(observed)
Chi-squared test for given

probabilities
data: observed
X-squared = 8.44, df = 3, p-value = 0.03774
Decision:
Reject H0 and accept H1
There is a significant difference of proboballity for each card to be picked (not random)
chi-square test of independence
D Data -> chapek9.Rdata
http://bdsrc.binus.ac.id/RM/chapek9.Rdata
> load( "chapek9.Rdata" )
> attach(chapek9)
> str(chapek9)
$ species: Factor w/ 2 levels "robot","human": 1 2 2 2 1 2 2 1 2 1 ...
$ choice : Factor w/ 3 levels "puppy","flower",..: 2 3 3 3 3 2 3 3 1 2 ...
> summary(chapek9) > tbl = table(species, choice)

species choice > tbl
robot:87 choice
puppy : 28 species puppy flower data
human:93 robot 13 30 44
flower: 43 human 15 13 65
data
:109
chi-square test of independence
H0: There is no correlation between species and decision making
H1: There is correlation between species and decision making
> chisq.test(species, choice)
Pearson's Chi-squared test
data: choice and species

X-squared = 10.722, df = 2, p-value = 0.004697
Critical value:
> qchisq(0.95,2)
[1] 5.991465
Fisher test for small N
> mhs <- matrix(c(1, 2, 1, 3),nrow = 2, dimnames = list (c(“Passed", “not passed"), c("D
epressed", “Not Depressed")))
> mhs
Depressed Not depressed
Passed 1 1
Not passed 2 3
> chisq.test(mhs)
Pearson's Chi-squared test with Yates' continuity correction
data: mhs
X-squared = 1.438e-32, df = 1, p-value = 1
Warning message:
In chisq.test(mhs) : Chi-squared approximation may be incorrect
Fisher test untuk N kecil
> fisher.test(mhs)
Fisher's Exact Test for Count Data
data: mhs
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
0.01279034 156.23100767
sample estimates:
odds ratio
1.414185
Odds Ratio
Based on this result, the probability of students to suffer from depression b
ecause of the their score is 1 orang.
MOOC:
-https://www.r-bloggers.com/how-to-learn-r-2/
-https://www.datacamp.com/
-http://tryr.codeschool.com/
-https://www.coursera.org/learn/r-programming
-https://www.rstudio.com/online-learning/
References
Adler, J. (2012). R in a Nutshell. Sebastopol, California: O'Reilly Media.
Matloff, N. (2011). The Art of R Programming. San Francisco: No Starch Press, Inc.
Pardamean, B., Baurley, J.W., Muljo, H.M., Perbangsa, A.S., & Suparyanto, T. (2014).
Data Management and Analysis System for Genome-Wide Association Study.
Bioinformatics Research Group, Bina Nusantara University.
Pathak, M.A. (2014). Beginning Data Science with R. California, USA: Springer.
Teetor, P. (2011). R Cookbook. Sebastopol, California: O’Reilly Media, Inc.
Venables, W.N., & Smith, D.M. (2008). An Introduction to R. Network Theory.
Verzani, J. (2014). Using R for Introductory Statistics. Chapman and Hall/CRC.

RSCH8079 - Session 09 - Data Science With R

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

RSCH8079 - Session 09 - Data Science With R

Caricato da

Copyright:

Formati disponibili

Course: RSCH8079 – IT Research Methodology

Data Science with R

D3502 - Bens Pardamean, B.Sc., M.Sc., PhD

- Open source, dan cross platform - Mendukung prinsip reproducibility

-Integer x = 10 -Factor x = “agree”

package ‘stringr’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in

R Help System > help(cat) > ?cat

#print the structure of the data

Data entry error

- Fix incorrect data

#change it with the correct value

Data entry error Data entry error

- Missing value (NA) -Missing values (NA) – Data Imputation

> data = read.csv(‘teams.csv’)

> plot (payroll,wins)

> plot (payroll,wins)

> plot (payroll,wins)

> s1 = which(league == ‘NL’) > s3 = which(pct > 0.5)

> s1 = which(league == ‘NL’) > plot(payroll[s3], wins[s3], xlim = range(payroll), ylim

> sum(payroll[which(league == 'NL')])

> par(xpd=T, mar=par()$mar + c(0,0,0,4))

> labels = c(‘AL Central’, ‘AL East’, ‘AL West’,‘NL

> output = sort(data$CENSUS2010POP,decreasing=T,

> boxplot(data$BIRTHS2010 ~ data$LSAD) > hist(data.micro$BIRTHS2010)

> hist (dan.grump)

> plot(dan.grump, dan.sleep)

> cor(dan.sleep, dan.grump) [1]

> cor(dan.sleep, dan.grump) [1]

> load( "effort.Rdata" )

> hours.rank = rank(hours) > cor (hours, grade, method = “spearman”)

> cor(hours.rank, grade.rank)

> cor.test(dan.sleep, dan.grump)

Pearson's product-moment correlation data: dan.sleep and dan.grump

> auto = read.csv('auto.csv')

> model = lm(price ˜ horsepower)

price = -4630.7022 + 173.1292 * horsepower

> summary(model) Call:

Min 1Q Median 3Q Max

> new.data = data.frame(horsepower = c(100,125,150,175,200))

> lm(price ~ length + engine.size + horsepower + city.mpg)

city.mpg = 101.44 – 16.62 x horsepower

> summary(lm(city.mpg ˜ horsepower)) Call: > summary(lm(city.mpg ˜ log(horsepower))) Call:

• node), split, n, deviance, yval

• 1) root 193 12563190000 13285.030

10) length< 176.4 33 444818200 12290.670

Reject H0 dan accept H1 if p-value < 0.05 > t.test(extra, mu=0)

Accept H0 if p-value > 0.05 One Sample t-test

Conclussion: There is a significant data: extra

> mean(extra[group==1]) > t.test (extra~group, sleep, var.equal = T, paired=T)

H0 : There is no significant difference of sleep duration mean

H1 : There is a significant difference of sleep duration mean

>boxplot(count ~ spray, InsectSprays)

H0: There is no significant difference among the gorups

H0: The data has a homogen population variant

> bartlett.test(count~spray, InsectSprays)

Decision: Rehect H0 because p-value = 9.085e-05, less than 0.05

diff lwr upr p adj

> observed = table(choice_1)

H0: All card have the same probability to be picked

H1: There is a significant difference of proboballity for each card to be picked.

> N = 200 # sample size