Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Logistic Regression,
R and S-PLUS
Richard Mott
http://bioinformatics.well.ox.ac.uk/lectures/
What is R ?
Modelling in R
Data for individual labelled i=1n:
Response yi
Genotypes gij for markers j=1..m
Missing Data
NA default in R
Using R
Load genetic logistic regression tools
> source(logistic.R)
Column names
names(t)
t$y response (0,1)
t$m1, t$m2, . Genotypes for each marker
Contigency Tables in R
ftable(t$y,t$m31) prints the contingency table
> ftable(t$y,t$m31)
11 12 22
0
1
>
515 387
28 11
75
2
Chi-Squared Test in R
> chisq.test(t$y,t$m31)
Pearson's Chi-squared test
data: t$y and t$m31
X-squared = 3.8424, df = 2, p-value = 0.1464
Warning message:
Chi-squared approximation may be incorrect in:
chisq.test(t$y, t$m31)
>
Prob(Y=0)
AG
GG
Recessive
Dominant
Additive
Genotype
x=0,
x=1,
x=2
Linear Predictor
= b0 + xb1
Additive Model: b0 = -2 b1 = 2
PAA = 0.12 PAG = 0.50 PGG = 0.88
Prob(Y=0)
Additive Model: b0 = 0 b1 = 2
PAA = 0.50 PAG = 0.88 PGG = 0.98
Prob(Y=0)
Recessive Model
Code genotypes as
AA
AG
GG
x=0,
x=0,
x=1
Linear Predictor
= b0 + xb1
Recessive Model: b0 = 0 b1 = 2
PAA = PAG = 0.50 PGG = 0.88
Prob(Y=0)
Genotype Model
x=0, y=0
x=1, y=0
x=0, y=1
Linear Predictor
Genotype Model: b0 = 0 b1 = 2 b2 = -1
PAA = 0.5 PAG = 0.88 PGG = 0.27
Prob(Y=0)
Models in R
response y
genotype g
AA
AG
GG
model
DF
Recessive
y ~ dominant(g)
Dominant
y ~ recessive(g)
Additive
y ~ additive(g)
Genotype
y ~ genotype(g)
Data Transformation
g <- t$m1
use these functions to treat a genotype
vector in a certain way:
a
r
d
g
<<<<-
additive(g)
recessive(g)
dominant(g)
genotype(g)
afit
rfit
dfit
gfit
<<<<-
glm(
glm(
glm(
glm(
t$y
t$y
t$y
t$y
~
~
~
~
additive(g),family=binomial)
recessive(g),family=binomial)
dominant(g),family=binomial)
genotype(g),family=binomial)
Equivalent models:
genotype = dominant + recessive
genotype = additive + recessive
genotype = additive + dominant
genotype ~ standard chi-squared test of genotype
association
Parameter Estimates
> summary(glm( t$y ~ genotype(t$m31), family='binomial'))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
b0 (Intercept)
-2.9120
0.1941 -15.006
<2e-16 ***
b1 genotype(t$m31)12 -0.6486
0.3621 -1.791
0.0733 .
b2 genotype(t$m31)22 -0.7124
0.7423 -0.960
0.3372
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
>
Analysis of Deviance
Chi-Squared Test
> anova(glm( t$y ~ genotype(t$m31), family='binomial'))
Analysis of Deviance Table
Model: binomial, link: logit
Response: t$y
Terms added sequentially (first to last)
NULL
genotype(t$m31)
Model Comparison
Compare general model with additive,
dominant or recessive models:
> afit <- glm(t$y ~ additive(t$m20))
> gfit <- glm(t$y ~ genotype(t$m20))
> anova(afit,gfit)
Analysis of Deviance Table
Model 1: t$y ~ additive(t$m20)
Model 2: t$y ~ genotype(t$m20)
Resid. Df Resid. Dev
Df Deviance
1
1016
38.301
2
1015
38.124
1
0.177
>
Multilocus Models
Can test the effects of fitting two or more
markers simultaneously
Several multilocus models are possible
Interaction Model assumes that each
combination of genotypes has a different
effect
eg t$y ~ t$m10 * t$m15
Multi-Locus Models
> f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26) , family='binomial')
> anova(f)
Analysis of Deviance Table
Model: binomial, link: logit
Response: t$y
Terms added sequentially (first to last)
NULL
genotype(t$m13)
genotype(t$m26)
genotype(t$m13):genotype(t$m26)
anova(fit1,fit2)
Multiple Testing
Take care interpreting significance levels when
performing multiple tests
Linkage disequilibrium can reduce the effective number of
independent tests
Permutation is a safe procedure to determine significance
Repeat j=1..N times:
Permute disease status y between individuals
Fit all markers
Record maximum deviance maxdev[j] over all markers
Haplotype Association
Haplotype Association
Different from multiple genotype models
Phase taken into account
Haplotype association can be modelled in a similar logistic
framework