Logistic Regression Using R

Association Analysis,
Logistic Regression,
R and S-PLUS
Richard Mott
http://bioinformatics.well.ox.ac.uk/lectures/
Logistic Regression in Statistical

Genetics
Applicable to Association Studies
Data:
Binary outcomes (eg disease status)
Dependent on genotypes [+ sex, environment]
Aim is to identify which factors influence the

outcome
Rigorous tests of statistical significance
Flexible modelling language
Generalisation of Chi-Squared Test
What is R ?
Statistical analysis package

Free
Similar to commercial package S-PLUS
Runs on Unix, Windows, Mac
www.r-project.org
Many packages for statistical genetics,
microarray analysis available in R
Easily Programmable
Modelling in R
Data for individual labelled i=1n:
Response yi
Genotypes gij for markers j=1..m
Coding Unphased Genotypes

Several possibilities:
AA, AG, GG original genotypes
12, 21, 22
1, 2, 3
0, 1, 2 # of G alleles
Missing Data
NA default in R
Using R
Load genetic logistic regression tools
> source(logistic.R)
Read data table from file

> t <- read.table(geno.dat,
header=TRUE)
Column names
names(t)
t$y response (0,1)
t$m1, t$m2, . Genotypes for each marker
Contigency Tables in R
ftable(t$y,t$m31) prints the contingency table
> ftable(t$y,t$m31)
11 12 22
0
1
>
515 387
28 11
75
2
Chi-Squared Test in R
> chisq.test(t$y,t$m31)
Pearson's Chi-squared test
data: t$y and t$m31
X-squared = 3.8424, df = 2, p-value = 0.1464
Warning message:
Chi-squared approximation may be incorrect in:
chisq.test(t$y, t$m31)
>
The Logistic Model

Prob(Yi=0) = exp(iexp(i))
i = j xij bj - Linear Predictor
xij Design Matrix (genotypes etc)
bj Model Parameters (to be estimated)
Model is investigated by
estimating the bjs by maximum likelihood
testing if the estimates are different from 0
The Logistic Function

Prob(Yi=0) = exp(iexp(i))
Prob(Y=0)
Types of genetic effect at a single

locus
AA
AG
GG
Recessive
Dominant
Additive
Genotype
Additive Genotype Model

Code genotypes as
AA
AG
GG
x=0,
x=1,
x=2
Linear Predictor
= b0 + xb1
P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))

PAA = P(Y=0|x=0) = exp(b0)/(1+exp(b0))
PAG = P(Y=0|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))
PGG = P(Y=0|x=2) = exp(b0 + 2b1)/(1+exp(b0 + 2b1))
Additive Model: b0 = -2 b1 = 2
PAA = 0.12 PAG = 0.50 PGG = 0.88
Prob(Y=0)
Additive Model: b0 = 0 b1 = 2
PAA = 0.50 PAG = 0.88 PGG = 0.98
Prob(Y=0)
Recessive Model
Code genotypes as
AA
AG
GG
x=0,
x=0,
x=1
Linear Predictor
= b0 + xb1
P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))

PAA = PAG = P(Y=0|x=0) = exp(b0)/(1+exp(b0))
PGG = P(Y=0|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))
Recessive Model: b0 = 0 b1 = 2
PAA = PAG = 0.50 PGG = 0.88
Prob(Y=0)
Genotype Model
Each genotype has an independent probability

Code genotypes as (for example)
AA
AG
GG
x=0, y=0
x=1, y=0
x=0, y=1
Linear Predictor
= b0 + xb1+yb2 two parameters
P(Y=0|x) = exp(b0 + xb1+yb2)/(1+exp(b0 + xb1+yb2))

PAA = P(Y=0|x=0,y=0) = exp(b0)/(1+exp(b0))
PAG = P(Y=0|x=1,y=0) = exp(b0 + b1)/(1+exp(b0 + b1))
PGG = P(Y=0|x=0,y=1) = exp(b0 + b2)/(1+exp(b0 + b2))
Genotype Model: b0 = 0 b1 = 2 b2 = -1
PAA = 0.5 PAG = 0.88 PGG = 0.27
Prob(Y=0)
Models in R
response y
genotype g
AA
AG
GG
model
DF
Recessive
y ~ dominant(g)
Dominant
y ~ recessive(g)
Additive
y ~ additive(g)
Genotype
y ~ genotype(g)
Data Transformation
g <- t$m1
use these functions to treat a genotype
vector in a certain way:
a
r
d
g
<<<<-
additive(g)
recessive(g)
dominant(g)
genotype(g)
Fitting the Model
afit
rfit
dfit
gfit
<<<<-
glm(
glm(
glm(
glm(
t$y
t$y
t$y
t$y
~
~
~
~
additive(g),family=binomial)
recessive(g),family=binomial)
dominant(g),family=binomial)
genotype(g),family=binomial)
Equivalent models:
genotype = dominant + recessive
genotype = additive + recessive
genotype = additive + dominant
genotype ~ standard chi-squared test of genotype
association
Parameter Estimates
> summary(glm( t$y ~ genotype(t$m31), family='binomial'))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
b0 (Intercept)
-2.9120
0.1941 -15.006
<2e-16 ***
b1 genotype(t$m31)12 -0.6486
0.3621 -1.791
0.0733 .
b2 genotype(t$m31)22 -0.7124
0.7423 -0.960
0.3372
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
>
Analysis of Deviance
Chi-Squared Test
> anova(glm( t$y ~ genotype(t$m31), family='binomial'))
Analysis of Deviance Table
Model: binomial, link: logit
Response: t$y
Terms added sequentially (first to last)
NULL
genotype(t$m31)
Df Deviance Resid. Df Resid. Dev

1017
343.71
2
3.96
1015
339.76
Model Comparison
Compare general model with additive,
dominant or recessive models:
> afit <- glm(t$y ~ additive(t$m20))
> gfit <- glm(t$y ~ genotype(t$m20))
> anova(afit,gfit)
Model 1: t$y ~ additive(t$m20)
Model 2: t$y ~ genotype(t$m20)
Resid. Df Resid. Dev
Df Deviance
1
1016
38.301
2
1015
38.124
1
0.177
>
Scanning all Markers

> logscan(t,model=additive)
Deviance DF
Pval
LogPval
m1 8.604197e+00 1 3.353893e-03 2.474450800
m2 7.037336e+00 1 7.982767e-03 2.097846522
m3 6.603882e-01 1 4.164229e-01 0.380465360
m4 3.812860e+00 1 5.086054e-02 1.293619014
m5 7.194936e+00 1 7.310960e-03 2.136025588
m6 2.449127e+00 1 1.175903e-01 0.929628598
m7 2.185613e+00 1 1.393056e-01 0.856031566
m8 1.227191e+00 1 2.679539e-01 0.571939852
m9 2.532562e+01 1 4.842353e-07 6.314943565
m10 5.729634e+01 1 3.748518e-14 13.426140380
m11 3.107441e+01 1 2.483233e-08 7.604982503
Multilocus Models
Can test the effects of fitting two or more
markers simultaneously
Several multilocus models are possible
Interaction Model assumes that each
combination of genotypes has a different
effect
eg t$y ~ t$m10 * t$m15
Multi-Locus Models
> f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26) , family='binomial')
> anova(f)
Model: binomial, link: logit
Response: t$y
Terms added sequentially (first to last)
NULL
genotype(t$m13)
genotype(t$m26)
genotype(t$m13):genotype(t$m26)
Df Deviance Resid. Df Resid. Dev

1017
343.71
2
108.68
1015
235.03
2
1.14
1013
233.89
3
6.03
1010
227.86
> pchisq(6.03,2,lower.tail=F) calculate p-value

[1] 0.04904584
Adding the effects of Sex and other

Covariates
Read in sex and other covariate data, eg.
age from a file into variables, say a$sex,
a$age
Fit models of the form
fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=binomial)
fit2 <- glm(t$y ~ a$sex + a$age, family=binomial)
Adding the effects of Sex and other

Covariates
Compare models using anova test if the effect
of the marker m10 is significant after taking into
account sex and age
anova(fit1,fit2)
Multiple Testing
Take care interpreting significance levels when
performing multiple tests
Linkage disequilibrium can reduce the effective number of
independent tests
Permutation is a safe procedure to determine significance
Repeat j=1..N times:
Permute disease status y between individuals
Fit all markers
Record maximum deviance maxdev[j] over all markers
Permutation p-value for a marker is the proportion of

times the permuted maximum deviance across all
markers exceeds the observed deviance for the marker
logscan(t,permute=1000) slow!
Haplotype Association
Haplotype Association
Different from multiple genotype models
Phase taken into account
Haplotype association can be modelled in a similar logistic
framework
Treat haplotypes as extended alleles

Fit additive, recessive, dominant & genotype models as
before
Eg haplotypes are h = AAGCAT, ATGCTT, etc
y ~ additive(h)
y ~ dominant(h) etc

Logistic Regression Using R

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Logistic Regression Using R

Caricato da

Copyright:

Formati disponibili

Association Analysis,

Logistic Regression in Statistical

Aim is to identify which factors influence the

Statistical analysis package

Coding Unphased Genotypes

Read data table from file

The Logistic Model

The Logistic Function

Types of genetic effect at a single

Additive Genotype Model

P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))

P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))

Each genotype has an independent probability

= b0 + xb1+yb2 two parameters

P(Y=0|x) = exp(b0 + xb1+yb2)/(1+exp(b0 + xb1+yb2))

Fitting the Model

Df Deviance Resid. Df Resid. Dev

Scanning all Markers

Df Deviance Resid. Df Resid. Dev

> pchisq(6.03,2,lower.tail=F) calculate p-value

Adding the effects of Sex and other

fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=binomial)

fit2 <- glm(t$y ~ a$sex + a$age, family=binomial)

Adding the effects of Sex and other

Permutation p-value for a marker is the proportion of

Treat haplotypes as extended alleles

Potrebbero piacerti anche