Sei sulla pagina 1di 31

Association Analysis,

Logistic Regression,
R and S-PLUS
Richard Mott
http://bioinformatics.well.ox.ac.uk/lectures/

Logistic Regression in Statistical


Genetics
Applicable to Association Studies
Data:
Binary outcomes (eg disease status)
Dependent on genotypes [+ sex, environment]

Aim is to identify which factors influence the


outcome
Rigorous tests of statistical significance
Flexible modelling language
Generalisation of Chi-Squared Test

What is R ?

Statistical analysis package


Free
Similar to commercial package S-PLUS
Runs on Unix, Windows, Mac
www.r-project.org
Many packages for statistical genetics,
microarray analysis available in R
Easily Programmable

Modelling in R
Data for individual labelled i=1n:
Response yi
Genotypes gij for markers j=1..m

Coding Unphased Genotypes


Several possibilities:
AA, AG, GG original genotypes
12, 21, 22
1, 2, 3
0, 1, 2 # of G alleles

Missing Data
NA default in R

Using R
Load genetic logistic regression tools
> source(logistic.R)

Read data table from file


> t <- read.table(geno.dat,
header=TRUE)

Column names
names(t)
t$y response (0,1)
t$m1, t$m2, . Genotypes for each marker

Contigency Tables in R
ftable(t$y,t$m31) prints the contingency table
> ftable(t$y,t$m31)
11 12 22
0
1
>

515 387
28 11

75
2

Chi-Squared Test in R
> chisq.test(t$y,t$m31)
Pearson's Chi-squared test
data: t$y and t$m31
X-squared = 3.8424, df = 2, p-value = 0.1464
Warning message:
Chi-squared approximation may be incorrect in:
chisq.test(t$y, t$m31)
>

The Logistic Model


Prob(Yi=0) = exp(iexp(i))
i = j xij bj - Linear Predictor
xij Design Matrix (genotypes etc)
bj Model Parameters (to be estimated)
Model is investigated by
estimating the bjs by maximum likelihood
testing if the estimates are different from 0

The Logistic Function


Prob(Yi=0) = exp(iexp(i))

Prob(Y=0)

Types of genetic effect at a single


locus
AA

AG

GG

Recessive

Dominant

Additive

Genotype

Additive Genotype Model


Code genotypes as
AA
AG
GG

x=0,
x=1,
x=2

Linear Predictor
= b0 + xb1

P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))


PAA = P(Y=0|x=0) = exp(b0)/(1+exp(b0))
PAG = P(Y=0|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))
PGG = P(Y=0|x=2) = exp(b0 + 2b1)/(1+exp(b0 + 2b1))

Additive Model: b0 = -2 b1 = 2
PAA = 0.12 PAG = 0.50 PGG = 0.88

Prob(Y=0)

Additive Model: b0 = 0 b1 = 2
PAA = 0.50 PAG = 0.88 PGG = 0.98

Prob(Y=0)

Recessive Model
Code genotypes as
AA
AG
GG

x=0,
x=0,
x=1

Linear Predictor
= b0 + xb1

P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))


PAA = PAG = P(Y=0|x=0) = exp(b0)/(1+exp(b0))
PGG = P(Y=0|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))

Recessive Model: b0 = 0 b1 = 2
PAA = PAG = 0.50 PGG = 0.88

Prob(Y=0)

Genotype Model

Each genotype has an independent probability


Code genotypes as (for example)
AA
AG
GG

x=0, y=0
x=1, y=0
x=0, y=1

Linear Predictor

= b0 + xb1+yb2 two parameters

P(Y=0|x) = exp(b0 + xb1+yb2)/(1+exp(b0 + xb1+yb2))


PAA = P(Y=0|x=0,y=0) = exp(b0)/(1+exp(b0))
PAG = P(Y=0|x=1,y=0) = exp(b0 + b1)/(1+exp(b0 + b1))
PGG = P(Y=0|x=0,y=1) = exp(b0 + b2)/(1+exp(b0 + b2))

Genotype Model: b0 = 0 b1 = 2 b2 = -1
PAA = 0.5 PAG = 0.88 PGG = 0.27

Prob(Y=0)

Models in R
response y
genotype g
AA

AG

GG

model

DF

Recessive

y ~ dominant(g)

Dominant

y ~ recessive(g)

Additive

y ~ additive(g)

Genotype

y ~ genotype(g)

Data Transformation
g <- t$m1
use these functions to treat a genotype
vector in a certain way:
a
r
d
g

<<<<-

additive(g)
recessive(g)
dominant(g)
genotype(g)

Fitting the Model

afit
rfit
dfit
gfit

<<<<-

glm(
glm(
glm(
glm(

t$y
t$y
t$y
t$y

~
~
~
~

additive(g),family=binomial)
recessive(g),family=binomial)
dominant(g),family=binomial)
genotype(g),family=binomial)

Equivalent models:
genotype = dominant + recessive
genotype = additive + recessive
genotype = additive + dominant
genotype ~ standard chi-squared test of genotype
association

Parameter Estimates
> summary(glm( t$y ~ genotype(t$m31), family='binomial'))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
b0 (Intercept)
-2.9120
0.1941 -15.006
<2e-16 ***
b1 genotype(t$m31)12 -0.6486
0.3621 -1.791
0.0733 .
b2 genotype(t$m31)22 -0.7124
0.7423 -0.960
0.3372
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
>

Analysis of Deviance
Chi-Squared Test
> anova(glm( t$y ~ genotype(t$m31), family='binomial'))
Analysis of Deviance Table
Model: binomial, link: logit
Response: t$y
Terms added sequentially (first to last)

NULL
genotype(t$m31)

Df Deviance Resid. Df Resid. Dev


1017
343.71
2
3.96
1015
339.76

Model Comparison
Compare general model with additive,
dominant or recessive models:
> afit <- glm(t$y ~ additive(t$m20))
> gfit <- glm(t$y ~ genotype(t$m20))
> anova(afit,gfit)
Analysis of Deviance Table
Model 1: t$y ~ additive(t$m20)
Model 2: t$y ~ genotype(t$m20)
Resid. Df Resid. Dev
Df Deviance
1
1016
38.301
2
1015
38.124
1
0.177
>

Scanning all Markers


> logscan(t,model=additive)
Deviance DF
Pval
LogPval
m1 8.604197e+00 1 3.353893e-03 2.474450800
m2 7.037336e+00 1 7.982767e-03 2.097846522
m3 6.603882e-01 1 4.164229e-01 0.380465360
m4 3.812860e+00 1 5.086054e-02 1.293619014
m5 7.194936e+00 1 7.310960e-03 2.136025588
m6 2.449127e+00 1 1.175903e-01 0.929628598
m7 2.185613e+00 1 1.393056e-01 0.856031566
m8 1.227191e+00 1 2.679539e-01 0.571939852
m9 2.532562e+01 1 4.842353e-07 6.314943565
m10 5.729634e+01 1 3.748518e-14 13.426140380
m11 3.107441e+01 1 2.483233e-08 7.604982503

Multilocus Models
Can test the effects of fitting two or more
markers simultaneously
Several multilocus models are possible
Interaction Model assumes that each
combination of genotypes has a different
effect
eg t$y ~ t$m10 * t$m15

Multi-Locus Models
> f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26) , family='binomial')
> anova(f)
Analysis of Deviance Table
Model: binomial, link: logit
Response: t$y
Terms added sequentially (first to last)

NULL
genotype(t$m13)
genotype(t$m26)
genotype(t$m13):genotype(t$m26)

Df Deviance Resid. Df Resid. Dev


1017
343.71
2
108.68
1015
235.03
2
1.14
1013
233.89
3
6.03
1010
227.86

> pchisq(6.03,2,lower.tail=F) calculate p-value


[1] 0.04904584

Adding the effects of Sex and other


Covariates
Read in sex and other covariate data, eg.
age from a file into variables, say a$sex,
a$age
Fit models of the form

fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=binomial)

fit2 <- glm(t$y ~ a$sex + a$age, family=binomial)

Adding the effects of Sex and other


Covariates
Compare models using anova test if the effect
of the marker m10 is significant after taking into
account sex and age

anova(fit1,fit2)

Multiple Testing
Take care interpreting significance levels when
performing multiple tests
Linkage disequilibrium can reduce the effective number of
independent tests
Permutation is a safe procedure to determine significance
Repeat j=1..N times:
Permute disease status y between individuals
Fit all markers
Record maximum deviance maxdev[j] over all markers

Permutation p-value for a marker is the proportion of


times the permuted maximum deviance across all
markers exceeds the observed deviance for the marker
logscan(t,permute=1000) slow!

Haplotype Association
Haplotype Association
Different from multiple genotype models
Phase taken into account
Haplotype association can be modelled in a similar logistic
framework

Treat haplotypes as extended alleles


Fit additive, recessive, dominant & genotype models as
before
Eg haplotypes are h = AAGCAT, ATGCTT, etc
y ~ additive(h)
y ~ dominant(h) etc

Potrebbero piacerti anche