Lecture 221 The Essentials of Microarray Data Analysis

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License.
Your use of this

material constitutes acceptance of that license and the conditions of use of materials on this site.
Copyright 2006, The Johns Hopkins University and Karl Broman. All rights reserved. Use of these materials
permitted only in accordance with license rights granted. Materials provided AS IS; no representations or
warranties provided. User assumes all responsibility for use, and all liability related thereto, and must independently
review all materials for accuracy and efficacy. May contain materials owned by others. User is responsible for
obtaining permissions for use from third parties as needed.
Outline
The essentials of
microarray data analysis
Experimental design
Take logs!
Pre-processing: affy chips and 2-color arrays
(from a complete novice)
Clustering
Identifying differentially expressed genes
Thanks to Rafael Irizarry for
the slides!
Biological question
Differentially expressed genes
Sample class prediction etc.
Resources
Experimental design
Expressionists working group

(Search google for hopkins expressionists)
Top ten things to know held monthly
Microarray experiment
16-bit TIFF files
Stats for gene expression course (140.688)
Image analysis
http://www.biostat.jhsph.edu/~ririzarr/Teaching/688
(Rfg, Rbg), (Gfg, Gbg)

Normalization
R, G
Estimation
Testing
Clustering
Biological verification
and interpretation
Discrimination
Experimental design
Experimental design
Choice of platform
Array design
Proper experimental design is needed to ensure that

questions of interest can be answered and that this can be
done accurately and precisely, given experimental
constraints, such as cost of reagents and availability of
mRNA.
Creation of probes
Location on the array
Controls
Target samples
Preparing mRNA samples:
Avoidance of bias
Mouse model
Dissection of
tissue
Conditions of an experiment (mRNA extraction and processing,

the reagents, the operators, the scanners and so on) can
leave a global signature in the resulting expression data.
Biological Replicates
RNA
Isolation
Balance
Randomization
Amplification
Technical replicates
Probe
labelling
Hybridization
7
Technical replication - amplification
Two-color layouts
3 sets of two different samples performed on different days

#10 and #12 were from the same RNA isolation and amplification
For two color platforms it is assumed that the size of the

spot/probe effect is too big to trust the absolute intensites. Thus
we always use relative measurements
#12 and #18 were from different dissections and amplifications

All 3 data sets were labeled separately before hybridization
Vertices: mRNA samples; Edges: hybridization;

Direction: dye assignment.
Cy3
sample
Cy5
sample
Data provided by
Dave Lin (Cornell)
9
10
Experiment for which a

number of designs are
suitable for use
Common reference design

T1
T2
Tn-1
Tn
Time Series
Ref
Experiment for which the common reference design is appropriate
Meaningful biological control (C) Identify genes that responded differently / similarly
across two or more treatments relative to control.
Large scale comparison. To discover tumor subtypes when you have many different tumor
samples.
T1
T2
T3
T4
Ref
T1
T2
T3
T4
T1
T2
T3
T4
Advantages:
Ease of interpretation.
Extensibility - extend current study or to compare the results from current study to other
array projects.
T1
11
T2
T3
T4
12
Pooling: looking at very small amount of tissues
Pooling
Mouse model
Dissection of
tissue
Should I pool mRNA samples across subjects

in an effort to reduce the effect of biological
variability?
RNA
Isolation
Notice, in many cases, samples are cheap but

arrays are expensive
Pooling
Probe
labelling
Hybridization
13
14
Alternative pooling
strategy
Problem with pooling

everything
Instead of pooling everything, how about

pooling groups?
For example, will I obtain the same
results with 12 individuals on 12 chips as
with 12 individuals on 4 chips ?
You cannot measure variability

You cannot take logs before averaging
15
16
Design 1
Pooled vs Individual
samples
More on Pooling and outliers
Design 2
Taken from
Kendziorski etl al (2003)
17
18
Images of probe level

data
Outline
Experimental design
Take logs!
Clustering
This is the raw data

19
20
Images of probe level

data
Outline
Experimental design
Take logs!
Clustering
Log scale version much more informative
21
22
Work flow
Affymetrix GeneChip
Design
Raw data (.DAT files)
Image analysis
Probe intensities (.CEL files)

Pre-processing
normalization
Reference sequence
Expression measures (tables)
TGTGATGGTGGGGAATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT
CCCTTACCCAGTCTTCCGGAGGCTA Perfectmatch
CCCTTACCCAGTGTTCCGGAGGCTA Mismatch
Statistical test
Rank (list)
Choose filter
Significance level
NSB & SB
NSB
Candidate genes (short list)

23
24
Affy Technical Replicates

Boxplot
Self-self hybridization
log2 R vs. log 2 G
Different scanners were used.

25
26
M vs. A
M vs. A
Robust local regression

within sectors
(print-tip-groups)
of intensity log-ratio M
on average log-intensity A.
M = log 2R - log 2G, A = (log 2R + log 2G)/2

27
M = log 2R - log 2G, A = (log 2R + log 2G)/2
28
Boxplots by print-tip-group
What can we do?

Throw away the data and start again? Maybe.
Statistics offers hope:
Use control genes to adjust
Assume most genes are not differentially expressed
Assume distribution of expression are the same
Intensity
log-ratio, M
29
Simplest Idea
30
How does it work
Assume all arrays have the same median log expression or

relative log expression
Subtract median from each array
In cDNA this makes the median log ratio 0 (we assume there
are as many over-expressed as under-expressed)
In Affy we usually add a constant that takes us back to the
original range
Notice subtracting in the original scale wont work well

31
32
How does it work
The medians match but there are some discrepancies
What are the consequences
33
Quantile normalization
These are two technical replicates with spike-ins.
34
How does it work
All these non-linear methods perform similarly

Quantiles is my favorite because its fast
Basic idea:
order value in each array

take average across probes
Substitute probe intensity with average
Put in original order
35
36
How does it work
Before and
Does it wash away real differential expression?

37
These are two technical replicates. Only the red are differentially expressed
38
Outline
After quantile normalization

Experimental design
Take logs!

Clustering
39
40
10
Clustering is not a
universally good tool
Outline
Experimental design
Take logs!
Clustering
41
Work flow
42
Differential gene expression
Raw data (.DAT files)

Image analysis
Identify genes whose expression levels are associate with a

response or covariate of interest
Probe intensities (.CEL files)
clinical outcome (e.g. survival, response to treatment, tumor class)
Pre-processing
normalization
covariate such as treatment, dose, time.

Estimation: In a statistical framework, assigning a score can be
viewed as estimating an effects of interest (e.g. difference in
means, slope, interaction). We can also take the variability of
these estimates into account.
Expression measures (tables)

Statistical test
Testing: In a statistical framework, deciding on a cut-off can be

viewed as an assessment of the statistical significance of the
observed associations.
Rank (list)
Choose filter
Significance level
Candidate genes (short list)
43
44
11
Example: Two populations
A common problem is to find genes that are differentially expressed in

two populations.
Many method papers appear in both statistical and molecular biology

literature.
The proposed scores range from:
Should we consider variability?
ad-hoc summaries of fold-change

variantes on the t-test
and posterior means obtained from Bayesian or empircal Bayes
methods.
What's the difference? Mainly the way in which the variation within
population is incorporated
45
46
47
48
12
Useful Plots
Example
The MA plot shows M=log fold change, plotted against

A=average log intensity
If we have various replicates in each population we can plot
M=difference in log averages in two populations
The volcano plot shows, for a particular test, - log p-value
against the effect size (M)
49
If you insist on using MAS 5.0

it really helps
50
Summary
Experimental design is critical
Biological vs technical replicates
Pre-processing (calibration) can have an

enormous effect on the end results
Analysis should be guided not by what others
have done, but by what you want to learn.
Regarding multiple testing: view this process
as exploratory, and dont get hung up on
adjustments for multiple testing.
51
52
13

Lecture 221 The Essentials of Microarray Data Analysis

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Lecture 221 The Essentials of Microarray Data Analysis

Caricato da

Copyright:

Formati disponibili

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License.

Your use of this

(from a complete novice)

Expressionists working group

16-bit TIFF files

Stats for gene expression course (140.688)

(Rfg, Rbg), (Gfg, Gbg)

Proper experimental design is needed to ensure that

Preparing mRNA samples:

Conditions of an experiment (mRNA extraction and processing,

Technical replication - amplification

3 sets of two different samples performed on different days

For two color platforms it is assumed that the size of the

#12 and #18 were from different dissections and amplifications

Vertices: mRNA samples; Edges: hybridization;

Experiment for which a

Common reference design

Pooling: looking at very small amount of tissues

Should I pool mRNA samples across subjects

Notice, in many cases, samples are cheap but

Problem with pooling

Instead of pooling everything, how about

You cannot measure variability

More on Pooling and outliers

Images of probe level

This is the raw data

Images of probe level

Log scale version much more informative

Raw data (.DAT files)

Probe intensities (.CEL files)

Expression measures (tables)

Candidate genes (short list)

Affy Technical Replicates

Different scanners were used.

Robust local regression

M = log 2R - log 2G, A = (log 2R + log 2G)/2

M = log 2R - log 2G, A = (log 2R + log 2G)/2

What can we do?

How does it work

Assume all arrays have the same median log expression or

Notice subtracting in the original scale wont work well

How does it work

The medians match but there are some discrepancies

What are the consequences

These are two technical replicates with spike-ins.

How does it work

All these non-linear methods perform similarly

order value in each array

How does it work

Does it wash away real differential expression?

After quantile normalization

Pre-processing: affy chips and 2-color arrays

Differential gene expression

Raw data (.DAT files)

Identify genes whose expression levels are associate with a

Probe intensities (.CEL files)

clinical outcome (e.g. survival, response to treatment, tumor class)

covariate such as treatment, dose, time.

Expression measures (tables)

Testing: In a statistical framework, deciding on a cut-off can be

Example: Two populations

A common problem is to find genes that are differentially expressed in

Many method papers appear in both statistical and molecular biology