Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
490
Lecture #2
Feb. 26, 2004
&
Alignment
Chris Burge
Review of Lecture 1: “Genome Sequencing
& DNA Sequence Analysis”
• Flavors of BLAST
- BLAST[PNX], TBLAST[NX]
200 kb (NIH)
3 Gb (Celera)
Sonicate, Subclone
Subclones
Sequence, Assemble
What
would
cause
problems
with
assembly?
Shotgun Contigs
DNA Sequence Alignment IV
A. B.
0.4
For a normal distribution with a mean m and a variance σ, the height of the
curve is described by Y=1/(σ√2π) exp[-(x-m)2/2σ2]
For an extreme value distribution, the height of the curve is described by
Y=exp[-x-e-x] …and P(S>x) = 1-exp[-e-λ(x-u)] where u=(ln Kmn)/λ
Can show that mean extreme score is ~ log2(nm), and the probability of
getting a score that exceeds some number of “standard deviations” x is:
P(S>x)~ Kmne-λx. ***K and λ are tabulated for different matrices ****
For the less statistically inclined: E~ Kmne-λS
DNA Sequence Comparison & Alignment
∑ pipjeλsij = 1
i,j
i j: A C G T
A 1 m m m
C m 1 m m
G m m 1 m
si,j :
T m m m 1
Then m = sij = sij/sii = ln(qij / pipj )/λ) / (ln(qii / pipi )/λ (i≠j)
⇒ m = ln(4(1-r)/3)/ln(4r)
DNA Sequence Alignment IX
m = ln(4(1-r)/3)/ln(4r)
Examples:
m -1 -2 -3
Nucleotide-
nucleotide
BLAST Web
Server Other advanced
(BLASTN)
DNA Sequence Alignment X
… DNA …
Transcription (N)
pre-mRNA
Splicing (N)
mRNA AAAAAAA
Translation (C)
Protein
Typical Human Gene Statistics
• PipMaker: applications to
- human/mouse exon finding
- human/mouse regulatory region finding
Schwartz, Scott, Zheng Zhang, Kelly A. Frazer, Arian Smit, Cathy Riemer, John Bouck, Richard Gibbs,
Ross Hardison, and Webb Miller. "PipMaker--A Web Server for Aligning Two Genomic DNA Sequences."
Genome Res. 10 (April 2000): 577-586.
Application of PipMaker #1 - finding human/mouse exons
Schwartz, Scott, Zheng Zhang, Kelly A. Frazer, Arian Smit, Cathy Riemer, John Bouck, Richard Gibbs,
Ross Hardison, and Webb Miller. "PipMaker--A Web Server for Aligning Two Genomic DNA Sequences."
Genome Res. 10 (April 2000): 577-586.
A Computational Biology Paradigm
Loots, GG, RM Locksley, CM Blankespoor, ZE Wang, W Miller, EM Rubin, and KA Frazer. "Identification of A
Coordinate Regulator of Interleukins 4, 13, and 5 by Cross-species Sequence Comparisons." Science 288, no. 5463
(7 April 2000): 136-40.
Effects on Transcription of Deleting CNS-1 region
Loots, GG, RM Locksley, CM Blankespoor, ZE Wang, W Miller, EM Rubin, and KA Frazer. "Identification of A
Coordinate Regulator of Interleukins 4, 13, and 5 by Cross-species Sequence Comparisons." Science 288, no. 5463
(7 April 2000): 136-40.
“Phylogenetic Shadowing”
Yeast Genome
A 1/3 T 1/3
C 1/6 G 1/6
Neyman-Pearson Lemma:
Optimal decision rules are of the form R > C
Pos -3 -2 -1 +1 +2 +3 +4 +5 +6
A 0.3 0.6 0.1 0.0 0.0 0.4 0.7 0.1 0.1
S = S1 S2 S3 S4 S5 S6 S7 S8 S9
Statistical Independence
5’ splice signal
Background
Con: C A G … G T Pos Generic
Pos -3 -2 -1 … +5 +6
A 0.25
A 0.3 0.6 0.1 … 0.1 0.1
C 0.25
C 0.4 0.1 0.0 … 0.1 0.2
G 0.25
G 0.2 0.2 0.8 … 0.8 0.2
S = S1 S2 S3 S4 S5 S6 S7 S8 S9
S = S1 S2 S3 S4 S5 S6 S7 S8 S9
P(S|+) P-3(S1)P-2(S2)P-1(S3) ••• P5(S8)P6(S9)
Odds Ratio: R = =
P(S|-) Pbg(S1)Pbg(S2)Pbg(S3) ••• Pbg(S8)Pbg(S9)
k=9
= ∏ P-4+k(Sk)/ Pbg(Sk)
k=1
k=9
Score s = log2R = ∑ log2 (P-4+k(Sk)/ Pbg(Sk))
k=1
Neyman-Pearson Lemma:
Optimal decision rules are of the form R > C
ttgacctagatgagatgtcgttcacttttactgagctacagaaaa
……
Assign score to each 9 base window.
“Decoy” True
5’ 5’
Splice Splice
Sites Sites