Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Molecular evolution
The increasing available completely sequenced organisms and the importance of evolutionary processes that affect the species history, have stressed the interest in studying the molecular evolution events at the sequence level.
Applications
Expansion*
duplication HGT
Phylogeny*
genesis
species genome
loss
HGT
Exchange*
Deletion*
and selection
Duplication Duplication
B Species tree
Speciation Speciation
B Gene tree
Original version
Actual version
Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206.
Orthologs: A1 vs A2 and B1 vs B2
A1A1
BB 11
A 22 A
BB 22 Sequence analysis a S1 S2 b
Species-1
Species-2
Molecular evolution
Applications:
Molecular evolution analysis has clarified: the evolutionary relationships between humans and other primates; the origins of AIDS;
Molecular evolution
Molecular evolution
GACGACCATAGACCAGCA TAG GACTACCATAGACTGCAAAG
Changes may give rise to new genes which become fixed if they give the organism an advantage in selection;
Neutral theory
Majority of evolution at the molecular level is caused by random genetic drift through mutations that are selectively neutral or nearly neutral. Describes cases in which selection (purifying or positive) is not strong enough to outweigh random events.
Neutral mutation is an ongoing process which gives rise to genetic polymorphisms; changes in environment can select for certain of these alleles.
Positive selection
Positive selection is a darwinian selection fixing advantageous mutations.
The term is used interchangeably with molecular adaptation and adaptive molecular evolution. Positive selection can be shown to play a role in some evolutionary events This is demonstrated at the molecular level if the rate of nonsynonymous mutation at a site is greater than the rate of synonymous mutation Most substitution rates are determined by either neutral evolution of purifying selection against deleterious mutations
Molecular evolution
We observe and try to decode the process of molecular evolution from the perspective of accumulated differences among related genes from one or diverse organisms.
DNA yields more phylogenetic information than proteins. The nucleotide sequences of a pair of homologous genes have a higher information content than the amino acid sequences of the corresponding proteins, because mutations that result in synonymous changes alter the DNA sequence but do not affect the amino acid sequence.
A A C
1 change, 1 difference Parallel substitution
C A
T A A
C
2 change, 1 difference Back substitution
C A C
2 changes, no difference
T A T C
3 changes, no difference
2 changes, no difference
transition changes one purine for another or one pyrimidine for another. transversion changes a purine for a pyrimidine or vice versa.
Because there are only 20 amino acids, but 64 possible codons, the same amino
acid is often encoded by a number of different codons, which usually differ in the third base of the triplet. Because of this repetition the genetic code is said to be degenerate and codons which produce the same amino acid are called synonymous codons.
(exp. GAT and GAC code for Aspartic acid (asp, D),
whereas GAA and GAG both code for Glutamic acid (glu, E)). Threefold degenerate site: are codon positions where changing 3 of the 4 nucleotides has no effect on the aa, while changing the fourth possible nucleotide results in a different aa.
There is only 1 threefold degenerate site: the 3rd position of an isoleucine codon. ATT, ATC, or ATA all encode isoleucine, but ATG encodes methionine.
Five amino-acids are encoded by 4 codons which differ only in the third position. These sites are called fourfold degenerate sites
A G P T V Ala Gly Pro Thr Val Alanine Glycine Proline Threonine Valine GCT GCC GCA G CG GGG GGA GGT GGC CCT CCC CCA CCG ACT ACC ACA ACG GTT GTC GTA GTG
Transition:
A/G; C/T
non synonymous substitutions i.e. nucleotide substitutions that change amino acids.
nonsense mutations, mutations that result in stop codons.
exp: Gly: any changes in 3rd position of codon results in Gly; any changes in second position results in amino acid changes; and so is the first position.
G Gly Glycine GGG GGA GGT GGC
exp:
Glu GAG
AGC
Ser
Nonsynonymous/synonymous substitutions
Estimation of synonymous and nonsynonymous substitution rates is important in understanding the dynamics of molecular sequence evolution. As synonymous (silent) mutations are largely invisible to natural selection, while nonsynonymous (amino-acid replacing) mutations may be under strong selective pressure, comparison of the rates of fixation of those two types of mutations provides a powerful tool for understanding the mechanisms of DNA sequence evolution. For example, variable nonsynonymous/synonymous rate ratios among lineages may indicate adaptative evolution or relaxed selective constraints along certain lineages. Likewise, models of variable nonsynonymous/synonymous rate ratios among sites may provide important insights into functional constraints at different amino acid sites and may be used to detect sites under positive selection.
Codon usage
There are 64 (43) possible codons that code for 20 amino acids (and stop signals). If nucleotide substitution occurs at random at each nucleotide site, every nucleotide site is expected to have one of the 4 nucleotides, A, T, C and G, with equal probability. Therefore, if there is no selection and no mutation bias, one would expect that the codons encoding the same amino acid are on average in equal frequencies in protein coding regions of DNA.
In practice, the frequencies of different codons for the same amino acid are usually different, and some codons are used more often than others. This codon usage bias is often observed.
Codon usage bias is controlled by both mutation pressure and purifying selection.
When they differ at 2 or 3 positions, there will be 2 of 6 parsimonious pathways along which one codon could change into the other, and all of them should be considered.
Since different pathways may involve different numbers of synonymous and nonsynonymous changes, they should be weighted differently.
Codon 1: GAA --> GAC ;1 nuc. diff., 1 nonsynonymous difference; Codon 2: GTT --> GTC ;1 nuc. diff., 1 synonymous difference;
Path 1 : implies 1 non-synonymous and 1 synonymous substitutions; Path 2 : implies 2 non synonymous substitutions;
K Lys Lysine
AAA AAG
Only one possible mutation at 3rd position that will not change Lysine
QuickTime et un dcompresseur TIFF (non compress) sont requis pour visio nner cette image.
Ziheng Yang & Rasmus Nielsen (2000) Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol. 17:32-43.
Purifying selection:
Most of the time selection eliminates deleterious mutations, keeping the protein as it is.
Positive selection:
In few instances we find that dN (also denoted Ka) is much greater than dS (also denoted Ks) (i.e. dN/dS >> 1 (Ka/Ks >>1 )). This is strong evidence that selection has acted to change the protein.
Positive selection was tested for by comparing the number of nonsynonymous substitutions per nonsynonymous site (dN) to the number of synonymous substitutions per synonymous site (dS). Because these numbers are normalized to the number of sites, if selection were neutral (i.e., as for a pseudogene) the dN/dS ratio would be equal to 1. An unequivocal sign of positive selection is a dN/dS ratio significantly exceeding 1, indicating a functional benefit to diversify the amino acid sequence.
disadvantageous
Positive selection is very important for evolution of new functions especially for duplicated genes.
(must occur early after duplication otherwise null mutations and will be fixed producing pseudogenes).
dN/dS (or Ka/Ks) measures selection pressure
Mutational saturation
Mutational saturation in DNA and protein sequences occurs when sites have undergone multiple mutations causing sequence dissimilarity (the observed differences) to no longer accurately reflect the true evolutionary distance i.e. the number of substitutions that have actually occurred since the divergence of two sequences. Correct estimation of the evolutionary distance is crucial. Generally: sequences where dS > 2 are excluded to avoid the saturation effect of nucleotide substitution.
YN00 - P13.4.C13.18.fa.paml ns = 13 ls = 29 Estimation by the method of: Yang & Nielsen (2000):
seq.
YALI0A08195g YALI0E25443g YALI0E25443g YALI0C21230g YALI0C21230g YALI0C21230g YALI0C21230g YALI0C21230g YALI0C21230g YALI0C21230g YALI0C21230g . YALI0C21230g YALI0D11638g YALI0E19140g YALI0E19140g
seq.
YALI0A17963g YALI0A17963g YALI0A08195g YALI0A17963g YALI0A08195g YALI0E25443g YALI0A02783g YALI0C21252g YALI0C21274g YALI0F09944g YALI0A13497g YALI0B06160g YALI0C21230g YALI0C21230g YALI0D11638g
S
15.1 17.3 17.6 24.1 24.5 24.9 24.6 25.4 25.3 24.3 28.2 27.1 27.3 25.2 22.4
N
71.9 69.7 69.4 62.9 62.5 62.1 62.4 61.6 61.7 62.7 58.8 59.9 59.7 61.8 64.6
t
0.37 1.8 1.00 5.35 6.58 4.76 4.71 6.64 6.54 7.51 7.13 7.34 8.04 7.67 4.12
kappa omega dN +- SE
1.31 1.31 1.31 1.31 1.31 1.31 1.31 1.31 1.31 1.31 1.31 1.31 1.31 1.31 1.31 0.20 0.05 0.06 0.75 0.57 1.27
dS +- SE
0.07 +- 0.03 0.36 +- 0.22 0.13 +- 0.05 2.55 +- 13.95 0.08 +- 0.03 1.35 +- 0.70 1.63 1.81 1.69 1.97 2.77 2.75 2.97 3.06 2.79 3.07 3.09 1.04 ++++++++++++1.06 1.43 0.57 0.81 2.27 2.21 2.93 3.38 2.37 3.40 3.46 0.29 2.19 3.19 1.33 0.55 0.86 0.79 1.29 0.95 1.68 1.83 1.25 2.33 ++++++++++++1.70 6.21 0.59 0.21 0.32 0.34 1.09 0.34 0.86 1.39 0.54 2.13
3.20
.. 1.66 1.68 2.48 0.45
http://abacus.gene.ucl.ac.uk/software/paml.html
The number of substitutions between any two species is assumed to be the sum of the number of substitutions along the branches of the tree connecting them: d13=dA1+dA3
d23=dA2+dA3 d12=dA1+dA2
d13, d23 and d12 are measures of the differences between 1 and 3, 2 and 3 and 1 and 2 respectively.
dA1=(d12+d13-d23)/2 dA2=(d12+d23-d13)/2 dA1 and dA2 should be the same (A common ancestor of 1 and 2).
Evolution of functionally important regions over time. Immediately after a speciation event, the two copies of the genomic region are 100% identical (see graph on left). Over time, regions under little or no selective pressure, such as introns, are saturated with mutations, whereas regions under negative selection, such as most exons, retain a higher percent identity (see graph on right). Many sequences involved in regulating gene expression also maintain a higher percent identity than do sequences with no function.
COMPARATIVE GENOMICS
Webb Miller, Kateryna D. Makova, Anton Nekrutenko, and Ross C. Hardison Annual Review of Genomics and Human Genetics
Reference
Yang & Nielsen, Esimating Synonymous and Nonsynonymous Substitution Rates Under Realistic Evolutionary Models Mol. Biol. Evol. 2000, 17:32-43
var(ps) = ps(1-ps)/S.
var(pn) = pn(1-pn)/S.
Sd and Nd are respectively the total number of synonymous and non synonymous differences calculated over all codons. S and N are the numbers of synonymous and nonsynonymous substitutions. S+N=n total number of nucleotides and N >> S. ps is often denoted Ks and pn is denoted Ka.
Number of synonymous (ds) and non synonymous (dn) substitutions per site
1) Jukes and Cantor, one-parameter method denoted 1-p : This model assumes that the rate of nucleotide substitution is the same for all pairs of the four nucleotides A, T, C and G (generally not true!). d = -(3/4)*Ln(1-(4/3)*p) where p is either ps or pn. 2) Kimura's 2-parameter, denoted 2-p : The rate of transitional nucleotide substitution is often higher than that of transversional substitution. d = -(1/2)*Ln(1 -2*P -Q) -(1/4)*Log(1 -2*Q) P is the proportion of transitional differences, Q is the proportion of transversional differences P and Q are respectively calculated over synonymous and non synonymous differences.
Kimura 2-parameters model : A T C G A T and are the rates of transitional C G and transvertional substitutions Tamura model : A T A (1-q T (1-q) C (1-q) (1-q G (1-q (1-q Hasegaw a et al . A A T gA C gA G gA C q q q G q q q G gG gG gG -
and are the rates of transitional and transvertional substitutions and q is the G+C content.
model : T C gT gC gC gT gT gC
and are the rates of transitional and transvertional substitutions and gi the nucleotide frequencies (i=A,T,C,G).
C gC gC2 gC
G gG1 gG gG -
betw een purines and betw een pyrimidines; is the rate of transvertional substitutions; and gi the nucleotide frequencies (i=A,T ,C,G).
Procedure
1. Alignment of a family protein sequences using clustalW
2. Alignment of corresponding DNA sequences using as template their corresponding amino acid alignment obtained in step 1 3. Format the DNA alignment in yn00 format 4. Perform yn00 program (PAML package) on the obtained DNA alignment 5. Clean the yn00 output to get YN (Yang & Nielsen) estimates in a file. Estimations with large standard errors were eliminated 6. From YN estimates extract gene pairs with w = dN/dS >= 3 and gene pairs with w<= 0.3, respectively. 7. Genes with w>=3 are considered as candidate genes on which positive selection may operate. Whereas genes with w<=0.3 are candidates for purifying (negative) selection
Most of the genes are under purifying selection Only few genes might be under positive selection
m std n 0.90 0.6 2.96 1.3 0.34 0.32 3.6 0.57 min 5085 5085 5085 10 0.0 0.0 0.0 3.0 Max 4.98 6.84 4.45 4.45
Codon volatility
New method recently introduced, the utility of which is still under debate; has interresting consequences on the study of codon variability;
Detecting Selection
If a protein coding region of a nucleotide sequence has undergone an excess number of amino-acid substitutions, then the region will on average contain an overabundance of volatile codons, compared with the genome as a whole. Using the concept of codon volatility, we can scan an entire genome to find genes that show significantly more, or less, pressure for amino-acid substitutions than the genome as a whole. If a gene contains many residues under pressure for aa replacements, then the resulting codons in that gene will on average exhibit elevated volatility.
If a gene is under purifying selection not to change its aa, then the resulting sequence will on average exhibit lower volatility.
Plotkin et al. Nature 428; 942-945
Codons volatility
2 1 8 7 5 6 4 1 8 5 6 3 2 3
The codon CGA encoding arginine (R), has 8 potential ancestor codons (i.e. non stop codon) that differ from CGA by one substitution.
Volatility of a codon is defined as the proportion of nonsynonymous codons over the total neighbour sense codons obtained by a single substitution.
The volatility of CGA = 4/8. The volatility of AGA also encodes an arginine = 6/8.
Plotkin et al. 2004. Nature 428. p.942-945
Codons volatility
22 codons have at least one synonymous with a different volatility;
Volatility of a codon c:
v(c) = 1/n {D[aacid(c) - aacid(ci)];i=1,n}; n is the number of neighbors (other than non-stop codons) that can mutate by a single substitution. D is the Hamming distance = 0 if the 2 aa are identical; =1 otherwise.
Volatility of a gene G:
v(G) = {v(ck);k=1,l}; l is the number of codons in the gene G.
Codons volatility
Volatility is used to quantify the probability that the most recent substitution of a site caused an amino-acid change. Each genes observed volatility is compared with a bootstrap distribution of alternative synonymous sequences, drawn
Codons volatility
Volatility p-value of G: The observed v(G) is compared with a bootstrap distribution of 106 synonymous versions of the gene G. In each randomization sample, a nucleotide sequence G is constructed so that it has the same translation as G but whose codons are drawn randomly according to the relative frequencies of synonymous codons in the whole genome. p-value for G = proportion of randomized samples; so that v(G) > v(G).
1-p is a p-value that tests whether a gene is significantly less volatile than the genome as a whole.
Detecting Selection
A p-value near zero indicates significantly elevated volatility, whereas a p-value near one indicates significantly depressed volatility. The probability that a sites most recent substitution caused a non-synonymous change is:
http://www.cgr.harvard.edu/volatility
1) Paul M. Sharp Gene "volatility" is Most Unlikely to Reveal Adaptation MBE Advance Access published on December 22, 2004. doi:10.1093/molbev/msi073 2) Tal Dagan and Dan Graur The Comparative Method Rules! Codon Volatility Cannot Detect Positive Darwinian Selection Using a Single Genome Sequence MBE Advance Access published on November 3, 2004. doi:10.1093/molbev/msi033 3) Robert Friedman and Austin L. Hughes Codon Volatility as an Indicator of Positive Selection: Data from Eukaryotic Genome Comparisons MBE Advance Access originally published on November 3, 2004. This version published November 8, 2004. doi:10.1093/molbev/msi038 4) Hahn MW, Mezey JG, Begun DJ, Gillespie JH, Kern AD, Langley CH, Moyle LC. Evolutionary genomics: Codon bias and selection on single genomes. Nature. 2005 Jan 20;433(7023):E5-6.
-> Extreme volatility classes have interesting properties, in terms of aa 5) Nielsen R, Hubisz MJ. composition or codon bias; Evolutionary genomics: Detecting selection needs comparative data.
Nature. 2005 Jan 20;433(7023):E6. 6) Chen Y, Emerson JJ, Martin TM Evolutionary genomics: Codon volatility does not detect selection. Nature. 2005 Jan 20;433(7023):E6-7. 7) Zhang J, 2005. On the evolution of codon volatility Genetics 169: 495-501.
-> Authors : some genes are under more positive, or less negative, 8) Plotkin JB, Dushoff J, Fraser HB. selection than others. Evolutionary genomics: Codon volatility does not detect selection (reply).
Nature. 2005 Jan 20;433(7023):E7-8.
9) Plotkin JB, Dushoff J, Desai MM and Fraser HB Synonymous codon and selection on proteins
Codon Volatility (simple substitution model): Codons and volatility under simple substitution model
aa A A A A R R R R R R N N D D C C Q Q E E G G G G H H I I I L L L L L L K K M F F P P P P S S S S S
GCT GCC GCA GCG CGT CGC CGA CGG AGA AGG AAT AAC GAT GAC TGT TGC CAA CAG GAA GAG GGT GGC GGA GGG CAT CAC ATT ATC ATA TTA TTG CTT CTC CTA CTG AAA AAG ATG TTT TTC CCT CCC CCA CCG TCT TCC TCA TCG AGT
3 3 3 3 3 3 4 4 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
D 1 1
1 1 1 1 1 1
G 1 1 1 1 1 1 1 1 1 1
1 1
1 1 1 1 1 1 1 2 2 1
P 1 1 1 1 1 1 1 1
S 1 1 1 1 1 1
T 1 1 1 1
V 1 1 1 1
1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1
1 1 1 1 1 1 1 1 1 1 1 1
2 2
1 1 1 1
1 1 1 1
1 1
1 1
1 1 2 2 1 1 1 1 1 1 1 1
2 2
1 1
1 1 2 2 1 1
2 2 1 1
1 1 1 1
1 1 1 1 1 1 1 1 2 2
1 1 3 3 3 3 1 1 2 2 2 1 1 1 1 1 1 1 3 1 1 1 1 2 3 3 1 1 1 1 1 1 1 1 2 2 2 3 3 4 4
1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1 1
2 2 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1
1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1
1 1
1 1 1 1
1 1 1
taa 9 9 9 9 9 9 8 9 8 9 9 9 9 9 8 8 8 8 8 8 9 9 8 9 9 9 9 9 9 7 8 9 9 9 9 8 8 9 9 9 9 9 9 9 9 9 7 8 9
daa 6 6 6 6 6 6 4 5 6 7 8 8 8 8 7 7 7 7 7 7 6 6 5 6 8 8 7 7 7 5 6 6 6 5 5 7 7 9 8 8 6 6 6 6 6 6 4 5 8
Vol 0 . 67 0 . 67 0 . 67 0 . 67 0 . 67 0 . 67 0.5 0 . 56 0 . 75 0 . 78 0 . 89 0 . 89 0 . 89 0 . 89 0 . 88 0 . 88 0 . 88 0 . 88 0 . 88 0 . 88 0 . 67 0 . 67 0 . 63 0 . 67 0 . 89 0 . 89 0 . 78 0 . 78 0 . 78 0 . 71 0 . 75 0 . 67 0 . 67 0 . 56 0 . 56 0 . 88 0 . 88 1. 0 . 89 0 . 89 0 . 67 0 . 67 0 . 67 0 . 67 0 . 67 0 . 67 0 . 57 0 . 63 0 . 89
G+C 2 3 2 3 2 3 2 3 1 2 0 1 1 2 1 2 1 2 1 2 2 3 2 3 1 2 0 1 0 0 1 1 2 1 2 0 1 1 0 1 2 3 2 3 1 2 1 2 1
A+T 1 0 1 0 1 0 1 0 2 1 3 2 2 1 2 1 2 1 2 1 1 0 1 0 2 1 3 2 3 3 2 2 1 2 1 3 2 2 3 2 1 0 1 0 2 1 2 1 2
S T T T T W Y Y V V V V Tot
AGC ACT ACC ACA ACG TGG TAT TAC GTT GTC GTA GTG
3 1 1 1 1
1 1 1
1 1 2 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1
1 1 1 1 1
1 1 1 1
2 1 1
1 2 2 1 1 1 1 1
1 3 3 3 3 1 1 3 3 3 3 36
1 1 1 1 36
54 18 18 18
1 1 1 1 1 1 18 18 36
1 1 2 2 18 27 54
1 1 1
1 18 9
9 9 9 9 9 7 7 7 9 9 9 9
8 6 6 6 6 7 6 6 6 6 6 6
0 . 89 0 . 67 0 . 67 0 . 67 0 . 67 1. 0 . 86 0 . 86 0 . 67 0 . 67 0 . 67 0 . 67
2 1 2 1 2 2 0 1 1 2 1 2
1 2 1 2 1 1 3 2 2 1 2 1
18 36
54 36 9
18
Volatility
0.4 0.5 0.6 0.7 0.8 0.9 1
Leu
AA_Codons
Ser
Vol 0.5 0.56 0.57 0.63 0.67 0.71 0.75 0.78 0.86 0.88 0.89 1.
1 1 1 6
2 1 1 2 12 1 3 3 1
3 1 7
1 2 1 1 2 2 1 1 4 5 1
Vol 0.5 0.56 0.57 0.63 0.67 0.71 0.75 0.78 0.86 0.88 0.89 1.
0 1 7
1 1 1 2 12 1 3 3 1
2 1 1 6
1 2 1 1 4 5 1 3 1 1 2
QuickTime et un dcompresseur TIFF (non compress) sont requis pour visionner cette image.
References:
Ziheng Yang and Rasmus Nielsen (2000) Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol. 17:32-43.
References
Phylogeny programs : http://evolution.genetics.washington.edu/phylip/sftware.html
MEGA: http://www.megasoftware.net/
PAML: http://abacus.gene.ucl.ac.uk/software/paml.html
Books:
Fundamental concepts of Bioinformatics. Dan E. Krane and Michael L. Raymer Genomes 2 edition. T.A. Brown Molecular Evolution; A phylogenetic Approach Page, RDM and Holmes, EC
Blackwell Science