Sei sulla pagina 1di 87

PAIRWISE ALIGNMENT

SEQUENCES ARE RELATED


Darwin: all organisms

are related through descent with modification Sequences are related through descent with modification Similar molecules have similar functions in different organisms
Phylogenetic tree based on ribosomal RNA: three domains of life

WHY COMPARE SEQUENCES?


To determine evolutionary

relationships

To decide if two proteins

Protein 1: binds oxygen Sequence similarity Protein 2: binds oxygen ?

(or genes) are related structurally or functionally

WHY COMPARE SEQUENCES?


To identify domains or motifs that are shared between

proteins

TERMINOLOGIES
Similarity the extent to which nucleotide or protein

sequences are related. It is based upon identity plus conservation. Identity the extent to which two sequences are invariant.

TERMINOLOGIES
Conservation changes at a specific position of an amino

acid (or less commonly, DNA) sequences that preserve the physicochemical properties of the original residue.

CONSERVED RESIDUES

Residues conserved among various G protein coupled receptors are highlighted in green

CONSERVATION OF FUNCTION
Alignments can reveal which parts of the sequences are

likely to be important for the function, if the proteins are involved in similar processes. In parts of the sequence of a protein which are not very critical for its function, random mutations can easily accumulate. In parts of the sequence that are critical for the function of the protein, hardly any mutations will be accepted; nearly all changes in such regions will destroy the function.

INSERTIONS/DELETIONS AND PROTEIN STRUCTURE


Why is it that two similar sequences may have large

insertions/deletions? some insertions and deletions may not significantly affect the structure of a protein
loop structures: insertions/deletions here not so significant

COMPARING THE PROTEIN KINASE KRAF_HUMAN AND THE UNCHARACTERIZED O22558 FROM ARABIDOPSIS USING BLAST
546 AA Score = 185 bits (464), Expect = 1e-45 Identities = 107/283 (37%), Positives = 172/283 (59%), Gaps = 15/283 (5%) Query: 337 DSSYYWEIEASEVMLSTRIGSGSFGTVYKGKWHG-DVAVKILKVVDPTPEQFQAFRNEVA 395 D + WEI+ +++ + ++ SGS+G +++G + +VA+K LK E + F EV Sbjct: 274 DGTDEWEIDVTQLKIEKKVASGSYGDLHRGTYCSQEVAIKFLKPDRVNNEMLREFSQEVF 333 Query: 396 VLRKTRHVNILLFMGYMTKD-NLAIVTQWCEGSSLYKHLHVQETKFQMFQLIDIARQTAQ 454 ++RK RH N++ F+G T+ L IVT++ S+Y LH Q+ F++ L+ +A A+ Sbjct: 334 IMRKVRHKNVVQFLGACTRSPTLCIVTEFMARGSIYDFLHKQKCAFKLQTLLKVALDVAK 393 Query: 455 GMDYLHAKNIIHRDMKSNNIFLHEGLTVKIGDFGLATVKSRWSGSQQVEQPTGSVLWMAP 514 GM YLH NIIHRD+K+ N+ + E VK+ DFG+A V+ SG E TG+ WMAP Sbjct: 394 GMSYLHQNNIIHRDLKTANLLMDEHGLVKVADFGVARVQIE-SGVMTAE--TGTYRWMAP 450 Query: 515 EVIRMQDNNPFSFQSDVYSYGIVLYELMTGELPYSHINNRDQIIFMVGRGYASPDLSKLY 574 EVI ++ P++ ++DV+SY IVL+EL+TG++PY+ + + +V +G P + K Sbjct: 451 EVI---EHKPYNHKADVFSYAIVLWELLTGDIPYAFLTPLQAAVGVVQKG-LRPKIPK-- 504 Query: 575 KNCPKAMKRLVADCVKKVKEERPLFPQILSSIELLQHSLPKIN 617 K PK +K L+ C + E+RPLF +I IE+LQ + ++N Sbjct: 505 KTHPK-VKGLLERCWHQDPEQRPLFEEI---IEMLQQIMKEVN 543

SIMILARITY AND HOMOLOGY


Similarity can be expressed as a percentage. It does not imply

any reasons for the observed sameness. Homology is an evolutionary term used to describe relationship via descent from a common ancestor. Homologous things are often similar, but not always (e.g. whale flipper human arm) Homology is NEVER expressed as a percentage

HOMOLOGY
Homologous sequences can be divided into three groups:
Orthologous sequences sequences that diverged due to

a speciation event (e.g. human -globin and mouse -globin). Paralogous sequences sequences that diverged due to a gene duplication event (e.g. human -globin and human -globin, various versions of both). Xenologous sequences sequences for which the history of one of them involves interspecies transfer since the time of their common ancestor.

HOMOLOGY

SIMILARITY AND HOMOLOGY


Sequence homology can be reliably inferred from

statistically significant similarity over a majority of the sequence length. Non-homology CANNOT be inferred from non-similarity because non-similar things can still share a common ancestor. Homologous proteins share common structures, but not necessarily common sequence or function Homology is all or nothing. There is no such thing as 50% homology

QUESTION 1
True or False. Homology is synonymous with similarity

SEARCHING SEQUENCE DATABASES


When we search a sequence database, we are usually

looking for related sequences. Unfortunately, the algorithms that we have for searching databases, do not search for homology, they search for similarity. When similarity is found, we must determine if this similarity is a result of homology or if it comes from another source.

WHY SEARCH FOR SIMILARITY?


I have just sequenced something. What is known about the

thing I sequenced? I have a unique sequence. Is there similarity to another gene that has a known function? I found a new protein in a lower organism. Is it similar to a protein from another species? I have decided to work on a new gene. The people in the field will not give me the plasmid. I need the complete cDNA sequence to perform RT-PCR or some other experiment.

SEQUENCE ALIGNMENT: DEFINITION


The process of lining up two or more sequences to achieve

maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.

SEQUENCE ALIGNMENT
Comparing sequences provides information as to which

genes or proteins have the same function Sequences are compared by aligning them sliding them along each other to find the most matches with a few gaps An alignment can be scored count matches, and can penalize mismatches and gaps

QUESTION 2
Whenever possible, it is better to

A. Compare proteins than to compare genes B. Compare genes than to compare proteins Discuss as a group and cite points that defend your argument

IT IS MUCH EASIER TO ALIGN PROTEINS


4 DNA bases vs. 20 amino acids less chance similarity There are varying degrees of similarity between different

AAs

Protein databanks are much smaller than DNA databanks

PAIRWISE ALIGNMENT
The alignment of two sequences (DNA or protein) is a

relatively straightforward computational problem. Two sequences can always be aligned. Sequence alignments have to be scored. Often there is more than one solution with the same score.

PAIRWISE ALIGNMENTS: PURPOSE


identification of sequences with significant similarity to (a)

sequence(s) in a sequence repository identification of all homologous sequences within the repository identification of domains with sequence similarity

METHODS OF ALIGNMENT
By hand slide sequences on two lines of a word processor Dot plot Rigorous mathematical approach
Dynamic programming (slow, optimal)
Heuristic methods (fast, approximate) BLAST and FASTA (uses word matching and hash tables)

ALIGNMENT BY HAND
GATCGCCTA_TTACGTCCTGGAC <---> AGGCATACGTA_GCCCTTTCGC

A scoring system is still essential to find the best alignment

DOTPLOTS
Not technically an

alignment Gives picture of correspondence between pairs of sequences Dot represents similarity between segments of the two sequences

QUESTION 3
Do diagonals correspond

to conserved regions?

A.Yes B. No

QUESTION 3 REDUX
Take note that the dots

are placed at grid points where two sequences have identical residues.
Do diagonals correspond

to conserved regions? A.Yes B. No

A LIMITATION TO DOT MATRIX COMPARISON


Where part of one sequence shares a long stretch of

similarity with the other sequence, a diagonal of dots will be evident in the matrix. However, when single bases are compared at each position, most of the dots in the matrix will be due to background similarity. That is, for any two nucleotides compared between the two sequences, there is a 1 in 4 chance of a match, assuming equal frequencies of A,G,C and T.

SIMPLE DOT PLOT


G
G G A T T G A C C C G

A SOLUTION
This background noise can be filtered out by comparing

groups of l nucleotides, rather than single nucleotides, at each position. For example, if we compare dinucleotides (l = 2), the probability of two dinucleotides chosen at random from each sequence matching is 1/16, rather than 1/4. Therefore, the number of background matches will be lower:

A FILTERED DOT PLOT


G G G A T T G A C C C G G C T T G A C C G G

THE DOT MATRIX ALGORITHM


The dot-matrix algorithm can be generalized for sequences s

and t of sizes m and n, respectively, and window size l. For each position in sequence s, compare a window of l nucleotides centered at that position with each window of l nucleotides in sequence t. Conceptually, you can think of windows of length l sliding along each axis, so that all possible windows of l nucleotides are compared between the two sequences.

I=3
G G G A T T G A C C C G G C T T G A C C G G

DOT MATRIX SEQUENCE COMPARISON EXAMPLES

COMPARING A PROTEIN WITH ITSELF


Proteins can be compared with themselves to show internal

duplications or repeating sequences. A self-matrix produces a central diagonal line through the origin, indicating an exact match between the x and y axes. The parallel diagonals that appear off the central line are indicative of repeated sequence elements in different locations of the same protein.

HAPTOGLOBIN
Haptoglobin is a protein that is secreted into the blood by the

liver. This protein binds free hemoglobin. The concentration of "free" hemoglobin (that is, outside red blood cells) in plasma (the fluid portion of blood) is ordinarily very low. However, free hemoglobin is released when red blood cells hemolyze for any reason. After haptoglobin binds hemoglobin, it is taken up by the liver. The liver recycles the iron, heme, and amino acids contained in the hemoglobin protein.

OUR COMPARISON
Files used
1006264A Haptoglobin H2 DNA sequencing shows that the intragenic duplication within

the human haptoglobin Hp2 allele was formed by a nonhomologous, probably random, crossing-over within different introns of two Hp1 genes. A repeated sequence (starting with ADDGCP...) is observed beginning at positions 30-90 and 90-150 - probably due to a duplication event in one of these locations.

Window: 30 Stringency: 3 Blosum 62 matrix

SEARCHING FOR REPEATS IN DOTPLOTS


One of the strengths of dot-matrix searches is that they

make repeats easy to detect by comparing a sequence against itself. In self comparisons, direct repeats appear as diagonals parallel to the main line of identity.

COMPARISON OF TWO SIMILAR SEQUENCES


Files Used:
P03035 Repressor protein from E. coli Phage p22 RPBPL Repressor protein from E. coli phage Lambda

Lambda phages infect E. coli. They can be lytic and destroys


the host cell, making hundreds of progeny. They can also be lysogenic, and live quietly within the DNA of the bacteria. A gene makes the repressor protein that prevents the phage from going destructively lytic. Phage p22 is a related phage that also makes a repressor. Both proteins form a dimer and bind DNA to prevent lysis.

LAMBDA REPRESSOR/OPERATOR COMPLEX (1LMB)

DOT MATRIX SEQUENCE COMPARISON


A row of dots represents a region of sequence similarity. Background matching also appears as scattered dots.

Window: 10 Stringency: 1 Blosum 62 matrix

Window: 10 Stringency: 3 Blosum 62 matrix

Window: 30 Stringency: 1 Blosum 62 matrix

Window: 30 Stringency: 3 Blosum 62 matrix

QUESTION 4
Which of the following combinations of parameters will

produce the least background noise?


A. Low window, low stringency
B. Low window, high stringency C. High window, low stringency

D. High window, high stringency

DISADVANTAGES TO DOT PLOTS


While dot-matrix searches provide a great deal of

information in a visual fashion, they can only be considered semi-quantitative, and therefore do not lend themselves to statistical analysis. Also, dot-matrix searches do not provide a precise alignment between two sequences.

RIGOROUS ALGORITHMS
DYNAMIC PROGRAMMING

ALGORITHM
An algorithm is a complete, unambiguous procedure for

solving a specified problem in a finite number of steps. Algorithms leave nothing undefined and require no intuition to achieve their end.

FIVE FEATURES OF AN ALGORITHM:


An algorithm must stop after a finite number of steps. All steps of the algorithm must be precisely defined. Input to the algorithm must be specified.

Output of the algorithm must be specified. There must be at

least one output. An algorithm must be effective - i.e. its operations must be basic and doable.

DYNAMIC PROGRAMMING
Algorithmic technique for optimization problems that have

two properties:
Optimal substructure: Optimal solution can be computed from optimal

solutions to subproblems Overlapping subproblems: Subproblems overlap such that the total number of distinct subproblems to be solved is relatively small
3
1 2 4 6

7
8

RIGOROUS ALGORITHMS
Needleman-Wunsch (Global) Smith-Waterman (Local)

GLOBAL VS. LOCAL ALIGNMENT


Global alignment algorithms start at the beginning of two

sequences and add gaps to each until the end of one is reached.

Local alignment algorithms finds the region (or regions)

of highest similarity between two sequences and build the alignment outward from there.

GLOBAL VS. LOCAL ALIGNMENT

GLOBAL ALIGNMENT
The Needleman-Wunsch algorithm creates a global

alignment over the length of both sequences (needle) Global algorithms are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence.
Sometimes two sequences may be derived from ancient

recombination events where only a single functional domain is shared.

Global methods are useful when you want to force two

sequences to align over their entire length

LOCAL ALIGNMENT
Identify the most similar sub-region shared between two

sequences There is no attempt to force entire sequences into an alignment, just those parts that appear to have good similarity, according to some criterion. Smith-Waterman (water)

LOCAL ALIGNMENTS
It may seem that one should always use local alignments. However, it may be difficult to spot an overall similarity, as

opposed to just a domain-to-domain similarity, if one uses only local alignment. So global alignment is useful in some cases. The popular programs BLAST and FASTA for searching sequence databases produce local alignments.

GAPS AND INSERTIONS


In an alignment, much better correspondence can be

obtained between two sequences if a gap can be introduced in one sequence. Alternatively, an insertion could be allowed in the other sequence. Biologically, this corresponds to a mutation event that eliminates a part of a gene, or introduces new DNA into a gene.

GAPS
Positions at which a letter is paired with a null are called

gaps. Gap scores are typically negative.

QUESTION 5
Which is more significant? The presence of a gap or the

length of a gap?
A. The presence of a gap
B. The length of a gap

GAPS
Since a single mutational event may cause the insertion or

deletion of more than one residue, the presence of a gap is considered more significant than the length of the gap.

OPTIMAL ALIGNMENT
The alignment that is the best, given a defined set of

rules and parameter values for comparing different alignments. There is no such thing as the single best alignment, since optimality always depends on the assumptions one bases the alignment on. For example, what penalty should gaps carry? All sequence alignment procedures make some such assumptions.

PARAMETERS OF SEQUENCE ALIGNMENT


Scoring systems:
Each symbol pairing is assigned a numerical value, based on a

symbol comparison table.


Gap penalties:
Opening: The cost to introduce a gap Extension: The cost to elongate a gap

DNA SCORING SYSTEMS VERY SIMPLE

Match: 1 Mismatch: 0 Score = 5

PROTEIN SCORING SYSTEMS

PROTEIN SCORING SYSTEMS


Amino acids have different biochemical and physical

properties that influence their relative replaceability in evolution.

PROTEIN SCORING SYSTEMS


Scoring matrices reflect: Number of mutations to convert one to another Chemical similarity Observed mutation frequencies The probability of occurrence of each amino acid

Widely used scoring matrices: PAM BLOSUM

PAM MATRICES
Point Accepted Mutation Family of matrices: PAM 80, PAM 120, PAM 250 The number with a PAM matrix represents the evolutionary

distance between the sequences on which the matrix is based. PAM 250 = 250 mutations per 100 residues Greater numbers denote greater evolutionary distance

PAM MATRICES
Derived from global alignments of protein families. Family

members share at least 85% identity

Construction of phylogenetic tree and ancestral sequences

of each protein family Computation of number of replacements for each pair of amino acids

PAM 250 MATRIX

PAM LIMITATIONS
Based on only one original dataset
Based mainly on small globular proteins so the matrix is biased

Examines proteins with few differences (85% identity)

BLOSUM MATRICES
BLOcks SUbstitution Matrix Derived from alignments of domains of distantly related

proteins Different BLOSUMn matrices are calculated independently from blocks (ungapped local alignments) BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity BLOSUM 62 represents closer sequences than BLOSUM 45

BLOSUM MATRICES
Built from BLOCKS database: from the most conserved

regions of aligned sequences

~2000 blocks from 500 families have been used

BLOSUM 62 is the most popular matrix and is the default

matrix for the standard BLAST program.

A R N D C Q E G H I L K M F P S T W Y V

5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0 A

7 -1 -2 -4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -3 R

7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 1 0 -4 -2 -3 N

8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4 D

Positive scores on diagonal (identities)


13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1 C 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3 Q

6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3 E

Similar residues get higher (positive) scores


8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 G 10 -4 -3 0 -1 -1 -2 -1 -2 -3 2 -4 H 5 2 -3 2 0 -3 -3 -1 -3 -1 4 I

5 -3 3 1 -4 -3 -1 -2 -1 1 L

Dissimilar residues get smaller (negative) scores


6 -2 -4 -1 0 -1 -3 -2 -3 K 7 0 -3 -2 -1 -1 0 1 M

8 -4 -3 -2 1 4 -1 F

10 -1 5 -1 2 5 -4 -4 -3 15 -3 -2 -2 2 8 -3 -2 0 -3 -1 P S T W Y

5 V

BLOSUM 50 MATRIX

QUESTION 6
Which is the appropriate matrix to use when comparing

highly divergent sequences?


A. BLOSUM with a lower n
B. PAM with lower n C. BLOSUM with a higher n

D. Both B and C

PAM VS. BLOSUM


PAM 100 = BLOSUM 90 PAM 120 = BLOSUM 80 PAM 160 = BLOSUM 60 PAM 200 = BLOSUM 52 PAM 250 = BLOSUM 45
More distant sequences
PAM 120 for general use PAM 160 for close relations PAM 250 for distant relations BLOSUM 62 for general use BLOSUM 80 for close relations BLOSUM 45 for distant relations

TIPS ON CHOOSING A SCORING MATRIX


Generally, BLOSUM matrices perform better than PAM

matrices for local similarity searches (Henikoff & Henikoff, 1993). When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices. For database searching the commonly used matrix is BLOSUM62.

PAM
Built from global alignments Built from small amout of data

BLOSUM
Built from local alignments Built from vast amout of data

based on minimum replacement or maximum parsimony better for finding global alignments and remote homologs Higher PAM series means more divergence

based on groups of related sequences counted as one

better for finding local alignments

Lower BLOSUM series means more divergence

SCORING INSERTIONS AND DELETIONS

The creation of a gap is penalized with a negative score

value

WHY GAP PENALTIES?

WHY GAP PENALTIES?


The optimal alignment of two similar sequences is usually

that which
Maximizes the number of matches and Minimizes the number of gaps

Permitting the insertion of arbitrarily many gaps can lead to

high scoring alignments of non-homologous sequences. Penalizing gaps forces alignments to have relatively few gaps.

BALANCING GAPS WITH MISMATCHES


Gaps must get a steep penalty, or else youll end up with

nonsense alignments. In real sequences, multi-base (or amino acid) gaps are quite common Affine gap penalties give a big penalty for each new gap, but a much smaller gap extension penalty.

SCORING INSERTIONS AND DELETIONS

Potrebbero piacerti anche