Pairwise Alignment Prelab PDF

PAIRWISE ALIGNMENT
SEQUENCES ARE RELATED

Darwin: all organisms
are related through descent with modification Sequences are related through descent with modification Similar molecules have similar functions in different organisms
Phylogenetic tree based on ribosomal RNA: three domains of life
WHY COMPARE SEQUENCES?

To determine evolutionary
relationships
To decide if two proteins
Protein 1: binds oxygen Sequence similarity Protein 2: binds oxygen ?
(or genes) are related structurally or functionally
WHY COMPARE SEQUENCES?

To identify domains or motifs that are shared between
proteins
TERMINOLOGIES
Similarity the extent to which nucleotide or protein
sequences are related. It is based upon identity plus conservation. Identity the extent to which two sequences are invariant.
TERMINOLOGIES
Conservation changes at a specific position of an amino
acid (or less commonly, DNA) sequences that preserve the physicochemical properties of the original residue.
CONSERVED RESIDUES
Residues conserved among various G protein coupled receptors are highlighted in green
CONSERVATION OF FUNCTION
Alignments can reveal which parts of the sequences are
likely to be important for the function, if the proteins are involved in similar processes. In parts of the sequence of a protein which are not very critical for its function, random mutations can easily accumulate. In parts of the sequence that are critical for the function of the protein, hardly any mutations will be accepted; nearly all changes in such regions will destroy the function.
INSERTIONS/DELETIONS AND PROTEIN STRUCTURE

Why is it that two similar sequences may have large
insertions/deletions? some insertions and deletions may not significantly affect the structure of a protein
loop structures: insertions/deletions here not so significant
COMPARING THE PROTEIN KINASE KRAF_HUMAN AND THE UNCHARACTERIZED O22558 FROM ARABIDOPSIS USING BLAST
546 AA Score = 185 bits (464), Expect = 1e-45 Identities = 107/283 (37%), Positives = 172/283 (59%), Gaps = 15/283 (5%) Query: 337 DSSYYWEIEASEVMLSTRIGSGSFGTVYKGKWHG-DVAVKILKVVDPTPEQFQAFRNEVA 395 D + WEI+ +++ + ++ SGS+G +++G + +VA+K LK E + F EV Sbjct: 274 DGTDEWEIDVTQLKIEKKVASGSYGDLHRGTYCSQEVAIKFLKPDRVNNEMLREFSQEVF 333 Query: 396 VLRKTRHVNILLFMGYMTKD-NLAIVTQWCEGSSLYKHLHVQETKFQMFQLIDIARQTAQ 454 ++RK RH N++ F+G T+ L IVT++ S+Y LH Q+ F++ L+ +A A+ Sbjct: 334 IMRKVRHKNVVQFLGACTRSPTLCIVTEFMARGSIYDFLHKQKCAFKLQTLLKVALDVAK 393 Query: 455 GMDYLHAKNIIHRDMKSNNIFLHEGLTVKIGDFGLATVKSRWSGSQQVEQPTGSVLWMAP 514 GM YLH NIIHRD+K+ N+ + E VK+ DFG+A V+ SG E TG+ WMAP Sbjct: 394 GMSYLHQNNIIHRDLKTANLLMDEHGLVKVADFGVARVQIE-SGVMTAE--TGTYRWMAP 450 Query: 515 EVIRMQDNNPFSFQSDVYSYGIVLYELMTGELPYSHINNRDQIIFMVGRGYASPDLSKLY 574 EVI ++ P++ ++DV+SY IVL+EL+TG++PY+ + + +V +G P + K Sbjct: 451 EVI---EHKPYNHKADVFSYAIVLWELLTGDIPYAFLTPLQAAVGVVQKG-LRPKIPK-- 504 Query: 575 KNCPKAMKRLVADCVKKVKEERPLFPQILSSIELLQHSLPKIN 617 K PK +K L+ C + E+RPLF +I IE+LQ + ++N Sbjct: 505 KTHPK-VKGLLERCWHQDPEQRPLFEEI---IEMLQQIMKEVN 543
SIMILARITY AND HOMOLOGY

Similarity can be expressed as a percentage. It does not imply
any reasons for the observed sameness. Homology is an evolutionary term used to describe relationship via descent from a common ancestor. Homologous things are often similar, but not always (e.g. whale flipper human arm) Homology is NEVER expressed as a percentage
HOMOLOGY
Homologous sequences can be divided into three groups:
Orthologous sequences sequences that diverged due to
a speciation event (e.g. human -globin and mouse -globin). Paralogous sequences sequences that diverged due to a gene duplication event (e.g. human -globin and human -globin, various versions of both). Xenologous sequences sequences for which the history of one of them involves interspecies transfer since the time of their common ancestor.
HOMOLOGY
SIMILARITY AND HOMOLOGY

Sequence homology can be reliably inferred from
statistically significant similarity over a majority of the sequence length. Non-homology CANNOT be inferred from non-similarity because non-similar things can still share a common ancestor. Homologous proteins share common structures, but not necessarily common sequence or function Homology is all or nothing. There is no such thing as 50% homology
QUESTION 1
True or False. Homology is synonymous with similarity
SEARCHING SEQUENCE DATABASES

When we search a sequence database, we are usually
looking for related sequences. Unfortunately, the algorithms that we have for searching databases, do not search for homology, they search for similarity. When similarity is found, we must determine if this similarity is a result of homology or if it comes from another source.
WHY SEARCH FOR SIMILARITY?

I have just sequenced something. What is known about the
thing I sequenced? I have a unique sequence. Is there similarity to another gene that has a known function? I found a new protein in a lower organism. Is it similar to a protein from another species? I have decided to work on a new gene. The people in the field will not give me the plasmid. I need the complete cDNA sequence to perform RT-PCR or some other experiment.
SEQUENCE ALIGNMENT: DEFINITION

The process of lining up two or more sequences to achieve
maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.
SEQUENCE ALIGNMENT
Comparing sequences provides information as to which
genes or proteins have the same function Sequences are compared by aligning them sliding them along each other to find the most matches with a few gaps An alignment can be scored count matches, and can penalize mismatches and gaps
QUESTION 2
Whenever possible, it is better to
A. Compare proteins than to compare genes B. Compare genes than to compare proteins Discuss as a group and cite points that defend your argument
IT IS MUCH EASIER TO ALIGN PROTEINS

4 DNA bases vs. 20 amino acids less chance similarity There are varying degrees of similarity between different
AAs
Protein databanks are much smaller than DNA databanks
PAIRWISE ALIGNMENT
The alignment of two sequences (DNA or protein) is a
relatively straightforward computational problem. Two sequences can always be aligned. Sequence alignments have to be scored. Often there is more than one solution with the same score.
PAIRWISE ALIGNMENTS: PURPOSE

identification of sequences with significant similarity to (a)
sequence(s) in a sequence repository identification of all homologous sequences within the repository identification of domains with sequence similarity
METHODS OF ALIGNMENT
By hand slide sequences on two lines of a word processor Dot plot Rigorous mathematical approach
Dynamic programming (slow, optimal)
Heuristic methods (fast, approximate) BLAST and FASTA (uses word matching and hash tables)
ALIGNMENT BY HAND
GATCGCCTA_TTACGTCCTGGAC <---> AGGCATACGTA_GCCCTTTCGC
A scoring system is still essential to find the best alignment
DOTPLOTS
Not technically an
alignment Gives picture of correspondence between pairs of sequences Dot represents similarity between segments of the two sequences
QUESTION 3
Do diagonals correspond
to conserved regions?
A.Yes B. No
QUESTION 3 REDUX
Take note that the dots
are placed at grid points where two sequences have identical residues.
Do diagonals correspond
to conserved regions? A.Yes B. No
A LIMITATION TO DOT MATRIX COMPARISON

Where part of one sequence shares a long stretch of
similarity with the other sequence, a diagonal of dots will be evident in the matrix. However, when single bases are compared at each position, most of the dots in the matrix will be due to background similarity. That is, for any two nucleotides compared between the two sequences, there is a 1 in 4 chance of a match, assuming equal frequencies of A,G,C and T.
SIMPLE DOT PLOT

G
G G A T T G A C C C G
A SOLUTION
This background noise can be filtered out by comparing
groups of l nucleotides, rather than single nucleotides, at each position. For example, if we compare dinucleotides (l = 2), the probability of two dinucleotides chosen at random from each sequence matching is 1/16, rather than 1/4. Therefore, the number of background matches will be lower:
A FILTERED DOT PLOT

G G G A T T G A C C C G G C T T G A C C G G
THE DOT MATRIX ALGORITHM

The dot-matrix algorithm can be generalized for sequences s
and t of sizes m and n, respectively, and window size l. For each position in sequence s, compare a window of l nucleotides centered at that position with each window of l nucleotides in sequence t. Conceptually, you can think of windows of length l sliding along each axis, so that all possible windows of l nucleotides are compared between the two sequences.
I=3
G G G A T T G A C C C G G C T T G A C C G G
DOT MATRIX SEQUENCE COMPARISON EXAMPLES
COMPARING A PROTEIN WITH ITSELF

Proteins can be compared with themselves to show internal
duplications or repeating sequences. A self-matrix produces a central diagonal line through the origin, indicating an exact match between the x and y axes. The parallel diagonals that appear off the central line are indicative of repeated sequence elements in different locations of the same protein.
HAPTOGLOBIN
Haptoglobin is a protein that is secreted into the blood by the
liver. This protein binds free hemoglobin. The concentration of "free" hemoglobin (that is, outside red blood cells) in plasma (the fluid portion of blood) is ordinarily very low. However, free hemoglobin is released when red blood cells hemolyze for any reason. After haptoglobin binds hemoglobin, it is taken up by the liver. The liver recycles the iron, heme, and amino acids contained in the hemoglobin protein.
OUR COMPARISON
Files used
1006264A Haptoglobin H2 DNA sequencing shows that the intragenic duplication within
the human haptoglobin Hp2 allele was formed by a nonhomologous, probably random, crossing-over within different introns of two Hp1 genes. A repeated sequence (starting with ADDGCP...) is observed beginning at positions 30-90 and 90-150 - probably due to a duplication event in one of these locations.
Window: 30 Stringency: 3 Blosum 62 matrix
SEARCHING FOR REPEATS IN DOTPLOTS

One of the strengths of dot-matrix searches is that they
make repeats easy to detect by comparing a sequence against itself. In self comparisons, direct repeats appear as diagonals parallel to the main line of identity.
COMPARISON OF TWO SIMILAR SEQUENCES

Files Used:
P03035 Repressor protein from E. coli Phage p22 RPBPL Repressor protein from E. coli phage Lambda
Lambda phages infect E. coli. They can be lytic and destroys

the host cell, making hundreds of progeny. They can also be lysogenic, and live quietly within the DNA of the bacteria. A gene makes the repressor protein that prevents the phage from going destructively lytic. Phage p22 is a related phage that also makes a repressor. Both proteins form a dimer and bind DNA to prevent lysis.
LAMBDA REPRESSOR/OPERATOR COMPLEX (1LMB)
DOT MATRIX SEQUENCE COMPARISON

A row of dots represents a region of sequence similarity. Background matching also appears as scattered dots.
QUESTION 4
Which of the following combinations of parameters will
produce the least background noise?

A. Low window, low stringency
B. Low window, high stringency C. High window, low stringency
D. High window, high stringency
DISADVANTAGES TO DOT PLOTS

While dot-matrix searches provide a great deal of
information in a visual fashion, they can only be considered semi-quantitative, and therefore do not lend themselves to statistical analysis. Also, dot-matrix searches do not provide a precise alignment between two sequences.
RIGOROUS ALGORITHMS
DYNAMIC PROGRAMMING
ALGORITHM
An algorithm is a complete, unambiguous procedure for
solving a specified problem in a finite number of steps. Algorithms leave nothing undefined and require no intuition to achieve their end.
FIVE FEATURES OF AN ALGORITHM:

An algorithm must stop after a finite number of steps. All steps of the algorithm must be precisely defined. Input to the algorithm must be specified.
Output of the algorithm must be specified. There must be at
least one output. An algorithm must be effective - i.e. its operations must be basic and doable.
DYNAMIC PROGRAMMING
Algorithmic technique for optimization problems that have
two properties:
Optimal substructure: Optimal solution can be computed from optimal
solutions to subproblems Overlapping subproblems: Subproblems overlap such that the total number of distinct subproblems to be solved is relatively small
3
1 2 4 6
7
8
RIGOROUS ALGORITHMS
Needleman-Wunsch (Global) Smith-Waterman (Local)
GLOBAL VS. LOCAL ALIGNMENT

Global alignment algorithms start at the beginning of two
sequences and add gaps to each until the end of one is reached.
Local alignment algorithms finds the region (or regions)
of highest similarity between two sequences and build the alignment outward from there.
GLOBAL VS. LOCAL ALIGNMENT
GLOBAL ALIGNMENT
The Needleman-Wunsch algorithm creates a global
alignment over the length of both sequences (needle) Global algorithms are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence.
Sometimes two sequences may be derived from ancient
recombination events where only a single functional domain is shared.
Global methods are useful when you want to force two
sequences to align over their entire length
LOCAL ALIGNMENT
Identify the most similar sub-region shared between two
sequences There is no attempt to force entire sequences into an alignment, just those parts that appear to have good similarity, according to some criterion. Smith-Waterman (water)
LOCAL ALIGNMENTS
It may seem that one should always use local alignments. However, it may be difficult to spot an overall similarity, as
opposed to just a domain-to-domain similarity, if one uses only local alignment. So global alignment is useful in some cases. The popular programs BLAST and FASTA for searching sequence databases produce local alignments.
GAPS AND INSERTIONS

In an alignment, much better correspondence can be
obtained between two sequences if a gap can be introduced in one sequence. Alternatively, an insertion could be allowed in the other sequence. Biologically, this corresponds to a mutation event that eliminates a part of a gene, or introduces new DNA into a gene.
GAPS
Positions at which a letter is paired with a null are called
gaps. Gap scores are typically negative.
QUESTION 5
Which is more significant? The presence of a gap or the
length of a gap?
A. The presence of a gap
B. The length of a gap
GAPS
Since a single mutational event may cause the insertion or
deletion of more than one residue, the presence of a gap is considered more significant than the length of the gap.
OPTIMAL ALIGNMENT
The alignment that is the best, given a defined set of
rules and parameter values for comparing different alignments. There is no such thing as the single best alignment, since optimality always depends on the assumptions one bases the alignment on. For example, what penalty should gaps carry? All sequence alignment procedures make some such assumptions.
PARAMETERS OF SEQUENCE ALIGNMENT

Scoring systems:
Each symbol pairing is assigned a numerical value, based on a
symbol comparison table.

Gap penalties:
Opening: The cost to introduce a gap Extension: The cost to elongate a gap
DNA SCORING SYSTEMS VERY SIMPLE
Match: 1 Mismatch: 0 Score = 5
PROTEIN SCORING SYSTEMS

Amino acids have different biochemical and physical
properties that influence their relative replaceability in evolution.

Scoring matrices reflect: Number of mutations to convert one to another Chemical similarity Observed mutation frequencies The probability of occurrence of each amino acid
Widely used scoring matrices: PAM BLOSUM
PAM MATRICES
Point Accepted Mutation Family of matrices: PAM 80, PAM 120, PAM 250 The number with a PAM matrix represents the evolutionary
distance between the sequences on which the matrix is based. PAM 250 = 250 mutations per 100 residues Greater numbers denote greater evolutionary distance
PAM MATRICES
Derived from global alignments of protein families. Family
members share at least 85% identity
Construction of phylogenetic tree and ancestral sequences
of each protein family Computation of number of replacements for each pair of amino acids
PAM 250 MATRIX
PAM LIMITATIONS
Based on only one original dataset
Based mainly on small globular proteins so the matrix is biased
Examines proteins with few differences (85% identity)
BLOSUM MATRICES
BLOcks SUbstitution Matrix Derived from alignments of domains of distantly related
proteins Different BLOSUMn matrices are calculated independently from blocks (ungapped local alignments) BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity BLOSUM 62 represents closer sequences than BLOSUM 45
BLOSUM MATRICES
Built from BLOCKS database: from the most conserved
regions of aligned sequences
~2000 blocks from 500 families have been used
BLOSUM 62 is the most popular matrix and is the default
matrix for the standard BLAST program.
A R N D C Q E G H I L K M F P S T W Y V
5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0 A
7 -1 -2 -4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -3 R
7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 1 0 -4 -2 -3 N
8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4 D
Positive scores on diagonal (identities)

13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1 C 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3 Q
6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3 E
Similar residues get higher (positive) scores

8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 G 10 -4 -3 0 -1 -1 -2 -1 -2 -3 2 -4 H 5 2 -3 2 0 -3 -3 -1 -3 -1 4 I
5 -3 3 1 -4 -3 -1 -2 -1 1 L
Dissimilar residues get smaller (negative) scores

6 -2 -4 -1 0 -1 -3 -2 -3 K 7 0 -3 -2 -1 -1 0 1 M
8 -4 -3 -2 1 4 -1 F
10 -1 5 -1 2 5 -4 -4 -3 15 -3 -2 -2 2 8 -3 -2 0 -3 -1 P S T W Y
5 V
BLOSUM 50 MATRIX
QUESTION 6
Which is the appropriate matrix to use when comparing
highly divergent sequences?

A. BLOSUM with a lower n
B. PAM with lower n C. BLOSUM with a higher n
D. Both B and C
PAM VS. BLOSUM

PAM 100 = BLOSUM 90 PAM 120 = BLOSUM 80 PAM 160 = BLOSUM 60 PAM 200 = BLOSUM 52 PAM 250 = BLOSUM 45
More distant sequences
PAM 120 for general use PAM 160 for close relations PAM 250 for distant relations BLOSUM 62 for general use BLOSUM 80 for close relations BLOSUM 45 for distant relations
TIPS ON CHOOSING A SCORING MATRIX

Generally, BLOSUM matrices perform better than PAM
matrices for local similarity searches (Henikoff & Henikoff, 1993). When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices. For database searching the commonly used matrix is BLOSUM62.
PAM
Built from global alignments Built from small amout of data
BLOSUM
Built from local alignments Built from vast amout of data
based on minimum replacement or maximum parsimony better for finding global alignments and remote homologs Higher PAM series means more divergence
based on groups of related sequences counted as one
better for finding local alignments
Lower BLOSUM series means more divergence
SCORING INSERTIONS AND DELETIONS
The creation of a gap is penalized with a negative score
value
WHY GAP PENALTIES?
WHY GAP PENALTIES?

The optimal alignment of two similar sequences is usually
that which
Maximizes the number of matches and Minimizes the number of gaps
Permitting the insertion of arbitrarily many gaps can lead to
high scoring alignments of non-homologous sequences. Penalizing gaps forces alignments to have relatively few gaps.
BALANCING GAPS WITH MISMATCHES

Gaps must get a steep penalty, or else youll end up with
nonsense alignments. In real sequences, multi-base (or amino acid) gaps are quite common Affine gap penalties give a big penalty for each new gap, but a much smaller gap extension penalty.
SCORING INSERTIONS AND DELETIONS

Pairwise Alignment Prelab PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Pairwise Alignment Prelab PDF

Caricato da

Copyright:

Formati disponibili

PAIRWISE ALIGNMENT

SEQUENCES ARE RELATED

WHY COMPARE SEQUENCES?

To decide if two proteins

Protein 1: binds oxygen Sequence similarity Protein 2: binds oxygen ?

(or genes) are related structurally or functionally

WHY COMPARE SEQUENCES?

INSERTIONS/DELETIONS AND PROTEIN STRUCTURE

SIMILARITY AND HOMOLOGY

SIMILARITY AND HOMOLOGY

SEARCHING SEQUENCE DATABASES

WHY SEARCH FOR SIMILARITY?

SEQUENCE ALIGNMENT: DEFINITION

IT IS MUCH EASIER TO ALIGN PROTEINS

Protein databanks are much smaller than DNA databanks

PAIRWISE ALIGNMENTS: PURPOSE

A scoring system is still essential to find the best alignment

to conserved regions? A.Yes B. No

A LIMITATION TO DOT MATRIX COMPARISON

SIMPLE DOT PLOT

A FILTERED DOT PLOT

THE DOT MATRIX ALGORITHM

DOT MATRIX SEQUENCE COMPARISON EXAMPLES

COMPARING A PROTEIN WITH ITSELF

Window: 30 Stringency: 3 Blosum 62 matrix

SEARCHING FOR REPEATS IN DOTPLOTS

COMPARISON OF TWO SIMILAR SEQUENCES

Lambda phages infect E. coli. They can be lytic and destroys

LAMBDA REPRESSOR/OPERATOR COMPLEX (1LMB)

DOT MATRIX SEQUENCE COMPARISON

Window: 10 Stringency: 1 Blosum 62 matrix

Window: 10 Stringency: 3 Blosum 62 matrix

Window: 30 Stringency: 1 Blosum 62 matrix

Window: 30 Stringency: 3 Blosum 62 matrix

produce the least background noise?

D. High window, high stringency

DISADVANTAGES TO DOT PLOTS

FIVE FEATURES OF AN ALGORITHM:

Output of the algorithm must be specified. There must be at

GLOBAL VS. LOCAL ALIGNMENT

Local alignment algorithms finds the region (or regions)

GLOBAL VS. LOCAL ALIGNMENT

recombination events where only a single functional domain is shared.

Global methods are useful when you want to force two

sequences to align over their entire length

GAPS AND INSERTIONS

gaps. Gap scores are typically negative.

PARAMETERS OF SEQUENCE ALIGNMENT

symbol comparison table.

DNA SCORING SYSTEMS VERY SIMPLE

Match: 1 Mismatch: 0 Score = 5

PROTEIN SCORING SYSTEMS

PROTEIN SCORING SYSTEMS

properties that influence their relative replaceability in evolution.

PROTEIN SCORING SYSTEMS

Widely used scoring matrices: PAM BLOSUM

members share at least 85% identity

Construction of phylogenetic tree and ancestral sequences

PAM 250 MATRIX

Examines proteins with few differences (85% identity)

regions of aligned sequences

~2000 blocks from 500 families have been used

BLOSUM 62 is the most popular matrix and is the default