Sei sulla pagina 1di 15

Pharmaceutical Bioinformatics, 7.

5p
Lecture notes

Alignments in bioinformatics

Lecture notes

Compiled by:

Ola Spjuth [ola.spjuth@farmbio.uu.se]


Department of Pharmaceutical Biosciences
Uppsala University

1
Pharmaceutical Bioinformatics, 7.5p
Lecture notes

ALIGNMENTS IN BIOINFORMATICS 1
Sequence Analysis 3
Biological Background for Sequence Analysis 3
Searching of databases for sequences similar to a new sequence 4

Sequence alignment 5

Multiple sequence alignment 6


Evaluating local multiple alignments 7

Tools for sequence alignment 8


BLAST 8
Clustal 10

Uses of multiple alignment 11


Searching 13
PCR primer design 13

Structural alignments 14
Data produced by structural alignment 15

References 15

2
Pharmaceutical Bioinformatics, 7.5p
Lecture notes

Sequence Analysis

Biological Background for Sequence Analysis

The fundamental building blocks of life are proteins. Enzymes, which are the
molecular machines responsible for virtually all of the chemical transformations that
cells are capable of, are proteins. In addition, much of the structure of a cell is made
up of proteins. That part of the structure which is not made up of proteins is produced
by enzymes which are proteins. A human contains on the order of 100,000 different
proteins. It is the properties of and the interactions between these 100,000 proteins
that make us what we are.

Proteins are variable length linear, mixed polymers of 20 different amino acids. Other
terms used more or less interchangeably for amino acid polymers are peptides and
polypeptides. These topologically linear polymers fold upon themselves to generate a
shape characteristic of each different protein, and this shape along with the different
chemical properties of the 20 amino acids determine the function of the protein. One
of the most important concepts in modern biology is that the functional properties of
proteins is determined largely by the sequence of the 20 amino acids in the linear
polypeptide chain; that in many cases proteins are largely self-folding. Thus, in
theory, knowing the sequence of a protein (the order with which the amino acids
occurred) one could infer its function.

What determines the order of amino acids in a protein? The Central Dogma of
Molecular Biology describes how the genetic information we inherit from our parents
is stored in DNA, and that information is used to make identical copies of that DNA
and is also transferred from DNA to RNA to protein. DNA is a linear polymer of 4
nucleotides [4] deoxyAdenosine monophosphate (abbreviated A), deoxyThymidine
monophosphate (abbreviated T), deoxyGuanosine monophosphate (abbreviated G)
and deoxyCytidine monophosphate (abbreviated C). RNA is a very similar polymer
of Adenosine monophosphate, Guanosine monophosphate, Cytidine monophosphate,
and Uridine monophosphate. Uridine monophosphate, abbreviated U, is a nucleotide
functionally equivalent to Thymidine monophosphate.

A property of both DNA and RNA is that the linear polymers can pair one with
another, such pairing being sequence specific. In such double polymers (referred to as
a "double helix" due to the shape they assume) G pairs with C and A pairs with T or
U. All possible combinations of DNA and RNA double helices occur. One strand
DNA can serve as a template for the construction of a complementary strand, and this
complementary strand can be used to recreate the original strand. This is the basis of
DNA replication and thus all of genetics. Similar templating results in an RNA copy
of a DNA sequence. Conversion of that RNA sequence into a protein sequence is
more complex. This occurs by translation of a code consisting of three nucleotides
into one amino acid, a process accomplished by cellular machinery including tRNA
and ribosomes.

3
Pharmaceutical Bioinformatics, 7.5p
Lecture notes

Four different nucleotides taken three at a time can result in 64 different possible
triplet codes; more than enough to encode 20 amino acids. The way that these 64
codes are mapped onto 20 amino acids is first, that one amino acid may be encoded
by 1 to 6 different triplet codes, and second, that 3 of the 64 codes, called stop codons,
specify "end of peptide sequence". Where multiple codons specify the same amino
acid, the different codons are used with unequal frequency and this distribution of
frequency is referred to as "codon usage". Codon usage varies between species.

The fact that DNA nucleotides need to be read three at a time to specify a protein
sequence implies that a DNA sequence has three different reading frames determined
by whether you start at nucleotide one, two, or three. (Nucleotide four will be in the
same frame as nucleotide one and so on). Both strands of DNA can be copied into
RNA (for translation into protein). Thus, a DNA sequence with its (inferred)
complementary strand can specify six different reading frames.

It is possible to chemically determine the sequence of amino acids in a protein and of


nucleotides in RNA or DNA. However, it is vastly easier at present to determine the
sequence of DNA than that of RNA or protein. Since the sequence of a protein can be
determined from the DNA sequence that encodes it, most protein sequences are in
fact inferred from DNA sequences. Conversion of RNA to a DNA copy (cDNA) is a
simple laboratory proceedure, so RNA molecules are themselves sequenced as cDNA
copies.

Searching of databases for sequences similar to a new sequence

If you have just determined a sequence of an interesting bit of DNA, one of the first
questions you are likely to ask yourself is "has anybody else seen anything like this?"
Fortunately, there has been a very successful international effort to collect all the
sequences people have determined in one place so they can be searched. For DNA
sequences, three groups have cooperated in this effort, one in Japan, one in Europe,
and one in the United States to produce DDBJ, EMBL and GenBank, respectively.
These databases are frequently reconciled with each other, so that searching any one
is virtually the same as searching all three. The problem is that these databases are
HUGE and, as a result, you must compare your sequence with this vast number of
other sequences efficiently. A number of programs have been written to rapidly
search a database for a query sequence, two of which, BLAST and FASTA, will be
discussed in this course. The techniques used by these programs to make searching
rapid result in some loss of rigor of comparison. It is possible (although, as it turns
out, unlikely) that a weak but relevant similarity could be missed by these programs.
In addition, many times these programs will flag a sequence as being similar to your
query sequence when this similarity is not significant. Thus, these programs should be
seen as tools for identifying a small subset of sequences from the database for
retrieval and further analysis rather than ends in themselves.

Databases of protein sequences, including Uniprot and PIR, also exist and can
similarly be searched.

4
Pharmaceutical Bioinformatics, 7.5p
Lecture notes

Which program should you use to search a database, FASTA or BLAST? This
question is about as controversial as that over choices of computers (Mac vs. PC) or
religions. In fact, as you enter the world of sequence analysis, you will find religous
wars between proponents of different programs over and over. Worse, new programs
are constantly appearing. In addition, even after having selected a program, you will
frequently have to select values for "parameters" and always have to interpret the
output. There are no magic answers to help you do these things

Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the primary sequences
of DNA, RNA, or protein to identify regions of similarity that may be a consequence
of functional, structural, or evolutionary relationships between the sequences. Aligned
sequences of nucleotide or amino acid residues are typically represented as rows
within a matrix. Gaps are inserted between the residues so that residues with identical
or similar characters are aligned in successive columns.

A sequence alignment, produced by ClustalW between two human zinc finger proteins
identified by GenBank accession number.

If two sequences in an alignment share a common ancestor, mismatches can be


interpreted as point mutations and gaps as indels (that is, insertion or deletion
mutations) introduced in one or both lineages in the time since they diverged from
one another. In protein sequence alignment, the degree of similarity between amino
acids occupying a particular position in the sequence can be interpreted as a rough
measure of how conserved a particular region or sequence motif is among lineages.
The absence of substitutions, or the presence of only very conservative substitutions
(that is, the substitution of amino acids whose side chains have similar biochemical
properties) in a particular region of the sequence, suggest that this region has
structural or functional importance. Although DNA and RNA nucleotide bases are
more similar to each other than to amino acids, the conservation of base pairing can
indicate a similar functional or structural role. Sequence alignment can be used for
non-biological sequences, such as those present in natural language or in financial
data.

Very short or very similar sequences can be aligned by hand; however, most
interesting problems require the alignment of lengthy, highly variable or extremely
numerous sequences that cannot be aligned solely by human effort. Instead, human
knowledge is primarily applied in constructing algorithms to produce high-quality
sequence alignments, and occasionally in adjusting the final results to reflect patterns

5
Pharmaceutical Bioinformatics, 7.5p
Lecture notes

that are difficult to represent algorithmically (especially in the case of nucleotide


sequences). Computational approaches to sequence alignment generally fall into two
categories: global alignments and local alignments. Calculating a global alignment is
a form of global optimization that "forces" the alignment to span the entire length of
all query sequences. By contrast, local alignments identify regions of similarity within
long sequences that are often widely divergent overall. Local alignments are often
preferable, but can be more difficult to calculate because of the additional challenge
of identifying the regions of similarity. A variety of computational algorithms have
been applied to the sequence alignment problem, including slow but formally
optimizing methods like dynamic programming and efficient heuristic or probabilistic
methods designed for large-scale database search.

Multiple sequence alignment


Multiple sequence alignment (MSA) is a sequence alignment of three or more
biological sequences, generally protein, DNA, or RNA. In general, the input set of
query sequences are assumed to have an evolutionary relationship by which they
share a lineage and are descended from a common ancestor. From the resulting MSA,
sequence homology can be inferred and phylogenetic analysis can be conducted to
assess the sequences' shared evolutionary origins. Visual depictions of the alignment
as in the image at right illustrate mutation events such as point mutations (single
amino acid or nucleotide changes) that appear as differing characters in a single
alignment column, and insertion or deletion mutations (or indels) that appear as gaps
in one or more of the sequences in the alignment. Multiple sequence alignment is
often used to assess sequence conservation of protein domains, tertiary and secondary
structures, and even individual amino acids or nucleotides.

Multiple sequence alignment also refers to the process of aligning such a sequence
set. Because three or more sequences of biologically relevant length are nearly
impossible to align by hand, computational algorithms are used to produce and
analyze the alignments. MSAs require more sophisticated methodologies than
pairwise alignment because they are more computationally complex to produce. Most
multiple sequence alignment programs use heuristic methods rather than global
optimization because identifying the optimal alignment between more than a few
sequences of moderate length is prohibitively computationally expensive.

6
Pharmaceutical Bioinformatics, 7.5p
Lecture notes

First 90 positions of a protein multiple sequence alignment of instances of the acidic


ribosomal protein P0 (L10E) from several organisms. Generated with ClustalW.

Sequences can be aligned across their entire length (global alignment) or only in
certain regions (local alignment). This is true for pairwise and multiple alignments.
Global alignments need to use gaps (representing insertions/deletions) while local
alignments can avoid them, aligning regions between gaps.

Evaluating local multiple alignments

Some programs give quantitative measures for the significance of the alignment.
These are usually based on the chance occurrence of such alignments and depend on
the size and composition of the aligned sequences. Empirical measures are also
extremely useful for deciding the 'correctness' of the multiple alignment. Consistency
is a powerful measure for correct multiple alignments. If the same alignment is found
in the sequence-to-sequence searches and various multiple alignment methods it is
most probably correct. One pitfall to avoid is biased sequence composition that may
lead to trivial alignments.

Experimental data can be used in evaluating, and even constructing, multiple


alignments. For example, if we know the catalytic site in the aligned proteins we

7
Pharmaceutical Bioinformatics, 7.5p
Lecture notes

expect the sites to be aligned together and may 'force' that alignment. Such manual
alignments can serve as a seed to an alignment with more sequences.

Local multiple alignments (blocks) from different programs can be joined or used
together. Another approach is 'divide and conquer'. Blocks present in all sequences
divide them into separate parts, in each of which more blocks can be searched for.

Tools for sequence alignment

BLAST

BLAST is an acronym for Basic Local Alignment Search Tool, and it consists of a set
of algorithms for comparing biological sequences such as nucleotides or protein
sequences. A nucleotide sequence is nothing but a DNA (or part of) sequence
expressed as a long string of 4 characters: A,T,C and G. They stand for Adenine,
Guanine, Cytosine and Thymine. So, every nucleotide sequence consists of only these
four characters arranged in different orders.

BLAST allows you to compare your sequence against a database of sequences and
informs you if your sequence matches any of the sequences in the database, along
with a lot of information like:

* Homology of match (% of characters matched)


* Alignment length (over what length did the nucleotides match)

8
Pharmaceutical Bioinformatics, 7.5p
Lecture notes

* Evalue (Expectation value. The number of different alignents with scores


equivalent to or better than S that are expected to occur in a database search by
chance. The lower the E value, the more significant the score)

For a complete BLAST glossary you may visit


http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html

So, now that you know BLAST can be used to align two sequences and to study the
similarity between two or more sequences, let us look into the principles of sequence
alignment briefly.

Sequence alignment refers to arranging two sequences in an order such that their
similar portions are highlighted.

For ex:

AGCTATGGGCAAATTTGGAACAAACCAAAAAGT
........ ........ ...............
AGCTATGGACAAATTTGCAACAAACCAAAAAGT

The portions in the sequence which do not match are shown by gaps in the alignment.

Global Alignment: It refers to the alignment in which all the characters in both
sequences participate in the alignment.

Local Alignment: It refers to finding closely matching regions between sequences. In


local alignment the beginning part (say 0.100 nucleotides) of a sequence may align
with the ending part of another sequence (say 400-500).

BLAST flavours

The BLAST programs are widely used tools for searching DNA and protein databases
for sequence similarity to identify homologs to a query sequence. While often referred
to as just "BLAST", this can really be thought of as a set of programs: blastp, blastn,
blastx, tblastn, and tblastx.

The five flavours of BLAST perform the following tasks:

 blastp
o Compares an amino acid query sequence against a protein sequence
database
 blastn
o Compares a nucleotide query sequence against a nucleotide sequence
database
 blastx
o Compares the six-frame conceptual translation products of a nucleotide
query sequence (both strands) against a protein sequence database
 tblastn

9
Pharmaceutical Bioinformatics, 7.5p
Lecture notes

o Compares a protein query sequence against a nucleotide sequence


database dynamically translated in all six reading frames (both
strands).
 Tblastx
o Compares the six-frame translations of a nucleotide query sequence
against the six-frame translations of a nucleotide sequence database.
(Due to the nature of tblastx, gapped alignments are not available with
this option)

Links for BLAST:

 NCBI's blast tool can be found at http://www.ncbi.nlm.nih.gov/blast/


 An article on methodology behind blast:
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
 How to interpret BLAST output:
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut2.html

Clustal

Clustal is a fully automatic program for global multiple alignment of DNA and
protein sequences. The alignment is progressive and considers the sequence
redundancy. Trees can also be calculated from multiple alignments (see below). The
program has some adjustable parameters with reasonable defaults. ClustalW is
available on the WWW and for various computer operating systems.

How does Clustalw Work (very simple explanation)?


1. Determine all pairwise alignments between sequences and the degree of similarity
between them:

2. Construct a similarity tree.

3. Combine the alignments from 1 in the order specified in 2 using the rule " once a
gap always a gap"

In stage 1:

1.1. clustalw uses a pairwise alignment to compute pairwise alignments.

1.2. Using the alignments from 1.1 it computes a distance.

1.2.1. The distance is commonly calculated by looking at the non-gapped positions


and count the number of mistmatches between the two sequences. Then divide this
value by the number of non-gapped pairs to calculate the distance. Once all distances
for all pairs are calculated they go into a matrix. This follows on in stage 2.

10
Pharmaceutical Bioinformatics, 7.5p
Lecture notes

2. Using the matrix from 1.2.1. and Neighbor-Joining, Clustalw constructs the
similarity tree. The root is placed in the middle of the longest chain of consecutive
edges.

3. Combine the alignments, starting from the closest related groups (going form the
tips of the tree towards the root).

Uses of multiple alignment


The basic information from a multiple alignment of protein sequences is the position
and nature of the conserved regions in each member of the group. Conserved
sequence regions correspond to functionally and structurally important parts of the
protein. We often only know the sequence-to-function relation for one or two
members of the group. Multiple alignments let us transfer that knowledge to the other
members in the group. Hypotheses about functional importance or specific roles can
then be directly tested by mutagenesis and truncation experiments.

Viewing
Multiple alignments of many sequences and those with different sequence weights are
difficult to visualize. Sequence logos are a graphical way for presenting multiple
alignments.

ID ADH_IRON_1; BLOCK
AC BL00913C; distance from previous block=(56,76)
DE Iron-containing alcohol dehydrogenases proteins.
BL HHG motif; width=22; seqs=11; 99.5%=492; strength=1428
ADHE_CLOAB ( 720) CHSMAIKLSSEHNIPSGIANAL 66

FUCO_ECOLI ( 262) VHGMAHPLGAFYNTPHGVANAI 44

GLDA_BACST ( 259) HNGFTALEGEIHHLTHGEKVAF 100

GLDA_ECOLI ( 269) VHNGLTAIPDAHHYYHGEKVAF 100

MEDH_BACMT ( 259) VHSISHQVGGVYKLQHGICNSV 78

ADH1_CLOAB ( 258) CHSMAHKTGAVFHIPHGCANAI 47


ADHE_ECOLI ( 721) CHSMAHKLGSQFHIPHGLANAL 47

ADH2_ZYMMO ( 261) VHAMAHQLGGYYNLPHGVCNAV 36


ADH4_YEAST ( 263) VHALAHQLGGFYHLPHGVCNAV 41

ADHA_CLOAB ( 266) CHPMEHELSAYYDITHGVGLAI 50


ADHB_CLOAB ( 266) VHLMEHELSAYYDITHGVGLAI 49
//

11
Pharmaceutical Bioinformatics, 7.5p
Lecture notes

Figure: Block and logo of a conserved region in iron containing alcohol


dehydrogenases. The block is first transformed into a position specific scoring matrix
(PSSM) that allows for the sequence weights and expected frequencies of different
amino acids (aa). The logo shows the aa present in each alignment position. The
higher the aa and the stack the more conserved they are. The conservation is shown
in bits and the aa are shaded according to their properties. The conserved histidines
probably bind the ferrous on(s) required for these enzymes activity.

A different graphical view of multiply aligned sequences is by a tree relating their


sequence similarity. This is very useful when the aligned sequences are of several
functional subtypes and we wish to know to which one does our sequence/s belong. A
way to estimate the significance of a tree is by bootstrap values. Simply put, these
values show how many times was each bifurcation (branching point) observed with
different models of the input data. The higher the fraction of the bootstrap value
(number of observations/number of trials) the more confident we can be that the
sequences emerging from that branch point cluster together.

12
Pharmaceutical Bioinformatics, 7.5p
Lecture notes

Fig: A tree made from the three blocks in the iron containing alcohol dehydrogenases
family. Bootstrap values are for 100 trials. The tree was calculated from the blocks
with the ClustalW program and drawn with the TreeView program.

Searching

Multiple alignments are powerful tools for identifying new members of the aligned
group. It is possible to query databases of multiple alignments with single sequences
and to query sequence databases with multiple alignments. It has been shown that
such searches are more sensitive and selective than sequence-to-sequence searches. A
simple (but very effective !) 'hybrid' approach is to use a properly made consensus
sequence

PCR primer design

Design of degenerate PCR primers is emerging as a major use for multiple


alignments. PCR can identify the sequence of a gene in genomic or other DNA from
two short flanking segments (primers). Conserved sequence regions are (by
definition) a good source for primer design. When designing primers the conservation
of the regions, the degeneracy of the genetic code and parameters of the PCR reaction
must be considered. The Blocks WWW server designs PCR primers for each family
in the database, for sequence groups submitted to be aligned and for multiple
alignment submitted to be reformatted. These primers are degenerate at the 3' end and
consensus at the 5' end (codehop- COnsensus DEgenerate Hybrid Oligonucleotide
Primers). The design is fully automatic but the user can set the requested Tm, genetic

13
Pharmaceutical Bioinformatics, 7.5p
Lecture notes

code and bias the primers toward some of the sequences. codehop primers were
shown more effective than simple degenerate primers in various cases.

Structural alignments
Structural alignment is a form of sequence alignment based on comparison of shape.
These alignments attempt to establish equivalences between two or more polymer
structures based on their shape and three-dimensional conformation. This process is
usually applied to protein tertiary structures but can also be used for large RNA
molecules. In contrast to simple structural superposition, where at least some
equivalent residues of the two structures are known, structural alignment requires no a
priori knowledge of equivalent positions. Structural alignment is a valuable tool for
the comparison of proteins with low sequence similarity, where evolutionary
relationships between proteins cannot be easily detected by standard sequence
alignment techniques. Structural alignment can therefore be used to imply
evolutionary relationships between proteins that share very little common sequence.
However, caution should be used in using the results as evidence for shared
evolutionary ancestry because of the possible confounding effects of convergent
evolution by which multiple unrelated amino acid sequences converge on a common
tertiary structure.

Structural alignments can compare two sequences or multiple sequences. Because


these alignments rely on information about all the query sequences' three-dimensional
conformations, the method can only be used on sequences where these structures are
known. These are usually found by X-ray crystallography or NMR spectroscopy. It is
possible to perform a structural alignment on structures produced by structure
prediction methods. Indeed, evaluating such predictions often requires a structural
alignment between the model and the true known structure to assess the model's
quality. Structural alignments are especially useful in analyzing data from structural
genomics and proteomics efforts, and they can be used as comparison points to
evaluate alignments produced by purely sequence-based bioinformatics methods.

The outputs of a structural alignment are a superposition of the atomic coordinate sets
and a minimal root mean square distance (RMSD) between the structures. The RMSD
of two aligned structures indicates their divergence from one another. Structural
alignment can be complicated by the existence of multiple protein domains within one
or more of the input structures, because changes in relative orientation of the domains
between two structures to be aligned can artificially inflate the RMSD.

14
Pharmaceutical Bioinformatics, 7.5p
Lecture notes

Fig: Structural alignment of thioredoxins from humans and the fly Drosophila
melanogaster. The proteins are shown as ribbons, with the human protein in red, and
the fly protein in yellow. Generated from PDB 3TRX and 1XWC.

Data produced by structural alignment

The minimum information produced from a successful structural alignment is a set of


superposed three-dimensional coordinates for each input structure. (Note that one
input element may be fixed as a reference and therefore its superposed coordinates do
not change.) The fitted structures can be used to calculate mutual RMSD values, as
well as other more sophisticated measures of structural similarity such as the global
distance test (GDT, the metric used in CASP). The structural alignment also implies a
corresponding one-dimensional sequence alignment from which a sequence identity,
or the percentage of residues that are identical between the input structures, can be
calculated as a measure of how closely the two sequences are related.

References

http://en.wikipedia.org/wiki/Structural_alignment
http://en.wikipedia.org/wiki/Sequence_alignment_software
http://en.wikipedia.org/wiki/Multiple_sequence_alignment
http://puneetwadhwa.blogspot.com/2005/10/introduction-to-blast-basic-local.html
http://en.wikipedia.org/wiki/BLAST

15

Potrebbero piacerti anche