Sei sulla pagina 1di 17

Methods for applying multiple sequence alignment

During the last decade, great progress has been made in devising methods for applying multiple sequence alignments of known proteins to identify related sequences in database searches. The results are central to contemporary applications of bioinformatics, including the interpretation of genomes. Three important methods are Profiles, PSI-BLAST and Hidden Markov Model (HMMs).

Profiles

Profiles express the patterns inherent in a multiple sequence alignment of a set of homologous sequences. They have several applications like They permit greater accuracy in alignments of distantly-related sequences. Sets of residues that are highly conserved are likely to be part of the active site, and give clues to function. The conservation patterns facilitate identification of other homologous sequences.

Profiles

The conservation patterns facilitate identification of other homologous sequences.


Patterns from the sequences are useful in classifying subfamilies within a set of homologues. Set of residues that show little conservation, and are subject to insertion and deletion, are likely to be in surface loops. This information has been applied to vaccine design, because such regions are likely to elicit antibodies that will cross-react well with the native structure.

Profiles contd.

Most structure prediction methods are more reliable if based on a multiple sequence alignment than on a single sequence. For example homology modelling depends crucially on correct sequence alignments.

How Profile works?


The basic idea in using profile patterns in identifying homologues is to match the query sequence from the database against the sequences in the alignment table, giving higher weight to positions that are conserved than to those that are variable.

But one must not be too compulsive as in that case there is a chance of missing interesting distant relatives.

A quantitative measure of conservation


For each position in the table of aligned sequences, take inventory of the distribution of amino acids. Let us take an example.
Number of each amino acid A C D E F G H 25 1 26 16 27 16 28 29 16 30 1 4 I K L 2 M N P Q R S T V W Y 13

7 1 2 1 7 1

Quantitative measure

It is evident that the positions 26, 27 and 29 contributes very high score and disagreement at these positions contributes a very low score. For moderately conserved positions, such as position 28, we want a modest positive contribution to the score if the query sequence has an S or a W at this position, and a smaller contribution if it has T or Y.

Number of each amino acid A C D E F G H 25 1 26 16 27 16 28 29 16 30 1 4

K L 2

R S T

V W Y 13

7 1 2 1 7 1

Profiles

So the general idea is to score each residue from the query sequence based on the amino acid distribution at that position in the multiple sequence alignment table.
A simple approach would be to use the inventories as scores directly.

The sequence VDFSAE would score 13+16+16+7+16+4=72


The alternative query sequence would score 1+0+0+5+16+2=24 (A nonD nonF W A P)

Profiles

Thus we have to take inventory for each query sequence and will have to test all possible alignments with the multiple alignment table, and take the largest total score. It is obvious from these discussions that if the table contained a large and unbiased sample of sequences then the inventory would provide the correct picture of the potential distribution of residues at each position. With similar arguments we can say that if our sample were small, the pattern derived would be unlikely to reflect the complete repertoire.

How to make the inventory general?

Let a1,a2,a3, ........a20 be the amino acid distribution at any residue position in a 20-membered array. A better scoring scheme would evaluate any amino acid according to its chance of being substituted for one of the observed amino acids. If D(i,j) is the amino acid substitution matrix (PAM250 or BLOSUM62) then amino acid i could score a1D(i,1)+a2D(i,2)......a20D(i,20) Thus this scheme distributes the score among observed amino acids, weighted according to the substitution probability.

Profile

An amino acid in the query sequence could score higher either if it appears frequently in the inventory at this position or it has a probability of arising by mutation from residue types that are common at this position. A good approach is to use as the amino acid distribution a combination of the observed inventory and a general background level of the amino acid composition. The result is a set of probability scores for each amino acid (or gap) at each position of the alignment called a position specific scoring matrix(PSSM).

Database searches with scoring matrix like PSSM

The PSSM is a scoring matrix (stands for position specific scoring matrix)
It represents an alignment of sequence patterns of the same length without gaps It is constructed by a simple logarithmic transformation of a matrix giving the frequency of each amino acid in the motif

Considerations for PSSM


If the number of sequences with the found motif is large and reasonably diverse, the sequences represent a good statistical sampling of all sequences that are likely to be found with the same motif.

Concept of pseudocounts in PSSM

If the data is small, then unless the motif has almost identical amino acids in each column, the column frequencies in the motif may not be highly representative of all other occurrences of the motif.
In such cases adding extra amino acid counts broaden the evolutionary reach of the profile to variation. Pseudocounts are added based on previous variations seen in the aligned sequences.

Expression for the frequencies in PSSM


The probability pca that amino acid a is in column c in all occurrences of the blocks is (nca+bca)/(Nc+Bc) where nca and bca are the real counts and pseudocounts respectively, Nc and Bc are the total number of real counts and pseudocounts.

The Matrix
The log odds ratio is calculated as before. Here, one column denotes each position and a row denotes each amino acid of the motif.

As a sequence is searched with PSSM, the value of the first amino acid in the sequence is looked up in the first column of the PSSM, then the value of the second amino acid in the matrix and so on until the length scanned is the same as the motif width represented by the matrix.

How is it different from the scoring matrices you have already learnt?

Substitutions of the same amino acid within the matrix may be scored differently, depending on its position. Amino acids in highly conserved positions score
higher than those in weakly conserved

positions. This matrix is used to score the next


BLAST search and the matrix is refined again.

Potrebbero piacerti anche