Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Objective: In these exercises the students will be introduced to BLASTn through the
comparison of nucleotide sequence against the nucleotide database.
For this exercise students are required to have some knowledge of: Using links, be familiar
with
the NCBI webpage, background on taxonomic classification, concept of nucleotide point
mutations.
BLASTn is the program used to compare nucleotide query sequences against nucleotide
databases. In addition to BLASTn, BLAST allows one to search for protein matches to a
nucleotide query through the BLASTx program that translates the nucleotide sequence
query into an amino acid query. It also possible to search for protein sequence matches in
the database to a protein sequence query through the BLASTp program. The following
exercises will introduce students to the use of BLASTn; the other programs will be
considered in subsequent exercises.
NCBI provides very useful tutorials to help new and veteran users to understand and utilize
tools, including BLAST. Here is the link to their BLAST tutorial:
1
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
Reference:
Altschul, S. F., W. Gish, W. Miller, E. W. Myers and D. J. Lipman. (1990). “Basic Local
Alignmnet Search Tool.” J. Mol. Biol. 215: 403-410.
BLASTn Example:
You will need to determine the source from which the following DNA fragment was
obtained by comparing the “unknown 1” nucleotide sequence (query sequence) against a
nucleotide sequence database using BLASTn.
>Unknown 1
gagcaggtgcctcactatcgacaagccctagacatgatcttggacctggaacctgatgaagagctggaagacaaccccaaccaga
gtgacttgattgagcaggcggccgagatgctctatgggttgatccacgcccgctacatcctcaccaaccggggcattgcacaaatgt
tggaaaagtaccagcaaggagactttggctactgtcctcgagtatactgtgagaaccagccgatgcttcccatcggcctttcggacat
cccaggagaggccatggtgaagctctactgccccaagtgcatggacgtgtacacacccaagtcctctaggcaccaccacacggat
ggcgcatacttcggcactggtttccctcacatgctcttcatggtgcatcccgagtaccggcccaagcggccggccaaccagtttgtgc
ccaggctctacggtttcaagatccatccaatggcctaccagctgcagctccaagccgccagcaacttcaagagcccagtcaagacg
attcgctgagtgccctcccacctcctctgcctgtgacaccaccgtccctccgctgccaccctttcaggaagtctatggtttttagt
You will be prompted with a window informing you that your request was successfully
submitted to BLAST. It will assign you a “Request ID” and will give you the option to
“Format!” the results or to “Reset all.” Hit “Format!” when you wish to obtain you
results. (NOTE: The NCBI web page is changing constantly, so the figures presented
in this example are in the format used by NCBI in 2005 and may differ somewhat from
those you may encounter).
2
Default settings
The results will be given to you according to the selections shown under “Format.” The
default settings (in 2005) include a “graphical view” of the query sequence aligned to
different sequences in the database; “linkout” to other specialized NCBI databases,
“sequence retrieval” of the corresponding matches with the “NCBI-gi” identifiers, accession
numbers, and locus names. It also provides an alignment in html form for the specified
number of pair alignments with reported high-scoring segment pairs (HSPs). The number
of pair alignments is set to 50 but can be changed. You can always click in the particular
parameter to get more information about the settings.
After you hit “Format!” a new page will appear informing you of the estimated time
needed to retrieve the results. The results will be given to you in a new page with the
bibliographic reference to the database search program used in your BLAST session,
followed by your request ID, the query name, and number of nucleotides considered in the
search. The databases used in the search are also specified on this page and a link to
BLAST FAQs (frequently asked questions) is provided.
Click on “taxonomy report” to obtain information about the different organisms containing
DNA sequences similar to your query. Is this piece of DNA more likely coming from a
bacterium or from a Eukaryote? Can you recognize the names of all of the organisms
listed? If not, you can click on the ones unfamiliar to you and get a taxonomic description.
3
Number of hits to
each organism
Get back to your BLAST result page. Look at the graphical view of the results. You will
see a set of parallel horizontal bars of different colors and lengths. The color indicates the
level of similarity between the query sequence and the matching sequence from the
database. A red bar denotes a high similarity score while black denotes very low similarity
between the two sequences.
4
Description
Below the graph there is a list of sequences from the database producing significant
alignments described by a one-line summary called “description.” The alignments are
sorted by E-values (expected values) with the lowest score (0.0) presented at the top of the
list. The E-values represent the probability of obtaining the particular alignment by chance
rather than by real sequence similarity. Therefore the lower the value the more significant
the alignment. E-values are very useful in helping one to decide what results are more
meaningful. In addition to E-values, the description provides a score value, which is another
statistical value to represent the alignment. The score of the alignment gives you an
estimation of how accurate the alignment is; the higher the score the better. (For more
information about the statistics used in BLAST visit the following page:
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html#head2).
Click on one of the bars to obtain the alignment. Notice that the window at the top of the
graphical view shows the ”description” for the selected bar. You can also retrieve the
alignment by clicking on one of the “descriptions” below the graph or by scrolling down the
page.
Link
You to the
will actual
obtain sequence
your alignment in theGene
following
name form:
5
Alignment statistics.
Perfect
alignment
The first part of the result is the “description” with a link to the source from which the
matching sequence was obtained, and a description of the sequence. Following the
description is the statistical information for the alignment, including the score, the E-value,
the “identities” (the ratio of the number of nucleotides considered in the alignment to the
number of well-matched nucleotides). In this case the identity is 100%, indicating that 605
nucleotides analyzed form the query sequence were identical to 605 nucleotides in the
sequence of Rat casein II beta subunit. If you look at the alignment you will see that both
sequences are identical. The low E-value (0.0) together with the high score (1199) and the
100% identity strongly suggest that the “Unknown 1” sequence comes from a rat and is a
portion of the casein kinase II beta subunit (CK2) mRNA sequence.
Let’s compare this result to one with a higher E-value. For example click on the Chicken
casein kinase II with an E-value of e-102. Notice that this alignment also has a lower score
(381) and a lower identity 444/528 (84%) than the previous sequence. These statistical
6
values reflect the mismatches between the two sequences. Therefore the sequence with the
higher E-value is less likely to correspond to the “Unknown 1” sequence.
Lots of
mismatches
Once you have decided what description better fits your query, you can obtain the complete
record (the source for the matching sequence). Click on the NCBI-gi accession number in
the sequence description that links to the full sequence of the gene. You will be prompted
with a page containing a complete description of the sequence, including the accession
number under which the sequence is stored in the database, the name of the gene and of the
organism from where the sequence was obtained, a complete taxonomic origin for the
organism, a reference to the publication reporting the sequence with comments about the
sequence, and the protein and nucleotide sequences. This page also includes “Links” to
other relevant sections on NCBI like PubMed, Taxon Browser, Gene etc. (See the upper
right hand corner of the page). You can access each of those links to get more information
about the sequence.
7
Accession number
Gene name
Taxonomic report
Link to PubMed
Exercise 1.
ttgttatctcgacgccagatccccactataatctttgttcctcaccatgaaatatggaactggagaactatcatgtctagctaaaggtgtgt
aaattcaccagtcagcaagctgtgtctaactcaaggtttgtaaaggcaccaatcagcaccctgtgtctagctcaagatttgtaaatgca
ccaatcagtcctctgtgtctagctaatctagtggtgacttgaagactttcgtgtctagctgaaggattgtaaacgcactaatcagca
This DNA sequence fragment was obtained from a person with some kind of olfactory
problem. Using BLASTn, could you identify the target gene from which this DNA
fragment was obtained and give some explanation for the possible causes of the olfactory
disorder? (Hint: consider possible mutations that this person may have.)
Explain your reasoning for choosing the appropriate target gene, taking advantage of the
information provided in the description of the sequence pair alignments provided by
BLAST such as gene name, score, expected value and identity. Compare your choice with
alternative choices.
8
2.2 BLAST AND PRIMERS FOR POLYMERASE CHAIN REACTION (PCR)
Polymerase Chain Reaction (PCR) amplification and sequencing are among the most widely
used molecular biology techniques. Many questions in biology, such as the detection of
nucleotide polymorphisms or the discovery of important functional genes, can be addressed
by obtaining and analyzing sequences. Millions of copies of a specific DNA fragment can
be obtained though PCR amplification that can then be sequenced to obtain the string of
nucleotides forming the sequence. The general principle of PCR is based on targeting a
particular region on a genome using short pieces of single stranded DNA that are highly
similar to the desired sequence. These short pieces are called primers and they determine
the limits of the amplified fragment by attaching to their complementary sequence in the
denatured genomic DNA. The enzyme DNA polymerase uses the free 3’ terminal of the
primers to start synthesizing the DNA complementary to the target sequence. If the primers
are not well attached to the targeted sequence on the genomic DNA, no synthesis is
possible. Therefore, it is crucial to decide what set of primers to use for PCR amplification
and further sequencing. In general a good primer set should be specific so that only the
target locus in a given genome can be amplified. In addition, primers should be placed in
regions of the genome that are highly conserved among the group of organisms to which the
organism under study belongs to. Otherwise the primers may not be able to anneal to the
target DNA. Other considerations are intrinsic to the properties of the primers themselves,
such as the formation of loops, compatibility between the two primers of the set, optimal
annealing temperature, etc.
You can read more about PCR using the link to reference books in BookShelf of NCBI.
This is an example taken from the book “Human molecular genetics” by Tom Strachans and
Andrew P. Read and available through the BookShelf of NCBI:
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=hmg.section.552.
In 2005 a group of scientists reported a new active transposable element (Herves) in the
genome of the Anopheles gambie mosquito to GenBank. This finding could have important
implications for human health since the “African malaria mosquito” is the single most
efficient vector of malaria. Thus, the presence of the Herves element may give some clues
about malarial infection. In their sequence submission the authors reported
polymorphisms(tinh da hinh: sụ khác nhau 1 hoac vài nu ở từng người) found in the Herves
sequence. It will be interesting to evaluate the importance of such polymorphisms in the
susceptibility of the mosquito to infections with Plasmodium (the protozoan carried by
Anopheles mosquitoes and causing malaria). The first step is to compare the sequence of
the transposase gene in Herves among different A. gambie mosquitoes and see if the
presence of the polymorphisms correlates with the presence of Plasmodium in the host. To
9
do so it is necessary to obtain the sequences by PCR amplification.
Reverse primer: 5’ ttacaaaaagtgcattacaaaacaattatt 3’(kg dặc hiệu)-vẫn gắn đúng vị trí nên
vẫn dùng dc PCR
(Plus/Minus)
A. Go to BLAST and select “search for short, nearly exact matches” under
“nucleotide.”
B. Enter both primer sequences in the window provided by the program.
C. Limit your search to Anopheles gambie [orgn] in the appropriate place
under “options for advanced blasting.”
D. Hit “BLAST.”
E. Obtain the corresponding Herves sequence by clicking on the gi accession
number on the description of the best match.
Assignment 1.
Using BLAST as described above answer the following questions (make sure to look
carefully at the BLAST report of original sequence of the gene):
For help with the BLAST programs, visit the NCBI BLAST tutorials website:
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
Exercise 2.
10
Pseudomonas aeruginosa is a bacterium that lives on the surface of many organisms,
including humans, without causing major problems to the host. However, P. aeruginosa
can take advantage of breaks in the host’s defense system and produce infection in different
tissues. P. aeruginosa is of high concern to patients with AIDS, cancer or severe burns. In
fact, 50% of the infections of P. aeruginosa in these patients result in death. Hospitals are
reservoirs for this bacterium, and this organism is the most common cause of nosocomial
infections (infections caused by staying in a hospital).
You want to compare the diversity of P. aeruginosa found in the hospitals in the US with
that observed in Ireland (Finnan et al 2004. “Genome diversity of Pseudomonas aeruginosa
isolates from cystic fibrosis patients and the hospital environment.” J Clin Microbiol. Dec:
42(12): 5783-92). To do this, you need to sequence the malate dehydrogenase (mdh) gene
of a bunch of isolates. Therefore, you will need to use a set of primers to amplify the mdh
gene for further sequencing. The best way to start is to try using the primers used by Finnan
et al. 2004.
A. Go to the NCBI webpage and find the Finnan et al paper (2004) in PubMed.(;lấy
primer) For that purpose you should type the author’s last name followed by the
organism’s complete name. Look at the primers used in that paper and write the
sequence of both the forward and the reverse primers, or copy the sequences to a
separate document.
B. Go to BLAST and select “search for short, nearly exact matches” under
“nucleotide.”
C. Enter the both primer sequences in the window provided by the program.
D. Limit your search to Pseudomonas aeruginosa [orgn] in the appropriate place
under “options for advanced blasting.”
E. Hit “BLAST.”
Assignment 2.
Polymerase chain reaction (PCR) amplification is a molecular method used to obtain many
copies of a particular DNA fragment. It is a type of cloning that can be done without using
living cells and in a very short time. A typical PCR reaction consist of a mixture of
11
genomic DNA containing the targeted DNA, a pair of primers needed to initiate the DNA
synthesis, a supply of nucleotides, the DNA polymerase enzyme and a buffer to maintain
the optimal conditions for the synthesis. This mixture is exposed to cycles of different
temperatures in which (i) the genomic DNA gets denatured, (ii) the primers are annealed to
the complementary sequence in the denatured DNA, and (ii) the enzyme polymerase
proceed with the synthesis of the new DNA. The PCR reaction is very sensitive and only
need very small amounts of DNA to initiate synthesis. Therefore any unwanted DNA
present in the reaction can produce unspecific amplifications. Other common sources of
error in PCR amplification are related to the annealing of the primers to regions in the
genome other than the desired target. This may be due to a low annealing temperature used
during the cycles or to a high similarity in the primers’ sequences and several regions in the
genomic DNA (rather than one specific sequence). Learning to be critical about the PCR
results may help in obtaining the desired amplification.
You can read more about PCR using the link to reference books in BookShelf of NCBI.
This is an example taken from the book “GENOMES 2” by T. A. Brown and available
through the BookShelf of NCBI:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
cmd=Search&db=books&doptcmdl=GenBookHL&term=PCR+AND+genomes%5Bbook
%5D+AND+229260%5Buid%5D&rid=genomes.section.6064
Exercise 1.
You are confronted with the following scenario. You need to characterize a sequence of a
particular gene for your experiment and invariably your Polymerase Chain Reaction (PCR)
gives you two distinctive bands. After changing your PCR conditions, the problem persists,
so you decide to isolate the two bands and sequence them.
How may the sequencing results help in solving the problem? There are several hypotheses
to explain the PCR problem:
(i) The primers are not specific to the desired sequence and therefore can amplify
different regions in the genome of the organism under consideration;
5’ 3’
3’ 5’ 3’ 5’ 3’ 5’
5’ 3’ 5’ 3’
3’ 5’
Amplified fragments
12
(ii) The DNA used is contaminated with DNA from a different organism;(
(iii) The selected gene has a secondary structure (like a loop) so the primers will
sometimes amplify the DNA in the loop and some times not;
Secondary structure,
DNA loop
(iv) There are two copies of the gene in the genome, one of which underwent a
deletion/insertion (pseudogene), and both are being amplified by your primers;
Functional gene
agctgtgatcattgcttggatagccctgatgctagtcgctagctcg
Deletion
agctgtgatcattgcttgtgatgctagtcgctagctcg
Pseudogene(gene giả)
(v) There are fragments of DNA sequences repeated one beside the other (in tandem)
that cause the polymerase to slip during PCR amplification.
agctgtgatcattcattcattcattcattcattcatagtcgctagctcg
13
As an example let’s imagine that after DNA sequencing of the two bands you found the
following two sequences:
>Unknown 1
GNCCATCNCCAGGCGGACCACCAAACCGAGAATGAATTCTAACAAAATTATAC
CAGAATGAAAACAGAAAAACAAACCTGTGAACCCTCCTCCAGTCTCTAGCCCT
GAAATTATTCCAGAAACGCTTTGTTCTTTCTTTTAAAGTCCTGAACCAGCGTTCA
ATACAGTTCCTCGGCCCGAAGGTTACGTGAACAAAATCCAGTCCAAGGGATTTA
AAAGCAGACTTGTACCACTTCCCACCATCAACCAGAAAAACAGGCTGTCCTTTG
CAGGATTTCAAAACAACCAGAATGAAGTCTCTAGCAATCCACCAGTTTCTGGTT
GTTGTAATCCAGACTGCGAGAACTTCCCTGCTCTCAACGTCAAGTGCAGCCCAG
AGAAATCTCTTCTCCCCGTTGATTTTTACGACAGTCTCATCAACTGCGATAAAAT
TCCTCTGTTTTCTTACTGCGAGGAGTGTTGGCTGGTAAACTGCTTCTGCTAGTTT
TTGAACTGCCTCCCAGACTGTTGTGTGGCTGATTTTGAGGATTTTTCCGACTTGT
CTGTAGCTGAGGCCTCGCAGGTATAGTTCTACTGCTCTGATTTTCTTTTCTGGTG
GGATTTTGTTGCGGCGAAAAGGTTTTAAGACTGAAATCAGTGAGTATAGAATGC
TCTCAGCCTTCATTTCTCTCCCTTTCTTTTTCTGAAAAATTATCAGAAACTTAAA
CCTAACGCCCTACCGCTTATCCTAACAGTATCTACACCCTAAGAGAAAAATTTT
TATTGGGTATCATTTNCCCAAATTA
>Unknown 2
TTAACATTTCTACCTCTCTCNGTCTCAATCCCTCCATAAAACTCTTTTCTTAATTA
CTTTGTCTTCTAAAAGTCTAACCCTTTAGTATGAGATATAGTAAATTGTATGCTT
TCCCTGAACTCCTTGGCTACCTTCATTATCTCTCTTAATCCTTCAGTAAGCCTGA
GAACTAGAACTAAATTCCTCAACTTATAGCCTTTTTTTTATAGTTCCCATAATAT
TTTTTGCTCAAAGGCTATTTTCCAAACTTCAGTGTATTCCCAAATAAATTGTCTA
TGCTCCTATGTTCTTAAAAGGACAGTAAAAGAGATATAAAAGGTTTCCTGTTCT
CGGGAGAGACACTGTCAGGATAAGCAGTGGGACGTTAGGTTTAAGTTTCTGAT
ACCTCTTCAGAAAAGGGGAGAGAAATGAGGACTGAAACCATTATTTACTTACT
GGTTTCAGTCTTAAAAACCTTTCGCCGGAACAAAATCCCAGCAAAAAAGAAAA
CCAGGGCAATAAACCTGTACCTGCACGGACTAAGTTACAGACAGGTAGGAACA
ATCCTCGAAATCAGCCACACAACAGTCTGGGAAACAGTCCAAAAATTCGCGAA
AGCAGTTTACCAGCCGAAAATCCTCGCAGTCAAAAACAGAGAANCTTCATCGC
AATNGACGAGACAGNGATAAAGATCANCGGCCAGAAAAGATTTCTCTGGGCTG
CAATCGACGTTGAAAGCAAAGAAATCCTAGCAG
2. The first thing we need to do is to compare the two sequences to determine if they
correspond to the same gene:
(kq: 2 trinh tự giống nhau ở 2 đầu, theo loại suy suy ra th4. Để chứng minh ta chạy 2
blast riêng rẻ để chứng minh nó cùng giống 1 tt nào do)
a. Go to the NCBI web page by typing the URL http://www.ncbi.nlm.nih.gov/
b. Go to BLAST
c. Go to “Align two sequences (bl2seq)”
d. Insert the two sequences in each of the windows provided
e. Hit “Align”
14
The bl2n result in 2005 looked like this:
This information
describes some
parameters used to
align the sequences
These values give you the statistical values associated with the
alignment. A perfect alignment will have an expected value of
0.0, while highly divergent sequences will give high expected
values. In this case, this is a very low value, suggesting a good
match between the aligned portions of two sequences.
Note: the
first
sequence is
in the
direction 5’
to 3’ while
the other is
3’ to 5’.
They only
overlap in a
small
fragment.
What did you obtain? Are the two sequences completely, partially or not at all similar?
In our example the two sequences are partially similar because only a small portion of the
15
amplified sequences aligned. Note also that the Expected value is not 0.0 (the value
obtained if the aligned portion was completely identical) and indeed, the aligned portion
shows several mismatches. This evidence suggests that we were hitting two different genes
with our primers.
2. Testing our second hypothesis. We can verify whether the two sequences came from the
same organism.
f. Go to BLAST
g. Go to “Nucleotide-nucleotide BLAST (BLASTn)
h. Enter one of your sequences
i. Hit “blast”
You can click on the score (bit) link on the “description” or move down the page on your
screen to see more information about the matches including the actual alignments.
16
As you can see, this first sequence most likely comes from an organism very similar to
Pyrococcus furiosus (an archaean). If you click on the first red bar you will be able to see
the alignment of your sequence (query) with the best hit in GenBank. It should look like
this:
Name of
the
organism
Your fragment is very similar to the sequence from base pair 10315 to 11090 in
The record identifies as >gi|18891962|gb|AE010132.1| . If you click on this number you
will retrieve the sequence and the region you were amplifying.
17
The query sequence is
very similar to the genes
in this region (10315 to
11090). And similar to
gene number PF0069.
So far we have established that the first sequence is very similar to the transposase PF0069
of Pyrococcus furiosus.
Repeat the BLASTn with the second sequence. You should get these results:
18
After looking for the genes corresponding to nucleotides 5203 to 5925 in the above
accession number you should be able to get this information:
This sequence
is similar to
parts of two
genes,
including
PF0536,
which is also
a transposase,
and PF0537.
Summarizing the results: The two sequences come from the DNA sequence of an
organism that is very similar to P. furiosus. They both amplify a portion of a transposase
gene but each one is amplifying a different transposase gene (PF0069 and PF0536
respectively). Therefore we should accept the first and fourth hypotheses that the primers
are not specific to the desired sequence (because the gene is duplicated) and therefore can
amplify different regions in the genome of the organism under consideration. In this case, a
solution to the double band problem can be solved by re-designing primers in regions that
are unique in the genome of the organism under study.
Exercise 2.
Assignment 1.
19
Using the same analytical path as in the previous exercise, evaluate the different hypotheses
for the existence of double bands in PCR amplification with the following pairs of
sequences. As a reminder, these are the hypotheses to be tested:
Each of the following sequence pairs should match to one of these hypotheses.
>Unknown 1B
TAAGCTAAATTGTTATGATAAAAATTAACCCGTGTGGGTACCCTTTACCCCCTA
TTTTTTAGTCTTAAAACTATGCTTAAATAAGCCGACCAAAAATTCAAAAAAAAC
AAAATTACACGATTTTTTAGAAAATTAGTAAATTTTGCAATTTTCGGAGAAATT
TTAGATCTTATCAAATTTAAAGCATAACAAAGTGGTATTTTTAAAGCTCAATTT
GCAATTTTTATAATTTTTAATTTCTGCTGATAAGCTAAATTGTTATGATAAAAAT
TAACCCGTGTGGGTACCCTTTACCCCCTATTTTTTAGTCTTAAAACTATGCTTAA
ATAAGCCGACCAAAAATTCAAAAAAAACAAAATTACACGATTTTTTAGAAAAT
TAGTAAATTTTGCAATTTTCGGAGAAATTTTAGATCTTATCAAATTTAAAGCATA
ACAAAGTGGTATTTTTAAAGCTCAATTTGCAATTTTTATAATTTTTAATTTCTGC
20
TGATAAGCTAAATTGTTATGATAAAAATTAACCCGTGTGGGTACCCTTTACCCC
CTATTTTTTAGTCTTAAAACTATGCTTAAATAAGCCGACCAAAAATTCAAAAAA
AACAAAATTACACGATTTTTTAGAAAATTAGTAAATTTTGCAATTTTCGGAGAA
ATTTTAGATCTTATCAAATTTAAAGCATAACAAAGTGGTATTTTTAAAGCTCAA
TTTGCAATTTTTATAATTTTTAATTTCTGCTGATAAGCTAAATTGTTATGATAAA
AATTAACCCGTGTGGGTACCCTTTACCCCCTATTTTTTAGTCTTAAAACTATGCT
TAAATAAGCCGACCAAAAATTCAAAAAAAACAAAATTACACGATTTTTTAGAA
AATTAGTAAATTTTGCAATTTTCGGAGAAATTTTAGATCTTATCAAATTTAAAGC
ATAAC
>Unknown 2B
ATAACCGCAACTGCTGGCACAAAATTTGTTATTAATTTAAATATTTCTAAATCTT
AAGTTCTTAAATTTTTAATAATATTTACTACTTATATTTAATTAATTTATTATTAA
AATAAATAAAAATTATTACTAAAATTTATATATAAAATAAATTTATAAATAAAT
TATTCTAAACCATAAAAAATTTTTTTATTAAAAATAAGCTAAACAAGCTTTTGG
GCTCATACCTCA
3. Tap nhiem
>Unknown 3A
ATCTGCCAGTACCTGCTGGCCCGGGATTGCGAGGACCACTCCTTCTCCATTGTC
ATTGAGACCGTCCAGGTGAGTTTTGCCAGCCTGGCGGCTGGTCGGGTGGCAGCC
TGAGGCGCTGCGTAGCTGCTGGGCTGAGTGGGCAGCCCCTGGAAGGCGTGGAC
CACTGCTTTCGTGCACGAGGGAGGTCTCTTCCAGACTTCAGTTAGCCCGTTCTTC
ATCCAGAAACCCAGCCCATCCCCTCTCTCCCTAGATGCTGGGGGAGGGCATCCT
GCCCTGACACTAATGCCGTCTCAGGGCTGGCTCAGCTCCCCACTCAGGGGAAGC
ATCCTCCTTGGGTCTAGCCTTTTGCCCCCGTAACTAACCCTCGATGTTTGGAGCT
GGGGGTGGCTGGGGTGGGGGCAAAGAGGGCCTGTCTTCTGCTCTGGACTCTATG
GTACGTGCGTGGGTTCCTTTCTCTAGGGATGGAGGCCAAGATGAGGACGAATA
GAGTGCACAGGGCTTCCCCTGGGGATAAAAGTCCCTGATTCTCTTTGGAACAGT
21
AGGTCCGAGGAACACTGTGAAGCAGAGGCCCATGGTTGTTTATGAAAGACCGG
GCCAGCTGGGAAAAGGTGCGAGTAAGGATGCACTTAGCGCCTTCTTGCTGCAC
AACTCTAGGGGGCCCCATTTACATGGAGTATTCTGGGAGAAGGGAGTTGTAGG
CTGTGAGAAGGCCAGAGGCTCTCTGGGGTTGGGGCCTTTCTCTGATTAAGAGGG
TCCTGGACTGGGGTGCTGGACAGGCAG
>Unknown 3b
ATATGCCAGTACCTGCTGGCCCGGGACTGTGAGGACCACTCTTTCTCCATTGTC
ATTGAGACCATGCAGGTGAGCCTGGCCAGGATGGTTGCTATGGCTGCAGCCCCT
GGGGGGGCTGCAGGCCATTGCTTTCTGTGCAGGAGGGAAGATCCTTCTTCCAGG
CTTCAGTCAGCCCCTCATTCTTCCAGAAACCCTACCCATATCTCTGTCCCTAGAT
GTAGGAGAAGGACCTGCCTGCTGACCACATCATTCAGGGCTGGCTTGGCTCCCA
GAGAAGCTTCCGTGGGTACTTCAGGTCCTGCCTGTTGCCCTCGTGGTGAACCCC
TGCAGTGTGAAGCTGGGTGACTGGGGTGGGGGCAAGGAGGAATGTCTGCCATT
CTGCTCTGCTGTGGCAGGTTGTAGACTTATTTTCTTTAGAAAACAGCAAAGATG
AAGACAAATAGAGTGCTGAAGTCCCCGATTCTCTTTGGAACAATAAGCCCAAG
GAACAAGCTAAGCCAGAGGCCATGGTTCTTTATAAAGACCAGGCCAGGATGGG
AAAGGATGTGGGAGAGGTTGGGCATAGTACCTGCACAACTCTAGGGGTCCCAC
TGACAGTGTTTTGTGAGAATGGAGCTGTAGTCAGCGAGAAGGCTAGAGGCTCTC
TGGAGCTGGGCCTTTCTCTGACTCAGTAGGTCTTGGGCTGGGGTGCTGGACTGG
TAAGGGGTACTATGGGATCAGCCTGCACCCTCTCTTGGCAGTGTGCTGATGACC
CTGACGCTGTCTGTACCCGCTCAGTGACCATTCGGCTGCCTGACCTGC
>Unknown 4a
GGGACGGACGACGAATCAGGAGGAGGAGAACAAAGAAATAATTGGAGGAGGT
GGAGGCGAGAGAGAGAGGGGTGCGTCCTTATAGATAAAGAACCGGTGTTTGAT
TTGGCGGTGGGTGTGGTGGCCGCTCGCGCGCGAGAAAGCGACACCGGGGGAAT
CGGCGATGGTGTATGTGTAGTATTGGCGGACGACGCAGCAGCAACGCACGGTC
GCTACCTGCAAACCAGATTGCTATCTAATCTAGCTATTATCTCGCTGCTGGTGA
AGTGGTGGCAATGGACTTAATGGTGAGGCACCACACGCACTGTTCGGAAACAG
GGAGGCCATTGGCACCCATTTGCAGCTGTGCCTTAGCTAGCTAGCTTGCCCTGC
AGATTATATTGTGGCTAATGAAGTAGCTTGATTTGGCCATGCTTTCTGCTCCTGC
TAGCAAGATCAAACTGGCGATGATTAATGTTATTAGCCGGGAGTTGTCTAGCTA
GGACAGTGCGTTGCCTTCGGGGTTAATTAGGAGGTGACAATGACATGTGGGGC
CTGGA
>Unknown 4b
TCCTAGGGACGGACGACGAATCAGGAGGAGGAGAACAAAGAAATAATTGGAG
GAGGTGGAGGCGAGAGAGAGAGGGGTGCGTCCTTATAGATAAAGAACCGGTGT
TTGATTTGGCGGTGGGTGTGGTGGCCGCTCGCGCGCGAGAAAGCGACACCGGG
22
GGAATCGGCGATGGTGTATGTGTAGTATTGGCGGACGACGCAGCAGCAACGCA
CGGTCGCTACCTGCAAACCAGATTGCTATCTAATCTAGCTATTATCTCGCTGCTG
GTGAAGTGGTGGCAATGGACTTAATGGTGAGGCACCACACGCACTGTTCGGAA
ACAGGGAGGCCATTGGCACCCATTTGCAGCTGTGCCTTAGCTAGCTAGCTTGCC
CTGCAGATTATATTGTGGCTAATGAAGTAGCTTGATTTGGCCATGCTTTCTGCTC
CTGCTAGCAAGATCAAACTGGCGATGATTAATGTTATTAGCCGGGAGTTGTCTA
GCTAGGACAGTGCGTTGCCTTCGGGGTTAATTAGGAGGTGACAATGACATGTGG
GGCCCTGGACGGGAGGAAGGCCAATTCCAGTACAGTAGCCCGGTACTCCATGT
CACTATTTATCAGGGTCGTTGTGAGTGGTTTCGAGCCTCGAATATGTGATATAG
TACGCAGGAAAGAGACGGATCACTGGA
*To look for a brief description of unfamiliar terms visit the following URL:
http://www.answers.com.
Exercise 1.
23
The following sequence was obtained from the genome of a strain of a hyperthermophilic
archaeon Pyrococcus spp isolated on Vulcano Island (Italy).
ACGTCATGGTTAACGGCGACAAGCCTCCGGACGATTTTGATATTGAGATAATAG
TTGCAAAGCCCAAGAGGTTTAGGATAAAACCAGGAATTTACCAGATGGCATGG
CACCTTGTTTTCAAGGCTTATGGAGATGATGAGCTGATTAAAGTTGGCTATGTA
GTTGGCTTTGGGGAAAAGAACTCGCTCGGCTTTGGAATGGTTAAAGTCGAGGGT
GGTAAACGTGTATCGGGTGGAGTTAAAAGTAGAGTTATAACTCCGGCGTTTATC
CGAGGAGCAGACCAGAAGAAGGCGGAGTTGAGGGTAGCGTTTACTAGAAGGA
ATAATTTATCTGCTAACTATCCTTCCTCTTATGAGTCCTCTATCAGAGAGGTGAT
TTTTCTATGTTGAAATTTACTGGCAACTGGTTCATAGATGCTGGCATTTTGGGGT
TCGTGAATTTGATGGAAGAGGTTTACGGTTGGGATTTGGAGGAGCTTCAGAGAC
ATATCAAAGAAGAGCCGGAGAAGGTCTATTATGGGTATTTTCCACTAGCTTACT
TTTACAGTTTAGCACCCAAGGGCCAAGAGAACAAAGGGCTTCTCTTACAGGCTA
TGCAAGAAATAGAAACTTTTGAAGGAGACAAACATAAATTGCTCGAGCTCGTG
TGG
Taking advantage of the fact that several Pyrococcus strains have been completely
sequenced, the researchers decided to compare this query sequence to the existing
sequences in the GenBank database, hoping to find a similar sequence in another
Pyrococcus spp. However, based on the results obtained through their BLAST analysis the
researchers concluded that this sequence is very likely the result of an event of horizontal
gene transfer (HGT) between Pyrococcus and a distantly related organism.
Assignment 1.
Using the BLAST programs from NCBI, try to recreate the logic that led the previous
researchers to come to the conclusion that a HGT event had occurred. Give the name of the
species that potentially donated the new gene to the Pyrococcus spp strain. Remember that
if the donor species is a distantly related species to Pyrococcus spp, then the genomic
similarity at the nucleotide level may be so small that you may not be able to get a
meaningful result with BLASTn. However the sequence may have some similarity at the
protein level, as protein sequences evolve less rapidly than nuclear sequences. In that case
you may consider using BLASTx. (Hint: make sure to look for similarity over the entire
sequence length; the graphical representation is very useful. Think about “mosaic”
genomes.
For help with the BLAST programs visit the NCBI BLAST tutorials website:
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
Exercise 2.
Assignment 2.
24
What is the importance of each of the following genes to HGT? Explain your answer using
the information presented in the report page of your BLASTn search for the sequence that
better matches your query. You may even want to read the abstract of the paper that
published the sequence by using the direct link to pubMED.
Unknown 1
ggcactgcaccctaatagtgggacgtaagaaaaacacttttaggcgaccagttttctgtactgtacagaaaactggtcgtttaatctctg
ttgaagtctagtttcattataaaatgtaatgtcatttttaacaatatttgttatactatctttgttgtattttctcctattatggagataaaaggtttc
agtctttaggacggagtgaaatcattcaatacaggcattatctgcaggtgtttcttttcgagacattgagcagataatgcctttttccgtgc
aagcctggtagtaagccatagaagtatacactgagccttggtcactgtgtaagagtgctcctttaggcaattttaactgattaagggtgt
ctagtaaaaaatccgtgtcctgacactctgaaatagtgtaagctatcatttctcggttatagagatccataattgatgagagatacaattta
cagttaccgaaataatataggtaggtaatatctgttacaagcttttccataggcttatcggaatggaaatcccgactcaatttgttatctgtt
aaatcataagctttacccaaattgggaactttcttggtgcgtgtccgacaaagccggccattatttttcatgatacgatagactttctttgta
ttaacagtcaatccgtggatctttttgagcaatcgtgtaatggtacgatagccataaataaagtgattctacatgcagagctgttcaattaa
ttcaatgacatcatcttttttttgcggcttctcatactcctttttccaacggtaataagtcgaccatttgacgtcaaaacagtctagaatgaca
gctatctggtggttgtttttatagtcttccacaagcttgataagacttactttatcgatttccttatcaagcctcgatacttttttaagaggtcaa
cttgtaatggtaattgttccacttcagatagatgttccaagcctttaccataggtatattgcttgccaacaccttgatgagaacgataaagc
tcctcgttttcgtgccatttcatccaagtatagatttgactatcatttttgatacctaaagtctccataatcactctgttagacttgcctgctttc
ttcttctcgatgcaagccagcttagtttcccatgaatatgcttttttaaccataataaaacactcctgtttctagtttactagatttcaacagga
gtgtttttcttttgtctcattttagggattcagtgcc
Unknown 2
tcgaatttgggaactttgagcaagaggcaaatgaaaaacaagaaaatgcactttatctgattattattctttcaaggactagtataacata
aatcgtctacaaatagacaaaaaacctgcacgcttaatgtagatcaaaagcttaacgcaaatgaaatagattgacctcccaataacac
cacgttagttattgggagtcaatctatgaaatgcgattaagctttttctaattcacataagcgtgcaggtttaaagtacataaaaaatataat
gaaaaaaagcatcattatactaacgttataccaacattatactaattgcttattccaatttcctattggttggaaccaacaggcgttagtgt
gttgttgagttggtactttcatgggattaatcccatgaaacccccaaccaactcgccaaagctttggctaacacacacgccattccaac
caatagttttctcggcattaaagccatgctctgacgcttaaatgcactaatgccttaaaaaaacattaaagtctaacacactagacttattt
tcattcgtaattaagtcgttaaaccgtgtgctctacgaccaaaagtataaaacctttaagaactttcttttttcttgtaaaaaaagaaactag
ataaatctctcatatcttttattcaataatcgcatcagattgcagtataaatttaacgatcactcatcatgttcatatttatcagagctcgtgct
ataattatactaattttataaggaggaaaaaataaagagggttataatgaacgagaaaaatataaaacacagtcaaaactttattacttca
aaacataatatagataaaataatgacaaatataagattaaatgaacatgataatatctttgaaatcggctcaggaaaagggcattttacc
cttgaattagtacagaggtgtaatttcgtaactgccattgaaatagaccataaattatgcaaaactacagaaaataaacttgttgatcacg
ataatttccaagttttaaacaaggatatattgcagtttaaatttcctaaaaaccaatcctataaaatatttggtaatataccttataacataagt
acggatataatacgcaaaattgtttttgatagtatagctgatgagatttatttaatcgtggaatacgggtttgctaaaagattattaaataca
aaacgctcattcgcattatttttaatggcagaagttgatatttctatattaagtatggttccaagagcag
Unknown 3
atgaacgatacaacagagcatcatggacctaatccgctaaacgctccaccacctagcaactcacagagcaatgatcttttaaatttgct
agactcattatatcctaaagggagtttaggggagcaaagatttcacgaagctttaaagaatcaagaagagttgaaaaatatcctaatag
aaatagaaaagctaccgcaagaaaaaaggtatgaacttctgatgcagataggacaagccaagcaaagaataatggaagcatatgct
cattcattcttaggatatatagggggactagagcatctgttaggattgtgtatgggtgggatatttgttttgtttgcaatctattttgtattttta
agaactagcaaaaacatggagctagtggaaagtctaaaaacaaaactaaaacttcagtatttttactatgcctttggtgtgggtgcggtt
ttgttttttggattagaaacaattagatctatttatgaactatatatcttaggaattggtagcactaacgacaaggtgctctttgttttgaaaaa
25
catttgcttcataggtatgggctatttgatttataaagttattaaggttattggtataaaaaattttatcaatggtcttttcacttcaaagaaaca
aggcggtgcagaatga
Exercise 3.
The origin of photosynthesis in algae and plants seems to be the result of an incredible
horizontal gene transfer event. While many organisms are capable of photosynthesis, only a
few use the energy produced by the break down of water molecules to do so. This type of
photosynthesis is called oxygenic photosynthesis, possible thanks to the presence of two
photosystems in the chloroplasts of organisms with this ability.
The amino acid sequences given below are two enzymes present in the chloroplasts of
organisms that perform oxygenic photosynthesis, such as land plants. The first sequence
corresponds to the Chloroplast photosystem I subunit III and the second one is the
chloroplast photosystem II reaction center protein Z.
1.
mmdfnlpsifvplvglvfpaiamaslflyvqknkiv
2. mtiafqlavfalivtssvlvisvplvfaspdgwsnnknvvfsgtslwiglvflvailnslis
The chloroplasts of plants can perform oxygenic photosynthesis because they have the two
photosystems mentioned above. Chloroplasts in green algae and plants (Viridiplante) have
an endosymbiotic origin in which an ancestral eukaryote engulfed and maintained inside its
cytoplasm a prokaryote that already had the two photosystems.
Assignment 3.
What group of prokaryotes is most likely to be the ancestor of the chloroplasts in plants?
Explain your answer by using BLASTp to determine the presence or absence of the two
photosystems in each of the different groups of organisms listed in the table below.
Remember that you can limit your BLASTp search to a particular group of organisms by
typing the name of the organism followed by [orgn] (ex. Chloroflexi [orgn]) in the space
provided by BLASTp below the window where you introduced your query.
26
Clostridia (Bacteria)
Metazoa (Animals)
Stramenopile (Brown algae)
Rhodophyta (Red algae)
Viridiplantae (Green algae and Plants)
Glaucosystophyceae (Algae)
When asked to print results, please print the entire web page as displayed in your browser.
Unless told otherwise, leave all settings on their default values.
In this question, you will perform an iterated protein BLAST search, using the results of
each iteration to form the new search sequence.
>WHOAMISTKKKPLTQEQLEDARRLKAIYEKKKNELGLSQESVADKMGMGQS
GVGALFNGINVLQAYNAALLAKILKVSVEEFSPSIAREIYEMYEAVSMQPSLRSE
YEYPVFSHVQAGMFSPELRTFTKGDAERWVSTTKKASDSAFWLEVEGNSM
TAPTGSKPSFPDGMLILVDPEQAVEPGDFCIARLGGDEFTFKKLIRDSGQ
VFLQPLNPQYPMIPCNESCSVVGKVIASQWPEETFG
2. Find the points where the entered sequence differs from the best match.
What can you say (in brief) about the amino acids each sequence has at those points?
You may also find this a useful guide to amino acid chemical structure and
properties:
http://www.escience.ws/b572/L9/L9.htm
27
4. Find the first sequence that newly appeared in the results of this iteration and note
down its reference number.
5. Run a few more iterations of PSI-Blast, looking at the number of hits returned and
the E values of the hits you are receiving.
Is the number of hits generally going up or down as you proceed through the
iterations?
What about the E values?
Why do you think that is?
6. How many iterations did it take until no more new results appeared?
7. Let's say a database contains many sequences, including sequences 'A' and 'B'. When
we perform a BLAST search with sequence A against the database, the best hit is for
sequence B, with score 500 and expect value 2e-100. Now we perform a BLAST
search with sequence B against the database. What (if anything) can we predict about
the score, expect value and position of sequence A in the results?
8. Let's say a database contains many sequences. When we perform a BLAST search
with sequence X against the database, the best hit is for sequence Y, with score 300
and expect value 2e-60. A year later, we come back to the database, which has of
course grown in the meantime. Now we perform a BLAST search with the exact
same sequence X as before. What (if anything) can we predict about the score, expect
value and position of sequence Y in the results?
In this question, we will try different types of comparison between protein sequences and
nucleotide sequences.
2. Perform a search for RL1_SERMA in one window and RL1_HALCU in the other.
28
3. Based purely on the Comments section in the resulting pages, what type of
similarities would you expect to see between the proteins?
4. Open a new NotePad window and copy across both the proteins' sequences in
FASTA format by clicking the link in the bottom right of the table.
5. Now go back and click the GenBank links under the Cross-References section for
each protein.
6. Copy across both the genes' sequences in FASTA format to your NotePad window.
9. We are now going to perform 5 different pairwise comparisons. In each case, we will
compare one sequence related to RL1_HALCU and one related to RL1_SERMA.
The five searches are as follows:
A. The nucleotide sequence for RL1_HALCU against the nucleotide sequence for
RL1_SERMA.
B. The protein sequence for RL1_HALCU against the protein sequence for
RL1_SERMA.
C. The translated nucleotide sequence for RL1_HALCU against the protein sequence
for RL1_SERMA.
D. The protein sequence for RL1_HALCU against the translated nucleotide sequence
for RL1_SERMA.
E. The translated nucleotide sequence for RL1_HALCU against the translated
nucleotide sequence for RL1_SERMA.
29
Print the results for each of these five comparisons, labelling them A to E as above.
You may find this description of all the different BLAST types useful:
http://www.ncbi.nlm.nih.gov/blast/html/BLASThomehelp.html#BLAST2SEQ
10. If comparison B took 0.1 seconds and it takes 0.01 seconds to translate a nucleotide
sequence into one of its possible protein sequences, approximately how long would
you expect comparison C to take? How about comparison E?
11. Which comparison, A or B, gave a stronger match? How might this have happened?
12. Which comparison, B or E, had a more significant E value? Can you think why?
13. Which comparison, C or D, had a more significant E value? Suggest a reason why.
14. Which comparison, B or C, had a more significant E value? How do their scores
compare? Can you explain this?
15. When aligning one protein sequence against one nucleotide sequence, what is the
difference (if any) between using BLASTX or TBLASTN?
In this question, we will experiment with using different amino acid comparison matrices.
1. In 2 separate windows, go to the NCBI home page and perform a Protein search for
rta_rat in one and lshr_rat in the other.
2. Extract the two sequences in FASTA format and paste into a NotePad window.
30
4. Select User-entered sequence in both places on the page and paste in your
sequences without their description lines. Set the Number of alignments to be
computed to 1.
6. Perform alignments using PAM120, PAM250 and PAM400, noting down the scores
for each.
8. Do you think these protein sequences are unrelated, distantly related or closely
related? Why?
To answer this, you may find it helpful to click the PRSS link in some of your search
results. This will take you to another report which ends with an assessment which
can be converted easily to an E value.
31