Slides 2012 08 26 Recomb Ab 1

Open Computational Problems in Metagenomics
Gabriel Valiente
Algorithms, Bioinformatics, Complexity and Formal Methods Research Group

Technical University of Catalonia
Computational Biology and Bioinformatics Research Group

Research Institute of Health Science, University of the Balearic Islands
Centre for Genomic Regulation

Barcelona Biomedical Research Park
University of Zaragoza
18 May 2012
Abstract
Next-generation sequencing technologies allow for the genetic

study of complex microbial communities, which were so far
largely unknown because they cannot be cultured in the
laboratory.
The core problem of metagenomics is to determine and

quantify the composition of a sample consisting of a mixture
of different, and possibly unknown, microbial species (National
Research Council, 2007).
Solving this core biological problem involves a series of

algorithmic and computational problems, ranging from the
simulation of metagenomic samples to the alignment or
mapping of sequence reads, the non-taxonomic assignment or
binning of sequence reads, and the taxonomic assignment of
sequence reads (Ribeca and Valiente, 2011).
Biological background I
Biological background II
Kingdom Archaea Bacteria Eukaryota Viruses
Phylum Streptophyta
Class Streptophytina
Order Solanales
Family Solanaceae
Genus Solanum
Species Solanum lycopersicum
Solanum caule inermi herbaceo, foliis pinnatis incisis, racemis

simplicibus
Biological background III
Kingdom Archaea Bacteria Eukaryota Viruses
Phylum Chordata
Class Mammalia
Order Primates
Family Hominidae
Genus Gorilla Homo Pan Pongo
Species Homo sapiens

Computational background I
Computational background II
(Van de Peer et al., 2000)

Computational background III
(Ashelford et al., 2005)

Computational background IV
(Schloss, 2009)
Simulating metagenomic samples I
Different metagenomic analysis pipelines produce different results,
and standardized simulated data are essential to their evaluation.
The simulated sequence reads have to reflect the diverse taxonomic

composition of a metagenomic dataset. Usual experimental
biological settings involve targeted or random sequencing,
single-end or paired-end reads, and one or many dominant species.
Ribeca and Valiente (2011) showed that current simulation

tools (Richter et al., 2008) do not properly reflect the diverse
taxonomic composition of a metagenomic dataset.
Problem 1 Devise an algorithm to simulate a metagenomic dataset

of single-end reads.
Problem 2 Devise an algorithm to simulate a metagenomic dataset

of paired-end reads.
Simulating metagenomic samples II
Domain Phylum Class Genomes
Bacteria Actinobacteria Actinobacteria 9
Bacteroidetes Cytophagia 1
Chlorobi Chlorobia 7
Chloroflexi Chloroflexi 1
Cyanobacteria Cyanobacteria 6
Deinococcus-Thermus Deinococci 1
Firmicutes Bacilli 13
Clostridia 8
Proteobacteria Alphaproteobacteria 17
Betaproteobacteria 13
Gammaproteobacteria 25
Deltaproteobacteria 6
Epsilonproteobacteria 1
unclassified Proteobacteria 1
Archaea Euryarchaeota Methanomicrobia 3
Thermoplasmata 1
Simulating metagenomic samples III
• Low-complexity microbial community with one dominant
population
• Medium-complexity microbial community with three dominant
populations flanked by low-abundance populations
• High-complexity microbial community with no dominant
population at all
simLC simMC simHC

Most abundant 28,861 22,956 2,384
2nd abundant 9,277 16,577 2,248
3rd abundant 5,168 10,484 2,191
4th abundant 1,149 6,107 2,127
5th abundant 1,109 4,868 2,083
6th abundant 1,074 1,146 2,051
Rest 50,857 52,319 103,687
(Alonso-Alemany et al., 2011)

Mapping sequence reads I
The composition of a metagenomic dataset can be assessed by
aligning or mapping the sequence reads to a reference database of
known sequences from a large set of different organisms. The high
yield of high-throughput sequencing technologies requires
extremely efficient mapping programs, ruling out traditional
alignment programs like BLAST (Altschul et al., 1990).
Current mapping programs achieve efficiency at the price of

accuracy, by not being exhaustive and allowing a small number of
insertions and mismatches, and by only reporting single matches in
case of ambiguities (Ribeca and Valiente, 2011).
Problem 3 Devise an exhaustive algorithm to map sequence reads

with long insertions and an arbitrary number of mismatches.
Mapping sequence reads II
• Align the 328,723 simulated sequence reads to the 113
microbial genomes using BLAST (a larger database is often
used when the target sequences are not known beforehand)
• Ambiguities arise when a sequence read is aligned with more
than one target sequence
• Take as candidate alignments all those sequences with the
same E-value as the top BLAST hit
• Sequence reads with no hit in the database of microbial
genomes are due to sequencing errors
Data set No hit One hit Ambiguous Total

simLC 59 76,513 20,923 97,495
simMC 76 86,705 27,676 114,457
simHC 100 99,619 17,052 116,771
(Alonso-Alemany et al., 2011)

Binning sequence reads
Those sequence reads that cannot be mapped to any sequence in a
reference database of known sequences are usually assumed to
come from unknown species. Pairwise similarities among sequence
reads are used to group them into clusters of related species.
Ribeca and Valiente (2011) showed that current binning

tools (Schloss et al., 2009; Caporaso et al., 2010) provide an
overestimation of diversity and richness in a simulated
metagenomic dataset.
Problem 4 Devise a binning algorithm that reflects the microbial

composition of a metagenomic dataset.
Assigning sequence reads I
Ambiguities may arise when mapping sequence reads to a reference
database of known sequences. Sequence reads are attributed to
species at the closest possible taxonomic rank, and any ambiguities
are usually solved by assigning ambiguous sequence reads to either
the consensus or LCA of all matching sequences in a reference
taxonomy (Huson et al., 2007), or to a sequence in the reference
taxonomy that provides optimal sensitivity and
specificity (Clemente et al., 2010, 2011).
Ribeca and Valiente (2011) showed that current assignment

tools (Cole et al., 2009; Schloss et al., 2009; Alonso-Alemany
et al., 2011) provide an underestimation of diversity and richness in
a simulated metagenomic dataset. Alonso-Alemany et al. (2011)
extended these results to taxonomic diversity.
Problem 5 Devise an assignment algorithm that reflects the

taxonomic composition of a metagenomic dataset.
Assigning sequence reads II
Input A genomic reference S (set of sequences)
A taxonomic reference T (tree) with a leaf set L, where each
leaf in L has an associated known sequence of S
A set R of sequence (short or long) reads
A positive integer k
Output For each read Ri ∈ R, a single node in T that represents
in a “good” way the subset Mi ⊆ L of hits or matches whose
sequences contain a substring with at most k mismatches to Ri
Assigning sequence reads III
Given a reference taxonomy T , a set R of sequence reads, and a
threshold value k of sequence similarity,
• Let Ri be the ith read
• Let Mi be the leaves of T matching Ri with up to k
mismatches
• Let Ti be the subtree of T rooted at the LCA of Mi
• Let Ni be the leaves of Ti not matching Ri with up to k
mismatches
For the ith read, the leaves of Ti can be partitioned in the
following four subsets:
• TP i = Mi (true positives)
• FP i = Ni (false positives)
• TN i = ∅ (true negatives)
• FN i = ∅ (false negatives)
Assigning sequence reads IV
Ti
Ni Mi
FPi TPi
TP TP
P= R=
TP + FP TP + FN
Assigning sequence reads V
Given a reference taxonomy T , a set R of sequence reads, and a
threshold value k of sequence similarity,
• Let Tij be the subtree of T rooted at the jth node of Ti
• Let Mij be the leaves of Tij matching Ri with up to k
mismatches
• Let Nij be the leaves of Tij not matching Ri with up to k
mismatches
For the ith read and the jth node of Ti , the leaves of Ti can be
partitioned in the following four subsets:
• TP ij = Mij (true positives)
• FP ij = Nij (false positives)
• TN ij = Ni \ Nij (true negatives)
• FN ij = Mi \ Mij (false negatives)
Assigning sequence reads VI
Ti
Tij
Ni Nij Mij Mi
TNij FPij TPij FNij

TP TP
P= R=
TP + FP TP + FN
Assigning sequence reads VII
Bacteria
Aquificae
Aquificae
Aquificales
Aquificaceae
Aquifex
Aquifex pyrophilus
Hydrogenobaculum
Hydrogenobaculum acidophilum
P = 6/(6 + 8) = 43% Hydrogenobacter
R = 6/(6 + 0) = 100% Hydrogenobacter subterraneus
Hydrogenobacter thermophilus
F = 60% Hydrogenobacter hydrogenophilus
Persephonella
Persephonella hydrogeniphila
Persephonella marina
Persephonella guaymasensis
Sulfurihydrogenibium
Sulfurihydrogenibium subterraneum
P = 3/(3 + 0) = 100% Sulfurihydrogenibium azorense
R = 3/(3 + 3) = 50% Sulfurihydrogenibium yellowstonense
F = 67% Thermocrinis
Thermocrinis albus
Thermocrinis ruber
Hydrogenivirga
Hydrogenivirga caldilitoris
Assigning sequence reads VIII
Fischer and Huson (2010) introduced the notion of
LCA-skeleton-tree: the restriction of a tree to a given subset of the
nodes. Clemente et al. (2011) noticed that the LCA-skeleton-tree
of a reference taxonomy suffices for optimal taxonomic assignment.
Problem 6 Devise a fast algorithm for the LCA-skeleton-tree of a

subset of the nodes in a reference taxonomy.
Assigning sequence reads IX
Kingdom Archaea
Phylum Crenarchaeota
Class Thermoprotei
Order
Family
Genus
Species
References I
D. Alonso-Alemany, J. C. Clemente, J. Jansson, and G. Valiente.
Taxonomic assignment in metagenomics with TANGO.
EMBnet.journal, 17(2):46–50, 2011.
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J.
Lipman. Basic Local Alignment Search Tool. J. Mol. Biol., 215
(3):403–410, 1990.
K. E. Ashelford, N. A. Chuzhanova, J. C. Fry, A. J. Jones, and
A. J. Weightman. At least 1 in 20 16S rRNA sequence records
currently held in public repositories is estimated to contain
substantial anomalies. Appl. Environ. Microbiol., 71(12):
7724–7736, 2005.
J. G. Caporaso, J. Kuczynski, J. Stombaugh, et al. Qiime allows
analysis of high-throughput community sequencing data. Nat.
Methods, 7(5):335–6, 2010.
References II
J. C. Clemente, J. Jansson, and G. Valiente. Accurate taxonomic
assignment of short pyrosequencing reads. In Proc. 15th Pacific
Symp. Biocomputing, volume 15, pages 3–9. World Scientific,
2010.
J. C. Clemente, J. Jansson, and G. Valiente. Flexible taxonomic
assignment of ambiguous sequencing reads. BMC
Bioinformatics, 12:8, 2011.
J. R. Cole, Q. Wang, E. Cardenas, J. Fish, B. Chai, R. J. Farris,
A. S. Kulam-Syed-Mohideen, D. M. McGarrell, T. Marsh, G. M.
Garrity, and J. M. Tiedje. The Ribosomal Database Project:
Improved alignments and new tools for rRNA analysis. Nucleic
Acids Res., 37(D):141–145, 2009.
J. Fischer and D. H. Huson. New common ancestor problems in
trees and directed acyclic graphs. Inform. Process. Lett., 110
(8–9):331–335, 2010.
References III
D. H. Huson, A. F. Auch, J. Qi, and S. C. Schuster. MEGAN
analysis of metagenomic data. Genome Res., 17(3):377–386,
2007.
National Research Council. The New Science of Metagenomics:
Revealing the Secrets of Our Microbial Planet. The National
Academic Press, Washington, DC, 2007.
P. Ribeca and G. Valiente. Computational challenges of sequence
classification in microbiomic data. Brief. Bioinform., 12(6):
614–625, 2011.
D. C. Richter, F. Ott, A. F. Auch, R. Schmid, and D. H. Huson.
MetaSim: A sequencing simulator for genomics and
metagenomics. PLoS ONE, 3(10):e3373, 2008.
P. D. Schloss. A high-throughput dna sequence aligner for
microbial ecology studies. PLoS ONE, 4(12):e8230, 2009.
References IV
P. D. Schloss, S. L. Westcott, T. Ryabin, et al. Introducing
mothur: Open-source, platform-independent,
community-supported software for describing and comparing
microbial communities. Appl. Environ. Microbiol., 75(23):
7537–7541, 2009.
Y. Van de Peer, P. D. Rijk, J. Wuyts, T. Winkelmans, and R. D.
Wachter. The european small subunit ribosomal RNA database.
Nucleic Acids Res., 28(1):175–176, 2000.

Slides 2012 08 26 Recomb Ab 1

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Slides 2012 08 26 Recomb Ab 1

Caricato da

Copyright:

Formati disponibili

Open Computational Problems in Metagenomics

Algorithms, Bioinformatics, Complexity and Formal Methods Research Group

Computational Biology and Bioinformatics Research Group

Centre for Genomic Regulation

Next-generation sequencing technologies allow for the genetic

The core problem of metagenomics is to determine and

Solving this core biological problem involves a series of

Species Solanum lycopersicum

Solanum caule inermi herbaceo, foliis pinnatis incisis, racemis

Genus Gorilla Homo Pan Pongo

Species Homo sapiens

(Van de Peer et al., 2000)

(Ashelford et al., 2005)

The simulated sequence reads have to reflect the diverse taxonomic

Ribeca and Valiente (2011) showed that current simulation

Problem 1 Devise an algorithm to simulate a metagenomic dataset

Problem 2 Devise an algorithm to simulate a metagenomic dataset

simLC simMC simHC

(Alonso-Alemany et al., 2011)

Current mapping programs achieve efficiency at the price of

Problem 3 Devise an exhaustive algorithm to map sequence reads

Data set No hit One hit Ambiguous Total

(Alonso-Alemany et al., 2011)

Ribeca and Valiente (2011) showed that current binning

Problem 4 Devise a binning algorithm that reflects the microbial

Ribeca and Valiente (2011) showed that current assignment

Problem 5 Devise an assignment algorithm that reflects the

TNij FPij TPij FNij

Problem 6 Devise a fast algorithm for the LCA-skeleton-tree of a

Potrebbero piacerti anche