Sei sulla pagina 1di 1

Mining for Viral Genomes within Sequences of Complex Communities

Andrea Garretto & Catherine Putonti, Loyola University Chicago

Abstract Methods
Metagenomics has facilitated the sequencing of viral Viral genomes are identified & annotated by virMine via the following steps:
communities inhabiting niches from across the globe.
STEP 1: Assembly STEP 3: Open Reading Frame (ORF) Prediction
Genomes of novel viral species - particularly those
Sickle1 for raw read quality control GLIMMER5 scripts have been modified
in high abundance - have been successfully
SPAdes2, metaSPADes3, or to better predict viral coding regions
excavated directly from the metagenomes of complex
MEGAHIT4 for sequence assembly
communities. Discovery of such viral genomes often
STEP 4: Classify contigs by ORFs
relies heavily on manual curation. Prior studies have
STEP 2: Filter contigs BLASTx or BLASTp against viral & non-
employed a variety of different criteria when sifting
Options include size, coverage, viral data sets
through sequence data.
presence of genes or sequences of BLAST scores used to categorize
To provide an automated and comprehensive means of viral genome interest (e.g., CRISPR spacers) contigs as viral, bacterial or unknown
discovery, we developed the tool virMine. Key features of this tool are its ease
of use and flexibility, allowing researchers to select from a variety of filters for
their search. In addition to benchmark tests with synthetic communities of OUTPUT: Contigs predicted to be
sequences, VirMine was used to examine viral metagenomic data sets from viral or unknown (putative novel
several studies of: (1) freshwater viromes, (2) urinary viromes, and (3) gut taxa) are written to file, as are their
viromes, identifying several novel viral genomes. The virMine tool provides a ORF predictions and BLAST results.
robust and expedient means for viral genome sequence discovery from
complex community sequence data. Fig. 1: Overview of virMine pipeline.

Benchmarking
Synthetic data sets were created using a single non-viral sequence (Pseudomonas aeruginosa UW4 (NC_019670.1)) and a single viral
sequence (Pseudomonas phage PB1 (NC_011810.1)) at various concentrations using MetaSim6. While these synthetic data sets were
made both with and without mutation, in many cases the mutated sequence sets were unable to assemble. Contigs were classified using
the set of all annotated RefSeq phage genes and bacterial COGs7 (less those annotated as belonging to the mobilome; code X).
Only one filter was used in our search: contigs were required to have a coverage 3. Fig. 2 shows the results of this analysis. When 50%
or more of the reads were from the PB1 genome, the full PB1 genome was retrieved. As the N50 values for each of the assemblies (written
above the corresponding bars) show, the virMine assembled viral genome exceeds that of PB1s annotated genome (65,764 bps); this is a
residual of the terminal repeats in the PB1 sequence. The contigs classified as unknown were further investigated and found to
corresponded to phage-associated coding regions within the P. aeruginosa genome.
Fig. 2: Benchmark statistics.

Viral Metagenomic Analysis


VirMine was tested on three types of data sets: freshwater, urinary, and gut viromes. While several studies have been conducted on the latter, the viral populations within freshwaters and the
urinary microbiota are not as well studied. Furthermore, viruses in particular phage within freshwaters and the urinary microbiota are not well represented in the sequence databases. The
same databases used during the benchmark test were used here to classify viral, unknown, and bacterial contigs. The SPAdes2 assembler was selected for these analyses, given its superior
performance in phage sequence assembly8. In the data sets examined, we anticipate that phage species will be the most abundant viruses present9-13. Each predicted viral and unknown contig
was compared against the complete nt database via BLASTn; hits with a query coverage >50% were considered confident hits and are recorded in the tables below.

Freshwater Viromes Gut Viromes


OVERALL VIRAL UNKNOWN BACTERIA OVERALL VIRAL UNKNOWN BACTERIA
Data Set SRA ID # contigs N50 # contigs N50 # blast hits # contigs N50 # blast hits # contigs Sample ID # contigs N50 # contigs N50 # blast hits # contigs N50 # blast hits # contigs
Lake Michigan9 SRR1302020 2555 12326 122 10904 19 298 892 124 2135 SRR073432 215 29905 147 12527 98 29 1044 22 39
Lake Michigan9 SRR1302010 934 49497 41 26538 10 246 564 64 647 SRR073433 5 18658 4 18658 3 0 0 0 1
Lake Michigan9 SRR1301999 2046 4355 120 34260 11 762 642 47 1164 SRR073436 1499 12414 576 15850 427 669 314 224 254
Lake Michigan9 SRR1296481 1518 49442 60 5648 27 73 529 22 1385 All 3 combined 2937 7412 927 10730 538 1361 285 427 649
Lough Neagh10 SRR2147000 3869 8406 1641 8868 4 484 629 0 1744
To further validate our approach, the above data sets12 were tested as they were one of the
The disparity between the number of contigs and the blast hits points to the identification of novel first used for de novo assembly of a novel phage genome, the crAssphage13.
genetic information, specifically viral, as a number of the contigs contained viral-like genes.

OVERALL VIRAL UNKNOWN BACTERIA


Urinary Viromes Sample ID # contigs N50 # contigs N50 # blast hits # contigs N50 # blast hits # contigs
MH0002_081203_clear 1138 16703 68 11135 30 321 464 263 749
OVERALL VIRAL UNKNOWN BACTERIA MH0003_081224 4542 6246 266 7668 124 1277 476 833 2999
Sample ID # contigs N50 # contigs N50 # blast hits #contigs N50 # blast hits # contigs MH0006_081109_clean 5471 7243 197 1715 33 3857 263 1533 1417
mgm4568646 325 6082 22 909 0 100 264 0 203 MH0006_lane2 4666 10627 140 3966 53 1945 567 1024 2581
mgm4568645 447 6498 16 8215 0 128 265 4 303 MH0006_lane6 6482 8432 182 4632 53 2525 580 1296 3775
mgm4568644 502 6453 15 1137 0 117 283 3 370 MH0006_lane7 6458 8126 243 4140 61 2286 551 1179 3929
mgm4568643 342 7977 16 1711 0 31 404 0 295 MH0006_lane8 5650 9649 195 3991 59 1917 558 1046 3538
mgm4568642 272 9301 16 3979 0 75 339 0 181 MH0009_090109 10116 6187 318 2680 83 5667 292 2843 4131
mgm4568641 1114 7155 67 7825 2 321 256 0 726 MH0012_081203 1453 13167 93 8156 17 712 510 374 648
mgm4568640 1284 7147 67 4718 0 525 258 13 692 MH0012_081224 6588 6129 189 9428 51 2576 564 1288 3823
mgm4568639 251 6287 15 4098 1 19 755 0 217 MH0012_lane2 9741 9391 289 8113 54 3800 548 1801 5652
mgm4568638 437 6760 12 600 1 223 310 79 202 MH0013_081113_clean 10659 530 230 13640 34 9365 258 3387 1064
mgm4568637 226 7206 7 11424 0 10 600 0 209 MH0013_081120 12716 439 467 6891 96 9772 250 3591 2477

All of these data sets are from the study of Santiago-Rodriguez et al.11. VirMine results demonstrate This table represents a subset of the data generated by Qin et al.14 that was examined by
how little is currently known about the urinary microbiota. Contigs identified here often have little to virMine. A disparity can be seen between the number of contigs and blast hits. Further
no sequence similarity to known sequences (the nt database). investigation of the blast hits revealed homologies to characterized prophages.

Conclusions & Future Directions Acknowledgments


Our analyses of complex communities has uncovered a number of novel viral genomes; we are in the This work is supported by the Carbon Fellowship at
process of inspecting and curating these genomes. As our benchmark test showed, virMine is capable of Loyola University (AG), the Computing Research
identifying both prophage and viral sequences. Given its automation, virMine can be used to identify viruses Association (AG), and the National Science
in any niche and thus further our understanding of this vast reservoir of genetic diversity. Foundation (CP; awards 1149387 & 1661357).

References: [1] Joshi NA, Fass JN. (2011) (Version 1.33) [Software]. [2] Bankevich et al. (2012) J Comput Biol. 19: 455-77. [3] Nurk et al. (2017) Genome Res. 27: 824-834. [4] Li et al. (2015) Bioinformatics. 31: 1674-6. [5] Delcher et al. (1999) Nucl Acids Res. 27:463641. [6] Richter et al. (2008) PLoS One. 3: e3373. [7] Galperin et al. (2015) Nucleic Acids Res.
43: D261-9. [8] Rihtman et al. (2016) PeerJ. 4: e2055. [9] Sible et al. (2015) Data Brief. 5: 9-12. [10] Skvortsov et al. (2016) PLoS One. 11: e0150361. [11] Santiago-Rodriguez et al. (2015) Front Microbiol. 6: 14. [12] Reyes et al. (2010) Nature 466: 334-8. [13] Dutilh et al. (2014) Nat Commun. 5: 4498. [14] Qin et al. (2010) Nature. 464: 59-65.

Potrebbero piacerti anche