Sei sulla pagina 1di 3

Science & Society the potential to dramatically increase the example is the fact that less than 20%

Microbiome Data power of individual studies and generate


valuable insights. This approach has
of the bacterial and archaeal type strains
have been sequenced, despite evidence
Science: greatly hindered discovery by wasting sig- for the scientic value of generating these
nicant resources and limiting interpreta- basic data [5,6]. While recent trends
Understanding Our tion of results by underutilizing the emphasize hypothesis-driven science
Microbial Planet available sequence information. Further, and a shift away from exploratory
this has resulted in a high degree of dupli- sequencing, we argue that part of a grand
Nikos C. Kyrpides,1,* cated efforts since every analysis group vision for microbiome data science neces-
Emiley A. Eloe-Fadrosh,1 and has to partially redevelop data storage, sitates the continued generation of refer-
integration, and analysis platforms. ence data. These data are fundamental for
Natalia N. Ivanova1
interpretation of how microbiomes func-
In the light of the problems listed above, tion in a community context, and how they
Microbiology is experiencing a rev-
we have identied the following major bot- interact within the environments and hosts
olution brought on by recent devel-
tlenecks (all of which are related to fund- they inhabit. Systematic decoding of
opments in sequencing technology. ing) currently impeding progress in microbes and their environments to ll in
The unprecedented volume of microbiome research. the gaps in our databases is a key step
microbiome data being generated (a) Lack of a grand vision in supporting towards hypothesis-driven science and
poses signicant challenges that Microbiome Data Science. There is enabling a better understanding of micro-
are currently hindering progress in currently a sharp contrast between bial life.
the eld. Here, we outline the major what is needed and what is available (b) Inefcient funding mechanism. The
bottlenecks and propose a vision to and/or nancially supported in the increasingly interdisciplinary approach
advance microbiome research as a eld. The need for a national and uni- to biology has enabled us to reach the
data-driven science. ed international microbiome effort point where scientic progress can be
was proposed several years ago [2] hindered by the insulation of individual
Bottlenecks in Microbiome and renewed interest is beginning to funding agencies. This is especially
Research gain broader community support [3,4]. evident in the segregation of funding
The vast increase in sequencing output However, it is still unclear whether from individual agencies supporting
during the last decade [1] has not been these calls will gain traction to drive big data integration and analysis, for
matched with analogous scaling and the effort from a conceptual idea to example, the EarthCube (http://
democratization in computational resour- realization. Even with today's small- earthcube.org/) initiative by the
ces, either in the form of available compu- scale data (relative to what is expected National Science Foundation (NSF),
tational capacity for data processing or 10 years from now), there is a pro- the Human Microbiome initiative
data integration. Although unprocessed found lack of a grand vision in appro- (https://commonfund.nih.gov/hmp/
microbiome data are deposited in INSDC priate funding to support the index) by the National Institutes of
(International Nucleotide Sequence Data- extraction of knowledge from big data Health (NIH), and several other initia-
base Collaboration) centers, there are cur- (i.e., across studies). Most microbiome tives by the Department of Energy
rently no funded efforts to process and projects currently have their data ana- (DOE). Rather than joining forces to
integrate all the microbiome data. This lyzed in the context of their own study create interagency funding models to
has resulted in the majority of microbiome and largely do not incorporate data- face the grand challenges of big data
sequences being single-use, that is, they sets from other publicly available stud- ahead (following up on existing recom-
have limited, if any, data reuse beyond the ies. Current research efforts work well mendations from the scientic com-
original scope of the study. This phenom- for small- to medium-scale projects, munity) [7], agencies each support
enon has led to a certain compartmental- but fail to support and promote larger separate smaller-scale efforts. Fur-
ization of microbiome studies whereby endeavors at global multifaceted anal- thermore, support for big data integra-
the data are stored in an ad hoc manner, ysis that may require processing and tion and analysis requires long-term
and are often inaccessible to other scien- integration of all relevant publicly avail- commitment, which is required for
tists who want to reproduce the results of able data. microbiome research but has been
the study or mine the data for other appli- nearly impossible to obtain due to
cations. It also prevents systematic review Furthermore, the reference data needed the limited funding period for data-
and meta-analysis of the data using newly to contextualize the myriad microbiome bases responsible for large data inte-
developed strategies and tools that have samples is sorely lacking. A prime gration and analysis.

Trends in Microbiology, June 2016, Vol. 24, No. 6 425


(c) Insufcient data standards and inter- research, even though it is well appreci-
larger the size of the dataset used in
operability. Although international con- ated that we live on a microbial planet [10]
the comparison, discovery becomes
sortia for the establishment and and that attempts to understand biological
more likely and more accurate. In prin-
propagation of standards have already phenomena based on incomplete data ciple, we do not know upfront which
formed [8], they are either limited in only lead to erroneous conclusions [11].
data are the most relevant to compare,
scope [9] or their adoption lacks the and because life has no divisions or
appropriate mandate from funding Establishing a Distributed boundaries along the lines of the fund-
agencies and publishers alike. As a National Microbiome Data Center ing agencies or application areas,
result, the lack of standards for all Although biology is rapidly moving towards comparative analysis should not be
the steps from preparing the samples a holistic view of life, we are witnessing an restricted to individual organisms, indi-
to the end point of processing and increase in funding awards that are vidual environments, or funding focus
comparing data is currently impeding regressing biology towards individual, par- scope.
the community's ability to perform ef- tially redundant, and largely disconnected (b) Data integration. The success of the
cient comparative analysis. efforts, instead of preparing it to fulll its comparative analysis is directly depen-
destiny as a quantiable science akin to dent on the efciency of data integra-
Vision to Advance Microbiome physics. To this end, there is great demand tion, which, in turn, depends on the
Research: Enabling Data Science for creating a distributed national micro- breadth of the integration, the quality
Key to moving forward in the face of these biome data center that would organize, of the integrated data, and the under-
bottlenecks is a vision for transforming the process, and serve all available environ- lying structure of the integration. Each
deluge of data from a problem to a solu- mental genomic data. Signicant improve- of these parameters is critical for build-
tion, by enabling the research community ments in computational methods for data ing successful data-integration plat-
to utilize and explore the data produced processing and high performance in dis- forms. The breadth of the integration
worldwide. To achieve this, it is imperative tributed data-management systems, cou- here refers not only to the number and
to develop a long-term strategy that will pled with the ability to utilize high- diversity of the integrated datasets, but
support the anticipated data growth, and performance computing (HPC), are now also to the various types of omics
that will ensure that the data revolution will rendering such an endeavor possible. data as they become available. As
not become disruptive for the eld through The key objective of this center would the microbiome eld moves towards
the balkanization (i.e., fragmentation) of be to develop and maintain a state-of- more holistic approaches, and the
microbiome data generation and analysis, the-art data-management system inte- emerging technologies enable explora-
as is currently the case. The development grating all available environmental geno- tion of whole systems (e.g., human or
of this strategy requires a major cultural mic data. This would enable efcient plant microbiome), it becomes essen-
and conceptual transformation whereby handling and processing (i.e., assembly tial to integrate a wide array of data
the generation of vast amounts of biologi- and annotation) of all publicly available types across all domains of life. The
cal data is no longer considered the goal primary microbiome data and metadata quality of the integration directly
or the end result of funded studies, but generated around the globe for down- depends on the quality of the data,
rather, the most important tool needed in stream interpretation and discovery. In as reected by the level of data con-
order to efciently address fundamental this respect this facility should also serve tamination, coherence of annotations,
biological questions critical to human as an international microbiome data cen- availability of metadata, and the overall
health, biotechnology, energy, food, and ter. The envisioned data center would level of detail in identifying accuracy
environmental sustainability. Analogous to support both grand vision projects as well and completeness of the integrated
the telescope for astronomy and the par- as smaller studies by providing the ability data. Finally, the underlying structure
ticle accelerator for high-energy physics, to conduct effective comparative analysis should not only enable integration of a
biological sequence data should be con- in an integrated context. Efcient data wide range of interdisciplinary data, it
sidered an instrumental tool for the study handling and interpretation rests on three should also support vigorous data
of biological systems. Tools like the major pillars, all of which are profoundly visualization and sustain an unprece-
Hubble telescope or CERN's particle interconnected and interdependent: dented growth in data.
accelerator required several years for con- (a) Comparative analysis. This represents (c) Data standards. Standardizing the
struction, multibillion dollar funding efforts, the hallmark of data interpretation. It is description and processing methods
and very large and distributed research well known that the single most impor- of microbiome data is critical for com-
networks. Funding or development of tant tool for interpreting genomic parisons across different samples and
data-science-related tools of that scale and metagenomic sequences is their studies that have adopted incompati-
are currently not available for microbiome analysis on a comparative level. The ble recommendations from different

426 Trends in Microbiology, June 2016, Vol. 24, No. 6


1
Prokaryotic Super Program, Department of Energy Joint
international bodies promoting stand- coronavirus genomes using
Genome Institute, Walnut Creek, CA, USA
ards in microbiome research. sequence data, examining their
*Correspondence: nckyrpides@lbl.gov (N.C. Kyrpides). capabilities of replicating in
A number of large data-management sys- http://dx.doi.org/10.1016/j.tim.2016.02.011
human cells and causing dis-
tems are currently available for supporting
References eases in animal models, and eval-
the comparative analysis of assembled 1. Koboldt, D.C. et al. (2013) The next-generation sequencing uating therapeutics and vaccines.
[12] or unassembled [13] microbiome data revolution and its impact on genomics. Cell 155, 2738
2. Kyrpides, N.C. (2009) Fifteen years of microbial genomics: Similar approaches could be used
and their associated metadata [14], as
well as systems designed for predictive
meeting the challenges and fullling the dream. Nat. Bio- to assess the potential of human
technol. 27, 627632
modeling (https://kbase.us/) and cyberin- 3. Alivisatos, A.P. et al. (2015) A unied initiative to harness emergence and pathogenicity for
frastructures [15]. Similar successful sys-
Earth's microbiomes. Science 350, 507508 other viruses.
4. Dubilier, N. et al. (2015) Microbiology: Create a global
tems with existing and dedicated long- microbiome effort. Nature 526, 631634
term funding should be an integral part 5. Wu, D. et al. (2009) A phylogeny-driven genomic encyclo- The severe acute respiratory syndrome
paedia of Bacteria and Archaea. Nature 462, 10561060
of such a distributed national microbiome (SARS) epidemic in 2003 and the Middle
6. Kyrpides, N.C. et al. (2014) Genomic encyclopedia of bac-
data center. teria and archaea: sequencing a myriad of type strains. East respiratory syndrome (MERS) epi-
PLoS Biol. 12, e1001920 demic in the last 3 years have shown that
7. Gilbert, J.A. et al. (2014) Meeting report: Ocean omics
Concluding Remarks science, technology and cyberinfrastructure: current chal-
coronaviruses (CoVs) have the capability
Future endeavors in microbiome research lenges and future requirements (August 2023, 2013). to cause major epidemics. For the SARS
Stand. Genomic Sci. 9, 12521258
are expected to lead us to a new age of epidemic, a total of >8000 laboratory-
8. Field, D. et al. (2011) The Genomic Standards Consortium.
holistic understanding of microbial life, PLoS Biol. 9, e1001088 conrmed cases with >800 deaths were
develop novel therapeutic strategies to 9. Field, D. et al. (2011) Genomic standards consortium proj- observed (http://www.cdc.gov/sars/
ects. Stand. Genomic Sci. 9, 514526
treat infectious diseases, identify solutions about/fs-sars.html). This horric epidemic
10. National Research Council (US) Committee on Metage-
for protecting the environment, and ulti- nomics: Challenges and Functional Applications (2007) was followed by the publication of >7500
mately understand and harness the power The New Science of Metagenomics: Revealing the Secrets scientic papers on CoVs visible in
of Our Microbial Planet, National Academies Press (US)
of the most abundant natural resources on PubMed, which represents two-thirds
11. Ioannidis, J.P.A. (2005) Why most published research nd-
our planet. To achieve these endeavors and ings are false. PLoS Med. 2, 696701 of the total number of publications on
enable the vision described above, the 12. Markowitz, V.M. et al. (2014) IMG/M 4 version of the inte- CoVs in Pubmed. Despite the numerous
grated metagenome comparative analysis system. Nucleic
research community requires a major Acids Res. 42, D568D573 studies on CoVs, it is still difcult to predict
restructuring in the current research-fund- 13. Wilke, A. et al. (2013) A metagenomics portal for a democ- which CoV may have the potential to
ratized sequencing world. Meth. Enzymol. 531, 487523
ing policies through the development of emerge as the next culprit. A recent study
14. Reddy, T.B. et al. (2015) The Genomes OnLine Database
innovative funding mechanisms that will (GOLD) v.5: a metadata management system based on a in PNAS by Menachery et al. [1] and
provide long-term support for microbiome four level (meta)genome project classication. Nucleic Acids another similar study in Nature Medicine
Res. 43, D1099D1106
data science. Examples of such mecha- published in December 2015 by the same
15. Goff, S.A. et al. (2011) The iPlant Collaborative: cyberin-
nisms can be drawn from existing models frastructure for plant biology. Front. Plant Sci. 2, 34 group [2] reported the use of existing
such as the Brain Initiative (https://www. sequence data with reverse genetics to
whitehouse.gov/share/brain-initiative), a engineer SARS-related CoVs and evalu-
grand challenge research effort to revolu- ate their potential of emergence and
tionize our understanding of the human Spotlight pathogenicity.
brain. At the dawn of the third decade of
microbial genomics, and well into the infor-
Engineering Shortly after the emergence of SARS-CoV,
mation age, the time is ripe to embark on the Coronaviruses to SARS-related CoVs were found in civets [3].
greatest endeavor to understand Earth's However, multiple lines of evidence showed
microbiome. Microbiome data science, Evaluate Emergence that the civets are just the intermediate or
through the establishment of a national
microbiome data center, can pave the way.
and Pathogenic amplication hosts for SARS-CoV. Through
intensive surveillance studies in various
Potential mammals in Hong Kong, Lau et al. reported
Acknowledgments
the presence of SARS-related CoVs in Chi-
We thank Victor Markowitz, Torben Nielsen, and Susanna K.P. Lau1,2,3,4,5,* and nese horseshoe bats in Hong Kong [4]. A
Heather Maughan for critical reading and suggestions
on the manuscript. This work was conducted by the U.
Patrick C.Y. Woo1,2,3,4,5,* similar observation was also reported by
S. Department of Energy Joint Genome Institute, a another group in mainland China [5]. Since
DOE Ofce of Science User Facility, under Contract A recent study provides a plat- then, numerous SARS-related CoV
No. DE-AC02-05CH11231. form for generating infectious sequences were observed in different

Trends in Microbiology, June 2016, Vol. 24, No. 6 427

Potrebbero piacerti anche