Science & Society the potential to dramatically increase the example is the fact that less than 20%
Microbiome Data power of individual studies and generate
valuable insights. This approach has of the bacterial and archaeal type strains have been sequenced, despite evidence Science: greatly hindered discovery by wasting sig- for the scientic value of generating these nicant resources and limiting interpreta- basic data [5,6]. While recent trends Understanding Our tion of results by underutilizing the emphasize hypothesis-driven science Microbial Planet available sequence information. Further, and a shift away from exploratory this has resulted in a high degree of dupli- sequencing, we argue that part of a grand Nikos C. Kyrpides,1,* cated efforts since every analysis group vision for microbiome data science neces- Emiley A. Eloe-Fadrosh,1 and has to partially redevelop data storage, sitates the continued generation of refer- integration, and analysis platforms. ence data. These data are fundamental for Natalia N. Ivanova1 interpretation of how microbiomes func- In the light of the problems listed above, tion in a community context, and how they Microbiology is experiencing a rev- we have identied the following major bot- interact within the environments and hosts olution brought on by recent devel- tlenecks (all of which are related to fund- they inhabit. Systematic decoding of opments in sequencing technology. ing) currently impeding progress in microbes and their environments to ll in The unprecedented volume of microbiome research. the gaps in our databases is a key step microbiome data being generated (a) Lack of a grand vision in supporting towards hypothesis-driven science and poses signicant challenges that Microbiome Data Science. There is enabling a better understanding of micro- are currently hindering progress in currently a sharp contrast between bial life. the eld. Here, we outline the major what is needed and what is available (b) Inefcient funding mechanism. The bottlenecks and propose a vision to and/or nancially supported in the increasingly interdisciplinary approach advance microbiome research as a eld. The need for a national and uni- to biology has enabled us to reach the data-driven science. ed international microbiome effort point where scientic progress can be was proposed several years ago [2] hindered by the insulation of individual Bottlenecks in Microbiome and renewed interest is beginning to funding agencies. This is especially Research gain broader community support [3,4]. evident in the segregation of funding The vast increase in sequencing output However, it is still unclear whether from individual agencies supporting during the last decade [1] has not been these calls will gain traction to drive big data integration and analysis, for matched with analogous scaling and the effort from a conceptual idea to example, the EarthCube (http:// democratization in computational resour- realization. Even with today's small- earthcube.org/) initiative by the ces, either in the form of available compu- scale data (relative to what is expected National Science Foundation (NSF), tational capacity for data processing or 10 years from now), there is a pro- the Human Microbiome initiative data integration. Although unprocessed found lack of a grand vision in appro- (https://commonfund.nih.gov/hmp/ microbiome data are deposited in INSDC priate funding to support the index) by the National Institutes of (International Nucleotide Sequence Data- extraction of knowledge from big data Health (NIH), and several other initia- base Collaboration) centers, there are cur- (i.e., across studies). Most microbiome tives by the Department of Energy rently no funded efforts to process and projects currently have their data ana- (DOE). Rather than joining forces to integrate all the microbiome data. This lyzed in the context of their own study create interagency funding models to has resulted in the majority of microbiome and largely do not incorporate data- face the grand challenges of big data sequences being single-use, that is, they sets from other publicly available stud- ahead (following up on existing recom- have limited, if any, data reuse beyond the ies. Current research efforts work well mendations from the scientic com- original scope of the study. This phenom- for small- to medium-scale projects, munity) [7], agencies each support enon has led to a certain compartmental- but fail to support and promote larger separate smaller-scale efforts. Fur- ization of microbiome studies whereby endeavors at global multifaceted anal- thermore, support for big data integra- the data are stored in an ad hoc manner, ysis that may require processing and tion and analysis requires long-term and are often inaccessible to other scien- integration of all relevant publicly avail- commitment, which is required for tists who want to reproduce the results of able data. microbiome research but has been the study or mine the data for other appli- nearly impossible to obtain due to cations. It also prevents systematic review Furthermore, the reference data needed the limited funding period for data- and meta-analysis of the data using newly to contextualize the myriad microbiome bases responsible for large data inte- developed strategies and tools that have samples is sorely lacking. A prime gration and analysis.
Trends in Microbiology, June 2016, Vol. 24, No. 6 425
(c) Insufcient data standards and inter- research, even though it is well appreci- larger the size of the dataset used in operability. Although international con- ated that we live on a microbial planet [10] the comparison, discovery becomes sortia for the establishment and and that attempts to understand biological more likely and more accurate. In prin- propagation of standards have already phenomena based on incomplete data ciple, we do not know upfront which formed [8], they are either limited in only lead to erroneous conclusions [11]. data are the most relevant to compare, scope [9] or their adoption lacks the and because life has no divisions or appropriate mandate from funding Establishing a Distributed boundaries along the lines of the fund- agencies and publishers alike. As a National Microbiome Data Center ing agencies or application areas, result, the lack of standards for all Although biology is rapidly moving towards comparative analysis should not be the steps from preparing the samples a holistic view of life, we are witnessing an restricted to individual organisms, indi- to the end point of processing and increase in funding awards that are vidual environments, or funding focus comparing data is currently impeding regressing biology towards individual, par- scope. the community's ability to perform ef- tially redundant, and largely disconnected (b) Data integration. The success of the cient comparative analysis. efforts, instead of preparing it to fulll its comparative analysis is directly depen- destiny as a quantiable science akin to dent on the efciency of data integra- Vision to Advance Microbiome physics. To this end, there is great demand tion, which, in turn, depends on the Research: Enabling Data Science for creating a distributed national micro- breadth of the integration, the quality Key to moving forward in the face of these biome data center that would organize, of the integrated data, and the under- bottlenecks is a vision for transforming the process, and serve all available environ- lying structure of the integration. Each deluge of data from a problem to a solu- mental genomic data. Signicant improve- of these parameters is critical for build- tion, by enabling the research community ments in computational methods for data ing successful data-integration plat- to utilize and explore the data produced processing and high performance in dis- forms. The breadth of the integration worldwide. To achieve this, it is imperative tributed data-management systems, cou- here refers not only to the number and to develop a long-term strategy that will pled with the ability to utilize high- diversity of the integrated datasets, but support the anticipated data growth, and performance computing (HPC), are now also to the various types of omics that will ensure that the data revolution will rendering such an endeavor possible. data as they become available. As not become disruptive for the eld through The key objective of this center would the microbiome eld moves towards the balkanization (i.e., fragmentation) of be to develop and maintain a state-of- more holistic approaches, and the microbiome data generation and analysis, the-art data-management system inte- emerging technologies enable explora- as is currently the case. The development grating all available environmental geno- tion of whole systems (e.g., human or of this strategy requires a major cultural mic data. This would enable efcient plant microbiome), it becomes essen- and conceptual transformation whereby handling and processing (i.e., assembly tial to integrate a wide array of data the generation of vast amounts of biologi- and annotation) of all publicly available types across all domains of life. The cal data is no longer considered the goal primary microbiome data and metadata quality of the integration directly or the end result of funded studies, but generated around the globe for down- depends on the quality of the data, rather, the most important tool needed in stream interpretation and discovery. In as reected by the level of data con- order to efciently address fundamental this respect this facility should also serve tamination, coherence of annotations, biological questions critical to human as an international microbiome data cen- availability of metadata, and the overall health, biotechnology, energy, food, and ter. The envisioned data center would level of detail in identifying accuracy environmental sustainability. Analogous to support both grand vision projects as well and completeness of the integrated the telescope for astronomy and the par- as smaller studies by providing the ability data. Finally, the underlying structure ticle accelerator for high-energy physics, to conduct effective comparative analysis should not only enable integration of a biological sequence data should be con- in an integrated context. Efcient data wide range of interdisciplinary data, it sidered an instrumental tool for the study handling and interpretation rests on three should also support vigorous data of biological systems. Tools like the major pillars, all of which are profoundly visualization and sustain an unprece- Hubble telescope or CERN's particle interconnected and interdependent: dented growth in data. accelerator required several years for con- (a) Comparative analysis. This represents (c) Data standards. Standardizing the struction, multibillion dollar funding efforts, the hallmark of data interpretation. It is description and processing methods and very large and distributed research well known that the single most impor- of microbiome data is critical for com- networks. Funding or development of tant tool for interpreting genomic parisons across different samples and data-science-related tools of that scale and metagenomic sequences is their studies that have adopted incompati- are currently not available for microbiome analysis on a comparative level. The ble recommendations from different
426 Trends in Microbiology, June 2016, Vol. 24, No. 6
1 Prokaryotic Super Program, Department of Energy Joint international bodies promoting stand- coronavirus genomes using Genome Institute, Walnut Creek, CA, USA ards in microbiome research. sequence data, examining their *Correspondence: nckyrpides@lbl.gov (N.C. Kyrpides). capabilities of replicating in A number of large data-management sys- http://dx.doi.org/10.1016/j.tim.2016.02.011 human cells and causing dis- tems are currently available for supporting References eases in animal models, and eval- the comparative analysis of assembled 1. Koboldt, D.C. et al. (2013) The next-generation sequencing uating therapeutics and vaccines. [12] or unassembled [13] microbiome data revolution and its impact on genomics. Cell 155, 2738 2. Kyrpides, N.C. (2009) Fifteen years of microbial genomics: Similar approaches could be used and their associated metadata [14], as well as systems designed for predictive meeting the challenges and fullling the dream. Nat. Bio- to assess the potential of human technol. 27, 627632 modeling (https://kbase.us/) and cyberin- 3. Alivisatos, A.P. et al. (2015) A unied initiative to harness emergence and pathogenicity for frastructures [15]. Similar successful sys- Earth's microbiomes. Science 350, 507508 other viruses. 4. Dubilier, N. et al. (2015) Microbiology: Create a global tems with existing and dedicated long- microbiome effort. Nature 526, 631634 term funding should be an integral part 5. Wu, D. et al. (2009) A phylogeny-driven genomic encyclo- The severe acute respiratory syndrome paedia of Bacteria and Archaea. Nature 462, 10561060 of such a distributed national microbiome (SARS) epidemic in 2003 and the Middle 6. Kyrpides, N.C. et al. (2014) Genomic encyclopedia of bac- data center. teria and archaea: sequencing a myriad of type strains. East respiratory syndrome (MERS) epi- PLoS Biol. 12, e1001920 demic in the last 3 years have shown that 7. Gilbert, J.A. et al. (2014) Meeting report: Ocean omics Concluding Remarks science, technology and cyberinfrastructure: current chal- coronaviruses (CoVs) have the capability Future endeavors in microbiome research lenges and future requirements (August 2023, 2013). to cause major epidemics. For the SARS Stand. Genomic Sci. 9, 12521258 are expected to lead us to a new age of epidemic, a total of >8000 laboratory- 8. Field, D. et al. (2011) The Genomic Standards Consortium. holistic understanding of microbial life, PLoS Biol. 9, e1001088 conrmed cases with >800 deaths were develop novel therapeutic strategies to 9. Field, D. et al. (2011) Genomic standards consortium proj- observed (http://www.cdc.gov/sars/ ects. Stand. Genomic Sci. 9, 514526 treat infectious diseases, identify solutions about/fs-sars.html). This horric epidemic 10. National Research Council (US) Committee on Metage- for protecting the environment, and ulti- nomics: Challenges and Functional Applications (2007) was followed by the publication of >7500 mately understand and harness the power The New Science of Metagenomics: Revealing the Secrets scientic papers on CoVs visible in of Our Microbial Planet, National Academies Press (US) of the most abundant natural resources on PubMed, which represents two-thirds 11. Ioannidis, J.P.A. (2005) Why most published research nd- our planet. To achieve these endeavors and ings are false. PLoS Med. 2, 696701 of the total number of publications on enable the vision described above, the 12. Markowitz, V.M. et al. (2014) IMG/M 4 version of the inte- CoVs in Pubmed. Despite the numerous grated metagenome comparative analysis system. Nucleic research community requires a major Acids Res. 42, D568D573 studies on CoVs, it is still difcult to predict restructuring in the current research-fund- 13. Wilke, A. et al. (2013) A metagenomics portal for a democ- which CoV may have the potential to ratized sequencing world. Meth. Enzymol. 531, 487523 ing policies through the development of emerge as the next culprit. A recent study 14. Reddy, T.B. et al. (2015) The Genomes OnLine Database innovative funding mechanisms that will (GOLD) v.5: a metadata management system based on a in PNAS by Menachery et al. [1] and provide long-term support for microbiome four level (meta)genome project classication. Nucleic Acids another similar study in Nature Medicine Res. 43, D1099D1106 data science. Examples of such mecha- published in December 2015 by the same 15. Goff, S.A. et al. (2011) The iPlant Collaborative: cyberin- nisms can be drawn from existing models frastructure for plant biology. Front. Plant Sci. 2, 34 group [2] reported the use of existing such as the Brain Initiative (https://www. sequence data with reverse genetics to whitehouse.gov/share/brain-initiative), a engineer SARS-related CoVs and evalu- grand challenge research effort to revolu- ate their potential of emergence and tionize our understanding of the human Spotlight pathogenicity. brain. At the dawn of the third decade of microbial genomics, and well into the infor- Engineering Shortly after the emergence of SARS-CoV, mation age, the time is ripe to embark on the Coronaviruses to SARS-related CoVs were found in civets [3]. greatest endeavor to understand Earth's However, multiple lines of evidence showed microbiome. Microbiome data science, Evaluate Emergence that the civets are just the intermediate or through the establishment of a national microbiome data center, can pave the way. and Pathogenic amplication hosts for SARS-CoV. Through intensive surveillance studies in various Potential mammals in Hong Kong, Lau et al. reported Acknowledgments the presence of SARS-related CoVs in Chi- We thank Victor Markowitz, Torben Nielsen, and Susanna K.P. Lau1,2,3,4,5,* and nese horseshoe bats in Hong Kong [4]. A Heather Maughan for critical reading and suggestions on the manuscript. This work was conducted by the U. Patrick C.Y. Woo1,2,3,4,5,* similar observation was also reported by S. Department of Energy Joint Genome Institute, a another group in mainland China [5]. Since DOE Ofce of Science User Facility, under Contract A recent study provides a plat- then, numerous SARS-related CoV No. DE-AC02-05CH11231. form for generating infectious sequences were observed in different
Trends in Microbiology, June 2016, Vol. 24, No. 6 427
Surrounded by Idiots: The Four Types of Human Behavior and How to Effectively Communicate with Each in Business (and in Life) (The Surrounded by Idiots Series) by Thomas Erikson: Key Takeaways, Summary & Analysis