Abstract

The direct “metagenomic” sequencing of genomic material from complex assemblages of bacteria, archaea, viruses and microeukaryotes has yielded new insights into the structure of microbial communities. For example, analysis of metagenomic data has revealed the existence of previously unknown microbial taxa whose spatial distributions are limited by environmental conditions, ecological competition, and dispersal mechanisms. However, differences in genotypes that might lead biologists to designate two microbes as taxonomically distinct need not necessarily imply differences in ecological function. Hence, there is a growing need for large-scale analysis of the distribution of microbial function across habitats. Here, we present a framework for investigating the biogeography of microbial function by analyzing the distribution of protein families inferred from environmental sequence data across a global collection of sites. We map over 6,000,000 protein sequences from unassembled reads from the Global Ocean Survey dataset to protein families, generating a protein family relative abundance matrix that describes the distribution of each protein family across sites. We then use non-negative matrix factorization (NMF) to approximate these protein family profiles as linear combinations of a small number of ecological components. Each component has a characteristic functional profile and site profile. Our approach identifies common functional signatures within several of the components. We use our method as a filter to estimate functional distance between sites, and find that an NMF-filtered measure of functional distance is more strongly correlated with environmental distance than a comparable PCA-filtered measure. We also find that functional distance is more strongly correlated with environmental distance than with geographic distance, in agreement with prior studies. We identify similar protein functions in several components and suggest that functional co-occurrence across metagenomic samples could lead to future methods for de-novo functional prediction. We conclude by discussing how NMF, and other dimension reduction methods, can help enable a macroscopic functional description of marine ecosystems.

Highlights

  • Metagenomics – large-scale sequencing of DNA isolated directly from environmental samples – has greatly facilitated the study of microbial communities [1,2,3,4,5,6]

  • We approximated the Global Ocean Sampling (GOS) dataset of over 6,000,000 unique protein sequences, representing 8214 Pfam abundances distributed across 45 sites, as a combination of five components, each with a characteristic functional profile and site profile

  • We showed that using this negative matrix factorization (NMF) decomposition as a lens allowed identification of novel patterns of clustering of Pfams, and overlaps between these clusters

Read more

Summary

Introduction

Metagenomics – large-scale sequencing of DNA isolated directly from environmental samples – has greatly facilitated the study of microbial communities [1,2,3,4,5,6]. This wealth of information has created a new set of challenges in understanding the factors underlying the functional processes mediated by microbes at community, regional and global scales [2]. A complementary series of analyses are necessary to quantify the functional properties of microbial communities and to explain how differences in their functional properties relate to environmental and geographic factors. Such analyses have the potential to help form the empirical foundation for the study of microbial biogeography [13,14,15,16]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.