Abstract

BackgroundWhole-genome “shotgun” (WGS) metagenomic sequencing is an increasingly widely used tool for analyzing the metagenomic content of microbiome samples. While WGS data contains gene-level information, it can be challenging to analyze the millions of microbial genes which are typically found in microbiome experiments. To mitigate the ultrahigh dimensionality challenge of gene-level metagenomics, it has been proposed to cluster genes by co-abundance to form Co-Abundant Gene groups (CAGs). However, exhaustive co-abundance clustering of millions of microbial genes across thousands of biological samples has previously been intractable purely due to the computational challenge of performing trillions of pairwise comparisons.ResultsHere we present a novel computational approach to the analysis of WGS datasets in which microbial gene groups are the fundamental unit of analysis. We use the Approximate Nearest Neighbor heuristic for near-exhaustive average linkage clustering to group millions of genes by co-abundance. This results in thousands of high-quality CAGs representing complete and partial microbial genomes. We applied this method to publicly available WGS microbiome surveys and found that the resulting microbial CAGs associated with inflammatory bowel disease (IBD) and colorectal cancer (CRC) were highly reproducible and could be validated independently using multiple independent cohorts.ConclusionsThis powerful approach to gene-level metagenomics provides a powerful path forward for identifying the biological links between the microbiome and human health. By proposing a new computational approach for handling high dimensional metagenomics data, we identified specific microbial gene groups that are associated with disease that can be used to identify strains of interest for further preclinical and mechanistic experimentation.

Highlights

  • Whole-genome “shotgun” (WGS) metagenomic sequencing is an increasingly widely used tool for analyzing the metagenomic content of microbiome samples

  • While the microbiome has been implicated in a number of human diseases, we chose to focus on colorectal cancer (CRC) and inflammatory bowel disease (IBD)

  • We chose to group together all participants with any form of the disease state, as the criteria for disease classification was not consistent across studies. In this discovery-validation approach, those Co-Abundant Gene groups (CAGs) which had a q value of < 0.2 in the discovery cohort were subsequently tested in an additional “validation” cohort, and those CAGs which had a q value < 0.2 in that second step and the same direction of effect were considered to be associated with disease. We found with this approach that the estimated coefficient of disease status in the set of CAGs associated with disease in the discovery cohort was significantly associated with the estimated coefficient in the validation cohort (Fig. 2a, b; CRC—r = 0.36, p < 2E−16; IBD—r = 0.30, p < 2E−16)

Read more

Summary

Introduction

Whole-genome “shotgun” (WGS) metagenomic sequencing is an increasingly widely used tool for analyzing the metagenomic content of microbiome samples. While WGS data contains gene-level information, it can be challenging to analyze the millions of microbial genes which are typically found in microbiome experiments. Metagenomic analysis of the microbiome typically falls into the categories of taxonomic classification, metabolic pathway reconstruction, or genome reconstruction. The approach of gene-level metagenomics is not new to this study and has been proposed previously as an alternative to taxonomic or metabolic pathway analysis [13]. We took the previously described approach of grouping together genes that are consistently found at a similar level of abundance across multiple samples [13]. Grouping genes by co-abundance finds low-dimensional structure in high-dimensional gene-level data, mitigating challenges with the statistical analysis of high-dimensional metagenomics data

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call