Abstract

BackgroundSequencing partial 16S rRNA genes is a cost effective method for quantifying the microbial composition of an environment, such as the human gut. However, downstream analysis relies on binning reads into microbial groups by either considering each unique sequence as a different microbe, querying a database to get taxonomic labels from sequences, or clustering similar sequences together. However, these approaches do not fully capture evolutionary relationships between microbes, limiting the ability to identify differentially abundant groups of microbes between a diseased and control cohort. We present sequence-based biomarkers (SBBs), an aggregation method that groups and aggregates microbes using single variants and combinations of variants within their 16S sequences. We compare SBBs against other existing aggregation methods (OTU clustering and Microphenoor DiTaxa features) in several benchmarking tasks: biomarker discovery via permutation test, biomarker discovery via linear discriminant analysis, and phenotype prediction power. We demonstrate the SBBs perform on-par or better than the state-of-the-art methods in biomarker discovery and phenotype prediction.ResultsOn two independent datasets, SBBs identify differentially abundant groups of microbes with similar or higher statistical significance than existing methods in both a permutation-test-based analysis and using linear discriminant analysis effect size. . By grouping microbes by SBB, we can identify several differentially abundant microbial groups (FDR <.1) between children with autism and neurotypical controls in a set of 115 discordant siblings. Porphyromonadaceae, Ruminococcaceae, and an unnamed species of Blastocystis were significantly enriched in autism, while Veillonellaceae was significantly depleted. Likewise, aggregating microbes by SBB on a dataset of obese and lean twins, we find several significantly differentially abundant microbial groups (FDR<.1). We observed Megasphaera andSutterellaceae highly enriched in obesity, and Phocaeicola significantly depleted. SBBs also perform on bar with or better than existing aggregation methods as features in a phenotype prediction model, predicting the autism phenotype with an ROC-AUC score of .64 and the obesity phenotype with an ROC-AUC score of .84.ConclusionsSBBs provide a powerful method for aggregating microbes to perform differential abundance analysis as well as phenotype prediction. Our source code can be freely downloaded from http://github.com/briannachrisman/16s_biomarkers.

Highlights

  • Sequencing partial 16S rRNA genes is a cost effective method for quan‐ tifying the microbial composition of an environment, such as the human gut

  • sequence-based biomarker (SBB) provide a powerful method for aggregating microbes to perform differential abundance analysis as well as phenotype prediction

  • Multi‐loci SBBs yield high statistical power in identifying differentially abundant microbial groups Using sequence-based biomarkers, we were able to identify groups of microbes differentially enriched in autism as well as in obesity with high statistical power

Read more

Summary

Introduction

Sequencing partial 16S rRNA genes is a cost effective method for quan‐ tifying the microbial composition of an environment, such as the human gut. Downstream analysis relies on binning reads into microbial groups by either consider‐ ing each unique sequence as a different microbe, querying a database to get taxo‐ nomic labels from sequences, or clustering similar sequences together These approaches do not fully capture evolutionary relationships between microbes, limiting the ability to identify differentially abundant groups of microbes between a diseased and control cohort. Microbial community profiling by 16S sequencing (Fig. 1) involves designing primers to target conserved sequences around a hypervariable region of choice, amplifying the region from a mixture of diverse genomes, and performing short read sequencing of the amplicons These exact sequencing variants (ESVs) are preprocessed to remove noise and sequencing artifacts in order Amplicon Sequence Variants (ASV) [6]. The final output of this pipeline is a matrix of read counts for each ASV

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call