An empirical Bayes approach to normalization and differential abundance testing for microbiome data

Tiantian Liu,Tao Wang,Hongyu Zhao

doi:10.1186/s12859-020-03552-z

Tiantian Liu, Tao Wang + Show 1 more

Open Access

https://doi.org/10.1186/s12859-020-03552-z

Copy DOI

Journal: BMC bioinformatics	Publication Date: Jun 3, 2020
Citations: 14	License type: open-access

Affiliation: Shanghai Jiao Tong University, Yale University

Abstract

BackgroundAdvances in DNA sequencing have offered researchers an unprecedented opportunity to better study the variety of species living in and on the human body. However, the analysis of microbiome data is complicated by several challenges. First, the sequencing depth may vary by orders of magnitude across samples. Second, species are rare and the data often contain many zeros. Third, the specimen is a fraction of the microbial ecosystem, and so the data are compositional carrying only relative information. Other characteristics of microbiome data include pronounced over-dispersion in taxon abundances, and the existence of a phylogenetic tree that relates all bacterial species. To address some of these challenges, microbiome analysis workflows often normalize the read counts prior to downstream analysis. However, there are limitations in the current literature on the normalization of microbiome data.ResultsUnder the multinomial distribution for the read counts and a prior for the unknown proportions, we propose an empirical Bayes approach to microbiome data normalization. Using a tree-based extension of the Dirichlet prior, we further extend our method by incorporating the phylogenetic tree into the normalization process. We study the impact of normalization on differential abundance analysis. In the presence of tree structure, we propose a phylogeny-aware detection procedure.ConclusionsExtensive simulations and gut microbiome data applications are conducted to demonstrate the superior performance of our empirical Bayes method over other normalization methods, and over commonly-used methods for differential abundance testing. Original R scripts are available at GitHub (https://github.com/liudoubletian/eBay).

Highlights

Advances in DNA sequencing have offered researchers an unprecedented opportunity to better study the variety of species living in and on the human body
We generated bacterial counts from a DM or Dirichlet-tree multinomial (DTM) model, with the true vector of proportions π estimated based on a real dataset [32], which contains the counts of 60 taxa from 1897 samples, together with a phylogenetic tree describing the evolutionary relationship among these taxa
We examined the downstream effect of normalization in the context of differential abundance analysis, Fig. 7 Differentially abundant bacterial species between normal weight and obese individuals

Summary

Introduction

Advances in DNA sequencing have offered researchers an unprecedented opportunity to better study the variety of species living in and on the human body. Other characteristics of microbiome data include pronounced over-dispersion in taxon abundances, and the existence of a phylogenetic tree that relates all bacterial species. The human microbiome, which refers to the collection of microbes and their genetic information in the human body, contributes to healthy human physiology and development, and dysbiosis of microbial communities is linked to many diseases, such as obesity, type 2 diabetes, and inflammatory bowel disease [1,2,3]. In order to understand the taxonomic composition and biological function of microbiomes, high-throughout sequencing technologies and advanced bioinformatics tools are routinely employed in microbiome studies [6]. The evolutionary relationships among OTUs can be inferred, by using a reference database, or by inferring the phylogenetic tree de novo [7]

Methods

Results

Discussion

Conclusion