A marginalized two-part Beta regression model for microbiome compositional data.

Haitao Chai,Lu Lin,Lei Liu,Hongmei Jiang,Dan Knights

doi:10.1371/journal.pcbi.1006329

Abstract

In microbiome studies, an important goal is to detect differential abundance of microbes across clinical conditions and treatment options. However, the microbiome compositional data (quantified by relative abundance) are highly skewed, bounded in [0, 1), and often have many zeros. A two-part model is commonly used to separate zeros and positive values explicitly by two submodels: a logistic model for the probability of a specie being present in Part I, and a Beta regression model for the relative abundance conditional on the presence of the specie in Part II. However, the regression coefficients in Part II cannot provide a marginal (unconditional) interpretation of covariate effects on the microbial abundance, which is of great interest in many applications. In this paper, we propose a marginalized two-part Beta regression model which captures the zero-inflation and skewness of microbiome data and also allows investigators to examine covariate effects on the marginal (unconditional) mean. We demonstrate its practical performance using simulation studies and apply the model to a real metagenomic dataset on mouse skin microbiota. We find that under the proposed marginalized model, without loss in power, the likelihood ratio test performs better in controlling the type I error than those under conventional methods.

Highlights

In recent years, metagenomics studies have been growing rapidly due to the advances of nextgeneration sequencing (NGS) technologies [1]
Semi-continuous compositional data are typically analyzed using two-part models which separately describe the probability of zero values and the distribution of positive values
It is of great interest in many applications to assess the covariate effect on the marginal mean of the response

Summary

Introduction

Metagenomics studies have been growing rapidly due to the advances of nextgeneration sequencing (NGS) technologies [1]. The microbial abundance is usually measured in read counts. Such quantities are not directly comparable across samples due to the uneven total sequence counts of samples. The read counts are often normalized to relative abundances which sum to 1 for all microbes in a sample [8]. Relative abundance can be characterized by a point mass at zero and a right-skewed continuous distribution with a positive support, the so-called “semi-continuous” or “zero-inflated continuous” data. The zero values indicate that certain microbes are absent in the sample, or the rare microbes are present but missed due to undersampling, while the continuous distribution with a positive support describes the levels of relative abundance among the present microbes

Results

Discussion

Conclusion