A Zero-Inflated Latent Dirichlet Allocation Model for Microbiome Studies.

Rebecca A. Deek,Hongzhe Li

doi:10.3389/fgene.2020.602594

Abstract

The human microbiome consists of a community of microbes in varying abundances and is shown to be associated with many diseases. An important first step in many microbiome studies is to identify possible distinct microbial communities in a given data set and to identify the important bacterial taxa that characterize these communities. The data from typical microbiome studies are high dimensional count data with excessive zeros due to both absence of species (structural zeros) and low sequencing depth or dropout. Although methods have been developed for identifying the microbial communities based on mixture models of counts, these methods do not account for excessive zeros observed in the data and do not differentiate structural from sampling zeros. In this paper, we introduce a zero-inflated Latent Dirichlet Allocation model (zinLDA) for sparse count data observed in microbiome studies. zinLDA builds on the flexible Latent Dirichlet Allocation model and allows for zero inflation in observed counts. We develop an efficient Markov chain Monte Carlo (MCMC) sampling procedure to fit the model. Results from our simulations show zinLDA provides better fits to the data and is able to separate structural zeros from sampling zeros. We apply zinLDA to the data set from the American Gut Project and identify microbial communities characterized by different bacterial genera.

Highlights

The advent and proliferation of next-generation sequencing (NGS) technologies has given rise to many large-scale high-throughput microbiome studies (Turnbaugh et al, 2007; Gilbert et al, 2014; McDonald et al, 2018)
We propose a modification to the Latent Dirichlet Allocation model that allows the latent subcommunity organization to be composed of both structural zeros, taxa that truly do not belong to the community, and sampling zeros, taxa that belong to the community, but are not captured due to low sequencing depth or dropout
The data was simulated from a true zero-inflated Latent Dirichlet Allocation (zinLDA) model, following the steps specified by the generative algorithm given section 2.3

Summary

Introduction

The advent and proliferation of next-generation sequencing (NGS) technologies has given rise to many large-scale high-throughput microbiome studies (Turnbaugh et al, 2007; Gilbert et al, 2014; McDonald et al, 2018). Single-species analysis, that focus on one species at a time in a “parts-list” fashion, are not able to capture complex and dynamic interactions. These inter-species interactions form the basis of distinct underlying subcommunity structures and failing to account for them contributions to the data heterogeneity commonly seen in microbiome studies. Network-based approaches have been successfully applied in this area (Faust and Raes, 2012; Layeghifard et al, 2017) These methods use co-occurrence or correlation measures to identify pairwise interactions in crosssectional studies (Faust et al, 2012; Friedman and Alm, 2012; Kurtz et al, 2015). Others use temporally conserved covariance to identify interactions in longitudinal studies (Raman et al, 2019)

Methods

Results

Conclusion