Abstract
Compositional data sets are ubiquitous in science, including geology, ecology, and microbiology. In microbiome research, compositional data primarily arise from high-throughput sequence-based profiling experiments. These data comprise microbial compositions in their natural habitat and are often paired with covariate measurements that characterize physicochemical habitat properties or the physiology of the host. Inferring parsimonious statistical associations between microbial compositions and habitat- or host-specific covariate data is an important step in exploratory data analysis. A standard statistical model linking compositional covariates to continuous outcomes is the linear log-contrast model. This model describes the response as a linear combination of log-ratios of the original compositions and has been extended to the high-dimensional setting via regularization. In this contribution, we propose a general convex optimization model for linear log-contrast regression which includes many previous proposals as special cases. We introduce a proximal algorithm that solves the resulting constrained optimization problem exactly with rigorous convergence guarantees. We illustrate the versatility of our approach by investigating the performance of several model instances on soil and gut microbiome data analysis tasks.
Highlights
Compositional data sets are ubiquitous in many areas of science, spanning such disparate fields as geology and ecology
The quantification of the relative abundances of operational taxonomic units (OTUs) in the environment is often accompanied by measurements of other covariates, including physicochemical properties of the underlying habitats, variables related to the health status of the host, or those coming from other highthroughput protocols, such as metabolomics or flow cytometry
In the low-dimensional setting, the standard log-contrast model with zero-sum constraints can be estimated by solving a least-squares problem subject to a linear constraint, or alternatively, via standard linear regression applied to isometrically log-ratio transformed compositions [14]
Summary
Compositional data sets are ubiquitous in many areas of science, spanning such disparate fields as geology and ecology. Compositional data arise from high-throughput sequence-based microbiome profiling techniques, such as targeted amplicon sequencing (TAS) and metagenomic profiling These methods generate large-scale genomic survey data of microbial community compositions in their natural habitat, ranging from marine ecosystems to host-associated environments. A particular feature of our model is the joint estimation of regression vectors and associated scales for log-contrast models, similar to the scaled Lasso approach in high-dimensional linear regression [31]. This is achieved by leveraging recent results on the connection between perspective functions and statistical models [8,9,10]. We illustrate the versatility of our modeling approach by applying novel log-contrast model instances to environmental and gut microbiome data analysis tasks
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.