Abstract

BackgroundCompositional data refer to the data that lie on a simplex, which are common in many scientific domains such as genomics, geology and economics. As the components in a composition must sum to one, traditional tests based on unconstrained data become inappropriate, and new statistical methods are needed to analyze this special type of data.ResultsIn this paper, we consider a general problem of testing for the compositional difference between K populations. Motivated by microbiome and metagenomics studies, where the data are often over-dispersed and high-dimensional, we formulate a well-posed hypothesis from a Bayesian point of view and suggest a nonparametric test based on inter-point distance to evaluate statistical significance. Unlike most existing tests for compositional data, our method does not rely on any data transformation, sparsity assumption or regularity conditions on the covariance matrix, but directly analyzes the compositions. Simulated data and two real data sets on the human microbiome are used to illustrate the promise of our method.ConclusionsOur simulation studies and real data applications demonstrate that the proposed test is more sensitive to the compositional difference than the mean-based method, especially when the data are over-dispersed or zero-inflated. The proposed test is easy to implement and computationally efficient, facilitating its application to large-scale datasets.

Highlights

  • Compositional data refer to the data that lie on a simplex, which are common in many scientific domains such as genomics, geology and economics

  • Data that lie on the simplex Sd−1 = (x1, x2, ..., xd), s.t. minj xj ≥ 0, d j=1 xj are often called (d − 1)-dimensional compositional data, and they arise in many scientific disciplines such as genomics, geology and economics [1,2,3]

  • Unlike most existing tests for compositional data, our method does not rely on any data transformation, sparsity assumption or regularity conditions on the covariance matrix, but directly analyzes the compositions

Read more

Summary

Introduction

Compositional data refer to the data that lie on a simplex, which are common in many scientific domains such as genomics, geology and economics. In the Human Microbiome Project, it is common to have hundreds to thousands of bacterial taxa while only tens of samples are available for analysis To this end, Cao et al (2017) developed a powerful two-sample test for high-dimensional means using a centered log-ratio transformation [3]. Cao et al.’s test achieves satisfactory statistical power under high-dimensional sparse settings, and the consistency of the test has been well established under some regularity conditions This test has several shortcomings which has limited its application in practice. Cao et al.’s test can only deal with two-sample comparison, and its validity depends on a list of regularity conditions on the underlying covariance matrices This test is a maximum-type test, and its performance relies on the sparsity assumption, i.e., only a small proportion of components in the composition are different across groups

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call