Abstract

Random forest is considered as one of the most successful machine learning algorithms, which has been widely used to construct microbiome-based predictive models. However, its use as a statistical testing method has not been explored. In this study, we propose “Random Forest Test” (RFtest), a global (community-level) test based on random forest for high-dimensional and phylogenetically structured microbiome data. RFtest is a permutation test using the generalization error of random forest as the test statistic. Our simulations demonstrate that RFtest has controlled type I error rates, that its power is superior to competing methods for phylogenetically clustered signals, and that it is robust to outliers and adaptive to interaction effects and non-linear associations. Finally, we apply RFtest to two real microbiome datasets to ascertain whether microbial communities are associated or not with the outcome variables.

Highlights

  • The microbiome, the collection of microorganisms and their genetic materials in an environment, has been intricately related to human health (Gao et al, 2018; Gentile and Weir, 2018) and ecosystem functioning (Fierer, 2017)

  • These operational taxonomic units (OTUs) and amplicon sequence variants (ASVs) are regarded as surrogates of microbial taxa, and downstream statistical analyses are performed based on the OTU/ASV abundance table, which records the frequencies of the detected OTUs/ASVs in each microbiome sample, together with a phylogenetic tree relating the OTUs/ASVs and the metadata describing the characteristics of the samples

  • We propose a community-level test based on random forest (RFtest) for testing the associations between the microbiome and an outcome variable

Read more

Summary

Introduction

The microbiome, the collection of microorganisms and their genetic materials in an environment, has been intricately related to human health (Gao et al, 2018; Gentile and Weir, 2018) and ecosystem functioning (Fierer, 2017). For the past three decades, the marker gene sequencing has been the dominant approach to investigate the phylogenies and the abundance of microbial groups (Weisburg et al, 1991), while shotgun metagenomics has become increasingly popular to study the functional potential of the microbiome (Quince et al, 2017) Sequences stemming from this marker gene sequencing procedure are usually quality-filtered, merged, and clustered into operational taxonomic units (OTUs) (Schloss et al, 2009; Edgar, 2013) or denoised into amplicon sequence variants (ASVs) (Callahan et al, 2016; Bharti and Grimm, 2021). These OTUs and ASVs are regarded as surrogates of microbial taxa, and downstream statistical analyses are performed based on the OTU/ASV abundance table, which records the frequencies of the detected OTUs/ASVs in each microbiome sample, together with a phylogenetic tree relating the OTUs/ASVs and the metadata describing the characteristics of the samples.

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call