Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible

Paul J Mcmurdie,Susan Holmes,Alice Carolyn Mchardy

doi:10.1371/journal.pcbi.1003531

Paul J Mcmurdie, Susan Holmes + Show 1 more

Open Access

https://doi.org/10.1371/journal.pcbi.1003531

Copy DOI

Journal: PLoS computational biology	Publication Date: Apr 3, 2014
Citations: 2276	License type: CC BY 4.0

Affiliation: Stanford University

Abstract

Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

Highlights

Modern, massively parallel DNA sequencing technologies have changed the scope and technique of investigations across many fields of biology [1,2]
Even though the statistical methods available for analyzing microarray data have matured to a high level of sophistication [8], these methods are not directly applicable because DNA sequencing data consists of discrete counts of sequence reads rather than continuous values derived from the fluorescence intensity of hybridized probes
In recent generation DNA sequencing the total reads per sample can vary by orders of magnitude within a single sequencing run

Summary

Introduction

Massively parallel DNA sequencing technologies have changed the scope and technique of investigations across many fields of biology [1,2]. In gene expression studies the standard measurement technique has shifted away from microarray hybridization to direct sequencing of cDNA, a technique often referred to as RNA-Seq [3]. Comparison across samples with different library sizes requires more than a simple linear or logarithmic scaling adjustment because it implies different levels of uncertainty, as measured by the sampling variance of the proportion estimate for each feature (a feature is a gene in the RNA-Seq context, and is a species or Operational Taxonomic Unit, OTU, in the context of microbiome sequencing). A Gamma mixture of Poisson variables gives the negative binomial (NB) distribution [10,11] and several RNA-Seq analysis packages model the counts, K, for gene i, in sample j according to: Kij *NB(sj mi ,wi )

Author Summary

Findings

Methods

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS computational biology

Lead the way for us

Similar Papers

Systematic performance improvement – refining the space between learning and results
Jim Burrow ... Paula Berardinelli
The journal of workplace learning | VOL. 15
Jim Burrow, et. al.Jim Burrow ... Paula Berardinelli
01 Feb 2003
The journal of workplace learning | VOL. 15

Improved normalization of species count data in ecology by scaling with ranked subsampling (SRS): application to microbial communities.
Lukas Beule ... Petr Karlovsky
PeerJ | VOL. 8
Lukas Beule, et. al.Lukas Beule ... Petr Karlovsky
03 Aug 2020
PeerJ | VOL. 8

Randomized quantile residuals for diagnosing zero-inflated generalized linear mixed models with applications to microbiome count data
Wei Bai ... Cindy Feng
BMC bioinformatics | VOL. 22
Wei Bai, et. al.Wei Bai ... Cindy Feng
25 Nov 2021
BMC bioinformatics | VOL. 22

Log-ratio analysis of microbiome data with many zeroes is library size dependent.
Dennis E Te Beest ... Cajo J F Ter Braak
Molecular ecology resources | VOL. 21
Dennis E Te Beest, et. al.Dennis E Te Beest ... Cajo J F Ter Braak
03 May 2021
Molecular ecology resources | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS computational biology