Normalization and microbial differential abundance strategies depend upon data characteristics

Sophie Weiss,Shyamal Peddada,Yoshiki Vázquez-Baeza,Rob Knight,Amnon Amir,Antonio Gonzalez,Embriette R Hyde,Zhenjiang Zech Xu,Jesse R Zaneveld,Catherine Lozupone,Amanda Birmingham,Kyle Bittinger

doi:10.1186/s40168-017-0237-y

Abstract

BackgroundData from 16S ribosomal RNA (rRNA) amplicon sequencing present challenges to ecological and statistical interpretation. In particular, library sizes often vary over several ranges of magnitude, and the data contains many zeros. Although we are typically interested in comparing relative abundance of taxa in the ecosystem of two or more groups, we can only measure the taxon relative abundance in specimens obtained from the ecosystems. Because the comparison of taxon relative abundance in the specimen is not equivalent to the comparison of taxon relative abundance in the ecosystems, this presents a special challenge. Second, because the relative abundance of taxa in the specimen (as well as in the ecosystem) sum to 1, these are compositional data. Because the compositional data are constrained by the simplex (sum to 1) and are not unconstrained in the Euclidean space, many standard methods of analysis are not applicable. Here, we evaluate how these challenges impact the performance of existing normalization methods and differential abundance analyses.ResultsEffects on normalization: Most normalization methods enable successful clustering of samples according to biological origin when the groups differ substantially in their overall microbial composition. Rarefying more clearly clusters samples according to biological origin than other normalization techniques do for ordination metrics based on presence or absence. Alternate normalization measures are potentially vulnerable to artifacts due to library size.Effects on differential abundance testing: We build on a previous work to evaluate seven proposed statistical methods using rarefied as well as raw data. Our simulation studies suggest that the false discovery rates of many differential abundance-testing methods are not increased by rarefying itself, although of course rarefying results in a loss of sensitivity due to elimination of a portion of available data. For groups with large (~10×) differences in the average library size, rarefying lowers the false discovery rate. DESeq2, without addition of a constant, increased sensitivity on smaller datasets (<20 samples per group) but tends towards a higher false discovery rate with more samples, very uneven (~10×) library sizes, and/or compositional effects. For drawing inferences regarding taxon abundance in the ecosystem, analysis of composition of microbiomes (ANCOM) is not only very sensitive (for >20 samples per group) but also critically the only method tested that has a good control of false discovery rate.ConclusionsThese findings guide which normalization and differential abundance techniques to use based on the data characteristics of a given study.

Highlights

Data from 16S ribosomal RNA amplicon sequencing present challenges to ecological and statistical interpretation
Alternatives to rarefying recommend discarding lowdepth samples, if they cluster separately from the rest of the data [4, 42]. These results demonstrate that previous microbiome ordinations using rarefying as a normalization method likely clustered compared to newer techniques, especially if some low-depth samples were removed
Of methods for normalizing microbial data for ordination analysis, we found that DESeq normalization [30, 42], which was developed for RNA-Seq data and makes use of a log-like transformation, does not work well with ecologically useful metrics, except weighted UniFrac [58]

Summary

Introduction

Data from 16S ribosomal RNA (rRNA) amplicon sequencing present challenges to ecological and statistical interpretation. Following initial quality control steps to account for errors in the sequencing process, microbial community sequencing data is typically organized into large matrices where the columns represent samples, and the rows contain observed counts of clustered sequences commonly known as operational taxonomic units, or OTUs, that represent bacteria types. These tables are often referred to as OTU tables. Sparsity, and the fact that researchers are interested in drawing inferences on taxon abundance in the ecosystem using the specimen level data represent serious challenges for interpreting data from microbial survey studies

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Microbiome	Publication Date: Mar 3, 2017
Citations: 1458	License type: open-access

R Discovery Prime

R Discovery Prime

Normalization and microbial differential abundance strategies depend upon data characteristics

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Microbiome

Lead the way for us

Similar Papers

LOCOM: A logistic regression model for testing differential abundance in compositional microbiome data with false discovery rate control
Yingtian Hu ... Yi-Juan Hu
Proceedings of the National Academy of Sciences | VOL. 119
Yingtian Hu, et. al.Yingtian Hu ... Yi-Juan Hu
22 Jul 2022
Proceedings of the National Academy of Sciences | VOL. 119

Analysis of composition of microbiomes: a novel method for studying microbial composition.
Siddhartha Mandal ... Merete Eggesbø
Microbial Ecology in Health & Disease | VOL. 26
Siddhartha Mandal, et. al.Siddhartha Mandal ... Merete Eggesbø
29 May 2015
Microbial Ecology in Health & Disease | VOL. 26

A novel normalization and differential abundance test framework for microbiome data.
Yuanjing Ma ... Hongmei Jiang
Bioinformatics (Oxford, England) | VOL. 36
Yuanjing Ma, et. al.Yuanjing Ma ... Hongmei Jiang
20 Apr 2020
Bioinformatics (Oxford, England) | VOL. 36

KRAKEN results: taxonomic abundance tables and discriminant analysis

-

01 Jan 2017
01 Jan 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Normalization and microbial differential abundance strategies depend upon data characteristics

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Microbiome