Bioconductor workflow for microbiome data analysis: from raw reads to community analyses

Ben J Callahan,Paul J Mcmurdie,Julia A Fukuyama,Kris Sankaran,Susan P Holmes

doi:10.12688/f1000research.8986.1

Ben J Callahan, Paul J Mcmurdie + Show 3 more

Open Access

https://doi.org/10.12688/f1000research.8986.1

Copy DOI

Journal: F1000Research	Publication Date: Jun 24, 2016
Citations: 592	License type: CC BY 4.0

Affiliation: Stanford University

Abstract

High-throughput sequencing of PCR-amplified taxonomic markers (like the 16S rRNA gene) has enabled a new level of analysis of complex bacterial communities known as microbiomes. Many tools exist to quantify and compare abundance levels or microbial composition of communities in different conditions. The sequencing reads have to be denoised and assigned to the closest taxa from a reference database. Common approaches use a notion of 97% similarity and normalize the data by subsampling to equalize library sizes. In this paper, we show that statistical models allow more accurate abundance estimates. By providing a complete workflow in R, we enable the user to do sophisticated downstream statistical analyses, including both parameteric and nonparametric methods. We provide examples of using the Rpackages dada2, phyloseq, DESeq2, ggplot2 and vegan to filter, visualize and test microbiome data. We also provide examples of supervised analyses using random forests, partial least squares and linear models as well as nonparametric testing using community networks and the ggnetwork package.

Highlights

The microbiome is formed from the ecological communities of microorganisms that dominate the living world
Previous standard workflows depended on clustering all 16s rRNA sequences that occur within a 97% radius of similarity and assigning these to ‘Operational Taxonomic Units’ (OTUs) from reference trees1,2
We have shown how a complete workflow in R is available to denoise, identify and normalize generation amplicon sequencing reads using probabilistic models with parameters fit using the data at hand

Summary

Leo Lahti Finland

Zachary Charlop-Powers , The Rockefeller University, New York, USA. University of California, San Francisco, San Francisco, USA. Any reports and responses or comments on the article can be found at the end of the article. This article is included in the Bioconductor gateway. This article is included in the Phylogenetics collection

Introduction

Methods

Conclusions

14. Wickham H: ggplot2

20. Greenacre M

Findings

24. Penrose M