Abstract

Molecular microbial ecology investigations often employ large marker gene datasets, for example, ribosomal RNAs, to represent the occurrence of single-cell genomes in microbial communities. Massively parallel DNA sequencing technologies enable extensive surveys of marker gene libraries that sometimes include nearly identical sequences. Computational approaches that rely on pairwise sequence alignments for similarity assessment and de novo clustering with de facto similarity thresholds to partition high-throughput sequencing datasets constrain fine-scale resolution descriptions of microbial communities. Minimum Entropy Decomposition (MED) provides a computationally efficient means to partition marker gene datasets into ‘MED nodes', which represent homogeneous operational taxonomic units. By employing Shannon entropy, MED uses only the information-rich nucleotide positions across reads and iteratively partitions large datasets while omitting stochastic variation. When applied to analyses of microbiomes from two deep-sea cryptic sponges Hexadella dedritifera and Hexadella cf. dedritifera, MED resolved a key Gammaproteobacteria cluster into multiple MED nodes that are specific to different sponges, and revealed that these closely related sympatric sponge species maintain distinct microbial communities. MED analysis of a previously published human oral microbiome dataset also revealed that taxa separated by less than 1% sequence variation distributed to distinct niches in the oral cavity. The information theory-guided decomposition process behind the MED algorithm enables sensitive discrimination of closely related organisms in marker gene amplicon datasets without relying on extensive computational heuristics and user supervision.

Highlights

  • Marker gene analyses of microbial diversity require categorizing DNA sequences into ecologically meaningful units

  • By relying on information-rich variable Minimum Entropy Decomposition sites and discarding low-entropy nucleotide posi- The algorithm iteratively partitions a dataset of tions in a group of sequencing reads, oligotyping amplicon sequences into homogenous operational taxonomic units (OTUs)

  • Minimum Entropy Decomposition (MED) identified organisms in two example datasets that differ by only a few nucleotides, yet distribute differently across environments, and recapitulated published oligotyping results

Read more

Summary

Introduction

Marker gene analyses of microbial diversity require categorizing DNA sequences into ecologically meaningful units. By relying on information-rich variable Minimum Entropy Decomposition sites and discarding low-entropy nucleotide posi- The algorithm iteratively partitions a dataset of tions in a group of sequencing reads, oligotyping amplicon sequences into homogenous OTUs (‘MED facilitates the identification of closely related but nodes’) that serve as input to alpha- and betadistinct organisms that may differ by as little as one diversity analyses. The algorithm can detect biologically meaningful The parameter c defines the maximum number of differences between closely or distantly related nucleotide positions with entropy values greater sequences in large datasets without requiring CPU- than m for decomposing every node throughout the intensive alignment. It uses M to filter noise as described for oligotyping (Eren et al, 2013a): if the most abundant unique sequence of a node is smaller than the user-defined value of M, MED will remove it from the analysis. Reads trimmed from the 30 end used a sliding window of average quality score (The Human Microbiome Project Consortium, 2012a)

Results
Discussion
Methods
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call