DMSC: A Dynamic Multi-Seeds Method for Clustering 16S rRNA Sequences Into OTUs.

Ze-Gang Wei,Shao-Wu Zhang

doi:10.3389/fmicb.2019.00428

Abstract

Next-generation sequencing (NGS)-based 16S rRNA sequencing by jointly using the PCR amplification and NGS technology is a cost-effective technique, which has been successfully used to study the phylogeny and taxonomy of samples from complex microbiomes or environments. Clustering 16S rRNA sequences into operational taxonomic units (OTUs) is often the first step for many downstream analyses. Heuristic clustering is one of the most widely employed approaches for generating OTUs. However, most heuristic OTUs clustering methods just select one single seed sequence to represent each cluster, resulting in their outcomes suffer from either overestimation of OTUs number or sensitivity to sequencing errors. In this paper, we present a novel dynamic multi-seeds clustering method (namely DMSC) to pick OTUs. DMSC first heuristically generates clusters according to the distance threshold. When the size of a cluster reaches the pre-defined minimum size, then DMSC selects the multi-core sequences (MCS) as the seeds that are defined as the n-core sequences (n ≥ 3), in which the distance between any two sequences is less than the distance threshold. A new sequence is assigned to the corresponding cluster depending on the average distance to MCS and the distance standard deviation within the MCS. If a new sequence is added to the cluster, dynamically update the MCS until no sequence is merged into the cluster. The new method DMSC was tested on several simulated and real-life sequence datasets and also compared with the traditional heuristic methods such as CD-HIT, UCLUST, and DBH. Experimental results in terms of the inferred OTUs number, normalized mutual information (NMI) and Matthew correlation coefficient (MCC) metrics demonstrate that DMSC can produce higher quality clusters with low memory usage and reduce OTU overestimation. Additionally, DMSC is also robust to the sequencing errors. The DMSC software can be freely downloaded from https://github.com/NWPU-903PR/DMSC.

Highlights

Bacteria are the most diverse domain on our planet and play an essential role in various biogeochemical activities as well as an important role in human health and disease (Fuks et al, 2018)
Two major approaches for binning 16S ribosomal RNA (rRNA) sequences include: (i) taxonomy dependent methods, where each query sequence is compared against a reference taxonomy database and assigned to the organism of the best-matched annotated sequence using sequence searching (Altschul et al, 1990) or classification (Liu et al, 2017, 2018), and (ii) taxonomy independent methods (Chen et al, 2013b), where sequences are grouped into operational taxonomic units (OTU) based on pairwise sequence similarities
It can be seen that DMSC method has four main phases: (i) according to the distance threshold θ, a series of clusters are formed by heuristic clustering of each sequence one by one; (ii) when the size of a cluster reaches the pre-defined minimum sequence number (η), the multi-core sequences (MCS) is selected as the seeds; (iii) according to the average distance to MCS and the distance standard deviation (σ) between each pairwise sequences in MCS, a new sequence is assigned to the corresponding cluster; and (iv) after a new sequence is added to one cluster, update the MCS

Summary

Introduction

Bacteria are the most diverse domain on our planet and play an essential role in various biogeochemical activities as well as an important role in human health and disease (Fuks et al, 2018). Bypassing the necessity of isolating single organisms for cultivation, the advanced sequencing technology can produce millions of 16S rRNA and has become a powerful tool for in-depth analysis of bacterial community composition (Zhang et al, 2013; Wei and Zhang, 2018). Two major approaches for binning 16S rRNA sequences include: (i) taxonomy dependent methods, where each query sequence is compared against a reference taxonomy database and assigned to the organism of the best-matched annotated sequence using sequence searching (Altschul et al, 1990) or classification (Liu et al, 2017, 2018), and (ii) taxonomy independent methods ( called de novo clustering) (Chen et al, 2013b), where sequences are grouped into OTUs based on pairwise sequence similarities. De novo clustering methods divide sequences into OTUs without needing any reference database and have become the preferred choice for researchers (Cai et al, 2017)

Methods

Results

Discussion

Conclusion