Abstract

To study the phylogeny and taxonomy of samples from complex environments Next-generation sequencing (NGS)-based 16S rRNA sequencing , which has been successfully used jointly with the PCR amplification and NGS technology. First step for many downstream analyses is clustering 16S rRNA sequences into operational taxonomic units (OTUs). Heuristic clustering is one of the most widely employed approaches for generating OTUs in which one or more seed sequences to represent each cluster are selected. In this work we chose five random seeds for each cluster from a genes library, and we present a novel distance measure to cluster bacteria in the sample. Artificially created sets of 16S rRNA genes selected from databases are successfully clustered with more than %98 accuracy, sensitivity, and specificity.

Highlights

  • Bacteria play an important role in human health and disease [1]

  • Most widely used biomarker for microbial community descriptions is the 16S ribosomal RNA (rRNA) marker genes generated by high-throughput sequencing technology [4]

  • For binning 16S rRNA sequences there are two major approaches: a. taxonomy dependent methods, where each query sequence is compared against a reference taxonomy database and assigned to the organism of the best-matched annotated sequence using sequence searching [11] or classification [12][13], and taxonomy independent methods [14], where sequences are grouped into operational taxonomic units (OTUs) based on pairwise sequence similarities

Read more

Summary

Introduction

Bacteria play an important role in human health and disease [1]. In addition, they have an essential role in various biogeochemical activities. Hierarchical clustering methods like mothur [17], HPC-CLUST [19], ESPRIT [20], and mcClust [21] require a distance matrix. This matrix is computed from all sequences pairs after pairwise sequence alignment or a multiple sequence alignment. They have still a high computational burden [26] For this reason, hierarchical clustering, model-based and network-based clustering methods, in dealing large-scale sequencing data, quickly meet with the limitations of computational time and memory usage [17]

Materials and methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.