Abstract

To analyze complex biodiversity in microbial communities, 16S rRNA marker gene sequences are often assigned to operational taxonomic units (OTUs). The abundance of methods that have been used to assign 16S rRNA marker gene sequences into OTUs brings discussions in which one is better. Suggestions on having clustering methods should be stable in which generated OTU assignments do not change as additional sequences are added to the dataset is contradicting some other researches contend that the methods should properly present the distances of sequences is more important. We add one more de novo clustering algorithm, Rolling Snowball to existing ones including the single linkage, complete linkage, average linkage, abundance-based greedy clustering, distance-based greedy clustering, and Swarm and the open and closed-reference methods. We use GreenGenes, RDP, and SILVA 16S rRNA gene databases to show the success of the method. The highest accuracy is obtained with SILVA library.

Highlights

  • Metagenomics is a recently-born and highly popular field that studies the genomic contents of microbial communities living in certain environments and tries to understand the structure and function of these microbial communities by sequencing genomic fragments from environmental samples without the need of cultivating them in a laboratory (Huttenhower et al, 2012; Qin et al, 2010)

  • This type of clustering is referred to as phylotyping (Schloss & Westcott, 2011) or closed-reference clustering (Navas-Molina et al, 2013). This approach compares sequence reads to a reference database and cluster them into the same operational taxonomic units (OTUs) that is similar to the same reference read

  • De novo clustering (Navas-Molina et al, 2013) which is referred to as distance-based (Schloss & Westcott, 2011) clustering, the distance between sequences is used to bin sequences into OTUs rather than using a reference database to calculate distances

Read more

Summary

INTRODUCTION

Metagenomics is a recently-born and highly popular field that studies the genomic contents of microbial communities living in certain environments and tries to understand the structure and function of these microbial communities by sequencing genomic fragments from environmental samples without the need of cultivating them in a laboratory (Huttenhower et al, 2012; Qin et al, 2010). Rapid development in NGS has made it possible to directly sequence a huge amount of DNA/RNA fragments extracted from environmental samples such as human gut, marine or soil in a reasonable time (Eisen, 2011) It has made sequencing faster and highly economical providing a unique opportunity to study the microbial diversity of many complex environments at a much lower cost (Desai et al, 2013). To simplify the complexity of large datasets generated by NGS technologies, sequences are clustered into meaningful bins These bins are called operational taxonomic units (OTUs) which are used to study the biodiversity within and between different samples (Schloss & Westcott, 2011). There are popular reference databases: Ribosomal Database Project (RDP) (Cole et al, 2009), Greengenes (DeSantis et al, 2006), SILVA (Pruesse et al, 2007), NCBI (Federhen, 2012), Open Tree of Life Taxonomy (OTT) (Hinchliff et al, 2014), and UNITE (Kõljalg et al, 2013)

Closed Reference Approach
De novo Approach
Open Reference Approach
Greedy Heuristic Clustering
Hierarchical Clustering
Model-based Clustering
Denoising
Taxonomy Prediction
PROBLEM STATEMENT
3.BACKGROUND
MATERIALS AND METHODS
Findings
Signal Similarity
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call