Separating metagenomic short reads into genomes via clustering

Olga Tanaseichuk,Tao Jiang,James Borneman

doi:10.1186/1748-7188-7-27

Olga Tanaseichuk, Tao Jiang + Show 1 more

Open Access

https://doi.org/10.1186/1748-7188-7-27

Copy DOI

Abstract

BackgroundThe metagenomics approach allows the simultaneous sequencing of all genomes in an environmental sample. This results in high complexity datasets, where in addition to repeats and sequencing errors, the number of genomes and their abundance ratios are unknown. Recently developed next-generation sequencing (NGS) technologies significantly improve the sequencing efficiency and cost. On the other hand, they result in shorter reads, which makes the separation of reads from different species harder. Among the existing computational tools for metagenomic analysis, there are similarity-based methods that use reference databases to align reads and composition-based methods that use composition patterns (i.e., frequencies of short words or l-mers) to cluster reads. Similarity-based methods are unable to classify reads from unknown species without close references (which constitute the majority of reads). Since composition patterns are preserved only in significantly large fragments, composition-based tools cannot be used for very short reads, which becomes a significant limitation with the development of NGS. A recently proposed algorithm, AbundanceBin, introduced another method that bins reads based on predicted abundances of the genomes sequenced. However, it does not separate reads from genomes of similar abundance levels.ResultsIn this work, we present a two-phase heuristic algorithm for separating short paired-end reads from different genomes in a metagenomic dataset. We use the observation that most of the l-mers belong to unique genomes when l is sufficiently large. The first phase of the algorithm results in clusters of l-mers each of which belongs to one genome. During the second phase, clusters are merged based on l-mer repeat information. These final clusters are used to assign reads. The algorithm could handle very short reads and sequencing errors. It is initially designed for genomes with similar abundance levels and then extended to handle arbitrary abundance ratios. The software can be download for free at http://www.cs.ucr.edu/∼tanaseio/toss.htm.ConclusionsOur tests on a large number of simulated metagenomic datasets concerning species at various phylogenetic distances demonstrate that genomes can be separated if the number of common repeats is smaller than the number of genome-specific repeats. For such genomes, our method can separate NGS reads with a high precision and sensitivity.

Highlights

The metagenomics approach allows the simultaneous sequencing of all genomes in an environmental sample
Due to the lack of appropriate short read clustering tools for comparison, we modify a well-known genome assembly software, Velvet [26], to make it behave like a genome separation tool and compare our clustering results with those of the modified Velvet
Similarity-based methods work on short reads, they explore the taxonomic content of metagenomic data according to known genomes rather than classifying reads

Summary

Introduction

The metagenomics approach allows the simultaneous sequencing of all genomes in an environmental sample. A recently proposed algorithm, AbundanceBin, introduced another method that bins reads based on predicted abundances of the genomes sequenced. It does not separate reads from genomes of similar abundance levels. Many well-known metagenomics projects use the whole genome shotgun sequencing approach in combination with Sanger sequencing technologies. This approach has produced datasets from the Sargasso Sea [4], Human Gut Microbiome [5] and Acid Mine Drainage Biofilm [6]. The only drawback is that read length is reduced - NGS reads are usually of lengths 25-150 (Illumina/SOLiD) compared to 800-1000 bps in Sanger reads

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms for Molecular Biology	Publication Date: Sep 26, 2012
Citations: 19	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Separating metagenomic short reads into genomes via clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms for Molecular Biology

Lead the way for us

Similar Papers

Separating Metagenomic Short Reads into Genomes via Clustering
Olga Tanaseichuk ... Tao Jiang
-
Olga Tanaseichuk, et. al.Olga Tanaseichuk ... Tao Jiang
01 Jan 2010
01 Jan 2010

CUSHAW Suite: Parallel and Efficient Algorithms for NGS Read Alignment
Yongchao Liu ... Bertil Schmidt
-
Yongchao Liu, et. al.Yongchao Liu ... Bertil Schmidt
01 Jan 2017
01 Jan 2017

MetaObtainer: A Tool for Obtaining Specified Species from Metagenomic Reads of Next-generation Sequencing.
Weihua Pan ... Yun Xu
Interdisciplinary sciences, computational life sciences | VOL. 7
Weihua Pan, et. al.Weihua Pan ... Yun Xu
21 Aug 2015
Interdisciplinary sciences, computational life sciences | VOL. 7

Factors that affect large subunit ribosomal DNA amplicon sequencing studies of fungal communities: classification method, primer choice, and error.
Teresita M Porter ... G Brian Golding
PloS one | VOL. 7
Teresita M Porter, et. al.Teresita M Porter ... G Brian Golding
27 Apr 2012
PloS one | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Separating metagenomic short reads into genomes via clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms for Molecular Biology