Abstract

BackgroundThe problem of supervised DNA sequence classification arises in several fields of computational molecular biology. Although this problem has been extensively studied, it is still computationally challenging due to size of the datasets that modern sequencing technologies can produce.ResultsWe introduce Clark a novel approach to classify metagenomic reads at the species or genus level with high accuracy and high speed. Extensive experimental results on various metagenomic samples show that the classification accuracy of Clark is better or comparable to the best state-of-the-art tools and it is significantly faster than any of its competitors. In its fastest single-threaded mode Clark classifies, with high accuracy, about 32 million metagenomic short reads per minute. Clark can also classify BAC clones or transcripts to chromosome arms and centromeric regions.ConclusionsClark is a versatile, fast and accurate sequence classification method, especially useful for metagenomics and genomics applications. It is freely available at http://clark.cs.ucr.edu/.Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-015-1419-2) contains supplementary material, which is available to authorized users.

Highlights

  • The problem of supervised DNA sequence classification arises in several fields of computational molecular biology

  • Once all k-spectrums of target sequences have been collected in the index, CLARK removes any common k-mers between targets

  • Performance statistics for several choices of the k-mer length for NBC, KRAKEN, CLARK and their fast variants on the classification of “HiSeq”, “MiSeq”, “simBA-5” and “simHC.20.500” metagenomic datasets against the 695 genus-level targets; precision and sensitivity are expressed as percentages, while speed is expressed in 103 reads per minute; KRAKEN-Q and CLARK-E are faster, but less accurate, variants of these tools; CLARK-l is a less memory-intensive version of CLARK which runs only for k = 27; experiments were carried out in single-threaded mode; ∗parameter k is referred as N in the NBC manuscript

Read more

Summary

Results

We introduce CLARK a novel approach to classify metagenomic reads at the species or genus level with high accuracy and high speed. Extensive experimental results on various metagenomic samples show that the classification accuracy of CLARK is better or comparable to the best state-of-the-art tools and it is significantly faster than any of its competitors. In its fastest single-threaded mode CLARK classifies, with high accuracy, about 32 million metagenomic short reads per minute. CLARK can classify BAC clones or transcripts to chromosome arms and centromeric regions

Conclusions
Background
Results and discussion
Methods
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call