Abstract

Metagenomics brings in new discoveries and insights into the uncultured microbial world. One fundamental task in metagenomics analysis is to determine the taxonomy of raw sequence fragments. Modern sequencing technologies produce relatively short fragments and greatly increase the number of fragments, and thus make the taxonomic classification considerably more difficult than before. Therefore, fast and accurate techniques are called to classify large-scale fragments. We propose EnSVM (Ensemble Support Vector Machine) and its advanced method called EnSVMB (EnSVM with BLAST) to accurately classify fragments. EnSVM divides fragments into a large confident (or small diffident) set, based on whether the fragments get consistent (or inconsistent) predictions from linear SVMs trained with different k-mers. Empirical study shows that sensitivity and specificity of EnSVM on confident set are higher than 90% and 97%, but on diffident set are lower than 60% and 75%. To further improve the performance on diffident set, EnSVMB takes advantage of best hits of BLAST to reclassify fragments in that set. Experimental results show EnSVM can efficiently and effectively divide fragments into confident and diffident sets, and EnSVMB achieves higher accuracy, sensitivity and more true positives than related state-of-the-art methods and holds comparable specificity with the best of them.

Highlights

  • Metagenomics fragments classification is to assign a fragment to a corresponding species

  • We evaluate the performance of EnSVMB with different parameters

  • By referring to the results on the medium dataset, the accuracy of VW decreases from 85.24% to 84.29%, naive Bayesian classifier (NBC) from 75.45% to 71.95%, Kraken from 84.33% to 79.45%, BWA from 81.57% to 78.20%, BLAST from 83.71% to 83.60%, and EnSVMB from 88.04% to 87.36%

Read more

Summary

Introduction

Metagenomics fragments classification is to assign a fragment to a corresponding species (or taxonomy). Many computational methods have been proposed to automatically determine the taxonomy of fragments. These methods can be roughly divided into two categories: alignment-based and composition-based. Alignment-based methods use alignment tools (i.e., BLAST2) to align fragments to known reference sequences and assign fragments to a species based on the best match[3, 4]. Composition-based methods usually assign fragments based on their k-mer signatures. Traditional kNN is faced with the curse of dimensionality problem when the dimensionality of k-mer profiles is high[10] To solve this problem, TACOA10 introduces a Gaussian kernel to extend the traditional kNN and applies kNN for fragments classification. PhyloPythia[14] takes the oligonucleotide www.nature.com/scientificreports/

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call