Comparing K-mer based methods for improved classification of 16S sequences.

Hilde Vinje,Trygve Almøy,Kristian Hovde Liland,Lars Snipen

doi:10.1186/s12859-015-0647-4

Hilde Vinje, Trygve Almøy + Show 2 more

Open Access

PDF Available

https://doi.org/10.1186/s12859-015-0647-4

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundThe need for precise and stable taxonomic classification is highly relevant in modern microbiology. Parallel to the explosion in the amount of sequence data accessible, there has also been a shift in focus for classification methods. Previously, alignment-based methods were the most applicable tools. Now, methods based on counting K-mers by sliding windows are the most interesting classification approach with respect to both speed and accuracy. Here, we present a systematic comparison on five different K-mer based classification methods for the 16S rRNA gene. The methods differ from each other both in data usage and modelling strategies. We have based our study on the commonly known and well-used naïve Bayes classifier from the RDP project, and four other methods were implemented and tested on two different data sets, on full-length sequences as well as fragments of typical read-length.ResultsThe difference in classification error obtained by the methods seemed to be small, but they were stable and for both data sets tested. The Preprocessed nearest-neighbour (PLSNN) method performed best for full-length 16S rRNA sequences, significantly better than the naïve Bayes RDP method. On fragmented sequences the naïve Bayes Multinomial method performed best, significantly better than all other methods. For both data sets explored, and on both full-length and fragmented sequences, all the five methods reached an error-plateau.ConclusionsWe conclude that no K-mer based method is universally best for classifying both full-length sequences and fragments (reads). All methods approach an error plateau indicating improved training data is needed to improve classification from here. Classification errors occur most frequent for genera with few sequences present. For improving the taxonomy and testing new classification methods, the need for a better and more universal and robust training data set is crucial.

Highlights

The need for precise and stable taxonomic classification is highly relevant in modern microbiology
The exploration of microbial communities is a major focus in microbiology, opening new approaches to the study of microbiomes of humans and other organisms as well as the communities found in natural environments of air, water or soil [1]
The classification of 16S sequences obtained from some samples is a classical pattern recognition problem, i.e. recognizing some pattern in

Summary

Results

The difference in classification error obtained by the methods seemed to be small, but they were stable and for both data sets tested. The Preprocessed nearest-neighbour (PLSNN) method performed best for full-length 16S rRNA sequences, significantly better than the naïve Bayes RDP method. On fragmented sequences the naïve Bayes Multinomial method performed best, significantly better than all other methods. For both data sets explored, and on both full-length and fragmented sequences, all the five methods reached an error-plateau

Conclusions

Background

Methods

Results and discussion

Method

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jul 1, 2015
Citations: 34	License type: CC BY 4.0

R Discovery Prime

Comparing K-mer based methods for improved classification of 16S sequences.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Abundant fungi dominate the complexity of microbial networks in soil of contaminated site: High-precision community analysis by full-length sequencing
Kang Yan ... Haizhen Wang
Science of the Total Environment | VOL. 861
Kang Yan, et. al.Kang Yan ... Haizhen Wang
28 Nov 2022
Science of the Total Environment | VOL. 861

Effects of Waterlogging on Soybean Rhizosphere Bacterial Community Using V4, LoopSeq, and PacBio 16S rRNA Sequence.
Taobing Yu ... Shasha Wang
Microbiology Spectrum | VOL. 10
Taobing Yu, et. al.Taobing Yu ... Shasha Wang
16 Feb 2022
Microbiology Spectrum | VOL. 10

Editor's evaluation: Ribosomal RNA (rRNA) sequences from 33 globally distributed mosquito species for improved metagenomics and species identification
Sara L Sawyer
-
Sara L SawyerSara L Sawyer
23 Nov 2022
23 Nov 2022

Author response: Ribosomal RNA (rRNA) sequences from 33 globally distributed mosquito species for improved metagenomics and species identification
Cassandra Koh ... Sébastien Boyer
-
Cassandra Koh, et. al.Cassandra Koh ... Sébastien Boyer
23 Dec 2022
23 Dec 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Comparing K-mer based methods for improved classification of 16S sequences.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: BMC Bioinformatics