Abstract

BackgroundMicrobiome studies often involve sequencing a marker gene to identify the microorganisms in samples of interest. Sequence classification is a critical component of this process, whereby sequences are assigned to a reference taxonomy containing known sequence representatives of many microbial groups. Previous studies have shown that existing classification programs often assign sequences to reference groups even if they belong to novel taxonomic groups that are absent from the reference taxonomy. This high rate of “over classification” is particularly detrimental in microbiome studies because reference taxonomies are far from comprehensive.ResultsHere, we introduce IDTAXA, a novel approach to taxonomic classification that employs principles from machine learning to reduce over classification errors. Using multiple reference taxonomies, we demonstrate that IDTAXA has higher accuracy than popular classifiers such as BLAST, MAPSeq, QIIME, SINTAX, SPINGO, and the RDP Classifier. Similarly, IDTAXA yields far fewer over classifications on Illumina mock microbial community data when the expected taxa are absent from the training set. Furthermore, IDTAXA offers many practical advantages over other classifiers, such as maintaining low error rates across varying input sequence lengths and withholding classifications from input sequences composed of random nucleotides or repeats.ConclusionsIDTAXA’s classifications may lead to different conclusions in microbiome studies because of the substantially reduced number of taxa that are incorrectly identified through over classification. Although misclassification error is relatively minor, we believe that many remaining misclassifications are likely caused by errors in the reference taxonomy. We describe how IDTAXA is able to identify many putative mislabeling errors in reference taxonomies, enabling training sets to be automatically corrected by eliminating spurious sequences. IDTAXA is part of the DECIPHER package for the R programming language, available through the Bioconductor repository or accessible online (http://DECIPHER.codes).

Highlights

  • Microbiome studies often involve sequencing a marker gene to identify the microorganisms in samples of interest

  • The IDTAXA algorithm exhibits lower over classification error rates We focused on the basal taxonomic rank in each training set for benchmarking classification accuracy because the basal rank is the most difficult to predict

  • Here, we have shown that IDTAXA substantially reduces false positive classifications of test sequences falling outside the scope of a training set

Read more

Summary

Introduction

Microbiome studies often involve sequencing a marker gene to identify the microorganisms in samples of interest. Previous studies have shown that existing classification programs often assign sequences to reference groups even if they belong to novel taxonomic groups that are absent from the reference taxonomy. This high rate of “over classification” is detrimental in microbiome studies because reference taxonomies are far from comprehensive. Microbiome studies frequently involve sequencing a taxonomic marker, such as the 16S ribosomal RNA (rRNA) gene or internal transcribed spacer (ITS), to identify the microorganisms that are present in a sample. Nearest neighbor methods are popular in part due to their simplicity and clearly defined basis for taxonomic assignment, but frequently fail where taxonomic groups do not conform to standard distance cutoffs [6]

Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.