Abstract
Prediction of taxonomy for marker gene sequences such as 16S ribosomal RNA (rRNA) is a fundamental task in microbiology. Most experimentally observed sequences are diverged from reference sequences of authoritatively named organisms, creating a challenge for prediction methods. I assessed the accuracy of several algorithms using cross-validation by identity, a new benchmark strategy which explicitly models the variation in distances between query sequences and the closest entry in a reference database. When the accuracy of genus predictions was averaged over a representative range of identities with the reference database (100%, 99%, 97%, 95% and 90%), all tested methods had ≤50% accuracy on the currently-popular V4 region of 16S rRNA. Accuracy was found to fall rapidly with identity; for example, better methods were found to have V4 genus prediction accuracy of ∼100% at 100% identity but ∼50% at 97% identity. The relationship between identity and taxonomy was quantified as the probability that a rank is the lowest shared by a pair of sequences with a given pair-wise identity. With the V4 region, 95% identity was found to be a twilight zone where taxonomy is highly ambiguous because the probabilities that the lowest shared rank between pairs of sequences is genus, family, order or class are approximately equal.
Highlights
Next-generation sequencing of tags such as the 16S ribosomal RNA gene and fungal internal transcribed spacer (ITS) region has revolutionized the study of microbial communities in environments ranging from the human body (Cho & Blaser, 2012; Pflughoeft & Versalovic, 2012) to oceans (Moran, 2015) and soils (Hartmann et al, 2014)
Many taxonomy prediction algorithms have been developed, including the RDP Naive Bayesian Classifier (NBC) (Wang et al, 2007), GAST (Huse et al, 2008), the lowest common ancestor (LCA) method in MEGAN (Mitra, Stark & Huson, 2011), 16Sclassifier (Chaudhary et al, 2015), SPINGO (Allard et al, 2015), Metaxa2 (Bengtsson-Palme et al, 2015), SINTAX (Edgar, 2016), PROTAX (Somervuo et al, 2016), microclass (Liland, Vinje & Snipen, 2017), and methods implemented by the mothur (Schloss et al, 2009), QIIME v1 (Caporaso et al, 2010) and QIIME v2 packages
lowest common rank (LCR) probabilities have the advantage of independence from clustering methods and cluster quality metrics, which give conflicting results for optimal threshold values (Edgar, 2018a)
Summary
Next-generation sequencing of tags such as the 16S ribosomal RNA (rRNA) gene and fungal internal transcribed spacer (ITS) region has revolutionized the study of microbial communities in environments ranging from the human body (Cho & Blaser, 2012; Pflughoeft & Versalovic, 2012) to oceans (Moran, 2015) and soils (Hartmann et al, 2014). Many taxonomy prediction algorithms have been developed, including the RDP Naive Bayesian Classifier (NBC) (Wang et al, 2007), GAST (Huse et al, 2008), the lowest common ancestor (LCA) method in MEGAN (Mitra, Stark & Huson, 2011), 16Sclassifier (Chaudhary et al, 2015), SPINGO (Allard et al, 2015), Metaxa (Bengtsson-Palme et al, 2015), SINTAX (Edgar, 2016), PROTAX (Somervuo et al, 2016), microclass (Liland, Vinje & Snipen, 2017), and methods implemented by the mothur (Schloss et al, 2009), QIIME v1 (Caporaso et al, 2010) and QIIME v2 (https://qiime2.org) packages. Most taxonomies in the RDP database were predicted by the RDP NBC, while most taxonomies in Greengenes and SILVA were annotated by a combination of database-specific computational prediction methods and manual curation (McDonald et al, 2012; Yilmaz et al, 2014)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.