Abstract

Current methods to identify unknown insect (class Insecta) cytochrome c oxidase (COI barcode) sequences often rely on thresholds of distances that can be difficult to define, sequence similarity cut-offs, or monophyly. Some of the most commonly used metagenomic classification methods do not provide a measure of confidence for the taxonomic assignments they provide. The aim of this study was to use a naïve Bayesian classifier (Wang et al. Applied and Environmental Microbiology, 2007; 73: 5261) to automate taxonomic assignments for large batches of insect COI sequences such as data obtained from high-throughput environmental sequencing. This method provides rank-flexible taxonomic assignments with an associated bootstrap support value, and it is faster than the blast-based methods commonly used in environmental sequence surveys. We have developed and rigorously tested the performance of three different training sets using leave-one-out cross-validation, two field data sets, and targeted testing of Lepidoptera, Diptera and Mantodea sequences obtained from the Barcode of Life Data system. We found that type I error rates, incorrect taxonomic assignments with a high bootstrap support, were already relatively low but could be lowered further by ensuring that all query taxa are actually present in the reference database. Choosing bootstrap support cut-offs according to query length and summarizing taxonomic assignments to more inclusive ranks can also help to reduce error while retaining the maximum number of assignments. Additionally, we highlight gaps in the taxonomic and geographic representation of insects in public sequence databases that will require further work by taxonomists to improve the quality of assignments generated using any method.

Highlights

  • Identification of insects collected from the field using morphological traits is time-consuming and requires specialist knowledge

  • This study shows that a na€ıve Bayesian classifier can be effectively applied to classify large numbers of mitochondrial c oxidase subunit 1 (COI) barcode sequences from insects

  • We benchmarked the performance of our training sets using partial-length COI sequences, such as those commonly generated by next-generation sequencing (NGS) platforms and provided bootstrap support cut-off guidelines

Read more

Summary

Introduction

Identification of insects collected from the field using morphological traits is time-consuming and requires specialist knowledge. As a result of this, the public databases (GenBank and BOLD) are filled with many insufficiently identified DNA barcode sequences (Kwong et al 2012). Nilsson et al (2005) to describe sequences that are identified to higher (more inclusive) taxonomic ranks, as opposed to ‘fully identified’ sequences that are identified to the species rank. What is urgently needed is a reference set of DNA barcode sequences from fully identified insects classified to the species rank. As in many other fields, the main bottleneck in this process is not the production of sequences, but rather the accurate taxonomic identification of samples to the species rank by acknowledged specialists.

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call