A reference cytochrome c oxidase subunit I database curated for hierarchical classification of arthropod metabarcoding data.

Rodney T Richardson,Mary M Gardiner,Reed M Johnson,Johan Bengtsson-Palme

doi:10.7717/peerj.5126

Rodney T Richardson, Mary M Gardiner + Show 2 more

Open Access

https://doi.org/10.7717/peerj.5126

Copy DOI

Journal: PeerJ	Publication Date: Jun 26, 2018
Citations: 15	License type: CC BY 4.0

Affiliation: The Ohio State University, University of Gothenburg

Abstract

Metabarcoding is a popular application which warrants continued methods optimization. To maximize barcoding inferences, hierarchy-based sequence classification methods are increasingly common. We present methods for the construction and curation of a database designed for hierarchical classification of a 157 bp barcoding region of the arthropod cytochrome c oxidase subunit I (COI) locus. We produced a comprehensive arthropod COI amplicon dataset including annotated arthropod COI sequences and COI sequences extracted from arthropod whole mitochondrion genomes, the latter of which provided the only source of representation for Zoraptera, Callipodida and Holothyrida. The database contains extracted sequences of the target amplicon from all major arthropod clades, including all insect orders, all arthropod classes and Onychophora, Tardigrada and Mollusca outgroups. During curation, we extracted the COI region of interest from approximately 81 percent of the input sequences, corresponding to 73 percent of the genus-level diversity found in the input data. Further, our analysis revealed a high degree of sequence redundancy within the NCBI nucleotide database, with a mean of approximately 11 sequence entries per species in the input data. The curated, low-redundancy database is included in the Metaxa2 sequence classification software (http://microbiology.se/software/metaxa2/). Using this database with the Metaxa2 classifier, we performed a cross-validation analysis to characterize the relationship between the Metaxa2 reliability score, an estimate of classification confidence, and classification error probability. We used this analysis to select a reliability score threshold which minimized error. We then estimated classification sensitivity, false discovery rate and overclassification, the propensity to classify sequences from taxa not represented in the reference database. Our work will help researchers design and evaluate classification databases and conduct metabarcoding on arthropods and alternate taxa.

Highlights

With the increasing availability of high-throughput DNA sequencing, scientists with a wide diversity of backgrounds and interests are increasingly utilizing this technology to achieve a variety of goals
Upon analyzing the representativeness of this initial database across arthropod classes and insect orders, we found that amplicon sequences from two insect orders, Strepsiptera and Embioptera, were not present in the curated database, likely due to their poor sequence similarity to the reference sequence used to designate the amplicon barcode region of interest
We obtained 199,206 reference amplicon sequences belonging to 51,416 arthropod species

Summary

Introduction

With the increasing availability of high-throughput DNA sequencing, scientists with a wide diversity of backgrounds and interests are increasingly utilizing this technology to achieve a variety of goals. Using universal primers designed to amplify conserved genomic regions across a broad diversity of taxonomic groups of interest, researchers are afforded the opportunity to survey biological communities at previously unprecedented scales. While such advancements hold great promise for improving our knowledge of the biological world, they represent new challenges to the scientific community. Researchers continue to utilize a diversity of methods to draw taxonomic inferences from amplicon sequence data. Relative to alignment-based nearest-neighbor and lowest common ancestor-type classification approaches, methods involving hierarchical classification of DNA sequences are popular as they are often designed to estimate the probabilistic confidence of taxonomic inferences at each taxonomic rank. Studies explicitly examining the accuracy of classification confidence estimates are rare (Somervuo et al, 2016)

Methods

Results

Discussion

Conclusion