Abstract

The accuracy of DNA barcode databases is critical for research and practical applications. Here we apply a frequency matrix to assess sequencing errors in a very large set of avian BARCODEs. Using 11,000 sequences from 2,700 bird species, we show most avian cytochrome c oxidase I (COI) nucleotide and amino acid sequences vary within a narrow range. Except for third codon positions, nearly all (96%) sites were highly conserved or limited to two nucleotides or two amino acids. A large number of positions had very low frequency variants present in single individuals of a species; these were strongly concentrated at the ends of the barcode segment, consistent with sequencing error. In addition, a small fraction (0.1%) of BARCODEs had multiple very low frequency variants shared among individuals of a species; these were found to represent overlooked cryptic pseudogenes lacking stop codons. The calculated upper limit of sequencing error was 8×10−5 errors/nucleotide, which was relatively high for direct Sanger sequencing of amplified DNA, but unlikely to compromise species identification. Our results confirm the high quality of the avian BARCODE database and demonstrate significant quality improvement in avian COI records deposited in GenBank over the past decade. This approach has potential application for genetic database quality control, discovery of cryptic pseudogenes, and studies of low-level genetic variation.

Highlights

  • Beginning in 2003, researchers have been building a library of short genetic identifiers – DNA barcodes – for all animal, plant, and fungal species [1,2]

  • The agreed upon standard DNA barcode for animals is a 648 base pair region encompassing 216 codons of cytochrome c oxidase I (COI), which contains enough sequence diversity to separate most species and is relatively easy to amplify from most taxa using a limited set of primers [4,5,6]

  • Most nucleotide and amino acid positions in the COI barcode region were more than 99.9% conserved (Table 1, Fig. 2)

Read more

Summary

Introduction

Beginning in 2003, researchers have been building a library of short genetic identifiers – DNA barcodes – for all animal, plant, and fungal species [1,2]. The effort aims to simplify species identification, including for specimens missing diagnostic features (e.g. fragments and immature or vegetative forms) or when taxonomic expertise is not available [3]. The agreed upon standard DNA barcode for animals is a 648 base pair (bp) region encompassing 216 codons of cytochrome c oxidase I (COI), which contains enough sequence diversity to separate most species and is relatively easy to amplify from most taxa using a limited set of primers [4,5,6]. COI barcodes represent the largest, most taxonomically diverse set of mitochondrial sequences presently available, with approximately 260,000 records from 37,000 animal species in GenBank under keyword BARCODE. The largest set of mtDNA sequences in GenBank is cytochrome b with 157,000 records from 26,000 species. Advantages of the BARCODE standard include a minimum of 500 bp from a defined region, linkage to museum specimens, and publicly archived trace files documenting a minimum quality score [4]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call