Measurement of a Barcode’s Accuracy in Identifying Species

John L Spouge

doi:10.1007/978-3-319-41840-7_2

Abstract

This chapter describes a workflow for measuring a barcode’s accuracy when identifying species. First, assemble a database of specimens with their marker sequences and their species binomials. The species binomials provide a “taxonomic gold standard” for species identification and should be as accurate as possible, to avoid penalizing correct species assignment. Second, select a computer algorithm for assigning species to barcode sequences. Only one algorithm (BLAST+P) has improved notably on the simple strategy of assigning specimens to the species of the database sequence(s) nearest under p-distance. Global sequence alignments (e.g., with the Needleman-Wunsch algorithm, or with multiple sequence alignment algorithms) align entire barcode sequences, using all available information, so they sometimes produce more accurate species identifications than local sequence alignments (e.g., with BLAST), particularly when BLAST produces barcode alignments of small subsequences within the sequences. Finally, consensus has settled on “the probability of correct identification” (PCI) as the appropriate measurement of species identification accuracy. The overall PCI for a data set is the average of the species PCIs, taken over all species in the data set. The chapter discusses some variant PCIs, their calculation and the estimation of their statistical sampling errors. It also discusses good practice in incorporating PCR failure and species with singleton representatives into data summaries. For software relevant to this chapter, see http://tinyurl.com/spouge-barcode.

Full Text