The Use of Mean Instead of Smallest Interspecific Distances Exaggerates the Size of the “Barcoding Gap” and Leads to Misidentification

Guanyang Zhang,Farhan Ali,Kelly Zamudio,Rudolf Meier

doi:10.1080/10635150802406343

Abstract

DNA barcoding is one of the best funded and most visible large-scale initiatives in systematic biology and has generated both much interest and controversy. DNA barcoding has also attracted significant support from foundations that had previously shown little interest in systematics. Yet, the project is controversial because many systematists feel that currently the conceptual foundation of DNA barcoding remains weak. This problem can only be alleviated through additional research that can lead to improved tools and concepts. Here, we scrutinize a key concept of DNA barcoding, the so-called barcoding gap (Meyer and Paulay, 2005), and use empirical data to document that it needs to be computed based on the smallest instead of the mean interspecific distances. In the literature on DNA barcoding, the “barcoding gap” (Meyer and Paulay, 2005) refers to the separation between mean intraand interspecific sequence variability for congeneric COI sequences. The barcoding gap is so essential to barcoding that a widely cited publication was dedicated to documenting these gaps across major metazoan taxa (Hebert et al., 2003b). It is also regularly mentioned in articles promoting barcoding to a broader audience (Check, 2005; Cognato and Caesar, 2006; Dasmahapatra and Mallet, 2006) and is one of the few metrics included in the Web-based identification system BOLD, “The Barcode of Life Data System,” which is a major identification tool for the DNA barcoding community (http://www.barcodinglife.org; Ratnasingham and Hebert, 2007). Large barcoding gaps are routinely used to predict DNA-barcoding success for the taxon under study (Hebert et al., 2003a, 2003b, 2004a, 2004b; Hogg and Hebert, 2004; Powers, 2004; Zehner et al., 2004; Armstrong and Ball, 2005; Ball et al., 2005; Barrett and Hebert, 2005; Lorenz et al., 2005; Saunders, 2005; Smith et al., 2005, 2006; Ward et al., 2005; Cywinska et al., 2006; Hajibabaei et al., 2006a, 2006b; Lefebure et al., 2006; Clare et al., 2007; Seifert et al., 2007). However, here we argue and document that barcoding gaps are currently incorrectly computed and that the values reported in the barcoding literature are misleading. The main problem is that the barcoding gap is generally quantified as the difference between intraspecific and mean interspecific, congeneric distances, whereas we will argue here that for species identification only the smallest interspecific distance should be used. Other authors have also pointed out that the use of smallest interspecific distances would be more appropriate (see Sperling, 2003; Moritz and Cicero, 2004; Vences et al., 2005a, 2005b; Cognato, 2006; Meier et al., 2006; Meyer and Paulay, 2005; Roe and Sperling, 2007), but currently we lack a comparative study that documents that the two measures yield different results. Here we provide evidence based on 43,137 COI sequences from 12,459 Metazoan species that barcoding gaps based on mean interspecific distances are artificially inflated and that only smallest interspecific distances correctly reflect that species identification gets more difficult as more species are sampled. Using DNA barcodes for species identification is analogous to identifying an unidentified specimen by comparing it to a reference collection of identified specimens. Initially one may compare an unidentified specimen to all identified material in the same genus, but ultimately the identification problem pares down to deciding whether a specimen belongs to one of a few, very similar, congeneric species. Determining an unidentified specimen to species is straightforward if the intraspecific variability is small—i.e., the unidentified specimen is a good match to a referenced species—and the differences between the best-matching species and the next best match is large—i.e., the specimen is a good match to only one of the referenced species. Analogously, the ease with which a query sequence can be identified to species is only dependent on how different it is from the most similar allospecific sequence, whereas its distinctness from a hypothetical “average” congeneric species does not matter (see Sperling, 2003; Moritz and Cicero, 2004; Vences et al., 2005a, 2005b; Cognato, 2006; Meier et al., 2006; Meyer and Paulay, 2005; Roe and Sperling, 2007). Yet, DNA barcoding publications and BOLD continue to report the mean instead of the smallest interspecific distances for congeneric species.

Full Text