Employing statistical learning to derive species‐level genetic diversity for mammalian species

Carlos G Schrago,Beatriz Mello

doi:10.1111/mam.12192

Abstract

Abstract The patterns of genetic diversity in several genomic regions have been used in mammalian systematics for decades. For instance, when studying closely related species, it is generally assumed that the mitochondrial cytochrome b gene (cytb) exhibits significant information that can be used for differentiation between intraspecies and interspecies variation in mammals. Because of sampling limitations, early analyses of this proposition were conducted mainly on rodents and bats. Currently, more than 57000 cytb sequences are available covering all major lineages of mammals, and sequencing of several individuals per species is common practice in molecular systematics. We were thus prompted to carry out a large‐scale analysis of the utility of cytb genetic variation as a predictor of whether a pair of sequences came from within‐species or between‐species comparisons. Using predetermined species‐level assignments, we employed standard methods from statistical learning to calculate the cut‐off values able to classify genetic distances in either intraspecies or interspecies categories; we then measured the performance of such statistical classifiers to predict the species‐level taxonomic rank as defined by experts. Depending on the classifier, our results showed that when adopting cytb distance cut‐off values of 7.3% and 5.5% for small mammals (Metatheria, Rodentia, Chiroptera, and Eulipotyphla) and 4.3% and 3% for medium‐sized to large mammals (Primates, Carnivora, and Artiodactyla), the frequency of incorrect assignment of within‐species divergences to the between‐species category (type I error) varied from 7 to 11%. In order to avoid over‐splitting by future researchers, we calculated cut‐off values using a more conservative evaluation and provided a list of mammalian species that are likely to consist of complexes of cryptic species. We show that our supervised method can provide practical guidelines to improve the performance of unsupervised algorithms for species delimitation. Finally, we discuss limitations of large‐scale approaches (e.g. effects of misclassification in databases and the need for case‐by‐case evaluation of cryptic species complexes) and their consequences for conservation policies.

Full Text