Specimens at the Center: An Informatics Workflow and Toolkit for Specimen-level Analysis of Public DNA Database Data

Kasey Pham ,Nathan J Derieg,Julian R Starr,Takuji Hoshino,Eric H Roalson,Bethany H Brown,David A Simpson,Berit Gehrke,Matthias H Hoffmann,Robert F C Naczi,Karen L Wilson,Sangtae Kim,Sebastian Gebauer,Kate Lueders,Marcial Escudero,Enrique Maguilla,Jeremy J Bruhl,Marcia J Waterway,Modesto Luceño,Andrew L Hipp,Marlene Hahn,Okihito Yano,Tamara Villaverde,Anton A Reznicek,Bruce A Ford,Shuren Zhang,Pedro Jiménez‐Mejías ,Jong-Yun Jung ,Kyong Sook Chung ,Léo P Bruederle ,Santiago Martín‐Bravo

doi:10.1600/036364416x692505

Abstract

Abstract Major public DNA databases — NCBI GenBank, the DNA DataBank of Japan (DDBJ), and the European Molecular Biology Laboratory (EMBL) — are invaluable biodiversity libraries. Systematists and other biodiversity scientists commonly mine these databases for sequence data to use in phylogenetic studies, but such studies generally use only the taxonomic identity of the sequenced tissue, not the specimen identity. Thus studies that use DNA supermatrices to construct phylogenetic trees with species at the tips typically do not take advantage of the fact that for many individuals in the public DNA databases, several DNA regions have been sampled; and for many species, two or more individuals have been sampled. Thus these studies typically do not make full use of the multigene datasets in public DNA databases to test species coherence and select optimal sequences to represent a species. In this study, we introduce a set of tools developed in the R programming language to construct individual-based trees from...

Full Text