A central problem in biodiversity data science remains the inability to precisely aggregate observations that originate from the same species. Current approaches still rely heavily on the ‘Genus species’ pair of Linnaean taxonomy to label and group data. However, such binomial species names do not alone contain enough information to distinguish variation in a name’s conceptual usage (i.e., meaning) across time, place, or investigator. This “species name-to-meaning” problem is well known, but no robust and scalable solutions yet exist (Berendsohn 1995Bisby 2000, Sutherland et al. 2000, Page 2007, Franz and Sterner 2018, Upham et al. 2021, Sterner et al. 2023, Sterner et al. 2020, Beach et al. 1993). The problem is widespread especially for well-studied groups like mammals, for which 45% more species are now recognized than 30 years ago, including many species splits and rearrangements (Burgin et al. 2018, Mammal Diversity Database 2024). To address this problem, we here propose to digitally package and sign the relevant taxonomic data associated with a given species name usage to form a ‘Taxonomic Data Object’ (TDO). These TDOs are conceptually similar to 'secundus' references (i.e., species name sec. author, Berendsohn 1995) but differ in being machine readable and thus operational from the perspective of high-throughput data science. TDOs aim to move from analog secundus references, which require a human to track meaning by reading the cited article, to a digital and machine-actionable taxonomic reference. Versioned TDOs that are digitally signed using hash algorithms (e.g., md5 or sha256, see Elliott et al. 2023) will allow for precisely communicating, with verified data provenance from sender to receiver, certain aspects of what a species name means according to a given authority at a given time. Example TDO contents include DarwinCore terms (e.g., scientificName, namePublishedIn, nameAccordingTo, nomenclaturalStatus (Darwin Core Maintenance Group 2021)) as well as meaning-rich digital assets like geographic range maps (e.g., GeoJSON format), exemplar DNA sequences (e.g., FASTA format or National Center for Biotechnology Information (NCBI) accession number), holotype specimen information (e.g., catalog number, type locality coordinates), and taxonomic treatment texts (including material citations, e.g., as digitized and extracted by Plazi TreatmentBank, Agosti and Egloff 2009; see Fig. 1). We discuss pilot examples comparing global bat species according to the v1.2 vs. v1.11 taxonomies of the Mammal Diversity Database (Sep 2020 vs. Apr 2023), showing how TDOs enable the reliable tracking of taxonomic meaning for the 63 bat species affected by splits during this period. We posit that widespread use of TDOs will enable the traceable exchange of taxonomic information across a variety of existing platforms, providing a path for accurate species-level data aggregation at global scales, at least for well-studied taxa.
Read full abstract