Abstract

Evolutionary and organismal biology have become inundated with data. At the same rate, we are experiencing a surge in broader evolutionary and ecological syntheses for which tree-thinking is the staple for a variety of post-tree analyses. To fully take advantage of this wealth of data to discover and understand large-scale evolutionary and ecological patterns, computational data integration, i.e., the use of machines to link data at large scale, is crucial. The most common shared entity by which evolutionary and ecological data need to be linked is the taxon to which they belong. We propose a set of requirements that a system for defining such taxa should meet for computational data science: taxon definitions should maintain conceptual consistency, be reproducible via a known algorithm, be computationally automatable, and be applicable across the Tree of Life. We argue that Linnaean names, the most prevalent means of linking data to taxa, fail to meet these requirements due to fundamental theoretical and practical shortfalls. We argue that for the purposes of data-integration we should instead use phylogenetic definitions transformed into formal logic expressions. We call such expressions phyloreferences, and argue that, unlike Linnaean names, they meet all requirements for effective data-integration.

Highlights

  • The last two decades have witnessed a vast increase of available digital biodiversity data

  • The rapidly increasing knowledge across the Tree of Life has enabled a synthesis of phylogenetic hypotheses on a Tree of Life scale, to produce an encompassing–-and digitally fully reusable–-view of Life’s evolution, the Open Tree of Life (Hinchliff et al 2015; McTavish et al 2017)

  • While the approach we propose in this paper fits more naturally with a form of phylogenetic nomenclature, it is compatible with retaining Linnaean names

Read more

Summary

Introduction

The last two decades have witnessed a vast increase of available digital biodiversity data. The latter of these, instead, consists in generating purely phylogenetic definitions of clades To arbitrate between these alternatives, we propose the following four requirements that any system suitable for data-integration should meet: (i) The mapping maintains conceptual consistency, meaning that when mapped to different phylogenies, the semantics of the retrieved clades are consistent. (ii) The mapping of a given clade concept to a given phylogenetic hypothesis is exactly reproducible via a known algorithm. What is at stake is the best way of defining taxa for data integration, and not the names of these taxa or whether they can be listed as species

The Poverty of Linnaean Names
The Linnaean shortfall limits data discovery
Linnaean names make data discovery difficult to reproduce
The Richness of Phylogenetic Definitions
What Is a Phyloreference?
Other Efforts to Improve the Computability of Taxon Concepts
Challenges and Limitations
Specifiers
Genealogical discordance
Adoption cost
Final Remarks
Literature cited
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.