Abstract

Documented occurrences of fossil taxa are the empirical foundation for understanding large-scale biodiversity changes and evolutionary dynamics in deep time. The fossil record contains vast amounts of understudied taxa. Yet the compilation of huge volumes of data remains a labour-intensive impediment to a more complete understanding of Earth's biodiversity history. Even so, many occurrence records of species and genera in these taxa can be uncovered in the palaeontological literature. Here, we extract observations of fossils and their inferred ages from unstructured text in books and scientific articles using machine-learning approaches. We use Bryozoa, a group of marine invertebrates with a rich fossil record, as a case study. Building on recent advances in computational linguistics, we develop a pipeline to recognize taxonomic names and geologic time intervals in published literature and use supervised learning to machine-read whether the species in question occurred in a given age interval. Intermediate machine error rates appear comparable to human error rates in a simple trial, and resulting genus richness curves capture the main features of published fossil diversity studies of bryozoans. We believe our automated pipeline, that greatly reduced the time required to compile our dataset, can help others compile similar data for other taxa.

Highlights

  • How have scientists determined the history of biodiversity on our planet? The radiations of unicellular organisms, plants and animals, rates of diversification and extinction, correlation of past biodiversity levels with environmental forcing factors, mass extinctions and recoveries—all of these and more are reliant on, or at least calibrated by, published occurrences of fossil taxa and their geologic ages

  • Despite considerable progress in statistical methods that aim to compensate for occurrence gaps and known biases of the fossil record [4,5,6,7,8], we are still some ways away from a comprehensive understanding of the history of global biodiversity

  • The classifier accuracy at 82.2% is comparable to our interannotator labelling accuracy at 84.1%

Read more

Summary

Introduction

How have scientists determined the history of biodiversity on our planet? The radiations of unicellular organisms, plants and animals, rates of diversification and extinction, correlation of past biodiversity levels with environmental forcing factors, mass extinctions and recoveries—all of these and more are reliant on, or at least calibrated by, published occurrences of fossil taxa and their geologic ages. Community efforts have built large public data compilations of taxonomic nomenclature or taxon occurrences, for instance, the Global Biodiversity Information Facility [11], World Register of Marine Species [12] and the Paleobiology Database (https://paleobiodb.org/). The cheilostomes are the most species-rich group of bryozoans for which there are currently about 4800 known extant members [29] We explicitly quantify both human and machine error in retrieving taxon names and their time intervals of occurrence. We chose to use a rulebased approach, not least because a nearly exhaustive list of post-Palaeozoic bryozoan Linnaean binomials (including all cheilostome bryozoans, our target group) is already available We used this list and our compiled list of geologic age interval royalsocietypublishing.org/journal/rspb Proc.

Burdigalian Serravallian nominal modifier conjunct
Findings
Understanding tables in context using standard
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.