Abstract

One of the main challenges in biodiversity data reusability is finding ways to transform what is provided in research publications into different and reusable formats, following the FAIR (Findable, Accessible, Interoperable, Reusable) principles (Agosti and Egloff 2009). Most often, data is restricted to text, figures and tables in the so-called “PDF prison” or other flat formats. Plazi's infrastructure and workflow (Guidoti et al. 2021) transform such data into reusable formats that can then be exported and linked across different platforms, such as the Global Biodiversity Information Facility (GBIF), Biodiversity Literature Repository, Zenodo, Synospecies, ChecklistBank, and OpenBiodiv among others. In order to liberate the many relevant pieces of information, such as taxonomic treatments (Catapano 2019), material citations (Darwin Core term MaterialCitation) or bibliographic references from the publication types mentioned above, one has to run a single document or a batch of documents through a series of extraction steps, which can be done manually or automatically, through the use of templates. The latter are a set of parameters that tell the Plazi-dedicated software (GoldenGATE suite) how to read and where to find key pieces of information; these parameters are established by examining publication standards and publisher-specific layouts, followed by a series of iterative tests, to ascertain the quality of the automation. However, even with a high number of tests to ensure a better extraction, human quality control is still needed (Simoes et al. 2021). To that end, Plazi has a quality control process, based on logical rules, which checks the components of the extracted document, flagging errors in four different levels of severity, which can then be checked and corrected (if needed) by a trained user. These errors are also used in a data transit control mechanism, internally dubbed “the gatekeeper”, which blocks certain data transits to create deposits or reuse of data in the presence of specific errors. In this presentation, we will go through the steps of the entire process, from publication to liberated data (and how it is presented in the linked platforms), highlighting the importance of accurate quality control, and explore some of the many challenges along the way.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call