A Pre-ingestion Framework for Darwin Core Archives

Mahmoud Sadeghi,Peggy Newman,Patricia Koh

doi:10.3897/biss.6.93853

Abstract

The Atlas of Living Australia’s (ALA) Pre-ingestion Framework is our alternative to managing datasets via the Global Biodiversity Information Facility's (GBIF) Integrated Publishing Toolkit (IPT). The framework uses a system-agnostic Python codebase to create and update Darwin Core archives: building an archive from a core and extension csv files, merging two archives together, deleting records and identifying duplicates based on the identifiers. The framework dynamicly supports current Darwin Core and GBIF namespace terms. Previously, this functionality was handled internally by a Java-based biocache-store ingestion application. While flexible and easy to call, this black box approach to data management created challenges like removing problem records and tracking and verifying data sources. Last year, as the ALA merged our ingestion codebase with GBIF's pipelines and upgraded our data store infrastructure, we took the opportunity to manage our source data exclusively as full Darwin Core archives, rather than partial text files or spreadsheets. Consequently, the Python-based framework consolidates a lot of work previously managed using a range of methodologies and technologies including Talend, Java and unix based scripting. Alongside the Darwin Core archive manipulation tools, it has handlers for harvesting data from secure external web services, web hosts or file servers. The standardised approach to data loading paves the way for improved automation and workflow. The work has the potential to become an open source project to share with the Living Atlas and biodiversity informatics communities.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Pre-ingestion Framework for Darwin Core Archives

Abstract

Talk to us

Similar Papers

More From: Biodiversity Information Science and Standards

Lead the way for us

Journal: Biodiversity Information Science and Standards	Publication Date: Aug 23, 2022
License type: cc-by

Similar Papers

A Google Sheet Add-on for Biodiversity Data Standardization and Sharing
José Augusto Salim ... Antonio Saraiva
Biodiversity Information Science and Standards | VOL. 4
José Augusto Salim, et. al.José Augusto Salim ... Antonio Saraiva
02 Oct 2020
Biodiversity Information Science and Standards | VOL. 4

From text to structured data: Converting a word-processed floristic checklist into Darwin Core Archive format
David Remsen ... Sandra Knapp
PhytoKeys | VOL. 9
David Remsen, et. al.David Remsen ... Sandra Knapp
30 Jan 2012
PhytoKeys | VOL. 9

The GBIF integrated publishing toolkit: facilitating the efficient publishing of biodiversity data on the internet.
Tim Robertson ... Laura Russell
PLoS ONE | VOL. 9
Tim Robertson, et. al.Tim Robertson ... Laura Russell
06 Aug 2014
PLoS ONE | VOL. 9

Connecting West and Central African Herbaria Data: A new Living Atlases regional data platform
Sylvain Morin ... Alice Ainsa
Biodiversity Information Science and Standards | VOL. 5
Sylvain Morin, et. al.Sylvain Morin ... Alice Ainsa
13 Sep 2021
Biodiversity Information Science and Standards | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Pre-ingestion Framework for Darwin Core Archives

Abstract

Talk to us

Similar Papers

More From: Biodiversity Information Science and Standards