Abstract

The Atlas of Living Australia’s (ALA) Pre-ingestion Framework is our alternative to managing datasets via the Global Biodiversity Information Facility's (GBIF) Integrated Publishing Toolkit (IPT). The framework uses a system-agnostic Python codebase to create and update Darwin Core archives: building an archive from a core and extension csv files, merging two archives together, deleting records and identifying duplicates based on the identifiers. The framework dynamicly supports current Darwin Core and GBIF namespace terms. Previously, this functionality was handled internally by a Java-based biocache-store ingestion application. While flexible and easy to call, this black box approach to data management created challenges like removing problem records and tracking and verifying data sources. Last year, as the ALA merged our ingestion codebase with GBIF's pipelines and upgraded our data store infrastructure, we took the opportunity to manage our source data exclusively as full Darwin Core archives, rather than partial text files or spreadsheets. Consequently, the Python-based framework consolidates a lot of work previously managed using a range of methodologies and technologies including Talend, Java and unix based scripting. Alongside the Darwin Core archive manipulation tools, it has handlers for harvesting data from secure external web services, web hosts or file servers. The standardised approach to data loading paves the way for improved automation and workflow. The work has the potential to become an open source project to share with the Living Atlas and biodiversity informatics communities.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.