GBIF Integration of Open Data

Tim Robertson,Nikolay Volik,Thomas Stjernegaard Jeppese,Federico Mendez,Marcos Gonzalez,Mikhail Podolskiy,Morten Høfft,Matthew Blissett,Markus Döring

doi:10.3897/biss.5.75606

Tim Robertson, Nikolay Volik + Show 7 more

Open Access

PDF Available

https://doi.org/10.3897/biss.5.75606

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

The Global Biodiversity Information Facility (GBIF) runs a global data infrastructure that integrates data from more than 1700 institutions. Combining data at this scale has been achieved by deploying open Application Programming Interfaces (API) that adhere to the open data standards provided by Biodiversity Information Standards (TDWG). In this presentation, we will provide an overview of the GBIF infrastructure and APIs and provide insight into lessons learned while operating and evolving the systems, such as long-term API stability, ease of use, and efficiency. This will include the following topics: The registry component provides RESTful APIs for managing the organizations, repositories and datasets that comprise the network and control access permissions. Stability and ease of use have been critical to this being embedded in many systems. Changes within the registry trigger data crawling processes, which connect to external systems through their APIs and deposit datasets into GBIF's central data warehouse. One challenge here relates to the consistency of data across a distributed network. Once a dataset is crawled, the data processing infrastructure organizes and enriches data using reference catalogues accessed through open APIs, such as the vocabulary server and the taxonomic backbone. Being able to process data quickly as source data and reference catalogues change is a challenge for this component. The data access APIs provide search and download services. Asynchronous APIs are required for some of these aspects, and long-term stability is a requirement for widespread adoption. Here we will talk about policies for schema evolution to avoid incompatible changes, which would cause failures in client systems. The APIs that drive the user interface have specific needs such as efficient use of the network bandwidth. We will present how we approached this, and how we are currently adopting GraphQL as the next generation of these APIs. There are several APIs that we believe are of use for the data publishing community. These include APIs that will help in data quality aspects, and new data of interest thanks to the data clustering algorithms GBIF deploys. The registry component provides RESTful APIs for managing the organizations, repositories and datasets that comprise the network and control access permissions. Stability and ease of use have been critical to this being embedded in many systems. Changes within the registry trigger data crawling processes, which connect to external systems through their APIs and deposit datasets into GBIF's central data warehouse. One challenge here relates to the consistency of data across a distributed network. Once a dataset is crawled, the data processing infrastructure organizes and enriches data using reference catalogues accessed through open APIs, such as the vocabulary server and the taxonomic backbone. Being able to process data quickly as source data and reference catalogues change is a challenge for this component. The data access APIs provide search and download services. Asynchronous APIs are required for some of these aspects, and long-term stability is a requirement for widespread adoption. Here we will talk about policies for schema evolution to avoid incompatible changes, which would cause failures in client systems. The APIs that drive the user interface have specific needs such as efficient use of the network bandwidth. We will present how we approached this, and how we are currently adopting GraphQL as the next generation of these APIs. There are several APIs that we believe are of use for the data publishing community. These include APIs that will help in data quality aspects, and new data of interest thanks to the data clustering algorithms GBIF deploys.

Highlights

Changes within the registry trigger data crawling processes, which connect to external systems through their Application Programming Interfaces (API) and deposit datasets into Global Biodiversity Information Facility (GBIF)'s central data warehouse
The Global Biodiversity Information Facility (GBIF) runs a global data infrastructure that integrates data from more than 1700 institutions. Combining data at this scale has been achieved by deploying open Application Programming Interfaces (API) that adhere to the open data standards provided by Biodiversity Information Standards (TDWG)
We will provide an overview of the GBIF infrastructure and APIs and provide insight into lessons learned while operating and evolving the systems, such as long-term API stability, ease of use, and efficiency

Summary

Introduction

Changes within the registry trigger data crawling processes, which connect to external systems through their APIs and deposit datasets into GBIF's central data warehouse. Corresponding author: Tim Robertson (trobertson@gbif.org) Received: Sep 2021 | Published: Sep 2021 Citation: Robertson T, Mendez F, Blissett M, Høfft M, Stjernegaard Jeppese T, Volik N, Gonzalez ML, Podolskiy M, Döring M (2021) GBIF Integration of Open Data . Biodiversity Information Science and Standards 5: e75606. The Global Biodiversity Information Facility (GBIF) runs a global data infrastructure that integrates data from more than 1700 institutions.

Results

Conclusion