Data Location Quality at GBIF

John Waller

doi:10.3897/biss.3.35829

Abstract

I will cover how the Global Biodiversity Information Facility (GBIF) handles data quality issues, with specific focus on coordinate location issues, such as gridded datasets (Fig. 1) and country centroids. I will highlight the challenges GBIF faces identifying potential data quality problems and what we and others (Zizka et al. 2019) are doing to discover and address them. GBIF is the largest open-data portal of biodiversity data, which is a large network of individual datasets (&gt; 40k) from various sources and publishers. Since these datasets are variable both within themselves and dataset-to-dataset, this creates a challenge for users wanting to use data collected from museums, smartphones, atlases, satellite tracking, DNA sequencing, and various other sources for research or analysis. Data quality at GBIF will always be a moving target (Chapman 2005), and GBIF already handles many obvious errors such as zero/impossible coordinates, empty or invalid data fields, and fuzzy taxon matching. Since GBIF primarily (but not exclusively) serves lat-lon location information, there is an expectation that occurrences fall somewhat close to where the species actually occurs. This is not always the case. Occurrence data can be hundereds of kilometers away from where the species naturally occur, and there can be multiple reasons for why this can happen, which might not be entirely obvious to users. One reasons is that many GBIF datasets are gridded. Gridded datasets are datasets that have low resolution due to equally-spaced sampling. This can be a data quality issue because a user might assume an occurrence record was recorded exactly at its coordinates. Country centroids are another reason why a species occurrence record might be far from where it occurs naturally. GBIF does not yet flag country centroids, which are records where the dataset publishers has entered the lat-long center of a country instead of leaving the field blank. I will discuss the challenges surrounding locating these issues and the current solutions (such as the CoordinateCleaner R package). I will touch on how existing DWCA terms like coordinateUncertaintyInMeters and footprintWKT are being utilized to highlight low coordinate resolution. Finally, I will highlight some other emerging data quality issues and how GBIF is beginning to experiment with dataset-level flagging. Currently we have flagged around 500 datasets as gridded and around 400 datasets as citizen science, but there are many more potential dataset flags.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Biodiversity Information Science and Standards	Publication Date: Jun 13, 2019
Citations: 2	License type: CC0

R Discovery Prime

R Discovery Prime

Data Location Quality at GBIF

Abstract

Talk to us

Similar Papers

More From: Biodiversity Information Science and Standards

Lead the way for us

Similar Papers

GBIF Data Processing and Validation
John Waller ... Federico Mendez
Biodiversity Information Science and Standards | VOL. 5
John Waller, et. al.John Waller ... Federico Mendez
27 Sep 2021
Biodiversity Information Science and Standards | VOL. 5

Processing Country Centroids at the Global Biodiversity Information Facility
John Waller
Biodiversity Information Science and Standards | VOL. 7
John WallerJohn Waller
09 Aug 2023
Biodiversity Information Science and Standards | VOL. 7

Going Molecular: Sequence-based spatiotemporal biodiversity evidence in GBIF
Dmitry Schigel ... Joseph Miller
Biodiversity Information Science and Standards | VOL. 3
Dmitry Schigel, et. al.Dmitry Schigel ... Joseph Miller
13 Jun 2019
Biodiversity Information Science and Standards | VOL. 3

Contribution of Citizen Science to Biodiversity Data Mobilization in Russia
Natalya Ivanova ... Maxim Shashkov
Biodiversity Information Science and Standards | VOL. 4
Natalya Ivanova, et. al.Natalya Ivanova ... Maxim Shashkov
01 Oct 2020
Biodiversity Information Science and Standards | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Data Location Quality at GBIF

Abstract

Talk to us

Similar Papers

More From: Biodiversity Information Science and Standards