Abstract
Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked using taxonomic names. This labor intensive, error prone, and lengthy process relies on accessible versions of nomenclatural authorities and fuzzy-matching algorithms. To approach the challenge of linking diverse data, more than technology is needed. New social collaborations like the Global Unified Open Data Architecture (GUODA) that combines skills from diverse groups of computer engineers from iDigBio, server resources from the Advanced Computing and Information Systems (ACIS) Lab, global-scale data presentation from EOL, and independent developers and researchers are what is needed to make concrete progress on finding relationships between biodiversity datasets. This paper will discuss a technical solution developed by the GUODA collaboration for faster linking across databases with a use case linking Wikidata and the Global Biotic Interactions database (GloBI). The GUODA infrastructure is a 12-node, high performance computing cluster made up of about 192 threads with 12 TB of storage and 288 GB memory. Using GUODA, 20 GB of compressed JSON from Wikidata was processed and linked to GloBI in about 10–11 min. Instead of comparing name strings or relying on a single identifier, Wikidata and GloBI were linked by comparing graphs of biodiversity identifiers external to each system. This method resulted in adding 119,957 Wikidata links in GloBI, an increase of 13.7% of all outgoing name links in GloBI. Wikidata and GloBI were compared to Open Tree of Life Reference Taxonomy to examine consistency and coverage. The process of parsing Wikidata, Open Tree of Life Reference Taxonomy and GloBI archives and calculating consistency metrics was done in minutes on the GUODA platform. As a model collaboration, GUODA has the potential to revolutionize biodiversity science by bringing diverse technically minded people together with high performance computing resources that are accessible from a laptop or desktop. However, participating in such a collaboration still requires basic programming skills.
Highlights
Biodiversity databases provide global access to information about species via the Web
Many biodiversity databases share information with each other (Bingham et al, 2017), but creating the links can be very difficult for several reasons including the size of the databases, the heterogeneous nature of the data, and the heterogeneous nature of the identifiers used by the different resources (Page, 2008)
After 10 min of processing, Global Biotic Interactions database (GloBI) was linked to Wikidata using pre-existing identifier mappings
Summary
Biodiversity databases provide global access to information about species via the Web. Biodiversity databases provide global access to information about species via the Web These databases contain information as varied as observation records, text descriptions, images, maps, genetic sequences, phylogenetic trees, and trait data (Table 1). All of these data become much more useful if they can be linked. The more popular methods for linking biodiversity databases include taxonomic names, LSID (Life Sciences Identifier), and DOI (Digital Object Identifier). TBMap provides links from TreeBase across several taxonomic databases, such as ITIS and NCBI (Page, 2007) This mapping was achieved using taxonomic names, but in some cases GenBank Accession numbers and museum specimen codes were available for supplement. The use of taxonomic names to aggregate data can lead to errors and requires significant a priori knowledge either in the form of curators or an authoritative nomenclature
Full Text
Topics from this Paper
Fuzzy-matching Algorithms
Cross-institutional Collaboration
Biodiversity Datasets
Consistency Metrics
GB Memory
+ Show 5 more
Create a personalized feed of these topics
Get StartedSimilar Papers
Biodiversity Information Science and Standards
May 17, 2018
Biodiversity Information Science and Standards
Aug 31, 2021
Feb 24, 2004
International Journal of Web Engineering and Technology
Jul 1, 2006
Biopreservation and Biobanking
Feb 1, 2022
Frontiers in Ecology and Evolution
Mar 16, 2023
Biodiversity Information Science and Standards
Jul 3, 2018
International journal of clinical monitoring and computing
Jan 1, 1991
PLoS ONE
Jun 20, 2012
2008 IEEE Asia-Pacific Services Computing Conference
Dec 9, 2008
Sep 1, 2015
PeerJ Computer Science
PeerJ Computer Science
Oct 2, 2023
PeerJ Computer Science
Oct 2, 2023
PeerJ Computer Science
Sep 29, 2023
PeerJ Computer Science
Sep 29, 2023
PeerJ Computer Science
Sep 29, 2023
PeerJ Computer Science
Sep 28, 2023
PeerJ Computer Science
Sep 28, 2023
PeerJ Computer Science
Sep 27, 2023
PeerJ Computer Science
Sep 27, 2023
PeerJ Computer Science
Sep 27, 2023