PeerJ Computer Science | VOL. 4

20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration

Publication Date Sep 17, 2018


Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked using taxonomic names. This labor intensive, error prone, and lengthy process relies on accessible versions of nomenclatural authorities and fuzzy-matching algorithms. To approach the challenge of linking diverse data, more than technology is needed. New social collaborations like the Global Unified Open Data Architecture (GUODA) that combines skills from diverse groups of computer engineers from iDigBio, server resources from the Advanced Computing and Information Systems (ACIS) Lab, global-scale data presentation from EOL, and independent developers and researchers are what is needed to make concrete progress on finding relationships between biodiversity datasets. This paper will discuss a technical solution developed by the GUODA collaboration for faster linking across databases with a use case linking Wikidata and the Global Biotic Interactions database (GloBI). The GUODA infrastructure is a 12-node, high performance computing cluster made up of about 192 threads with 12 TB of storage and 288 GB memory. Using GUODA, 20 GB of compressed JSON from Wikidata was processed and linked to GloBI in about 10–11 min. Instead of comparing name strings or relying on a single identifier, Wi...


Fuzzy-matching Algorithms Cross-institutional Collaboration Biodiversity Datasets Consistency Metrics GB Memory Biodiversity Information Social Collaborations Numerous Databases Taxonomic Names Advanced Computing

Round-ups are the summaries of handpicked papers around trending topics published every week. These would enable you to scan through a collection of papers and decide if the paper is relevant to you before actually investing time into reading it.

Climate change Research Articles published between Nov 21, 2022 to Nov 27, 2022

R DiscoveryNov 28, 2022
R DiscoveryArticles Included:  2

No potential conflict of interest was reported by the authors. The conception and design of the study, acquisition of data, analysis and interpretatio...

Read More

Coronavirus Pandemic

You can also read COVID related content on R COVID-19

R ProductsCOVID-19


Creating the world’s largest AI-driven & human-curated collection of research, news, expert recommendations and educational resources on COVID-19

COVID-19 Dashboard

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on “as is” basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The Copyright Law.