"Publish First": A Rapid, GPT-4 Based Digitisation System for Small Institutes with Minimal Resources

Rukaya Johaadien,Michal Torma

doi:10.3897/biss.7.112428

Abstract

We present a streamlined technical solution ("Publish First") designed to assist smaller, resource-constrained herbaria in rapidly publishing their specimens to the Global Biodiversity Information Facility (GBIF). Specimen data from smaller herbaria, particularly those in biodiversity-rich regions of the world, provide a valuable and often unique contribution to the global pool of biodiversity knowledge (Marsico et al. 2020). However, these institutions often face challenges not applicable to larger herbaria, including a lack of staff with technical skills, limited staff hours for digitization work, inadequate financial resources for specialized scanning equipment, cameras, lights, and imaging stands, limited (or no) access to computers and collection management software, and unreliable internet connections. Data-scarce and biodiversity rich countries are also often linguistically diverse (Gorenflo et al. 2012), and staff may not have English skills, which means pre-existing online data publication resources and guides are of limited use. The "Publish First" method we are trialing, addresses several of these issues: it drastically simplifies the publication process so technical skills are not necessary; it minimizes administrative tasks saving time; it uses simple, cheap and easily available hardware; it does not require any specialized software; and the process is so simple that there is little to no need for any written instructions. "Publish first" requires staff to attach QR code labels containing identifiers to herbarium specimen sheets, scan these sheets using a document scanner costing around €300, then drag and drop these files to an S3 bucket (a cloud container that specialises in storing files). Subsequently, these images are automatically processed through an Optical Character Recognition (OCR) service to extract text, which is then passed on to OpenAI's Generative Pre-Transformer 4 (GPT-4) Application Programming Interface (API), for standardization. The standardized data is integrated into a Darwin Core Archive file that is automatically published through GBIF's Integrated Publishing Toolkit (IPT) (GBIF 2021). The most technically challenging aspect of this project has been the standardization of OCR data to Darwin Core using the GPT-4 API, particularly in crafting precise prompts to address the inherent inconsistency and lack of reliability in these Large Language Models (LLMs). Despite this, GPT-4 outperformed our manual scraping efforts. Our choice of GPT-4 as a model was a naive one: we implemented the workflow on some pre-digitized specimens from previously published Norwegian collections, compared the published data on GBIF with GPT-4's Darwin Core standardized output, and found the results satisfactory. Moving forward, we plan to undertake more rigorous additional research to compare the effectiveness and cost-efficiency of different LLMs as Darwin Core standardization engines. We are also particularly interested in exploring the new "function calling" feature added to the GPT-4 API, as it promises to allow us to retrieve standardized data in a more consistent and structured format. This workflow is currently under trial in Tajikistan, and may possibly be used in Uzbekistan, Armenia and Italy in the near future.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

"Publish First": A Rapid, GPT-4 Based Digitisation System for Small Institutes with Minimal Resources

Abstract

Talk to us

Similar Papers

More From: Biodiversity Information Science and Standards

Lead the way for us

Journal: Biodiversity Information Science and Standards	Publication Date: Sep 11, 2023
License type: CC BY 4.0

Similar Papers

A Google Sheet Add-on for Biodiversity Data Standardization and Sharing
José Augusto Salim ... Antonio Saraiva
Biodiversity Information Science and Standards | VOL. 4
José Augusto Salim, et. al.José Augusto Salim ... Antonio Saraiva
02 Oct 2020
Biodiversity Information Science and Standards | VOL. 4

GBIF Data Processing and Validation
John Waller ... Federico Mendez
Biodiversity Information Science and Standards | VOL. 5
John Waller, et. al.John Waller ... Federico Mendez
27 Sep 2021
Biodiversity Information Science and Standards | VOL. 5

Using ChatGPT with Confidence for Biodiversity-Related Information Tasks
Michael Elliott ... José Fortes
Biodiversity Information Science and Standards | VOL. 7
Michael Elliott, et. al.Michael Elliott ... José Fortes
19 Sep 2023
Biodiversity Information Science and Standards | VOL. 7

APIs: A Common Interface for the Global Biodiversity Informatics Community
Ben Norton
Biodiversity Information Science and Standards | VOL. 5
Ben NortonBen Norton
16 Sep 2021
Biodiversity Information Science and Standards | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

"Publish First": A Rapid, GPT-4 Based Digitisation System for Small Institutes with Minimal Resources

Abstract

Talk to us

Similar Papers

More From: Biodiversity Information Science and Standards