As the urgency to address the climate crisis intensifies, the availability of accurate and comprehensive biodiversity data has become crucial for informing climate change studies, tracking key environmental indicators, and building global biodiversity monitoring platforms. The Biodiversity Heritage Library (BHL) plays a vital role in the core biodiversity infrastructure, housing over 60 million pages of digitized literature about life on Earth. Recognizing the value of over 500 years of data in BHL, a global network of BHL staff is working to establish a scalable data pipeline to provide actionable occurrence data from BHL’s vast and diverse collections. However, transforming textual content into FAIR (findable, accessible, interoperable, reusable) data poses challenges due to missing descriptive metadata and error-ridden unstructured outputs from commercial text engines. (Fig. 1) Despite the wealth of knowledge in BHL now available to global audiences, the underutilization of biodiversity and climate data contained in BHL's textual corpus hinders scientific research, hampers informed decision-making for conservation efforts, and limits our understanding of biodiversity patterns crucial for addressing the climate crisis. By leveraging recent advancements in text recognition engines, along with cutting-edge AI (Artificial Intelligence) models like OpenAI’s CLIP (Contrastive Language-Image Pre-Training) and nascent features in transcription platforms, BHL staff are beginning to process vast amounts of textual and image data and transform centuries worth of data from BHL collections into computationally usable formats. Recent technological breakthroughs now offer a transformative opportunity to empower the global biodiversity community with prescient insights from our shared past and facilitate the integration of historical knowledge into climate action initiatives. To bridge gaps in the historical record and unlock the potential of the Biodiversity Heritage Library (BHL), a multi-pronged effort utilizing innovative cross-disciplinary approaches is being piloted. These technical approaches were selected for their efficiency and ability to generate rapid results that could be applied across the diverse range of materials in BHL. (Fig. 2) Piloting a data pipeline that is scalable to 60 million pages requires considerable investigation, experimentation, and resources but will have an appreciable impact on global conservation efforts by informing and establishing historic baselines deeper into time. This presentation will focus on the identification, extraction, and transformation of OCR into structured data outputs in BHL. Approaches include: Upgrading legacy OCR text using Tesseract OCR engine to improve data quality by 20% and openly publish 40 GBs of textual data as FAIR data; Evaluating handwritten text recognition (HTR) engines (Microsoft Azure Computer Vision, Google Cloud Vision API (Application Programming Interface), and Amazon Textract) to improve scientific name-finding in BHL’s handwritten archival materials using algorithms developed by Global Names Architecture; Extracting data from collecting events using HTR coordinate outputs with Python library Pandas DataFrame to create structured data; Classifying BHL page-level images with OpenAI's CLIP, a neural network model to accurately identify the handwritten sub-corpus of primary source materials in BHL; Running an A/B test to evaluate the efficiency and accuracy of human-keyed transcription data extraction to provide high-quality, human-vetted datasets that can be deposited with data aggregators. Upgrading legacy OCR text using Tesseract OCR engine to improve data quality by 20% and openly publish 40 GBs of textual data as FAIR data; Evaluating handwritten text recognition (HTR) engines (Microsoft Azure Computer Vision, Google Cloud Vision API (Application Programming Interface), and Amazon Textract) to improve scientific name-finding in BHL’s handwritten archival materials using algorithms developed by Global Names Architecture; Extracting data from collecting events using HTR coordinate outputs with Python library Pandas DataFrame to create structured data; Classifying BHL page-level images with OpenAI's CLIP, a neural network model to accurately identify the handwritten sub-corpus of primary source materials in BHL; Running an A/B test to evaluate the efficiency and accuracy of human-keyed transcription data extraction to provide high-quality, human-vetted datasets that can be deposited with data aggregators. The ongoing development of a scalable data pipeline of BHL’s relevant biodiversity and climate-related datasets requires sustained support and partnership with the biodiversity community. Initial results demonstrate that liberating data from archival and handwritten field notes is arduous but feasible. Extending these methodologies to the broader scientific literature presents new research opportunities. Extracting and normalizing data from unstructured textual sources can significantly advance biodiversity research and inform environmental policy. The Biodiversity Heritage Library staff are committed to building multiple scalable data pipelines with the ultimate goal of erecting a global biodiversity knowledge graph, rich in interconnected data and semantic meaning, enabling informed decisions for the preservation and sustainable management of Earth's biodiversity.
Read full abstract