Abstract

The rapid development and refinement of digital technologies in the last two decades has spearheaded a wave of digitization in natural history collections. This has generated a massive number of digitized images and many more are expected with the planned European Distributed Systems of Scientific Collections (DiSSCo) infrastructure. Many of these images will contain labels with data written on them, typed on them or interpretable from them, but capturing these data remains a challenge. Automated as well as manual methods are being investigated and have yielded mixed results. In addition, previously captured data or data to be captured this way will need to be interoperable in order to make digital access and enrichment most effective. Finally, institutions holding the physical specimens will need to remain capable of efficiently curating the digital, potentially annotated, counterparts. This will require compatibility with the diverse data models of local Collection Management Systems (CMS). In the context of the ICEDIG (Innovation and consolidation for large scale digitisation of natural heritage) project, a benchmark dataset of herbarium specimens was assembled from nine contributing institutions (Dillen et al. 2019). This dataset was used to evaluate automated methods of text recognition, such as OCR (Optical Character Recognition) and HTR (Handwritten Text Recognition), and post-capture classification, such as language identification or NER (Named Entity Recognition). A pipeline from scan to scientifically useful data was drafted and guidelines for selecting appropriate software solutions were provided. The benchmark dataset was also processed through multiple crowdsourcing platforms, after which the quality and interoperability of the resulting transcriptions was analyzed. The aptitude of local Collection Management Systems to curate these digitized specimens efficiently was investigated, as well as the fitness of data standards in use to ensure and maintain proper interoperability. In addition, available surveys on CMS use and satisfaction were summarized and in-depth assessments of the CMS in use at the ICEDIG partner institutes were performed. A summary of results and recommendations will be presented.

Highlights

  • In the context of the ICEDIG (Innovation and consolidation for large scale digitisation of natural heritage) project, a benchmark dataset of herbarium specimens was assembled from nine contributing institutions (Dillen et al 2019)

  • The benchmark dataset was processed through multiple crowdsourcing platforms, after which the quality and interoperability of the resulting transcriptions was analyzed

  • ICEDIG – “Innovation and consolidation for large scale digitisation of natural heritage” H2020-INFRADEV-2016-2017 – Grant Agreement No 777483

Read more

Summary

Introduction

In the context of the ICEDIG (Innovation and consolidation for large scale digitisation of natural heritage) project, a benchmark dataset of herbarium specimens was assembled from nine contributing institutions (Dillen et al 2019). This dataset was used to evaluate automated methods of text recognition, such as OCR (Optical Character Recognition) and HTR (Handwritten Text Recognition), and post-capture classification, such as language identification or NER (Named Entity Recognition).

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call