Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections

David Owen,Alex Hardisty,Irena Spasić,Myriam Van Walsum,Quentin Groom,Noortje Wijkamp,Laurence Livermore,Thijs Leegwater

doi:10.3897/rio.6.e55789

Abstract

We describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the form of a pipeline. Recommendations are made for the pipeline's component parts based on some of the state-of-the-art technologies.Optical Character Recognition (OCR) can be used to digitise text on images of specimens. However, recognising text quickly and accurately from these images can be a challenge for OCR. We show that OCR performance can be improved by prior segmentation of specimen images into their component parts. This ensures that only text-bearing labels are submitted for OCR processing as opposed to whole specimen images, which inevitably contain non-textual information that may lead to false positive readings. In our testing Tesseract OCR version 4.0.0 offers promising text recognition accuracy with segmented images.Not all the text on specimen labels is printed. Handwritten text varies much more and does not conform to standard shapes and sizes of individual characters, which poses an additional challenge for OCR. Recently, deep learning has allowed for significant advances in this area. Google's Cloud Vision, which is based on deep learning, is trained on large-scale datasets, and is shown to be quite adept at this task. This may take us some way towards negating the need for humans to routinely transcribe handwritten text.Determining the countries and collectors of specimens has been the goal of previous automated text digitisation research activities. Our approach also focuses on these two pieces of information. An area of Natural Language Processing (NLP) known as Named Entity Recognition (NER) has matured enough to semi-automate this task. Our experiments demonstrated that existing approaches can accurately recognise location and person names within the text extracted from segmented images via Tesseract version 4.0.0. Potentially, NER could be used in conjunction with other online services, such as those of the Biodiversity Heritage Library to map the named entities to entities in the biodiversity literature (https://www.biodiversitylibrary.org/docs/api3.html).We have highlighted the main recommendations for potential pipeline components. The document also provides guidance on selecting appropriate software solutions. These include automatic language identification, terminology extraction, and integrating all pipeline components into a scientific workflow to automate the overall digitisation process.

Highlights

1.1 BackgroundWe do not know how many specimens are held in the world's museums and herbaria
This paper examines the state of the art in automated text digitisation with respect to specimen images
Named Entity Recognition (NER) is commonly used in information extraction to identify text segments that refer to entities from predefined categories (Nadeau and Sekine 2009)

Summary

Background

We do not know how many specimens are held in the world's museums and herbaria. estimates of three billion seem reasonable (Wheeler et al 2012). Perhaps the method most widely used today to extract these data from labels is for expert technicians to type the specimen details into a dedicated collection management system. The recommendations within are designed to enhance the digitisation and transcription pipelines that exist at partner institutions They are intended to provide guidance towards a proposed centralised specimen enrichment pipeline that could be created under a pan-European Research Infrastructure for biodiversity collections (DiSSCo 2020). This pipeline would provide state-of-the-art label digitisation services to institutions that need them. Herbaria have been among the first to mass image their collections, so there is a vast number of specimen images available for testing

Digitisation Workflow

Project Context

Data Collection

Data Properties

Habitat and altitude

Metadata

Optical Character Recognition

Handwritten Text Recognition

Language Identification

Named Entity Recognition

Terminology Extraction

Putting It All Together

Conclusions

Glossary

Findings

Funding program

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Research Ideas and Outcomes	Publication Date: Jul 3, 2020
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Research Ideas and Outcomes

Lead the way for us

Similar Papers

Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections
David Owen ... Laurence Livermore
Research Ideas and Outcomes | VOL. 6
David Owen, et. al.David Owen ... Laurence Livermore
28 Aug 2020
Research Ideas and Outcomes | VOL. 6

Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from Scanned Documents
Shashank Mujumdar ... Douglas Burdick
-
Shashank Mujumdar, et. al.Shashank Mujumdar ... Douglas Burdick
01 Sep 2019
01 Sep 2019

A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books
Shaolei Feng ... R Manmatha
-
Shaolei Feng, et. al.Shaolei Feng ... R Manmatha
11 Jun 2006
11 Jun 2006

Plants meet machines: Prospects in machine learning for plant biology
Pamela S Soltis ... Emily K Meineke
Applications in Plant Sciences | VOL. 8
Pamela S Soltis, et. al.Pamela S Soltis ... Emily K Meineke
01 Jun 2020
Applications in Plant Sciences | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Research Ideas and Outcomes