A novel automated label data extraction and data base generation system from herbarium specimen images using OCR and NER

Atsuko Takano,Theodor C H Cole,Hajime Konagai

doi:10.1038/s41598-023-50179-0

Atsuko Takano, Theodor C H Cole + Show 1 more

Open Access

PDF Available

https://doi.org/10.1038/s41598-023-50179-0

Copy DOI

Export

Save

Cite

Journal: Scientific Reports	Publication Date: Jan 2, 2024
Citations: 4	License type: CC BY 4.0

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Digital extraction of label data from natural history specimens along with more efficient procedures of data entry and processing is essential for improving documentation and global information availability. Herbaria have made great advances in this direction lately. In this study, using optical character recognition (OCR) and named entity recognition (NER) techniques, we have been able to make further advancements towards fully automatic extraction of label data from herbarium specimen images. This system can be developed and run on a consumer grade desktop computer with standard specifications, and can also be applied to extracting label data from diverse kinds of natural history specimens, such as those in entomological collections. This system can facilitate the digitization and publication of natural history museum specimens around the world.

Full Text