Historically, herbarium specimens have provided users with documented occurrences of plants in specific locations over time. Herbarium collections have therefore been the basis of systematic botany for centuries (Younis et al. 2020). According to the latest summary report based on the data from Index Herbariorum, there are around 3400 active herbaria in the world containing 397 million specimens that are spread across 182 countries (Thiers 2021). Exponential growth in high quality image capturing devices induced by the enormous amount of uncovered collections has further led to rising interest in large scale digitization initiatives across the world (Le Bras et al. 2017). As herbarium specimens are increasingly becoming digitised and accessible in online repositories, an important need has also emerged to develop automated tools to process and enrich these collections to facilitate better access to the preserved archives. This rising number of digitised herbarium sheets provides an opportunity to employ computer-based image processing techniques, such as deep learning, to automatically identify species and higher taxa (Carranza-Rojas and Joly 2018, Carranza-Rojas et al. 2017, Younis et al. 2020) or to extract other useful information from the herbaria sheets, such as detecting handwritten text, color bars, scales and barcodes. The species identification task works well for herbarium sheets that have only one species in a page. However, there are many herbarium books that have multiple species on the same page (as shown in Fig. 1) for which the complexity of the identification problem increases tremendously. It also involves a great deal of time and effort if they are to be enriched manually. In this work, we propose a pipeline that can automatically detect, identify, and enrich plant species in multi-specimen herbaria. The core idea of the pipeline is to detect unique plant species and handwritten text around the plant species and map the text to the correct plant species. As shown in Fig. 2, the proposed pipeline begins with the pre-processing of the images. The images are rotated and aligned such that the longest edge is maintained as its height. In the case of herbarium books, the pages are detected and morphological transformations are performed to reduce occlusions (Thirukokaranam Chandrasekar and Verstockt 2020). A YOLOv3 (You Only Look Once version 3) object detection model (Zhao and Li 2020) is trained from scratch to detect plants and text. The model was trained on a dataset of single species herbarium sheets with a mosaic augmentation technique to extend the plants model to detect multiple species. The first results of the training shows impressive results although it could be further improved with more labelled data. We also plan to train an object segmentation model and contrast its performance with the plant detection model for multi-specimen herbarium sheets. After detecting both the plants and the text, the text will be recognized with a state-of-the-art handwritten text recognition (HTR) model. The recognized text can then be matched with a database of specimens, to identify each detected specimen. Furthermore, additional textual metadata (e.g. date, locality, collector's name, institution) visible on the sheet will be recognized and used to enrich the collection.
Read full abstract