Articles published on Specimen Labels
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
217 Search results
Sort by Recency
- New
- Research Article
- 10.3897/nhcm.3.179306
- Feb 2, 2026
- Natural History Collections and Museomics
- Jayme Sones + 4 more
The digitization and labelling of historic and new specimens is a time-consuming, error-prone process as labels are still often hand-cut. To improve efficiency and consistency, we developed a low-cost solution for high-throughput environments which employs Cricut Maker ® 3 to automatically cut precision entomological labels. We optimized digital label templates for use with Cricut ® software and developed custom accessories for efficient label transfer and organization. Implementation of pre-cut label batches increased workflow efficiency, reduced user strain, and improved label quality. This system is compatible with existing digitization pipelines and supports standardized labelling formats, including those for DNA barcode workflows. With minimal set-up and maintenance costs, Cricut ® offers an effective suite of tools for generating specimen labels in natural history collections. Modernizing entomology labelling workflows supports data standardization, collection digitization, and the scientific value of natural history specimens.
- New
- Research Article
- 10.3897/bdj.14.e177202
- Jan 19, 2026
- Biodiversity Data Journal
- Juan Wen
The digitization of herbarium specimens is crucial for advancing biodiversity research and data sharing. However, this process is often hindered by the inefficiency of manual transcription and the technical challenges posed by the massive volume of specimens, heterogeneous label layouts, and the prevalence of handwritten texts. To overcome these bottlenecks, this study proposed an automated pipeline that integrates the PadddleOCR engine with the DeepSeek large language model (LLM) for structured information extraction from specimen labels.The pipeline is designed to extract 16 key metadata fields from both printed and handwritten labels. Evaluated on a benchmark dataset, it achieved a high field-level accuracy of 95.4% for printed labels, demonstrating strong reliability. For handwritten labels, the system maintained functionality while correctly identifying its limitations through a confidence-based quality control mechanism. A key finding was the compensatory role of the LLM, which effectively corrected upstream OCR errors, as evidenced by a weak correlation (r = 0.32) between OCR (Optical Character Recognition) confidence and final extraction accuracy. This hybrid architecture ensures data security through local image processing and cost-efficiency via text-only LLM parsing.This work provides a robust, scalable, and practical solution for accelerating the digitization of botanical collections. The method is directly applicable to real-world digitization workflows and promises to significantly enhance the efficiency of biodiversity data creation and sharing.
- Research Article
- 10.3897/biss.9.183295
- Dec 23, 2025
- Biodiversity Information Science and Standards
- Christian Bölling + 4 more
Natural history collections preserve invaluable records of biodiversity across time and space. Each specimen is typically accompanied by one or more labels documenting provenance, locality, and contextual data. Mobilizing this information is crucial for research on biodiversity change, biogeography, and taxonomy. To date, much of this data remains inaccessible for computational knowledge engineering approaches because manually processing and converting the sources into structured, interoperable data formats is a labor-intensive challenge due to the volume, heterogeneity, and complexity of the documents—and curatorial resources for fulfilling these tasks are generally insufficient. Artificial Intelligence (AI)-based methodologies can significantly accelerate this process and make it economically scalable. We present KIEBIDS*1, an open-source framework for specifying and executing AI-based workflows for information extraction from specimen label images. Following a linear data-pipeline architecture, workflows comprise five sequential functional steps for information extraction: image pre-processing (to prepare input images for subsequent analysis), layout analysis (to identify image regions that are relevant for information extraction), optical character recognition (to identify text on the syntactical level), semantic parsing (to identify text that references categories of interest), entity linking (to identify entities of interest mentioned in the text with authority records). image pre-processing (to prepare input images for subsequent analysis), layout analysis (to identify image regions that are relevant for information extraction), optical character recognition (to identify text on the syntactical level), semantic parsing (to identify text that references categories of interest), entity linking (to identify entities of interest mentioned in the text with authority records). Modularity and adaptability are central design principles for the framework's architecture. Each function can be realized by one or more modules that operate independently through file-based input and output, enabling substitution or extension as new technologies emerge. This ensures flexible adaptation to various information extraction goals or new data domains. In the current release, image pre-processing is implemented using the OpenCV framework with steps for resizing, grayscale conversion, noise reduction, and binarization. Layout analysis, based on the Segment Anything Model, identifies image regions that depict labels. Character recognition is implemented using two alternative modules. Besides EasyOCR, Moondream is used to leverage locally-deployable vision-language model (VLM) technology. Semantic parsing is implemented using spaCy and regular expressions, as rule-based parsing has proven efficient for syntactically well-defined entities, such as dates or coordinates, given the sparse context of label texts. Entity linking, in the current release, is realized for geographical place names using the GeoNames application programming interface (API). The input for a given workflow run consists of document images and configuration parameters. The configuration parameters encompass settings for the pipeline as a whole, such as location of input and output files and execution mode, the configurable settings for each functional step of the pipeline, e.g., models to be used, model parameters or the tag selection for the semantic tagging. settings for the pipeline as a whole, such as location of input and output files and execution mode, the configurable settings for each functional step of the pipeline, e.g., models to be used, model parameters or the tag selection for the semantic tagging. The workflow's output are PAGE-XML files containing image annotations, including the extracted and annotated text. Optionally, intermediate data and evaluation metrics can be assessed. Integrating seamlessly with Python codebases, Prefect is used for scheduling, monitoring, and graphical user interaction. By combining existing open frameworks rather than developing new components, the project leverages recent advances in computer vision and natural language processing to mobilize biodiversity data. Future developments will focus on improving user experience, integrating better models for handwritten text, and expanding semantic analysis capabilities. KIEBIDS' source code*2 is openly available and locally deployable with moderate hardware requirements.
- Research Article
- 10.1093/clinchem/hvaf086.329
- Oct 2, 2025
- Clinical Chemistry
- Li Tiang Goh + 5 more
Abstract Background Neonatal jaundice is a common condition characterized by elevated bilirubin levels in newborns. Since bilirubin is a photolabile analyte, delays in transporting blood samples can lead to inaccurate results. With a change in blood collection tube, from heparinized capillary tubes (SafeCap®, USA) to Microcuvette® 200 Lithium Heparin LH tubes (Sarstedt, Germany), blood samples for neonatal bilirubin were sent with blood tube labelled with specimen label only. A new in-house transport container, repurposed from urine dipstick container, was implemented to protect the sample from light. This study aims to assess whether the specimen label alone provides adequate protection from light, or if a transport container is required. Methods Twenty blood samples with neonatal bilirubin concentrations ranging from 63 to 390 µmol/L were collected in Lithium Heparin tubes (Becton Dickinson, USA). These were concurrently transferred into Microcuvette® 200 Lithium Heparin LH tubes. The samples were subjected to different transport conditions: with or without a transport container, and with tubes either fully or half-wrapped with specimen label. The samples were centrifuged at 2853 g for 5 minutes at 1-, 4- and 6-hours intervals, followed by plasma neonatal bilirubin measurements using Reichert® UNISTAT Bilirubinometer. Results Neonatal bilirubin measurements for samples at 1-, 4-, and 6-hours were compared against 0-hour samples. The average percentage difference in neonatal bilirubin concentration for samples sent with transport containers were 1.0%, 0.9%, and 1.1% respectively. For samples sent with specimen label but without transport containers, the percentage differences were -0.5%, -2.9%, and -3.8% respectively. For samples that were half-wrapped with a specimen label but without container, the percentage differences were 0.8%, -3.9%, and -4.4% respectively. All neonatal bilirubin differences calculated at different time intervals were within The Royal College of Pathologists of Australasia Quality Assurance Program Chemical Pathology Allowable Performance Specifications 2022 (± 8 <= 80 µmol/L and ± 10% > 80 µmol/L). Neonatal bilirubin levels in samples transported with containers remained stable for up to 6 hours. Samples sent without transport containers showed a greater degree of neonatal bilirubin degradation between 4 to 6 hours. Conclusion These results demonstrated that specimen labels alone were insufficient in protecting neonatal bilirubin samples from photodegradation, hence the use of transport containers is recommended. The use of this transport container can also be expanded to transporting other photolabile analytes to the laboratory.
- Research Article
- 10.1002/aisy.202500005
- Aug 8, 2025
- Advanced Intelligent Systems
- Naifeng Zhang + 4 more
The Natural History Museum, UK (NHM), is at the forefront of digitizing vast natural history collections, with over six million of its 80 million specimens already digitized. Extensive, high‐quality, digital specimen datasets are crucial for the integration, and analysis of biological information, providing global accessibility and digital preservation. However, at current rates, it could take centuries to digitize entire collections. To accelerate this, researchers at NHM are exploring the use of collaborative robots (cobots) for digitization. Here, the focus is on the development of artificial intelligence (AI) pipelines for the digitization of one of the largest NHM collections: pinned insects. Aa proof‐of‐concept workflow is presented that leverages AI to assist in precise identification, handling, and digitization of insect specimens and labels. The pipeline is designed to be adaptable across different museum specimen datasets, and to one day integrate seamlessly with the newly introduced cobot at NHM. Experimental results achieved accuracies of 0.95 for specimen identification, 0.79 for pinheads, and 0.92 for specimen labels, in independent image and video test sets. These results demonstrate the potential of this workflow in accelerating digitization efforts whilst prototyping novel cobot‐integrated digitization systems and advancing the biodiversity informatics for data creation and accessibility.
- Research Article
- 10.3897/bdj.13.e160553
- Jul 31, 2025
- Biodiversity data journal
- Alan Stenhouse + 1 more
The digitisation of natural history collections represents a critical step towards preserving and increasing accessibility to valuable scientific data. Despite their fundamental importance to taxonomy, ecology and conservation, the world's natural history collections remain underutilised due to the labour-intensive process of extracting metadata from specimen labels. This paper describes SpeciMate, a software application that uses a human-AI collaborative approach to accelerate the extraction of metadata from digitised specimen images. The system leverages artificial intelligence web services including optical character recognition (OCR), automated translation and large language and multimodal models (LLMs) to extract structured metadata, while requiring human expertise for prompt engineering and data curation. We describe the application's architecture, functionality and workflows, which enable effective processing of various specimen types including herbarium sheets and insect slides. Our trials indicate that this tool significantly improves the efficiency of metadata extraction while maintaining high data quality. The combination of automated AI processing with human supervision and refinement represents a promising approach to accelerating the digitisation and databasing of natural history collections, thereby enabling broader access to these invaluable resources for research, education and conservation efforts.
- Research Article
- 10.11606/1807-0205/2025.65.024
- Jul 31, 2025
- Papéis Avulsos de Zoologia
- Leonardo Alho Gomes + 3 more
The Invertebrate Collection of the Instituto Nacional de Pesquisas da Amazônia (INPA) is one of the most significant insect collections in Brazil, encompassing a vast diversity of species. It is particularly notable for its Hymenoptera collection, considered one of the most important in the country. This catalog documents the type specimens of the family Bethylidae (Hymenoptera: Chrysidoidea) deposited at INPA, comprising a total of 329 type specimens, including 55 holotypes and 274 paratypes, distributed across nine genera and 86 species. All specimen label information has been carefully compiled and is presented here alongside additional data from original descriptions and INPA records.
- Research Article
1
- 10.1093/biosci/biaf042
- Jul 17, 2025
- Bioscience
- Robert Turnbull + 3 more
Specimen-associated biodiversity data are crucial for biological, environmental, and conservation sciences. A rate shift is needed to extract data from specimen images efficiently, moving beyond human-mediated transcription. We developed Hespi (for herbarium specimen sheet pipeline) using advanced computer vision techniques to extract authoritative data applicable for a range of research purposes from primary specimen labels on herbarium specimens. Hespi integrates two object detection models: one for detecting the components of the sheet and another for fields on the primary specimen label. It classifies labels as printed, typed, handwritten, or mixed and uses optical character recognition and handwritten text recognition for extraction. The text is then corrected against authoritative taxon databases and refined using a multimodal large language model. Hespi accurately detects and extracts text from specimen sheets across international herbaria, and its modular design allows users to train and integrate custom models.
- Research Article
- 10.14258/turczaninowia.28.2.22
- Jun 30, 2025
- Turczaninowia
- Irina I Gureyeva + 1 more
Lectotypification of the names of 15 taxa of the genus Saussurea, described on the base of materials stored in the Krylov Herbarium (TK) of Tomsk State University, has been carried out. Lectotypes of 8 valid published names of the taxa (one species and seven varieties) described by famous Tomsk botanists P. N. Krylov, L. P. Sergievskaya, B. K. Schischkin, S. V. Gudoschnikov are designated. In accordance with the rules of the “International Code of Nomenclature”, the category of type specimens of seven taxon names previously cited as “Typus” (“Holotypus”) has been corrected to “Lectotypus”. For each designated lectotype, the nomenclatural citation, text of the herbarium specimen label, categories and number of other type specimens, text of the protologue, and, if necessary, a note are provided. The names are lectotypified regardless of whether they are currently accepted or listed among synonyms.
- Research Article
- 10.31939/vieraea.2025.48.02
- Jun 16, 2025
- Vieraea Folia scientiarum biologicarum canariensium
- Miriam Del Carmen Herrera Darias + 3 more
A total of 38 typus of phanerogams (29 taxa), which are part of the TFC Herbarium, are analyzed. Specimen labels are transcribed and the phenological states of the material as well as the presence of associated documentation are explored. The information on the taxa is updated (distribution, habitat, protection, and new nomenclatural status).
- Research Article
- 10.1002/ece3.71665
- Jun 1, 2025
- Ecology and evolution
- Madeleine M Ostwald + 7 more
Community or volunteer participation in research has the potential to significantly help mobilize the wealth of biodiversity and functional ecological data housed in natural history collections. Many such projects recruit community scientists to transcribe specimen label data from images; a next step is to task community scientists with conducting straightforward morphological measurements (e.g., body size) from specimen images. We investigated whether community science could be an effective approach to generating significant body size datasets from specimen images generated by museum digitization initiatives. Using the community science platform Notes from Nature, we engaged community scientists in a specimen measurement task to estimate body size (i.e., intertegular distance) from images of bee specimens. Community scientists showed high engagement and completion of this task, with each user measuring 43.6 specimens on average and self-reporting successful measurement of 98.0% of the images. Community scientist measurements were significantly larger than measurements conducted by trained researchers, though the average measurement error was only 2.3%. These results suggest that community science participation could be an effective approach for bee body size measurement, for descriptive studies or for research questions where this degree of expected error is deemed acceptable. For larger-bodied organisms (e.g., vertebrates), where modest measurement errors represent a smaller proportion of body size, community science approaches may be particularly effective. Methods we present here may serve as a blueprint for future projects aimed at engaging the public in biodiversity and collections-based research efforts.
- Research Article
1
- 10.3897/zookeys.1233.140726
- Mar 26, 2025
- ZooKeys
- Dirk Ahrens + 3 more
We provide short tutorials in how to read out specimen label data from type- as well as handwritten labels in a rapid and easy way with a mobile phone. We apply them in general, but test them in particular for insect specimen labels, which are generally quite small. We provide alterative procedure instructions for Android and Apple based environments, as well as protocols for single and bulk scans. We expect that this way of data capture will be of great help for a simple digitization in taxonomy and collection management, independent from large industrial digitization pipelines. By omitting the step of taking/maintaining images of the labels, this approach is more rapid, cheaper, and environmentally more sustainable because no storage with carbon footprint is required for label images. We see the biggest advantage of this protocol in the use of readily available commercial devices, which are easy to handle, as they are used on a daily basis and can be replaced at relatively low cost when they come into (informatic) age, which is also a matter of cyber security.
- Research Article
- 10.1098/rspb.2024.2748
- Feb 1, 2025
- Proceedings of the Royal Society B: Biological Sciences
- J C Ordoñez + 7 more
The flowering phenology of many tropical mountain forest tree species remains poorly understood, including flowering synchrony and its drivers across neotropical ecosystems. We obtained herbarium records for 427 tree species from a long-term monitoring transect on the northwestern Ecuadorian Andes, sourced from the Global Biodiversity Information Facility and the Herbario Nacional del Ecuador. Using machine learning algorithms, we identified flowering phenophases from digitized specimen labels and applied circular statistics to build phenological calendars across six climatic regions within the neotropics. We found 47 939 herbarium records, of which 14 938 were classified as flowering by Random Forest Models. We constructed phenological calendars for six regions and 86 species with at least 20 flowering records. Phenological patterns varied considerably across regions, among species within regions, and within species across regions. There was limited interannual synchronicity in flowering patterns within regions primarily driven by bimodal species whose flowering peaks coincided with irradiance peaks. The predominantly high variability of phenological patterns among species and within species likely confers adaptative advantages by reducing interspecific competition during reproductive periods and promoting species coexistence in highly diverse regions with little or no seasonality.
- Research Article
- 10.35699/2675-5327.2024.46756
- Nov 29, 2024
- Lundiana: International Journal of Biodiversity
- Alessandro R Lima + 2 more
The Entomological Collection of the Center of Taxonomic Collections (CCT-UFMG) at the Federal University of Minas Gerais is a significant repository of insect specimens, housing a diverse range of orders and providing invaluable resources for taxonomic and ecological research. With over 245 thousand registered specimens, it stands as the largest collection within the CCT-UFMG. This paper presents a catalog of the collection's insect type specimens, excluding Odonata. The cataloging process involved examination of digital databases, scientific literature, and specimen labels to identify and document type specimens and their associated information. For each taxon we provide the bibliographic reference containing the original description, preservation method, sex, the number of specimens deposited in the CCT-UFMG, geographic distribution, and notes on inconsistencies with the literature. For holotypes and neotypes, we also describe the data label and the condition of the 88 specimens. Hymenoptera predominates with 77 holotypes and neotypes (87% of the total), of which 53 are bees. Other orders represented in holotypes include Coleoptera (1), Hemiptera (8), Mecoptera (1), and Phasmatodea (1). Paratypes encompass 130 species across six orders: Coleoptera, Diptera, Hemiptera, Hymenoptera, and Phasmatodea, with Hymenoptera again being the dominant order, comprising 543 paratypes of 112 species, including 78 species of bees (74% of total paratypes).
- Research Article
1
- 10.1002/aps3.11623
- Nov 5, 2024
- Applications in plant sciences
- Robert P Guralnick + 3 more
One of the slowest steps in digitizing natural history collections is converting labels associated with specimens into a digital data record usable for collections management and research. Here, we address how herbarium specimen labels can be converted into digital data records via extraction into standardized Darwin Core fields. We first showcase the development of a rule-based approach and compare outcomes with a large language model-based approach, in particular ChatGPT4. We next quantified omission and commission error rates across target fields for a set of labels transcribed using optical character recognition (OCR) for both approaches. For example, we find that ChatGPT4 often creates field names that are not Darwin Core compliant while rule-based approaches often have high commission error rates. Our results suggest that these approaches each have different strengths and limitations. We therefore developed an ensemble approach that leverages the strengths of each individual method and documented that ensembling strongly reduced overall information extraction errors. This work shows that an ensemble approach has particular value for creating high-quality digital data records, even for complicated label content. While human validation is still needed to ensure the best possible quality, automated approaches can speed digitization of herbarium specimen labels and are likely to be broadly usable for all natural history collection types.
- Research Article
2
- 10.1093/aob/mcae183
- Oct 21, 2024
- Annals of botany
- Paulo Henrique Gaem + 6 more
Herbaria are the most important source of information for plant taxonomic work. Resources and technologies available today, such as digitized collections and herbarium DNA sequencing, can help accelerate taxonomic decisions in challenging plant groups. Here we employ an integrative methodology relying exclusively on herbarium specimens to investigate species boundaries in the Neotropical Myrcia neoobscura complex (Myrtaceae). We collected morphometric data from high-resolution images of herbarium sheets and analysed them using hierarchical clustering. We posteriorly tested the obtained morpho-groups with phylogenomics using the Angiosperms353 probe kit. We also gathered phenological and geographical information from specimen labels and built phenological histograms and ecological niche models to investigate ecological differences amongst taxa. Current circumscriptions of Myrcia arenaria, Myrcia neoglabra and Myrcia neoregeliana are confirmed in this study. Conversely, the four pieces of evidence together support Calyptranthes langsdorffii var. grandiflora, Marlierea regeliana var. parviflora and Marlierea warmingiana as separate from Myrcia marliereana, Myrcia neoriedeliana and Myrcia neoobscura, respectively, contrary to arrangements proposed by previous authors. Integrated analyses also support separation between Myrcia excoriata and two similar, undescribed taxa. Our data reveal the need for major changes in the systematics of the group, with recognition of 12 species. The successful delivery of our study aims was possible due to obtaining robust, high-quality data from museum specimens. We emphasize the importance of maintaining botanical collections physically and digitally available for taxonomic work and advocate their use to accelerate taxonomic solutions of tropical species complexes hollistically. This is urgent, given the paucity of funds for fieldwork and unprecedented rates of habitat loss in the tropics.
- Research Article
- 10.3897/biss.8.138512
- Oct 4, 2024
- Biodiversity Information Science and Standards
- Arianna Salili-James + 5 more
The Natural History Museum in the UK (NHM) is home to more than 80 million objects spanning 4.5 billion years of history. Each of these contain a wealth of data, whether on specimen labels, index cards, registers and/or diaries. Transcribing and categorising this information can help unlock crucial research potential. To do this at scale, we turn to computer vision (CV) and Machine Learning (ML) techniques to automate this work. Over a million of the museum’s specimens are ornithological, including one of the largest and most comprehensive egg collections in the world. Representing 52% of known bird species, with over 300,000 clutches (where a clutch defines the total group of eggs laid in a nest), collected over the last 200 years, arguably make this the most important archive of avian environmental change data in existence(Norris et al. 2023). The eggs were historically catalogued using index cards, containing key information such as identification, collection date, locality and clutch size. A proportion of these egg cards have now been imaged and this led to the start of this project, focusing on a sample of 15,000 photographed egg cards (example seen in Fig. 1). Our initial approach used Google Vision to perform Optical Character Recognition (OCR) to transcribe all text with the egg cards. By focusing on textboxes around key terms (e.g., “Collector”), and using CV tools, we approximated boxes around every key category. Finally, each text segment was associated to a category box, followed by minor post-processing in order to extract (i.e., transcribe and categorise) the data. Here we successfully extracted the data within the sample, with a 98.6% average accuracy. Although our methods worked well for our sample, they did rely on consistency within the structures of cards. To expand the project further, and to mitigate the reliance on consistent structures within cards, we turned to Large Language Models (LLMs). This allowed us to explore automatic data extraction from different types of cards and labels, despite variation in the card structure, and even handle unknown categories of text. Consequently, the scope of the data collected was widened, such as adding ornithological specimen data (e.g., skins), as well as external datasets through collaboration with the British Trust for Ornithology, who manage the Nest Record Scheme (Crick et al. 2003), which holds decades of vital information on the progress of monitored nests in the UK. This index-card data-extraction project is just the beginning. As we expand our data extraction capabilities, our aim is to develop a novel pipeline that can be applied not just to avifauna-related cards, but any structured textual data, with the potential to unlock invaluable insights.
- Research Article
- 10.3897/biss.8.138060
- Sep 30, 2024
- Biodiversity Information Science and Standards
- Atsuko Takano + 4 more
We would like to introduce our recently developed systems for taking images of herbarium specimens and for the automatic extraction of data from specimen labels at the Herbarium of the Museum of Nature and Human Activities, Hyogo, Japan (HYO). Firstly, we designed a low-cost, but high-quality specimen imaging system for non-professional photographers to obtain images rapidly (Takano et al. 2019). Our system uses a mass-produced, mirrorless single-lens reflex (SLR) camera (SONY ILCE6300) with a zoom lens (Samyang Optics SYIO35AF-E35 mm F/2.8). We made a photo stand by ourselves to reduce costs. In addition, we have adopted an LED (light-emitting diode) lighting system with high color rendering. This imaging system has been introduced, with some improvements or adjustments for available space, to various herbaria in Japan (e.g., University of Tokyo (TI), Kyoto University (KYO)), contributing to the digitization of herbarium specimens across Japan. Next, we developed a system to extract label information from specimen images. The specimen image was uploaded to Google OCR and data were extracted in the form of text. Uploading the whole specimen image decreased the reading accuracy of the software because the plant images behaved as OCR (Optical Character Reader) noise. Therefore, the label part was cut out from the whole specimen image by using D-Lib*1 and uploaded to tesseract OCR*2 for OCR extraction of the label information (Aoki 2019, Takano et al. 2020). When installing this system for HYO, we designed it as an application accessible externally via the internet, which proved very useful during the coronavirus pandemic: part-time workers checked and conducted label data input from home. Finally, we decided to develop a system that would automatically label the text data extracted by OCR and input them into the appropriate cells of the database. Even though the text data could be extracted from specimen images, it needed a human to input them into the database. Therefore, we adopted Named Entity Recognition (NER), a system that extracts named entities such as place names, identifying proper nouns from unstructured text data. It enables information recorded in herbarium specimens to be tagged as named entities. We tried text matching at first, but the result was not satisfactory, so we started to use machine learning instead. We compared three natural language libraries for Japanese: BERT (Bidirectional Encoder Representations from Transformers), Albert (A Lite version of BERT), and SpaCy. Despite BERT and SpaCy returning similarly high f-scores (indicating good performance), we decided to use SpaCy because it runs better on ordinary PCs or servers. With sufficient machine learning after the creation of a text corpus (a specialised dataset) specific to labels on herbarium specimens, we successfully developed the application. The project files are available on GitHub*3 (Takano et al. 2024). We then examined whether this system could be applied to non-plant specimen images, i.e., fishes or birds, and found that it could efficiently extract data. Therefore, we decided to publicize this system on the cloud server and share it with other natural history museums in Japan*4. Curators can obtain a unique ID and password and upload specimen images from their collection to extract label data. The digitization of natural history collections in Japan has been long behind other countries, and this system will help to accelerate it. The system mentioned above is specialized for the natural history collections of Japan, but we believe it is possible to build similar programs in other countries, and we hope our experience will contribute to the mobilization of the world’s natural history collections.
- Research Article
- 10.3389/fevo.2024.1305931
- Jul 17, 2024
- Frontiers in Ecology and Evolution
- Beulah H Garner + 10 more
IntroductionHistoric museum collections hold a wealth of biodiversity data that are essential to our understanding of the rapidly changing natural world. Novel curatorial practices are needed to extract and digitise these data, especially for the innumerable pinned insects whose collecting information is held on small labels.MethodsWe piloted semi-automated specimen imaging and digitisation of specimen labels for a collection of ~29,000 pinned insects of ground beetles (Carabidae: Lebiinae) held at the Natural History Museum, London. Raw transcription data were curated against literature sources and non-digital collection records. The primary data were subjected to statistical analyses to infer trends in collection activities and descriptive taxonomy over the past two centuries.ResultsThis work produced research-ready digitised records for 2,546 species (40% of known species of Lebiinae). Label information was available on geography in 91% of identified specimens, and the time of collection in 39.8% of specimens and could be approximated for nearly all specimens. Label data revealed the great age of this collection (average age 91.4 years) and the peak period of specimen acquisition between 1880 and 1930, with little differences among continents. Specimen acquisition declined greatly after about 1950. Early detected species generally were present in numerous specimens but were missing records from recent decades, while more recently acquired species (after 1950) were represented mostly by singleton specimens only. The slowing collection growth was mirrored by the decreasing rate of species description, which was affected by huge time lags of several decades to formal description after the initial specimen acquisition.DiscussionHistoric label information provides a unique resource for assessing the state of biodiversity backwards to pre-industrial times. Many species held in historical collections especially from tropical super-diverse areas may not be discovered ever again, and if they do, their recognition requires access to digital resources and more complete levels of species description. A final challenge is to link the historical specimens to contemporary collections that are mostly conducted with mechanical trapping of specimens and DNA-based species recognition.
- Research Article
- 10.1080/23818107.2024.2351992
- May 26, 2024
- Botany Letters
- Lantotiana M E Randriamanana + 6 more
ABSTRACT A taxonomic revision of the mostly Malagasy genus Saldinia (Rubiaceae, Rubioideae, Lasiantheae) was conducted by Bremekamp 66 years ago. However, no holotype was selected for four of his 22 species and no distribution maps of these species were presented. The objectives of this study are, therefore, a) to produce a world checklist of the species of Saldinia with information about the type specimens, b) to lectotypify four Saldinia species, and c) to provide new distribution maps of the 22 described species. A total of 966 specimens of Saldinia (including 27 types) from the P herbarium were examined together with the protologues of all recognized species. The geographic coordinates of the 22 species recognized by Bremekamp were gathered from the specimen labels. We present a world checklist of Saldinia and four lectotypifications for Saldinia axillaris, S. acuminata, S. pallida, and S. stenophylla, and provided new distribution maps for all 22 species of Saldinia. The checklist will serve as a basis for future systematic studies in Saldinia and the species distribution maps will be useful for the ongoing taxonomic revision of the genus.