Abstract

This report reviews the current state-of-the-art applied approaches on automated tools, services and workflows for extracting information from images of natural history specimens and their labels. We consider the potential for repurposing existing tools, including workflow management systems; and areas where more development is required. This paper was written as part of the SYNTHESYS+ project for software development teams and informatics teams working on new software-based approaches to improve mass digitisation of natural history specimens.

Highlights

  • A key limiting factor in organising and using information from global natural history specimens is making that information structured and computable

  • The tools evaluated in this landscape analysis include both unsupervised and supervised machine learning approaches, with a key difference being that unsupervised methods do not require a training dataset

  • As the Specimen Data Refinery is intended to integrate both artificial intelligence (AI) and human-in-the-loop (HitL) approaches to extraction and annotation, citizen science platforms such as plant identification apps and volunteer transcription services were included in the initial research

Read more

Summary

Introduction

A key limiting factor in organising and using information from global natural history specimens is making that information structured and computable. The objective of the Specimen Data Refinery (SDR) is to combine these technologies into a cloud-based platform for processing specimen images and their labels en masse in order to extract essential data efficiently and effectively, according to standard best practices As part of this process a workflow was developed, illustrating the steps required to fully automate the procedure from image capture to a full specimen dataset (Fig. 1). This report does not include: technical evaluation of existing tools, service registries and platform-based approaches; evaluation and recommendations on using, integrating and merging partial (prior/previously created) specimen data; assessment of hardware and physical infrastructure requirements; assessment for the potential to use pan-European Collaborative Data Infrastructure; creation of reference/ground truth/training datasets

Machine Learning and Training Data Sets
Prior Research on Automation
Crowdsourcing and Human-in-the-Loop
Project Context
Methodology
Gap Analysis
Image segmentation
Building a Workflow
Selecting a Human-in-the-Loop Workflow Management Systems
Implementing a standardised workflow language for interoperability
Incorporating prior information and the statistical framework
Assembling the workflow
The Specimen Data Refinery techology stack
Conclusion
Findings
Funding program

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.