Semi‐automated workflows for acquiring specimen data from label images in herbarium collections

Íñigo Granzow-De La Cerda,James H Beach

doi:10.1002/tax.596014

Abstract

AbstractComputational workflow environments are an active area of computer science and informatics research; they promise to be effective for automating biological information processing for increasing research efficiency and impact. In this project, semi‐automated data processing workflows were developed to test the efficiency of computerizing information contained in herbarium plant specimen labels. Our test sample consisted of mexican and Central American plant specimens held in the University of michigan Herbarium (MICH). The initial data acquisition process consisted of two parts: (1) the capture of digital images of specimen labels and of full‐specimen herbarium sheets, and (2) creation of a minimal field database, or "pre‐catalog", of records that contain only information necessary to uniquely identify specimens. For entering "pre‐catalog" data, two methods were tested: key‐stroking the information (a) from the specimen labels directly, or (b) from digital images of specimen labels. In a second step, locality and latitude/longitude data fields were filled in if the values were present on the labels or images. If values were not available, geo‐coordinates were assigned based on further analysis of the descriptive locality information on the label. Time and effort for the various steps were measured and recorded. Our analysis demonstrates a clear efficiency benefit of articulating a biological specimen data acquisition workflow into discrete steps, which in turn could be individually optimized. First, we separated the step of capturing data from the specimen from most keystroke data entry tasks. We did this by capturing a digital image of the specimen for the first step, and also by limiting initial key‐stroking of data to create only a minimal "pre‐catalog" database for the latter tasks. By doing this, specimen handling logistics were streamlined to minimize staff time and cost. Second, by then obtaining most of the specimen data from the label images, the more intellectually challenging task of label data interpretation could be moved electronically out of the herbarium to the location of more highly trained specialists for greater efficiency and accuracy. This project used experts in the plants' country of origin, mexico, to verify localities, geography, and to derive geo‐coordinates. Third, with careful choice of data fields for the "pre‐catalog" database, specimen image files linked to the minimal tracking records could be sorted by collector and date of collection to minimize key‐stroking of redundant data in a continuous series of labels, resulting in improved data entry efficiency and data quality.

Full Text