A Pipeline for Deep Learning with Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens

Matthew Collins,Rebecca Dikow,Gaurav Yeole,Sylvia Orli,Renato Figueiredo,Paul Frandsen

doi:10.3897/biss.2.25699

Abstract

iDigBio Matsunaga et al. 2013 currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our compute infrastructure. Using these images for scientific research is a logistical and technical challenge. Transferring large numbers of images requires programming skill, bandwidth, and storage space. While simple image transformations such as resizing and generating histograms are approachable on desktops and laptops, the neural networks commonly used for learning from images require server-based graphical processing units (GPUs) to run effectively. Using the GUODA (Global Unified Open Data Access) infrastructure, we have built a model pipeline for applying user-defined processing to any subset of the images stored in iDigBio. This pipeline is run on servers located in the Advanced Computing and Information Systems lab (ACIS) alongside the iDigBio storage system. We use Apache Spark, the Hadoop File System (HDFS), and Mesos to perform the processing. We have placed a Jupyter notebook server in front of this architecture which provides an easy environment with deep learning libraries for Python already loaded for end users to write their own models. Users can access the stored data and images and manipulate them according to their requirements and make their work publicly available on GitHub. As an example of how this pipeline can be used in research, we applied a neural network developed at the Smithsonian Institution to identify herbarium sheets that were prepared with hazardous mercury containing solutions Schuettpelz et al. 2017. The model was trained with Smithsonian resources on their images and transferred to the GUODA infrastructure hosted at ACIS which also houses iDigBio. We then applied this model to additional images in iDigBio to classify them to illustrate the application of these techniques to broad image corpora potentially to notify other data publishers of contamination. We present the results of this classification not as a verified research result, but as an example of the collaborative and scalable workflows this pipeline and infrastructure enable.

Full Text