Identification Pipelines Research Articles

Insect populations are changing rapidly, and monitoring these changes is essential for understanding the causes and consequences of such shifts. However, large‐scale insect identification projects are time‐consuming and expensive when done solely by human identifiers. Machine learning offers a possible solution to help collect insect data quickly and efficiently.Here, we outline a methodology for training classification models to identify pitfall trap‐collected insects from image data and then apply the method to identify ground beetles (Carabidae). All beetles were collected by the National Ecological Observatory Network (NEON), a continental scale ecological monitoring project with sites across the United States. We describe the procedures for image collection, image data extraction, data preparation, and model training, and compare the performance of five machine learning algorithms and two classification methods (hierarchical vs. single‐level) identifying ground beetles from the species to subfamily level. All models were trained using pre‐extracted feature vectors, not raw image data. Our methodology allows for data to be extracted from multiple individuals within the same image thus enhancing time efficiency, utilizes relatively simple models that allow for direct assessment of model performance, and can be performed on relatively small datasets.The best performing algorithm, linear discriminant analysis (LDA), reached an accuracy of 84.6% at the species level when naively identifying species, which was further increased to >95% when classifications were limited by known local species pools. Model performance was negatively correlated with taxonomic specificity, with the LDA model reaching an accuracy of ~99% at the subfamily level. When classifying carabid species not included in the training dataset at higher taxonomic levels species, the models performed significantly better than if classifications were made randomly. We also observed greater performance when classifications were made using the hierarchical classification method compared to the single‐level classification method at higher taxonomic levels.The general methodology outlined here serves as a proof‐of‐concept for classifying pitfall trap‐collected organisms using machine learning algorithms, and the image data extraction methodology may be used for nonmachine learning uses. We propose that integration of machine learning in large‐scale identification pipelines will increase efficiency and lead to a greater flow of insect macroecological data, with the potential to be expanded for use with other noninsect taxa.

Natural history collections are leading successful large-scale projects of specimen digitization (images, metadata, DNA barcodes), thereby transforming taxonomy into a big data science. Yet, little effort has been directed towards safeguarding and subsequently mobilizing the considerable amount of original data generated during the process of naming 15,000–20,000 species every year. From the perspective of alpha-taxonomists, we provide a review of the properties and diversity of taxonomic data, assess their volume and use, and establish criteria for optimizing data repositories. We surveyed 4113 alpha-taxonomic studies in representative journals for 2002, 2010, and 2018, and found an increasing yet comparatively limited use of molecular data in species diagnosis and description. In 2018, of the 2661 papers published in specialized taxonomic journals, molecular data were widely used in mycology (94%), regularly in vertebrates (53%), but rarely in botany (15%) and entomology (10%). Images play an important role in taxonomic research on all taxa, with photographs used in >80% and drawings in 58% of the surveyed papers. The use of omics (high-throughput) approaches or 3D documentation is still rare. Improved archiving strategies for metabarcoding consensus reads, genome and transcriptome assemblies, and chemical and metabolomic data could help to mobilize the wealth of high-throughput data for alpha-taxonomy. Because long-term—ideally perpetual—data storage is of particular importance for taxonomy, energy footprint reduction via less storage-demanding formats is a priority if their information content suffices for the purpose of taxonomic studies. Whereas taxonomic assignments are quasifacts for most biological disciplines, they remain hypotheses pertaining to evolutionary relatedness of individuals for alpha-taxonomy. For this reason, an improved reuse of taxonomic data, including machine-learning-based species identification and delimitation pipelines, requires a cyberspecimen approach—linking data via unique specimen identifiers, and thereby making them findable, accessible, interoperable, and reusable for taxonomic research. This poses both qualitative challenges to adapt the existing infrastructure of data centers to a specimen-centered concept and quantitative challenges to host and connect an estimated n}{} le 2 million images produced per year by alpha-taxonomic studies, plus many millions of images from digitization campaigns. Of the 30,000–40,000 taxonomists globally, many are thought to be nonprofessionals, and capturing the data for online storage and reuse therefore requires low-complexity submission workflows and cost-free repository use. Expert taxonomists are the main stakeholders able to identify and formalize the needs of the discipline; their expertise is needed to implement the envisioned virtual collections of cyberspecimens. [Big data; cyberspecimen; new species; omics; repositories; specimen identifier; taxonomy; taxonomic data.]

Identification Pipelines Research Articles

Articles published on Identification Pipelines

Therapeutic targeting of senescent cells in the CNS.

Disentangling the Black Hole Mass Spectrum with Photometric Microlensing Surveys

ITCep: a deep learning framework for identification of T cell epitopes by harnessing fusion features.

Sex chromosome aneuploidies give rise to changes in the circular RNA profile: A circular transcriptome-wide study of Turner and Klinefelter syndrome across different tissues.

3pHLA-score improves structure-based peptide-HLA binding affinity prediction

SETApp: A machine learning and image analysis based application to automate the sea urchin embryo test

Flukebook: an open-source AI platform for cetacean photo identification

Extensive Variation in Gene Expression is Revealed in 13 Fertility-Related Genes Using RNA-Seq, ISO-Seq, and CAGE-Seq From Brahman Cattle.

Reference-free discovery of nuclear SNPs permits accurate, sensitive identification of Carya (hickory) species and hybrids.

Immunopeptidomics toolkit library (IPTK): a python-based modular toolbox for analyzing immunopeptidomics data

Abstract 274: Comparison of proteomics identification pipelines for lymphocyte characterization

Creation and filtering of a recurrent spectral library of CHO cell metabolites and media components.

The Taxon Hypothesis Paradigm-On the Unambiguous Detection and Communication of Taxa.

Robust and simplified machine learning identification of pitfall trap-collected ground beetles at the continental scale.

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

Photo Sleuth

Metaviral SPAdes: assembly of viruses from metagenomic data

Prewas: data pre-processing for more informative bacterial GWAS.

Repositories for Taxonomic Data: Where We Are and What is Missing.

MALDI-TOF Mass Spectrometry and Specific Biomarkers: Potential New Key for Swift Identification of Antimicrobial Resistance in Foodborne Pathogens.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Identification Pipelines Research Articles

Articles published on Identification Pipelines

Therapeutic targeting of senescent cells in the CNS.

Disentangling the Black Hole Mass Spectrum with Photometric Microlensing Surveys

ITCep: a deep learning framework for identification of T cell epitopes by harnessing fusion features.

Sex chromosome aneuploidies give rise to changes in the circular RNA profile: A circular transcriptome-wide study of Turner and Klinefelter syndrome across different tissues.

3pHLA-score improves structure-based peptide-HLA binding affinity prediction

SETApp: A machine learning and image analysis based application to automate the sea urchin embryo test

Flukebook: an open-source AI platform for cetacean photo identification

Extensive Variation in Gene Expression is Revealed in 13 Fertility-Related Genes Using RNA-Seq, ISO-Seq, and CAGE-Seq From Brahman Cattle.

Reference-free discovery of nuclear SNPs permits accurate, sensitive identification of Carya (hickory) species and hybrids.

Immunopeptidomics toolkit library (IPTK): a python-based modular toolbox for analyzing immunopeptidomics data

Abstract 274: Comparison of proteomics identification pipelines for lymphocyte characterization

Creation and filtering of a recurrent spectral library of CHO cell metabolites and media components.

The Taxon Hypothesis Paradigm-On the Unambiguous Detection and Communication of Taxa.

Robust and simplified machine learning identification of pitfall trap-collected ground beetles at the continental scale.

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

Photo Sleuth

Metaviral SPAdes: assembly of viruses from metagenomic data

Prewas: data pre-processing for more informative bacterial GWAS.

Repositories for Taxonomic Data: Where We Are and What is Missing.

MALDI-TOF Mass Spectrometry and Specific Biomarkers: Potential New Key for Swift Identification of Antimicrobial Resistance in Foodborne Pathogens.