Abstract

Hymenoptera, the insect order that includes sawflies, bees, wasps, and ants, exhibits an incredible diversity of phenotypes, with over 145,000 species described in a corpus of textual knowledge since Carolus Linnaeus. In the absence of specialized training, often spanning decades, however, these articles can be challenging to decipher. Much of the vocabulary is domain-specific (e.g., Hymenoptera biology), historically without a comprehensive glossary, and contains much homonymous and synonymous terminology. The Hymenoptera Anatomy Ontology was developed to surmount this challenge and to aid future communication related to hymenopteran anatomy, as well as provide support for domain experts so they may actively benefit from the anatomy ontology development. As part of HAO development, an active learning, dictionary-based, natural language recognition tool was implemented to facilitate Hymenoptera anatomy term discovery in literature. We present this tool, referred to as the ‘Proofer’, as part of an iterative approach to growing phenotype-relevant ontologies, regardless of domain. The process of ontology development results in a critical mass of terms that is applied as a filter to the source collection of articles in order to reveal term occurrence and biases in natural language species descriptions. Our results indicate that taxonomists use domain-specific terminology that follows taxonomic specialization, particularly at superfamily and family level groupings and that the developed Proofer tool is effective for term discovery, facilitating ontology construction.

Highlights

  • The vast majority of our biological knowledge exists only in printed, prosaic natural language, or ‘analog’ texts [1]

  • Publication protocol for the description of a new animal species requires that an organism ‘diagnosis’ for each new taxon be published in a journal in accordance with the International Code of Zoological Nomenclature [2] and until 2011 these journals had to be printed in journals with paper copies

  • Comparison of the number of terms presented to the user with and without the Proofer stop words list for 25 randomly selected articles demonstrated that the Proofer stop word list reduced the number of terms displayed to the user by 1/3 of the total actual word count of the article, which was an 80% reduction in the number of combinations of words displayed to a user by the Proofer tool

Read more

Summary

Introduction

The vast majority of our biological knowledge exists only in printed, prosaic natural language, or ‘analog’ texts [1]. This situation is true for the field of descriptive taxonomy, the subdomain of biology responsible for describing organisms and classifying them into nested sets cataloged with scientific names (i.e. taxa). Recent modifications in the code allow for entirely electronic publication under certain conditions [3]. Language usage for these diagnoses is dependent on the describing authors, journal editors, and reviewers of the manuscript, without standardized vocabularies across domains. The collected anatomical (i.e. morphological) terms were applied to the construction of the HAO, based on principles of structural similarity [5,6] enabling future diagnoses to be tied a priori to a structured vocabulary that is detailed enough in morphological terminology to be effective for comparable and accurate descriptions [7,8]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.