Abstract

The cTAKES package (using the ClearTK Natural Language Processing toolkit Bethard et al. 2014,http://cleartk.github.io/cleartk/) has been successfully used to automatically read clinical notes in the medical field (Albright et al. 2013, Styler et al. 2014). It is used on a daily basis to automatically process clinical notes and extract relevant information by dozens of medical institutions. ClearEarth is a collaborative project that brings together computational linguistics and domain scientists to port Natural Language Processing (NLP) modules trained on the same types of linguistic annotation to the fields of geology, cryology, and ecology. The goal for ClearEarth in the ecology domain is the extraction of ecologically-relevant terms, including eco-phenotypic traits from text and the assignment of those traits to taxa. Four annotators used Anafora (an annotation software; https://github.com/weitechen/anafora) to mark seven entity types (biotic, aggregate, abiotic, locality, quality, unit, value) and six reciprocal property types (synonym of/has synonym, part of/has part, subtype/supertype) in 133 documents from primarily Encyclopedia of Life (EOL) and Wikipedia according to project guidelines (https://github.com/ClearEarthProject/AnnotationGuidelines). Inter-annotator agreement ranged from 43% to 90%. Performance of ClearEarth on identifying named entities in biology text overall was good (precision: 85.56%; recall: 71.57%). The named entities with the best performance were organisms and their parts/products (biotic entities - precision: 72.09%; recall: 54.17%) and systems and environments (aggregate entities - precision: 79.23%; recall: 75.34%). Terms and their relationships extracted by ClearEarth can be embedded in the new ecocore ontology after vetting (http://www.obofoundry.org/ontology/ecocore.html). This project enables use of advanced industry and research software within natural sciences for downstream operations such as data discovery, assessment, and analysis. In addition, ClearEarth uses the NLP results to generate domain-specific ontologies and other semantic resources.

Highlights

  • The cTAKES package has been successfully used to automatically read clinical notes in the medical field (Albright et al 2013, Styler et al 2014). It is used on a daily basis to automatically process clinical notes and extract relevant information by dozens of medical institutions

  • ClearEarth is a collaborative project that brings together computational linguistics and domain scientists to port Natural Language Processing (NLP) modules trained on the same types of linguistic annotation to the fields of geology, cryology, and ecology

  • The goal for ClearEarth in the ecology domain is the extraction of ecologically-relevant terms, including eco-phenotypic traits from text and the assignment of those traits to taxa

Read more

Summary

Introduction

The cTAKES package (using the ClearTK Natural Language Processing toolkit Bethard et al 2014, http://cleartk.github.io/cleartk/) has been successfully used to automatically read clinical notes in the medical field (Albright et al 2013, Styler et al 2014). Corresponding author: Anne E Thessen (annethessen@gmail.com) Received: 22 Apr 2018 | Published: 22 May 2018 Citation: Thessen A, Preciado J, Jain P, Martin J, Palmer M, Bhat R (2018) Automated Trait Extraction using ClearEarth, a Natural Language Processing System for Text Mining in Natural Sciences. Biodiversity Information Science and Standards 2: e26080.

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call