Abstract

Concept recognition (CR) is a foundational task in the biomedical domain. It supports the important process of transforming unstructured resources into structured knowledge. To date, several CR approaches have been proposed, most of which focus on a particular set of biomedical ontologies. Their underlying mechanisms vary from shallow natural language processing and dictionary lookup to specialized machine learning modules. However, no prior approach considers the case sensitivity characteristics and the term distribution of the underlying ontology on the CR process. This article proposes a framework that models the CR process as an information retrieval task in which both case sensitivity and the information gain associated with tokens in lexical representations (e.g., term labels, synonyms) are central components of a strategy for generating term variants. The case sensitivity of a given ontology is assessed based on the distribution of so-called case sensitive tokens in its terms, while information gain is modelled using a combination of divergence from randomness and mutual information. An extensive evaluation has been carried out using the CRAFT corpus. Experimental results show that case sensitivity awareness leads to an increase of up to 0.07 F1 against a non-case sensitive baseline on the Protein Ontology and GO Cellular Component. Similarly, the use of information gain leads to an increase of up to 0.06 F1 against a standard baseline in the case of GO Biological Process and Molecular Function and GO Cellular Component. Overall, subject to the underlying token distribution, these methods lead to valid complementary strategies for augmenting term label sets to improve concept recognition.

Highlights

  • The latest advances in high-throughput methods in the biomedical field have led to an explosion of publicly available data, much of which has been published in free text form, i.e., PLOS ONE | DOI:10.1371/journal.pone.0119091 March 19, 2015Impact of Case Sensitivity and Term Information Gain on Biomedical Concept recognition (CR) manuscripts, technical reports, etc

  • The public 1.0 version consists of 67 full-length articles that have been manually annotated against several ontologies covering various aspects, such as proteins, chemical entities or cells

  • From a case sensitivity perspective, the results are divided: with the exception of the Cell Ontology, which has been assessed as non-case sensitive, GO Cellular Component (GO_CC) and Protein Ontology (PRO) have recorded an increase in F-Score when compared to the baseline (2.05% and 7.15%), while most of the other ontologies have been affected negatively

Read more

Summary

Introduction

Impact of Case Sensitivity and Term Information Gain on Biomedical CR manuscripts, technical reports, etc. This vast amount of data makes manual curation of biological entities (e.g., genes, proteins) infeasible [1]. The 2Poisson model relies on the premise that informative words are supported by an elite set of documents, in which these words tend to be more frequent in comparison to the rest of the documents. There are words that are not supported by such an elite set, and their frequency follows a random distribution. A DFR model relies on the assumption that a word carries more information within a particular document if it has a larger divergence of the within-document frequency from its frequency in the collection. The weight of the words is inversely related to the probability of frequency within a document d, using a particular model of randomness M (see Eq 1)

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call