Abstract

In recent years, the number of published biomedical articles has increased as researchers have focused on biological domains to investigate the functions of biological objects, such as genes and proteins. However, the ambiguous nature of genes and their products have rendered the literature more complex for readers and curators of molecular interaction databases. To address this challenge, a normalization technique that can link variants of biological objects to a single, standardized form was applied. In this work, we developed a species normalization module, which recognizes species names and normalizes them to NCBI Taxonomy IDs. Unlike most previous work, which ignored the prefix of a gene name that represents an abbreviation of the species name to which the gene belongs, the recognition results of our module include the prefixed species. The developed species normalization module achieved an overall F-score of 0.954 on an instance-level species normalization corpus. For gene normalization, two separate modules were respectively employed to recognize gene mentions and normalize those mentions to their Entrez Gene IDs by utilizing a multistage normalization algorithm developed for processing full-text articles. All of the developed modules are BioC-compatible .NET framework libraries and are publicly available from the NuGet gallery.Database URL: https://sites.google.com/site/hjdairesearch/Projects/isn-corpus

Highlights

  • Life science researchers are interested in exploring biological processes and principles, and their associated objects

  • The instance-level GN (IGN) corpus compiled in our previous work was selected for evaluation

  • All 543 abstracts of the corpus were assigned to our in-lab annotators with life science backgrounds to annotate all mentions of species and their NCBI Taxonomy IDs

Read more

Summary

Introduction

Life science researchers are interested in exploring biological processes and principles, and their associated objects. The ability to acquire timely and up-to-date information on genes/proteins cited in the large collection of biomedical literature has become a topic of interest to life scientists To this end, data mining researchers are developing text-mining techniques to extract high-quality information from the biomedical literature. This work, which was part of the BioC task and was presented at the BioCreative V workshop, developed three BioC-compatible modules for processing abstracts and full-text articles presented in the BioC format, which can generate annotations for species and gene/protein names along with their NCBI Taxonomy IDs and Entrez Gene IDs. Most previously released species recognition tools [5,6] only recognize complete species terms such as ‘human’ in the gene name ‘human brain 25 kDa alysophospholipidspecific lysophospholipase’ and normalize them to their corresponding records in the NCBI Taxonomy database. The corpus is represented in the BioC XML format and is publicly available at https://sites.google.com/site/hjdairesearch/Projects/ isn-corpus

Materials and methods
Results
Evaluation of species recognition performance
NLPþSR
Species in the same sentence
Focus species
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.