Abstract

A high proportion of life science researches are gene-oriented, in which scientists aim to investigate the roles that genes play in biological processes, and their involvement in biological mechanisms. As a result, gene names and their related information turn out to be one of the main objects of interest in biomedical literatures. While the capability of recognizing gene mentions has made significant progress, the results of recognition are still insufficient for direct use due to the ambiguity of gene names. Gene normalization (GN) goes beyond the recognition task by linking a gene mention to a database ID. Unlike most previous works, we approach GN on the instance-level and evaluate its overall performance on the recognition and normalization steps in abstracts and full texts. We release the first instance-level gene normalization (IGN) corpus in the BioC format, which includes annotations for the boundaries of all gene mentions and the corresponding IDs for human gene mentions. Species information, along with existing co-reference chains and full name/abbreviation pairs are also provided for each gene mention. Using the released corpus, we have designed a collective instance-level GN approach using not only the contextual information of each individual instance, but also the relations among instances and the inherent characteristics of full-text sections. Our experimental results show that our collective approach can achieve an F-score of 0.743. The proposed approach that exploits section characteristics in full-text articles can improve the F-scores of information lacking sections by up to 1.8%. In addition, using the proposed refinement process improved the F-score of gene mention recognition by 0.125 and that of GN by 0.03. Whereas current experimental results are limited to the human species, we seek to continue updating the annotations of the IGN corpus and observe how the proposed approach can be extended to other species.

Highlights

  • Knowledge about the functions and behaviours of genes and proteins is the primary research interest of life scientists, which can assist in gaining advanced perception of the complex mechanisms behind biological phenomena

  • Significant progress has been achieved in named entity recognition, its results are still insufficient for direct use because of the wide array of synonyms and high ambiguity of name variations in names across documents [2]

  • The instance-level evaluation measures Gene normalization (GN) performance at a fine-grained resolution; the PRF scores are calculated based on the sums of true/false positive/ negative counts of linked IDs for all gene mention instances

Read more

Summary

Introduction

Knowledge about the functions and behaviours of genes and proteins is the primary research interest of life scientists, which can assist in gaining advanced perception of the complex mechanisms behind biological phenomena. In contrast to a bibliographic query, a gene/protein query tends to return a large number of results due to the ambiguity of gene/protein names and the frequent use of abbreviations in such a query. When the same term is used to query GQuery, a global cross-database NCBI search engine, even more complex results are obtained, inferring that distinguishing the true identity of named entities is an indispensable process. Several preliminary results [7,8] have demonstrated that such a disambiguation process can improve search quality It can help one manually curate a database [9] and index entries [10], facilitate links among data across resources [11,12], and improve the online browsing experience [13]

Methods
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call