Abstract

Linking gene and protein names mentioned in the literature to unique identifiers in referent genomic databases is an essential step in accessing and integrating knowledge in the biomedical domain. However, it remains a challenging task due to lexical and terminological variation, and ambiguity of gene name mentions in documents. We present a generic and effective rule-based approach to link gene mentions in the literature to referent genomic databases, where pre-processing of both gene synonyms in the databases and gene mentions in text are first applied. The mapping method employs a cascaded approach, which combines exact, exact-like and token-based approximate matching by using flexible representations of a gene synonym dictionary and gene mentions generated during the pre-processing phase. We also consider multi-gene name mentions and permutation of components in gene names. A systematic evaluation of the suggested methods has identified steps that are beneficial for improving either precision or recall in gene name identification. The results of the experiments on the BioCreAtIvE2 data sets (identification of human gene names) demonstrated that our methods achieved highly encouraging results with F-measure of up to 81.20%.

Highlights

  • Finding, integrating and exploiting information on genes and proteins they encode is an essential task in the biomedical domain

  • Linking gene and protein names mentioned in the literature to unique identifiers in referent genomic databases is an essential step in accessing and integrating knowledge in the biomedical domain

  • Automated identification of gene and protein names in biomedical text is a fundamental step in biomedical text mining. [1, 2] For example, the identification of both protein names that act as transcription factors and corresponding target genes is the first step in a semi-automated construction of regulatory networks from the literature

Read more

Summary

Introduction

Finding, integrating and exploiting information on genes and proteins they encode is an essential task in the biomedical domain. Gene/protein name identification refers to the process of linking a mention of a name in text to a relevant entry in a genomic database (e.g. Entrez Gene [3], or UniProt [4]). The second step provides a mapping of the detected mentions to standardised gene identifiers (gene name normalisation or mapping). It aims at generating a list of unique identifiers (typically from a referent genomic database) for each of the gene and protein mentions. The normalisation aids in treating different mentions associated with the same entity as equivalent, which is essential for information access and integration

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call