A Multistage Gene Normalization System Integrating Multiple Effective Methods

Lishuang Li,Degen Huang,Shanshan Liu,Wenting Fan,Lihua Li,Huiwei Zhou

doi:10.1371/journal.pone.0081956

Lishuang Li, Degen Huang + Show 4 more

Open Access

https://doi.org/10.1371/journal.pone.0081956

Copy DOI

Abstract

Gene/protein recognition and normalization is an important preliminary step for many biological text mining tasks. In this paper, we present a multistage gene normalization system which consists of four major subtasks: pre-processing, dictionary matching, ambiguity resolution and filtering. For the first subtask, we apply the gene mention tagger developed in our earlier work, which achieves an F-score of 88.42% on the BioCreative II GM testing set. In the stage of dictionary matching, the exact matching and approximate matching between gene names and the EntrezGene lexicon have been combined. For the ambiguity resolution subtask, we propose a semantic similarity disambiguation method based on Munkres' Assignment Algorithm. At the last step, a filter based on Wikipedia has been built to remove the false positives. Experimental results show that the presented system can achieve an F-score of 90.1%, outperforming most of the state-of-the-art systems.

Highlights

As a critical step of text mining in biomedical literature, gene name normalization [1] is the determination of the unique identifiers of genes and proteins mentioned in biomedical literature, so as to create the linkage between these entities and the biological databases
Many solutions have been proposed for the gene name normalization task
For the gene name recognition subtask, as the fundamental step of gene normalization, we apply the gene mention tagger developed in our earlier work [13], which achieves an F-score of 88.42% on the BioCreative II GM testing set based on the two-layer stacking hybrid method

Summary

Introduction

As a critical step of text mining in biomedical literature, gene name normalization [1] is the determination of the unique identifiers of genes and proteins mentioned in biomedical literature, so as to create the linkage between these entities and the biological databases. ‘‘CARD10’’ with ID ‘‘29775’’ is a human gene and ‘‘CARD10’’ with ID ‘‘105844’’ belongs to a mouse gene. They represent different types of genes and the ambiguity should be eliminated first, known as gene normalization. Many solutions have been proposed for the gene name normalization task. Despite many efforts it remains a challenging task. The main challenges for gene name normalization are as follows:

Methods

Results

Discussion

Conclusion