Recognition of protein/gene names from text using an ensemble of classifiers

Guodong Zhou,Soonheng Tan,Jian Su,Jie Zhang,Dan Shen

doi:10.1186/1471-2105-6-s1-s7

Guodong Zhou, Soonheng Tan + Show 3 more

Open Access

PDF Available

https://doi.org/10.1186/1471-2105-6-s1-s7

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

This paper proposes an ensemble of classifiers for biomedical name recognition in which three classifiers, one Support Vector Machine and two discriminative Hidden Markov Models, are combined effectively using a simple majority voting strategy. In addition, we incorporate three post-processing modules, including an abbreviation resolution module, a protein/gene name refinement module and a simple dictionary matching module, into the system to further improve the performance. Evaluation shows that our system achieves the best performance from among 10 systems with a balanced F-measure of 82.58 on the closed evaluation of the BioCreative protein/gene name recognitiontask (Task 1A).

Highlights

With an overwhelming amount of textual information in biomedicine, there is a need for effective and efficient literature mining and knowledge discovery that can help biologists to gather and make use of the knowledge encoded in text documents
The only difference between the two discriminative Hidden Markov Models (DHMMs) comes from the part-of-speech (POS) features, which are trained on different corpora
Our evaluation on the dry-run data shows that the Support Vector Machine (SVM) using the POS feature trained on the refined BioCreative-POS corpus (Please see below for details) has high precision and low recall, the DHMM1 using the POS feature trained on the refined BioCreativePOS corpus has balanced precision and recall, and the DHMM2 using the POS feature trained on the unrefined BioCreative-POS corpus has low precision and high recall

Summary

Introduction

With an overwhelming amount of textual information in biomedicine, there is a need for effective and efficient literature mining and knowledge discovery that can help biologists to gather and make use of the knowledge encoded in text documents. MEDLINE [1], the primary research database serving the biomedical community, is an online bibliographic source of citations and abstracts dating from 1966 till present and currently contains over 12 million abstracts with 60,000 new abstracts each month. Each of them contains entries ranging from thousands to millions and multiplies rapidly. All of these resources are annotated manually by human experts. Such manual handling is much throughput-limited, extremely time-consuming and enormously expensive. In order to make organized and structured information available, automatically recognizing biomedical names becomes critical and is important for protein-protein interaction extraction, pathway construction, automatic database curation, etc

Methods

Results

Conclusion