NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition

Richard Tzong-Han Tsai,Ting-Yi Sung,Hong-Jie Dai,Hsieh-Chuan Hung,Cheng-Lung Sung,Wen-Lian Hsu

doi:10.1186/1471-2105-7-s5-s11

Richard Tzong-Han Tsai, Ting-Yi Sung + Show 4 more

Open Access

https://doi.org/10.1186/1471-2105-7-s5-s11

Copy DOI

Abstract

BackgroundBiomedical named entity recognition (Bio-NER) is a challenging problem because, in general, biomedical named entities of the same category (e.g., proteins and genes) do not follow one standard nomenclature. They have many irregularities and sometimes appear in ambiguous contexts. In recent years, machine-learning (ML) approaches have become increasingly common and now represent the cutting edge of Bio-NER technology. This paper addresses three problems faced by ML-based Bio-NER systems. First, most ML approaches usually employ singleton features that comprise one linguistic property (e.g., the current word is capitalized) and at least one class tag (e.g., B-protein, the beginning of a protein name). However, such features may be insufficient in cases where multiple properties must be considered. Adding conjunction features that contain multiple properties can be beneficial, but it would be infeasible to include all conjunction features in an NER model since memory resources are limited and some features are ineffective. To resolve the problem, we use a sequential forward search algorithm to select an effective set of features. Second, variations in the numerical parts of biomedical terms (e.g., "2" in the biomedical term IL2) cause data sparseness and generate many redundant features. In this case, we apply numerical normalization, which solves the problem by replacing all numerals in a term with one representative numeral to help classify named entities. Third, the assignment of NE tags does not depend solely on the target word's closest neighbors, but may depend on words outside the context window (e.g., a context window of five consists of the current word plus two preceding and two subsequent words). We use global patterns generated by the Smith-Waterman local alignment algorithm to identify such structures and modify the results of our ML-based tagger. This is called pattern-based post-processing.ResultsTo develop our ML-based Bio-NER system, we employ conditional random fields, which have performed effectively in several well-known tasks, as our underlying ML model. Adding selected conjunction features, applying numerical normalization, and employing pattern-based post-processing improve the F-scores by 1.67%, 1.04%, and 0.57%, respectively. The combined increase of 3.28% yields a total score of 72.98%, which is better than the baseline system that only uses singleton features.ConclusionWe demonstrate the benefits of using the sequential forward search algorithm to select effective conjunction feature groups. In addition, we show that numerical normalization can effectively reduce the number of redundant and unseen features. Furthermore, the Smith-Waterman local alignment algorithm can help ML-based Bio-NER deal with difficult cases that need longer context windows.

Highlights

Biomedical named entity recognition (Bio-NER) is a challenging problem because, in general, biomedical named entities of the same category do not follow one standard nomenclature
Datasets In our experiment, we employ the dataset used in the JNLPBA 2004 shared task [26], which was converted from the GENIA corpus
In the JNLPBA 2004 shared task, the GENIA corpus was still used as training data

Summary

Introduction

Biomedical named entity recognition (Bio-NER) is a challenging problem because, in general, biomedical named entities of the same category (e.g., proteins and genes) do not follow one standard nomenclature. They have many irregularities and sometimes appear in ambiguous contexts. Most ML approaches usually employ singleton features that comprise one linguistic property (e.g., the current word is capitalized) and at least one class tag (e.g., B-protein, the beginning of a protein name). Variations in the numerical parts of biomedical terms (e.g., "2" in the biomedical term IL2) cause data sparseness and generate many redundant features In this case, we apply numerical normalization, which solves the problem by replacing all numerals in a term with one representative numeral to help classify named entities. Depending on the underlying application, BioNER systems can extract objects ranging from protein/ gene names to disease/virus names

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC bioinformatics	Publication Date: Dec 1, 2006
Citations: 128	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics

Lead the way for us

Similar Papers

A cross-lingual similarity measure for detecting biomedical term translations.
Danushka Bollegala ... Georgios Kontonatsios
PloS one | VOL. 10
Danushka Bollegala, et. al.Danushka Bollegala ... Georgios Kontonatsios
01 Jun 2015
PloS one | VOL. 10

DrugShot: querying biomedical search terms to retrieve prioritized lists of small molecules
Eryk Kropiwnicki ... Zhuorui Xie
BMC bioinformatics | VOL. 23
Eryk Kropiwnicki, et. al.Eryk Kropiwnicki ... Zhuorui Xie
19 Feb 2022
BMC bioinformatics | VOL. 23

Teaching Machines to Find Names
Raymond Chiong
-
Raymond ChiongRaymond Chiong
01 Jan 2009
01 Jan 2009

A Novel Hybrid Approach to Arabic Named Entity Recognition
Mohamed A Meselhi ... Hitham M Abo Bakr
-
Mohamed A Meselhi, et. al.Mohamed A Meselhi ... Hitham M Abo Bakr
01 Jan 2014
01 Jan 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics