BCC-NER: bidirectional, contextual clues named entity tagger for gene/protein mention recognition

Gurusamy Murugesan,Balu Bhasuran,Jeyakumar Natarajan,Sabenabanu Abdulkadhar

doi:10.1186/s13637-017-0060-6

Gurusamy Murugesan, Balu Bhasuran + Show 2 more

Open Access

https://doi.org/10.1186/s13637-017-0060-6

Copy DOI

Abstract

Tagging biomedical entities such as gene, protein, cell, and cell-line is the first step and an important pre-requisite in biomedical literature mining. In this paper, we describe our hybrid named entity tagging approach namely BCC-NER (bidirectional, contextual clues named entity tagger for gene/protein mention recognition). BCC-NER is deployed with three modules. The first module is for text processing which includes basic NLP pre-processing, feature extraction, and feature selection. The second module is for training and model building with bidirectional conditional random fields (CRF) to parse the text in both directions (forward and backward) and integrate the backward and forward trained models using margin-infused relaxed algorithm (MIRA). The third and final module is for post-processing to achieve a better performance, which includes surrounding text features, parenthesis mismatching, and two-tier abbreviation algorithm. The evaluation results on BioCreative II GM test corpus of BCC-NER achieve a precision of 89.95, recall of 84.15 and overall F-score of 86.95, which is higher than the other currently available open source taggers.

Highlights

Scientific literature is the major source of biomedical knowledge, and the interest in developing automated text mining solutions to extract useful information from biomedical text is increasing every year
named entity recognition (NER) in the biomedical domain is generally considered to be more difficult than other domains such as newswire as there is no standard nomenclature naming biomedical entities like genes and protein names resulting in ambiguity, and further, there are millions of biomedical entity names in use and more entities are added regularly [2, 3]
Leaman et al [2] proposed a machine learning-based open source biomedical named entity system which was a combination of conditional random fields (CRF) and some postprocessing methods to tag gene/proteins

Summary

Introduction

Scientific literature is the major source of biomedical knowledge, and the interest in developing automated text mining solutions to extract useful information from biomedical text is increasing every year. NER in the biomedical domain is generally considered to be more difficult than other domains such as newswire as there is no standard nomenclature naming biomedical entities like genes and protein names resulting in ambiguity, and further, there are millions of biomedical entity names in use and more entities are added regularly [2, 3]. The commonly used techniques for NER task are rulebased approaches [4], dictionary-based approaches [3], Leaman et al [2] proposed a machine learning-based open source biomedical named entity system which was a combination of conditional random fields (CRF) and some postprocessing methods to tag gene/proteins. The results of SVM as well as CRF were fused and a useful algorithm was developed after applying two rules

Methods

Results

Conclusion