A neural network approach to chemical and gene/protein entity recognition in patents

Ling Luo,Pei Yang,Hongfei Lin,Lei Wang,Zhihao Yang,Jian Wang,Yin Zhang

doi:10.1186/s13321-018-0318-3

Ling Luo, Pei Yang + Show 5 more

Open Access

https://doi.org/10.1186/s13321-018-0318-3

Copy DOI

Journal: Journal of Cheminformatics	Publication Date: Dec 1, 2018
Citations: 10	License type: open-access

Affiliation: Dalian University of Technology

Abstract

In biomedical research, patents contain the significant amount of information, and biomedical text mining has received much attention in patents recently. To accelerate the development of biomedical text mining for patents, the BioCreative V.5 challenge organized three tracks, i.e., chemical entity mention recognition (CEMP), gene and protein related object recognition (GPRO) and technical interoperability and performance of annotation servers, to focus on biomedical entity recognition in patents. This paper describes our neural network approach for the CEMP and GPRO tracks. In the approach, a bidirectional long short-term memory with a conditional random field layer is employed to recognize biomedical entities from patents. To improve the performance, we explored the effect of additional features (i.e., part of speech, chunking and named entity recognition features generated by the GENIA tagger) for the neural network model. In the official results, our best runs achieve the highest performances (a precision of 88.32%, a recall of 92.62%, and an F-score of 90.42% in the CEMP track; a precision of 76.65%, a recall of 81.91%, and an F-score of 79.19% in the GPRO track) among all participating teams in both tracks.

Highlights

Biomedical named entity recognition (NER) aims to automatically find the biomedical mentions in text, which is crucial for the information extraction in biomedical domain
We explored the effect of additional features (i.e., part of speech (POS), chunking and NER features generated by the GENIA tagger) for the neural network model
Annotations for the gene and protein related object recognition (GPRO) track are divided in two groups: type 1, covering GPRO mentions that can be normalized to a database record; and type 2, covering those GPRO mentions that in principle cannot be normalized to a unique bio-entity database record [30]

Summary

Introduction

Biomedical named entity recognition (NER) aims to automatically find the biomedical mentions in text, which is crucial for the information extraction in biomedical domain. In the previous BioCreative challenges [1,2,3], various tasks have been addressed to recognize biomedical entities (such as gene/protein, chemical and disease) from the scientific literature. In addition to the scientific literature, patents are another important source since they contain a wealth of useful biomedical information. Automatic extraction of information contained in patents has received much attention, and automatic biomedical entity recognition from medicinal chemistry patents has become an important research task [4]. To promote the development of NER systems, the BioCreative V.5, a major challenge event in biomedical natural language processing, organized three tracks to focus.

Methods

Results

Conclusion