Improving biomedical named entity recognition with syntactic information

Yuanhe Tian,Min He,Fei Xia,Kenli Li,Wang Shen,Yan Song

doi:10.1186/s12859-020-03834-6

Yuanhe Tian, Min He + Show 4 more

Open Access

https://doi.org/10.1186/s12859-020-03834-6

Copy DOI

Abstract

BackgroundBiomedical named entity recognition (BioNER) is an important task for understanding biomedical texts, which can be challenging due to the lack of large-scale labeled training data and domain knowledge. To address the challenge, in addition to using powerful encoders (e.g., biLSTM and BioBERT), one possible method is to leverage extra knowledge that is easy to obtain. Previous studies have shown that auto-processed syntactic information can be a useful resource to improve model performance, but their approaches are limited to directly concatenating the embeddings of syntactic information to the input word embeddings. Therefore, such syntactic information is leveraged in an inflexible way, where inaccurate one may hurt model performance.ResultsIn this paper, we propose BioKMNER, a BioNER model for biomedical texts with key-value memory networks (KVMN) to incorporate auto-processed syntactic information. We evaluate BioKMNER on six English biomedical datasets, where our method with KVMN outperforms the strong baseline method, namely, BioBERT, from the previous study on all datasets. Specifically, the F1 scores of our best performing model are 85.29% on BC2GM, 77.83% on JNLPBA, 94.22% on BC5CDR-chemical, 90.08% on NCBI-disease, 89.24% on LINNAEUS, and 76.33% on Species-800, where state-of-the-art performance is obtained on four of them (i.e., BC2GM, BC5CDR-chemical, NCBI-disease, and Species-800).ConclusionThe experimental results on six English benchmark datasets demonstrate that auto-processed syntactic information can be a useful resource for BioNER and our method with KVMN can appropriately leverage such information to improve model performance.

Highlights

Biomedical named entity recognition (BioNER) is an important task for understanding biomedical texts, which can be challenging due to the lack of large-scale labeled training data and domain knowledge
The results demonstrate the effectiveness of our method for BioNER, where BioKMNER outperforms the BioBERT results reported by Lee et al [19] on all datasets and achieves state-of-the-art results on four of them
These datasets focus on four different biomedical entity types: BC2GM dataset [35] and JNLPBA dataset [14] for gene/protein Named entity recognition (NER), BC5CDRchemical dataset [44] for chemical NER, NCBI-disease dataset [8] for disease NER, and LINNAEUS dataset [9] and Species-800 dataset [29] for species NER

Summary

Introduction

Biomedical named entity recognition (BioNER) is an important task for understanding biomedical texts, which can be challenging due to the lack of large-scale labeled training data and domain knowledge. Previous studies have shown that auto-processed syntactic information can be a useful resource to improve model performance, but their approaches are limited to directly concatenating the embeddings of syntactic information to the input word embeddings. Such syntactic information is leveraged in an inflexible way, where inaccurate one may hurt model performance. Pretrained models such as ELMo [30] and BERT [6] achieved state-of-the-art performance on many NLP tasks in the general domain. Lee et al [19] proposed a variant of BERT, namely, BioBERT, for the biomedical domain, which is pretrained on large raw biomedical corpora and achieves state-of-the-art performance in BioNER

Methods

Results

Conclusion