Abstract

BackgroundBiomedical named entity recognition is one of the most essential tasks in biomedical information extraction. Previous studies suffer from inadequate annotated datasets, especially the limited knowledge contained in them.MethodsTo remedy the above issue, we propose a novel Biomedical Named Entity Recognition (BioNER) framework with label re-correction and knowledge distillation strategies, which could not only create large and high-quality datasets but also obtain a high-performance recognition model. Our framework is inspired by two points: (1) named entity recognition should be considered from the perspective of both coverage and accuracy; (2) trustable annotations should be yielded by iterative correction. Firstly, for coverage, we annotate chemical and disease entities in a large-scale unlabeled dataset by PubTator to generate a weakly labeled dataset. For accuracy, we then filter it by utilizing multiple knowledge bases to generate another weakly labeled dataset. Next, the two datasets are revised by a label re-correction strategy to construct two high-quality datasets, which are used to train two recognition models, respectively. Finally, we compress the knowledge in the two models into a single recognition model with knowledge distillation.ResultsExperiments on the BioCreative V chemical-disease relation corpus and NCBI Disease corpus show that knowledge from large-scale datasets significantly improves the performance of BioNER, especially the recall of it, leading to new state-of-the-art results.ConclusionsWe propose a framework with label re-correction and knowledge distillation strategies. Comparison results show that the two perspectives of knowledge in the two re-corrected datasets respectively are complementary and both effective for BioNER.

Highlights

  • Biomedical Named Entity Recognition (BioNER) is a fundamental step for downstream biomedical natural language processing tasks

  • We introduce knowledge distillation to compress the recognition models trained on the two datasets into a single recognition model

  • We propose a novel label re-correction strategy to improve the recall without significantly introducing noise in the weakly labeled datasets by leveraging a small manuallyannotated dataset, i.e. chemical-disease relation (CDR) or NCBI Disease

Read more

Summary

Methods

The highest scores are highlighted in bold 1: models with word and character features 2: models with additional domain resource features and linguistic features 3: models with multi-task learning 4: models with large-scale unlabeled datasets *Indicates that the results are calculated by us according to their reported results in chemical and disease. Our model with vector dimension 100 achieves a competitive performance of Lee et al [24] with vector dimension 768 on both This demonstrates the effectiveness of our label re-correction and knowledge distillation strategies. During the training process on the weakly labeled dataset, our word vector is fine-tuned at the same time, so the word vector could remain rich knowledge about chemical and disease entity recognition. When we use BioBERT as encoder to re-correct the weakly labeled datasets and train a distilled recognition model, it outperforms Lee et al [24]. The probability of label “O” is 55.31%, which is larger than that of label “B-Chemical” with the probability 23.38% This illustrates that student can effectively distill the trustable knowledge from the teachers. This proves that the re-correction procedure reduce some false negatives

Conclusions
Introduction
Experiment and discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call