Abstract
Medical NER is a basic task of information extraction in medical domain, and a key step of entity relationship extraction and medical health Q&A. To address the problems of small NER datasets, sparse entity types and difficult construction methods in medical domain, we built own large-scale medical Q&A statement dataset Ywbd-2022 based on medical Q&A statement pairs. The dataset contains nine categories of entities, and covers more than 94000 statement pairs using three annotation methods. Experiments show that the dataset has high availability and strong training ability. Meanwhile, the method based on vocabulary enhancement can avoid the decline of model performance and recognition accuracy caused by word segmentation errors. Therefore, in this paper, the vocabulary enhancement method is applied to the medical NER field, combined with a large untagged medical domain corpus for BERT pre-training, combined with ECA attention mechanism, fine-tuning on the professional data set, to obtain a medical NER domain-oriented vocabulary enhancement model BERT-SL-ECA. Experimentally, the Bert-SL-ECA model achieves optimal results from all three major CCKS conference evaluation datasets from 2017 to 2019, and achieved a F1 value of 96.64% on the Ywbd-2022 dataset.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have