EPIC: An epidemiological investigation of COVID-19 dataset for Chinese named entity recognition

Pu Li,Guohao Zhou,Suzhi Zhang,Yanbu Guo,Yuncheng Jiang,Yong Tang

doi:10.1016/j.ipm.2023.103541

Abstract

Since the outbreak of COVID-19, it has had a huge impact on the whole world. In China, there have been a large number of epidemiological investigation reports in response to COVID-19. In order to more effectively analyze and utilize these reports for future large-scale epidemics, this paper proposes a new three-stage annotation method and utilizes the method to construct an epidemiological investigation of COVID-19 dataset for Chinese named entity recognition (CNER)—EPIC (EPidemiological Investigation of COVID-19). EPIC contains 10 categories of named entities, focusing on the travel history of confirmed cases. The corpus of the EPIC consists of 226 official epidemiological investigation reports and the inter-annotator agreement of the EPIC reaches 0.97. Based on EPIC, this paper proposes the Semantic Embedding with Chinese Character Structural Features (SECCSF) method to improve the accuracy of segmentation boundary detection and entity category determination in CNER. In the experimental phase, this paper implements several baselines to conduct experiments on EPIC. The baseline with the SECCSF method achieves an F1 value of 0.892. This indicates the effectiveness of the SECCSF method for the NER task on the EPIC. We release the EPIC at: https://github.com/tinyyhorm/EPIC.

Full Text