Phenonizer: A fine-grained phenotypic named entity recognizer for Chinese clinical texts

Qunsheng Zou,Xuezhong Zhou,Xiaodong Li,Xiaoping Zhang,Kai Chang,Kuo Yang

doi:10.1109/bibm52615.2021.9669766

Abstract

Biomedical named entity recognition from clinical texts is a fundamental task for clinical data analysis due to the availability of large volume of electronic medical record data, which are mostly in free text format, in real-world clinical settings. Clinical text data incorporates significant phenotypic medical entities, which could be used for profiling the clinical characteristics of patients in specific disease conditions. However, general approaches mostly rely on the coarse-grained annotations (e.g. mentions of symptom terms) of phenotypic entities in benchmark text dataset. Owing to the numerous negation expressions of phenotypic entities (e.g. “no fever”, “no cough” and “no hypertension”) in clinical texts, this could not feed the subsequent data analysis process with well-prepared structured clinical data. Thus, we constructed a fine-grained Chinese clinical corpus. Thereafter, we proposed a phenotypic named entity recognizer (Phenonizer). The results on the test set show that Phenonizer outperform those methods based on Word2Vec with Fl-score of 0.896. By comparing character embeddings from different data, it is found that character embeddings trained by clinical corpora can improve F-score by 0.0103. Furthermore, the fine-grained dataset enables methods to distinguish between negated symptoms and presented symptoms, and avoids the interference of negated symptoms. Finally, we tested the generalization performance of Phenonier, achieving a superior F1-score of 0.8389. In summary, together with fine-grained annotated benchmark dataset, Phenonier proposes a feasible approach to effectively extract symptom information from Chinese clinical texts with acceptable performance.

Full Text