Abstract

BackgroundIn this study, we focus on building a fine-grained entity annotation corpus with the corresponding annotation guideline of traditional Chinese medicine (TCM) clinical records. Our aim is to provide a basis for the fine-grained corpus construction of TCM clinical records in future.MethodsWe developed a four-step approach that is suitable for the construction of TCM medical records in our corpus. First, we determined the entity types included in this study through sample annotation. Then, we drafted a fine-grained annotation guideline by summarizing the characteristics of the dataset and referring to some existing guidelines. We iteratively updated the guidelines until the inter-annotator agreement (IAA) exceeded a Cohen’s kappa value of 0.9. Comprehensive annotations were performed while keeping the IAA value above 0.9.ResultsWe annotated the 10,197 clinical records in five rounds. Four entity categories involving 13 entity types were employed. The final fine-grained annotated entity corpus consists of 1104 entities and 67,799 tokens. The final IAAs are 0.936 on average (for three annotators), indicating that the fine-grained entity recognition corpus is of high quality.ConclusionsThese results will provide a foundation for future research on corpus construction and named entity recognition tasks in the TCM clinical domain.

Highlights

  • In this study, we focus on building a fine-grained entity annotation corpus with the corresponding annotation guideline of traditional Chinese medicine (TCM) clinical records

  • The inter-annotator agreement (IAA) values exceeded 0.9, indicating that the three annotators had a high degree of consistency in the understanding of labels and TCM records, and they had ability to accomplish these annotation tasks with satisfactory consistency

  • We presented a method of building a fine-grained annotated entity corpus based on case records of TCM

Read more

Summary

Introduction

We focus on building a fine-grained entity annotation corpus with the corresponding annotation guideline of traditional Chinese medicine (TCM) clinical records. The lack of TCM clinical datasets is partly due to concerns regarding patients’ privacy as well as concerns about revealing unfavorable institutional practices [14], so these records are very private and scarce; another reason is the high complexity of Chinese clinical text analysis. This type of text has sublanguage features [15], so the characteristics of raw TCM free-text clinical records are very different from the characteristics of common texts in the Chinese language. Constructing a corpus of TCM clinical records remains difficult, and the electronic capture or retrieval of TCM clinical text data has been a challenge; research into NLP tasks on TCM clinical free text is still at a preliminary stage

Objectives
Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.