A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text

Ying Xiong,Dehuan Jiang,Xiaolong Wang,Zhongmin Wang,Jun Yan,Buzhou Tang,Hua Xu,Qingcai Chen

doi:10.1186/s12911-019-0770-7

Abstract

BackgroundChinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks of Chinese text processing. They are usually preliminary steps for lots of Chinese natural language processing (NLP) tasks. There have been a large number of studies on CWS and POS tagging in various domains, however, few studies have been proposed for CWS and POS tagging in the clinical domain as it is not easy to determine granularity of words.MethodsIn this paper, we investigated CWS and POS tagging for Chinese clinical text at a fine-granularity level, and manually annotated a corpus. On the corpus, we compared two state-of-the-art methods, i.e., conditional random fields (CRF) and bidirectional long short-term memory (BiLSTM) with a CRF layer. In order to validate the plausibility of the fine-grained annotation, we further investigated the effect of CWS and POS tagging on Chinese clinical named entity recognition (NER) on another independent corpus.ResultsWhen only CWS was considered, CRF achieved higher precision, recall and F-measure than BiLSTM-CRF. When both CWS and POS tagging were considered, CRF also gained an advantage over BiLSTM. CRF outperformed BiLSTM-CRF by 0.14% in F-measure on CWS and by 0.34% in F-measure on POS tagging. The CWS information brought a greatest improvement of 0.34% in F-measure, while the CWS&POS information brought a greatest improvement of 0.74% in F-measure.ConclusionsOur proposed fine-grained CWS and POS tagging corpus is reliable and meaningful as the output of the CWS and POS tagging systems developed on this corpus improved the performance of a Chinese clinical NER system on another independent corpus.

Highlights

Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks of Chinese text processing
“自觉” is a word that usually means “conscientiously” in the newswire domain, but are two words “自” and “觉” that mean “feels ... by himself/ herself” in the clinical domain. “上下颚”, which depicts two body parts “上颚” and “下颚”, should be split into three words “上”, “下” and “颚” as only in this way is it possible to form the two body parts “上颚” and “下颚” by combining “上” and “下” with “颚”, respectively, which are very important for subsequent tasks such as clinical named entity recognition and normalization
In order to make sure that CWS and POS tagging for clinical text are consistent with subsequent clinical natural language processing (NLP) tasks, we investigated the two fundamental tasks at a fine-grained level comprehensively, and manually annotated a benchmark corpus composed of 1800 clinical notes from a tier 3A hospital of China

Summary

Introduction

Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks of Chinese text processing. In order to make sure that CWS and POS tagging for clinical text are consistent with subsequent clinical NLP tasks, we investigated the two fundamental tasks at a fine-grained level comprehensively, and manually annotated a benchmark corpus composed of 1800 clinical notes from a tier 3A hospital of China On this corpus, we first compared two state-of-the-art methods, i.e., conditional random fields (CRF) and bidirectional long short-term memory (BiLSTM) with a CRF layer (called BiLSTM-CRF), and further investigated the effect of CWS and POS tagging on Chinese clinical named entity recognition (NER) on another independent benchmark corpus, that is the corpus for task 1 of the CCKS (China conference on knowledge graph and semantic computing) challenge in 2017. The experimental results indicate that our proposed fine-grained CWS and POS tagging corpus is reliable and meaningful

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Medical Informatics and Decision Making	Publication Date: Apr 1, 2019
Citations: 14	License type: open-access

R Discovery Prime

R Discovery Prime

A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making

Lead the way for us

Similar Papers

End to End Parts of Speech Tagging and Named Entity Recognition in Bangla Language
Jillur Rahman Saurav ... Farida Chowdhury
-
Jillur Rahman Saurav, et. al.Jillur Rahman Saurav ... Farida Chowdhury
01 Sep 2019
01 Sep 2019

Part-of-Speech Tagging of Odia Language Using Statistical and Deep Learning Based Approaches
Tusarkanta Dalai ... Pankaj K Sa
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 22
Tusarkanta Dalai, et. al.Tusarkanta Dalai ... Pankaj K Sa
16 Jun 2023
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 22

Chinese clinical named entity recognition based on stacked neural network
Ruoyu Zhang ... Xueping Peng
Concurrency and Computation: Practice and Experience | VOL. 33
Ruoyu Zhang, et. al.Ruoyu Zhang ... Xueping Peng
28 Apr 2020
Concurrency and Computation: Practice and Experience | VOL. 33

Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM
Wasan Alkhwiter ... Nora Al-Twairesh
Computer Speech & Language | VOL. 65
Wasan Alkhwiter, et. al.Wasan Alkhwiter ... Nora Al-Twairesh
31 Jul 2020
Computer Speech & Language | VOL. 65

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making