Abstract
Abstract There exists no sentence boundary in most classical Chinese literature texts. Since it is difficult to read literature of this kind, experts in literature or linguistics would segment the sentence manually. This article explores the effectiveness of classical Chinese sentence segmentation method so as to provide a reference for classical Chinese punctuation. On the basis of the machine learning methods, we chose three components of machine learning, namely models, tagging schemes, and features, to compare the learning results. The models include conditional random field (CRF) models, long short term memory (LSTM) models, BiLSTM–CRF models, and three Bidirectional Encoder Representation from Transformers (BERT) models. There are five tagging schemes in this article and three features including the statistical feature, Guangyun, and Fanqie. Finally, the performance of the combined feature template is evaluated by ten-fold cross-validation on four classical Chinese texts in different genres. The SikuBERT model is proved to be the most effective model for sentence segmentation at present. Different tagging schemes and various features are introduced. The results show that 5-tag-J tagging schemes can improve performance. Statistical feature, as an important clue for classical Chinese sentence segmentation, is useful in related tasks, but Guangyun and Fanqie have little impact. Other important factors of sentence segmentation are genres and writing styles.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.