Abstract

Since there are no public tagged corpora available for ancient Chinese word segmentation (CWS), the state-of-the-art CWS methods cannot be used for ancient Chinese. To address this problem, this paper proposes a word segmentation method based on word alignment (WSWA). Specifically, the method segments words according to the word alignment between modern Chinese words and ancient Chinese characters. If multiple consecutive characters in ancient Chinese align to the same modern Chinese word, they are considered as one word. Because many modern Chinese words are derived from ancient Chinese, the method also exploits the co-occurring characters between modern and ancient Chinese to extract words for CWS. Moreover, to reduce the effect of alignment errors, the method removes the word alignments easily leading to CWS errors. We quantitatively analyze the effects of modern CWS and word alignment on WSWA method using hand-annotated corpora. Our method outperforms the state-of-the-art methods on the WSA experiment on Shiji with a large margin, which demonstrates the effectiveness of using word alignment to perform ancient CWS.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call