
Classical Chinese is essentially different from Modern Chinese, in both syntax and morphology. While there has recently been a number of works on partof- speech (PoS) tagging for Modern Chinese, the PoS tagging for Classical Chinese is largely neglected. To the best of our knowledge, this is the first work in the area. Fortunately however, in terms of tagging, Classical Chinese is easier than Modern Chinese in that most Classical Chinese words are single-character-formed, thus no segmentation is needed. So in this paper, we will propose and analyze a simple statistical approach for PoS tagging of Classical Chinese. We first designed a tagset for Classical Chinese that is later shown to be accurate and efficient. Then we apply the hidden Markov model (HMM) Viterbi algorithm and made several improvements, such as sparse data problem handling and unknown word guessing, both designed particularly for Classical Chinese. As the training set grows larger, the accuracies for bigram and trigram increase to 94.9% and 97.6 %, respectively. The contribution of our work also lies in proposing and solving some previously unseen problems in processing Classical Chinese.KeywordsHide Markov ModelNatural Language ProcessingViterbi AlgorithmUnknown WordSparse Data ProblemThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call