The Application Analysis of the Construction Method of Minimum Entropy Unsupervised thesaurus in Ancient Chinese Word Segmentation

Yuyao Li,Jinhao Liang,Xiujuan Huang

doi:10.1109/icisce50968.2020.00281

Abstract

Ancient Chinese text segmentation is the basic work of the Intelligentization of ancient books. In this paper, the unsupervised thesaurus construction algorithm based on the minimum entropy model is applied to a large-scale ancient text corpus, and the lexicon composed of high-frequency cooccurring neighbor characters in the ancient text is extracted; and the lexicon is combined with existing word segmentation tools to perform ancient text segmentation experiment. The experimental results show that this method has different enhancement effects on the word segmentation effect of ancient texts in different periods, which shows that the vocabulary has a certain range of effectiveness. In addition, this article is one of the few works that apply monolingual word segmentation methods to ancient Chinese word segmentation. The work of this article has enriched the research in related fields.

Full Text