Abstract

Chinese word segmentation has received extensive attention in recent years. The word segmentation method based on character-based tagging improves the performance of word segmentation greatly. This method transforms the word segmentation problem into a sequence labeling problem, which has become the main word segmentation method. In order to further study the word segmentation performance of this method, we use the maximum entropy sequence labeling model in this paper. We used two different word position sets and three feature templates to compare the experimental results. We have done further research on the unknown words and segmentation ambiguity in the word segmentation results. First we combined N-Gram with cohesion and degree of freedom to solve the problem of unknown words. Then the maximum entropy model is used to train the new participle to eliminate the ambiguity. The closed test was conducted on the Bakeoff 2005 corpus of the international Chinese word segmentation evaluation. Experiments show that the six-tag position combined with the corresponding feature templates can achieve better word segmentation performance. After adding unknown words and disambiguation processing, the word segmentation performance of some data sets can be further improved to optimal results of Bakeoff 2005.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.