Abstract

Chinese word segmentation is a very important problem for Chinese information processing. Chinese word segmentation results are the basis for computers to understand natural language. However, unlike most Western languages, Chinese words do not have fixed symbols like white space as word segmentation marks. Moreover, Chinese has a very complex grammar, and the word segmentation criteria are varied with the contexts. Therefore, Chinese word segmentation is a very difficult task. Many existing works have proposed many algorithms to solve this problem. However, to our best knowledge, none of them could outperform all the other methods. In this paper, we develop a novel algorithm based on semantics and contexts. We propose a semantic-based word similarity measure using the concept hierarchy in knowledge graphs, and use this measure to prune the different results which are generated by several state-of-the-art Chinese word segmentation methods. The idea is to respectively compute the concept similarity of these words to other words in the text, and choose the word with the highest concept similarity score. To evaluate the effectiveness of the proposed approach, we conduct a series of experiment on two real datasets. The results show that our method outperforms all the state-of-the-art algorithms by filtering out wrong results and retaining correct ones.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call