New Words Discovery Method Based On Word Segmentation Result

Heyang Liu,Yi Xiao,Pengdong Gao

doi:10.1109/icis.2018.8466490

Abstract

A kind of new words discovery method based on word segmentation result is presented in this paper. Word segmentation is an important part of many Chinese Natural language processing (NLP) tasks. Improving the accuracy of Chinese word segmentation is a matter of great concern. With the increasing number of web text, more and more Chinese NLP tasks need to use micro-blog, movie review and other web text. The content of web text changes very fast and often contains a large number of new words. It is an important factor affecting the accuracy of word segmentation that word segmentation tools can not identify these new words. One way to solve this problem is to discover new words in the text to use and add these new words to the dictionaries on which the word segmentation tool depends. The traditional method of new words discovery can only find the words that do not exist in word segmentation tool’s dictionary. But these words do not necessarily affect the result of the word segmentation. That is, the words may be correctly segmented even if they are not added to the word segmentation tool’s dictionary. To address this issue, we propose to build a collection of candidate new words based on segmentation result. All the new words discovered in this way segmented by the word segmentation tool by mistake. Adding these new words to the word segmentation tool’s dictionary can improve the accuracy of the word segmentation more than traditional methods. Experiments on the DouBan movie review dataset show that our method can get better new words to improve the accuracy on movie review sentiment classification.

Full Text