Challenges in Chinese Text Similarity Research

Xiuhong Wang,Shiguang Ju,Shengli Wu

doi:10.1109/isip.2008.76

Abstract

There are many opportunities and challenges in Chinese text similarity research, which is one of the most important issues in the information retrieval field. Quite a few models and approaches have been investigated for this. Chinese is one of the most complicated languages on morphology, syntax, semantics and pragmatics. In Chinese, there is not an explicit delimiter between words as in English. The difficulties in Chinese natural language processing, such as segmentation, knock down both effectiveness and efficiency of text similarity computation. This paper addresses some challenges in Chinese text similarity computation, which are undergoing from Chinese linguistics, models and approaches used in information retrieval. We consider Chinese text similarity computing tasks to cover broad topics of word, sentence and document similarity. Our work provides insights into the difficulties and bottleneck in the research, including tradeoffs between effectiveness and efficiency. New directions of the future work are discussed.

Full Text