Abstract

Present text retrieval systems are generally built on the reductionist basis that words in texts (keywords) are used as indexing terms to represent the texts. A necessary precursor to these systems is word extraction which, for English texts, can be achieved automatically by using spaces and punctuations as word delimiters. This cannot be readily applied to Chinese texts because they do not have obvious word boundaries. A Chinese text consists of a linear sequence of nonspaced or equally spaced ideographic characters, which are similar to morphemes in English. Researchers of Chinese text retrieval have been seeking methods of text segmentation to divide Chinese texts automatically into words. First, a review of these methods is provided in which the various different approaches to Chinese text segmentation have been classified in order to provide a general picture of the research activity in this area. Some of the most important work is described. There follows a discussion of the problems of Chinese text segmentation with examples to illustrate. These problems include morphological complexities, segmentation ambiguity, and parsing problems, and demonstrate that text segmentation remains one of the most challenging and interesting areas for Chinese text retrieval. © 1993 John Wiley & Sons, Inc.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.