Abstract
Text representation, which is a fundamental and necessary process for text-based intelligent information processing, includes the tasks of determining the index terms for documents and producing the numeric vectors corresponding to the documents. In this paper, multi-word, which is regarded as containing more contextual semantics than individual word and possessing the favorable statistical characteristics, is proposed as an alternative index terms in vector space model for text representation with theoretical support. We investigate the traditional indexing methods as TF*IDF (term frequency inverse document frequency) and LSI (latent semantic indexing) for comparative study. The performances of TF*IDF, LSI and multi-word are examined on the tasks of text classification, which includes information retrieval (IR) and text categorization (TC), in Chinese and English document collection respectively. We also attempt to tune the rescaling factor of LSI and observe its effectiveness in text classification. The experimental results demonstrate that TF*IDF and multi-word are comparable when they are used for IR and TC and LSI is the poorest one of them. Moreover, the rescaling factor of LSI has an insignificant influence on its effectiveness on text classification for both Chinese and English text classification.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have