A two-stage feature selection method for text categorization

Jiana Meng,Hongfei Lin

doi:10.1109/fskd.2010.5569324

Jiana Meng, Hongfei Lin

https://doi.org/10.1109/fskd.2010.5569324

Copy DOI

Export

Save

Cite

Publication Date: Aug 1, 2010

Citations: 10

Affiliation: Dalian University of Technology

Abstract
Full-Text
Similar Papers

Abstract

Listen

Feature selection for text classification is a well-studied problem and the goals are improving classification effectiveness, computational efficiency, or both. In this paper, we propose a two-stage feature selection algorithm based on a kind of feature selection method and latent semantic indexing. Traditional word-matching based text categorization system uses vector space model to represent the document. However, it needs a high dimensional space to represent the document, and does not take into account the semantic relationship between terms, which can also lead to poor classification accuracy. Latent semantic indexing can overcome the problems caused by using statistically derived conceptual indices instead of individual words. It constructs a conceptual vector space in which each term or document is represented as a vector in the space. It not only greatly reduces the dimensionality but also discovers the important associative relationship between terms. Because of the too much calculation time of constructing a new semantic space, in this algorithm, firstly we apply a kind of feature selection method to reduce the term dimensions. Secondly, we construct a new reduced semantic space between terms based on latent semantic indexing method. Through some applications involving spam database categorization, we find that our two-stage feature selection method performs better.

Full Text