A Two-Stage Unsupervised Dimension Reduction Method for Text Clustering

Kusum Kumari Bharti,Pramod Kumar Singh

doi:10.1007/978-81-322-1041-2_45

Abstract

Feature selection is widely used in text clustering to reduce dimensions in the feature space. In this paper, we study and propose two-stage unsupervised feature selection methods to determine a subset of relevant features to improve accuracy of the underlying algorithm. We experiment with hybrid approach of feature selection—feature selection (FS–FS) and feature selection—feature extraction (FS–FE) methods. Initially, each feature in the document is scored on the basis of its importance for the clustering using two different feature selection methods individually Mean-Median (MM) and Mean Absolute Difference (MAD).In the second stage, in two different experiments, we hybridize them with a feature selection method absolute cosine (AC) and a feature extraction method principal component analysis (PCA) to further reduce the dimensions in the feature space. We perform comprehensive experiments to compare FS, FS–FS and FS–FE using k-mean clustering on Reuters-21578 dataset. The experimental results show that the two-stage feature selection methods are more effective to obtain good quality results by the underlying clustering algorithm. Additionally, we observe that FS–FE approach is superior to FS–FS approach.

Full Text