Users of digital era are overwhelmed with large volumes of text collections. Most of the text collections are without class labels. Clustering is the only feasible solution to extract valuable insights from the data. Clustering of text collections in high dimensional space is inefficient. Several unsupervised dimensionality reduction methods have been proposed in the literature. Feature selection methods are easy to interpret. Filter feature selection methods have been proved to be scalable and efficient for high dimensional datasets. The aim of this work is to propose an unsupervised univariate filter feature selection method for efficient clustering of very high dimensional text datasets in low dimensional feature space. Wavelets are mathematical functions that can transform a signal into space or time, frequency domain to analyze the signal at different resolutions of transformed domain. Wavelet transforms are efficient in identifying transients of a signal. Stationary wavelet transform using Symlet of order 2 is used to identify the most discriminant features of text documents for efficient clustering in low dimensional feature space. The proposed feature selection method is compared with nine other relevant methods by their quality of clustering solution on seven real text document collections. The proposed method has been able to identify the most discriminative features that have resulted in the best peak clustering performance better than clustering performance with all the features, on four out of seven datasets, with at most 1.5% of top-rated features.
Read full abstract