The class of center-based clustering algorithms offers methods to efficiently identify clusters in data sets, making them applicable to larger data sets. While a data set may contain several features, not all of them may be equally informative or helpful towards cluster detection. Therefore, sparse center-based clustering methods offer a way to select only those features that may be useful in identifying the clusters present in a data set. However, to automatically determine the degree to which features should be selected, these methods use the Permutation Method which involves generating and clustering multiple randomly permuted data sets, leading to much higher computation costs. In this paper, we propose an improved approach towards model selection for sparse clustering by using expressions of Bayesian Information Criterion (BIC) derived for the center-based clustering methods of k-Means and Fuzzy c-Means. The derived expressions of BIC require significantly lower computation costs, yet allow us to compare and select a suitable sparse clustering among several possible sparse partitions that may have selected different subsets of features. Experiments on synthetic and real-world data sets show that using BIC for model selection leads to remarkable improvements in the identification of sparse clusterings for both Sparse k-Means and Sparse Fuzzy c-Means.
Read full abstract