In this paper we have coupled feature selection problem with semi-supervised clustering. Semi-supervised clustering utilizes the information of unsupervised and supervised learning in order to overcome the problems related to them. But in general all the features present in the data set may not be important for clustering purpose. Thus appropriate selection of features from the set of all features is very much relevant from clustering point of view. In this paper we have solved the problem of automatic feature selection and semi-supervised clustering using multiobjective optimization. A recently created simulated annealing based multiobjective optimization technique titled archived multiobjective simulated annealing (AMOSA) is used as the underlying optimization technique. Here features and cluster centers are encoded in the form of a string. We assume that for each data set for 10% data points class level information are known to us. Two internal cluster validity indices reflecting different data properties, an external cluster validity index measuring the similarity between the obtained partitioning and the true labelling for 10% data points and a measure counting the number of features present in a particular string are optimized using the search capability of AMOSA. AMOSA is utilized to detect the appropriate subset of features, appropriate number of clusters as well as the appropriate partitioning from any given data set. The effectiveness of the proposed semi-supervised feature selection technique as compared to the existing techniques is shown for seven real-life data sets of varying complexities.
Read full abstract