Feature selection and semi-supervised clustering using multiobjective optimization.

Sriparna Saha,Rachamadugu Spandana,Asif Ekbal,Abhay Kumar Alok

doi:10.1186/2193-1801-3-465

Sriparna Saha, Rachamadugu Spandana + Show 2 more

Open Access

https://doi.org/10.1186/2193-1801-3-465

Copy DOI

Journal: SpringerPlus	Publication Date: Aug 26, 2014
Citations: 9	License type: CC BY 2.0

Affiliation: Indian Institute of Technology Patna

Abstract

In this paper we have coupled feature selection problem with semi-supervised clustering. Semi-supervised clustering utilizes the information of unsupervised and supervised learning in order to overcome the problems related to them. But in general all the features present in the data set may not be important for clustering purpose. Thus appropriate selection of features from the set of all features is very much relevant from clustering point of view. In this paper we have solved the problem of automatic feature selection and semi-supervised clustering using multiobjective optimization. A recently created simulated annealing based multiobjective optimization technique titled archived multiobjective simulated annealing (AMOSA) is used as the underlying optimization technique. Here features and cluster centers are encoded in the form of a string. We assume that for each data set for 10% data points class level information are known to us. Two internal cluster validity indices reflecting different data properties, an external cluster validity index measuring the similarity between the obtained partitioning and the true labelling for 10% data points and a measure counting the number of features present in a particular string are optimized using the search capability of AMOSA. AMOSA is utilized to detect the appropriate subset of features, appropriate number of clusters as well as the appropriate partitioning from any given data set. The effectiveness of the proposed semi-supervised feature selection technique as compared to the existing techniques is shown for seven real-life data sets of varying complexities.

Highlights

Clustering, termed as unsupervised learning, is the method of grouping the data items into different partitions or clusters in such a way so that points which belong to same cluster should be similar in some manner and points which belong to different clusters should be dissimilar in the same manner (Saha and Bandyopadhyay 2010)
3.1 String representation and population initialization In Semi-FeaClusMOO, a state of archived multiobjective simulated annealing (AMOSA) is comprising of two items: a) a set of real numbers which represents the coordinates of the centers of the clusters the associated partitioning of the data and b) a set of binary numbers which represents different feature combinations
All the features are utilized for distance computation and point symmetry based distance is used for assignment of points to different clusters. (d) a point symmetry based automatic clustering technique utilizing the search capability of GAs, VGAPS-clustering (Bandyopadhyay and Saha 2008) where all the features are utilized for distance computation and point symmetry based distance is utilized for assignment of points to different clusters c) traditional K -means clustering technique with all the features utilized for distance computation

Summary

Introduction

Clustering, termed as unsupervised learning, is the method of grouping the data items into different partitions or clusters in such a way so that points which belong to same cluster should be similar in some manner and points which belong to different clusters should be dissimilar in the same manner (Saha and Bandyopadhyay 2010). In supervised learning some training set needs to be available which captures the prior knowledge about class labels of some points. A classifier can be trained using this training set. After this step, a classifier can detect the class labels of unlabelled data depending on the model built. No prior knowledge about data points are available. Unsupervised learning classifies the data based on actual distribution of the data items and well quantified intrinsic property

Results

Discussion

Conclusion