Constrained K-Means Classification

P N Smyrlis,M G Tsipouras,D C Tsouros

doi:10.48084/etasr.2149

Abstract

Classification-via-clustering (CvC) is a widely used method, using a clustering procedure to perform classification tasks. In this paper, a novel K-Means-based CvC algorithm is presented, analysed and evaluated. Two additional techniques are employed to reduce the effects of the limitations of K-Means. A hypercube of constraints is defined for each centroid and weights are acquired for each attribute of each class, for the use of a weighted Euclidean distance as a similarity criterion in the clustering procedure. Experiments are made with 42 well–known classification datasets. The experimental results demonstrate that the proposed algorithm outperforms CvC with simple K-Means.

Highlights

Data science and especially data mining [1] is a rapidly evolving field with the extraction of valuable knowledge out of large accumulated information being a major challenge
The diagrams were created as follows: the result obtained with simple K-Means was subtracted by the respective C-K-Means result and the results were sorted in a descending order
The algorithms include two major modifications compared to the K-means, being (i) the use of a hypercube of constraints for each centroid extracted from the information of the training data, and (ii) the use weights for each attribute and each class along with the weighted Euclidean distance as a similarity criterion for the clustering procedure

Summary

INTRODUCTION

Data science and especially data mining [1] is a rapidly evolving field with the extraction of valuable knowledge out of large accumulated information being a major challenge. Each data is assigned to the proper cluster and the centroids are recalculated given the updated cluster This algorithm is widely used in CVC, where the clusters are matched to classes. A K-Means based CvC algorithm is presented, introducing the use of constraints for the centroids movement and a weighted Euclidean distance as a similarity criterion. Two main alterations to K-Means are proposed to use background knowledge: (i) application of constraints to the initialization and update of the centoids, (ii) weighted euclidean distance fuction employment. The clustering procedure and the formulation of the clusters takes place in the test data rather than the training set This helps us to better classify the testing instances, using the distance from each observation from the centroids, and updating the centroids to the test set clusters formation

Description of C-K-Means

EXPERIMENTAL RESULTS AND DISCUSSION

CONCLUSIONS