Machine learning algorithm for cluster analysis of mixed dataset based on instance-cluster closeness metric

K Balaji,K Lavanya

doi:10.1016/j.chemolab.2021.104346

Abstract

Recently, ubiquitous studies need attention on clustering algorithms residual to their significance in machine learning. Most methods have been presented on clustering mixed datasets. However, these methods also suffer from certain issues. First, most of them directly implement existing machine learning methods used to cluster instances without adding any improvements. Second, some methods convert the categorical features into numerical features, which may lead to loss of information and the introduction of noise. Third, some methods convert the numerical features into categorical features but ignore the influence of initial inflow conditions. Accordingly, to address these issues, we propose an intelligent method for clustering categorical and numerical datasets based on the Instance Cluster Closeness Metric (ICCM) algorithm. More specifically, we first present a similarity metric for numerical features. Subsequently, we design a novel metric for categorical features. Moreover, we design a new learning algorithm to cluster mixed datasets. The proposed algorithm achieves the clustering accuracies of 89.2% for heart disease and 89.4%, 84.9%, 85.5%, 91.2% for kaggle, factors, kinase, UV of chemoinformatics datasets, respectively. Also, it is compared with the state-of-the-art approaches and the results demonstrate that the proposed method has the best efficiency.

Full Text