Abstract

The k-prototypes algorithm is a hybrid clustering algorithm that can process Categorical Data and Numerical Data. In this study, the method of initial Cluster Center selection was improved and a new Hybrid Dissimilarity Coefficient was proposed. Based on the proposed Hybrid Dissimilarity Coefficient, a weighted k-prototype clustering algorithm based on the hybrid dissimilarity coefficient was proposed (WKPCA). The proposed WKPCA algorithm not only improves the selection of initial Cluster Centers, but also puts a new method to calculate the dissimilarity between data objects and Cluster Centers. The real dataset of UCI was used to test the WKPCA algorithm. Experimental results show that WKPCA algorithm is more efficient and robust than other k-prototypes algorithms.

Highlights

  • Cluster analysis belongs to unsupervised learning and is an important research direction in the field of machine learning [1]

  • E modes vector is a combination of the eigenvalue that occurs most frequently of each feature in the subcluster. e dissimilarity between data objects to be clustered and the cluster is calculated by simple Hamming distance, and only the Categorical Data can be processed

  • Researchers have carried out a series of exploratory studies. k-prototypes algorithm [6] and its variant algorithm are mixed-type data clustering algorithms that take into account the Dissimilarity Coefficient of Categorical Feature and Numerical Feature at the same time

Read more

Summary

Introduction

Cluster analysis belongs to unsupervised learning and is an important research direction in the field of machine learning [1]. There are four disadvantages to using binary encoding for data preprocessing: (1) the original structure of the Categorical Data is destroyed, resulting in the meaningless binary features after conversion; (2) the implicit information of dissimilarity is ignored, which cannot truly reflect the structure of the dataset; (3) if the range of eigenvalues is large, the converted binary eigenvalues will have a larger dimension; and (4) maintenance is difficult, if new eigenvalues are added for the Categorical Feature, all data objects will change [5] To solve these problems, researchers have carried out a series of exploratory studies. K-prototypes algorithm and its variants were analyzed and compared, and the automatic determination method of initial Cluster Centers was improved, and a new Hybrid Dissimilarity Coefficient is proposed.

The k-Prototypes Algorithm
Quantized Numerical Dissimilarity Coefficient
A1 A2 A3
Weighted Hybrid Dissimilarity Coefficient
Cost Function considering Weights
Experimental Results and Analysis

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.