Initialization strategies for clustering mixed-type data with the k-prototypes algorithm

  • Abstract
  • Highlights & Summary
  • References
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Abstract One of the most popular partitioning cluster algorithms is k-means, which is only applicable to numerical data. An extension to mixed-type data containing numerical and categorical variables is the k-prototypes algorithm. Due to its iterative structure, the algorithm may only converges to a local minimum rather than a global minimum. Therefore, just like the solution of the original k-means, the resulting cluster partition suffers from the initialization. In general, there are two ways of achieving an improvement of the random-based initialization of the algorithm: One possibility is to determine concrete initial cluster centers, and the other strategy is to repeat the algorithm with different randomly chosen initial centers. In this work, algorithm initializations of both options are analyzed and evaluated comparatively in a benchmark study. Therefore, selected initialization strategies of the k-means algorithm are transformed to the application on mixed-type data. For the simulation study, several data sets are artificially generated and cluster partitions are determined by using the competing initialization strategies. It is shown that an improvement of the cluster algorithm’s target criterion can be achieved as well as the ability to identify appropriate groups, even with manageable time expenditure.

Highlights

  • Cluster analysis aims to identify similarity structures and unknown groups in data (Hennig et al 2015)

  • An alternative approach is the application of hierarchical clustering, exemplified by the well-known partitioning around medoids (PAM) algorithm based on Gower distances (Kaufman and Rousseeuw 1990)

  • The upper plot visualizes the evaluation of the cluster partition by the adjusted Rand index, the middle plot refers to the internal validation criterion average silhouette width and the lower plot visualizes the sum of the within-cluster sums of the partitions, i.e. the sums of the distances between each observation and the respective prototype, which represents the optimization criterion of the k-prototypes algorithm

Read more Highlights Expand/Collapse icon

Summary

IntroductionExpand/Collapse icon

Cluster analysis aims to identify similarity structures and unknown groups in data (Hennig et al 2015). A distinction is made between hierarchical methods and partitioning algorithms, where the k-means algorithm is certainly one of the most widely used cluster algorithms (Jain et al 1999) and has been further developed many times throughout the past (for more information, see e.g. Jain 2010). An alternative approach is the application of hierarchical clustering, exemplified by the well-known partitioning around medoids (PAM) algorithm based on Gower distances (Kaufman and Rousseeuw 1990). In their comprehensive review of state-of-the-art mixed data clustering algorithms, Ahmad and Khan (2019) concluded that, despite certain limitations, algorithms based on partitional clustering are typically preferred by researchers and practitioners due to their interpretability, scalability with large data sets, and adaptability to parallel architectures. This study focuses on the k-prototypes algorithm, which was identified in a benchmark study by Preud’homme et al (2021) on simulated and reallife data as the only efficient distance-based method for clustering heterogeneous data

ObjectivesExpand/Collapse icon
ResultsExpand/Collapse icon
ConclusionExpand/Collapse icon
ReferencesShowing 10 of 36 papers
  • Cite Count Icon 64
  • 10.1109/ijcnn.2004.1379917
Initialization of cluster refinement algorithms: a review and comparative study
  • Jul 25, 2004
  • Ji He + 4 more

  • Open Access Icon
  • Cite Count Icon 68
  • 10.18637/jss.v067.i06
Rmixmod: TheRPackage of the Model-Based Unsupervised, Supervised, and Semi-Supervised ClassificationMixmodLibrary
  • Jan 1, 2015
  • Journal of Statistical Software
  • Rémi Lebret + 5 more

  • Cite Count Icon 376
  • 10.1201/b19706
Handbook of Cluster Analysis
  • Dec 16, 2015

  • Cite Count Icon 11
  • 10.5445/ksp/1000098011/02
Cluster Validation for Mixed-Type Data
  • Jun 25, 2020
  • Rabea Aschenbruck + 1 more

  • Cite Count Icon 32
  • 10.1109/his.2009.73
Initializing K-means Clustering Using Affinity Propagation
  • Jan 1, 2009
  • Yan Zhu + 2 more

  • Cite Count Icon 635
  • 10.1016/j.datak.2007.03.016
A k-mean clustering algorithm for mixed numeric and categorical data
  • Apr 11, 2007
  • Data & Knowledge Engineering
  • Amir Ahmad + 1 more

  • Cite Count Icon 2226
  • 10.1023/a:1009769707641
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
  • Sep 1, 1998
  • Data Mining and Knowledge Discovery
  • Zhexue Huang

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 15
  • 10.1007/s00357-022-09422-y
Imputation Strategies for Clustering Mixed-Type Data with Missing Values
  • Nov 26, 2022
  • Journal of Classification
  • Rabea Aschenbruck + 2 more

  • Open Access Icon
  • Cite Count Icon 31
  • 10.1007/s00180-017-0742-2
OpenML: An R package to connect to the machine learning platform OpenML
  • Jun 19, 2017
  • Computational Statistics
  • Giuseppe Casalicchio + 8 more

  • Open Access Icon
  • Cite Count Icon 348
  • 10.1109/97.329844
A new initialization technique for generalized Lloyd iteration
  • Oct 1, 1994
  • IEEE Signal Processing Letters
  • I Katsavounidis + 2 more

Similar Papers
  • Research Article
  • 10.29244/ijsa.v5i2p228-242
K-prototypes Algorithm for Clustering Schools Based on The Student Admission Data in IPB University
  • Jun 27, 2021
  • Indonesian Journal of Statistics and Its Applications
  • Sri Sulastri + 2 more

The new student admissions was regularly held every year by all grades of education, including in IPB University. Since 2013, IPB University has a track record of every school that has succeeded in sending their graduates, even until they successfully completed their education at IPB University. It was recorded that there were 5,345 schools that included in the data. It was necessary to making every school in the data into the clusters, so IPB could see which schools were classified as good or not good in terms of sending their graduates to continue their education at IPB based on the characteristics of the clusters. This study using the k-prototypes algorithm because it can be used on the data that consisting of categorical and numerical data (mixed type data). The k-prototypes algorithm could maintain the efficiency of the k-means algorithm in handling large data sizes, but eliminated the limitations of k-means. The results showed that the optimal number of clusters in this study were four clusters. The fourth cluster (421 school members) was the best cluster related to the student admission at IPB University. On the other hand, the third cluster (391 school members) was the worst cluster in this study.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 9
  • 10.3390/sym9040058
A Fast K-prototypes Algorithm Using Partial Distance Computation
  • Apr 21, 2017
  • Symmetry
  • Byoungwook Kim

The k-means is one of the most popular and widely used clustering algorithm; however, it is limited to numerical data only. The k-prototypes algorithm is an algorithm famous for dealing with both numerical and categorical data. However, there have been no studies to accelerate it. In this paper, we propose a new, fast k-prototypes algorithm that provides the same answers as those of the original k-prototypes algorithm. The proposed algorithm avoids distance computations using partial distance computation. Our k-prototypes algorithm finds minimum distance without distance computations of all attributes between an object and a cluster center, which allows it to reduce time complexity. A partial distance computation uses a fact that a value of the maximum difference between two categorical attributes is 1 during distance computations. If data objects have m categorical attributes, the maximum difference of categorical attributes between an object and a cluster center is m. Our algorithm first computes distance with numerical attributes only. If a difference of the minimum distance and the second smallest with numerical attributes is higher than m, we can find the minimum distance between an object and a cluster center without distance computations of categorical attributes. The experimental results show that the computational performance of the proposed k-prototypes algorithm is superior to the original k-prototypes algorithm in our dataset.

  • Book Chapter
  • Cite Count Icon 11
  • 10.1007/978-3-540-71701-0_129
K-Centers Algorithm for Clustering Mixed Type Data
  • May 22, 2007
  • Wei-Dong Zhao + 2 more

The K-modes and K-prototypes algorithms both apply the frequency-based update method for centroids, regarding attribute values with the highest frequency but neglecting other attribute values, which affects the accuracy of clustering results. To solve this problem, the K-centers clustering algorithm is proposed to handle mixed type data. As the extension to the K-prototypes algorithms, hard and fuzzy K-centers algorithm, focusing on effects of attribute values with different frequencies on clustering accuracy, a new update method for centroids is proposed in this paper. Experiments on many UCI machine-learning databases show that the K-centers algorithm can cluster categorical and mixed-type data more efficiently and effectively than the K-modes and K-prototypes algorithms.

  • Research Article
  • 10.29244/ijsa.v9i1p117-135
K-Prototypes Algorithm for School Indexing in Report Card-Based Student Admissions
  • Jun 24, 2025
  • Indonesian Journal of Statistics and Its Applications
  • Ervina Dwi Anggrahini + 2 more

Institut Pertanian Bogor, also known as IPB University, is a state university that was ranked first as the best university in Indonesia by the Ministry of Research and Technology in 2020. It has three main channels in the new student admission selection system. The selection method is called “Seleksi Nasional Berdasarkan Prestasi”. “Seleksi Nasional Berdasarkan Prestasi” is one of the new student admission pathways at IPB University based on report cards without a test. The selection of new student admissions based on report cards requires creating a school index to assess the quality and commitment of each school by grouping schools among “Seleksi Nasional Berdasarkan Prestasi” applicants. One method that can be used is the K-Prototypes algorithm. K-Prototypes can be used to cluster large and mixed-type data (numeric and categorical) by combining distance measures from two non-hierarchical methods, namely the K-Means and K-Modes algorithms. Based on the analysis, the K-Prototypes algorithm yields three optimal clusters, each with distinct characteristics. Cluster 1 is the lowest cluster because it comprises schools with the lowest quality and commitment to new student admissions at IPB University, as indicated by the report card. Cluster 2 has a quality that is not superior to Cluster 3 but is higher than that of Cluster 1. Cluster 3 is the best cluster because it consists of schools that have high quality and commitment to new student admissions at IPB University through the report card route.

  • Conference Article
  • 10.1109/iccpcct.2018.8574332
Concept Based Document Clustering Using K Prototype Algorithm
  • Mar 1, 2018
  • Sneha Pasarate + 1 more

Internet is a wide network of unstructured data such as blogs, tweets, mails, files. The retrieval of correct data is necessary. Data clustering is the process of putting together data into groups which are coherently similar. Clustering minimizes search time as similar documents are in the same cluster. Named entity recognition method allocates noun entities to different sections named as person, location etc. Latent dirichlet allocation treats documents as mixture of topics, and works as generative model. We have used Reuters 21578 dataset, self created dataset, News article dataset, web dataset for processing. The proposed system consists of preprocessing web document data for removing unwanted data. Next is the feature extraction phase through named entity recognition method and topic modeling approach(LDA). Feature extraction shrinks data dimensionality. K-prototype clustering algorithm approach performs better for clustering as it takes into consideration number of mismatches for categorical data. The execution time and space utilized by K-prototype algorithm is better than Fuzzy clustering algorithm.

  • Conference Article
  • 10.1109/icetc.2010.5529620
The improvement of initial point selection method for fuzzy K-Prototype clustering algorithm
  • Jun 1, 2010
  • Zhou Caiying + 1 more

K-Prototype is one of the important and effective clustering analysis algorithm to deal with mixed data types. This article discussed fuzzy clustering algorithm based on K-Prototype in detail and made improvements to solve its initial value problems. The proposed method is simple, easy to understand and can be achieved easily.

  • Book Chapter
  • Cite Count Icon 4
  • 10.4018/978-1-5225-3686-4.ch010
Clustering Mixed Datasets Using K-Prototype Algorithm Based on Crow-Search Optimization
  • Jan 1, 2018
  • Lakshmi K + 3 more

Data mining techniques are useful to discover the interesting knowledge from the large amount of data objects. Clustering is one of the data mining techniques for knowledge discovery and it is the unsupervised learning method and it analyses the data objects without knowing class labels. The k-prototype is the most widely-used partitional clustering algorithm for clustering the data objects with mixed numeric and categorical type of data. This algorithm provides the local optimum solution due to its selection of initial prototypes randomly. Recently, there are number of optimization algorithms are introduced to obtain the global optimum solution. The Crow Search algorithm is one the recently developed population based meta-heuristic optimization algorithm. This algorithm is based on the intelligent behavior of the crows. In this paper, k-prototype clustering algorithm is integrated with the Crow Search optimization algorithm to produce the global optimum solution.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 15
  • 10.1007/s00357-022-09422-y
Imputation Strategies for Clustering Mixed-Type Data with Missing Values
  • Nov 26, 2022
  • Journal of Classification
  • Rabea Aschenbruck + 2 more

Incomplete data sets with different data types are difficult to handle, but regularly to be found in practical clustering tasks. Therefore in this paper, two procedures for clustering mixed-type data with missing values are derived and analyzed in a simulation study with respect to the factors of partition, prototypes, imputed values, and cluster assignment. Both approaches are based on the k-prototypes algorithm (an extension of k-means), which is one of the most common clustering methods for mixed-type data (i.e., numerical and categorical variables). For k-means clustering of incomplete data, the k-POD algorithm recently has been proposed, which imputes the missings with values of the associated cluster center. We derive an adaptation of the latter and additionally present a cluster aggregation strategy after multiple imputation. It turns out that even a simplified and time-saving variant of the presented method can compete with multiple imputation and subsequent pooling.

  • Research Article
  • 10.29313/statistika.v22i1.469
PENGGEROMBOLAN DESA DI KABUPATEN POSO BERDASARKAN SARANA PRASARANA DAN TENAGA KESEHATAN MENGGUNAKAN METODE K-PROTOTYPE DENGAN ALGORITMA GENETIKA
  • Nov 18, 2022
  • STATISTIKA Journal of Theoretical Statistics and Its Applications
  • Septyani Kawuwung

Cluster analysis is a technique for grouping a set of similar objects into one group so that they are not similar to objects in other groups. Cluster analysis is generally applied to objects with numeric data type. but in reality clustering also uses categorical data types. Clustering handling with mixed-type data can be done by applying the k-prototype algorithm, but the determination of cluster center initialization tends to be sensitive. To handle the determination of the initialization of the cluster center, can be applied an algorithm that is genetic algorithm. This study discusses the facilities and infrastructure and health workers in Poso Regency where the infrastructure and legal personnel in the district are adequate but the distribution is not evenly distributed in several areas. The results of this study indicate that in the k-prototype algorithm there are 8 clusters with cluster centers optimized using the genetic algorithm, namely 36,7,99,49,69,104,105,110.

  • Research Article
  • Cite Count Icon 3
  • 10.19184/geosi.v6i1.23347
Landslide Hazard Analysis Using a Multilayered Approach Based on Various Input Data Configurations
  • Apr 25, 2021
  • Geosfera Indonesia
  • Tay Lea Tien + 2 more

Landslide is a natural disaster that occurs mostly in hill areas. Landslide hazard mapping is used to classify the prone areas to mitigate the risk of landslide hazards. This paper aims to compare spatial landslide prediction performance using an artificial neural network (ANN) model based on different data input configurations, different numbers of hidden neurons, and two types of normalization techniques on the data set of Penang Island, Malaysia. The data set involves twelve landslide influencing factors in which five factors are in continuous values, while the remaining seven are in categorical/discrete values. These factors are considered in three different configurations, i.e., original (OR), frequency ratio (FR), and mixed-type (MT) data, which act as an input to train the ANN model separately. A significant effect on the final output is the number of hidden neurons in the hidden layer. In addition, three data configurations are processed using two different normalization methods, i.e., mean-standard deviation (Mean-SD) and Min-Max. The landslide causative data often consist of correlated information caused by overlapping of input instances. Therefore, the principal component analysis (PCA) technique is used to eliminate the correlated information. The area under the receiver of characteristics (ROC) curve, i.e., AUC is also applied to verify the produced landslide hazard maps. The best result of AUC for both Mean-SD and Min-Max with PCA schemes are 96.72% and 96.38%, respectively. The results show that Mean-SD with PCA of MT data configuration yields the best validation accuracy, AUC, and lowest AIC at 100 number of hidden neurons. MT data configuration with the Mean-SD normalization and PCA scheme is more robust and stable in the MLP model's training for landslide prediction. 
 Keywords: Landslide; ANN; Hidden Neurons; Normalization; PCA; ROC; Hazard map
 
 Copyright (c) 2021 Geosfera Indonesia and Department of Geography Education, University of Jember
 This work is licensed under a Creative Commons Attribution-Share A like 4.0 International License

  • Research Article
  • Cite Count Icon 11
  • 10.5445/ksp/1000098011/02
Cluster Validation for Mixed-Type Data
  • Jun 25, 2020
  • Rabea Aschenbruck + 1 more

For cluster analysis based on mixed-type data (i.e. data consisting of numerical and categorical variables), comparatively few clustering methods are available. One popular approach to deal with this kind of problems is an extension of the k-means algorithm (Huang, 1998), the so-called k-prototype algorithm, which is implemented in the R package clustMixType (Szepannek and Aschenbruck, 2019). It is further known that the selection of a suitable number of clusters k is particularly crucial in partitioning cluster procedures. Many implementations of cluster validation indices in R are not suitable for mixed-type data. This paper examines the transferability of validation indices, such as the Gamma index, Average Silhouette Width or Dunn index to mixed-type data. Furthermore, the R package clustMixType is extended by these indices and their application is demonstrated. Finally, the behaviour of the adapted indices is tested by a short simulation study using different data scenarios.

  • Research Article
  • Cite Count Icon 1
  • 10.11591/ijeecs.v4.i3.pp617-628
Clustering Large Data with Mixed Values Using Extended Fuzzy Adaptive Resonance Theory
  • Dec 1, 2016
  • Indonesian Journal of Electrical Engineering and Computer Science
  • Asadi Srinivasulu + 1 more

<p>Clustering is one of the technique or approach in content mining and it is used for grouping similar items. Clustering software datasets with mixed values is a major challenge in clustering applications. The previous work deals with unsupervised feature learning techniques such as k-Means and C-Means which cannot be able to process the mixed type of data. There are several drawbacks in the previous work such as cluster tendency, partitioning, less accuracy and less performance. To overcome all those problems the extended fuzzy adaptive resonance theory (EFART) came into existence which indicates that the usage of fuzzy ART with some traditional approach. This work deals with mixed type of data by applying unsupervised feature learning for achieving the sparse representation to make it easier for clustering algorithms to separate the data. The advantages of extended fuzzy adaptive resonance theory are high accuracy, high performance, good partitioning, and good cluster tendency. This EFART adopts unsupervised feature learning which helps to cluster the large data sets like the teaching assistant evaluation, iris and the wine datasets. Finally, the obtained results may consist of clusters which are formed based on the similarity of their attribute type and values.</p>

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 10
  • 10.1155/2020/5143797
Weighted k-Prototypes Clustering Algorithm Based on the Hybrid Dissimilarity Coefficient
  • Jul 25, 2020
  • Mathematical Problems in Engineering
  • Ziqi Jia + 1 more

The k-prototypes algorithm is a hybrid clustering algorithm that can process Categorical Data and Numerical Data. In this study, the method of initial Cluster Center selection was improved and a new Hybrid Dissimilarity Coefficient was proposed. Based on the proposed Hybrid Dissimilarity Coefficient, a weighted k-prototype clustering algorithm based on the hybrid dissimilarity coefficient was proposed (WKPCA). The proposed WKPCA algorithm not only improves the selection of initial Cluster Centers, but also puts a new method to calculate the dissimilarity between data objects and Cluster Centers. The real dataset of UCI was used to test the WKPCA algorithm. Experimental results show that WKPCA algorithm is more efficient and robust than other k-prototypes algorithms.

  • Research Article
  • Cite Count Icon 15
  • 10.1007/s12046-018-0823-0
An equi-biased k-prototypes algorithm for clustering mixed-type data
  • Mar 1, 2018
  • Sādhanā
  • Ravi Sankar Sangam + 1 more

Clustering has been recognized as a very important approach for data analysis that partitions the data according to some (dis)similarity criterion. In recent years, the problem of clustering mixed-type data has attracted many researchers. The k-prototypes algorithm is well known for its scalability in this respect. In this paper, the limitations of dissimilarity coefficient used in the k-prototypes algorithm are discussed with some illustrative examples. We propose a new hybrid dissimilarity coefficient for k-prototypes algorithm, which can be applied to the data with numerical, categorical and mixed attributes. Besides retaining the scalability of the k-prototypes algorithm in our method, the dissimilarity functions for either-type attributes are defined on the same scale with respect to their dimensionality, which is very beneficial to improve the efficiency of clustering result. The efficacy of our method is shown by experiments on real and synthetic data sets.

  • Research Article
  • Cite Count Icon 22
  • 10.1016/j.eswa.2008.06.100
Constraint-based clustering and its applications in construction management
  • Jun 27, 2008
  • Expert Systems with Applications
  • Ying-Mei Cheng + 1 more

Constraint-based clustering and its applications in construction management

More from: Advances in Data Analysis and Classification
  • New
  • Research Article
  • 10.1007/s11634-025-00659-0
Data-driven logistic regression ensembles with applications in genomics
  • Nov 25, 2025
  • Advances in Data Analysis and Classification
  • Anthony-Alexander Christidis + 2 more

  • Research Article
  • 10.1007/s11634-025-00660-7
Editorial for ADAC issue 4 of volume 19 (2025)
  • Nov 17, 2025
  • Advances in Data Analysis and Classification
  • Maurizio Vichi + 2 more

  • Research Article
  • 10.1007/s11634-025-00655-4
Low-bias discrimination of circular data with measurement errors
  • Oct 18, 2025
  • Advances in Data Analysis and Classification
  • Marco Di Marzio + 3 more

  • Research Article
  • 10.1007/s11634-025-00650-9
Two-stage principal component analysis on interval-valued data using patterned covariance structures
  • Jul 19, 2025
  • Advances in Data Analysis and Classification
  • Anuradha Roy

  • Addendum
  • 10.1007/s11634-025-00648-3
Correction to: Sparse correspondence analysis for large contingency tables
  • Jun 26, 2025
  • Advances in Data Analysis and Classification
  • Ruiping Liu + 3 more

  • Research Article
  • 10.1007/s11634-025-00646-5
Sparse constrained and unconstrained non-symmetric correspondence analysis
  • Jun 23, 2025
  • Advances in Data Analysis and Classification
  • Mark De Rooij + 1 more

  • Research Article
  • 10.1007/s11634-025-00651-8
Flexible multi-class cost-sensitive thresholding
  • Jun 22, 2025
  • Advances in Data Analysis and Classification
  • Jorge C-Rella + 1 more

  • Research Article
  • 10.1007/s11634-025-00639-4
Initialization strategies for clustering mixed-type data with the k-prototypes algorithm
  • Jun 12, 2025
  • Advances in Data Analysis and Classification
  • Rabea Aschenbruck + 2 more

  • Research Article
  • 10.1007/s11634-025-00643-8
Modeling time-dependent population proportions in a finite mixture model setting
  • Jun 6, 2025
  • Advances in Data Analysis and Classification
  • Igor Melnykov + 1 more

  • Research Article
  • 10.1007/s11634-025-00649-2
Increasing biases can be more efficient than increasing weights
  • Jun 1, 2025
  • Advances in Data Analysis and Classification
  • Carlo Metta + 10 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon