Initialization strategies for clustering mixed-type data with the k-prototypes algorithm
Abstract One of the most popular partitioning cluster algorithms is k-means, which is only applicable to numerical data. An extension to mixed-type data containing numerical and categorical variables is the k-prototypes algorithm. Due to its iterative structure, the algorithm may only converges to a local minimum rather than a global minimum. Therefore, just like the solution of the original k-means, the resulting cluster partition suffers from the initialization. In general, there are two ways of achieving an improvement of the random-based initialization of the algorithm: One possibility is to determine concrete initial cluster centers, and the other strategy is to repeat the algorithm with different randomly chosen initial centers. In this work, algorithm initializations of both options are analyzed and evaluated comparatively in a benchmark study. Therefore, selected initialization strategies of the k-means algorithm are transformed to the application on mixed-type data. For the simulation study, several data sets are artificially generated and cluster partitions are determined by using the competing initialization strategies. It is shown that an improvement of the cluster algorithm’s target criterion can be achieved as well as the ability to identify appropriate groups, even with manageable time expenditure.
Highlights
Cluster analysis aims to identify similarity structures and unknown groups in data (Hennig et al 2015)
An alternative approach is the application of hierarchical clustering, exemplified by the well-known partitioning around medoids (PAM) algorithm based on Gower distances (Kaufman and Rousseeuw 1990)
The upper plot visualizes the evaluation of the cluster partition by the adjusted Rand index, the middle plot refers to the internal validation criterion average silhouette width and the lower plot visualizes the sum of the within-cluster sums of the partitions, i.e. the sums of the distances between each observation and the respective prototype, which represents the optimization criterion of the k-prototypes algorithm
Summary
Cluster analysis aims to identify similarity structures and unknown groups in data (Hennig et al 2015). A distinction is made between hierarchical methods and partitioning algorithms, where the k-means algorithm is certainly one of the most widely used cluster algorithms (Jain et al 1999) and has been further developed many times throughout the past (for more information, see e.g. Jain 2010). An alternative approach is the application of hierarchical clustering, exemplified by the well-known partitioning around medoids (PAM) algorithm based on Gower distances (Kaufman and Rousseeuw 1990). In their comprehensive review of state-of-the-art mixed data clustering algorithms, Ahmad and Khan (2019) concluded that, despite certain limitations, algorithms based on partitional clustering are typically preferred by researchers and practitioners due to their interpretability, scalability with large data sets, and adaptability to parallel architectures. This study focuses on the k-prototypes algorithm, which was identified in a benchmark study by Preud’homme et al (2021) on simulated and reallife data as the only efficient distance-based method for clustering heterogeneous data
64
- 10.1109/ijcnn.2004.1379917
- Jul 25, 2004
68
- 10.18637/jss.v067.i06
- Jan 1, 2015
- Journal of Statistical Software
376
- 10.1201/b19706
- Dec 16, 2015
11
- 10.5445/ksp/1000098011/02
- Jun 25, 2020
32
- 10.1109/his.2009.73
- Jan 1, 2009
635
- 10.1016/j.datak.2007.03.016
- Apr 11, 2007
- Data & Knowledge Engineering
2226
- 10.1023/a:1009769707641
- Sep 1, 1998
- Data Mining and Knowledge Discovery
15
- 10.1007/s00357-022-09422-y
- Nov 26, 2022
- Journal of Classification
31
- 10.1007/s00180-017-0742-2
- Jun 19, 2017
- Computational Statistics
348
- 10.1109/97.329844
- Oct 1, 1994
- IEEE Signal Processing Letters
- Research Article
- 10.29244/ijsa.v5i2p228-242
- Jun 27, 2021
- Indonesian Journal of Statistics and Its Applications
The new student admissions was regularly held every year by all grades of education, including in IPB University. Since 2013, IPB University has a track record of every school that has succeeded in sending their graduates, even until they successfully completed their education at IPB University. It was recorded that there were 5,345 schools that included in the data. It was necessary to making every school in the data into the clusters, so IPB could see which schools were classified as good or not good in terms of sending their graduates to continue their education at IPB based on the characteristics of the clusters. This study using the k-prototypes algorithm because it can be used on the data that consisting of categorical and numerical data (mixed type data). The k-prototypes algorithm could maintain the efficiency of the k-means algorithm in handling large data sizes, but eliminated the limitations of k-means. The results showed that the optimal number of clusters in this study were four clusters. The fourth cluster (421 school members) was the best cluster related to the student admission at IPB University. On the other hand, the third cluster (391 school members) was the worst cluster in this study.
- Research Article
9
- 10.3390/sym9040058
- Apr 21, 2017
- Symmetry
The k-means is one of the most popular and widely used clustering algorithm; however, it is limited to numerical data only. The k-prototypes algorithm is an algorithm famous for dealing with both numerical and categorical data. However, there have been no studies to accelerate it. In this paper, we propose a new, fast k-prototypes algorithm that provides the same answers as those of the original k-prototypes algorithm. The proposed algorithm avoids distance computations using partial distance computation. Our k-prototypes algorithm finds minimum distance without distance computations of all attributes between an object and a cluster center, which allows it to reduce time complexity. A partial distance computation uses a fact that a value of the maximum difference between two categorical attributes is 1 during distance computations. If data objects have m categorical attributes, the maximum difference of categorical attributes between an object and a cluster center is m. Our algorithm first computes distance with numerical attributes only. If a difference of the minimum distance and the second smallest with numerical attributes is higher than m, we can find the minimum distance between an object and a cluster center without distance computations of categorical attributes. The experimental results show that the computational performance of the proposed k-prototypes algorithm is superior to the original k-prototypes algorithm in our dataset.
- Book Chapter
11
- 10.1007/978-3-540-71701-0_129
- May 22, 2007
The K-modes and K-prototypes algorithms both apply the frequency-based update method for centroids, regarding attribute values with the highest frequency but neglecting other attribute values, which affects the accuracy of clustering results. To solve this problem, the K-centers clustering algorithm is proposed to handle mixed type data. As the extension to the K-prototypes algorithms, hard and fuzzy K-centers algorithm, focusing on effects of attribute values with different frequencies on clustering accuracy, a new update method for centroids is proposed in this paper. Experiments on many UCI machine-learning databases show that the K-centers algorithm can cluster categorical and mixed-type data more efficiently and effectively than the K-modes and K-prototypes algorithms.
- Research Article
- 10.29244/ijsa.v9i1p117-135
- Jun 24, 2025
- Indonesian Journal of Statistics and Its Applications
Institut Pertanian Bogor, also known as IPB University, is a state university that was ranked first as the best university in Indonesia by the Ministry of Research and Technology in 2020. It has three main channels in the new student admission selection system. The selection method is called “Seleksi Nasional Berdasarkan Prestasi”. “Seleksi Nasional Berdasarkan Prestasi” is one of the new student admission pathways at IPB University based on report cards without a test. The selection of new student admissions based on report cards requires creating a school index to assess the quality and commitment of each school by grouping schools among “Seleksi Nasional Berdasarkan Prestasi” applicants. One method that can be used is the K-Prototypes algorithm. K-Prototypes can be used to cluster large and mixed-type data (numeric and categorical) by combining distance measures from two non-hierarchical methods, namely the K-Means and K-Modes algorithms. Based on the analysis, the K-Prototypes algorithm yields three optimal clusters, each with distinct characteristics. Cluster 1 is the lowest cluster because it comprises schools with the lowest quality and commitment to new student admissions at IPB University, as indicated by the report card. Cluster 2 has a quality that is not superior to Cluster 3 but is higher than that of Cluster 1. Cluster 3 is the best cluster because it consists of schools that have high quality and commitment to new student admissions at IPB University through the report card route.
- Conference Article
- 10.1109/iccpcct.2018.8574332
- Mar 1, 2018
Internet is a wide network of unstructured data such as blogs, tweets, mails, files. The retrieval of correct data is necessary. Data clustering is the process of putting together data into groups which are coherently similar. Clustering minimizes search time as similar documents are in the same cluster. Named entity recognition method allocates noun entities to different sections named as person, location etc. Latent dirichlet allocation treats documents as mixture of topics, and works as generative model. We have used Reuters 21578 dataset, self created dataset, News article dataset, web dataset for processing. The proposed system consists of preprocessing web document data for removing unwanted data. Next is the feature extraction phase through named entity recognition method and topic modeling approach(LDA). Feature extraction shrinks data dimensionality. K-prototype clustering algorithm approach performs better for clustering as it takes into consideration number of mismatches for categorical data. The execution time and space utilized by K-prototype algorithm is better than Fuzzy clustering algorithm.
- Conference Article
- 10.1109/icetc.2010.5529620
- Jun 1, 2010
K-Prototype is one of the important and effective clustering analysis algorithm to deal with mixed data types. This article discussed fuzzy clustering algorithm based on K-Prototype in detail and made improvements to solve its initial value problems. The proposed method is simple, easy to understand and can be achieved easily.
- Book Chapter
4
- 10.4018/978-1-5225-3686-4.ch010
- Jan 1, 2018
Data mining techniques are useful to discover the interesting knowledge from the large amount of data objects. Clustering is one of the data mining techniques for knowledge discovery and it is the unsupervised learning method and it analyses the data objects without knowing class labels. The k-prototype is the most widely-used partitional clustering algorithm for clustering the data objects with mixed numeric and categorical type of data. This algorithm provides the local optimum solution due to its selection of initial prototypes randomly. Recently, there are number of optimization algorithms are introduced to obtain the global optimum solution. The Crow Search algorithm is one the recently developed population based meta-heuristic optimization algorithm. This algorithm is based on the intelligent behavior of the crows. In this paper, k-prototype clustering algorithm is integrated with the Crow Search optimization algorithm to produce the global optimum solution.
- Research Article
15
- 10.1007/s00357-022-09422-y
- Nov 26, 2022
- Journal of Classification
Incomplete data sets with different data types are difficult to handle, but regularly to be found in practical clustering tasks. Therefore in this paper, two procedures for clustering mixed-type data with missing values are derived and analyzed in a simulation study with respect to the factors of partition, prototypes, imputed values, and cluster assignment. Both approaches are based on the k-prototypes algorithm (an extension of k-means), which is one of the most common clustering methods for mixed-type data (i.e., numerical and categorical variables). For k-means clustering of incomplete data, the k-POD algorithm recently has been proposed, which imputes the missings with values of the associated cluster center. We derive an adaptation of the latter and additionally present a cluster aggregation strategy after multiple imputation. It turns out that even a simplified and time-saving variant of the presented method can compete with multiple imputation and subsequent pooling.
- Research Article
- 10.29313/statistika.v22i1.469
- Nov 18, 2022
- STATISTIKA Journal of Theoretical Statistics and Its Applications
Cluster analysis is a technique for grouping a set of similar objects into one group so that they are not similar to objects in other groups. Cluster analysis is generally applied to objects with numeric data type. but in reality clustering also uses categorical data types. Clustering handling with mixed-type data can be done by applying the k-prototype algorithm, but the determination of cluster center initialization tends to be sensitive. To handle the determination of the initialization of the cluster center, can be applied an algorithm that is genetic algorithm. This study discusses the facilities and infrastructure and health workers in Poso Regency where the infrastructure and legal personnel in the district are adequate but the distribution is not evenly distributed in several areas. The results of this study indicate that in the k-prototype algorithm there are 8 clusters with cluster centers optimized using the genetic algorithm, namely 36,7,99,49,69,104,105,110.
- Research Article
3
- 10.19184/geosi.v6i1.23347
- Apr 25, 2021
- Geosfera Indonesia
Landslide is a natural disaster that occurs mostly in hill areas. Landslide hazard mapping is used to classify the prone areas to mitigate the risk of landslide hazards. This paper aims to compare spatial landslide prediction performance using an artificial neural network (ANN) model based on different data input configurations, different numbers of hidden neurons, and two types of normalization techniques on the data set of Penang Island, Malaysia. The data set involves twelve landslide influencing factors in which five factors are in continuous values, while the remaining seven are in categorical/discrete values. These factors are considered in three different configurations, i.e., original (OR), frequency ratio (FR), and mixed-type (MT) data, which act as an input to train the ANN model separately. A significant effect on the final output is the number of hidden neurons in the hidden layer. In addition, three data configurations are processed using two different normalization methods, i.e., mean-standard deviation (Mean-SD) and Min-Max. The landslide causative data often consist of correlated information caused by overlapping of input instances. Therefore, the principal component analysis (PCA) technique is used to eliminate the correlated information. The area under the receiver of characteristics (ROC) curve, i.e., AUC is also applied to verify the produced landslide hazard maps. The best result of AUC for both Mean-SD and Min-Max with PCA schemes are 96.72% and 96.38%, respectively. The results show that Mean-SD with PCA of MT data configuration yields the best validation accuracy, AUC, and lowest AIC at 100 number of hidden neurons. MT data configuration with the Mean-SD normalization and PCA scheme is more robust and stable in the MLP model's training for landslide prediction. 
 Keywords: Landslide; ANN; Hidden Neurons; Normalization; PCA; ROC; Hazard map
 
 Copyright (c) 2021 Geosfera Indonesia and Department of Geography Education, University of Jember
 This work is licensed under a Creative Commons Attribution-Share A like 4.0 International License
- Research Article
11
- 10.5445/ksp/1000098011/02
- Jun 25, 2020
For cluster analysis based on mixed-type data (i.e. data consisting of numerical and categorical variables), comparatively few clustering methods are available. One popular approach to deal with this kind of problems is an extension of the k-means algorithm (Huang, 1998), the so-called k-prototype algorithm, which is implemented in the R package clustMixType (Szepannek and Aschenbruck, 2019). It is further known that the selection of a suitable number of clusters k is particularly crucial in partitioning cluster procedures. Many implementations of cluster validation indices in R are not suitable for mixed-type data. This paper examines the transferability of validation indices, such as the Gamma index, Average Silhouette Width or Dunn index to mixed-type data. Furthermore, the R package clustMixType is extended by these indices and their application is demonstrated. Finally, the behaviour of the adapted indices is tested by a short simulation study using different data scenarios.
- Research Article
1
- 10.11591/ijeecs.v4.i3.pp617-628
- Dec 1, 2016
- Indonesian Journal of Electrical Engineering and Computer Science
<p>Clustering is one of the technique or approach in content mining and it is used for grouping similar items. Clustering software datasets with mixed values is a major challenge in clustering applications. The previous work deals with unsupervised feature learning techniques such as k-Means and C-Means which cannot be able to process the mixed type of data. There are several drawbacks in the previous work such as cluster tendency, partitioning, less accuracy and less performance. To overcome all those problems the extended fuzzy adaptive resonance theory (EFART) came into existence which indicates that the usage of fuzzy ART with some traditional approach. This work deals with mixed type of data by applying unsupervised feature learning for achieving the sparse representation to make it easier for clustering algorithms to separate the data. The advantages of extended fuzzy adaptive resonance theory are high accuracy, high performance, good partitioning, and good cluster tendency. This EFART adopts unsupervised feature learning which helps to cluster the large data sets like the teaching assistant evaluation, iris and the wine datasets. Finally, the obtained results may consist of clusters which are formed based on the similarity of their attribute type and values.</p>
- Research Article
10
- 10.1155/2020/5143797
- Jul 25, 2020
- Mathematical Problems in Engineering
The k-prototypes algorithm is a hybrid clustering algorithm that can process Categorical Data and Numerical Data. In this study, the method of initial Cluster Center selection was improved and a new Hybrid Dissimilarity Coefficient was proposed. Based on the proposed Hybrid Dissimilarity Coefficient, a weighted k-prototype clustering algorithm based on the hybrid dissimilarity coefficient was proposed (WKPCA). The proposed WKPCA algorithm not only improves the selection of initial Cluster Centers, but also puts a new method to calculate the dissimilarity between data objects and Cluster Centers. The real dataset of UCI was used to test the WKPCA algorithm. Experimental results show that WKPCA algorithm is more efficient and robust than other k-prototypes algorithms.
- Research Article
15
- 10.1007/s12046-018-0823-0
- Mar 1, 2018
- Sādhanā
Clustering has been recognized as a very important approach for data analysis that partitions the data according to some (dis)similarity criterion. In recent years, the problem of clustering mixed-type data has attracted many researchers. The k-prototypes algorithm is well known for its scalability in this respect. In this paper, the limitations of dissimilarity coefficient used in the k-prototypes algorithm are discussed with some illustrative examples. We propose a new hybrid dissimilarity coefficient for k-prototypes algorithm, which can be applied to the data with numerical, categorical and mixed attributes. Besides retaining the scalability of the k-prototypes algorithm in our method, the dissimilarity functions for either-type attributes are defined on the same scale with respect to their dimensionality, which is very beneficial to improve the efficiency of clustering result. The efficacy of our method is shown by experiments on real and synthetic data sets.
- Research Article
22
- 10.1016/j.eswa.2008.06.100
- Jun 27, 2008
- Expert Systems with Applications
Constraint-based clustering and its applications in construction management
- New
- Research Article
- 10.1007/s11634-025-00659-0
- Nov 25, 2025
- Advances in Data Analysis and Classification
- Research Article
- 10.1007/s11634-025-00660-7
- Nov 17, 2025
- Advances in Data Analysis and Classification
- Research Article
- 10.1007/s11634-025-00655-4
- Oct 18, 2025
- Advances in Data Analysis and Classification
- Research Article
- 10.1007/s11634-025-00650-9
- Jul 19, 2025
- Advances in Data Analysis and Classification
- Addendum
- 10.1007/s11634-025-00648-3
- Jun 26, 2025
- Advances in Data Analysis and Classification
- Research Article
- 10.1007/s11634-025-00646-5
- Jun 23, 2025
- Advances in Data Analysis and Classification
- Research Article
- 10.1007/s11634-025-00651-8
- Jun 22, 2025
- Advances in Data Analysis and Classification
- Research Article
- 10.1007/s11634-025-00639-4
- Jun 12, 2025
- Advances in Data Analysis and Classification
- Research Article
- 10.1007/s11634-025-00643-8
- Jun 6, 2025
- Advances in Data Analysis and Classification
- Research Article
- 10.1007/s11634-025-00649-2
- Jun 1, 2025
- Advances in Data Analysis and Classification
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.