Clustering heterogeneous categorical data using enhanced mini batch K-means with entropy distance measure

Nurshazwani Muhamad Mahfuz,Marina Yusoff,Zainura Idrus

doi:10.11591/ijece.v13i1.pp1048-1059

Nurshazwani Muhamad Mahfuz, Marina Yusoff + Show 1 more

Open Access

https://doi.org/10.11591/ijece.v13i1.pp1048-1059

Copy DOI

Abstract

Clustering methods in data mining aim to group a set of patterns based on their similarity. In a data survey, heterogeneous information is established with various types of data scales like nominal, ordinal, binary, and Likert scales. A lack of treatment of heterogeneous data and information leads to loss of information and scanty decision-making. Although many similarity measures have been established, solutions for heterogeneous data in clustering are still lacking. The recent entropy distance measure seems to provide good results for the heterogeneous categorical data. However, it requires many experiments and evaluations. This article presents a proposed framework for heterogeneous categorical data solution using a mini batch k-means with entropy measure (MBKEM) which is to investigate the effectiveness of similarity measure in clustering method using heterogeneous categorical data. Secondary data from a public survey was used. The findings demonstrate the proposed framework has improved the clustering’s quality. MBKEM outperformed other clustering algorithms with the accuracy at 0.88, v-measure (VM) at 0.82, adjusted rand index (ARI) at 0.87, and Fowlkes-Mallow’s index (FMI) at 0.94. It is observed that the average minimum elapsed time-varying for cluster generation, k at 0.26 s. In the future, the proposed solution would be beneficial for improving the quality of clustering for heterogeneous categorical data problems in many domains.

Full Text