Abstract

<span lang="EN-US">Clustering methods in data mining aim to group a set of patterns based on their similarity. In a data survey, heterogeneous information is established with various types of data scales like nominal, ordinal, binary, and Likert scales. A lack of treatment of heterogeneous data and information leads to loss of information and scanty decision-making. Although many similarity measures have been established, solutions for heterogeneous data in clustering are still lacking. The recent entropy distance measure seems to provide good results for the heterogeneous categorical data. However, it requires many experiments and evaluations. This article presents a proposed framework for heterogeneous categorical data solution using a mini batch k-means with entropy measure (MBKEM) which is to investigate the effectiveness of similarity measure in clustering method using heterogeneous categorical data. Secondary data from a public survey was used. The findings demonstrate the proposed framework has improved the clustering’s quality. MBKEM outperformed other clustering algorithms with the accuracy at 0.88, v-measure (VM) at 0.82, adjusted rand index (ARI) at 0.87, and Fowlkes-Mallow’s index (FMI) at 0.94. It is observed that the average minimum elapsed time-varying for cluster generation, </span><em><span lang="EN-US">k</span></em><span lang="EN-US"> at 0.26 s. In the future, the proposed solution would be beneficial for improving the quality of clustering for heterogeneous categorical data problems in many domains.</span>

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call