Abstract

A mix between numerical and nominal data types commonly presents many modern-age data collections. Examples of these include banking data, sales history and healthcare records, where both continuous attributes like age and nominal ones like blood type are exploited to characterize account details, business transactions or individuals. However, only a few standard clustering techniques and consensus clustering methods are provided to examine such a data thus far. Given this insight, the paper introduces novel extensions of link-based cluster ensemble, and that are accurate for analyzing mixed-type data. They promote diversity within an ensemble through different initializations of the k-prototypes algorithm as base clusterings and then refine the summarized data using a link-based approach. Based on the evaluation metric of NMI (Normalized Mutual Information) that is averaged across different combinations of benchmark datasets and experimental settings, these new models reach the improved level of 0.34, while the best model found in the literature obtains only around the mark of 0.24. Besides, parameter analysis included herein helps to enhance their performance even further, given relations of clustering quality and algorithmic variables specific to the underlying link-based models. Moreover, another significant factor of ensemble size is examined in such a way to justify a tradeoff between complexity and accuracy.

Highlights

  • Cluster analysis has been widely used to explore the structure of a given dataset

  • LCEWCT and LCEWTQ are more effective than their baseline model (i.e., HBGF), whilst significantly improve the quality of data partitions acquired by base clusterings, i.e., k-prototypes

  • Their performance levels are better than other cluster ensemble methods and standard clustering algorithms included in this evaluation

Read more

Summary

Introduction

Cluster analysis has been widely used to explore the structure of a given dataset. This analytical tool is usually employed in the initial stage of data interpretation, especially for a new problem where prior knowledge is limited. The goal of acquiring knowledge from data sources has been a major driving force, which makes cluster analysis one of the highly active research subjects. CMC, 2022, vol., no.1 recent development of clustering approaches for cancer gene expression data has attracted a lot of interests amongst computer scientists, biological and clinical researchers [9,10]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call