The DRk-M for Clustering Categorical Datasets With Uncertainty

Semeh Ben Salem,Sami Naouali,Zied Chtourou

doi:10.1109/mis.2020.3038837

Abstract

The problem of categorical clustering has attracted much attention, during the last years, since many real world applications tend to produce or consume categorical data types. The k-modes were among the first algorithms developed for categorical clustering using the notion of modes as cluster centroids. However, their major drawback is the random update of the modes in each iteration. In this article, it is proposed to identify the most adequate modes among a list of candidate ones in the mode update step of the process. The proposed algorithm, called density rough k-modes (DRk-M), computes the modes’ density to characterize its observations’ distribution and the rough set theory (RST) to deal with the uncertainty involved in this process. The DRk-M was experimented using UCI datasets and compared to many state-of-the-art methods such as the k-modes (1998), the Ng's method (2007), Cao's method (2012), and their variants. The obtained results pointed an average performance improvement reaching 17% in some cases and more than 25.5% of the total experiments with an average improvement more than 7% between these methods.

Full Text