Uncertainty mode selection in categorical clustering using the rough set theory

Sami Naouali,Semeh Ben Salem,Zied Chtourou

doi:10.1016/j.eswa.2020.113555

Abstract

Clustering is an unsupervised Machine Learning technique widely used to arrange a set of observations into distinct groups called clusters. The problem of categorical clustering has attracted much attention since many real world applications tend to produce such data types. The k-mode was among the first algorithms developed in this context. This algorithms uses the notion of modes to represent the centroids within the clusters. However, its major drawback lies in the random selection of the modes in each iteration during the clustering process. In this paper, we tackled this random selection issue and proposed a new method based on identifying the most adequate modes among a list of candidate ones. The proposed algorithm called Density Rough k-modes (DRk-M) is based on computing the density of each candidate mode to characterize the distribution of the observations around it. Then, we use the Rough Set Theory to deal with the uncertainty involved in this process. The DRk-M was experimented using real world datasets extracted from the UCI (University of California Irvine) Machine Learning Repository, the Global Terrorism Database (GTD) and a set of scrapped Tweets. The DRk-M was compared to many state of the art methods including the k-modes (1998), the Ng’s method (2007), Cao’s method (2012) and Bai’s technique (2014) and it has shown great efficiency.

Full Text