An automated robust algorithm for clustering multivariate data

Gajendra K Vishwakarma,Chinmoy Paul,Ali S Hadi,A.M Elsawah

doi:10.1016/j.cam.2023.115219

Abstract

Clustering analysis is widely used in various applications, such as marketing, biology, medical science, finance, data mining, image processing, data analysis and pattern recognition. For instance, clustering can be used to: characterize the customer groups based on their purchasing patterns by discovering the distinct groups in the customer base; derive plant and animal taxonomies, categorize genes with similar functionalities and gain insight into structures inherent to populations; identify the cancer cells; and detect credit card fraud. The k-means, Hierarchical and self-organizing (Kohonen) map are widely used clustering algorithms. The practice demonstrated that these clustering algorithms have some significant limitations and drawbacks. This manuscript gives an automated robust algorithm for clustering multivariate data without prior information about the number of clusters. Robust estimate of location and covariance matrix are used to define Mahalanobis distance and corresponding radius of clustering algorithm. The algorithm is designed in a way that it controls both masking and swamping effects. It automatically divides a given data set into a number of clusters. Some properties pertaining to the algorithm are demonstrated which helps in finding clusters that accommodates observations with large deviation. A method to avoid the use of a fixed cutoff for determining outlier is discussed. The performance of the proposed algorithm is compared with the existing clustering algorithms and robust multiple outlier detection methods.

Full Text