Knowing the number of clusters a priori is one of the most challenging aspects of unsupervised learning. Clustering Internal Validity Indices (CIVIs) evaluate partitions in unsupervised algorithms based on metrics like compactness, separation, and density. However, specialized CIVIs for specific applications have been designed, and there is no general CIVI that works in all scenarios. The absence of CIVIs based on crisp uncertainty metrics is especially critical in decision-making processes that involve ambiguity, non-convex distributions, outliers, and overlapping data. To address this problem, we propose a novel Uncertainty Fréchet (UF) CIVI that assesses the certainty of a well-defined partition. UF leverages uncertainty fingerprints based on Type-2 fuzzy Gaussian Mixture Models (T2FGMM) and the Fréchet distance between clusters to introduce a metric that evaluates partition quality. We integrate UF into a merging methodology that combines similar clusters within a partition, allowing us to determine the number of clusters without the need to run the clustering algorithms iteratively as other CIVIs require. We undertake a comprehensive evaluation of our proposal on 5,250 convex, 36 non-convex synthetic datasets, and five benchmark real datasets. In addition, we apply UF in a real-world scenario that involves high uncertainty: Passive Acoustic Monitoring (PAM) of ecosystems, which aims to study ecological transformations through acoustic recordings. The results show that UF exhibits notable performance in synthetic and real-world scenarios, obtaining an Adjusted Mutual Information (AMI) score higher than 0.88 for normal, uniform, gamma, and triangular distribution datasets. In the PAM application, UF identifies the transformation of ecosystems through sound using clustering algorithms and UF, achieving an F1 score of 0.84. Therefore, results show that the UF index is a suitable tool for researchers and practitioners working with highly uncertain data.
Read full abstract