A comparative study of validity indices on estimating the optimal number of clusters

Aikaterini Karanikola,Sotiris Kotsiantis,Charalampos M Liapis

doi:10.1109/iisa52424.2021.9555497

Abstract

In clustering, finding the optimal number of clusters is usually one of the most crucial steps in the whole partitioning process. The decision about the optimal number of clusters, however, is not easy to make. In addition, the term ”optimal” is rather vague. In general, determining the optimal number of clusters is directly dependent on the method used to measure similarities and the parameter selection of the partition method. Moreover, certain inherent characteristics of the datasets, such as clusters that overlap with each other or clusters that contain subclusters, may, most often, increase the task’s level of difficulty. Given the above, in order to tackle the problem of estimating such an optimal in each distinct clustering case, different kind of indicators have over the years been proposed. In this study, a large number of such indicators, called validity indices, based on the approach of the so-called relative criteria, are examined comparatively. Specifically, a total of 26 validity indices are examined in two separate study cases: one in real-world and one in artificially generated data. Every index is utilized under the schemes of 9 different clustering methods which incorporate a total of 5 different distance metrics. The results are presented in various explanatory forms.

Full Text