Partitional clustering is one of the most relevant unsupervised learning and pattern recognition techniques. Unfortunately, one of the main drawbacks of these methodologies refer to the fact that the number of clusters is generally assumed to be known beforehand and automating its selection is not straightforward. On the same token, internal validity measures, such as the Silhouette index, Davies-Bouldin and Caliski-Harabasz measures have emerged as the standard techniques to be used when comparing the goodness of clustering results obtained via different clustering methods. These measures take into consideration both the inter and intra-cluster simmilarities and can be adapted to different metrics. Unfortunately, their used has been hindered due to their large computational complexities, which are commonly quadratic with respect to the number of instances of the data set. In this work, we show that the time complexity of computing the most popular internal validity measures can be utterly reduced by making used of the within-cluster errors and different properties of the Bregman divergences. This contribution ultimately allows us to massively speed-up the selection of an adequate number of clusters for a given data set as verified with extensive empirical comparisons.
Read full abstract