For most clustering methods, not only the number of clusters must be set in advance, but also various hyperparameters such as initial centroids, number of nearest neighbours, the minimum number of points, neighbourhood radius, and cutoff distance all require pre-specification. As one of the most promising unsupervised learning methods in machine intelligence, existing clustering methods cannot simultaneously handle datasets with arbitrary shapes, different densities, distinct sizes, and overlapping. Background outliers and high dimensionality make clustering problems more challenging. In this paper, we propose a novel universal clustering methodology, called G2-SCANN, which yields the best clustering performance for all 30 synthetic and real datasets without any hyperparameter tuning if the exact number of clusters is known. Firstly, the shortest path length (SPL) in complex network or graph-based geodesic distance is used to give a locally backbone-structured description of graph vertex similarity. Accordingly, SPL-weighted local degree (SLD) is defined as vertex attributes of a SPL-weighted graph expressed by G2-SPL adjacency matrix with ε-natural neighbourhood. Secondly, the process of calculating SLD for every data point in a bottom-up way directly leads to division from a complete graph constituted by all data points to a group of SLD trees. This brings the interpretability and the elimination of lone trees. Thirdly, contrastive learning of largest SLD values for finding root vertices of each divisive tree is conducted and top-down category message is then transmitted from the root vertices to all the leaf ones of a SLD tree. It eventually produces tree-like clusters. Totally, the proposed G2-SCANN method leverages both local neighbouring similarity of data points and global information about data distribution and makes it perform better than other methods.
Read full abstract