Hypergraph-Clustering Method Based on an Improved Apriori Algorithm

Feng Wang,Feng Hu,Libing Bai,Rumeng Chen

doi:10.3390/app131910577

Abstract

With the complexity and variability of data structures and dimensions, traditional clustering algorithms face various challenges. The integration of network science and clustering has become a popular field of exploration. One of the main challenges is how to handle large-scale and complex high-dimensional data effectively. Hypergraphs can accurately represent multidimensional heterogeneous data, making them important for improving clustering performance. In this paper, we propose a hypergraph-clustering method dubbed the “high-dimensional data clustering method” based on hypergraph partitioning using an improved Apriori algorithm (HDHPA). First, the method constructs a hypergraph based on the improved Apriori association rule algorithm, where frequent itemsets existing in high-dimensional data are treated as hyperedges. Then, different frequent itemsets are mined in parallel to obtain hyperedges with corresponding ranks, avoiding the generation of redundant rules and improving mining efficiency. Next, we use the dense subgraph partition (DSP) algorithm to divide the hypergraph into multiple subclusters. Finally, we merge the subclusters through dense sub-hypergraphs to obtain the clustering results. The advantage of this method lies in its use of the hypergraph model to discretize the association between data in space, which further enhances the effectiveness and accuracy of clustering. We comprehensively compare the proposed HDHPA method with several advanced hypergraph-clustering methods using seven different types of high-dimensional datasets and then compare their running times. The results show that the clustering evaluation index values of the HDHPA method are generally superior to all other methods. The maximum ARI value can reach 0.834, an increase of 42%, and the average running time is lower than other methods. All in all, HDHPA exhibits an excellent comparable performance on multiple real networks. The research results of this paper provide an effective solution for processing and analyzing large-scale network datasets and are also conducive to broadening the application range of clustering techniques.

Full Text