Canopy-MMD Text Clustering Algorithm Based on Simulated Annealing and Canopy Optimization

Jun-Wu Zhai Jun-Wu Zhai,Yu-Chen Tian Jun-Wu Zhai,Kun Liang Wen-Tao Li,Wen-Tao Li Yu-Chen Tian

doi:10.53106/199115992023023401006

Abstract

<p>Aiming at the problems that traditional K-means text clustering cannot automatically determine the number of clusters and is sensitive to initial cluster centers, this paper proposes a Canopy-MMD text clustering algorithm based on simulated annealing and silhouette coefficient optimization. The algorithm uses the simulated annealing algorithm combined with the silhouette coefficient to optimize the Canopy algorithm to find the optimal number of clusters, and uses the optimal number of clusters to determine the scale coefficient in the MMD algorithm, and finally achieves a better text clustering effect. The Sohu News dataset of Sogou Lab is experimentally analyzed and compared with the clustering results obtained by traditional K-means and algorithms in the literature. The experimental results show that the clustering performance of the algorithm is better than the traditional K-means algorithm and the algorithm in the literature, and the accuracy, precision, recall and F value are improved by 8.02%, 8.91%, 8.02%, 9.51% compared with the traditional K-means algorithm, which can be widely used in fields such as text mining, knowledge graph and natural language processing.</p> <p>&nbsp;</p>

Full Text