Clustering is a fundamental machine learning task, which aim at assigning instances into groups so that similar samples belong to the same cluster while dissimilar samples belong to different clusters. Shallow clustering methods usually assume that data are collected and expressed as feature vectors within which clustering is performed. However, clustering high-dimensional data, such as images, texts, videos, and graphs, poses significant challenges for clustering tasks, such as indiscriminate representation and intricate relationships among instances. Over the past decades, deep learning has achieved remarkable success in effective representation learning and modeling complex relationships. Motivated by these advancements, Deep Clustering seeks to improve clustering outcomes through deep learning techniques, garnering considerable interest from both academia and industry. Despite many contributions to this vibrant area of research, the lack of systematic analysis and a comprehensive taxonomy has hindered progress in this field. In this survey, we first explore how deep learning can be integrated into deep clustering and identify two fundamental components: the representation learning module and the clustering module. Then, we summarize and analyze the representative design of these two modules. Furthermore, we introduce a novel taxonomy of deep clustering based on how these two modules interact, specifically through multistage, generative, iterative, and simultaneous approaches. In addition, we present well-known benchmark datasets, evaluation metrics, and open-source tools to clearly demonstrate different experimental approaches. Finally, we examine the practical applications of deep clustering and propose challenging areas for future research.
Read full abstract