Multi-modal clustering represents a formidable challenge in the domain of unsupervised learning. The objective of multi-modal clustering is to categorize data collected from diverse modalities, such as audio, visual, and textual sources, into distinct clusters. These clustering techniques operate by extracting shared features across modalities in an unsupervised manner, where the identified common features exhibit high correlations within real-world objects. Recognizing the importance of perceiving the correlated nature of these features is vital for enhancing clustering accuracy in multi-modal settings. This survey explores Deep Learning (DL) techniques applied to multi-modal clustering, encompassing methodologies such as Convolutional Neural Networks (CNN), Autoencoders (AE), Recurrent Neural Networks (RNN), and Graph Convolutional Networks (GCN). Notably, this survey represents the first attempt to investigate DL techniques specifically for multi-modal clustering. The survey presents a novel taxonomy for DL-based multi-modal clustering, conducts a comparative analysis of various multi-modal clustering approaches, and deliberates on the datasets employed in the evaluation process. Additionally, the survey identifies research gaps within the realm of multi-modal clustering, offering insights into potential future avenues for research.