In response to the inadequacy of manual analysis in meeting the rising demand for retinal optical coherence tomography (OCT) images, a self-supervised learning-based clustering model was implemented. A public dataset was utilized, with 83,484 OCT images with categories of choroidal neovascularization (CNV), diabetic macular edema (DME), drusen, and normal fundus. This study employed the Semantic Pseudo Labeling for Image Clustering (SPICE) framework, a self-supervised learning-based method, to cluster unlabeled OCT images into binary and four categories, and the performances were compared with baseline models. We also analysed feature distribution using t-SNE, and explored the cluster centers, attention maps, and misclassified images. In addition, DME and CNV subsets were clustered binarily, and the results were interpreted by two retinal specialists. SPICE demonstrated superior performance in binary and four categories classification tasks, achieving the accuracy of 0.886 and 0.846, respectively. In t-SNE analysis, the four types exhibited significant clustering into distinct groups. The cluster centers corresponded to the human labels, and the heat map revealed that the model focused on important biomarkers. The misclassified images exposed similar features to the inaccurate classes. The model also grouped DME and CNV into two distinct categories respectively. Self-supervised clustering effectively distinguished disease variances and revealed common features, with a notable capability to detect disease heterogeneity through biomarkers.