In unsupervised cross-modal hashing, there are two notable issues that require attention. The inter- and intra-modal similarity matrices in the original and Hamming spaces lack sufficient neighborhood information and semantic consistency, while solely relying on the reconstruction of instance-level similarity matrices fails to effectively capture the global intrinsic correlation and manifold structure of the training samples. We propose a novel method that combines multi-similarity reconstructing with clustering-based contrastive hashing. Firstly, we construct image feature, text feature and joint-semantic feature multi-similarity matrices in their original space, along with their corresponding hashing code similarity matrices in the Hamming space, to enhance the semantic consistency of the inter-and intra-modal reconstructions. Secondly, the clustering-based contrastive hashing is proposed to capture the global intrinsic correlation and manifold structure of the image-text pairs. Extensive experiment results on Wiki, NUS-WIDE, MIRFlickr-25K and MS-COCO demonstrate the promising cross-modal retrieval performance of the proposed method.