Image-based 3D shape retrieval (IBSR) is a cross-modal matching task which searches similar shapes from a 3D repository using a natural image. Continuous attention has been paid to this topic, such as joint embedding, adversarial learning, and contrastive learning. Modality gap and diversity of instance similarities are two obstacles for accurate and fine-grained cross-modal matching. To overcome the two obstacles, we propose a style-mixed contrastive learning method (SC-IBSR). On one hand, we propose a style transition module to mix the styles of images and rendered shape views to form an intermediate style and inject it into image contents. The obtained style-mixed image features serve as a bridge for later contrastive learning in order to alleviate the modality gap. On the other hand, the proposed strategy of fine-grained consistency constraint aims at cross-domain contrast and considers the different importance of negative (positive) samples. Extensive experiments demonstrate the superiority of the style-mixed cross-modal contrastive learning on both the instance-level retrieval benchmark (i.e., Pix3D, Stanford Cars, and Comp Cars that annotate shapes to images), and the unsupervised category-level retrieval benchmark (i.e., MI3DOR-1 and MI3DOR-2 with unlabeled 3D shapes). Moreover, experiments are conducted on Office-31 dataset to validate the generalization capability of our method. Code and pretrained models will be available at https://github.com/honoria0204/SC-IBSR .
Read full abstract