We present a novel and effective joint embedding approach for retrieving the most similar 3D shape for a single image query. Our approach builds upon hybrid 3D representations—the octree-based representation and the multi-view image representation, which characterize shape geometry in different ways. We first pre-train a 3D feature space via jointly embedding 3D shapes with hybrid representations and then introduce a transform layer and an image encoder to map both shape codes and real images into a common space via a second joint embedding. Our pre-training benefits from the hybrid representation of 3D shapes and builds a more discriminative 3D shape space than using any one of 3D representations only. The transform layer helps to mind the gap between the 3D shape space and the real image space. We validate the efficacy of our method on the instance-level single-image 3D retrieval task and achieve significant improvements over existing methods.
Read full abstract