Abstract

To address 3D object retrieval, substantial efforts have been made to generate highly discriminative descriptors for 3D objects represented by a single modality, such as voxels, point clouds, or multiview images. It is promising to leverage complementary information from multimodal representations of 3D objects to further improve retrieval performance. However, multimodal 3D object retrieval has rarely been developed or analyzed for large-scale datasets. In this paper, we propose a self-and-cross-attention-based aggregation of point clouds and multiview images (SCA-PVNet) for 3D object retrieval. With deep features extracted from point clouds and multi-view images, we design two types of feature aggregation modules, namely the in-modality aggregation module (IMAM) and the cross-modality aggregation module (CMAM ), for effective feature fusion. IMAM leverages a self-attention mechanism to aggregate multiview features, whereas CMAM exploits a cross-attention mechanism to interact with point-cloud and multiview features. The final descriptor of a 3D object for object retrieval can be obtained by concatenating the aggregated feature outputs of both modules. Extensive experiments and analyses were conducted on four datasets ranging from small to large scales to demonstrate the superiority of the proposed SCA-PVNet over state-of-the-art methods. In addition to achieving state-of-the-art retrieval performance, our method is more robust in challenging scenarios when views or points are missing during inference.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call