Abstract

As an important solution for 3D shape retrieval, a multi-view shape descriptor has achieved impressive performance. One crucial part of view-based shape descriptors is to interpret 3D structures through various 2D observations. Most existing methods like MVCNN believe that a strong classification model trained with deep learning, can often provide an efficient shape embedding for 3D shape retrieval. However, these methods pay much attention to discriminative models and none of them necessarily incorporate the underlying 3D properties of the objects from 2D images. In this paper, we present a novel encoder-decoder recurrent feature aggregation network (ERFA-Net) to address this problem. Aiming at emphasizing the 3D properties of 3D shapes in the fusion of multiple view features, 3D properties prediction tasks are introduced into the 3D shape retrieval. Specifically, an image sequence of the shape is recurrently aggregated into a discriminative shape embedding based on LSTM network, and then this latent shape embedding is trained to predict the original voxel grids and estimate images of unseen viewpoints. This generation task gives an effective supervision which makes the network exploit 3D properties of shapes through various 2D images. Our method achieves the state-of-the-art performance for 3D shape retrieval, on two large-scale 3D shape datasets, ModelNet and ShapeNetCore55. Extensive experiments show that the proposed 3D representation performs robust discrimination against view occlusion, and strong generation ability for various 3D shape tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call