Cross-modal retrieval attracts much research attention due to its wide applications in numerous search systems. Sketch based 3D shape retrieval is a typical challenging cross-modal retrieval task for the huge divergence between sketch modality and 3D shape view modality. Existing approaches project the sketches and shapes into a common space for feature update and data alignment. However, these methods contain several disadvantages: Firstly, the majority approaches ignore the modality-shared information for divergence compensation in descriptor generation process. Secondly, traditional fusion method of multi-view features introduces much redundancy, which decreases the discrimination of shape descriptors. Finally, most approaches only focus on the cross-modal alignment, which omits the modality-specific data relevance. To address these limitations, we propose a Joint Feature Learning Network (JFLN). Firstly, we design a novel modality-shared feature extraction network to exploit both modality-specific characteristics and modality-shared information for descriptor generation. Subsequently, we introduce a hierarchical view attention module to gradually focus on the effective information for multiview feature updating and aggregation. Finally, we propose a novel cross-modal feature learning network, which can simultaneously contribute to modality-specific data distribution and cross-modal data alignment. We conduct exhaustive experiments on three public databases. The experimental results validate the superiority of the proposed method. Full Codes are available at https://github.com/dlmuyy/JFLN.
Read full abstract