Abstract

Cross-modal retrieval aims to find relevant data of different modalities, such as images and text. In order to bridge the modality gap, most existing methods require a lot of coupled sample pairs as training data. To reduce the demands for training data, we propose a cross-modal retrieval framework that utilizes both coupled and uncoupled samples. The framework consists of two parts: Abstraction that aims to provide high-level single-modal representations with uncoupled samples; then, Association links different modalities through a few coupled training samples. Moreover, under this framework, we implement a cross-modal retrieval method based on the consistency between the semantic structure of multiple modalities. First, both images and text are represented with the semantic structure-based representation, which represents each sample as its similarity from the reference points that are generated from single-modal clustering. Then, the reference points of different modalities are aligned through an active learning strategy. Finally, the cross-modal similarity can be measured with the consistency between the semantic structures. The experiment results demonstrate that given proper abstraction of single-modal data, the relationship between different modalities can be simplified, and even limited coupled cross-modal training data are sufficient for satisfactory retrieval accuracy.

Highlights

  • Recent years have witnessed a surge of need in jointly analyzing multimodal data [1, 2]

  • Rough this paper, we demonstrate the importance of uncoupled samples for preserving intramodal relations and the correlation between semantic structures of different modalities, which together provide the possibility of crossmodal retrieval with limited coupled training samples. e main contribution can be summarized as follows: (1) AbsAss cross-modal retrieval framework

  • We compare the retrieval performance of the proposed method with eight baselines: CCA [1]: with canonical correlation analysis (CCA), a shared space is learned for different modalities where they are maximally correlated

Read more

Summary

Introduction

Recent years have witnessed a surge of need in jointly analyzing multimodal data [1, 2]. A common approach to bridge the modality gap is constructing a shared representation space where the multimodal samples can be represented uniformly [2]. It is not easy because it requires detailed knowledge of the content of each modality and the correspondence between them [6]. A variety of tools are used to construct the shared space, such as canonical correlation analysis (CCA) [1, 7,8,9,10], topic model [11,12,13], and hashing [14,15,16,17,18] Among these methods, the deep neural network (DNN) has become the most popular one because of its strong learning ability [6, 19,20,21,22,23,24]. Collecting coupled training data is labor-intensive and time-consuming

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call