Abstract

Multi-modal retrieval is a challenge due to heterogeneous gap and a complex semantic relationship between different modal data. Typical research map different modalities into a common subspace with a one-to-one correspondence or similarity/dissimilarity relationship of inter-modal data, in which the distances of heterogeneous data can be compared directly; thus, inter-modal retrieval can be achieved by the nearest neighboring search. However, most of them ignore intra-modal relations and complicated semantics between multi-modal data. In this paper, we propose a deep multi-modal metric learning method with multi-scale semantic correlation to deal with the retrieval tasks between image and text modalities. A deep model with two branches is designed to nonlinearly map raw heterogeneous data into comparable representations. In contrast to binary similarity, we formulate semantic relationship with multi-scale similarity to learn fine-grained multi-modal distances. Inter-modal and intra-modal correlations constructed on multi-scale semantic similarity are incorporated to train the deep model in an end-to-end way. Experiments validate the effectiveness of our proposed method on multi-modal retrieval tasks, and our method outperforms state-of-the-art methods on NUS-WIDE, MIR Flickr, and Wikipedia datasets.

Highlights

  • With the development of Internet and social media, people could come into contact with massive data from various modalities in daily lives

  • We propose a deep multi-modal metric learning method for image–text retrieval, in which two network branches are simultaneously learned as metric functions to measure the image–text distances according to multi-modal semantic relationship

  • We can obtain that the average performances of MS-DMML with α = 0.4, β = 0.6 are close to α = 0.8, β = 0.2 on NUS-WIDE and Wikipedia, and the average performance with α = 0.4, β = 0.6 (89.27%) is higher than with α = 0.8, β = 0.2 (83.80%) on MIR Flickr. This might be due to the number of average labels of MIR Flickr being more than that of NUS-WIDE and Wikipedia; the weight of similarity item α should be relatively small to form a trade-off between the multi-scale similarity and dissimilarity items

Read more

Summary

Introduction

With the development of Internet and social media, people could come into contact with massive data from various modalities (e.g., image, text, audio, video, etc.) in daily lives. In recent studies [22,23,24], multi-modal data are embedded as low-dimensional representations via two different deep models and learned into a shared latent space at the top layers by pulling inter-modal similar points together while separating dissimilar points far away They learn multi-layer nonlinear transformations to align heterogeneous data based on a binary similar or dissimilar inter-modal relationship, whereas they ignore the intra-modal correlation of multi-modal data, which has been verified effective in subspace learning methods [8,10,25,26,27]. Our method benefits from combining inter-modal and intra-modal correlations, and achieves good performances on the four kinds of retrieval tasks

Related Work
Deep Multi-Modal Metric Learning with Multi-Scale Correlation
Deep Networks for Image and Text Modalities
Multi-Scale Metric Learning
Experiments
Dataset and Measurement
Implementation Details
Validation on Inter-Modal and Intra-Modal Correlations
Validation Multi-Scale Correlation
Comparison with Others
Methods
Performance Curves
Retrieval Examples
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.