Enhancing Cross-Modal Retrieval Based on Modality-Specific and Embedding Spaces

Rintaro Yanagi,Takahiro Ogawa,Miki Haseyama,Ren Togo

doi:10.1109/access.2020.2995815

Abstract

A new approach that drastically improves cross-modal retrieval performance in vision and language (hereinafter referred to as “vision and language retrieval”) is proposed in this paper. Vision and language retrieval takes data of one modality as a query to retrieve relevant data of another modality, and it enables flexible retrieval across different modalities. Most of the existing methods learn optimal embeddings of visual and lingual information to a single common representation space. However, we argue that the forced embedding optimization results in loss of key information for sentences and images. In this paper, we propose an effective utilization of representation spaces in a simple but robust vision and language retrieval method. The proposed method makes use of multiple individual representation spaces through text-to-image and image-to-text models. Experimental results showed that the proposed approach enhances the performance of existing methods that embed visual and lingual information to a single common representation space.

Highlights

Single-modal retrieval such as document retrieval from keyword queries [1] and image retrieval from an image query [2] has been traditionally conducted
In each method, we can see that the proposed approach drastically enhances the mean and median ranks compared to the state-of-the-art methods. This means that the proposed approach is effective for various conventional embedding methods to improve the retrieval performance of vision and language retrieval
The best median rank is obtained around the settings α = 0.3, β = 0.5 and γ = 0.2. These results mean that the similarity sEn is the most important information; the similarities sVn and sLn are important information, and we can enhance the performance for vision and language retrieval by using spaces

Summary

INTRODUCTION

Single-modal retrieval such as document retrieval from keyword queries [1] and image retrieval from an image query [2] has been traditionally conducted. In a text-to-image retrieval scenario, a lingual feature from a query in space L and visual features from candidate images in space V are projected into a learned common representation space E that can compare the two different modalities. This embedding approach is currently one of the most popular approaches. We enhance the retrieval performance of conventional embedding methods that only utilize space E utilizing the text-to-image and image-to-text models.

RELATED WORKS

SIMILARITY CALCULATION IN SPACE L

SIMILARITY CALCULATION IN SPACE E

VERIFYING THE EFFECTIVENESS OF LINGUAL AND VISUAL SPACES FOR RETRIEVAL

EXPERIMENTAL SETUP

QUANTITATIVE EVALUATION ON MSCOCO DATASET

Findings

CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 38	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Enhancing Cross-Modal Retrieval Based on Modality-Specific and Embedding Spaces

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Multiple representation contrastive self-supervised learning for pulmonary nodule detection✰
Asghar Torki ... Hamidreza Baradaran Kashani
Knowledge-Based Systems | VOL. 301
Asghar Torki, et. al.Asghar Torki ... Hamidreza Baradaran Kashani
08 Aug 2024
Knowledge-Based Systems | VOL. 301

Single address space or private address spaces?
Jacques Mossière ... Xavier Rousset De Pina
-
Jacques Mossière, et. al.Jacques Mossière ... Xavier Rousset De Pina
12 Sep 1994
12 Sep 1994

Attention on Global–Local Representation Spaces in Recommender Systems
Munlika Rattaphun ... Chih-Yi Chiu
IEEE Transactions on Computational Social Systems | VOL. 9
Munlika Rattaphun, et. al.Munlika Rattaphun ... Chih-Yi Chiu
01 Oct 2022
IEEE Transactions on Computational Social Systems | VOL. 9

Group-Wise Learning for Aurora Image Classification With Multiple Representations.
Jun Zhang ... Ke Lu
IEEE Transactions on Cybernetics | VOL. 51
Jun Zhang, et. al.Jun Zhang ... Ke Lu
28 Mar 2019
IEEE Transactions on Cybernetics | VOL. 51

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Enhancing Cross-Modal Retrieval Based on Modality-Specific and Embedding Spaces

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access