Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment

Po-Yao Huang,Xiaojun Chang,Alexander G Hauptmann,Guoliang Kang,Wenhe Liu

doi:10.1145/3343031.3350894

Abstract

Visual-semantic embeddings are central to many multimedia applications such as cross-modal retrieval between visual data and natural language descriptions. Conventionally, learning a joint embedding space relies on large parallel multimodal corpora. Since massive human annotation is expensive to obtain, there is a strong motivation in developing versatile algorithms to learn from large corpora with fewer annotations. In this paper, we propose a novel framework to leverage automatically extracted regional semantics from un-annotated images as additional weak supervision to learn visual-semantic embeddings. The proposed model employs adversarial attentive alignments to close the inherent heterogeneous gaps between annotated and un-annotated portions of visual and textual domains. To demonstrate its superiority, we conduct extensive experiments on sparsely annotated multimodal corpora. The experimental results show that the proposed model outperforms state-of-the-art visual-semantic embedding models by a significant margin for cross-modal retrieval tasks on the sparse Flickr30k and MS-COCO datasets. It is also worth noting that, despite using only 20% of the annotations, the proposed model can achieve competitive performance (Recall at 10 > 80.0% for 1K and > 70.0% for 5K text-to-image retrieval) compared to the benchmarks trained with the complete annotations.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Improving visual-semantic embeddings by learning semantically-enhanced hard negatives for cross-modal information retrieval
Yan Gong ... Georgina Cosma
Pattern Recognition | VOL. 137
Yan Gong, et. al.Yan Gong ... Georgina Cosma
18 Dec 2022
Pattern Recognition | VOL. 137

Multi-View Visual Semantic Embedding
Zheng Li ... Caili Guo
-
Zheng Li, et. al.Zheng Li ... Caili Guo
01 Jul 2022
01 Jul 2022

Forward and Backward Multimodal NMT for Improved Monolingual and Multilingual Cross-Modal Retrieval
Po-Yao Huang ... Alexander Hauptmann
-
Po-Yao Huang, et. al.Po-Yao Huang ... Alexander Hauptmann
08 Jun 2020
08 Jun 2020

Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders
Nicola Messina ... Andrea Esuli
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 17
Nicola Messina, et. al.Nicola Messina ... Andrea Esuli
12 Nov 2021
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 17

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment

Abstract

Talk to us

Similar Papers