Abstract

Visual-semantic embedding (VSE) networks create joint image–text representations to map images and texts in a shared embedding space to enable various information retrieval-related tasks, such as image–text retrieval, image captioning, and visual question answering. The most recent state-of-the-art VSE-based networks are: VSE++, SCAN, VSRN, and UNITER. This study evaluates the performance of those VSE networks for the task of image-to-text retrieval and identifies and analyses their strengths and limitations to guide future research on the topic. The experimental results on Flickr30K revealed that the pre-trained network, UNITER, achieved 61.5% on average Recall@5 for the task of retrieving all relevant descriptions. The traditional networks, VSRN, SCAN, and VSE++, achieved 50.3%, 47.1%, and 29.4% on average Recall@5, respectively, for the same task. An additional analysis was performed on image–text pairs from the top 25 worst-performing classes using a subset of the Flickr30K-based dataset to identify the limitations of the performance of the best-performing models, VSRN and UNITER. These limitations are discussed from the perspective of image scenes, image objects, image semantics, and basic functions of neural networks. This paper discusses the strengths and limitations of VSE networks to guide further research into the topic of using VSE networks for cross-modal information retrieval tasks.

Highlights

  • Visual-semantic embedding (VSE) networks jointly learn representations of images and texts to enable various cross-modal information retrieval-related tasks, such as image– text retrieval [1,2,3,4], image captioning [5,6,7,8], and visual question answering (VQA) [4,9,10]

  • VSE++, stacked cross attention (SCAN), visual semantic reasoning network (VSRN), and universal image-text representation (UNITER) were evaluated in terms of their performance in retrieving any one of the five relevant textual descriptions for each query

  • VSE++, SCAN, VSRN, and UNITER were evaluated in terms of their performance in retrieving all five of the relevant textual descriptions for each query

Read more

Summary

Introduction

Visual-semantic embedding (VSE) networks jointly learn representations of images and texts to enable various cross-modal information retrieval-related tasks, such as image– text retrieval [1,2,3,4], image captioning [5,6,7,8], and visual question answering (VQA) [4,9,10]. Chen et al [4] introduced a pre-trained network with the transformer [12], namely the universal image-text representation (UNITER), to unify various cross-modal tasks such as VQA and image–text matching. Group 2: The VSE networks do not give enough attention to the detailed visual information (all limitations apply to VSRN and UNITER). Group 3: The VSE networks’ capability in extracting the higher-level visual semantics needs to be improved (all limitations apply to VSRN and UNITER)

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call