Cross-Lingual Visual Grounding

Wenjian Dong,Mayu Otani,Noa Garcia,Yuta Nakashima,Chenhui Chu

doi:10.1109/access.2020.3046719

Abstract

Visual grounding is a vision and language understanding task aiming at locating a region in an image according to a specific query phrase. However, most previous studies only address this task for the English language. Although there are previous cross-lingual vision and language studies, they work on image and video captioning, and visual question answering. In this paper, we present the first work on cross-lingual visual grounding to expand the task to different languages to study an effective yet efficient way for visual grounding on other languages. We construct a visual grounding dataset for French via crowdsourcing. Our dataset consists of 14k, 3k, and 3k query phrases with their corresponding image regions for 5k, 1k, and 1k training, validation and test images, respectively. In addition, we propose a cross-lingual visual grounding approach that transfers the knowledge from a learnt English model to a French model. Despite that the size of our French dataset is 1/6 of the English dataset, experiments indicate that our model achieves an accuracy of 65.17%, which is comparable to the accuracy 69.04% of the English model. Our dataset and codes are available at https://github.com/ids-cv/Multi-Lingual-Visual-Grounding .

Highlights

Studies on various vision and language tasks, such as image captioning [1] and visual question answering [2], have significantly promoted the research on joint vision and language understanding
Visual grounding, which aims at finding a specific region in an image corresponding to a query phrase, plays a fundamental role in enhancing the performance of many joint vision and language tasks
English is the major language, visual grounding could be important for other languages for joint vision and language understanding in those languages

Summary

Introduction

Studies on various vision and language tasks, such as image captioning [1] and visual question answering [2], have significantly promoted the research on joint vision and language understanding. Since the emergence of the first work of visual grounding in [3], research efforts have been dedicated to improve its accuracy [4]–[11]. These studies are targeted mostly on English because largescale visual grounding datasets are only available in English. Studies on English, may not necessarily address the visual grounding task for other languages, as the query phrases may be completely different for different languages, and a visual grounding model is strongly tied to the specific

Objectives

Methods

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Dec 23, 2020
Citations: 33	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Cross-Lingual Visual Grounding

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Akira Fukui ... Dong Huk Park
-
Akira Fukui, et. al.Akira Fukui ... Dong Huk Park
01 Jan 2015
01 Jan 2015

Efficient adaptation of Foundation Models for Visual Grounding Remote Sensing task
Ali J Ghandour ... Cristiano Nattero
-
Ali J Ghandour, et. al.Ali J Ghandour ... Cristiano Nattero
08 Mar 2024
08 Mar 2024

Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends
Chunyuan Li ... Zhe Gan
-
Chunyuan Li, et. al.Chunyuan Li ... Zhe Gan
01 Jan 2021
01 Jan 2021

Answer Questions with Right Image Regions: A Visual Attention Regularization Approach
Yibing Liu ... Weifeng Liu
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 18
Yibing Liu, et. al.Yibing Liu ... Weifeng Liu
04 Mar 2022
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cross-Lingual Visual Grounding

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access