Language-Conditioned Feature Pyramids for Visual Selection Tasks

Taichi Iki,Akiko Aizawa

doi:10.18653/v1/2020.findings-emnlp.420

Abstract

Referring expression comprehension, which is the ability to locate language to an object in an image, plays an important role in creating common ground. Many models that fuse visual and linguistic features have been proposed. However, few models consider the fusion of linguistic features with multiple visual features with different sizes of receptive fields, though the proper size of the receptive field of visual features intuitively varies depending on expressions. In this paper, we introduce a neural network architecture that modulates visual features with varying sizes of receptive field by linguistic features. We evaluate our architecture on tasks related to referring expression comprehension in two visual dialogue games. The results show the advantages and broad applicability of our architecture. Source code is available at https://github.com/Alab-NII/lcfp .

Highlights

Referring expressions are a ubiquitous part of human communication (Krahmer and Van Deemter, 2012) that must be studied in order to create machines that work smoothly with humans
We introduce a neural network architecture for referring expression comprehension considering visual features with different sizes of the receptive fields, and evaluate it on the OneCommon task
To confirm the broad applicability of our architecture, we further evaluate it on another task, which is expected to require the ability of object category recognition more than OneCommon does because it uses photographs

Summary

Introduction

Referring expressions are a ubiquitous part of human communication (Krahmer and Van Deemter, 2012) that must be studied in order to create machines that work smoothly with humans. Much effort has been taken to improve methods of creating visual common ground between machines, which have limited means of expression and knowledge about the real world, and humans, from the perspectives of both referring expression comprehension and generation (Moratz et al, 2002; Tenbrink and Moratz, 2003; Funakoshi et al, 2004, 2005, 2006; Fang et al, 2013). Many models have been proposed for referring expression comprehension so far. As image recognition matured, Guadarrama et al (2014) studied object retrieval methods based on category labels predicted by the recognition models. Models that fuse linguistic features with visual features using deep learning have been studied (Hu et al, 2016b,a; Anderson et al, 2018; Deng et al, 2018; Misra et al, 2018; Li et al, 2018; Yang et al, 2019a,b; Liu et al, 2019; Can et al, 2020)

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Language-Conditioned Feature Pyramids for Visual Selection Tasks

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 26	License type: cc-by

Similar Papers

Synergistic Combination of Visual Features in Vision-Taste Crossmodal Correspondences.
Byron P Lee ... Charles Spence
Multisensory Research | VOL. 36
Byron P Lee, et. al.Byron P Lee ... Charles Spence
14 Aug 2023
Multisensory Research | VOL. 36

Video preference estimation using fNIRS signals
Akira Toyoda ... Takahiro Ogawa
-
Akira Toyoda, et. al.Akira Toyoda ... Takahiro Ogawa
01 Oct 2017
01 Oct 2017

FMFN: Fine-Grained Multimodal Fusion Networks for Fake News Detection
Jingzi Wang ... Hongyan Mao
Applied Sciences | VOL. 12
Jingzi Wang, et. al.Jingzi Wang ... Hongyan Mao
21 Jan 2022
Applied Sciences | VOL. 12

Experience Improves Feature Extraction inDrosophila
Yueqing Peng ... Aike Guo
The Journal of Neuroscience | VOL. 27
Yueqing Peng, et. al.Yueqing Peng ... Aike Guo
09 May 2007
The Journal of Neuroscience | VOL. 27

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Language-Conditioned Feature Pyramids for Visual Selection Tasks

Abstract

Highlights

Summary

Talk to us

Similar Papers