Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Jisu Hwang,Incheol Kim

doi:10.3390/s21031012

Abstract

Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks. The proposed JMEBS model uses a transformer-based joint multimodal embedding module. JMEBS uses both multimodal context and temporal context. It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions. A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions. The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.

Highlights

Miguel Arevalillo-HerráezDriven by the rapid growth of computer vision and natural language processing technologies, in recent years there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text
In an effort to overcome the limitations of previous studies for solving vision and language navigation (VLN) tasks, we propose the joint multimodal embedding and backtracking search (JMEBS), a novel deep neural network model
This study proposed the novel deep neural network model JMEBS as an efficient tool to solve VLN tasks

Summary

Introduction

Driven by the rapid growth of computer vision and natural language processing technologies, in recent years there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. It is a great challenge for attention-based VLN models to extract sufficient context from multimodal input data, including natural language instructions and images, to make real-time action decisions To address this drawback, researchers proposed transformer-based pretrained models [20,21]. The proposed model uses a transformer-based, joint multimodal embedding module to obtain a text context that is efficient for action selection based on natural language instructions and real-time input images. One of its salient features is that the context information extracted by the module can be integrated into various path planning and action selection strategies It is equipped with a backtracking-enabled local search feature designed to improve the task success rate and optimize the navigation path based on the local and global scores related to candidate actions.

Related Works

Problem Description

Proposed Model

Local Scoring

Global Scoring

Backtracking-Enabled Greedy Local Search

Dataset and Model Training

Experiments

Qualitative Analysis

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Sensors	Publication Date: Feb 2, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors

Lead the way for us

Similar Papers

Local Slot Attention for Vision and Language Navigation
Yifeng Zhuang ... Xiangyang Xue
-
Yifeng Zhuang, et. al.Yifeng Zhuang ... Xiangyang Xue
27 Jun 2022
27 Jun 2022

Vision and Language Navigation using Multi-head Attention Mechanism
Sai Mao ... Junmin Wu
-
Sai Mao, et. al.Sai Mao ... Junmin Wu
01 Dec 2020
01 Dec 2020

Generating machine-executable plans from end-user's natural-language instructions
Rui Liu ... Xiaoli Zhang
Knowledge-Based Systems | VOL. 140
Rui Liu, et. al.Rui Liu ... Xiaoli Zhang
01 Nov 2017
Knowledge-Based Systems | VOL. 140

Airbert: In-domain Pretraining for Vision-and-Language Navigation
Pierre-Louis Guhur ... Ivan Laptev
-
Pierre-Louis Guhur, et. al.Pierre-Louis Guhur ... Ivan Laptev
01 Oct 2021
01 Oct 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors