RES-StS: Referring Expression Speaker via Self-Training With Scorer for Goal-Oriented Vision-Language Navigation

Liuyi Wang,Huiyi Chen,Ronghao Dang,Qijun Chen,Zongtao He,Chengju Liu

doi:10.1109/tcsvt.2022.3233554

Abstract

It is a rather practical but difficult task to find a specified target object via autonomous exploration based on natural language descriptions in an unstructured environment. Since the human-annotated data is expensive to gather for the goal-oriented vision-language navigation (GVLN) task, the size of the standard dataset is inadequate, which has significantly limited the accuracy of previous techniques. In this work, we aim to improve the robustness and generalization of the navigator by dynamically providing high-quality pseudo-instructions using a proposed RES-StS paradigm. Specifically, we establish a referring expression speaker (RES) to predict descriptive instructions for the given path to the goal object. Based on an environment-and-object fusion (EOF) module, RES derives spatial representations from the input trajectories, which are subsequently encoded by a number of transformer layers. Additionally, given that the quality of the pseudo labels is important for data augmentation while the limited dataset may also hinder RES learning, we propose to equip RES with a more effective generation ability by using the self-training approach. A trajectory-instruction matching scorer (TIMS) network based on contrastive learning is proposed to selectively use rehearsal of prior knowledge. Finally, all network modules in the system are integrated by suggesting a multi-stage training strategy, allowing them to assist one another and thus enhance performance on the GVLN task. Experimental results demonstrate the effectiveness of our approach. Compared with the SOTA methods, our method improves SR, SPL, and RGS by 4.72%, 2.55%, and 3.45% respectively, on the REVERIE dataset, and 4.58%, 3.75% and 3.14% respectively, on the SOON dataset.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

RES-StS: Referring Expression Speaker via Self-Training With Scorer for Goal-Oriented Vision-Language Navigation

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems for Video Technology

Lead the way for us

Journal: IEEE Transactions on Circuits and Systems for Video Technology	Publication Date: Jul 1, 2023
Citations: 5

Similar Papers

Automatic Object Searching and Behavior Learning for Mobile Robots in Unstructured Environment by Deep Belief Networks
Jiru Wang ... Fuchun Sun
IEEE Transactions on Cognitive and Developmental Systems | VOL. 11
Jiru Wang, et. al.Jiru Wang ... Fuchun Sun
01 Sep 2019
IEEE Transactions on Cognitive and Developmental Systems | VOL. 11

Grounding Language Attributes to Objects using Bayesian Eigenobjects
Vanya Cohen ... Stefanie Tellex
-
Vanya Cohen, et. al.Vanya Cohen ... Stefanie Tellex
01 Nov 2019
01 Nov 2019

Improving Inertial Sensor-Based Activity Recognition in Neurological Populations.
Yunus Celik ... Kadir Sabanci
Sensors | VOL. 22
Yunus Celik, et. al.Yunus Celik ... Kadir Sabanci
15 Dec 2022
Sensors | VOL. 22

Fall prediction, control, and recovery of quadruped robots
Hao Sun ... Changhong Wang
ISA Transactions | VOL. 151
Hao Sun, et. al.Hao Sun ... Changhong Wang
25 May 2024
ISA Transactions | VOL. 151

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

RES-StS: Referring Expression Speaker via Self-Training With Scorer for Goal-Oriented Vision-Language Navigation

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems for Video Technology