Improved Speaker and Navigator for Vision-and-Language Navigation

Zongkai Wu,Zihan Liu,Donglin Wang,Ting Wang

doi:10.1109/mmul.2021.3058314

Abstract

Prior works in vision-and-language navigation (VLN) focus on using long short-term memory (LSTM) to carry the flow of information on either the navigation model (navigator) or the instruction generating model (speaker).The outstanding capability of LSTM to process intermodal interactions has been widely verified; however, LSTM neglects the intramodel interactions, leading to negative effect on either navigator or speaker. The performance of attention-based Transformer is satisfactory in sequence-to-sequence translation domains, but Transformer structure implemented directly in VLN has yet been satisfied. In this article, we propose novel Transformer-based multimodal frameworks for the navigator and speaker, respectively. In our frameworks, the multihead self-attention with the residual connection is used to carry the information flow. Specially, we set a switch to prevent them from directly entering the information flow in our navigator framework. In experiments, we verify the effectiveness of our proposed approach, and show significant performance advantages over the baselines.

Full Text