Abstract

In the wake of the developments of deep learning, more and more research focus on the intersection fields of natural language processing and machine vision, vision and language navigation (VLN) is one of them. The VLN task needs an embodied agent to carry out the natural language instruction and navigate inside a real 3D environment with the help of visual information, planning a trajectory from start point to goal location. In this paper, inspired by the previous works, we introduce a multi-head attention module with a parallel attention computing method which apply multi-head attention mechanism on visual and textual input to enhance the performance of the model. Specifically, first we design a multihead attention module with trainable parameters, which can extract associated attention from the textual and visual information, extracted attention variables are used to promote the agent aware of which part of the sentence or image is more important. Second, in order to help the agent perceives more useful information, we perform a parallel computing method on extracted attention and input, i.e. visual and textual input feature, then use a layer normalization to combine them. The results of experiments indicate our proposed module enables the model to obtain better performance and surpasses the baseline model. The success rate of our model is 51% and oracle success rate is 62% with low navigation error.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call