Abstract

In the Remote Embodied Visual Referring Expression (REVERIE) task, an agent needs to navigate through an unseen environment to identify a referred object following high-level instructions. Despite recent efforts of vision-and-language navigation (VLN), previous methods commonly rely on detailed navigational instructions, which might not be available in practice. To address this issue, we present a method that strengthens vision-and-language (V&L) navigators with object-awareness. By combining object-aware textual grounding and visual grounding operations, our technique helps the navigator recognize the relationship between instructions and the contents of captured images. As a generic method, the proposed solution can be seamlessly integrated into other V&L navigators with different frameworks (for example, Seq2Seq or BERT). In order to alleviate the problem of data scarcity, we synthesize augmented data based on a simple yet effective prompt template that retains object information and destination information. Experimental results on REVERIE and R2R datasets demonstrate the proposed methods’ applicability and performance improvement across different domains.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call