VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

Raphael Schumann,Weixi Feng,Stefan Riezler,Wanrong Zhu,Tsu-Jui Fu,William Yang Wang

doi:10.1609/aaai.v38i17.29858

Abstract

Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation (VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. The embodied agent needs to ground its understanding of navigation instructions in observations of a real-world environment like Street View. Despite the impressive results of LLMs in other research areas, it is an ongoing problem of how to best connect them with an interactive visual environment. In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action. Visual information is verbalized by a pipeline that extracts landmarks from the human written navigation instructions and uses CLIP to determine their visibility in the current panorama view. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve around 25% relative improvement in task completion over the previous state-of-the-art for two datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Mar 24, 2024
Citations: 1

Similar Papers

VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation
Jialu Li ... Gaurav Sukhatme
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Jialu Li, et. al.Jialu Li ... Gaurav Sukhatme
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

WebVLN: Vision-and-Language Navigation on Websites
Qi Chen ... Gengze Zhou
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Qi Chen, et. al.Qi Chen ... Gengze Zhou
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory
...
International Journal of Computer Vision | VOL. 129
, et. al. ...
31 Aug 2020
International Journal of Computer Vision | VOL. 129

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory
Arun Balajee Vasudevan ... Luc Van Gool
International Journal of Computer Vision | VOL. 129
Arun Balajee Vasudevan, et. al.Arun Balajee Vasudevan ... Luc Van Gool
31 Aug 2020
International Journal of Computer Vision | VOL. 129

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence