Visual-and-Language Multimodal Fusion for Sweeping Robot Navigation Based on CNN and GRU

Yiping Zhang,Kolja Wilker

doi:10.4018/joeuc.338388

Yiping Zhang, Kolja Wilker

Open Access

PDF Available

https://doi.org/10.4018/joeuc.338388

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Effectively fusing information between the visual and language modalities remains a significant challenge. To achieve deep integration of natural language and visual information, this research introduces a multimodal fusion neural network model, which combines visual information (RGB images and depth maps) with language information (natural language navigation instructions). Firstly, the authors used faster R-CNN and ResNet50 to extract image features and attention mechanism to further extract effective information. Secondly, GRU model is used to extract language features. Finally, another GRU model is used to fuse the visual- language features, and then the history information is retained to give the next action instruction to the robot. Experimental results demonstrate that the proposed method effectively addresses the localization and decision-making challenges for robotic vacuum cleaners.

Full Text