Abstract

For robot in human environment, it has always been expected that the robot can execute specified tasks following language instructions. Most current methods only rely on visual perception to understand the language instruction, while it may be not sufficient to fully interpret some language instructions when visually identical objects exist. In this paper, we propose a task of audio–visual language instruction understanding for robotic sorting, in which the robot is able to use both the visual and audio information to fully understand and execute the given instruction. To solve the proposed task, an audio–visual fusion framework is developed, which combines the visual localization and audio recognition models together for the robotic sorting task following language instruction. We have also collected a multimodal dataset for evaluation, and extensive experiments are conducted within the dataset and generalized to new scenarios in physical world demonstrating the effectiveness of the proposed framework.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call