Abstract

Learning from demonstration holds the promise of enabling robots to learn diverse actions from expert experience. In contrast to learning from observation-action pairs, humans learn to imitate in a more flexible and efficient manner: learning behaviors by simply “watching.” In this article, we propose a “watch-and-act” imitation learning pipeline that endows a robot with the ability of learning diverse manipulations from visual demonstrations. Specifically, we address this problem by intuitively casting it as two subtasks: 1) understanding the demonstration video and 2) learning the demonstrated manipulations. First, a captioning module based on visual change is presented to understand the demonstration by translating the demonstration video into a command sentence. Then, to execute the captioning command, a manipulation module that learns the demonstrated manipulations is built upon an instance segmentation model and a manipulation affordance prediction model. We validate the superiority of the two modules over existing methods separately via extensive experiments and demonstrate the whole robotic imitation system developed based on the two modules in diverse scenarios using a real robotic arm. Supplementary video is available at https://vsislab.github.io/watch-and-act/.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call