Abstract

Visual imitation learning is a promising approach that promotes robots to learn skills from visual demonstrations. However, current visual imitation learning approaches introduce unreasonable assumptions that the contexts of the visual demonstrations and the robot observations are consistent, which affects the flexibility and scalability of the approaches. It is a key challenge for robots to learn from visual demonstrations with inconsistent contexts. Inconsistent contexts may cause a serious difference in the pixel distribution of the operator and the environment, which makes vision-based control policies hardly effective. In this paper, we propose a novel imitation learning framework to enable robots to reproduce behavior by watching human demonstrations with inconsistent contexts, such as different viewpoints, operators, backgrounds, object appearances and positions. Specifically, our framework consists of three networks: flow-based viewpoint transformation network (FVTrans), robot2human alignment network (RANet) and inverse dynamics network (IDNet). First, FVTrans transforms various third-person demonstrations into the fixed robot execution view. With a meta learning strategy, FVTrans can quickly adapt to novel contexts with few samples. Then, RANet aligns the human and the robot at the feature level. Therefore, the demonstration feature can be used as a subgoal of the current moment. Finally, IDNet predicts the joint angles of the robot. We collect a multi-context dataset on the real robot (UR5) for three tasks, including grasping cups, sweeping garbage and placing objects. We empirically demonstrate that our framework can perform three tasks with a high success rate and be effectively generalized to different contexts.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call