Dynamics Learning With Object-Centric Interaction Networks for Robot Manipulation

Jiayu Wang,Yunan Wang,Chuxiong Hu,Yu Zhu

doi:10.1109/access.2021.3077117

Jiayu Wang, Yunan Wang + Show 2 more

Open Access

https://doi.org/10.1109/access.2021.3077117

Copy DOI

Abstract

Understanding the physical interactions of objects with environments is critical for multi-object robotic manipulation tasks. A predictive dynamics model can predict the future states of manipulated objects, which is used to plan plausible actions that enable the objects to achieve desired goal states. However, most current approaches on dynamics learning from high-dimensional visual observations have limitations. These methods either rely on a large amount of real-world data or build a model with a fixed number of objects, which makes them difficult to generalize to unseen objects. This paper proposes a Deep Object-centric Interaction Network (DOIN) which encodes object-centric representations for multiple objects from raw RGB images and reasons about the future trajectory for each object in latent space. The proposed model is trained only on large amounts of random interaction data collected in simulation. The learned model combined with a model predictive control framework enables a robot to search action sequences that manipulate objects to the desired configurations. The proposed method is evaluated both in simulation and real-world experiments on multi-object pushing tasks. Extensive simulation experiments show that DOIN can achieve high prediction accuracy in different scenes with different numbers of objects and outperform state-of-the-art baselines in the manipulation tasks. Real-world experiments demonstrate that the model trained on simulated data can be transferred to the real robot and can successfully perform multi-object pushing tasks for previously-unseen objects with significant variations in shape and size.

Highlights

Humans possess a natural physical intuition that can predict the effects of actions and the state of things in the future
For seen ShapeNet objects that are used for training, the average position errors of DOR-NN are slightly worse than the proposed method
The results show that the proposed model with the interaction network achieves lower average final distances for most scenes of all test sets, significantly outperforming the state-of-the-art video prediction VisualMPC and the alternatives without reasoning about physics interaction

Summary

Introduction

Humans possess a natural physical intuition that can predict the effects of actions and the state of things in the future. The ability to interpret the physical world can be acquired and improved through experience. Understanding the efforts of physical interactions is essential for planning actions in robotic manipulation tasks. If a robot could reason about the effects of actions and the future states of objects to manipulate, it would plan feasible actions to navigate objects to their goal configurations. The ability to reason physical interactions has been a longstanding challenge in robotics.

Objectives

Methods

Results

Conclusion