A Novel Robotic Pushing and Grasping Method Based on Vision Transformer and Convolution.

Sheng Yu,Di-Hua Zhai,Yuanqing Xia

doi:10.1109/tnnls.2023.3244186

Abstract

Robotic grasping techniques have been widely studied in recent years. However, it is always a challenging problem for robots to grasp in cluttered scenes. In this issue, objects are placed close to each other, and there is no space around for the robot to place the gripper, making it difficult to find a suitable grasping position. To solve this problem, this article proposes to use the combination of pushing and grasping (PG) actions to help grasp pose detection and robot grasping. We propose a pushing-grasping combined grasping network (GN), PG method based on transformer and convolution (PGTC). For the pushing action, we propose a vision transformer (ViT)-based object position prediction network pushing transformer network (PTNet), which can well capture the global and temporal features and can better predict the position of objects after pushing. To perform the grasping detection, we propose a cross dense fusion network (CDFNet), which can make full use of the RGB image and depth image, and fuse and refine them several times. Compared with previous networks, CDFNet is able to detect the optimal grasping position more accurately. Finally, we use the network for both simulation and actual UR3 robot grasping experiments and achieve SOTA performance. Video and dataset are available at https://youtu.be/Q58YE-Cc250.

Full Text