Due to the differences in shape, material, and color of objects, the detection of planar grasps by robots remains challenging. Traditional methods rely on discrete grasp configurations for annotation, ignoring many possible grasp configurations. This leads to poor network generalization, making it challenging to handle a diverse range of grasped objects. Manual reannotation to continuous labels can effectively address this issue but comes with a significant cost. Therefore, this paper proposes a Pixel-level Grasp framework. Firstly, APGLG can automatically generate pixel-level grasp labels and discrete labels into continuous labels, effectively increasing the information content of individual data and improving the network generalization performance. Then, we propose the Max-Grasp-Net, built on the Multi-axis Vision Transformer and Dynamic Convolution Decomposition, to construct a U-shaped Network structure. A grasp decoder is incorporated, and deep supervision is applied to enhance network generalization. On the Cornell dataset, we achieve the best results, a grasping accuracy of 99.55%, an average success rate of 97.92% in single-object grasping, and 95.83% in multi-object grasping. We verified the effectiveness of our label generation algorithms and network innovation through actual grasp experiments.
Read full abstract