Abstract

Deep learning methods have achieved excellent results in the field of grasp detection. However, deep learning-based models for general object detection lack the proper balance of accuracy and inference speed, resulting in poor performance in real-time grasp tasks. This work proposes an efficient grasp detection network with n-channel images as inputs for robotic grasp. The proposed network is a lightweight generative structure for grasp detection in one stage. Specifically, a Gaussian kernel-based grasp representation is introduced to encode the training samples, embodying the maximum center that possesses the highest grasp confidence. A receptive field block is plugged into the bottleneck to improve the model's feature discriminability. In addition, pixel-based and channel-based attention mechanisms are used to construct a multidimensional attention fusion network to fuse valuable semantic information, achieved by suppressing noisy features and highlighting object features. The proposed method is evaluated on the Cornell, Jacquard, and extended OCID grasp datasets. The experimental results show that our method achieves excellent balancing accuracy and running speed performance. The network gets a running speed of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\text{6}\,\text{ms}$</tex-math></inline-formula> , achieving better performance on the Cornell, Jacquard, and extended OCID grasp datasets with 97.8, 95.6, and 76.4% accuracy, respectively. Subsequently, an excellent grasp success rate in a physical environment is obtained using the UR5 robot arm.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call