Abstract

In this paper, we present a novel approach that subtly combines the transformer with grasping CNN to achieve more optimal grasps in complex real-life situations. The approach comprises two unique designs that effectively improve grasp precision in complex scenes. The first essential design uses self-attention mechanisms to capture contextual information from RGB images, boosting contrast between key object features and their surroundings. We precisely adjust internal parameters to balance accuracy and computing costs. The second crucial design involves building a feature fusion bridge that processes all one-dimensional sequence features at once to create an intuitive visual perception for the detection stage, ensuring a seamless combination of the transformer block and CNN. These designs eliminate noise features in complex backgrounds and emphasize graspable object features, providing valuable semantic data to the subsequent grasping CNN to achieve appropriate grasping. We evaluated the approach on the Cornell and VMRD datasets. According to the experimental results, our method achieves better performance than the original grasping CNN in single-object and multi-object scenarios, exhibiting 97.7% and 72.2% accuracy on the Cornell and VMRD grasp datasets using RGB, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call