In complex environments with multi-object stacking, the spatial relationships between objects necessitate a sequential grasping strategy to ensure both the safety of the target objects and the efficiency of robotic arm operations. To address this challenge, this study introduces a Visual Manipulation Relationship Network (VMRN) to determine the optimal grasping sequence. Traditional VMRN frameworks typically rely on convolutional neural networks (CNNs) for feature extraction, which often struggle with high-frequency feature extraction, long-tail data distributions, and real-time computational demands in multi-object stacking scenarios. To overcome these limitations, we propose a lightweight, convolution-free Transformer-based feature extraction network integrated into the visual detection model. This model is specifically designed for visual reasoning, with a focus on lightweight optimization to enhance the extraction of features for stacked objects. The proposed network incorporates local window attention, global information aggregation and broadcasting, and a dual-dimensional attention-based feedforward network to improve feature representation. Additionally, a novel loss function is designed to address the performance degradation in detecting long-tail categories, effectively mitigating the over-suppression of rare objects in imbalanced datasets. Experimental results demonstrate that the proposed model significantly improves both detection accuracy and computational efficiency, making it particularly suitable for real-time robotic grasping tasks in complex environments due to its lightweight design.
Read full abstract