Abstract

Object recognition with multimodal representation has recently attracted great interest in the field of intelligent robotics, where visual and tactile fusion learning shows the potential to improve performance. However, existing approaches primarily focus on capturing the complementary features from two modalities, while ignoring the disparities between vision and touch, and overlooking the fusion of features from different scales. In this article, we propose an alignment and multi-scale fusion method (AMSF) for robotic object recognition to address these challenges. The proposed method exploits a novel alignment strategy based on contrastive learning through multimodality information and provides a more grounded visual and tactile representation. In the fusion of extracting interactive information, a multi-scale fusion module by transformer is applied to integrate features on different scales from two modalities and generates an ideal representation for a pair of visual and tactile data. Abundant experiments are carried out on three public datasets, and the results validate the superiority of our method. Furthermore, the effectiveness of the proposed two modules has been illustrated in ablation studies.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call