Partial Visual-Tactile Fused Learning for Robotic Object Recognition

Tao Zhang,Yang Cong,Dongdong Hou,Jiahua Dong

doi:10.1109/tsmc.2021.3096235

Abstract

Currently, visual-tactile fusion learning for robotic object recognition has achieved appealing performance, due to the fact that visual and tactile data can offer complementary information. However: 1) the distinct gap between vision and touch makes it difficult to fully explore the complementary information, which would further lead to performance degradation and 2) most of the existing visual-tactile fused learning methods assume that visual and tactile data are complete, which is often difficult to be satisfied in many real-world applications. In this article, we propose a partial visual-tactile fused (PVTF) framework for robotic object recognition to address these challenges. Specifically, we first employ two modality-specific (MS) encoders to encode partial visual-tactile data into two incomplete subspaces (i.e., visual subspace and tactile subspace). Then, a modality gap mitigated (MGM) network is adopted to discover modality-invariant high-level label information, which is utilized to generate gap loss and further help updating the MS encoders for relatively consistent visual and tactile subspaces generation. In this way, the huge gap between vision and touch is mitigated, which would further contribute to mine the complementary visual-tactile information. Finally, to achieve data completeness and complementary visual-tactile information exploration simultaneously, a cycle subspace leaning technique is proposed to project the incomplete subspaces into a complete subspace by fully exploiting all the obtainable samples, where complete latent representations with maximum complementary information can be learned. A lot of comparative experiments conducted on three visual-tactile datasets validate the advantage of the proposed PVTF framework, by comparing with state-of-the-art baselines.

Full Text