Abstract

Humans use multimodal sensory information to understand the physical properties of their environment. Intelligent decision-making systems such as the ones used in robotic applications could also utilize the fusion of multimodal information to improve their performance and reliability. In recent years, machine learning and deep learning methods are used at the heart of such intelligent systems. Developing visuo-tactile models is a challenging task due to various problems such as performance, limited datasets, reliability, and computational efficiency. In this research, we propose four efficient models based on dynamic neural network architectures for unimodal and multimodal object recognition. For unimodal object recognition, TactileNet and VisionNet are proposed. For multimodal object recognition, the FusionNet-A and the FusionNet-B are designed to implement early and late fusion strategies, respectively. The proposed models have a flexible structure and are able to change at the train or test phase to accommodate the amount of available information. Model confidence calibration is employed to enhance the reliability and generalization of the models. The proposed models are evaluated on MIT CSAIL large-scale multimodal dataset. Our results demonstrate accurate performance in both unimodal and multimodal scenarios. It has been illustrated that by using different fusion strategies and augmenting the tactile-based models with visual information, the top-1 error rate of the single-frame tactile model was reduced by 78% and the mean average precision was increased by 2.19 times. Although the focus has been on the fusion of tactile and visual modalities, the proposed design methodology can be generalized to include more modalities.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call