During human-robot interaction, touch gesture and emotion recognition (TGER) can help robots perceive and infer human intentions and emotional states, and establish a more harmonious human-robot relationship. Conventional handcrafted feature-based machine learning methods and end-to-end deep learning networks have been used for TGER. However, existing TGER methods are mainly based on single-task models, i.e., the touch gestures and emotions are recognized separately with different models, which suffer from relatively low recognition accuracy, high computational complexity, or high memory burden. To solve these existing problems, this paper proposed a multi-scale spatiotemporal convolutions with attention mechanism for touch gesture and emotion recognition (MUSCAT), and established a multi-task TGER framework for an overall model construction. MUSCAT uses a factorized spatiotemporal convolutional neural network to sequentially extract spatial and temporal features, and multi-scale convolution is used in temporal convolution. The attention mechanisms are applied after spatial convolution, temporal convolution and point-wise convolution to respectively focus on the key information of spatial, temporal and channel dimensions. The proposed MUSCAT model has a deeper structure than the standard three-dimensional convolution, which gains better performance with fewer model parameters. Extensive experiments were conducted on two touch datasets, and the results verify the feasibility of the proposed method.
Read full abstract