Incongruity-aware multimodal physiology signals fusion for emotion recognition

Jing Li,Ning Chen,Hongqing Zhu,Guangqiang Li,Zhangyong Xu,Dingxin Chen

doi:10.1016/j.inffus.2023.102220

Abstract

Various physiological signals can reflect the human’s emotional states objectively. How to take advantage of the common as well as complementary properties of different physiological signals in representing the emotional states is an interesting problem. Although various models have been constructed to fuse multimodal physiological signals for emotion recognition, the possible incongruity existing among different physiological signals in representing the emotional states and the redundancy resulted from the fusion, which may affect the performance of the fusion schemes seriously, were seldom considered. To this end, a fusion model, which can eliminate the incongruity among different physiological signals and reduce the information redundancy to some extent, is proposed. First, one physiological signal is chosen as the primary modality due to its prominent performance in emotion recognition, and the remaining physiological signals are viewed as the auxiliary modalities. Secondly, the Cross Modal Transformer (CMT) is adopted to optimize the features of the auxiliary modalities by eliminating the incongruity among them, and then Low Rank Fusion (LRF) is performed to eliminate information redundancy caused by fusion. Thirdly, the modified CMT (MCMT) is constructed to enhance the feature of the primary modality by that of each optimized auxiliary modality feature. Fourthly, Self-Attention Transformer (SAT) is performed on the concatenation result of all the enhanced primary modality features to take full advantage of the common as well complementary properties among them in representing the emotional states. Finally, the enhanced primary modality feature and the optimized auxiliary features are fused by concatenation for emotion recognition. Extensive experimental results on DEAP and WESAD datasets demonstrate that (i) The incongruity does exist among different physiological signals, and the CMT-based auxiliary modality feature optimization strategy can eliminate the incongruity prominently; (ii) The emotion prediction accuracy of the primary modality can be enhanced by the auxiliary modality; (iii) All the key modules in the proposed model, CMT, LRF, and MCMT, contribute to the performance enhancement of the proposed model; iv) The proposed model outperforms State-Of-The-Art (SOTA) models in emotion recognition task.

Full Text