TCA-NET: Triplet Concatenated-Attentional Network for Multimodal Engagement Estimation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Human social interactions involve intricate social signals that artificial intelligence and machine learning models aim to decipher, particularly in the context of artificial mediators that can enhance human interactions across domains like education and healthcare. Engagement, a key aspect of these interactions, relies heavily on multimodal information like facial expressions, voice and posture. Recently, many deep learning methods have been deployed in engagement estimation. Still, they often focus on unimodality or bimodality, leading to the results lacking robustness and adaptability due to factors like noise and varying individual responses. To address this challenge, we introduce a novel modality fusion framework named Triplet Concatenated-Attentional Net (TCA-Net). This framework takes three distinct types of data modality (video, audio and Kinect) as inputs and delivers a prediction score as output. Within this network, a specially designed concatenated-attention fusion mechanism serves the purpose of modality fusion and preserves the intra-modal features. Experimental results validate the efficiency of our TCA-Net in enhancing the accuracy and reliability of engagement estimation across diverse scenarios, with a test set Concordance Correlation Coefficient (CCC) of 0.75. We release our code at https://github.com/Daming-W/Multimodal_Engagement_Estimation.

Save Icon
Up Arrow
Open/Close