Context-aware emotion recognition (CAER) leverages comprehensive scene information, including facial expressions, body postures, and contextual background. However, current studies predominantly rely on facial expressions, body postures, and global contextual features; the interaction between the agents (target individuals) and other objects in the scene is usually absent or incomplete. In this article, a three-dimensional view relationship-based CAER (TDRCer) method is proposed, which comprises two branches: the personal emotional branch (PEB) and the contextual emotional branch (CEB). First, PEB is designed for the extraction of facial expression features and body posture features from the agent. A vision transformer (ViT), pretrained by contrastive learning with a novel loss function combining Euclidean distance and cosine similarity, is applied to enhance the robustness of facial expression features. Meanwhile, the human body contour images extracted by semantic segmentation are fed into another ViT to extract body posture features. Second, CEB is constructed for the extraction of global contextual features and interactive relationships among objects in the scene. The images masked by the agents' bodies are fed into a ViT to extract global contextual features. By leveraging both the gaze angle and depth map, a three-dimensional view graph (3DVG) is constructed to represent the interactive relationships between agents and objects in the scene. Then, a graph convolutional network is employed to extract interactive relationship features from the 3DVG. Finally, the multiplicative fusion strategy is applied to fuse the features of two branches, and the fused features are utilized to classify the emotions. TDRCer achieves an accuracy of 89.90% on the CAER-S dataset and a mean average precision (mAP) of 36.02% on the EMOTIons in context (EMOTIC) dataset. The code can be accessed at https://github.com/mengTender/TDRCer.
Read full abstract