Affective body expression recognition technology enables machines to interpret non-verbal emotional signals from human movements, which is crucial for facilitating natural and empathetic human–machine interaction (HCI). This work proposes a new framework for emotion recognition from body movements, providing a universal and effective solution for decoding the temporal–spatial mapping between emotions and body expressions. Compared with previous studies, our approach extracted interpretable temporal and spatial features by constructing a body expression energy model (BEEM) and a multi-input symmetric positive definite matrix network (MSPDnet). In particular, the temporal features extracted from the BEEM reveal the energy distribution, dynamical complexity, and frequency activity of the body expression under different emotions, while the spatial features obtained by MSPDnet capture the spatial Riemannian properties between body joints. Furthermore, this paper introduces an attentional temporal–spatial feature fusion (ATSFF) algorithm to adaptively fuse temporal and spatial features with different semantics and scales, significantly improving the discriminability and generalizability of the fused features. The proposed method achieves recognition accuracies over 90% across four public datasets, outperforming most state-of-the-art approaches.
Read full abstract