Human action recognition (HAR) is a critical area in computer vision with wide-ranging applications, including video surveillance, healthcare monitoring, and abnormal behavior detection. Current HAR methods predominantly rely on full-body data, which can limit their effectiveness in real-world scenarios where occlusion is common. In such situations, the face often remains visible, providing valuable cues for action recognition. This paper introduces Face in Action (FIA), a novel two-stream method that leverages facial action cues for robust action recognition under conditions of significant occlusion. FIA consists of an RGB stream and a landmark stream. The RGB stream processes facial image sequences using a fine-spatio-multitemporal (FSM) 3D convolution module, which employs smaller spatial receptive fields to capture detailed local facial movements and larger temporal receptive fields to model broader temporal dynamics. The landmark stream processes facial landmark sequences using a normalized temporal attention (NTA) module within an NTA-GCN block, enhancing the detection of key facial frames and improving overall recognition accuracy. We validate the effectiveness of FIA using the NTU RGB+D and NTU RGB+D 120 datasets, focusing on action categories related to medical conditions. Our experiments demonstrate that FIA significantly outperforms existing methods in scenarios with extensive occlusion, highlighting its potential for practical applications in surveillance and healthcare settings.
Read full abstract