In dynamic facial expression sequences captured in natural environments, the number and distribution of expression peak frames (key frames) show significant variation within video sequences. Efficient extraction and utilization of these key expression frames are crucial for improving the accuracy of expression recognition. To address the above challenges, this paper proposed the Key Frame Extraction via Semantic Coherence (KFE-SC) method. The method explores the spatial-temporal emotional semantic information within sequences, employing a coarse-to-fine approach to extract key expression frames, thus facilitating classification tasks. By introduce a Spatial Semantic Coherence (SSC) module and designing a Global-Local Feature Aggregation (GLFA) unit, the method explores the spatial semantic correlations at global-local levels in images. Through calculating weight proportions, the method achieves a coarse-grained extraction of key frames in the spatial dimension. Considering that video sequences contain both spatial and temporal semantic relevance, KFE-SC further introduces a Temporal Semantic Coherence (TSC) module, which include the Temporal Alignment Aggregation (TA2) unit to align short and long-term temporal information and explore temporal semantic relevance within the sequences, enabling fine-grained extraction of key frames in the temporal dimension through weight calculations.
Read full abstract