Abstract

In-the-wild dynamic facial expression recognition (DFER) is a very challenging task, and previous methods based on convolutional neural networks (CNNs), recurrent neural networks (RNNs), or Transformers emphasize the extraction of either short-term temporal information or long-term temporal information from facial video sequences. Different from existing methods, this paper proposes a long short-term perceptimon network (LSTPNet) for dynamic facial expression recognition, taking into account the joint perception of the above two temporal cues to benefit the DFER task. Specifically, we propose a long short-term temporal Transformer (LSTformer) which can perceive both long-term and short-term temporal information effectively. In addition, we introduce a temporal channel excitation (TCE) module extended from the previous notable efficient channel attention (ECA) module, in order to establish temporal attention for intermediate features within the backbone network, and obtain more temporally representative features. Experimental results on three benchmark datasets demonstrate the state-of-the-art performance of the proposed LSTPNet. The code will be available at https://github.com/LLFabiann/LSTPNet/.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call