Abstract

In-the-wild dynamic facial expression recognition (DFER) is a very challenging task, and previous methods based on convolutional neural networks (CNNs), recurrent neural networks (RNNs), or Transformers emphasize the extraction of either short-term temporal information or long-term temporal information from facial video sequences. Different from existing methods, this paper proposes a long short-term perceptimon network (LSTPNet) for dynamic facial expression recognition, taking into account the joint perception of the above two temporal cues to benefit the DFER task. Specifically, we propose a long short-term temporal Transformer (LSTformer) which can perceive both long-term and short-term temporal information effectively. In addition, we introduce a temporal channel excitation (TCE) module extended from the previous notable efficient channel attention (ECA) module, in order to establish temporal attention for intermediate features within the backbone network, and obtain more temporally representative features. Experimental results on three benchmark datasets demonstrate the state-of-the-art performance of the proposed LSTPNet. The code will be available at https://github.com/LLFabiann/LSTPNet/.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.