Abstract

Speech emotion recognition (SER) plays a vital role in intelligent human–computer interaction (HCI). The Convolutional Neural Network (CNN) is widely used in SER, effectively capturing the static local features but ignoring the temporal dynamic features necessary for SER. To solve this problem, we place an Attentional Temporal Dynamic Activation (ATDA) module into the CNN-based model to empower it to learn the static and dynamic features simultaneously. In particular, the ATDA module comprises a Temporal Dynamic Activation (TDA) block followed by a Multi-view and Multi-granularity Attention (MMA) block. The TDA block calculates the temporal difference at the feature level to activate the dynamic information and generate the fundamental dynamic feature. The MMA block further detects and amplifies the emotion-related dynamic features based on multiple attention views and granularities. These two blocks within the ATDA module cooperate to activate and extract the dynamic emotional features. Meanwhile, the static features are obtained by a convolutional layer, which are then combined with the dynamic features to generate the final emotional representations. Finally, experiments on the IEMOCAP, MSP-IMPROV, and MELD datasets reveal that the proposed ATDA-CNN model achieves competitive results and enhances SER accuracy by learning meaningful emotional representations.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call