Abstract

Transformers have achieved great success in many artificial intelligence fields, such as computer vision (CV), audio processing and natural language processing (NLP). In speech emotion recognition (SER), transformer-based architectures usually compute attention in a token-by-token (frame-by-frame) manner, but this approach lacks adequate capacity to capture local emotion information and is easily affected by noise. This paper proposes a novel SER architecture, referred to as block and token self-attention (BAT), that splits a mixed spectrogram into blocks and computes self-attention by combining these blocks with tokens, which can alleviate the effect of local noise while capturing authentic sentiment expressions. Furthermore, we present a cross-block attention mechanism to facilitate information interaction among blocks while integrating a frequency compression and channel enhancement (FCCE) module to smooth the attention biases between blocks and tokens. BAT achieves 73.2% weighted accuracy (WA) and 75.2% unweighted accuracy (UA) on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset, surpassing the results of previously developed state-of-the-art approaches with the same dataset partitioning operation. Further experimental results reveal that our proposed method is also well suited for cross-database and cross-domain tasks, achieving 89% WA and 87.4% UA on Emo-DB and producing a top-1 recognition accuracy of 88.32% with only 15.01 Mb of parameters on the CIFAR-10 image dataset under a scenario with no data augmentation or pretraining.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call