Speech emotion analysis based on vision transformer

Xiaogang Huang,Chen Dong,Yuting Liu,Qifeng Zheng,Yuanyuan Zhang,Dong Cheng,Linlin Shen

doi:10.1117/12.2673332

Abstract

Emotion is an essential aspect of human life, and effectively identifying corresponding emotions from different scenarios will help promote the development of human-computer interaction systems. Therefore, emotion classification has gradually become a challenging and popular research field. Compared with text emotion analysis, emotion analysis of audio data is still relatively immature. Traditional audio sentiment analysis research is based on feature information such as MFCC, MFSC, etc. while using time-memory models such as LSTM and RNN for emotion analysis. Due to the rapid development of transformers and attention mechanisms, many scholars have shifted their research from the RNN family to the transformer family or deep learning models with attention mechanisms. Therefore, this paper proposes a method to convert audio data into a spectrogram and use a vision transformer model based on transfer learning for emotion classification. This paper conducts experiments on the IEMOCAP dataset and the MELD dataset. The experimental results show that the emotion classification accuracy of the Vision transformer in the IEMOCAP and the MELD datasets reach 56.18% and 37.1%.

Full Text