Abstract

Emotion is an essential aspect of human life, and effectively identifying corresponding emotions from different scenarios will help promote the development of human-computer interaction systems. Therefore, emotion classification has gradually become a challenging and popular research field. Compared with text emotion analysis, emotion analysis of audio data is still relatively immature. Traditional audio sentiment analysis research is based on feature information such as MFCC, MFSC, etc. while using time-memory models such as LSTM and RNN for emotion analysis. Due to the rapid development of transformers and attention mechanisms, many scholars have shifted their research from the RNN family to the transformer family or deep learning models with attention mechanisms. Therefore, this paper proposes a method to convert audio data into a spectrogram and use a vision transformer model based on transfer learning for emotion classification. This paper conducts experiments on the IEMOCAP dataset and the MELD dataset. The experimental results show that the emotion classification accuracy of the Vision transformer in the IEMOCAP and the MELD datasets reach 56.18% and 37.1%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call