Speech emotion recognition is gaining increasing interest in the academic sphere due to the advancement of machine intelligence in the service industries. The previous research has already validated the efficacy of multimodality in Speech Emotion Recognition (SER); yet most studies have focused on one-time fusion techniques. This paper proposes a hybrid fusion architecture which optimizes the advantages of multiple fusion techniques and modalities. The model is predominantly based on the rapidly rising Transformer architecture. This study also extends the classic cross-entropy loss and designs a novel loss function which differentiates the misprediction patterns. The architecture is experimented on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset with sufficient cross-validation. It reaches 89.7% accuracy and outmatches the State-of-the-art (SOTA) methods. The performance is further enhanced by the proposed loss function and arrives at 91.1% accuracy. In addition, the models show computation scalability and few needs for hyperparameter fine-tuning. This article concludes that more comprehensive fusion techniques are worth exploration for multimodal speech emotion recognition and Transformers are suitable for emotional characteristics and lead the classification process.
Read full abstract