Abstract
Speech emotion recognition (SER) is an essential part of human–computer interaction. Meanwhile, the SER has widely utilized multimodal information in SER in recent years. This paper focuses on exploiting the acoustic and textual modalities for the SER task. We propose a bimodal network based on an Audio–Text-Interactional-Attention (ATIA) structure which can facilitate the interaction and fusion of the emotionally salient information within the acoustic and textual modalities. We also explored four different ATIA structures and verified their effectiveness. Finally, we selected one ATIA structure to build our bimodal network with the best performance. Furthermore, our SER model adopts an additive angular margin loss, named ArcFace loss, applied to the deep face recognition field. Compared with the widespread Softmax loss, our visualization results demonstrated the effectiveness of the ArcFace loss function. ArcFace loss can improve the discriminate power of features by focusing on the angles between the features and the weights. As we know, it is the first time to apply ArcFace loss in the field of SER. Finally, the results show that the bimodal network combined ArcFace loss achieved 72.8% of Weighted Accuracy (WA) and 62.5% of Unweighted Accuracy (UA) for the seven-class emotion classification, and 82.4% of WA and 80.6% of UA for the four-class emotion classification on the IEMOCAP dataset.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.