Abstract

Speech emotion recognition (SER) is challenging owing to the complexity of emotional representation. Hence, this article focuses on multimodal speech emotion recognition that analyzes the speaker’s sentiment state via audio signals and textual content. Existing multimodal approaches utilize sequential networks to capture the temporal dependency in various feature sequences, ignoring the underlying relations in acoustic and textual modalities. Moreover, current feature-level and decision-level fusion methods have unresolved limitations. Therefore, this paper develops a novel multimodal fusion graph convolutional network that comprehensively executes information interactions within and between the two modalities. Specifically, we construct the intra-modal relations to excavate exclusive intrinsic characteristics in each modality. For the inter-modal fusion, a multi-perspective fusion mechanism is devised to integrate the complementary information between the two modalities. Substantial experiments on the IEMOCAP and RAVDESS datasets and experimental results demonstrate that our approach achieves superior performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.