Abstract

Emotion recognition based on text-audio modalities is the core technology for transforming a graphical user interface into a voice user interface, and it plays a vital role in natural human-computer interaction systems. Currently, mainstream multimodal learning research has designed various fusion strategies to learn intermodality interactions but hardly considers that not all modalities play equal roles in emotion recognition. Therefore, the main challenge in multimodal emotion recognition is how to implement effective fusion algorithms based on the auxiliary structure. To address this problem, this article proposes an adaptive interactive attention network (AIA-Net). In AIA-Net, text is treated as a primary modality, and audio is an auxiliary modality. AIA-Net adapts to textual and acoustic features with different dimensions and learns their dynamic interactive relations in a more flexible way. The interactive relations are encoded as interactive attention weights to focus on the acoustic features that are effective for textual emotional representations. AIA-Net performs well in adaptively assisting the textual emotional representation with the acoustic emotional information. Moreover, multiple collaborative learning (co-learning) layers of AIA-Net achieve multiple multimodal interactions and the deep bottom-up evolution of emotional representations. Experimental results on three benchmark datasets demonstrate the great effectiveness of the proposed method over the state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call