Multimodal graph learning with framelet-based stochastic configuration networks for emotion recognition in conversation

Jiandong Shi,Ming Li,Yuting Chen,Lixin Cui,Lu Bai

doi:10.1016/j.ins.2024.121393

Abstract

The multimodal emotion recognition in conversation (ERC) task presents significant challenges due to the complexity of relationships and the difficulty in achieving semantic fusion across various modalities. Graph learning, recognized for its capability to capture intricate data relations, has been suggested as a solution for ERC. However, existing graph-based ERC models often fail to address the fundamental limitations of graph learning, such as assuming pairwise interactions and neglecting high-frequency signals in semantically-poor modalities, which leads to an over-reliance on text. While these issues might be negligible in other applications, they are crucial for the success of ERC. In this paper, we propose a novel framework for ERC, namely multimodal graph learning with framelet-based stochastic configuration networks (i.e., Frame-SCN). Specifically, framelet-based stochastic configuration networks, which employ 2D directional Haar framelets to extract both low- and high-pass components, are introduced to learn the unified semantic embeddings from multimodal data, mitigating prediction biases caused by an excessive reliance on text without introducing an unnecessarily large number of parameters. Also, we develop a modality-aware information extraction module that is able to extract both general and sensitive information in a multimodal semantic space, alleviating potential noise issues. Extensive experiment results demonstrate that our proposed Frame-SCN outperforms many state-of-the-art approaches on two widely used multimodal ERC datasets.

Full Text