Multimodal sentiment analysis combines text, audio, and visual signals to understand human emotions. However, current methods often face challenges in handling asynchronous signals and capturing long-term dependencies between different modalities. Early techniques that merge multiple modalities often introduce unnecessary complexity, while newer methods that treat each modality separately may miss important relationships between the signals. Transformer-based models are effective, but they are typically too resource-heavy for practical use. To overcome these issues, we introduce the multimodal GRU model (MulG), which uses a cross-modal attention mechanism to better synchronize the different signals and capture their dependencies. MulG also employs GRU layers, which are efficient in handling sequential data, making it both accurate and computationally efficient. Extensive experiments on datasets such as CMU-MOSI, CMU-MOSEI, and IEMOCAP demonstrate that MulG outperforms existing methods in accuracy, F1 score, and correlation. Specifically, MulG achieves 82.2% accuracy on CMU-MOSI’s 7-class task, 82.1% on CMU-MOSEI, and 90.6% on IEMOCAP’s emotion classification. Further ablation studies show that each component of the model contributes significantly to its overall performance. By addressing the limitations of previous approaches, MulG offers a practical and scalable solution for applications like analyzing user-generated content and improving human-computer interaction.
Read full abstract