Abstract

Multimodal networks with richer information contents should always outperform the unimodal counterparts. In our experiment, however, we observe that this is not always the case. Prior efforts on multimodal tasks mainly tend to design a uniform optimization algorithm for all modalities, and yet only obtain a sub-optimal multimodal representation with the fusion of under-optimized unimodal representations, which are still challenged by performance drop on multimodal networks caused by heterogeneity among modalities. In this work, to remove the slowdowns in performance on multimodal tasks, we decouple the learning procedures of unimodal and multimodal networks by dynamically balancing the learning rates for various modalities, so that the modality-specific optimization algorithm for each modality can be obtained. Specifically, the adaptive tracking factor (ATF) is introduced to adjust the learning rate for each modality on a real-time basis. Furthermore, adaptive convergent equalization (ACE) and bilevel directional optimization (BDO) are proposed to equalize and update the ATF, avoiding sub-optimal unimodal representations due to overfitting or underfitting. Extensive experiments on multimodal sentiment analysis demonstrate that our method achieves superior performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call