Abstract
Multi-modal fusion can exploit complementary information from various modalities and improve the accuracy of prediction or classification tasks. In this paper, we propose a semi-tensor product-based multi-modal factorized multilinear (STP-MFM) pooling method for information fusion in sentiment analysis. Initially, we extend the bilinear pooling to multilinear pooling for multi-modal fusion. Next, we propose a multi-modal factorized multilinear pooling (MFM) method, which parametrizes the fusion weight tensor with the Tucker decomposition. Furthermore, we propose to use Semi-Tensor Product (STP) in MFM to obtain more flexible and compact tensor decompositions with smaller factor sizes, this process permits the connection of two factors with different dimensionality by using the semi-tensor mode product. The proposed method removes the limitation of dimension consistency in matrix multiplication and expresses the information in a more compact structure with less memory. Most importantly, the STP leverages temporal and spatial information from video, audio, and text, producing a better representation of intra-modality correlations. We verified the proposed STP-MFM for sentiment analysis on the CMU-MOSI and the IEMOCAP datasets. The experimental results indicate that the proposed method outperforms the baselines by a significant margin. Moreover, it also gains a superior training speed and lowers model complexity.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have