Abstract

Multi-modal fusion can achieve better predictions through the amalgamation of information from different modalities. To improve the performance of accuracy, a method based on Higher-order Orthogonal Iteration Decomposition and Projection (HOIDP) is proposed, in the fusion process, higher-order orthogonal iteration decomposition algorithm and factor matrix projection are used to remove redundant information duplicated inter-modal and produce fewer parameters with minimal information loss. The performance of the proposed method is verified by three different multi-modal datasets. The numerical results validate the accuracy of the performance of the proposed method having 0.4% to 4% improvement in sentiment analysis, 0.3% to 8% improvement in personality trait recognition, and 0.2% to 25% improvement in emotion recognition at three different multi-modal datasets compared with other 5 methods.

Highlights

  • The multi-modal fusion technique turns up to be an interesting topic in AI technology fields

  • Nowadays it has been applied in a broad range of applications, such as multimedia event detection [2,3], sentiment analysis [1,4], cross-modal translation [5,6,7], Visual Question Answering (VQA) [8,9], etc

  • To verify the improvement of the method, we compare our method with Deep Fusion (DF) [22], Multi-attention Recurrent Network (MARN) [23], Memory Fusion Network (MFN) [24], TEN[16], and Low-rank Multi-modal Fusion (LMF) [17] in sentiment analysis, personality trait recognition, and emotion recognition at three different multi-modal datasets

Read more

Summary

Introduction

The multi-modal fusion technique turns up to be an interesting topic in AI technology fields It integrates the information in multiple modalities and is expected to perform better prediction than the case using any unimodal information [1]. Zadeh [16] proposed a tensor fusion network (TFN) which calculates the interaction between different modalities by the crossproduct of tensor Such representations suffer from an exponential growth in feature dimensions and resulting in high cost training process. To tackle this problem, an efficient decomposition method (LMF) is proposed [17] which leads to low-rank tensor factors and much less computational complexity, preserves the capacity of expressing the interactions of modalities. The performance of the proposed method has been verified through the evaluation processes on three common available multi-modal task datasets

Relevant Mathematical Notations
Methodology
Multi-Modal Fusion Based on Tensor Representation
Higher-Order Orthogonal Iteration Decomposition
Factor Matrix Projection
Experimental Methodology
Datasets
Multimodal Data Features
Model Architecture
Evaluation Metrics
Comparison with the State-of-the-Art
Computation Accuracy Analysis
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call