Abstract

Language understanding is multimodal. During human communication, messages are conveyed not only by words in textual form, but also through speech patterns, gestures or facial emotions of the speakers. Therefore, it is crucial to fuse information from different modalities to achieve a joint comprehension. With the rapid progress in the deep learning field, neural networks have emerged as the most popular approach for addressing multimodal data fusion [1, 6, 7, 12]. While these models can effectively combine multimodal features by learning from data, they nevertheless lack an explicit exhibition of how different modalities are related to each other, due to the inherent low interpretability of neural networks [2]. In the meantime, Quantum Theory (QT) has given rise to principled approaches for incorporating interactions between textual features into a holistic textual representation [3, 5, 8, 10], where the concepts of superposition andentanglement have been universally exploited to formulate interactions. The advantages of those models in capturing complicated correlations between textual features have been observed. We hereby propose the research on quantum-inspired multimodal data fusion, claiming that the limitation of multimodal data fusion can be tackled by quantum-driven models. In particular, we propose to employ superposition to formulate intra-modal interactions while the interplay between different modalities is expected to be captured by entanglement measures. By doing so, the interactions within multimodal data may be rendered explicitly in a unified quantum formalism, increasing the performance and interpretability for concrete multimodal tasks. It will also expand the application domains of quantum theory to multimodal tasks where only preliminary efforts have been made [11]. We therefore aim at answering the following research question: RQ. Can we fuse multimodal data with quantum-inspired models? To answer this question, we propose to fuse multimodal data with complex-valued neural networks, motivated by the theoretical link between neural networks and quantum theory [4] and advances in complex-valued neural networks [9]. Our model begins with a separate complex-valued embedding learned for each unimodal data based on the existing works [5, 10] which inherently assumes superposition between intra-modal features. Then we construct a many-body system in entangled state for multimodal data, where cross-modality interactions are naturally reflected by entanglement measures. Quantum measurement operators are applied to the entanglement state to address a concrete multimodal task at hand. The whole process is instrumented by a complex-valued neural network, which is able to learn how multimodal features are combined from data, and at the same time explain the combination by means of quantum superposition and entanglement measures. We plan to examine our proposed models on CMU-MOSI [12] and CMU-MOSEI [1] which are benchmarking multimodal sentiment analysis datasets. The dataset targets at classifying sentiment into 2, 5 or 7 classes with the input of textual, visual and acoustic features. We expect to see comparable effectiveness to state-of-the-art models, and we will explore superposition and entanglement measures to better understand the inter-modal interactions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call