Abstract
There has been an increased interest in multimodal language processing including multimodal dialog, question answering, sentiment analysis, and speech recognition. However, naturally occurring multimodal data is often imperfect as a result of imperfect modalities, missing entries or noise corruption. To address these concerns, we present a regularization method based on tensor rank minimization. Our method is based on the observation that high-dimensional multimodal time series data often exhibit correlations across time and modalities which leads to low-rank tensor representations. However, the presence of noise or incomplete values breaks these correlations and results in tensor representations of higher rank. We design a model to learn such tensor representations and effectively regularize their rank. Experiments on multimodal language data show that our model achieves good results across various levels of imperfection.
Highlights
Analyzing multimodal language sequences spans various fields including multimodal dialog (Das et al, 2017; Rudnicky, 2005), question answering (Antol et al, 2015; Tapaswi et al, 2015; Das et al, 2018), sentiment analysis (Morency et al, 2011), and speech recognition (Palaskar et al, 2018)
Simple tensors that can be represented as outer products of language AAAB/XicbVDLSsNAFJ3UV62v+Ni5GSyCq5KooMuiGxcKFfuCNpTJdNIOnUzCzI1YQ/FX3LhQxK3/4c6/cdpmoa0HLhzOuZd77/FjwTU4zreVW1hcWl7JrxbW1jc2t+ztnbqOEkVZjUYiUk2faCa4ZDXgIFgzVoyEvmANf3A59hv3TGkeySoMY+aFpCd5wCkBI3XsvTawB0gFkb2E9Bi+vqvejDp20Sk5E+B54makiDJUOvZXuxvRJGQSqCBat1wnBi8lCjgVbFRoJ5rFhA7MhpahkoRMe+nk+hE+NEoXB5EyJQFP1N8TKQm1Hoa+6QwJ9PWsNxb/81oJBOdeymWcAJN0uihIBIYIj6PAXa4YBTE0hFDFza2Y9okiFExgBROCO/vyPKkfl9yTknN7WixfZHHk0T46QEfIRWeojK5QBdUQRY/oGb2iN+vJerHerY9pa87KZnbRH1ifP3B7lTI=
We show that T2FN increases the capacity of TFN to capture high-rank tensor representations, which itself leads to improved prediction performance
Summary
Analyzing multimodal language sequences spans various fields including multimodal dialog (Das et al, 2017; Rudnicky, 2005), question answering (Antol et al, 2015; Tapaswi et al, 2015; Das et al, 2018), sentiment analysis (Morency et al, 2011), and speech recognition (Palaskar et al, 2018). These multimodal sequences contain heterogeneous sources of information across the language, visual and acoustic modalities.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have