Abstract

Complementarity plays a significant role in the synergistic effect created by different components of a complex data object. Complementarity learning on multimodal data has fundamental challenges of representation learning because the complementarity exists along with multiple modalities and one or multiple items of each modality. Also, an appropriate metric is needed for measuring the complementarity in the representation space. Existing methods that rely on similarity-based metrics cannot adequately capture the complementarity. In this work, we propose a novel deep architecture for systematically learning the complementarity of components from multimodal multi-item data. The proposed model consists of three major modules: 1) unimodal aggregation for extracting the intramodal complementarity; 2) cross-modal fusion for extracting the intermodal complementarity at the modality level; and 3) interactive aggregation for extracting the intermodal complementarity at the item level. To quantify complementarity, we utilize the TUBE distance metric to measure the difference between the composited data object and its label in the representation space. Experiments on three real datasets show that our model outperforms the state-of-the-art by +6.8% of mean reciprocal rank (MRR) on object classification and +3.0% of MRR on hold-out item prediction. Qualitative analyses reveal that complementarity is significantly different from similarity.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call