Abstract

Recently, there is a surge of interest in image-text multimodal representation learning, and many neural network based models have been proposed aiming to capture the interaction between two modalities with different forms of functions. Despite their success, a potential limitation of these methods is insufficient to model all kinds of interactions with a set of static parameters. To alleviate this problem, we present a dynamic interaction network, in which the parameters of the interaction function are dynamically generated by a meta network. Additionally, to provide necessary multimodal features that the meta network needs, we propose a new neural module called Multimodal Transformer. Experimentally, we not only make a comprehensively quantitative evaluation on four image-text tasks, but also show some interpretable analyses of our models, revealing the internal working mechanism of the dynamic parameter learning.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call