Abstract

The heterogeneity gap leads to inconsistent distributions and representations between image and text, which rises a challenging task to measure their similarities and construct cross-media correlation between them. The existing works mainly model the cross-media correlation in a common subspace, which causes insufficient correlation modeling in such third-party subspace with intermediate unidirectional transformation. Inspired by the recent advances of neural machine translation, which aims to establish a corresponding relationship between two entirely different languages, we can naturally discover that it has striking common characteristic with cross-media correlation learning to consider image and text as bilingual pairs, where the image is treated as a special kind of language to provide visual description, so that bidirectional transformation can be conducted between image and text to effectively explore cross-media correlation in the feature space of each media type. Thus, we propose a reinforced cross-media bidirectional translation (RCBT) approach to model the correlation between visual and textual descriptions. First, cross-media bidirectional translation mechanism is proposed to conduct direct transformation between the bilingual pairs of visual and textual descriptions bidirectionally, where the cross-media correlation can be effectively captured in both feature spaces of image and text through bidirectional translation training. Second, cross-media context-aware network with residual attention is proposed to exploit the rich spatial and temporal context hints with cross-media convolutional recurrent neural network, which can lead to more precise correlation learning for promoting bidirectional translation process. Third, cross-media reinforcement learning is proposed to perform a two-agent communication game played as a round between image and text to boost the bidirectional translation process, and we further extract inter-media and intra-media reward signals to provide complementary clues for learning cross-media correlation. Extensive experiments are conducted on cross-media retrieval to verify the effectiveness of our proposed RCBT approach, compared with 11 state-of-the-art methods on three cross-media datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call