Abstract

With the explosive increase of multimodal data, cross-modal correlation classification has become an important research topic and is in great demand in many cross-modal applications. A variety of classification schemes and predictive models have been built based on the existing cross-modal correlation categorization. However, these classification schemes typically follow the prior assumption that the paired cross-modal samples are strictly related, and thus pay great attention to the fine-grained relevant types of cross-modal correlation, ignoring the high volume of implicitly relevant data which are often wrongly classified into irrelevant types. Even more, previous predictive models fall short of reflecting the essence of cross-modal correlation according to their definitions, especially in the modeling of network structure. Thus in this paper, by comprehensively investigating the current image-text correlation classification research, we redefine a new classification scheme for cross-modal correlation based on the implicit and explicit relevance. To predict the types of image-text correlation based on our proposed definition, we further devise the Association and Alignment Network (namely AnANet) to model the implicit and explicit relevance, which captures both the implicit association of global discrepancy and commonality between image and text and explicit alignment of cross-modal local relevance. Experimental studies on our constructed new image-text correlation dataset verify the effectiveness of our proposed model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call