AnANet: Association and Alignment Network for Modeling Implicit Relevance in Cross-Modal Correlation Classification

Nan Xu,Junyan Wang,Yuan Tian,Wenji Mao,Ruike Zhang

doi:10.1109/tmm.2022.3229960

Abstract

With the explosive increase of multimodal data, cross-modal correlation classification has become an important research topic and is in great demand in many cross-modal applications. A variety of classification schemes and predictive models have been built based on the existing cross-modal correlation categorization. However, these classification schemes typically follow the prior assumption that the paired cross-modal samples are strictly related, and thus pay great attention to the fine-grained relevant types of cross-modal correlation, ignoring the high volume of implicitly relevant data which are often wrongly classified into irrelevant types. Even more, previous predictive models fall short of reflecting the essence of cross-modal correlation according to their definitions, especially in the modeling of network structure. Thus in this paper, by comprehensively investigating the current image-text correlation classification research, we redefine a new classification scheme for cross-modal correlation based on the implicit and explicit relevance. To predict the types of image-text correlation based on our proposed definition, we further devise the Association and Alignment Network (namely AnANet) to model the implicit and explicit relevance, which captures both the implicit association of global discrepancy and commonality between image and text and explicit alignment of cross-modal local relevance. Experimental studies on our constructed new image-text correlation dataset verify the effectiveness of our proposed model.

Full Text