Self-Attention based fine-grained cross-media hybrid network

Wei Shan,Dan Huang,Jiangtao Wang,Feng Zou,Suwen Li

doi:10.1016/j.patcog.2022.108748

Abstract

Due to the heterogeneity gap, the data representations of different types of media are inconsistent. It is challenging to measure the fine-grained gap between different media. To this end, we propose a self-attention-based hybrid network to learn the common representations of different media data. Specifically, we first utilize a local self-attention layer to learn the common attention space between different media data. Then we propose a similarity concatenation method to understand the content relationship between features. To further improve the robustness of the model, we also learn a local position encoding to capture the spatial relationships between features. Therefore, our proposed approach can effectively reduce the gap between different feature distributions on cross-media retrieval tasks. Extensive experiments and ablation studies demonstrate that our proposed method achieves state-of-the-art performance. The source code and models are publicly available at: https://github.com/NUST-Machine-Intelligence-Laboratory/SAFGCMHN.

Full Text