Abstract

Due to the heterogeneity gap, the data representations of different types of media are inconsistent. It is challenging to measure the fine-grained gap between different media. To this end, we propose a self-attention-based hybrid network to learn the common representations of different media data. Specifically, we first utilize a local self-attention layer to learn the common attention space between different media data. Then we propose a similarity concatenation method to understand the content relationship between features. To further improve the robustness of the model, we also learn a local position encoding to capture the spatial relationships between features. Therefore, our proposed approach can effectively reduce the gap between different feature distributions on cross-media retrieval tasks. Extensive experiments and ablation studies demonstrate that our proposed method achieves state-of-the-art performance. The source code and models are publicly available at: https://github.com/NUST-Machine-Intelligence-Laboratory/SAFGCMHN.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call