Abstract
Extracting semantically consistent representations from multi-modal data helps computers understand the human world more comprehensively. Visual-semantic matching, as one of the fundamental tasks for multi-modal learning, attracts continuous attention. Recent research makes unflagging endeavors to enhance the matching performance, but sometimes at the expense of overlooking the delicate balance between efficiency and effectiveness. In this paper, we aim to address this dilemma through a newly proposed attention-mechanism-based architecture. To ensure optimal effectiveness, we adopt a more advanced Transformer Encoder (TE) as our basic model and introduce two significant ameliorations to tailor it for the visual-semantic matching task. Initially, we incorporate fine-grained supervision into the classic TE, allowing our model to capture sophisticated correspondences between different modalities. Subsequently, we employ a dynamic attention-evolving strategy to selectively pass useful information and strengthen the attention pattern consistency between adjacent TE blocks. To maintain efficiency, we propose a novel Select & Re-rank strategy that enables the model to ignore redundant information. This approach significantly reduces the computational cost and increases the matching speed with relatively minimal performance degradation. The proposed architecture can gradually capture and reorganize useful information from inter-modality as well as intra-modality under the supervision of both fine-grained and global similarity, which leads to more comprehensive and discriminative embeddings. Experiments on two benchmark datasets show that the proposed method achieves competitive results in terms of both efficiency and effectiveness.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.