Select & Re-Rank: Effectively and efficiently matching multimodal data with dynamically evolving attention

Weikuo Guo,Xiangwei Kong,Huaibo Huang

doi:10.1016/j.neucom.2024.129003

Abstract

Extracting semantically consistent representations from multi-modal data helps computers understand the human world more comprehensively. Visual-semantic matching, as one of the fundamental tasks for multi-modal learning, attracts continuous attention. Recent research makes unflagging endeavors to enhance the matching performance, but sometimes at the expense of overlooking the delicate balance between efficiency and effectiveness. In this paper, we aim to address this dilemma through a newly proposed attention-mechanism-based architecture. To ensure optimal effectiveness, we adopt a more advanced Transformer Encoder (TE) as our basic model and introduce two significant ameliorations to tailor it for the visual-semantic matching task. Initially, we incorporate fine-grained supervision into the classic TE, allowing our model to capture sophisticated correspondences between different modalities. Subsequently, we employ a dynamic attention-evolving strategy to selectively pass useful information and strengthen the attention pattern consistency between adjacent TE blocks. To maintain efficiency, we propose a novel Select & Re-rank strategy that enables the model to ignore redundant information. This approach significantly reduces the computational cost and increases the matching speed with relatively minimal performance degradation. The proposed architecture can gradually capture and reorganize useful information from inter-modality as well as intra-modality under the supervision of both fine-grained and global similarity, which leads to more comprehensive and discriminative embeddings. Experiments on two benchmark datasets show that the proposed method achieves competitive results in terms of both efficiency and effectiveness.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Select & Re-Rank: Effectively and efficiently matching multimodal data with dynamically evolving attention

Abstract

Talk to us

Similar Papers

More From: Neurocomputing

Lead the way for us

Similar Papers

Evaluation of Aqua-Ammonia Chiller Technologies and Field Site Installation
Abdolreza Zaltash
-
Abdolreza ZaltashAbdolreza Zaltash
01 Sep 2007
01 Sep 2007

System design for using multimodal trace data in modeling self-regulated learning
Elizabeth Brooke Cloude ... Philip H Winne
Frontiers in Education | VOL. 7
Elizabeth Brooke Cloude, et. al.Elizabeth Brooke Cloude ... Philip H Winne
11 Aug 2022
Frontiers in Education | VOL. 7

Extreme Energy Density Flywheel Energy Storage System for Space Applications
Richard Hockney ... Ward Spears
-
Richard Hockney, et. al.Richard Hockney ... Ward Spears
15 Jun 2008
15 Jun 2008

Dynamic filter cache for low power instruction memory hierarchy
...
-
, et. al. ...
31 Aug 2004
31 Aug 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Select & Re-Rank: Effectively and efficiently matching multimodal data with dynamically evolving attention

Abstract

Talk to us

Similar Papers

More From: Neurocomputing