A cross-modal conditional mechanism based on attention for text-video retrieval.

Wanru Du,Xiaoyin Wang,Quan Zhu,Xuan Liu,Xiaochuan Jing

doi:10.3934/mbe.2023889

Abstract

Current research in cross-modal retrieval has primarily focused on aligning the global features of videos and sentences. However, video conveys a much more comprehensive range of information than text. Thus, text-video matching should focus on the similarities between frames containing critical information and text semantics. This paper proposes a cross-modal conditional feature aggregation model based on the attention mechanism. It includes two innovative modules: (1) A cross-modal attentional feature aggregation module, which uses the semantic text features as conditional projections to extract the most relevant features from the video frames. It aggregates these frame features to form global video features. (2) A global-local similarity calculation module calculates similarities at two granularities (video-sentence and frame-word features) to consider both the topic and detail features in the text-video matching process. Our experiments on the four widely used MSR-VTT, LSMDC, MSVD and DiDeMo datasets demonstrate the effectiveness of our model and its superiority over state-of-the-art methods. The results show that the cross-modal attention aggregation approach can effectively capture the primary semantic information of the video. At the same time, the global-local similarity calculation model can accurately match text and video based on topic and detail features.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A cross-modal conditional mechanism based on attention for text-video retrieval.

Abstract

Talk to us

Similar Papers

More From: Mathematical biosciences and engineering : MBE

Lead the way for us

Journal: Mathematical biosciences and engineering : MBE	Publication Date: Jan 1, 2023
License type: cc-by

Similar Papers

Video Sampled Frame Category Aggregation and Consistent Representation for Cross-Modal Retrieval
Ming Jin ... Huaxiang Zhang
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 33
Ming Jin, et. al.Ming Jin ... Huaxiang Zhang
01 Feb 2023
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 33

Semantic Modeling of Textual Relationships in Cross-modal Retrieval
Jing Yu ... Chenghao Yang
-
Jing Yu, et. al.Jing Yu ... Chenghao Yang
01 Jan 2019
01 Jan 2019

Global and detailed features of motor unit potential in myogenic and neurogenic disorders
Ewa Zalewska ... Irena Hausmanowa-Petrusewicz
Medical Engineering and Physics | VOL. 21
Ewa Zalewska, et. al.Ewa Zalewska ... Irena Hausmanowa-Petrusewicz
01 Jul 1999
Medical Engineering and Physics | VOL. 21

A Novel Feature Aggregation Approach for Image Retrieval Using Local and Global Features
Yuhua Li ... Prasenjit Chatterjee
Computer Modeling in Engineering & Sciences | VOL. 131
Yuhua Li, et. al.Yuhua Li ... Prasenjit Chatterjee
01 Jan 2021
Computer Modeling in Engineering & Sciences | VOL. 131

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A cross-modal conditional mechanism based on attention for text-video retrieval.

Abstract

Talk to us

Similar Papers

More From: Mathematical biosciences and engineering : MBE