Video Sampled Frame Category Aggregation and Consistent Representation for Cross-Modal Retrieval

Ming Jin,Li Liu,Jiande Sun,Huaxiang Zhang,Lei Zhu

doi:10.1109/tcsvt.2022.3204623

Abstract

Many current video and text cross-modal retrieval research works focus on narrowing the semantic gap between video and text, but ignore the semantic difference between different sampled frames in the same video and the correlation of feature distribution of objects contained in different sampled frames in the same video, as a result, the features of the sampled frames in the final learned video cannot well represent the semantic features of the whole video. To overcome the shortcomings of existing studies, we first use a pre-trained video frame classification-aggregation network to make the object categories contained in different sampled frames in the same video be more close to the important object categories contained in the whole video, so as to promote the feature distribution of different sampled frames in the same video to be consistent, and increase the relevance of object features in different frames. Then we propose a video internal frame aggregation loss module to solve the problem of inconsistent feature distribution between different frame features encoded by video encoder in the same video and the aggregation feature of the sampled frame, thus enhancing the ability of video sampled frame aggregation feature representation. Experiments conducted on three common datasets MSVD, MSR-VTT and DiDeMo demonstrate the validity of the proposed approach.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Video Sampled Frame Category Aggregation and Consistent Representation for Cross-Modal Retrieval

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society

Lead the way for us

Journal: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society	Publication Date: Feb 1, 2023
Citations: 1

Similar Papers

A cross-modal conditional mechanism based on attention for text-video retrieval.
Wanru Du ... Xuan Liu
Mathematical Biosciences and Engineering | VOL. 20
Wanru Du, et. al.Wanru Du ... Xuan Liu
01 Jan 2023
Mathematical Biosciences and Engineering | VOL. 20

Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval
Xiaoshuai Hao ... Yucan Zhou
-
Xiaoshuai Hao, et. al.Xiaoshuai Hao ... Yucan Zhou
24 Aug 2021
24 Aug 2021

Language Independent Speech Driven Facial Animation
...
-
, et. al. ...
01 Dec 2008
01 Dec 2008

Dual steganography for hiding text in video by linked list method
P Selvigrija ... E Ramya
-
P Selvigrija, et. al.P Selvigrija ... E Ramya
01 Mar 2015
01 Mar 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Video Sampled Frame Category Aggregation and Consistent Representation for Cross-Modal Retrieval

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society