Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval

Xiaoyu Wu,Tiantian Wang,Shengjin Wang

doi:10.3390/electronics9122125

Xiaoyu Wu, Tiantian Wang + Show 1 more

Open Access

https://doi.org/10.3390/electronics9122125

Copy DOI

Abstract

Text-video retrieval tasks face a great challenge in the semantic gap between cross modal information. Some existing methods transform the text or video into the same subspace to measure their similarity. However, this kind of method does not consider adding a semantic consistency constraint when associating the two modalities of semantic encoding, and the associated result is poor. In this paper, we propose a multi-modal retrieval algorithm based on semantic association and multi-task learning. Firstly, the multi-level features of video or text are extracted based on multiple deep learning networks, so that the information of the two modalities can be fully encoded. Then, in the public feature space where the two modalities information are mapped together, we propose a semantic similarity measurement and semantic consistency classification based on text-video features for a multi-task learning framework. With the semantic consistency classification task, the learning of semantic association task is restrained. So multi-task learning guides the better feature mapping of two modalities and optimizes the construction of unified feature subspace. Finally, the experimental results of our proposed algorithm on the Microsoft Video Description dataset (MSVD) and MSR-Video to Text (MSR-VTT) are better than the existing research, which prove that our algorithm can improve the performance of cross-modal retrieval.

Highlights

In today’s era of the increasing scale of information and more diversified information forms, video media websites such as YouTube are developing rapidly, while TikTok and other short video applications are popular with people
Considering the problems above, this paper proposes a multi-level and multi-task learning based on semantic association to deal with text-video retrieval
Microsoft Video Description dataset (MSVD) and MSR-Video to Text (MSR-VTT), and all the results show the effectiveness of the proposed algorithm

Summary

Introduction

In today’s era of the increasing scale of information and more diversified information forms, video media websites such as YouTube are developing rapidly, while TikTok and other short video applications are popular with people. Since it is hard to exhaustively express the semantic information in the queries by concepts, the word embedding technique is utilized to integrate with the visual features and map them to a common space as a “bridge” for comparing the similarity between the text and visual data [9]. This kind of method is concept-free, and makes corresponding video retrieval by using the whole text query [10,11,12]. The slightly modified SlowFast [17] model is utilized to extract accurate video features in the spatial domain, and the BERT [18] model is used to embed the high-level text semantic embedding in sentences rather in words

Related Work

Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning

Multi-Level Video Semantic Feature Encoding

Global Encoding

Temporal-Aware Encoding

Temporal-Domain Multi-Scale Encoding

Multi-Level Text Semantic Feature Encoding

Cross Modal Multi-Task Learning

Text-Video Similarity Task Loss

Text-Video Semantic Consistency Classification Task Loss

Experiments

Dataset

Measurements

Implementation Details

Methods

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Electronics	Publication Date: Dec 11, 2020
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Similar Papers

An overview of multi-task learning
Qiang Yang ... Yu Zhang
National Science Review | VOL. 5
Qiang Yang, et. al.Qiang Yang ... Yu Zhang
01 Sep 2017
National Science Review | VOL. 5

A probabilistic framework for learning task relationships in multi-task learning
Yu Zhang
-
Yu ZhangYu Zhang
23 Dec 2014
23 Dec 2014

Anterior Temporal Lobe Morphometry Predicts Categorization Ability.
Michel Thiebaut De Schotten ... Emmanuelle Volle
Frontiers in human neuroscience | VOL. 12
Michel Thiebaut De Schotten, et. al.Michel Thiebaut De Schotten ... Emmanuelle Volle
07 Feb 2018
Frontiers in human neuroscience | VOL. 12

Ambiguity and relatedness effects in semantic tasks: Are they due to semantic coding?
Stephen J Lupker ... Yasushi Hino
Journal of Memory and Language | VOL. 55
Stephen J Lupker, et. al.Stephen J Lupker ... Yasushi Hino
23 Jun 2006
Journal of Memory and Language | VOL. 55

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics