Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence

Huy Manh Nguyen,Yoshihiro Sugaya,Shinichiro Omachi,Tomo Miyazaki

doi:10.3390/app11073214

Abstract

Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos and sentences. In this paper, we propose a novel framework that maps instances into multiple individual embedding spaces so that we can capture multiple relationships between instances, leading to compelling video retrieval. We propose to produce a final similarity between instances by fusing similarities measured in each embedding space using a weighted sum strategy. We determine the weights according to a sentence. Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. The proposed method achieved superior performance, and the results are competitive to state-of-the-art methods. These experimental results demonstrated the effectiveness of the proposed multiple embedding approach compared to existing methods.

Highlights

Video has become an essential source for humans to learn and acquire knowledge.Due to the increased demand for sharing and accumulating information, there is a massive amount of video being produced in the world every day
We summarize the results of the sentence-to-video retrieval task on the Microsoft Research Video to Text dataset (MSR-VTT)
We presented a novel framework for embedding videos and sentences into multiple embedding spaces

Summary

Introduction

Video has become an essential source for humans to learn and acquire knowledge. Due to the increased demand for sharing and accumulating information, there is a massive amount of video being produced in the world every day. Similar to the fact that humans experience the world with multiple senses, the goal of multimodal learning is to develop a model that can simultaneously process multiple modalities, such as visual, text, and audio, in an integrated manner by constructing a joint embedding space. Such models can map various modalities into a shared Euclidean space where distances and directions capture useful semantic relationships. We propose a novel framework equipped with multiple embedding networks so that we can capture various relationships between video and sentence, leading to more compelling video retrieval. We conducted video retrieval experiments using query sentences on the standard benchmark dataset and demonstrated an improvement of our approach compared to existing methods

Vision and Language Understanding

Video and Sentence Embedding

Overview

Textual Embedding Network

Global Visual Network

Sequential Visual Network

Similarity Aggregation

Optimization

Extendability

Experiments

Sentence-to-Video Retrieval Results

Method

Embedding Spaces

Spatial Attention Mechanism

Findings

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Apr 3, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Single address space or private address spaces?
Jacques Mossière ... Xavier Rousset De Pina
-
Jacques Mossière, et. al.Jacques Mossière ... Xavier Rousset De Pina
12 Sep 1994
12 Sep 1994

Towards the Web of Embeddings: Integrating multiple knowledge graph embedding spaces with FedCoder
Daniele Dell’Aglio ... Matthias Baumgartner
Journal of Web Semantics | VOL. 75
Daniele Dell’Aglio, et. al.Daniele Dell’Aglio ... Matthias Baumgartner
08 Aug 2022
Journal of Web Semantics | VOL. 75

Neural Retrieval with Partially Shared Embedding Spaces
Le Jia ... Bo Li
-
Le Jia, et. al.Le Jia ... Bo Li
17 Oct 2018
17 Oct 2018

Cross-Modal Image-Text Retrieval with Semantic Consistency
Sicheng Zhao ... Jungong Han
-
Sicheng Zhao, et. al.Sicheng Zhao ... Jungong Han
15 Oct 2019
15 Oct 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences