A comprehensive review of the video-to-text problem

Jesus Perez-Martin,Benjamin Bustos,Grethel Coello Said,Jorge Pérez,Ivan Sipiran,Silvio Jamil F Guimarães

doi:10.1007/s10462-021-10104-1

Abstract

Research in the Vision and Language area encompasses challenging topics that seek to connect visual and textual information. When the visual information is related to videos, this takes us into Video-Text Research, which includes several challenging tasks such as video question answering, video summarization with natural language, and video-to-text and text-to-video conversion. This paper reviews the video-to-text problem, in which the goal is to associate an input video with its textual description. This association can be mainly made by retrieving the most relevant descriptions from a corpus or generating a new one given a context video. These two ways represent essential tasks for Computer Vision and Natural Language Processing communities, called text retrieval from video task and video captioning/description task. These two tasks are substantially more complex than predicting or retrieving a single sentence from an image. The spatiotemporal information present in videos introduces diversity and complexity regarding the visual content and the structure of associated language descriptions. This review categorizes and describes the state-of-the-art techniques for the video-to-text problem. It covers the main video-to-text methods and the ways to evaluate their performance. We analyze twenty-six benchmark datasets, showing their drawbacks and strengths for the problem requirements. We also show the progress that researchers have made on each dataset, we cover the challenges in the field, and we discuss future research directions.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A comprehensive review of the video-to-text problem

Abstract

Talk to us

Similar Papers

More From: Artificial Intelligence Review

Lead the way for us

Journal: Artificial Intelligence Review	Publication Date: Jan 16, 2022
Citations: 10

Similar Papers

Dual Path Multi-Modal High-Order Features for Textual Content based Visual Question Answering
Yanan Li ... Honghui Zhao
-
Yanan Li, et. al.Yanan Li ... Honghui Zhao
10 Jan 2021
10 Jan 2021

A Multi-Modal Incompleteness Ontology model (MMIO) to enhance information fusion for image retrieval
Stefan Poslad ... Kraisak Kesorn
Information Fusion | VOL. 20
Stefan Poslad, et. al.Stefan Poslad ... Kraisak Kesorn
07 Mar 2014
Information Fusion | VOL. 20

Influences of narcissism and parental mediation on adolescents' textual and visual personal information disclosure in Facebook
Cong Liu ... May O Lwin
Computers in Human Behavior | VOL. 58
Cong Liu, et. al.Cong Liu ... May O Lwin
02 Jan 2015
Computers in Human Behavior | VOL. 58

Dual Path Multi-modal High-Order Features for Textual Content based Visual Question Answering

-

29 Dec 2020
29 Dec 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A comprehensive review of the video-to-text problem

Abstract

Talk to us

Similar Papers

More From: Artificial Intelligence Review