Text-Driven Video Prediction

Xue Song,Bin Zhu,Yu-Gang Jiang,Jingjing Chen

doi:10.1145/3675171

Abstract

Current video generation models usually convert signals indicating appearance and motion received from inputs (e.g., image and text) or latent spaces (e.g., noise vectors) into consecutive frames, fulfilling a stochastic generation process for the uncertainty introduced by latent code sampling. However, this generation pattern lacks deterministic constraints for both appearance and motion, leading to uncontrollable and undesirable outcomes. To this end, we propose a new task called Text-driven Video Prediction (TVP). Taking the first frame and text caption as inputs, this task aims to synthesize the following frames. Specifically, appearance and motion components are provided by the image and caption separately. The key to addressing the TVP task depends on fully exploring the underlying motion information in text descriptions, thus facilitating plausible video generation. In fact, this task is intrinsically a cause-and-effect problem, as the text content directly influences the motion changes of frames. To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM), producing step-wise embeddings to regulate motion inference for subsequent frames. In particular, a refinement mechanism incorporating global motion semantics guarantees coherent generation. Extensive experiments are conducted on Something-Something V2 and Single Moving MNIST datasets. Experimental results demonstrate that our model achieves better results over other baselines, verifying the effectiveness of the proposed framework.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Text-Driven Video Prediction

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications

Lead the way for us

Similar Papers

Straight or curved? From deterministic to probabilistic models of 3D motion perception
Martin Lages
Frontiers in Behavioral Neuroscience | VOL. 7
Martin LagesMartin Lages
01 Jan 2013
Frontiers in Behavioral Neuroscience | VOL. 7

ODD-VGAN: Optimised Dual Discriminator Video Generative Adversarial Network for Text-to-Video Generation with Heuristic Strategy
Rayeesa Mehmood ... Kaiser J Giri
Journal of Information & Knowledge Management | VOL. -
Rayeesa Mehmood, et. al.Rayeesa Mehmood ... Kaiser J Giri
29 Jul 2023
Journal of Information & Knowledge Management | VOL. -

Do Cross Modal Systems Leverage Semantic Relationships?
Shah Nawaz ... Arif Mahmood
-
Shah Nawaz, et. al.Shah Nawaz ... Arif Mahmood
01 Oct 2019
01 Oct 2019

Recurrent Deconvolutional Generative Adversarial Networks with Application to Video Generation
Hongyuan Yu ... Yan Huang
-
Hongyuan Yu, et. al.Hongyuan Yu ... Yan Huang
01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Text-Driven Video Prediction

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications