Abstract

Generating videos with semantic meaning, such as gestures in sign language, is a challenging problem. The model should not only learn to generate videos with realistic appearance, but also take notice of crucial details in frames to convey precise information. In this paper, we focus on the problem of generating long-term gesture videos containing precise and complete semantic meanings. We develop a novel architecture to learn the temporal and spatial transforms in regions of interest, i.e., gesticulating hands or face in our case. We adopt a hierarchical approach for generating gesture videos, by first making predictions on future pose configurations, and then using the encoder-decoder architecture to synthesize future frames based on the predicted pose structures. We develop the scheme of action progress in our architecture to represent how far the action has been performed during its expected execution, and to instruct our model to synthesize actions with various paces. Our approach is evaluated on two challenging datasets for the task of gesture video generation. Experimental results show that our method can produce gesture videos with more realistic appearance and precise meaning than the state-of-the-art video generation approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call