Abstract

Generative adversarial networks (GANs) have demonstrated high accuracy on image generation tasks. A large number of studies have applied image generating models to video generation as well. However, because of the complexities of video generation, it's not that trivial to use GANs in the video domain. In the video generation, the resulting content has to be spatially and temporally coherent. Moreover, generating videos from text is even more challenging since besides maintaining the temporal and spatial coherence, semantic consistency also needs to be maintained. In this paper, we have compared three recently proposed text-to-video GAN architectures. Text-Filter Conditioning Generative Adversarial Network (TFGAN) is the first architecture, which employs a superior feature fusion method in which firstly the discriminative convolutional filters are produced from text features and then convolved with image features in the discriminator. The Introspective Recurrent Convolutional GAN (IRC-GAN) is the second architecture, which leverages mutual-information introspection to maintain semantic consistency between the generated videos and the input text. The third model is Bottom-up GAN (BoGAN) which introduces three levels of losses viz, region level loss, frame-level loss, and video level loss.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call