Abstract
The recent advancements of unsupervised deep generative models have produced incredible results in image and video generation tasks. However, existing approaches still pose huge challenges in the high-quality video generation process. The generated videos consist of blurred effects and poor video quality. In this paper, we introduce a novel generative framework named dynamic generative adversarial networks (dynamic GAN) model for regulating the adversarial training and generating photo-realistic high-quality sign language videos. The proposed model uses skeletal poses information and person images as input and produces high-quality videos. In generator phase, the proposed model uses U-Net-like network to generate target frames from skeletal poses. Further, the generated samples are classified using the VGG-19 framework to identify its word class. The discriminator network classifies the real and fake samples as well as concatenates the resultant frames and generates the high-quality video output. Unlike, existing approaches the proposed novel framework produces photo-realistic video quality results without employing any animation or avatar approaches. To evaluate the model performance qualitatively and quantitatively, the proposed model has been evaluated using three benchmark datasets that yield plausible results. The datasets are RWTH-PHOENIX-Weather 2014T dataset, and our self-created dataset for Indian Sign Language (ISL-CSLTR), and the UCF-101 Action Recognition dataset. The proposed model achieves average 28.7167 PSNR score, 0.921 average SSIM score, 14 average FID score and 8.73 ± 0.23 average inception score which are relatively higher than existing approaches.
Submitted Version (
Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have