Abstract

Deep learning based visual-to-sound generation systems have been developed that identify and create audio features from video signals. However, these techniques often fail to consider the time-synchronicity of the visual and audio features. In this paper we introduce a novel method for guiding a class-conditioned GAN to synthesize representative audio with temporally extracted visual information. We accomplish this visual-to-sound generation task by adapting the synchronicity traits between the audio-visual modalities. Our proposed FoleyGAN model is capable of conditioning action sequences of visual events leading to the generation of visually aligned realistic soundtracks. We expanded our previously proposed Automatic Foley data set. We evaluated FoleyGAN's synthesized sound output through human surveys that show noteworthy (on average 81%) audio-visual synchronicity performance. Our approach outperforms other baseline models and audio-visual data sets in statistical and ablation experiments achieving improved IS, FID and NDB scores. In ablation analysis we showed the significance of our visual and temporal feature extraction method as well as augmented performance of our generation network. Overall, our FoleyGAN model showed sound retrieval accuracy of 76.08% surpassing existing visual-to-audio synthesis deep neural networks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call