Abstract

Deep learning based visual-to-sound generation systems have been developed that identify and create audio features from video signals. However, these techniques often fail to consider the time-synchronicity of the visual and audio features. In this paper we introduce a novel method for guiding a class-conditioned GAN to synthesize representative audio with temporally extracted visual information. We accomplish this visual-to-sound generation task by adapting the synchronicity traits between the audio-visual modalities. Our proposed FoleyGAN model is capable of conditioning action sequences of visual events leading to the generation of visually aligned realistic soundtracks. We expanded our previously proposed Automatic Foley data set. We evaluated FoleyGAN's synthesized sound output through human surveys that show noteworthy (on average 81%) audio-visual synchronicity performance. Our approach outperforms other baseline models and audio-visual data sets in statistical and ablation experiments achieving improved IS, FID and NDB scores. In ablation analysis we showed the significance of our visual and temporal feature extraction method as well as augmented performance of our generation network. Overall, our FoleyGAN model showed sound retrieval accuracy of 76.08% surpassing existing visual-to-audio synthesis deep neural networks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.