Abstract

Conventional motion predictions have achieved promising performance. However, the length of the predicted motion sequences of most literatures are short, and the rhythm of the generated pose sequence has rarely been explored. To pursue high quality, rhythmic, and long-term pose sequence prediction, this paper explores a novel dancing with the sound task, which is appealing and challenging in computer vision field. To tackle this problem, a novel model is proposed, which takes the sound as an indicator input and outputs the dancing pose sequence. Specifically, our model is based on the variational autoencoder (VAE) framework, which encodes the continuity and rhythm of the sound information into the hidden space to generate a coherent, diverse, rhythmic and long-term pose video. Extensive experiments validated the effectiveness of audio cues in the generation of dancing pose sequences. Concurrently, a novel dataset of audiovisual multimodal sequence generation has been released to promote the development of this field.

Highlights

  • Audiovisual multimodal generation is a very important research issue in computer vision, and it has a wide range of applications, such as solving the problem of missing modalities, generating zero-shot samples, and inspiring artists to create artistic creations

  • Rhythmic, and long-term pose sequence prediction, this paper explores a novel dancing with the sound task, which is appealing and challenging in computer vision field

  • Our model is based on the variational autoencoder (VAE) framework, which encodes the continuity and rhythm of the sound information into the hidden space to generate a coherent, diverse, rhythmic and long-term pose video

Read more

Summary

Introduction

Audiovisual multimodal generation is a very important research issue in computer vision, and it has a wide range of applications, such as solving the problem of missing modalities, generating zero-shot samples, and inspiring artists to create artistic creations. The generation of audiovisual modalities still faces huge challenges due to the large modal discrepancy between visual and audio modalities. To address the problem of audio-visual multimodal generation, Chen et al developed a conditional generative adversarial network (GAN) [1] which has several defects. The mutual generation of visual-to-audio and audioto-visual is realized through two independent networks. The generation of each path is realized through two stages. The first stage extracts the discrimination information of the known modality, and the second stage uses the extracted discrimination information to generate the corresponding unknown modalities. These two pharses limit the efficiency of generation

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call