Abstract

Audio-visual cross-modality generation refers to the generation of audio or visual content based on input from another modality. One of the key tasks in this field is the generation of realistic talking facial videos from audio and head pose information, which has significant applications in human–computer interaction, virtual reality, and video production. However, previous work has limitations such as the inability to generate natural head poses or interact with audio, which compromises the realism and expressive power of the generated videos. This paper aims to address these issues and improve the state-of-the-art in this field. To this end, we propose an autoregressive generation method called Flow2Flow and collect a large-scale in-the-wild solo-singing-themed audio-visual dataset called AVVS to investigate the rhythmic head movement patterns. The Flow2Flow model involves a multimodal transformer block with cross-attention, which can encode audio features and historical head poses to establish potential audio-visual motion entanglement and uses normalizing flows to generate future facial motion representation sequences. The generated motion representations are identity-independent, allowing the method to be transferred to any face identity. We model the motion of image content using warping flows generated from 3D keypoints based on the facial motion representation sequences, carefully manipulate animation generation, and estimate dense motion fields based on deformation flows using a neural rendering model to present photo-realistic talking facial videos. Experimental results show that our proposed method generates photo-realistic videos with natural head poses and lip-syncing, and we validate the effectiveness of our method compared to state-of-the-art methods on two public datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call