Abstract
In terms of music‐driven dance movement generation, the music movement matching model and the statistical mapping model have poor fit between the dance generated by the model and the music self. The generated dance movement is incomplete, and the smoothness and rationality of long‐term dance sequences are low. The new dance moves and other related issues cannot be generated by the traditional model. In order to address these issues, we design a dance generation algorithm based on movements and neural networks that will extract the mapping between voice and movement features. In the first stage, where the prosody features and audio beat features extracted from music are used as music features, and the coordinates of key points of the human body extracted from dance videos are used as motion features for training. In the second stage, the basic mapping of music and dance movements are realized through the generator module of the model to generate a smooth dance posture; the consistency of dance and music is realized through the discriminator module; the audio characteristics are more possessed through the Autoencoder module representative. In the third and final stage, the modified version of the model transforms the dance posture sequence into a realistic version of the dance. Finally, a realistic version of the dance that fits the music is obtained. The experimental data is obtained from dance videos on the Internet, and the experimental results are analyzed from five aspects: loss function value, comparison of different baselines, evaluation of sequence generation effect, user research, and quality evaluation of real‐life dance videos. The results show that the proposed dance generation model has a good effect in transforming into realistic dance videos.
Highlights
With the development and popularization of deep learning in recent years, artificial neural networks have been successfully applied to the generation of dance movements
For the visualization of dance, people often draw the skeleton of the human body or perform animation processing directly according to the coordinates of the key points of the human body, and there is room for further improvement in the visualization effect
It is necessary to extract continuous dance pose data from a specific dance video using human pose estimation technology, and design a specific audio feature encoder to extract audio features from the music matched by the dance. e dance data reflects the changes in the coordinates of the key points of the human body in different times
Summary
Current Research Status. e cross-modal generation from audio to video can be divided into three categories: body motion generation, audio-driven image generation, and talking face video generation. Synthesizing the corresponding face video through speech or music is a typical cross-modal generation task. E other is conditional GAN, which is used to generate facial images based on a given set of lip marks Together, these two networks can generate a natural speaking face sequence synchronized with the input audio track. Eir model is trained in a self-supervised manner by using naturally aligned audio and video features in the video Another type of cross-modal generation task is to generate corresponding speech videos from voice or text end-to-end without the intervention of specified rules. Some researchers considered combining acoustic analysis with text [10], demonstrating a method of generating 3D virtual humans from audio signals by inferring the acoustic and semantic characteristics of speech. Some other researchers have realized the speech of any given speaker through self-supervised training in speech video [11], generated the corresponding speech posture without adding any semantic information, and synthesized realistic speech video
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.