Abstract

Audio-to-visual conversion is the basic problem of speech-driven facial animation. Since the conversion problem is to predict facial control parameters from the acoustic speech, the informative representation of audio, i.e., the audio feature, is important to get a good prediction. This paper presents a performance comparison on prosodic features, articulatory features, and perceptual features for the audio-to-visual conversion problem on a common test bed. Experimental results show that the Mel frequency cepstral coefficients (MFCCs) produce the best performance, followed by the perceptual linear prediction coefficients (PLPC), the linear predictive cepstral coefficients (LPCCs), and the prosodic feature set (F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">0</sub> and energy). The combination of three kinds of features can further improve the prediction performance on facial parameters. It unveils that different audio features carry complementary information relevant to facial animation

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call