Abstract
This paper addresses talking head synthesis based on the concatenation of units comprising of both acoustic and visual information. Selection of appropriate diphone units to synthesize ag iven text string is based on the minimization of aw eighted linear combination of four costs that reflect linguistic, acoustic, and visual considerations. We present initial work toward a method to determine automatically the weights applied to each cost, using a series of metrics that assess quantitatively the performance of synthesis. Index Terms :t alking head, audiovisual speech synthesis, selection, optimization
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have