Abstract

In human heads there is a strong structural linkage between the vocal tract and facial behavior during speech. For a robotic talking head to have human-like behavior, this linkage should be emulated. One way to do that is to estimate the articulatory features from a given utterance and to use them to control a talking head. In this paper, we describe an algorithm to estimate the articulatory features from a spoken sentence using a novel computational model of human vocalization. Our model uses a set of fuzzy rules and genetic optimization. That is, the places of articulation are considered as fuzzy sets whose degrees of membership are the values of the articulatory features. The fuzzy rules represent the relationships between places of articulation and speech acoustic parameters, and the genetic algorithm estimates the degrees of membership of the places of articulation according to an optimization criteria and it performs imitation learning. We verify our model by performing audio-visual subjective tests of animated talking heads showing that the algorithm is able to produce correct results. In particular, subjective listening tests of artificially generated sentences from the articulatory description resulted in an average phonetic accuracy slightly under 80%. Through the analysis of large amounts of natural speech, the algorithm can be used to learn the places of articulation of all phonemes of a given speaker. The estimated places of articulation are then used to control talking heads in humanoid robotics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call