Abstract

In this paper, we present a multimodal acquisition setup that combines different motion-capture systems. This system is mainly aimed for recording expressive audiovisual corpus in the context of audiovisual speech synthesis. When dealing with speech recording, the standard optical motion-capture systems fail in tracking the articulators finely, especially the inner mouth region, due to the disappearing of certain markers during the articulation. Also, some systems have limited frame rates and are not suitable for smooth speech tracking. In this work, we demonstrate how those limitations can be overcome by creating a heterogeneous system taking advantage of different tracking systems. In the scope of this work, we recorded a prototypical corpus using our combined system for a single subject. This corpus was used to validate our multimodal data acquisition protocol and to assess the quality of the expressiveness before recording a large corpus. We conducted two evaluations of the recorded data, the first one concerns the production aspect of speech and the second one focuses on the speech perception aspect (both evaluations concern visual and acoustic modalities). Production analysis allowed us to identify characteristics specific to each expressive context. This analysis showed that the expressive content of the recorded data is globally in line with what is commonly expected in the literature. The perceptual evaluation, conducted as a human emotion recognition task using different types of stimulus, confirmed that the different recorded emotions were well perceived.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call