Abstract

Articulatory synthesis is based on modeling various physical phenomena of speech production, including sound radiation from the mouth. With regard to sound radiation, the most common approach is to approximate it in terms of a simple spherical source of strength equal to the mouth volume velocity. However, because this approximation is only valid at very low frequencies and does not account for the diffraction by the head and the torso, we simulated two alternative radiation characteristics that are potentially more realistic: the radiation from a vibrating piston in a spherical baffle, and the radiation from the mouth of a detailed model of the human head and torso. Using the articulatory speech synthesizer VocalTractLab, a corpus of 10 sentences was synthesized with the different radiation characteristics combined with three different phonation types. The synthesized sentences were acoustically compared with natural recordings of the same sentences in terms of their long-term average spectra (LTAS), and evaluated in terms of their naturalness and intelligibility. The intelligibility was not affected by the type of radiation characteristic. However, it was found that the more similar their LTAS was to real speech, the more natural the synthetic sentences were perceived to be. Hence, the naturalness was not directly determined by the realism of the radiation characteristic, but by the combined spectral effect of the radiation characteristic and the voice source. While the more realistic radiation models do not per se improve synthesis quality, they provide new insights in the study of speech production and articulatory synthesis.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call