Abstract

Various proposals have attempted to put speech targets into invariant “acoustic” form, sometimes with an additional transformation into “auditory” space. These targets, however, are not strictly acoustic. Targets for vowels, for example, are transformed so that vocal tract length differences (between talkers, especially across men, women, and children) are taken into account. Such a transformation makes the resultant targets combinations of articulatory and acoustic information. The auditory transformation improves automatic speech recognition, but the theoretical underpinnings for this result have been unclear. Ghosh, Goldstein, and Narayanan [J. Acoust. Soc. Am. 129, 4014–4022] shows that the articulatory information is maximized by the auditory transform, indicating that this transform is not solely in the acoustic domain. Moreover, a lowered F3 has been proposed as the production target for American English /r/ [e.g., Nieto-Castanon et al., J. Acoust. Soc. Am. 117, 3196–3212], but synthesis that retains an exemplary /r/ F3 while altering F1 and F2 results in other percepts, such as /w/ or a pharyngeal glide. F3 as an acoustic target is insufficient by itself and must incorporate articulatory dynamics implied by the other formants. Thus, “acoustic” invariants, to the extent they work at all, do so by incorporating articulation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call