Abstract
Our recent experiments with vocoded natural speech, wherein the spectral envelope and fundamental frequency are manipulated independently, have confirmed that some coordination of f0 and formant patterns are beneficial to vowel identification by humans. In an effort to model the perceptual dependency more precisely, we have investigated the performance of several alternative pattern recognition models on natural speech samples. This paper reports on several quite distinct methods of exploiting statistical relations between formant frequencies and f0 for recognition. Many of these methods yield quite similar results on the classic Peterson and Barney data and on larger, more recently collected data sets. Methods involving indirect normalization whereby the f0 of a single token is restricted to the role of estimating the formant frequency average of a speaker’s entire vowel system perform well. Indeed, they are often better than a method where the role of f0 is unconstrained, thus accommodating inherent pitch differences among vowels. The indirect use of f0 also allows for methods of combining f0 and formant range information in ways that preliminary results suggest to be more effective for modeling perceptual effects with modified stimuli. More formal evaluation against perceptual data will be presented.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have