Abstract

Many glottal source models have been proposed, but none has been systematically validated perceptually. Our previous work showed that model fitting of the negative peak of the flow derivative is the most important predictor of perceptual similarity to the target voice. In this study, a new voice source model motivated by high-speed laryngeal videoendoscopy is proposed to capture perceptually-important source shape aspects. Six voice source models (the proposed model, two previous models developed at UCLA, as well as the Fujisaki-Ljungkvist, Liljencrants-Fant, and Rosenberg models) were fitted to 40 natural voices obtained by inverse filtering and analysis-by-synthesis (AbS). We generated synthetic copies of the voices using each modeled source pulse, with all other parameters held constant, and then conducted a visual sort-and-rate task in which listeners assessed the extent of perceived match between the original natural voice samples and each copy. Model fitting results showed that the proposed model provides a more accurate fitting to the AbS-derived source than the other models. Perceptual experiments showed that the proposed model provides a close match to the original natural voice. Perceptual studies examining the extent to which each model matches the target tokens will also be reported. [Work supported by NSF grant IIS-1018863 and NIH/NIDCD grant DC01797.]

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call