Abstract
We introduce a method for measuring the correspondence between low-level speech features and human perception, using a cognitive model of speech perception implemented directly on speech recordings. We evaluate two speaker normalization techniques using this method and find that in both cases, speech features that are normalized across speakers predict human data better than unnormalized speech features, consistent with previous research. Results further reveal differences across normalization methods in how well each predicts human data. This work provides a new framework for evaluating low-level representations of speech on their match to human perception, and lays the groundwork for creating more ecologically valid models of speech perception.
Highlights
Understanding the features that listeners extract from the speech signal is a critical part of understanding phonetic learning and perception
In this paper we introduce a method for measuring the correspondence between low-level speech feature representations and human speech perception
We find that mel frequency cepstral coefficients (MFCCs) normalized by vocal tract length normalization (VTLN) outperform z-scored MFCCs
Summary
Understanding the features that listeners extract from the speech signal is a critical part of understanding phonetic learning and perception. Changing the signal processing methods that are used to extract features from the speech waveform (Hermansky, 1990; Hermansky and Morgan, 1994) and applying speaker normalization techniques to these features (Wegmann et al, 1996; Povey and Saon, 2006) can improve the performance of a recognizer. It is potentially useful to know how closely the feature representations in ASR resemble those of human listeners, for low-resource settings, where systems rely heavily on these features to guide generalization across speakers and dialects
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have