Auditory speech processing for scale-shift covariance and its evaluation in automatic speech recognition

Roy D Patterson,Jessica Monaghan,Toshio Irino,Christian Feldbauer,Thomas C Walters

doi:10.1109/iscas.2010.5537725

Abstract

The syllables of speech contain information about the vocal tract length (VTL) of the speaker as well as the phonetic message. Ideally, the pre-processor used for automatic speech recognition (ASR) should segregate the phonetic message from the VTL information. This paper describes a method to calculate VTL-invariant auditory feature vectors from speech, using a method in which the message and the VTL are segregated. Spectra produced by an auditory filterbank are summarized by a Gaussian mixture model (GMM) to produce a low-dimensional feature vector. These features are evaluated for robustness in comparison with conventional mel-frequency cepstral coefficients (MFCCs) using a hidden-Markov-model (HMM) recognizer. A dynamic, compressive gammachirp (dcGC) auditory filterbank is also introduced. The dcGC provides a level-dependent spectral analysis, with near instantaneous compression, and two-tone suppression.

Full Text