Low-dimensional, auditory feature vectors that improve vocal-tract-length normalization in automatic speech recognition

Jessica J Monaghan,Tom C Walters,Roy D Patterson,Christian Feldbauer

doi:10.1121/1.2932824

Abstract

Human speech recognition is robust to large changes in vocal tract length (VTL) but automatic speech recognition is not. In an effort to improve VTL normalization, an auditory model was used to derive formant-like features from syllables. The robustness supported by these auditory features was compared to the robustness provided by traditional MFCCs (Mel-Frequency Cepstral Coefficients), using a standard HMM recognizer (Hidden-Markov-Model). The speech database consisted of 180 syllables, each scaled with the vocoder STRAIGHT to have a wide range VTLs and glottal pulse rates. Training took place with syllables from a small, central range of scale values. When tested on the full range of scaled syllables, average performance for MFCC-based recognition was 73.5%, with performance falling close to 0% for syllables with extreme VTL values. The feature vectors constructed with the auditory model led to much better performance; the average for the full range of scaled syllables was 91%, and performance never fell below 65% even for extreme combinations of VTL and GPR. Moreover the auditory feature vectors contain just 12 features whereas the standard MFCC vectors contain 39 features. Research supported by the UK-MRC (G0500221) and EOARD (FA8655-05-1-3043).

Full Text