Abstract

Human speech recognition is robust to large changes in vocal tract length (VTL) but automatic speech recognition is not. In an effort to improve VTL normalization, an auditory model was used to derive formant-like features from syllables. The robustness supported by these auditory features was compared to the robustness provided by traditional MFCCs (Mel-Frequency Cepstral Coefficients), using a standard HMM recognizer (Hidden-Markov-Model). The speech database consisted of 180 syllables, each scaled with the vocoder STRAIGHT to have a wide range VTLs and glottal pulse rates. Training took place with syllables from a small, central range of scale values. When tested on the full range of scaled syllables, average performance for MFCC-based recognition was 73.5%, with performance falling close to 0% for syllables with extreme VTL values. The feature vectors constructed with the auditory model led to much better performance; the average for the full range of scaled syllables was 91%, and performance never fell below 65% even for extreme combinations of VTL and GPR. Moreover the auditory feature vectors contain just 12 features whereas the standard MFCC vectors contain 39 features. Research supported by the UK-MRC (G0500221) and EOARD (FA8655-05-1-3043).

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.