Variable frame rate hierarchical analysis for robust speech recognition

Jean Rouat,Stephane Molotchnikoff,Stephane Loiselle

doi:10.1109/iros.2011.6094672

Abstract

A new bio-inspired speech analysis system that extracts acoustical speech events is proposed and used in the design of a variable frame rate (VFR) speech recognizer. The same speech recognizer (Hidden Markov Model -HMM- and Mel Frequency Cepstrum Coefficients -MFCC-) has been used with the proposed VFR analysis and conventional fixed frame rate (FFR) approach. In comparison with other VFR recognizers, the hierarchical features in the proposed system have the potential to serve as classification parameters of a complete bio-inspired speech recognition system. Also, no voice activity detection is required and there are no hard decisions to be taken by the system. Events are used to label and identify the moments at which the acoustical properties of speech are stable or changing. These events are markers on which an analysis window can be positioned to perform the recognition. Inspired by our knowledge of the auditory and visual systems, hierarchical complex features like transients and energy orientation are used. Training has been done on clean speech and recognition on noisy (from 20dB to −10dB Signal to Noise Ratios -SNR) or reverberated speech by using the TI 46-word database corrupted with 4 noises taken from the Aurora 2 data. In comparison with a FFR recognizer, our VFR system yields more than 50% increase in recognition rates for a speaker independent isolated word recognition task when SNRs are between 0 and 20 dB.

Full Text