Abstract
This paper focuses on the problem of performance degradation in mismatched speech recognition. The F-Ratio analysis method is utilized to analyze the significance of different frequency bands for speech unit classification, and we find that frequencies around 1kHz and 3kHz, which are the upper bounds of the first and the second formants for most of the vowels, should be emphasized in comparison to the Mel-frequency cepstral coefficients (MFCC). The analysis result is further observed to be stable in several typical mismatched situations. Similar to the Mel-Frequency scale, another frequency scale called the F-Ratio-scale is thus proposed to optimize the filter bank design for the MFCC features, and make each subband contains equal significance for speech unit classification. Under comparable conditions, with the modified features we get a relative 43.20% decrease compared with the MFCC in sentence error rate for the emotion affected speech recognition, 35.54%, 23.03% for the noisy speech recognition at 15dB and 0dB SNR (signal to noise ratio) respectively, and 64.50% for the three years' 863 test data. The application of the F-Ratio analysis on the clean training set of the Aurora2 database demonstrates its robustness over languages, texts and sampling rates.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.