Abstract

In this paper, we propose a novel noise-robustness method known as weighted sub-band histogram equalization (WS-HEQ) to improve speech recognition accuracy in noise-corrupted environments. Considering the observations that high- and low-pass portions of the intra-frame cepstral features possess unequal importance for noise-corrupted speech recognition, WS-HEQ is intended to reduce the high-pass components of the cepstral features. Furthermore, we provide four types of WS-HEQ, which partially refers to the structure of spatial histogram equalization (S-HEQ). In the experiments conducted on the Aurora-2 noisy-digit database, the presented WS-HEQ yields significant recognition improvements relative to the Mel-scaled filter-bank cepstral coefficient (MFCC) baseline and to cepstral histogram normalization (CHN) in various noise-corrupted situations and exhibits a behavior superior to that of S-HEQ in most cases.

Highlights

  • The performance of speech recognition systems is often degraded due to noise in application environments

  • The work in [11] revealed that in the cepstral histogram normalization (CHN) method, even though each cepstral channel is processed by histogram equalization (HEQ), a significant histogram mismatch still exists among the training and testing cepstral features for the low-pass filtered (LPF) and highpass filtered (HPF) portions of the intra-frame cepstra

  • 5.1 Recognition accuracy The presented weighted sub-band histogram equalization (WS-HEQ) is evaluated in terms of recognition accuracy

Read more

Summary

Introduction

The performance of speech recognition systems is often degraded due to noise in application environments. Typical examples are perceptual masking [1], empirical mode decomposition [2], optimally modified log-spectral amplitude estimation [3], wavelet packet decomposition with AR modeling [4], cepstral mean and variance normalization (MVN) [5], cepstral histogram normalization (CHN) [6,7], MVN with ARMA filtering (MVA) [8], higher order cepstral moment normalization (HOCMN) [9], and temporal structure normalization (TSN) [10] In some of these methods, the compensation is performed on each individual cepstral channel sequence of an utterance by assuming that these channels are mostly uncorrelated [7]. We change the order of the procedures in S-HEQ by first splitting the original intra-frame cepstra (not the CHN-preprocessed cepstra) into LPF and HPF, subsequently compensating LPF and HPF individually, and normalizing the full-band cepstra This new structure can reduce the effect of noise on the LPF and HPF portions in the plain cepstra more directly in comparison with S-HEQ.

Proposed approach
Experimental setup
Method
Experimental results and discussions for the Aurora-2 task
The experiment on the TCC-300 Mandarin dataset
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.