Abstract

This article describes a modified technique for enhancing noisy speech to improve automatic speech recognition (ASR) performance. The proposed approach improves the widely used spectral subtraction which inherently suffers from the associated musical noise effects. Through a psychoacoustic masking and critical band variance normalization technique, the artifacts produced by spectral subtraction are minimized for improving the ASR accuracy. The popular advanced ETSI-2 front end is tested for comparison purposes. The performed speech recognition evaluations on the noisy standard AURORA-2 tasks show enhanced performance for all noise conditions.

Highlights

  • Enhancement of noise corrupted speech signals is a challenging task for speech processing systems to be deployed in real-world applications

  • Apart from extracting robust features which represent parameters less sensitive to noise by modifying the extracted features [1], other research directions aimed at increasing the performance of speech recognizers in noise are: speech signal enhancement, model adaptation and hybrid methods [2,3,4,5]

  • The spectral subtraction method was used to reduce the broadband noise due to peaks, and the combination of masking and variance normalization technique was effective in reducing the artifacts by reducing the dynamic range of its magnitude spectrum, which resulted in the improved speech recognition performance

Read more

Summary

Introduction

Enhancement of noise corrupted speech signals is a challenging task for speech processing systems to be deployed in real-world applications. The application of using human auditory masking in Kalman filtering to speech enhancement is considered in [9] Another novel approach based on sub-band variance normalization technique was proposed where speech frames are characterized by high variance and noise frames by low variance, which are suppressed to improve the ASR performance in presence of both additive noise and reverberation [10]. The non-linear mapping of spectral estimates that fall below a threshold, where noise has been overestimated results in some randomly located negative values for the estimated clean speech magnitude This leads to undesired residual noise called musical noise (narrow band spectrum with randomly distributed tones over time and frequency). The global masking threshold is applied, variance normalization is performed to further suppress the tones at random frequencies Later, these normalized values are used as weights which are multiplied with the filter bank energies as shown

Original PSD
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.