Abstract

Automatic speech recognition performance degrades significantly when speech is affected by environmental noise. Nowadays, the major challenge is to achieve good robustness in adverse noisy conditions so that automatic speech recognizers can be used in real situations. Spectral subtraction (SS) is a well-known and effective approach; it was originally designed for improving the quality of speech signal judged by human listeners. SS techniques usually improve the quality and intelligibility of speech signal while speech recognition systems need compensation techniques to reduce mismatch between noisy speech features and clean trained acoustic model. Nevertheless, correlation can be expected between speech quality improvement and the increase in recognition accuracy. This paper proposes a novel approach for solving this problem by considering SS and the speech recognizer not as two independent entities cascaded together, but rather as two interconnected components of a single system, sharing the common goal of improved speech recognition accuracy. This will incorporate important information of the statistical models of the recognition engine as a feedback for tuning SS parameters. By using this architecture, we overcome the drawbacks of previously proposed methods and achieve better recognition accuracy. Experimental evaluations show that the proposed method can achieve significant improvement of recognition rates across a wide range of signal to noise ratios.

Highlights

  • By increasing the role of computers and electronic devices in today’s life, using traditional interfaces such as mouse, keyboard, buttons, and knobs is not satisfying, so the desire for more convenient and more natural interfaces has increased

  • We evaluate the performance of the maximum likelihood-based SS (MLBSS) algorithm in noisy training conditions by using noisy speech data in the training phase

  • Recognition results obtained from the noisy training conditions are shown in Figure 5, where the following deductions can be made: (i) higher signal-to-noise ratio (SNR) difference between the training and testing speech causes higher degree of mismatch, and results in greater degradation in recognition performance; (ii) in matched conditions, where the recognition system is trained with speech having the same level of noise as the test speech, best recognition accuracies are obtained; (iii) the MLBSS is more effective than the Kamath and Loizou’s [29] multiband spectral subtraction (KLMBSS) method in overcoming environmental mismatch where models are trained with noisy speech but the noise type and the SNR level of noisy speech are not known a priori; (iv) in the KLMBSS method, lower SNR of the training data results in greater degradation in recognition performance

Read more

Summary

Introduction

By increasing the role of computers and electronic devices in today’s life, using traditional interfaces such as mouse, keyboard, buttons, and knobs is not satisfying, so the desire for more convenient and more natural interfaces has increased. When noise affects the speech signal, distributions characterizing the extracted features from noisy speech are not similar to the corresponding distributions extracted from clean speech in the training phase. This mismatch results in misclassification and decreases speech recognition accuracy [1, 2]. This degradation can only be ameliorated by reducing the difference between the distributions of test data and those used by the recognizer. The problem of noisy speech recognition still poses a challenge to the area of signal processing

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call