Abstract
Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR) systems. We propose a novel approach which combines the Karhunen-Loève transform (KLT) in the mel-frequency domain with a genetic algorithm (GA) to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs) varying from 16 dB to dB. We also showed the effectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.
Highlights
Continuous speech recognition (CSR) systems remain faced with the serious problem of acoustic condition changes
We propose an approach which can be viewed as a signal transformation via a mapping operator using a mel-frequency space decomposition based on the Karhunen-Loeve transform (KLT) and a genetic algorithm (GA) with a real-coded encoding
For the KLT- and KLT-GA-based CSR systems, we found that using the KLT-GA as a preprocessing approach to enhance the mel-frequency cepstral coefficients (MFCC) that were used for recognition with N-mixture Gaussian hidden Markov models (HMM) for N = 1, 2, 4, and 8, using triphone models, led to an important improvement in the accuracy of the word recognition rate
Summary
Continuous speech recognition (CSR) systems remain faced with the serious problem of acoustic condition changes. Their performance often degrades due to unknown adverse conditions (e.g., due to room acoustics, ambient noise, speaker variability, sensor characteristics, and other transmission channel artifacts). These speech variations create mismatches between the training data and the test data. The major methods of this field are founded on the principle to find a robust distorsion measure that emphasizes the regions of the spectrum that are less influenced by noise [6]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have