Abstract

A major challenge in the field of automated speech recognition (ASR) lies in designing noise-resilient systems. These systems are crucial for real-world applications where high levels of noise tend to be present. We introduce a noise robust system based on a recently developed approach to training a recurrent neural network (RNN), namely, the echo state network (ESN). To evaluate the performance of the proposed system, we used our recently released public Arabic dataset that contains a total of about 10 000 examples of 20 isolated words spoken by 50 speakers. Different feature extraction methods considered in this study include mel-frequency cepstral coefficients (MFCCs), perceptual linear prediction (PLP) and RASTA- perceptual linear prediction. These extracted features were fed to the ESN and the result was compared with a baseline hidden Markov model (HMM), so that six models were compared in total. These models were trained on clean data and then tested on unseen data with different levels and types of noise. ESN models outperformed HMM models under almost all the feature extraction methods, noise levels, and noise types. The best performance was obtained by the model that combined RASTA-PLP with ESN.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call