Noise Robust Automatic Speech Recognition Research Articles

Introduction: An Automatic Speech Recognition (ASR) system enables to recognize the speech utterances and thus can be used to convert speech into text for various purposes. These systems are deployed in different environments such as clean or noisy and are used by all ages or types of people. These also present some of the major difficulties faced in the development of an ASR system. Thus, an ASR system need to be efficient, while also being accurate and robust. Our main goal is to minimize the error rate during training as well as testing phases, while implementing an ASR system. Performance of ASR depends upon different combinations of feature extraction techniques and back-end techniques. In this paper, using a continuous speech recognition system, the performance comparison of different combinations of feature extraction techniques and various types of back-end techniques has been presented Methods: Hidden Markov Models (HMMs), Subspace Gaussian Mixture Models (SGMMs) and Deep Neural Networks (DNNs) with DNN-HMM architecture, namely Karel’s, Dan’s and Hybrid DNN-SGMM architecture are used at the back-end of the implemented system. Mel frequency Cepstral Coefficient (MFCC), Perceptual Linear Prediction (PLP), and Gammatone Frequency Cepstral coefficients (GFCC) are used as feature extraction techniques at the front-end of the proposed system. Kaldi toolkit has been used for the implementation of the proposed work. The system is trained on the Texas Instruments-Massachusetts Institute of Technology (TIMIT) speech corpus for English language Results: The experimental results show that MFCC outperforms GFCC and PLP in noiseless conditions, while PLP tends to outperform MFCC and GFCC in noisy conditions. Furthermore, the hybrid of Dan’s DNN implementation along with SGMM performs the best for the back-end acoustic modeling. The proposed architecture with PLP feature extraction technique in the front end and hybrid of Dan’s DNN implementation along with SGMM at the back end outperforms the other combinations in a noisy environment. Conclusion: Automatic Speech recognition has numerous applications in our lives like Home automation, Personal assistant, Robotics etc. It is highly desirable to build an ASR system with good performance. The performance Automatic Speech Recognition is affected by various factors which include vocabulary size, whether system is speaker dependent or independent, whether speech is isolated, discontinuous or continuous, adverse conditions like noise. The paper presented an ensemble architecture that uses PLP for feature extraction at the front end and a hybrid of SGMM + Dan’s DNN in the backend to build a noise robust ASR system Discussion: The presented work in this paper discusses the performance comparison of continuous ASR systems developed using different combinations of front-end feature extraction (MFCC, PLP, and GFCC) and back-end acoustic modeling (mono-phone, tri-phone, SGMM, DNN and hybrid DNN-SGMM) techniques. Each type of front-end technique is tested in combination with each type of back-end technique. Finally, it compares the results of the combinations thus formed, to find out the best performing combination in noisy and clean conditions

Read full abstract

User applications such as voice-based web search, online learning, and video gaming require an effective speech recognition module to take user commands. Nowadays, even children are frequently using such tools, especially for online learning and gaming. This has increased the demand for developing a noise-robust automatic speech recognition (ASR) system that can effectively transcribe children’s data under varied ambient conditions. However, automatic recognition of children’s speech is extremely challenging due to the insufficiency of data from child speakers in the majority of the languages across the world. Consequently, in this zero-resource condition, we are forced to decode children’s speech on systems trained using adults’ data. However, the acoustic mismatch between adults’ and children’s speech, such as differences in pitch, formant frequencies, and speaking-rates, leads to highly degraded recognition performance. To enhance the recognition rate under zero-resource conditions, we have explored the role of formant and duration-modification-based out-of-domain data augmentation in this paper. For that purpose, the formant frequencies of the adults’ speech data are upscaled using warping of linear predictive coding coefficients. On pooling original and formant modified adults’ speech data into training, the mismatch in formant locations is reduced leading to better recognition performance. Further improvement in recognition rate can be achieved by simultaneously modifying the duration as well as the formant frequencies of the training data. This case of out-of-domain data augmentation has also been studied in this work and found to yield added gains. In addition to data augmentation, a noise- and pitch-robust front-end acoustic feature extraction approach exploiting higher-order spectral analysis (simple and cross-bispectrum) is also proposed in this paper. The proposed features are noise-robust due to the inherent immunity of the bispectrum towards additive noises. An added advantage of bispectrum is reduced pitch sensitivity as demonstrated in this work. This, in turn, helps alleviate the aforementioned pitch-induced acoustic mismatch. The experimental evaluations presented in this paper demonstrate that the use of proposed acoustic features, as well as the out-of-domain data augmentation techniques, are highly suited for zero-resource children’s speech recognition tasks under clean and noisy conditions.

Read full abstract

Noise Robust Automatic Speech Recognition Research Articles

Articles published on Noise Robust Automatic Speech Recognition

Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition

Noise robust automatic speech recognition: review and analysis

A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition

End-to-End Lip-Reading Open Cloud-Based Speech Architecture.

Performance Analysis of various Front-end and Back End Amalgamations for Noise-robust DNN-based ASR

Robust children’s speech recognition in zero resource condition

GFCC based discriminatively trained noise robust continuous ASR system for Hindi language

Histogram equalization with Bayesian estimation for noise robust speech recognition.

Dual-channel spectral weighting for robust speech recognition in mobile devices

Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR

A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation

Bayesian feature enhancement using independent vector analysis and reverberation parameter re-estimation for noisy reverberant speech recognition

Dual‐channel VTS feature compensation for noise‐robust speech recognition on mobile devices

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

Modeling State-Conditional Observation Distribution Using Weighted Stereo Samples for Factorial Speech Processing Models

Feature enhancement of reverberant speech by distribution matching and non-negative matrix factorization

Kernel Power Flow Orientation Coefficients for Noise-Robust Speech Recognition

Minimum Mean-Square Error Estimation of Mel-Frequency Cepstral Features–A Theoretically Consistent Approach

Noise Robust Automatic Speech Recognition Scheme with Histogram of Oriented Gradient Features

Sparse coding of the modulation spectrum for noise-robust automatic speech recognition

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Noise Robust Automatic Speech Recognition Research Articles

Articles published on Noise Robust Automatic Speech Recognition

Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition

Noise robust automatic speech recognition: review and analysis

A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition

End-to-End Lip-Reading Open Cloud-Based Speech Architecture.

Performance Analysis of various Front-end and Back End Amalgamations for Noise-robust DNN-based ASR

Robust children’s speech recognition in zero resource condition

GFCC based discriminatively trained noise robust continuous ASR system for Hindi language

Histogram equalization with Bayesian estimation for noise robust speech recognition.

Dual-channel spectral weighting for robust speech recognition in mobile devices

Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR

A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation

Bayesian feature enhancement using independent vector analysis and reverberation parameter re-estimation for noisy reverberant speech recognition

Dual‐channel VTS feature compensation for noise‐robust speech recognition on mobile devices

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

Modeling State-Conditional Observation Distribution Using Weighted Stereo Samples for Factorial Speech Processing Models

Feature enhancement of reverberant speech by distribution matching and non-negative matrix factorization

Kernel Power Flow Orientation Coefficients for Noise-Robust Speech Recognition

Minimum Mean-Square Error Estimation of Mel-Frequency Cepstral Features–A Theoretically Consistent Approach

Noise Robust Automatic Speech Recognition Scheme with Histogram of Oriented Gradient Features

Sparse coding of the modulation spectrum for noise-robust automatic speech recognition