Multi-condition Training Research Articles

Lately, the self-attention mechanism has marked a new milestone in the field of automatic speech recognition (ASR). Nevertheless, its performance is susceptible to environmental intrusions as the system predicts the next output symbol depending on the full input sequence and the previous predictions. A popular solution for this problem is adding an independent speech enhancement module as the front-end. Nonetheless, due to being trained separately from the ASR module, the independent enhancement front-end falls into the sub-optimum easily. Besides, the handcrafted loss function of the enhancement module tends to introduce unseen distortions, which even degrade the ASR performance. Inspired by the extensive applications of the generative adversarial networks (GANs) in speech enhancement and ASR tasks, we propose an adversarial joint training framework with the self-attention mechanism to boost the noise robustness of the ASR system. Generally, it consists of a self-attention speech enhancement GAN and a self-attention end-to-end ASR model. There are two advantages which are worth noting in this proposed framework. One is that it benefits from the advancement of both self-attention mechanism and GANs, while the other is that the discriminator of GAN plays the role of the global discriminant network in the stage of the adversarial joint training, which guides the enhancement front-end to capture more compatible structures for the subsequent ASR module and thereby offsets the limitation of the separate training and handcrafted loss functions. With the adversarial joint optimization, the proposed framework is expected to learn more robust representations suitable for the ASR task. We execute systematic experiments on the corpus AISHELL-1, and the experimental results show that on the artificial noisy test set, the proposed framework achieves the relative improvements of 66% compared to the ASR model trained by clean data solely, 35.1% compared to the speech enhancement and ASR scheme without joint training, and 5.3% compared to multi-condition training.

Read full abstract

We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that “only good signal processing can lead to top ASR performance” in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28% on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46% is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76% on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

Read full abstract

Multi-condition Training Research Articles

Related Topics

Articles published on Multi-condition Training

Speech emotion recognition with transfer learning and multi-condition training for noisy environments

Improving License Plate Identification in Morocco: Intelligent Region Segmentation Approach, Multi-Font and Multi-Condition Training

Making Speaker Diarization System Noise Tolerant

Deep MCANC: A deep learning approach to multi-channel active noise control

Addressing smartphone mismatch in Parkinson’s disease detection aid systems based on speech

A Principle Solution for Enroll-Test Mismatch in Speaker Recognition

Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition

Deep ANC: A deep learning approach to active noise control

A Novel Loss Function and Training Strategy for Noise-Robust Keyword Spotting

Multicondition Training for Noise-Robust Detection of Benign Vocal Fold Lesions From Recorded Speech

Far-Field Automatic Speech Recognition

Multi-condition training for noise-robust speech emotion recognition

Towards the reduction of the effects of muscle fatigue on myoelectric control of upper limb prostheses

Robust Speaker Identification and Verification in Adverse Acoustic Condition

A Spiking Neural Network Framework for Robust Sound Classification.

Application of Machine Learning for the Spatial Analysis of Binaural Room Impulse Responses

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Robust Speech Dereverberation With a Neural Network-Based Post-Filter That Exploits Multi-Conditional Training of Binaural Cues

Predicting speech intelligibility with deep neural networks

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multi-condition Training Research Articles

Related Topics

Articles published on Multi-condition Training

Speech emotion recognition with transfer learning and multi-condition training for noisy environments

Improving License Plate Identification in Morocco: Intelligent Region Segmentation Approach, Multi-Font and Multi-Condition Training

Making Speaker Diarization System Noise Tolerant

Deep MCANC: A deep learning approach to multi-channel active noise control

Addressing smartphone mismatch in Parkinson’s disease detection aid systems based on speech

A Principle Solution for Enroll-Test Mismatch in Speaker Recognition

Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition

Deep ANC: A deep learning approach to active noise control

A Novel Loss Function and Training Strategy for Noise-Robust Keyword Spotting

Multicondition Training for Noise-Robust Detection of Benign Vocal Fold Lesions From Recorded Speech

Far-Field Automatic Speech Recognition

Multi-condition training for noise-robust speech emotion recognition

Towards the reduction of the effects of muscle fatigue on myoelectric control of upper limb prostheses

Robust Speaker Identification and Verification in Adverse Acoustic Condition

A Spiking Neural Network Framework for Robust Sound Classification.

Application of Machine Learning for the Spatial Analysis of Binaural Room Impulse Responses

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Robust Speech Dereverberation With a Neural Network-Based Post-Filter That Exploits Multi-Conditional Training of Binaural Cues

Predicting speech intelligibility with deep neural networks

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech