Automatic Speech Recognition Performance Research Articles

Lately, the self-attention mechanism has marked a new milestone in the field of automatic speech recognition (ASR). Nevertheless, its performance is susceptible to environmental intrusions as the system predicts the next output symbol depending on the full input sequence and the previous predictions. A popular solution for this problem is adding an independent speech enhancement module as the front-end. Nonetheless, due to being trained separately from the ASR module, the independent enhancement front-end falls into the sub-optimum easily. Besides, the handcrafted loss function of the enhancement module tends to introduce unseen distortions, which even degrade the ASR performance. Inspired by the extensive applications of the generative adversarial networks (GANs) in speech enhancement and ASR tasks, we propose an adversarial joint training framework with the self-attention mechanism to boost the noise robustness of the ASR system. Generally, it consists of a self-attention speech enhancement GAN and a self-attention end-to-end ASR model. There are two advantages which are worth noting in this proposed framework. One is that it benefits from the advancement of both self-attention mechanism and GANs, while the other is that the discriminator of GAN plays the role of the global discriminant network in the stage of the adversarial joint training, which guides the enhancement front-end to capture more compatible structures for the subsequent ASR module and thereby offsets the limitation of the separate training and handcrafted loss functions. With the adversarial joint optimization, the proposed framework is expected to learn more robust representations suitable for the ASR task. We execute systematic experiments on the corpus AISHELL-1, and the experimental results show that on the artificial noisy test set, the proposed framework achieves the relative improvements of 66% compared to the ASR model trained by clean data solely, 35.1% compared to the speech enhancement and ASR scheme without joint training, and 5.3% compared to multi-condition training.

Read full abstract

In the domain of air traffic control (ATC) systems, efforts to train a practical automatic speech recognition (ASR) model always faces the problem of small training samples since the collection and annotation of speech samples are expert- and domain-dependent task. In this work, a novel training approach based on pretraining and transfer learning is proposed to address this issue, and an improved end-to-end deep learning model is developed to address the specific challenges of ASR in the ATC domain. An unsupervised pretraining strategy is first proposed to learn speech representations from unlabeled samples for a certain dataset. Specifically, a masking strategy is applied to improve the diversity of the sample without losing their general patterns. Subsequently, transfer learning is applied to fine-tune a pretrained or other optimized baseline models to finally achieves the supervised ASR task. By virtue of the common terminology used in the ATC domain, the transfer learning task can be regarded as a sub-domain adaption task, in which the transferred model is optimized using a joint corpus consisting of baseline samples and new transcribed samples from the target dataset. This joint corpus construction strategy enriches the size and diversity of the training samples, which is important for addressing the issue of the small transcribed corpus. In addition, speed perturbation is applied to augment the new transcribed samples to further improve the quality of the speech corpus. Three real ATC datasets are used to validate the proposed ASR model and training strategies. The experimental results demonstrate that the ASR performance is significantly improved on all three datasets, with an absolute character error rate only one-third of that achieved through the supervised training. The applicability of the proposed strategies to other ASR approaches is also validated.

Read full abstract

Automatic Speech Recognition Performance Research Articles

Related Topics

Articles published on Automatic Speech Recognition Performance

Performance evaluation and implementations of MFCC, SVM and MLP algorithms in the FPGA board

Out Domain Data Augmentation on Punjabi Children Speech Recognition using Tacotron

A cross-language study of speech recognition systems for English, German, and Hebrew

Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition

Curriculum Learning based approaches for robust end-to-end far-field speech recognition

On training targets for deep learning approaches to clean speech magnitude spectrum estimation.

Deep neural network-based generalized sidelobe canceller for dual-channel far-field speech recognition

An exploration of semi-supervised and language-adversarial transfer learning using hybrid acoustic model for hindi speech recognition

Improving speech recognition models with small samples for air traffic control systems

Gated Recurrent Context: Softmax-Free Attention for Online Encoder-Decoder Speech Recognition

Hindi speech recognition in noisy environment using hybrid technique

Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition

Multi-Channel Multi-Frame ADL-MVDR for Target Speech Separation

Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition

Automatic Speech Recognition in Different Languages Using High-Density Surface Electromyography Sensors

Parameter Tuning-Free Missing-Feature Reconstruction for Robust Sound Recognition

A Cross-Entropy-Guided Measure (CEGM) for Assessing Speech Recognition Performance and Optimizing DNN-Based Speech Enhancement

Real Time Speech Recognition based on PWP Thresholding and MFCC using SVM

Neural candidate-aware language models for speech recognition

Automatic speech recognition system with pitch dependent features for Punjabi language on KALDI toolkit

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Automatic Speech Recognition Performance Research Articles

Related Topics

Articles published on Automatic Speech Recognition Performance

Performance evaluation and implementations of MFCC, SVM and MLP algorithms in the FPGA board

Out Domain Data Augmentation on Punjabi Children Speech Recognition using Tacotron

A cross-language study of speech recognition systems for English, German, and Hebrew

Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition

Curriculum Learning based approaches for robust end-to-end far-field speech recognition

On training targets for deep learning approaches to clean speech magnitude spectrum estimation.

Deep neural network-based generalized sidelobe canceller for dual-channel far-field speech recognition

An exploration of semi-supervised and language-adversarial transfer learning using hybrid acoustic model for hindi speech recognition

Improving speech recognition models with small samples for air traffic control systems

Gated Recurrent Context: Softmax-Free Attention for Online Encoder-Decoder Speech Recognition

Hindi speech recognition in noisy environment using hybrid technique

Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition

Multi-Channel Multi-Frame ADL-MVDR for Target Speech Separation

Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition

Automatic Speech Recognition in Different Languages Using High-Density Surface Electromyography Sensors

Parameter Tuning-Free Missing-Feature Reconstruction for Robust Sound Recognition

A Cross-Entropy-Guided Measure (CEGM) for Assessing Speech Recognition Performance and Optimizing DNN-Based Speech Enhancement

Real Time Speech Recognition based on PWP Thresholding and MFCC using SVM

Neural candidate-aware language models for speech recognition

Automatic speech recognition system with pitch dependent features for Punjabi language on KALDI toolkit