Acoustic Model Training Research Articles

For noise robust speech recognition, data mismatch between training and testing is a significant challenge. Data augmentation is an effective way to enlarge the size and diversity of training data and solve this problem. Different from the traditional approaches by directly adding noise to the original waveform, in this work we utilize generative adversarial networks (GAN) for data generation to improve speech recognition under noise conditions. In this paper we investigate different configurations of GANs. Firstly the basic GAN is applied: the generated speech samples are based on spectrum feature level and produced frame by frame without dependence among them, and there is no true labels. Thus, an unsupervised learning framework is proposed to utilize these untranscribed data for acoustic modeling. Then, in order to better guide the data generation, condition information is introduced into GAN structures, and the conditional GAN is utilized: two different conditions are explored, including the acoustic state of each speech frame and the original paired clean speech of each speech frame. With the incorporation of specific condition information into data generation, these conditional GANs can provide true labels directly, which can be used for later acoustic modeling. During the acoustic model training, these true labels are combined with the soft labels which make the model better. The proposed GAN-based data augmentation approaches are evaluated on two different noisy tasks: Aurora4 (simulated data with additive noise and channel distortion) and the AMI meeting transcription task (real data with significant reverberation). The experiments show that the new data augmentation approaches can obtain the performance improvement under all noisy conditions, which including additive noise, channel distortion and reverberation. With these augmented data by basic GAN / conditional GAN, a relative 6% to 14% WER reduction can be obtained upon an advanced acoustic model.

Read full abstract

This paper proposes novel training algorithms for vocoder-free text-to-speech (TTS) synthesis based on generative adversarial networks (GANs) that compensate for short-term Fourier transform (STFT) amplitude spectra in low/multi frequency resolution. Vocoder-free TTS using STFT amplitude spectra can avoid degradation of synthetic speech quality caused by the vocoder-based parameterization used in conventional TTS. Our previous work for the vocoder-based TTS proposed a method for incorporating the GAN-based distribution compensation into acoustic model training to improve synthetic speech quality. This paper extends the algorithm to the vocoder-free TTS and propose a GAN-based training algorithm using low-frequency-resolution amplitude spectra to overcome the difficulty in modeling complicated distribution of the high-dimensional spectra. In the proposed algorithm, amplitude spectra are transformed into low-frequency-resolution amplitude spectra by applying an average pooling function along with a frequency axis; then the GAN-based distribution compensation is performed in the low-frequency-resolution domain. Because the low-frequency-resolution amplitude spectra approximately emulate filter banks, the proposed algorithm is expected to improve synthetic speech quality by reducing differences in spectral envelopes of natural and synthetic speech. Furthermore, various frequency scales that are related to human speech perception (e.g., mel and inverse mel frequency scales) can be introduced to the proposed training algorithm by applying an frequency warping function to amplitude spectra. This paper also proposes a GAN-based training algorithm using multi-frequency-resolution amplitude spectra that uses both low- and original-frequency-resolution amplitude spectra to reduce the differences in not only spectral envelopes but also fine structures. Experimental results demonstrate that (1) GANs using low-frequency-resolution amplitude spectra improve speech quality and work robustly against the settings of the frequency resolution and hyperparameters, (2) in comparison among low-, original-, and multi-frequency-resolution amplitude spectra, the use of low-frequency-resolution ones work best improve the synthetic speech quality, and (3) the use of the inverse mel frequency scale for obtaining low-frequency-resolution amplitude spectra further improves synthetic speech quality.

Read full abstract

Acoustic Model Training Research Articles

Related Topics

Articles published on Acoustic Model Training

DEVELOPMENT OF A REAL-TIME SPEECH RECOGNITION SYSTEM FOR THE AZERBAIJANI LANGUAGE

Multilingual speech recognition for GlobalPhone languages

Free resources for forced phonetic alignment in Brazilian Portuguese based on Kaldi toolkit

Using Automatic Speech Recognition to Assess Thai Speech Language Fluency in the Montreal Cognitive Assessment (MoCA).

Nonlinear waveform distortion: Assessment and detection of clipping on speech data and systems

A Cross-Lingual Speech Synthesis System of Malay and Indonesian Based on HMM–DNN

Asynchronous Decentralized Distributed Training of Acoustic Models

A Parameter Transfer Method for HMM-DNN Heterogeneous Model with the Scarce Mongolian Data Set

User-Friendly Automatic Transcription of Low-Resource Languages: Plugging ESPnet into Elpis

A unified system for multilingual speech recognition and language identification

심층신경망 기반의 음성인식을 위한 절충된 특징 정규화 방식*

Semi-Supervised Speech Recognition Acoustic Model Training Using Policy Gradient

Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition: A comparison of current training strategies

Keyword retrieving in continuous speech using connectionist temporal classification

Artificial intelligence speech recognition model for correcting spoken English teaching

Cross-Lingual Transfer Learning of Non-Native Acoustic Modeling for Pronunciation Error Detection and Diagnosis

Acoustic modeling for Kazakh speech synthesis

Data augmentation using generative adversarial networks for robust speech recognition

Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra

Training of reduced-rank linear transformations for multi-layer polynomial acoustic features for speech recognition

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Acoustic Model Training Research Articles

Related Topics

Articles published on Acoustic Model Training

DEVELOPMENT OF A REAL-TIME SPEECH RECOGNITION SYSTEM FOR THE AZERBAIJANI LANGUAGE

Multilingual speech recognition for GlobalPhone languages

Free resources for forced phonetic alignment in Brazilian Portuguese based on Kaldi toolkit

Using Automatic Speech Recognition to Assess Thai Speech Language Fluency in the Montreal Cognitive Assessment (MoCA).

Nonlinear waveform distortion: Assessment and detection of clipping on speech data and systems

A Cross-Lingual Speech Synthesis System of Malay and Indonesian Based on HMM–DNN

Asynchronous Decentralized Distributed Training of Acoustic Models

A Parameter Transfer Method for HMM-DNN Heterogeneous Model with the Scarce Mongolian Data Set

User-Friendly Automatic Transcription of Low-Resource Languages: Plugging ESPnet into Elpis

A unified system for multilingual speech recognition and language identification

심층신경망 기반의 음성인식을 위한 절충된 특징 정규화 방식*

Semi-Supervised Speech Recognition Acoustic Model Training Using Policy Gradient

Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition: A comparison of current training strategies

Keyword retrieving in continuous speech using connectionist temporal classification

Artificial intelligence speech recognition model for correcting spoken English teaching

Cross-Lingual Transfer Learning of Non-Native Acoustic Modeling for Pronunciation Error Detection and Diagnosis

Acoustic modeling for Kazakh speech synthesis

Data augmentation using generative adversarial networks for robust speech recognition

Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra

Training of reduced-rank linear transformations for multi-layer polynomial acoustic features for speech recognition