Baseline Acoustic Model Research Articles

In this paper, a semisupervised speech data extraction method is presented and applied to create a new dataset designed for the development of fully bilingual Automatic Speech Recognition (ASR) systems for Basque and Spanish. The dataset is drawn from an extensive collection of Basque Parliament plenary sessions containing frequent code switchings. Since session minutes are not exact, only the most reliable speech segments are kept for training. To that end, we use phonetic similarity scores between nominal and recognized phone sequences. The process starts with baseline acoustic models trained on generic out-of-domain data, then iteratively updates the models with the extracted data and applies the updated models to refine the training dataset until the observed improvement between two iterations becomes small enough. A development dataset, involving five plenary sessions not used for training, has been manually audited for tuning and evaluation purposes. Cross-validation experiments (with 20 random partitions) have been carried out on the development dataset, using the baseline and the iteratively updated models. On average, Word Error Rate (WER) reduces from 16.57% (baseline) to 4.41% (first iteration) and further to 4.02% (second iteration), which corresponds to relative WER reductions of 73.4% and 8.8%, respectively. When considering only Basque segments, WER reduces on average from 16.57% (baseline) to 5.51% (first iteration) and further to 5.13% (second iteration), which corresponds to relative WER reductions of 66.7% and 6.9%, respectively. As a result of this work, a new bilingual Basque–Spanish resource has been produced based on Basque Parliament sessions, including 998 h of training data (audio segments + transcriptions), a development set (17 h long) designed for tuning and evaluation under a cross-validation scheme and a fully bilingual trigram language model.

Read full abstract

This study investigated large-scale semi-supervised training (SST) to improve acoustic models for automatic speech recognition. The conventional self-training, the recently proposed committee-based SST using heterogeneous neural networks and the lattice-based SST were examined and compared. The large-scale SST was studied in deep neural network acoustic modeling with respect to the automatic transcription quality, the importance data filtering, the training data quantity and other data attributes of a large quantity of multi-genre unsupervised live data. We found that the SST behavior on large-scale ASR tasks was very different from the behavior obtained on small-scale SST: 1) big data can tolerate a certain degree of mislabeling in the automatic transcription for SST. It is possible to achieve further performance gains with more unsupervised fresh data, and even the automatic transcriptions have a certain degree of errors; 2) the audio attributes, transcription quality and importance of the fresh data are more important than the increased data quantity for large-scale SST; and 3) there are large differences in performance gains on different recognition tasks, such that the benefits highly depend on the selected data attributes of unsupervised data and the data scale of the baseline ASR system. Furthermore, we proposed a novel utterance filtering approach based on active learning to improve the data selection in large-scale SST. The experimental results showed that the SST with the proposed data filtering yields a 2-11% relative word error rate reduction on five multi-genre recognition tasks, even with the baseline acoustic model that was already well trained on a 10000-hr supervised dataset.

Read full abstract

Baseline Acoustic Model Research Articles

Articles published on Baseline Acoustic Model

THE NO TRAIN NO GAIN SYSTEM FOR O-COCOSDA AND VLSP 2022 - A-MSV SHARED TASK: ASIAN MULTILINGUAL SPEAKER VERIFICATION

Semisupervised Speech Data Extraction from Basque Parliament Sessions and Validation on Fully Bilingual Basque–Spanish ASR

Space-and-speaker-aware acoustic modeling with effective data augmentation for recognition of multi-array conversational speech

Improving Speech Enhancement Framework via Deep Learning

Efficient acoustic feature transformation in mismatched environments using a Guided-GAN

Large-Scale Semi-Supervised Training in Deep Learning Acoustic Model for ASR

Improved acoustic models for spontaneous speech recognition

Three-Stage Framework for Unsupervised Acoustic Modeling Using Untranscribed Spoken Content

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Baseline Acoustic Model Research Articles

Articles published on Baseline Acoustic Model

THE NO TRAIN NO GAIN SYSTEM FOR O-COCOSDA AND VLSP 2022 - A-MSV SHARED TASK: ASIAN MULTILINGUAL SPEAKER VERIFICATION

Semisupervised Speech Data Extraction from Basque Parliament Sessions and Validation on Fully Bilingual Basque–Spanish ASR

Space-and-speaker-aware acoustic modeling with effective data augmentation for recognition of multi-array conversational speech

Improving Speech Enhancement Framework via Deep Learning

Efficient acoustic feature transformation in mismatched environments using a Guided-GAN

Large-Scale Semi-Supervised Training in Deep Learning Acoustic Model for ASR

Improved acoustic models for spontaneous speech recognition

Three-Stage Framework for Unsupervised Acoustic Modeling Using Untranscribed Spoken Content