Relative Word Error Rate Reduction Research Articles

In this paper, a semisupervised speech data extraction method is presented and applied to create a new dataset designed for the development of fully bilingual Automatic Speech Recognition (ASR) systems for Basque and Spanish. The dataset is drawn from an extensive collection of Basque Parliament plenary sessions containing frequent code switchings. Since session minutes are not exact, only the most reliable speech segments are kept for training. To that end, we use phonetic similarity scores between nominal and recognized phone sequences. The process starts with baseline acoustic models trained on generic out-of-domain data, then iteratively updates the models with the extracted data and applies the updated models to refine the training dataset until the observed improvement between two iterations becomes small enough. A development dataset, involving five plenary sessions not used for training, has been manually audited for tuning and evaluation purposes. Cross-validation experiments (with 20 random partitions) have been carried out on the development dataset, using the baseline and the iteratively updated models. On average, Word Error Rate (WER) reduces from 16.57% (baseline) to 4.41% (first iteration) and further to 4.02% (second iteration), which corresponds to relative WER reductions of 73.4% and 8.8%, respectively. When considering only Basque segments, WER reduces on average from 16.57% (baseline) to 5.51% (first iteration) and further to 5.13% (second iteration), which corresponds to relative WER reductions of 66.7% and 6.9%, respectively. As a result of this work, a new bilingual Basque–Spanish resource has been produced based on Basque Parliament sessions, including 998 h of training data (audio segments + transcriptions), a development set (17 h long) designed for tuning and evaluation under a cross-validation scheme and a fully bilingual trigram language model.

Read full abstract

Knowledge distillation (KD) has been widely used to improve the performance of a simpler student model by imitating the outputs or intermediate representations of a more complex teacher model. The most commonly used KD technique is to minimize a Kullback-Leibler divergence between the output distributions of the teacher and student models. When it is applied to compressing acoustic models trained with a connectionist temporal classification (CTC) criterion, an assumption is made that the teacher and student share the same frame-level feature-transcription alignment. However, frame-level alignments learned by teachers can be inaccurate and unstable due to the lack of fine-grained frame-level guidance during CTC training. Forcing student to learn inaccurate alignments will lead to limited performance improvements. In this article, we investigate building powerful teacher models with more accurate and stable feature-transcription alignments. We achieve this goal by using a novel alignment-consistent ensemble (ACE) technique, where all models within an ensemble are jointly trained along with a regularization term to encourage consistent and stable alignments. With well-trained deep bidirectional LSTM (DBLSTM) ACE as a teacher, we can directly use the traditional frame-wise KD method to train DBLSTM students. When applying KD to transfer knowledge from a DBLSTM ACE to a deep unidirectional LSTM (DLSTM) student, a simple yet effective target delay technique is proposed to handle the alignment difference between bidirectional and unidirectional models. Experimental results on Switchboard-I speech recognition task show that, with DBLSTM ACE as a teacher, the simple frame-wise KD method can achieve competitive or better performance than other complex KD methods on DBLSTM students. When applying KD to build DLSTM students from DBLSTM teachers, our proposed target delay technique can achieve relative word error rate reductions of 14.2% $\sim$ 14.8% compared with the models trained from scratch, which outperforms other carefully-designed KD methods.

Read full abstract

Relative Word Error Rate Reduction Research Articles

Related Topics

Articles published on Relative Word Error Rate Reduction

Hypernetworks for Personalizing ASR to Atypical Speech

Decoupled structure for improved adaptability of end-to-end models

CAM: A cross-lingual adaptation framework for low-resource language speech recognition

DNN-based Multilingual Acoustic Modeling for Four Ethiopian Languages

Language fusion via adapters for low-resource speech recognition

Adapting Pre-Trained Self-Supervised Learning Model for Speech Recognition with Light-Weight Adapters

Semisupervised Speech Data Extraction from Basque Parliament Sessions and Validation on Fully Bilingual Basque–Spanish ASR

Wav2vec‐MoE: An unsupervised pre‐training and adaptation method for multi‐accent ASR

Combining hybrid DNN-HMM ASR systems with attention-based models using lattice rescoring

Morphology aware data augmentation with neural language models for online hybrid ASR

Data Augmentation Using Spectral Warping for Low Resource Children ASR

Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition.

Multilingual speech recognition for GlobalPhone languages

Combining Frame-Synchronous and Label-Synchronous Systems for Speech Recognition

Optimizing Data Usage for Low-Resource Speech Recognition

End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party

Towards Contextual Spelling Correction for Customization of End-to-End Speech Recognition Systems

Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection

Online Speaker Adaptation Using Memory-Aware Networks for Speech Recognition

Improving Knowledge Distillation of CTC-Trained Acoustic Models With Alignment-Consistent Ensemble and Target Delay

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Relative Word Error Rate Reduction Research Articles

Related Topics

Articles published on Relative Word Error Rate Reduction

Hypernetworks for Personalizing ASR to Atypical Speech

Decoupled structure for improved adaptability of end-to-end models

CAM: A cross-lingual adaptation framework for low-resource language speech recognition

DNN-based Multilingual Acoustic Modeling for Four Ethiopian Languages

Language fusion via adapters for low-resource speech recognition

Adapting Pre-Trained Self-Supervised Learning Model for Speech Recognition with Light-Weight Adapters

Semisupervised Speech Data Extraction from Basque Parliament Sessions and Validation on Fully Bilingual Basque–Spanish ASR

Wav2vec‐MoE: An unsupervised pre‐training and adaptation method for multi‐accent ASR

Combining hybrid DNN-HMM ASR systems with attention-based models using lattice rescoring

Morphology aware data augmentation with neural language models for online hybrid ASR

Data Augmentation Using Spectral Warping for Low Resource Children ASR

Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition.

Multilingual speech recognition for GlobalPhone languages

Combining Frame-Synchronous and Label-Synchronous Systems for Speech Recognition

Optimizing Data Usage for Low-Resource Speech Recognition

End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party

Towards Contextual Spelling Correction for Customization of End-to-End Speech Recognition Systems

Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection

Online Speaker Adaptation Using Memory-Aware Networks for Speech Recognition

Improving Knowledge Distillation of CTC-Trained Acoustic Models With Alignment-Consistent Ensemble and Target Delay