Speaker Adaptation Techniques Research Articles

Speaker adaptation techniques provide a powerful solution to customise automatic speech recognition (ASR) systems for individual users. Practical application of unsupervised model-based speaker adaptation techniques to data intensive end-to-end ASR systems is hindered by the scarcity of speaker-level data and performance sensitivity to transcription errors. To address these issues, a set of compact and data efficient speaker-dependent (SD) parameter representations are used to facilitate both speaker adaptive training and test-time unsupervised speaker adaptation of state-of-the-art Conformer ASR systems. The sensitivity to supervision quality is reduced using a confidence score-based selection of the less erroneous subset of speaker-level adaptation data. Two lightweight confidence score estimation modules are proposed to produce more reliable confidence scores. The data sparsity issue, which is exacerbated by data selection, is addressed by modelling the SD parameter uncertainty using Bayesian learning. Experiments on the benchmark 300-hour Switchboard and the 233-hour AMI datasets suggest that the proposed confidence score-based adaptation schemes consistently outperformed the baseline speaker-independent (SI) Conformer model and conventional non-Bayesian, point estimate-based adaptation using no speaker data selection. Similar consistent performance improvements were retained after external Transformer and LSTM language model rescoring. In particular, on the 300-hour Switchboard corpus, statistically significant WER reductions of 1.0%, 1.3%, and 1.4% absolute (9.5%, 10.9%, and 11.3% relative) were obtained over the baseline SI Conformer on the NIST Hub5'00, RT02, and RT03 evaluation sets respectively. Similar WER reductions of 2.7% and 3.3% absolute (8.9% and 10.2% relative) were also obtained on the AMI development and evaluation sets.

Acoustic modeling based on deep architectures has recently gained remarkable success, with substantial improvement of speech recognition accuracy in several automatic speech recognition (ASR) tasks. For distant speech recognition, the multi-channel deep neural network based approaches rely on the powerful modeling capability of deep neural network (DNN) to learn suitable representation of distant speech directly from its multi-channel source. In this model-based combination of multiple microphones, features from each channel are concatenated and used together as an input to DNN. This allows integrating the multi-channel audio for acoustic modeling without any pre-processing steps. Despite powerful modeling capabilities of DNN, an environmental mismatch due to noise and reverberation may result in severe performance degradation when features are simply fed to a DNN without a feature enhancement step. In this paper, we introduce the nonlinear bottleneck feature mapping approach using DNN, to transform the noisy and reverberant features to its clean version. The bottleneck features derived from the DNN are used as a teacher signal because they contain relevant information to phoneme classification, and the mapping is performed with the objective of suppressing noise and reverberation. The individual and combined impacts of beamforming and speaker adaptation techniques along with the feature mapping are examined for distant large vocabulary speech recognition, using a single and multiple far-field microphones. As an alternative to beamforming, experiments with concatenating multiple channel features are conducted. The experimental results on the AMI meeting corpus show that the feature mapping, used in combination with beamforming and speaker adaptation yields a distant speech recognition performance below 50% word error rate (WER), using DNN for acoustic modeling.

Speaker Adaptation Techniques Research Articles

Related Topics

Articles published on Speaker Adaptation Techniques

Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems

Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Bayesian Learning for Deep Neural Network Adaptation

Evaluation of speaker-dependent and average-voice Vietnamese statistical speech synthesis systems

Language-independent acoustic cloning of HTS voices

Feature-space SVM adaptation for speaker adapted word prominence detection

Gaussian mixture models for adaptation of deep neural network acoustic models in automatic speech recognition systems

Non-Parallel Training in Voice Conversion Using an Adaptive Restricted Boltzmann Machine

Deep CNNs Along the Time Axis With Intermap Pooling for Robustness to Spectral Variations

Many-to-many voice conversion using hidden Markov model-based speech recognition and synthesis

Feature mapping using far-field microphones for distant speech recognition

Multistage data selection-based unsupervised speaker adaptation for personalized speech emotion recognition

Linear Regression Based Acoustic Adaptation for the Subspace Gaussian Mixture Model

Adaptation of hidden Markov model mean parameters using two‐dimensional PCA with constraint on speaker weight

Evolutionary approach for integration of multiple pronunciation patterns for enhancement of dysarthric speech recognition

Reusing Speech Techniques for Video Semantic Indexing [Applications Corner

Aging speech recognition with speaker adaptation techniques: Study on medium vocabulary continuous Bengali speech

Adaptation of Hidden Markov Models Using Model-as-Matrix Representation

Unified framework for basis-based speaker adaptation based on sample covariance matrix of variable dimension

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Speaker Adaptation Techniques Research Articles

Related Topics

Articles published on Speaker Adaptation Techniques

Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems

Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Bayesian Learning for Deep Neural Network Adaptation

Evaluation of speaker-dependent and average-voice Vietnamese statistical speech synthesis systems

Language-independent acoustic cloning of HTS voices

Feature-space SVM adaptation for speaker adapted word prominence detection

Gaussian mixture models for adaptation of deep neural network acoustic models in automatic speech recognition systems

Non-Parallel Training in Voice Conversion Using an Adaptive Restricted Boltzmann Machine

Deep CNNs Along the Time Axis With Intermap Pooling for Robustness to Spectral Variations

Many-to-many voice conversion using hidden Markov model-based speech recognition and synthesis

Feature mapping using far-field microphones for distant speech recognition

Multistage data selection-based unsupervised speaker adaptation for personalized speech emotion recognition

Linear Regression Based Acoustic Adaptation for the Subspace Gaussian Mixture Model

Adaptation of hidden Markov model mean parameters using two‐dimensional PCA with constraint on speaker weight

Evolutionary approach for integration of multiple pronunciation patterns for enhancement of dysarthric speech recognition

Reusing Speech Techniques for Video Semantic Indexing [Applications Corner

Aging speech recognition with speaker adaptation techniques: Study on medium vocabulary continuous Bengali speech

Adaptation of Hidden Markov Models Using Model-as-Matrix Representation

Unified framework for basis-based speaker adaptation based on sample covariance matrix of variable dimension