Hierarchical Bayesian combination of plug-in maximum a posteriori decoders in deep neural networks-based speech recognition and speaker adaptation

Zhen Huang,Sabato Marco Siniscalchi,Chin-Hui Lee

doi:10.1016/j.patrec.2017.08.001

Abstract

We propose a novel decoding framework by dynamically combining K multiple plug-in maximum a posteriori (MAP) decoders, with each solving for a sequence of symbols in a state-by-state manner in time and according to a set of constraints on the symbol sequences in space. The score combination occurs at the state level with the set of K combination weights either chosen to be equal (i.e., equal weighting scheme) or learned from a collection of data through a hierarchical Bayesian setting. When applied to automatic speech recognition (ASR), leveraging upon some characteristic differences in computing acoustic probabilities with both feed-forward deep neural networks (DNNs) and Gaussian mixture models (GMMs) at the hidden Markov phone state level, these scores can be discriminatively combined in plug-in MAP decoding. The DNN and GMM parameters can be trained from a large collection of speaker-independent (SI) speech data and further refined with a small set of speaker adaptation (SA) utterances. The per-speaker, per-state combination weights can be learned from SA data through the proposed hierarchical Bayesian approach. Experimental results on the Switchboard ASR task show that an ad hoc fixed-weight combination already reduces the word error rate (WER) to 16.9% from a SI WER of 17.4%. Model adaptation with 20 utterances can reduce the WER to 16.7%, which is further reduced to 16.1% using the SA models and fixed-weight combination decoding. The best WER of 15.3% is attained by using the proposed hierarchical Bayesian learned weights combining the two SA and two SI systems. Finally, we contrast the proposed technique with a state-of-the-art static system combination approach based on multiple word lattices generated by different ASR systems, and minimum Bayes risk. The experimental results demonstrate that static system combination cannot boost system performance of the individual systems, and the proposed dynamic combination scheme is needed.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Hierarchical Bayesian combination of plug-in maximum a posteriori decoders in deep neural networks-based speech recognition and speaker adaptation

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition Letters

Lead the way for us

Journal: Pattern Recognition Letters	Publication Date: Aug 4, 2017
Citations: 9

Similar Papers

A study of an active approach to speaker and task adaptation based on automatic analysis of vocabulary confusability
Wei Li
-
Wei LiWei Li
25 Apr 2012
25 Apr 2012

Speaker-adaptive speech recognition using speaker diarization for improved transcription of large spoken archives
Petr Cerva ... Ladislav Seps
Speech Communication | VOL. 55
Petr Cerva, et. al.Petr Cerva ... Ladislav Seps
08 Jul 2013
Speech Communication | VOL. 55

Speaker Adaptation and Adaptive Training for Jointly Optimised Tandem Systems
Yu Wang ... Chao Zhang
-
Yu Wang, et. al.Yu Wang ... Chao Zhang
02 Sep 2018
02 Sep 2018

An active approach to speaker and task adaptation based on automatic analysis of vocabulary confusability
Qiang Huo ... Wei Li
-
Qiang Huo, et. al.Qiang Huo ... Wei Li
27 Aug 2007
27 Aug 2007

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Hierarchical Bayesian combination of plug-in maximum a posteriori decoders in deep neural networks-based speech recognition and speaker adaptation

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition Letters