Semi-Supervised Training in Deep Learning Acoustic Model

Yan Huang,Yongqiang Wang,Yifan Gong

doi:10.21437/interspeech.2016-1596

Abstract

We studied the semi-supervised training in a fully connected deep neural network (DNN), unfolded recurrent neural network (RNN), and long short-term memory recurrent neural network (LSTM-RNN) with respect to the transcription quality, the importance data sampling, and the training data amount. We found that DNN, unfolded RNN, and LSTM-RNN are increasingly more sensitive to labeling errors. For example, with the simulated erroneous training transcription at 5%, 10%, or 15% word error rate (WER) level, the semi-supervised DNN yields 2.37%, 4.84%, or 7.46% relative WER increase against the baseline model trained with the perfect transcription; in comparison, the corresponding WER increase is 2.53%, 4.89%, or 8.85% in an unfolded RNN and 4.47%, 9.38%, or 14.01% in an LSTM-RNN. We further found that the importance sampling has similar impact on all three models with 2~3% relative WER reduction comparing to the random sampling. Lastly, we compared the modeling capability with increased training data. Experimental results suggested that LSTM-RNN can benefit more from enlarged training data comparing to unfolded RNN and DNN. We trained a semi-supervised LSTM-RNN using 2600 hr transcribed and 10000 hr untranscribed data on a mobile speech task. The semi-supervised LSTM-RNN yields 7.9\% relative WER reduction against the supervised baseline.

Full Text