Improving Knowledge Distillation of CTC-Trained Acoustic Models With Alignment-Consistent Ensemble and Target Delay

Haisong Ding,Qiang Huo,Kai Chen

doi:10.1109/taslp.2020.3019917

Abstract

Knowledge distillation (KD) has been widely used to improve the performance of a simpler student model by imitating the outputs or intermediate representations of a more complex teacher model. The most commonly used KD technique is to minimize a Kullback-Leibler divergence between the output distributions of the teacher and student models. When it is applied to compressing acoustic models trained with a connectionist temporal classification (CTC) criterion, an assumption is made that the teacher and student share the same frame-level feature-transcription alignment. However, frame-level alignments learned by teachers can be inaccurate and unstable due to the lack of fine-grained frame-level guidance during CTC training. Forcing student to learn inaccurate alignments will lead to limited performance improvements. In this article, we investigate building powerful teacher models with more accurate and stable feature-transcription alignments. We achieve this goal by using a novel alignment-consistent ensemble (ACE) technique, where all models within an ensemble are jointly trained along with a regularization term to encourage consistent and stable alignments. With well-trained deep bidirectional LSTM (DBLSTM) ACE as a teacher, we can directly use the traditional frame-wise KD method to train DBLSTM students. When applying KD to transfer knowledge from a DBLSTM ACE to a deep unidirectional LSTM (DLSTM) student, a simple yet effective target delay technique is proposed to handle the alignment difference between bidirectional and unidirectional models. Experimental results on Switchboard-I speech recognition task show that, with DBLSTM ACE as a teacher, the simple frame-wise KD method can achieve competitive or better performance than other complex KD methods on DBLSTM students. When applying KD to build DLSTM students from DBLSTM teachers, our proposed target delay technique can achieve relative word error rate reductions of 14.2% $\sim$ 14.8% compared with the models trained from scratch, which outperforms other carefully-designed KD methods.

Full Text