Abstract

Knowledge distillation (KD) has been widely used to improve the performance of a simpler student model by imitating the outputs or intermediate representations of a more complex teacher model. The most commonly used KD technique is to minimize a Kullback-Leibler divergence between the output distributions of the teacher and student models. When it is applied to compressing acoustic models trained with a connectionist temporal classification (CTC) criterion, an assumption is made that the teacher and student share the same frame-level feature-transcription alignment. However, frame-level alignments learned by teachers can be inaccurate and unstable due to the lack of fine-grained frame-level guidance during CTC training. Forcing student to learn inaccurate alignments will lead to limited performance improvements. In this article, we investigate building powerful teacher models with more accurate and stable feature-transcription alignments. We achieve this goal by using a novel alignment-consistent ensemble (ACE) technique, where all models within an ensemble are jointly trained along with a regularization term to encourage consistent and stable alignments. With well-trained deep bidirectional LSTM (DBLSTM) ACE as a teacher, we can directly use the traditional frame-wise KD method to train DBLSTM students. When applying KD to transfer knowledge from a DBLSTM ACE to a deep unidirectional LSTM (DLSTM) student, a simple yet effective target delay technique is proposed to handle the alignment difference between bidirectional and unidirectional models. Experimental results on Switchboard-I speech recognition task show that, with DBLSTM ACE as a teacher, the simple frame-wise KD method can achieve competitive or better performance than other complex KD methods on DBLSTM students. When applying KD to build DLSTM students from DBLSTM teachers, our proposed target delay technique can achieve relative word error rate reductions of 14.2% $\sim$ 14.8% compared with the models trained from scratch, which outperforms other carefully-designed KD methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.