Knowledge Distillation from Multilingual and Monolingual Teachers for End-to-End Multilingual Speech Recognition

Jingyi Xu,Wu Guo,Yan Song,Lirong Dai,Junfeng Hou

doi:10.1109/apsipaasc47483.2019.9023203

Abstract

Attention-based encoder-decoder models significantly reduce the burden of developing multilingual speech recognition systems. By means of end-to-end modeling and parameters sharing, a single model can be efficiently trained and deployed for all languages. Although the single model benefits from jointly training across different languages, it should handle the variation and diversity of the languages at the same time. In this paper, we exploit knowledge distillation from multiple teachers to improve the recognition accuracy of the end-to-end multilingual model. Considering that teacher models learning from monolingual and multilingual data contain distinct knowledge of specific languages, we introduce multiple teachers including monolingual teachers of each language, and multilingual teacher to teach a same sized multilingual student model so that the multilingual student will learn various knowledge embedded in the data and intend to outperform multilingual teacher. Different from conventional knowledge distillation which usually relies on a linear interpolation for hard loss from true label and soft losses from teachers, a new random augmented training strategy is proposed to switch the optimization of the student model between hard or soft losses in random order. Our experiments on Wall Street Journal (English) and AISHELL-1 (Chinese) composed multilingual speech dataset show the proposed multiple teachers and distillation strategy boost the performance of the student significantly relative to the multilingual teacher.

Full Text