Abstract

While larger acoustic models provide better speech recognition performance, smaller models are appropriate when computational resources are limited. Knowledge distillation is used to train small models on basis of soft labels obtained from larger models instead of hard labels obtained from reference transcriptions. In this work, we investigated two methods for using both types of labels: sequence-level distillation (SD), in which the loss function selected is related to the hard or soft labels, and sequence-level interpolation (SI), in which both loss functions are interpolated. Experiments showed that SI was consistently better than SD, and that SI with annealing performed the best.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call