Improving Non-Autoregressive Speech Recognition with Autoregressive Pretraining

Yanjia Li,Lahiru Samarakoon,Ivan Fung

doi:10.1109/icassp49357.2023.10096815

Abstract

Autoregressive (AR) automatic speech recognition (ASR) models predict each output token conditioning on the previous ones, which slows down their inference speed. On the other hand, non-autoregressive (NAR) models predict tokens independently and simultaneously within a constant number of decoding iterations, which brings high inference speed. However, NAR models generally have lower accuracy than AR models. In this work, we propose AR pretraining to the NAR encoder to reduce the accuracy gap between AR and NAR models. The experiment results show that our AR-pretrained MaskCTC reaches the same accuracy as AR Conformer on Aishell-1 (both 4.9% CER) and reduce the performance gap with AR Conformer on LibriSpeech by relatively 50%. Moreover, our AR-pretrained MaskCTC only needs single decoding iteration, which reduces inference time by 50%. We also investigate multiple masking strategies in training the masked language model of MaskCTC.

Full Text