Abstract

Transformer-based models have led to significant innovation in various classic and practical subjects, including speech processing, natural language processing, and computer vision. On top of the Transformer, attention-based end-to-end automatic speech recognition (ASR) models have become a popular fashion in recent years. Specifically, an emergent research topic is non-autoregressive modeling, which can achieve fast inference speed and obtain competitive performance when compared with conventional autoregressive methods. In addition, in the context of natural language processing, the bidirectional encoder representations from Transformers (BERT) model and its variants have received widespread attention, partially due to their ability to infer contextualized word representations and obtain superior performances of downstream tasks through simple fine-tuning. However, to our knowledge, leveraging the synergistic power of non-autoregressive modeling and pre-trained language model for ASR remains relatively underexplored. In this regard, this study presents a novel pre-trained language model-based non-autoregressive ASR framework. A series of experiments were conducted on two publicly available Chinese datasets, AISHELL-1 and AISHELL-2, to demonstrate competitive or superior results of the proposed ASR models when compared with well-practiced baseline systems. In addition, a set of comparative experiments is likewise carried out with different settings to analyze the performance of the proposed framework.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call