Abstract

The accuracy of Automatic Speech Recognition (ASR) is critical to speech-based products, such as subtitling, speech translation, spoken dialogue. We aim to enhance ASR accuracy by correcting errors in ASR hypothesis with sequence-to-sequence (seq2seq) models. In this paper, we propose to boost Transformer-based ASR error correction by fusing the pre-trained BERT [1] in encoder and copying mechanism in decoder, which exploits externally well-learned token representation and copying correct tokens in ASR transcript respectively. In addition, we leverage Text-to-Speech (TTS) synthesized data and ASR 5-best hypotheses to augment the training data and make the data more diverse. We evaluate our approach on two internal test sets and two public ASR test sets. Experimental results show that the proposed approach decreases the average Character Error Rate (CER) from 9.36% to 7.30% compared with ASR hypothesis without correction, and outperforms Transformer-based model by a large margin.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call