Abstract

Nowadays Automatic Speech Recognition (ASR) systems can accurately recognize which words are said. However, due to the disfluency, grammatical error, and other phenomena in spontaneous speech, the verbatim transcription of ASR impairs its readability, which is crucial for human comprehension and downstream tasks processing that need to understand the meaning and purpose of what is spoken. In this work, we formulate the ASR post-processing for readability (APR) as a sequence-to-sequence text generation problem that aims to transform the incorrect and noisy ASR output into readable text for humans and downstream tasks. We leverage the Metadata Extraction (MDE) corpus to construct a task-specific dataset for our study. To solve the problem of too little training data, we propose a novel data augmentation method that synthesizes large-scale training data from the grammatical error correction dataset. We propose a model based on the pre-trained language model to perform the APR task and train the model with a two-stage training strategy to better exploit the augmented data. On the constructed test set, our approach outperforms the best baseline system by a large margin of 17.53 on BLEU and 13.26 on readability-aware WER (RA-WER). The human evaluation also shows that our model can generate more human-readable transcripts than the baseline method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.