Robustness is a long-standing challenge for automatic speech recognition (ASR) as the applied environment of any ASR system faces much noisier speech samples than clean training corpora. However, it is impractical to annotate every types of noisy environments. In this work, we propose a novel phonetic-semantic pre-training (PSP) framework that allows a model to effectively improve the performance of ASR against practical noisy environments via seamlessly integrating pre-training, self-supervised learning, and fine-tuning. In particular, there are three fundamental stages in PSP. First, pre-train the phone-to-word transducer (PWT) to map the generated phone sequence to the target text using only unpaired text data; second, continue training the PWT on more complex data generated from an empirical phone-perturbation heuristic, in additional to self-supervised signals by recovering the tainted phones; and third, fine-tune the resultant PWT with real world speech data. We perform experiments on two real-life datasets collected from industrial scenarios and synthetic noisy datasets, which show that the PSP effectively improves the traditional ASR pipeline with relative character error rate (CER) reductions of 28.63% and 26.38%, respectively, in two real-life datasets. It also demonstrates its robustness against synthetic highly noisy speech datasets.
Read full abstract