Abstract

In this paper, we develop an end-to-end automatic speech recognition (ASR) model designed for a common low-resource scenario: no pronunciation dictionary or phonemic transcripts, very limited transcribed speech, and much larger non-parallel text and speech corpora. Our semi-supervised model is built on top of an encoder-decoder model with attention and takes advantage of non-parallel speech and text corpora in several ways: a denoising text autoencoder that shares parameters with the ASR decoder, a speech autoencoder that shares parameters with the ASR encoder, and adversarial training that encourages the speech and text encoders to use the same embedding space. We show that a model with this architecture significantly outperforms the baseline in this low-resource condition. We additionally perform an ablation evaluation, demonstrating that all of our added components contribute substantially to the overall performance of our model. We propose several avenues for further work, noting in particular that a model with this architecture could potentially enable fully unsupervised speech recognition.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call