Abstract
End-to-End speech translation usually leverages audio-to-text parallel data to train an available speech translation model which has shown impressive results on various speech translation tasks. Due to the artificial cost of collecting audio-to-text parallel data, the speech translation is a natural low-resource translation scenario, which greatly hinders its improvement. In this paper, we proposed a new adversarial training method to leverage target monolingual data to relieve the low-resource shortcoming of speech translation. In our method, the existing speech translation model is considered as a Generator to gain a target language output, and another neural Discriminator is used to guide the distinction between outputs of speech translation model and true target monolingual sentences. Experimental results on the CCMT 2019-BSTC dataset speech translation task demonstrate that the proposed methods can significantly improve the performance of the End-to-End speech translation system.
Highlights
A traditional speech translation (ST) system usually consists of two components: an automatic speech recognition (ASR) model and a machine translation (MT) model
The existing speech translation model is considered as a Generator to gain a target language output, and another neural Discriminator is used to guide the distinction between outputs of speech translation model and true target monolingual sentences
We can see that the Adversarial Training method can obtain 19.1 BLEU, which is an improvement of 1.4 BLEU over the end-to-end baseline model, and even better than the multitask method
Summary
A traditional speech translation (ST) system usually consists of two components: an automatic speech recognition (ASR) model and a machine translation (MT) model. Due to the success of end-to-end approaches in both automatic speech recognition and machine translation, researchers are increasingly interested in end-to-end speech translation. The audio-to-text parallel data has only tens to hundreds of hours which are equivalent to about hundreds of thousands of bilingual sentence pairs. Liu et al (2019) proposed a Knowledge Distillation approach which utilizes a text-only MT model to guide the ST model because there is a huge performance gap between end-toend ST and MT model. Despite their success, these approaches still need additional labeled data, such as the source language speech, source language transcript, and target language translation
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have