Abstract
The end-to-end speech translation (ST) model usually adopts the encoder-decoder structure, which takes the speech of the source language as input and directly outputs its translation result in the target language. Since the model performs cross-modal translation, it needs to extract semantically rich representations from speech. Compared with text, speech is fine-grained and contains more noise, which puts a great burden on the encoder of the ST model. The modal gap between speech and text usually causes the ST model to perform inferior to the corresponding machine translation (MT) model. To bridge the cross-modal gap, this paper proposes to use adversarial training to relieve the burden on the ST encoder by providing internal supervision signals. In this approach, the encoder in the ST model can extract representations that contain rich semantics, which greatly improves performance. Experiments on the datasets of Augmented Librispeech English-French and MuST-C English-German show the effectiveness of our method. Further analysis indicates that our proposed method can perform well in low-resource conditions compared to strong baselines.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.