Abstract

This paper describes Edinburgh’s submissions to the IWSLT2021 multilingual speech translation (ST) task. We aim at improving multilingual translation and zero-shot performance in the constrained setting (without using any extra training data) through methods that encourage transfer learning and larger capacity modeling with advanced neural components. We build our end-to-end multilingual ST model based on Transformer, integrating techniques including adaptive speech feature selection, language-specific modeling, multi-task learning, deep and big Transformer, sparsified linear attention and root mean square layer normalization. We adopt data augmentation using machine translation models for ST which converts the zero-shot problem into a zero-resource one. Experimental results show that these methods deliver substantial improvements, surpassing the official baseline by > 15 average BLEU and outperforming our cascading system by > 2 average BLEU. Our final submission achieves competitive performance (runner up).

Highlights

  • End-to-end (E2E) speech translation (ST) has achieved great success in recent years, outperforming its cascading counterpart and delivering state-of-the-art performance on several benchmarks (Ansari et al, 2020; Zhang et al, 2020a; Zhao et al, 2020), it still suffers from the relatively low amounts of dedicated speech-to-translation parallel training data (Salesky et al, 2021)

  • Our study demonstrates that rectified linear attention (ReLA) generalizes well to ST

  • Zhang and Sennrich (2019b) propose root mean square layer normalization (RMSNorm) which relies on root mean square statistic alone to regularize activations and is a drop-in replacement to LayerNorm

Read more

Summary

Introduction

End-to-end (E2E) speech translation (ST) has achieved great success in recent years, outperforming its cascading counterpart and delivering state-of-the-art performance on several benchmarks (Ansari et al, 2020; Zhang et al, 2020a; Zhao et al, 2020), it still suffers from the relatively low amounts of dedicated speech-to-translation parallel training data (Salesky et al, 2021). Whether and how to obtain similar success in very low-resource (and practical) scenario for multilingual ST with E2E models remains an open question. To address this question, we participated in the IWSLT2021 multilingual speech translation task, which focuses on low-resource ST language pairs in a multilingual setup. The task is organized in two settings: constrained setting and unconstrained setting The former restricts participants to use the given multilingual TEDx data (Salesky et al, 2021) alone for experiment; while the latter allows for additional ASR/ST/MT/others training data.

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.