Abstract

This paper describes USTC-NELSLIP’s submissions to the IWSLT2021 Simultaneous Speech Translation task. We proposed a novel simultaneous translation model, Cross-Attention Augmented Transducer (CAAT), which extends conventional RNN-T to sequence-to-sequence tasks without monotonic constraints, e.g., simultaneous translation. Experiments on speech-to-text (S2T) and text-to-text (T2T) simultaneous translation tasks shows CAAT achieves better quality-latency trade-offs compared to wait-k, one of the previous state-of-the-art approaches. Based on CAAT architecture and data augmentation, we build S2T and T2T simultaneous translation systems in this evaluation campaign. Compared to last year’s optimal systems, our S2T simultaneous translation system improves by an average of 11.3 BLEU for all latency regimes, and our T2T simultaneous translation system improves by an average of 4.6 BLEU.

Highlights

  • This paper describes the submission to IWSLT 2021 Simultaneous Speech Translation task by National Engineering Laboratory for Speech and Language Information Processing (NELSLIP), University of Science and Technology of China, China.Recent work in text-to-text simultaneous translation tends to fall into two categories, fixed policy and flexible policy, represented by wait-k (Ma et al, 2019) and monotonic attention (Arivazhagan et al, 2019; Ma et al, 2020b) respectively

  • We propose a novel architecture, Cross Attention augmented Transducer (CAAT), and a latency loss function to ensure Cross Attention Augmented Transducer (CAAT) model works with an appropriate latency

  • We propose a novel architecture, Cross Attention Augmented Transducer (CAAT), which significantly outperforms wait-k (Ma et al, 2019) baseline in both text-to-text and speech-to-text simultaneous translation task

Read more

Summary

Introduction

This paper describes the submission to IWSLT 2021 Simultaneous Speech Translation task by National Engineering Laboratory for Speech and Language Information Processing (NELSLIP), University of Science and Technology of China, China.Recent work in text-to-text simultaneous translation tends to fall into two categories, fixed policy and flexible policy, represented by wait-k (Ma et al, 2019) and monotonic attention (Arivazhagan et al, 2019; Ma et al, 2020b) respectively. Flexible policy often leads to difficulties in model optimization. We found it’s impossible to calculate the marginal probability based on conventional Attention Encoder-Decoder (Sennrich et al, 2016) architectures (Transformer (Vaswani et al, 2017) included), which is due to the deep coupling between source contexts and target history contexts. To solve this problem, we propose a novel architecture, Cross Attention augmented Transducer (CAAT), and a latency loss function to ensure CAAT model works with an appropriate latency. Policy is integrated into translation model and learned jointly for CAAT model

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call