Abstract

The cross-attention mechanism enables Transformer to capture correspondences between the input and output. However, in the domain of end-to-end (E2E) speech-to-text translation (ST), the learned cross-attention weights often struggle to accurately correspond with actual alignments, given the need to align speech and text across different modalities and languages. In this paper, we present a simple yet effective method called regularized cross-attention learning, for end-to-end speech translation in a multitask learning (MTL) framework. RCAL leverages the knowledge from auxiliary automatic speech recognition (ASR) and machine translation (MT) tasks to generate a teacher cross-attention matrix, serving as prior alignment knowledge to enhance cross-attention learning within the ST task. An additional loss function is introduced as part of the MTL framework to facilitate this process. We conducted experiments on the MuST-C benchmark dataset to evaluate the effectiveness of RCAL. The results demonstrate that the proposed approach yields significant improvements over the baseline, with an average enhancement of +0.8 BLEU across four translation directions in two experimental settings, outperforming state-of-the-art E2E and cascaded speech translation models. Further analysis and visualization reveal that the model with RCAL effectively learns high-quality alignment information from auxiliary ASR and ST tasks, thereby improving the ST alignment quality. Moreover, the experiments with different sizes of MT and ST data provide strong evidence supporting our model’s robustness in various scenarios.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call