Abstract
Speech translation (ST) is a bimodal conversion task from source speech to the target text. Generally, deep learning-based ST systems require sufficient training data to obtain a competitive result, even with a state-of-the-art model. However, the training data is usually unable to meet the completeness condition due to the small sample problems. Most low-resource ST tasks improve data integrity with a single model, but this optimization has a single dimension and limited effectiveness. In contrast, multimodality is introduced to leverage different dimensions of data features for multiperspective modeling. This approach mutually addresses the gaps in the different modalities to enhance the representation of the data and improve the utilization of the training samples. Therefore, it is a new challenge to leverage the enormous multimodal out-of-domain information to improve the low-resource tasks. This paper describes how to use multimodal out-of-domain information to improve low-resource models. First, we propose a low-resource ST framework to reconstruct large-scale label-free audio by combining self-supervised learning. At the same time, we introduce a machine translation (MT) pretraining model to complement text embedding and fine-tune decoding. In addition, we analyze the similarity at the decoder side. We reduce multimodal invalid pseudolabels by performing random depth pruning in the similarity layer to minimize error propagation and use additional CTC loss in the nonsimilarity layer to optimize the ensemble loss. Finally, we study the weighting ratio of the fusion technique in the multimodal decoder. Our experiment results show that the proposed method is promising for low-resource ST, with improvements of up to +3.6 BLEU points compared to baseline low-resource ST models.
Highlights
Language translation has become an essential skill today
Speech translation tasks include cascade and end-to-end structures. e cascade structures are based on jointly trained automatic speech recognition (ASR) [1] and machine translation (MT) [2] models. e advantage of this method is that it leverages text and audio resources to the greatest possible extent [3, 4]
Baseline Base Large translation tasks of different sizes and in different languages. ey are used to evaluate the impact of external MT on BLEU scores. e results show that the pretrained models for MT tasks in different languages effectively improve the performance of the baseline models
Summary
Language translation has become an essential skill today. We know that the existing methods have achieved good results due to a large number of available speech and text resources. These methods reach research and life practical standards. Speech translation tasks include cascade and end-to-end structures. Cascade methods in a low-resource context have been shown to learn phoneme quality better than end-to-end ST [10, 11]. Traditional low-resource ST tasks often use a cascade structure. Erefore, in this paper, in low-resource ST tasks, we attempt to improve the error propagation problem that exists in cascade structures Cascade structures are prone to error propagation to leading to incorrect translations [12]. erefore, in this paper, in low-resource ST tasks, we attempt to improve the error propagation problem that exists in cascade structures
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have