Speech-to-speech translation (S2ST) has evolved from cascade systems which integrate Automatic Speech Recognition (ASR), Machine Translation (MT), and Text-to-Speech (TTS), to end-to-end models. This evolution has been driven by advancements in model performance and the expansion of cross-lingual speech datasets. Despite the paucity of research on Tibetan speech translation, this paper endeavors to tackle the challenge of Tibetan-to-Chinese direct speech-to-speech translation within the multi-task learning framework, employing self-supervised learning (SSL) and sequence-to-sequence model training. Leveraging HuBERT model to extract discrete units of target speech, we develop a speech-to-unit translation (S2UT) model using an encoder-decoder architecture which subsequently generates speech output through a unit-based vocoder. By employing SSL and utilizing discrete representations as training targets, our approach effectively captures linguistic differences, facilitating direct translation between the two languages. We evaluate the performance of HuBERT model under various configurations to select the optimal setup based on Phone-unit Normalized Mutual Information (PNMI) values. After fine-tuning the chosen HuBERT model on specific corpora, we introduce auxiliary tasks to enhance translation performance. This underscores the pivotal role of multi-task learning in improving overall model efficacy. Experimental results validate the feasibility of Tibetan-to-Chinese S2ST, demonstrating promising translation quality and semantic content preservation, despite limited data availability.
Read full abstract