Cascade or Direct Speech Translation? A Case Study

Thierry Etchegoyhen,Edson Benites Fernandez,Iván G Torre,Haritz Arzelus,Ander González-Docasal,Juan Manuel Martín-Doñas,Aitor Alvarez,Harritxu Gete

doi:10.3390/app12031097

Thierry Etchegoyhen, Edson Benites Fernandez + Show 6 more

Open Access

https://doi.org/10.3390/app12031097

Copy DOI

Abstract

Speech translation has been traditionally tackled under a cascade approach, chaining speech recognition and machine translation components to translate from an audio source in a given language into text or speech in a target language. Leveraging on deep learning approaches to natural language processing, recent studies have explored the potential of direct end-to-end neural modelling to perform the speech translation task. Though several benefits may come from end-to-end modelling, such as a reduction in latency and error propagation, the comparative merits of each approach still deserve detailed evaluations and analyses. In this work, we compared state-of-the-art cascade and direct approaches on the under-resourced Basque–Spanish language pair, which features challenging phenomena such as marked differences in morphology and word order. This case study thus complements other studies in the field, which mostly revolve around the English language. We describe and analysed in detail the mintzai-ST corpus, prepared from the sessions of the Basque Parliament, and evaluated the strengths and limitations of cascade and direct speech translation models trained on this corpus, with variants exploiting additional data as well. Our results indicated that, despite significant progress with end-to-end models, which may outperform alternatives in some cases in terms of automated metrics, a cascade approach proved optimal overall in our experiments and manual evaluations.

Highlights

Introduction published maps and institutional affilSpeech translation (ST) systems have been traditionally designed under a cascade approach, where independent automatic speech recognition (ASR) and machine translation (MT) components are chained, feeding the ASR output into the MT component, oftentimes with task-specific bridging to optimise component communication [1,2,3]
The remainder of this paper is organised as follows: Section 2 presents related work in the field; in Section 3, we describe the mintzai-ST corpus, including the data acquisition process and data statistics; Section 4 describes the different baseline models that were built for Basque–Spanish speech translation, including cascade and end-to-end models; Section 5 discusses comparative results for the baseline models; in Section 6, we describe several direct ST model variants and their results on automated metrics; Section 7 describes the protocol and results of our manual evaluation of the best cascade and end-to-end models, along with the results of targeted evaluations on specific linguistic phenomena and on the impact of relative input difficulty; Section 8 draws the main conclusions from this work
ASR models trained with either an End-to-End neural model (E2E) or the Kaldi toolkit (KAL); ASR and MT models trained on either In-Domain data only (IND) or on a combination of in-domain and out-of-domain data (ALL), by integrating the OpenDataEuskadi dataset to train the language and casing models for speech recognition and the translation models for the MT component; MT models obtained by fine-tuning a model trained on the out-of-domain dataset with the in-domain data, in addition to the models trained on in-domain data only and all available data

Summary

Related Work

Standard speech-to-text translation systems operate on the basis of separate components for speech recognition and machine translation, feeding the output of the ASR module into the MT component. One of the main reasons for this state of affairs was training data scarcity, i.e., the lack of sufficiently large speech-to-text datasets to train direct ST systems, in contrast with the comparatively larger training data for the ASR and MT components, considered separately Another relevant factor was the need to improve end-to-end ST architectures or training methods for this type of approach. Recent improvements in ST modelling have closed the gap between direct and cascade approaches on standard datasets Whereas the latter outperformed the former in the IWSLT 2019 shared task, results from the 2020 edition featured similar performances overall [23].

The mintzai-ST Corpus

Data Acquisition

Alignment and Filtering

Data Distribution

Baseline Models

Cascade Models

Speech Recognition

Machine Translation

End-to-End Baseline Models

Comparative Baseline Results

Advanced End-to-End Models

Architectural Variants

Pretraining

Knowledge Distillation

Comparative Direct Models’ Results

Targeted Evaluations of Cascade and Advanced Direct Models

Manual Ranking Task

Divergence on Specific Phenomena

Error Propagation

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Jan 21, 2022
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Cascade or Direct Speech Translation? A Case Study

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

End-to-End Speech Translation With Transcoding by Multi-Task Learning for Distant Language Pairs
Takatomo Kano ... Sakriani Sakti
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 28
Takatomo Kano, et. al.Takatomo Kano ... Sakriani Sakti
01 Jan 2020
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 28

Integrating Multiple ASR Systems into NLP Backend with Attention Fusion
Takatomo Kano ... Atsunori Ogawa
-
Takatomo Kano, et. al.Takatomo Kano ... Atsunori Ogawa
23 May 2022
23 May 2022

Streaming Models for Joint Speech Recognition and Translation
Orion Weller ... Christian Gollan
-
Orion Weller, et. al.Orion Weller ... Christian Gollan
01 Jan 2020
01 Jan 2020

The Effect of Difference in Word Order on Semantic Processing in Hindi-English Bilinguals.
Geet Govind Anand ... Bodepudi Narasimha Rao
Annals of neurosciences | VOL. 30
Geet Govind Anand, et. al.Geet Govind Anand ... Bodepudi Narasimha Rao
16 Jan 2023
Annals of neurosciences | VOL. 30

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cascade or Direct Speech Translation? A Case Study

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences