Streaming cascade-based speech translation leveraged by a direct segmentation model

Javier Iranzo-Sánchez,Javier Jorge,Pau Baquero-Arnal,Joan Albert Silvestre-Cerdà,Adrià Giménez,Jorge Civera,Albert Sanchis,Alfons Juan

doi:10.1016/j.neunet.2021.05.013

Javier Iranzo-Sánchez, Javier Jorge + Show 6 more

Open Access

https://doi.org/10.1016/j.neunet.2021.05.013

Copy DOI

Abstract

The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system.

Full Text