Abstract

Using end-to-end models for speech translation (ST) has increasingly been the focus of the ST community. These models condense the previously cascaded systems by directly converting sound waves into translated text. However, cascaded models have the advantage of including automatic speech recognition output, useful for a variety of practical ST systems that often display transcripts to the user alongside the translations. To bridge this gap, recent work has shown initial progress into the feasibility for end-to-end models to produce both of these outputs. However, all previous work has only looked at this problem from the consecutive perspective, leaving uncertainty on whether these approaches are effective in the more challenging streaming setting. We develop an end-to-end streaming ST model based on a re-translation approach and compare against standard cascading approaches. We also introduce a novel inference method for the joint case, interleaving both transcript and translation in generation and removing the need to use separate decoders. Our evaluation across a range of metrics capturing accuracy, latency, and consistency shows that our end-to-end models are statistically similar to cascading models, while having half the number of parameters. We also find that both systems provide strong translation quality at low latency, keeping 99% of consecutive quality at a lag of just under a second.

Highlights

  • Speech translation (ST) is the process of translating acoustic sound waves into text in a different language than was originally spoken in.This paper focuses on ST in a particular setting, as described by two characteristics: (1) We desire models that translate in a streaming fashion, wherePrevious approaches to streaming ST have typically utilized a cascaded system that pipelines the output of an automatic speech recognition (ASR) system through a machine translation (MT) model for the final result

  • That the cascaded model has nearly twice as many parameters as the E2E models (217M vs 107M). When we examine these models under a variety of different inference conditions (using constrained decoding and mask-k as in Arivazhagan et al (2020a)), we further see this trend illustrated through the quality vs latency trade-off, with both models retaining 99% of their BLEU at less than 1.0 AL

  • We focus on the task of streaming speech translation, producing both a target translation and a source transcript from an audio source

Read more

Summary

Introduction

Previous approaches to streaming ST have typically utilized a cascaded system that pipelines the output of an automatic speech recognition (ASR) system through a machine translation (MT) model for the final result. These systems have been the preeminent strategy, taking the top place in recent streaming ST competitions (Pham et al, 2019; Jan et al, 2019; Elbayad et al, 2020; Ansari et al, 2020). E2E models are appealing from computational and engineering standpoints, reducing model complexity and decreasing parameter count

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call