SimulSpeech: End-to-End Simultaneous Speech to Text Translation

Yi Ren,Tao QIN,Chen Zhang,Jinglin Liu,Tie-Yan Liu,Xu Tan,Zhou Zhao

doi:10.18653/v1/2020.acl-main.350

Abstract

In this work, we develop SimulSpeech, an end-to-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently. SimulSpeech consists of a speech encoder, a speech segmenter and a text decoder, where 1) the segmenter builds upon the encoder and leverages a connectionist temporal classification (CTC) loss to split the input streaming speech in real time, 2) the encoder-decoder attention adopts a wait-k strategy for simultaneous translation. SimulSpeech is more challenging than previous cascaded systems (with simultaneous automatic speech recognition (ASR) and simultaneous neural machine translation (NMT)). We introduce two novel knowledge distillation methods to ensure the performance: 1) Attention-level knowledge distillation transfers the knowledge from the multiplication of the attention matrices of simultaneous NMT and ASR models to help the training of the attention mechanism in SimulSpeech; 2) Data-level knowledge distillation transfers the knowledge from the full-sentence NMT model and also reduces the complexity of data distribution to help on the optimization of SimulSpeech. Experiments on MuST-C English-Spanish and English-German spoken language translation datasets show that SimulSpeech achieves reasonable BLEU scores and lower delay compared to full-sentence end-to-end speech to text translation (without simultaneous translation), and better performance than the two-stage cascaded simultaneous translation model in terms of BLEU scores and translation delay.

Highlights

In this work, we develop SimulSpeech, an endto-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently
Experiments on MuST-C1 English-Spanish and English-German spoken language translation datasets demonstrate that SimulSpeech: 1) achieves reasonable BLEU scores and lower delay compared to full-sentence end-to-end speech to text translation, and 2) obtains better performance than the two-stage cascaded simultaneous translation model in terms of BLEU scores and translation delay
To better train the SimulSpeech model, we propose a novel attention-level knowledge distillation that is specially designed for speech to text translation, 4.3 Data-Level Knowledge Distillation

Summary

Introduction

We develop SimulSpeech, an endto-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently. Simultaneous speech to text translation (Fugen et al, 2007; Oda et al, 2014; Dalvi et al, 2018), which translates source-language speech into targetlanguage text concurrently, is of great importance to the real-time understanding of spoken lectures or conversations and widely used in many scenarios including live video streaming and international conferences It is widely considered as one of the challenging tasks in machine translation domain because simultaneous speech to text translation has to understand the speech and trade off translation accuracy and delay. Experiments on MuST-C1 English-Spanish and English-German spoken language translation datasets demonstrate that SimulSpeech: 1) achieves reasonable BLEU scores and lower delay compared to full-sentence end-to-end speech to text translation (without simultaneous translation), and 2) obtains better performance than the two-stage cascaded simultaneous translation model in terms of BLEU scores and translation delay

Preliminaries

The SimulSpeech Model

Training of SimulSpeech

Training Segmenter with CTC Loss

Attention-Level Knowledge Distillation

Experiment Settings

Experiment Results

Ablation Study

Method

Simultaneous Translation

Speech to Text Translation

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SimulSpeech: End-to-End Simultaneous Speech to Text Translation

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 50	License type: cc-by

Similar Papers

Confidence Based Bidirectional Global Context Aware Training Framework for Neural Machine Translation
...
-
, et. al. ...
11 May 2022
11 May 2022

Combining Discrete Lexicon Probabilities with NMT for Low-Resource Mongolian-Chinese Translation
Li Jinting ... Wang Hongbin
-
Li Jinting, et. al.Li Jinting ... Wang Hongbin
01 Dec 2017
01 Dec 2017

SIMULEVAL: An Evaluation Toolkit for Simultaneous Translation
Xutai Ma ... Juan Pino
-
Xutai Ma, et. al.Xutai Ma ... Juan Pino
01 Jan 2020
01 Jan 2020

Neural Machine Translation in Translation and Program Repair
Tingsong Huang ... Yifei Jia
Theoretical and Natural Science | VOL. 2
Tingsong Huang, et. al.Tingsong Huang ... Yifei Jia
20 Feb 2023
Theoretical and Natural Science | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SimulSpeech: End-to-End Simultaneous Speech to Text Translation

Abstract

Highlights

Summary

Talk to us

Similar Papers