Abstract

We propose an efficient inference procedure for non-autoregressive machine translation that iteratively refines translation purely in the continuous space. Given a continuous latent variable model for machine translation (Shu et al., 2020), we train an inference network to approximate the gradient of the marginal log probability of the target sentence, using the latent variable instead. This allows us to use gradient-based optimization to find the target sentence at inference time that approximately maximizes its marginal probability. As each refinement step only involves computation in the latent space of low dimensionality (we use 8 in our experiments), we avoid computational overhead incurred by existing non-autoregressive inference procedures that often refine in token space. We compare our approach to a recently proposed EM-like inference procedure (Shu et al., 2020) that optimizes in a hybrid space, consisting of both discrete and continuous variables. We evaluate our approach on WMT’14 En→De, WMT’16 Ro→En and IWSLT’16 De→En, and observe two advantages over the EM-like inference: (1) it is computationally efficient, i.e. each refinement step is twice as fast, and (2) it is more effective, resulting in higher marginal probabilities and BLEU scores with the same number of refinement steps. On WMT’14 En→De, for instance, our approach is able to decode 6.2 times faster than the autoregressive model with minimal degradation to translation quality (0.9 BLEU).

Highlights

  • Most neural machine translation systems are autoregressive, decoding latency grows linearly with respect to the length of the target sentence

  • We emphasize that the same underlying latent variable model is used across three different inference procedures (Delta, Energy, Score), to compare their efficiency and effectiveness

  • We propose an efficient inference procedure for non-autoregressive machine translation that refines translations purely in the continuous space

Read more

Summary

Introduction

Most neural machine translation systems are autoregressive, decoding latency grows linearly with respect to the length of the target sentence. While various training objectives are used to admit refinement (e.g. denoising, evidence lowerbound maximization and mask language modeling), the generation process of these models is similar in that the refinement process happens in the discrete space of sentences. Another line of work proposed to use continuous latent variables for non-autoregressive translation, such that the distribution of the target sentences can be factorized over time given the latent variables (Ma et al, 2019; Shu et al, 2020). We refer the readers to (Mansimov et al, 2019) for a formal definition of a sequence generation framework that unifies these models, and briefly discuss the inference procedure below

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call