Abstract

Non-autoregressive models generate target words in a parallel way, which achieve a faster decoding speed but at the sacrifice of translation accuracy. To remedy a flawed translation by non-autoregressive models, a promising approach is to train a conditional masked translation model (CMTM), and refine the generated results within several iterations. Unfortunately, such approach hardly considers the \textit{sequential dependency} among target words, which inevitably results in a translation degradation. Hence, instead of solely training a Transformer-based CMTM, we propose a Self-Review Mechanism to infuse sequential information into it. Concretely, we insert a left-to-right mask to the same decoder of CMTM, and then induce it to autoregressively review whether each generated word from CMTM is supposed to be replaced or kept. The experimental results (WMT14 En$\leftrightarrow$De and WMT16 En$\leftrightarrow$Ro) demonstrate that our model uses dramatically less training computations than the typical CMTM, as well as outperforms several state-of-the-art non-autoregressive models by over 1 BLEU. Through knowledge distillation, our model even surpasses a typical left-to-right Transformer model, while significantly speeding up decoding.

Highlights

  • Neural Machine Translation (NMT) models have achieved a great success in recent years (Sutskever et al, 2014; Bahdanau et al, 2015; Cho et al, 2014; Kalchbrenner et al, 2016; Gehring et al, 2017; Vaswani et al, 2017)

  • Given a source sentence x = {x1, x2, ..., x|x|}, a NMT model is aimed to generate a sentence in target language y = {y1, y2, ..., y|y|} with identical semantics expressed, where |x| and |y| are denoted as the length of source and target sentence, respectively

  • We identify the drawback of conditional masked translation modeling (CMTM) that it is insufficient to capture the sequential correlations among target words

Read more

Summary

Introduction

Neural Machine Translation (NMT) models have achieved a great success in recent years (Sutskever et al, 2014; Bahdanau et al, 2015; Cho et al, 2014; Kalchbrenner et al, 2016; Gehring et al, 2017; Vaswani et al, 2017). NMTs use autoregressive decoders, where the words are generated one-by-one. Despite the acceleration of computation efficiency, these models usually suffers from the cost of translation accuracy Even worse, they decode a target only in one shot, and miss a chance to remedy a flawed translation. The training objective of an autogressive NMT model is expressed as a chain of conditional probabilities in a left-to-right manner:. T=1 where y0 and y|y|+1 are and , standing for the start and end of a sentence, respectively These probabilities are parameterized using a standard encoder-decoder architecture (Sutskever et al, 2014), where the decoders use autoregressive strategy to capture the left-to-right dependency among the target words. Different from the training objective, we adopt conditional masked translation modeling (CMTM) (Ghazvininejad et al, 2019) to optimize our proposed non-autoregressive NMT model. Based on the assumption that the words of ymask are independent, the training objective of CMTM is formulated as: Length Predict the cat is cool Softmax Linear

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call