Abstract

Reinforcement Learning (RL) is a powerful framework to address the discrepancy between loss functions used during training and the final evaluation metrics to be used at test time. When applied to neural Machine Translation (MT), it minimises the mismatch between the cross-entropy loss and non-differentiable evaluation metrics like BLEU. However, the suitability of these metrics as reward function at training time is questionable: they tend to be sparse and biased towards the specific words used in the reference texts. We propose to address this problem by making models less reliant on such metrics in two ways: (a) with an entropy-regularised RL method that does not only maximise a reward function but also explore the action space to avoid peaky distributions; (b) with a novel RL method that explores a dynamic unsupervised reward function to balance between exploration and exploitation. We base our proposals on the Soft Actor-Critic (SAC) framework, adapting the off-policy maximum entropy model for language generation applications such as MT. We demonstrate that SAC with BLEU reward tends to overfit less to the training data and performs better on out-of-domain data. We also show that our dynamic unsupervised reward can lead to better translation of ambiguous words.

Highlights

  • Autoregressive sequence-to-sequence neural architectures have become the de facto approach in Machine Translation (MT)

  • We demonstrate that Soft Actor-Critic (SAC) results in improved generalisation compared to the Maximum Likelihood Estimation (MLE) training, leading to better translation of out-of-domain data; (b) the proposal of a dynamic unsupervised reward within the SAC framework (Section 3.4)

  • We clearly observe the tendency of Entropy-Regularised AC (ERAC) models to perform better on the more in-domain 2016 data

Read more

Summary

Introduction

Autoregressive sequence-to-sequence (seq2seq) neural architectures have become the de facto approach in Machine Translation (MT) Such models include Recurrent Neural Networks (RNN) (Sutskever et al, 2014; Bahdanau et al, 2014) and Transformer networks (Vaswani et al, 2017), among others. These models have as a serious limitation the discrepancy between their training and inference time regimes. They are traditionally trained using the Maximum Likelihood Estimation (MLE), which aims to maximise log-likelihood of a categorical ground truth distribution (samples in the training corpus) using loss functions such as cross-entropy, which are very different from the evaluation metric used at inference time, which generally compares string similarity between the system output and reference outputs. MLE training causes: (a) the problem of “exposure bias” as a result of recursive conditioning on its own errors at test time, since the model has never been exclusively “exposed” to its own predictions during training; (b) a mismatch between the training objective and the test objective, where the latter relies on evaluation using discrete and non-differentiable measures such as BLEU (Papineni et al, 2002)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call