Abstract

Many Data Augmentation (DA) methods have been proposed for neural machine translation. Existing works measure the superiority of DA methods in terms of their performance on a specific test set, but we find that some DA methods do not exhibit consistent improvements across translation tasks. Based on the observation, this paper makes an initial attempt to answer a fundamental question: what benefits, which are consistent across different methods and tasks, does DA in general obtain? Inspired by recent theoretic advances in deep learning, the paper understands DA from two perspectives towards the generalization ability of a model: input sensitivity and prediction margin, which are defined independent of specific test set thereby may lead to findings with relatively low variance. Extensive experiments show that relatively consistent benefits across five DA methods and four translation tasks are achieved regarding both perspectives.

Highlights

  • Data Augmentation (DA) is a training paradigm that has been proved to be very effective in many modalities (Park et al, 2019; Perez and Wang, 2017; Sennrich et al, 2016a), especially for classification (Perez and Wang, 2017)

  • Metric on a specific test set when compared to the whole data population, which generates all possible data, has large variance so that leads to the inconsistency

  • This paper aims at delivering relatively consistent benefit measures of DA due to the phenomenon of inconsistant BLEU improvement across translation tasks

Read more

Summary

Introduction

Data Augmentation (DA) is a training paradigm that has been proved to be very effective in many modalities (Park et al, 2019; Perez and Wang, 2017; Sennrich et al, 2016a), especially for classification (Perez and Wang, 2017). ⇤ Work done at Tencent AI Lab. metric on a specific test set when compared to the whole data population, which generates all possible data, has large variance so that leads to the inconsistency. Metric on a specific test set when compared to the whole data population, which generates all possible data, has large variance so that leads to the inconsistency This evaluation dilemma is recognized and explored by Recht et al (2018, 2019); Werpachowski et al (2019), and is especially notorious for language generation tasks (Chaganty et al, 2018; Hashimoto et al, 2019) where the evaluation metrics, e.g. BLEU (Papineni et al, 2001), are extrinsic and heavily relies on the reference provided. We ask a fundamental question: what benefits, which are more consistent across different DA methods and translation tasks, can DA in general obtain?

Objectives
Methods
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call