Abstract
Many Data Augmentation (DA) methods have been proposed for neural machine translation. Existing works measure the superiority of DA methods in terms of their performance on a specific test set, but we find that some DA methods do not exhibit consistent improvements across translation tasks. Based on the observation, this paper makes an initial attempt to answer a fundamental question: what benefits, which are consistent across different methods and tasks, does DA in general obtain? Inspired by recent theoretic advances in deep learning, the paper understands DA from two perspectives towards the generalization ability of a model: input sensitivity and prediction margin, which are defined independent of specific test set thereby may lead to findings with relatively low variance. Extensive experiments show that relatively consistent benefits across five DA methods and four translation tasks are achieved regarding both perspectives.
Highlights
Data Augmentation (DA) is a training paradigm that has been proved to be very effective in many modalities (Park et al, 2019; Perez and Wang, 2017; Sennrich et al, 2016a), especially for classification (Perez and Wang, 2017)
Metric on a specific test set when compared to the whole data population, which generates all possible data, has large variance so that leads to the inconsistency
This paper aims at delivering relatively consistent benefit measures of DA due to the phenomenon of inconsistant BLEU improvement across translation tasks
Summary
Data Augmentation (DA) is a training paradigm that has been proved to be very effective in many modalities (Park et al, 2019; Perez and Wang, 2017; Sennrich et al, 2016a), especially for classification (Perez and Wang, 2017). ⇤ Work done at Tencent AI Lab. metric on a specific test set when compared to the whole data population, which generates all possible data, has large variance so that leads to the inconsistency. Metric on a specific test set when compared to the whole data population, which generates all possible data, has large variance so that leads to the inconsistency This evaluation dilemma is recognized and explored by Recht et al (2018, 2019); Werpachowski et al (2019), and is especially notorious for language generation tasks (Chaganty et al, 2018; Hashimoto et al, 2019) where the evaluation metrics, e.g. BLEU (Papineni et al, 2001), are extrinsic and heavily relies on the reference provided. We ask a fundamental question: what benefits, which are more consistent across different DA methods and translation tasks, can DA in general obtain?
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have