Abstract
Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit on. We investigate G-DAUG^C, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. Our approach generates synthetic examples using pretrained language models, and selects the most informative and diverse set of examples for data augmentation. In experiments with multiple commonsense reasoning benchmarks, G-DAUG^C consistently outperforms existing data augmentation methods based on back-translation, and establishes a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA. Further, in addition to improvements in in-distribution accuracy, G-DAUG^C-augmented training also enhances out-of-distribution generalization, showing greater robustness against adversarial or perturbed examples. Our analysis demonstrates that G-DAUG^C produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance. Our findings encourage future research toward generative data augmentation to enhance both in-distribution learning and out-of-distribution generalization.
Highlights
While recent advances in large-scale neural language models (Devlin et al, 2019; Liu et al, 2019; Radford et al, 2019; Raffel et al, 2019) have led to strong performance on several commonsense reasoning benchmarks (Talmor et al, 2019; Lv et al, 2020; Sakaguchi et al, 2020), their accuracy by and large depends on the availability of large-scale human-authored training data
In experiments across multiple commonsense benchmarks, we show that G-DAUGc can mitigate the expense and brittleness resulting from large training sets for commonsense reasoning tasks
We present experiments on four commonsense multiple choice QA benchmarks: COMMONSENSEQA (Talmor et al, 2019), WINOGRANDE (Sakaguchi et al, 2020), CODAH (Chen et al, 2019) and HellaSwag (Zellers et al, 2019)
Summary
While recent advances in large-scale neural language models (Devlin et al, 2019; Liu et al, 2019; Radford et al, 2019; Raffel et al, 2019) have led to strong performance on several commonsense reasoning benchmarks (Talmor et al, 2019; Lv et al, 2020; Sakaguchi et al, 2020), their accuracy by and large depends on the availability of large-scale human-authored training data. A candidate solution that has shown promise in other tasks, such as reading comprehension, is to augment a human-authored training set with a large set of synthetically-generated examples (Zhou et al, 2017; Du et al, 2017; Zhao et al, 2018a).
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have