DeepBT and NLP Data Augmentation Techniques: a new proposal and a comprehensive study

Taynan Maier Ferreira

doi:10.48448/n3nf-nh08

Abstract

Data Augmentation methods - a family of techniques designed for synthetic generation of training data - have shown remarkable results in various Deep Learning and Machine Learning tasks. Despite its widespread and successful adoption within the computer vision community, data augmentation techniques designed for natural language processing (NLP) tasks have exhibited much slower advances and limited success in achieving performance gains. As a consequence, with the exception of applications of back-translation to machine translation tasks, these techniques have not been as thoroughly explored by the wider NLP community. Recent research on the subject also still lacks a proper practical understanding of the relationship between data augmentation and several important aspects of model design, such as hyperparameters and regularization parameters. In this paper, we perform a comprehensive study of NLP data augmentation techniques, comparing their relative performance under different settings. We also propose Deep Back- Translation, a novel NLP data augmentation technique and apply it to benchmark datasets. We analyze the quality of the synthetic data generated, evaluate its performance gains and compare all of these aspects to previous existing data augmentation procedures.

Full Text