Abstract

Data augmentation is a commonly-used technique to avoid over-fitting in deep learning. However, the mechanism behind effective data augmentation methods is unclear. To address this issue, we explore and identify two critical factors: semantic preservation and diversity to assess the quality of data augmentation in natural language processing. Our study focus on text sentiment classification and examines these two factors on two commonly-used data augmentation methods: synonym replacement and random deletion. Based on the discovery, we propose two new augmentation methods: TF-IDF word dropout and adaptive synonym replacement. Experimental results demonstrate that these two new data augmentation methods are effective. Moreover, with further experiments, we summarize three strategies for improving data augmentation methods in sentiment classification task. These strategies are employing online augmentation, introducing word importance into word sampling process, and filtering augmented data based on the current model state. We hope that our study will inspire some new perspectives on the underlying principles of data augmentation’s effectiveness and contribute to a systematic study of data augmentation methods in future.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call